Inspecting variables

Querying and slicing case data

A qp.DataSet is mimicking pandas-like item access, i.e. passing a variable name into the []-accessor will return a pandas.DataFrame view of the case data component. That means that we can chain any pandas.DataFrame method to the query:

>>> ds['q9'].head()
     q9
0   99;
1  1;4;
2   98;
3  1;4;
4   99;

There is the same support for selecting multiple variables at once:

>>> ds[['q9', 'gender']].head()
     q9  gender
0   99;       1
1  1;4;       2
2   98;       1
3  1;4;       1
4   99;       1

To integrate array (masks) variables into this behaviour, passing an array name will automatically call its item list:

>>> ds['q6'].head()
   q6_1  q6_2  q6_3
0     1     1     1
1     1   NaN     1
2     1   NaN     2
3     2   NaN     2
4     2    10    10

This can be combined with the list-based selection as well:

>>> ds[['q6', 'q9', 'gender']].head()
   q6_1  q6_2  q6_3    q9  gender
0     1     1     1   99;       1
1     1   NaN     1  1;4;       2
2     1   NaN     2   98;       1
3     2   NaN     2  1;4;       1
4     2    10    10   99;       1

DataSet case data supports row-slicing based on complex logical conditions to inspect subsets of the data. We can use the take() with a Quantipy logic operation naturally for this:

>>> condition = intersection(
...    [{'gender': [1]},
...     {'religion': [3]},
...     {'q9': [1, 4]}])
>>> take = ds.take(condition)
>>> ds[take, ['gender', 'religion', 'q9']].head()
     gender  religion      q9
52        1         3  1;2;4;
357       1         3  1;3;4;
671       1         3  1;3;4;
783       1         3  2;3;4;
802       1         3      4;

See also

Please find an overview of Quantipy logical operators and data slicing and masking in the docs about complex logical conditions!

Variable and value existence

any, all, code_count, is_nan, var_exists, codes_in_data, is_like_numeric variables


We can use variables() and var_exists() to generally test the membership of variables inside DataSet. The former is showing the list of all variables registered inside the 'data file' set, the latter is checking if a variable’s name is found in either the 'columns' or 'masks' collection. For our example data, the variables are:

>>> dataset.variables()

So a test for the array 'q5' should be positive:

>>> dataset.var_exists('q5')
True

In addition to Quantipy’s complex logic operators, the DataSet class offers some quick case data operations for code existence tests. To return a pandas.Series of all empty rows inside a variable use is_nan() as per:

>>> dataset.is_nan('q8').head()
0    True
1    True
2    True
3    True
4    True
Name: q8, dtype: bool

Which we can also use to quickly check the number of missing cases…

>>> dataset.is_nan('q8').value_counts()
True     5888
False    2367
Name: q8, dtype: int64

… as well as use the result as slicer for the DataSet case data component, e.g. to show the non-empty rows:

>>> slicer = dataset.is_nan('q8')
>>> dataset[~slicer, 'q8'].head()
Name: q8, dtype: int64
7       5;
11      5;
13    1;4;
14    4;5;
23    1;4;
Name: q8, dtype: object

Especially useful for delimited set and array data, the code_count() method is creating the pandas.Series of response values found. If applied on an array, the result is expressed across all source item variables:

>>> dataset.code_count('q6').value_counts()
3    5100
2    3155
dtype: int64

… which means that not all cases contain answers in all three of the array’s items.

With some basic pandas we can double-check this result:

>>> pd.concat([dataset['q6'], dataset.code_count('q6')], axis=1).head()
   q6_1  q6_2  q6_3  0
0     1   1.0     1  3
1     1   NaN     1  2
2     1   NaN     2  2
3     2   NaN     2  2
4     2  10.0    10  3

code_count() can optionally ignore certain codes via the count_only and count_not parameters:

>>> q2_count = dataset.code_count('q2', count_only=[1, 2, 3])
>>> pd.concat([dataset['q2'], q2_count], axis=1).head()
         q2  0
0  1;2;3;5;  3
1      3;6;  1
2       NaN  0
3       NaN  0
4       NaN  0

Similarly, the any() and all() methods yield slicers for cases obeying the condition that at least one / all of the provided codes are found in the response. Again, for array variables the conditions are extended across all the items:

>>> dataset[dataset.all('q6', 5), 'q6']
      q6_1  q6_2  q6_3
374      5   5.0     5
2363     5   5.0     5
2377     5   5.0     5
4217     5   5.0     5
5530     5   5.0     5
5779     5   5.0     5
5804     5   5.0     5
6328     5   5.0     5
6774     5   5.0     5
7269     5   5.0     5
8148     5   5.0     5
>>> dataset[dataset.all('q8', [1, 2, 3, 4, 96]), 'q8']
845     1;2;3;4;5;96;
6242      1;2;3;4;96;
7321      1;2;3;4;96;
Name: q8, dtype: object
>>> dataset[dataset.any('q8', [1, 2, 3, 4, 96]), 'q8'].head()
13      1;4;
14      4;5;
23      1;4;
24    1;3;4;
25      1;4;
Name: q8, dtype: object

Variable types

To get a summary of the all variables grouped by type, call by_type() on the DataSet:

>>> ds.by_type()
size: 8255     single delimited set array            int     float string        date      time N/A
0              gender            q2    q5  record_number    weight    q8a  start_time  duration
1            locality            q3    q7      unique_id  weight_a    q9a    end_time
2           ethnicity            q8    q6            age  weight_b
3            religion            q9            birth_day
4                  q1                        birth_month
5                 q2b                         birth_year
6                  q4
7                q5_1
8                q5_2
9                q5_3
10               q5_4
11               q5_5
12               q5_6
13               q6_1
14               q6_2
15               q6_3
16               q7_1
17               q7_2
18               q7_3
19               q7_4
20               q7_5
21               q7_6

We can restrict the output to certain types by providing the desired ones in the types parameter:

>>> ds.by_type(types='delimited set')
size: 8255 delimited set
0                     q2
1                     q3
2                     q8
3                     q9
>>> ds.by_type(types=['delimited set', 'float'])
size: 8255 delimited set     float
0                     q2    weight
1                     q3  weight_a
2                     q8  weight_b
3                     q9       NaN

In addition to that, DataSet implements the following methods that return the corresponding variables as a list for easy iteration:

DataSet.singles
       .delimied_sets()
       .ints()
       .floats()
       .dates()
       .strings()
       .masks()
       .columns()
       .sets()
>>> ds.delimited_sets()
[u'q3', u'q2', u'q9', u'q8']
>>> for delimited_set in ds.delimited_sets():
...     print delimited_set
q3
q2
q9
q8

Slicing & dicing metadata objects

Although it is possible to access a DataSet meta component via its _meta attribute directly, the prefered way to inspect and interact with with the metadata is to use DataSet methods. For instance, the easiest way to view the most important meta on a variable is to use the meta() method:

>>> ds.meta('q8')
delimited set                                      codes                         texts missing
q8: Which of the following do you regularly skip?
1                                                      1                     Breakfast    None
2                                                      2          Mid-morning snacking    None
3                                                      3                         Lunch    None
4                                                      4        Mid-afternoon snacking    None
5                                                      5                        Dinner    None
6                                                     96                  None of them    None
7                                                     98  Don't know (it varies a lot)    None

This output is extended with the item metadata if an array is passed:

>>> ds.meta('q6')
single                                             items                   item texts  codes                        texts missing
q6: How often do you take part in any of the fo...
1                                                   q6_1               Exercise alone      1     Once a day or more often    None
2                                                   q6_2       Join an exercise class      2               Every few days    None
3                                                   q6_3  Play any kind of team sport      3                  Once a week    None
4                                                                                          4             Once a fortnight    None
5                                                                                          5                 Once a month    None
6                                                                                          6        Once every few months    None
7                                                                                          7        Once every six months    None
8                                                                                          8                  Once a year    None
9                                                                                          9  Less often than once a year    None
10                                                                                        10                        Never    None

If the variable is not categorical, meta() returns simply:

>>> ds.meta('weight_a')
                             float
weight_a: Weight (variant A)   N/A

DataSet also provides a lot of methods to access and return the several meta objects of a variable to make various data processing tasks easier:

Variable labels: quantipy.core.dataset.DataSet.text()

>>> ds.text('q8', text_key=None)
Which of the following do you regularly skip?

values object: quantipy.core.dataset.DataSet.values()

>>> ds.values('gender', text_key=None)
[(1, u'Male'), (2, u'Female')]

Category codes: quantipy.core.dataset.DataSet.codes()

>>> ds.codes('gender')
[1, 2]

Category labels: quantipy.core.dataset.DataSet.value_texts()

>>> ds.value_texts('gender', text_key=None)
[u'Male', u'Female']

items object: quantipy.core.dataset.DataSet.items()

>>> ds.items('q6', text_key=None)
[(u'q6_1', u'How often do you exercise alone?'),
 (u'q6_2', u'How often do you take part in an exercise class?'),
 (u'q6_3', u'How often do you play any kind of team sport?')]

Item 'columns' sources: quantipy.core.dataset.DataSet.sources()

>>> ds.sources('q6')
[u'q6_1', u'q6_2', u'q6_3']

Item labels: quantipy.core.dataset.DataSet.item_texts()

>>> ds.item_texts('q6', text_key=None)
[u'How often do you exercise alone?',
 u'How often do you take part in an exercise class?',
 u'How often do you play any kind of team sport?']