Inspecting variables¶
Querying and slicing case data¶
A qp.DataSet
is mimicking pandas
-like item access, i.e. passing a variable
name into the []
-accessor will return a pandas.DataFrame
view of the
case data component. That means that we can chain any pandas.DataFrame
method to
the query:
>>> ds['q9'].head()
q9
0 99;
1 1;4;
2 98;
3 1;4;
4 99;
There is the same support for selecting multiple variables at once:
>>> ds[['q9', 'gender']].head()
q9 gender
0 99; 1
1 1;4; 2
2 98; 1
3 1;4; 1
4 99; 1
To integrate array
(masks
) variables into this behaviour, passing an
array
name will automatically call its item list:
>>> ds['q6'].head()
q6_1 q6_2 q6_3
0 1 1 1
1 1 NaN 1
2 1 NaN 2
3 2 NaN 2
4 2 10 10
This can be combined with the list
-based selection as well:
>>> ds[['q6', 'q9', 'gender']].head()
q6_1 q6_2 q6_3 q9 gender
0 1 1 1 99; 1
1 1 NaN 1 1;4; 2
2 1 NaN 2 98; 1
3 2 NaN 2 1;4; 1
4 2 10 10 99; 1
DataSet
case data supports row-slicing based on complex logical conditions
to inspect subsets of the data. We can use the take()
with a Quantipy
logic operation naturally for this:
>>> condition = intersection(
... [{'gender': [1]},
... {'religion': [3]},
... {'q9': [1, 4]}])
>>> take = ds.take(condition)
>>> ds[take, ['gender', 'religion', 'q9']].head()
gender religion q9
52 1 3 1;2;4;
357 1 3 1;3;4;
671 1 3 1;3;4;
783 1 3 2;3;4;
802 1 3 4;
See also
Please find an overview of Quantipy
logical operators and data slicing
and masking in the docs about complex logical conditions!
Variable and value existence¶
any, all, code_count, is_nan, var_exists, codes_in_data, is_like_numeric variables
We can use variables()
and var_exists()
to generally test the membership
of variables inside DataSet
. The former is showing the list of all variables
registered inside the 'data file'
set
, the latter is checking if a variable’s
name
is found in either the 'columns'
or 'masks'
collection. For
our example data, the variables are:
>>> dataset.variables()
So a test for the array
'q5'
should be positive:
>>> dataset.var_exists('q5')
True
In addition to Quantipy
’s complex logic operators, the DataSet
class
offers some quick case data operations for code existence tests. To return a
pandas.Series
of all empty rows inside a variable use is_nan()
as per:
>>> dataset.is_nan('q8').head()
0 True
1 True
2 True
3 True
4 True
Name: q8, dtype: bool
Which we can also use to quickly check the number of missing cases…
>>> dataset.is_nan('q8').value_counts()
True 5888
False 2367
Name: q8, dtype: int64
… as well as use the result as slicer for the DataSet
case data component,
e.g. to show the non-empty rows:
>>> slicer = dataset.is_nan('q8')
>>> dataset[~slicer, 'q8'].head()
Name: q8, dtype: int64
7 5;
11 5;
13 1;4;
14 4;5;
23 1;4;
Name: q8, dtype: object
Especially useful for delimited set
and array
data, the code_count()
method is creating the pandas.Series
of response values found. If applied on
an array
, the result is expressed across all source item variables:
>>> dataset.code_count('q6').value_counts()
3 5100
2 3155
dtype: int64
… which means that not all cases contain answers in all three of the array’s items.
With some basic pandas
we can double-check this result:
>>> pd.concat([dataset['q6'], dataset.code_count('q6')], axis=1).head()
q6_1 q6_2 q6_3 0
0 1 1.0 1 3
1 1 NaN 1 2
2 1 NaN 2 2
3 2 NaN 2 2
4 2 10.0 10 3
code_count()
can optionally ignore certain codes via the count_only
and
count_not
parameters:
>>> q2_count = dataset.code_count('q2', count_only=[1, 2, 3])
>>> pd.concat([dataset['q2'], q2_count], axis=1).head()
q2 0
0 1;2;3;5; 3
1 3;6; 1
2 NaN 0
3 NaN 0
4 NaN 0
Similarly, the any()
and all()
methods yield slicers for cases obeying
the condition that at least one / all of the provided codes are found in the
response. Again, for array
variables the conditions are extended across all
the items:
>>> dataset[dataset.all('q6', 5), 'q6']
q6_1 q6_2 q6_3
374 5 5.0 5
2363 5 5.0 5
2377 5 5.0 5
4217 5 5.0 5
5530 5 5.0 5
5779 5 5.0 5
5804 5 5.0 5
6328 5 5.0 5
6774 5 5.0 5
7269 5 5.0 5
8148 5 5.0 5
>>> dataset[dataset.all('q8', [1, 2, 3, 4, 96]), 'q8']
845 1;2;3;4;5;96;
6242 1;2;3;4;96;
7321 1;2;3;4;96;
Name: q8, dtype: object
>>> dataset[dataset.any('q8', [1, 2, 3, 4, 96]), 'q8'].head()
13 1;4;
14 4;5;
23 1;4;
24 1;3;4;
25 1;4;
Name: q8, dtype: object
Variable types¶
To get a summary of the all variables grouped by type, call by_type()
on
the DataSet
:
>>> ds.by_type()
size: 8255 single delimited set array int float string date time N/A
0 gender q2 q5 record_number weight q8a start_time duration
1 locality q3 q7 unique_id weight_a q9a end_time
2 ethnicity q8 q6 age weight_b
3 religion q9 birth_day
4 q1 birth_month
5 q2b birth_year
6 q4
7 q5_1
8 q5_2
9 q5_3
10 q5_4
11 q5_5
12 q5_6
13 q6_1
14 q6_2
15 q6_3
16 q7_1
17 q7_2
18 q7_3
19 q7_4
20 q7_5
21 q7_6
We can restrict the output to certain types by providing the desired ones in
the types
parameter:
>>> ds.by_type(types='delimited set')
size: 8255 delimited set
0 q2
1 q3
2 q8
3 q9
>>> ds.by_type(types=['delimited set', 'float'])
size: 8255 delimited set float
0 q2 weight
1 q3 weight_a
2 q8 weight_b
3 q9 NaN
In addition to that, DataSet
implements the following methods
that return the corresponding variables as a list
for easy iteration:
DataSet.singles
.delimied_sets()
.ints()
.floats()
.dates()
.strings()
.masks()
.columns()
.sets()
>>> ds.delimited_sets()
[u'q3', u'q2', u'q9', u'q8']
>>> for delimited_set in ds.delimited_sets():
... print delimited_set
q3
q2
q9
q8
Slicing & dicing metadata objects¶
Although it is possible to access a DataSet
meta component via its _meta
attribute directly, the prefered way to inspect and interact with with the metadata
is to use DataSet
methods. For instance, the easiest way to view the most
important meta on a variable is to use the meta()
method:
>>> ds.meta('q8')
delimited set codes texts missing
q8: Which of the following do you regularly skip?
1 1 Breakfast None
2 2 Mid-morning snacking None
3 3 Lunch None
4 4 Mid-afternoon snacking None
5 5 Dinner None
6 96 None of them None
7 98 Don't know (it varies a lot) None
This output is extended with the item
metadata if an array
is passed:
>>> ds.meta('q6')
single items item texts codes texts missing
q6: How often do you take part in any of the fo...
1 q6_1 Exercise alone 1 Once a day or more often None
2 q6_2 Join an exercise class 2 Every few days None
3 q6_3 Play any kind of team sport 3 Once a week None
4 4 Once a fortnight None
5 5 Once a month None
6 6 Once every few months None
7 7 Once every six months None
8 8 Once a year None
9 9 Less often than once a year None
10 10 Never None
If the variable is not categorical, meta()
returns simply:
>>> ds.meta('weight_a')
float
weight_a: Weight (variant A) N/A
DataSet
also provides a lot of methods to access and return the several
meta objects of a variable to make various data processing tasks easier:
Variable labels: quantipy.core.dataset.DataSet.text()
>>> ds.text('q8', text_key=None)
Which of the following do you regularly skip?
values
object: quantipy.core.dataset.DataSet.values()
>>> ds.values('gender', text_key=None)
[(1, u'Male'), (2, u'Female')]
Category codes: quantipy.core.dataset.DataSet.codes()
>>> ds.codes('gender')
[1, 2]
Category labels: quantipy.core.dataset.DataSet.value_texts()
>>> ds.value_texts('gender', text_key=None)
[u'Male', u'Female']
items
object: quantipy.core.dataset.DataSet.items()
>>> ds.items('q6', text_key=None)
[(u'q6_1', u'How often do you exercise alone?'),
(u'q6_2', u'How often do you take part in an exercise class?'),
(u'q6_3', u'How often do you play any kind of team sport?')]
Item 'columns'
sources: quantipy.core.dataset.DataSet.sources()
>>> ds.sources('q6')
[u'q6_1', u'q6_2', u'q6_3']
Item labels: quantipy.core.dataset.DataSet.item_texts()
>>> ds.item_texts('q6', text_key=None)
[u'How often do you exercise alone?',
u'How often do you take part in an exercise class?',
u'How often do you play any kind of team sport?']