Release notes¶
Latest (09/04/2019)¶
New Nesting in Batch.add_crossbreak()
Nested crossbreaks can be defined for Excel deliverables, the nesting can be
defined by "var1 > var2"
. Nesting in more than two levels is available
"var1 > var2 > var3 > ..."
, but nesting a group of variables is NOT supported
“var1 > (var2, var3)”.
New Leveling
Running Batch.level(array, levels={})
gives the option to aggregate leveled
arrays. If no levels
are provided, automatically the Batch.yks
are taken.
New DataSet.used_text_keys()
This new method loops over text objects in DataSet._meta
and returns
all found text_keys
.
Update Batch (transposed) summaries
As announced a while ago, Batch.make_summaries()
is fully deprecated now
and gives a NotImplementedError. Per default, all arrays in the downbreak
list are added to the Batch.x_y_map
. The array exclusive
functionality
(add array, but skip items) is now supported by the new method
Batch.exclusive_arrays()
.
Additionally Batch.transpose_array()
is deprecated. Instead
Batch.transpose()
is available, which does not support replace
anymore,
because the “normal” arrays needs to be included always. If the summaries are
not requested in the deliverables, they can be hidden in the ChainManager.
Archived release notes¶
sd (14/01/2019)¶
New: Chain.export()
/ assign()
and custom calculations
Expanding on the current Chain
editing features provided via cut()
and join()
, it is now possible to calculate additional row and column results using plain pandas.dataframe
methods. Use Chain.export()
to work on a simplified Chain.dataframe
and assign()
to rebuild it properly when finished.
New: Batch.as_main(keep=True)
to change qp.Batch
relations
It is now possible to promote an .additional
Batch to a main/regular one. Optionally, the original parent Batch can be erased by setting keep=False
. Example:
Starting from:
>>> dataset.batches(main=True, add=False)
['batch 2', 'batch 5']
>>> dataset.batches(main=False, add=True)
['batch 4', 'batch 3', 'batch 1']
We turn batch 3
into a normal one:
>>> b = dataset.get_batch('batch 3')
>>> b.as_main()
>>> dataset.batches(main=True, add=False)
['batch 2', 'batch 5', 'batch 3']
>>> dataset.batches(main=False, add=True)
['batch 4', 'batch 1']
New: On-the-fly rebasing via Quantity.normalize(on='y', per_cell=False)
Quantipy’s engine will now accept another variable’s base for (column) percentage
computations. Furthermore, it is possible to rebase the percentages to the cell
frequencies of the other variable’s cross-tabulation by setting per_cell=True
,
i.e. rebase variables with identical categories to their respective per-category results. The following example shows how 'A1'
results are serving as cell bases
for the percentages of 'A2'
:
>>> l = stack[stack.keys()[0]]['no_filter']['A1']['datasource']
>>> q = qp.Quantity(l)
>>> q.count()
Question datasource
Values All 1 2 3 4 5 6
Question Values
A1 All 6984.0 767.0 1238.0 2126.0 836.0 1012.0 1005.0
1 1141.0 503.0 78.0 109.0 102.0 155.0 194.0
2 2716.0 615.0 406.0 499.0 499.0 394.0 303.0
3 1732.0 603.0 89.0 128.0 101.0 404.0 407.0
4 5391.0 644.0 798.0 1681.0 655.0 796.0 817.0
5 4408.0 593.0 177.0 1649.0 321.0 818.0 850.0
6 3584.0 615.0 834.0 834.0 327.0 507.0 467.0
7 4250.0 588.0 724.0 1717.0 540.0 55.0 626.0
8 3729.0 413.0 1014.0 788.0 311.0 539.0 664.0
9 3575.0 496.0 975.0 270.0 699.0 230.0 905.0
10 4074.0 582.0 910.0 1148.0 298.0 861.0 275.0
11 2200.0 446.0 749.0 431.0 177.0 146.0 251.0
12 5554.0 612.0 987.0 1653.0 551.0 860.0 891.0
13 544.0 40.0 107.0 232.0 87.0 52.0 26.0
>>> l = stack[stack.keys()[0]]['no_filter']['A2']['datasource']
>>> q = qp.Quantity(l)
>>> q.count()
Question datasource
Values All 1 2 3 4 5 6
Question Values
A2 All 6440.0 727.0 1131.0 1894.0 749.0 960.0 979.0
1 568.0 306.0 34.0 32.0 48.0 63.0 85.0
2 1135.0 417.0 107.0 88.0 213.0 175.0 135.0
3 975.0 426.0 43.0 49.0 49.0 220.0 188.0
4 2473.0 350.0 267.0 599.0 431.0 404.0 422.0
5 2013.0 299.0 88.0 573.0 162.0 417.0 474.0
6 1174.0 342.0 219.0 183.0 127.0 135.0 168.0
7 1841.0 355.0 161.0 754.0 285.0 21.0 265.0
8 1740.0 265.0 376.0 327.0 160.0 212.0 400.0
9 1584.0 181.0 390.0 89.0 398.0 94.0 432.0
10 1655.0 257.0 356.0 340.0 137.0 443.0 122.0
11 766.0 201.0 241.0 101.0 76.0 53.0 94.0
12 2438.0 217.0 528.0 497.0 247.0 459.0 490.0
13 1532.0 72.0 286.0 685.0 118.0 183.0 188.0
>>> q.normalize(on='A1', per_cell=True)
Question datasource
Values All 1 2 3 4 5 6
Question Values
A2 All 92.210767 94.784876 91.357027 89.087488 89.593301 94.861660 97.412935
1 49.780894 60.834990 43.589744 29.357798 47.058824 40.645161 43.814433
2 41.789396 67.804878 26.354680 17.635271 42.685371 44.416244 44.554455
3 56.293303 70.646766 48.314607 38.281250 48.514851 54.455446 46.191646
4 45.872751 54.347826 33.458647 35.633551 65.801527 50.753769 51.652387
5 45.666969 50.421585 49.717514 34.748332 50.467290 50.977995 55.764706
6 32.756696 55.609756 26.258993 21.942446 38.837920 26.627219 35.974304
7 43.317647 60.374150 22.237569 43.913803 52.777778 38.181818 42.332268
8 46.661303 64.164649 37.080868 41.497462 51.446945 39.332096 60.240964
9 44.307692 36.491935 40.000000 32.962963 56.938484 40.869565 47.734807
10 40.623466 44.158076 39.120879 29.616725 45.973154 51.451800 44.363636
11 34.818182 45.067265 32.176235 23.433875 42.937853 36.301370 37.450199
12 43.896291 35.457516 53.495441 30.066546 44.827586 53.372093 54.994388
13 281.617647 180.000000 267.289720 295.258621 135.632184 351.923077 723.076923
New: DataSet.missings(name=None)
This new method returns the missing data definition for the provided variable or
all missing definitions found in the dataset (if name
is omitted).
>>> dataset.missings()
{u'q10': {u'exclude': [6]},
u'q11': {u'exclude': [977]},
u'q17': {u'exclude': [977]},
u'q23_1_new': {u'exclude': [8]},
u'q25': {u'exclude': [977]},
u'q32': {u'exclude': [977]},
u'q38': {u'exclude': [977]},
u'q39': {u'exclude': [977]},
u'q48': {u'exclude': [977]},
u'q5': {u'exclude': [977]},
u'q9': {u'exclude': [977]}}
Update: DataSet.batches(main=True, add=False)
The collection of Batch
sets can be separated by main
and add
itional
ones (see above) to make analyzing Batch setups and relations easier. The default is still to return all Batch names.
Bugfix: Stack, other_source
statistics failing for delimited sets
A bug that prevented other_source
statistics being computed on delimited set type variables has been resolved by adjusting the underlying data type checking mechanic.
sd (26/10/2018)¶
New: Filter variables in DataSet and Batch
To avoid complex logics stored in the background and resulting problem with json serializing, the filter concept in DataSet and Batch has changed.
Now actual variables are added to the data and meta, which have the property
recoded_filter
. The values of depend on the included logic, and all logics
summarized in the value 0: keep
. Because of the an easy logic can be used
at several places in qp: {'filter_var': 0}
DataSet methods
All included filters of a Dataset can be shown running dataset.filters()
.
A filter variable can be easily created:
dataset.add_filter_var(name, logic, overwrite=False)
name
is the name of the new filter-variable.logic
should be (a list of) dictionaries in form of:>>> { ... 'label': 'reason', ... 'logic': {var: keys} / intersection/ .... ... }
or stings (var_name), which are automatically transformed into the following dict
>>> { ... 'label': 'var_name not empty', ... 'logic': {var_name: not_count(0)} ... }
If a list is provided, each item results in an own value of the filter variable.
An existing filter variable can also be extended:
dataset.extend_filter_var(name, logic, extend_as=None)
name
is the name of the existing filter-variable.logic
should be the same as above, then additional categories are added to the filter and the 0 value is recalculated.extend_as
determines if a new filter var is created or the initial variable is modified. Ifextend_as=None
the variable is modified inplace. Otherwiseextend_as
is used as suffix for the new filter variable.
Known methods like:
.copy()
.drop()
.rename()
can be applied on filter-variables, all others are not valid!
Batch methods
Batch.add_filter(filter_name, filter_logic=None, overwrite=False)
A filter can still be added to a batch, by adding a filter_logic
, but also
it’s possible to add only the filter_name
of an existing filter variable.
If filter_name
is an existing filter-variable, a filter_logic
is provided
and overwrite
is turned off, the scripts will return an error.
Batch.remove_filter()
This method only removes filters from the Batch definitions, the created filter-variables still exist in the belonging DataSet object.
Batch methods that use filters:
.extend_filter()
.add_y_on_y()
.add_open_ends()
create new extended filter variables if the used filter differs from the batch global filter. So it’s recommended to add the global filter first, it’s taken over automatically for the mentioned methods.
New: Summarizing and rearranging qp.Chain
elements via ChainManager
cut(values, ci=None, base=False, tests=False)
join(title='Summary')
It is now possible to summarize View
aggregation results from existing Chain
items by restructuring and editing them via their ChainManager
methods. The
general idea behind building a summary Chain
is to unify a set of results into
items by restructuring and editing them via their ChainManager
methods. The
general idea behind building a summary Chain is to unify a set of results into
one cohesive representation to offer an easy way to look at certain key figures
of interest in comparison to each other. To achieve this, the ChainManager
class
has gained the new cut()
and join()
methods. Summaries are built post-
aggregation and therefore rely on what has been defined (via the qp.Batch
class) and computed (via the qp.Stack
methods) at previous stages.
The intended way of working with this new feature can be outlined as
reorder()
cut()
join()
insert()
In more detail:
A) Grouping the results for the summary
Both methods will operate on the entire set of Chains
collected in a
ChainManager
, so building a summary Chain
will normally start with
restricting a copy of an existing ChainManager
to the question variables
that you’re interested in. This can be done via clone()
with
reorder(..., inplace=True)
or by assigning back the new instance from
reorder(..., inplace=False)
.
B) Selecting View results via cut()
This method lets you target the kind of results (nets, means, NPS scores,
only the frequencies, etc.) from a given qp.Chain.dataframe
. Elements must
be targeted by their underlying regular index values, e.g. 'net_1'
, 'net_2'
,
'mean'
, 1
, 'calc'
, … . Use the base
and tests
parameters
to also carry over the matching base rows and/or significance testing results.
The ci
parameter additionally allows targeting only the 'counts'
or
'c%'
results if cell items are grouped together.
C) Unifying the individual results with join()
Merging all new results into one, the join()
method concatenates vertically
and relabels the x-axis to separate all variable results by their matching
metadata text
that has also been applied while creating the regular set of
and relabels the x-axis to separate all variable results by their matching
metadata text
that has has also been applied while creating the regular set of
Chain
items. The new summary can then also be inserted back into its
originating ChainManager
with insert()
if desired.
Update: Batch.add_variables(varlist)
A qp.Batch
can now carry a collection of variables that is explicitly not
directed towards any table-like builds. Variables from varlist
will solely
be used in non-aggregation based, data transformation and export oriented
applications. To make this distinction more visible in the API, add_x()
and
add_y()
have been renamed to add_downbreak()
and add_crossbreak()
.
Users are warned and advised to switch to the new method versions via a
DeprecationWarning
. In a future version of the library add_x()
and
add_y()
will be removed.
Update: Batch.copy()
-> Batch.clone()
Since qp.Batch
is a subclass of qp.DataSet
, the copy()
method is renamed into
Batch.clone()
.
sd (01/10/2018)¶
New: “rewrite” of Rules module (affecting sorting):
sorting “normal” columns:
sort_on
always ‘@’fix
any categoriessort_by_weight
default is unweighted (None), but each weight (included
in data) can be used
If sort_by_weight and the view-weight differ, a warning is shown.
sorting “expanded net” columns:
sort_on
always ‘@’fix
any categories- sorting
within
orbetween
net groups is available sort_by_weight
: as default the weight of the first found
expanded-net-view is taken. Only weights of aggregated net-views are possible
sorting “array summaries”:
sort_on
can be any desc (‘median’, ‘stddev’, ‘sem’, ‘max’, ‘min’,
‘mean’, ‘upper_q’, ‘lower_q’) or nets (‘net_1’, ‘net_2’, …. enumerated
by the net_def)
* sort_by_weight
: as default the weight of the first found desc/net-view
is taken. Only weights of aggregated desc/net-views are possible
* sort_on
can also be any category, here each weight can be used to sort_on
New: DataSet.min_value_count()
A new wrapper for DataSet.hiding()
is included. All values are hidden,
that have less counts than the included number min
.
The used data can be weighted or filtered using the parameters weight
and
condition
.
Usage as Batch method:
Batch.min_value_count()
without the parameters weight
and
condition
automatically grabs Batch.weights[0]
and Batch.filter
to calculate low value counts.
New: Prevent weak duplicated in data
As Python is case sensitive it is possible to have two or more variables with
the same name, but in lower- and uppercases. Most other software do not support
that, so a warning is shown if a weak dupe is created. Additionally
Dataset.write_dimensions()
performs auto-renaming is weak dupes are detected.
New: Prevent single-cat delimited sets
DataSet.add_meta(..., qtype='delimited set', categories=[...], ...)
automatically switches qtype
to single if only one category is defined.
DataSet.convert(name, 'single')
allows conversion from delimited set
to
single
if the variable has only one category.
DataSet.repair()
and DataSt.remove_values()
convert delimited sets
automatically to singles if only one category is included.
Update: merge warnings + merging delimites sets
Warnings in hmerge()
and vmerge()
are updated. If a column exists in
the left and the right dataset, the type is compared. Some type inconsistencies
are allowed, but return a warning, while others end up in a raise.
delimited sets in vmerge()
:
If a column is a delimited set in the left dataset, but a single, int or float in the right dataset, the data of the right column is converted into a delimited set.
delimited sets in hmerge(...merge_existing=None)
:
For the hmerge a new parameter merge_existing
is included, which can be
None
, a list of variable-names or 'all'
.
If delimited sets are included in left and right dataset:
merge_existing=None
: Only meta is adjusted. Data is untouched (left data
is taken).
* merge_existing='all'
: Meta and data are merged for all delimited sets,
that are included in both datasets.
* merge_existing=[variable-names]
: Meta and data are merged for all
delimited sets, that are listed and included in both datasets.
Update: encoding in DataSet.get_batch(name)
The method is not that encoding sensitive anymore. It returns the depending
Batch
, no matter if '...'
, u'...'
or '...'.decode('utf8')
is
included as name.
Update: warning in weight engine
Missing codes in the sample are only alerted, if the belonging target is not 0.
Update: DataSet.to_array(..., variables, ...)
Duplicated vars in variables
are not allowed anymore, these were causing
problems in the ChainManager class.
Update: Batch.add_open_ends()
Method raises an error if no vars are included in oe
and break_by
. The
empty dataframe was causing issues in the ChainManager class.
Update: Batch.extend_x()
The method automatically checks if the included variables are arrays and adds
them to Batch.summaries
if they are included yet.
sd (04/06/2018)¶
New: Additional variable (names) “getter”-like and resolver methods
DataSet.created()
DataSet.find(str_tags=None, suffixed=False)
DataSet.names()
DataSet.resolve_name()
A bunch of new methods enhancing the options of finding and testing for variable
names have been added. created()
will list all variables that have been added
to a dataset using core functions, i.e. add_meta()
and derive()
, resp.
all helper methods that use them internally (as band()
or categorize()
do
for instance).
The find()
method is returning all variable names that contain any of the
provided substrings in str_tags
. To only consider names that end with these
strings, set suffixed=True
. If no str_tags
are passed, the method will
use a default list of tags including ['_rc', '_net', ' (categories', ' (NET', '_rec']
.
Sometimes a dataset might contain “semi-duplicated” names, variables that differ
in respect to case sensitivity but have otherwise identical names. Calling
names()
will report such cases in a pd.DataFrame
that lists all name
variants under the respective str.lower()
version. If no semi-duplicates
are found, names()
will simply return DataSet.variables()
.
Lastly, resolve_name()
can be used to return the “proper”, existing representation(s) of a given variable name’s spelling.
New: Batch.remove()
Not needed batches can be removed from meta
, so they are not aggregated
anymore.
New: Batch.rename(new_name)
Sometimes standard batches have long/ complex names. They can now be changed into a custom name. Please take into account, that for most hubs the name of omnibus batches should look like ‘client ~ topic’.
Update: Handling verbatims in qp.Batch
Instead of holding the well prepared open-end dataframe in batch.verbatims
,
the attribute is now filled by batch.add_open_ends()
with instructions to
create the open-end dataframe. It is easier to to modify/ overwrite existing
verbatims. Therefore also a new parameter is included overwrite=True
.
Update: Batch.copy(..., b_filter=None, as_addition=False)
It is now possible to define an additional filter for a copied batch and also to set it as addition to the master batch.
Update: Regrouping the variable list using DataSet.order(..., regroup=True)
A new parameter called regroup
will instruct reordering all newly created
variables into their logical position of the dataset’s main variable order, i.e.
attempting to place derived variables after the originating ones.
Bugfix: add_meta()
and duplicated categorical values
codes
Providing duplicated numerical codes while attempting to create new metadata
using add_meta()
will now correctly raise a ValueError
to prevent
corrupting the DataSet
.
>>> cats = [(1, 'A'), (2, 'B'), (3, 'C'), (3, 'D'), (2, 'AA')]
>>> dataset.add_meta('test_var', 'single', 'test label', cats)
ValueError: Cannot resolve category definition due to code duplicates: [2, 3]
sd (04/04/2018)¶
New: Emptiness handlers in DataSet
and Batch
classes
DataSet.empty(name, condition=None)
DataSet.empty_items(name, condition=None, by_name=True)
DataSet.hide_empty_items(condition=None, arrays=None)
Batch.hide_empty(xks=True, summaries=True)
empty()
is used to test if regular variables are completely empty,
empty_items()
checks the same for the items of an array mask definition.
Both can be run on lists of variables. If a single variable is tested, the former
returns simply boolean, the latter will list all empty items. If lists are checked,
empty()
returns the sublist of empty variables, empty_items()
is mapping
the list of empty items per array name. The condition
parameter of these
methods takes a Quantipy logic
expression to restrict the test to a subset
of the data, i.e. to check if variables will be empty if the dataset is filtered
a certain way. A very simple example:
>>> dataset.add_meta('test_var', 'int', 'Variable is empty')
>>> dataset.empty('test_var')
True
>>> dataset[dataset.take({'gender': 1}), 'test_var'] = 1
>>> dataset.empty('test_var')
False
>>> dataset.empty('test_var', {'gender': 2})
True
The DataSet
method hide_empty_items()
uses the emptiness tests to
automatically apply a hiding rule on all empty items found in the dataset.
To restrict this to specific arrays only, their names can be provided via the
arrays
argument. Batch.hide_empty()
takes into account the current
Batch.filter
setup and by drops/hides all relevant empty variables from the
xks
list and summary aggregations by default. Summaries that would end up without valid
items because of this are automatically removed from the summaries
collection
and the user is warned.
New: qp.set_option('fast_stack_filters', True)
A new option to enable a more efficient test for already existing filters
inside the qp.Stack
object has been added. Set the 'fast_stack_filters'
option to True
to use it, the default is False
to ensure compatibility
in different versions of production DP template workspaces.
Update: Stack.add_stats(..., factor_labels=True, ...)
The parameter factor_labels
is now also able to take the string '()'
,
then factors are written in the normal brackets next to the label (instead
of []
).
In the new version factor_labels are also just added if there are none included before, except new scales are used.
Bugfix: DataSet
np.NaN
insertion to delimited_set
variables
np.NaN
was incorrectly transformed when inserted into delimited_set
before, leading to either numpy
type conflicts or type casting exceptions.
This is now fixed.
sd (27/02/2018)¶
New: DataSet._dimensions_suffix
DataSet
has a new attribute _dimensions_suffix
, which is used as mask
suffix while running DataSet.dimensionize()
. The default is _grid
and
it can be modified with DataSet.set_dim_suffix()
.
Update: Stack._get_chain()
(old chain)
The method is speeded-up. If a filter is already included in the Stack, it is
not calculated from scratch anymore. Additionally the method has a new parameter
described
, which takes a describing dataframe of the Stack, so it no longer
needs to be calculated in each loop.
Nets that are applied on array variables will now also create a new recoded
array that reflects the net definitions if recoded
is used. The
method has been creating only the item versions before.
Update: Stack.add_stats()
The method will now create a new metadata property called 'factor'
for each
variable it is applied on. You can only have one factor assigned to one
categorical value, so for multiple statistic definitions (exclusions, etc.)
it will get overwritten.
Update: DataSet.from_batch()
(additions
parameter)
The additions
parameter has been updated to also be able to create recoded
variables from existing “additional” Batches that are attached to a parent one.
Filter variables will get the new meta 'properties'
tag 'recoded_filter'
and only have one category (1
, 'active'
). They are named simply
'filter_1'
, 'filter_2'
and so on. The new possible values of the
parameters are now:
None
:as_addition()
-Batches are not considered.'variables'
: Only cross- and downbreak variables are considered.'filters'
: Only filters are recoded.'full'
:'variables'
+'filters'
Bugfix: ViewManager._request_views()
Cumulative sums are only requested if they are included in the belonging
Stack
. Additionally the correct related sig-tests are now taken for
cumulative sums.
sd (12/01/2018)¶
New: Audit
Audit
is a new class which takes DataSet
instances, compares and aligns
them.
The class compares/ reports/ aligns the following aspects:
- datasets are valid (
DataSet.validate()
)- mismatches (variables are not included in all datasets)
- different types (variables are in more than one dataset, but have different types)
- labels (variables are in more than one dataset, but have different labels for the same text_key)
- value codes (variables are in more than one dataset, but have different value codes)
- value texts (variables are in more than one dataset, but have different value texts)
- array items (arrays are in more than one dataset, but have different items)
- item labels (arrays are in more than one dataset, but their items have different labels)
This is the first draft of the class, so it will need some testing and probably adjustments.
New: DataSet.reorder_items(name, new_order)
The new method reorders the items of the included array. The ints in the
new_order
list match up to the number of the items
(DataSet.item_no('item_name')
), not to the position.
New: DataSet.valid_tks
, Arabic
Arabic (ar-AR
) is included as default valid text-key.
New: DataSet.extend_items(name, ext_items, text_key=None)
The new method extends the items of an existing array.
Update: DataSet.set_missings()
The method is now limited to DataSet
, Batch
does not inherit it.
Update: DataSet
The whole class is reordered and cleaned up. Some new deprecation warnings will appear.
Update: DataSet.add_meta()
/ DataSet.derive()
Both methods will now raise a ValueError: Duplicated codes provided. Value codes must be unique!
if categorical values
definitions try to apply duplicated codes.
sd (18/12/2017)¶
New: Batch.remove_filter()
Removes all defined (global + extended) filters from a Batch instance.
Update: Batch.add_filter()
It’s now possible to extend the global filter of a Batch instance. These options are possible.
Add first filter:
>>> batch.filter, batch.filter_names
'no_filter', ['no_filter']
>>> batch.add_filter('filter1', logic1)
>>> batch.filter, batch.filter_names
{'filter1': logic1}, ['filter1']
Extend filter:
>>> batch.filter, batch.filter_names
{'filter1': logic}, ['filter1']
>>> batch.add_filter('filter2', logic2)
>>> batch.filter, batch.filter_names
{'filter1' + 'filter2': intersection([logic1, logic2])}, ['filter1' + 'filter2']
Replace filter:
>>> batch.filter, batch.filter_names
{'filter1': logic}, ['filter1']
>>> batch.add_filter('filter1', logic2)
>>> batch.filter, batch.filter_names
{'filter1': logic2}, ['filter1']
Update: Stack.add_stats(..., recode)
The new parameter recode
defines if a new numerical variable is created which
satisfies the stat definitions.
Update: DataSet.populate()
A progress tracker is added to this method.
Bugfix: Batch.add_open_ends()
=
is removed from all responsess in the included variables, as it causes
errors in the Excel-Painter.
Bugfix: Batch.extend_x()
and Batch.extend_y()
Check if included variables exist and unroll included masks.
Bugfix: Stack.add_nets(..., calc)
If the operator in calc is div
/ /
, the calculation is now performed
correctly.
sd (28/11/2017)¶
New DataSet.from_batch()
Creates a new DataSet
instance out of Batch
definitions (xks, yks,
filter, weight, language, additions, edits).
New: Batch.add_total()
Defines if total column @
should be included in the downbreaks (yks).
New: Batch.set_unwgt_counts()
If cellitems are cp
and a weight is provided, it is possible to request
unweighted count views (percentages are still weighted).
Update: Batch.add_y_on_y(name, y_filter=None, main_filter='extend')
Multiple y_on_y
aggregations can now be added to a Batch
instance
and each can have an own filter. The y_on_y-filter can extend
or replace
the main_filter of the Batch
.
Update: Stack.add_nets(..., recode)
The new parameter recode
defines if a new variable is created which
satisfies the net definitions. Different options for recode
are:
'extend_codes'
: The new variable contains all codes of the original variable and all nets as new categories.'drop_codes'
: The new variable contains only all nets as new categories.'collect_codes'
or'collect_codes@cat_name'
: The new variable contains all nets as new categories and another new category which sums all cases that are not in any net. The new category text can be defined by adding@cat_name
tocollect_codes
. If none is providedOther
is used as default.
Update: Stack.add_nets()
If a variable in the Stack
already has a net_view, it gets overwritten
if a new net is added.
Update: DataSet.set_missings(..., missing_map)
The parameter missing_map
can also handle lists now. All included
codes are be flagged as 'exclude'
.
Update: request_views(..., sums='mid')
(ViewManager
/query.py
)
Allow different positions for sums in the view-order. They can be placed in
the middle ('mid'
) between the basics/ nets and the stats or at the
'bottom'
after the stats.
Update/ New: write_dimensions()
Converting qp data to mdd and ddf files by using write_dimensions()
is
updated now. A bug regarding encoding texts is fixed and additionally all
included text_keys
in the meta are transferred into the mdd. Therefore
two new classes are included: DimLabels
and DimLabel
.
sd (13/11/2017)¶
- New ``DataSet.to_delimited_set(name, label, variables,
- from_dichotomous=True, codes_from_name=True)``
Creates a new delimited set variable out of other variables. If the input-
variables are dichotomous (from_dichotomous
), the new value-codes can be
taken from the variable-names or from the order of the variables
(codes_from_name
).
Update Stack.aggregate(..., bases={})
A dictionary in form of:
bases = {
'cbase': {
'wgt': True,
'unwgt': False},
'cbase_gross': {
'wgt': True,
'unwgt': True},
'ebase': {
'wgt': False,
'unwgt': False}
}
defines what kind of bases will be aggregated. If bases
is provided the
old parameter unweighted_base
and any bases in the parameter views
will be ignored. If bases is not provided and any base is included in views
,
a dictionary is automatically created out of views
and unweighted_base
.
sd (17/10/2017)¶
New: del DataSet['var_name']
and 'var_name' in DataSet
syntax support
It is now possible to test membership of a variable name simply using the in
operator instead of DataSet.var_exists('var_name')
and delete a variable definition
from DataSet
using the del
keyword inplace of the drop('var_name')
method.
New: DataSet.is_single(name)
, .is_delimited_set(name)
, .is_int(name)
, .is_float(name)
, .is_string(name)
, .is_date(name)
, .is_array(name)
These new methods make testing a variable’s type easy.
Update: DataSet.singles(array_items=True)
and all other non-array
type iterators
It is now possible to exclude array
items from singles()
, delimited_sets()
,
ints()
and floats()
variable lists by setting the new array_items
parameter to False
.
Update: Batch.set_sigtests(..., flags=None, test_total=None)
, Batch.sigproperties
The significancetest-settings for flagging and testing against total, can now
be modified by the two parameters flags
and test_total
. The Batch
attribute siglevels
is removed, instead all sig-settings are stored
in Batch.sigproperties
.
Update: Batch.make_summaries(..., exclusive=False)
, Batch.skip_items
The new parameter exclusive
can take a list of arrays or a boolean. If a list
is included, these arrays are added to Batch.skip_items
, if it is True all
variables from Batch.summaries
are added to Batch.skip_items
Update: quantipy.sandbox.sandbox.Chain.paint(..., totalize=True)
If totalize
is True
, @
-Total columns of a (x-oriented) Chain.dataframe
will be painted as 'Total'
instead of showing the corresponsing x
-variables
question text.
Update: quantipy.core.weights.Rim.Rake
The weighting algorithm’s generate_report()
method can be caught up in a
MemoryError
for complex weight schemes run on very large sample sizes. This
is now prevented to ensure the weight factors are computed with priority and
the algorithm is able to terminate correctly. A warning is raised:
UserWarning: OOM: Could not finish writing report...
Update: Batch.replace_y()
Conditional replacements of y-variables of a Batch
will now always also
automatically add the @
-Total indicator if not provided.
Bugfix: DataSet.force_texts(..., overwrite=True)
Forced overwriting of existing text_key
meta data was failing for array
mask
objects. This is now solved.
sd (15/09/2017)¶
New: DataSet.meta_to_json(key=None, collection=None)
The new method allows saving parts of the metadata as a json file. The parameters
key
and collection
define the metaobject which will be saved.
New: DataSet.save()
and DataSet.revert()
These two new methods are useful in interactive sessions like Ipython or
Jupyter notebooks. save()
will make a temporary (only im memory, not
written to disk) copy of the DataSet
and store its current state. You can
then use revert()
to rollback to that snapshot of the data at a later
stage (e.g. a complex recode operation went wrong, reloading from the physical files takes
too long…).
New: DataSet.by_type(types=None)
The by_type()
method is replacing the soon to be deprecated implementation
of variables()
(see below). It provides the same functionality
(pd.DataFrame
summary of variable types) as the latter.
Update: DataSet.variables()
absorbs list_variables()
and variables_from_set()
In conjunction with the addition of by_type()
, variables()
is
replacing the related list_variables()
and variables_from_set()
methods in order to offer a unified solution for querying the DataSet
’s (main) variable collection.
Update: Batch.as_addition()
The possibility to add multiple cell item iterations of one Batch
definition
via that method has been reintroduced (it was working by accident in previous
versions with subtle side effects and then removed). Have fun!
Update: Batch.add_open_ends()
The method will now raise an Exception
if called on a Batch
that has
been added to a parent one via as_addition()
to warn the user and prevent
errors at the build stage:
NotImplementedError: Cannot add open end DataFrames to as_addition()-Batches!
sd (31/08/2017)¶
New: DataSet.code_from_label(..., exact=True)
The new parameter exact
is implemented. If exact=True
codes are returned
whose belonging label is equal the included text_label
. Otherwise the
method checks if the labels contain the included text_label
.
New: DataSet.order(new_order=None, reposition=None)
This new method can be used to change the global order of the DataSet
variables. You can either pass a complete new_order
list of variable names to
set the order or provide a list of dictionaries to move (multiple) variables
before a reference variable name. The order is reflected in the case data
pd.DataFrame.columns
order and the meta 'data file'
set
object’s items.
New: DataSet.dichotomize(name, value_texts=None, keep_variable_text=True, ignore=None, replace=False, text_key=None)
Use this to convert a 'delimited set'
variable into a set of binary coded
'single'
variables. Variables will have the values 1/0 and by default use
'Yes'
/ 'No'
as the corresponding labels. Use the value_texts
parameter to apply custom labels.
New: Batch.extend_x(ext_xks)
The new method enables an easy extension of Batch.xks
. In ext_xks
included str
are added at the end of Batch.xks
. Values of included
dict
s are positioned in front of the related key.
Update: Batch.extend_y(ext_yks, ...)
The parameter ext_yks
now also takes dict
s, which define the position
of the additional yks
.
Update: Batch.add_open_ends(..., replacements)
The new parameter replacements
is implemented. The method loops over the
whole pd.DataFrame and replaces all keys of the included dict
with the belonging value.
Update: Stack.add_stats(..., other_source)
Statistic views can now be added to delimited sets if other_source
is used.
In this case other_source
must be a single or numerical variable.
Update: DataSet.validate(..., spss_limits=False)
The new parameter spss_limits
is implemented. If spss_limits=True
, the
validate output dataframe is extended by 3 columns which show if the SPSS label
limitations are satisfied.
Bugfix: DataSet.convert()
A bug that prevented conversions from single
to numeric types has been fixed.
Bugfix: DataSet.add_meta()
A bug that prevented the creation of numerical arrays outside of to.array()
has been fixed. It is now possible to create array
metadata without providing
category references.
Bugfix: Stack.add_stats()
Checking the statistic views is skipped now if no single typed variables are included even if a checking cluster is provided.
Bugfix: Batch.copy()
Instead of using a deepcopy of the Batch
instance, a new instance is created
and filled with the attributes of the initial one. Then the copied instance can
be used as additional Batch
.
Bugfix: qp.core.builds.powerpoint
Access to bar-chart series and colour-filling is now working for
different Powerpoint versions. Also a bug is fixed which came up in
PowerPointpainter()
for variables which have fixed categories and whose
values are located in lib
.
sd (24/07/2017)¶
New: qp.set_option()
It is now possible to set library-wide settings registered in qp.OPTIONS
by providing the setting’s name (key) and the desired value. Currently supported
are:
OPTIONS = {
'new_rules': False,
'new_chains': False,
'short_item_texts': False
}
So for example, to work with the currently refactored Chain
interim class
we can use qp.set_options('new_chains', True)
.
New: qp.Batch()
This is a new object aimed at defining and structuring aggregation and build setups. Please see an extensive overview here.
New: Stack.aggregate()
/ add_nets()
/ add_stats()
/ add_tests()
/ …
Connected to the new Batch
class, some new Stack
methods to ease up
view creation have been added. You can find the docs here.
New: DataSet.populate()
Use this to create a qp.Stack
from Batch
definitions. This connects the
Batch
and Stack
objects; check out the Batch
and Analysis & aggregation docs.
New: DataSet.write_dimensions(path_mdd=None, path_ddf=None, text_key=None, mdm_lang='ENG', run=True, clean_up=True)
It is now possible to directly convert a DataSet
into a Dimensions .ddf/.mdd
file pair (given SPSS Data Collection Base Professional is installed on your
machine). By default, files will be saved to the same location in that the
DataSet
resides and keep its text_key
.
New: DataSet.repair()
This new method can be used to try to fix common DataSet
metadata problems
stemming from outdated versions, incorrect manual editing of the meta dictionary
or other inconsistencies. The method is checking and repairing following issues:
'name'
is present for all variable metadata'source'
and'subtype'
references for array variables- correct
'lib'
-based'values'
object for array variablestext key
-dependent'x edits'
/'y edits'
meta data['data file']['items']
set entries exist in'columns'
/'masks'
New: DataSet.subset(variables=None, from_set=None, inplace=False)
As a counterpart to filter()
, subset()
can be used to create a new
DataSet
that contains only a selection of variables. The new variables
collection can be provided either as a list of names or by naming an already
existing set containing the desired variables.
New: DataSet.variables_from_set(setname)
Get the list of variables belonging to the passed set indicated by
setname
.
New: DataSet.is_like_numeric(name)
A new method to test if all of a string
variable’s values can be converted
to a numerical (int
/ float
) type. Returns a boolean True
/ False
.
Update: DataSet.convert()
It is now possible to convert inplace from string
to int
/ float
if
the respective internal is_like_numeric()
check identifies numeric-like values.
Update: DataSet.from_components(..., reset=True)
, DataSet.read_quantipy(..., reset=True)
Loaded .json
metadata dictionaries will get cleaned now by default from any
user-defined, non-native objects inside the 'lib'
and 'sets'
collections. Set reset=False
to keep any extra entires (restoring the old
behaviour).
Update: DataSet.from_components(data_df, meta_dict=None, ...)
It is now possible to create a DataSet
instance by providing a pd.DataFrame
alone, without any accompanying meta data. While reading in the case data, the meta
component will be created by inferring the proper Quantipy
variable types
from the pandas
dtype
information.
Update: Quantity.swap(var, ..., update_axis_def=True)
It is now possible to swap()
the 'x'
variable of an array based Quantity
,
as long as the length oh the constructing 'items'
collection is identical.
In addition, the new parameter update_axis_def
is now by default enforcing
an update of the axis defintions (pd.DataFrame
column names, etc) while
previously the method was keeping the original index and column names. The old
behaviour can be restored by setting the parameter to False
.
Array example:
>>> link = stack[name_data]['no_filter']['q5']['@']
>>> q = qp.Quantity(link)
>>> q.summarize()
Array q5
Questions q5_1 q5_2 q5_3 q5_4 q5_5 q5_6
Question Values
q5 All 8255.000000 8255.000000 8255.000000 8255.000000 8255.000000 8255.000000
mean 26.410297 22.260569 25.181466 39.842883 24.399758 28.972017
stddev 40.415559 38.060583 40.018463 46.012205 40.537497 41.903322
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 3.000000 3.000000 3.000000 3.000000 1.000000 3.000000
median 5.000000 3.000000 3.000000 5.000000 3.000000 5.000000
75% 5.000000 5.000000 5.000000 98.000000 5.000000 97.000000
max 98.000000 98.000000 98.000000 98.000000 98.000000 98.000000
Updated axis definiton:
>>> q.swap('q7', update_axis_def=True)
>>> q.summarize()
Array q7
Questions q7_1 q7_2 q7_3 q7_4 q7_5 q7_6
Question Values
q7 All 1195.000000 1413.000000 3378.000000 35.000000 43.000000 36.000000
mean 5.782427 5.423213 5.795145 4.228571 4.558140 5.333333
stddev 2.277894 2.157226 2.366247 2.073442 2.322789 2.552310
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 4.000000 4.000000 4.000000 3.000000 3.000000 3.000000
median 6.000000 6.000000 6.000000 4.000000 4.000000 6.000000
75% 8.000000 7.000000 8.000000 6.000000 6.000000 7.750000
max 9.000000 9.000000 9.000000 8.000000 9.000000 9.000000
Original axis definiton:
>>> q = qp.Quantity(link)
>>> q.swap('q7', update_axis_def=False)
>>> q.summarize()
Array q5
Questions q5_1 q5_2 q5_3 q5_4 q5_5 q5_6
Question Values
q5 All 1195.000000 1413.000000 3378.000000 35.000000 43.000000 36.000000
mean 5.782427 5.423213 5.795145 4.228571 4.558140 5.333333
stddev 2.277894 2.157226 2.366247 2.073442 2.322789 2.552310
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 4.000000 4.000000 4.000000 3.000000 3.000000 3.000000
median 6.000000 6.000000 6.000000 4.000000 4.000000 6.000000
75% 8.000000 7.000000 8.000000 6.000000 6.000000 7.750000
max 9.000000 9.000000 9.000000 8.000000 9.000000 9.000000
Update: DataSet.merge_texts()
The method will now always overwrite existing text_key
meta, which makes it
possible to merge text
s from meta of the same text_key
as the master
DataSet
.
Bugfix: DataSet.band()
band(new_name=None)
’s automatic name generation was incorrectly creating
new variables with the name None_banded
. This is now fixed.
Bugfix: DataSet.copy()
The method will now check if the name of the copy already exists in the
DataSet
and drop the referenced variable if found to prevent
inconsistencies. Additionally, it is not longer possible to copy isolated
array
items:
>>> dataset.copy('q5_1')
NotImplementedError: Cannot make isolated copy of array item 'q5_1'. Please copy array variable 'q5' instead!
sd (08/06/2017)¶
New: DataSet.extend_valid_tks()
, DataSet.valid_tks
DataSet
has a new attribute valid_tks
that contains a list of all valid
textkeys. All methods that take a textkey as parameter are checked against that
list.
If a datafile contains a special/ unusual textkey (for example 'id-ID'
or
'zh-TW'
), the list can be extended with DataSet.extend_valid_tks()
.
This extension can also be used to create a textkey for special conditions,
for example to create texts only for powerpoint outputs:
>>> dataset.extend_valid_tks('pptx')
>>> dataset.force_texts('pptx', 'en-GB')
>>> dataset.set_variable_text('gender','Gender label for pptx', text_key='pptx')
New: Equal error messages
All methods that use the parameters name
/var
, text_key
or
axis_edit
/ axis
now have a decorator that checks the provided values.
The following shows a few examples for the new error messages:
name
& var
:
'name' argument for meta() must be in ['columns', 'masks'].
q1 is not in ['columns', 'masks'].
text_key
:
'en-gb' is not a valid text_key! Supported are: ['en-GB', 'da-DK', 'fi-FI', 'nb-NO', 'sv-SE', 'de-DE']
axis_edit
& axis
:
'xs' is not a valid axis! Supported are: ['x', 'y']
New: DataSet.repair_text_edits(text_key)
This new method can be used in trackers, that were drawn up in an older Quantipy
version. Text objects can be repaired if are not well prepared, for example if
it looks like this:
{'en-GB': 'some English text',
'sv_SE': 'some Swedish text',
'x edits': 'new text'}
DataSet.repair_text_edits()
loops over all text objects in the dataset and
matches the x edits
and y edits
texts to all included textkeys:
>>> dataset.repair_text_edits(['en-GB', 'sv-SE'])
{'en-GB': 'some English text',
'sv_SE': 'some Swedish text',
'x edits': {'en-GB': new text', 'sv-SE': 'new text'}}
Update: DataSet.meta()
/ .text()
/ .values()
/ .value_texts()
/ .items()
/ .item_texts()
All these methods now can take the parameters text_key
and axis_edit
.
The related text is taken from the meta information and shown in the output.
If a text key or axis edit is not included the text is returned as None.
Update: DataSet.compare(dataset, variables=None, strict=False, text_key=None)
The method is totally updated, works more precise and contains a few new
features. Generally variables included in dataset
are compared with
eponymous variables in the main DataSet
instance. You can specify witch
variables
should be compared, if question/ value texts should be compared
strict
or not and for which text_key
.
Update: DataSet.validate(verbose=True)
A few new features are tested now and the output has changed. Set verbose=True
to see the definitions of the different error columns:
name: column/mask name and meta[collection][var]['name'] are not identical
q_label: text object is badly formated or has empty text mapping
values: categorical var does not contain values, value text is badly
formated or has empty text mapping
textkeys: dataset.text_key is not included or existing tks are not
consistent (also for parents)
source: parents or items do not exist
codes: codes in .data are not included in .meta
Update: DataSet.sorting()
/ .slicing()
/ .hiding()
These methods will now also work on lists of variable names.
Update: DataSet.set_variable_text()
, Dataset.set_item_texts()
If these methods are applied to an array item, the new variable text is also included in the meta information of the parent array. The same works also the other way around, if an array text is set, then the array item texts are modified.
Update: DataSet.__init__(self, name, dimensions_comp=True)
A few new features are included to handle data coming from Crunch. While
initializing a new DataSet
instance dimensions compatibility can be set to
False. In the custom template use t.get_qp_dataset(name, dim_comp=False)
in the load cells.
Bugfix: DataSet.hmerge()
If right_on
and left_on
are used and right_on
is also included in
the main file, it is not overwritten any more.
sd (17/05/2017)¶
Update: DataSet.set_variable_text(..., axis_edit=None)
, DataSet.set_value_texts(..., axis_edit=False)
The new axis_edit
argument can be used with one of 'x'
, 'y'
or ['x', 'y']
to instruct a text metadata change that will only be visible in build exports.
Warning
In a future version set_col_text_edit()
and set_val_text_text()
will
be removed! The identical functionality is provided via this axis_edit
parameter.
Update: DataSet.replace_texts(..., text_key=None)
The method loops over all meta text objects and replaces unwanted strings.
It is now possible to perform the replacement only for specified text_key
s.
If text_key=None
the method replaces the strings for all text_key
s.
Update: DataSet.force_texts(copy_to=None, copy_from=None, update_existing=False)
The method is now only able to force texts for all meta text objects (for
single variables use the methods set_variable_text()
and
set_value_texts()
).
Bugfix: DataSet.copy()
Copied variables get the tag created
and can be listed with
t.list_variables(dataset, 'created')
.
Bugfix: DataSet.hmerge()
, DataSet.vmerge()
Array meta information in merged datafiles is now updated correctly.
sd (04/05/2017)¶
New: DataSet.var_exists()
Returns True if the input variable/ list of variables are included in the
DataSet
instance, otherwise False.
New: DataSet.remove_html()
, DataSet.replace_texts(replace)
The DataSet
method clean_texts()
has been removed and split into two
methods to make usage more clear: remove_html()
will strip all text
metadata objects from any html and formatting tags. replace_texts()
will
use a dict
mapping of old to new str
terms to change the matching
text
throughout the DataSet
metadata.
New: DataSet.item_no(name)
This method will return the positional index number of an array item, e.g.:
>>> dataset.item_no('Q4A[{q4a_1}].Q4A_grid')
1
New: QuantipyViews
: counts_cumsum
, c%_cumsum
These two new views contain frequencies with cumulative sums which are computed over the x-axis.
Update: DataSet.text(name, shorten=True)
The new parameter shorten
is now controlling if the variable text
metadata
of array masks will be reported in short format, i.e. without the corresponding
mask label text. This is now also the default behaviour.
Update: DataSet.to_array()
Created mask meta information now also contains keys parent
and subtype
.
Variable names are compatible with crunch and dimensions meta:
Example in Dimensions modus:
>>> dataset.to_array('Q11', ['Q1', 'Q2', 'Q3', 'Q4', 'Q5'], 'label')
The new grid is named 'Q11.Q11_grid'
and the source/column variables are
'Q11[{Q1}].-Q11_grid'
- 'Q11[{Q5}].-Q11_grid'
.
Bugfix: DataSet.derotate()
Meta is now Crunch and Dimensions compatible. Also mask meta information are updated.
sd (24/04/2017)¶
Update: DataSet.hiding(..., hide_values=True)
The new parameter hide_values
is only necessary if the input variable is a
mask. If False
, mask items are hidden, if True
mask values are hidden
for all mask items and for array summary sheets.
Bugfix: DataSet.set_col_text_edit(name)
If the input variable is an array item, the new column text is also added to
meta['mask'][name]['items]
.
Bugfix: DataSet.drop(name, ignore_items=False)
If a mask is dropped, but the items are kept, all items are handled now as
individual variables and their meta information is not stored in meta['lib']
anymore.
sd (06/04/2017)¶
Only small adjustments.
sd (29/03/2017)¶
New: DataSet.codes_in_data(name)
This method returns a list of codes that exist in the data of a variable. This information can be used for more complex recodes, for example copying a variable, but keeping only all categories with more than 50 ratings, e.g.:
>>> valid_code = dataset.codes_in_data('varname')
>>> keep_code = [x for x in valid_code if dataset['varname'].value_counts()[x] > 49]
>>> dataset.copy('varname', 'rc', copy_only=keep_code)
Update: DataSet.copy(..., copy_not=None)
The new parameter copy_not
takes a list of codes that should be ignored
for the copied version of the provided variable. The metadata of the copy will
be reduced as well.
Update: DataSet.code_count()
This method is now alligned with any()
and all()
in that it can be used
on 'array'
variables as well. In such a case, the resulting pandas.Series
is reporting the number of answer codes found across all items per case data
row, i.e.:
>>> code_count = dataset.code_count('Q4A.Q4A_grid', count_only=[3, 4])
>>> check = pd.concat([dataset['Q4A.Q4A_grid'], code_count], axis=1)
>>> check.head(10)
Q4A[{q4a_1}].Q4A_grid Q4A[{q4a_2}].Q4A_grid Q4A[{q4a_3}].Q4A_grid 0
0 3.0 3.0 NaN 2
1 NaN NaN NaN 0
2 3.0 3.0 4.0 3
3 5.0 4.0 2.0 1
4 4.0 4.0 4.0 3
5 4.0 5.0 4.0 2
6 3.0 3.0 3.0 3
7 4.0 4.0 4.0 3
8 6.0 6.0 6.0 0
9 4.0 5.0 5.0 1
sd (20/03/2017)¶
New: qp.DataSet(dimensions_comp=True)
The DataSet
class can now be explicitly run in a Dimensions compatibility
mode to control the naming conventions of array
variables (“grids”). This
is also the default behaviour for now. This comes with a few changes related to
meta creation and variable access using DataSet
methods. Please see a brief
case study on this topic here.
New: enriched items
/ masks
meta data
masks
will now also store the subtype
(single
, delimited set
, etc.)
while items
elements will now contain a reference to the defining masks
entrie(s) in a new parent
object.
Update: DataSet.weight(..., subset=None)
Filters the dataset by giving a Quantipy complex logic expression and weights only the remaining subset.
Update: Defining categorical values
meta and array
items
Both values
and items
can now be created in three different ways when
working with the DataSet
methods add_meta()
, extend_values()
and
derive()
: (1) Tuples that map element code to label, (2) only labels or (3)
only element codes. Please see quick guide on that here
sd (07/03/2017)¶
Update: DataSet.code_count(..., count_not=None)
The new parameter count_not
can be used to restrict the set of codes feeding
into the resulting pd.Series
by exclusion (while count_only
restricts
by inclusion).
Update: DataSet.copy(..., copy_only=None)
The new parameter copy_only
takes a list of codes that should be included
for the copied version of the provided variable, all others will be ignored
and the metadata of the copy will be reduced as well.
Bugfix: DataSet.band()
There was a bug that was causing the method to crash for negative values. It is
now possible to create negative single value bands, while negative ranges
(lower and/or upper bound < 0) will raise a ValueError
.
sd (24/02/2017)¶
- Some minor bugfixes and updates. Please use latest version.
sd (16/02/2017)¶
New: DataSet.derotate(levels, mapper, other=None, unique_key='identity', dropna=True)
Create a derotated (“levelled”, responses-to-cases) DataSet
instance by
defining level variables, looped variables and other (simple) variables that
should be added.
View more information on the topic here.
New: DataSet.to_array(name, variables, label)
Combine column
variables with identical values
objects to an array
incl. all required meta['masks']
information.
Update: DataSet.interlock(..., variables)
It is now possible to add dict
s to variables
. In these dict
s a
derive()
-like mapper can be included which will then create a temporary
variable for the interlocked result. Example:
>>> variables = ['gender',
... {'agegrp': [(1, '18-34', {'age': frange('18-34')}),
... (2, '35-54', {'age': frange('35-54')}),
... (3, '55+', {'age': is_ge(55)})]},
... 'region']
>>> dataset.interlock('new_var', 'label', variables)
sd (04/01/2017)¶
New: DataSet.flatten(name, codes, new_name=None, text_key=None)
Creates a new delimited set
variable that groups grid item
answers to
categories. The items
become values
of the new variable. If an
item
contains one of the codes
it will be counted towards the categorical
case data of the new variable.
New: DataSet.uncode(target, mapper, default=None, intersect=None, inplace=True)
Remove codes from the target
variable’s data component if a logical
condition is satisfied.
New: DataSet.text(var, text_key=None)
Returns the question text label (per text_key
) of a variable.
New: DataSet.unroll(varlist, keep=None, both=None)
Replaces masks
names inside varlist
with their items
. Optionally,
individual masks
can be excluded or kept inside the list.
New: DataSet.from_stack(stack, datakey=None)
Create a quantipy.DataSet
from the meta
, data
, data_key
and
filter
definition of a quantipy.Stack
instance.
sd (8/12/2016)¶
New:
DataSet.from_excel(path_xlsx, merge=True, unique_key='identity')
Returns a new DataSet
instance with data
from excel
. The meta
for all variables contains type='int'
.
Example: new_ds = dataset.from_excel(path, True, 'identity')
The function is able to modify dataset
inplace by merging new_ds
on
identity
.
Update:
DataSet.copy(..., slicer=None)
It is now possible to filter the data that statisfies the logical condition
provided in the slicer
.
Example:
>>> dataset.copy('q1', 'rec', True, {'q1': not_any([99])})
sd (23/11/2016)¶
Update:
DataSet.rename(name, new_name=None, array_item=None)
The function is able to rename columns
, masks
or mask items
.
maks items
are changed by position.
Update:
DataSet.categorize(..., categorized_name=None)
Provide a custom name string for categorized_name
will change the default
name of the categorized variable from OLD_NAME#
to the passed string.
sd (16/11/2016)¶
New:
DataSet.check_dupe(name='identity')
Returns a list with duplicated values for the variable provided via name
.
Identifies for example duplicated identities.
New:
DataSet.start_meta(text_key=None)
Creates an empty QP meta data document blueprint to add variable definitions to.
Update:
DataSet.create_set(setname='new_set', based_on='data file', included=None,
... excluded=None, strings='keep', arrays='both', replace=None,
... overwrite=False)
Add a new set
to the meta['sets']
object. Variables from an existing
set
(based_on
) can be included
to new_set
or varibles can be
excluded
from based_on
with customized lists of variables.
Control string
variables and masks
with the kwargs
strings
and
arrays
. replace
single variables in new_set
with a dict
.
Update:
DataSet.from_components(..., text_key=None)
Will now accept a text_key
in the method call. If querying a text_key
from the meta component fails, the method will no longer crash, but raise a
warning
and set the text_key
to None
.
Update:
DataSet.as_float()
DataSet.as_int()
DataSet.as_single()
DataSet.as_delimited_set()
DataSet.as_string()
DataSet.band_numerical()
DataSet.derive_categorical()
DataSet.set_mask_text()
DataSet.set_column_text()
These methods will now print a UserWarning
to prepare for the soon to
come removal of them.
Bugfix:
DataSet.__setitem__()
Trying to set np.NaN
was failing the test against meta data for categorical
variables and was raising a ValueError
then. This is fixed now.
sd (11/11/2016)¶
New:
DataSet.columns
DataSet.masks
DataSet.sets
DataSet.singles
DataSet.delimited_sets
DataSet.ints
DataSet.floats
DataSet.dates
DataSet.strings
New DataSet
instance attributes to quickly return the list of columns
,
masks
and sets
objects from the meta or query the variables by
type
. Use this to check for variables, iteration, inspection, ect.
New:
DataSet.categorize(name)
Create a categorized version of int/string/date
variables. New variables
will be named as per OLD_NAME#
New:
DataSet.convert(name, to)
Wraps the individual as_TYPE()
conversion methods. to
must be one of
'int', 'float', 'string', 'single', 'delimited set'
.
New:
DataSet.as_string(name)
Only for completeness: Use DataSet.convert(name, to='string')
instead.
Converts int/float/single/date
typed variables into a string
and
removes all categorical metadata.
Update:
DataSet.add_meta()
Can now add date
and text
type meta data.
Bugfix:
DataSet.vmerge()
If masks
in the right dataset
, that also exist in the left dataset
,
have new items
or values
, they are added to meta['masks']
,
meta['lib']
and meta['sets']
.
sd (09/11/2016)¶
New:
DataSet.as_float(name)
Converts int/single
typed variables into a float
and removes
all categorical metadata.
New:
DataSet.as_int(name)
Converts single
typed variables into a int
and removes
all categorical metadata.
New:
DataSet.as_single(name)
Converts int
typed variables into a single
and adds numeric values as
categorical metadata.
New:
DataSet.create_set(name, variables, blacklist=None)
Adds a new set
to meta['sets']
object. Create easily sets
from
other sets
while using customised blacklist
.
New:
DataSet.drop(name, ignore_items=False)
Removes all metadata and data referenced to the variable. When passing an
array mask
, ignore_items
can be ste to True
to keep the item
columns
incl. their metadata.
New:
DataSet.compare(dataset=None, variables=None)
Compare the metadata definition between the current and another dataset
,
optionally restricting to a pair of variables.
Update:
DataSet.__setitem__()
[..]
-Indexer now checks scalars against categorical meta.
How-to-snippets¶
DataSet
Dimensions compatibility¶
DTO-downloaded and Dimensions converted variable naming conventions are following
specific rules for array
names and corresponding ìtems
. DataSet
offers a compatibility mode for Dimensions scenarios and handles the proper
renaming automatically. Here is what you should know…
The compatibility mode¶
A DataSet
will (by default) support Dimensions-like array
naming for its connected data files when constructed. An array
masks
meta defintition
of a variable called q5
looking like this…:
{u'items': [{u'source': u'columns@q5_1', u'text': {u'en-GB': u'Surfing'}},
{u'source': u'columns@q5_2', u'text': {u'en-GB': u'Snowboarding'}},
{u'source': u'columns@q5_3', u'text': {u'en-GB': u'Kite boarding'}},
{u'source': u'columns@q5_4', u'text': {u'en-GB': u'Parachuting'}},
{u'source': u'columns@q5_5', u'text': {u'en-GB': u'Cave diving'}},
{u'source': u'columns@q5_6', u'text': {u'en-GB': u'Windsurfing'}}],
u'subtype': u'single',
u'text': {u'en-GB': u'How likely are you to do each of the following in the next year?'},
u'type': u'array',
u'values': u'lib@values@q5'}
…will be converted into its “Dimensions equivalent” as per:
>>> dataset = qp.DataSet(name_data, dimensions_comp=True)
>>> dataset.read_quantipy(path_data+name_data, path_data+name_data)
DataSet: ../Data/Quantipy/Example Data (A)
rows: 8255 - columns: 75
Dimensions compatibilty mode: True
>>> dataset.masks()
['q5.q5_grid', 'q6.q6_grid', 'q7.q7_grid']
>>> dataset._meta['masks']['q5.q5_grid']
{u'items': [{u'source': 'columns@q5[{q5_1}].q5_grid',
u'text': {u'en-GB': u'Surfing'}},
{u'source': 'columns@q5[{q5_2}].q5_grid',
u'text': {u'en-GB': u'Snowboarding'}},
{u'source': 'columns@q5[{q5_3}].q5_grid',
u'text': {u'en-GB': u'Kite boarding'}},
{u'source': 'columns@q5[{q5_4}].q5_grid',
u'text': {u'en-GB': u'Parachuting'}},
{u'source': 'columns@q5[{q5_5}].q5_grid',
u'text': {u'en-GB': u'Cave diving'}},
{u'source': 'columns@q5[{q5_6}].q5_grid',
u'text': {u'en-GB': u'Windsurfing'}}],
'name': 'q5.q5_grid',
u'subtype': u'single',
u'text': {u'en-GB': u'How likely are you to do each of the following in the next year?'},
u'type': u'array',
u'values': 'lib@values@q5.q5_grid'}
Accessing and creating array
data¶
Since new names are converted automatically by DataSet
methods, there is
no need to write down the full (DTO-like) Dimensions array
name when adding
new metadata. However, querying variables is always requiring the proper name:
>>> name, qtype, label = 'array_var', 'single', 'ARRAY LABEL'
>>> cats = ['A', 'B', 'C']
>>> items = ['1', '2', '3']
>>> dataset.add_meta(name, qtype, label, cats, items)
>>> dataset.masks()
['q5.q5_grid', 'array_var.array_var_grid', 'q6.q6_grid', 'q7.q7_grid']
>>> dataset.meta('array_var.array_var_grid')
single items item texts codes texts missing
array_var.array_var_grid: ARRAY LABEL
1 array_var[{array_var_1}].array_var_grid 1 1 A None
2 array_var[{array_var_2}].array_var_grid 2 2 B None
3 array_var[{array_var_3}].array_var_grid 3 3 C None
>>> dataset['array_var.array_var_grid'].head(5)
array_var[{array_var_1}].array_var_grid array_var[{array_var_2}].array_var_grid array_var[{array_var_3}].array_var_grid
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
As can been seen above, both the masks
name as well as the array
item
elements are being properly converted to match DTO/Dimensions
conventions.
When using rename()
, copy()
or transpose()
, the same behaviour
applies:
>>> dataset.rename('q6.q6_grid', 'q6new')
>>> dataset.masks()
['q5.q5_grid', 'array_var.array_var_grid', 'q6new.q6new_grid', 'q7.q7_grid']
>>> dataset.copy('q6new.q6new_grid', suffix='q6copy')
>>> dataset.masks()
['q5.q5_grid', 'q6new_q6copy.q6new_q6copy_grid', 'array_var.array_var_grid', 'q6new.q6new_grid', 'q7.q7_grid']
>>> dataset.transpose('q6new_q6copy.q6new_q6copy_grid')
>>> dataset.masks()
['q5.q5_grid', 'q6new_q6copy_trans.q6new_q6copy_trans_grid', 'q6new_q6copy.q6new_q6copy_grid', 'array_var.array_var_grid', 'q6new.q6new_grid', 'q7.q7_grid']
Different ways of creating categorical values¶
The DataSet
methods add_meta()
, extend_values()
and derive()
offer three alternatives for specifying the categorical values of 'single'
and 'delimited set'
typed variables. The approaches differ with respect to
how the mapping of numerical value codes to value text labels is handled.
(1) Providing a list of text labels
By providing the category labels only as a list of str
, DataSet
is going to create the numerical codes by simple enumeration:
>>> name, qtype, label = 'test_var', 'single', 'The test variable label'
>>> cats = ['test_cat_1', 'test_cat_2', 'test_cat_3']
>>> dataset.add_meta(name, qtype, label, cats)
>>> dataset.meta('test_var')
single codes texts missing
test_var: The test variable label
1 1 test_cat_1 None
2 2 test_cat_2 None
3 3 test_cat_3 None
(2) Providing a list of numerical codes
If only the desired numerical codes are provided, the label information for all
categories consequently will appear blank. In such a case the user will, however,
get reminded to add the 'text'
meta in a separate step:
>>> cats = [1, 2, 98]
>>> dataset.add_meta(name, qtype, label, cats)
...\\quantipy\core\dataset.py:1287: UserWarning: 'text' label information missing,
only numerical codes created for the values object. Remember to add value 'text' metadata manually!
>>> dataset.meta('test_var')
single codes texts missing
test_var: The test variable label
1 1 None
2 2 None
3 98 None
(3) Pairing numerical codes with text labels
To explicitly assign codes to corresponding labels, categories can also be defined as a list of tuples of codes and labels:
>>> cats = [(1, 'test_cat_1') (2, 'test_cat_2'), (98, 'Don\'t know')]
>>> dataset.add_meta(name, qtype, label, cats)
>>> dataset.meta('test_var')
single codes texts missing
test_var: The test variable label
1 1 test_cat_1 None
2 2 test_cat_2 None
3 98 Don't know None
Note
All three approaches are also valid for defining the items
object for
array
-typed masks
.
Derotation¶
What is derotation¶
Derotation of data
is necessary if brands, products or something similar
(levels) are assessed and each respondent (case) rates a different
selection of that levels. So each case has several responses.
Derotation now means, that the data
is switched from case-level to
responses-level.
Example: q1_1/q1_2
: On a scale from 1 to 10, how much do you like the
following drinks?
``data``
id | drink_1 | drink_2 | q1_1 | q1_2 | gender |
case1 | 1 | 3 | 2 | 8 | 1 |
case2 | 1 | 4 | 9 | 5 | 2 |
case3 | 2 | 4 | 6 | 10 | 1 |
derotated ``data``
drink | drink_levelled | q1 | gender | |
case1 | 1 | 1 | 2 | 1 |
case1 | 2 | 3 | 8 | 1 |
case2 | 1 | 1 | 9 | 2 |
case2 | 2 | 4 | 5 | 2 |
case3 | 1 | 2 | 6 | 1 |
case3 | 2 | 4 | 10 | 1 |
To identify which case rates which levels, some key-/level-variables are
included in the data
, in this example drink_1
and drink_2
.
Variables (for example gender
) that are not included to this loop can also
be added.
How to use DataSet.derotate()
¶
The DataSet
method takes a few parameters:
levels
:dict
oflist
Contains all key-/level-variables and the name for the new levelled variable. All key-/level-variables must have the same
value_map
.>>> levels = {'drink': ['drink_1', 'drink_2']}
mapper
:list
ofdicts
oflist
Contains the looped questions and the new
column
name to which the looped questions will be combinded.>>> mapper = [{'q1': ['q1_1', 'q1_2']}]
other
:str
orlist
ofstr
Contains all variables that should be assumed to the derotated
data
, but which are not included in the loop.>>> other = 'gender'
unique_key
:str
Name of varibale that identifies cases in the initial
data
.>>> unique_key = 'id'
dropna
:bool
, defaultTrue
If a case rates less then the possible counts of levels, these responses will be droped.
>>> ds = dataset.derotate(levels = {'drink': ['drink_1', 'drink_2']},
... mapper = [{'q1': ['q1_1', 'q1_2']}],
... other = 'gender',
... unique_key = 'id',
... dropna = True)
What about arrays
?¶
It is possible that also arrays
are looped. In this case a mapper can look
like this:
>>> mapper = [{'q12_1': ['q12a[{q12a_1}].q12a_grid', 'q12b[{q12b_1}].q12b_grid',
... 'q12c[{q12c_1}].q12c_grid', 'q12d[{q12d_1}].q12d_grid']},
... {'q12_2': ['q12a[{q12a_2}].q12a_grid', 'q12b[{q12b_2}].q12b_grid',
... 'q12c[{q12c_2}].q12c_grid', 'q12d[{q12d_2}].q12d_grid']},
... {'q12_3': ['q12a[{q12a_3}].q12a_grid', 'q12b[{q12b_3}].q12b_grid',
... 'q12c[{q12c_3}].q12c_grid', 'q12d[{q12d_3}].q12d_grid']},
... {'q12_4': ['q12a[{q12a_4}].q12a_grid', 'q12b[{q12b_4}].q12b_grid',
... 'q12c[{q12c_4}].q12c_grid', 'q12d[{q12d_4}].q12d_grid']},
... {'q12_5': ['q12a[{q12a_5}].q12a_grid', 'q12b[{q12b_5}].q12b_grid',
... 'q12c[{q12c_5}].q12c_grid', 'q12d[{q12d_5}].q12d_grid']},
... {'q12_6': ['q12a[{q12a_6}].q12a_grid', 'q12b[{q12b_6}].q12b_grid',
... 'q12c[{q12c_6}].q12c_grid', 'q12d[{q12d_6}].q12d_grid']},
... {'q12_7': ['q12a[{q12a_7}].q12a_grid', 'q12b[{q12b_7}].q12b_grid',
... 'q12c[{q12c_7}].q12c_grid', 'q12d[{q12d_7}].q12d_grid']},
... {'q12_8': ['q12a[{q12a_8}].q12a_grid', 'q12b[{q12b_8}].q12b_grid',
... 'q12c[{q12c_8}].q12c_grid', 'q12d[{q12d_8}].q12d_grid']},
... {'q12_9': ['q12a[{q12a_9}].q12a_grid', 'q12b[{q12b_9}].q12b_grid',
... 'q12c[{q12c_9}].q12c_grid', 'q12d[{q12d_9}].q12d_grid']},
... {'q12_10': ['q12a[{q12a_10}].q12a_grid', 'q12b[{q12b_10}].q12b_grid',
... 'q12c[{q12c_10}].q12c_grid', 'q12d[{q12d_10}].q12d_grid']},
... {'q12_11': ['q12a[{q12a_11}].q12a_grid', 'q12b[{q12b_11}].q12b_grid',
... 'q12c[{q12c_11}].q12c_grid', 'q12d[{q12d_11}].q12d_grid']},
... {'q12_12': ['q12a[{q12a_12}].q12a_grid', 'q12b[{q12b_12}].q12b_grid',
... 'q12c[{q12c_12}].q12c_grid', 'q12d[{q12d_12}].q12d_grid']},
... {'q12_13': ['q12a[{q12a_13}].q12a_grid', 'q12b[{q12b_13}].q12b_grid',
... 'q12c[{q12c_13}].q12c_grid', 'q12d[{q12d_13}].q12d_grid']}]]
Can be also writen like this:
>>> for y in frange('1-13'):
... q_group = []
... for x in ['a', 'b', 'c', 'd']:
... var = 'q12{}'.format(x)
... var_grid = var + '[{' + var + '_{}'.format(y) + '}].' + var + '_grid'
... q_group.append(var_grid)
... mapper.append({'q12_{}'.format(y): q_group})
So the derotated dataset
will lose its meta
information about the
mask
and only the columns
q12_1
to q12_13
will be added. To
receive back the mask
structure, use the method dataset.to_array()
:
>>> variables = [{'q12_1': u'label 1'},
... {'q12_2': u'label 2'},
... {'q12_3': u'label 3'},
... {'q12_4': u'label 4'},
... {'q12_5': u'label 5'},
... {'q12_6': u'label 6'},
... {'q12_7': u'label 7'},
... {'q12_8': u'label 8'},
... {'q12_9': u'label 9'},
... {'q12_10': u'label 10'},
... {'q12_11': u'label 11'},
... {'q12_12': u'label 12'},
... {'q12_13': u'label 13'}]
>>> ds.to_array('qTP', variables, 'Var_name')
variables
can also be a list of variable-names, then the mask-items
will be named by its belonging columns
.
arrays
included in other
will keep their meta
structure.
Data processing¶
DataSet components¶
Case and meta data¶
Quantipy
builds upon the pandas
library to feature the DataFrame
and Series
objects in the case data component of its DataSet
object.
Additionally, each DataSet
offers a metadata component to describe the
data columns and provide additional information on the characteristics of the
underlying structure. The metadata document is implemented as a nested dict
and provides the following keys
on its first level:
element | contains |
---|---|
'type' |
case data type |
'info' |
info on the source data |
'lib' |
shared use references |
'columns' |
info on DataFrame columns (Quantipy types, labels, etc.) |
'sets' |
ordered groups of variables pointing to other parts of the meta |
'masks' |
complex variable type definitions (arrays, dichotomous, etc.) |
columns
and masks
objects¶
There are two variable collections inside a Quantipy
metadata document:
'columns'
is storing the meta for each accompanying pandas.DataFrame
column object, while 'masks'
are building upon the regular 'columns'
metadata but additionally employ special meta instructions to define
complex data types. An example is the the 'array'
type that (in MR speak) maps
multiple “question” variables to one “answer” object.
“Simple”” data definitons that are supported by Quantipy
can either be numeric
'float'
and 'int'
types, categorical 'single'
and 'delimited set'
variables or of type 'string'
, 'date'
and 'time'
.
Languages: text
and text_key
mappings¶
Throughout Quantipy
metadata all label information, e.g. variable question
texts and category descriptions, are stored in text
objects that are mapping
different language (or context) versions of a label to a specific text_key
.
That way the metadata can support multi-language and multi-purpose (for example
detailed/extensive vs. short question texts) label information in a digestable
format that is easy to query:
>>> meta['columns']['q1']['text']
{'de-DE': 'Das ist ein langes deutsches Label',
u'en-GB': u'What is your main fitness activity?',
'x edits': {'de-DE': 'German build label', 'en-GB': 'English build label'}}
Valid text_key
settings are:
text_key |
Language / context |
---|---|
'en-GB' |
English |
'de-DE' |
German |
'fr-FR' |
French |
'da-DK' |
Danish |
'sv-SV' |
Swedish |
'nb-NO' |
Norwegian |
'fi-FI' |
Finnish |
'x edits' |
Build label edit for x-axis |
'y edits' |
Build label edit for y-axis |
Categorical values
object¶
single
and delimited set
variables restrict the possible case data
entries to a list of values
that consist of numeric answer codes and their
text
labels, defining distinct categories:
>>> meta['columns']['q1']['values']
[{'value': 1,
'text': {'en-GB': 'Dog'}
},
{'value': 2,
'text': {'en-GB': 'Cat'}
},
{'value': 3,
'text': {'en-GB': 'Bird'}
},
{'value': -9,
'text': {'en-GB': 'Not an animal'}
}]
The array
type¶
Turning to the masks
collection of the metadata, array
variables
group together a collection of variables that share a common response options
scheme, i.e. different statements (usually referencing a broader topic) that
are answered using the same scale. In the Quantipy
metadata document, an
array
variable has a subtype
that describes the type of the
constructing source variables listed in the items
object. In contrast to simple variable types, any
categorical values
metadata is stored inside the shared information collection
lib
, for access from both the columns
and masks
representation of
array
elements:
>>> meta['masks']['q5']
{u'items': [{u'source': u'columns@q5_1', u'text': {u'en-GB': u'Surfing'}},
{u'source': u'columns@q5_2', u'text': {u'en-GB': u'Snowboarding'}},
{u'source': u'columns@q5_3', u'text': {u'en-GB': u'Kite boarding'}},
{u'source': u'columns@q5_4', u'text': {u'en-GB': u'Parachuting'}},
{u'source': u'columns@q5_5', u'text': {u'en-GB': u'Cave diving'}},
{u'source': u'columns@q5_6', u'text': {u'en-GB': u'Windsurfing'}}],
u'name': u'q5',
u'subtype': u'single',
u'text': {u'en-GB': u'How likely are you to do each of the following in the next year?'},
u'type': u'array',
u'values': 'lib@values@q5'}
>>> meta['lib']['values']['q5']
[{u'text': {u'en-GB': u'I would refuse if asked'}, u'value': 1},
{u'text': {u'en-GB': u'Very unlikely'}, u'value': 2},
{u'text': {u'en-GB': u"Probably wouldn't"}, u'value': 3},
{u'text': {u'en-GB': u'Probably would if asked'}, u'value': 4},
{u'text': {u'en-GB': u'Very likely'}, u'value': 5},
{u'text': {u'en-GB': u"I'm already planning to"}, u'value': 97},
{u'text': {u'en-GB': u"Don't know"}, u'value': 98}]
Exploring the columns
meta of an array item shows the same values
reference pointer and informs about its parent
meta structure, i.e. the
array’s masks
defintion:
>>> meta['columns']['q5_1']
{u'name': u'q5_1',
u'parent': {u'masks@q5': {u'type': u'array'}},
u'text': {u'en-GB': u'How likely are you to do each of the following in the next year? - Surfing'},
u'type': u'single',
u'values': u'lib@values@q5'}
I/O¶
Starting from native components¶
Using a standalone pd.DataFrame
¶
Quantipy
can create a meta document from a inferring its variable types from
the dtypes
of a pd.DataFrame
. In that process, ìnt
, float
and
string
data types are created inside the meta component of the DataSet
.
In this basic form, text
label information is missing. For a example, given
a pd.DataFrame
as per:
>>> casedata = [[1000, 10, 1.2, 'text1'],
... [1001, 4, 3.4, 'jjda'],
... [1002, 8, np.NaN, 'what?'],
... [1003, 8, 7.81, '---' ],
... [1004, 5, 3.0, 'hello world!']]
>>> df = pd.DataFrame(casedata, columns=['identity', 'q1', 'q2', 'q3'])
>>> df
identity q1 q2 q3
0 1000 10 1.20 text1
1 1001 4 3.40 jjda
2 1002 8 NaN what?
3 1003 8 7.81 ---
4 1004 5 3.00 hello world!
… the conversion is adding matching metadata to the DataSet
instance:
>>> dataset = qp.DataSet(name='example', dimensions_comp=False)
>>> dataset.from_components(df)
Inferring meta data from pd.DataFrame.columns (4)...
identity: dtype: int64 - converted: int
q1: dtype: int64 - converted: int
q2: dtype: float64 - converted: float
q3: dtype: object - converted: string
>>> dataset.meta()['columns']['q2']
{'text': {'en-GB': ''}, 'type': 'float', 'name': 'q2', 'parent': {}, 'properties': {'created': True}}
.csv
/ .json
pairs¶
We can easily read in Quantipy
native data with the read_quantipy()
method and providing the paths to both the .csv
and .json
file (file
extensions are handled automatically), e.g.:
>>> folder = './Data/'
>>> file_name = 'Example Data (A)'
>>> path_csv = path_json = folder + file_name
>>> dataset = qp.DataSet(name='example', dimensions_comp=False)
>>> dataset.read_quantipy(path_json, path_csv)
DataSet: ./Data/example
rows: 8255 - columns: 76
Dimensions compatibility mode: False
We can that access the case and metadata components:
>>> dataset.data()['q4'].head()
0 1
1 2
2 2
3 1
4 1
Name: q4, dtype: int64
>>> meta = dataset.meta()['columns']['q4']
>>> json.dumps(meta)
{
"values": [
{
"text": {
"en-GB": "Yes"
},
"value": 1
},
{
"text": {
"en-GB": "No"
},
"value": 2
}
],
"text": {
"en-GB": "Do you ever participate in sports activities with people in your household?"
},
"type": "single",
"name": "q4",
"parent": {}
}
Third party conversions¶
Supported conversions¶
In adddition to providing plain .csv
/.json
data (pairs), source files
can be read into Quantipy using a number of I/O functions to deal with
standard file formats encountered in the market research industry:
Software | Format | Read | Write |
---|---|---|---|
SPSS Statistics | .sav | Yes | Yes |
SPSS Dimensions | .dff/.mdd | Yes | Yes |
Decipher | tab-delimited .json/ .txt | Yes | No |
Ascribe | tab-delimited .xml/ .txt | Yes | No |
The following functions are designed to convert the different file formats’ structures into inputs understood by Quantipy.
SPSS Statistics¶
Reading:
>>> from quantipy.core.tools.dp.io import read_spss
>>> meta, data = read_spss(path_sav)
Note
On a Windows machine you MUST use ioLocale=None
when reading
from SPSS. This means if you are using a Windows machine your base
example for reading from SPSS is
meta, data = read_spss(path_sav, ioLocale=None)
.
When reading from SPSS you have the opportunity to specify a custom
dichotomous values map, that will be used to convert all dichotomous
sets into Quantipy delimited sets, using the dichot
argument.
The entire read operation will use the same map on all dichotomous
sets so they must be applied uniformly throughout the SAV file. The
default map that will be used if none is provided will be
{'yes': 1, 'no': 0}
.
>>> meta, data = read_spss(path_sav, dichot={'yes': 1, 'no': 2})
SPSS dates will be converted to pandas dates by default but
if this results in conversion issues or failures you can read
the dates in as Quantipy strings to deal with them later, using the
dates_as_strings
argument.
>>> meta, data = read_spss(path_sav, dates_as_strings=True)
Writing:
>>> from quantipy.core.tools.dp.io import write_spss
>>> write_spss(path_sav, meta, data)
By default SPSS files will be generated from the 'data file'
set found in meta['sets']
, but a custom set can be named instead
using the from_set
argument.
>>> write_spss(path_sav_analysis, meta, data, from_set='sav-export')
The custom set must be well-formed:
>>> "sets" : {
... "sav-export": {
... "items": [
... "columns@Q1",
... "columns@Q2",
... "columns@Q3",
... ...
... ]
... }
... }
Dimensions¶
Reading:
>>> from quantipy.core.tools.dp.io import read_dimensions
>>> meta, data = read_dimensions(path_mdd, path_ddf)
Decipher¶
Reading:
>>> from quantipy.core.tools.dp.io import read_decipher
>>> meta, data = read_decipher(path_json, path_txt)
Ascribe¶
Reading:
>>> from quantipy.core.tools.dp.io import read_ascribe
>>> meta, data = read_ascribe(path_xml, path_txt)
DataSet management¶
Setting the variable order¶
The global variable order of a DataSet
is dictated by the content of the
meta['sets']['data file']['items']
list and reflected in the structure of
the case data component’s pd.DataFrame.columns
. There are two ways to set
a new order using the order(new_order=None, reposition=None)
method:
Define a full order
Using this apporach requires that all DataSet
variable names are passed
via the new_order
parameter. Providing only a subset of the variables will
raise a ValueError
:
>>> dataset.order(['q1', 'q8'])
ValueError: 'new_order' must contain all DataSet variables.
Text…
Change positions relatively
Often only a few changes to the natural order of the DataSet
are necessary,
e.g. derived variables should be moved alongside their originating ones or specific
sets of variables (demographics, etc.) should be grouped together. We can achieve
this using the reposition
parameter as follows:
Text…
Cloning, filtering and subsetting¶
Sometimes you want to cut the data into sections defined by either case/respondent conditions (e.g. a survey wave) or a collection of variables (e.g.
a specific part of the questionnaire). To not permanently change an existing
DataSet
by accident, draw a copy of it first:
>>> copy_ds = dataset.clone()
Then you can use filter()
to restrict cases (rows) or subset()
to keep
only a selected range of variables (columns). Both methods can be used inplace
but will return a new object by default.
>>> keep = {'Wave': [1]}
>>> copy_ds.filter(alias='first wave', condition=keep, inplace=True)
>>> copy_ds._data.shape
(1621, 76)
After the filter has been applied, the DataSet
is only showing cases that contain the value 1 in the 'Wave'
variable. The filter alias (a short name
to describe the arbitrarily complex filter condition
) is attached to the
instance:
>>> copy_ds.filtered
only first wave
We are now further reducing the DataSet
by dropping all variables except the three array
variables 'q5'
, 'q6'
, and 'q7'
using subset()
.
>>> reduced_ds = copy_ds.subset(variables=['q5', 'q6', 'q7'])
We can see that only the requested variables (masks
defintitions and the
constructing array
items) remain in reduced_ds
:
>>> reduced_ds.by_type()
size: 1621 single delimited set array int float string date time N/A
0 q5_1 q5
1 q5_2 q7
2 q5_3 q6
3 q5_4
4 q5_5
5 q5_6
6 q6_1
7 q6_2
8 q6_3
9 q7_1
10 q7_2
11 q7_3
12 q7_4
13 q7_5
14 q7_6
Merging¶
Intro text… As opposed to reducing an existing file…
Vertical (cases/rows) merging¶
Text
Horizontal (variables/columns) merging¶
Text
Savepoints and state rollback¶
When working with big DataSet
s and needing to perform a lot of data
preparation (deriving large amounts of new variables, lots of meta editing,
complex cleaning, …) it can be beneficial to quickly store a snapshot of a
clean and consistent state of the DataSet
. This is most useful when working
in interactive sessions like IPython or Jupyter notebooks and might
prevent you from reloading files from disk or waiting for previous processes
to finish.
Savepoints are stored via save()
and can be restored via revert()
.
Note
Savepoints only exists in memory and are not written to disk. Only one
savepoint can exist, so repeated save()
calls will overwrite any previous
versions of the DataSet
. To permanently save your data, please use one
of the write
methods, e.g. write_quantipy()
.
Inspecting variables¶
Querying and slicing case data¶
A qp.DataSet
is mimicking pandas
-like item access, i.e. passing a variable
name into the []
-accessor will return a pandas.DataFrame
view of the
case data component. That means that we can chain any pandas.DataFrame
method to
the query:
>>> ds['q9'].head()
q9
0 99;
1 1;4;
2 98;
3 1;4;
4 99;
There is the same support for selecting multiple variables at once:
>>> ds[['q9', 'gender']].head()
q9 gender
0 99; 1
1 1;4; 2
2 98; 1
3 1;4; 1
4 99; 1
To integrate array
(masks
) variables into this behaviour, passing an
array
name will automatically call its item list:
>>> ds['q6'].head()
q6_1 q6_2 q6_3
0 1 1 1
1 1 NaN 1
2 1 NaN 2
3 2 NaN 2
4 2 10 10
This can be combined with the list
-based selection as well:
>>> ds[['q6', 'q9', 'gender']].head()
q6_1 q6_2 q6_3 q9 gender
0 1 1 1 99; 1
1 1 NaN 1 1;4; 2
2 1 NaN 2 98; 1
3 2 NaN 2 1;4; 1
4 2 10 10 99; 1
DataSet
case data supports row-slicing based on complex logical conditions
to inspect subsets of the data. We can use the take()
with a Quantipy
logic operation naturally for this:
>>> condition = intersection(
... [{'gender': [1]},
... {'religion': [3]},
... {'q9': [1, 4]}])
>>> take = ds.take(condition)
>>> ds[take, ['gender', 'religion', 'q9']].head()
gender religion q9
52 1 3 1;2;4;
357 1 3 1;3;4;
671 1 3 1;3;4;
783 1 3 2;3;4;
802 1 3 4;
See also
Please find an overview of Quantipy
logical operators and data slicing
and masking in the docs about complex logical conditions!
Variable and value existence¶
any, all, code_count, is_nan, var_exists, codes_in_data, is_like_numeric variables
We can use variables()
and var_exists()
to generally test the membership
of variables inside DataSet
. The former is showing the list of all variables
registered inside the 'data file'
set
, the latter is checking if a variable’s
name
is found in either the 'columns'
or 'masks'
collection. For
our example data, the variables are:
>>> dataset.variables()
So a test for the array
'q5'
should be positive:
>>> dataset.var_exists('q5')
True
In addition to Quantipy
’s complex logic operators, the DataSet
class
offers some quick case data operations for code existence tests. To return a
pandas.Series
of all empty rows inside a variable use is_nan()
as per:
>>> dataset.is_nan('q8').head()
0 True
1 True
2 True
3 True
4 True
Name: q8, dtype: bool
Which we can also use to quickly check the number of missing cases…
>>> dataset.is_nan('q8').value_counts()
True 5888
False 2367
Name: q8, dtype: int64
… as well as use the result as slicer for the DataSet
case data component,
e.g. to show the non-empty rows:
>>> slicer = dataset.is_nan('q8')
>>> dataset[~slicer, 'q8'].head()
Name: q8, dtype: int64
7 5;
11 5;
13 1;4;
14 4;5;
23 1;4;
Name: q8, dtype: object
Especially useful for delimited set
and array
data, the code_count()
method is creating the pandas.Series
of response values found. If applied on
an array
, the result is expressed across all source item variables:
>>> dataset.code_count('q6').value_counts()
3 5100
2 3155
dtype: int64
… which means that not all cases contain answers in all three of the array’s items.
With some basic pandas
we can double-check this result:
>>> pd.concat([dataset['q6'], dataset.code_count('q6')], axis=1).head()
q6_1 q6_2 q6_3 0
0 1 1.0 1 3
1 1 NaN 1 2
2 1 NaN 2 2
3 2 NaN 2 2
4 2 10.0 10 3
code_count()
can optionally ignore certain codes via the count_only
and
count_not
parameters:
>>> q2_count = dataset.code_count('q2', count_only=[1, 2, 3])
>>> pd.concat([dataset['q2'], q2_count], axis=1).head()
q2 0
0 1;2;3;5; 3
1 3;6; 1
2 NaN 0
3 NaN 0
4 NaN 0
Similarly, the any()
and all()
methods yield slicers for cases obeying
the condition that at least one / all of the provided codes are found in the
response. Again, for array
variables the conditions are extended across all
the items:
>>> dataset[dataset.all('q6', 5), 'q6']
q6_1 q6_2 q6_3
374 5 5.0 5
2363 5 5.0 5
2377 5 5.0 5
4217 5 5.0 5
5530 5 5.0 5
5779 5 5.0 5
5804 5 5.0 5
6328 5 5.0 5
6774 5 5.0 5
7269 5 5.0 5
8148 5 5.0 5
>>> dataset[dataset.all('q8', [1, 2, 3, 4, 96]), 'q8']
845 1;2;3;4;5;96;
6242 1;2;3;4;96;
7321 1;2;3;4;96;
Name: q8, dtype: object
>>> dataset[dataset.any('q8', [1, 2, 3, 4, 96]), 'q8'].head()
13 1;4;
14 4;5;
23 1;4;
24 1;3;4;
25 1;4;
Name: q8, dtype: object
Variable types¶
To get a summary of the all variables grouped by type, call by_type()
on
the DataSet
:
>>> ds.by_type()
size: 8255 single delimited set array int float string date time N/A
0 gender q2 q5 record_number weight q8a start_time duration
1 locality q3 q7 unique_id weight_a q9a end_time
2 ethnicity q8 q6 age weight_b
3 religion q9 birth_day
4 q1 birth_month
5 q2b birth_year
6 q4
7 q5_1
8 q5_2
9 q5_3
10 q5_4
11 q5_5
12 q5_6
13 q6_1
14 q6_2
15 q6_3
16 q7_1
17 q7_2
18 q7_3
19 q7_4
20 q7_5
21 q7_6
We can restrict the output to certain types by providing the desired ones in
the types
parameter:
>>> ds.by_type(types='delimited set')
size: 8255 delimited set
0 q2
1 q3
2 q8
3 q9
>>> ds.by_type(types=['delimited set', 'float'])
size: 8255 delimited set float
0 q2 weight
1 q3 weight_a
2 q8 weight_b
3 q9 NaN
In addition to that, DataSet
implements the following methods
that return the corresponding variables as a list
for easy iteration:
DataSet.singles
.delimied_sets()
.ints()
.floats()
.dates()
.strings()
.masks()
.columns()
.sets()
>>> ds.delimited_sets()
[u'q3', u'q2', u'q9', u'q8']
>>> for delimited_set in ds.delimited_sets():
... print delimited_set
q3
q2
q9
q8
Slicing & dicing metadata objects¶
Although it is possible to access a DataSet
meta component via its _meta
attribute directly, the prefered way to inspect and interact with with the metadata
is to use DataSet
methods. For instance, the easiest way to view the most
important meta on a variable is to use the meta()
method:
>>> ds.meta('q8')
delimited set codes texts missing
q8: Which of the following do you regularly skip?
1 1 Breakfast None
2 2 Mid-morning snacking None
3 3 Lunch None
4 4 Mid-afternoon snacking None
5 5 Dinner None
6 96 None of them None
7 98 Don't know (it varies a lot) None
This output is extended with the item
metadata if an array
is passed:
>>> ds.meta('q6')
single items item texts codes texts missing
q6: How often do you take part in any of the fo...
1 q6_1 Exercise alone 1 Once a day or more often None
2 q6_2 Join an exercise class 2 Every few days None
3 q6_3 Play any kind of team sport 3 Once a week None
4 4 Once a fortnight None
5 5 Once a month None
6 6 Once every few months None
7 7 Once every six months None
8 8 Once a year None
9 9 Less often than once a year None
10 10 Never None
If the variable is not categorical, meta()
returns simply:
>>> ds.meta('weight_a')
float
weight_a: Weight (variant A) N/A
DataSet
also provides a lot of methods to access and return the several
meta objects of a variable to make various data processing tasks easier:
Variable labels: quantipy.core.dataset.DataSet.text()
>>> ds.text('q8', text_key=None)
Which of the following do you regularly skip?
values
object: quantipy.core.dataset.DataSet.values()
>>> ds.values('gender', text_key=None)
[(1, u'Male'), (2, u'Female')]
Category codes: quantipy.core.dataset.DataSet.codes()
>>> ds.codes('gender')
[1, 2]
Category labels: quantipy.core.dataset.DataSet.value_texts()
>>> ds.value_texts('gender', text_key=None)
[u'Male', u'Female']
items
object: quantipy.core.dataset.DataSet.items()
>>> ds.items('q6', text_key=None)
[(u'q6_1', u'How often do you exercise alone?'),
(u'q6_2', u'How often do you take part in an exercise class?'),
(u'q6_3', u'How often do you play any kind of team sport?')]
Item 'columns'
sources: quantipy.core.dataset.DataSet.sources()
>>> ds.sources('q6')
[u'q6_1', u'q6_2', u'q6_3']
Item labels: quantipy.core.dataset.DataSet.item_texts()
>>> ds.item_texts('q6', text_key=None)
[u'How often do you exercise alone?',
u'How often do you take part in an exercise class?',
u'How often do you play any kind of team sport?']
Editing metadata¶
Creating meta from scratch¶
It is very easy to add new variable metadata to a DataSet
via add_meta()
which let’s you create all supported variable types. Each new variable needs at
least a name
, qtype
and label
. With this information a string
,
int
, float
or date
variable can be defined, e.g.:
>>> ds.add_meta(name='new_int', qtype='int', label='My new int variable')
>>> ds.meta('new_int')
int
new_int: My new int variable N/A
Using the categories
parameter we can create categorical variables of type
single
or delimited set
. We can provide the categories
in two
different ways:
>>> name, qtype, label = 'new_single', 'single', 'My new single variable'
Providing a list of category labels (codes will be enumerated starting
from 1
):
>>> cats = ['Category A', 'Category B', 'Category C']
>>> ds.add_meta(name, qtype, label, categories=cats)
>>> ds.meta('new_single')
single codes texts missing
new_single: My new single variable
1 1 Category A None
2 2 Category B None
3 3 Category C None
Providing a list of tuples pairing codes and labels:
>>> cats = [(1, 'Category A'), (2, 'Category B'), (99, 'Category C')]
>>> ds.add_meta(name, qtype, label, categories=cats)
>>> ds.meta('new_single')
single codes texts missing
new_single: My new single variable
1 1 Category A None
2 2 Category B None
3 99 Category C None
Note
add_meta()
is preventing you from adding ill-formed or
inconsistent variable information, e.g. it is not possible to add categories
to an int
…
>>> ds.add_meta('new_int', 'int', 'My new int variable', cats)
ValueError: Numerical data of type int does not accept 'categories'.
…and you must provide categories
when trying to add categorical data:
>>> ds.add_meta(name, 'single', label, categories=None)
ValueError: Must provide 'categories' when requesting data of type single.
Similiar to the usage of the categories
argument, items
is controlling
the creation of an array
, i.e. specifying items
is automatically
preparing the 'masks'
and 'columns'
metadata. The qtype
argument
in this case always refers to the type of the corresponding 'columns'
.
>>> name, qtype, label = 'new_array', 'single', 'My new array variable'
>>> cats = ['Category A', 'Category B', 'Category C']
Again, there are two alternatives to construct the items
object:
Providing a list of item labels (item identifiers will be enumerated
starting from 1
):
>>> items = ['Item A', 'Item B', 'Item C', 'Item D']
>>> ds.add_meta(name, qtype, label, cats, items=items)
>>> ds.meta('new_array')
single items item texts codes texts missing
new_array: My new array variable
1 new_array_1 Item A 1 Category A None
2 new_array_2 Item B 2 Category B None
3 new_array_3 Item C 3 Category C None
4 new_array_4 Item D
Providing a list of tuples pairing item identifiers and labels:
>>> items = [(1, 'Item A'), (2, 'Item B'), (97, 'Item C'), (98, 'Item D')]
>>> ds.add_meta(name, qtype, label, cats, items)
>>> ds.meta('new_array')
single items item texts codes texts missing
new_array: My new array variable
1 new_array_1 Item A 1 Category A None
2 new_array_2 Item B 2 Category B None
3 new_array_97 Item C 3 Category C None
4 new_array_98 Item D
Note
For every created variable, add_meta()
is also adding the relevant columns
into the pd.DataFrame
case data component of the DataSet
to keep
it consistent:
>>> ds['new_array'].head()
new_array_1 new_array_2 new_array_97 new_array_98
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
Renaming¶
It is possible to attach new names to DataSet
variables. Using the rename()
method will replace all former variable keys
and other mentions inside the
metadata document and exchange the DataFrame
column names. For array
variables only the 'masks'
name reference is updated by default – to rename
the corresponding items
a dict mapping item position number to new name can
be provided.
>>> ds.rename(name='q8', new_name='q8_with_a_new_name')
As mentioned, renaming a 'masks'
variable will leave the items untouched:
>>>
But we can simply provide their new names as per:
>>>
>>>
Changing & adding text
info¶
All text
-related DataSet
methods expose the text_key
argument to
control to which language or context a label is added. For instance we can add
a German variable label to 'q8'
with set_variable_text()
:
>>> ds.set_variable_text(name='q8', new_text='Das ist ein deutsches Label', text_key='de-DE')
>>> ds.text('q8', 'en-GB')
Which of the following do you regularly skip?
>>> ds.text('q8', 'de-DE')
Das ist ein deutsches Label
To change the text
inside the values
or items
metadata, we can
similarly use set_value_text
and set_item_text()
:
>>>
When working with multiple language versions of the metadata, it might be required
to copy one language’s text
meta to another one’s, for instance if there are
no fitting translations or the correct translation is missing. In such cases you
can use force_texts()
to copy the meta of a source text_key
(specified
in the `copy_from
parameter) to a target text_key
(indicated via copy_to
).
>>>
>>>
With clean_texts()
you also have the option to replace specific characters,
terms or formatting tags (i.e. html
) from all text
metadata of the
DataSet
:
>>>
Extending the values
object¶
We can add new category defintitons to existing values
meta with the
extend_values()
method. As when adding full metadata for categorical
variables, new values
can be generated by either providing only labels or
tuples of codes and labels.
>>>
While the method will never allow adding duplicated numeric values for the
categories, setting safe
to False
will enable you to add duplicated text
meta, i.e. values
could contain both
{'text': {'en-GB': 'No answer'}, 'value': 98}
and
{'text': {'en-GB': 'No answer'}, 'value': 99}
. By default, however,
the method will strictly prohibit any duplicates in the resulting values
.
>>>
Reordering the values
object¶
Removing DataSet
objects¶
Transforming variables¶
Copying¶
It’s often recommended to draw a clean copy of a variable before starting to
editing its meta or case data. With copy()
you can add a copy to the
DataSet
that is identical to the original in all respects but its name. By
default, the copy’s name will be suffixed with '_rec'
, but you can apply a
custom suffix by providing it via the suffix
argument (leaving out the
'_'
which is added automatically):
>>> ds.copy('q3')
>>> ds.copy('q3', suffix='version2')
>>> ds.delimited_sets
[u'q3', u'q2', u'q9', u'q8', u'q3_rec', u'q3_version2']
Querying the DataSet
, we can see that all three version are looking identical:
>>> ds[['q3', 'q3_rec', 'q3_version2']].head()
q3 q3_rec q3_version2
0 1;2;3; 1;2;3; 1;2;3;
1 1;2;3; 1;2;3; 1;2;3;
2 1;2;3; 1;2;3; 1;2;3;
3 1;3; 1;3; 1;3;
4 2; 2; 2;
We can, however, prevent copying the case data and simply add an “empty” copy
of the variable by passing copy_data=False
:
>>> ds.copy('q3', suffix='no_data', copy_data=False)
>>> ds[['q3', 'q3_rec', 'q3_version2', 'q3_no_data']].head()
q3 q3_rec q3_version2 q3_no_data
0 1;2;3; 1;2;3; 1;2;3; NaN
1 1;2;3; 1;2;3; 1;2;3; NaN
2 1;2;3; 1;2;3; 1;2;3; NaN
3 1;3; 1;3; 1;3; NaN
4 2; 2; 2; NaN
If we wanted to only copy a subset of the case data, we could also use a
logical slicer and supply it in the copy()
operation’s
slicer
parameter:
>>> slicer = {'gender': [1]}
>>> ds.copy('q3', suffix='only_men', copy_data=True, slicer=slicer)
>>> ds[['q3', 'gender', 'q3_only_men']].head()
q3 gender q3_only_men
0 1;2;3; 1 1;2;3;
1 1;2;3; 2 NaN
2 1;2;3; 1 1;2;3;
3 1;3; 1 1;3;
4 2; 1 2;
Inplace type conversion¶
You can change the characteristics of existing DataSet
variables by
converting from one type
to another. Conversions happen inplace
, i.e.
no copy of the variable is taken prior to the operation. Therefore, you might
want to take a DataSet.copy()
before using the convert(name, to)
method.
Conversions need to modify both the meta
and data
component of the
DataSet
and are limited to transformations that keep the original and new
state of a variable consistent. The following conversions are currently
supported:
name (from-type ) |
to='single' |
to='delimited set' |
to='int' |
to='float' |
to='string' |
'single' |
[X] | X | X | X | X |
'delimited set' |
[X] | ||||
'int' |
X | [X] | X | X | |
'float' |
[X] | X | |||
'string' |
X | X* | X* | [X] | |
'date' |
X | X |
* If all values of the variable are numerical, i.e. DataSet.is_like_numeric()
returns True
.
Each of these conversions will rebuild the variable meta data to match the to
type. This means, that for instance a variable that is single
will lose
its values
object when transforming to int
, while the reverse operation
will create a values
object that categorizes the unqiue numeric codes found in the
case data with their str
representation as text
meta. Consider the
variables q1
(single
) and age
(int
):
From type single
to int
:
>>> ds.meta('q1')
single codes texts missing
q1: What is your main fitness activity?
1 1 Swimming None
2 2 Running/jogging None
3 3 Lifting weights None
4 4 Aerobics None
5 5 Yoga None
6 6 Pilates None
7 7 Football (soccer) None
8 8 Basketball None
9 9 Hockey None
10 96 Other None
11 98 I regularly change my fitness activity None
12 99 Not applicable - I don't exercise None
>>> ds.convert('q1', to='int')
>>> ds.meta('q1')
int
q1: What is your main fitness activity? N/A
From type int
to single
:
>>> ds.meta('age')
int
age: Age N/A
>>> ds.convert('age', to='single')
>>> ds.meta('age')
single codes texts missing
age: Age
1 19 19 None
2 20 20 None
3 21 21 None
4 22 22 None
5 23 23 None
6 24 24 None
7 25 25 None
8 26 26 None
9 27 27 None
10 28 28 None
11 29 29 None
12 30 30 None
13 31 31 None
14 32 32 None
15 33 33 None
16 34 34 None
17 35 35 None
18 36 36 None
19 37 37 None
20 38 38 None
21 39 39 None
22 40 40 None
23 41 41 None
24 42 42 None
25 43 43 None
26 44 44 None
27 45 45 None
28 46 46 None
29 47 47 None
30 48 48 None
31 49 49 None
Banding and categorization¶
In contrast to convert()
, the categorize()
method creates a new
variable of type single
, acting as a short-hand for creating a renamed copy
and then type-transforming it. Therefore, it lets you quickly categorize
the unique values of a text
, int
or date
variable, storing
values
meta in the form of {'text': {'en-GB': str(1)}, 'value': 1}
.
>>>
Flexible banding of numeric data is provided thorugh DataSet.band()
: If a
variable is banded, it will standardly be added to the DataSet
via the
original’s name suffixed with 'banded'
, e.g. 'age_banded'
, keeping
the originating variables text
label. The new_name
and label
parameters can be used to create custom variable names and labels. The banding
of the incoming data is controlled with the bands
argument that expects a
list containing int
, tuples
or dict
, where each type is used for a
different kind of group definition.
Banding with int
and tuple
:
- Use an
int
to make a band of only one value - Use a
tuple
to indicate (inclusive) group limits values
text
meta is infered- Example:
[0, (1, 10), (11, 14), 15, (16, 25)]
Banding with dict
:
- The dict
key
will dicate the group’stext
label meta - The dict
value
can pick up anint
/tuple
(see above) - Example:
[{'A': 0}, {'B': (1, 10)}, {'C': (11, 14)}, {'D': 15}, {'E': (16, 25)}]
- Mixing allowed:
[0, {'A': (1, 10)}, (11, 14), 15, {'B': (16, 25)}]
For instance, we could band 'age'
into a new variable called 'grouped_age'
with bands
being:
>>> bands = [{'Younger than 35': (19, 34)},
... (35, 39),
... {'Exactly 40': 40},
... 41,
... (42, 60)]
>>> ds.band(name='age', bands=bands, new_name='grouped_age', label=None)
>>> ds.meta('grouped_age')
single codes texts missing
grouped_age: Age
1 1 Younger than 35 None
2 2 35-39 None
3 3 Exactly 40 None
4 4 41 None
5 5 42-60 None
>>> ds.crosstab('age', 'grouped_age')
Question grouped_age. Age
Values All Younger than 35 35-39 Exactly 40 41 42-60
Question Values
age. Age All 8255 4308 1295 281 261 2110
19 245 245 0 0 0 0
20 277 277 0 0 0 0
21 270 270 0 0 0 0
22 323 323 0 0 0 0
23 272 272 0 0 0 0
24 263 263 0 0 0 0
25 246 246 0 0 0 0
26 252 252 0 0 0 0
27 260 260 0 0 0 0
28 287 287 0 0 0 0
29 270 270 0 0 0 0
30 271 271 0 0 0 0
31 264 264 0 0 0 0
32 287 287 0 0 0 0
33 246 246 0 0 0 0
34 275 275 0 0 0 0
35 258 0 258 0 0 0
36 236 0 236 0 0 0
37 252 0 252 0 0 0
38 291 0 291 0 0 0
39 258 0 258 0 0 0
40 281 0 0 281 0 0
41 261 0 0 0 261 0
42 290 0 0 0 0 290
43 267 0 0 0 0 267
44 261 0 0 0 0 261
45 257 0 0 0 0 257
46 259 0 0 0 0 259
47 243 0 0 0 0 243
48 271 0 0 0 0 271
49 262 0 0 0 0 262
Array transformations¶
Transposing arrays
DataSet
offers tools to simplify common array
variable operations.
You can switch the structure of items
vs. values
by producing the one
from the other using transpose()
. The transposition of an array will always
result in items
that have the delimited set
type in the corresponding
'columns'
metadata. That is because the transposed array is collecting
what former items have been assignd per former value:
>>> ds.transpose('q5')
Original
>>> ds['q5'].head()
q5_1 q5_2 q5_3 q5_4 q5_5 q5_6
0 2 2 2 2 1 2
1 5 5 3 3 3 5
2 5 98 5 5 1 5
3 5 5 1 5 3 5
4 98 98 98 98 98 98
>>> ds.meta('q5')
single items item texts codes texts missing
q5: How likely are you to do each of the follow...
1 q5_1 Surfing 1 I would refuse if asked None
2 q5_2 Snowboarding 2 Very unlikely None
3 q5_3 Kite boarding 3 Probably wouldn't None
4 q5_4 Parachuting 4 Probably would if asked None
5 q5_5 Cave diving 5 Very likely None
6 q5_6 Windsurfing 97 I'm already planning to None
7 98 Don't know None
Transposition
>>> ds['q5_trans'].head()
q5_trans_1 q5_trans_2 q5_trans_3 q5_trans_4 q5_trans_5 q5_trans_97 q5_trans_98
0 5; 1;2;3;4;6; NaN NaN NaN NaN NaN
1 NaN NaN 3;4;5; NaN 1;2;6; NaN NaN
2 5; NaN NaN NaN 1;3;4;6; NaN 2;
3 3; NaN 5; NaN 1;2;4;6; NaN NaN
4 NaN NaN NaN NaN NaN NaN 1;2;3;4;5;6;
>>> ds.meta('q5_trans')
delimited set items item texts codes texts missing
q5_trans: How likely are you to do each of the ...
1 q5_trans_1 I would refuse if asked 1 Surfing None
2 q5_trans_2 Very unlikely 2 Snowboarding None
3 q5_trans_3 Probably wouldn't 3 Kite boarding None
4 q5_trans_4 Probably would if asked 4 Parachuting None
5 q5_trans_5 Very likely 5 Cave diving None
6 q5_trans_97 I'm already planning to 6 Windsurfing None
7 q5_trans_98 Don't know
The method’s ignore_items
and ignore_values
arguments can pick up
items
(indicated by their order number) and values
to leave aside
during the transposition.
Ignoring items
The new values
meta’s numerical codes will always be enumerated from 1 to
the number of valid items for the transposition, so ignoring items 2, 3 and 4
will lead to:
>>> ds.transpose('q5', ignore_items=[2, 3, 4])
>>> ds['q5_trans'].head(1)
q5_trans_1 q5_trans_2 q5_trans_3 q5_trans_4 q5_trans_5 q5_trans_97 q5_trans_98
0 2; 1;3; NaN NaN NaN NaN NaN
>>> ds.values('q5_trans')
[(1, 'Surfing'), (2, 'Cave diving'), (3, 'Windsurfing')]
Ignoring values
>>> ds.transpose('q5', ignore_values=[1, 97])
>>> ds['q5_trans'].head(1)
q5_trans_2 q5_trans_3 q5_trans_4 q5_trans_5 q5_trans_98
0 1;2;3;4;6; NaN NaN NaN NaN
>>> ds.items('q5_trans')
[('q5_trans_2', u'Very unlikely'),
('q5_trans_3', u"Probably wouldn't"),
('q5_trans_4', u'Probably would if asked'),
('q5_trans_5', u'Very likely'),
('q5_trans_98', u"Don't know")]
Ignoring both items and values
>>> ds.transpose('q5', ignore_items=[2, 3, 4], ignore_values=[1, 97])
>>> ds['q5_trans'].head(1)
q5_trans_2 q5_trans_3 q5_trans_4 q5_trans_5 q5_trans_98
0 1;3; NaN NaN NaN NaN
>>> ds.meta('q5_trans')
delimited set items item texts codes texts missing
q5_trans: How likely are you to do each of the ...
1 q5_trans_2 Very unlikely 1 Surfing None
2 q5_trans_3 Probably wouldn't 2 Cave diving None
3 q5_trans_4 Probably would if asked 3 Windsurfing None
4 q5_trans_5 Very likely
5 q5_trans_98 Don't know
Flatten item answers
flatten()
Logic and set operaters¶
Ranges¶
The frange()
function takes a string of abbreviated ranges, possibly delimited
by a comma (or some other character) and extrapolates its full,
unabbreviated list of ints.
>>> from quantipy.core.tools.dp.prep import frange
Basic range:
>>> frange('1-5')
[1, 2, 3, 4, 5]
Range in reverse:
>>> frange('15-11')
[15, 14, 13, 12, 11]
Combination:
>>> frange('1-5,7,9,15-11')
[1, 2, 3, 4, 5, 7, 9, 15, 14, 13, 12, 11]
May include spaces for clarity:
>>> frange('1-5, 7, 9, 15-11')
[1, 2, 3, 4, 5, 7, 9, 15, 14, 13, 12, 11]
Complex logic¶
Multiple conditions can be combined using union
or intersection
set
statements. Logical mappers can be arbitrarily nested as long as they are
well-formed.
union
¶
union
takes a list of logical conditions that will be treated with
or logic.
Where any of logic_A, logic_B or logic_C are True
:
>>> union([logic_A, logic_B, logic_C])
intersection
¶
intersection
takes a list of conditions that will be
treated with and logic.
Where all of logic_A, logic_B and logic_C are True
:
>>> intersection([logic_A, logic_B, logic_C])
“List” logic¶
Instead of using the verbose has_any
operator, we can express simple, non-nested
or logics simply as a list of codes. For example {"q1_1": [1, 2]}
is an
example of list-logic, where [1, 2]
will be interpreted as has_any([1, 2])
,
meaning if q1_1 has any of the values 1 or 2.
q1_1
has any of the responses 1, 2 or 3:
>>> l = {"q1_1": [1, 2, 3]}
has_any
¶
q1_1
has any of the responses 1, 2 or 3:
>>> l = {"q1_1": has_any([1, 2, 3])}
q1_1
has any of the responses 1, 2 or 3 and no others:
>>> l = {"q1_1": has_any([1, 2, 3], exclusive=True)}
not_any
¶
q1_1
doesn’t have any of the responses 1, 2 or 3:
>>> l = {"q1_1": not_any([1, 2, 3])}
q1_1
doesn’t have any of the responses 1, 2 or 3 but has some others:
>>> l = {"q1_1": not_any([1, 2, 3], exclusive=True)}
has_all
¶
q1_1
has all of the responses 1, 2 and 3:
>>> l = {"q1_1": has_all([1, 2, 3])}
q1_1
has all of the responses 1, 2 and 3 and no others:
>>> l = {"q1_1": has_all([1, 2, 3], exclusive=True)}
not_all
¶
q1_1
doesn’t have all of the responses 1, 2 and 3:
>>> l = {"q1_1": not_all([1, 2, 3])}
q1_1
doesn’t have all of the responses 1, 2 and 3 but has some others:
>>> l = {"q1_1": not_all([1, 2, 3], exclusive=True)}
has_count
¶
q1_1
has exactly 2 responses:
>>> l = {"q1_1": has_count(2)}
q1_1
has 1, 2 or 3 responses:
>>> l = {"q1_1": has_count([1, 3])}
q1_1
has 1 or more responses:
>>> l = {"q1_1": has_count([is_ge(1)])}
q1_1
has 1, 2 or 3 responses from the response group 5, 6, 7, 8 or 9:
>>> l = {"q1_1": has_count([1, 3, [5, 6, 7, 8, 9]])}
q1_1
has 1 or more responses from the response group 5, 6, 7, 8 or 9:
>>> l = {"q1_1": has_count([is_ge(1), [5, 6, 7, 8, 9]])}
not_count
¶
q1_1
doesn’t have exactly 2 responses:
>>> l = {"q1_1": not_count(2)}
q1_1
doesn’t have 1, 2 or 3 responses:
>>> l = {"q1_1": not_count([1, 3])}
q1_1
doesn’t have 1 or more responses:
>>> l = {"q1_1": not_count([is_ge(1)])}
q1_1
doesn’t have 1, 2 or 3 responses from the response group 5, 6, 7, 8 or 9:
>>> l = {"q1_1": not_count([1, 3, [5, 6, 7, 8, 9]])}
q1_1
doesn’t have 1 or more responses from the response group 5, 6, 7, 8 or 9:
>>> l = {"q1_1": not_count([is_ge(1), [5, 6, 7, 8, 9]])}
Boolean slicers and code existence¶
any()
, all()
code_count()
, is_nan()
Custom data recoding¶
The recode()
method in detail¶
This function takes a mapper of {key: logic}
entries and injects the
key into the target column where its paired logic is True. The logic
may be arbitrarily complex and may refer to any other variable or
variables in data. Where a pre-existing column has been used to
start the recode, the injected values can replace or be appended to
any data found there to begin with. Note that this function does
not edit the target column, it returns a recoded copy of the target
column. The recoded data will always comply with the column type
indicated for the target column according to the meta.
method: | recode(target, mapper, default=None, append=False,
intersect=None, initialize=None, fillna=None, inplace=True) |
---|
target
¶
target
controls which column meta should be used to control the
result of the recode operation. This is important because you cannot
recode multiple responses into a ‘single’-typed column.
The target
column must already exist in meta.
The recode
function is effectively a request to return a copy of
the target
column, recoded as instructed. recode
does not
edit the target
column in place, it returns a recoded copy of it.
If the target
column does not already exist in data
then a new
series, named accordingly and initialized with np.NaN
, will begin
the recode.
Return a recoded version of the column radio_stations_xb
edited
based on the given mapper:
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper=mapper
... )
By default, recoded data resulting from the the mapper will replace any data already sitting in the target column (on a cell-by-cell basis).
mapper
¶
A mapper is a dict of {value: logic}
entries where value represents
the data that will be injected for cases where the logic is True.
Here’s a simplified example of what a mapper looks like:
>>> mapper = {
... 1: logic_A,
... 2: logic_B,
... 3: logic_C,
... }
1 will be generated where logic_A
is True
, 2 where logic_B
is
True
and 3 where logic_C
is True
.
The recode function, by referencing the type indicated by the meta, will manage the complications involved in single vs delimited set data.
>>> mapper = {
... 901: {'radio_stations': frange('1-13')},
... 902: {'radio_stations': frange('14-20')},
... 903: {'radio_stations': frange('21-25')}
... }
This means: inject 901 if the column radio_stations
has any of the
values 1-13, 902 where radio_stations
has any of the values 14-20
and 903 where radio_stations
has any of the values 21-25.
default
¶
If you had lots of values to generate from the same reference column
(say most/all of them were based on radio_stations
) then we can
omit the wildcard logic format and use recode’s default parameter.
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper={
... 901: frange('1-13'),
... 902: frange('14-20'),
... 903: frange('21-25')
... },
... default='radio_stations'
... )
This means, all unkeyed logic will default to be keyed to
radio_stations
. In this case the three codes 901, 902 and 903 will
be generated based on the data found in radio_stations
.
You can combine this with reference to other columns, but you can only provide one default column.
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper={
... 901: frange('1-13'),
... 902: frange('14-20'),
... 903: frange('21-25'),
... 904: {'age': frange('18-34')}
... },
... default='radio_stations'
... )
Given that logic can be arbitrarily complicated, mappers can be as well. You’ll see an example of a mapper that recodes a segmentation in Example 4, below.
append
¶
If you want the recoded data to be appended to whatever may already be in the target column (this is only applicable for ‘delimited set’-typed columns), then you should use the append parameter.
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper=mapper,
... append=True
... )
The precise behaviour of the append parameter can be seen in the following examples.
Given the following data:
>>> df['radio_stations_xb']
1 6;7;9;13;
2 97;
3 97;
4 13;16;18;
5 2;6;
Name: radio_stations_xb, dtype: object
We generate a recoded value of 901 if any of the values 1-13 are
found. With the default append=False
behaviour we will return the
following:
>>> target = 'radio_stations_xb'
>>> recode(meta, data, target, mapper)
1 901;
2 97;
3 97;
4 901;
5 901;
Name: radio_stations_xb, dtype: object
However, if we instead use append=True
, we will return the following:
>>> target = 'radio_stations_xb'
>>> recode(meta, data, target, mapper, append=True)
1 6;7;9;13;901;
2 97;
3 97;
4 13;16;18;901;
5 2;6;901;
Name: radio_stations_xb, dtype: object
intersect
¶
One way to help simplify complex logical conditions, especially when
they are in some way repetitive, is to use intersect
, which
accepts any logical statement and forces every condition in the mapper
to become the intersection of both it and the intersect condition.
For example, we could limit our recode to males by giving a logical
condition to that effect to intersect
:
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper={
... 901: frange('1-13'),
... 902: frange('14-20'),
... 903: frange('21-25'),
... 904: {'age': frange('18-34')}
... },
... default='radio_stations',
... intersect={'gender': [1]}
... )
initialize
¶
You may also initialize
your copy of the target column as part of your
recode operation. You can initalize
with either np.NaN (to overwrite
anything that may already be there when your recode begins) or by naming
another column. When you name another column a copy of the data from that
column is used to initialize your recode.
Initialization occurs before your recode.
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper={
... 901: frange('1-13'),
... 902: frange('14-20'),
... 903: frange('21-25'),
... 904: {'age': frange('18-34')}
... },
... default='radio_stations',
... initialize=np.NaN
... )
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper={
... 901: frange('1-13'),
... 902: frange('14-20'),
... 903: frange('21-25'),
... 904: {'age': frange('18-34')}
... },
... default='radio_stations',
... initialize='radio_stations'
... )
fillna
¶
You may also provide a fillna
value that will be used as per
pd.Series.fillna()
after the recode has been performed.
>>> recoded = recode(
... meta, data,
... target='radio_stations_xb',
... mapper={
... 901: frange('1-13'),
... 902: frange('14-20'),
... 903: frange('21-25'),
... 904: {'age': frange('18-34')}
... },
... default='radio_stations',
... initialize=np.NaN,
... fillna=99
... )
Custom recode examples¶
Building a net code¶
Here’s an example of copying an existing question and recoding onto it a net code.
Create the new metadata:
>>> meta['columns']['radio_stations_xb'] = copy.copy(
... meta['columns']['radio_stations']
... )
>>> meta['columns']['radio_stations_xb']['values'].append(
... {
... "value": 901,
... "text": {"en-GB": "NET: Listened to radio in past 30 days"}
... }
... )
Initialize the new column. In this case we’re starting with a copy of
the radio_stations
column:
>>> data['radio_stations_xb'] = data['radio_stations'].copy()
Recode the new column by appending the code 901 to it as indicated by the mapper:
>>> data['radio_stations_xb'] = recode(
... meta, data,
... target='radio_stations_xb',
... mapper={
... 901: {'radio_stations': frange('1-23, 92, 94, 141')}
... },
... append=True
... )
Check the result:
>>> data[['radio_stations', 'radio_stations_xb']].head(20)
radio_stations radio_stations_cb
0 5; 5;901;
1 97; 97;
2 97; 97;
3 97; 97;
4 97; 97;
5 4; 4;901;
6 11; 11;901;
7 4; 4;901;
8 97; 97;
9 97; 97;
10 97; 97;
11 92; 92;901;
12 97; 97;
13 1;13;17; 1;13;17;901;
14 6; 6;901;
15 1;5;6;10; 1;5;6;10;901;
16 6; 6;901;
17 2;4;16; 2;4;16;901;
18 6;10; 6;10;901;
19 6; 6;901;
Create-and-fill¶
Here’s an example where the value 1 is generated based on some logic and then all remaining cases are given the value 2 using the pandas.Series.fillna() method.
Create the new metadata
>>> meta['columns']['age_xb'] = {
... 'type': 'single',
... 'text': {'en-GB': 'Age'},
... 'values': [
... {'value': 1, 'text': {'en-GB': '16-25'}},
... {'value': 2, 'text': {'en-GB': 'Others'}}
... ]
... }
Initialize the new column:
>>> data['age_xb'] = np.NaN
Recode the new column:
>>> data['age_xb'] = recode(
... meta, data,
... target='age_xb',
... mapper={
... 1: {'age': frange('16-40')}
... }
... )
Fill all cases that are still empty with the value 2:
>>> data['age_xb'].fillna(2, inplace=True)
Check the result:
>>> data[['age', 'age_xb']].head(20)
age age_grp_rc
0 22 1
1 68 2
2 32 1
3 44 2
4 33 1
5 52 2
6 54 2
7 44 2
8 62 2
9 49 2
10 64 2
11 73 2
12 43 2
13 28 1
14 66 2
15 39 1
16 51 2
17 50 2
18 77 2
19 42 2
Numerical banding¶
Here’s a typical example of recoding age into custom bands.
In this case we’re using list comprehension to generate the first ten values objects and then concatenate that with a final ‘65+’ value object which doesn’t folow the same label format.
Create the new metadata:
>>> meta['columns']['age_xb_1'] = {
... 'type': 'single',
... 'text': {'en-GB': 'Age'},
... 'values': [
... {
... 'value': i,
... 'text': {'en-GB': '{}-{}'.format(r[0], r[1])}
... }
... for i, r in enumerate(
... [
... [18, 20],
... [21, 25], [26, 30],
... [31, 35], [36, 40],
... [41, 45], [46, 50],
... [51, 55], [56, 60],
... [61, 65]
... ],
... start=1
... )
... ] + [
... {
... 'value': 11,
... 'text': {'en-GB': '65+'}
... }
... ]
... }
Initialize the new column:
>>> data['age_xb_1'] = np.NaN
Recode the new column:
>>> data['age_xb_1'] = recode(
... meta, data,
... target='age_xb_1',
... mapper={
... 1: frange('18-20'),
... 2: frange('21-25'),
... 3: frange('26-30'),
... 4: frange('31-35'),
... 5: frange('36-40'),
... 6: frange('41-45'),
... 7: frange('46-50'),
... 8: frange('51-55'),
... 9: frange('56-60'),
... 10: frange('61-65'),
... 11: frange('66-99')
... },
... default='age'
... )
Check the result:
>>> data[['age', 'age_xb_1']].head(20)
age age_cb
0 22 2
1 68 11
2 32 4
3 44 6
4 33 4
5 52 8
6 54 8
7 44 6
8 62 10
9 49 7
10 64 10
11 73 11
12 43 6
13 28 3
14 66 11
15 39 5
16 51 8
17 50 7
18 77 11
19 42 6
Complicated segmentation¶
Here’s an example of using a complicated, nested series of logic statements to recode an obscure segmentation.
The segemenation was given with the following definition:
1 - Self-directed:
- If q1_1 in [1,2] and q1_2 in [1,2] and q1_3 in [3,4,5]
2 - Validators:
- If q1_1 in [1,2] and q1_2 in [1,2] and q1_3 in [1,2]
3 - Delegators:
- If (q1_1 in [3,4,5] and q1_2 in [3,4,5] and q1_3 in [1,2])
- Or (q1_1 in [3,4,5] and q1_2 in [1,2] and q1_3 in [1,2])
- Or (q1_1 in [1,2] and q1_2 in [3,4,5] and q1_3 in [1,2])
4 - Avoiders:
- If (q1_1 in [3,4,5] and q1_2 in [3,4,5] and q1_3 in [3,4,5])
- Or (q1_1 in [3,4,5] and q1_2 in [1,2] and q1_3 in [3,4,5])
- Or (q1_1 in [1,2] and q1_2 in [3,4,5] and q1_3 in [3,4,5])
5 - Others:
- Everyone else.
Create the new metadata:
>>> meta['columns']['segments'] = {
... 'type': 'single',
... 'text': {'en-GB': 'Segments'},
... 'values': [
... {'value': 1, 'text': {'en-GB': 'Self-directed'}},
... {'value': 2, 'text': {'en-GB': 'Validators'}},
... {'value': 3, 'text': {'en-GB': 'Delegators'}},
... {'value': 4, 'text': {'en-GB': 'Avoiders'}},
... {'value': 5, 'text': {'en-GB': 'Other'}},
... ]
... }
Initialize the new column?
>>> data['segments'] = np.NaN
Create the mapper separately, since it’s pretty massive!
See the Complex logic section for more information and examples
related to the use of union
and intersection
.
>>> mapper = {
... 1: intersection([
... {"q1_1": [1, 2]},
... {"q1_2": [1, 2]},
... {"q1_3": [3, 4, 5]}
... ]),
... 2: intersection([
... {"q1_1": [1, 2]},
... {"q1_2": [1, 2]},
... {"q1_3": [1, 2]}
... ]),
... 3: union([
... intersection([
... {"q1_1": [3, 4, 5]},
... {"q1_2": [3, 4, 5]},
... {"q1_3": [1, 2]}
... ]),
... intersection([
... {"q1_1": [3, 4, 5]},
... {"q1_2": [1, 2]},
... {"q1_3": [1, 2]}
... ]),
... intersection([
... {"q1_1": [1, 2]},
... {"q1_2": [3, 4, 5]},
... {"q1_3": [1, 2]}
... ]),
... ]),
... 4: union([
... intersection([
... {"q1_1": [3, 4, 5]},
... {"q1_2": [3, 4, 5]},
... {"q1_3": [3, 4, 5]}
... ]),
... intersection([
... {"q1_1": [3, 4, 5]},
... {"q1_2": [1, 2]},
... {"q1_3": [3, 4, 5]}
... ]),
... intersection([
... {"q1_1": [1, 2]},
... {"q1_2": [3, 4, 5]},
... {"q1_3": [3, 4, 5]}
... ])
... ])
... }
Recode the new column:
>>> data['segments'] = recode(
... meta, data,
... target='segments',
... mapper=mapper
... )
Note
Anything not at the top level of the mapper will not benefit from using
the default
parameter of the recode function. In this case, for example,
saying default='q1_1'
would not have helped. Everything in a nested level
of the mapper, including anything in a union
or intersection
list,
must use the explicit dict form {"q1_1": [1, 2]}
.
Fill all cases that are still empty with the value 5:
>>> data['segments'].fillna(5, inplace=True)
Check the result:
>>> data[['q1_1', 'q1_2', 'q1_3', 'segments']].head(20)
q1_1 q1_2 q1_3 segments
0 3 3 3 4
1 3 3 3 4
2 1 1 3 1
3 1 1 2 2
4 2 2 2 2
5 1 1 5 1
6 2 3 2 3
7 2 2 3 1
8 1 1 4 1
9 3 3 3 4
10 3 3 4 4
11 2 2 4 1
12 1 1 5 1
13 2 2 4 1
14 1 1 1 2
15 2 2 4 1
16 2 2 3 1
17 1 1 5 1
18 5 5 1 3
19 1 1 4 1
Variable creation¶
Adding derived variables¶
Interlocking variables¶
Condition-based code removal¶
Weights¶
Background and methodology¶
quantipy
utilizes the Rim (sometimes also called Raking) weighting method,
an iterative fitting algorithm that tries to balance out multiple sample
frequencies simultaneously. It is rooted in the mathematical model developed in
the seminal academic paper by Deming/Stephan (1940) ([DeSt40]). The following chapters
draw heavily from it.
The statistical problem¶
More often than not, market research professionals (and not only them!) are required to weight their raw data collected via a survey to match a known specific real-world distribution. This is the case when you try to weight your sample to reflect the population distribution of a certain characteristic to make it “representative” in one or more terms. Leaving unconsidered what a “representative” sample actually is in the first place, let’s see what “weighting data” comes down to and why weighting in order to achieve representativeness can be quite a difficult task. Look at the following two examples:
1. Your data contains an equal number of male and female respondents while in the real world you know that women are a little bit more frequent than men. In relative terms you have sampled 2 percentage points more men than women:
Sample (N=100) | Population | Factors | |
Men | 50 % | 48% | 48/50 = 0.96 |
Women | 50% | 52% | 52/50 = 1.04 |
That one is easy because you know each cell’s population frequencies and can simply find the factors that will correct your sample to mirror the real-world population. To weight you would simply compute the relevant factors by dividing the desired population figure by the sample frequency and assign each case in your data the respective result (based on his or her gender). The factors are coming from your one-dimensional weighting matrix above.
2. You have a survey project that requires the sample to match the gender and age distributions in real-world Germany and additionally should take into account the distribution of iPad owners and the population frequencies of the federal states.
Again, to weight the data you would need to calculate the cell ratios of target vs. sample figures for the different sample characteristics. While you may be able to find the joint distribution of age categories by gender, you will have a hard time coming up e.g. with the correct figures for a joint distribution of iPad owners per federal state by gender and age group.
To put it differently: You will not know the population’s cell target figures for all weighting dimensions in all relevant cells of the multi-dimensional weighting matrix. Since you need this information to assign each case a weight factor to come up with the correct weighted distributions for the four sample characteristics you would not be able to weight the data. To illustrate the complexity of such a weighting scheme, the table below should suit:
╔═════════╦═════════╦═══════════════════════╦═══════════════════════╦═════╗
║ State: ║ ║ Bavaria ║ Saxony ║ ║
╠═════════╬═════════╬═══════╦═══════╦═══════╬═══════╦═══════╦═══════╬═════╣
║ Age: ║ ║ 18-25 ║ 26-35 ║ 36-55 ║ 18-25 ║ 26-35 ║ 36-55 ║ ... ║
╠═════════╬═════════╬═══╦═══╬═══╦═══╬═══╦═══╬═══╦═══╬═══╦═══╬═══╦═══╬═════╣
║ Gender: ║ ║ m ║ f ║ m ║ f ║ m ║ f ║ m ║ f ║ m ║ f ║ m ║ f ║ ... ║
╠═════════╬═════════╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═════╣
║ ║ iPad ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║
╠═════════╬═════════╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═══╬═════╣
║ ║ no iPad ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║ ? ║
╚═════════╩═════════╩═══╩═══╩═══╩═══╩═══╩═══╩═══╩═══╩═══╩═══╩═══╩═══╩═════╝
Note that you would also need to take into account the other joint distributions of age by gender per federal state, iPad owners by age, and so on to get the correct weight factors step by step: all cross-tabulation information for the population that will not be available to you. Additionally, even if you would have all the information necessary for your calculations, try to imagine the amount of work that awaits to come up with the weight factors per cell regarding getting all possible combinations right, then creating variables, recoding those variables and then finally computing the ratios.
What is available regularly, however, is the distribution of people living in Germany’s federal states and the distribution of iPad owners in general (as per “Yes, have one,” “do not own one”), plus the age and gender frequencies. This is where rim weighting comes into play.
Rim weighting concept¶
Rim weighting in short can be described as an iterative data fitting process that aims to apply a weight factor to each respondent’s case record in order to match the target figures by altering the sample cell frequencies relevant to the weighting matrix. Doing that, it will find the single cell’s ratios that are required to come up with the correct targets per weight dimension – it will basically estimate all the joint distribution information that is unknown.
The way this works can be summarized as follows: For each interlocking cell coming from all categories of all the variables that are given to weight to, an algorithm will compute the proportion necessary in a single specific cell that, when summed over per column or respectively by row, will result in a column (row) total per category that matches the target distribution. However, it will occur that having balanced a column total to match, the row totals will be off. This is where one iteration ends and another one begins starting now with the weighted values from the previous run. This iterative process will continue until a satisfying result in terms of an acceptable low amount of mismatch between produced sample results and weight targets is reached.
In short: Simultaneous adjustment of all weight variables with the smallest amount of data manipulation possible while forcing the maximum match between sample and weight scheme.
References
[DeSt40] | Deming, W. Edwards; Stephan, Frederick F. (1940): On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known. In: Ann. Math. Statist. 11 , no. 4, pp. 427 - 444. |
Weight scheme setup¶
Using the Rim
class¶
The Rim
object’s purpose is to define the required setup of the weighting process, i.e. the weight scheme that should be used to compute the actual factor results per case in the dataset. While its main purpose is to provide a simple interface to structure weight schemes of all complexities, it also offers advanced options that control the underlying weighting algorithm itself and thus might impact the results.
To start working with a Rim
object, we only need to think of a name for our scheme:
>>> scheme = qp.Rim('my_first_scheme')
Target distributions¶
A major and (probably the most important) step in specifying a weight scheme
is mapping the desired target population proportions to the categories of the related variables inside the data. This is done via a dict
mapping.
For example, to equally weight female and male respondents in our sample, we simply define:
>>> gender_targets = {}
>>> gender_targets['gender'] = {1: 50.0, 2: 50.0}
>>> gender_targets
{'gender': {1: 50.0, 2: 50.0}}
Since we are normally dealing with multiple variables at once, we collect
them in a list
, adding other variables naturally in the same way:
>>> dataset.band('age', [(19, 25), (26-35), (36, 49)])
>>> age_targets = {'age_banded': {1: 45.0, 2: 29.78, 3: 25.22}}
>>> all_targets = [gender_targets, age_targets]
The set_targets()
method can now use the all_targets
list to apply the target distributions to the Rim
weight scheme setup (we are also providing an optional name for our group of variables) .
>>> scheme.set_targets(targets=all_targets, group_name='basic weights')
The Rim
instance also allows inspecting these targets from itself now (you can
see the group_name
parameter reflected here, it would fall back to
'_default_name_'
if none was provided):
>>> scheme.groups['basic weights']['targets']
[{'gender': {1: 50.0, 2: 50.0}}, {'age_banded': {1: 45.0, 2: 29.78, 3: 25.22}}]
Weight groups and filters¶
For more elaborate weight schemes, we are instead using the add_group()
method
which is effectively a generalized version of set_targets()
that supports
addressing subsets of the data by filtering. For example, differing target distributions (or even the scheme defining variables of interest) might be
required across several market segments or between survey periods.
We can illustrate this using the variable 'Wave'
from the dataset:
>>> dataset.crosstab('Wave', text=True, pct=True)
Question Wave. Wave
Values @
Question Values
Wave. Wave All 100.0
Wave 1 19.6
Wave 2 20.2
Wave 3 20.5
Wave 4 19.8
Wave 5 19.9
Let’s assume we want to use the original targets for the first three waves but the remaining two waves need to reflect some changes in both gender and the age distributions. We first define a new set of targets that should apply only to the waves 4 and 5:
gender_targets_2 = {'gender': {1: 30.0, 2: 70.0}}
age_targets_2 = {'age_banded': {1: 35.4, 2: 60.91, 3: 3.69}}
all_targets_2 = [gender_targets_2, age_targets_2]
We then set the filter expressions for the respective subsets of the data, as per:
>>> filter_wave1 = 'Wave == 1'
>>> filter_wave2 = 'Wave == 2'
>>> filter_wave3 = 'Wave == 3'
>>> filter_wave4 = 'Wave == 4'
>>> filter_wave5 = 'Wave == 5'
And add our weight specifications accordingly:
>>> scheme = qp.Rim('my_complex_scheme')
>>> scheme.add_group(name='wave 1', filter_def=filter_wave1, targets=all_targets)
>>> scheme.add_group(name='wave 2', filter_def=filter_wave2, targets=all_targets)
>>> scheme.add_group(name='wave 3', filter_def=filter_wave3, targets=all_targets)
>>> scheme.add_group(name='wave 4', filter_def=filter_wave4, targets=all_targets_2)
>>> scheme.add_group(name='wave 5', filter_def=filter_wave5, targets=all_targets_2)
Note
For historical reasons, the logic operators currently do not work within the Rim
module. This means that all filter definitions need to be valid
string expressions suitable for the pandas.DataFrame.query()
method.
We are planning to abandon this limitation as soon as possible to enable
easier and more complex filters that are consistent with the rest of the library.
Setting group targets¶
At this stage it might also be needed to balance out the survey waves themselves in a certain way, e.g. make each wave count exactly the same (as you can see above each wave accounts for roughly 20% of the full sample but not quite exactly).
With Rim.group_targets()
we can apply an outer weighting to the between
group distribution while keeping the already set inner target proportions within each of them. Again we are using a dict
, this time mapping the
group names from above to the desired outcome percentages:
>>> group_targets = {'wave 1': 20.0,
... 'wave 2': 20.0,
... 'wave 3': 20.0,
... 'wave 4': 20.0,
... 'wave 5': 20.0}
>>> scheme.group_targets(group_targets)
To sum it up: Our weight scheme consists of five groups based on 'Wave'
that
resp. need to match two different sets of target distributions on the 'gender'
and 'age_banded'
variables with each group coming out as 20% of the full sample.
Integration within DataSet
¶
The computational core of the weighting algorithm is the
quantipy.core.weights.rim.Rake
class which can be accessed by working
with qp.WeightEngine()
, but it is much easier to directly use the DataSet.weight()
method. Its full signature looks as follows:
DataSet.weight(weight_scheme,
weight_name='weight',
unique_key='identity',
subset=None,
report=True,
path_report=None,
inplace=True,
verbose=True)
Weighting and weighted aggregations¶
As can been seen, we can simply provide our weight scheme Rim
instance to
the method. Since the dataset already contains a variable called 'weight'
(and we do not want to overwrite that one) we set weight_name
to be
'weights_new'
. We also need to set unique_key='unique_id'
as that is our
identifying key variable (that is needed to map the weight factors back into our
dataset):
>>> dataset.weight(scheme, weight_name='weights_new', unique_key='unique_id')
Before we take a look at the report that is printed (because of report=True
),
we want to manually check our results. For that, we can simply analyze some cross-
tabulations, weighted by our new weights! For a start, we check if we arrived at
the desired proportions for 'gender'
and 'age_banded'
per 'Wave'
:
>>> dataset.crosstab(x='gender', y='Wave', w='weights_new', pct=True)
Question Wave. Wave
Values All Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
Question Values
gender. What is your gender? All 100.0 100.0 100.0 100.0 100.0 100.0
Male 42.0 50.0 50.0 50.0 30.0 30.0
Female 58.0 50.0 50.0 50.0 70.0 70.0
>>> dataset.crosstab(x='age_banded', y='Wave', w='weights_new', pct=True,
... decimals=2)
Question Wave. Wave
Values All Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
Question Values
age_banded. Age All 100.00 100.00 100.00 100.00 100.00 100.00
19-25 41.16 45.00 45.00 45.00 35.40 35.40
26-35 42.23 29.78 29.78 29.78 60.91 60.91
36-49 16.61 25.22 25.22 25.22 3.69 3.69
Both results accurately reflect the desired proportions from our scheme. And we
can also verify the weighted distribution of 'Wave'
, now completely
balanced:
>>> dataset.crosstab(x='Wave', w='weights_new', pct=True)
Question Wave. Wave
Values @
Question Values
Wave. Wave All 100.0
Wave 1 20.0
Wave 2 20.0
Wave 3 20.0
Wave 4 20.0
Wave 5 20.0
The isolated weight dataframe¶
By default, the weighting operates inplace
, i.e. the weight vector will
be placed into the DataSet
instance as a regular columns
element:
>>> dataset.meta('weights_new')
float
weights_new: my_first_scheme weights N/A
>>> dataset['weights_new'].head()
unique_id weights_new
0 402891 0.885593
1 27541022 1.941677
2 335506 0.984491
3 22885610 1.282057
4 229122 0.593834
It is also possible to return a new pd.DataFrame
that contains all relevant Rim
scheme variables incl. the factor vector for external use cases or further
analysis:
>>> wdf = dataset.weight(scheme, weight_name='weights_new', unique_key='unique_id',
inplace=False)
>>> wdf.head()
unique_id gender age_banded weights_new Wave
0 402891 1 1.0 0.885593 4
1 27541022 2 1.0 1.941677 1
2 335506 1 2.0 0.984491 3
3 22885610 1 2.0 1.282057 5
4 229122 1 3.0 0.593834 1
Diagnostics¶
We did not yet take a look at the default weight report that offers some additional information on the weighting outcome results and the even the algorithm process itself (the report lists the internal weight variable name that is always just a suffix of the scheme name):
Weight variable weights_my_complex_scheme
Weight group wave 1 wave 2 wave 3 wave 4 wave 5
Weight filter Wave == 1 Wave == 2 Wave == 3 Wave == 4 Wave == 5
Total: unweighted 1621.000000 1669.000000 1689.000000 1637.000000 1639.000000
Total: weighted 1651.000000 1651.000000 1651.000000 1651.000000 1651.000000
Weighting efficiency 74.549628 78.874120 77.595143 53.744060 50.019937
Iterations required 13.000000 8.000000 11.000000 12.000000 10.000000
Mean weight factor 1.018507 0.989215 0.977501 1.008552 1.007322
Minimum weight factor 0.513928 0.562148 0.518526 0.053652 0.050009
Maximum weight factor 2.243572 1.970389 1.975681 2.517704 2.642782
Weight factor ratio 4.365539 3.505106 3.810189 46.926649 52.846124
The weighting efficiency¶
After all, getting the sample to match to the desired population proportions always comes at a cost. This cost is captured in a statistical measure called the weighting efficiency and is featured in the report as well. It is a metric for evaluation of the sample vs. targets match, i.e. the sample balance compared to the weight scheme. You can also inversely view it as the amount of distortion that was needed to arrive at the weighted figures, that is, how much the data is manipulated by the weighting. A low efficiency indicates a larger bias introduced by the weights.
Let \(w\) denote our weight vector containing the factor for each \(i\) respondent, then the mathematical definititon of the (total) weighting efficiency \(we\) is:
Which is the quotient of the squared sum of weights and the number of cases divided by the sum of squared weights (expressed as a percentage).
We can manually check the figure for group 'wave 1'
. We first recreate the
filter that has been used, which we can also derive the number of cases n
from:
>>> f = dataset.take({'Wave': [1]})
>>> n = len(f)
>>> n
1621
The sum of weights squared sws
is then:
>>> sws = (dataset[f, 'weights_new'].sum()) ** 2
>>> sws
2725801.0
And the sum of squared weights ssw
:
>>> ssw = (dataset[f, 'weights_new']**2).sum()
>>> ssw
2255.61852968
Which enables us to calculate the weighting efficiency we
as per:
>>> we = (sws / n) / ssw * 100
>>> we
74.5496275503
Generally, weighting efficiency results below the 80% mark indicate a high sample vs. population mismatch. Dropping below 70% should be a reason to reexamine the weight scheme specifications or analysis design.
To better understand why the weighting efficiency is good for judging the quality of the weighting, we can look at its relation to the effective sample size (the effective base). In our example, the effective base of the weight group would be around 0.75 * 1621 = 1215.75. This means that we are dealing with an effective sample of only 1216 cases for weighted statistical analysis and inference. In other words, the weighting reduces the reliability of the sample as if we had sampled roughly 400 (about 25%) respondents less.
Gotchas¶
[A] Subsets and targets
In the example we have defined five weight groups, one for each of the waves, although we only had two differing sets of targets we wanted to match. One could be tempted to only set two weight groups because of this, using the filters:
>>> f1 = 'Wave in [1, 2, 3]'
and
>>> f2 = 'Wave in [4, 5]'
It is crucial to remember that the algorithm is applied on the weight group’s overall data base, i.e. the above definition would achieve the targets inside the two groups (Waves 1/2/3 and Waves 4/5) and not within each of the waves.
Batch
¶
qp.Batch
is a subclass of qp.DataSet
and is a container for
structuring a qp.Link
collection’s specifications.
qp.Batch
is not only a subclass of qp.DataSet
, it also takes a
DataSet instance as input argument, inheriting a few of its attributes, e.g.
_meta
, _data
, valid_tks
and text_key
.
All other Batch
attributes are used as construction plans for populating a
qp.Stack
, these get stored in the belonging DataSet
meta component in
_meta['sets']['batches'][batchname]
.
In general, it does not matter in which order Batch
attributes are set by
methods, the class ensures that all attributes are kept consistent.
All next sections are working with the following qp.DataSet
instance:
import quantipy as qp
dataset = qp.DataSet('Example Data (A)')
dataset.read_quantipy('Example Data (A).json', 'Example Data (A).csv')
The json and csv files you can find in quantipy/tests
.
Creating/ Loading a qp.Batch
instance¶
As mentioned, a Batch
instance has a close connection to its belonging
DataSet
instance and we can easily create a new Batch
from a DataSet
as per:
batch1 = dataset.add_batch(name='batch1')
batch2 = dataset.add_batch(name='batch2', ci=['c'], weights='weight')
It is also possible to load an already existing instance out of the meta
stored in dataset._meta['sets']['batches']
:
batch = dataset.get_batch('batch1')
Both methods, .add_batch()
and .get_batch()
, are an easier way to
use the __init__()
method of qp.Batch
.
An other way to get a new qp.Batch
instance is to copy an existing one, in
that case all added open ends are removed from the new instance:
copy_batch = batch.copy('copy_of_batch1')
Adding variables to a qp.Batch
instance¶
x-keys and y-keys¶
The included variables in a Batch
constitute the main structure for the
qp.Stack
construction plan. Variables can be added as x-keys or y-keys, for
arrays all belonging items are automatically added and the qp.Stack
gets
populated with all cross-tabulations of these keys:
>>> batch.add_x(['q1', 'q2', 'q6'])
>>> batch.add_y(['gender', 'q1'])
Array summaries setup: Creating ['q6'].
x-specific y-keys can be produced by manipulating the main y-keys, this edit can be extending or replacing the existing keys:
>>> batch.extend_y(['locality', 'ethnicity'], on=['q1'])
>>> batch.replace_y(['locality', 'ethnicity'], on=['q2'])
With these settings the construction plan looks like that:
>>> print batch.x_y_map
OrderedDict([('q1', ['@', 'gender', 'q1', 'locality', 'ethnicity']),
('q2', ['locality', 'ethnicity']),
('q6', ['@']),
(u'q6_1', ['@', 'gender', 'q1']),
(u'q6_2', ['@', 'gender', 'q1']),
(u'q6_3', ['@', 'gender', 'q1'])])
Arrays¶
A special case exists if the added variables contain arrays. As default for all
arrays in x-keys array summaries are created (array as x-key and '@'
-referenced total as
y-key), see the output below (Array summaries setup: Creating ['q6'].
).
If array summaries are requested only for a selection of variables or for none,
use .make_summaries()
:
>>> batch.make_summaries(None)
Array summaries setup: Creating no summaries!
Arrays can also be transposed ('@'
-referenced total as x-key and array name
as y-key). If they are not in the batch summary list before, they are
automatically added and depending on the replace
parameter only the
transposed or both types of summaries are added to the qp.Stack
:
>>> batch.transpose_array('q6', replace=False)
Array summaries setup: Creating ['q6'].
The construction plan now shows that both summary types are included:
>>> print batch.x_y_map
OrderedDict([('q1', ['@', 'gender', 'q1', 'locality', 'ethnicity']),
('q2', ['locality', 'ethnicity']),
('q6', ['@']),
('@', ['q6']),
(u'q6_1', ['@', 'gender', 'q1']),
(u'q6_2', ['@', 'gender', 'q1']),
(u'q6_3', ['@', 'gender', 'q1'])])
Verbatims/ open ends¶
Another special case are verbatims. They will not be aggregated in a qp.Stack
,
but they have to be defined in a qp.Batch
to add them later to a qp.Cluster
.
There are two different ways to add verbatims: Either all to one qp.Cluster
key or each gets its own key. But both options can be done with the same method.
For splitting the verbatims, set split=True
and insert as many titles as
included verbatims/ open ends:
>>> batch.add_open_ends(['q8a', 'q9a'], break_by=['record_number', 'age'],
split=True, title=['oe_q8', 'oe_q9'])
For collecting all verbatims in one Cluster key, set split=False
and add
only one title
or use the default parameters:
>>> batch.add_open_ends(['q8a', 'q9a'], break_by=['record_number', 'age'])
Special aggregations¶
It is possible to add some special aggregations to a qp.Batch
, that are
not stored in the main construction plan .x_y_map
. One option is to give a
name for a Cluster key in which all y-keys are cross-tabulated against each
other:
>>> batch.add_y_on_y('y-keys')
Another possibility is to add a qp.Batch
instance to an other instance.
The added Batch loses all information about verbatims and .y_on_y
, that
means only the main construction plan in .x_y_map
gets adopted. Each of
the two batches is aggregated discretely in the qp.Stack
, but the added
instance gets included into the qp.Cluster
of the first qp.Batch
in
a key named by its instance name.
>>> batch1 = dataset.get_batch('batch1')
>>> batch2 = dataset.get_batch('batch2')
>>> batch2.add_x('q2b')
Array summaries setup: Creating no summaries!
>>> batch2.add_y('gender')
>>> batch2.as_addition('batch1')
Batch 'batch2' specified as addition to Batch 'batch1'. Any open end summaries and 'y_on_y' agg. have been removed!
The connection between the two qp.Batch
instances you can see in .additional
for the added instance and in ._meta['sets']['batches']['batchname']['additions']
for the first instance.
Set properties of a qp.Batch
¶
The section before explained how the main construction plan (batch.x_y_map
)
is built, that describes which x-keys and y-keys are used to add qp.Link
s
to a qp.Stack
. Now you will get to know how the missing information for the
Link
s are defined and which specific views get extracted for the
qp.Cluster
by adding some property options the qp.Batch
instance.
Filter, weights and significance testing¶
qp.Link
s can be added to a qp.Stack
data_key-level by defining its x
and y-keys, which is already done in .x_y_map
, and setting a filter.
This property can be edited in a qp.Batch
instance with the
following methods:
>>> batch.add_filter('men only', {'gender': 1})
>>> batch.extend_filter({'q1': {'age': [20, 21, 22, 23, 24, 25]}})
Filters can be added globally or for a selection of x-keys only. Out of the
global filter, .sample_size
is automatically calculated for each qp.Batch
defintion.
Now all information are collected in the qp.Batch
instance and the Stack
can be populated with Link
s in form of stack[data_key][filter_key][x_key][y_key]
.
For each Link
qp.View
s can be added, these views depend on a weight
definition, which is also defined in the qp.Batch
:
>>> batch.set_weights(['weight_a'])
Significance tests are a special View
; the sig. levels which they are
calculated on can be added to the qp.Batch
like this:
>>> batch.set_sigtests(levels=[0.05])
Cell items and language¶
As qp.Stack
is a container for a large amount of aggregations, it will
accommodate various qp.View
s. The qp.Batch
property .cell_items
is
used to define which specfic Views
will be taken to create a qp.Cluster
:
>>> batch.set_cell_items(['c', 'p'])
The property .language
allows the user to define which text
labels from
the meta data should be used for the extracted Views
by entering a valid
text key:
>>> batch.set_language('en-GB')
Inherited qp.DataSet
methods¶
Being a qp.DataSet
subclasss, qp.Batch
inherits some of its methods.
The important ones are these which allow the manipulation of the meta component.
That means meta-edits can be applied globally (run methods on qp.DataSet
) or
Batch
-specific (run methods on qp.Batch
). Batch meta-edits
always overwrite global meta-edits and while building a qp.Cluster
from a
qp.Batch
, the modified meta information is taken from .meta_edits
.
The following methods can be used to create meta-edits for a qp.Batch
:
>>> batch.hiding('q1', [2], axis='y')
>>> batch.sorting('q2', fix=[97, 98])
>>> batch.slicing('q1', [1, 2, 3, 4, 5], axis='x')
>>> batch.set_variable_text('gender', 'Gender???')
>>> batch.set_value_texts('gender', {1: 'Men', 2: 'Women'})
>>> batch.set_property('q1', 'base_text', 'This var has a second filter.')
Some methods are not allowed to be used for a Batch
. These will raise a
NotImplementedError
to prevent inconsistent case and meta data states.
Analysis & aggregation¶
Collecting aggregations¶
All computational results are collected in a so-called qp.Stack
object which
acts as a container for large amount of aggregations in form of qp.Link
s.
What is a qp.Link?
¶
A qp.Link
is defined by four attributes that make it unique and set how it is
stored in a qp.Stack
. These four attributes are data_key
, filter
,
x
(downbreak) and y
(crossbreak), which are positioned in a qp.Stack
similar to a tree diagram:
- Each
Stack
can have variousdata_key
s.- Each
data_key
can have variousfilter
s.- Each
filter
can have variousx
s.- Each
x
can have variousy
s.
Consequently qp.Stack[dk][filter][x][y]
is one qp.Link
that can be added
using add_link(self, data_keys=None, filters=['no_filter'], x=None, y=None, ...)
qp.Link
s are are storing different qp.View
s (frequencies, statistics,
etc. - all kinds of computations) that are applied on the same four data attributes.
Populating a qp.Stack
¶
A qp.Stack
is able to cope with a large amount of aggregations, so it is
impractical to add Link
s one by one with repeated Stack.add_link()
calls.
It is much easier to create a “construction plan” using a qp.Batch
and
apply the settings saved in DataSet._meta['sets']['batches']
to populate a
qp.Stack
instance. In the following, let’s assume dataset
is containing
the definitions of two qp.Batch
es, a qp.Stack
can be created running:
stack = dataset.populate(batches='all')
For the Batch
definitions from here, you
will get the following construction plans:
>>> batch1 = dataset.get_batch('batch1')
>>> batch1.add_y_on_y('y_keys')
>>> print batch1.x_y_map
OrderedDict([('q1', ['@', 'gender', 'q1', 'locality', 'ethnicity']),
('q2', ['locality', 'ethnicity']),
('q6', ['@']),
('@', ['q6']),
(u'q6_1', ['@', 'gender', 'q1']),
(u'q6_2', ['@', 'gender', 'q1']),
(u'q6_3', ['@', 'gender', 'q1'])])
>>> print batch1.x_filter_map
OrderedDict([('q1', {'(men only)+(q1)': (<function _intersection at 0x0000000019AE06D8>, [{'gender': 1}, {'age': [20, 21, 22, 23, 24, 25]}])}),
('q2', {'men only': {'gender': 1}}),
('q6', {'men only': {'gender': 1}}),
('q6_1', {'men only': {'gender': 1}}),
('q6_2', {'men only': {'gender': 1}}),
('q6_3', {'men only': {'gender': 1}})])
>>> batch2 = dataset.get_batch('batch2')
>>> print batch2.x_y_map
OrderedDict([('q2b', ['@', 'gender'])])
>>> print batch2.x_filter_map
OrderedDict([('q2b', 'no_filter')])
As both Batch
es refer to the same data file, the same data_key
(in this
case the name of dataset
) is defining all Links
.
After populating the Stack
content can be viewed using .describe()
:
>>> stack.describe()
data filter x y view #
0 Example Data (A) men only q1 q1 NaN 1
1 Example Data (A) men only q1 @ NaN 1
2 Example Data (A) men only q1 gender NaN 1
3 Example Data (A) men only @ q6 NaN 1
4 Example Data (A) men only q2 ethnicity NaN 1
5 Example Data (A) men only q2 locality NaN 1
6 Example Data (A) men only q6_1 q1 NaN 1
7 Example Data (A) men only q6_1 @ NaN 1
8 Example Data (A) men only q6_1 gender NaN 1
9 Example Data (A) men only q6_2 q1 NaN 1
10 Example Data (A) men only q6_2 @ NaN 1
11 Example Data (A) men only q6_2 gender NaN 1
12 Example Data (A) men only q6_3 q1 NaN 1
13 Example Data (A) men only q6_3 @ NaN 1
14 Example Data (A) men only q6_3 gender NaN 1
15 Example Data (A) men only gender q1 NaN 1
16 Example Data (A) men only gender @ NaN 1
17 Example Data (A) men only gender gender NaN 1
18 Example Data (A) men only q6 @ NaN 1
19 Example Data (A) (men only)+(q1) q1 q1 NaN 1
20 Example Data (A) (men only)+(q1) q1 @ NaN 1
21 Example Data (A) (men only)+(q1) q1 locality NaN 1
22 Example Data (A) (men only)+(q1) q1 ethnicity NaN 1
23 Example Data (A) (men only)+(q1) q1 gender NaN 1
24 Example Data (A) no_filter q2b @ NaN 1
25 Example Data (A) no_filter q2b gender NaN 1
You can find all combinations defined in the x_y_map
in the
Stack
structure, but also Link
s like Stack['Example Data (A)']['men only']['gender']['gender']
are included. These special cases arising from the y_on_y
setting. Sometimes
it is helpful to group a describe
-dataframe and create a cross-tabulation
of the four Link
attributes to get a better overview, e.g. to see how many
Links
are included for each x-filter combination.
:
>>> stack.describe('x', 'filter')
filter (men only)+(q1) men only no_filter
x
@ NaN 1.0 NaN
gender NaN 3.0 NaN
q1 5.0 3.0 NaN
q2 NaN 2.0 NaN
q2b NaN NaN 2.0
q6 NaN 1.0 NaN
q6_1 NaN 3.0 NaN
q6_2 NaN 3.0 NaN
q6_3 NaN 3.0 NaN
The computational engine¶
Significance testing¶
View aggregation¶
All following examples are working with a qp.Stack
that was populated
from a qp.DataSet
including the following qp.Batch
definitions:
>>> batch1 = dataset.get_batch('batch1')
>>> batch1.add_y_on_y('y_keys')
>>> print batch1.x_y_map
OrderedDict([('q1', ['@', 'gender', 'q1', 'locality', 'ethnicity']),
('q2', ['locality', 'ethnicity']),
('q6', ['@']),
('@', ['q6']),
(u'q6_1', ['@', 'gender', 'q1']),
(u'q6_2', ['@', 'gender', 'q1']),
(u'q6_3', ['@', 'gender', 'q1'])])
>>> print batch1.x_filter_map
OrderedDict([('q1', {'(men only)+(q1)': (<function _intersection at 0x0000000019AE06D8>, [{'gender': 1}, {'age': [20, 21, 22, 23, 24, 25]}])}),
('q2', {'men only': {'gender': 1}}),
('q6', {'men only': {'gender': 1}}),
('q6_1', {'men only': {'gender': 1}}),
('q6_2', {'men only': {'gender': 1}}),
('q6_3', {'men only': {'gender': 1}})])
>>> print batch1.weights
['weight_a']
>>> batch2 = dataset.get_batch('batch2')
>>> print batch2.x_y_map
OrderedDict([('q2b', ['@', 'gender'])])
>>> print batch2.x_filter_map
OrderedDict([('q2b', 'no_filter')])
>>> print batch2.weights
['weight']
Basic views¶
It is possible to add various qp.View
s to a Link
. This can be performed
by running Stack.add_link()
providing View
objects via the view
parameter.
Alternatively, the qp.Batch
definitions that are stored in the meta data
help to add basic View
s (counts, percentages, bases and sums). By simply
running Stack.aggregate()
we can easily add a large amount of aggregations
in one step.
Note
Stack.aggregate()
can only be used with pre-populated Stack
s!
(see DataSet.populate()).
For instance, we can add column percentages and (unweighted and weighted) base sizes
to all Link
s of batch2
like this:
>>> stack.aggregate(views=['c%', 'cbase'], unweighted_base=True, batches='batch2', verbose=False)
>>> stack.describe()
data filter x y view #
0 Example Data (A) men only q1 q1 NaN 1
1 Example Data (A) men only q1 @ NaN 1
2 Example Data (A) men only q1 gender NaN 1
3 Example Data (A) men only @ q6 NaN 1
4 Example Data (A) men only q2 ethnicity NaN 1
5 Example Data (A) men only q2 locality NaN 1
6 Example Data (A) men only q6_1 q1 NaN 1
7 Example Data (A) men only q6_1 @ NaN 1
8 Example Data (A) men only q6_1 gender NaN 1
9 Example Data (A) men only q6_2 q1 NaN 1
10 Example Data (A) men only q6_2 @ NaN 1
11 Example Data (A) men only q6_2 gender NaN 1
12 Example Data (A) men only q6_3 q1 NaN 1
13 Example Data (A) men only q6_3 @ NaN 1
14 Example Data (A) men only q6_3 gender NaN 1
15 Example Data (A) men only gender q1 NaN 1
16 Example Data (A) men only gender @ NaN 1
17 Example Data (A) men only gender gender NaN 1
18 Example Data (A) men only q6 @ NaN 1
19 Example Data (A) (men only)+(q1) q1 q1 NaN 1
20 Example Data (A) (men only)+(q1) q1 @ NaN 1
21 Example Data (A) (men only)+(q1) q1 locality NaN 1
22 Example Data (A) (men only)+(q1) q1 ethnicity NaN 1
23 Example Data (A) (men only)+(q1) q1 gender NaN 1
24 Example Data (A) no_filter q2b @ x|f|:|y|weight|c% 1
25 Example Data (A) no_filter q2b @ x|f|x:||weight|cbase 1
26 Example Data (A) no_filter q2b @ x|f|x:|||cbase 1
27 Example Data (A) no_filter q2b gender x|f|:|y|weight|c% 1
28 Example Data (A) no_filter q2b gender x|f|x:||weight|cbase 1
29 Example Data (A) no_filter q2b gender x|f|x:|||cbase 1
Obviously View
s are only added to Link
s defined by batch2
and
automatically weighted according to the weight definition of batch2
,
which is evident from the view keys (x|f|:|y|weight|c%
). Combining the information
of the four Link
attributes with a view key, leads to a pd.DataFrame
and its belonging meta information:
>>> link = stack['Example Data (A)']['no_filter']['q2b']['gender']
>>> view_key = 'x|f|:|y|weight|c%'
>>> link[view_key]
Question q2b
Values @
Question Values
q2b 1 11.992144
2 80.802580
3 7.205276
>>> link[view_key].meta()
{
"agg": {
"weights": "weight",
"name": "c%",
"grp_text_map": null,
"text": "",
"fullname": "x|f|:|y|weight|c%",
"is_weighted": true,
"method": "frequency",
"is_block": false
},
"x": {
"is_array": false,
"name": "q2b",
"is_multi": false,
"is_nested": false
},
"shape": [
3,
1
],
"y": {
"is_array": false,
"name": "@",
"is_multi": false,
"is_nested": false
}
}
Now we are adding View
s to all batch1
-defined Link
s as well:
>>> stack.aggregate(views=['c%', 'counts', 'cbase'], unweighted_base=True, batches='batch1', verbose=False)
>>> stack.describe(['x', 'view'], 'y').loc[['@', 'q6'], ['@', 'q6']]
y @ q6
x view
@ x|f|:|y|weight_a|c% NaN 1.0
x|f|:||weight_a|counts NaN 1.0
q6 x|f|:|y|weight_a|c% 1.0 NaN
x|f|:||weight_a|counts 1.0 NaN
Even if unweighted bases are requested, they get skipped for array summaries and transposed arrays.
Since y_on_y
is requested, for a variable used as cross- and downbreak, with an extended filter (in this
example q1
), two Link
s with View
s are created:
>>> stack.describe(['y', 'filter', 'view'], 'x').loc['q1', 'q1']
filter view
(men only)+(q1) x|f|:|y|weight_a|c% 1.0
x|f|:||weight_a|counts 1.0
x|f|x:||weight_a|cbase 1.0
x|f|x:|||cbase 1.0
men only x|f|:|y|weight_a|c% 1.0
x|f|:||weight_a|counts 1.0
x|f|x:||weight_a|cbase 1.0
x|f|x:|||cbase 1.0
The first one is the aggregation defined by the Batch
construction plan,
the second one shows the y_on_y
aggregation using only the main
Batch.filter
.
Non-categorical variables¶
>>> batch3 = dataset.add_batch('batch3')
>>> batch3.add_x('age')
>>> stack = dataset.populate('batch3')
>>> stack.describe()
data filter x y view #
0 Example Data (A) no_filter age @ NaN 1
Non-categorical variables (ìnt
or float
) are handled in a special way.
There are two options:
Treat them like categorical variables: Append them to the parameter
categorize
, then counts, percentage and sum aggregations can be added alongside thecbase
View
.>>> stack.aggregate(views=['c%', 'counts', 'cbase', 'counts_sum', 'c%_sum'], unweighted_base=True, categorize=['age'], batches='batch3', verbose=False)>>> stack.describe() data filter x y view # 0 Example Data (A) no_filter age @ x|f|:|||counts 1 1 Example Data (A) no_filter age @ x|f.c:f|x:|y||c%_sum 1 2 Example Data (A) no_filter age @ x|f|:|y||c% 1 3 Example Data (A) no_filter age @ x|f|x:|||cbase 1 4 Example Data (A) no_filter age @ x|f.c:f|x:|||counts_sum 1Do not categorize the variable: Only
cbase
is created and additional descriptive statisticsView
s must be added. The method will raise a warning:>>> stack.aggregate(views=['c%', 'counts', 'cbase', 'counts_sum', 'c%_sum'], unweighted_base=True, batches='batch3', verbose=True) Warning: Found 1 non-categorized numeric variable(s): ['age']. Descriptive statistics must be added!>>> stack.describe() data filter x y view # 0 Example Data (A) no_filter age @ x|f|x:|||cbase 1
Descriptive statistics¶
>>> b_name = 'batch4'
>>> batch4 = dataset.add_batch(b_name)
>>> batch4.add_x(['q2b', 'q6', 'age'])
>>> stack = dataset.populate(b_name)
>>> stack.aggregate(views=['counts', 'cbase'], batches=b_name, verbose=False)
>>> stack.describe()
data filter x y view #
0 Example Data (A) no_filter q2b @ x|f|:|||counts 1
1 Example Data (A) no_filter q2b @ x|f|x:|||cbase 1
2 Example Data (A) no_filter q6_1 @ x|f|:|||counts 1
3 Example Data (A) no_filter q6_1 @ x|f|x:|||cbase 1
4 Example Data (A) no_filter q6_2 @ x|f|:|||counts 1
5 Example Data (A) no_filter q6_2 @ x|f|x:|||cbase 1
6 Example Data (A) no_filter q6_3 @ x|f|:|||counts 1
7 Example Data (A) no_filter q6_3 @ x|f|x:|||cbase 1
8 Example Data (A) no_filter age @ x|f|x:|||cbase 1
9 Example Data (A) no_filter q6 @ x|f|:|||counts 1
10 Example Data (A) no_filter q6 @ x|f|x:|||cbase 1
Adding descriptive statistics View
s like mean, stddev, min, max, median, etc.
can be added with the method stack.add_stats()
. With the parameters
other_source
, rescale
and exclude
you can specify the calculation.
Again each combination of the parameters refers to a unique view key. Note that
in on_vars
included arrays get unrolled, that means also all belonging
array items get equipped with the added View
:
>>> stack.add_stats(on_vars=['q2b', 'age'], stats='mean', _batches=b_name, verbose=False)
>>> stack.add_stats(on_vars=['q6'], stats='stddev', _batches=b_name, verbose=False)
>>> stack.add_stats(on_vars=['q2b'], stats='mean', rescale={1:100, 2:50, 3:0},
... custom_text='rescale mean', _batches=b_name, verbose=False)
>>> stack.describe('view', 'x')
x age q2b q6 q6_1 q6_2 q6_3
view
x|d.mean|x:|||stat 1.0 1.0 NaN NaN NaN NaN
x|d.mean|x[{100,50,0}]:|||stat NaN 1.0 NaN NaN NaN NaN
x|d.stddev|x:|||stat NaN NaN 1.0 1.0 1.0 1.0
x|f|:|||counts NaN 1.0 1.0 1.0 1.0 1.0
x|f|x:|||cbase 1.0 1.0 1.0 1.0 1.0 1.0
Nets¶
>>> b_name = 'batch5'
>>> batch5 = dataset.add_batch(b_name)
>>> batch5.add_x(['q2b', 'q6'])
>>> stack = dataset.populate(b_name)
>>> stack.aggregate(views=['counts', 'c%', 'cbase'], batches=b_name, verbose=False)
>>> stack.describe('view', 'x')
x q2b q6 q6_1 q6_2 q6_3
view
x|f|:|y||c% 1 1 1 1 1
x|f|:|||counts 1 1 1 1 1
x|f|x:|||cbase 1 1 1 1 1
Net-like View
s can be added with the method Stack.add_nets()
by defining
net_map
s for selected variables. There is a distinction between two different
types of net View
s:
Expanded nets: The existing counts or percentage
View
s are replaced with the new netView
s, which will the net-defining codes after or before the computed net groups (i.e. “overcode” nets).>>> stack.add_nets('q2b', [{'Top2': [1, 2]}], expand='after', _batches=b_name, verbose=False)>>> stack.describe('view', 'x') x q2b q6 q6_1 q6_2 q6_3 view x|f|:|y||c% NaN 1.0 1.0 1.0 1.0 x|f|:|||counts NaN 1.0 1.0 1.0 1.0 x|f|x:|||cbase 1.0 1.0 1.0 1.0 1.0 x|f|x[{1,2}+]*:|y||net 1.0 NaN NaN NaN NaN x|f|x[{1,2}+]*:|||net 1.0 NaN NaN NaN NaNNot expanded nets: The new net
View
s are added to the stack, which contain only the computed net groups.>>> stack.add_nets('q2b', [{'Top2': [1, 2]}], _batches=b_name, verbose=False)>>> stack.describe('view', 'x') x q2b q6 q6_1 q6_2 q6_3 view x|f|:|y||c% NaN 1.0 1.0 1.0 1.0 x|f|:|||counts NaN 1.0 1.0 1.0 1.0 x|f|x:|||cbase 1.0 1.0 1.0 1.0 1.0 x|f|x[{1,2}+]*:|y||net 1.0 NaN NaN NaN NaN x|f|x[{1,2}+]*:|||net 1.0 NaN NaN NaN NaN x|f|x[{1,2}]:|y||net 1.0 NaN NaN NaN NaN x|f|x[{1,2}]:|||net 1.0 NaN NaN NaN NaN
The difference between the two net types are also visible in the view keys:
x|f|x[{1,2}+]*:|||net
versus x|f|x[{1,2}]:|||net
.
Net definitions¶
To create more complex net definitions the method quantipy.net()
can be used,
which generates a well-formatted instruction dict and appends it to the net_map
.
It’s a helper especially concerning including various texts with different
valid text_keys
. The next example shows how to prepare a net for ‘q6’
(promoters, detractors):
>>> q6_net = qp.net([], [1, 2, 3, 4, 5, 6], 'Promotors', ['en-GB', 'sv-SE'])
>>> q6_net = qp.net(q6_net, [9, 10], {'en-GB': 'Detractors',
... 'sv_SE': 'Detractors',
... 'de-DE': 'Kritiker'})
>>> qp.net(q6_net[0], text='Promoter', text_key='de-DE')
>>> print q6_net
[
{
"1": [1, 2, 3, 4, 5, 6],
"text": {
"en-GB": "Promotors",
"sv-SE": "Promotors",
"de-DE": "Promoter"
}
},
{
"2": [9, 10],
"text": {
"en-GB": "Detractors",
"sv_SE": "Detractors",
"de-DE": "Kritiker"
}
}
]
Calculations¶
Stack.add_nets()
has the parameter calc
, which allows adding View
s
that are calculated out of the defined nets. The method qp.calc()
is a
helper to create a well-formatted instruction dict for the calculation.
For instance, to calculate the NPS (promoters - detractors) for 'q6'
, see the example
above and create the following calculation:
>>> q6_calc = qp.calc((1, '-', 2), 'NPS', ['en-GB', 'sv-SE', 'de-DE'])
>>> print q6_calc
OrderedDict([('calc', ('net_1', <built-in function sub>, 'net_2')),
('calc_only', False),
('text', {'en-GB': 'NPS',
'sv-SE': 'NPS',
'de-DE': 'NPS'})])
>>> stack.add_nets('q6', q6_net, calc=q6_calc, _batches=b_name, verbose=False)
>>> stack.describe('view', 'x')
x q2b q6 q6_1 q6_2 q6_3
view
x|f.c:f|x[{1,2,3,4,5,6}],x[{9,10}],x[{1,2,3,4,5... NaN 1.0 1.0 1.0 1.0
x|f.c:f|x[{1,2,3,4,5,6}],x[{9,10}],x[{1,2,3,4,5... NaN 1.0 1.0 1.0 1.0
x|f|:|y||c% NaN 1.0 1.0 1.0 1.0
x|f|:|||counts NaN 1.0 1.0 1.0 1.0
x|f|x:|||cbase 1.0 1.0 1.0 1.0 1.0
x|f|x[{1,2}+]*:|y||net 1.0 NaN NaN NaN NaN
x|f|x[{1,2}+]*:|||net 1.0 NaN NaN NaN NaN
x|f|x[{1,2}]:|y||net 1.0 NaN NaN NaN NaN
x|f|x[{1,2}]:|||net 1.0 NaN NaN NaN NaN
You can see that nets that are added on arrays are also applied for all array items.
Cumulative sums¶
Cumulative sum View
s can be added to a specified collection of xks of the
Stack
using stack.cumulative_sum()
. These View
s will always
replace the regular counts and percentage View
s:
>>> b_name = 'batch6'
>>> batch6 = dataset.add_batch(b_name)
>>> batch6.add_x(['q2b', 'q6'])
>>> stack = dataset.populate(b_name)
>>> stack.aggregate(views=['counts', 'c%', 'cbase'], batches=b_name, verbose=False)
>>> stack.cumulative_sum('q6', verbose=False)
>>> stack.describe('view', 'x')
x q2b q6 q6_1 q6_2 q6_3
view
x|f.c:f|x++:|y||c%_cumsum NaN 1.0 1.0 1.0 1.0
x|f.c:f|x++:|||counts_cumsum NaN 1.0 1.0 1.0 1.0
x|f|:|y||c% 1.0 NaN NaN NaN NaN
x|f|:|||counts 1.0 NaN NaN NaN NaN
x|f|x:|||cbase 1.0 1.0 1.0 1.0 1.0
Significance tests¶
>>> batch2 = dataset.get_batch('batch2')
>>> batch2.set_sigtests([0.05])
>>> batch5 = dataset.get_batch('batch5')
>>> batch5.set_sigtests([0.01, 0.05])
>>> stack = dataset.populate(['batch2', 'batch5'])
>>> stack.aggregate(['counts', 'cbase'], batches=['batch2', 'batch5'], verbose=False)
>>> stack.describe(['view', 'y'], 'x')
x q2b q6 q6_1 q6_2 q6_3
view y
x|f|:||weight|counts @ 1.0 NaN NaN NaN NaN
gender 1.0 NaN NaN NaN NaN
x|f|:|||counts @ 1.0 1.0 1.0 1.0 1.0
x|f|x:||weight|cbase @ 1.0 NaN NaN NaN NaN
gender 1.0 NaN NaN NaN NaN
x|f|x:|||cbase @ 1.0 1.0 1.0 1.0 1.0
gender 1.0 NaN NaN NaN NaN
Significance tests can only be added Batch
-wise, which also means that
significance levels must be defined for each Batch
before running
stack.add_tests()
.
>>> stack.add_tests(['batch2', 'batch5'], verbose=False)
>>> stack.describe(['view', 'y'], 'x')
x q2b q6 q6_1 q6_2 q6_3
view y
x|f|:||weight|counts @ 1.0 NaN NaN NaN NaN
gender 1.0 NaN NaN NaN NaN
x|f|:|||counts @ 1.0 1.0 1.0 1.0 1.0
x|f|x:||weight|cbase @ 1.0 NaN NaN NaN NaN
gender 1.0 NaN NaN NaN NaN
x|f|x:|||cbase @ 1.0 1.0 1.0 1.0 1.0
gender 1.0 NaN NaN NaN NaN
x|t.props.Dim.01|:|||significance @ 1.0 NaN 1.0 1.0 1.0
x|t.props.Dim.05|:||weight|significance @ 1.0 NaN NaN NaN NaN
gender 1.0 NaN NaN NaN NaN
x|t.props.Dim.05|:|||significance @ 1.0 NaN 1.0 1.0 1.0
Builds¶
API references¶
Chain¶
-
class
quantipy.
Chain
(name=None)¶ Container class that holds ordered Link defintions and associated Views.
The Chain object is a subclassed dict of list where each list contains one or more View aggregations of a Stack. It is an internal class included and used inside the Stack object. Users can interact with the data directly through the Chain or through the related Cluster object.
-
concat
()¶ Concatenates all Views found for the Chain definition along its orientations axis.
-
copy
()¶ Create a copy of self by serializing to/from a bytestring using cPickle.
-
describe
(index=None, columns=None, query=None)¶ Generates a list of all link defining stack keys.
-
static
load
(filename)¶ This method loads the pickled object that is made using method: save()
-
filename
¶ Specifies the name of the file to be loaded. Example of use: new_stack = Chain.load(“./tests/ChainName.chain”)
Type: string
-
-
save
(path=None)¶ This method saves the current chain instance (self) to file (.chain) using cPickle.
- Attributes :
- path (string)
- Specifies the location of the saved file, NOTE: has to end with ‘/’ Example: ‘./tests/’
-
Cluster¶
-
class
quantipy.
Cluster
(name='')¶ Container class in form of an OrderedDict of Chains.
It is possible to interact with individual Chains through the Cluster object. Clusters are mainly used to prepare aggregations for an export/ build, e.g. MS Excel Workbooks.
-
add_chain
(chains=None)¶ Adds chains to a cluster
-
bank_chains
(spec, text_key)¶ Return a banked chain as defined by spec.
This method returns a banked or compound chain where the spec describes how the view results from multiple chains should be banked together into the same set of dataframes in a single chain.
Parameters: - spec (dict) – The banked chain specification object.
- text_key (str, default='values') – Paint the x-axis of the banked chain using the spec provided and this text_key.
Returns: bchain – The banked chain.
Return type:
-
static
load
(path_cluster)¶ Load Stack instance from .stack file.
Parameters: path_cluster (str) – The full path to the .cluster file that should be created, including the extension. Returns: Return type: None
-
merge
()¶ Merges all Chains found in the Cluster into a new pandas.DataFrame.
-
save
(path_cluster)¶ Load Stack instance from .stack file.
Parameters: path_cluster (str) – The full path to the .cluster file that should be created, including the extension. Returns: Return type: None
-
DataSet¶
-
class
quantipy.
DataSet
(name, dimensions_comp=True)¶ A set of casedata (required) and meta data (optional).
DESC.
-
add_filter_var
(name, logic, overwrite=False)¶ Create filter-var, that allows index slicing using
manifest_filter
Parameters: - name (str) – Name and label of the new filter-variable, which gets also listed in DataSet.filters
- logic (complex logic/ str, list of complex logic/ str) – Logic to keep cases.
Complex logic should be provided in form of:
` { 'label': 'any text', 'logic': {var: keys} / intersection/ .... } `
If a str (column-name) is provided, automatically a logic is created that keeps all cases which are not empty for this column. If logic is a list, each included list-item becomes a category of the new filter-variable and all cases are kept that satify all conditions (intersection) - overwrite (bool, default False) – Overwrite an already existing filter-variable.
-
add_meta
(name, qtype, label, categories=None, items=None, text_key=None, replace=True)¶ Create and insert a well-formed meta object into the existing meta document.
Parameters: - name (str) – The column variable name keyed in
meta['columns']
. - qtype ({'int', 'float', 'single', 'delimited set', 'date', 'string'}) – The structural type of the data the meta describes.
- label (str) – The
text
label information. - categories (list of str, int, or tuples in form of (int, str), default None) – When a list of str is given, the categorical values will simply be
enumerated and mapped to the category labels. If only int are
provided, text labels are assumed to be an empty str (‘’) and a
warning is triggered. Alternatively, codes can be mapped to categorical
labels, e.g.:
[(1, 'Elephant'), (2, 'Mouse'), (999, 'No animal')]
- items (list of str, int, or tuples in form of (int, str), default None) – If provided will automatically create an array type mask.
When a list of str is given, the item number will simply be
enumerated and mapped to the category labels. If only int are
provided, item text labels are assumed to be an empty str (‘’) and
a warning is triggered. Alternatively, numerical values can be
mapped explicitly to items labels, e.g.:
[(1 'The first item'), (2, 'The second item'), (99, 'Last item')]
- text_key (str, default None) – Text key for text-based label information. Uses the
DataSet.text_key
information if not provided. - replace (bool, default True) – If True, an already existing corresponding
pd.DataFrame
column in the case data component will be overwritten with a new (empty) one.
Returns: DataSet
is modified inplace, meta data and_data
columns will be addedReturn type: None
- name (str) – The column variable name keyed in
-
align_order
(vlist, align_against=None, integrate_rc=(['_rc', '_rb'], True), fix=[])¶ Align list to existing order.
Parameters: - vlist (list of str) – The list which should be reordered.
- align_against (str or list of str, default None) – The list of variables to align against. If a string is provided, the depending set list is taken. If None, “data file” set is taken.
- integrate_rc (tuple (list, bool)) – The provided list are the suffixes for recodes, the bool decides whether parent variables should be replaced by their recodes if the parent variable is not in vlist.
- fix (list of str) – Variables which are fixed at the beginning of the reordered list.
-
all
(name, codes)¶ Return a logical has_all() slicer for the passed codes.
Note
When applied to an array mask, the has_all() logic is ex- tended to the item sources, i.e. the it must itself be true for all the items.
Parameters: - name (str, default None) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - codes (int or list of int) – The codes to build the logical slicer from.
Returns: slicer – The indices fulfilling has_all([codes]).
Return type: pandas.Index
- name (str, default None) – The column variable name keyed in
-
any
(name, codes)¶ Return a logical has_any() slicer for the passed codes.
Note
When applied to an array mask, the has_any() logic is ex- tended to the item sources, i.e. the it must itself be true for at least one of the items.
Parameters: - name (str, default None) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - codes (int or list of int) – The codes to build the logical slicer from.
Returns: slicer – The indices fulfilling has_any([codes]).
Return type: pandas.Index
- name (str, default None) – The column variable name keyed in
-
band
(name, bands, new_name=None, label=None, text_key=None)¶ Group numeric data with band definitions treated as group text labels.
Wrapper around
derive()
for quick banding of numeric data.Parameters: - name (str) – The column variable name keyed in
_meta['columns']
that will be banded into summarized categories. - bands (list of int/tuple or dict mapping the former to value texts) – The categorical bands to be used. Bands can be single numeric
values or ranges, e.g.: [0, (1, 10), 11, 12, (13, 20)].
Be default, each band will also make up the value text of the
category created in the
_meta
component. To specify custom texts, map each band to a category name e.g.: [{‘A’: 0}, {‘B’: (1, 10)}, {‘C’: 11}, {‘D’: 12}, {‘E’: (13, 20)}] - new_name (str, default None) – The created variable will be named
'<name>_banded'
, unless a desired name is provided explicitly here. - label (str, default None) – The created variable’s text label will be identical to the origi-
nating one’s passed in
name
, unless a desired label is provided explicitly here. - text_key (str, default None) – Text key for text-based label information. Uses the
DataSet.text_key
information if not provided.
Returns: DataSet
is modified inplace.Return type: None
- name (str) – The column variable name keyed in
-
by_type
(types=None)¶ Get an overview of all the variables ordered by their type.
Parameters: types (str or list of str, default None) – Restrict the overview to these data types. Returns: overview – The variables per data type inside the DataSet
.Return type: pandas.DataFrame
-
categorize
(name, categorized_name=None)¶ Categorize an
int
/string
/text
variable tosingle
.The
values
object of the categorized variable is populated with the unique values found in the originating variable (ignoring np.NaN / empty row entries).Parameters: - name (str) – The column variable name keyed in
meta['columns']
that will be categorized. - categorized_name (str) – If provided, the categorized variable’s new name will be drawn
from here, otherwise a default name in form of
'name#'
will be used.
Returns: DataSet is modified inplace, adding the categorized variable to it.
Return type: None
- name (str) – The column variable name keyed in
-
clear_factors
(name)¶ Remove all factors set in the variable’s
'values'
object.Parameters: name (str) – The column variable name keyed in _meta['columns']
or_meta['masks']
.Returns: Return type: None
-
clone
()¶ Get a deep copy of the
DataSet
instance.
-
code_count
(name, count_only=None, count_not=None)¶ Get the total number of codes/entries found per row.
Note
Will be 0/1 for type
single
and range between 0 and the number of possible values for typedelimited set
.Parameters: - name (str) – The column variable name keyed in
meta['columns']
ormeta['masks']
. - count_only (int or list of int, default None) – Pass a list of codes to restrict counting to.
- count_not (int or list of int, default None) – Pass a list of codes that should no be counted.
Returns: count – A series with the results as ints.
Return type: pandas.Series
- name (str) – The column variable name keyed in
-
code_from_label
(name, text_label, text_key=None, exact=True, flat=True)¶ Return the code belonging to the passed
text
label (if present).Parameters: - name (str) – The originating variable name keyed in
meta['columns']
ormeta['masks']
. - text_label (str or list of str) – The value text(s) to search for.
- text_key (str, default None) – The desired
text_key
to search through. Uses theDataSet.text_key
information if not provided. - exact (bool, default True) –
text_label
must exactly match a categorical value’stext
. If False, it is enough that the category contains thetext_label
. - flat (If a list is passed for
text_label
, return all found codes) – as a regular list. If False, return a list of lists matching the order of thetext_label
list.
Returns: codes – The list of value codes found for the passed label
text
.Return type: list
- name (str) – The originating variable name keyed in
-
codes
(name)¶ Get categorical data’s numerical code values.
Parameters: name (str) – The column variable name keyed in _meta['columns']
.Returns: codes – The list of category codes. Return type: list
-
codes_in_data
(name)¶ Get a list of codes that exist in data.
-
compare
(dataset, variables=None, strict=False, text_key=None)¶ Compares types, codes, values, question labels of two datasets.
Parameters: - dataset (quantipy.DataSet instance) – Test if all variables in the provided
dataset
are also inself
and compare their metadata definitions. - variables (str, list of str) – Check only these variables
- strict (bool, default False) – If True lower/ upper cases and spaces are taken into account.
- text_key (str, list of str) – The textkeys for which texts are compared.
Returns: Return type: None
- dataset (quantipy.DataSet instance) – Test if all variables in the provided
-
compare_filter
(name1, name2)¶ Show if filters result in the same index.
Parameters: - name1 (str) – Name of the first filter variable
- name2 (str/ list of st) – Name(s) of the filter variable(s) to compare with.
-
convert
(name, to)¶ Convert meta and case data between compatible variable types.
Wrapper around the separate
as_TYPE()
conversion methods.Parameters: - name (str) – The column variable name keyed in
meta['columns']
that will be converted. - to ({'int', 'float', 'single', 'delimited set', 'string'}) – The variable type to convert to.
Returns: The DataSet variable is modified inplace.
Return type: None
- name (str) – The column variable name keyed in
-
copy
(name, suffix='rec', copy_data=True, slicer=None, copy_only=None, copy_not=None)¶ Copy meta and case data of the variable defintion given per
name
.Parameters: - name (str) – The originating column variable name keyed in
meta['columns']
ormeta['masks']
. - suffix (str, default 'rec') – The new variable name will be constructed by suffixing the original
name
with_suffix
, e.g.'age_rec
. - copy_data (bool, default True) – The new variable assumes the
data
of the original variable. - slicer (dict) – If the data is copied it is possible to filter the data with a complex logic. Example: slicer = {‘q1’: not_any([99])}
- copy_only (int or list of int, default None) – If provided, the copied version of the variable will only contain (data and) meta for the specified codes.
- copy_not (int or list of int, default None) – If provided, the copied version of the variable will contain (data and) meta for the all codes, except of the indicated.
Returns: DataSet is modified inplace, adding a copy to both the data and meta component.
Return type: None
- name (str) – The originating column variable name keyed in
-
copy_array_data
(source, target, source_items=None, target_items=None, slicer=None)¶
-
create_set
(setname='new_set', based_on='data file', included=None, excluded=None, strings='keep', arrays='masks', replace=None, overwrite=False)¶ Create a new set in
dataset._meta['sets']
.Parameters: - setname (str, default 'new_set') – Name of the new set.
- based_on (str, default 'data file') – Name of set that can be reduced or expanded.
- included (str or list/set/tuple of str) – Names of the variables to be included in the new set. If None all
variables in
based_on
are taken. - excluded (str or list/set/tuple of str) – Names of the variables to be excluded in the new set.
- strings ({'keep', 'drop', 'only'}, default 'keep') – Keep, drop or only include string variables.
- arrays ({'masks', 'columns'}, default masks) – For arrays add
masks@varname
orcolumns@varname
. - replace (dict) – Replace a variable in the set with an other.
Example: {‘q1’: ‘q1_rec’}, ‘q1’ and ‘q1_rec’ must be included in
based_on
. ‘q1’ will be removed and ‘q1_rec’ will be moved to this position. - overwrite (bool, default False) – Overwrite if
meta['sets'][name]
already exist.
Returns: The
DataSet
is modified inplace.Return type: None
-
crosstab
(x, y=None, w=None, pct=False, decimals=1, text=True, rules=False, xtotal=False, f=None)¶
-
cut_item_texts
(arrays=None)¶ Remove array text from array item texts.
Parameters: arrays (str, list of str, default None) – Cut texts for items of these arrays. If None, all keys in ._meta['masks']
are taken.
-
data
()¶ Return the
data
component of theDataSet
instance.
-
derive
(name, qtype, label, cond_map, text_key=None)¶ Create meta and recode case data by specifying derived category logics.
Parameters: - name (str) – The column variable name keyed in
meta['columns']
. - qtype ([
int
,float
,single
,delimited set
]) – The structural type of the data the meta describes. - label (str) – The
text
label information. - cond_map (list of tuples) –
Tuples of either two or three elements of following structures:
2 elements, no labels provided: (code, <qp logic expression here>), e.g.:
(1, intersection([{'gender': [1]}, {'age': frange('30-40')}]))
2 elements, no codes provided: (‘text label’, <qp logic expression here>), e.g.:
('Cat 1', intersection([{'gender': [1]}, {'age': frange('30-40')}]))
3 elements, with codes + labels: (code, ‘Label goes here’, <qp logic expression here>), e.g.:
(1, 'Men, 30 to 40', intersection([{'gender': [1]}, {'age': frange('30-40')}]))
- text_key (str, default None) – Text key for text-based label information. Will automatically fall back to the instance’s text_key property information if not provided.
Returns: DataSet
is modified inplace.Return type: None
- name (str) – The column variable name keyed in
-
derotate
(levels, mapper, other=None, unique_key='identity', dropna=True)¶ Derotate data and meta using the given mapper, and appending others.
This function derotates data using the specification defined in mapper, which is a list of dicts of lists, describing how columns from data can be read as a heirarchical structure.
Returns derotated DataSet instance and saves data and meta as json and csv.
Parameters: - levels (dict) – The name and values of a new column variable to identify cases.
- mapper (list of dicts of lists) –
A list of dicts matching where the new column names are keys to to lists of source columns. Example:
>>> mapper = [{'q14_1': ['q14_1_1', 'q14_1_2', 'q14_1_3']}, ... {'q14_2': ['q14_2_1', 'q14_2_2', 'q14_2_3']}, ... {'q14_3': ['q14_3_1', 'q14_3_2', 'q14_3_3']}]
- unique_key (str) – Name of column variable that will be copied to new dataset.
- other (list (optional; default=None)) – A list of additional columns from the source data to be appended to the end of the resulting stacked dataframe.
- dropna (boolean (optional; default=True)) – Passed through to the pandas.DataFrame.stack() operation.
Returns: Return type: new
qp.DataSet
instance
-
describe
(var=None, only_type=None, text_key=None, axis_edit=None)¶ Inspect the DataSet’s global or variable level structure.
-
dichotomize
(name, value_texts=None, keep_variable_text=True, ignore=None, replace=False, text_key=None)¶
-
dimensionize
(names=None)¶ Rename the dataset columns for Dimensions compatibility.
-
dimensionizing_mapper
(names=None)¶ Return a renaming dataset mapper for dimensionizing names.
Parameters: None – Returns: mapper – A renaming mapper in the form of a dict of {old: new} that maps non-Dimensions naming conventions to Dimensions naming conventions. Return type: dict
-
drop
(name, ignore_items=False)¶ Drops variables from meta and data components of the
DataSet
.Parameters: - name (str or list of str) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - ignore_items (bool) – If False source variables for arrays in
_meta['columns']
are dropped, otherwise kept.
Returns: DataSet is modified inplace.
Return type: None
- name (str or list of str) – The column variable name keyed in
-
drop_duplicates
(unique_id='identity', keep='first', sort_by=None)¶ Drop duplicated cases from self._data.
Parameters: - unique_id (str) – Variable name that gets scanned for duplicates.
- keep (str, {'first', 'last'}) – Keep first or last of the duplicates.
- sort_by (str) – Name of a variable to sort the data by, for example “endtime”. It is a helper to specify keep.
-
duplicates
(name='identity')¶ Returns a list with duplicated values for the provided name.
Parameters: name (str, default 'identity') – The column variable name keyed in meta['columns']
.Returns: vals – A list of duplicated values found in the named variable. Return type: list
-
empty
(name, condition=None)¶ Check variables for emptiness (opt. restricted by a condition).
Parameters: - name ((list of) str) – The mask variable name keyed in
_meta['columns']
. - condition (Quantipy logic expression, default None) – A logical condition expressed as Quantipy logic that determines which subset of the case data rows to be considered.
Returns: empty
Return type: bool
- name ((list of) str) – The mask variable name keyed in
-
empty_items
(name, condition=None, by_name=True)¶ Test arrays for item emptiness (opt. restricted by a condition).
Parameters: - name ((list of) str) – The mask variable name keyed in
_meta['masks']
. - condition (Quantipy logic expression, default None) – A logical condition expressed as Quantipy logic that determines which subset of the case data rows to be considered.
- by_name (bool, default True) – Return array items by their name or their index.
Returns: empty – The list of empty items by their source names or positional index (starting from 1!, mapped to their parent mask name if more than one).
Return type: list
- name ((list of) str) – The mask variable name keyed in
-
extend_filter_var
(name, logic, extend_as=None)¶ Extend logic of an existing filter-variable.
Parameters: - name (str) – Name of the existing filter variable.
- logic ((list of) complex logic/ str) – Additional logic to keep cases (intersection with existing logic).
Complex logic should be provided in form of:
` { 'label': 'any text', 'logic': {var: keys} / intersection/ .... } `
- extend_as (str, default None) – Addition to the filter-name to create a new filter. If it is None the existing filter-variable is overwritten.
-
extend_items
(name, ext_items, text_key=None)¶ Extend mask items of an existing array.
Parameters: - name (str) – The originating column variable name keyed in
meta['masks']
. - ext_items (list of str/ list of dict) – The label of the new item. It can be provided as str, then the new column is named by the grid and the item_no, or as dict {‘new_column’: ‘label’}.
- text_key (str/ list of str, default None) – Text key for text-based label information. Will automatically fall back to the instance’s text_key property information if not provided.
- name (str) – The originating column variable name keyed in
-
extend_values
(name, ext_values, text_key=None, safe=True)¶ Add to the ‘values’ object of existing column or mask meta data.
Attempting to add already existing value codes or providing already present value texts will both raise a
ValueError
!Parameters: - name (str) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - ext_values (list of str or tuples in form of (int, str), default None) – When a list of str is given, the categorical values will simply be enumerated and mapped to the category labels. Alternatively codes can mapped to categorical labels, e.g.: [(1, ‘Elephant’), (2, ‘Mouse’), (999, ‘No animal’)]
- text_key (str, default None) – Text key for text-based label information. Will automatically fall back to the instance’s text_key property information if not provided.
- safe (bool, default True) – If set to False, duplicate value texts are allowed when extending
the
values
object.
Returns: The
DataSet
is modified inplace.Return type: None
- name (str) – The column variable name keyed in
-
factors
(name)¶ Get categorical data’s stat. factor values.
Parameters: name (str) – The column variable name keyed in _meta['columns']
or_meta['masks']
.Returns: factors – A {value: factor}
mapping.Return type: OrderedDict
-
filter
(alias, condition, inplace=False)¶ Filter the DataSet using a Quantipy logical expression.
-
find
(str_tags=None, suffixed=False)¶ Find variables by searching their names for substrings.
Parameters: - str_tags ((list of) str) – The strings tags to look for in the variable names. If not provided, the modules’ default global list of substrings from VAR_SUFFIXES will be used.
- suffixed (bool, default False) – If set to True, only variable names that end with a given string sequence will qualify.
Returns: found – The list of matching variable names.
Return type: list
-
find_duplicate_texts
(name, text_key=None)¶ Collect values that share the same text information to find duplicates.
Parameters: - name (str) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - text_key (str, default None) – Text key for text-based label information. Will automatically fall
back to the instance’s
text_key
property information if not provided.
- name (str) – The column variable name keyed in
-
first_responses
(name, n=3, others='others', reduce_values=False)¶ Create n-first mentions from the set of responses of a delimited set.
Parameters: - name (str) – The column variable name of a delimited set keyed in
meta['columns']
. - n (int, default 3) – The number of mentions that will be turned into single-type variables, i.e. 1st mention, 2nd mention, 3rd mention, 4th mention, etc.
- others (None or str, default 'others') – If provided, all remaining values will end up in a new delimited set variable reduced by the responses transferred to the single mention variables.
- reduce_values (bool, default False) – If True, each new variable will only list the categorical value metadata for the codes found in the respective data vector, i.e. not the initial full codeframe.
Returns: DataSet is modified inplace.
Return type: None
- name (str) – The column variable name of a delimited set keyed in
-
flatten
(name, codes, new_name=None, text_key=None)¶ Create a variable that groups array mask item answers to categories.
Parameters: - name (str) – The array variable name keyed in
meta['masks']
that will be converted. - codes (int, list of int) – The answers codes that determine the categorical grouping. Item labels will become the category labels.
- new_name (str, default None) – The name of the new delimited set variable. If None,
name
is suffixed with ‘_rec’. - text_key (str, default None) – Text key for text-based label information. Uses the
DataSet.text_key
information if not provided.
Returns: The DataSet is modified inplace, delimited set variable is added.
Return type: None
- name (str) – The array variable name keyed in
-
force_texts
(copy_to=None, copy_from=None, update_existing=False)¶ Copy info from existing text_key to a new one or update the existing one.
Parameters: - copy_to (str) – {‘en-GB’, ‘da-DK’, ‘fi-FI’, ‘nb-NO’, ‘sv-SE’, ‘de-DE’} None -> _meta[‘lib’][‘default text’] The text key that will be filled.
- copy_from (str / list) – {‘en-GB’, ‘da-DK’, ‘fi-FI’, ‘nb-NO’, ‘sv-SE’, ‘de-DE’} You can also enter a list with text_keys, if the first text_key doesn’t exist, it takes the next one
- update_existing (bool) – True : copy_to will be filled in any case False: copy_to will be filled if it’s empty/not existing
Returns: Return type: None
-
from_batch
(batch_name, include='identity', text_key=[], apply_edits=True, additions='variables')¶ Get a filtered subset of the DataSet using qp.Batch definitions.
Parameters: - batch_name (str) – Name of a Batch included in the DataSet.
- include (str/ list of str) – Name of variables that get included even if they are not in Batch.
- text_key (str/ list of str, default None) – Take over all texts of the included text_key(s), if None is provided all included text_keys are taken.
- apply_edits (bool, default True) – meta_edits and rules are used as/ applied on global meta of the new DataSet instance.
- additions ({'variables', 'filters', 'full', None}) – Extend included variables by the xks, yks and weights of the
additional batches if set to ‘variables’, ‘filters’ will create
new 1/0-coded variables that reflect any filters defined. Selecting
‘full’ will do both,
None
will ignore additional Batches completely.
Returns: b_ds
Return type: quantipy.DataSet
-
from_components
(data_df, meta_dict=None, reset=True, text_key=None)¶ Attach data and meta directly to the
DataSet
instance.Note
Except testing for appropriate object types, this method offers no additional safeguards or consistency/compability checks with regard to the passed data and meta documents!
Parameters: - data_df (pandas.DataFrame) – A DataFrame that contains case data entries for the
DataSet
. - meta_dict (dict, default None) – A dict that stores meta data describing the columns of the data_df. It is assumed to be well-formed following the Quantipy meta data structure.
- reset (bool, default True) – Clean the ‘lib’ and
'sets'
metadata collections from non-native entries, e.g. user-defined information or helper metadata. - text_key (str, default None) – The text_key to be used. If not provided, it will be attempted to
use the ‘default text’ from the
meta['lib']
definition.
Returns: Return type: None
- data_df (pandas.DataFrame) – A DataFrame that contains case data entries for the
-
from_excel
(path_xlsx, merge=True, unique_key='identity')¶ Converts excel files to a dataset or/and merges variables.
Parameters: - path_xlsx (str) – Path where the excel file is stored. The file must have exactly one sheet with data.
- merge (bool) – If True the new data from the excel file will be merged on the dataset.
- unique_key (str) – If
merge=True
an hmerge is done on this variable.
Returns: new_dataset – Contains only the data from excel. If
merge=True
dataset is modified inplace.Return type: quantipy.DataSet
-
from_stack
(stack, data_key=None, dk_filter=None, reset=True)¶ Use
quantipy.Stack
data and meta to create aDataSet
instance.Parameters: - stack (quantipy.Stack) – The Stack instance to convert.
- data_key (str) – The reference name where meta and data information are stored.
- dk_filter (string, default None) – Filter name if the stack contains more than one filters. If None ‘no_filter’ will be used.
- reset (bool, default True) – Clean the ‘lib’ and
'sets'
metadata collections from non-native entries, e.g. user-defined information or helper metadata.
Returns: Return type: None
Get all array definitions that contain only hidden items.
Returns: hidden – The list of array mask names. Return type: list
-
get_batch
(name)¶ Get existing Batch instance from DataSet meta information.
Parameters: name (str) – Name of existing Batch instance.
-
get_property
(name, prop_name, text_key=None)¶
-
hide_empty_items
(condition=None, arrays=None)¶ Apply
rules
meta to automatically hide empty array items.Parameters: - name ((list of) str, default None) – The array mask variable names keyed in
_meta['masks']
. If not explicitly provided will test all array mask definitions. - condition (Quantipy logic expression) – A logical condition expressed as Quantipy logic that determines which subset of the case data rows to be considered.
Returns: Return type: None
- name ((list of) str, default None) – The array mask variable names keyed in
-
hiding
(name, hide, axis='y', hide_values=True)¶ Set or update
rules[axis]['dropx']
meta for the named column.Quantipy builds will respect the hidden codes and cut them from results.
Note
This is not equivalent to
DataSet.set_missings()
as missing values are respected also in computations.Parameters: - name (str or list of str) – The column variable(s) name keyed in
_meta['columns']
. - hide (int or list of int) – Values indicated by their
int
codes will be dropped fromQuantipy.View.dataframe
s. - axis ({'x', 'y'}, default 'y') – The axis to drop the values from.
- hide_values (bool, default True) – Only considered if
name
refers to a mask. If True, values are hidden on all mask items. If False, mask items are hidden by position (only for array summaries).
Returns: Return type: None
- name (str or list of str) – The column variable(s) name keyed in
-
hmerge
(dataset, on=None, left_on=None, right_on=None, overwrite_text=False, from_set=None, inplace=True, update_existing=None, merge_existing=None, text_properties=None, verbose=True)¶ Merge Quantipy datasets together using an index-wise identifer.
This function merges two Quantipy datasets together, updating variables that exist in the left dataset and appending others. New variables will be appended in the order indicated by the ‘data file’ set if found, otherwise they will be appended in alphanumeric order. This merge happend horizontally (column-wise). Packed kwargs will be passed on to the pandas.DataFrame.merge() method call, but that merge will always happen using how=’left’.
Parameters: - dataset (
quantipy.DataSet
) – The dataset to merge into the currentDataSet
. - on (str, default=None) – The column to use as a join key for both datasets.
- left_on (str, default=None) – The column to use as a join key for the left dataset.
- right_on (str, default=None) – The column to use as a join key for the right dataset.
- overwrite_text (bool, default=False) – If True, text_keys in the left meta that also exist in right meta will be overwritten instead of ignored.
- from_set (str, default=None) – Use a set defined in the right meta to control which columns are merged from the right dataset.
- inplace (bool, default True) – If True, the
DataSet
will be modified inplace with new/updated columns. Will return a newDataSet
instance if False. - update_existing (str/ list of str, default None, {'all', [var_names]}) – Update values for defined delimited sets if it exists in both datasets.
- text_properties (str/ list of str, default=None, {'all', [var_names]}) – Controls the update of the dataset_left properties with properties from the dataset_right. If None, properties from dataset_left will be updated by the ones from the dataset_right. If ‘all’, properties from dataset_left will be kept unchanged. Otherwise, specify the list of properties which will be kept unchanged in the dataset_left; all others will be updated by the properties from dataset_right.
- verbose (bool, default=True) – Echo progress feedback to the output pane.
Returns: None or new_dataset – If the merge is not applied
inplace
, aDataSet
instance is returned.Return type: quantipy.DataSet
- dataset (
-
interlock
(name, label, variables, val_text_sep='/')¶ Build a new category-intersected variable from >=2 incoming variables.
Parameters: - name (str) – The new column variable name keyed in
_meta['columns']
. - label (str) – The new text label for the created variable.
- variables (list of >= 2 str or dict (mapper)) –
The column names of the variables that are feeding into the intersecting recode operation. Or dicts/mapper to create temporary variables for interlock. Can also be a mix of str and dict. Example:
>>> ['gender', ... {'agegrp': [(1, '18-34', {'age': frange('18-34')}), ... (2, '35-54', {'age': frange('35-54')}), ... (3, '55+', {'age': is_ge(55)})]}, ... 'region']
- val_text_sep (str, default '/') – The passed character (or any other str value) wil be used to separate the incoming individual value texts to make up the inter- sected category value texts, e.g.: ‘Female/18-30/London’.
Returns: Return type: None
- name (str) – The new column variable name keyed in
-
is_like_numeric
(name)¶ Test if a
string
-typed variable can be expressed numerically.Parameters: name (str) – The column variable name keyed in _meta['columns']
.Returns: Return type: bool
-
is_nan
(name)¶ Detect empty entries in the
_data
rows.Parameters: name (str) – The column variable name keyed in meta['columns']
.Returns: count – A series with the results as bool. Return type: pandas.Series
-
is_subfilter
(name1, name2)¶ Verify if index of name2 is part of the index of name1.
-
item_no
(name)¶ Return the order/position number of passed array item variable name.
Parameters: name (str) – The column variable name keyed in _meta['columns']
.Returns: no – The positional index of the item (starting from 1). Return type: int
-
item_texts
(name, text_key=None, axis_edit=None)¶ Get the
text
meta data for the items of the passed array mask name.Parameters: - name (str) – The mask variable name keyed in
_meta['masks']
. - text_key (str, default None) – The text_key that should be used when taking labels from the source meta.
- axis_edit ({'x', 'y'}, default None) – If provided the text_key is taken from the x/y edits dict.
Returns: texts – The list of item texts for the array elements.
Return type: list
- name (str) – The mask variable name keyed in
-
items
(name, text_key=None, axis_edit=None)¶ Get the array’s paired item names and texts information from the meta.
Parameters: - name (str) – The column variable name keyed in
_meta['masks']
. - text_key (str, default None) – The text_key that should be used when taking labels from the source meta.
- axis_edit ({'x', 'y'}, default None) – If provided the text_key is taken from the x/y edits dict.
Returns: items – The list of source item names (from
_meta['columns']
) and theirtext
information packed as tuples.Return type: list of tuples
- name (str) – The column variable name keyed in
-
link
(filters=None, x=None, y=None, views=None)¶ Create a Link instance from the DataSet.
-
manifest_filter
(name)¶ Get index slicer from filter-variables.
Parameters: name (str) – Name of the filter_variable.
-
merge_texts
(dataset)¶ Add additional
text
versions from othertext_key
meta.Case data will be ignored during the merging process.
Parameters: dataset ((A list of multiple) quantipy.DataSet
) – One or multiple datasets that provide newtext_key
meta.Returns: Return type: None
-
meta
(name=None, text_key=None, axis_edit=None)¶ Provide a pretty summary for variable meta given as per
name
.Parameters: - name (str, default None) – The variable name keyed in
_meta['columns']
or_meta['masks']
. If None, the entiremeta
component of theDataSet
instance will be returned. - text_key (str, default None) – The text_key that should be used when taking labels from the source meta.
- axis_edit ({'x', 'y'}, default None) – If provided the text_key is taken from the x/y edits dict.
Returns: meta – Either a DataFrame that sums up the meta information on a
mask
orcolumn
or the meta dict as a whole isReturn type: dict or pandas.DataFrame
- name (str, default None) – The variable name keyed in
-
meta_to_json
(key=None, collection=None)¶ Save a meta object as json file.
Parameters: - key (str, default None) – Name of the variable whose metadata is saved, if key is not provided included collection or the whole meta is saved.
- collection (str {'columns', 'masks', 'sets', 'lib'}, default None) – The meta object is taken from this collection.
Returns: Return type: None
-
min_value_count
(name, min=50, weight=None, condition=None, axis='y', verbose=True)¶ Wrapper for self.hiding(), which is hiding low value_counts.
Parameters: - variables (str/ list of str) – Name(s) of the variable(s) whose values are checked against the defined border.
- min (int) – If the amount of counts for a value is below this number, the value is hidden.
- weight (str, default None) – Name of the weight, which is used to calculate the weigthed counts.
- condition (complex logic) – The data, which is used to calculate the counts, can be filtered by the included condition.
- axis ({'y', 'x', ['x', 'y']}, default None) – The axis on which the values are hidden.
-
names
(ignore_items=True)¶ Find all weak-duplicate variable names that are different only by case.
Note
Will return self.variables() if no weak-duplicates are found.
Returns: weak_dupes – An overview of case-sensitive spelling differences in otherwise equal variable names. Return type: pd.DataFrame
-
order
(new_order=None, reposition=None, regroup=False)¶ Set the global order of the DataSet variables collection.
The global order of the DataSet is reflected in the data component’s pd.DataFrame.columns order and the variable references in the meta component’s ‘data file’ items.
Parameters: - new_order (list) – A list of all DataSet variables in the desired order.
- reposition ((List of) dict) – Each dict maps one or a list of variables to a reference variable name key. The mapped variables are moved before the reference key.
- regroup (bool, default False) – Attempt to regroup non-native variables (i.e. created either
manually with
add_meta()
,recode()
,derive()
, etc. or automatically by manifestingqp.View
objects) with their originating variables.
Returns: Return type: None
-
parents
(name)¶ Get the
parent
meta information for masks-structured column elements.Parameters: name (str) – The mask variable name keyed in _meta['columns']
.Returns: parents – The list of parents the _meta['columns']
variable is attached to.Return type: list
-
populate
(batches='all', verbose=True)¶ Create a
qp.Stack
based on all availableqp.Batch
definitions.Parameters: batches (str/ list of str) – Name(s) of qp.Batch
instances that are used to populate theqp.Stack
.Returns: Return type: qp.Stack
-
read_ascribe
(path_meta, path_data, text_key)¶ Load Dimensions .xml/.txt files, connecting as data and meta components.
Parameters: - path_meta (str) – The full path (optionally with extension
'.xml'
, otherwise assumed as such) to the meta data defining'.xml'
file. - path_data (str) – The full path (optionally with extension
'.txt'
, otherwise assumed as such) to the case data defining'.txt'
file.
Returns: The
DataSet
is modified inplace, connected to Quantipy data and meta components that have been converted from their Ascribe source files.Return type: None
- path_meta (str) – The full path (optionally with extension
-
read_dimensions
(path_meta, path_data)¶ Load Dimensions .ddf/.mdd files, connecting as data and meta components.
Parameters: - path_meta (str) – The full path (optionally with extension
'.mdd'
, otherwise assumed as such) to the meta data defining'.mdd'
file. - path_data (str) – The full path (optionally with extension
'.ddf'
, otherwise assumed as such) to the case data defining'.ddf'
file.
Returns: The
DataSet
is modified inplace, connected to Quantipy data and meta components that have been converted from their Dimensions source files.Return type: None
- path_meta (str) – The full path (optionally with extension
-
read_quantipy
(path_meta, path_data, reset=True)¶ Load Quantipy .csv/.json files, connecting as data and meta components.
Parameters: - path_meta (str) – The full path (optionally with extension
'.json'
, otherwise assumed as such) to the meta data defining'.json'
file. - path_data (str) – The full path (optionally with extension
'.csv'
, otherwise assumed as such) to the case data defining'.csv'
file. - reset (bool, default True) – Clean the ‘lib’ and
'sets'
metadata collections from non-native entries, e.g. user-defined information or helper metadata.
Returns: The
DataSet
is modified inplace, connected to Quantipy native data and meta components.Return type: None
- path_meta (str) – The full path (optionally with extension
-
read_spss
(path_sav, **kwargs)¶ Load SPSS Statistics .sav files, converting and connecting data/meta.
Parameters: path_sav (str) – The full path (optionally with extension '.sav'
, otherwise assumed as such) to the'.sav'
file.Returns: The DataSet
is modified inplace, connected to Quantipy data and meta components that have been converted from the SPSS source file.Return type: None
-
recode
(target, mapper, default=None, append=False, intersect=None, initialize=None, fillna=None, inplace=True)¶ Create a new or copied series from data, recoded using a mapper.
This function takes a mapper of {key: logic} entries and injects the key into the target column where its paired logic is True. The logic may be arbitrarily complex and may refer to any other variable or variables in data. Where a pre-existing column has been used to start the recode, the injected values can replace or be appended to any data found there to begin with. Note that this function does not edit the target column, it returns a recoded copy of the target column. The recoded data will always comply with the column type indicated for the target column according to the meta.
Parameters: - target (str) – The column variable name keyed in
_meta['columns']
that is the target of the recode. If not found in_meta
this will fail with an error. Iftarget
is not found in data.columns the recode will start from an empty series with the same index as_data
. Iftarget
is found in data.columns the recode will start from a copy of that column. - mapper (dict) – A mapper of {key: logic} entries.
- default (str, default None) – The column name to default to in cases where unattended lists are given in your logic, where an auto-transformation of {key: list} to {key: {default: list}} is provided. Note that lists in logical statements are themselves a form of shorthand and this will ultimately be interpreted as: {key: {default: has_any(list)}}.
- append (bool, default False) – Should the new recoded data be appended to values already found in the series? If False, data from series (where found) will overwrite whatever was found for that item instead.
- intersect (logical statement, default None) – If a logical statement is given here then it will be used as an implied intersection of all logical conditions given in the mapper.
- initialize (str or np.NaN, default None) – If not None, a copy of the data named column will be used to populate the target column before the recode is performed. Alternatively, initialize can be used to populate the target column with np.NaNs (overwriting whatever may be there) prior to the recode.
- fillna (int, default=None) – If not None, the value passed to fillna will be used on the recoded series as per pandas.Series.fillna().
- inplace (bool, default True) – If True, the
DataSet
will be modified inplace with new/updated columns. Will return a new recodedpandas.Series
instance if False.
Returns: Either the
DataSet._data
is modified inplace or a newpandas.Series
is returned.Return type: None or recode_series
- target (str) – The column variable name keyed in
-
reduce_filter_var
(name, values)¶ Remove values from filter-variables and recalculate the filter.
-
remove_html
()¶ Cycle through all meta
text
objects removing html tags.Currently uses the regular expression ‘<.*?>’ in _remove_html() classmethod.
Returns: Return type: None
-
remove_items
(name, remove)¶ Erase array mask items safely from both meta and case data components.
Parameters: - name (str) – The originating column variable name keyed in
meta['masks']
. - remove (int or list of int) – The items listed by their order number in the
_meta['masks'][name]['items']
object will be droped from themask
definition.
Returns: DataSet is modified inplace.
Return type: None
- name (str) – The originating column variable name keyed in
-
remove_values
(name, remove)¶ Erase value codes safely from both meta and case data components.
Attempting to remove all value codes from the variable’s value object will raise a
ValueError
!Parameters: - name (str) – The originating column variable name keyed in
meta['columns']
ormeta['masks']
. - remove (int or list of int) – The codes to be removed from the
DataSet
variable.
Returns: DataSet is modified inplace.
Return type: None
- name (str) – The originating column variable name keyed in
-
rename
(name, new_name)¶ Change meta and data column name references of the variable defintion.
Parameters: - name (str) – The originating column variable name keyed in
meta['columns']
ormeta['masks']
. - new_name (str) – The new variable name.
Returns: DataSet is modified inplace. The new name reference replaces the original one.
Return type: None
- name (str) – The originating column variable name keyed in
-
rename_from_mapper
(mapper, keep_original=False, ignore_batch_props=False)¶ Rename meta objects and data columns using mapper.
Parameters: mapper (dict) – A renaming mapper in the form of a dict of {old: new} that will be used to rename columns throughout the meta and data. Returns: DataSet is modified inplace. Return type: None
-
reorder_items
(name, new_order)¶ Apply a new order to mask items.
Parameters: - name (str) – The variable name keyed in
_meta['masks']
. - new_order (list of int, default None) – The new order of the mask items. The included ints match up to
the number of the items (
DataSet.item_no('item_name')
).
Returns: DataSet is modified inplace.
Return type: None
- name (str) – The variable name keyed in
-
reorder_values
(name, new_order=None)¶ Apply a new order to the value codes defined by the meta data component.
Parameters: - name (str) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - new_order (list of int, default None) – The new code order of the DataSet variable. If no order is given,
the
values
object is sorted ascending.
Returns: DataSet is modified inplace.
Return type: None
- name (str) – The column variable name keyed in
-
repair
()¶ Try to fix legacy meta data inconsistencies and badly shaped array / datafile items
'sets'
meta definitions.
-
repair_text_edits
(text_key=None)¶ Cycle through all meta
text
objects repairing axis edits.Parameters: text_key (str / list of str, default None) – {None, ‘en-GB’, ‘da-DK’, ‘fi-FI’, ‘nb-NO’, ‘sv-SE’, ‘de-DE’} The text_keys for which text edits should be included. Returns: Return type: None
-
replace_texts
(replace, text_key=None)¶ Cycle through all meta
text
objects replacing unwanted strings.Parameters: - replace (dict, default Nonea) – A dictionary mapping {unwanted string: replacement string}.
- text_key (str / list of str, default None) – {None, ‘en-GB’, ‘da-DK’, ‘fi-FI’, ‘nb-NO’, ‘sv-SE’, ‘de-DE’} The text_keys for which unwanted strings are replaced.
Returns: Return type: None
-
resolve_name
(name)¶
-
restore_item_texts
(arrays=None)¶ Restore array item texts.
Parameters: arrays (str, list of str, default None) – Restore texts for items of these arrays. If None, all keys in ._meta['masks']
are taken.
-
revert
()¶ Return to a previously saved state of the DataSet.
Note
This method is designed primarily for use in interactive Python environments like iPython/Jupyter and their notebook applications.
-
roll_up
(varlist, ignore_arrays=None)¶ Replace any array items with their parent mask variable definition name.
Parameters: - varlist (list) – A list of meta
'columns'
and/or'masks'
names. - ignore_arrays ((list of) str) – A list of array mask names that should not be rolled up if their
items are found inside
varlist
.
Note
varlist can also contain nesting var1 > var2. The variables which are included in the nesting can also be controlled by keep and both, even if the variables are also included as a “normal” variable.
Returns: rolled_up – The modified varlist
.Return type: list - varlist (list) – A list of meta
-
save
()¶ Save the current state of the DataSet’s data and meta.
The saved file will be temporarily stored inside the cache. Use this to take a snapshot of the DataSet state to easily revert back to at a later stage.
Note
This method is designed primarily for use in interactive Python environments like iPython/Jupyter notebook applications.
-
select_text_keys
(text_key=None)¶ Cycle through all meta
text
objects keep only selected text_key.Parameters: text_key (str / list of str, default None) – {None, ‘en-GB’, ‘da-DK’, ‘fi-FI’, ‘nb-NO’, ‘sv-SE’, ‘de-DE’} The text_keys which should be kept. Returns: Return type: None
-
classmethod
set_encoding
(encoding)¶ Hack sys.setdefaultencoding() to escape ASCII hell.
Parameters: encoding (str) – The name of the encoding to default to.
-
set_factors
(name, factormap, safe=False)¶ Apply numerical factors to (
single
-type categorical) variables.Factors can be read while aggregating descrp. stat.
qp.Views
.Parameters: - name (str) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - factormap (dict) – A mapping of
{value: factor}
(int
toint
). - safe (bool, default False) – Set to
True
to prevent setting factors to thevalues
meta data of non-single
type variables.
Returns: Return type: None
- name (str) – The column variable name keyed in
-
set_item_texts
(name, renamed_items, text_key=None, axis_edit=None)¶ Rename or add item texts in the
items
objects ofmasks
.Parameters: - name (str) – The column variable name keyed in
_meta['masks']
. - renamed_items (dict) –
A dict mapping with following structure (array mask items are assumed to be passed by their order number):
>>> {1: 'new label for item #1', ... 5: 'new label for item #5'}
- text_key (str, default None) – Text key for text-based label information. Will automatically fall
back to the instance’s
text_key
property information if not provided. - axis_edit ({'x', 'y', ['x', 'y']}, default None) – If the
new_text
of the variable should only be considered temp. for build exports, the axes on that the edited text should appear can be provided.
Returns: The
DataSet
is modified inplace.Return type: None
- name (str) – The column variable name keyed in
-
set_missings
(var, missing_map='default', hide_on_y=True, ignore=None)¶ Flag category definitions for exclusion in aggregations.
Parameters: - var (str or list of str) – Variable(s) to apply the meta flags to.
- missing_map ('default' or list of codes or dict of {'flag': code(s)}, default 'default') – A mapping of codes to flags that can either be ‘exclude’ (globally ignored) or ‘d.exclude’ (only ignored in descriptive statistics). Codes provided in a list are flagged as ‘exclude’. Passing ‘default’ is using a preset list of (TODO: specify) values for exclusion.
- ignore (str or list of str, default None) – A list of variables that should be ignored when applying missing flags via the ‘default’ list method.
Returns: Return type: None
-
set_property
(name, prop_name, prop_value, ignore_items=False)¶ Access and set the value of a meta object’s
properties
collection.Parameters: - name (str) – The originating column variable name keyed in
meta['columns']
ormeta['masks']
. - prop_name (str) – The property key name.
- prop_value (any) – The value to be set for the property. Must be of valid type and have allowed values(s) with regard to the property.
- ignore_items (bool, default False) – When
name
refers to a variable from the'masks'
collection, setting to True will ignore anyitems
and only apply the property to themask
itself.
Returns: Return type: None
- name (str) – The originating column variable name keyed in
-
set_text_key
(text_key)¶ Set the default text_key of the
DataSet
.Note
A lot of the instance methods will fall back to the default text key in
_meta['lib']['default text']
. It is therefore important to use this method with caution, i.e. ensure that the meta containstext
entries for thetext_key
set.Parameters: text_key ({'en-GB', 'da-DK', 'fi-FI', 'nb-NO', 'sv-SE', 'de-DE'}) – The text key that will be set in _meta['lib']['default text']
.Returns: Return type: None
-
set_value_texts
(name, renamed_vals, text_key=None, axis_edit=None)¶ Rename or add value texts in the ‘values’ object.
This method works for array masks and column meta data.
Parameters: - name (str) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - renamed_vals (dict) – A dict mapping with following structure:
{1: 'new label for code=1', 5: 'new label for code=5'}
Codes will be ignored if they do not exist in the ‘values’ object. - text_key (str, default None) – Text key for text-based label information. Will automatically fall
back to the instance’s
text_key
property information if not provided. - axis_edit ({'x', 'y', ['x', 'y']}, default None) – If
renamed_vals
should only be considered temp. for build exports, the axes on that the edited text should appear can be provided.
Returns: The
DataSet
is modified inplace.Return type: None
- name (str) – The column variable name keyed in
-
set_variable_text
(name, new_text, text_key=None, axis_edit=None)¶ Apply a new or update a column’s/masks’ meta text object.
Parameters: - name (str) – The originating column variable name keyed in
meta['columns']
ormeta['masks']
. - new_text (str) – The
text
(label) to be set. - text_key (str, default None) – Text key for text-based label information. Will automatically fall back to the instance’s text_key property information if not provided.
- axis_edit ({'x', 'y', ['x', 'y']}, default None) – If the
new_text
of the variable should only be considered temp. for build exports, the axes on that the edited text should appear can be provided.
Returns: The
DataSet
is modified inplace.Return type: None
- name (str) – The originating column variable name keyed in
-
set_verbose_errmsg
(verbose=True)¶
-
set_verbose_infomsg
(verbose=True)¶
-
slicing
(name, slicer, axis='y')¶ Set or update
rules[axis]['slicex']
meta for the named column.Quantipy builds will respect the kept codes and show them exclusively in results.
Note
This is not a replacement for
DataSet.set_missings()
as missing values are respected also in computations.Parameters: - name (str or list of str) – The column variable(s) name keyed in
_meta['columns']
. - slice (int or list of int) – Values indicated by their
int
codes will be shown inQuantipy.View.dataframe
s, respecting the provided order. - axis ({'x', 'y'}, default 'y') – The axis to slice the values on.
Returns: Return type: None
- name (str or list of str) – The column variable(s) name keyed in
-
sorting
(name, on='@', within=False, between=False, fix=None, ascending=False, sort_by_weight='auto')¶ Set or update
rules['x']['sortx']
meta for the named column.Parameters: - name (str or list of str) – The column variable(s) name keyed in
_meta['columns']
. - within (bool, default True) – Applies only to variables that have been aggregated by creating a
an
expand
grouping / overcode-styleView
: If True, will sort frequencies inside each group. - between (bool, default True) – Applies only to variables that have been aggregated by creating a
an
expand
grouping / overcode-styleView
: If True, will sort group and regular code frequencies with regard to each other. - fix (int or list of int, default None) – Values indicated by their
int
codes will be ignored in the sorting operation. - ascending (bool, default False) – By default frequencies are sorted in descending order. Specify
True
to sort ascending.
Returns: Return type: None
- name (str or list of str) – The column variable(s) name keyed in
-
sources
(name)¶ Get the
_meta['columns']
elements for the passed array mask name.Parameters: name (str) – The mask variable name keyed in _meta['masks']
.Returns: sources – The list of source elements from the array definition. Return type: list
-
split
(save=False)¶ Return the
meta
anddata
components of the DataSet instance.Parameters: save (bool, default False) – If True, the meta
anddata
objects will be saved to disk, using the instance’sname
andpath
attributes to determine the file location.Returns: meta, data – The meta dict and the case data DataFrame as separate objects. Return type: dict, pandas.DataFrame
-
static
start_meta
(text_key='main')¶ Starts a new/empty Quantipy meta document.
Parameters: text_key (str, default None) – The default text key to be set into the new meta document. Returns: meta – Quantipy meta object Return type: dict
-
subset
(variables=None, from_set=None, inplace=False)¶ Create a cloned version of self with a reduced collection of variables.
Parameters: - variables (str or list of str, default None) – A list of variable names to include in the new DataSet instance.
- from_set (str) – The name of an already existing set to base the new DataSet on.
Returns: subset_ds – The new reduced version of the DataSet.
Return type: qp.DataSet
-
take
(condition)¶ Create an index slicer to select rows from the DataFrame component.
Parameters: condition (Quantipy logic expression) – A logical condition expressed as Quantipy logic that determines which subset of the case data rows to be kept. Returns: slicer – The indices fulfilling the passed logical condition. Return type: pandas.Index
-
text
(name, shorten=True, text_key=None, axis_edit=None)¶ Return the variables text label information.
Parameters: - name (str, default None) – The variable name keyed in
_meta['columns']
or_meta['masks']
. - shorten (bool, default True) – If True,
text
label meta from array items will not report the parent mask’stext
. Setting it to False will show the “full” label. - text_key (str, default None) – The default text key to be set into the new meta document.
- axis_edit ({'x', 'y'}, default None) – If provided the text_key is taken from the x/y edits dict.
Returns: text – The text metadata.
Return type: str
- name (str, default None) – The variable name keyed in
-
to_array
(name, variables, label, safe=True)¶ Combines column variables with same
values
meta into an array.Parameters: - name (str) – Name of new grid.
- variables (list of str or list of dicts) – Variable names that become items of the array. New item labels can be added as dict. Example: variables = [‘q1_1’, {‘q1_2’: ‘shop 2’}, {‘q1_3’: ‘shop 3’}]
- label (str) – Text label for the mask itself.
- safe (bool, default True) – If True, the method will raise a
ValueError
if the provided variable name is already present in self. SelectFalse
to forcefully overwrite an existing variable with the same name (independent of its type).
Returns: Return type: None
-
to_delimited_set
(name, label, variables, from_dichotomous=True, codes_from_name=True)¶ Combines multiple single variables to new delimited set variable.
Parameters: - name (str) – Name of new delimited set
- label (str) – Label text for the new delimited set.
- variables (list of str or list of tuples) – variables that get combined into the new delimited set. If they are dichotomous (from_dichotomous=True), the labels of the variables are used as category texts or if tuples are included, the second items will be used for the category texts. If the variables are categorical (from_dichotomous=False) the values of the variables need to be eqaul and are taken for the delimited set.
- from_dichotomous (bool, default True) – Define if the input variables are dichotomous or categorical.
- codes_from_name (bool, default True) – If from_dichotomous=True, the codes can be taken from the Variable names, if they are in form of ‘q01_1’, ‘q01_3’, … In this case the codes will be 1, 3, ….
Returns: Return type: None
-
transpose
(name, new_name=None, ignore_items=None, ignore_values=None, copy_data=True, text_key=None, overwrite=False)¶ Create a new array mask with transposed items / values structure.
This method will automatically create meta and case data additions in the
DataSet
instance.Parameters: - name (str) – The originating mask variable name keyed in
meta['masks']
. - new_name (str, default None) – The name of the new mask. If not provided explicitly, the new_name
will be constructed constructed by suffixing the original
name
with ‘_trans’, e.g.'Q2Array_trans
. - ignore_items (int or list of int, default None) – If provided, the items listed by their order number in the
_meta['masks'][name]['items']
object will not be part of the transposed array. This means they will be ignored while creating the new value codes meta. - ignore_codes (int or list of int, default None) – If provided, the listed code values will not be part of the transposed array. This means they will not be part of the new item meta.
- text_key (str) – The text key to be used when generating text objects, i.e. item and value labels.
- overwrite (bool, default False) – Overwrite variable if new_name is already included.
Returns: DataSet is modified inplace.
Return type: None
- name (str) – The originating mask variable name keyed in
-
unbind
(name)¶ Remove mask-structure for arrays
-
uncode
(target, mapper, default=None, intersect=None, inplace=True)¶ Create a new or copied series from data, recoded using a mapper.
Parameters: - target (str) – The variable name that is the target of the uncode. If it is keyed
in
_meta['masks']
the uncode is done for all mask items. If not found in_meta
this will fail with an error. - mapper (dict) – A mapper of {key: logic} entries.
- default (str, default None) – The column name to default to in cases where unattended lists are given in your logic, where an auto-transformation of {key: list} to {key: {default: list}} is provided. Note that lists in logical statements are themselves a form of shorthand and this will ultimately be interpreted as: {key: {default: has_any(list)}}.
- intersect (logical statement, default None) – If a logical statement is given here then it will be used as an implied intersection of all logical conditions given in the mapper.
- inplace (bool, default True) – If True, the
DataSet
will be modified inplace with new/updated columns. Will return a new recodedpandas.Series
instance if False.
Returns: Either the
DataSet._data
is modified inplace or a newpandas.Series
is returned.Return type: None or uncode_series
- target (str) – The variable name that is the target of the uncode. If it is keyed
in
-
undimensionize
(names=None, mapper_to_meta=False)¶ Rename the dataset columns to remove Dimensions compatibility.
-
undimensionizing_mapper
(names=None)¶ Return a renaming dataset mapper for un-dimensionizing names.
Parameters: None – Returns: mapper – A renaming mapper in the form of a dict of {old: new} that maps Dimensions naming conventions to non-Dimensions naming conventions. Return type: dict
-
unify_values
(name, code_map, slicer=None, exclusive=False)¶ Use a mapping of old to new codes to replace code values in
_data
.Note
Experimental! Check results carefully!
Parameters: - name (str) – The column variable name keyed in
meta['columns']
. - code_map (dict) – A mapping of
{old: new}
;old
andnew
must be the int-type code values from the column meta data. - slicer (Quantipy logic statement, default None) – If provided, the values will only be unified for cases where the condition holds.
- exclusive (bool, default False) – If True, the recoded unified value will replace whatever is already
found in the
_data
column, ignoringdelimited set
typed data to which normally would get appended to.
Returns: Return type: None
- name (str) – The column variable name keyed in
-
unroll
(varlist, keep=None, both=None)¶ Replace mask with their items, optionally excluding/keeping certain ones.
Parameters: - varlist (list) – A list of meta
'columns'
and/or'masks'
names. - keep (str or list, default None) – The names of masks that will not be replaced with their items.
- both ('all', str or list of str, default None) – The names of masks that will be included both as themselves and as collections of their items.
Note
varlist can also contain nesting var1 > var2. The variables which are included in the nesting can also be controlled by keep and both, even if the variables are also included as a “normal” variable.
- Example::
>>> ds.unroll(varlist = ['q1', 'q1 > gender'], both='all') ['q1', 'q1_1', 'q1_2', 'q1 > gender', 'q1_1 > gender', 'q1_2 > gender']
Returns: unrolled – The modified varlist
.Return type: list - varlist (list) – A list of meta
-
update
(data, on='identity', text_properties=None)¶ Update the
DataSet
with the case data entries found indata
.Parameters: - data (
pandas.DataFrame
) – A dataframe that contains a subset of columns from theDataSet
case data component. - on (str, default 'identity') – The column to use as a join key.
- text_properties (str/ list of str, default=None, {'all', [var_names]}) – Controls the update of the dataset_left properties with properties from the dataset_right. If None, properties from dataset_left will be updated by the ones from the dataset_right. If ‘all’, properties from dataset_left will be kept unchanged. Otherwise, specify the list of properties which will be kept unchanged in the dataset_left; all others will be updated by the properties from dataset_right.
Returns: DataSet is modified inplace.
Return type: None
- data (
-
used_text_keys
()¶ Get a list of all used textkeys in the dataset instance.
-
validate
(spss_limits=False, verbose=True)¶ Identify and report inconsistencies in the
DataSet
instance.- name:
- column/mask name and
meta[collection][var]['name']
are not identical - q_label:
- text object is badly formatted or has empty text mapping
- values:
- categorical variable does not contain values, value text is badly formatted or has empty text mapping
- text_keys:
- dataset.text_key is not included or existing text keys are not consistent (also for parents)
- source:
- parents or items do not exist
- codes:
- codes in data component are not included in meta component
- spss limit name:
- length of name is greater than spss limit (64 characters) (only shown if spss_limits=True)
- spss limit q_label:
- length of q_label is greater than spss limit (256 characters) (only shown if spss_limits=True)
- spss limit values:
- length of any value text is greater than spss limit (120 characters) (only shown if spss_limits=True)
-
value_texts
(name, text_key=None, axis_edit=None)¶ Get categorical data’s text information.
Parameters: - name (str) – The column variable name keyed in
_meta['columns']
. - text_key (str, default None) – The text_key that should be used when taking labels from the source meta.
- axis_edit ({'x', 'y'}, default None) – If provided the text_key is taken from the x/y edits dict.
Returns: texts – The list of category texts.
Return type: list
- name (str) – The column variable name keyed in
-
values
(name, text_key=None, axis_edit=None)¶ Get categorical data’s paired code and texts information from the meta.
Parameters: - name (str) – The column variable name keyed in
_meta['columns']
or_meta['masks']
. - text_key (str, default None) – The text_key that should be used when taking labels from the source meta.
- axis_edit ({'x', 'y'}, default None) – If provided the text_key is taken from the x/y edits dict.
Returns: values – The list of the numerical category codes and their
texts
packed as tuples.Return type: list of tuples
- name (str) – The column variable name keyed in
-
variables
(setname='data file', numeric=True, string=True, date=True, boolean=True, blacklist=None)¶ View all DataSet variables listed in their global order.
Parameters: - setname (str, default 'data file') – The name of the variable set to query. Defaults to the main variable collection stored via ‘data file’.
- numeric (bool, default True) – Include
int
andfloat
type variables? - string (bool, default True) – Include
string
type variables? - date (bool, default True) – Include
date
type variables? - boolean (bool, default True) – Include
boolean
type variables? - blacklist (list, default None) – A list of variables names to exclude from the variable listing.
Returns: varlist – The list of variables registered in the queried
set
.Return type: list
-
vmerge
(dataset, on=None, left_on=None, right_on=None, row_id_name=None, left_id=None, right_id=None, row_ids=None, overwrite_text=False, from_set=None, uniquify_key=None, reset_index=True, inplace=True, text_properties=None, verbose=True)¶ Merge Quantipy datasets together by appending rows.
This function merges two Quantipy datasets together, updating variables that exist in the left dataset and appending others. New variables will be appended in the order indicated by the ‘data file’ set if found, otherwise they will be appended in alphanumeric order. This merge happens vertically (row-wise).
Parameters: - dataset ((A list of multiple)
quantipy.DataSet
) – One or multiple datasets to merge into the currentDataSet
. - on (str, default=None) – The column to use to identify unique rows in both datasets.
- left_on (str, default=None) – The column to use to identify unique in the left dataset.
- right_on (str, default=None) – The column to use to identify unique in the right dataset.
- row_id_name (str, default=None) – The named column will be filled with the ids indicated for each dataset, as per left_id/right_id/row_ids. If meta for the named column doesn’t already exist a new column definition will be added and assigned a reductive-appropriate type.
- left_id (str/int/float, default=None) – Where the row_id_name column is not already populated for the dataset_left, this value will be populated.
- right_id (str/int/float, default=None) – Where the row_id_name column is not already populated for the dataset_right, this value will be populated.
- row_ids (list of str/int/float, default=None) – When datasets has been used, this list provides the row ids that will be populated in the row_id_name column for each of those datasets, respectively.
- overwrite_text (bool, default=False) – If True, text_keys in the left meta that also exist in right meta will be overwritten instead of ignored.
- from_set (str, default=None) – Use a set defined in the right meta to control which columns are merged from the right dataset.
- uniquify_key (str, default None) – A int-like column name found in all the passed
DataSet
objects that will be protected from having duplicates. The original version of the column will be kept under its name prefixed with ‘original’. - reset_index (bool, default=True) – If True pandas.DataFrame.reindex() will be applied to the merged dataframe.
- inplace (bool, default True) – If True, the
DataSet
will be modified inplace with new/updated rows. Will return a newDataSet
instance if False. - merge_existing (str/ list of str, default None, {'all', [var_names]}) – Merge values for defined delimited sets if it exists in both datasets. (update_existing is prioritized)
- text_properties (str/ list of str, default=None, {'all', [var_names]}) – Controls the update of the dataset_left properties with properties from the dataset_right. If None, properties from dataset_left will be updated by the ones from the dataset_right. If ‘all’, properties from dataset_left will be kept unchanged. Otherwise, specify the list of properties which will be kept unchanged in the dataset_left; all others will be updated by the properties from dataset_right.
- verbose (bool, default=True) – Echo progress feedback to the output pane.
Returns: None or new_dataset – If the merge is not applied
inplace
, aDataSet
instance is returned.Return type: quantipy.DataSet
- dataset ((A list of multiple)
-
weight
(weight_scheme, weight_name='weight', unique_key='identity', subset=None, report=True, path_report=None, inplace=True, verbose=True)¶ Weight the
DataSet
according to a well-defined weight scheme.Parameters: - weight_scheme (quantipy.Rim instance) – A rim weights setup with defined targets. Can include multiple weight groups and/or filters.
- weight_name (str, default 'weight') – A name for the float variable that is added to pick up the weight factors.
- unique_key (str, default 'identity'.) – A variable inside the
DataSet
instance that will be used to the map individual case weights to their matching rows. - subset (Quantipy complex logic expression) – A logic to filter the DataSet, weighting only the remaining subset.
- report (bool, default True) – If True, will report a summary of the weight algorithm run and factor outcomes.
- path_report (str, default None) – A file path to save an .xlsx version of the weight report to.
- inplace (bool, default True) – If True, the weight factors are merged back into the
DataSet
instance. Will otherwise return thepandas.DataFrame
that contains the weight factors, theunique_key
and all variables that have been used to compute the weights (filters, target variables, etc.).
Returns: Will either create a new column called
'weight'
in theDataSet
instance or return aDataFrame
that contains the weight factors.Return type: None or
pandas.DataFrame
-
write_dimensions
(path_mdd=None, path_ddf=None, text_key=None, run=True, clean_up=True, CRLF='CR')¶ Build Dimensions/SPSS Base Professional .ddf/.mdd data pairs.
Note
SPSS Data Collection Base Professional must be installed on the machine. The method is creating .mrs and .dms scripts which are executed through the software’s API.
Parameters: - path_mdd (str, default None) – The full path (optionally with extension
'.mdd'
, otherwise assumed as such) for the saved the DataSet._meta component. If not provided, the instance’sname
and`path
attributes will be used to determine the file location. - path_ddf (str, default None) – The full path (optionally with extension
'.ddf'
, otherwise assumed as such) for the saved DataSet._data component. If not provided, the instance’sname
and`path
attributes will be used to determine the file location. - text_key (str, default None) – The desired
text_key
for alltext
label information. Uses theDataSet.text_key
information if not provided. - run (bool, default True) – If True, the method will try to run the metadata creating .mrs script and execute a DMSRun for the case data transformation in the .dms file.
- clean_up (bool, default True) – By default, all helper files from the conversion (.dms, .mrs, paired .csv files, etc.) will be deleted after the process has finished.
Returns: Return type: A .ddf/.mdd pair is saved at the provided path location.
- path_mdd (str, default None) – The full path (optionally with extension
-
write_quantipy
(path_meta=None, path_data=None)¶ Write the data and meta components to .csv/.json files.
The resulting files are well-defined native Quantipy source files.
Parameters: - path_meta (str, default None) – The full path (optionally with extension
'.json'
, otherwise assumed as such) for the saved the DataSet._meta component. If not provided, the instance’sname
and`path
attributes will be used to determine the file location. - path_data (str, default None) – The full path (optionally with extension
'.csv'
, otherwise assumed as such) for the saved DataSet._data component. If not provided, the instance’sname
and`path
attributes will be used to determine the file location.
Returns: Return type: A .csv/.json pair is saved at the provided path location.
- path_meta (str, default None) – The full path (optionally with extension
-
write_spss
(path_sav=None, index=True, text_key=None, mrset_tag_style='__', drop_delimited=True, from_set=None, verbose=True)¶ Convert the Quantipy DataSet into a SPSS .sav data file.
Parameters: - path_sav (str, default None) – The full path (optionally with extension
'.json'
, otherwise assumed as such) for the saved the DataSet._meta component. If not provided, the instance’sname
and`path
attributes will be used to determine the file location. - index (bool, default False) – Should the index be inserted into the dataframe before the conversion happens?
- text_key (str, default None) – The text_key that should be used when taking labels from the
source meta. If the given text_key is not found for any
particular text object, the
DataSet.text_key
will be used instead. - mrset_tag_style (str, default '__') – The delimiting character/string to use when naming dichotomous set variables. The mrset_tag_style will appear between the name of the variable and the dichotomous variable’s value name, as taken from the delimited set value that dichotomous variable represents.
- drop_delimited (bool, default True) – Should Quantipy’s delimited set variables be dropped from the export after being converted to dichotomous sets/mrsets?
- from_set (str) – The set name from which the export should be drawn.
Returns: Return type: A SPSS .sav file is saved at the provided path location.
- path_sav (str, default None) – The full path (optionally with extension
-
quantify.engine¶
-
class
quantipy.
Quantity
(link, weight=None, base_all=False, ignore_flags=False)¶ The Quantity object is the main Quantipy aggregation engine.
Consists of a link’s data matrix representation and sectional defintion of weight vector (wv), x-codes section (xsect) and y-codes section (ysect). The instance methods handle creation, retrieval and manipulation of the data input matrices and section definitions as well as the majority of statistical calculations.
-
calc
(expression, axis='x', result_only=False)¶ Compute (simple) aggregation level arithmetics.
-
count
(axis=None, raw_sum=False, cum_sum=False, effective=False, margin=True, as_df=True)¶ Count entries over all cells or per axis margin.
Parameters: - axis ({None, 'x', 'y'}, deafult None) – When axis is None, the frequency of all cells from the uni- or multivariate distribution is presented. If the axis is specified to be either ‘x’ or ‘y’ the margin per axis becomes the resulting aggregation.
- raw_sum (bool, default False) – If True will perform a simple summation over the cells given the axis parameter. This ignores net counting of qualifying answers in favour of summing over all answers given when considering margins.
- cum_sum (bool, default False) – If True a cumulative sum of the elements along the given axis is returned.
- effective (bool, default False) – If True, compute effective counts instead of traditional (weighted) counts.
- margin (bool, deafult True) – Controls whether the margins of the aggregation result are shown. This also applies to margin aggregations themselves, since they contain a margin in (form of the total number of cases) as well.
- as_df (bool, default True) – Controls whether the aggregation is transformed into a Quantipy- multiindexed (following the Question/Values convention) pandas.DataFrame or will be left in its numpy.array format.
Returns: Passes a pandas.DataFrame or numpy.array of cell or margin counts to the
result
property.Return type: self
-
exclude
(codes, axis='x')¶ Wrapper for _missingfy(…keep_codes=False, …, keep_base=False, …) Excludes specified codes from aggregation.
-
filter
(condition, keep_base=True, inplace=False)¶ Use a Quantipy conditional expression to filter the data matrix entires.
-
group
(groups, axis='x', expand=None, complete=False)¶ Build simple or logical net vectors, optionally keeping orginating codes.
Parameters: - groups (list, dict of lists or logic expression) –
The group/net code defintion(s) in form of…
- a simple list:
[1, 2, 3]
- a dict of list:
{'grp A': [1, 2, 3], 'grp B': [4, 5, 6]}
- a logical expression:
not_any([1, 2])
- a simple list:
- axis ({
'x'
,'y'
}, default'x'
) – The axis to group codes on. - expand ({None,
'before'
,'after'
}, defaultNone
) – If'before'
, the codes that are grouped will be kept and placed before the grouped aggregation; vice versa for'after'
. Ignored on logical expressions found ingroups
. - complete (bool, default False) – If True, codes that define the Link on the given
axis
but are not present in thegroups
defintion(s) will be placed in their natural position within the aggregation, respecting the value ofexpand
.
Returns: Return type: None
- groups (list, dict of lists or logic expression) –
-
limit
(codes, axis='x')¶ Wrapper for _missingfy(…keep_codes=True, …, keep_base=True, …) Restrict the data matrix entires to contain the specified codes only.
-
normalize
(on='y', per_cell=False)¶ Convert a raw cell count result to its percentage representation.
Parameters: - on ({'y', 'x', 'counts_sum', str}, default 'y') – Defines the base to normalize the result on.
'y'
will produce column percentages,'x'
will produce row percentages. It is also possible to use another question’s frequencies to compute rebased percentages providing its name instead. - per_cell (bool, default False) – Compute percentages on a cell-per-cell basis, effectively treating
each categorical row as a base figure on its own. Only possible if the
on
argument does not indidcate an axis result ('x'
,'y'
,'counts_sum'
), but instead another variable’s name. The relatedxdef
codes collection length must be identical for this for work, otherwise aValueError
is raised.
Returns: Updates a count-based aggregation in the
result
property.Return type: self
- on ({'y', 'x', 'counts_sum', str}, default 'y') – Defines the base to normalize the result on.
-
rescale
(scaling, drop=False)¶ Modify the object’s
xdef
property reflecting new value defintions.Parameters: - scaling (dict) – Mapping of old_code: new_code, given as of type int or float.
- drop (bool, default False) – If True, codes not included in the scaling dict will be excluded.
Returns: Return type: self
-
summarize
(stat='summary', axis='x', margin=True, as_df=True)¶ Calculate distribution statistics across the given axis.
Parameters: - stat ({'summary', 'mean', 'median', 'var', 'stddev', 'sem', varcoeff',) – ‘min’, ‘lower_q’, ‘upper_q’, ‘max’}, default ‘summary’ The measure to calculate. Defaults to a summary output of the most important sample statistics.
- axis ({'x', 'y'}, default 'x') – The axis which is reduced in the aggregation, e.g. column vs. row means.
- margin (bool, default True) – Controls whether statistic(s) of the marginal distribution are shown.
- as_df (bool, default True) – Controls whether the aggregation is transformed into a Quantipy- multiindexed (following the Question/Values convention) pandas.DataFrame or will be left in its numpy.array format.
Returns: Passes a pandas.DataFrame or numpy.array of the descriptive (summary) statistic(s) to the
result
property.Return type: self
-
swap
(var, axis='x', update_axis_def=True, inplace=True)¶ Change the Quantity’s x- or y-axis keeping filter and weight setup.
All edits and aggregation results will be removed during the swap.
Parameters: - var (str) – New variable’s name used in axis swap.
- axis ({‘x’, ‘y’}, default
'x'
) – The axis to swap. - update_axis_def (bool, default False) – If self is of type
'array'
, the name and item definitions (that are e.g. used in theto_df()
method) can be updated to reflect the swapped axis variable or kept to show the original’s ones. - inplace (bool, default True) – Whether to modify the Quantity inplace or return a new instance.
Returns: swapped
Return type: New Quantity instance with exchanged x- or y-axis.
-
unweight
()¶ Remove any weighting by dividing the matrix by itself.
-
weight
()¶ Weight by multiplying the indicator entries with the weight vector.
-
-
class
quantipy.
Test
(link, view_name_notation, test_total=False)¶ The Quantipy Test object is a defined by a Link and the view name notation string of a counts or means view. All auxiliary figures needed to arrive at the test results are computed inside the instance of the object.
-
get_se
()¶ Compute the standard error (se) estimate of the tested metric.
The calculation of the se is defined by the parameters of the setup. The main difference is the handling of variances. unpooled implicitly assumes variance inhomogenity between the column pairing’s samples. pooled treats variances effectively as equal.
-
get_sig
()¶ TODO: implement returning tstats only.
-
get_statistic
()¶ Returns the test statistic of the algorithm.
-
run
()¶ Performs the testing algorithm and creates an output pd.DataFrame.
The output is indexed according to Quantipy’s Questions->Values convention. Significant results between columns are presented as lists of integer y-axis codes where the column with the higher value is holding the codes of the columns with the lower values. NaN is indicating that a cell is not holding any sig. higher values compared to the others.
-
set_params
(test_total=False, level='mid', mimic='Dim', testtype='pooled', use_ebase=True, ovlp_correc=True, cwi_filter=False, flag_bases=None)¶ Sets the test algorithm parameters and defines the type of test.
This method sets the test’s global parameters and derives the necessary measures for the computation of the test statistic. The default values correspond to the SPSS Dimensions Column Tests algorithms that control for bias introduced by weighting and overlapping samples in the column pairs of multi-coded questions.
Note
The Dimensions implementation uses variance pooling.
Parameters: - test_total (bool, default False) – If set to True, the test algorithms will also include an existent total (@-) version of the original link and test against the unconditial data distribution.
- level (str or float, default 'mid') – The level of significance given either as per ‘low’ = 0.1, ‘mid’ = 0.05, ‘high’ = 0.01 or as specific float, e.g. 0.15.
- mimic ({'askia', 'Dim'} default='Dim') – Will instruct the mimicking of a software specific test.
- testtype (str, default 'pooled') – Global definition of the tests.
- use_ebase (bool, default True) – If True, will use the effective sample sizes instead of the the simple weighted ones when testing a weighted aggregation.
- ovlp_correc (bool, default True) – If True, will consider and correct for respondent overlap when testing between multi-coded column pairs.
- cwi_filter (bool, default False) – If True, will check an incoming count aggregation for cells that fall below a treshhold comparison aggregation that assumes counts to be independent.
- flag_bases (list of two int, default None) – If provided, the output dataframe will replace results that have
been calculated on (eff.) bases below the first int with
'**'
and mark results in columns with bases below the second int with'*'
Returns: Return type: self
-
QuantipyViews¶
-
class
quantipy.
QuantipyViews
(views=None, template=None)¶ A collection of extendable MR aggregation and statistic methods.
View methods are used to generate various numerical or categorical data aggregations. Their behaviour is controlled via
kwargs
.-
coltests
(link, name, kwargs)¶ Will test appropriate views from a Stack for stat. sig. differences.
Tests can be performed on frequency aggregations (generated by
frequency
) and means (fromsummarize
) and will compare all unique column pair combinations.Parameters: - link (Quantipy Link object.) –
- name (str) – The shortname applied to the view.
- kwargs (dict) –
- arguments (specific) (Keyword) –
- text (str, optional, default None) – Sets an optional label in the meta component of the view that is used when the view is passed into a Quantipy build (e.g. Excel, Powerpoint).
- metric ({'props', 'means'}, default 'props') – Determines whether a proportion or means test algorithm is performed.
- test_total (bool, deafult False) – If True, the each View’s y-axis column will be tested against the uncoditional total of its x-axis.
- mimic ({'Dim', 'askia'}, default 'Dim') – It is possible to mimic the test logics used in other statistical software packages by passing them as instructions. The method will then choose the appropriate test parameters.
- level ({'high', 'mid', 'low'} or float) – Sets the level of significance to which the test is carried out.
Given as str the levels correspond to
'high'
= 0.01,'mid'
= 0.05 and'low'
= 0.1. If a float is passed the specified level will be used. - flags (list of two int, default None) – Base thresholds for Dimensions-like tests, e.g. [30, 100]. First int is minimum base for reported results, second int controls small base indication.
Returns: - None – Adds requested View to the Stack, storing it under the full view name notation key.
- .. note:: – Mimicking the askia software (
mimic
='askia'
) restricts the values to be one of'high'
,'low'
,'mid'
. Any other value passed will make the algorithm fall back to'low'
. Mimicking Dimensions (mimic
='Dim'
) can use either the str or float version.
-
default
(link, name, kwargs)¶ Adds a file meta dependent aggregation to a Stack.
Checks the Link definition against the file meta and produces either a numerical or categorical summary tabulation including marginal the results.
Parameters: - link (Quantipy Link object.) –
- name (str) – The shortname applied to the view.
- kwargs (dict) –
Returns: Adds requested View to the Stack, storing it under the full view name notation key.
Return type: None
-
descriptives
(link, name, kwargs)¶ Adds num. distribution statistics of a Link defintion to the Stack.
descriptives
views can apply a range of summary statistics. Measures include statistics of centrality, dispersion and mass.Parameters: - link (Quantipy Link object.) –
- name (str) – The shortname applied to the view.
- kwargs (dict) –
- arguments (specific) (Keyword) –
- text (str, optional, default None) – Sets an optional label suffix for the meta component of the view which will be appended to the statistic name and used when the view is passed into a Quantipy build (e.g. Excel, Powerpoint).
- stats (str, default 'mean') – The measure to compute.
- exclude (list of int) – Codes that will not be considered calculating the result.
- rescale (dict) –
A mapping of {old code: new code}, e.g.:
{ 1: 0, 2: 25, 3: 50, 4: 75, 5: 100 }
- drop (bool) – If
rescale
provides a new scale defintion,drop
will remove all codes that are not transformed. Acts as a shorthand for manually passing any remaining codes inexclude
.
Returns: Adds requested View to the Stack, storing it under the full view name notation key.
Return type: None
-
frequency
(link, name, kwargs)¶ Adds count-based views on a Link defintion to the Stack object.
frequency
is able to compute several aggregates that are based on the count of code values in uni- or bivariate Links. This includes bases / samples sizes, raw or normalized cell frequencies and code summaries like simple and complex nets.Parameters: - link (Quantipy Link object.) –
- name (str) – The shortname applied to the view.
- kwargs (dict) –
- arguments (specific) (Keyword) –
- text (str, optional, default None) – Sets an optional label in the meta component of the view that is used when the view is passed into a Quantipy build (e.g. Excel, Powerpoint).
- logic (list of int, list of dicts or core.tools.view.logic operation) –
If a list is passed this instructs a simple net of the codes given as int. Multiple nets can be generated via a list of dicts that map names to lists of ints. For complex logical statements, expression are parsed to identify the qualifying rows in the data. For example:
# simple net 'logic': [1, 2, 3] # multiple nets/code groups 'logic': [{'A': [1, 2]}, {'B': [3, 4]}, {'C', [5, 6]}] # code logic 'logic': has_all([1, 2, 3])
- calc (TODO) –
- calc_only (TODO) –
Returns: - None – Adds requested View to the Stack, storing it under the full view name notation key.
- .. note:: Net codes take into account if a variable is – multi-coded. The net will therefore consider qualifying cases and not the raw sum of the frequencies per category, i.e. no multiple counting of cases.
-
Rim¶
-
class
quantipy.
Rim
(name, max_iterations=1000, convcrit=0.01, cap=0, dropna=True, impute_method='mean', weight_column_name=None, total=0)¶ -
add_group
(name=None, filter_def=None, targets=None)¶ Set weight groups using flexible filter and target defintions.
Main method to structure and specify complex weight schemes.
Parameters: - name (str) – Name of the weight group.
- filter_def (str, optional) – An optional filter defintion given as a boolean expression in string format. Must be a valid input for the pandas DataFrame.query() method.
- targets (dict) – Dictionary mapping of DataFrame columns to target proportion list.
Returns: Return type: None
-
group_targets
(group_targets)¶ Set inter-group target proportions.
This will scale the weight factors per group to match the desired group proportions and thus effectively change each group’s weighted total number of cases.
Parameters: group_targets (dict) – A dictionary mapping of group names to the desired proportions. Returns: Return type: None
-
report
(group=None)¶ TODO: Docstring
-
set_targets
(targets, group_name=None)¶ Quickly set simple weight targets, optionally assigning a group name.
Parameters: - targets (dict or list of dict) – Dictionary mapping of DataFrame columns to target proportion list.
- group_name (str, optional) – A name for the simple weight (group) created.
Returns: Return type: None
-
validate
()¶ Summary on scheme target variables to detect and handle missing data.
Returns: df – A summary of missing entries and (rounded) mean/mode/median of value codes per target variable. Return type: pandas.DataFrame
-
Stack¶
-
class
quantipy.
Stack
(name='', add_data=None)¶ Container of quantipy.Link objects holding View objects.
A Stack is nested dictionary that structures the data and variable relationships storing all View aggregations performed.
-
add_data
(data_key, data=None, meta=None)¶ Sets the data_key into the stack, optionally mapping data sources it.
It is possible to handle the mapping of data sources in different ways:
- no meta or data (for proxy links not connected to source data)
- meta only (for proxy links with supporintg meta)
- data only (meta will be inferred if possible)
- data and meta
Parameters: - data_key (str) – The reference name for a data source connected to the Stack.
- data (pandas.DataFrame) – The input (case) data source.
- meta (dict or OrderedDict) – A quantipy compatible metadata source that describes the case data.
Returns: Return type: None
-
add_link
(data_keys=None, filters=['no_filter'], x=None, y=None, views=None, weights=None, variables=None)¶ Add Link and View defintions to the Stack.
The method can be used flexibly: It is possible to pass only Link defintions that might be composed of filter, x and y specifications, only views incl. weight variable selections or arbitrary combinations of the former.
TODO: Remove
variables
from parameter list and method calls.Parameters: - data_keys (str, optional) – The data_key to be added to. If none is given, the method will try to add to all data_keys found in the Stack.
- filters (list of str describing filter defintions, default ['no_filter']) – The string must be a valid input for the pandas.DataFrame.query() method.
- y (x,) – The x and y variables to constrcut Links from.
- views (list of view method names.) – Can be any of Quantipy’s preset Views or the names of created view method specifications.
- weights (list, optional) – The names of weight variables to consider in the data aggregation
process. Weight variables must be of type
float
.
Returns: Return type: None
-
add_nets
(on_vars, net_map, expand=None, calc=None, rebase=None, text_prefix='Net:', checking_cluster=None, _batches='all', recode='auto', mis_in_rec=False, verbose=True)¶ Add a net-like view to a specified collection of x keys of the stack.
Parameters: - on_vars (list) – The list of x variables to add the view to.
- net_map (list of dicts) –
The listed dicts must map the net/band text label to lists of categorical answer codes to group together, e.g.:
>>> [{'Top3': [1, 2, 3]}, ... {'Bottom3': [4, 5, 6]}] It is also possible to provide enumerated net definition dictionaries that are explicitly setting ``text`` metadata per ``text_key`` entries:
>>> [{1: [1, 2], 'text': {'en-GB': 'UK NET TEXT', ... 'da-DK': 'DK NET TEXT', ... 'de-DE': 'DE NET TEXT'}}]
- expand ({'before', 'after'}, default None) – If provided, the view will list the net-defining codes after or before the computed net groups (i.e. “overcode” nets).
- calc (dict, default None) –
A dictionary that is attaching a text label to a calculation expression using the the net definitions. The nets are referenced as per ‘net_1’, ‘net_2’, ‘net_3’, … . Supported calculation expressions are add, sub, div, mul. Example:
>>> {'calc': ('net_1', add, 'net_2'), 'text': {'en-GB': 'UK CALC LAB', ... 'da-DK': 'DA CALC LAB', ... 'de-DE': 'DE CALC LAB'}}
- rebase (str, default None) – Use another variables margin’s value vector for column percentage computation.
- text_prefix (str, default 'Net:') – By default each code grouping/net will have its
text
label prefixed with ‘Net: ‘. Toggle by passing None (or an empty str, ‘’). - checking_cluster (quantipy.Cluster, default None) – When provided, an automated checking aggregation will be added to the
Cluster
instance. - _batches (str or list of str) – Only for
qp.Links
that are defined in thisqp.Batch
instances views are added. - recode ({‘extend_codes’, ‘drop_codes’, ‘collect_codes’, ‘collect_codes@cat_name’},) – default ‘auto’ Adds variable with nets as codes to DataSet/Stack. If ‘extend_codes’, codes are extended with nets. If ‘drop_codes’, new variable only contains nets as codes. If ‘collect_codes’ or ‘collect_codes@cat_name’ the variable contains nets and another category that summarises all codes which are not included in any net. If no cat_name is provided, ‘Other’ is taken as default
- mis_in_rec (bool, default False) – Skip or include codes that are defined as missing when recoding from net definition.
Returns: The stack instance is modified inplace.
Return type: None
-
add_stats
(on_vars, stats=['mean'], other_source=None, rescale=None, drop=True, exclude=None, factor_labels=True, custom_text=None, checking_cluster=None, _batches='all', recode=False, verbose=True)¶ Add a descriptives view to a specified collection of xks of the stack.
Valid descriptives views: {‘mean’, ‘stddev’, ‘min’, ‘max’, ‘median’, ‘sem’}
Parameters: - on_vars (list) – The list of x variables to add the view to.
- stats (list of str, default
['mean']
) – The metrics to compute and add as a view. - other_source (str) – If provided the Link’s x-axis variable will be swapped with the (numerical) variable provided. This can be used to attach statistics of a different variable to a Link definition.
- rescale (dict) – A dict that maps old to new codes, e.g. {1: 5, 2: 4, 3: 3, 4: 2, 5: 1}
- drop (bool, default True) – If
rescale
is provided all codes that are not mapped will be ignored in the computation. - exclude (list) – Codes/values to ignore in the computation.
- factor_labels (bool / str, default True) – Writes the (rescaled) factor values next to the category text label. If True, square-brackets are used. If ‘()’, normal brackets are used.
- custom_text (str, default None) – A custom string affix to put at the end of the requested statistics’ names.
- checking_cluster (quantipy.Cluster, default None) – When provided, an automated checking aggregation will be added to the
Cluster
instance. - _batches (str or list of str) – Only for
qp.Links
that are defined in thisqp.Batch
instances views are added. - recode (bool, default False) – Create a new variable that contains only the values which are needed for the stat computation. The values and the included data will be rescaled.
Returns: The stack instance is modified inplace.
Return type: None
-
add_tests
(_batches='all', verbose=True)¶ Apply coltests for selected batches.
Sig. Levels are taken from
qp.Batch
definitions.Parameters: _batches (str or list of str) – Only for qp.Links
that are defined in thisqp.Batch
instances views are added.Returns: Return type: None
-
aggregate
(views, unweighted_base=True, categorize=[], batches='all', xs=None, bases={}, verbose=True)¶ Add views to all defined
qp.Link
inqp.Stack
.Parameters: - views (str or list of str or qp.ViewMapper) –
views
that are added. - unweighted_base (bool, default True) – If True, unweighted ‘cbase’ is added to all non-arrays. This parameter will be deprecated in future, please use bases instead.
- categorize (str or list of str) – Determines how numerical data is handled: If provided, the
variables will get counts and percentage aggregations
(
'counts'
,'c%'
) alongside the'cbase'
view. If False, only'cbase'
views are generated for non-categorical types. - batches (str/ list of str, default 'all') – Name(s) of
qp.Batch
instance(s) that are used to aggregate theqp.Stack
. - xs (list of str) – Names of variable, for which views are added.
- bases (dict) – Defines which bases should be aggregated, weighted or unweighted.
Returns: Return type: None, modify
qp.Stack
inplace- views (str or list of str or qp.ViewMapper) –
-
apply_meta_edits
(batch_name, data_key, filter_key=None, freeze=False)¶ Take over meta_edits from Batch definitions.
Parameters: - batch_name (str) – Name of the Batch whose meta_edits are taken.
- data_key (str) – Accessing this metadata:
self[data_key].meta
Batch definitions are takes from here and this metadata is modified. - filter_key (str, default None) – Currently not implemented!
Accessing this metadata:
self[data_key][filter_key].meta
Batch definitions are takes from here and this metadata is modified.
-
cumulative_sum
(on_vars, _batches='all', verbose=True)¶ Add cumulative sum view to a specified collection of xks of the stack.
Parameters: - on_vars (list) – The list of x variables to add the view to.
- _batches (str or list of str) – Only for
qp.Links
that are defined in thisqp.Batch
instances views are added.
Returns: The stack instance is modified inplace.
Return type: None
-
describe
(index=None, columns=None, query=None, split_view_names=False)¶ Generates a structured overview of all Link defining Stack elements.
Parameters: - columns (index,) – optional Controls the output representation by structuring a pivot-style table according to the index and column values.
- query (str) – A query string that is valid for the pandas.DataFrame.query() method.
- split_view_names (bool, default False) – If True, will create an output of unique view name notations split up into their components.
Returns: description – DataFrame summing the Stack’s structure in terms of Links and Views.
Return type: pandas.DataFrame
-
freeze_master_meta
(data_key, filter_key=None)¶ Save
.meta
in.master_meta
for a defined data_key.Parameters: - data_key (str) – Using:
self[data_key]
- filter_key (str, default None) – Currently not implemented!
Using:
self[data_key][filter_key]
- data_key (str) – Using:
-
static
from_sav
(data_key, filename, name=None, path=None, ioLocale='en_US.UTF-8', ioUtf8=True)¶ Creates a new stack instance from a .sav file.
Parameters: - data_key (str) – The data_key for the data and meta in the sav file.
- filename (str) – The name to the sav file.
- name (str) – A name for the sav (stored in the meta).
- path (str) – The path to the sav file.
- ioLocale (str) – The locale used in during the sav processing.
- ioUtf8 (bool) – Boolean that indicates the mode in which text communicated to or from the I/O module will be.
Returns: stack – A stack instance that has a data_key with data and metadata to run aggregations.
Return type: stack object instance
-
static
load
(path_stack, compression='gzip', load_cache=False)¶ Load Stack instance from .stack file.
Parameters: - path_stack (str) – The full path to the .stack file that should be created, including the extension.
- compression ({'gzip'}, default 'gzip') – The compression type that has been used saving the file.
- load_cache (bool, default False) – Loads MatrixCache into the Stack a .cache file is found.
Returns: Return type: None
-
static
recode_from_net_def
(dataset, on_vars, net_map, expand, recode='auto', text_prefix='Net:', mis_in_rec=False, verbose=True)¶ Create variables from net definitions.
-
reduce
(data_keys=None, filters=None, x=None, y=None, variables=None, views=None)¶ Remove keys from the matching levels, erasing discrete Stack portions.
Parameters: filters, x, y, views (data_keys,) – Returns: Return type: None
-
refresh
(data_key, new_data_key='', new_weight=None, new_data=None, new_meta=None)¶ Re-run all or a portion of Stack’s aggregations for a given data key.
refresh() can be used to re-weight the data using a new case data weight variable or to re-run all aggregations based on a changed source data version (e.g. after cleaning the file/ dropping cases) or a combination of the both.
Note
Currently this is only supported for the preset QuantipyViews(), namely:
'cbase'
,'rbase'
,'counts'
,'c%'
,'r%'
,'mean'
,'ebase'
.Parameters: - data_key (str) – The Links’ data key to be modified.
- new_data_key (str, default '') – Controls if the existing data key’s files and aggregations will be overwritten or stored via a new data key.
- new_weight (str) – The name of a new weight variable used to re-aggregate the Links.
- new_data (pandas.DataFrame) – The case data source. If None is given, the original case data found for the data key will be used.
- new_meta (quantipy meta document) – A meta data source associated with the case data. If None is given, the original meta definition found for the data key will be used.
Returns: Return type: None
-
remove_data
(data_keys)¶ Deletes the data_key(s) and associated data specified in the Stack.
Parameters: data_keys (str or list of str) – The data keys to remove. Returns: Return type: None
-
restore_meta
(data_key, filter_key=None)¶ Restore the
.master_meta
for a defined data_key if it exists.Undo self.apply_meta_edits()
Parameters: - data_key (str) – Accessing this metadata:
self[data_key].meta
- filter_key (str, default None) – Currently not implemented!
Accessing this metadata:
self[data_key][filter_key].meta
- data_key (str) – Accessing this metadata:
-
save
(path_stack, compression='gzip', store_cache=True, decode_str=False, dataset=False, describe=False)¶ Save Stack instance to .stack file.
Parameters: - path_stack (str) – The full path to the .stack file that should be created, including the extension.
- compression ({'gzip'}, default 'gzip') – The intended compression type.
- store_cache (bool, default True) – Stores the MatrixCache in a file in the same location.
- decode_str (bool, default=True) – If True the unicoder function will be used to decode all str objects found anywhere in the meta document/s.
- dataset (bool, default=False) – If True a json/csv will be saved parallel to the saved stack for each data key in the stack.
- describe (bool, default=False) – If True the result of stack.describe().to_excel() will be saved parallel to the saved stack.
Returns: Return type: None
-
variable_types
(data_key, only_type=None, verbose=True)¶ Group variables by data types found in the meta.
Parameters: - data_key (str) – The reference name of a case data source hold by the Stack instance.
- only_type ({'int', 'float', 'single', 'delimited set', 'string',) – ‘date’, time’, ‘array’}, optional Will restrict the output to the given data type.
Returns: types – A summary of variable names mapped to their data types, in form of {type_name: [variable names]} or a list of variable names confirming only_type.
Return type: dict or list of str
-
View¶
-
class
quantipy.
View
(link=None, name=None, kwargs=None)¶ -
get_edit_params
()¶ Provides the View’s Link edit kwargs with fallbacks to default values.
Returns: edit_params – A tuple of kwargs controlling the following supported Link data edits: logic, calc, … Return type: tuple
-
get_std_params
()¶ Provides the View’s standard kwargs with fallbacks to default values.
Returns: std_parameters – A tuple of the common kwargs controlling the general View method behaviour: axis, relation, rel_to, weights, text Return type: tuple
-
has_other_source
()¶ Tests if the View is generated with a swapped x-axis.
-
is_base
()¶ Tests if the View is a base size aggregation.
-
is_counts
()¶ Tests if the View is a count representation of a frequency.
-
is_cumulative
()¶ Tests if the View is a cumulative frequency.
-
is_meanstest
()¶ Tests if the View is a statistical test of differences in means.
-
is_net
()¶ Tests if the View is a code group/net aggregation.
-
is_pct
()¶ Tests if the View is a percentage representation of a frequency.
-
is_propstest
()¶ Tests if the View is a statistical test of differences in proportions.
-
is_stat
()¶ Tests if the View is a sample statistic.
-
is_sum
()¶ Tests if the View is a plain sum aggregation.
-
is_weighted
()¶ Tests if the View is performed on weighted data.
-
meta
()¶ Get a summary on a View’s meta information.
Returns: viewmeta – A dictionary that contains global aggregation information. Return type: dict
-
missing
()¶ Returns any excluded value codes.
-
nests
()¶ Slice a nested
View.dataframe
into its innermost column sections.
-
notation
(method, condition)¶ Generate the View’s Stack key notation string.
Parameters: shortname, relation (aggname,) – Strings for the aggregation name, the method’s shortname and the relation component of the View notation. Returns: notation – The View notation. Return type: str
-
rescaling
()¶ Returns the rescaling specification of value codes.
-
spec_condition
(link, conditionals=None, expand=None)¶ Updates the View notation’s condition component based on agg. details.
Parameters: link (Link) – Returns: relation_string – The relation part of the View name notation. Return type: str
-
weights
()¶ Returns the weight variable name used in the aggregation.
-
ViewMapper¶
-
class
quantipy.
ViewMapper
(views=None, template=None)¶ Applies View computation results to Links based on the view method’s kwargs, handling the coordination and structuring side of the aggregation process.
-
add_method
(name=None, method=None, kwargs={}, template=None)¶ Add a method to the instance of the ViewMapper.
Parameters: - name (str) – The short name of the View.
- method (view method) – The view method that will be used to derivce the result
- kwargs (dict) – The keyword arguments needed by the view method.
- template (dict) – A ViewMapper template that contains information on view method and kwargs values to iterate over.
Returns: Updates the ViewMapper instance with a new method definiton.
Return type: None
-
make_template
(method, iterators=None)¶ Generate a view method template that cycles through kwargs values.
Parameters: - method ({'frequency', 'descriptives', 'coltests'}) – The baseline view method to be used.
- iterators (dict) – A dictionary mapping of view method kwargs to lists of values.
Returns: Sets the template inside ViewMapper instance.
Return type: None
-
subset
(views, strict_selection=True)¶ Copy ViewMapper instance retaining only the View names provided.
Parameters: - views (list of str) – The selection of View names to keep.
- strict_selection (bool, default True) – TODO
Returns: subset
Return type: ViewMapper instance
-
Quantipy: Python survey data toolkit¶
Quantipy is an open-source data processing, analysis and reporting software project that builds on the excellent pandas and numpy libraries. Aimed at social and marketing research survey data, Quantipy offers support for native handling of special data types like multiple choice variables, statistical analysis using case or observation weights, dataset metadata and customizable reporting exports.
Note
We are currently moving our documentation and reorganizing it. Sorry for the lack of latest information.
Key features¶
- Reads plain .csv, converts from Dimensions, SPSS, Decipher, or Ascribe
- Open metadata format to describe and manage datasets
- Powerful, metadata-driven cleaning, editing, recoding and transformation of datasets
- Computation and assessment of data weights
- Easy-to-use analysis interface
- Automated data aggregation using
Batch
defintions - Structured analysis and reporting via Chain and Cluster containers
- Export to SPSS, Dimensions ddf/mdd, table spreadsheets and chart decks
- Contributors
Kerstin Müller, Alexander Buchhammer, Alasdair Eaglestone, James Griffiths
Birgir Hrafn Sigurðsson and Geir Freysson of datasmoothie