Weight scheme setup¶

Using the `Rim` class¶

The Rim object’s purpose is to define the required setup of the weighting process, i.e. the weight scheme that should be used to compute the actual factor results per case in the dataset. While its main purpose is to provide a simple interface to structure weight schemes of all complexities, it also offers advanced options that control the underlying weighting algorithm itself and thus might impact the results.

To start working with a Rim object, we only need to think of a name for our scheme:

>>> scheme = qp.Rim('my_first_scheme')

Target distributions¶

A major and (probably the most important) step in specifying a weight scheme is mapping the desired target population proportions to the categories of the related variables inside the data. This is done via a dict mapping.

For example, to equally weight female and male respondents in our sample, we simply define:

>>> gender_targets = {}
>>> gender_targets['gender'] = {1: 50.0, 2: 50.0}
>>> gender_targets
{'gender': {1: 50.0, 2: 50.0}}

Since we are normally dealing with multiple variables at once, we collect them in a list, adding other variables naturally in the same way:

>>> dataset.band('age', [(19, 25), (26-35), (36, 49)])
>>> age_targets = {'age_banded': {1: 45.0, 2: 29.78, 3: 25.22}}
>>> all_targets = [gender_targets, age_targets]

The set_targets() method can now use the all_targets list to apply the target distributions to the Rim weight scheme setup (we are also providing an optional name for our group of variables) .

>>> scheme.set_targets(targets=all_targets, group_name='basic weights')

The Rim instance also allows inspecting these targets from itself now (you can see the group_name parameter reflected here, it would fall back to '_default_name_' if none was provided):

>>> scheme.groups['basic weights']['targets']
[{'gender': {1: 50.0, 2: 50.0}}, {'age_banded': {1: 45.0, 2: 29.78, 3: 25.22}}]

Weight groups and filters¶

For more elaborate weight schemes, we are instead using the add_group() method which is effectively a generalized version of set_targets() that supports addressing subsets of the data by filtering. For example, differing target distributions (or even the scheme defining variables of interest) might be required across several market segments or between survey periods.

We can illustrate this using the variable 'Wave' from the dataset:

>>> dataset.crosstab('Wave', text=True, pct=True)
Question          Wave. Wave
Values                     @
Question   Values
Wave. Wave All         100.0
           Wave 1       19.6
           Wave 2       20.2
           Wave 3       20.5
           Wave 4       19.8
           Wave 5       19.9

Let’s assume we want to use the original targets for the first three waves but the remaining two waves need to reflect some changes in both gender and the age distributions. We first define a new set of targets that should apply only to the waves 4 and 5:

gender_targets_2 = {'gender': {1: 30.0, 2: 70.0}}
age_targets_2 = {'age_banded': {1: 35.4, 2: 60.91, 3: 3.69}}
all_targets_2 = [gender_targets_2, age_targets_2]

We then set the filter expressions for the respective subsets of the data, as per:

>>> filter_wave1 = 'Wave == 1'
>>> filter_wave2 = 'Wave == 2'
>>> filter_wave3 = 'Wave == 3'
>>> filter_wave4 = 'Wave == 4'
>>> filter_wave5 = 'Wave == 5'

And add our weight specifications accordingly:

>>> scheme = qp.Rim('my_complex_scheme')
>>> scheme.add_group(name='wave 1', filter_def=filter_wave1, targets=all_targets)
>>> scheme.add_group(name='wave 2', filter_def=filter_wave2, targets=all_targets)
>>> scheme.add_group(name='wave 3', filter_def=filter_wave3, targets=all_targets)
>>> scheme.add_group(name='wave 4', filter_def=filter_wave4, targets=all_targets_2)
>>> scheme.add_group(name='wave 5', filter_def=filter_wave5, targets=all_targets_2)

Note

For historical reasons, the logic operators currently do not work within the Rim module. This means that all filter definitions need to be valid string expressions suitable for the pandas.DataFrame.query() method. We are planning to abandon this limitation as soon as possible to enable easier and more complex filters that are consistent with the rest of the library.

Setting group targets¶

At this stage it might also be needed to balance out the survey waves themselves in a certain way, e.g. make each wave count exactly the same (as you can see above each wave accounts for roughly 20% of the full sample but not quite exactly).

With Rim.group_targets() we can apply an outer weighting to the between group distribution while keeping the already set inner target proportions within each of them. Again we are using a dict, this time mapping the group names from above to the desired outcome percentages:

>>> group_targets = {'wave 1': 20.0,
...                  'wave 2': 20.0,
...                  'wave 3': 20.0,
...                  'wave 4': 20.0,
...                  'wave 5': 20.0}

>>> scheme.group_targets(group_targets)

To sum it up: Our weight scheme consists of five groups based on 'Wave' that resp. need to match two different sets of target distributions on the 'gender' and 'age_banded' variables with each group coming out as 20% of the full sample.

Integration within `DataSet`¶

The computational core of the weighting algorithm is the quantipy.core.weights.rim.Rake class which can be accessed by working with qp.WeightEngine(), but it is much easier to directly use the DataSet.weight() method. Its full signature looks as follows:

DataSet.weight(weight_scheme,
               weight_name='weight',
               unique_key='identity',
               subset=None,
               report=True,
               path_report=None,
               inplace=True,
               verbose=True)

Weighting and weighted aggregations¶

As can been seen, we can simply provide our weight scheme Rim instance to the method. Since the dataset already contains a variable called 'weight' (and we do not want to overwrite that one) we set weight_name to be 'weights_new'. We also need to set unique_key='unique_id' as that is our identifying key variable (that is needed to map the weight factors back into our dataset):

>>> dataset.weight(scheme, weight_name='weights_new', unique_key='unique_id')

Before we take a look at the report that is printed (because of report=True), we want to manually check our results. For that, we can simply analyze some cross- tabulations, weighted by our new weights! For a start, we check if we arrived at the desired proportions for 'gender' and 'age_banded' per 'Wave':

>>> dataset.crosstab(x='gender', y='Wave', w='weights_new', pct=True)
Question                            Wave. Wave
Values                                     All Wave 1 Wave 2 Wave 3 Wave 4 Wave 5
Question                     Values
gender. What is your gender? All         100.0  100.0  100.0  100.0  100.0  100.0
                             Male         42.0   50.0   50.0   50.0   30.0   30.0
                             Female       58.0   50.0   50.0   50.0   70.0   70.0

>>> dataset.crosstab(x='age_banded', y='Wave', w='weights_new', pct=True,
...                  decimals=2)
Question               Wave. Wave
Values                        All  Wave 1  Wave 2  Wave 3  Wave 4  Wave 5
Question        Values
age_banded. Age All        100.00  100.00  100.00  100.00  100.00  100.00
                19-25       41.16   45.00   45.00   45.00   35.40   35.40
                26-35       42.23   29.78   29.78   29.78   60.91   60.91
                36-49       16.61   25.22   25.22   25.22    3.69    3.69

Both results accurately reflect the desired proportions from our scheme. And we can also verify the weighted distribution of 'Wave', now completely balanced:

>>> dataset.crosstab(x='Wave', w='weights_new', pct=True)
Question          Wave. Wave
Values                     @
Question   Values
Wave. Wave All         100.0
           Wave 1       20.0
           Wave 2       20.0
           Wave 3       20.0
           Wave 4       20.0
           Wave 5       20.0

The isolated weight dataframe¶

By default, the weighting operates inplace, i.e. the weight vector will be placed into the DataSet instance as a regular columns element:

>>> dataset.meta('weights_new')
                                     float
weights_new: my_first_scheme weights   N/A

>>> dataset['weights_new'].head()
   unique_id  weights_new
   402891     0.885593
 27541022     1.941677
   335506     0.984491
 22885610     1.282057
   229122     0.593834

It is also possible to return a new pd.DataFrame that contains all relevant Rim scheme variables incl. the factor vector for external use cases or further analysis:

>>> wdf = dataset.weight(scheme, weight_name='weights_new', unique_key='unique_id',
                         inplace=False)
>>> wdf.head()
   unique_id  gender  age_banded  weights_new  Wave
0     402891       1         1.0     0.885593     4
1   27541022       2         1.0     1.941677     1
2     335506       1         2.0     0.984491     3
3   22885610       1         2.0     1.282057     5
4     229122       1         3.0     0.593834     1