Custom data recoding

The recode() method in detail

This function takes a mapper of {key: logic} entries and injects the key into the target column where its paired logic is True. The logic may be arbitrarily complex and may refer to any other variable or variables in data. Where a pre-existing column has been used to start the recode, the injected values can replace or be appended to any data found there to begin with. Note that this function does not edit the target column, it returns a recoded copy of the target column. The recoded data will always comply with the column type indicated for the target column according to the meta.

method:recode(target, mapper, default=None, append=False, intersect=None, initialize=None, fillna=None, inplace=True)

target

target controls which column meta should be used to control the result of the recode operation. This is important because you cannot recode multiple responses into a ‘single’-typed column.

The target column must already exist in meta.

The recode function is effectively a request to return a copy of the target column, recoded as instructed. recode does not edit the target column in place, it returns a recoded copy of it.

If the target column does not already exist in data then a new series, named accordingly and initialized with np.NaN, will begin the recode.

Return a recoded version of the column radio_stations_xb edited based on the given mapper:

>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper=mapper
... )

By default, recoded data resulting from the the mapper will replace any data already sitting in the target column (on a cell-by-cell basis).

mapper

A mapper is a dict of {value: logic} entries where value represents the data that will be injected for cases where the logic is True.

Here’s a simplified example of what a mapper looks like:

>>> mapper = {
...     1: logic_A,
...     2: logic_B,
...     3: logic_C,
... }

1 will be generated where logic_A is True, 2 where logic_B is True and 3 where logic_C is True.

The recode function, by referencing the type indicated by the meta, will manage the complications involved in single vs delimited set data.

>>> mapper = {
...     901: {'radio_stations': frange('1-13')},
...     902: {'radio_stations': frange('14-20')},
...     903: {'radio_stations': frange('21-25')}
... }

This means: inject 901 if the column radio_stations has any of the values 1-13, 902 where radio_stations has any of the values 14-20 and 903 where radio_stations has any of the values 21-25.

default

If you had lots of values to generate from the same reference column (say most/all of them were based on radio_stations) then we can omit the wildcard logic format and use recode’s default parameter.

>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper={
...         901: frange('1-13'),
...         902: frange('14-20'),
...         903: frange('21-25')
...     },
...     default='radio_stations'
... )

This means, all unkeyed logic will default to be keyed to radio_stations. In this case the three codes 901, 902 and 903 will be generated based on the data found in radio_stations.

You can combine this with reference to other columns, but you can only provide one default column.

>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper={
...         901: frange('1-13'),
...         902: frange('14-20'),
...         903: frange('21-25'),
...         904: {'age': frange('18-34')}
...     },
...     default='radio_stations'
... )

Given that logic can be arbitrarily complicated, mappers can be as well. You’ll see an example of a mapper that recodes a segmentation in Example 4, below.

append

If you want the recoded data to be appended to whatever may already be in the target column (this is only applicable for ‘delimited set’-typed columns), then you should use the append parameter.

>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper=mapper,
...     append=True
... )

The precise behaviour of the append parameter can be seen in the following examples.

Given the following data:

>>> df['radio_stations_xb']
1    6;7;9;13;
2          97;
3          97;
4    13;16;18;
5         2;6;
Name: radio_stations_xb, dtype: object

We generate a recoded value of 901 if any of the values 1-13 are found. With the default append=False behaviour we will return the following:

>>> target = 'radio_stations_xb'
>>> recode(meta, data, target, mapper)
1    901;
2     97;
3     97;
4    901;
5    901;
Name: radio_stations_xb, dtype: object

However, if we instead use append=True, we will return the following:

>>> target = 'radio_stations_xb'
>>> recode(meta, data, target, mapper, append=True)
1    6;7;9;13;901;
2              97;
3              97;
4    13;16;18;901;
5         2;6;901;
Name: radio_stations_xb, dtype: object

intersect

One way to help simplify complex logical conditions, especially when they are in some way repetitive, is to use intersect, which accepts any logical statement and forces every condition in the mapper to become the intersection of both it and the intersect condition.

For example, we could limit our recode to males by giving a logical condition to that effect to intersect:

>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper={
...         901: frange('1-13'),
...         902: frange('14-20'),
...         903: frange('21-25'),
...         904: {'age': frange('18-34')}
...     },
...     default='radio_stations',
...     intersect={'gender': [1]}
... )

initialize

You may also initialize your copy of the target column as part of your recode operation. You can initalize with either np.NaN (to overwrite anything that may already be there when your recode begins) or by naming another column. When you name another column a copy of the data from that column is used to initialize your recode.

Initialization occurs before your recode.

>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper={
...         901: frange('1-13'),
...         902: frange('14-20'),
...         903: frange('21-25'),
...         904: {'age': frange('18-34')}
...     },
...     default='radio_stations',
...     initialize=np.NaN
... )
>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper={
...         901: frange('1-13'),
...         902: frange('14-20'),
...         903: frange('21-25'),
...         904: {'age': frange('18-34')}
...     },
...     default='radio_stations',
...     initialize='radio_stations'
... )

fillna

You may also provide a fillna value that will be used as per pd.Series.fillna() after the recode has been performed.

>>> recoded = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper={
...         901: frange('1-13'),
...         902: frange('14-20'),
...         903: frange('21-25'),
...         904: {'age': frange('18-34')}
...     },
...     default='radio_stations',
...     initialize=np.NaN,
...     fillna=99
... )

Custom recode examples

Building a net code

Here’s an example of copying an existing question and recoding onto it a net code.

Create the new metadata:

>>> meta['columns']['radio_stations_xb'] = copy.copy(
...     meta['columns']['radio_stations']
... )
>>> meta['columns']['radio_stations_xb']['values'].append(
...     {
...         "value": 901,
...         "text": {"en-GB": "NET: Listened to radio in past 30 days"}
...     }
... )

Initialize the new column. In this case we’re starting with a copy of the radio_stations column:

>>> data['radio_stations_xb'] = data['radio_stations'].copy()

Recode the new column by appending the code 901 to it as indicated by the mapper:

>>> data['radio_stations_xb'] = recode(
...     meta, data,
...     target='radio_stations_xb',
...     mapper={
...         901: {'radio_stations': frange('1-23, 92, 94, 141')}
...     },
...     append=True
... )

Check the result:

>>> data[['radio_stations', 'radio_stations_xb']].head(20)
   radio_stations radio_stations_cb
0              5;            5;901;
1             97;               97;
2             97;               97;
3             97;               97;
4             97;               97;
5              4;            4;901;
6             11;           11;901;
7              4;            4;901;
8             97;               97;
9             97;               97;
10            97;               97;
11            92;           92;901;
12            97;               97;
13       1;13;17;      1;13;17;901;
14             6;            6;901;
15      1;5;6;10;     1;5;6;10;901;
16             6;            6;901;
17        2;4;16;       2;4;16;901;
18          6;10;         6;10;901;
19             6;            6;901;

Create-and-fill

Here’s an example where the value 1 is generated based on some logic and then all remaining cases are given the value 2 using the pandas.Series.fillna() method.

Create the new metadata

>>> meta['columns']['age_xb'] = {
...     'type': 'single',
...     'text': {'en-GB': 'Age'},
...     'values': [
...         {'value': 1, 'text': {'en-GB': '16-25'}},
...         {'value': 2, 'text': {'en-GB': 'Others'}}
...     ]
...     }

Initialize the new column:

>>> data['age_xb'] = np.NaN

Recode the new column:

>>> data['age_xb'] = recode(
...     meta, data,
...     target='age_xb',
...     mapper={
...         1: {'age': frange('16-40')}
...     }
... )

Fill all cases that are still empty with the value 2:

>>> data['age_xb'].fillna(2, inplace=True)

Check the result:

>>> data[['age', 'age_xb']].head(20)
    age  age_grp_rc
0    22           1
1    68           2
2    32           1
3    44           2
4    33           1
5    52           2
6    54           2
7    44           2
8    62           2
9    49           2
10   64           2
11   73           2
12   43           2
13   28           1
14   66           2
15   39           1
16   51           2
17   50           2
18   77           2
19   42           2

Numerical banding

Here’s a typical example of recoding age into custom bands.

In this case we’re using list comprehension to generate the first ten values objects and then concatenate that with a final ‘65+’ value object which doesn’t folow the same label format.

Create the new metadata:

>>> meta['columns']['age_xb_1'] = {
...     'type': 'single',
...     'text': {'en-GB': 'Age'},
...     'values': [
...         {
...             'value': i,
...             'text': {'en-GB': '{}-{}'.format(r[0], r[1])}
...         }
...         for i, r in enumerate(
...             [
...                 [18, 20],
...                 [21, 25], [26, 30],
...                 [31, 35], [36, 40],
...                 [41, 45], [46, 50],
...                 [51, 55], [56, 60],
...                 [61, 65]
...             ],
...             start=1
...         )
...     ] + [
...         {
...             'value': 11,
...             'text': {'en-GB': '65+'}
...         }
...     ]
... }

Initialize the new column:

>>> data['age_xb_1'] = np.NaN

Recode the new column:

>>> data['age_xb_1'] = recode(
...     meta, data,
...     target='age_xb_1',
...     mapper={
...         1: frange('18-20'),
...         2: frange('21-25'),
...         3: frange('26-30'),
...         4: frange('31-35'),
...         5: frange('36-40'),
...         6: frange('41-45'),
...         7: frange('46-50'),
...         8: frange('51-55'),
...         9: frange('56-60'),
...         10: frange('61-65'),
...         11: frange('66-99')
...     },
...     default='age'
... )

Check the result:

>>> data[['age', 'age_xb_1']].head(20)
    age  age_cb
0    22       2
1    68      11
2    32       4
3    44       6
4    33       4
5    52       8
6    54       8
7    44       6
8    62      10
9    49       7
10   64      10
11   73      11
12   43       6
13   28       3
14   66      11
15   39       5
16   51       8
17   50       7
18   77      11
19   42       6

Complicated segmentation

Here’s an example of using a complicated, nested series of logic statements to recode an obscure segmentation.

The segemenation was given with the following definition:

1 - Self-directed:

  • If q1_1 in [1,2] and q1_2 in [1,2] and q1_3 in [3,4,5]

2 - Validators:

  • If q1_1 in [1,2] and q1_2 in [1,2] and q1_3 in [1,2]

3 - Delegators:

  • If (q1_1 in [3,4,5] and q1_2 in [3,4,5] and q1_3 in [1,2])
  • Or (q1_1 in [3,4,5] and q1_2 in [1,2] and q1_3 in [1,2])
  • Or (q1_1 in [1,2] and q1_2 in [3,4,5] and q1_3 in [1,2])

4 - Avoiders:

  • If (q1_1 in [3,4,5] and q1_2 in [3,4,5] and q1_3 in [3,4,5])
  • Or (q1_1 in [3,4,5] and q1_2 in [1,2] and q1_3 in [3,4,5])
  • Or (q1_1 in [1,2] and q1_2 in [3,4,5] and q1_3 in [3,4,5])

5 - Others:

  • Everyone else.

Create the new metadata:

>>> meta['columns']['segments'] = {
...     'type': 'single',
...     'text': {'en-GB': 'Segments'},
...     'values': [
...         {'value': 1, 'text': {'en-GB': 'Self-directed'}},
...         {'value': 2, 'text': {'en-GB': 'Validators'}},
...         {'value': 3, 'text': {'en-GB': 'Delegators'}},
...         {'value': 4, 'text': {'en-GB': 'Avoiders'}},
...         {'value': 5, 'text': {'en-GB': 'Other'}},
...     ]
... }

Initialize the new column?

>>> data['segments'] = np.NaN

Create the mapper separately, since it’s pretty massive!

See the Complex logic section for more information and examples related to the use of union and intersection.

>>> mapper = {
...     1: intersection([
...         {"q1_1": [1, 2]},
...         {"q1_2": [1, 2]},
...         {"q1_3": [3, 4, 5]}
...     ]),
...     2: intersection([
...         {"q1_1": [1, 2]},
...         {"q1_2": [1, 2]},
...         {"q1_3": [1, 2]}
...     ]),
...     3: union([
...         intersection([
...             {"q1_1": [3, 4, 5]},
...             {"q1_2": [3, 4, 5]},
...             {"q1_3": [1, 2]}
...         ]),
...         intersection([
...             {"q1_1": [3, 4, 5]},
...             {"q1_2": [1, 2]},
...             {"q1_3": [1, 2]}
...         ]),
...         intersection([
...             {"q1_1": [1, 2]},
...             {"q1_2": [3, 4, 5]},
...             {"q1_3": [1, 2]}
...         ]),
...     ]),
...     4: union([
...         intersection([
...             {"q1_1": [3, 4, 5]},
...             {"q1_2": [3, 4, 5]},
...             {"q1_3": [3, 4, 5]}
...         ]),
...         intersection([
...             {"q1_1": [3, 4, 5]},
...             {"q1_2": [1, 2]},
...             {"q1_3": [3, 4, 5]}
...         ]),
...         intersection([
...             {"q1_1": [1, 2]},
...             {"q1_2": [3, 4, 5]},
...             {"q1_3": [3, 4, 5]}
...         ])
...     ])
... }

Recode the new column:

>>> data['segments'] = recode(
...     meta, data,
...     target='segments',
...     mapper=mapper
... )

Note

Anything not at the top level of the mapper will not benefit from using the default parameter of the recode function. In this case, for example, saying default='q1_1' would not have helped. Everything in a nested level of the mapper, including anything in a union or intersection list, must use the explicit dict form {"q1_1": [1, 2]}.

Fill all cases that are still empty with the value 5:

>>> data['segments'].fillna(5, inplace=True)

Check the result:

>>> data[['q1_1', 'q1_2', 'q1_3', 'segments']].head(20)
    q1_1  q1_2  q1_3  segments
0      3     3     3         4
1      3     3     3         4
2      1     1     3         1
3      1     1     2         2
4      2     2     2         2
5      1     1     5         1
6      2     3     2         3
7      2     2     3         1
8      1     1     4         1
9      3     3     3         4
10     3     3     4         4
11     2     2     4         1
12     1     1     5         1
13     2     2     4         1
14     1     1     1         2
15     2     2     4         1
16     2     2     3         1
17     1     1     5         1
18     5     5     1         3
19     1     1     4         1

Variable creation

Adding derived variables

Interlocking variables

Condition-based code removal