CountFrequencyEncoder

The CountFrequencyEncoder() replaces categories by the count of observations per category or by the percentage of observations per category.
For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from feature_engine.encoding import CountFrequencyEncoder
In [2]:
# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['age'] = data['age'].astype('float')
    data['fare'] = data['fare'].astype('float')
    data['embarked'].fillna('C', inplace=True)
    data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)
    return data
In [3]:
data = load_titanic()
data.head()
Out[3]:
pclass survived name sex age sibsp parch ticket fare cabin embarked
0 1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B S
1 1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C S
2 1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C S
3 1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C S
4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C S
In [4]:
X = data.drop(['survived', 'name', 'ticket'], axis=1)
y = data.survived
In [5]:
# we will encode the below variables, they have no missing values
X[['cabin', 'pclass', 'embarked']].isnull().sum()
Out[5]:
cabin       0
pclass      0
embarked    0
dtype: int64
In [6]:
''' Make sure that the variables are type (object).
if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) 
or not pick it up (if we leave variables=None). '''

X[['cabin', 'pclass', 'embarked']].dtypes
Out[6]:
cabin       object
pclass      object
embarked    object
dtype: object
In [7]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape
Out[7]:
((916, 8), (393, 8))

The CountFrequencyEncoder(), replaces the categories by the count or frequency of the observations in the train set for that category.

If we select "count" in the encoding_method, then for the variable colour, if there are 10 observations in the train set that show colour blue, blue will be replaced by 10.

Alternatively, if we select "frequency" in the encoding_method, if 10% of the observations in the train set show blue colour, then blue will be replaced by 0.1.

Frequency

Labels are replaced by the percentage of the observations that show that label in the train set.

In [8]:
'''
Parameters
----------

encoding_method : str, default='count' 
                Desired method of encoding.

        'count': number of observations per category
        
        'frequency': percentage of observations per category

variables : list
          The list of categorical variables that will be encoded. If None, the 
          encoder will find and transform all object type variables.
'''
count_encoder = CountFrequencyEncoder(encoding_method='frequency',
                                      variables=['cabin', 'pclass', 'embarked'])

count_encoder.fit(X_train)
Out[8]:
CountFrequencyEncoder(encoding_method='frequency',
                      variables=['cabin', 'pclass', 'embarked'])
In [9]:
# we can explore the encoder_dict_ to find out the category replacements.
count_encoder.encoder_dict_
Out[9]:
{'cabin': {'n': 0.7663755458515283,
  'C': 0.07751091703056769,
  'B': 0.04585152838427948,
  'D': 0.034934497816593885,
  'E': 0.034934497816593885,
  'A': 0.018558951965065504,
  'F': 0.016375545851528384,
  'G': 0.004366812227074236,
  'T': 0.001091703056768559},
 'pclass': {3: 0.5436681222707423,
  1: 0.25109170305676853,
  2: 0.2052401746724891},
 'embarked': {'S': 0.7117903930131004,
  'C': 0.19759825327510916,
  'Q': 0.0906113537117904}}
In [10]:
# transform the data: see the change in the head view
train_t = count_encoder.transform(X_train)
test_t = count_encoder.transform(X_test)
test_t.head()
Out[10]:
pclass sex age sibsp parch fare cabin embarked
1139 0.543668 male 38.0 0 0 7.8958 0.766376 0.71179
533 0.205240 female 21.0 0 1 21.0000 0.766376 0.71179
459 0.205240 male 42.0 1 0 27.0000 0.766376 0.71179
1150 0.543668 male NaN 0 0 14.5000 0.766376 0.71179
393 0.205240 male 25.0 0 0 31.5000 0.766376 0.71179
In [11]:
test_t['pclass'].value_counts().plot.bar()
plt.show()
In [12]:
test_orig = count_encoder.inverse_transform(test_t)
test_orig.head()
Out[12]:
pclass sex age sibsp parch fare cabin embarked
1139 3 male 38.0 0 0 7.8958 n S
533 2 female 21.0 0 1 21.0000 n S
459 2 male 42.0 1 0 27.0000 n S
1150 3 male NaN 0 0 14.5000 n S
393 2 male 25.0 0 0 31.5000 n S

Count

Labels are replaced by the number of the observations that show that label in the train set.

In [13]:
# this time we encode only 1 variable

count_enc = CountFrequencyEncoder(encoding_method='count',
                                                variables='cabin')

count_enc.fit(X_train)
Out[13]:
CountFrequencyEncoder(variables=['cabin'])
In [14]:
# we can find the mappings in the encoder_dict_ attribute.

count_enc.encoder_dict_
Out[14]:
{'cabin': {'n': 702,
  'C': 71,
  'B': 42,
  'D': 32,
  'E': 32,
  'A': 17,
  'F': 15,
  'G': 4,
  'T': 1}}
In [15]:
# transform the data: see the change in the head view for Cabin

train_t = count_enc.transform(X_train)
test_t = count_enc.transform(X_test)

test_t.head()
Out[15]:
pclass sex age sibsp parch fare cabin embarked
1139 3 male 38.0 0 0 7.8958 702 S
533 2 female 21.0 0 1 21.0000 702 S
459 2 male 42.0 1 0 27.0000 702 S
1150 3 male NaN 0 0 14.5000 702 S
393 2 male 25.0 0 0 31.5000 702 S

Select categorical variables automatically

If we don't indicate which variables we want to encode, the encoder will find all categorical variables

In [16]:
# this time we ommit the argument for variable
count_enc = CountFrequencyEncoder(encoding_method = 'count')

count_enc.fit(X_train)
Out[16]:
CountFrequencyEncoder(variables=['pclass', 'sex', 'cabin', 'embarked'])
In [17]:
# we can see that the encoder selected automatically all the categorical variables

count_enc.variables
Out[17]:
['pclass', 'sex', 'cabin', 'embarked']
In [18]:
# transform the data: see the change in the head view

train_t = count_enc.transform(X_train)
test_t = count_enc.transform(X_test)

test_t.head()
Out[18]:
pclass sex age sibsp parch fare cabin embarked
1139 498 581 38.0 0 0 7.8958 702 652
533 188 335 21.0 0 1 21.0000 702 652
459 188 581 42.0 1 0 27.0000 702 652
1150 498 581 NaN 0 0 14.5000 702 652
393 188 581 25.0 0 0 31.5000 702 652

Note

if there are labels in the test set that were not present in the train set, the transformer will introduce NaN, and raise a warning.

In [ ]: