The RareLabelEncoder() groups labels that show a small number of observations in the dataset into a new category called 'Rare'. This helps to avoid overfitting.
The argument ' tol ' indicates the percentage of observations that the label needs to have in order not to be re-grouped into the "Rare" label.
The argument n_categories indicates the minimum number of distinct categories that a variable needs to have for any of the labels to be re-grouped into 'Rare'.
If the number of labels is smaller than n_categories, then the encoder will not group the labels for that variable.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.encoding import RareLabelEncoder
# Load titanic dataset from OpenML
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['age'] = data['age'].astype('float')
data['fare'] = data['fare'].astype('float')
data['embarked'].fillna('C', inplace=True)
data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)
return data
data = load_titanic()
data.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B | S |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C | S |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C | S |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C | S |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C | S |
X = data.drop(['survived', 'name', 'ticket'], axis=1)
y = data.survived
# we will encode the below variables, they have no missing values
X[['cabin', 'pclass', 'embarked']].isnull().sum()
cabin 0 pclass 0 embarked 0 dtype: int64
''' Make sure that the variables are type (object).
if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument)
or not pick it up (if we leave variables=None). '''
X[['cabin', 'pclass', 'embarked']].dtypes
cabin object pclass object embarked object dtype: object
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape
((916, 8), (393, 8))
The RareLabelEncoder() groups rare / infrequent categories in a new category called "Rare", or any other name entered by the user.
For example in the variable colour,
if the percentage of observations
for the categories magenta, cyan and burgundy
are < 5%, all those
categories will be replaced by the new label "Rare".
Note, infrequent labels can also be grouped under a user defined name, for example 'Other'. The name to replace infrequent categories is defined with the parameter replace_with.
The encoder will encode only categorical variables (type 'object'). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode all categorical variables (object type).
## Rare value encoder
'''
Parameters
----------
tol: float, default=0.05
the minimum frequency a label should have to be considered frequent.
Categories with frequencies lower than tol will be grouped.
n_categories: int, default=10
the minimum number of categories a variable should have for the encoder
to find frequent labels. If the variable contains less categories, all
of them will be considered frequent.
max_n_categories: int, default=None
the maximum number of categories that should be considered frequent.
If None, all categories with frequency above the tolerance (tol) will be
considered.
variables : list, default=None
The list of categorical variables that will be encoded. If None, the
encoder will find and select all object type variables.
replace_with : string, default='Rare'
The category name that will be used to replace infrequent categories.
'''
rare_encoder = RareLabelEncoder(tol=0.05,
n_categories=5,
variables=['cabin', 'pclass', 'embarked'])
rare_encoder.fit(X_train)
c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent "considered frequent".format(var) c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent "considered frequent".format(var)
RareLabelEncoder(n_categories=5, variables=['cabin', 'pclass', 'embarked'])
rare_encoder.encoder_dict_
{'cabin': Index(['n', 'C'], dtype='object'), 'pclass': array([2, 3, 1], dtype=object), 'embarked': array(['S', 'C', 'Q'], dtype=object)}
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)
test_t.head()
pclass | sex | age | sibsp | parch | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|
501 | 2 | female | 13.0 | 0 | 1 | 19.5000 | n | S |
588 | 2 | female | 4.0 | 1 | 1 | 23.0000 | n | S |
402 | 2 | female | 30.0 | 1 | 0 | 13.8583 | n | C |
1193 | 3 | male | NaN | 0 | 0 | 7.7250 | n | Q |
686 | 3 | female | 22.0 | 0 | 0 | 7.7250 | n | Q |
test_t.cabin.value_counts()
n 702 Rare 143 C 71 Name: cabin, dtype: int64
## Rare value encoder
rare_encoder = RareLabelEncoder(tol = 0.03,
replace_with='Other', #replacing 'Rare' with 'Other'
variables=['cabin', 'pclass', 'embarked'],
n_categories=2
)
rare_encoder.fit(X_train)
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)
test_t.sample(5)
pclass | sex | age | sibsp | parch | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|
1059 | 3 | male | 28.0 | 0 | 0 | 8.0500 | n | S |
1227 | 3 | female | 22.0 | 0 | 0 | 9.8375 | n | S |
470 | 2 | male | 35.0 | 0 | 0 | 12.3500 | n | Q |
66 | 1 | female | 36.0 | 0 | 0 | 262.3750 | B | C |
950 | 3 | male | 29.0 | 0 | 0 | 9.4833 | n | S |
rare_encoder.encoder_dict_
{'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'), 'pclass': Int64Index([3, 1, 2], dtype='int64'), 'embarked': Index(['S', 'C', 'Q'], dtype='object')}
test_t.cabin.value_counts()
n 702 C 71 B 42 Other 37 E 32 D 32 Name: cabin, dtype: int64
## Rare value encoder
rare_encoder = RareLabelEncoder(tol = 0.03,
variables=['cabin', 'pclass', 'embarked'],
n_categories=2,
max_n_categories=3 #keeps only the most popular 3 categories in every variable.
)
rare_encoder.fit(X_train)
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)
test_t.sample(5)
pclass | sex | age | sibsp | parch | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|
1222 | 3 | male | 33.0 | 0 | 0 | 8.6625 | n | C |
781 | 3 | male | 33.0 | 0 | 0 | 7.8958 | n | C |
272 | 1 | female | 23.0 | 1 | 0 | 82.2667 | B | S |
1043 | 3 | female | NaN | 1 | 0 | 15.5000 | n | Q |
867 | 3 | female | 22.0 | 1 | 1 | 12.2875 | n | S |
rare_encoder.encoder_dict_
{'cabin': Index(['n', 'C', 'B'], dtype='object'), 'pclass': Int64Index([3, 1, 2], dtype='int64'), 'embarked': Index(['S', 'C', 'Q'], dtype='object')}
If no variable list is passed as argument, it selects all the categorical variables.
## Rare value encoder
rare_encoder = RareLabelEncoder(tol = 0.03, n_categories=3)
rare_encoder.fit(X_train)
rare_encoder.encoder_dict_
c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent "considered frequent".format(var) c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable sex is less than that indicated in n_categories. Thus, all categories will be considered frequent "considered frequent".format(var) c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent "considered frequent".format(var)
{'pclass': array([2, 3, 1], dtype=object), 'sex': array(['female', 'male'], dtype=object), 'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'), 'embarked': array(['S', 'C', 'Q'], dtype=object)}
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)
test_t.sample(5)
pclass | sex | age | sibsp | parch | fare | cabin | embarked | |
---|---|---|---|---|---|---|---|---|
385 | 2 | male | 8.0 | 1 | 1 | 36.750 | n | S |
154 | 1 | male | 55.0 | 1 | 1 | 93.500 | B | S |
323 | 2 | male | 30.0 | 1 | 0 | 24.000 | n | C |
572 | 2 | female | 28.0 | 0 | 0 | 12.650 | n | S |
809 | 3 | male | 18.0 | 2 | 2 | 34.375 | n | S |