RareLabelEncoder¶

The RareLabelEncoder() groups labels that show a small number of observations in the dataset into a new category called 'Rare'. This helps to avoid overfitting.

The argument ' tol ' indicates the percentage of observations that the label needs to have in order not to be re-grouped into the "Rare" label.
The argument n_categories indicates the minimum number of distinct categories that a variable needs to have for any of the labels to be re-grouped into 'Rare'.

Note¶

If the number of labels is smaller than n_categories, then the encoder will not group the labels for that variable.

In [5]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from feature_engine.encoding import RareLabelEncoder

In [6]:

# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['age'] = data['age'].astype('float')
    data['fare'] = data['fare'].astype('float')
    data['embarked'].fillna('C', inplace=True)
    data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)
    return data

In [7]:

data = load_titanic()
data.head()

Out[7]:

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B	S
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C	S
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C	S
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C	S
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C	S

In [8]:

X = data.drop(['survived', 'name', 'ticket'], axis=1)
y = data.survived

In [9]:

# we will encode the below variables, they have no missing values
X[['cabin', 'pclass', 'embarked']].isnull().sum()

Out[9]:

cabin       0
pclass      0
embarked    0
dtype: int64

In [10]:

''' Make sure that the variables are type (object).
if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) 
or not pick it up (if we leave variables=None). '''

X[['cabin', 'pclass', 'embarked']].dtypes

Out[10]:

cabin       object
pclass      object
embarked    object
dtype: object

In [17]:

# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

Out[17]:

((916, 8), (393, 8))

The RareLabelEncoder() groups rare / infrequent categories in a new category called "Rare", or any other name entered by the user.

For example in the variable colour,
if the percentage of observations for the categories magenta, cyan and burgundy are < 5%, all those categories will be replaced by the new label "Rare".

Note, infrequent labels can also be grouped under a user defined name, for example 'Other'. The name to replace infrequent categories is defined with the parameter replace_with.

The encoder will encode only categorical variables (type 'object'). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode all categorical variables (object type).

In [8]:

## Rare value encoder
'''
Parameters
----------

tol: float, default=0.05
    the minimum frequency a label should have to be considered frequent.
    Categories with frequencies lower than tol will be grouped.

n_categories: int, default=10
    the minimum number of categories a variable should have for the encoder
    to find frequent labels. If the variable contains less categories, all
    of them will be considered frequent.

max_n_categories: int, default=None
    the maximum number of categories that should be considered frequent.
    If None, all categories with frequency above the tolerance (tol) will be
    considered.

variables : list, default=None
    The list of categorical variables that will be encoded. If None, the 
    encoder will find and select all object type variables.

replace_with : string, default='Rare'
    The category name that will be used to replace infrequent categories.
'''

rare_encoder = RareLabelEncoder(tol=0.05, 
                                n_categories=5,
                                variables=['cabin', 'pclass', 'embarked'])
rare_encoder.fit(X_train)

c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent
  "considered frequent".format(var)
c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent
  "considered frequent".format(var)

Out[8]:

RareLabelEncoder(n_categories=5, variables=['cabin', 'pclass', 'embarked'])

In [9]:

rare_encoder.encoder_dict_

Out[9]:

{'cabin': Index(['n', 'C'], dtype='object'),
 'pclass': array([2, 3, 1], dtype=object),
 'embarked': array(['S', 'C', 'Q'], dtype=object)}

In [16]:

train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.head()

Out[16]:

	pclass	sex	age	sibsp	parch	fare	cabin	embarked
501	2	female	13.0	0	1	19.5000	n	S
588	2	female	4.0	1	1	23.0000	n	S
402	2	female	30.0	1	0	13.8583	n	C
1193	3	male	NaN	0	0	7.7250	n	Q
686	3	female	22.0	0	0	7.7250	n	Q

In [11]:

test_t.cabin.value_counts()

Out[11]:

n       702
Rare    143
C        71
Name: cabin, dtype: int64

The user can change the string from 'Rare' to something else.¶

In [20]:

## Rare value encoder

rare_encoder = RareLabelEncoder(tol = 0.03,
                                replace_with='Other', #replacing 'Rare' with 'Other'
                                variables=['cabin', 'pclass', 'embarked'],
                                n_categories=2
                           )

rare_encoder.fit(X_train)

train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.sample(5)

Out[20]:

	pclass	sex	age	fare	cabin	embarked
1059	3	male	28.0	8.0500	n	S
1227	3	female	22.0	9.8375	n	S
470	2	male	35.0	12.3500	n	Q
66	1	female	36.0	262.3750	B	C
950	3	male	29.0	9.4833	n	S

In [21]:

rare_encoder.encoder_dict_

Out[21]:

{'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'),
 'pclass': Int64Index([3, 1, 2], dtype='int64'),
 'embarked': Index(['S', 'C', 'Q'], dtype='object')}

In [22]:

test_t.cabin.value_counts()

Out[22]:

n        702
C         71
B         42
Other     37
E         32
D         32
Name: cabin, dtype: int64

The user can choose to retain only the most popular categories with the argument max_n_categories.¶

In [25]:

## Rare value encoder

rare_encoder = RareLabelEncoder(tol = 0.03,
                                variables=['cabin', 'pclass', 'embarked'],
                                n_categories=2,
                                
                                max_n_categories=3 #keeps only the most popular 3 categories in every variable.
                                
                           )

rare_encoder.fit(X_train)

train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.sample(5)

Out[25]:

	pclass	sex	age	sibsp	parch	fare	cabin	embarked
1222	3	male	33.0	0	0	8.6625	n	C
781	3	male	33.0	0	0	7.8958	n	C
272	1	female	23.0	1	0	82.2667	B	S
1043	3	female	NaN	1	0	15.5000	n	Q
867	3	female	22.0	1	1	12.2875	n	S

In [26]:

rare_encoder.encoder_dict_

Out[26]:

{'cabin': Index(['n', 'C', 'B'], dtype='object'),
 'pclass': Int64Index([3, 1, 2], dtype='int64'),
 'embarked': Index(['S', 'C', 'Q'], dtype='object')}

Automatically select all categorical variables¶

If no variable list is passed as argument, it selects all the categorical variables.

In [27]:

## Rare value encoder

rare_encoder = RareLabelEncoder(tol = 0.03, n_categories=3)

rare_encoder.fit(X_train)

rare_encoder.encoder_dict_

c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable pclass is less than that indicated in n_categories. Thus, all categories will be considered frequent
  "considered frequent".format(var)
c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable sex is less than that indicated in n_categories. Thus, all categories will be considered frequent
  "considered frequent".format(var)
c:\users\king_ashok\desktop\feature_engine\feature_engine\encoding\rare_label.py:139: UserWarning: The number of unique categories for variable embarked is less than that indicated in n_categories. Thus, all categories will be considered frequent
  "considered frequent".format(var)

Out[27]:

{'pclass': array([2, 3, 1], dtype=object),
 'sex': array(['female', 'male'], dtype=object),
 'cabin': Index(['n', 'C', 'B', 'E', 'D'], dtype='object'),
 'embarked': array(['S', 'C', 'Q'], dtype=object)}

In [13]:

train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_train)

test_t.sample(5)

Out[13]:

	pclass	sex	age	sibsp	parch	fare	cabin	embarked
385	2	male	8.0	1	1	36.750	n	S
154	1	male	55.0	1	1	93.500	B	S
323	2	male	30.0	1	0	24.000	n	C
572	2	female	28.0	0	0	12.650	n	S
809	3	male	18.0	2	2	34.375	n	S

In [ ]: