This notebook contains code and comments from Section 8.4 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.

8.4 Encoding High-Cardinality String Features¶

We wrap up this chapter by exploring encoding techniques for high-cardinality categorical features. The cardinality of a categorical feature is simply the number of unique categories in it. The number of categories is an important consideration in categorical encoding.

Real-world data sets often contain categorical string features, where feature values are strings. For example, consider a categorical feature of job titles at an organization. This feature can contain dozens to hundreds of job titles from ‘Intern’ to ‘President and CEO’, each with their own unique roles and responsibilities.

Such features contain a large number of categories and are inherently high-cardinality. This disqualifies encoding approaches such as one-hot encoding (because it increases feature dimension significantly), or ordinal encoding (because no natural ordering typically exists). What’s more, in real-world data sets, such high-cardinality are also ‘dirty’, as in they contain many variations of the same class or concept.

To address this issue, we will need to determine categories (and how to encode them) by string similarity rather than by exact matching! The intuition behind this approach is to encode similar categories together in a way that a human might, to ensure that the downstream learning algorithm treats them similarly (as it should).

The dirty-cat package provides such functionality off-the-shelf and can be used in seamlessly in modeling pipelines. The package provides three specialized encoders to handle so called “dirty categories”, which are essentially noisy and/or high-cardinality string categories.

SimilarityEncoder, a version of one-hot encoding constructed using string similarities,
GapEncoder, that encodes categories by considering frequently co-occurring substring combinations, and
MinHashEncoder, that encodes categories by applying hashing techniques to substrings.

In [1]:

import pandas as pd

In [2]:

# # Pre-process accordint to the example in dirty_cat gitbub
# # https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#id2

# from dirty_cat.datasets import fetch_employee_salaries
# employee_salaries = fetch_employee_salaries()
# X = employee_salaries.X
# y = employee_salaries.y

# X['date_first_hired'] = pd.to_datetime(X['date_first_hired'])
# X['year_first_hired'] = X['date_first_hired'].apply(lambda x: x.year)
# # Get mask of rows with missing values in gender
# mask = X.isna()['gender']
# # And remove the lines accordingly
# X.dropna(subset=['gender'], inplace=True)
# y = y[~mask]
# X['salary'] = y
# X = X.drop(['date_first_hired', 'division', 'department'], axis=1)
# X = X.sample(frac=1)
# X.to_csv('./data/ch08/employee_salaries.csv', index=False)

In [3]:

df = pd.read_csv('./data/ch08/employee_salaries.csv')
df.head()

Out[3]:

	gender	department_name	assignment_category	employee_position_title	underfilled_job_title	year_first_hired	salary
0	F	Department of Environmental Protection	Fulltime-Regular	Program Specialist II	NaN	2013	75362.93
1	F	Department of Recreation	Fulltime-Regular	Recreation Supervisor	NaN	1997	79522.62
2	F	Department of Transportation	Fulltime-Regular	Bus Operator	NaN	2014	42053.83
3	M	Fire and Rescue Services	Fulltime-Regular	Fire/Rescue Captain	NaN	1995	114587.02
4	F	Department of Public Libraries	Fulltime-Regular	Library Assistant I	NaN	1996	55139.67

In [4]:

X, y = df.drop('salary', axis=1), df['salary']  # Split the data into features and targets
print(X.shape)

print('Number of categories')
for col in X.columns:
    print('{0}: {1} categories'.format(col, df[col].nunique()))

from sklearn.model_selection import train_test_split
Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.2)

(9211, 6)
Number of categories
gender: 2 categories
department_name: 37 categories
assignment_category: 2 categories
employee_position_title: 385 categories
underfilled_job_title: 83 categories
year_first_hired: 51 categories

In [5]:

from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from dirty_cat import SimilarityEncoder, MinHashEncoder, GapEncoder
from xgboost import XGBRegressor
from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

lo_card = ['gender', 'department_name', 'assignment_category']
hi_card = ['employee_position_title']
continuous = ['year_first_hired']

encoders = [# OneHotEncoder(sparse=False), 
            SimilarityEncoder(similarity='ngram'),
            MinHashEncoder(n_components=100),
            GapEncoder(n_components=100)]

for encoder in encoders:
    ensemble = XGBRegressor(objective='reg:squarederror', learning_rate=0.1, 
                            n_estimators=100, max_depth=3)

    preprocess = ColumnTransformer(
                         transformers=[('continuous', MinMaxScaler(), continuous),
                                       ('onehot-encode', OneHotEncoder(sparse=False), lo_card),
                                       ('sim-encode', encoder, hi_card)],
                         remainder='drop')
    
    pipe = Pipeline(steps=[('preprocess', preprocess), 
                           ('train', ensemble)])
    pipe.fit(Xtrn, ytrn)
    
    ypred = pipe.predict(Xtst)
    print('{0}: {1}'.format(encoder.__class__.__name__, r2_score(ytst, ypred)))

SimilarityEncoder: 0.8995625658800894
MinHashEncoder: 0.8996750692009536
GapEncoder: 0.8895356402510632