This notebook contains code and comments from Section 8.4 of the book Ensemble Methods for Machine Learning. Please see the book for additional details on this topic. This notebook and code are released under the MIT license.
We wrap up this chapter by exploring encoding techniques for high-cardinality categorical features. The cardinality of a categorical feature is simply the number of unique categories in it. The number of categories is an important consideration in categorical encoding.
Real-world data sets often contain categorical string features, where feature values are strings. For example, consider a categorical feature of job titles at an organization. This feature can contain dozens to hundreds of job titles from ‘Intern’ to ‘President and CEO’, each with their own unique roles and responsibilities.
Such features contain a large number of categories and are inherently high-cardinality. This disqualifies encoding approaches such as one-hot encoding (because it increases feature dimension significantly), or ordinal encoding (because no natural ordering typically exists). What’s more, in real-world data sets, such high-cardinality are also ‘dirty’, as in they contain many variations of the same class or concept.
To address this issue, we will need to determine categories (and how to encode them) by string similarity rather than by exact matching! The intuition behind this approach is to encode similar categories together in a way that a human might, to ensure that the downstream learning algorithm treats them similarly (as it should).
The dirty-cat
package provides such functionality off-the-shelf and can be used in seamlessly in modeling pipelines. The package provides three specialized encoders to handle so called “dirty categories”, which are essentially noisy and/or high-cardinality string categories.
SimilarityEncoder
, a version of one-hot encoding constructed using string similarities,GapEncoder
, that encodes categories by considering frequently co-occurring substring combinations, andMinHashEncoder
, that encodes categories by applying hashing techniques to substrings.import pandas as pd
# # Pre-process accordint to the example in dirty_cat gitbub
# # https://dirty-cat.github.io/stable/auto_examples/01_dirty_categories.html#id2
# from dirty_cat.datasets import fetch_employee_salaries
# employee_salaries = fetch_employee_salaries()
# X = employee_salaries.X
# y = employee_salaries.y
# X['date_first_hired'] = pd.to_datetime(X['date_first_hired'])
# X['year_first_hired'] = X['date_first_hired'].apply(lambda x: x.year)
# # Get mask of rows with missing values in gender
# mask = X.isna()['gender']
# # And remove the lines accordingly
# X.dropna(subset=['gender'], inplace=True)
# y = y[~mask]
# X['salary'] = y
# X = X.drop(['date_first_hired', 'division', 'department'], axis=1)
# X = X.sample(frac=1)
# X.to_csv('./data/ch08/employee_salaries.csv', index=False)
df = pd.read_csv('./data/ch08/employee_salaries.csv')
df.head()
gender | department_name | assignment_category | employee_position_title | underfilled_job_title | year_first_hired | salary | |
---|---|---|---|---|---|---|---|
0 | F | Department of Environmental Protection | Fulltime-Regular | Program Specialist II | NaN | 2013 | 75362.93 |
1 | F | Department of Recreation | Fulltime-Regular | Recreation Supervisor | NaN | 1997 | 79522.62 |
2 | F | Department of Transportation | Fulltime-Regular | Bus Operator | NaN | 2014 | 42053.83 |
3 | M | Fire and Rescue Services | Fulltime-Regular | Fire/Rescue Captain | NaN | 1995 | 114587.02 |
4 | F | Department of Public Libraries | Fulltime-Regular | Library Assistant I | NaN | 1996 | 55139.67 |
X, y = df.drop('salary', axis=1), df['salary'] # Split the data into features and targets
print(X.shape)
print('Number of categories')
for col in X.columns:
print('{0}: {1} categories'.format(col, df[col].nunique()))
from sklearn.model_selection import train_test_split
Xtrn, Xtst, ytrn, ytst = train_test_split(X, y, test_size=0.2)
(9211, 6) Number of categories gender: 2 categories department_name: 37 categories assignment_category: 2 categories employee_position_title: 385 categories underfilled_job_title: 83 categories year_first_hired: 51 categories
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from dirty_cat import SimilarityEncoder, MinHashEncoder, GapEncoder
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
lo_card = ['gender', 'department_name', 'assignment_category']
hi_card = ['employee_position_title']
continuous = ['year_first_hired']
encoders = [# OneHotEncoder(sparse=False),
SimilarityEncoder(similarity='ngram'),
MinHashEncoder(n_components=100),
GapEncoder(n_components=100)]
for encoder in encoders:
ensemble = XGBRegressor(objective='reg:squarederror', learning_rate=0.1,
n_estimators=100, max_depth=3)
preprocess = ColumnTransformer(
transformers=[('continuous', MinMaxScaler(), continuous),
('onehot-encode', OneHotEncoder(sparse=False), lo_card),
('sim-encode', encoder, hi_card)],
remainder='drop')
pipe = Pipeline(steps=[('preprocess', preprocess),
('train', ensemble)])
pipe.fit(Xtrn, ytrn)
ypred = pipe.predict(Xtst)
print('{0}: {1}'.format(encoder.__class__.__name__, r2_score(ytst, ypred)))
SimilarityEncoder: 0.8995625658800894 MinHashEncoder: 0.8996750692009536 GapEncoder: 0.8895356402510632