OneHotEncoder¶

Performs One Hot Encoding.

The encoder can select how many different labels per variable to encode into binaries. When top_categories is set to None, all the categories will be transformed in binary variables.

However, when top_categories is set to an integer, for example 10, then only the 10 most popular categories will be transformed into binary, and the rest will be discarded.

The encoder has also the possibility to create binary variables from all categories (drop_last = False), or remove the binary for the last category (drop_last = True), for use in linear models.

Finally, the encoder has the option to drop the second dummy variable for binary variables. That is, if a categorical variable has 2 unique values, for example colour = ['black', 'white'], setting the parameter drop_last_binary=True, will automatically create only 1 binary for this variable, for example colour_black.

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder

In [2]:

# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['age'] = data['age'].astype('float')
    data['fare'] = data['fare'].astype('float')
    data['embarked'].fillna('C', inplace=True)
    data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)
    return data

In [3]:

data = load_titanic()
data.head()

Out[3]:

	pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked
0	1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B	S
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C	S
2	1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C	S
3	1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C	S
4	1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C	S

In [4]:

X = data.drop(['survived', 'name', 'ticket'], axis=1)
y = data.survived

In [5]:

# we will encode the below variables, they have no missing values
X[['cabin', 'pclass', 'embarked']].isnull().sum()

Out[5]:

cabin       0
pclass      0
embarked    0
dtype: int64

In [6]:

''' Make sure that the variables are type (object).
if not, cast it as object , otherwise the transformer will either send an error (if we pass it as argument) 
or not pick it up (if we leave variables=None). '''

X[['cabin', 'pclass', 'embarked']].dtypes

Out[6]:

cabin       object
pclass      object
embarked    object
dtype: object

In [7]:

# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

Out[7]:

((916, 8), (393, 8))

One hot encoding consists in replacing the categorical variable by a combination of binary variables which take value 0 or 1, to indicate if a certain category is present in an observation.

Each one of the binary variables are also known as dummy variables. For example, from the categorical variable "Gender" with categories 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is "male" and 0 otherwise.

The encoder has the option to generate one dummy variable per category, or to create dummy variables only for the top n most popular categories, that is, the categories that are shown by the majority of the observations.

If dummy variables are created for all the categories of a variable, you have the option to drop one category not to create information redundancy. That is, encoding into k-1 variables, where k is the number if unique categories.

The encoder will encode only categorical variables (type 'object'). A list of variables can be passed as an argument. If no variables are passed as argument, the encoder will find and encode categorical variables (object type).

Note:¶

New categories in the data to transform, that is, those that did not appear in the training set, will be ignored (no binary variable will be created for them).

Create all k dummy variables, top_categories=False¶

In [8]:

'''
Parameters
----------

top_categories: int, default=None
    If None, a dummy variable will be created for each category of the variable.
    Alternatively, top_categories indicates the number of most frequent categories
    to encode. Dummy variables will be created only for those popular categories
    and the rest will be ignored. Note that this is equivalent to grouping all the
    remaining categories in one group.
    
variables : list
    The list of categorical variables that will be encoded. If None, the  
    encoder will find and select all object type variables.
    
drop_last: boolean, default=False
    Only used if top_categories = None. It indicates whether to create dummy
    variables for all the categories (k dummies), or if set to True, it will
    ignore the last variable of the list (k-1 dummies).
'''

ohe_enc = OneHotEncoder(top_categories=None,
                        variables=['pclass', 'cabin', 'embarked'],
                        drop_last=False)
ohe_enc.fit(X_train)

Out[8]:

OneHotEncoder(variables=['pclass', 'cabin', 'embarked'])

In [9]:

ohe_enc.encoder_dict_

Out[9]:

{'pclass': [2, 3, 1],
 'cabin': ['n', 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'],
 'embarked': ['S', 'C', 'Q']}

In [10]:

train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Out[10]:

	sex	age	sibsp	parch	fare	pclass_2	pclass_3	cabin_n	embarked_S	embarked_C	embarked_Q
501	female	13.0	0	1	19.5000	1	0	1	1	0	0
588	female	4.0	1	1	23.0000	1	0	1	1	0	0
402	female	30.0	1	0	13.8583	1	0	1	0	1	0
1193	male	NaN	0	0	7.7250	0	1	1	0	0	1
686	female	22.0	0	0	7.7250	0	1	1	0	0	1

Selecting top_categories to encode¶

In [11]:

ohe_enc = OneHotEncoder(top_categories=2,
                        variables=['pclass', 'cabin', 'embarked'],
                        drop_last=False)
ohe_enc.fit(X_train)

ohe_enc.encoder_dict_

Out[11]:

{'pclass': [3, 1], 'cabin': ['n', 'C'], 'embarked': ['S', 'C']}

In [12]:

train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)
test_t.head()

Out[12]:

	sex	age	sibsp	parch	fare	pclass_3	cabin_n	embarked_S	embarked_C
501	female	13.0	0	1	19.5000	0	1	1	0
588	female	4.0	1	1	23.0000	0	1	1	0
402	female	30.0	1	0	13.8583	0	1	0	1
1193	male	NaN	0	0	7.7250	1	1	0	0
686	female	22.0	0	0	7.7250	1	1	0	0

Dropping the last category for linear models¶

In [13]:

ohe_enc = OneHotEncoder(top_categories=None,
                        variables=['pclass', 'cabin', 'embarked'],
                        drop_last=True)

ohe_enc.fit(X_train)

ohe_enc.encoder_dict_

Out[13]:

{'pclass': [2, 3],
 'cabin': ['n', 'E', 'C', 'D', 'B', 'A', 'F', 'T'],
 'embarked': ['S', 'C']}

In [14]:

train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Out[14]:

	sex	age	sibsp	parch	fare	pclass_2	pclass_3	cabin_n	embarked_S	embarked_C
501	female	13.0	0	1	19.5000	1	0	1	1	0
588	female	4.0	1	1	23.0000	1	0	1	1	0
402	female	30.0	1	0	13.8583	1	0	1	0	1
1193	male	NaN	0	0	7.7250	0	1	1	0	0
686	female	22.0	0	0	7.7250	0	1	1	0	0

Automatically select categorical variables¶

This encoder selects all the categorical variables, if None is passed to the variable argument when calling the encoder.

In [15]:

ohe_enc = OneHotEncoder(top_categories=None,
                        drop_last=True)

ohe_enc.fit(X_train)

Out[15]:

OneHotEncoder(drop_last=True)

In [16]:

# the parameter variables is None
ohe_enc.variables

In [17]:

# but the attribute variables_ has the categorical variables 
# that will be encoded

ohe_enc.variables_

Out[17]:

['pclass', 'sex', 'cabin', 'embarked']

In [18]:

# and we can also find which variables from those
# are binary

ohe_enc.variables_binary_

Out[18]:

['sex']

In [19]:

train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Out[19]:

	age	sibsp	parch	fare	pclass_2	pclass_3	sex_female	cabin_n	embarked_S	embarked_C
501	13.0	0	1	19.5000	1	0	1	1	1	0
588	4.0	1	1	23.0000	1	0	1	1	1	0
402	30.0	1	0	13.8583	1	0	1	1	0	1
1193	NaN	0	0	7.7250	0	1	0	1	0	0
686	22.0	0	0	7.7250	0	1	1	1	0	0

Automatically create 1 dummy from binary variables (sex)¶

We can encode categorical variables that have more than 2 categories into k dummies, and, at the same time, encode categorical variables that have 2 categories only in 1 dummy. The second 1 is completely redundant.

We do so as follows:

In [20]:

ohe_enc = OneHotEncoder(top_categories=None,
                        drop_last=False,
                        drop_last_binary=True,
                        )

ohe_enc.fit(X_train)

Out[20]:

OneHotEncoder(drop_last_binary=True)

In [21]:

# the encoder dictionary
ohe_enc.encoder_dict_

Out[21]:

{'pclass': [2, 3, 1],
 'sex': ['female'],
 'cabin': ['n', 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'],
 'embarked': ['S', 'C', 'Q']}

In [22]:

# and we can also find which variables from those
# are binary

ohe_enc.variables_binary_

Out[22]:

['sex']

In [23]:

train_t = ohe_enc.transform(X_train)
test_t = ohe_enc.transform(X_train)

test_t.head()

Out[23]:

	age	sibsp	parch	fare	pclass_2	pclass_3	sex_female	cabin_n	embarked_S	embarked_C	embarked_Q
501	13.0	0	1	19.5000	1	0	1	1	1	0	0
588	4.0	1	1	23.0000	1	0	1	1	1	0	0
402	30.0	1	0	13.8583	1	0	1	1	0	1	0
1193	NaN	0	0	7.7250	0	1	0	1	0	0	1
686	22.0	0	0	7.7250	0	1	1	1	0	0	1

In [ ]: