In this notebook, we present some typical ways of dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding.
Let's first load the entire adult dataset containing both numerical and categorical data.
import pandas as pd
adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name])
As we saw in the previous section, a numerical variable is a quantity represented by a real or integer number. These variables can be naturally handled by machine learning algorithms that are typically composed of a sequence of arithmetic instructions such as additions and multiplications.
In contrast, categorical variables have discrete values, typically
represented by string labels (but not only) taken from a finite list of
possible choices. For instance, the variable native-country
in our dataset
is a categorical variable because it encodes the data using a finite list of
possible countries (along with the ?
symbol when this information is
missing):
data["native-country"].value_counts().sort_index()
How can we easily recognize categorical columns among the dataset? Part of the answer lies in the columns' data type:
data.dtypes
If we look at the "native-country"
column, we observe its data type is
object
, meaning it contains string values.
In the previous notebook, we manually defined the numerical columns. We could
do a similar approach. Instead, we can use the scikit-learn helper function
make_column_selector
, which allows us to select columns based on their data
type. We now illustrate how to use this helper.
from sklearn.compose import make_column_selector as selector
categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
categorical_columns
Here, we created the selector by passing the data type to include; we then passed the input dataset to the selector object, which returned a list of column names that have the requested data type. We can now filter out the unwanted columns:
data_categorical = data[categorical_columns]
data_categorical.head()
print(f"The dataset is composed of {data_categorical.shape[1]} features")
In the remainder of this section, we will present different strategies to encode categorical data into numerical data which can be used by a machine-learning algorithm.
from sklearn.preprocessing import OrdinalEncoder
education_column = data_categorical[["education"]]
encoder = OrdinalEncoder().set_output(transform="pandas")
education_encoded = encoder.fit_transform(education_column)
education_encoded
We see that each category in "education"
has been replaced by a numeric
value. We could check the mapping between the categories and the numerical
values by checking the fitted attribute categories_
.
encoder.categories_
Now, we can check the encoding applied on all categorical features.
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
print(f"The dataset encoded contains {data_encoded.shape[1]} features")
We see that the categories have been encoded for each feature (column) independently. We also note that the number of features before and after the encoding is the same.
However, be careful when applying this encoding strategy: using this integer representation leads downstream predictive models to assume that the values are ordered (0 < 1 < 2 < 3... for instance).
By default, OrdinalEncoder
uses a lexicographical strategy to map string
category labels to integers. This strategy is arbitrary and often
meaningless. For instance, suppose the dataset has a categorical variable
named "size"
with categories such as "S", "M", "L", "XL". We would like the
integer representation to respect the meaning of the sizes by mapping them to
increasing integers such as 0, 1, 2, 3
.
However, the lexicographical strategy used by default would map the labels
"S", "M", "L", "XL" to 2, 1, 0, 3, by following the alphabetical order.
The OrdinalEncoder
class accepts a categories
constructor argument to
pass categories in the expected ordering explicitly. You can find more
information in the
scikit-learn documentation
if needed.
If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).
OneHotEncoder
is an alternative encoder that prevents the downstream
models to make a false assumption about the ordering of categories. For a
given feature, it creates as many new columns as there are possible
categories. For a given sample, the value of the column corresponding to the
category is set to 1
while all the columns of the other categories
are set to 0
.
We can encode a single feature (e.g. "education"
) to illustrate how the
encoding works.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False).set_output(transform="pandas")
education_encoded = encoder.fit_transform(education_column)
education_encoded
Note
sparse_output=False is used in the OneHotEncoder for didactic purposes, namely easier visualization of the data.
Sparse matrices are efficient data structures when most of your matrix elements are zero. They won't be covered in detail in this course. If you want more details about them, you can look at this.
We see that encoding a single feature gives a dataframe full of zeros and ones. Each category (unique value) became a column; the encoding returned, for each sample, a 1 to specify which category it belongs to.
Let's apply this encoding on the full dataset.
print(f"The dataset is composed of {data_categorical.shape[1]} features")
data_categorical.head()
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:5]
print(f"The encoded dataset contains {data_encoded.shape[1]} features")
Look at how the "workclass"
variable of the 3 first records has been encoded
and compare this to the original string representation.
The number of features after the encoding is more than 10 times larger than
in the original data because some variables such as occupation
and
native-country
have many possible categories.
Choosing an encoding strategy depends on the underlying models and the type of categories (i.e. ordinal vs. nominal).
Note
In general OneHotEncoder is the encoding strategy used when the downstream models are linear models while OrdinalEncoder is often a good strategy with tree-based models.
Using an OrdinalEncoder
outputs ordinal categories. This means
that there is an order in the resulting categories (e.g. 0 < 1 < 2
). The
impact of violating this ordering assumption is really dependent on the
downstream models. Linear models would be impacted by misordered categories
while tree-based models would not.
You can still use an OrdinalEncoder
with linear models but you need to be
sure that:
The next exercise highlights the issue of misusing OrdinalEncoder
with
a linear model.
One-hot encoding categorical variables with high cardinality can cause
computational inefficiency in tree-based models. Because of this, it is not
recommended to use OneHotEncoder
in such cases even if the original
categories do not have a given order. We will show this in the final
exercise of this sequence.
We can now integrate this encoder inside a machine learning pipeline like we did with numerical data: let's train a linear classifier on the encoded data and check the generalization performance of this machine learning pipeline using cross-validation.
Before we create the pipeline, we have to linger on the native-country
.
Let's recall some statistics regarding this column.
data["native-country"].value_counts()
We see that the "Holand-Netherlands"
category is occurring rarely. This will
be a problem during cross-validation: if the sample ends up in the test set
during splitting then the classifier would not have seen the category during
training and would not be able to encode it.
In scikit-learn, there are some possible solutions to bypass this issue:
categories
instead of letting the estimator automatically
determine them from the training data when calling fit;handle_unknown="ignore"
, i.e. if an unknown category is
encountered during transform, the resulting one-hot encoded columns for this
feature will be all zeros;min_frequency
parameter to collapse the rarest categories
observed in the training data into a single one-hot encoded feature. If you
enable this option, you can also set handle_unknown="infrequent_if_exist"
to encode the unknown categories (categories only observed at predict time)
as ones in that last column.In this notebook we only explore the second option, namely
OneHotEncoder(handle_unknown="ignore")
. Feel free to evaluate the
alternatives on your own, for instance using a sandbox notebook.
Tip
Be aware the OrdinalEncoder exposes a parameter also named handle_unknown. It can be set to use_encoded_value. If that option is chosen, you can define a fixed value that is assigned to all unknown categories during transform. For example, OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1) would set all values encountered during transform to -1 which are not part of the data encountered during the fit call. You are going to use these parameters in the next exercise.
We can now create our machine learning pipeline.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
model = make_pipeline(
OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=500)
)
Note
Here, we need to increase the maximum number of iterations to obtain a fully converged LogisticRegression and silence a ConvergenceWarning. Contrary to the numerical features, the one-hot encoded categorical features are all on the same scale (values are 0 or 1), so they would not benefit from scaling. In this case, increasing max_iter is the right thing to do.
Finally, we can check the model's generalization performance only using the categorical columns.
from sklearn.model_selection import cross_validate
cv_results = cross_validate(model, data_categorical, target)
cv_results
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} ± {scores.std():.3f}")
As you can see, this representation of the categorical variables is slightly more predictive of the revenue than the numerical variables that we used previously.
In this notebook we have: