# TODO add intro with objectives
# ## [markdown]
# Let's first load the data as we did in the previous notebook. TODO add link.
import pandas as pd
df = pd.read_csv(
"https://www.openml.org/data/get_csv/1595261/adult-census.csv")
# Or use the local copy:
# df = pd.read_csv('../datasets/adult-census.csv')
target_name = "class"
target = df[target_name].to_numpy()
data = df.drop(columns=[target_name, "fnlwgt"])
As we have seen in the previous section, a numerical variable is a continuous quantity represented by a real or integer number. These variables can be naturally handled by machine learning algorithms that are typically composed of a sequence of arithmetic instructions such as additions and multiplications.
In contrast, categorical variables have discrete values typically represented
by string labels taken from a finite list of possible choices. For instance,
the variable native-country
in our dataset is a categorical variable because
it encodes the data using a finite list of possible countries (along with the
?
symbol when this information is missing):
data["native-country"].value_counts()
United-States 43832 Mexico 951 ? 857 Philippines 295 Germany 206 Puerto-Rico 184 Canada 182 El-Salvador 155 India 151 Cuba 138 England 127 China 122 South 115 Jamaica 106 Italy 105 Dominican-Republic 103 Japan 92 Guatemala 88 Poland 87 Vietnam 86 Columbia 85 Haiti 75 Portugal 67 Taiwan 65 Iran 59 Nicaragua 49 Greece 49 Peru 46 Ecuador 45 France 38 Ireland 37 Hong 30 Thailand 30 Cambodia 28 Trinadad&Tobago 27 Outlying-US(Guam-USVI-etc) 23 Laos 23 Yugoslavia 23 Scotland 21 Honduras 20 Hungary 19 Holand-Netherlands 1 Name: native-country, dtype: int64
In the remainder of this section, we will present different strategies to encode categorical data into numerical data which can be used by a machine-learning algorithm.
data.dtypes
age int64 workclass object education object education-num int64 marital-status object occupation object relationship object race object sex object capital-gain int64 capital-loss int64 hours-per-week int64 native-country object dtype: object
categorical_columns = [
c for c in data.columns if data[c].dtype.kind not in ["i", "f"]]
categorical_columns
['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
data_categorical = data[categorical_columns]
data_categorical.head()
workclass | education | marital-status | occupation | relationship | race | sex | native-country | |
---|---|---|---|---|---|---|---|---|
0 | Private | 11th | Never-married | Machine-op-inspct | Own-child | Black | Male | United-States |
1 | Private | HS-grad | Married-civ-spouse | Farming-fishing | Husband | White | Male | United-States |
2 | Local-gov | Assoc-acdm | Married-civ-spouse | Protective-serv | Husband | White | Male | United-States |
3 | Private | Some-college | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | United-States |
4 | ? | Some-college | Never-married | ? | Own-child | White | Female | United-States |
print(
f"The datasets is composed of {data_categorical.shape[1]} features"
)
The datasets is composed of 8 features
The most intuitive strategy is to encode each category with a different
number. The OrdinalEncoder
will transform the data in such manner.
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
data_encoded = encoder.fit_transform(data_categorical)
print(
f"The dataset encoded contains {data_encoded.shape[1]} features")
data_encoded[:5]
The dataset encoded contains 8 features
array([[ 4., 1., 4., 7., 3., 2., 1., 39.], [ 4., 11., 2., 5., 0., 4., 1., 39.], [ 2., 7., 2., 11., 0., 4., 1., 39.], [ 4., 15., 2., 7., 0., 2., 1., 39.], [ 0., 15., 4., 0., 3., 4., 0., 39.]])
We can see that the categories have been encoded for each feature (column) independently. We can also note that the number of features before and after the encoding is the same.
However, one has to be careful when using this encoding strategy. Using this integer representation can lead the downstream models to make the assumption that the categories are ordered: 0 is smaller than 1 which is smaller than 2, etc.
By default, OrdinalEncoder
uses a lexicographical strategy to map string
category labels to integers. This strategy is completely arbitrary and often be
meaningless. For instance suppose the dataset has a categorical variable named
"size" with categories such as "S", "M", "L", "XL". We would like the integer
representation to respect the meaning of the sizes by mapping them to increasing
integers such as 0, 1, 2, 3. However lexicographical strategy used by default
would map the labels "S", "M", "L", "XL" to 2, 1, 0, 3.
The OrdinalEncoder
class accepts a "categories" constructor argument to pass
in the correct ordering explicitly.
If a categorical variable does not carry any meaningful order information then this encoding might be misleading to downstream statistical models and you might consider using one-hot encoding instead (see below).
Note however that the impact of violating this ordering assumption is really dependent on the downstream models (for instance linear models are much more sensitive than models built from a ensemble of decision trees).
OneHotEncoder
is an alternative encoder that can prevent the dowstream
models to make a false assumption about the ordering of categories. For a
given feature, it will create as many new columns as there are possible
categories. For a given sample, the value of the column corresponding to the
category will be set to 1
while all the columns of the other categories will
be set to 0
.
print(
f"The dataset is composed of {data_categorical.shape[1]} features"
)
data_categorical.head()
The dataset is composed of 8 features
workclass | education | marital-status | occupation | relationship | race | sex | native-country | |
---|---|---|---|---|---|---|---|---|
0 | Private | 11th | Never-married | Machine-op-inspct | Own-child | Black | Male | United-States |
1 | Private | HS-grad | Married-civ-spouse | Farming-fishing | Husband | White | Male | United-States |
2 | Local-gov | Assoc-acdm | Married-civ-spouse | Protective-serv | Husband | White | Male | United-States |
3 | Private | Some-college | Married-civ-spouse | Machine-op-inspct | Husband | Black | Male | United-States |
4 | ? | Some-college | Never-married | ? | Own-child | White | Female | United-States |
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse=False)
data_encoded = encoder.fit_transform(data_categorical)
print(
f"The dataset encoded contains {data_encoded.shape[1]} features")
data_encoded
The dataset encoded contains 102 features
array([[0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 1., 0., 0.], [0., 0., 1., ..., 1., 0., 0.], ..., [0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 1., 0., 0.], [0., 0., 0., ..., 1., 0., 0.]])
Let's wrap this numpy array in a dataframe with informative column names as provided by the encoder object:
columns_encoded = encoder.get_feature_names(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded).head()
workclass_ ? | workclass_ Federal-gov | workclass_ Local-gov | workclass_ Never-worked | workclass_ Private | workclass_ Self-emp-inc | workclass_ Self-emp-not-inc | workclass_ State-gov | workclass_ Without-pay | education_ 10th | ... | native-country_ Portugal | native-country_ Puerto-Rico | native-country_ Scotland | native-country_ South | native-country_ Taiwan | native-country_ Thailand | native-country_ Trinadad&Tobago | native-country_ United-States | native-country_ Vietnam | native-country_ Yugoslavia | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 102 columns
Look at how the "workclass" variable of the first 3 records has been encoded and compare this to the original string representation.
The number of features after the encoding is more than 10 times larger than in the
original data because some variables such as occupation
and native-country
have many possible categories.
We can now integrate this encoder inside a machine learning pipeline as in the case with numerical data: let's train a linear classifier on the encoded data and check the performance of this machine learning pipeline using cross-validation.
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
model = make_pipeline(
OneHotEncoder(handle_unknown='ignore'),
LogisticRegression(solver='lbfgs', max_iter=1000))
scores = cross_val_score(model, data_categorical, target)
print(f"The different scores obtained are: \n{scores}")
/home/lucy/miniconda3/envs/scikit-learn-tutorial/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning. warnings.warn(CV_WARNING, FutureWarning)
The different scores obtained are: [0.83453105 0.82961735 0.83519656]
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")
The accuracy is: 0.833 +/- 0.002
As you can see, this representation of the categorical variables of the data is slightly more predictive of the revenue than the numerical variables that we used previously.
Use the dedicated notebook to do this exercise.
In the previous sections, we saw that we need to treat data differently depending on their nature (i.e. numerical or categorical).
Scikit-learn provides a ColumnTransformer
class which will send
specific columns to a specific transformer, making it easy to fit a single
predictive model on a dataset that combines both kinds of variables together
(heterogeneously typed tabular data).
We can first define the columns depending on their data type:
binary_encoding_columns = ['sex']
one_hot_encoding_columns = [
'workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race', 'native-country']
scaling_columns = [
'age', 'education-num', 'hours-per-week', 'capital-gain',
'capital-loss']
We can now create our ColumnTransfomer
by specifying a list of triplet
(preprocessor name, transformer, columns). Finally, we can define a pipeline
to stack this "preprocessor" with our classifier (logistic regression).
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
preprocessor = ColumnTransformer([
('binary-encoder', OrdinalEncoder(), binary_encoding_columns),
('one-hot-encoder', OneHotEncoder(handle_unknown='ignore'),
one_hot_encoding_columns),
('standard-scaler', StandardScaler(), scaling_columns)])
model = make_pipeline(
preprocessor, LogisticRegression(solver='lbfgs', max_iter=1000))
The final model is more complex than the previous models but still follows the same API:
fit
method is called to preprocess the data then train the classifier;predict
method can make predictions on new data;score
method is used to predict on the test data and compare the
predictions to the expected test labels to compute the accuracy.from sklearn.model_selection import train_test_split
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42)
model.fit(data_train, target_train)
model.predict(data_test)[:5]
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' >50K'], dtype=object)
target_test[:5]
array([' <=50K', ' <=50K', ' >50K', ' <=50K', ' <=50K'], dtype=object)
data_test.head()
age | workclass | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7762 | 56 | Private | HS-grad | 9 | Divorced | Other-service | Unmarried | White | Female | 0 | 0 | 40 | United-States |
23881 | 25 | Private | HS-grad | 9 | Married-civ-spouse | Transport-moving | Own-child | Other | Male | 0 | 0 | 40 | United-States |
30507 | 43 | Private | Bachelors | 13 | Divorced | Prof-specialty | Not-in-family | White | Female | 14344 | 0 | 40 | United-States |
28911 | 32 | Private | HS-grad | 9 | Married-civ-spouse | Transport-moving | Husband | White | Male | 0 | 0 | 40 | United-States |
19484 | 39 | Private | Bachelors | 13 | Married-civ-spouse | Sales | Wife | White | Female | 0 | 0 | 30 | United-States |
model.score(data_test, target_test)
0.8577512079272787
This model can also be cross-validated as usual (instead of using a single train-test split):
scores = cross_val_score(model, data, target, cv=5)
print(f"The different scores obtained are: \n{scores}")
The different scores obtained are: [0.85116184 0.8498311 0.84756347 0.85268223 0.85513923]
print(f"The accuracy is: {scores.mean():.3f} +- {scores.std():.3f}")
The accuracy is: 0.851 +- 0.003
The compound model has a higher predictive accuracy than the two models that used numerical and categorical variables in isolation.
Linear models are very nice because they are usually very cheap to train, small to deploy, fast to predict and give a good baseline.
However, it is often useful to check whether more complex models such as an ensemble of decision trees can lead to higher predictive performance.
In the following cell we try a scalable implementation of the Gradient Boosting Machine algorithm. For this class of models, we know that contrary to linear models, it is useless to scale the numerical features and furthermore it is both safe and significantly more computationally efficient use an arbitrary integer encoding for the categorical variable even if the ordering is arbitrary. Therefore we adapt the preprocessing pipeline as follows:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
# For each categorical column, extract the list of all possible categories
# in some arbritrary order.
categories = [
data[column].unique() for column in data[categorical_columns]]
preprocessor = ColumnTransformer([
('categorical', OrdinalEncoder(categories=categories),
categorical_columns)], remainder="passthrough")
model = make_pipeline(preprocessor, HistGradientBoostingClassifier())
model.fit(data_train, target_train)
print(model.score(data_test, target_test))
0.8801080992547703
We can observe that we get significantly higher accuracies with the Gradient Boosting model. This is often what we observe whenever the dataset has a large number of samples and limited number of informative features (e.g. less than 1000) with a mix of numerical and categorical variables.
This explains why Gradient Boosted Machines are very popular among datascience practitioners who work with tabular data.
HistGradientBoostingClassifier
HistGradientBoostingClassifier
but slows down the training.Use the dedicated notebook to do this exercise.