The ArbitraryDiscretiser() divides continuous numerical variables into contiguous intervals are arbitrarily entered by the user.
The user needs to enter a dictionary with variable names as keys, and a list of the limits of the intervals as values. For example {'var1': [0, 10, 100, 1000], 'var2': [5, 10, 15, 20]}.
Note
For this demonstration, we use the Ames House Prices dataset produced by Professor Dean De Cock:
Dean De Cock (2011) Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester Regression Project, Journal of Statistics Education, Vol.19, No. 3
http://jse.amstat.org/v19n3/decock.pdf
https://www.tandfonline.com/doi/abs/10.1080/10691898.2011.11889627
The version of the dataset used in this notebook can be obtained from Kaggle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.discretisation import ArbitraryDiscretiser
plt.rcParams["figure.figsize"] = [15,5]
data = pd.read_csv('housing.csv')
data.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
# let's separate into training and testing set
X = data.drop(["Id", "SalePrice"], axis=1)
y = data.SalePrice
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=0)
print("X_train :", X_train.shape)
print("X_test :", X_test.shape)
X_train : (1022, 79) X_test : (438, 79)
# we will discretise two continuous variables
X_train[["LotArea", 'GrLivArea']].hist(bins=50)
plt.show()
The ArbitraryDiscretiser() works only with numerical variables. The discretiser will check if the dictionary entered by the user contains variables present in the training set, and if these variables are cast as numerical, before doing any transformation.
Then it transforms the variables, that is, it sorts the values into the intervals, transform.
'''
Parameters
----------
binning_dict : dict
The dictionary with the variable : interval limits pairs, provided by the user.
A valid dictionary looks like this:
binning_dict = {'var1':[0, 10, 100, 1000], 'var2':[5, 10, 15, 20]}.
return_object : bool, default=False
Whether the numbers in the discrete variable should be returned as
numeric or as object. The decision is made by the user based on
whether they would like to proceed the engineering of the variable as
if it was numerical or categorical.
return_boundaries: bool, default=False
whether the output should be the interval boundaries. If True, it returns
the interval boundaries. If False, it returns integers.
'''
atd = ArbitraryDiscretiser(binning_dict={"LotArea":[-np.inf,4000,8000,12000,16000,20000,np.inf],
"GrLivArea":[-np.inf,500,1000,1500,2000,2500,np.inf]})
atd.fit(X_train)
ArbitraryDiscretiser(binning_dict={'GrLivArea': [-inf, 500, 1000, 1500, 2000, 2500, inf], 'LotArea': [-inf, 4000, 8000, 12000, 16000, 20000, inf]}, return_boundaries=False, return_object=False)
# binner_dict contains the boundaries of the different bins
atd.binner_dict_
{'LotArea': [-inf, 4000, 8000, 12000, 16000, 20000, inf], 'GrLivArea': [-inf, 500, 1000, 1500, 2000, 2500, inf]}
train_t = atd.transform(X_train)
test_t = atd.transform(X_test)
# the below are the bins into which the observations were sorted
print(train_t['GrLivArea'].unique())
print(train_t['LotArea'].unique())
[4 2 1 0 3 5] [2 0 1 3 5 4]
# here I put side by side the original variable and the transformed variable
tmp = pd.concat([X_train[["LotArea", 'GrLivArea']], train_t[["LotArea", 'GrLivArea']]], axis=1)
tmp.columns = ["LotArea", 'GrLivArea',"LotArea_binned", 'GrLivArea_binned']
tmp.head()
LotArea | GrLivArea | LotArea_binned | GrLivArea_binned | |
---|---|---|---|---|
64 | 9375 | 2034 | 2 | 4 |
682 | 2887 | 1291 | 0 | 2 |
960 | 7207 | 858 | 1 | 1 |
1384 | 9060 | 1258 | 2 | 2 |
1100 | 8400 | 438 | 2 | 0 |
plt.subplot(1,2,1)
tmp.groupby('GrLivArea_binned')['GrLivArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of observations per bin')
plt.subplot(1,2,2)
tmp.groupby('LotArea_binned')['LotArea'].count().plot.bar()
plt.ylabel('Number of houses')
plt.title('Number of observations per bin')
plt.show()
atd = ArbitraryDiscretiser(binning_dict={"LotArea": [-np.inf, 4000, 8000, 12000, 16000, 20000, np.inf],
"GrLivArea": [-np.inf, 500, 1000, 1500, 2000, 2500, np.inf]},
# to return the boundary limits
return_boundaries=True)
atd.fit(X_train)
ArbitraryDiscretiser(binning_dict={'GrLivArea': [-inf, 500, 1000, 1500, 2000, 2500, inf], 'LotArea': [-inf, 4000, 8000, 12000, 16000, 20000, inf]}, return_boundaries=True, return_object=False)
train_t = atd.transform(X_train)
test_t = atd.transform(X_test)
# the numbers are the different bins into which the observations
# were sorted
np.sort(np.ravel(train_t['GrLivArea'].unique()))
array([Interval(-inf, 500.0, closed='right'), Interval(500.0, 1000.0, closed='right'), Interval(1000.0, 1500.0, closed='right'), Interval(1500.0, 2000.0, closed='right'), Interval(2000.0, 2500.0, closed='right'), Interval(2500.0, inf, closed='right')], dtype=object)
np.sort(np.ravel(test_t['GrLivArea'].unique()))
array([Interval(500.0, 1000.0, closed='right'), Interval(1000.0, 1500.0, closed='right'), Interval(1500.0, 2000.0, closed='right'), Interval(2000.0, 2500.0, closed='right'), Interval(2500.0, inf, closed='right')], dtype=object)
# bar plot to show the intervals returned by the transformer
test_t.LotArea.value_counts(sort=False).plot.bar(figsize=(6,4))
plt.ylabel('Number of houses')
plt.title('Number of houses per interval')
plt.show()