# Winsorizer¶

Winzorizer finds maximum and minimum values following a Gaussian or skewed distribution as indicated. It can also cap the right, left or both ends of the distribution.

The Winsorizer() caps maximum and / or minimum values of a variable.

The Winsorizer() works only with numerical variables. A list of variables can be indicated. Alternatively, the Winsorizer() will select all numerical variables in the train set.

The Winsorizer() first calculates the capping values at the end of the distribution. The values are determined using:

• a Gaussian approximation,
• the inter-quantile range proximity rule (IQR)
• percentiles.

### Example¶

In [1]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.outliers import Winsorizer

In [2]:
# Load titanic dataset from OpenML

'https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
data['fare'] = data['fare'].astype('float')
data['fare'].fillna(data['fare'].median(), inplace=True)
data['age'] = data['age'].astype('float')
data['age'].fillna(data['age'].median(), inplace=True)
data.drop(['name', 'ticket'], axis=1, inplace=True)
return data

# To plot histogram of given numerical feature

def plot_hist(data, col):
plt.figure(figsize=(8, 5))
plt.hist(data[col], bins=30)
plt.title("Distribution of "+col)
return plt.show()

In [3]:
# Loading titanic dataset
data.sample(5)

Out[3]:
pclass survived sex age sibsp parch fare cabin embarked boat body home.dest
157 1 0 male 28.0 0 0 51.8625 E S NaN NaN Brighton, MA
400 2 1 female 34.0 1 1 32.5000 n S 10 NaN Greenport, NY
546 2 1 female 28.0 0 0 13.0000 n S 9 NaN Spain
618 3 0 male 35.0 0 0 8.0500 n S NaN NaN Lower Clapton, Middlesex or Erdington, Birmingham
1208 3 0 female 9.0 3 2 27.9000 n S NaN NaN NaN
In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data.drop('survived', axis=1),
data['survived'],
test_size=0.3,
random_state=0)

print("train data:", X_train.shape)
print("test data:", X_test.shape)

train data: (916, 11)
test data: (393, 11)

In [5]:
# let's find out the maximum Age and maximum Fare in the titanic

print("Max age:", data.age.max())
print("Max fare:", data.fare.max())

Max age: 80.0
Max fare: 512.3292

In [6]:
# Histogram of age feature before capping outliers
plot_hist(data, 'age')

In [7]:
# Histogram of fare feature before capping outliers
plot_hist(data, 'fare')


### Capping : Gaussian¶

Gaussian limits:

• right tail: mean + 3* std
• left tail: mean - 3* std
In [8]:
'''Parameters
----------
capping_method : str, default=gaussian

Desired capping method. Can take 'gaussian', 'iqr' or 'quantiles'.

tail : str, default=right

Whether to cap outliers on the right, left or both tails of the distribution.
Can take 'left', 'right' or 'both'.

fold: int or float, default=3

How far out to to place the capping values. The number that will multiply
the std or IQR to calculate the capping values. Recommended values, 2
or 3 for the gaussian approximation, or 1.5 or 3 for the IQR proximity
rule.

variables: list, default=None

missing_values: string, default='raise'

Indicates if missing values should be ignored or raised.
'''
# capping at right tail using gaussian capping method
capper = Winsorizer(
capping_method='gaussian', tail='right', fold=3, variables=['age', 'fare'])

# fitting winsorizer object to training data
capper.fit(X_train)

Out[8]:
Winsorizer(variables=['age', 'fare'])
In [9]:
# here we can find the maximum caps allowed
capper.right_tail_caps_

Out[9]:
{'age': 67.49048447470315, 'fare': 174.78162171790441}
In [10]:
# this dictionary is empty, because we selected only right tail
capper.left_tail_caps_

Out[10]:
{}
In [11]:
# # Histogram of age feature after capping outliers
plot_hist(capper.transform(X_train), 'age')

In [12]:
# transforming the training and testing data
train_t = capper.transform(X_train)
test_t = capper.transform(X_test)

# let's check the new maximum Age and maximum Fare in the titanic
train_t.age.max(), train_t.fare.max()

Out[12]:
(67.49048447470315, 174.78162171790441)

### Gaussian approximation capping, both tails¶

In [13]:
# Capping the outliers at both tails using gaussian capping method

winsor = Winsorizer(capping_method='gaussian',
tail='both', fold=2, variables='fare')
winsor.fit(X_train)

Out[13]:
Winsorizer(fold=2, tail='both', variables=['fare'])
In [14]:
print("Minimum caps :", winsor.left_tail_caps_)

print("Maximum caps :", winsor.right_tail_caps_)

Minimum caps : {'fare': -62.30099726608475}
Maximum caps : {'fare': 127.36509792110658}

In [15]:
# Histogram of fare feature after capping outliers
plot_hist(winsor.transform(X_train), 'fare')

In [16]:
# transforming the training and testing data
train_t = winsor.transform(X_train)
test_t = winsor.transform(X_test)

print("Max fare:", train_t.fare.max())
print("Min fare:", train_t.fare.min())

Max fare: 127.36509792110658
Min fare: 0.0


### Inter Quartile Range, both tails¶

IQR limits:

• right tail: 75th quantile + 3* IQR
• left tail: 25th quantile - 3* IQR

where IQR is the inter-quartile range: 75th quantile - 25th quantile.

In [17]:
# capping at both tails using iqr capping method
winsor = Winsorizer(capping_method='iqr', tail='both',
variables=['age', 'fare'])

winsor.fit(X_train)

Out[17]:
Winsorizer(capping_method='iqr', tail='both', variables=['age', 'fare'])
In [18]:
winsor.left_tail_caps_

Out[18]:
{'age': -13.0, 'fare': -62.24179999999999}
In [19]:
winsor.right_tail_caps_

Out[19]:
{'age': 71.0, 'fare': 101.4126}
In [20]:
# transforming the training and testing data

train_t = winsor.transform(X_train)
test_t = winsor.transform(X_test)

print("Max fare:", train_t.fare.max())
print("Min fare", train_t.fare.min())

Max fare: 101.4126
Min fare 0.0


### percentiles or quantiles:¶

• right tail: 98th percentile
• left tail: 2nd percentile
In [21]:
# capping at both tails using quantiles capping method
winsor = Winsorizer(capping_method='quantiles', tail='both',
fold=0.02, variables=['age', 'fare'])

winsor.fit(X_train)

Out[21]:
Winsorizer(capping_method='quantiles', fold=0.02, tail='both',
variables=['age', 'fare'])
In [26]:
print("Minimum caps :", winsor.left_tail_caps_)

print("Maximum caps :", winsor.right_tail_caps_)

Minimum caps : {'age': 2.0, 'fare': 6.44125}
Maximum caps : {'age': 61.69999999999993, 'fare': 211.5}

In [24]:
# transforming the training and testing data
train_t = winsor.transform(X_train)
test_t = winsor.transform(X_test)

print("Max age:", train_t.age.max())
print("Min age", train_t.age.min())

Max age: 61.69999999999993
Min age 2.0

In [25]:
# Histogram of age feature after capping outliers
plot_hist(train_t, 'age')

In [ ]: