OutlierTrimmer¶

The OutlierTrimmer() removes observations with outliers from the dataset.

It works only with numerical variables. A list of variables can be indicated. Alternatively, the OutlierTrimmer() will select all numerical variables.

The OutlierTrimmer() first calculates the maximum and /or minimum values beyond which a value will be considered an outlier, and thus removed.

Limits are determined using:

a Gaussian approximation
the inter-quantile range proximity rule
percentiles.

Example:¶

In [1]:

# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from feature_engine.outliers import OutlierTrimmer

In [2]:

# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv(
        'https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['embarked'].fillna('C', inplace=True)
    data['fare'] = data['fare'].astype('float')
    data['fare'].fillna(data['fare'].median(), inplace=True)
    data['age'] = data['age'].astype('float')
    data['age'].fillna(data['age'].median(), inplace=True)
    data.drop(['name', 'ticket'], axis=1, inplace=True)
    return data

# To plot histogram of given numerical feature


def plot_hist(data, col):
    plt.figure(figsize=(8, 5))
    plt.hist(data[col], bins=30)
    plt.title("Distribution of " + col)
    return plt.show()

In [3]:

# Loading titanic dataset
data = load_titanic()
data.sample(5)

Out[3]:

	pclass	survived	sex	age	parch	fare	cabin	embarked	boat	body	home.dest
675	3	0	male	21.0	0	7.775	n	S	NaN	NaN	Brennes, Norway New York
558	2	1	female	18.0	2	13.000	n	S	16	NaN	Finland / Minneapolis, MN
194	1	0	male	30.0	0	26.000	C	S	NaN	NaN	Brockton, MA
217	1	0	male	64.0	0	26.000	n	S	NaN	263	Isle of Wight, England
473	2	0	male	28.0	0	0.000	n	S	NaN	NaN	Belfast

In [4]:

# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(data.drop('survived', axis=1),
                                                    data['survived'],
                                                    test_size=0.3,
                                                    random_state=0)

print("train data shape before removing outliers:", X_train.shape)
print("test data shape before removing outliers:", X_test.shape)

train data shape before removing outliers: (916, 11)
test data shape before removing outliers: (393, 11)

In [5]:

# let's find out the maximum Age and maximum Fare in the titanic

print("Max age:", data.age.max())
print("Max fare:", data.fare.max())

print("Min age:", data.age.min())
print("Min fare:", data.fare.min())

Max age: 80.0
Max fare: 512.3292
Min age: 0.1667
Min fare: 0.0

In [6]:

# Histogram of age feature before capping outliers
plot_hist(data, 'age')

In [7]:

# Histogram of fare feature before capping outliers
plot_hist(data, 'fare')

Outlier trimming using Gaussian limits:¶

The transformer will find the maximum and / or minimum values to trim the variables using the Gaussian approximation.

right tail: mean + 3* std
left tail: mean - 3* std

In [8]:

'''Parameters
----------

capping_method : str, default=gaussian
    Desired capping method. Can take 'gaussian', 'iqr' or 'quantiles'.
    
tail : str, default=right
    Whether to cap outliers on the right, left or both tails of the distribution.
    Can take 'left', 'right' or 'both'.

fold: int or float, default=3
    How far out to to place the capping values. The number that will multiply
    the std or IQR to calculate the capping values.

variables : list, default=None

missing_values: string, default='raise'
    Indicates if missing values should be ignored or raised.'''

# removing outliers based on right tail of age and fare columns using gaussian capping method
trimmer = OutlierTrimmer(
    capping_method='gaussian', tail='right', fold=3, variables=['age', 'fare'])

# fitting trimmer object to training data
trimmer.fit(X_train)

Out[8]:

OutlierTrimmer(variables=['age', 'fare'])

In [9]:

# here we can find the maximum caps allowed
trimmer.right_tail_caps_

Out[9]:

{'age': 67.49048447470315, 'fare': 174.78162171790441}

In [10]:

# this dictionary is empty, because we selected only right tail
trimmer.left_tail_caps_

Out[10]:

{}

In [15]:

# transforming the training and testing data
train_t = trimmer.transform(X_train)
test_t = trimmer.transform(X_test)

# let's check the new maximum Age and maximum Fare in the titanic
print("Max age:", train_t.age.max())
print("Max fare:", train_t.fare.max())

Max age: 66.0
Max fare: 164.8667

In [12]:

print("train data shape after removing outliers:", train_t.shape)
print(f"{X_train.shape[0] - train_t.shape[0]} observations are removed\n")

print("test data shape after removing outliers:", test_t.shape)
print(f"{X_test.shape[0] - test_t.shape[0]} observations are removed")

train data shape after removing outliers: (887, 11)
29 observations are removed

test data shape after removing outliers: (376, 11)
17 observations are removed

Gaussian approximation trimming, both tails¶

In [16]:

# Trimming the outliers at both tails using gaussian  method
trimmer = OutlierTrimmer(
    capping_method='gaussian', tail='both', fold=2, variables=['fare', 'age'])
trimmer.fit(X_train)

Out[16]:

OutlierTrimmer(fold=2, tail='both', variables=['fare', 'age'])

In [17]:

print("Minimum caps :", trimmer.left_tail_caps_)

print("Maximum caps :", trimmer.right_tail_caps_)

Minimum caps : {'fare': -62.30099726608475, 'age': 4.681562024142586}
Maximum caps : {'fare': 127.36509792110658, 'age': 54.92869998459104}

In [18]:

# transforming the training and testing data
train_t = trimmer.transform(X_train)
test_t = trimmer.transform(X_test)

print("train data shape after removing outliers:", train_t.shape)
print(f"{X_train.shape[0] - train_t.shape[0]} observations are removed\n")

print("test data shape after removing outliers:", test_t.shape)
print(f"{X_test.shape[0] - test_t.shape[0]} observations are removed")

train data shape after removing outliers: (803, 11)
113 observations are removed

test data shape after removing outliers: (334, 11)
59 observations are removed

Inter Quartile Range, both tails¶

The transformer will find the boundaries using the IQR proximity rule. IQR limits:

right tail: 75th quantile + 3* IQR
left tail: 25th quantile - 3* IQR

where IQR is the inter-quartile range: 75th quantile - 25th quantile.

In [19]:

# trimming at both tails using iqr capping method
trimmer = OutlierTrimmer(
    capping_method='iqr', tail='both', variables=['age', 'fare'])

trimmer.fit(X_train)

Out[19]:

OutlierTrimmer(capping_method='iqr', tail='both', variables=['age', 'fare'])

In [20]:

print("Minimum caps :", trimmer.left_tail_caps_)

print("Maximum caps :", trimmer.right_tail_caps_)

Minimum caps : {'age': -13.0, 'fare': -62.24179999999999}
Maximum caps : {'age': 71.0, 'fare': 101.4126}

In [21]:

# transforming the training and testing data
train_t = trimmer.transform(X_train)
test_t = trimmer.transform(X_test)

print("train data shape after removing outliers:", train_t.shape)
print(f"{X_train.shape[0] - train_t.shape[0]} observations are removed\n")

print("test data shape after removing outliers:", test_t.shape)
print(f"{X_test.shape[0] - test_t.shape[0]} observations are removed")

train data shape after removing outliers: (857, 11)
59 observations are removed

test data shape after removing outliers: (365, 11)
28 observations are removed

percentiles or quantiles:¶

The limits are given by the percentiles.

right tail: 98th percentile
left tail: 2nd percentile

In [23]:

# trimming at both tails using quantiles capping method
trimmer = OutlierTrimmer(capping_method='quantiles',
                         tail='both', fold=0.02, variables=['age', 'fare'])

trimmer.fit(X_train)

Out[23]:

OutlierTrimmer(capping_method='quantiles', fold=0.02, tail='both',
               variables=['age', 'fare'])

In [24]:

print("Minimum caps :", trimmer.left_tail_caps_)

print("Maximum caps :", trimmer.right_tail_caps_)

Minimum caps : {'age': 2.0, 'fare': 6.44125}
Maximum caps : {'age': 61.69999999999993, 'fare': 211.5}

In [25]:

# transforming the training and testing data
train_t = trimmer.transform(X_train)
test_t = trimmer.transform(X_test)

print("train data shape after removing outliers:", train_t.shape)
print(f"{X_train.shape[0] - train_t.shape[0]} observations are removed\n")

print("test data shape after removing outliers:", test_t.shape)
print(f"{X_test.shape[0] - test_t.shape[0]} observations are removed")

train data shape after removing outliers: (852, 11)
64 observations are removed

test data shape after removing outliers: (358, 11)
35 observations are removed

In [26]:

# Histogram of age feature after removing outliers
plot_hist(train_t, 'age')

In [27]:

# Histogram of fare feature after removing outliers
plot_hist(train_t, 'fare')

In [ ]: