Santander Customer Satisfaction¶

Data is from the 2016 Banco Santander competition on Kaggle.

Cyphered customer data is supplied with the goal of predicting customer satisfaction. The target flag is encoded:

1: Unsatisfied,
0: Not expilcitly unsatisfied.

(From the perspective of philosophy of logic, the above encoding reflects our rejection of the Law of Excluded Middle for this binary classification. That is to say, it is possible for a customer to be labelled in class 0 without being unconditionally satisfied. Ultimately though, this is just semantic quibbling in the spirit of Wittgenstein, not Russell.)

About the Author¶

Portfolio

Table of Contents¶

Note on HTML jump links: Open in nbviewer to use jump links.

Setup

Model Evaluation Functions

Setup¶

Skip to data preprocessing.

Update Libraries¶

Running the lastest builds of:

scikit-learn
imbalanced-learn
XGBoost
LightGBM
CatBoost

In [1]:

! pip install -U scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.22.4)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.10.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.1.0)

In [2]:

! pip install -U xgboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (1.7.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.10.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.22.4)

In [3]:

! pip install -U lightgbm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: lightgbm in /usr/local/lib/python3.10/dist-packages (3.3.5)
Requirement already satisfied: scikit-learn!=0.22.0 in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.2.2)
Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from lightgbm) (0.40.0)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.10.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.22.4)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn!=0.22.0->lightgbm) (3.1.0)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn!=0.22.0->lightgbm) (1.2.0)

In [4]:

! pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting catboost
  Downloading catboost-1.1.1-cp310-none-manylinux1_x86_64.whl (76.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.6/76.6 MB 23.0 MB/s eta 0:00:00
Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost) (1.16.0)
Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.5.3)
Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost) (0.20.1)
Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.22.4)
Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost) (5.13.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from catboost) (1.10.1)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost) (3.7.1)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->catboost) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->catboost) (2.8.2)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (23.1)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.0.7)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (8.4.0)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (4.39.3)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost) (8.2.2)
Installing collected packages: catboost
Successfully installed catboost-1.1.1

In [5]:

! pip install -U imbalanced-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (0.10.1)
Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.2.2)
Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.22.4)
Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.10.1)
Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (3.1.0)

Libraries¶

In [6]:

import numpy as np
import pandas as pd
import time

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics

from xgboost import cv as XGB_CV
from xgboost import DMatrix

#plotting
from matplotlib import pyplot as plt
import seaborn as sns
from IPython.display import clear_output

# cluster analysis
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

# models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import StackingClassifier, VotingClassifier

# processing pipeline
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE

In [7]:

sns.set_theme()

Read Data¶

We import our data, stored as an Apache parquet file.

In [8]:

bank=pd.read_parquet('santander_train.parquet')
data=bank.copy()

In [9]:

bank.shape

Out[9]:

(76020, 371)

We have around 76k records with 371 features.

In [10]:

bank.sample(10,random_state=1)

Out[10]:

	ID	var3	var15	imp_op_var39_comer_ult1	imp_op_var39_comer_ult3	...	var38	TARGET
14162	28459	2	25	0.00	0.00	...	46969.410000	1
35732	71476	2	33	930.21	1391.55	...	307194.780000	0
24191	48386	2	25	0.00	0.00	...	109659.060000	0
10440	20945	2	23	0.00	0.00	...	71302.530000	0
46585	93165	2	24	0.00	0.00	...	117310.979016	0
46064	92159	2	25	0.00	0.00	...	103667.040000	0
27661	55359	2	62	0.00	0.00	...	195088.740000	0
36671	73262	2	41	0.00	0.00	...	59630.010000	0
70885	141557	158	65	0.00	0.00	...	83606.940000	0
72468	144712	2	24	0.00	0.00	...	74988.840000	0

10 rows × 371 columns

The only recognizable columns are the ID column and the TARGET column. The rest of the attributes are cyphered, according to the source linked above.

In [11]:

bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB

The data frame requires around 215 MB of memory.

In [12]:

# null check
bank.isna().sum().sum()

Out[12]:

In [13]:

# duplicate row check
bank.duplicated().sum()

Out[13]:

There are no null entries or duplicated rows.

In [14]:

bank['TARGET'].value_counts(normalize=True)

Out[14]:

0    0.960431
1    0.039569
Name: TARGET, dtype: float64

This dataset is imbalanced, with only 4% of records in the positive class. Thankfully, most of our customers are not unsatisfied. On the other hand, this imbalance will make detection and modeling more delicate.

Data Preprocessing¶

Finding Superfluous Columns¶

In [15]:

# find constant columns
const_col=[]

for col in bank.columns:
  if bank[col].std()==0:
    const_col.append(col)

In [16]:

# find duplicate columns
dup_bool=bank.T.duplicated()

dups=[]
for idx in dup_bool.index:
  if dup_bool[idx]==True:
    dups.append(idx)

In [17]:

remove=const_col+dups
print(f'There are {len(remove)} columns to remove.')

There are 96 columns to remove.

We found 96 columns that were either constant or a duplicate of another column. Note that this is an inclusive OR, so it is possible that fewer than 96 columns will be removed.

Split¶

We separate the data into predictive features and our target. Then we split into training data and validation data. There is no need to reserve data for final evaluation, as we have that data stored in a separate file.

In [18]:

X=bank.drop(['ID','TARGET']+remove,axis=1)
y=bank['TARGET']

In [19]:

# split into training and validation sets
X_train,X_val,y_train,y_val=train_test_split(
    X,
    y,
    test_size=0.3,
    stratify=y,
    random_state=57
)

We reserve 30% of the data for our validation set.

Scaling¶

In order to demonstrate the PCA transformation that follows, we first need to scale our data. Scaling will shortly be incorporated into a preprocessing pipeline, rendering this section redundant.

In [20]:

scaler=StandardScaler().set_output(transform='pandas')

X_ts=scaler.fit_transform(X_train)

In [21]:

X_ts.describe().T.head()

Out[21]:

	count	mean	std	min	25%	50%	75%	max
var3	53214.0	5.608072e-18	1.000009	-27.956373	0.035750	0.035750	0.035750	0.042356
var15	53214.0	-2.281951e-16	1.000009	-2.172472	-0.785866	-0.477731	0.446674	5.530898
imp_ent_var16_ult1	53214.0	-6.008649e-18	1.000009	-0.049542	-0.049542	-0.049542	-0.049542	117.774053
imp_op_var39_comer_ult1	53214.0	-5.207496e-18	1.000009	-0.219330	-0.219330	-0.219330	-0.219330	26.065719
imp_op_var39_comer_ult3	53214.0	1.215082e-17	1.000009	-0.220725	-0.220725	-0.220725	-0.220725	38.221859

Note that every attribute has mean approximately 0 and standard deviation approximately 1.

PCA¶

Principal Component Analysis is an implementation of eigen decomposition. In effect, it is a coordinate transformation, where the new axes reflect explained variance in the data. Moreover, these axes, or components, are ordered decreasingly by explained variance. Many of the components can thus be discarded (off the end), as they generally do not contribute much to the explanation of variance. In this way, PCA can be used for dimension reduction.

In [22]:

print(f'Number of features: {X_ts.shape[1]}.')

Number of features: 306.

Before dimension reduction, we have 306 features in our dataset.

In [23]:

# major reduction test
pca37=PCA(n_components=37)

pca37.fit(X_ts)

Out[23]:

PCA(n_components=37)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [24]:

plt.title('Cumulative variance explained by eigenvectors',fontsize=15)
plt.step(
    np.arange(1,38),
    np.cumsum(pca37.explained_variance_ratio_),
    where='mid'
)
plt.xlabel('Number of Eigenvectors')
plt.ylabel('Cumulative Variance');

We find that 37 components explain about 75% of our variance: not enough. Moreover, we can see graphically that the right side of the curve is still increasing, not levelling off. We'll need more components.

We know that 306 features can explain 100% of the variance in our data. Can we get away with fewer?

In [25]:

# about 1/3 the size
pca123=PCA(n_components=123)

pca123.fit(X_ts)

Out[25]:

PCA(n_components=123)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [26]:

evr=pca123.explained_variance_ratio_

plt.title(f'Cumulative explained variance (reaches {np.round(sum(evr)*100,2)}%)',fontsize=15)
plt.step(
    np.arange(1,124),
    np.cumsum(evr),
    where='mid'
)
plt.xlabel('Number of Eigenvectors')
plt.ylabel('Cumulative Variance');

We can explain nearly 100% of the variance in our data with just a third of the components. This decreases the memory required to store the data and massively reduces computation time during training.

Pipeline¶

We now incorporate scaling and PCA into a preprocessing pipeline.

In [27]:

# preprocessing pipe
pre=Pipeline(
    steps=[
        ('Scaler',StandardScaler()),
        ('Dimension_Reduction',PCA(n_components=123))
    ]
).set_output(transform='pandas')

In [28]:

X_ts=pre.fit_transform(X_train)

In [29]:

X_vs=pre.transform(X_val)

We fit the pipeline on our training data and then use it to transform our validation data. This approach ensures the integrity of our analysis by preventing data leakage.

In [30]:

a=X_ts.memory_usage().sum()/X_train.memory_usage().sum()
print(f'Memory usage reduced to {np.round(a*100,2)}% of original data frame.')

Memory usage reduced to 40.39% of original data frame.

As expected, PCA reduced memory usage by roughly 60%.

Oversampling¶

In [31]:

y_train.value_counts(normalize=True)

Out[31]:

0    0.960424
1    0.039576
Name: TARGET, dtype: float64

Only around 4% of the supplied data belongs to the positive class. We oversample to balance the classes using the SMOTE method.

In [32]:

smote=SMOTE(
    sampling_strategy='not majority',
    random_state=1,
    k_neighbors=5
)

In [33]:

# oversampled training data
Xt_over,yt_over=smote.fit_resample(X_ts,y_train)

In [34]:

# re-scale data
Xt_over=scaler.fit_transform(Xt_over)

In [35]:

yt_over.value_counts(normalize=True)

Out[35]:

0    0.5
1    0.5
Name: TARGET, dtype: float64

Target classes are now balanced.

Model Evaluation Functions¶

This first function collects training and validationscores for a given model. It provides an option to output the scores in an easy-to-read Pandas DataFrame.

In [36]:

def get_scores(model,sample=None,output=None):
  '''Collect model scores.'''
  
  # define training data
  if sample=='over':
    X_t=Xt_over
    y_t=yt_over
  else:
    X_t=X_ts
    y_t=y_train

  # predictions
  y_t_hat=model.predict(X_t)
  y_v_hat=model.predict(X_vs)

  # collect scores
  train_scores=[
      metrics.recall_score(y_t,y_t_hat),
      metrics.fbeta_score(y_t,y_t_hat,beta=2),
      metrics.f1_score(y_t,y_t_hat),
      metrics.roc_auc_score(y_t,y_t_hat),
      metrics.zero_one_loss(y_t,y_t_hat)
  ]

  val_scores=[
      metrics.recall_score(y_val,y_v_hat),
      metrics.fbeta_score(y_val,y_v_hat,beta=2),
      metrics.f1_score(y_val,y_v_hat),
      metrics.roc_auc_score(y_val,y_v_hat),
      metrics.zero_one_loss(y_val,y_v_hat)
  ]

  # output scores in pandas df
  if output=='pandas':
    df=pd.DataFrame(
        [train_scores,val_scores],
        columns=[
            'Recall',
            'F_beta',
            'F1',
            'AUC',
            '0-1_Loss'
        ],
        index=['train','val']
    )
    return df

  return [train_scores,val_scores]

The next function displays a confusion matrix of model predictions on validation data.

In [37]:

def confusion_heatmap(model,show_scores=True):
  '''Heatmap of confusion matrix for
  model performance on validation data.'''

  actual=y_val
  predicted=model.predict(X_vs)

  # generate confusion matrix
  cm=metrics.confusion_matrix(actual,predicted)
  cm=np.flip(cm).T

  # heatmap labels
  labels=['TP','FP','FN','TN']
  cm_labels=np.array(cm).flatten()
  cm_percents=np.round((cm_labels/np.sum(cm))*100,3)
  annot_labels=[]
  for i in range(4):
    annot_labels.append(str(labels[i])+'\nCount:'+str(cm_labels[i])+'\n'+str(cm_percents[i])+'%')
  annot_labels=np.array(annot_labels).reshape(2,2)

  # print figure
  plt.figure(figsize=(8,5))
  plt.title('Confusion Matrix (Validation Data)',fontsize=20)
  sns.heatmap(data=cm,
              annot=annot_labels,
              annot_kws={'fontsize':'x-large'},
              xticklabels=[1,0],
              yticklabels=[1,0],
              cmap='Greens',
              fmt='s')
  plt.xlabel('Actual',fontsize=14)
  plt.ylabel('Predicted',fontsize=14)
  plt.tight_layout();

  # scores
  if show_scores==True:
    scores=['Accuracy','Precision','Recall','F1']
    score_list=[metrics.accuracy_score(actual,predicted),
                metrics.precision_score(actual,predicted),
                metrics.recall_score(actual,predicted),
                metrics.f1_score(actual,predicted)]
    df=pd.DataFrame(index=scores)
    df['Val. Scores']=score_list
    return df
  return

# alias function name to something shorter
ch=confusion_heatmap

Vanilla Models¶

Summary of vanilla model testing: No model performed well with default configurations on the regular data. Performance gains are only observed once we train on oversampled data.

Jump to models trained on oversampled data.

In [ ]:

models=[
    'RandomForest',
    'AdaBoost',
    'XGBoost',
    'LightGBM',
    'CatBoost'
]

datasets=['train','val']

# generate MultiIndex object
mi=pd.MultiIndex.from_product(
    iterables=[models,datasets],
    names=['model','data']
)

In [ ]:

# build comparison table
tab=pd.DataFrame(
    columns=[
        'Recall',
        'F_beta',
        'F1',
        'AUC',
        '0-1_Loss'
    ],
    index=mi
)

The value of $\beta$ in the $F_\beta$ score allows us to bias the score between precision and recall. With $\beta=2$, we give recall twice the importance of precision.

Dummy Classifier¶

In [ ]:

d=DummyClassifier(
    strategy='stratified',
    random_state=1
)

d.fit(X_ts,y_train)

Out[ ]:

DummyClassifier(random_state=1, strategy='stratified')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We fit a dummy classifier to set a performance baseline.

In [ ]:

ch(d)

Out[ ]:

	Val. Scores
Accuracy	0.925809
Precision	0.042824
Recall	0.041020
F1	0.041903

Predictably, this classifier yields high accuracy on our highly imbalanced data set. Its failing, however, is 4% recall.

Random Forest¶

In [ ]:

rf=RandomForestClassifier(
    random_state=1,
    n_jobs=-1
)

rf.fit(X_ts,y_train)

Out[ ]:

RandomForestClassifier(n_jobs=-1, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab.loc['RandomForest']=get_scores(rf)

tab.loc['RandomForest']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.902659	0.919868	0.946949	0.951251	0.004003
val	0.042129	0.050211	0.070501	0.517914	0.043936

Random forest fares better on recall, though this is merely due to overfitting. Validation AUC barely clears the baseline 50%.

In [ ]:

ch(rf)

Out[ ]:

	Val. Scores
Accuracy	0.956064
Precision	0.215909
Recall	0.042129
F1	0.070501

We can see from the confusion matrix that the number of true positives detected is the same as the dummy classifier. Thus, the only improvement made here is in the classification of true negatives.

AdaBoost¶

In [ ]:

abc=AdaBoostClassifier(random_state=1)

abc.fit(X_ts,y_train)

Out[ ]:

AdaBoostClassifier(random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab.loc['AdaBoost']=get_scores(abc)

tab.loc['AdaBoost']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.006648	0.008286	0.013146	0.503226	0.039501
val	0.002217	0.002765	0.004396	0.500972	0.039726

On the one hand, AdaBoost is not suffering the same overfitting issues as random forest. On the other, its performance is horrid.

In [ ]:

ch(abc)

Out[ ]:

	Val. Scores
Accuracy	0.960274
Precision	0.250000
Recall	0.002217
F1	0.004396

The confusion matrix shows that AdaBoost is just predicting 0 for essentially every observation, with only eight predicted in the positive class.

XGBoost¶

In [ ]:

xgb=XGBClassifier(
    random_state=1
)

xgb.fit(X_ts,y_train)

Out[ ]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=1, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab.loc['XGBoost']=get_scores(xgb)

tab.loc['XGBoost']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.405508	0.459931	0.57586	0.702695	0.02364
val	0.011086	0.01365	0.020899	0.504516	0.041086

In [ ]:

ch(xgb)

Out[ ]:

	Val. Scores
Accuracy	0.958914
Precision	0.181818
Recall	0.011086
F1	0.020899

Overfitting plagues XGBoost too. We have yet to see a validation AUC appreciably climb above 50%.

LightGBM¶

In [ ]:

lg=LGBMClassifier()

lg.fit(X_ts,y_train)

Out[ ]:

LGBMClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab.loc['LightGBM']=get_scores(lg)

tab.loc['LightGBM']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.17189	0.205892	0.292762	0.585896	0.032867
val	0.009978	0.0124	0.019502	0.504715	0.039683

In [ ]:

ch(lg)

Out[ ]:

	Val. Scores
Accuracy	0.960317
Precision	0.428571
Recall	0.009978
F1	0.019502

Comparable performance can be observed in default LightGBM, with overfitting and poor recall and AUC.

CatBoost¶

In [ ]:

cb=CatBoostClassifier()

cb.fit(X_ts,y_train,verbose=False)

Out[ ]:

<catboost.core.CatBoostClassifier at 0x7f378bdf7790>

In [ ]:

tab.loc['CatBoost']=get_scores(cb)

tab.loc['CatBoost']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.213675	0.253464	0.3517	0.606808	0.031176
val	0.008869	0.011004	0.017223	0.504001	0.040033

In [ ]:

ch(cb)

Out[ ]:

	Val. Scores
Accuracy	0.959967
Precision	0.296296
Recall	0.008869
F1	0.017223

CatBoost is similarly deficient.

Comparison¶

In [ ]:

tab

Out[ ]:

		Recall	F_beta	F1	AUC	0-1_Loss
model	data
RandomForest	train	0.902659	0.919868	0.946949	0.951251	0.004003
RandomForest	val	0.042129	0.050211	0.070501	0.517914	0.043936
AdaBoost	train	0.006648	0.008286	0.013146	0.503226	0.039501
AdaBoost	val	0.002217	0.002765	0.004396	0.500972	0.039726
XGBoost	train	0.405508	0.459931	0.57586	0.702695	0.02364
XGBoost	val	0.011086	0.01365	0.020899	0.504516	0.041086
LightGBM	train	0.17189	0.205892	0.292762	0.585896	0.032867
LightGBM	val	0.009978	0.0124	0.019502	0.504715	0.039683
CatBoost	train	0.213675	0.253464	0.3517	0.606808	0.031176
CatBoost	val	0.008869	0.011004	0.017223	0.504001	0.040033

Hyperparameter tuning will not garner the performance improvements we need. Let's instead train the models on oversamed data.

Oversampled Models¶

Data oversampling using SMOTE (Synthetic Minority Oversampling TEchnique).

Jump to model comparison.

In [ ]:

# build comparison table
tab_over=pd.DataFrame(
    columns=[
        'Recall',
        'F_beta',
        'F1',
        'AUC',
        '0-1_Loss'
    ],
    index=mi
)

Random Forest¶

In [ ]:

rf_over=RandomForestClassifier(
    random_state=1,
    n_jobs=-1
)

rf_over.fit(Xt_over,yt_over)

Out[ ]:

RandomForestClassifier(n_jobs=-1, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab_over.loc['RandomForest']=get_scores(rf_over,sample='over')

tab_over.loc['RandomForest']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.990158	0.990224	0.990323	0.990324	0.009676
val	0.296009	0.239935	0.186844	0.60945	0.101903

In [ ]:

ch(rf_over)

Out[ ]:

	Val. Scores
Accuracy	0.898097
Precision	0.136503
Recall	0.296009
F1	0.186844

AdaBoost¶

In [ ]:

abc_over=AdaBoostClassifier(random_state=1)

abc_over.fit(Xt_over,yt_over)

Out[ ]:

AdaBoostClassifier(random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab_over.loc['AdaBoost']=get_scores(abc_over,sample='over')

tab_over.loc['AdaBoost']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.775906	0.770682	0.762978	0.758961	0.241039
val	0.722838	0.32441	0.177584	0.729274	0.264799

In [ ]:

ch(abc_over)

Out[ ]:

	Val. Scores
Accuracy	0.735201
Precision	0.101227
Recall	0.722838
F1	0.177584

XGBoost¶

In [ ]:

xgb_over=XGBClassifier(
    tree_method='gpu_hist',
    random_state=1
)

xgb_over.fit(Xt_over,yt_over)

Out[ ]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=1, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab_over.loc['XGBoost']=get_scores(xgb_over,sample='over')

tab_over.loc['XGBoost']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.950184	0.938345	0.921131	0.918643	0.081357
val	0.51663	0.316533	0.200215	0.683283	0.163247

In [ ]:

ch(xgb_over)

Out[ ]:

	Val. Scores
Accuracy	0.836753
Precision	0.124167
Recall	0.516630
F1	0.200215

Notice here that far fewer false positives yields a higher F1 score than the previous AdaBoost model.

LightGBM¶

In [ ]:

lg_over=LGBMClassifier(
    n_jobs=-1,
    random_state=1
)

lg_over.fit(Xt_over,yt_over)

Out[ ]:

LGBMClassifier(random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

tab_over.loc['LightGBM']=get_scores(lg_over,sample='over')

tab_over.loc['LightGBM']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.897511	0.885921	0.869087	0.864806	0.135194
val	0.615299	0.340825	0.204194	0.716822	0.189687

In [ ]:

ch(lg_over)

Out[ ]:

	Val. Scores
Accuracy	0.810313
Precision	0.122408
Recall	0.615299
F1	0.204194

CatBoost¶

In [ ]:

cb_over=CatBoostClassifier(
    task_type='GPU',
    gpu_ram_part=0.9,
    gpu_cat_features_storage='GpuRam',
    random_seed=1
)

cb_over.fit(Xt_over,yt_over,verbose=False)

Out[ ]:

<catboost.core.CatBoostClassifier at 0x7f169c6741c0>

In [ ]:

tab_over.loc['CatBoost']=get_scores(cb_over,sample='over')

tab_over.loc['CatBoost']

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
data
train	0.897218	0.886961	0.872008	0.868308	0.131692
val	0.599778	0.34068	0.206725	0.713352	0.182057

In [ ]:

ch(cb_over)

Out[ ]:

	Val. Scores
Accuracy	0.817943
Precision	0.124885
Recall	0.599778
F1	0.206725

Comparison¶

In [ ]:

tab_over

Out[ ]:

		Recall	F_beta	F1	AUC	0-1_Loss
model	data
RandomForest	train	0.990158	0.990224	0.990323	0.990324	0.009676
RandomForest	val	0.296009	0.239935	0.186844	0.60945	0.101903
AdaBoost	train	0.775906	0.770682	0.762978	0.758961	0.241039
AdaBoost	val	0.722838	0.32441	0.177584	0.729274	0.264799
XGBoost	train	0.950184	0.938345	0.921131	0.918643	0.081357
XGBoost	val	0.51663	0.316533	0.200215	0.683283	0.163247
LightGBM	train	0.897511	0.885921	0.869087	0.864806	0.135194
LightGBM	val	0.615299	0.340825	0.204194	0.716822	0.189687
CatBoost	train	0.897218	0.886961	0.872008	0.868308	0.131692
CatBoost	val	0.599778	0.34068	0.206725	0.713352	0.182057

Random forest is massively overfit.
AdaBoost appears to suffer the least from overfitting. Especially promising is the AUC scores. The validation AUC in particular is higher than any other model in the table.
XGBoost is plagued by overfitting, but the validation scores show promise. With tuning the overfitting might be controllable.
LightGBM and CatBoost show near-identical performance in every metric. Still these models require much tuning and improvement.

Model Tuning¶

AdaBoost

XGBoost

LightGBM

AdaBoost¶

In [ ]:

params={
    'n_estimators':np.arange(50,251,50),
    'learning_rate':[0.5,1.0,2.0]
}

We will vary the number of estimators and the learning rate of our AdaBoost classifier.

Base Estimator: Decision Stump¶

In [ ]:

abc_tuned1=AdaBoostClassifier(random_state=1)

go1=GridSearchCV(
    estimator=abc_tuned1,
    param_grid=params,
    scoring=['recall','f1','roc_auc'],
    refit='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

start=time.time()
go1.fit(Xt_over,yt_over)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')

Fitting 5 folds for each of 15 candidates, totalling 75 fits
Fit completed in 48.8 minutes.

In [ ]:

best_abc1=go1.best_params_

best_abc1

Out[ ]:

{'learning_rate': 1.0, 'n_estimators': 250}

In [ ]:

abc_tuned1=AdaBoostClassifier(
    random_state=1,
    **best_abc1
)

abc_tuned1.fit(Xt_over,yt_over)

Out[ ]:

AdaBoostClassifier(n_estimators=250, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

get_scores(abc_tuned1,sample='over',output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.792440	0.791087	0.789066	0.788164	0.211836
val	0.667406	0.332010	0.189308	0.722856	0.226081

In [ ]:

ch(abc_tuned1)

Out[ ]:

	Val. Scores
Accuracy	0.773919
Precision	0.110297
Recall	0.667406
F1	0.189308

Base Estimator: XGBClassifier¶

An attempt was made to train an AdaBoost Classifier with XGBoost as the base estimator. After 5h 47m the training was terminated. Given limited computing resources, this configuration was too expensive to be feasable.

Base Estimator: Logistic Regression¶

In [ ]:

lr=LogisticRegression(
    random_state=1,
    max_iter=1000
)

lr.fit(Xt_over,yt_over)

Out[ ]:

LogisticRegression(max_iter=1000, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

get_scores(lr,sample='over',output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.758551	0.749218	0.735643	0.727411	0.272589
val	0.739468	0.307119	0.163621	0.719442	0.299000

Performance here is interesting: Recall and AUC good and consistent, indicating a reliable fit. However, precision, and consequently F1/$F_\beta$, suffer massively when training is compared with validation.

I conjecture this is due to training data being oversampled but not validation data. This sampling is of course intentional, as we generally only need the oversampled data for model training. However, in the case of logistic regression, training on oversampled data artificially inflates the regression intercept, meaning performance suffers on non-oversampled data.

In [ ]:

# logistic regression on non-oversampled data
logit=LogisticRegression(max_iter=1000)

logit.fit(X_ts,y_train)

Out[ ]:

LogisticRegression(max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

get_scores(logit,output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.006173	0.007679	0.012110	0.502812	0.039858
val	0.017738	0.021786	0.033126	0.507773	0.040954

Do note that logistic regression still sees massive improvements when trained on oversampled data. A quick look at a model trained on the original scaled data finds that every metric is at rock bottom. For instance, AUC is at 0.5, no better than guessing.

In [ ]:

ch(lr)

Out[ ]:

	Val. Scores
Accuracy	0.701000
Precision	0.091987
Recall	0.739468
F1	0.163621

While logistic regression suffers massive overpredictions of false positives, this model does have the advantage of providing stellar recall and AUC in both training and validation. We will try building a gradient booster on top of this logistic regression and optimize it for precision.

In [ ]:

abc_tuned3=AdaBoostClassifier(
    estimator=LogisticRegression(
        random_state=2,
        max_iter=1000
    ),
    random_state=1
)

abc_tuned3.fit(Xt_over,yt_over)

Out[ ]:

AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2),
                   random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

get_scores(abc_tuned3,sample='over',output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.740373	0.732576	0.721182	0.713763	0.286237
val	0.722838	0.294410	0.155850	0.705900	0.309699

In [ ]:

ch(abc_tuned3)

Out[ ]:

	Val. Scores
Accuracy	0.690301
Precision	0.087341
Recall	0.722838
F1	0.155850

Before tuning, the AdaBoost results are comparable to the logistic regression trained on the same data. Note that the subsequent boosting rounds did not reduce the high rate of false positives. This casts doubt on the possibility of remedying the poor precision with some hyperparameter tuning.

In [ ]:

go3=GridSearchCV(
    estimator=abc_tuned3,
    param_grid=params,
    scoring='precision',
    cv=5,
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

go3.fit(Xt_over,yt_over)

Fitting 5 folds for each of 15 candidates, totalling 75 fits

Out[ ]:

GridSearchCV(cv=5,
             estimator=AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000,
                                                                       random_state=2),
                                          random_state=1),
             n_jobs=-1,
             param_grid={'learning_rate': [0.5, 1.0, 2.0],
                         'n_estimators': array([ 50, 100, 150, 200, 250])},
             return_train_score=True, scoring='precision', verbose=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

best_abc3=go3.best_params_

best_abc3

Out[ ]:

{'learning_rate': 2.0, 'n_estimators': 150}

In [ ]:

abc_tuned3=AdaBoostClassifier(
    estimator=LogisticRegression(
        random_state=2,
        max_iter=1000
    ),
    random_state=1,
    **best_abc3
)

abc_tuned3.fit(Xt_over,yt_over)

Out[ ]:

AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2),
                   learning_rate=2.0, n_estimators=150, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

get_scores(abc_tuned3,sample='over',output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.741078	0.733895	0.723378	0.716610	0.283390
val	0.717295	0.295218	0.156810	0.705639	0.305095

Scores remain effectively unchanged. AdaBoost with the base estimator still scored best on F1, but overall tuned performance leaves much to be desired.

Consider, for example, the business ramifications of the AdaBoost model built on top of the logistic regression. With so many false positives, Santander would be investing in pleasing customers who are not unsatisfied. This is not a beneficial allocation of resources.

XGBoost¶

We will next try to improve our XGBoost Classifier with several rounds of cross-validated tuning.

Grid Search¶

In [ ]:

params={
    'eta':np.linspace(0.05,0.3,6),
    'max_depth':np.arange(2,5),
    'min_child_weight':[1,2],
    'subsample':np.linspace(0.5,0.9,4),
    'colsample_bytree':np.linspace(0.5,0.9,4)
}

We control the learning rate using eta.
To hopefully cut down on overfitting, we limit max_depth, which defaults to 6 in the standard XGBoost model.
We experiment with a more conservative algorithm by offering a larger value for min_child_weight. Options are 1, the algorithm default, and 2.
To build a more robust model, we limit both subsample and colsample_bytree to values less than 1. This subsamples data and features respectively during the tree building process to combat overfitting.

In [ ]:

xgb_tuned=XGBClassifier(
    tree_method='gpu_hist',
    random_state=1
)

go=GridSearchCV(
    estimator=xgb_tuned,
    param_grid=params,
    scoring=['recall','f1','roc_auc'],
    refit='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

start=time.time()
go.fit(Xt_over,yt_over)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')

Fitting 5 folds for each of 576 candidates, totalling 2880 fits
Fit completed in 133.48 minutes.

In [ ]:

best_xgb1=go.best_params_

best_xgb1

Out[ ]:

{'colsample_bytree': 0.9,
 'eta': 0.3,
 'max_depth': 4,
 'min_child_weight': 2,
 'subsample': 0.7666666666666666}

We find the model with the greatest AUC subsampled around 77% of the data for each training instance. This, along with the reduced column sample by tree, work to prevent overfitting.

In [ ]:

xgb_tuned1=XGBClassifier(
    tree_method='gpu_hist',
    random_state=1,
    **best_xgb1
)

xgb_tuned1.fit(Xt_over,yt_over)

Out[ ]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.9, early_stopping_rounds=None,
              enable_categorical=False, eta=0.3, eval_metric=None,
              feature_types=None, gamma=None, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=4,
              max_leaves=None, min_child_weight=2, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

ch(xgb_tuned1)

Out[ ]:

	Val. Scores
Accuracy	0.813207
Precision	0.124048
Recall	0.614191
F1	0.206408

We find already fewer false positives than the AdaBoost classifier.

In [ ]:

get_scores(xgb_tuned1,sample='over',output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.892561	0.881539	0.865507	0.861304	0.138696
val	0.614191	0.343077	0.206408	0.717797	0.186793

While validation recall is lower than the last AdaBoost model, both F1 and $F_\beta$ are higher. Moreover, the 0-1 loss is lower than all AdaBoost models, meaning this model made fewer mistakes overall.

CV Analysis¶

Let's plot some findings from the cross validation performed during the grid search.

In [ ]:

results=go.cv_results_

In [ ]:

# figure setup
plt.figure(figsize=(12,8))
plt.title('Mean Recall, ROC-AUC, and F1 for max_depth',fontsize=20)

# recall training
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_train_recall']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(2,5),
         a,
         label='Train Recall',
         linestyle='--',
         drawstyle='steps-mid')

# AUC training
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_train_roc_auc']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(2,5),
         a,
         label='Train AUC',
         linestyle='--',
         drawstyle='steps-mid')

# F1 training
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_train_f1']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(2,5),
         a,
         label='Train F1',
         linestyle='--',
         drawstyle='steps-mid')

# recall val
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_test_recall']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(2,5),
         a,
         label='Val. Recall',
         drawstyle='steps-mid')

# AUC val
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_test_roc_auc']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(2,5),
         a,
         label='Val. AUC',
         drawstyle='steps-mid')

# F1 val
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_test_f1']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(2,5),
         a,
         label='Val. F1',
         drawstyle='steps-mid')

# axes and legend
plt.xlabel('Maximum Tree Depth')
plt.ylabel('Mean Score')
plt.legend(loc='lower right')

plt.show()

Note that recall and F1 trade places as the dominant metric and max_depth increases. When max_depth=2, we find F1 is greater than recall (both in training and validation). This inverts when we increment max_depth.

Notice further that the vertical gaps between training and validation averages increase with max_depth. This indicates that risk of overfitting increases with max_depth.

ROC-AUC has the highest scores across the board. It too exhibits the tendancy toward overfitting as max_depth increases.

In [ ]:

plt.figure(figsize=(12,8))
plt.title('Recall, ROC-AUC, and F1 for eta',fontsize=20)

# recall train
sns.lineplot(x=results['param_eta'],
             y=results['mean_train_recall'],
             errorbar=('ci',0),
             label='Train Recall',
             linestyle='--')


# AUC train
sns.lineplot(x=results['param_eta'],
             y=results['mean_train_roc_auc'],
             errorbar=('ci',0),
             label='Train AUC',
             linestyle='--')

# F1 train
sns.lineplot(x=results['param_eta'],
             y=results['mean_train_f1'],
             errorbar=('ci',0),
             label='Train F1',
             linestyle='--')

# recall val
sns.lineplot(x=results['param_eta'],
             y=results['mean_test_recall'],
             errorbar=('ci',0),
             label='Val. Recall')

# AUC val
sns.lineplot(x=results['param_eta'],
             y=results['mean_test_roc_auc'],
             errorbar=('ci',0),
             label='Val. AUC')

# F1 val
sns.lineplot(x=results['param_eta'],
             y=results['mean_test_f1'],
             errorbar=('ci',0),
             label='Val. F1')

plt.xlabel('eta (learning rate)')
plt.ylabel('Mean Score')
plt.legend(loc='lower right')
plt.show()

As learning rate increases, so too do all scores. Interestingly, recall and F1 start about matched (with eta=0.05) and then diverge as learning rate increases. AUC values are consistently higher than precision and recall, though one should be careful when directly comparing these metrics.

Gamma Tuning¶

Now we will tune the gamma parameter. We use cross validation to prevent overfitting our model on validation data.

In [ ]:

plt.figure(figsize=(15,15))

for i in range(9):
  # collect params and set gamma
  w=xgb_tuned1.get_params()
  w['gamma']=2*i

  # convert to XGB DMatrix format
  dmat=DMatrix(
      Xt_over,
      yt_over,
      enable_categorical=True
  )

  # cv
  a=XGB_CV(
      params=w,
      dtrain=dmat,
      num_boost_round=250,
      nfold=5,
      metrics={'auc'}
  )

  # subplot
  plt.subplot(3,3,i+1)
  plt.title(f'gamma={2*i}')
  plt.plot(np.arange(250),a['train-auc-mean'],label='train')
  plt.plot(np.arange(250),a['test-auc-mean'],label='test')
  plt.legend(loc='lower right')

clear_output()
plt.show()

The shape of the plot for gamma=8 looks good: high scoring and levels off around 100, which is our set number of estimators in the boosting model.

In [ ]:

w=xgb_tuned1.get_params()
w['gamma']=8

xgb_tuned2=XGBClassifier(**w)

xgb_tuned2.fit(Xt_over,yt_over)

Out[ ]:

XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.9, early_stopping_rounds=None,
              enable_categorical=False, eta=0.3, eval_metric=None,
              feature_types=None, gamma=8, gpu_id=None, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=4,
              max_leaves=None, min_child_weight=2, missing=nan,
              monotone_constraints=None, n_estimators=100, n_jobs=None,
              num_parallel_tree=None, predictor=None, ...)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [ ]:

ch(xgb_tuned2)

Out[ ]:

	Val. Scores
Accuracy	0.807594
Precision	0.119930
Recall	0.609756
F1	0.200437

Gamma tuning has not made a perceptible difference.

In [ ]:

# gamma-tuned model
get_scores(xgb_tuned2,sample='over',output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.892346	0.880714	0.863824	0.859327	0.140673
val	0.609756	0.335611	0.200437	0.712749	0.192406

In [ ]:

# tuned model with default gamma
get_scores(xgb_tuned1,sample='over',output='pandas')

Out[ ]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.892561	0.881539	0.865507	0.861304	0.138696
val	0.614191	0.343077	0.206408	0.717797	0.186793

Comparing the first tuned XGBoost model with the gamma-tuned model, we see no improvement. If anything, scores have marginally slipped.

LightGBM¶

Tuning with Oversampled Data¶

In [86]:

params={
    'max_bin':[100,150,200],
    'min_gain_to_split':[0.001,0.005,0.01],
    'feature_fraction':np.linspace(0.5,0.9,4),
    'max_depth':np.arange(3,10,2)
}

We will experiment first with different values of max_bin, which controls the bucketing behavior of the model. Default is max_bin=255; lower values can negatively impact training accuracy but do generally cut down on overfitting.
With min_gain_to_split, we require a minimum gain to split two nodes. The default value is zero, so this should hopefully produce more meaningful divisions.
We subsample features in training instances using feature_fraction. This combats overfitting.
The general behavior of LightGBM is to let trees grow without limit. Restricting max_depth will contain tree growth with the goal of preventing overfitting.

In [87]:

lg_tuned=LGBMClassifier(random_state=1)

go=GridSearchCV(
    estimator=lg_tuned,
    param_grid=params,
    scoring=['recall','f1','roc_auc'],
    refit='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

start=time.time()
go.fit(Xt_over,yt_over)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')

Fitting 5 folds for each of 144 candidates, totalling 720 fits
Fit completed in 82.99 minutes.

In [89]:

best_lg=go.best_params_
best_lg

Out[89]:

{'feature_fraction': 0.7666666666666666,
 'max_bin': 200,
 'max_depth': 9,
 'min_gain_to_split': 0.001}

In [90]:

lg_tuned=LGBMClassifier(
    random_state=1,
    **best_lg
)

lg_tuned.fit(Xt_over,yt_over)

Out[90]:

LGBMClassifier(feature_fraction=0.7666666666666666, max_bin=200, max_depth=9,
               min_gain_to_split=0.001, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [91]:

ch(lg_tuned)

Out[91]:

	Val. Scores
Accuracy	0.579409
Precision	0.069717
Recall	0.780488
F1	0.128000

In [92]:

get_scores(lg_tuned,sample='over',output='pandas')

Out[92]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.893285	0.882073	0.865774	0.861509	0.138491
val	0.780488	0.256822	0.128000	0.675808	0.420591

In [93]:

results=go.cv_results_

# figure setup
plt.figure(figsize=(12,8))
plt.title('Mean Recall, ROC-AUC, and F1 for max_depth',fontsize=20)

# recall training
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_train_recall']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(3,10,2),
         a,
         label='Train Recall',
         linestyle='--',
         drawstyle='steps-mid')

# AUC training
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_train_roc_auc']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(3,10,2),
         a,
         label='Train AUC',
         linestyle='--',
         drawstyle='steps-mid')

# F1 training
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_train_f1']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(3,10,2),
         a,
         label='Train F1',
         linestyle='--',
         drawstyle='steps-mid')

# recall val
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_test_recall']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(3,10,2),
         a,
         label='Val. Recall',
         drawstyle='steps-mid')

# AUC val
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_test_roc_auc']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(3,10,2),
         a,
         label='Val. AUC',
         drawstyle='steps-mid')

# F1 val
a=pd.DataFrame(
    data=np.array([
        results['param_max_depth'],
        results['mean_test_f1']
    ]).T,
    columns=['depth','']
).groupby('depth').mean().values

plt.plot(np.arange(3,10,2),
         a,
         label='Val. F1',
         drawstyle='steps-mid')

# axes and legend
plt.xlabel('Maximum Tree Depth')
plt.ylabel('Mean Score')
plt.legend(loc='lower right')

plt.show()

To start, we see a jump in all scores as max_depth increases from 3 to 4. The gains continue as the parameter further increases, but the magnitude of improvements tapers off.

The `is_unbalance` Parameter¶

We now modify a previously untouched parameter from LightGBM. The is_unbalance flag modifies the model bulinding process to account for unbalanced training data. We can thus train on the non-oversampled data again.

In [46]:

lg1=LGBMClassifier(
    random_state=1,
    is_unbalance=True
)

lg1.fit(X_ts,y_train)

Out[46]:

LGBMClassifier(is_unbalance=True, random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [47]:

get_scores(lg1,output='pandas')

Out[47]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.922602	0.546308	0.338945	0.888749	0.142425
val	0.587583	0.352816	0.220604	0.716797	0.164211

Training scores are good and validation scores are some of the best we've seen.

In [50]:

ch(lg1)

Out[50]:

	Val. Scores
Accuracy	0.835789
Precision	0.135793
Recall	0.587583
F1	0.220604

Note that the false positive rate is more than half that of the tuned LightGBM classifier above.

In [95]:

params={
    'max_bin':[100,150,200,255],
    'feature_fraction':np.linspace(0.5,0.9,4),
}

We will once again consider different values of max_bin and feature_fraction. This time, however, we will test max_bin=255 as one of the options, which is the LightGBM default.

In [96]:

lg1=LGBMClassifier(
    random_state=1,
    is_unbalance=True
)

go=GridSearchCV(
    estimator=lg1,
    param_grid=params,
    scoring=['recall','f1','roc_auc'],
    refit='roc_auc',
    cv=5,
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

start=time.time()
go.fit(X_ts,y_train)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[LightGBM] [Warning] feature_fraction is set=0.5, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.5
Fit completed in 5.85 minutes.

In [97]:

best_lg1=go.best_params_
best_lg1

Out[97]:

{'feature_fraction': 0.5, 'max_bin': 255}

In [67]:

lg1=LGBMClassifier(
    random_state=1,
    is_unbalance=True,
    **best_lg1
)

lg1.fit(X_ts,y_train)

Out[67]:

LGBMClassifier(feature_fraction=0.5, is_unbalance=True, max_bin=100,
               random_state=1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

We fit the classifier with the updated parameters.

In [57]:

ch(lg1)

Out[57]:

	Val. Scores
Accuracy	0.835087
Precision	0.136352
Recall	0.594235
F1	0.221808

In [63]:

get_scores(lg1,output='pandas')

Out[63]:

	Recall	F_beta	F1	AUC	0-1_Loss
train	0.924501	0.544037	0.336386	0.888652	0.144361
val	0.586475	0.352103	0.220141	0.716197	0.164343

No perceptable difference from the previous model. That is, tuning has not led to great gains. Let's get more detail with some plots.

In [76]:

results=go.cv_results_

plt.figure(figsize=(12,8))
plt.title('Recall and ROC-AUC for Feature Fraction',fontsize=20)

# recall train
sns.lineplot(x=results['param_feature_fraction'],
             y=results['mean_train_recall'],
             errorbar=('ci',0),
             label='Train Recall',
             linestyle='--')


# AUC train
sns.lineplot(x=results['param_feature_fraction'],
             y=results['mean_train_roc_auc'],
             errorbar=('ci',0),
             label='Train AUC',
             linestyle='--')



# recall val
sns.lineplot(x=results['param_feature_fraction'],
             y=results['mean_test_recall'],
             errorbar=('ci',0),
             label='Val. Recall')

# AUC val
sns.lineplot(x=results['param_feature_fraction'],
             y=results['mean_test_roc_auc'],
             errorbar=('ci',0),
             label='Val. AUC')



plt.xlabel('Feature Fraction')
plt.ylabel('Mean Score')
plt.legend(loc='best')
plt.show()

We find that recall and AUC are nearly constant across values for feature_fraction, betraying our prediction that a lower feature fraction would bring training and validation scores closer together.

In [84]:

plt.figure(figsize=(12,8))
plt.title('Recall and ROC-AUC for max_bin',fontsize=20)

# recall train
sns.lineplot(x=results['param_max_bin'],
             y=results['mean_train_recall'],
             errorbar=('ci',0),
             label='Train Recall',
             linestyle='--',
             drawstyle='steps-mid')


# AUC train
sns.lineplot(x=results['param_max_bin'],
             y=results['mean_train_roc_auc'],
             errorbar=('ci',0),
             label='Train AUC',
             linestyle='--',
             drawstyle='steps-mid')

# recall val
sns.lineplot(x=results['param_max_bin'],
             y=results['mean_test_recall'],
             errorbar=('ci',0),
             label='Val. Recall',
             drawstyle='steps-mid')

# AUC val
sns.lineplot(x=results['param_max_bin'],
             y=results['mean_test_roc_auc'],
             errorbar=('ci',0),
             label='Val. AUC',
             drawstyle='steps-mid')


plt.xlabel('Max Bin')
plt.ylabel('Mean Score')
plt.legend(loc='best')
plt.show()

We find a similar trend across values for max_bin. I predict that we're essentially maxed out on the performance of this architecture of classifier.

In [85]:

plt.figure(figsize=(12,8))
plt.title('F1 for max_bin',fontsize=20)

# F1 train
sns.lineplot(x=results['param_max_bin'],
             y=results['mean_train_f1'],
             errorbar=('ci',0),
             label='Train F1',
             linestyle='--',
             drawstyle='steps-post')

# F1 val
sns.lineplot(x=results['param_max_bin'],
             y=results['mean_test_f1'],
             errorbar=('ci',0),
             label='Val. F1',
             drawstyle='steps-post')


plt.xlabel('Max Bin')
plt.ylabel('Mean Score')
plt.legend(loc='best')
plt.show()

The pattern for F1 scores is also essentially constant.

Concluding Remarks¶

With these data, boosting algorithms worked best.
With such prominent label imbalance in the target feature, models trained on the original data exhibited low-scoring recall, F1, and AUC.
Much better performance was achieved with models trained on oversampled data.
We tuned three boosting algorithms: AdaBoost, XGBoost, and LightGBM. All three offered slight gains at best, while still showing signs of weakness (such as overfitting).
The LightGBM model was also trained again on the non-oversampled data using a training flag for imbalanced data. Performance here was just as good or better than the others and training times were apprecibly reduced. This model is my favorite.

Next Steps¶

I think it would be interesting to explore the performance of a neural network trained to classify these data. I need to think more about how training on oversampled data would affect this architecture.
Previously, I implemented both a stacking classifier and a voting classifier with the goal of increasing overall scores. The former overfit and the latter offered no gains, though we can learn from the latter: As the voting classifier did not yield further correct labelling, this indicates that our highest-scoring classifiers are usually predicting the same incorrect label for the same data. Moving forward, it would be interesting to tune several models to maximize precision. Then we could combine those models with some of the high-recall models trained here to hopefully sniff out more correct classifications.

Return to the top.

Santander Customer Satisfaction¶

About the Author¶

Table of Contents¶

Setup¶

Update Libraries¶

Libraries¶

Read Data¶

Data Preprocessing¶

Finding Superfluous Columns¶

Split¶

Scaling¶

PCA¶

Pipeline¶

Oversampling¶

Model Evaluation Functions¶

Vanilla Models¶

Dummy Classifier¶

Random Forest¶

AdaBoost¶

XGBoost¶

LightGBM¶

CatBoost¶

Comparison¶

Oversampled Models¶

Random Forest¶

AdaBoost¶

XGBoost¶

LightGBM¶

CatBoost¶

Comparison¶

Model Tuning¶

AdaBoost¶

Base Estimator: Decision Stump¶

Base Estimator: XGBClassifier¶

Base Estimator: Logistic Regression¶

XGBoost¶

Grid Search¶

CV Analysis¶

Gamma Tuning¶

LightGBM¶

Tuning with Oversampled Data¶

The is_unbalance Parameter¶

Concluding Remarks¶

Next Steps¶

The `is_unbalance` Parameter¶