Data is from the 2016 Banco Santander competition on Kaggle.
Cyphered customer data is supplied with the goal of predicting customer satisfaction. The target flag is encoded:
(From the perspective of philosophy of logic, the above encoding reflects our rejection of the Law of Excluded Middle for this binary classification. That is to say, it is possible for a customer to be labelled in class 0 without being unconditionally satisfied. Ultimately though, this is just semantic quibbling in the spirit of Wittgenstein, not Russell.)
Note on HTML jump links: Open in nbviewer to use jump links.
Running the lastest builds of:
! pip install -U scikit-learn
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.22.4) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.2.0) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.10.1) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.1.0)
! pip install -U xgboost
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (1.7.5) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.10.1) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.22.4)
! pip install -U lightgbm
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: lightgbm in /usr/local/lib/python3.10/dist-packages (3.3.5) Requirement already satisfied: scikit-learn!=0.22.0 in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.2.2) Requirement already satisfied: wheel in /usr/local/lib/python3.10/dist-packages (from lightgbm) (0.40.0) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.10.1) Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from lightgbm) (1.22.4) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn!=0.22.0->lightgbm) (3.1.0) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn!=0.22.0->lightgbm) (1.2.0)
! pip install catboost
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Collecting catboost Downloading catboost-1.1.1-cp310-none-manylinux1_x86_64.whl (76.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 76.6/76.6 MB 23.0 MB/s eta 0:00:00 Requirement already satisfied: six in /usr/local/lib/python3.10/dist-packages (from catboost) (1.16.0) Requirement already satisfied: pandas>=0.24.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.5.3) Requirement already satisfied: graphviz in /usr/local/lib/python3.10/dist-packages (from catboost) (0.20.1) Requirement already satisfied: numpy>=1.16.0 in /usr/local/lib/python3.10/dist-packages (from catboost) (1.22.4) Requirement already satisfied: plotly in /usr/local/lib/python3.10/dist-packages (from catboost) (5.13.1) Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from catboost) (1.10.1) Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (from catboost) (3.7.1) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->catboost) (2022.7.1) Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=0.24.0->catboost) (2.8.2) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (23.1) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.0.7) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (0.11.0) Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (1.4.4) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (3.0.9) Requirement already satisfied: pillow>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (8.4.0) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib->catboost) (4.39.3) Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly->catboost) (8.2.2) Installing collected packages: catboost Successfully installed catboost-1.1.1
! pip install -U imbalanced-learn
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/ Requirement already satisfied: imbalanced-learn in /usr/local/lib/python3.10/dist-packages (0.10.1) Requirement already satisfied: scikit-learn>=1.0.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.2.2) Requirement already satisfied: numpy>=1.17.3 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.22.4) Requirement already satisfied: scipy>=1.3.2 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.10.1) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from imbalanced-learn) (3.1.0)
import numpy as np
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from xgboost import cv as XGB_CV
from xgboost import DMatrix
#plotting
from matplotlib import pyplot as plt
import seaborn as sns
from IPython.display import clear_output
# cluster analysis
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# models
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import StackingClassifier, VotingClassifier
# processing pipeline
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE
sns.set_theme()
We import our data, stored as an Apache parquet file.
bank=pd.read_parquet('santander_train.parquet')
data=bank.copy()
bank.shape
(76020, 371)
We have around 76k records with 371 features.
bank.sample(10,random_state=1)
ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
14162 | 28459 | 2 | 25 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 46969.410000 | 1 |
35732 | 71476 | 2 | 33 | 0.0 | 930.21 | 1391.55 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 307194.780000 | 0 |
24191 | 48386 | 2 | 25 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 109659.060000 | 0 |
10440 | 20945 | 2 | 23 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 71302.530000 | 0 |
46585 | 93165 | 2 | 24 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117310.979016 | 0 |
46064 | 92159 | 2 | 25 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 103667.040000 | 0 |
27661 | 55359 | 2 | 62 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 195088.740000 | 0 |
36671 | 73262 | 2 | 41 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 59630.010000 | 0 |
70885 | 141557 | 158 | 65 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 83606.940000 | 0 |
72468 | 144712 | 2 | 24 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 74988.840000 | 0 |
10 rows × 371 columns
The only recognizable columns are the ID column and the TARGET column. The rest of the attributes are cyphered, according to the source linked above.
bank.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 76020 entries, 0 to 76019 Columns: 371 entries, ID to TARGET dtypes: float64(111), int64(260) memory usage: 215.2 MB
The data frame requires around 215 MB of memory.
# null check
bank.isna().sum().sum()
0
# duplicate row check
bank.duplicated().sum()
0
There are no null entries or duplicated rows.
bank['TARGET'].value_counts(normalize=True)
0 0.960431 1 0.039569 Name: TARGET, dtype: float64
This dataset is imbalanced, with only 4% of records in the positive class. Thankfully, most of our customers are not unsatisfied. On the other hand, this imbalance will make detection and modeling more delicate.
# find constant columns
const_col=[]
for col in bank.columns:
if bank[col].std()==0:
const_col.append(col)
# find duplicate columns
dup_bool=bank.T.duplicated()
dups=[]
for idx in dup_bool.index:
if dup_bool[idx]==True:
dups.append(idx)
remove=const_col+dups
print(f'There are {len(remove)} columns to remove.')
There are 96 columns to remove.
We found 96 columns that were either constant or a duplicate of another column. Note that this is an inclusive OR, so it is possible that fewer than 96 columns will be removed.
We separate the data into predictive features and our target. Then we split into training data and validation data. There is no need to reserve data for final evaluation, as we have that data stored in a separate file.
X=bank.drop(['ID','TARGET']+remove,axis=1)
y=bank['TARGET']
# split into training and validation sets
X_train,X_val,y_train,y_val=train_test_split(
X,
y,
test_size=0.3,
stratify=y,
random_state=57
)
We reserve 30% of the data for our validation set.
In order to demonstrate the PCA transformation that follows, we first need to scale our data. Scaling will shortly be incorporated into a preprocessing pipeline, rendering this section redundant.
scaler=StandardScaler().set_output(transform='pandas')
X_ts=scaler.fit_transform(X_train)
X_ts.describe().T.head()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
var3 | 53214.0 | 5.608072e-18 | 1.000009 | -27.956373 | 0.035750 | 0.035750 | 0.035750 | 0.042356 |
var15 | 53214.0 | -2.281951e-16 | 1.000009 | -2.172472 | -0.785866 | -0.477731 | 0.446674 | 5.530898 |
imp_ent_var16_ult1 | 53214.0 | -6.008649e-18 | 1.000009 | -0.049542 | -0.049542 | -0.049542 | -0.049542 | 117.774053 |
imp_op_var39_comer_ult1 | 53214.0 | -5.207496e-18 | 1.000009 | -0.219330 | -0.219330 | -0.219330 | -0.219330 | 26.065719 |
imp_op_var39_comer_ult3 | 53214.0 | 1.215082e-17 | 1.000009 | -0.220725 | -0.220725 | -0.220725 | -0.220725 | 38.221859 |
Note that every attribute has mean approximately 0 and standard deviation approximately 1.
Principal Component Analysis is an implementation of eigen decomposition. In effect, it is a coordinate transformation, where the new axes reflect explained variance in the data. Moreover, these axes, or components, are ordered decreasingly by explained variance. Many of the components can thus be discarded (off the end), as they generally do not contribute much to the explanation of variance. In this way, PCA can be used for dimension reduction.
print(f'Number of features: {X_ts.shape[1]}.')
Number of features: 306.
Before dimension reduction, we have 306 features in our dataset.
# major reduction test
pca37=PCA(n_components=37)
pca37.fit(X_ts)
PCA(n_components=37)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=37)
plt.title('Cumulative variance explained by eigenvectors',fontsize=15)
plt.step(
np.arange(1,38),
np.cumsum(pca37.explained_variance_ratio_),
where='mid'
)
plt.xlabel('Number of Eigenvectors')
plt.ylabel('Cumulative Variance');
We find that 37 components explain about 75% of our variance: not enough. Moreover, we can see graphically that the right side of the curve is still increasing, not levelling off. We'll need more components.
We know that 306 features can explain 100% of the variance in our data. Can we get away with fewer?
# about 1/3 the size
pca123=PCA(n_components=123)
pca123.fit(X_ts)
PCA(n_components=123)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=123)
evr=pca123.explained_variance_ratio_
plt.title(f'Cumulative explained variance (reaches {np.round(sum(evr)*100,2)}%)',fontsize=15)
plt.step(
np.arange(1,124),
np.cumsum(evr),
where='mid'
)
plt.xlabel('Number of Eigenvectors')
plt.ylabel('Cumulative Variance');
We can explain nearly 100% of the variance in our data with just a third of the components. This decreases the memory required to store the data and massively reduces computation time during training.
We now incorporate scaling and PCA into a preprocessing pipeline.
# preprocessing pipe
pre=Pipeline(
steps=[
('Scaler',StandardScaler()),
('Dimension_Reduction',PCA(n_components=123))
]
).set_output(transform='pandas')
X_ts=pre.fit_transform(X_train)
X_vs=pre.transform(X_val)
We fit the pipeline on our training data and then use it to transform our validation data. This approach ensures the integrity of our analysis by preventing data leakage.
a=X_ts.memory_usage().sum()/X_train.memory_usage().sum()
print(f'Memory usage reduced to {np.round(a*100,2)}% of original data frame.')
Memory usage reduced to 40.39% of original data frame.
As expected, PCA reduced memory usage by roughly 60%.
y_train.value_counts(normalize=True)
0 0.960424 1 0.039576 Name: TARGET, dtype: float64
Only around 4% of the supplied data belongs to the positive class. We oversample to balance the classes using the SMOTE method.
smote=SMOTE(
sampling_strategy='not majority',
random_state=1,
k_neighbors=5
)
# oversampled training data
Xt_over,yt_over=smote.fit_resample(X_ts,y_train)
# re-scale data
Xt_over=scaler.fit_transform(Xt_over)
yt_over.value_counts(normalize=True)
0 0.5 1 0.5 Name: TARGET, dtype: float64
Target classes are now balanced.
This first function collects training and validationscores for a given model. It provides an option to output the scores in an easy-to-read Pandas DataFrame.
def get_scores(model,sample=None,output=None):
'''Collect model scores.'''
# define training data
if sample=='over':
X_t=Xt_over
y_t=yt_over
else:
X_t=X_ts
y_t=y_train
# predictions
y_t_hat=model.predict(X_t)
y_v_hat=model.predict(X_vs)
# collect scores
train_scores=[
metrics.recall_score(y_t,y_t_hat),
metrics.fbeta_score(y_t,y_t_hat,beta=2),
metrics.f1_score(y_t,y_t_hat),
metrics.roc_auc_score(y_t,y_t_hat),
metrics.zero_one_loss(y_t,y_t_hat)
]
val_scores=[
metrics.recall_score(y_val,y_v_hat),
metrics.fbeta_score(y_val,y_v_hat,beta=2),
metrics.f1_score(y_val,y_v_hat),
metrics.roc_auc_score(y_val,y_v_hat),
metrics.zero_one_loss(y_val,y_v_hat)
]
# output scores in pandas df
if output=='pandas':
df=pd.DataFrame(
[train_scores,val_scores],
columns=[
'Recall',
'F_beta',
'F1',
'AUC',
'0-1_Loss'
],
index=['train','val']
)
return df
return [train_scores,val_scores]
The next function displays a confusion matrix of model predictions on validation data.
def confusion_heatmap(model,show_scores=True):
'''Heatmap of confusion matrix for
model performance on validation data.'''
actual=y_val
predicted=model.predict(X_vs)
# generate confusion matrix
cm=metrics.confusion_matrix(actual,predicted)
cm=np.flip(cm).T
# heatmap labels
labels=['TP','FP','FN','TN']
cm_labels=np.array(cm).flatten()
cm_percents=np.round((cm_labels/np.sum(cm))*100,3)
annot_labels=[]
for i in range(4):
annot_labels.append(str(labels[i])+'\nCount:'+str(cm_labels[i])+'\n'+str(cm_percents[i])+'%')
annot_labels=np.array(annot_labels).reshape(2,2)
# print figure
plt.figure(figsize=(8,5))
plt.title('Confusion Matrix (Validation Data)',fontsize=20)
sns.heatmap(data=cm,
annot=annot_labels,
annot_kws={'fontsize':'x-large'},
xticklabels=[1,0],
yticklabels=[1,0],
cmap='Greens',
fmt='s')
plt.xlabel('Actual',fontsize=14)
plt.ylabel('Predicted',fontsize=14)
plt.tight_layout();
# scores
if show_scores==True:
scores=['Accuracy','Precision','Recall','F1']
score_list=[metrics.accuracy_score(actual,predicted),
metrics.precision_score(actual,predicted),
metrics.recall_score(actual,predicted),
metrics.f1_score(actual,predicted)]
df=pd.DataFrame(index=scores)
df['Val. Scores']=score_list
return df
return
# alias function name to something shorter
ch=confusion_heatmap
Summary of vanilla model testing: No model performed well with default configurations on the regular data. Performance gains are only observed once we train on oversampled data.
models=[
'RandomForest',
'AdaBoost',
'XGBoost',
'LightGBM',
'CatBoost'
]
datasets=['train','val']
# generate MultiIndex object
mi=pd.MultiIndex.from_product(
iterables=[models,datasets],
names=['model','data']
)
# build comparison table
tab=pd.DataFrame(
columns=[
'Recall',
'F_beta',
'F1',
'AUC',
'0-1_Loss'
],
index=mi
)
The value of $\beta$ in the $F_\beta$ score allows us to bias the score between precision and recall. With $\beta=2$, we give recall twice the importance of precision.
d=DummyClassifier(
strategy='stratified',
random_state=1
)
d.fit(X_ts,y_train)
DummyClassifier(random_state=1, strategy='stratified')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DummyClassifier(random_state=1, strategy='stratified')
We fit a dummy classifier to set a performance baseline.
ch(d)
Val. Scores | |
---|---|
Accuracy | 0.925809 |
Precision | 0.042824 |
Recall | 0.041020 |
F1 | 0.041903 |
Predictably, this classifier yields high accuracy on our highly imbalanced data set. Its failing, however, is 4% recall.
rf=RandomForestClassifier(
random_state=1,
n_jobs=-1
)
rf.fit(X_ts,y_train)
RandomForestClassifier(n_jobs=-1, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(n_jobs=-1, random_state=1)
tab.loc['RandomForest']=get_scores(rf)
tab.loc['RandomForest']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.902659 | 0.919868 | 0.946949 | 0.951251 | 0.004003 |
val | 0.042129 | 0.050211 | 0.070501 | 0.517914 | 0.043936 |
Random forest fares better on recall, though this is merely due to overfitting. Validation AUC barely clears the baseline 50%.
ch(rf)
Val. Scores | |
---|---|
Accuracy | 0.956064 |
Precision | 0.215909 |
Recall | 0.042129 |
F1 | 0.070501 |
We can see from the confusion matrix that the number of true positives detected is the same as the dummy classifier. Thus, the only improvement made here is in the classification of true negatives.
abc=AdaBoostClassifier(random_state=1)
abc.fit(X_ts,y_train)
AdaBoostClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=1)
tab.loc['AdaBoost']=get_scores(abc)
tab.loc['AdaBoost']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.006648 | 0.008286 | 0.013146 | 0.503226 | 0.039501 |
val | 0.002217 | 0.002765 | 0.004396 | 0.500972 | 0.039726 |
On the one hand, AdaBoost is not suffering the same overfitting issues as random forest. On the other, its performance is horrid.
ch(abc)
Val. Scores | |
---|---|
Accuracy | 0.960274 |
Precision | 0.250000 |
Recall | 0.002217 |
F1 | 0.004396 |
The confusion matrix shows that AdaBoost is just predicting 0 for essentially every observation, with only eight predicted in the positive class.
xgb=XGBClassifier(
random_state=1
)
xgb.fit(X_ts,y_train)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1, ...)
tab.loc['XGBoost']=get_scores(xgb)
tab.loc['XGBoost']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.405508 | 0.459931 | 0.57586 | 0.702695 | 0.02364 |
val | 0.011086 | 0.01365 | 0.020899 | 0.504516 | 0.041086 |
ch(xgb)
Val. Scores | |
---|---|
Accuracy | 0.958914 |
Precision | 0.181818 |
Recall | 0.011086 |
F1 | 0.020899 |
Overfitting plagues XGBoost too. We have yet to see a validation AUC appreciably climb above 50%.
lg=LGBMClassifier()
lg.fit(X_ts,y_train)
LGBMClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMClassifier()
tab.loc['LightGBM']=get_scores(lg)
tab.loc['LightGBM']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.17189 | 0.205892 | 0.292762 | 0.585896 | 0.032867 |
val | 0.009978 | 0.0124 | 0.019502 | 0.504715 | 0.039683 |
ch(lg)
Val. Scores | |
---|---|
Accuracy | 0.960317 |
Precision | 0.428571 |
Recall | 0.009978 |
F1 | 0.019502 |
Comparable performance can be observed in default LightGBM, with overfitting and poor recall and AUC.
cb=CatBoostClassifier()
cb.fit(X_ts,y_train,verbose=False)
<catboost.core.CatBoostClassifier at 0x7f378bdf7790>
tab.loc['CatBoost']=get_scores(cb)
tab.loc['CatBoost']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.213675 | 0.253464 | 0.3517 | 0.606808 | 0.031176 |
val | 0.008869 | 0.011004 | 0.017223 | 0.504001 | 0.040033 |
ch(cb)
Val. Scores | |
---|---|
Accuracy | 0.959967 |
Precision | 0.296296 |
Recall | 0.008869 |
F1 | 0.017223 |
CatBoost is similarly deficient.
tab
Recall | F_beta | F1 | AUC | 0-1_Loss | ||
---|---|---|---|---|---|---|
model | data | |||||
RandomForest | train | 0.902659 | 0.919868 | 0.946949 | 0.951251 | 0.004003 |
val | 0.042129 | 0.050211 | 0.070501 | 0.517914 | 0.043936 | |
AdaBoost | train | 0.006648 | 0.008286 | 0.013146 | 0.503226 | 0.039501 |
val | 0.002217 | 0.002765 | 0.004396 | 0.500972 | 0.039726 | |
XGBoost | train | 0.405508 | 0.459931 | 0.57586 | 0.702695 | 0.02364 |
val | 0.011086 | 0.01365 | 0.020899 | 0.504516 | 0.041086 | |
LightGBM | train | 0.17189 | 0.205892 | 0.292762 | 0.585896 | 0.032867 |
val | 0.009978 | 0.0124 | 0.019502 | 0.504715 | 0.039683 | |
CatBoost | train | 0.213675 | 0.253464 | 0.3517 | 0.606808 | 0.031176 |
val | 0.008869 | 0.011004 | 0.017223 | 0.504001 | 0.040033 |
Hyperparameter tuning will not garner the performance improvements we need. Let's instead train the models on oversamed data.
Data oversampling using SMOTE (Synthetic Minority Oversampling TEchnique).
# build comparison table
tab_over=pd.DataFrame(
columns=[
'Recall',
'F_beta',
'F1',
'AUC',
'0-1_Loss'
],
index=mi
)
rf_over=RandomForestClassifier(
random_state=1,
n_jobs=-1
)
rf_over.fit(Xt_over,yt_over)
RandomForestClassifier(n_jobs=-1, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier(n_jobs=-1, random_state=1)
tab_over.loc['RandomForest']=get_scores(rf_over,sample='over')
tab_over.loc['RandomForest']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.990158 | 0.990224 | 0.990323 | 0.990324 | 0.009676 |
val | 0.296009 | 0.239935 | 0.186844 | 0.60945 | 0.101903 |
ch(rf_over)
Val. Scores | |
---|---|
Accuracy | 0.898097 |
Precision | 0.136503 |
Recall | 0.296009 |
F1 | 0.186844 |
abc_over=AdaBoostClassifier(random_state=1)
abc_over.fit(Xt_over,yt_over)
AdaBoostClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(random_state=1)
tab_over.loc['AdaBoost']=get_scores(abc_over,sample='over')
tab_over.loc['AdaBoost']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.775906 | 0.770682 | 0.762978 | 0.758961 | 0.241039 |
val | 0.722838 | 0.32441 | 0.177584 | 0.729274 | 0.264799 |
ch(abc_over)
Val. Scores | |
---|---|
Accuracy | 0.735201 |
Precision | 0.101227 |
Recall | 0.722838 |
F1 | 0.177584 |
xgb_over=XGBClassifier(
tree_method='gpu_hist',
random_state=1
)
xgb_over.fit(Xt_over,yt_over)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=1, ...)
tab_over.loc['XGBoost']=get_scores(xgb_over,sample='over')
tab_over.loc['XGBoost']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.950184 | 0.938345 | 0.921131 | 0.918643 | 0.081357 |
val | 0.51663 | 0.316533 | 0.200215 | 0.683283 | 0.163247 |
ch(xgb_over)
Val. Scores | |
---|---|
Accuracy | 0.836753 |
Precision | 0.124167 |
Recall | 0.516630 |
F1 | 0.200215 |
Notice here that far fewer false positives yields a higher F1 score than the previous AdaBoost model.
lg_over=LGBMClassifier(
n_jobs=-1,
random_state=1
)
lg_over.fit(Xt_over,yt_over)
LGBMClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMClassifier(random_state=1)
tab_over.loc['LightGBM']=get_scores(lg_over,sample='over')
tab_over.loc['LightGBM']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.897511 | 0.885921 | 0.869087 | 0.864806 | 0.135194 |
val | 0.615299 | 0.340825 | 0.204194 | 0.716822 | 0.189687 |
ch(lg_over)
Val. Scores | |
---|---|
Accuracy | 0.810313 |
Precision | 0.122408 |
Recall | 0.615299 |
F1 | 0.204194 |
cb_over=CatBoostClassifier(
task_type='GPU',
gpu_ram_part=0.9,
gpu_cat_features_storage='GpuRam',
random_seed=1
)
cb_over.fit(Xt_over,yt_over,verbose=False)
<catboost.core.CatBoostClassifier at 0x7f169c6741c0>
tab_over.loc['CatBoost']=get_scores(cb_over,sample='over')
tab_over.loc['CatBoost']
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
data | |||||
train | 0.897218 | 0.886961 | 0.872008 | 0.868308 | 0.131692 |
val | 0.599778 | 0.34068 | 0.206725 | 0.713352 | 0.182057 |
ch(cb_over)
Val. Scores | |
---|---|
Accuracy | 0.817943 |
Precision | 0.124885 |
Recall | 0.599778 |
F1 | 0.206725 |
tab_over
Recall | F_beta | F1 | AUC | 0-1_Loss | ||
---|---|---|---|---|---|---|
model | data | |||||
RandomForest | train | 0.990158 | 0.990224 | 0.990323 | 0.990324 | 0.009676 |
val | 0.296009 | 0.239935 | 0.186844 | 0.60945 | 0.101903 | |
AdaBoost | train | 0.775906 | 0.770682 | 0.762978 | 0.758961 | 0.241039 |
val | 0.722838 | 0.32441 | 0.177584 | 0.729274 | 0.264799 | |
XGBoost | train | 0.950184 | 0.938345 | 0.921131 | 0.918643 | 0.081357 |
val | 0.51663 | 0.316533 | 0.200215 | 0.683283 | 0.163247 | |
LightGBM | train | 0.897511 | 0.885921 | 0.869087 | 0.864806 | 0.135194 |
val | 0.615299 | 0.340825 | 0.204194 | 0.716822 | 0.189687 | |
CatBoost | train | 0.897218 | 0.886961 | 0.872008 | 0.868308 | 0.131692 |
val | 0.599778 | 0.34068 | 0.206725 | 0.713352 | 0.182057 |
Random forest is massively overfit.
AdaBoost appears to suffer the least from overfitting. Especially promising is the AUC scores. The validation AUC in particular is higher than any other model in the table.
XGBoost is plagued by overfitting, but the validation scores show promise. With tuning the overfitting might be controllable.
LightGBM and CatBoost show near-identical performance in every metric. Still these models require much tuning and improvement.
params={
'n_estimators':np.arange(50,251,50),
'learning_rate':[0.5,1.0,2.0]
}
We will vary the number of estimators and the learning rate of our AdaBoost classifier.
abc_tuned1=AdaBoostClassifier(random_state=1)
go1=GridSearchCV(
estimator=abc_tuned1,
param_grid=params,
scoring=['recall','f1','roc_auc'],
refit='roc_auc',
cv=5,
n_jobs=-1,
verbose=1,
return_train_score=True
)
start=time.time()
go1.fit(Xt_over,yt_over)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')
Fitting 5 folds for each of 15 candidates, totalling 75 fits Fit completed in 48.8 minutes.
best_abc1=go1.best_params_
best_abc1
{'learning_rate': 1.0, 'n_estimators': 250}
abc_tuned1=AdaBoostClassifier(
random_state=1,
**best_abc1
)
abc_tuned1.fit(Xt_over,yt_over)
AdaBoostClassifier(n_estimators=250, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(n_estimators=250, random_state=1)
get_scores(abc_tuned1,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.792440 | 0.791087 | 0.789066 | 0.788164 | 0.211836 |
val | 0.667406 | 0.332010 | 0.189308 | 0.722856 | 0.226081 |
ch(abc_tuned1)
Val. Scores | |
---|---|
Accuracy | 0.773919 |
Precision | 0.110297 |
Recall | 0.667406 |
F1 | 0.189308 |
An attempt was made to train an AdaBoost Classifier with XGBoost as the base estimator. After 5h 47m the training was terminated. Given limited computing resources, this configuration was too expensive to be feasable.
lr=LogisticRegression(
random_state=1,
max_iter=1000
)
lr.fit(Xt_over,yt_over)
LogisticRegression(max_iter=1000, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=1000, random_state=1)
get_scores(lr,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.758551 | 0.749218 | 0.735643 | 0.727411 | 0.272589 |
val | 0.739468 | 0.307119 | 0.163621 | 0.719442 | 0.299000 |
Performance here is interesting: Recall and AUC good and consistent, indicating a reliable fit. However, precision, and consequently F1/$F_\beta$, suffer massively when training is compared with validation.
I conjecture this is due to training data being oversampled but not validation data. This sampling is of course intentional, as we generally only need the oversampled data for model training. However, in the case of logistic regression, training on oversampled data artificially inflates the regression intercept, meaning performance suffers on non-oversampled data.
# logistic regression on non-oversampled data
logit=LogisticRegression(max_iter=1000)
logit.fit(X_ts,y_train)
LogisticRegression(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression(max_iter=1000)
get_scores(logit,output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.006173 | 0.007679 | 0.012110 | 0.502812 | 0.039858 |
val | 0.017738 | 0.021786 | 0.033126 | 0.507773 | 0.040954 |
Do note that logistic regression still sees massive improvements when trained on oversampled data. A quick look at a model trained on the original scaled data finds that every metric is at rock bottom. For instance, AUC is at 0.5, no better than guessing.
ch(lr)
Val. Scores | |
---|---|
Accuracy | 0.701000 |
Precision | 0.091987 |
Recall | 0.739468 |
F1 | 0.163621 |
While logistic regression suffers massive overpredictions of false positives, this model does have the advantage of providing stellar recall and AUC in both training and validation. We will try building a gradient booster on top of this logistic regression and optimize it for precision.
abc_tuned3=AdaBoostClassifier(
estimator=LogisticRegression(
random_state=2,
max_iter=1000
),
random_state=1
)
abc_tuned3.fit(Xt_over,yt_over)
AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2), random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2), random_state=1)
LogisticRegression(max_iter=1000, random_state=2)
LogisticRegression(max_iter=1000, random_state=2)
get_scores(abc_tuned3,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.740373 | 0.732576 | 0.721182 | 0.713763 | 0.286237 |
val | 0.722838 | 0.294410 | 0.155850 | 0.705900 | 0.309699 |
ch(abc_tuned3)
Val. Scores | |
---|---|
Accuracy | 0.690301 |
Precision | 0.087341 |
Recall | 0.722838 |
F1 | 0.155850 |
Before tuning, the AdaBoost results are comparable to the logistic regression trained on the same data. Note that the subsequent boosting rounds did not reduce the high rate of false positives. This casts doubt on the possibility of remedying the poor precision with some hyperparameter tuning.
go3=GridSearchCV(
estimator=abc_tuned3,
param_grid=params,
scoring='precision',
cv=5,
n_jobs=-1,
verbose=1,
return_train_score=True
)
go3.fit(Xt_over,yt_over)
Fitting 5 folds for each of 15 candidates, totalling 75 fits
GridSearchCV(cv=5, estimator=AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2), random_state=1), n_jobs=-1, param_grid={'learning_rate': [0.5, 1.0, 2.0], 'n_estimators': array([ 50, 100, 150, 200, 250])}, return_train_score=True, scoring='precision', verbose=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(cv=5, estimator=AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2), random_state=1), n_jobs=-1, param_grid={'learning_rate': [0.5, 1.0, 2.0], 'n_estimators': array([ 50, 100, 150, 200, 250])}, return_train_score=True, scoring='precision', verbose=1)
AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2), random_state=1)
LogisticRegression(max_iter=1000, random_state=2)
LogisticRegression(max_iter=1000, random_state=2)
best_abc3=go3.best_params_
best_abc3
{'learning_rate': 2.0, 'n_estimators': 150}
abc_tuned3=AdaBoostClassifier(
estimator=LogisticRegression(
random_state=2,
max_iter=1000
),
random_state=1,
**best_abc3
)
abc_tuned3.fit(Xt_over,yt_over)
AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2), learning_rate=2.0, n_estimators=150, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AdaBoostClassifier(estimator=LogisticRegression(max_iter=1000, random_state=2), learning_rate=2.0, n_estimators=150, random_state=1)
LogisticRegression(max_iter=1000, random_state=2)
LogisticRegression(max_iter=1000, random_state=2)
get_scores(abc_tuned3,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.741078 | 0.733895 | 0.723378 | 0.716610 | 0.283390 |
val | 0.717295 | 0.295218 | 0.156810 | 0.705639 | 0.305095 |
Scores remain effectively unchanged. AdaBoost with the base estimator still scored best on F1, but overall tuned performance leaves much to be desired.
Consider, for example, the business ramifications of the AdaBoost model built on top of the logistic regression. With so many false positives, Santander would be investing in pleasing customers who are not unsatisfied. This is not a beneficial allocation of resources.
We will next try to improve our XGBoost Classifier with several rounds of cross-validated tuning.
params={
'eta':np.linspace(0.05,0.3,6),
'max_depth':np.arange(2,5),
'min_child_weight':[1,2],
'subsample':np.linspace(0.5,0.9,4),
'colsample_bytree':np.linspace(0.5,0.9,4)
}
We control the learning rate using eta
.
To hopefully cut down on overfitting, we limit max_depth
, which defaults to 6 in the standard XGBoost model.
We experiment with a more conservative algorithm by offering a larger value for min_child_weight
. Options are 1, the algorithm default, and 2.
To build a more robust model, we limit both subsample
and colsample_bytree
to values less than 1. This subsamples data and features respectively during the tree building process to combat overfitting.
xgb_tuned=XGBClassifier(
tree_method='gpu_hist',
random_state=1
)
go=GridSearchCV(
estimator=xgb_tuned,
param_grid=params,
scoring=['recall','f1','roc_auc'],
refit='roc_auc',
cv=5,
n_jobs=-1,
verbose=1,
return_train_score=True
)
start=time.time()
go.fit(Xt_over,yt_over)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')
Fitting 5 folds for each of 576 candidates, totalling 2880 fits Fit completed in 133.48 minutes.
best_xgb1=go.best_params_
best_xgb1
{'colsample_bytree': 0.9, 'eta': 0.3, 'max_depth': 4, 'min_child_weight': 2, 'subsample': 0.7666666666666666}
We find the model with the greatest AUC subsampled around 77% of the data for each training instance. This, along with the reduced column sample by tree, work to prevent overfitting.
xgb_tuned1=XGBClassifier(
tree_method='gpu_hist',
random_state=1,
**best_xgb1
)
xgb_tuned1.fit(Xt_over,yt_over)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, early_stopping_rounds=None, enable_categorical=False, eta=0.3, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, early_stopping_rounds=None, enable_categorical=False, eta=0.3, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, ...)
ch(xgb_tuned1)
Val. Scores | |
---|---|
Accuracy | 0.813207 |
Precision | 0.124048 |
Recall | 0.614191 |
F1 | 0.206408 |
We find already fewer false positives than the AdaBoost classifier.
get_scores(xgb_tuned1,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.892561 | 0.881539 | 0.865507 | 0.861304 | 0.138696 |
val | 0.614191 | 0.343077 | 0.206408 | 0.717797 | 0.186793 |
While validation recall is lower than the last AdaBoost model, both F1 and $F_\beta$ are higher. Moreover, the 0-1 loss is lower than all AdaBoost models, meaning this model made fewer mistakes overall.
Let's plot some findings from the cross validation performed during the grid search.
results=go.cv_results_
# figure setup
plt.figure(figsize=(12,8))
plt.title('Mean Recall, ROC-AUC, and F1 for max_depth',fontsize=20)
# recall training
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_train_recall']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(2,5),
a,
label='Train Recall',
linestyle='--',
drawstyle='steps-mid')
# AUC training
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_train_roc_auc']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(2,5),
a,
label='Train AUC',
linestyle='--',
drawstyle='steps-mid')
# F1 training
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_train_f1']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(2,5),
a,
label='Train F1',
linestyle='--',
drawstyle='steps-mid')
# recall val
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_test_recall']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(2,5),
a,
label='Val. Recall',
drawstyle='steps-mid')
# AUC val
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_test_roc_auc']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(2,5),
a,
label='Val. AUC',
drawstyle='steps-mid')
# F1 val
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_test_f1']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(2,5),
a,
label='Val. F1',
drawstyle='steps-mid')
# axes and legend
plt.xlabel('Maximum Tree Depth')
plt.ylabel('Mean Score')
plt.legend(loc='lower right')
plt.show()
Note that recall and F1 trade places as the dominant metric and max_depth
increases. When max_depth=2
, we find F1 is greater than recall (both in training and validation). This inverts when we increment max_depth
.
Notice further that the vertical gaps between training and validation averages increase with max_depth
. This indicates that risk of overfitting increases with max_depth
.
ROC-AUC has the highest scores across the board. It too exhibits the tendancy toward overfitting as max_depth
increases.
plt.figure(figsize=(12,8))
plt.title('Recall, ROC-AUC, and F1 for eta',fontsize=20)
# recall train
sns.lineplot(x=results['param_eta'],
y=results['mean_train_recall'],
errorbar=('ci',0),
label='Train Recall',
linestyle='--')
# AUC train
sns.lineplot(x=results['param_eta'],
y=results['mean_train_roc_auc'],
errorbar=('ci',0),
label='Train AUC',
linestyle='--')
# F1 train
sns.lineplot(x=results['param_eta'],
y=results['mean_train_f1'],
errorbar=('ci',0),
label='Train F1',
linestyle='--')
# recall val
sns.lineplot(x=results['param_eta'],
y=results['mean_test_recall'],
errorbar=('ci',0),
label='Val. Recall')
# AUC val
sns.lineplot(x=results['param_eta'],
y=results['mean_test_roc_auc'],
errorbar=('ci',0),
label='Val. AUC')
# F1 val
sns.lineplot(x=results['param_eta'],
y=results['mean_test_f1'],
errorbar=('ci',0),
label='Val. F1')
plt.xlabel('eta (learning rate)')
plt.ylabel('Mean Score')
plt.legend(loc='lower right')
plt.show()
As learning rate increases, so too do all scores. Interestingly, recall and F1 start about matched (with eta=0.05
) and then diverge as learning rate increases. AUC values are consistently higher than precision and recall, though one should be careful when directly comparing these metrics.
Now we will tune the gamma
parameter. We use cross validation to prevent overfitting our model on validation data.
plt.figure(figsize=(15,15))
for i in range(9):
# collect params and set gamma
w=xgb_tuned1.get_params()
w['gamma']=2*i
# convert to XGB DMatrix format
dmat=DMatrix(
Xt_over,
yt_over,
enable_categorical=True
)
# cv
a=XGB_CV(
params=w,
dtrain=dmat,
num_boost_round=250,
nfold=5,
metrics={'auc'}
)
# subplot
plt.subplot(3,3,i+1)
plt.title(f'gamma={2*i}')
plt.plot(np.arange(250),a['train-auc-mean'],label='train')
plt.plot(np.arange(250),a['test-auc-mean'],label='test')
plt.legend(loc='lower right')
clear_output()
plt.show()
The shape of the plot for gamma=8
looks good: high scoring and levels off around 100, which is our set number of estimators in the boosting model.
w=xgb_tuned1.get_params()
w['gamma']=8
xgb_tuned2=XGBClassifier(**w)
xgb_tuned2.fit(Xt_over,yt_over)
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, early_stopping_rounds=None, enable_categorical=False, eta=0.3, eval_metric=None, feature_types=None, gamma=8, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
XGBClassifier(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=0.9, early_stopping_rounds=None, enable_categorical=False, eta=0.3, eval_metric=None, feature_types=None, gamma=8, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=4, max_leaves=None, min_child_weight=2, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, ...)
ch(xgb_tuned2)
Val. Scores | |
---|---|
Accuracy | 0.807594 |
Precision | 0.119930 |
Recall | 0.609756 |
F1 | 0.200437 |
Gamma tuning has not made a perceptible difference.
# gamma-tuned model
get_scores(xgb_tuned2,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.892346 | 0.880714 | 0.863824 | 0.859327 | 0.140673 |
val | 0.609756 | 0.335611 | 0.200437 | 0.712749 | 0.192406 |
# tuned model with default gamma
get_scores(xgb_tuned1,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.892561 | 0.881539 | 0.865507 | 0.861304 | 0.138696 |
val | 0.614191 | 0.343077 | 0.206408 | 0.717797 | 0.186793 |
Comparing the first tuned XGBoost model with the gamma-tuned model, we see no improvement. If anything, scores have marginally slipped.
params={
'max_bin':[100,150,200],
'min_gain_to_split':[0.001,0.005,0.01],
'feature_fraction':np.linspace(0.5,0.9,4),
'max_depth':np.arange(3,10,2)
}
We will experiment first with different values of max_bin
, which controls the bucketing behavior of the model. Default is max_bin=255
; lower values can negatively impact training accuracy but do generally cut down on overfitting.
With min_gain_to_split
, we require a minimum gain to split two nodes. The default value is zero, so this should hopefully produce more meaningful divisions.
We subsample features in training instances using feature_fraction
. This combats overfitting.
The general behavior of LightGBM is to let trees grow without limit. Restricting max_depth
will contain tree growth with the goal of preventing overfitting.
lg_tuned=LGBMClassifier(random_state=1)
go=GridSearchCV(
estimator=lg_tuned,
param_grid=params,
scoring=['recall','f1','roc_auc'],
refit='roc_auc',
cv=5,
n_jobs=-1,
verbose=1,
return_train_score=True
)
start=time.time()
go.fit(Xt_over,yt_over)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')
Fitting 5 folds for each of 144 candidates, totalling 720 fits Fit completed in 82.99 minutes.
best_lg=go.best_params_
best_lg
{'feature_fraction': 0.7666666666666666, 'max_bin': 200, 'max_depth': 9, 'min_gain_to_split': 0.001}
lg_tuned=LGBMClassifier(
random_state=1,
**best_lg
)
lg_tuned.fit(Xt_over,yt_over)
LGBMClassifier(feature_fraction=0.7666666666666666, max_bin=200, max_depth=9, min_gain_to_split=0.001, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMClassifier(feature_fraction=0.7666666666666666, max_bin=200, max_depth=9, min_gain_to_split=0.001, random_state=1)
ch(lg_tuned)
Val. Scores | |
---|---|
Accuracy | 0.579409 |
Precision | 0.069717 |
Recall | 0.780488 |
F1 | 0.128000 |
get_scores(lg_tuned,sample='over',output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.893285 | 0.882073 | 0.865774 | 0.861509 | 0.138491 |
val | 0.780488 | 0.256822 | 0.128000 | 0.675808 | 0.420591 |
results=go.cv_results_
# figure setup
plt.figure(figsize=(12,8))
plt.title('Mean Recall, ROC-AUC, and F1 for max_depth',fontsize=20)
# recall training
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_train_recall']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(3,10,2),
a,
label='Train Recall',
linestyle='--',
drawstyle='steps-mid')
# AUC training
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_train_roc_auc']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(3,10,2),
a,
label='Train AUC',
linestyle='--',
drawstyle='steps-mid')
# F1 training
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_train_f1']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(3,10,2),
a,
label='Train F1',
linestyle='--',
drawstyle='steps-mid')
# recall val
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_test_recall']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(3,10,2),
a,
label='Val. Recall',
drawstyle='steps-mid')
# AUC val
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_test_roc_auc']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(3,10,2),
a,
label='Val. AUC',
drawstyle='steps-mid')
# F1 val
a=pd.DataFrame(
data=np.array([
results['param_max_depth'],
results['mean_test_f1']
]).T,
columns=['depth','']
).groupby('depth').mean().values
plt.plot(np.arange(3,10,2),
a,
label='Val. F1',
drawstyle='steps-mid')
# axes and legend
plt.xlabel('Maximum Tree Depth')
plt.ylabel('Mean Score')
plt.legend(loc='lower right')
plt.show()
To start, we see a jump in all scores as max_depth
increases from 3 to 4. The gains continue as the parameter further increases, but the magnitude of improvements tapers off.
is_unbalance
Parameter¶We now modify a previously untouched parameter from LightGBM. The is_unbalance
flag modifies the model bulinding process to account for unbalanced training data. We can thus train on the non-oversampled data again.
lg1=LGBMClassifier(
random_state=1,
is_unbalance=True
)
lg1.fit(X_ts,y_train)
LGBMClassifier(is_unbalance=True, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMClassifier(is_unbalance=True, random_state=1)
get_scores(lg1,output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.922602 | 0.546308 | 0.338945 | 0.888749 | 0.142425 |
val | 0.587583 | 0.352816 | 0.220604 | 0.716797 | 0.164211 |
Training scores are good and validation scores are some of the best we've seen.
ch(lg1)
Val. Scores | |
---|---|
Accuracy | 0.835789 |
Precision | 0.135793 |
Recall | 0.587583 |
F1 | 0.220604 |
Note that the false positive rate is more than half that of the tuned LightGBM classifier above.
params={
'max_bin':[100,150,200,255],
'feature_fraction':np.linspace(0.5,0.9,4),
}
We will once again consider different values of max_bin
and feature_fraction
. This time, however, we will test max_bin=255
as one of the options, which is the LightGBM default.
lg1=LGBMClassifier(
random_state=1,
is_unbalance=True
)
go=GridSearchCV(
estimator=lg1,
param_grid=params,
scoring=['recall','f1','roc_auc'],
refit='roc_auc',
cv=5,
n_jobs=-1,
verbose=1,
return_train_score=True
)
start=time.time()
go.fit(X_ts,y_train)
print(f'Fit completed in {np.round((time.time()-start)/60,2)} minutes.')
Fitting 5 folds for each of 16 candidates, totalling 80 fits [LightGBM] [Warning] feature_fraction is set=0.5, colsample_bytree=1.0 will be ignored. Current value: feature_fraction=0.5 Fit completed in 5.85 minutes.
best_lg1=go.best_params_
best_lg1
{'feature_fraction': 0.5, 'max_bin': 255}
lg1=LGBMClassifier(
random_state=1,
is_unbalance=True,
**best_lg1
)
lg1.fit(X_ts,y_train)
LGBMClassifier(feature_fraction=0.5, is_unbalance=True, max_bin=100, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LGBMClassifier(feature_fraction=0.5, is_unbalance=True, max_bin=100, random_state=1)
We fit the classifier with the updated parameters.
ch(lg1)
Val. Scores | |
---|---|
Accuracy | 0.835087 |
Precision | 0.136352 |
Recall | 0.594235 |
F1 | 0.221808 |
get_scores(lg1,output='pandas')
Recall | F_beta | F1 | AUC | 0-1_Loss | |
---|---|---|---|---|---|
train | 0.924501 | 0.544037 | 0.336386 | 0.888652 | 0.144361 |
val | 0.586475 | 0.352103 | 0.220141 | 0.716197 | 0.164343 |
No perceptable difference from the previous model. That is, tuning has not led to great gains. Let's get more detail with some plots.
results=go.cv_results_
plt.figure(figsize=(12,8))
plt.title('Recall and ROC-AUC for Feature Fraction',fontsize=20)
# recall train
sns.lineplot(x=results['param_feature_fraction'],
y=results['mean_train_recall'],
errorbar=('ci',0),
label='Train Recall',
linestyle='--')
# AUC train
sns.lineplot(x=results['param_feature_fraction'],
y=results['mean_train_roc_auc'],
errorbar=('ci',0),
label='Train AUC',
linestyle='--')
# recall val
sns.lineplot(x=results['param_feature_fraction'],
y=results['mean_test_recall'],
errorbar=('ci',0),
label='Val. Recall')
# AUC val
sns.lineplot(x=results['param_feature_fraction'],
y=results['mean_test_roc_auc'],
errorbar=('ci',0),
label='Val. AUC')
plt.xlabel('Feature Fraction')
plt.ylabel('Mean Score')
plt.legend(loc='best')
plt.show()
We find that recall and AUC are nearly constant across values for feature_fraction
, betraying our prediction that a lower feature fraction would bring training and validation scores closer together.
plt.figure(figsize=(12,8))
plt.title('Recall and ROC-AUC for max_bin',fontsize=20)
# recall train
sns.lineplot(x=results['param_max_bin'],
y=results['mean_train_recall'],
errorbar=('ci',0),
label='Train Recall',
linestyle='--',
drawstyle='steps-mid')
# AUC train
sns.lineplot(x=results['param_max_bin'],
y=results['mean_train_roc_auc'],
errorbar=('ci',0),
label='Train AUC',
linestyle='--',
drawstyle='steps-mid')
# recall val
sns.lineplot(x=results['param_max_bin'],
y=results['mean_test_recall'],
errorbar=('ci',0),
label='Val. Recall',
drawstyle='steps-mid')
# AUC val
sns.lineplot(x=results['param_max_bin'],
y=results['mean_test_roc_auc'],
errorbar=('ci',0),
label='Val. AUC',
drawstyle='steps-mid')
plt.xlabel('Max Bin')
plt.ylabel('Mean Score')
plt.legend(loc='best')
plt.show()
We find a similar trend across values for max_bin
. I predict that we're essentially maxed out on the performance of this architecture of classifier.
plt.figure(figsize=(12,8))
plt.title('F1 for max_bin',fontsize=20)
# F1 train
sns.lineplot(x=results['param_max_bin'],
y=results['mean_train_f1'],
errorbar=('ci',0),
label='Train F1',
linestyle='--',
drawstyle='steps-post')
# F1 val
sns.lineplot(x=results['param_max_bin'],
y=results['mean_test_f1'],
errorbar=('ci',0),
label='Val. F1',
drawstyle='steps-post')
plt.xlabel('Max Bin')
plt.ylabel('Mean Score')
plt.legend(loc='best')
plt.show()
The pattern for F1 scores is also essentially constant.
With these data, boosting algorithms worked best.
With such prominent label imbalance in the target feature, models trained on the original data exhibited low-scoring recall, F1, and AUC.
Much better performance was achieved with models trained on oversampled data.
We tuned three boosting algorithms: AdaBoost, XGBoost, and LightGBM. All three offered slight gains at best, while still showing signs of weakness (such as overfitting).
The LightGBM model was also trained again on the non-oversampled data using a training flag for imbalanced data. Performance here was just as good or better than the others and training times were apprecibly reduced. This model is my favorite.
I think it would be interesting to explore the performance of a neural network trained to classify these data. I need to think more about how training on oversampled data would affect this architecture.
Previously, I implemented both a stacking classifier and a voting classifier with the goal of increasing overall scores. The former overfit and the latter offered no gains, though we can learn from the latter: As the voting classifier did not yield further correct labelling, this indicates that our highest-scoring classifiers are usually predicting the same incorrect label for the same data. Moving forward, it would be interesting to tune several models to maximize precision. Then we could combine those models with some of the high-recall models trained here to hopefully sniff out more correct classifications.