Example of classification of unbalanced datasets. Dataset https://www.kaggle.com/mlg-ulb/creditcardfraud from Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles).
!wget -O creditfraud.zip https://www.dropbox.com/s/tl20yp9bcl56oxt/creditcardfraud.zip?dl=0
--2019-10-19 03:48:26-- https://www.dropbox.com/s/tl20yp9bcl56oxt/creditcardfraud.zip?dl=0 Resolving www.dropbox.com (www.dropbox.com)... 162.125.8.1, 2620:100:601b:1::a27d:801 Connecting to www.dropbox.com (www.dropbox.com)|162.125.8.1|:443... connected. HTTP request sent, awaiting response... 301 Moved Permanently Location: /s/raw/tl20yp9bcl56oxt/creditcardfraud.zip [following] --2019-10-19 03:48:26-- https://www.dropbox.com/s/raw/tl20yp9bcl56oxt/creditcardfraud.zip Reusing existing connection to www.dropbox.com:443. HTTP request sent, awaiting response... 302 Found Location: https://ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com/cd/0/inline/AquYGeyAWVcFILPlJlO8BXElxPAZN28EJ7vfr8YEBZy_oeHhzLNwTWZExgftfGzxPt3L2M0l5OooHzl7Qr3ws4FWn2sG9nYWYN0xllg5iGbt7nKp2duOedE0DCYkS_fkK5c/file# [following] --2019-10-19 03:48:27-- https://ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com/cd/0/inline/AquYGeyAWVcFILPlJlO8BXElxPAZN28EJ7vfr8YEBZy_oeHhzLNwTWZExgftfGzxPt3L2M0l5OooHzl7Qr3ws4FWn2sG9nYWYN0xllg5iGbt7nKp2duOedE0DCYkS_fkK5c/file Resolving ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com (ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com)... 162.125.8.6, 2620:100:601b:6::a27d:806 Connecting to ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com (ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com)|162.125.8.6|:443... connected. HTTP request sent, awaiting response... 302 FOUND Location: /cd/0/inline2/AqtYl2yD0FuI5in7_y_SHgvWeezYSyIdhL_vv3Lgs7QETZt5mumfGmO0T-csKQskdfm9gtMQXc93HJypu9OG3fcISoZBHPReo05netHf9gL_BWXUBvBa9z3exA6ZDjQyKllppeSNATrOyFY421JuAtbgIkdn2WLEaLhQ77R5SoS-PRtCxxVpRjvj7I6wMVFCbgb3Td7Iu_MPMPaRNDwpl4Vhg_mkHgVmOGg77WI69_Oh1diAbDPvXimgGIU_AKV1n5R9zqqtyUM84f9gf_hcZBj5Nf2q1tsnB7ASwGoR99qcM62zaKaITZP5o8YGmTk4cITMFyxywUapaOLTlk-gjfgK3PMwhIgev1z34J8MEvcVAA/file [following] --2019-10-19 03:48:27-- https://ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com/cd/0/inline2/AqtYl2yD0FuI5in7_y_SHgvWeezYSyIdhL_vv3Lgs7QETZt5mumfGmO0T-csKQskdfm9gtMQXc93HJypu9OG3fcISoZBHPReo05netHf9gL_BWXUBvBa9z3exA6ZDjQyKllppeSNATrOyFY421JuAtbgIkdn2WLEaLhQ77R5SoS-PRtCxxVpRjvj7I6wMVFCbgb3Td7Iu_MPMPaRNDwpl4Vhg_mkHgVmOGg77WI69_Oh1diAbDPvXimgGIU_AKV1n5R9zqqtyUM84f9gf_hcZBj5Nf2q1tsnB7ASwGoR99qcM62zaKaITZP5o8YGmTk4cITMFyxywUapaOLTlk-gjfgK3PMwhIgev1z34J8MEvcVAA/file Reusing existing connection to ucc47ffe6f1ee98a103fbc67829e.dl.dropboxusercontent.com:443. HTTP request sent, awaiting response... 200 OK Length: 69155672 (66M) [application/zip] Saving to: ‘creditfraud.zip’ creditfraud.zip 100%[===================>] 65.95M 67.3MB/s in 1.0s 2019-10-19 03:48:28 (67.3 MB/s) - ‘creditfraud.zip’ saved [69155672/69155672]
!unzip creditfraud.zip
Archive: creditfraud.zip inflating: creditcard.csv
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
dat=pd.read_csv('creditcard.csv')
dat.head()
Time | V1 | V2 | V3 | V4 | V5 | V6 | V7 | V8 | V9 | V10 | V11 | V12 | V13 | V14 | V15 | V16 | V17 | V18 | V19 | V20 | V21 | V22 | V23 | V24 | V25 | V26 | V27 | V28 | Amount | Class | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | -1.359807 | -0.072781 | 2.536347 | 1.378155 | -0.338321 | 0.462388 | 0.239599 | 0.098698 | 0.363787 | 0.090794 | -0.551600 | -0.617801 | -0.991390 | -0.311169 | 1.468177 | -0.470401 | 0.207971 | 0.025791 | 0.403993 | 0.251412 | -0.018307 | 0.277838 | -0.110474 | 0.066928 | 0.128539 | -0.189115 | 0.133558 | -0.021053 | 149.62 | 0 |
1 | 0.0 | 1.191857 | 0.266151 | 0.166480 | 0.448154 | 0.060018 | -0.082361 | -0.078803 | 0.085102 | -0.255425 | -0.166974 | 1.612727 | 1.065235 | 0.489095 | -0.143772 | 0.635558 | 0.463917 | -0.114805 | -0.183361 | -0.145783 | -0.069083 | -0.225775 | -0.638672 | 0.101288 | -0.339846 | 0.167170 | 0.125895 | -0.008983 | 0.014724 | 2.69 | 0 |
2 | 1.0 | -1.358354 | -1.340163 | 1.773209 | 0.379780 | -0.503198 | 1.800499 | 0.791461 | 0.247676 | -1.514654 | 0.207643 | 0.624501 | 0.066084 | 0.717293 | -0.165946 | 2.345865 | -2.890083 | 1.109969 | -0.121359 | -2.261857 | 0.524980 | 0.247998 | 0.771679 | 0.909412 | -0.689281 | -0.327642 | -0.139097 | -0.055353 | -0.059752 | 378.66 | 0 |
3 | 1.0 | -0.966272 | -0.185226 | 1.792993 | -0.863291 | -0.010309 | 1.247203 | 0.237609 | 0.377436 | -1.387024 | -0.054952 | -0.226487 | 0.178228 | 0.507757 | -0.287924 | -0.631418 | -1.059647 | -0.684093 | 1.965775 | -1.232622 | -0.208038 | -0.108300 | 0.005274 | -0.190321 | -1.175575 | 0.647376 | -0.221929 | 0.062723 | 0.061458 | 123.50 | 0 |
4 | 2.0 | -1.158233 | 0.877737 | 1.548718 | 0.403034 | -0.407193 | 0.095921 | 0.592941 | -0.270533 | 0.817739 | 0.753074 | -0.822843 | 0.538196 | 1.345852 | -1.119670 | 0.175121 | -0.451449 | -0.237033 | -0.038195 | 0.803487 | 0.408542 | -0.009431 | 0.798278 | -0.137458 | 0.141267 | -0.206010 | 0.502292 | 0.219422 | 0.215153 | 69.99 | 0 |
#checking for null values
dat.isnull().sum().max()
0
The dataset is hifghly unbalanced
dat['Class'].value_counts()/dat['Class'].count()
0 0.998273 1 0.001727 Name: Class, dtype: float64
sns.countplot(x='Class',data=dat)
<matplotlib.axes._subplots.AxesSubplot at 0x7f0942beef60>
We won't be using "Time" variable
dat = dat.drop([ 'Time'], 1)
dat.dtypes
V1 float64 V2 float64 V3 float64 V4 float64 V5 float64 V6 float64 V7 float64 V8 float64 V9 float64 V10 float64 V11 float64 V12 float64 V13 float64 V14 float64 V15 float64 V16 float64 V17 float64 V18 float64 V19 float64 V20 float64 V21 float64 V22 float64 V23 float64 V24 float64 V25 float64 V26 float64 V27 float64 V28 float64 Amount float64 Class int64 dtype: object
The split is in half, if the test set is smaller, there will be too few fraudulent cases.
X_train, X_test, y_train, y_test = train_test_split(dat.drop('Class',1) , dat['Class'], test_size=0.5, random_state=0)
Below we verify that the splits has the same distribution in the target variable as the initial set
y_test.value_counts()/y_test.count()
0 0.998294 1 0.001706 Name: Class, dtype: float64
y_train.value_counts()/y_train.count()
0 0.998251 1 0.001749 Name: Class, dtype: float64
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)
os_X_train,os_y_train=os.fit_sample(X_train, y_train)
os_X_train = pd.DataFrame(data=os_X_train,columns=X_train.columns )
os_y_train= pd.DataFrame(data=os_y_train,columns=['Class'])
/usr/local/lib/python3.6/dist-packages/sklearn/externals/six.py:31: DeprecationWarning: The module is deprecated in version 0.21 and will be removed in version 0.23 since we've dropped support for Python 2.7. Please rely on the official version of six (https://pypi.org/project/six/). "(https://pypi.org/project/six/).", DeprecationWarning)
logreg = LogisticRegression()
logreg.fit(os_X_train, os_y_train)
y_pred_log = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
/usr/local/lib/python3.6/dist-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning. FutureWarning) /usr/local/lib/python3.6/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
Accuracy of logistic regression classifier on test set: 0.98
conf_matrix = metrics.confusion_matrix(y_test,y_pred_log)
#print(conf_matrix)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
conf_matrix = metrics.confusion_matrix(y_test,y_pred_log)
conf_matrix=conf_matrix/y_test.size
#print(conf_matrix)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
conf_matrix = metrics.confusion_matrix(y_test,y_pred_log)
conf_matrix=conf_matrix/np.array(y_test.value_counts())[:,None]
#print(conf_matrix)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
ROC curve
logit_roc_auc = metrics.roc_auc_score(y_test, y_pred_log)
y_pred_prob_log=logreg.predict_proba(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob_log[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Unbalanced dataset - Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
F1-score
metrics.f1_score(y_test, y_pred_log, average='macro')
0.575608917341048
# Import the model we are using
from sklearn.ensemble import RandomForestClassifier
# Instantiate model with 50 decision trees
rf = RandomForestClassifier(n_estimators = 50, random_state = 12,n_jobs= -1)
# Train the model on training data
rf.fit(os_X_train, os_y_train);
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:3: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel(). This is separate from the ipykernel package so we can avoid doing imports until
y_pred_rf = rf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_rf))
Accuracy: 0.999522485323446
conf_matrix = metrics.confusion_matrix(y_test,y_pred_rf)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
logit_roc_auc = metrics.roc_auc_score(y_test, y_pred_rf)
y_pred_prob_rf=rf.predict_proba(X_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob_rf[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Random Forest (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Unbalanced dataset - Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()
metrics.f1_score(y_test, y_pred_rf, average='macro')
0.9262873537735853
# Get numerical feature importances
importances = list(rf.feature_importances_)# List of tuples with variable and importance
feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(dat.drop('Class',1).columns, importances)]# Sort the feature importances by most important first
feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)# Print out the feature and importances
[print('Variable: {:20} Importance: {}'.format(*pair)) for pair in feature_importances];
Variable: V14 Importance: 0.24 Variable: V10 Importance: 0.16 Variable: V4 Importance: 0.11 Variable: V12 Importance: 0.08 Variable: V17 Importance: 0.08 Variable: V3 Importance: 0.07 Variable: V11 Importance: 0.04 Variable: V16 Importance: 0.04 Variable: V2 Importance: 0.03 Variable: V7 Importance: 0.02 Variable: V9 Importance: 0.02 Variable: V1 Importance: 0.01 Variable: V5 Importance: 0.01 Variable: V6 Importance: 0.01 Variable: V8 Importance: 0.01 Variable: V18 Importance: 0.01 Variable: V19 Importance: 0.01 Variable: V21 Importance: 0.01 Variable: V23 Importance: 0.01 Variable: V28 Importance: 0.01 Variable: Amount Importance: 0.01 Variable: V13 Importance: 0.0 Variable: V15 Importance: 0.0 Variable: V20 Importance: 0.0 Variable: V22 Importance: 0.0 Variable: V24 Importance: 0.0 Variable: V25 Importance: 0.0 Variable: V26 Importance: 0.0 Variable: V27 Importance: 0.0
y_pred=np.maximum(y_pred_log,y_pred_rf)
conf_matrix = metrics.confusion_matrix(y_test,y_pred)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
os_X_train.shape
(284308, 29)
#y_pred_prob=(5*y_pred_prob_log[:,1]+y_pred_prob_rf[:,1])/6
y_pred_prob=(y_pred_prob_log[:,1]+y_pred_prob_rf[:,1])/2
y_pred_prob.shape
y_pred=[(lambda p: 1 if p>=0.5 else 0)(p) for p in y_pred_prob]
pd.DataFrame(data=y_pred,columns=['Class'])['Class'].value_counts()
0 141932 1 472 Name: Class, dtype: int64
conf_matrix = metrics.confusion_matrix(y_test,y_pred)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
from xgboost import XGBClassifier
# fit model no training data
xgb = XGBClassifier()
xgb.fit(os_X_train, os_y_train)
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:219: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True) /usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/label.py:252: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel(). y = column_or_1d(y, warn=True)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=None, n_estimators=100, n_jobs=1, nthread=None, objective='binary:logistic', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None, silent=None, subsample=1, verbosity=1)
y_pred_xgb = xgb.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred_xgb))
Accuracy: 0.9945366703182494
conf_matrix = metrics.confusion_matrix(y_test,y_pred_xgb)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
y_pred=np.maximum(y_pred_log,y_pred_rf,y_pred_xgb)
conf_matrix = metrics.confusion_matrix(y_test,y_pred)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);
y_pred_prob_xgb=xgb.predict_proba(X_test)
y_pred_prob=(y_pred_prob_log[:,1]+y_pred_prob_rf[:,1]+y_pred_prob_xgb[:,1])/3
y_pred_prob.shape
y_pred=[(lambda p: 1 if p>=0.5 else 0)(p) for p in y_pred_prob]
pd.DataFrame(data=y_pred,columns=['Class'])['Class'].value_counts()
0 141787 1 617 Name: Class, dtype: int64
conf_matrix = metrics.confusion_matrix(y_test,y_pred)
ax=plt.subplot()
sns.heatmap(conf_matrix,annot=True,ax=ax,fmt='g')#annot=True to annotate cells, fmt='g' numbers not scientific form
ax.set_xlabel('Predicted labels'); ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix');
ax.xaxis.set_ticklabels(['Normal', 'Fraud']); ax.yaxis.set_ticklabels(['Normal', 'Fraud']);