Credit Card Fraud Detection using Extreme Gradient Boosting

Ekaba Bisong

1. Introduction

Credit card fraud detection is the science and art of detecting unusual activity in credit transactions. Fraud occurs when the credit card information of an individual is stolen and used to make unauthorized purchases and or withdrawals from the original holders account.

The Data Science Approach to Fraud Detection

Learning from historical data forms the bedrock of the data science approach. By making use of statistical methods, a classification model is developed to map a function that learns to identify when a fraudulent transaction has occurred. The approach is typically divided into two perspectives, the supervised and unsupervised norms.

In the supervised norm, the function is trained using previous labelled data of fraudulent and normal transactions, so the function is taught to understand intrinsically what makes a transaction fraudulent. The trained function is now used to determine if a new transaction is fraudulent or not. While in the unsupervised norm, techniques such as anomaly detection is used to identify fraudulent transactions as outliers in a dimensional space. Future transactions which fall outside a particular region is further investigated as a fraudulent transaction.

Several studies have investigated different Machine Learning approaches to develop a computational model to identify the incidence of fraud. Some of the techniques from literature include artificial neural networks, bayesian belief networks$^1$, and hidden markov models$^2$.

Challenges to Credit Fraud Detection

A major challenge to credit fraud detection research is the availability of real world data due to privacy and legal concerns.

2. The Dataset

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. The dataset is made available on Kaggle.

Dataset Description

Description from Kaggle

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset present transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

3. Methodology

In this project, we take the supervised approach to identifying fraud. We will build a function to learn a transaction map to identify fraud from a specific number of attributes. This attributes have already been transformed to their principal components so much cannot be done in terms of feature engineering.

In [1]:
# import packages
import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from itertools import cycle
import pandas as pd
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc
Automatically created module for IPython interactive environment
In [2]:
# load the data
dataframe = pd.read_csv("creditcard.csv")
In [3]:
# get column names
colNames = dataframe.columns.values
array(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',
       'Amount', 'Class'], dtype=object)
In [4]:
# get dataframe dimensions
print "Dimension of dataset:", dataframe.shape
Dimension of dataset: (284807, 31)
In [5]:
# get attribute summaries
                Time            V1            V2            V3            V4  \
count  284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean    94813.859575  3.919560e-15  5.688174e-16 -8.769071e-15  2.782312e-15   
std     47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min         0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%     54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%     84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%    139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max    172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                 V5            V6            V7            V8            V9  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean  -1.552563e-15  2.010663e-15 -1.694249e-15 -1.927028e-16 -3.137024e-15   
std    1.380247e+00  1.332271e+00  1.237094e+00  1.194353e+00  1.098632e+00   
min   -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01   
25%   -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01   
50%   -5.433583e-02 -2.741871e-01  4.010308e-02  2.235804e-02 -5.142873e-02   
75%    6.119264e-01  3.985649e-01  5.704361e-01  3.273459e-01  5.971390e-01   
max    3.480167e+01  7.330163e+01  1.205895e+02  2.000721e+01  1.559499e+01   

           ...                 V21           V22           V23           V24  \
count      ...        2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean       ...        1.537294e-16  7.959909e-16  5.367590e-16  4.458112e-15   
std        ...        7.345240e-01  7.257016e-01  6.244603e-01  6.056471e-01   
min        ...       -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00   
25%        ...       -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01   
50%        ...       -2.945017e-02  6.781943e-03 -1.119293e-02  4.097606e-02   
75%        ...        1.863772e-01  5.285536e-01  1.476421e-01  4.395266e-01   
max        ...        2.720284e+01  1.050309e+01  2.252841e+01  4.584549e+00   

                V25           V26           V27           V28         Amount  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  284807.000000   
mean   1.453003e-15  1.699104e-15 -3.660161e-16 -1.206049e-16      88.349619   
std    5.212781e-01  4.822270e-01  4.036325e-01  3.300833e-01     250.120109   
min   -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01       0.000000   
25%   -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02       5.600000   
50%    1.659350e-02 -5.213911e-02  1.342146e-03  1.124383e-02      22.000000   
75%    3.507156e-01  2.409522e-01  9.104512e-02  7.827995e-02      77.165000   
max    7.519589e+00  3.517346e+00  3.161220e+01  3.384781e+01   25691.160000   

count  284807.000000  
mean        0.001727  
std         0.041527  
min         0.000000  
25%         0.000000  
50%         0.000000  
75%         0.000000  
max         1.000000  

[8 rows x 31 columns]
In [6]:
# get class distribution
print "Normal transaction:", dataframe['Class'][dataframe['Class'] == 0].count() #class = 0
print "Fraudulent transaction:", dataframe['Class'][dataframe['Class'] == 1].count() #class = 1
Normal transaction: 284315
Fraudulent transaction: 492
In [7]:
# separate classes into different datasets
class0 = dataframe.query('Class == 0')
class1 = dataframe.query('Class == 1')

# randomize the datasets
class0 = class0.sample(frac=1)
class1 = class1.sample(frac=1)

Undersampling to deal with class imbalance

The examples of the majority class, in this case the normal transactions drastically outnumber the incidences of fraud in our dataset. One of the strategies employed in the data science community is to delete instances from the over-represented class to improve the learning function. Here, we selected 6000 instances of the normal class from the original 284315 records.

In [8]:
# undersample majority class due to class imbalance before training - train
class0train = class0.iloc[0:6000]
class1train = class1

# combine subset of different classes into one balaced dataframe
train = class0train.append(class1train, ignore_index=True).values
In [9]:
# split data into X and y
X = train[:,0:30].astype(float)
Y = train[:,30]

4. The Learning Algorithm: XGBoost

Extreme Gradient Boosting is also known as XGBoost. This model is preferred due to its execution speed and learning performance. It has been the algorithm of choice for various winners of Kaggle and KDD Mining Cup competitions. It is implemented in the scikit learn python package. Orginally written in C++ by Tianqi Chen

In [10]:
# XGBoost CV model
model = XGBClassifier()
kfold = StratifiedKFold(n_splits=10, random_state=7)

# use area under the precision-recall curve to show classification accuracy
scoring = 'roc_auc'
results = cross_val_score(model, X, Y, cv=kfold, scoring = scoring)
print("AUC: %.3f (%.3f)") % (results.mean(), results.std())
AUC: 0.979 (0.011)

5. Plot the Result

In [11]:
# change size of Matplotlib plot
fig_size = plt.rcParams["figure.figsize"] # Get current size

old_fig_params = fig_size
# new figure parameters
fig_size[0] = 12
fig_size[1] = 9
plt.rcParams["figure.figsize"] = fig_size # set new size
In [12]:
# plot roc-curve
# code adapted from
mean_tpr = 0.0
mean_fpr = np.linspace(0, 1, 100)

colors = cycle(['cyan', 'indigo', 'seagreen', 'yellow', 'blue', 'darkorange'])
lw = 2

i = 0
for (train, test), color in zip(kfold.split(X, Y), colors):
    probas_ =[train], Y[train]).predict_proba(X[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(Y[test], probas_[:, 1])
    mean_tpr += interp(mean_fpr, fpr, tpr)
    mean_tpr[0] = 0.0
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, lw=lw, color=color,
             label='ROC fold %d (area = %0.2f)' % (i, roc_auc))

    i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=lw, color='k',

mean_tpr /= kfold.get_n_splits(X, Y)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
plt.plot(mean_fpr, mean_tpr, color='g', linestyle='--',
         label='Mean ROC (area = %0.2f)' % mean_auc, lw=lw)

plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")

6. Remarks

From the above results our algorithm achieved auc-roc (i.e. area under the precision-recall curve) score of 0.979. The auc-roc curve in insensitive to class imbalanace and hence is the preferred evaluation metric for estimating the performance of our learning function.

Further Remarks

Several other techniques that can be explored/ benchmarked:

  • Visualization to understand the transaction trends over time, more ideas can be gotten to understand more about fraudulent transactions
  • Research deep learning techniques to this problem such as Reccurent Neural Networks using the time component for sequence-to-sequence learning
  • Consider other learning options such as anomaly detection or change detection