#!/usr/bin/env python
# coding: utf-8

# # CAP5619 - AI for FinTech 
# #### Spring 2023
# #### Final - Research Project
# #### Brandon Doey & Dr. Pieter De Jong

# ## BNPL Credit Worthiness ML application

# ### 1 - Problem Statement
# BNPL is primarily geared toward populations with limited banking access and low credit scores. A common credit application could require a few days and the actual interaction could ward off possible borrowers. Our team wanted to investigate the feasibility of a more expeditious credit check for potential lenders and hypothesized the following: 
# H1: Can a machine learning algorithm provide an instantaneous credit check with given user inputs?
# H2: Which of the applicable machine learning models has the highest accuracy?
# 
# At the time of the midterm submission, we found strong evidence in support of H1 and were able to establish that a XGBoost model (Chen & Guestrin, 2016) with given features could provide an almost instantaneous credit check for any potential borrower with 89.13% accuracy. Our goal for the final project was to find compelling evidence in support of H2 and we determined that the Keras Sequential model has an even higher accuracy score (98.07%). 
# 
# Ultimately, we will be able to predict a client's creditworthiness in real time through a Gradio (Gradio, 2023) application. The client will be presented with two distinct outcomes based on their inputs at the final stage. Either they will be given a loan, or they will be referred to a web-based financial education application.
# 

# ### 2 - Literature
# Buy Now Pay Later (BNPL) is a kind of short-term financing that lets people pay for goods and services over time without having to worry about interest. Splitting a transaction into short-term installments is possible because BNPL does not charge interest. It may also be beneficial to merchants by making it simpler and less expensive for customers to purchase their goods. This is in addition to providing customers with flexibility and accessibility. This is especially helpful for smaller businesses that want to compete with larger retailers that offer financing options like this. BNPL can be a good option for customers who have little or no credit history because some providers do not require a credit check or a high credit score to be approved.
# 
# We used a synthetic version of the very popular "All Lending Club loan data" dataset which is roughly 400MB in size (George, n.d.) to determine creditworthiness with a machine learning application. Since the data contained all the parameters and features necessary to determine creditworthiness, we decided to utilize the same data from the midterm in the final project. The goal for the project is to provide BNPL services for underbanked populations, i.e., Generation Z and millennials. Over the past few years, digital shopping websites have become more and more popular with millennials, indicating that these customers are very familiar with digital transactions. It will be affected by the small number of customers who have not yet adopted BNPL services. Current BNPL users are more likely to use mobile phone digital wallets to buy now and pay later. Mastercard, Visa, and American Express are just a few of the major credit card issuers investing in installment solutions and developing their own BNPL applications in response.
# 
# Even though banks and regulators need to step in to make this growing market segment even more sustainable, we are hoping that our BNPL application will positively impact it. There have been new developments in this field despite the absence of regulation in the United States. BNPL is becoming more and more popular in established online retail markets. In addition, they are forming alliances with established BNPL providers to broaden the scope of their credit options and meet the growing demands of customers. Virtual lease to-possess models, card-connected portion choices, incorporated shopping applications, and bigger ticket plays are instances of retail market plans of action (Garcia Alvarez, 2021). BNPL advances have expanded altogether and are expected to do so further from now on. More borrowers had paid a late fee, according to the most recent data from 2022, and more than 70% of applicants had been approved for credit (Buy Now, Pay Later: Trends in the market and how they affect people, 2022). They may also increase the likelihood of overextension because of their increased reliance on third-party data tracking. Due to loans, borrowers may obtain multiple loans from the same lender. Additionally, excessive use of the BNPL business model for months or years (Buy Now, Pay Later:) may have an impact on customers' ability to meet non-BNPL obligations. Trends in the market and how they affect people, 2022).
# 

# ### 3 - Modeling Type and Technique
# 
# Our project focuses on using the Keras Sequential model, a powerful deep learning technique that can accomplish various tasks such as prediction, classification, and regression, to analyze the creditworthiness of potential borrowers. The model follows a linear stack of layers, where each layer carries out a specific function on the input data and passes the results to the following layer. This adaptable architecture allows for the addition or removal of layers as needed.
#  
# To start, we prepped the data to ensure that it was in the appropriate format for input into the model. We then partitioned the data into training and validation sets and used a portion of the data to train the model. During the training phase, the model adapts its parameters to reduce the difference between the anticipated and actual outputs, thereby minimizing the error between the predicted and actual outputs.
#  
# To find the best values for different model parameters, such as the learning rate, number of epochs, and batch size, we used techniques to adjust hyperparameters. Additionally, we utilized methods to prevent overfitting, which occurs when the model is too complex and only memorizes the training data instead of learning from it.
#  
# After completing the training process, we evaluated the model's performance by assessing it on the validation set and made necessary adjustments to enhance its accuracy. Once we were satisfied with the model's performance, we deployed it to predict the creditworthiness of potential borrowers in real-time using a Gradio application.
#  
# The Sequential model considers various borrower features, including employment status, income, credit history, and other pertinent details, to make predictions about the eligibility of a potential borrower for a BNPL loan. Each layer in the model carries out a unique function to process this information, resulting in a quick and almost instantaneous credit check.
#  
# We believe that the Keras Sequential model is a robust deep learning technique that can be used for multiple tasks, including creditworthiness prediction in our project. Through our training process and fine-tuning of hyperparameters and regularization methods, we were able to achieve high accuracy in real-time predictions of borrower eligibility for BNPL loans. Our project highlights the potential of deep learning techniques for credit analysis and decision-making in financial institutions.
# 
# 
# #### How does Keras Sequential Modeling work?
# The Keras Sequential model is an implementation of a feedforward neural network (FNN) built on top of TensorFlow, where each layer in the network has a single input and a single output. These networks are known as multilayer perceptrons (MLPs) or deep FFNs. 
# 
# ##### Feedforward Neural Networks
# Like most neural networks, FNNs consist of layers, each containing multiple neurons. However, as an FNN, Keras Sequential’s implementation of the neurons in these layers differs from traditional NNs. Each layer decreases the total number of neurons until the final output layer, which has a single neuron corresponding to the output of the model. The following is a description of the Keras sequential layers:
# - Input Layer
#     - Receives the input features. The number of nodes in this layer corresponds directly to the dimensionality of the input data.
# - Hidden (Dense) Layers
#     - Our model has four hidden layers, which are responsible for extracting features from the input data. The number of neurons decreases successively, allowing the network to learn a hierarchy of features and representations (Keras Team).
#     - Each hidden layer uses the Rectified Linear Unit (ReLU) activation function, which helps the network learn the non-linear relationships between the input and output.
#     - Keras Sequential can use dense, convolutional, or recurrent layers as its hidden layers. Our approach uses dense layers, since we have no need for image, sound, or language processing.
# - Output Layer
#     - This is the last layer of the model, containing a single neuron, and produces the output with a sigmoid activation function.
# 
# A standard FNN is a type of NN in which the connections between nodes do not form a loop (DeepAI, 2019). Instead, the information in a FNN moves only in one direction, from the input layer, through the hidden layers, to the output layer. This linear flow of information is why they are referred to as feedforward networks.
# 
# One of the key challenges faced by our team in the implementation of Keras Sequential is that our implementation was required to be a deep neural network, but our original trimmed data set was relatively simple. The simplicity of our data set meant that added layers for deep learning were not able to produce an increase in the accuracy of the model. By adding features back to our data set, the deep Keras sequential model was able to learn more complex relationships in the data than our ensemble model was able to achieve.
# 
# ##### Activation Functions
# Our implementation makes use of two activation functions.
# - Rectified Linear Unit (ReLU)
#     - The neurons receive a weighted sum of the input data. Neurons in the dense layers of our model apply the ReLU activation function. This function will either pass the input value itself if it is positive or pass 0 if it is negative.
#     - ReLU introduces non-linearity, which helps the network model complex relationships in the data (Ramachandran et al., 2017).
# - Sigmoid
#     - The sigmoid activation function maps its input value to a value between 0 and 1. This function is used to produce the final output of the model.
#     - Given that our problem was a binomial classification problem, sigmoid is well suited for our output.
# 
# ##### Loss Function
# Our model uses binary cross-entropy for its loss function. This loss function is commonly used for binomial classification problems such as our loan repayment problem (Godoy, 2022). Binary cross-entropy measures the difference between the predicted probability distribution and the true probability distribution of the binary outcome. Through this, we can account for the confidence of each prediction rather than simply accounting for the number of correct predictions.
# 
# ##### Optimizer
# Our model uses the Adam optimizer. Adam is a binary cross-entropy loss function and is an extension of the stochastic gradient descent (SGD) optimizer. Like SGD, Adam updates the weights of the neural network based on the gradients of the loss function with respect to the weights. However, Adam incorporates adaptive learning rates and momentum, which can improve the speed and stability of the training process when compared to traditional SGD.
# 
# ##### Conclusion
# 
# Keras’ Sequential Model proved to be a better neural network approach in completing our project over our previous Extreme Gradient Boosting (XGBoost). Keras has the framework that enabled us to construct and train our neural networks with higher accuracy at 98.07% compared to our accuracy score of 89.5% using XGBoost. The interpretation to how we received such higher accuracy was due to the ability of Keras becoming more effective when it came to learning complex relationships within the inputted data. Whereas XGBoost is well suited to structured data, our data proved otherwise to be complex and non-linear in nature.
# 
# A few challenges we came across when running our model was it took longer to train over XGBoost, especially with large datasets like taking in different demographics on loan information. This can be mitigated by using certain hardware like GPUs or TPUs which none of our classmates had the specialty tools to run it on. We did fortunately have time on our hands and the ability to fail fast using other neural network models to test our theory as well as effectiveness rates.
# 
# Neural networks in general are considered highly scalable. They can handle very large data sets. So, if new input data were to arise in our case to add more to our demographics, this would not be a challenge at all due to the fact that we have the ability to increase the number of layers within our model which in turn would increase the complexity in the relationships in the data.
# 
# The Keras model makes use of two activation functions, the ReLU function for neurons in the dense layers and the sigmoid function for the output layer. The model uses binary cross-entropy for its loss function, which measures the difference between the predicted probability distribution and the true probability distribution of the binary outcome. 
# 
# Our dataset using XGBoost from our midterm was considered small, since we took another approach with our final project, we thought to add more data which we found to show a small decrease in performance with the previous model. We then did some testing on other models and concluded that Keras is better suited for tasks that are unstructured that involve images, audio, and text.
#  
# As a result, the Keras Sequential Model is a deep learning technique that can be used for various tasks, including creditworthiness prediction, as demonstrated in our final project. We put an emphasis on the importance of training and fine tuning the model to achieve the highest level of accuracy in real-time predictions of borrower eligibility for BNPL loans. We believe our project highlights the potential of deep learning techniques for credit analysis and can make a big impact within financial institutions and our everyday lives.

# ### 4 - Implementation
# 
# 4.1  Import all packages
# 
# 4.2  Read data
# 
# 4.3  Create an instance of the classifier
# 
# 4.4  Train the classifier
# 
# 4.5  Check on the training set and visualize performance
# 
# 4.6  Compute the prediction according to the model
# 
# 4.7  Test and Validate 
# 
# 4.8  Check on the test set and visualize performance
# 
# 4.9  Compute the evaluation metrics - accuracy, precision, recall, F1-score
# 
# 4.10 Compute the confusion matrix - sensitivity & specificity 
# 

# ### 4.1 Import all packages

# In[1]:


#import all necessary packages
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score, confusion_matrix
from sklearn import metrics
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.metrics import Precision, Recall
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')


# ### 4.2 Read Data

# In[2]:


#read in the clean csv file
df = pd.read_csv('/work/loan_data_clean_dnn.csv')

# turn all columns to int type
loan_data = df.astype(int)

# drop loan_status from loan_data and assign to X
X = loan_data.drop('loan_paid', axis=1)
y = loan_data['loan_paid']


# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Split the training set into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Print the shapes of the datasets
print('X_train shape:', X_train.shape)
print('y_train shape:', y_train.shape)
print('X_test shape:', X_test.shape)
print('y_test shape:', y_test.shape)
print('X_val shape:', X_val.shape)
print('y_val shape:', y_val.shape)


# ### 4.3 Create an instance of the classifier

# In[3]:


#add a scaler and verify the data
scaler = MinMaxScaler()

X_train= scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
X_val = scaler.transform(X_val)

print(X_train.shape)
print(X_test.shape)
print(X_val.shape)

model = Sequential()
model.add(Dense(units=78,activation='relu'))
model.add(Dense(units=39,activation='relu'))
model.add(Dense(units=19,activation='relu'))
model.add(Dense(units=8,activation='relu'))
model.add(Dense(units=4,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy', Precision(name='precision'), Recall(name='recall')])


# ### 4.4 Train the classifier

# In[4]:


#train the classifier
keras_seq = model.fit(x=X_train, 
y=y_train, 
epochs=40,
batch_size=512,
validation_data=(X_val, y_val), verbose=1)


# ### 4.5 Check on the training set and visualize performance

# In[5]:


# Evaluate the model on the training set
train_metrics = model.evaluate(X_train, y_train, verbose=1)

# Extract the individual metrics
train_loss, train_accuracy, train_precision, train_recall = train_metrics

# Print the metrics
print("Train loss:", train_loss)
print("Train accuracy:", train_accuracy)
print("Train precision:", train_precision)
print("Train recall:", train_recall)


# In[6]:


# Plot the training and validation loss
plt.plot(keras_seq.history['loss'], label='Training loss')
plt.plot(keras_seq.history['val_loss'], label='Validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Training and Validation Loss')
plt.legend()
plt.show()


# In[7]:


# Plot the training and validation accuracy
plt.plot(keras_seq.history['accuracy'], label='Training accuracy')
plt.plot(keras_seq.history['val_accuracy'], label='Validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.title('Training and Validation Accuracy')
plt.legend()
plt.show()


# ### 4.6 Compute the prediction according to the model

# In[8]:


# compute prediction on unseen (test) dataset
y_pred = (model.predict(X_test) > 0.5).astype("int32")
print('Training Accuracy: ', round(accuracy_score(y_test, y_pred)*100, 2), '%')
print('Training Precision: ', round(precision_score(y_test, y_pred)*100, 2), '%')
print('Training Recall: ', round(recall_score(y_test, y_pred)*100, 2), '%')
print('Training F1: ', round(f1_score(y_test, y_pred)*100, 2), '%')
print('Training ROC AUC: ', round(roc_auc_score(y_test, y_pred)*100, 2), '%')
print('Training Confusion Matrix: \n', confusion_matrix(y_test, y_pred))


# In[9]:


# visualize the performance
cm = metrics.confusion_matrix(y_test, y_pred)
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(accuracy_score(y_test, y_pred))
plt.title(all_sample_title, size = 15);


# In[10]:


# make predictions on the testing set
y_pred_class = (model.predict(X_test) > 0.5).astype("int32")
pred=y_pred_class.tolist()
pred_1 = [i[0] for i in pred]
pred_2 = ''.join(str(pred_1).split(','))
# print true vs predicted

print('True:', y_test.values[0:25])
print('Pred:', pred_2[0:50])


# ### 4.7 Test and Validate

# In[11]:


# Evaluate the model on the test set
test_metrics = model.evaluate(X_test, y_test, verbose=1)

# Extract the individual metrics
test_loss, test_accuracy, test_precision, test_recall = test_metrics

# Print the metrics
print("Test loss:", test_loss)
print("Test accuracy:", test_accuracy)
print("Test precision:", test_precision)
print("Test recall:", test_recall)


# In[12]:


y_pred_class = (model.predict(X_test) > 0.5).astype("int32")
#predictions = (model.predict(X_test) > 0.5).astype("int32")
print('Testing Accuracy score: ', round(accuracy_score(y_test, y_pred_class)*100, 2), '%')
print('Testing Precision score: ', round(precision_score(y_test, y_pred_class)*100, 2), '%')
print('Testing Recall score: ', round(recall_score(y_test, y_pred_class)*100, 2), '%')
print('Testing F1 score: ', round(f1_score(y_test, y_pred_class)*100, 2), '%')
print('Testing ROC AUC score: ', round(roc_auc_score(y_test, y_pred_class)*100, 2), '%')
print('Testing Confusion matrix: \n', confusion_matrix(y_test, y_pred_class))


# ### 4.8 Check on the test set and visualize performance

# In[13]:


# visualize the performance
cm = metrics.confusion_matrix(y_test, y_pred_class)
plt.figure(figsize=(9,9))
sns.heatmap(cm, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
all_sample_title = 'Accuracy Score: {0}'.format(accuracy_score(y_test, y_pred_class))
plt.title(all_sample_title, size = 15);


# In[14]:


# plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_class)
auc = roc_auc_score(y_test, y_pred_class)

plt.figure()
plt.plot(fpr, tpr, label='ROC curve (AUC = %0.2f)' % auc)
plt.plot([0, 1], [0, 1], linestyle='--', color='r', label='Random classifier')
plt.xlabel('False Positive Rate (Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.title('ROC curve for loan classifier')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()


# ### 4.9 Compute the evaluation metrics - accuracy, precision, recall, F1-score

# In[15]:


y_val_pred = (model.predict(X_val) > 0.5).astype("int32")
print('Validation accuracy score: ', round(accuracy_score(y_val, y_val_pred)*100, 2), '%')
print('Validation precision score: ', round(precision_score(y_val, y_val_pred)*100, 2), '%')
print('Validation recall score: ', round(recall_score(y_val, y_val_pred)*100, 2), '%')
print('Validation F1 score: ', round(f1_score(y_val, y_val_pred)*100, 2), '%')
print('Validation ROC AUC score: ', round(roc_auc_score(y_val, y_val_pred)*100, 2), '%')


# ### 4.10 Compute the confusion matrix - sensitivity & specificity

# In[16]:


# visualize the validation confusion matrix
cm = confusion_matrix(y_val, y_val_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Validation Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


# In[17]:


# plot ROC curve
import matplotlib.pyplot as plt
fpr, tpr, thresholds = metrics.roc_curve(y_val, y_val_pred)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve for loan classifier')
plt.xlabel('False Positive Rate (Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
plt.grid(True)


# ### 5 - Observations
# 5.1 Overview
# 5.2 Plot to show balanced dataset
# 5.3 Feature Importance
# 5.4 Plot Correlation Matrix
# 5.5 Scatterplot Open Accounts and Revolving Balance by Loan Status
# 5.6 Scatterplot loan size and associated late fees
# 5.7 Bar graph
# 5.8 Kernel Density Estimation Plot
# 
# 
# 

# ### 5.1 Overview
# We chose to use a synthetic dataset based on aggregated LendingClub loan data, as we did for the midterm project. This dataset was obtained from the Kaggle page located here (George, N., n.d). Unfortunately, due to the size of the dataset, we were unable to work with it easily in any online Jupyter style notebooks. As a result, we downloaded the dataset, uncompressed it, and removed a large number of columns that we deemed unnecessary. We also dropped any record whose "loan status" was anything other than "Fully Paid" (indicating the loan was paid until the balance was paid off) or "Charged Off" (indicating the borrower had failed to repay the loan). There were significantly more Fully Paid records than Charged Off records, resulting in an extremely imbalanced dataset. We knew this would have a negative impact on the accuracy of our model, so we randomly dropped Fully Paid records to bring this number down and closer to the Charged Off records (Brownlee, 2018). We were aware of other methods for manually balancing datasets, such as the Synthetic Minority Oversampling Technique (SMOTE), but due to the sheer size of the dataset, randomly removing the Fully Paid records seemed like a much simpler approach that still left us with plenty of data (over 500,000 records) to train our model on. Following the completion of these data engineering actions, a number of categorical columns were One-Hot Encoded and appended to the end of the dataset. The final step was to save the dataframe as a .csv file so that it could be saved to disk and uploaded to our Deepnote Jupyter notebook. 

# ### 5.2 Plot to show balanced dataset
# The following bar chart simply shows the balanced distribution of records based on our target variable of “loan_paid”. In this case, a 1 represents a Fully Paid record and a 0 represents Charged Off. 

# In[18]:


# visualize the target variable in the original dataset to establish balanced dataset
sns.countplot(x='loan_paid', data=df)
plt.show()


# ### 5.3 Feature Importance
# This graph shows the feature importance of each column in the dataset as it relates to our target variable of loan_paid. A negative correlation means that as one variable increases, the other decreases. A positive correlation means that as one variable increases, the other variable increases as well.  In this example, the “last fico range high” column has the highest negative correlation with the “loan_paid” column, suggesting that as FICO credit scores decrease, so does the chance of the borrower defaulting on their loan.  

# In[19]:


# checking correlation of features to the new loan_paid column
plt.figure(figsize=(15,7))
new_corr = df.corr().iloc[:-1,-1].sort_values()
new_corr.plot.bar(rot=90)


# ### 5.4 Plot Correlation Matrix
# The following correlation matrix (Kumar, 2022) shows a strong negative correlation between the “loan_paid” column and the “collections_12_mths_ex_med” column suggesting that if a borrower had high number of collections in the past year (excluding medical collections), then there is a much higher chance that they will default on their loan.  

# In[20]:


# plot correlation matrix
plt.figure(figsize=(12,10))
mask = np.zeros_like(df.corr())
mask[np.triu_indices_from(mask)] = True
sns.heatmap(df.corr(), cmap='viridis', mask=mask, annot=False, square=True)
plt.show()


# ### 5.5 Scatterplot Open Accounts and Revolving Balance by Loan Status
# The result of the scatter plot below shows that borrowers with fewer open accounts tend to have higher revolving balances on those accounts. This could indicate a lower credit score for these users since their total available credit would be smaller across all accounts. A high revolving balance could also mean these borrowers are overextended and may have trouble paying off their debts. The borrowers who ended up defaulting on their loans are highlighted in red below. 

# In[21]:


# create a figure and an axis object
fig, ax = plt.subplots()
colors = {1: "green", 0: "red"}
ax.scatter(df["open_acc"], df["revol_bal"], c=df["loan_paid"].map(colors), alpha=0.1)
ax.set_xlabel("Open Accounts")
ax.set_ylabel("Revolving Balance")
ax.set_title("Relationship between Open Accounts and Revolving Balance by Loan Status")
plt.show()


# ### 5.6 Scatterplot loan size and associated late fees
# The graph below may help shed light on the correlation between loan size and associated late fees. It's reasonable to assume that late fees would increase in proportion to the size of the loan, but this isn't always the case. Indeed, as can be seen below, there was a spike in late fees when the loan amount reached the $35,000 mark.  

# In[22]:


# scatter plot that compares loan_amt and total_rec_late_fee
sns.scatterplot(x='loan_amnt', y='total_rec_late_fee', data=df)
plt.show()


# ### 5.7 Bar graph
# The data below provides support for the hypothesis that loan_paid is inversely related to collections_12_mths_ex_med. More collections in the past 12 months (excluding medical collections) correlates with a lower likelihood of loan repayment. This may be an indication that these borrowers are having more money problems than average or that their credit ratings are poorer.

# In[23]:


# visualize the relationship between loan_paid and collections_12_mths_ex_med and group by loan_paid
sns.catplot(x='collections_12_mths_ex_med', y='loan_paid', data=df, kind='bar')
plt.show()


# ### 5.8 Kernel Density Estimation Plot
# The probability density function of a random variable can be estimated using the Kernel density estimation (KDE) graph (Malhotra, 2020). This KDE plot is used to help us visualize the distribution of FICO scores for borrowers with various loan statuses when applied to loan status and FICO score. The diagram shows FICO 450 score being extremely low and 850 being brilliant. By utilizing this KDE plot, we can perceive how the conveyance of credit ratings shifts across various advance situations with. In a nutshell, the KDE plots of the loan status and FICO score indicate that a higher FICO score is related to a lower likelihood of loan default and a higher likelihood of paying off the loan completely. This data can be utilized to assess the financial soundness of borrowers and settle on informed loaning choices. For instance, borrowers who have defaulted on their credits might have a top in the conveyance around a low credit rating of 450, demonstrating that borrowers with lower credit ratings are bound to default on their advances. Conversely, borrowers who are as of now making installments on their credits might have a top in the conveyance around a high credit rating of 850, demonstrating that borrowers with higher credit ratings are bound to make their installments on time. As a result, we are able to provide customized BNPL options to borrowers with varying degrees of creditworthiness by identifying the FICO score range associated with a higher risk of default.

# In[24]:


# Kernel Density Estimation Plots for loan_status and last_fico_range_high and show legend
plt.figure(figsize=(12, 6))
sns.kdeplot(loan_data[loan_data['loan_paid'] == 1]['last_fico_range_high'], label='Charged Off', shade=True)
sns.kdeplot(loan_data[loan_data['loan_paid'] == 0]['last_fico_range_high'], label='Fully Paid', shade=True)
plt.xlabel('last_fico_range_high')
plt.ylabel('Density')
plt.legend()
plt.show()


# ### 6 - References
# 
# Brownlee, J. (2018, April 25). Machine Learning Mastery. Retrieved from How to Remove Outliers for Machine Learning: https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
# 
# Buy Now, Pay Later: Market trends and consumer impacts. (2022, September 15). Retrieved from CFBP: https://www.consumerfinance.gov/data-research/research-reports/buy-now-pay-later-market-trends-and-consumer-impacts/
# 
# DeepAI. (2019, May 17). Feed Forward Neural Network. DeepAI. Retrieved April 22, 2023, from https://deepai.org/machine-learning-glossary-and-terms/feed-forward-neural-network 
# 
# George, N. (n.d.). Kaggle. Retrieved from All Lending Club loan data: https://www.kaggle.com/datasets/wordsforthewise/lending-club
# 
# Godoy, D. (2022, July 10). Understanding binary cross-entropy / log loss: A visual explanation. Medium. Retrieved April 19, 2023, from https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a 
# 
# Gradio (2023). Build & Share Delightful Machine Learning Apps. Retrieved April 23, 2023, from https://gradio.app/
# 
# Kumar, A. (2022, April). Correlation Concepts, Matrix & Heatmap using Seaborn. Retrieved from Vitalflux: https://vitalflux.com/correlation-heatmap-with-seaborn-pandas/amp/
# 
# Malhotra, V. (2020, October 12). Medium. Retrieved from ML04: Kernel Density Estimation: https://medium.com/analytics-vidhya/ml04-kernel-density-estimation-ee29a1578d0c
# 
# Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. ArXiv Preprint ArXiv:1710.05941. 
# 
# Team, K. (n.d.). Simple. flexible. powerful. Keras. Retrieved April 20, 2023, from https://keras.io/ 

# <a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=15617cbb-2604-4e67-8d7c-8506cee32e96' target="_blank">
# <img alt='Created in deepnote.com' style='display:inline;max-height:16px;margin:0px;margin-right:7.5px;' src='data:image/svg+xml;base64,PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0iVVRGLTgiPz4KPHN2ZyB3aWR0aD0iODBweCIgaGVpZ2h0PSI4MHB4IiB2aWV3Qm94PSIwIDAgODAgODAiIHZlcnNpb249IjEuMSIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIiB4bWxuczp4bGluaz0iaHR0cDovL3d3dy53My5vcmcvMTk5OS94bGluayI+CiAgICA8IS0tIEdlbmVyYXRvcjogU2tldGNoIDU0LjEgKDc2NDkwKSAtIGh0dHBzOi8vc2tldGNoYXBwLmNvbSAtLT4KICAgIDx0aXRsZT5Hcm91cCAzPC90aXRsZT4KICAgIDxkZXNjPkNyZWF0ZWQgd2l0aCBTa2V0Y2guPC9kZXNjPgogICAgPGcgaWQ9IkxhbmRpbmciIHN0cm9rZT0ibm9uZSIgc3Ryb2tlLXdpZHRoPSIxIiBmaWxsPSJub25lIiBmaWxsLXJ1bGU9ImV2ZW5vZGQiPgogICAgICAgIDxnIGlkPSJBcnRib2FyZCIgdHJhbnNmb3JtPSJ0cmFuc2xhdGUoLTEyMzUuMDAwMDAwLCAtNzkuMDAwMDAwKSI+CiAgICAgICAgICAgIDxnIGlkPSJHcm91cC0zIiB0cmFuc2Zvcm09InRyYW5zbGF0ZSgxMjM1LjAwMDAwMCwgNzkuMDAwMDAwKSI+CiAgICAgICAgICAgICAgICA8cG9seWdvbiBpZD0iUGF0aC0yMCIgZmlsbD0iIzAyNjVCNCIgcG9pbnRzPSIyLjM3NjIzNzYyIDgwIDM4LjA0NzY2NjcgODAgNTcuODIxNzgyMiA3My44MDU3NTkyIDU3LjgyMTc4MjIgMzIuNzU5MjczOSAzOS4xNDAyMjc4IDMxLjY4MzE2ODMiPjwvcG9seWdvbj4KICAgICAgICAgICAgICAgIDxwYXRoIGQ9Ik0zNS4wMDc3MTgsODAgQzQyLjkwNjIwMDcsNzYuNDU0OTM1OCA0Ny41NjQ5MTY3LDcxLjU0MjI2NzEgNDguOTgzODY2LDY1LjI2MTk5MzkgQzUxLjExMjI4OTksNTUuODQxNTg0MiA0MS42NzcxNzk1LDQ5LjIxMjIyODQgMjUuNjIzOTg0Niw0OS4yMTIyMjg0IEMyNS40ODQ5Mjg5LDQ5LjEyNjg0NDggMjkuODI2MTI5Niw0My4yODM4MjQ4IDM4LjY0NzU4NjksMzEuNjgzMTY4MyBMNzIuODcxMjg3MSwzMi41NTQ0MjUgTDY1LjI4MDk3Myw2Ny42NzYzNDIxIEw1MS4xMTIyODk5LDc3LjM3NjE0NCBMMzUuMDA3NzE4LDgwIFoiIGlkPSJQYXRoLTIyIiBmaWxsPSIjMDAyODY4Ij48L3BhdGg+CiAgICAgICAgICAgICAgICA8cGF0aCBkPSJNMCwzNy43MzA0NDA1IEwyNy4xMTQ1MzcsMC4yNTcxMTE0MzYgQzYyLjM3MTUxMjMsLTEuOTkwNzE3MDEgODAsMTAuNTAwMzkyNyA4MCwzNy43MzA0NDA1IEM4MCw2NC45NjA0ODgyIDY0Ljc3NjUwMzgsNzkuMDUwMzQxNCAzNC4zMjk1MTEzLDgwIEM0Ny4wNTUzNDg5LDc3LjU2NzA4MDggNTMuNDE4MjY3Nyw3MC4zMTM2MTAzIDUzLjQxODI2NzcsNTguMjM5NTg4NSBDNTMuNDE4MjY3Nyw0MC4xMjg1NTU3IDM2LjMwMzk1NDQsMzcuNzMwNDQwNSAyNS4yMjc0MTcsMzcuNzMwNDQwNSBDMTcuODQzMDU4NiwzNy43MzA0NDA1IDkuNDMzOTE5NjYsMzcuNzMwNDQwNSAwLDM3LjczMDQ0MDUgWiIgaWQ9IlBhdGgtMTkiIGZpbGw9IiMzNzkzRUYiPjwvcGF0aD4KICAgICAgICAgICAgPC9nPgogICAgICAgIDwvZz4KICAgIDwvZz4KPC9zdmc+' > </img>
# Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>