#!/usr/bin/env python
# coding: utf-8
#
#
# # UFC MMA Predictor Workflow
# ## by Jason Chan Jin An
#
# [**Github**](https://www.github.com/jasonchanhku) || [**LinkedIn**](https://www.linkedin.com/in/jason-chan-jin-an-45a76a76/) || [**Email**](mailto:jasonchanhku@gmail.com)
#
# # Introduction
#
# This is the workflow and backend process of the **UFC MMA Predictor Webapp** i built https://ufcmmapredictor.herokuapp.com/ .This Jupyter Notebook highlights the following:
#
# * Introduction
# * Background
# * Objective
# * Data Requirements
# * Web Scraping
# * Data Cleansing and Blending
# * Finalized Dataset
# * Libraries Used
# * Exploratory Data Analysis (EDA)
# * Statistical Overview
# * Heatmap and Correlation
# * Statistical Tests (T-test)
# * Distribution Plots
# * Feature Selection
# * Feature Importance
# * Modelling the Data
# * Logistic Regression
# * Random Forest
# * Neural Network
# * Conclusion
# * Improvements
# * Citation
# * Collaboration & Sponsorship
# ## Background
#
# This web app is the outcome of being a full time Data Scientist and a hardcore UFC fan. As a hardcore UFC fan, it has always been a challenge to predict the winner of a fight. It is either the **Favourite** or **Underdog**. However, there are times where my predictions would go horribly wrong. Being the curious person I am, I found myself asking questions such as:
#
# * How often do favourites triumph over underdogs?
# * Do fighters with better fighting stats always win?
# * What are the most important skills that determines the winner? Is it striking? Wrestling? BJJ?
# * How has the MMA sport evolved? Do fights go the distance more?
#
# All these questions then lead me to build a web app that utilizes machine learning to predict the winner of a fight. This is essentially shaped into a **binary Classification** problem with label **Favourite or Underdog**. This app will then contribute as a validation point to my predictions.
# ## Objective
# The objective of this data science project is to build a model that:
# * Predicts better than 50% accuracy (better than randomly selecting any of the two fighter as the winner)
# * Predicts better than choosing all favourite only (roughly 60%)
#
# By satisfying these two objectives, I believe my web app can add some serious value.
# # Data Requirements
#
# For this projects, there the following two datasets are needed and scraped from public sources:
#
# ## UFC Fighters Database
#
# Dataset that contains fight stats of all fighters in the UFC
#
# Source(s):
# * http://www.fightmetric.com/statistics/fighters
#
# Web scraping code(s):
# * https://github.com/jasonchanhku/web_scraping/blob/master/MMA%20Project/MMA%20fighters%20database.R
#
# ### Dataset Preview
# In[2]:
import pandas as pd
fighter_db = pd.read_csv('https://raw.githubusercontent.com/jasonchanhku/UFC-MMA-Predictor/master/Datasets/UFC_Fighters_Database.csv')
fighter_db.head()
# ## UFC Fights History
#
# Dataset that contains fight history of each fight card with **fight odds**. As I have proven that including fight odds makes it the most important variable, the importance of having odds for fights exceeds the need for having each and every UFC fights.
#
# As odds are only available from www.betmma.tips from **UFC 159**, the dataset only contains fights from **UFC 159** to **UFC 211**.
#
# Source(s):
# * http://www.fightmetric.com/statistics/events/completed
# * http://www.betmma.tips/mma_betting_favorites_vs_underdogs.php
#
# Web scraping code(s):
# * https://github.com/jasonchanhku/web_scraping/blob/master/MMA%20Project/MMA%20events%20database.R
# * https://github.com/jasonchanhku/web_scraping/blob/master/MMA%20Project/favourite_vs_underdogs.R
#
# ### Dataset Preview
# In[14]:
fights_db = pd.read_csv('https://raw.githubusercontent.com/jasonchanhku/UFC-MMA-Predictor/master/Datasets/UFC_Fights.csv')
fights_db.head()
# ## Data Cleansing and Blending
#
# The two datasets above were cleansed and blended together using the following process.
#
# ### Feature Mapping
#
# Note that for each feature `x`. It is the difference between the Favourite vs Underdog. Hence if the feature is positive, this implies the favourite fighter has an advantage over the underdog for that feature.
#
#
#
# $Feature\quad { X }_{ i }=\quad { X }_{ favourite }\quad -\quad { X }_{ underdog }$
# # Finalized Dataset
#
# The following are the response variable and 10 features used in the dataset. Note that each feature has a suffix of **delta** due to the fact that it undergone the feature mapping stated above.
#
# * Label - This is the response variable. Either Favourite or Underdog will win
# * REACH - Fighter's reach. (Probabaly the least important feature)
# * SLPM - Significant Strikes Landed per Minute
# * STRA. - Significant Striking Accuracy
# * SAPM - Significant Strikes Absorbed per Minute
# * STRD - Significant Strike Defence (the % of opponents strikes that did not land)
# * TD - Average Takedowns Landed per 15 minutes
# * TDA - Takedown Accuracy
# * TDD - Takedown Defense (the % of opponents TD attempts that did not land)
# * SUBA - Average Submissions Attempted per 15 minutes
# * Odds - Fighter's decimal odds spread for that specific matchup
# In[19]:
df = pd.read_csv('https://raw.githubusercontent.com/jasonchanhku/UFC-MMA-Predictor/master/Datasets/Cleansed_Data.csv')
df = df.drop('Sum_delta', axis=1)
df.head()
# # Libraries Used
# In[289]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
import scipy.stats as stats
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, cross_val_predict
from sklearn.feature_selection import RFECV
from sklearn.metrics import roc_auc_score, classification_report, make_scorer, accuracy_score
import warnings
import time
warnings.filterwarnings('ignore')
get_ipython().run_line_magic('matplotlib', 'inline')
#Progress bar
def log_progress(sequence, every=None, size=None, name='Items'):
from ipywidgets import IntProgress, HTML, VBox
from IPython.display import display
is_iterator = False
if size is None:
try:
size = len(sequence)
except TypeError:
is_iterator = True
if size is not None:
if every is None:
if size <= 200:
every = 1
else:
every = int(size / 200) # every 0.5%
else:
assert every is not None, 'sequence is iterator, set every'
if is_iterator:
progress = IntProgress(min=0, max=1, value=1)
progress.bar_style = 'info'
else:
progress = IntProgress(min=0, max=size, value=0)
label = HTML()
box = VBox(children=[label, progress])
display(box)
index = 0
try:
for index, record in enumerate(sequence, 1):
if index == 1 or index % every == 0:
if is_iterator:
label.value = '{name}: {index} / ?'.format(
name=name,
index=index
)
else:
progress.value = index
label.value = u'{name}: {index} / {size}'.format(
name=name,
index=index,
size=size
)
yield record
except:
progress.bar_style = 'danger'
raise
else:
progress.bar_style = 'success'
progress.value = index
label.value = "{name}: {index}".format(
name=name,
index=str(index or '?')
)
# Creating Dummies
def create_dummies(df,column_name):
"""Create Dummy Columns (One Hot Encoding) from a single Column
Usage
------
train = create_dummies(train,"Age")
"""
dummies = pd.get_dummies(df[column_name],prefix=column_name)
df = pd.concat([df,dummies],axis=1)
return df
# # Exploratory Data Analysis (EDA)
#
# ## Statistical Overview
#
# From the **finalized dataset**, we know that:
#
# * 1,315 rows which implies the number of historical fights in the dataset
# * rougly 62% of Favourite fighters win over Underdogs
# * On average, Favourites that win have all features advantage compared to the underdog. They get hit less and are more accurate with their striking, making Favourite winners more efficient over Underdog winners
# * Meanwhile Underdog winner historically end up taking more hits and less efficient on average but somehow end up winning. Could this be **luck** from landing a sudden KO or submission?
# In[18]:
# Shape of df
df.shape
# In[17]:
# Data types of
df.dtypes
# In[34]:
# What percentage of Favourite fighters win?
df['Label'].value_counts()
# In[38]:
a = df['Label'].value_counts()/len(df)
a
# In[66]:
a.plot(kind='bar', rot=0)
# In[40]:
# Statistical overview of dataset
df.describe()
# In[68]:
# Does mean of each feature distinguish the Favourite / Underdog to win ?
# Does a specific feature advantage give the underdog winners an edge ?
df.groupby('Label').mean().plot(kind = 'bar', subplots=True, layout=(5,2), legend=False, figsize=(25,20), fontsize=20, rot=0)
# ## Correlation Matrix and Heatmap
# From the correlation matrix, we know that:
# * Predictors are not too correlated with each other. Low possibility of multicollinearity. Not too much of a worry if regression is applied
# * Positive correlation to strikes landed, striking defense to make a favourite more favourable to win
# In[114]:
def create_dummies(df,column_name):
"""Create Dummy Columns (One Hot Encoding) from a single Column
Usage
------
train = create_dummies(train,"Age")
"""
dummies = pd.get_dummies(df[column_name],prefix=column_name)
df = pd.concat([df,dummies],axis=1)
return df
# Correlation Matrix
df_corr = create_dummies(df, 'Label').drop('Label_Underdog', axis = 1)
corr = df_corr.corr()
corr = (corr)
corr
# In[115]:
plt.figure(figsize=(15,10))
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values)
# ## One Sample T-test (Measuring STRD_delta)
#
# A one-sample t-test checks whether a sample mean differs from the population mean. Since STRD_delta has the highest correlation with the dependent variable 'Label_Favourite', let's test to see whether the average STRD_delta of Favourite and Underdog winners differs significantly.
#
# Hypothesis Testing: Is there significant difference in the **means of STRD_delta** between favourite winners and underdog winners?
#
#
#
#