#!/usr/bin/env python # coding: utf-8 # # # # UFC MMA Predictor Workflow # ## by Jason Chan Jin An # # [**Github**](https://www.github.com/jasonchanhku) || [**LinkedIn**](https://www.linkedin.com/in/jason-chan-jin-an-45a76a76/) || [**Email**](mailto:jasonchanhku@gmail.com) # # # Introduction # # This is the workflow and backend process of the **UFC MMA Predictor Webapp** i built https://ufcmmapredictor.herokuapp.com/ .This Jupyter Notebook highlights the following: # # * Introduction # * Background # * Objective # * Data Requirements # * Web Scraping # * Data Cleansing and Blending # * Finalized Dataset # * Libraries Used # * Exploratory Data Analysis (EDA) # * Statistical Overview # * Heatmap and Correlation # * Statistical Tests (T-test) # * Distribution Plots # * Feature Selection # * Feature Importance # * Modelling the Data # * Logistic Regression # * Random Forest # * Neural Network # * Conclusion # * Improvements # * Citation # * Collaboration & Sponsorship # ## Background # # This web app is the outcome of being a full time Data Scientist and a hardcore UFC fan. As a hardcore UFC fan, it has always been a challenge to predict the winner of a fight. It is either the **Favourite** or **Underdog**. However, there are times where my predictions would go horribly wrong. Being the curious person I am, I found myself asking questions such as: # # * How often do favourites triumph over underdogs? # * Do fighters with better fighting stats always win? # * What are the most important skills that determines the winner? Is it striking? Wrestling? BJJ? # * How has the MMA sport evolved? Do fights go the distance more? # # All these questions then lead me to build a web app that utilizes machine learning to predict the winner of a fight. This is essentially shaped into a **binary Classification** problem with label **Favourite or Underdog**. This app will then contribute as a validation point to my predictions. # ## Objective # The objective of this data science project is to build a model that: # * Predicts better than 50% accuracy (better than randomly selecting any of the two fighter as the winner) # * Predicts better than choosing all favourite only (roughly 60%) # # By satisfying these two objectives, I believe my web app can add some serious value. # # Data Requirements # # For this projects, there the following two datasets are needed and scraped from public sources: # # ## UFC Fighters Database # # Dataset that contains fight stats of all fighters in the UFC # # Source(s): # * http://www.fightmetric.com/statistics/fighters # # Web scraping code(s): # * https://github.com/jasonchanhku/web_scraping/blob/master/MMA%20Project/MMA%20fighters%20database.R # # ### Dataset Preview # In[2]: import pandas as pd fighter_db = pd.read_csv('https://raw.githubusercontent.com/jasonchanhku/UFC-MMA-Predictor/master/Datasets/UFC_Fighters_Database.csv') fighter_db.head() # ## UFC Fights History # # Dataset that contains fight history of each fight card with **fight odds**. As I have proven that including fight odds makes it the most important variable, the importance of having odds for fights exceeds the need for having each and every UFC fights. # # As odds are only available from www.betmma.tips from **UFC 159**, the dataset only contains fights from **UFC 159** to **UFC 211**. # # Source(s): # * http://www.fightmetric.com/statistics/events/completed # * http://www.betmma.tips/mma_betting_favorites_vs_underdogs.php # # Web scraping code(s): # * https://github.com/jasonchanhku/web_scraping/blob/master/MMA%20Project/MMA%20events%20database.R # * https://github.com/jasonchanhku/web_scraping/blob/master/MMA%20Project/favourite_vs_underdogs.R # # ### Dataset Preview # In[14]: fights_db = pd.read_csv('https://raw.githubusercontent.com/jasonchanhku/UFC-MMA-Predictor/master/Datasets/UFC_Fights.csv') fights_db.head() # ## Data Cleansing and Blending # # The two datasets above were cleansed and blended together using the following process. # # ### Feature Mapping # # Note that for each feature `x`. It is the difference between the Favourite vs Underdog. Hence if the feature is positive, this implies the favourite fighter has an advantage over the underdog for that feature. # # # # $Feature\quad { X }_{ i }=\quad { X }_{ favourite }\quad -\quad { X }_{ underdog }$ # # Finalized Dataset # # The following are the response variable and 10 features used in the dataset. Note that each feature has a suffix of **delta** due to the fact that it undergone the feature mapping stated above. # # * Label - This is the response variable. Either Favourite or Underdog will win # * REACH - Fighter's reach. (Probabaly the least important feature) # * SLPM - Significant Strikes Landed per Minute # * STRA. - Significant Striking Accuracy # * SAPM - Significant Strikes Absorbed per Minute # * STRD - Significant Strike Defence (the % of opponents strikes that did not land) # * TD - Average Takedowns Landed per 15 minutes # * TDA - Takedown Accuracy # * TDD - Takedown Defense (the % of opponents TD attempts that did not land) # * SUBA - Average Submissions Attempted per 15 minutes # * Odds - Fighter's decimal odds spread for that specific matchup # In[19]: df = pd.read_csv('https://raw.githubusercontent.com/jasonchanhku/UFC-MMA-Predictor/master/Datasets/Cleansed_Data.csv') df = df.drop('Sum_delta', axis=1) df.head() # # Libraries Used # In[289]: import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as matplot import seaborn as sns import scipy.stats as stats from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.neural_network import MLPClassifier from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, cross_val_predict from sklearn.feature_selection import RFECV from sklearn.metrics import roc_auc_score, classification_report, make_scorer, accuracy_score import warnings import time warnings.filterwarnings('ignore') get_ipython().run_line_magic('matplotlib', 'inline') #Progress bar def log_progress(sequence, every=None, size=None, name='Items'): from ipywidgets import IntProgress, HTML, VBox from IPython.display import display is_iterator = False if size is None: try: size = len(sequence) except TypeError: is_iterator = True if size is not None: if every is None: if size <= 200: every = 1 else: every = int(size / 200) # every 0.5% else: assert every is not None, 'sequence is iterator, set every' if is_iterator: progress = IntProgress(min=0, max=1, value=1) progress.bar_style = 'info' else: progress = IntProgress(min=0, max=size, value=0) label = HTML() box = VBox(children=[label, progress]) display(box) index = 0 try: for index, record in enumerate(sequence, 1): if index == 1 or index % every == 0: if is_iterator: label.value = '{name}: {index} / ?'.format( name=name, index=index ) else: progress.value = index label.value = u'{name}: {index} / {size}'.format( name=name, index=index, size=size ) yield record except: progress.bar_style = 'danger' raise else: progress.bar_style = 'success' progress.value = index label.value = "{name}: {index}".format( name=name, index=str(index or '?') ) # Creating Dummies def create_dummies(df,column_name): """Create Dummy Columns (One Hot Encoding) from a single Column Usage ------ train = create_dummies(train,"Age") """ dummies = pd.get_dummies(df[column_name],prefix=column_name) df = pd.concat([df,dummies],axis=1) return df # # Exploratory Data Analysis (EDA) # # ## Statistical Overview # # From the **finalized dataset**, we know that: # # * 1,315 rows which implies the number of historical fights in the dataset # * rougly 62% of Favourite fighters win over Underdogs # * On average, Favourites that win have all features advantage compared to the underdog. They get hit less and are more accurate with their striking, making Favourite winners more efficient over Underdog winners # * Meanwhile Underdog winner historically end up taking more hits and less efficient on average but somehow end up winning. Could this be **luck** from landing a sudden KO or submission? # In[18]: # Shape of df df.shape # In[17]: # Data types of df.dtypes # In[34]: # What percentage of Favourite fighters win? df['Label'].value_counts() # In[38]: a = df['Label'].value_counts()/len(df) a # In[66]: a.plot(kind='bar', rot=0) # In[40]: # Statistical overview of dataset df.describe() # In[68]: # Does mean of each feature distinguish the Favourite / Underdog to win ? # Does a specific feature advantage give the underdog winners an edge ? df.groupby('Label').mean().plot(kind = 'bar', subplots=True, layout=(5,2), legend=False, figsize=(25,20), fontsize=20, rot=0) # ## Correlation Matrix and Heatmap # From the correlation matrix, we know that: # * Predictors are not too correlated with each other. Low possibility of multicollinearity. Not too much of a worry if regression is applied # * Positive correlation to strikes landed, striking defense to make a favourite more favourable to win # In[114]: def create_dummies(df,column_name): """Create Dummy Columns (One Hot Encoding) from a single Column Usage ------ train = create_dummies(train,"Age") """ dummies = pd.get_dummies(df[column_name],prefix=column_name) df = pd.concat([df,dummies],axis=1) return df # Correlation Matrix df_corr = create_dummies(df, 'Label').drop('Label_Underdog', axis = 1) corr = df_corr.corr() corr = (corr) corr # In[115]: plt.figure(figsize=(15,10)) sns.heatmap(corr, xticklabels=corr.columns.values, yticklabels=corr.columns.values) # ## One Sample T-test (Measuring STRD_delta) # # A one-sample t-test checks whether a sample mean differs from the population mean. Since STRD_delta has the highest correlation with the dependent variable 'Label_Favourite', let's test to see whether the average STRD_delta of Favourite and Underdog winners differs significantly. # # Hypothesis Testing: Is there significant difference in the **means of STRD_delta** between favourite winners and underdog winners? # #
# #
# $Null\quad Hypothesis\quad { H }_{ 0 }:\quad There\quad is\quad no\quad difference\quad in\quad STRD\_ delta\quad between\quad Favourite\quad and\quad Underdog\quad \quad \quad$ # # $Alternate\quad Hypothesis\quad { H }_{ 1 }:\quad There\quad is\quad a\quad difference\quad in\quad STRD\_ delta\quad between\quad Favourite\quad and\quad Underdog\quad \quad \quad$ #
# In[123]: # Compating both means print('STRD_delta mean of favourite winners is: ' + '{}' .format(df['STRD_delta'][df['Label'] == 'Favourite'].mean())) print('STRD_delta mean of undersog winners is: ' + '{}'.format(df['STRD_delta'][df['Label'] == 'Underdog'].mean())) # However, is the marginal difference of 0.047 significant? # ### Conducting the T-test **(95% confidence interval)** # # **Reject the Null Hypotheses because:** # # * T test scores lies outside the quantiles, 4.96 > 1.96 # * P - value lower than 5% # In[126]: # T-test stats.ttest_1samp(a= df[df['Label']=='Favourite']['STRD_delta'], # Sample of Favourite winners popmean = df['STRD_delta'].mean()) # Fighter population mean # In[127]: # Critical point degree_freedom = len(df[df['Label']=='Favourite']) LQ = stats.t.ppf(0.025,degree_freedom) # Left Quartile RQ = stats.t.ppf(0.975,degree_freedom) # Right Quartile print ('The t-distribution left quartile range is: ' + str(LQ)) print ('The t-distribution right quartile range is: ' + str(RQ)) # ## Distribution Plots # # * In most of the predictors, distribution is relatively normal centered around 0 # * This implies most matches made in UFC are based on evenly macthed skillsets # * Note that for the Underdog winners, it seems that the mean for the predictors tend to be lower than Favourites # * This implies Underdog winners are more skilled in that particular area for the matchup but somehow has been labelled as Underdog by [**wisdom of the crowd**](http://www.betmma.tips/mma_betting_statistics.php) # In[164]: cols = df.drop(['Events', 'Favourite', 'Underdog', 'Label'], axis =1).columns.tolist() # In[184]: # create 10 plots with a 2 by 5 dimension subplots fig, ax = plt.subplots(2,5, figsize=(20,20)) # loop to plot in subplots for i, col in enumerate(cols): x = i // 5 y = i % 5 sns.violinplot(x="Label", y=col , data=df, order=["Favourite", "Underdog"], ax=ax[x,y]) # # Feature Selection # # Feature selection is the process of selecting a subset of relevant predictors for use in model construction. Feature selection is used for: # # * simplification of models to make them easier to interpret # * shorter training times (applicable to very huge datasets) # * to avoid the curse of dimensionality # * enhanced generalization by reducing overfitting (reduction of variance) # # From RFECV and Feature Importance as validation, we know: # * The 4 most important features are **SAPM_delta, SLPM_delta, STRD_delta, TD_delta, Odds_delta** # ## Recursive Feature Elimination with Cross Validation (RFECV) # # The features will be selected based on Recursive Feature Elimination with Cross Validation [**(RFECV)**](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html). Recursive Feature Elimination (RFE) works by training the model, evaluating it, then removing the least significant features, and repeating. # # # In[207]: # Create a function to select features # Note that feature names are stored in cols def select_features(df): all_X = df.drop(['Events', 'Favourite', 'Underdog', 'Label'], axis=1) all_y = df['Label'] clf = RandomForestClassifier(random_state=1) selector = RFECV(clf) selector.fit(all_X, all_y) best_columns = list(all_X.columns[selector.support_]) print('Best Columns \n' + '-'*12 + '\n' + '{}'.format(best_columns)) return best_columns # In[208]: best_cols = select_features(df) # ## Feature Importance # # * As expected **Reach_delta** is of least importance since reach does not really determine a clear winner # In[210]: plt.style.use('fivethirtyeight') plt.rcParams['figure.figsize'] = (12,6) # Create train and test splits target_name = 'Label' X = df.drop(['Events', 'Favourite', 'Underdog', 'Label'], axis=1) y=df[target_name] X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=1, stratify=y) dtree = RandomForestClassifier( #max_depth=3, random_state = 1, class_weight="balanced", min_weight_fraction_leaf=0.01 ) dtree = dtree.fit(X_train,y_train) ## plot the importances ## importances = dtree.feature_importances_ feat_names = df.drop(['Events', 'Favourite', 'Underdog', 'Label'], axis=1).columns indices = np.argsort(importances)[::-1] plt.figure(figsize=(12,6)) plt.title("Feature importances by DecisionTreeClassifier") plt.bar(range(len(indices)), importances[indices], color='lightblue', align="center") plt.step(range(len(indices)), np.cumsum(importances[indices]), where='mid', label='Cumulative') plt.xticks(range(len(indices)), feat_names[indices], rotation='vertical',fontsize=14) plt.xlim([-1, len(indices)]) plt.show() # # Model Selection and Hyperparameter Tuning # # * Model selection and hyperparameter tuning were accomplish using GridSearchCV # * There is no need to apply **train_test_split** in this case due to the Cross Validation embedded in GridSearchCV # * Among the models considered are: # * Logistic Regression # * Random Forest Classifier # * Neural Network (MLP) # In[297]: def select_model(df, features): all_X = df[features] all_y = df["Label"] #create a list of dics which contains models and hyperparameters models = [ { "name": "Logistic Regression", "estimator": LogisticRegression(), "hyperparameters": { "solver": ["newton-cg", "lbfgs", "liblinear"] } }, { "name": "RandomForestClassifier", "estimator": RandomForestClassifier(random_state=1), "hyperparameters": { "n_estimators": [4, 6, 9], "criterion": ["entropy", "gini"], "max_depth": [2, 5, 10], "max_features": ["log2", "sqrt"], "min_samples_leaf": [1, 5, 8], "min_samples_split": [2, 3, 5] } }, { "name": "Multi Layer Perceptron (MLP)", "estimator": MLPClassifier(random_state=1), "hyperparameters": { "hidden_layer_sizes": [(5,5), (10,10)], "activation": ["relu", "tanh", "logistic"], "solver": ['sgd', 'adam'], "learning_rate": ["constant", "adaptive"] } } ] for model in log_progress(models): print(model["name"]) print("-"*len(model["name"])) grid = GridSearchCV(model["estimator"], param_grid=model["hyperparameters"], cv=10, scoring = 'accuracy') grid.fit(all_X,all_y) model["best_params"] = grid.best_params_ model["best_score"] = grid.best_score_ model["best_model"] = grid.best_estimator_ model["scoring"] = grid.scorer_ print("Best Paramerters:\n" + "{}".format(model["best_params"])) print("Best Score:\n" + "{}".format(model["best_score"])) print("Best Model:\n" + "{}\n".format(model["best_model"])) print("Scoring method:\n" + "{}\n".format(model["scoring"])) return models # In[298]: models = select_model(df, best_cols) # # Final Model Verdict # # * With the **Neural Network (MLP)** giving the highest score, this model will be chosen to be deployed on the web app # * Note that not too many hidden layers was chosen due to efficiency and also potential overfitting # * I did not pick the model based on **AUC, precision, or recall** because this is a **gambling problem** where a false positive or false negative is still a loss cause, unlike other problems such as predicting employee turnover. Hence **accuracy is all that matters** # # Conclusion # # * The final model selected, Neural Network (MLP) to predict winners from Favourite and Underdog has an **accuracy of 70.4%** # * This project successfully satisfied the two crucial objectives which was to achieve an accuracy of more than 50% and 63% # # # Improvements # # * Twitter sentiment scraping on UFC predictor key opinion leaders (KOLs) # * Include categorical variable named 'Fight Camp' as some fight camp rankings data available # * Better structured odds data and start warehousing data # * Improve data quality and granularity with premium data source from [**fightmetric**](http://www.fightmetric.com/) # # Citation # # * Shall you use this workflow in any of your work, I would appreciate it if you could site my full name (Jason Chan Jin An) and the link to this Notebook # # Collaboration and Sponsorship # # * Please do not hesitate to contact me for any sort of collaboration of discussion about this notebook # * I am also open to sponsorship / investment opportunities to monetize this project