This notebook shows the workflow my MMA Predictor project.
1. Background/questions
Mixed Martial Arts is a unique sport. There exists a large number of variables from stemming from the multitude of styles and attacks allowed in the sport which can make the fights quite unpredictable. Therefore, I sought to create a predictor that could make sense of the sport's chaos and answer the following questions:
2. Objectives
The objective of this project is to create a model that:
1. Collection
The data was gathered from UFC Stats using webscrapers built with Python's Beautiful Soup library. Two separate types of data were collected: fighter data and fight data. Both types of data were stored in separate Pandas data frames. The webscraper scripts can be found at the webscraper section of my github repository.
2. Preview of raw fighter data
Description of columns:
import pandas as pd
fighter_data_raw = pd.read_csv('raw-data/2022-08-25-fighter-data-raw.csv')
fighter_data_raw.head(5)
id | name | height | weight | reach | dob | ss_min | str_acc | str_a_min | str_def | td_avg | td_acc | td_def | sub_avg | wins | losses | wl_diff | momentum | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 8738eacd62e82a32 | Matt Andersen | 6'2" | 230.0 | NaN | 1971 | 0.00 | 0 | 0.00 | 0 | .0.00 | 0 | 0 | ..0.0 | 0 | 1 | -1 | -1 |
1 | 9199e0735b83dd32 | Andy Anderson | 5'6" | 240.0 | NaN | 1974 | 0.00 | 0 | 0.00 | 0 | .0.00 | 0 | 0 | ..0.0 | 0 | 1 | -1 | -1 |
2 | cf946e03ba2e7666 | Kevin Aguilar | 5'7" | 155.0 | 73.0 | 1988 | 3.96 | 40 | 4.81 | 52 | .0.16 | 16 | 78 | ..0.0 | 3 | 4 | -1 | -4 |
3 | 26387c19f32dda0f | Javy Ayala | 6'1" | 265.0 | NaN | 1988 | 0.83 | 25 | 5.00 | 33 | .0.00 | 0 | 0 | ..0.0 | 0 | 0 | 0 | 0 |
4 | d6a3e131ad4ba064 | Angga | 5'7" | 145.0 | 68.0 | 1989 | 3.56 | 38 | 3.77 | 53 | .0.00 | 0 | 33 | ..0.0 | 0 | 1 | -1 | -1 |
3. Preview of raw fight data
Description of columns:
fight_data_raw = pd.read_csv('raw-data/2022-08-25-fight-data-raw.csv')
fight_data_raw.head(5)
event_id | event_name | f1_id | f1_name | f2_id | f2_name | winner_id | |
---|---|---|---|---|---|---|---|
0 | 4f853e98886283cf | UFC 278: Usman vs. Edwards | f1fac969a1d70b08 | Leon Edwards | f1b2aa7853d1ed6e | Kamaru Usman | f1fac969a1d70b08 |
1 | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 2e5c2aa8e4ab9d82 | Paulo Costa | 00e11b5c8b7bfeeb | Luke Rockhold | 2e5c2aa8e4ab9d82 |
2 | 4f853e98886283cf | UFC 278: Usman vs. Edwards | c03520b5c88ed6b4 | Merab Dvalishvili | d0f3959b4a9747e6 | Jose Aldo | c03520b5c88ed6b4 |
3 | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 3cf66c62d9069f43 | Lucie Pudilova | fdbefee0827e1567 | Wu Yanan | 3cf66c62d9069f43 |
4 | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 49e96ff8ab781589 | Tyson Pedro | f703250471acaf96 | Harry Hunsucker | 49e96ff8ab781589 |
4. Cleaning/processing
The follow steps were taken to clean the datasets:
The data cleaning scripts can be found at the data cleaning section of my github respository.
5. Elo ranking formula
Formula for expected score of player A:
${\displaystyle E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}}~}$
A fighter's expected score is their probability of winning. Therefore, the higher elo a figher has, the higher their probability of winning.
Formula for updating elo:
${\displaystyle R_{\mathsf {A}}'=R_{\mathsf {A}}+60(E_{\mathsf {A}}={\frac {1}{1+10^{(R_{\mathsf {B}}-R_{\mathsf {A}})/400}}})~}$
6. Preview of aggregated dataset
The features of the aggregated data are as follows:
all_fights_fighters_df = pd.read_csv('data/2022-08-25-all-fights-fighters-data.csv')
all_fights_fighters_df.sort_values(by=['Unnamed: 0'], ascending=False).head(6)
Unnamed: 0 | id | name | id_opp | name_opp | event_id | event_name | elo | elo_opp | height_diff | ... | str_def_diff | td_avg_diff | td_acc_diff | td_def_diff | sub_avg_diff | wins_diff | losses_diff | momentum_diff | wl_diff_diff | outcome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
13085 | 13085 | 00e11b5c8b7bfeeb | Luke Rockhold | 2e5c2aa8e4ab9d82 | Paulo Costa | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 1109.535998 | 1103.120353 | 2.0 | ... | 6.0 | 0.17 | -46.0 | -14.0 | 1.0 | 9.0 | 3.0 | -4.0 | 6.0 | L |
13084 | 13084 | 2e5c2aa8e4ab9d82 | Paulo Costa | 00e11b5c8b7bfeeb | Luke Rockhold | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 1103.120353 | 1109.535998 | -2.0 | ... | -6.0 | -0.17 | 46.0 | 14.0 | -1.0 | -9.0 | -3.0 | 4.0 | -6.0 | W |
13083 | 13083 | d0f3959b4a9747e6 | Jose Aldo | c03520b5c88ed6b4 | Merab Dvalishvili | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 1216.774682 | 1159.868563 | 1.0 | ... | 2.0 | -6.05 | 14.0 | 13.0 | -0.2 | 13.0 | 5.0 | -9.0 | 8.0 | L |
13082 | 13082 | c03520b5c88ed6b4 | Merab Dvalishvili | d0f3959b4a9747e6 | Jose Aldo | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 1159.868563 | 1216.774682 | -1.0 | ... | -2.0 | 6.05 | -14.0 | -13.0 | 0.2 | -13.0 | -5.0 | 9.0 | -8.0 | W |
13081 | 13081 | fdbefee0827e1567 | Wu Yanan | 3cf66c62d9069f43 | Lucie Pudilova | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 917.791968 | 915.752514 | 0.0 | ... | -3.0 | 0.19 | 3.0 | 2.0 | -0.5 | -2.0 | 0.0 | -5.0 | -2.0 | L |
13080 | 13080 | 3cf66c62d9069f43 | Lucie Pudilova | fdbefee0827e1567 | Wu Yanan | 4f853e98886283cf | UFC 278: Usman vs. Edwards | 915.752514 | 917.791968 | 0.0 | ... | 3.0 | -0.19 | -3.0 | -2.0 | 0.5 | 2.0 | 0.0 | 5.0 | 2.0 | W |
6 rows × 24 columns
1. Summary statistics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as matplot
import seaborn as sns
all_fights_fighters_df.describe()
Unnamed: 0 | elo | elo_opp | height_diff | reach_diff | ss_min_diff | str_acc_diff | str_a_min_diff | str_def_diff | td_avg_diff | td_acc_diff | td_def_diff | sub_avg_diff | wins_diff | losses_diff | momentum_diff | wl_diff_diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 | 13086.000000 |
mean | 6542.500000 | 1040.896990 | 1040.896990 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
std | 3777.747146 | 67.405262 | 67.405262 | 2.539832 | 3.470319 | 1.562746 | 10.700523 | 1.568358 | 10.287499 | 1.793417 | 28.082514 | 29.508751 | 1.022474 | 7.151091 | 4.275102 | 3.350816 | 5.716938 |
min | 0.000000 | 853.961053 | 853.961053 | -13.000000 | -13.750000 | -8.990000 | -56.000000 | -14.180000 | -56.000000 | -11.770000 | -100.000000 | -100.000000 | -11.900000 | -29.000000 | -18.000000 | -20.000000 | -24.000000 |
25% | 3271.250000 | 1000.000000 | 1000.000000 | -2.000000 | -2.000000 | -1.010000 | -7.000000 | -0.910000 | -6.000000 | -1.050000 | -16.000000 | -18.000000 | -0.500000 | -4.000000 | -2.000000 | -2.000000 | -3.000000 |
50% | 6542.500000 | 1027.730742 | 1027.730742 | 0.000000 | -0.000000 | -0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 9813.750000 | 1075.647304 | 1075.647304 | 2.000000 | 2.000000 | 1.010000 | 7.000000 | 0.910000 | 6.000000 | 1.050000 | 16.000000 | 18.000000 | 0.500000 | 4.000000 | 2.000000 | 2.000000 | 3.000000 |
max | 13085.000000 | 1388.911657 | 1388.911657 | 13.000000 | 13.750000 | 8.990000 | 56.000000 | 14.180000 | 56.000000 | 11.770000 | 100.000000 | 100.000000 | 11.900000 | 29.000000 | 18.000000 | 20.000000 | 24.000000 |
2. Correlation Matrix
# Drop non-numeric columns
corr = all_fights_fighters_df.drop(all_fights_fighters_df[all_fights_fighters_df.outcome == "L"].index)
corr = corr.drop(['Unnamed: 0', 'id', 'name', 'id_opp', 'name_opp', 'event_id', 'event_name', 'outcome'], axis = 1)
corr = corr.corr()
corr
elo | elo_opp | height_diff | reach_diff | ss_min_diff | str_acc_diff | str_a_min_diff | str_def_diff | td_avg_diff | td_acc_diff | td_def_diff | sub_avg_diff | wins_diff | losses_diff | momentum_diff | wl_diff_diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
elo | 1.000000 | 0.620511 | 0.011821 | 0.100809 | -0.027085 | 0.004674 | -0.009452 | 0.000280 | 0.023665 | -0.006040 | -0.030251 | -0.009478 | 0.092822 | -0.107715 | 0.076475 | 0.213202 |
elo_opp | 0.620511 | 1.000000 | -0.005508 | 0.080546 | -0.067712 | -0.049532 | 0.123864 | -0.120745 | -0.054313 | -0.074618 | -0.084330 | -0.050591 | -0.240371 | -0.160667 | 0.020193 | -0.176775 |
height_diff | 0.011821 | -0.005508 | 1.000000 | 0.545316 | 0.103162 | 0.046144 | -0.039423 | -0.080871 | -0.140536 | 0.015124 | -0.013572 | 0.108591 | 0.013255 | -0.034068 | -0.001525 | 0.046400 |
reach_diff | 0.100809 | 0.080546 | 0.545316 | 1.000000 | 0.001207 | -0.015376 | -0.025878 | -0.131529 | -0.089269 | 0.011795 | -0.059239 | 0.066306 | -0.020701 | -0.115565 | 0.057026 | 0.071631 |
ss_min_diff | -0.027085 | -0.067712 | 0.103162 | 0.001207 | 1.000000 | 0.304210 | 0.233333 | 0.233051 | -0.156098 | 0.077116 | 0.275435 | -0.163988 | 0.146005 | 0.016668 | 0.117234 | 0.176468 |
str_acc_diff | 0.004674 | -0.049532 | 0.046144 | -0.015376 | 0.304210 | 1.000000 | -0.213667 | 0.077398 | 0.141629 | 0.178061 | 0.075265 | 0.075007 | 0.102509 | -0.058535 | 0.159388 | 0.183864 |
str_a_min_diff | -0.009452 | 0.123864 | -0.039423 | -0.025878 | 0.233333 | -0.213667 | 1.000000 | -0.344828 | -0.257949 | -0.196519 | -0.015361 | -0.086586 | -0.128772 | 0.046292 | -0.144337 | -0.207714 |
str_def_diff | 0.000280 | -0.120745 | -0.080871 | -0.131529 | 0.233051 | 0.077398 | -0.344828 | 1.000000 | 0.029987 | 0.127596 | 0.247280 | -0.141760 | 0.179833 | 0.043367 | 0.113975 | 0.197855 |
td_avg_diff | 0.023665 | -0.054313 | -0.140536 | -0.089269 | -0.156098 | 0.141629 | -0.257949 | 0.029987 | 1.000000 | 0.364267 | 0.034195 | 0.146070 | 0.023459 | -0.088412 | 0.121800 | 0.106125 |
td_acc_diff | -0.006040 | -0.074618 | 0.015124 | 0.011795 | 0.077116 | 0.178061 | -0.196519 | 0.127596 | 0.364267 | 1.000000 | 0.235979 | 0.021930 | 0.105539 | 0.037421 | 0.010026 | 0.105896 |
td_def_diff | -0.030251 | -0.084330 | -0.013572 | -0.059239 | 0.275435 | 0.075265 | -0.015361 | 0.247280 | 0.034195 | 0.235979 | 1.000000 | -0.220642 | 0.130412 | 0.090131 | 0.021827 | 0.093380 |
sub_avg_diff | -0.009478 | -0.050591 | 0.108591 | 0.066306 | -0.163988 | 0.075007 | -0.086586 | -0.141760 | 0.146070 | 0.021930 | -0.220642 | 1.000000 | 0.053118 | -0.045770 | 0.119044 | 0.108456 |
wins_diff | 0.092822 | -0.240371 | 0.013255 | -0.020701 | 0.146005 | 0.102509 | -0.128772 | 0.179833 | 0.023459 | 0.105539 | 0.130412 | 0.053118 | 1.000000 | 0.643362 | 0.161619 | 0.756813 |
losses_diff | -0.107715 | -0.160667 | -0.034068 | -0.115565 | 0.016668 | -0.058535 | 0.046292 | 0.043367 | -0.088412 | 0.037421 | 0.090131 | -0.045770 | 0.643362 | 1.000000 | -0.261906 | -0.013491 |
momentum_diff | 0.076475 | 0.020193 | -0.001525 | 0.057026 | 0.117234 | 0.159388 | -0.144337 | 0.113975 | 0.121800 | 0.010026 | 0.021827 | 0.119044 | 0.161619 | -0.261906 | 1.000000 | 0.434706 |
wl_diff_diff | 0.213202 | -0.176775 | 0.046400 | 0.071631 | 0.176468 | 0.183864 | -0.207714 | 0.197855 | 0.106125 | 0.105896 | 0.093380 | 0.108456 | 0.756813 | -0.013491 | 0.434706 | 1.000000 |
3. Correlation Heatmap
# Create correlation heatmap
plt.figure(figsize = (15,10))
sns.heatmap(corr,
xticklabels=corr.columns.values,
yticklabels=corr.columns.values).set(title='Correlation Heatmap of Fighter Variables')
[Text(0.5, 1.0, 'Correlation Heatmap of Fighter Variables')]
1. Training models
Using scikitlearn, we train and test a logistic regression, decision tree, and random forest classifier model.
# load training dataset
train = pd.read_csv('data/composite-train.csv')
#Import models from scikit learn module:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold #For K-fold cross validation
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import metrics
#Generic function for making a classification model and accessing performance:
def classification_model(model, data, predictors, outcome):
#Fit the model:
model.fit(data[predictors],data[outcome])
#Make predictions on training set:
predictions = model.predict(data[predictors])
#Print accuracy
accuracy = metrics.accuracy_score(predictions,data[outcome])
print("Accuracy : %s" % "{0:.3%}".format(accuracy))
#Perform k-fold cross-validation with 10 folds
kf = KFold(n_splits=10, shuffle=False)
error = []
for train, test in kf.split(data):
# Filter training data
train_predictors = (data[predictors].iloc[train,:])
# The target we're using to train the algorithm.
train_target = data[outcome].iloc[train]
# Training the algorithm using the predictors and target.
model.fit(train_predictors, train_target)
#Record error from each cross-validation run
error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))
#Fit the model again so that it can be refered outside the function:
model.fit(data[predictors],data[outcome])
# Assign predictors
predictors = ['elo','elo_opp','height_diff','reach_diff','ss_min_diff','str_acc_diff',
'str_a_min_diff','str_def_diff','td_avg_diff','td_acc_diff','td_def_diff','sub_avg_diff',
'wins_diff','losses_diff','momentum_diff','wl_diff_diff']
outcome = 'outcome'
# Fit logistic regression model
model_lr = LogisticRegression(max_iter=1000)
classification_model(model_lr, train, predictors, outcome)
Accuracy : 75.706% Cross-Validation Score : 75.217%
# Fit decision tree classifier model
model_dtc = DecisionTreeClassifier()
classification_model(model_dtc, train, predictors, outcome)
Accuracy : 100.000% Cross-Validation Score : 67.815%
# Fit random forest classifier model
model_rfc = RandomForestClassifier(n_estimators=30)
classification_model(model_rfc, train, predictors, outcome)
Accuracy : 99.900% Cross-Validation Score : 76.428%
Here we load a test dataset with the outcome column removed.
test = pd.read_csv('data/composite-test.csv')
results = pd.read_csv('data/test-results.csv')
test = test.iloc[::2]
test = test.reset_index()
results = results.iloc[::2]
results = results.reset_index()
2. Logistic regression
Running our test dataset through a logisitic regression models gives a prediction rate of 89.93%.
# Logistic Regression Testing
model_lr = LogisticRegression(max_iter = 1000)
model_lr.fit(train[predictors],train[outcome])
test[outcome] = model_lr.predict(test[predictors])
score = 0
total = len(results)
for i in range(0, total):
if results.loc[i, 'outcome'] == test.loc[i, 'outcome']:
score += 1
print('Prediction accuracy is ' + str(round((score*100/total),2)) + '%')
Prediction accuracy is 89.93%
3. Decision tree
Running our test dataset through a decision tree model gives a prediction rate of 73.83%.
# Decision Tree Testing
model_dtc = DecisionTreeClassifier()
model_dtc.fit(train[predictors],train[outcome])
predictions = model_dtc.predict(train[predictors])
test[outcome] = model_dtc.predict(test[predictors])
score = 0
total = len(results)
for i in range(0, total):
if results.loc[i, 'outcome'] == test.loc[i, 'outcome']:
score += 1
print('Prediction accuracy is ' + str(round((score*100/total),2)) + '%')
Prediction accuracy is 73.83%
4. Random forest classification
Running our test dataset through a decision tree model gives a prediction rate of 85.91%.
# Random Forest Testing
model_rfc = RandomForestClassifier()
model_rfc.fit(train[predictors],train[outcome])
predictions = model_rfc.predict(train[predictors])
test[outcome] = model_rfc.predict(test[predictors])
score = 0
total = len(results)
for i in range(0, total):
if results.loc[i, 'outcome'] == test.loc[i, 'outcome']:
score += 1
print('Prediction accuracy is ' + str(round((score*100/total),2)) + '%')
Prediction accuracy is 85.91%
Summary of results:
Logistic regression
Decision tree
Random forest classification
Based on the results, the logistic regression model has the overall greatest prediction accuracy at 89.93%, with the random forest classification model close behind at 85.91%. These prediction rates meet our initial objective of predicting better than 50% accuracy.
As for next steps, we can improve feature engineering through greater analysis of decision variables, and create an interactive dashboard to display real-time predictions.