As the football analyst for Mchezopesa Ltd, I have been tasked with creating a model that predicts the outcome of a football match between national teams.
i. Create a model that predicts whether the home team will win, lose, or draw in a football match.
ii. Create a model that predicts the number of goals that the home team will score.
iii. Create a model that predicts the number of goals that the away team will score in a given match.
The term ‘odds’ is commonly used in betting and it often refers to the probability of an event occurring. In a football match, the bookmaker assigns different odds depending on the true odds of an event occurring ie. win, loss, or draw (relative to the home team) while also factoring in the team’s form, team statistics, historical precedents, expert opinion, team motivation among other factors surrounding each match. In order to make profit, bookmakers will then adjust the probabilities downward before offering the bet to punters. While factors such as expert opinion and team motivation are hard to measure, team statistics such as wins, losses, goals scored, goals, conceded, and team ranks are recorded and can be used to predict the outcome of matches.
FIFA has good data on the different matches, and it also as a ranking system that is used to measure the performance of national teams over time. The FIFA ranking system is updated periodically to ensure that the team rankings are reflective of team performances. The latest review of the ranking system was done in 2018 , replacing a system that was in place since 2006. More information about FIFA ranking system can be found here. Aside from the FIFA website, bookmakers also source team information from team release news and professional contacts within different national teams.
To predict the match outcome, I am tasked with creating a logistic regression model. To predict the match scores, I am tasked with creating a polynomial regression model.
To improve model performance, I will perform feature engineering and parameter tuning.
Two datasets were provided for this project. The first data set contains different football matches played by national teams across different tournaments between 1872 and 2019. This data set includes the home team, away team, match scores, country and city the match was played, along with whether or not the playing ground was neutral. The second data set contains the national team ranks of different countries in the world. The ranking data set is updated monthly depending on the performance of the different teams in their respective matches.
To predict the outcome of a match, it is important to factor in team performance which is reflected in the ranking while also considering previous results which is refelcted in the first data set.
pip install -U pandas-profiling
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import pandas_profiling
sb.set_style()
import warnings
warnings.filterwarnings("ignore")
# previewing top of fifa ranking data set
fifa_ranking = pd.read_csv('/home/practitioner/Downloads/fifa_ranking.csv')
fifa_ranking.head()
# preview last five rows
fifa_ranking.tail()
A quick overview of the data indicates that the ranks of the different countries span between August 1993 and June 2016. According to FIFA, the most recent update of the rankings was done in 2018 prior to which, the previous system was in place from 2006 to 2018. For consistency, I will only rely on data that spans 2006 to 2018 to predict the outcome of the 2018 world cup matches.
# previewing top of match results data set
match_results = pd.read_csv('/home/practitioner/Downloads/results.csv')
match_results.head()
# preview last five rows
match_results.tail()
Assuming the results dataset is also ordered by date, the matches that have been recorded span between 1872 and 2019. To synchronize the two datsets, I will only use matches that were played between 2006 and 2018.
# checking the shape of our datasets
print('Fifa ranking dataset shape:', fifa_ranking.shape)
print('Results dataset shape:', match_results.shape)
# checking the data types
fifa_ranking.dtypes
The data types seem appropriate for the different columns aside from the rank date which is stored as an object. In the data preparation, this will be converted to date-time data type
#checking the data types
match_results.dtypes
Aside from the date column which is stored as an object rather than date-time data type, all the other data types are appropriate.
I confirmed the validity of the FIFA world rankings using information from the official FIFA website which can be found here. I also confirmed the accuracy of the match scores recorded for different games through a series of validaion scores across different tournament websites on the internet.
# checking for null values in the results dataset
match_results.isna().sum()
# split date column into year, month, and day
md = match_results['date'].str.split('-',n=2, expand=True)
match_results['year'] = md[0]
match_results['month'] = md[1]
match_results['day'] = md[2]
#match_results = match_results.drop('date', 1)
match_results[['year', 'month', 'day']] = match_results[['year', 'month', 'day']].astype(int)
match_results.dtypes
Splitting the date column will assist in the merging of the home and away teams and their FIFA ranks for the respective years.
# drop columns not needed
match_results = match_results.drop(['city', 'country'], 1)
match_results.head()
# adding a score difference and win/draw/lose column relative to the home team
match_results['score_difference'] = match_results['home_score'] - match_results['away_score']
conditions = [(match_results['score_difference'] > 0), (match_results['score_difference'] == 0), (match_results['score_difference'] < 0)]
values = [2, 1, 0] #where 2 is win, 1 is draw, 0 is loss
match_results['outcome'] = np.select(conditions, values)
match_results.head()
The score difference helps in the evaluation of each team's performance against their opponents. I recoreded wins as 2, draws as 1, and losses as 0.
#selecting matches that took place between January 2006 and June 2018
recent_results = match_results[match_results['date'] >= '2006-01-01']
recent_results = recent_results[recent_results['date'] <= '2018-06-07']
recent_results.shape
As was evident earlier, the match results dataset holds matches from 1872 which are useless in predicting the outcome of matches today. To get a more reflective sample of modern day football, there is need to filter the data up to a specific point in recent history. The choice of 2006 as the lower year bound is based on the fact that the most recent update of FIFA's ranking system before the 2018 world cup was done in 2006. The ranking procedures are revised with each update and this could affect the cosnsitency of ranking as a predictor for team performance acorss different eras. 2006 to 2018 seemed like a viable time duration to work with.
# checking for null values in the results dataset
fifa_ranking.isna().sum()
# splitting date column to year, month, day
new = fifa_ranking['rank_date'].str.split('-',n=2, expand=True)
fifa_ranking['year_rank'] = new[0]
fifa_ranking['month_rank'] = new[1]
fifa_ranking['day_rank'] = new[2]
#fifa_ranking = fifa_ranking.drop('rank_date', 1)
fifa_ranking.tail()
After the split, the year and month columns will be used to merge to the data set containg match results.
# selecting appropriate columns
select_ranking = fifa_ranking[['rank', 'rank_change', 'country_full', 'rank_date', 'year_rank', 'month_rank', 'day_rank']]
select_ranking.head()
Many of the columns in the fifa ranking data set will are not useful to our analysis so we select only select columns that we need which include the country name, rank, rank change, and the year of ranking.
# renaming countries with different names in the two data sets
select_ranking = select_ranking.replace("Côte d'Ivoire", 'Ivory Coast')
select_ranking.head()
Some of the countries have different names in the two data sets so we need to rename them.
#selecting fifa team rankings between 2006 and 2018
recent_ranking = select_ranking[select_ranking['rank_date'] >= '2006-01-01']
recent_ranking.tail()
For consistency, we select team ranks that fall within the same review period.
#drop date columns
recent_rank = recent_ranking.drop('rank_date', 1)
recent_games = recent_results.drop('date', 1)
We drop the date columns since we already have the year, month and days.
# convert year, month, and day to integer for merging
convert_dict = {'year_rank': int,
'month_rank': int,
'day_rank': int
}
recent_rank = recent_rank.astype(convert_dict)
recent_rank.dtypes
To merge the datasets, I used an inner join between the recent matches and recent ranks tables where on the matches data set I used the 'home_team', 'year', and 'month' columns to merge with the 'country_full', 'year_rank', and 'month_rank' columns. After merging, I dropped the 'year_rank', 'month_rank', 'day_rank', and 'country_full' columns since I needed the same columns to repeat the same merge to get the team ranks for the away teams.
# merging tables using inner join for home team rank and rank change
combo1 = pd.merge(recent_games, recent_rank, how='inner', left_on=['home_team','year', 'month'], right_on=['country_full', 'year_rank', 'month_rank'])
combo1 = combo1.drop(['year_rank', 'month_rank', 'day_rank', 'country_full'], 1)
combo1.rename(columns={'rank':'home_team_rank'}, inplace=True)
combo1.rename(columns={'rank_change':'home_rank_change'}, inplace=True)
combo1.head()
# merging tables using inner join for away team rank and rank change
combo2 = pd.merge(combo1, recent_rank, how='inner', left_on=['away_team','year', 'month'], right_on=['country_full', 'year_rank', 'month_rank'])
combo2 = combo2.drop(['year_rank', 'month_rank', 'day_rank', 'country_full'], 1)
combo2.rename(columns={'rank':'away_team_rank'}, inplace=True)
combo2.rename(columns={'rank_change':'away_rank_change'}, inplace=True)
combo2.head()
# reducing cardinality of tournament column
combo2.replace(to_replace=['African Cup of Nations', 'Lunar New Year Cup',
'AFC Asian Cup qualification', 'Cyprus International Tournament',
'Malta International Tournament', 'AFC Challenge Cup',
'COSAFA Cup', 'Kirin Cup',
'Merdeka Tournament',
'CFU Caribbean Cup qualification',
'African Cup of Nations qualification', 'Copa del Pacífico',
'AFF Championship', 'ELF Cup', 'CECAFA Cup',
'UAFA Cup qualification', "King's Cup", 'CFU Caribbean Cup',
'Gulf Cup', 'UNCAF Cup', 'EAFF Championship', 'Copa América',
'Gold Cup', 'WAFF Championship', 'Island Games', 'AFC Asian Cup',
'Nehru Cup', 'South Pacific Games',
'Amílcar Cabral Cup', 'AFC Challenge Cup qualification',
'Baltic Cup', 'SAFF Cup',
'African Nations Championship', 'VFF Cup', 'Confederations Cup',
'Dragon Cup', 'ABCS Tournament', 'Nile Basin Tournament',
'Nations Cup', 'Copa Paz del Chaco', 'Pacific Games',
'Oceania Nations Cup qualification', 'Oceania Nations Cup',
'UAFA Cup', 'OSN Cup', 'Windward Islands Tournament',
'Gold Cup qualification', 'Copa América qualification',
'Intercontinental Cup', 'UEFA Euro qualification', 'UEFA Euro'], value = 'Other competition', inplace=True)
combo2.tournament.unique()
#cheking for duplicates
combo2.duplicated().sum()
# dropping duplictes
combo2 = combo2.drop_duplicates()
#cheking for duplicates
combo2.duplicated().sum()
#checking for outliers
sb.boxplot(x=combo2['score_difference'])
plt.title('Goal Difference')
plt.show()
#checking for outliers
sb.boxplot(x=combo2['home_score'])
plt.title('Home Team Score')
plt.show()
#checking for outliers
sb.boxplot(x=combo2['away_score'])
plt.title('Away Team Score')
plt.show()
#checking table statistics
combo2.describe()
From the above box plots and table description, we can clearly observe outliers in the 'home_score', 'away_score', and 'score_difference' columns. In the 'home_score', the highest number goals scored is 21 yet the 75th percentile is 2 goals scored. This difference is also evident in the 'away_score' and 'score_difference' columns. Since we intend to make predictions using the data, we need to remove the outliers.
# compute iqr score to remove outliers for age and household size
q1_diff, q3_diff = np.percentile(combo2['score_difference'], [25, 75])
iqr_diff = q3_diff - q1_diff
lower_diff = q1_diff - (1.5 * iqr_diff)
upper_diff = q3_diff + (1.5 * iqr_diff)
upper_diff, lower_diff
#removing outliers
print(combo2.shape, ': With outliers')
combo2 = combo2.drop(combo2[combo2['score_difference'] > 5].index)
combo2 = combo2.drop(combo2[combo2['score_difference'] < -5].index)
combo2 = combo2.drop(combo2[combo2['home_score'] > 5.5].index)
combo2 = combo2.drop(combo2[combo2['away_score'] > 5.5].index)
print(combo2.shape, ': No outliers')
#checking for outliers
sb.boxplot(x=combo2['score_difference'])
plt.title('Goal Difference')
plt.show()
#checking for outliers
sb.boxplot(x=combo2['home_score'])
plt.title('Home Team Score')
plt.show()
#checking for outliers
sb.boxplot(x=combo2['away_score'])
plt.title('Away Team Score')
plt.show()
All outliers were effectively removed.
# checking the count for tournament types
combo2.tournament.value_counts()
Before the dependent variables can be used to make predictions, we need to get dummy variables for the categorical columns. Due to the high cardinality of the team names, we will only encode the tournament type.
# encode categorical columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
rresults_dummies = pd.get_dummies(combo2.drop(['away_team', 'home_team'], 1), prefix_sep='_', drop_first=True)
rresults_dummies[['away_team', 'home_team']] = combo2[['away_team', 'home_team']]
rresults_dummies.head()
# previewing last five rows of table with dummies
rresults_dummies.tail()
# creating a profile report for the combined data set
from pandas_profiling import ProfileReport
profile = ProfileReport(combo2, title='FIFA Matches and World Rankings Report')
profile.to_notebook_iframe()