1. Defining the Question

a) Specifying the Question

As the football analyst for Mchezopesa Ltd, I have been tasked with creating a model that predicts the outcome of a football match between national teams.

b) Defining the Metric for Success

i. Create a model that predicts whether the home team will win, lose, or draw in a football match.

ii. Create a model that predicts the number of goals that the home team will score.

iii. Create a model that predicts the number of goals that the away team will score in a given match.

c) Understanding the context

The term ‘odds’ is commonly used in betting and it often refers to the probability of an event occurring. In a football match, the bookmaker assigns different odds depending on the true odds of an event occurring ie. win, loss, or draw (relative to the home team) while also factoring in the team’s form, team statistics, historical precedents, expert opinion, team motivation among other factors surrounding each match. In order to make profit, bookmakers will then adjust the probabilities downward before offering the bet to punters. While factors such as expert opinion and team motivation are hard to measure, team statistics such as wins, losses, goals scored, goals, conceded, and team ranks are recorded and can be used to predict the outcome of matches.

FIFA has good data on the different matches, and it also as a ranking system that is used to measure the performance of national teams over time. The FIFA ranking system is updated periodically to ensure that the team rankings are reflective of team performances. The latest review of the ranking system was done in 2018 , replacing a system that was in place since 2006. More information about FIFA ranking system can be found here. Aside from the FIFA website, bookmakers also source team information from team release news and professional contacts within different national teams.

d) Recording the Experimental Design

To predict the match outcome, I am tasked with creating a logistic regression model. To predict the match scores, I am tasked with creating a polynomial regression model.

To improve model performance, I will perform feature engineering and parameter tuning.

e) Data Relevance

Two datasets were provided for this project. The first data set contains different football matches played by national teams across different tournaments between 1872 and 2019. This data set includes the home team, away team, match scores, country and city the match was played, along with whether or not the playing ground was neutral. The second data set contains the national team ranks of different countries in the world. The ranking data set is updated monthly depending on the performance of the different teams in their respective matches.

To predict the outcome of a match, it is important to factor in team performance which is reflected in the ranking while also considering previous results which is refelcted in the first data set.

Importing Libraries

In [ ]:
pip install -U pandas-profiling
In [2]:
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import pandas_profiling
sb.set_style()
import warnings
warnings.filterwarnings("ignore")

2. Loading and Checking the data sets

In [3]:
# previewing top of fifa ranking data set
fifa_ranking = pd.read_csv('/home/practitioner/Downloads/fifa_ranking.csv')
fifa_ranking.head()
Out[3]:
rank country_full country_abrv total_points previous_points rank_change cur_year_avg cur_year_avg_weighted last_year_avg last_year_avg_weighted two_year_ago_avg two_year_ago_weighted three_year_ago_avg three_year_ago_weighted confederation rank_date
0 1 Germany GER 0.0 57 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 UEFA 1993-08-08
1 2 Italy ITA 0.0 57 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 UEFA 1993-08-08
2 3 Switzerland SUI 0.0 50 9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 UEFA 1993-08-08
3 4 Sweden SWE 0.0 55 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 UEFA 1993-08-08
4 5 Argentina ARG 0.0 51 5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CONMEBOL 1993-08-08
In [4]:
# preview last five rows
fifa_ranking.tail()
Out[4]:
rank country_full country_abrv total_points previous_points rank_change cur_year_avg cur_year_avg_weighted last_year_avg last_year_avg_weighted two_year_ago_avg two_year_ago_weighted three_year_ago_avg three_year_ago_weighted confederation rank_date
57788 206 Anguilla AIA 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CONCACAF 2018-06-07
57789 206 Bahamas BAH 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CONCACAF 2018-06-07
57790 206 Eritrea ERI 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CAF 2018-06-07
57791 206 Somalia SOM 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CAF 2018-06-07
57792 206 Tonga TGA 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 OFC 2018-06-07

A quick overview of the data indicates that the ranks of the different countries span between August 1993 and June 2016. According to FIFA, the most recent update of the rankings was done in 2018 prior to which, the previous system was in place from 2006 to 2018. For consistency, I will only rely on data that spans 2006 to 2018 to predict the outcome of the 2018 world cup matches.

In [5]:
# previewing top of match results data set
match_results = pd.read_csv('/home/practitioner/Downloads/results.csv')
match_results.head()
Out[5]:
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
3 1875-03-06 England Scotland 2 2 Friendly London England False
4 1876-03-04 Scotland England 3 0 Friendly Glasgow Scotland False
In [6]:
# preview last five rows
match_results.tail()
Out[6]:
date home_team away_team home_score away_score tournament city country neutral
40834 2019-07-18 American Samoa Tahiti 8 1 Pacific Games Apia Samoa True
40835 2019-07-18 Fiji Solomon Islands 4 4 Pacific Games Apia Samoa True
40836 2019-07-19 Senegal Algeria 0 1 African Cup of Nations Cairo Egypt True
40837 2019-07-19 Tajikistan North Korea 0 1 Intercontinental Cup Ahmedabad India True
40838 2019-07-20 Papua New Guinea Fiji 1 1 Pacific Games Apia Samoa True

Assuming the results dataset is also ordered by date, the matches that have been recorded span between 1872 and 2019. To synchronize the two datsets, I will only use matches that were played between 2006 and 2018.

In [7]:
# checking the shape of our datasets
print('Fifa ranking dataset shape:', fifa_ranking.shape)
print('Results dataset shape:', match_results.shape)
Fifa ranking dataset shape: (57793, 16)
Results dataset shape: (40839, 9)
In [8]:
# checking the data types
fifa_ranking.dtypes
Out[8]:
rank                         int64
country_full                object
country_abrv                object
total_points               float64
previous_points              int64
rank_change                  int64
cur_year_avg               float64
cur_year_avg_weighted      float64
last_year_avg              float64
last_year_avg_weighted     float64
two_year_ago_avg           float64
two_year_ago_weighted      float64
three_year_ago_avg         float64
three_year_ago_weighted    float64
confederation               object
rank_date                   object
dtype: object

The data types seem appropriate for the different columns aside from the rank date which is stored as an object. In the data preparation, this will be converted to date-time data type

In [9]:
#checking the data types
match_results.dtypes
Out[9]:
date          object
home_team     object
away_team     object
home_score     int64
away_score     int64
tournament    object
city          object
country       object
neutral         bool
dtype: object

Aside from the date column which is stored as an object rather than date-time data type, all the other data types are appropriate.

3. External Data Source Validation

I confirmed the validity of the FIFA world rankings using information from the official FIFA website which can be found here. I also confirmed the accuracy of the match scores recorded for different games through a series of validaion scores across different tournament websites on the internet.

4. Tidying the Dataset

a. Match Results dataset

In [10]:
# checking for null values in the results dataset
match_results.isna().sum()
Out[10]:
date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
dtype: int64
In [11]:
# split date column into year, month, and day
md = match_results['date'].str.split('-',n=2, expand=True)
match_results['year'] = md[0]
match_results['month'] = md[1]
match_results['day'] = md[2]
#match_results = match_results.drop('date', 1)
match_results[['year', 'month', 'day']] = match_results[['year', 'month', 'day']].astype(int)
match_results.dtypes
Out[11]:
date          object
home_team     object
away_team     object
home_score     int64
away_score     int64
tournament    object
city          object
country       object
neutral         bool
year           int64
month          int64
day            int64
dtype: object

Splitting the date column will assist in the merging of the home and away teams and their FIFA ranks for the respective years.

In [12]:
# drop columns not needed
match_results = match_results.drop(['city', 'country'], 1)
match_results.head()
Out[12]:
date home_team away_team home_score away_score tournament neutral year month day
0 1872-11-30 Scotland England 0 0 Friendly False 1872 11 30
1 1873-03-08 England Scotland 4 2 Friendly False 1873 3 8
2 1874-03-07 Scotland England 2 1 Friendly False 1874 3 7
3 1875-03-06 England Scotland 2 2 Friendly False 1875 3 6
4 1876-03-04 Scotland England 3 0 Friendly False 1876 3 4
In [13]:
# adding a score difference and win/draw/lose column relative to the home team
match_results['score_difference'] = match_results['home_score'] - match_results['away_score']
conditions = [(match_results['score_difference'] > 0), (match_results['score_difference'] == 0), (match_results['score_difference'] < 0)]
values = [2, 1, 0]  #where 2 is win, 1 is draw, 0 is loss
match_results['outcome'] = np.select(conditions, values)
match_results.head()
Out[13]:
date home_team away_team home_score away_score tournament neutral year month day score_difference outcome
0 1872-11-30 Scotland England 0 0 Friendly False 1872 11 30 0 1
1 1873-03-08 England Scotland 4 2 Friendly False 1873 3 8 2 2
2 1874-03-07 Scotland England 2 1 Friendly False 1874 3 7 1 2
3 1875-03-06 England Scotland 2 2 Friendly False 1875 3 6 0 1
4 1876-03-04 Scotland England 3 0 Friendly False 1876 3 4 3 2

The score difference helps in the evaluation of each team's performance against their opponents. I recoreded wins as 2, draws as 1, and losses as 0.

In [14]:
#selecting matches that took place between January 2006 and June 2018
recent_results = match_results[match_results['date'] >= '2006-01-01']
recent_results = recent_results[recent_results['date'] <= '2018-06-07']
recent_results.shape
Out[14]:
(11801, 12)

As was evident earlier, the match results dataset holds matches from 1872 which are useless in predicting the outcome of matches today. To get a more reflective sample of modern day football, there is need to filter the data up to a specific point in recent history. The choice of 2006 as the lower year bound is based on the fact that the most recent update of FIFA's ranking system before the 2018 world cup was done in 2006. The ranking procedures are revised with each update and this could affect the cosnsitency of ranking as a predictor for team performance acorss different eras. 2006 to 2018 seemed like a viable time duration to work with.

b. Fifa ranking dataset

In [15]:
# checking for null values in the results dataset
fifa_ranking.isna().sum()
Out[15]:
rank                       0
country_full               0
country_abrv               0
total_points               0
previous_points            0
rank_change                0
cur_year_avg               0
cur_year_avg_weighted      0
last_year_avg              0
last_year_avg_weighted     0
two_year_ago_avg           0
two_year_ago_weighted      0
three_year_ago_avg         0
three_year_ago_weighted    0
confederation              0
rank_date                  0
dtype: int64
In [16]:
# splitting date column to year, month, day
new = fifa_ranking['rank_date'].str.split('-',n=2, expand=True)
fifa_ranking['year_rank'] = new[0]
fifa_ranking['month_rank'] = new[1]
fifa_ranking['day_rank'] = new[2]
#fifa_ranking = fifa_ranking.drop('rank_date', 1)
fifa_ranking.tail()
Out[16]:
rank country_full country_abrv total_points previous_points rank_change cur_year_avg cur_year_avg_weighted last_year_avg last_year_avg_weighted two_year_ago_avg two_year_ago_weighted three_year_ago_avg three_year_ago_weighted confederation rank_date year_rank month_rank day_rank
57788 206 Anguilla AIA 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CONCACAF 2018-06-07 2018 06 07
57789 206 Bahamas BAH 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CONCACAF 2018-06-07 2018 06 07
57790 206 Eritrea ERI 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CAF 2018-06-07 2018 06 07
57791 206 Somalia SOM 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 CAF 2018-06-07 2018 06 07
57792 206 Tonga TGA 0.0 0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 OFC 2018-06-07 2018 06 07

After the split, the year and month columns will be used to merge to the data set containg match results.

In [17]:
# selecting appropriate columns
select_ranking = fifa_ranking[['rank', 'rank_change', 'country_full', 'rank_date', 'year_rank', 'month_rank', 'day_rank']]
select_ranking.head()
Out[17]:
rank rank_change country_full rank_date year_rank month_rank day_rank
0 1 0 Germany 1993-08-08 1993 08 08
1 2 0 Italy 1993-08-08 1993 08 08
2 3 9 Switzerland 1993-08-08 1993 08 08
3 4 0 Sweden 1993-08-08 1993 08 08
4 5 5 Argentina 1993-08-08 1993 08 08

Many of the columns in the fifa ranking data set will are not useful to our analysis so we select only select columns that we need which include the country name, rank, rank change, and the year of ranking.

In [19]:
# renaming countries with different names in the two data sets
select_ranking = select_ranking.replace("Côte d'Ivoire", 'Ivory Coast')
select_ranking.head()
Out[19]:
rank rank_change country_full rank_date year_rank month_rank day_rank
0 1 0 Germany 1993-08-08 1993 08 08
1 2 0 Italy 1993-08-08 1993 08 08
2 3 9 Switzerland 1993-08-08 1993 08 08
3 4 0 Sweden 1993-08-08 1993 08 08
4 5 5 Argentina 1993-08-08 1993 08 08

Some of the countries have different names in the two data sets so we need to rename them.

In [20]:
#selecting fifa team rankings between 2006 and 2018
recent_ranking = select_ranking[select_ranking['rank_date'] >= '2006-01-01']
recent_ranking.tail()
Out[20]:
rank rank_change country_full rank_date year_rank month_rank day_rank
57788 206 1 Anguilla 2018-06-07 2018 06 07
57789 206 1 Bahamas 2018-06-07 2018 06 07
57790 206 1 Eritrea 2018-06-07 2018 06 07
57791 206 1 Somalia 2018-06-07 2018 06 07
57792 206 1 Tonga 2018-06-07 2018 06 07

For consistency, we select team ranks that fall within the same review period.

In [21]:
#drop date columns
recent_rank = recent_ranking.drop('rank_date', 1)
recent_games = recent_results.drop('date', 1)

We drop the date columns since we already have the year, month and days.

In [22]:
# convert year, month, and day to integer for merging
convert_dict = {'year_rank': int,
                'month_rank': int, 
                'day_rank': int
               } 
  
recent_rank = recent_rank.astype(convert_dict) 
recent_rank.dtypes
Out[22]:
rank             int64
rank_change      int64
country_full    object
year_rank        int64
month_rank       int64
day_rank         int64
dtype: object

c. Merging the data sets

To merge the datasets, I used an inner join between the recent matches and recent ranks tables where on the matches data set I used the 'home_team', 'year', and 'month' columns to merge with the 'country_full', 'year_rank', and 'month_rank' columns. After merging, I dropped the 'year_rank', 'month_rank', 'day_rank', and 'country_full' columns since I needed the same columns to repeat the same merge to get the team ranks for the away teams.

In [23]:
# merging tables using inner join for home team rank and rank change
combo1 = pd.merge(recent_games, recent_rank, how='inner', left_on=['home_team','year', 'month'], right_on=['country_full', 'year_rank', 'month_rank'])
combo1 = combo1.drop(['year_rank', 'month_rank', 'day_rank', 'country_full'], 1)
combo1.rename(columns={'rank':'home_team_rank'}, inplace=True)
combo1.rename(columns={'rank_change':'home_rank_change'}, inplace=True)
combo1.head()
Out[23]:
home_team away_team home_score away_score tournament neutral year month day score_difference outcome home_team_rank home_rank_change
0 Qatar Libya 2 0 Friendly False 2006 1 2 2 2 89 6
1 Egypt Zimbabwe 2 0 Friendly False 2006 1 5 2 2 32 0
2 Egypt South Africa 1 2 Friendly False 2006 1 14 -1 0 32 0
3 Egypt Libya 3 0 African Cup of Nations False 2006 1 20 3 2 32 0
4 Egypt Morocco 0 0 African Cup of Nations False 2006 1 24 0 1 32 0
In [24]:
# merging tables using inner join for away team rank and rank change
combo2 = pd.merge(combo1, recent_rank, how='inner', left_on=['away_team','year', 'month'], right_on=['country_full', 'year_rank', 'month_rank'])
combo2 = combo2.drop(['year_rank', 'month_rank', 'day_rank', 'country_full'], 1)
combo2.rename(columns={'rank':'away_team_rank'}, inplace=True)
combo2.rename(columns={'rank_change':'away_rank_change'}, inplace=True)
combo2.head()
Out[24]:
home_team away_team home_score away_score tournament neutral year month day score_difference outcome home_team_rank home_rank_change away_team_rank away_rank_change
0 Qatar Libya 2 0 Friendly False 2006 1 2 2 2 89 6 80 0
1 Egypt Libya 3 0 African Cup of Nations False 2006 1 20 3 2 32 0 80 0
2 Tunisia Libya 1 0 Friendly False 2006 1 12 1 2 28 0 80 0
3 Egypt Zimbabwe 2 0 Friendly False 2006 1 5 2 2 32 0 53 0
4 Morocco Zimbabwe 1 0 Friendly False 2006 1 14 1 2 35 1 53 0
In [25]:
# reducing cardinality of tournament column
combo2.replace(to_replace=['African Cup of Nations', 'Lunar New Year Cup',
       'AFC Asian Cup qualification', 'Cyprus International Tournament',
       'Malta International Tournament', 'AFC Challenge Cup',
       'COSAFA Cup', 'Kirin Cup',
       'Merdeka Tournament',
       'CFU Caribbean Cup qualification',
       'African Cup of Nations qualification', 'Copa del Pacífico',
       'AFF Championship', 'ELF Cup', 'CECAFA Cup',
       'UAFA Cup qualification', "King's Cup", 'CFU Caribbean Cup',
       'Gulf Cup', 'UNCAF Cup', 'EAFF Championship', 'Copa América',
       'Gold Cup', 'WAFF Championship', 'Island Games', 'AFC Asian Cup',
       'Nehru Cup', 'South Pacific Games',
       'Amílcar Cabral Cup', 'AFC Challenge Cup qualification',
       'Baltic Cup', 'SAFF Cup',
       'African Nations Championship', 'VFF Cup', 'Confederations Cup',
       'Dragon Cup', 'ABCS Tournament', 'Nile Basin Tournament',
       'Nations Cup', 'Copa Paz del Chaco', 'Pacific Games',
       'Oceania Nations Cup qualification', 'Oceania Nations Cup',
       'UAFA Cup', 'OSN Cup', 'Windward Islands Tournament',
       'Gold Cup qualification', 'Copa América qualification',
       'Intercontinental Cup', 'UEFA Euro qualification', 'UEFA Euro'], value = 'Other competition', inplace=True)
combo2.tournament.unique()
Out[25]:
array(['Friendly', 'Other competition', 'FIFA World Cup',
       'FIFA World Cup qualification'], dtype=object)
In [26]:
#cheking for duplicates
combo2.duplicated().sum()
Out[26]:
48
In [27]:
# dropping duplictes  
combo2 = combo2.drop_duplicates()
#cheking for duplicates
combo2.duplicated().sum()
Out[27]:
0
In [28]:
#checking for outliers
sb.boxplot(x=combo2['score_difference'])
plt.title('Goal Difference')
plt.show()
In [29]:
#checking for outliers
sb.boxplot(x=combo2['home_score'])
plt.title('Home Team Score')
plt.show()
In [30]:
#checking for outliers
sb.boxplot(x=combo2['away_score'])
plt.title('Away Team Score')
plt.show()
In [31]:
#checking table statistics
combo2.describe()
Out[31]:
home_score away_score year month day score_difference outcome home_team_rank home_rank_change away_team_rank away_rank_change
count 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000 9437.000000
mean 1.539472 1.055632 2011.721734 6.956978 14.341846 0.483840 1.193494 80.305818 0.690580 82.900922 0.071527
std 1.516916 1.228537 3.500398 3.297470 8.615898 2.104135 0.849075 52.664288 7.788676 53.258717 7.891007
min 0.000000 0.000000 2006.000000 1.000000 1.000000 -15.000000 0.000000 1.000000 -62.000000 1.000000 -62.000000
25% 0.000000 0.000000 2009.000000 4.000000 7.000000 -1.000000 0.000000 35.000000 -2.000000 38.000000 -3.000000
50% 1.000000 1.000000 2012.000000 7.000000 13.000000 0.000000 1.000000 76.000000 0.000000 78.000000 0.000000
75% 2.000000 2.000000 2015.000000 10.000000 22.000000 2.000000 2.000000 119.000000 3.000000 121.000000 2.000000
max 17.000000 15.000000 2018.000000 12.000000 31.000000 17.000000 2.000000 209.000000 73.000000 209.000000 82.000000

From the above box plots and table description, we can clearly observe outliers in the 'home_score', 'away_score', and 'score_difference' columns. In the 'home_score', the highest number goals scored is 21 yet the 75th percentile is 2 goals scored. This difference is also evident in the 'away_score' and 'score_difference' columns. Since we intend to make predictions using the data, we need to remove the outliers.

In [33]:
# compute iqr score to remove outliers for age and household size
q1_diff, q3_diff = np.percentile(combo2['score_difference'], [25, 75])
iqr_diff = q3_diff - q1_diff

lower_diff = q1_diff - (1.5 * iqr_diff)
upper_diff = q3_diff + (1.5 * iqr_diff)
upper_diff, lower_diff
Out[33]:
(6.5, -5.5)
In [34]:
#removing outliers
print(combo2.shape, ': With outliers')

combo2 = combo2.drop(combo2[combo2['score_difference'] > 5].index)
combo2 = combo2.drop(combo2[combo2['score_difference'] < -5].index)
combo2 = combo2.drop(combo2[combo2['home_score'] > 5.5].index)
combo2 = combo2.drop(combo2[combo2['away_score'] > 5.5].index)

print(combo2.shape, ': No outliers')
(9437, 15) : With outliers
(9158, 15) : No outliers
In [35]:
#checking for outliers
sb.boxplot(x=combo2['score_difference'])
plt.title('Goal Difference')
plt.show()
In [36]:
#checking for outliers
sb.boxplot(x=combo2['home_score'])
plt.title('Home Team Score')
plt.show()
In [37]:
#checking for outliers
sb.boxplot(x=combo2['away_score'])
plt.title('Away Team Score')
plt.show()

All outliers were effectively removed.

In [38]:
# checking the count for tournament types
combo2.tournament.value_counts()
Out[38]:
Friendly                        3682
Other competition               3425
FIFA World Cup qualification    1984
FIFA World Cup                    67
Name: tournament, dtype: int64

Before the dependent variables can be used to make predictions, we need to get dummy variables for the categorical columns. Due to the high cardinality of the team names, we will only encode the tournament type.

In [39]:
# encode categorical columns
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
rresults_dummies = pd.get_dummies(combo2.drop(['away_team', 'home_team'], 1), prefix_sep='_', drop_first=True)
rresults_dummies[['away_team', 'home_team']] = combo2[['away_team', 'home_team']]
rresults_dummies.head()
Out[39]:
home_score away_score neutral year month day score_difference outcome home_team_rank home_rank_change away_team_rank away_rank_change tournament_FIFA World Cup qualification tournament_Friendly tournament_Other competition away_team home_team
0 2 0 False 2006 1 2 2 2 89 6 80 0 0 1 0 Libya Qatar
1 3 0 False 2006 1 20 3 2 32 0 80 0 0 0 1 Libya Egypt
2 1 0 False 2006 1 12 1 2 28 0 80 0 0 1 0 Libya Tunisia
3 2 0 False 2006 1 5 2 2 32 0 53 0 0 1 0 Zimbabwe Egypt
4 1 0 False 2006 1 14 1 2 35 1 53 0 0 1 0 Zimbabwe Morocco
In [40]:
# previewing last five rows of table with dummies
rresults_dummies.tail()
Out[40]:
home_score away_score neutral year month day score_difference outcome home_team_rank home_rank_change away_team_rank away_rank_change tournament_FIFA World Cup qualification tournament_Friendly tournament_Other competition away_team home_team
9480 1 1 False 2018 6 5 0 1 126 7 129 10 0 1 0 Latvia Lithuania
9481 1 0 False 2018 6 6 1 2 53 -5 55 0 0 1 0 Panama Norway
9482 1 1 False 2018 6 6 0 1 78 1 51 -2 0 1 0 Hungary Belarus
9483 3 0 False 2018 6 7 3 2 14 3 95 -7 0 1 0 Uzbekistan Uruguay
9484 3 0 False 2018 6 7 3 2 4 0 66 -2 0 1 0 Algeria Portugal

5. Exporatory Data Analysis

In [41]:
# creating a profile report for the combined data set
from pandas_profiling import ProfileReport
profile = ProfileReport(combo2, title='FIFA Matches and World Rankings Report')
In [42]:
profile.to_notebook_iframe()