Notebook first written: 13/12/2020
Notebook last updated: 26/12/2020
Click here to jump straight to the Exploratory Data Analysis section and skip the Task Brief, Data Sources, and Data Engineering sections. Or click here to jump straight to the Conclusion.
This article was written with the aid of StrataData, which is property of Stratagem Technologies. StrataData powers the StrataBet Sports Trading Platform, in addition to StrataBet Premium Recommendations.
This notebook engineers StrataBet data for football matches in a variety of European leagues during the 16/17 and 17/18 seasons, using pandas for data manipulation through DataFrames.
For more information about this notebook and the author, I'm available through all the following channels:
The accompanying GitHub repository for this notebook can be found here and a static version of this notebook can be found here.
This notebook was written using Python 3 and requires the following libraries:
Jupyter notebooks
for this notebook environment with which this project is presented;NumPy
for multidimensional array computing;pandas
for data analysis and manipulation; andmatplotlib
for data visualisations;All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the Conda distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows here and Mac here, as well as in the Anaconda documentation itself here.
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv
# Import Dependencies
%matplotlib inline
# Math Operations
import numpy as np
from math import pi
# Datetime
import datetime
from datetime import date
import time
# Data Preprocessing
import pandas as pd # version 1.0.3
import os # used to read the csv filenames
import re
import random
from io import BytesIO
from pathlib import Path
# Reading directories
import glob
import os
# Working with JSON
import json
from pandas.io.json import json_normalize
from ast import literal_eval
# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno # visually display missing data
# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML
# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
print('Setup Complete')
Setup Complete
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))
Python: 3.7.6 NumPy: 1.18.1 pandas: 1.0.1 matplotlib: 3.1.3 Seaborn: 0.10.0
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_fbref = os.path.join(base_dir, 'data', 'fbref')
data_dir_stratabet = os.path.join(base_dir, 'data', 'stratabet')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')
pd.set_option('display.max_columns', None)
This Jupyter notebook explores how to engineer StrataBet Events data using pandas for data manipulation through DataFrames.
The resulting engineered Datarame is exported as a CSV file. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel.
This StrataData has been made availble by Stratagem Technologies. StrataData powers the StrataBet Sports Trading Platform, in addition to StrataBet Premium Recommendations.
Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section Section 3 and cleaned in the Data Engineering section Section 4.
We'll be using the pandas library to import our data to this workbook as a DataFrame.
The StrataBet Events dataset has twelve features (columns) with the following definitions and data types:
Feature | Data type |
---|---|
eventId |
int64 |
subEventName |
object |
tags |
object |
playerId |
int64 |
positions |
object |
matchId |
int64 |
eventName |
object |
teamId |
int64 |
matchPeriod |
object |
eventSec |
float64 |
subEventId |
object |
id |
int64 |
# Combine individual csv files to form one DataFrame, df_raw, using glob
lst_files_chances = glob.glob(data_dir_stratabet + '/raw/chances/individual_competitions' + "/*.csv") # Creates a list of all csv files
li = [] # pd.concat takes a list of DataFrames as an argument
for filename in lst_files_chances:
df_raw_temp = pd.read_csv(filename, index_col=None, header=0)
li.append(df_raw_temp)
df_stratabet_chances_raw = pd.concat(li, axis=0, ignore_index=True) # ignore_index=True as we don't want pandas to try an align row indexes
Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.
# Display the first 5 rows of the raw DataFrame, df_stratabet_chances_raw
df_stratabet_chances_raw.head()
index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | icon | chanceRating | team | type | time | player | location_x | location_y | bodyPart | shotQuality | defPressure | numDefPlayers | numAttPlayers | outcome | primaryPlayer | primaryType | primaryLocation_x | primaryLocation_y | secondaryPlayer | secondaryType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4684 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 24:43:00 | D. Epstein | 81 | 48 | Left | 3 | 5 | 2 | 0 | Saved | - | - | - | - | - | - |
1 | 4685 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 45:29:00 | D. Epstein | 27 | 60 | Left | 2 | 2 | 2 | 0 | Defended | Thuram | Open Play Pass | -29 | 82 | - | - |
2 | 4686 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 44:34:00 | S. Siontis | 23 | 117 | Right | 2 | 1 | 4 | 1 | Missed | - | - | - | - | - | - |
3 | 4687 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Platanias | Open Play | 42:39:00 | O. Gnjatic | -9 | 118 | Left | 1 | 1 | 3 | 1 | Missed | G. Manousos | Open Play Pass | 77 | 92 | - | - |
4 | 4688 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 40:46:00 | D. Epstein | 42 | 15 | Left | 2 | 5 | 2 | 0 | Saved | - | - | - | - | - | - |
# Display the last 5 rows of the raw DataFrame, df_stratabet_chances_raw
df_stratabet_chances_raw.tail()
index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | icon | chanceRating | team | type | time | player | location_x | location_y | bodyPart | shotQuality | defPressure | numDefPlayers | numAttPlayers | outcome | primaryPlayer | primaryType | primaryLocation_x | primaryLocation_y | secondaryPlayer | secondaryType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
119143 | 33422 | EngPr | 2463091 | 12/03/2018 | 20:00:00 | Stoke City | Manchester City | goodchance | goodchance | Manchester City | Open Play | 41:13:00 | L. Sane | 45 | 42 | Left | 2 | 2 | 2 | 0 | Missed | K. De Bruyne | Cross High | -69 | 48 | Gabriel Jesus | Open Play Pass |
119144 | 33424 | EngPr | 2463091 | 12/03/2018 | 20:00:00 | Stoke City | Manchester City | greatchance | greatchance | Manchester City | Open Play | 56:58:00 | R. Sterling | -18 | 13 | Left | 3 | 4 | 2 | 0 | Saved | David Silva | Open Play Pass | -36 | 86 | L. Sane | Open Play Pass |
119145 | 33418 | EngPr | 2463091 | 12/03/2018 | 20:00:00 | Stoke City | Manchester City | poorchance | poorchance | Manchester City | Open Play | 25:50:00 | Fernandinho | 3 | 97 | Left | 2 | 1 | 3 | 2 | Missed | - | - | - | - | - | - |
119146 | 33426 | EngPr | 2463091 | 12/03/2018 | 20:00:00 | Stoke City | Manchester City | poorchance | poorchance | Manchester City | Open Play | 57:48:00 | R. Sterling | 31 | 92 | Right | 2 | 1 | 4 | 1 | Defended | L. Sane | Open Play Pass | -7 | 86 | K. De Bruyne | Open Play Pass |
119147 | 33427 | EngPr | 2463091 | 12/03/2018 | 20:00:00 | Stoke City | Manchester City | poorchance | poorchance | Manchester City | Open Play | 58:29:00 | L. Sane | -27 | 107 | Left | 2 | 1 | 5 | 3 | Missed | R. Sterling | Open Play Pass | 17 | 88 | David Silva | Open Play Pass |
# Print the shape of the raw DataFrame, df_stratabet_chances_raw
print(df_stratabet_chances_raw.shape)
(119148, 27)
# Print the column names of the raw DataFrame, df_stratabet_chances_raw
print(df_stratabet_chances_raw.columns)
Index(['index', 'competition', 'gsm_id', 'kickoffDate', 'kickoffTime', 'hometeam_team1', 'awayteam_team2', 'icon', 'chanceRating', 'team', 'type', 'time', 'player', 'location_x', 'location_y', 'bodyPart', 'shotQuality', 'defPressure', 'numDefPlayers', 'numAttPlayers', 'outcome', 'primaryPlayer', 'primaryType', 'primaryLocation_x', 'primaryLocation_y', 'secondaryPlayer', 'secondaryType'], dtype='object')
The dataset has six features (columns). Full details of these attributes can be found in the Data Dictionary.
# Data types of the features of the raw DataFrame, df_stratabet_chances_raw
df_stratabet_chances_raw .dtypes
index int64 competition object gsm_id int64 kickoffDate object kickoffTime object hometeam_team1 object awayteam_team2 object icon object chanceRating object team object type object time object player object location_x object location_y object bodyPart object shotQuality object defPressure object numDefPlayers object numAttPlayers object outcome object primaryPlayer object primaryType object primaryLocation_x object primaryLocation_y object secondaryPlayer object secondaryType object dtype: object
All six of the columns have the object data type. Full details of these attributes and their data types can be found in the Data Dictionary.
# Info for the raw DataFrame, df_stratabet_chances_raw
df_stratabet_chances_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119148 entries, 0 to 119147 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 119148 non-null int64 1 competition 119148 non-null object 2 gsm_id 119148 non-null int64 3 kickoffDate 119148 non-null object 4 kickoffTime 119148 non-null object 5 hometeam_team1 119148 non-null object 6 awayteam_team2 119148 non-null object 7 icon 119148 non-null object 8 chanceRating 119148 non-null object 9 team 119148 non-null object 10 type 119148 non-null object 11 time 119148 non-null object 12 player 119148 non-null object 13 location_x 119148 non-null object 14 location_y 119148 non-null object 15 bodyPart 119148 non-null object 16 shotQuality 117618 non-null object 17 defPressure 119148 non-null object 18 numDefPlayers 119148 non-null object 19 numAttPlayers 119148 non-null object 20 outcome 119148 non-null object 21 primaryPlayer 119148 non-null object 22 primaryType 119148 non-null object 23 primaryLocation_x 119148 non-null object 24 primaryLocation_y 119148 non-null object 25 secondaryPlayer 119148 non-null object 26 secondaryType 119148 non-null object dtypes: int64(2), object(25) memory usage: 24.5+ MB
# Description of the raw DataFrame, df_stratabet_raw, showing some summary statistics for each numberical column in the DataFrame
df_stratabet_chances_raw.describe()
index | gsm_id | |
---|---|---|
count | 119148.000000 | 1.191480e+05 |
mean | 9941.664501 | 2.380177e+06 |
std | 17678.010794 | 1.127321e+05 |
min | 0.000000 | 2.237445e+06 |
25% | 1580.000000 | 2.247140e+06 |
50% | 3387.000000 | 2.404032e+06 |
75% | 6534.000000 | 2.467501e+06 |
max | 80263.000000 | 2.701477e+06 |
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_stratabet_chances_raw
msno.matrix(df_stratabet_chances_raw, figsize = (30, 7))
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d73a810>
# Counts of missing values
tm_null_value_stats = df_stratabet_chances_raw.isnull().sum(axis=0)
tm_null_value_stats[tm_null_value_stats != 0]
shotQuality 1530 dtype: int64
The visualisation shows us very quickly that there a few missing values in the shotQuality
column, but otherwise the dataset is complete. This data is now ready for Data Engineering.
df_stratabet_chances_raw.to_csv(data_dir_stratabet + '/raw/chances/' + 'stratabet_chances_all.csv', index=None, header=True)
The StrataBet Events dataset has twelve features (columns) with the following definitions and data types:
Feature | Data type |
---|---|
eventId |
int64 |
subEventName |
object |
tags |
object |
playerId |
int64 |
positions |
object |
matchId |
int64 |
eventName |
object |
teamId |
int64 |
matchPeriod |
object |
eventSec |
float64 |
subEventId |
object |
id |
int64 |
# Combine individual csv files to form one DataFrame, df_raw, using glob
lst_files_key_entries = glob.glob(data_dir_stratabet + '/raw/key_entries/individual_competitions' + "/*.csv") # Creates a list of all csv files
li = [] # pd.concat takes a list of DataFrames as an argument
for filename in lst_files_key_entries:
df_raw_temp = pd.read_csv(filename, index_col=None, header=0)
li.append(df_raw_temp)
df_stratabet_key_entries_raw = pd.concat(li, axis=0, ignore_index=True) # ignore_index=True as we don't want pandas to try an align row indexes
Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.
# Display the first 5 rows of the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.head()
True | index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | team | keyentryArea | keyentryType | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 153564 | 9570 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Right | Pass |
1 | 153565 | 9571 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Box | Pass |
2 | 153566 | 9572 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Right | Pass |
3 | 153567 | 9573 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Hamilton Academical | Right | Pass |
4 | 153568 | 9574 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Box | Run |
# Display the last 5 rows of the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.tail()
True | index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | team | keyentryArea | keyentryType | |
---|---|---|---|---|---|---|---|---|---|---|---|
195682 | 111281 | 7814 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Levadiakos | Right | Turnover |
195683 | 111282 | 7815 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Left | Turnover |
195684 | 111283 | 7816 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Right | Pass |
195685 | 111284 | 7817 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Box | Pass |
195686 | 111285 | 7818 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Levadiakos | Box | Run |
# Print the shape of the raw DataFrame, df_stratabet_key_entries_raw
print(df_stratabet_key_entries_raw.shape)
(195687, 11)
# Print the column names of the raw DataFrame, df_stratabet_key_entries_raw
print(df_stratabet_key_entries_raw.columns)
Index(['True', 'index', 'competition', 'gsm_id', 'kickoffDate', 'kickoffTime', 'hometeam_team1', 'awayteam_team2', 'team', 'keyentryArea', 'keyentryType'], dtype='object')
The dataset has six features (columns). Full details of these attributes can be found in the Data Dictionary.
# Info for the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195687 entries, 0 to 195686 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 True 195687 non-null int64 1 index 195687 non-null int64 2 competition 195687 non-null object 3 gsm_id 195687 non-null int64 4 kickoffDate 195687 non-null object 5 kickoffTime 195687 non-null object 6 hometeam_team1 195687 non-null object 7 awayteam_team2 195687 non-null object 8 team 195687 non-null object 9 keyentryArea 195687 non-null object 10 keyentryType 195687 non-null object dtypes: int64(3), object(8) memory usage: 16.4+ MB
# Description of the raw DataFrame, df_stratabet_key_entries_raw, showing some summary statistics for each numberical column in the DataFrame
df_stratabet_key_entries_raw.describe()
True | index | gsm_id | |
---|---|---|---|
count | 195687.000000 | 195687.000000 | 1.956870e+05 |
mean | 97843.000000 | 6453.847414 | 2.354363e+06 |
std | 56490.115401 | 5040.288373 | 1.125099e+05 |
min | 0.000000 | 0.000000 | 2.237445e+06 |
25% | 48921.500000 | 2675.000000 | 2.246823e+06 |
50% | 97843.000000 | 5401.000000 | 2.360808e+06 |
75% | 146764.500000 | 8928.000000 | 2.467223e+06 |
max | 195686.000000 | 25384.000000 | 2.701477e+06 |
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_stratabet_key_entries_raw
msno.matrix(df_stratabet_key_entries_raw, figsize = (30, 7))
<matplotlib.axes._subplots.AxesSubplot at 0x1a1e4a8390>
# Counts of missing values
tm_null_value_stats = df_stratabet_key_entries_raw.isnull().sum(axis=0)
tm_null_value_stats[tm_null_value_stats != 0]
Series([], dtype: int64)
The visualisation shows us very quickly that there a few missing values in the shotQuality
column, but otherwise the dataset is complete. This data is now ready for Data Engineering.
df_stratabet_key_entries_raw.to_csv(data_dir_stratabet + '/raw/key_entries/' + 'stratabet_key_entries_all.csv', index=None, header=True)
The StrataBet Events dataset has twelve features (columns) with the following definitions and data types:
Feature | Data type |
---|---|
eventId |
int64 |
subEventName |
object |
tags |
object |
playerId |
int64 |
positions |
object |
matchId |
int64 |
eventName |
object |
teamId |
int64 |
matchPeriod |
object |
eventSec |
float64 |
subEventId |
object |
id |
int64 |
# Combine individual csv files to form one DataFrame, df_raw, using glob
lst_files_key_entries = glob.glob(data_dir_stratabet + '/raw/key_entries/individual_competitions' + "/*.csv") # Creates a list of all csv files
li = [] # pd.concat takes a list of DataFrames as an argument
for filename in lst_files_key_entries:
df_raw_temp = pd.read_csv(filename, index_col=None, header=0)
li.append(df_raw_temp)
df_stratabet_key_entries_raw = pd.concat(li, axis=0, ignore_index=True) # ignore_index=True as we don't want pandas to try an align row indexes
Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.
# Display the first 5 rows of the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.head()
True | index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | team | keyentryArea | keyentryType | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 153564 | 9570 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Right | Pass |
1 | 153565 | 9571 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Box | Pass |
2 | 153566 | 9572 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Right | Pass |
3 | 153567 | 9573 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Hamilton Academical | Right | Pass |
4 | 153568 | 9574 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Box | Run |
# Display the last 5 rows of the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.tail()
True | index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | team | keyentryArea | keyentryType | |
---|---|---|---|---|---|---|---|---|---|---|---|
195682 | 111281 | 7814 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Levadiakos | Right | Turnover |
195683 | 111282 | 7815 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Left | Turnover |
195684 | 111283 | 7816 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Right | Pass |
195685 | 111284 | 7817 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Box | Pass |
195686 | 111285 | 7818 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Levadiakos | Box | Run |
# Print the shape of the raw DataFrame, df_stratabet_key_entries_raw
print(df_stratabet_key_entries_raw.shape)
(195687, 11)
# Print the column names of the raw DataFrame, df_stratabet_key_entries_raw
print(df_stratabet_key_entries_raw.columns)
Index(['True', 'index', 'competition', 'gsm_id', 'kickoffDate', 'kickoffTime', 'hometeam_team1', 'awayteam_team2', 'team', 'keyentryArea', 'keyentryType'], dtype='object')
The dataset has six features (columns). Full details of these attributes can be found in the Data Dictionary.
# Info for the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195687 entries, 0 to 195686 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 True 195687 non-null int64 1 index 195687 non-null int64 2 competition 195687 non-null object 3 gsm_id 195687 non-null int64 4 kickoffDate 195687 non-null object 5 kickoffTime 195687 non-null object 6 hometeam_team1 195687 non-null object 7 awayteam_team2 195687 non-null object 8 team 195687 non-null object 9 keyentryArea 195687 non-null object 10 keyentryType 195687 non-null object dtypes: int64(3), object(8) memory usage: 16.4+ MB
# Description of the raw DataFrame, df_stratabet_key_entries_raw, showing some summary statistics for each numberical column in the DataFrame
df_stratabet_key_entries_raw.describe()
True | index | gsm_id | |
---|---|---|---|
count | 195687.000000 | 195687.000000 | 1.956870e+05 |
mean | 97843.000000 | 6453.847414 | 2.354363e+06 |
std | 56490.115401 | 5040.288373 | 1.125099e+05 |
min | 0.000000 | 0.000000 | 2.237445e+06 |
25% | 48921.500000 | 2675.000000 | 2.246823e+06 |
50% | 97843.000000 | 5401.000000 | 2.360808e+06 |
75% | 146764.500000 | 8928.000000 | 2.467223e+06 |
max | 195686.000000 | 25384.000000 | 2.701477e+06 |
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_stratabet_key_entries_raw
msno.matrix(df_stratabet_key_entries_raw, figsize = (30, 7))
<matplotlib.axes._subplots.AxesSubplot at 0x108796a90>
# Counts of missing values
tm_null_value_stats = df_stratabet_key_entries_raw.isnull().sum(axis=0)
tm_null_value_stats[tm_null_value_stats != 0]
Series([], dtype: int64)
The visualisation shows us very quickly that there a few missing values in the shotQuality
column, but otherwise the dataset is complete. This data is now ready for Data Engineering.
df_stratabet_key_entries_raw.to_csv(data_dir_stratabet + '/raw/key_entries/' + 'stratabet_key_entries_all.csv', index=None, header=True)
The StrataBet Events dataset has twelve features (columns) with the following definitions and data types:
Feature | Data type |
---|---|
eventId |
int64 |
subEventName |
object |
tags |
object |
playerId |
int64 |
positions |
object |
matchId |
int64 |
eventName |
object |
teamId |
int64 |
matchPeriod |
object |
eventSec |
float64 |
subEventId |
object |
id |
int64 |
# Combine individual csv files to form one DataFrame, df_raw, using glob
lst_files_key_entries = glob.glob(data_dir_stratabet + '/raw/key_entries/individual_competitions' + "/*.csv") # Creates a list of all csv files
li = [] # pd.concat takes a list of DataFrames as an argument
for filename in lst_files_key_entries:
df_raw_temp = pd.read_csv(filename, index_col=None, header=0)
li.append(df_raw_temp)
df_stratabet_key_entries_raw = pd.concat(li, axis=0, ignore_index=True) # ignore_index=True as we don't want pandas to try an align row indexes
Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.
# Display the first 5 rows of the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.head()
True | index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | team | keyentryArea | keyentryType | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 153564 | 9570 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Right | Pass |
1 | 153565 | 9571 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Box | Pass |
2 | 153566 | 9572 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Right | Pass |
3 | 153567 | 9573 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Hamilton Academical | Right | Pass |
4 | 153568 | 9574 | ScoPr | 2242424 | 2016-08-06 | 11:30:00 | Rangers | Hamilton Academical | Rangers | Box | Run |
# Display the last 5 rows of the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.tail()
True | index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | team | keyentryArea | keyentryType | |
---|---|---|---|---|---|---|---|---|---|---|---|
195682 | 111281 | 7814 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Levadiakos | Right | Turnover |
195683 | 111282 | 7815 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Left | Turnover |
195684 | 111283 | 7816 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Right | Pass |
195685 | 111284 | 7817 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Lamia | Box | Pass |
195686 | 111285 | 7818 | GreSL | 2701477 | 2018-03-11 | 15:15:00 | Lamia | Levadiakos | Levadiakos | Box | Run |
# Print the shape of the raw DataFrame, df_stratabet_key_entries_raw
print(df_stratabet_key_entries_raw.shape)
(195687, 11)
# Print the column names of the raw DataFrame, df_stratabet_key_entries_raw
print(df_stratabet_key_entries_raw.columns)
Index(['True', 'index', 'competition', 'gsm_id', 'kickoffDate', 'kickoffTime', 'hometeam_team1', 'awayteam_team2', 'team', 'keyentryArea', 'keyentryType'], dtype='object')
The dataset has six features (columns). Full details of these attributes can be found in the Data Dictionary.
# Info for the raw DataFrame, df_stratabet_key_entries_raw
df_stratabet_key_entries_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 195687 entries, 0 to 195686 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 True 195687 non-null int64 1 index 195687 non-null int64 2 competition 195687 non-null object 3 gsm_id 195687 non-null int64 4 kickoffDate 195687 non-null object 5 kickoffTime 195687 non-null object 6 hometeam_team1 195687 non-null object 7 awayteam_team2 195687 non-null object 8 team 195687 non-null object 9 keyentryArea 195687 non-null object 10 keyentryType 195687 non-null object dtypes: int64(3), object(8) memory usage: 16.4+ MB
# Description of the raw DataFrame, df_stratabet_key_entries_raw, showing some summary statistics for each numberical column in the DataFrame
df_stratabet_key_entries_raw.describe()
True | index | gsm_id | |
---|---|---|---|
count | 195687.000000 | 195687.000000 | 1.956870e+05 |
mean | 97843.000000 | 6453.847414 | 2.354363e+06 |
std | 56490.115401 | 5040.288373 | 1.125099e+05 |
min | 0.000000 | 0.000000 | 2.237445e+06 |
25% | 48921.500000 | 2675.000000 | 2.246823e+06 |
50% | 97843.000000 | 5401.000000 | 2.360808e+06 |
75% | 146764.500000 | 8928.000000 | 2.467223e+06 |
max | 195686.000000 | 25384.000000 | 2.701477e+06 |
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_stratabet_key_entries_raw
msno.matrix(df_stratabet_key_entries_raw, figsize = (30, 7))
<matplotlib.axes._subplots.AxesSubplot at 0x1a1d9a0fd0>
# Counts of missing values
tm_null_value_stats = df_stratabet_key_entries_raw.isnull().sum(axis=0)
tm_null_value_stats[tm_null_value_stats != 0]
Series([], dtype: int64)
The visualisation shows us very quickly that there a few missing values in the shotQuality
column, but otherwise the dataset is complete. This data is now ready for Data Engineering.
df_stratabet_key_entries_raw.to_csv(data_dir_stratabet + '/raw/key_entries/' + 'stratabet_key_entries_all.csv', index=None, header=True)
# Assign Raw DataFrame to Engineered DataFrame
df_stratabet_chances = df_stratabet_chances_raw
df_stratabet_chances.head()
index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | icon | chanceRating | team | type | time | player | location_x | location_y | bodyPart | shotQuality | defPressure | numDefPlayers | numAttPlayers | outcome | primaryPlayer | primaryType | primaryLocation_x | primaryLocation_y | secondaryPlayer | secondaryType | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4684 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 24:43:00 | D. Epstein | 81 | 48 | Left | 3 | 5 | 2 | 0 | Saved | - | - | - | - | - | - |
1 | 4685 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 45:29:00 | D. Epstein | 27 | 60 | Left | 2 | 2 | 2 | 0 | Defended | Thuram | Open Play Pass | -29 | 82 | - | - |
2 | 4686 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 44:34:00 | S. Siontis | 23 | 117 | Right | 2 | 1 | 4 | 1 | Missed | - | - | - | - | - | - |
3 | 4687 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Platanias | Open Play | 42:39:00 | O. Gnjatic | -9 | 118 | Left | 1 | 1 | 3 | 1 | Missed | G. Manousos | Open Play Pass | 77 | 92 | - | - |
4 | 4688 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 40:46:00 | D. Epstein | 42 | 15 | Left | 2 | 5 | 2 | 0 | Saved | - | - | - | - | - | - |
df_stratabet_chances['Full_Fixture_Date'] = df_stratabet_chances['kickoffDate'].astype(str) + ' ' + df_stratabet_chances['hometeam_team1'].astype(str) + ' vs. ' + df_stratabet_chances['awayteam_team2'].astype(str)
df_stratabet_chances['location_x'] = df_stratabet_chances['location_x'].apply(pd.to_numeric, errors='coerce')
df_stratabet_chances['location_y'] = df_stratabet_chances['location_y'].apply(pd.to_numeric, errors='coerce')
df_stratabet_chances['primaryLocation_x'] = df_stratabet_chances['primaryLocation_x'].apply(pd.to_numeric, errors='coerce')
df_stratabet_chances['primaryLocation_y'] = df_stratabet_chances['primaryLocation_y'].apply(pd.to_numeric, errors='coerce')
df_stratabet_chances['shotQuality'] = df_stratabet_chances['shotQuality'].apply(pd.to_numeric, errors='coerce')
df_stratabet_chances['defPressure'] = df_stratabet_chances['defPressure'].apply(pd.to_numeric, errors='coerce')
df_stratabet_chances['numDefPlayers'] = df_stratabet_chances['numDefPlayers'].apply(pd.to_numeric, errors='coerce')
df_stratabet_chances['numAttPlayers'] = df_stratabet_chances['numAttPlayers'].apply(pd.to_numeric, errors='coerce')
From the documenation, the XY coordinates in addition to grid locations, with (0,0) representing the absolute centre of the defended goal line. The pitch length runs from 0 to 420, while the width runs from 136 to -136 (left to right).
Some key reference points of note:
df_stratabet_chances['location_y_120'] = ((df_stratabet_chances['location_y'] / 480) * 120).round(2)
df_stratabet_chances['location_x_80'] = (((df_stratabet_chances['location_x'] + 136) / 272) * 80).round(2)
df_stratabet_chances['primaryLocation_y_120'] = ((df_stratabet_chances['primaryLocation_y'] / 480) * 120).round(2)
df_stratabet_chances['primaryLocation_x_80'] = (((df_stratabet_chances['primaryLocation_x'] + 136) / 272) * 80).round(2)
df_stratabet_chances.head()
index | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | icon | chanceRating | team | type | time | player | location_x | location_y | bodyPart | shotQuality | defPressure | numDefPlayers | numAttPlayers | outcome | primaryPlayer | primaryType | primaryLocation_x | primaryLocation_y | secondaryPlayer | secondaryType | Full_Fixture_Date | location_y_120 | location_x_80 | primaryLocation_y_120 | primaryLocation_x_80 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4684 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 24:43:00 | D. Epstein | 81.0 | 48.0 | Left | 3.0 | 5.0 | 2.0 | 0.0 | Saved | - | - | NaN | NaN | - | - | 10/09/2016 Kerkyra vs. Platanias | 12.00 | 63.82 | NaN | NaN |
1 | 4685 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 45:29:00 | D. Epstein | 27.0 | 60.0 | Left | 2.0 | 2.0 | 2.0 | 0.0 | Defended | Thuram | Open Play Pass | -29.0 | 82.0 | - | - | 10/09/2016 Kerkyra vs. Platanias | 15.00 | 47.94 | 20.5 | 31.47 |
2 | 4686 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 44:34:00 | S. Siontis | 23.0 | 117.0 | Right | 2.0 | 1.0 | 4.0 | 1.0 | Missed | - | - | NaN | NaN | - | - | 10/09/2016 Kerkyra vs. Platanias | 29.25 | 46.76 | NaN | NaN |
3 | 4687 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Platanias | Open Play | 42:39:00 | O. Gnjatic | -9.0 | 118.0 | Left | 1.0 | 1.0 | 3.0 | 1.0 | Missed | G. Manousos | Open Play Pass | 77.0 | 92.0 | - | - | 10/09/2016 Kerkyra vs. Platanias | 29.50 | 37.35 | 23.0 | 62.65 |
4 | 4688 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 40:46:00 | D. Epstein | 42.0 | 15.0 | Left | 2.0 | 5.0 | 2.0 | 0.0 | Saved | - | - | NaN | NaN | - | - | 10/09/2016 Kerkyra vs. Platanias | 3.75 | 52.35 | NaN | NaN |
df_stratabet_chances['location_y_120_inv'] = 120 - ((df_stratabet_chances['location_y'] / 480) * 120).round(2)
df_stratabet_chances['location_x_80_inv'] = 80 - (((df_stratabet_chances['location_x'] + 136) / 272) * 80).round(2)
df_stratabet_chances['primaryLocation_y_120_inv'] = 120 - ((df_stratabet_chances['primaryLocation_y'] / 480) * 120).round(2)
df_stratabet_chances['primaryLocation_x_80_inv'] = 80 - (((df_stratabet_chances['primaryLocation_x'] + 136) / 272) * 80).round(2)
df_stratabet_chances = df_stratabet_chances.rename(columns = {'index':'id'})
df_stratabet_chances['Season'] = 'TO ADD'
# Write code here to add seasons for each match per 'competition' and 'kickoffDate' - varies per league
# Create DataFrame of Teams and Leagues
## Select columns of interest
cols = ['competition', 'team']
df_stratabet_teams_leagues = df_stratabet_chances[cols]
## Drop duplicates
df_stratabet_teams_leagues = df_stratabet_teams_leagues.drop_duplicates()
## Order columns by league and team
df_stratabet_teams_leagues = df_stratabet_teams_leagues.sort_values(['competition', 'team'], ascending=[True, True])
## Display DataFrame
df_stratabet_teams_leagues.head()
competition | team | |
---|---|---|
77137 | AusAL | Adelaide United |
77053 | AusAL | Brisbane Roar |
77104 | AusAL | Central Coast Mariners |
77066 | AusAL | Melbourne City |
77052 | AusAL | Melbourne Victory |
# Export DataFrame
df_stratabet_teams_leagues.to_csv(data_dir_stratabet + '/reference/teams_leagues.csv', index=None, header=True)
df_stratabet_chances.to_csv(data_dir_stratabet + '/engineered/chances/stratabet_chances_all.csv', index=None, header=True)
df_stratabet_chances.to_csv(data_dir + '/export/stratabet_events_chances.csv', index=None, header=True)
df_stratabet_chances.head()
id | competition | gsm_id | kickoffDate | kickoffTime | hometeam_team1 | awayteam_team2 | icon | chanceRating | team | type | time | player | location_x | location_y | bodyPart | shotQuality | defPressure | numDefPlayers | numAttPlayers | outcome | primaryPlayer | primaryType | primaryLocation_x | primaryLocation_y | secondaryPlayer | secondaryType | Full_Fixture_Date | location_y_120 | location_x_80 | primaryLocation_y_120 | primaryLocation_x_80 | location_y_120_inv | location_x_80_inv | primaryLocation_y_120_inv | primaryLocation_x_80_inv | Season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 4684 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 24:43:00 | D. Epstein | 81.0 | 48.0 | Left | 3.0 | 5.0 | 2.0 | 0.0 | Saved | - | - | NaN | NaN | - | - | 10/09/2016 Kerkyra vs. Platanias | 12.00 | 63.82 | NaN | NaN | 108.00 | 16.18 | NaN | NaN | TO ADD |
1 | 4685 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 45:29:00 | D. Epstein | 27.0 | 60.0 | Left | 2.0 | 2.0 | 2.0 | 0.0 | Defended | Thuram | Open Play Pass | -29.0 | 82.0 | - | - | 10/09/2016 Kerkyra vs. Platanias | 15.00 | 47.94 | 20.5 | 31.47 | 105.00 | 32.06 | 99.5 | 48.53 | TO ADD |
2 | 4686 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Kerkyra | Open Play | 44:34:00 | S. Siontis | 23.0 | 117.0 | Right | 2.0 | 1.0 | 4.0 | 1.0 | Missed | - | - | NaN | NaN | - | - | 10/09/2016 Kerkyra vs. Platanias | 29.25 | 46.76 | NaN | NaN | 90.75 | 33.24 | NaN | NaN | TO ADD |
3 | 4687 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | poorchance | poorchance | Platanias | Open Play | 42:39:00 | O. Gnjatic | -9.0 | 118.0 | Left | 1.0 | 1.0 | 3.0 | 1.0 | Missed | G. Manousos | Open Play Pass | 77.0 | 92.0 | - | - | 10/09/2016 Kerkyra vs. Platanias | 29.50 | 37.35 | 23.0 | 62.65 | 90.50 | 42.65 | 97.0 | 17.35 | TO ADD |
4 | 4688 | GreSL | 2355591 | 10/09/2016 | 13:00:00 | Kerkyra | Platanias | goodchance | goodchance | Kerkyra | Open Play | 40:46:00 | D. Epstein | 42.0 | 15.0 | Left | 2.0 | 5.0 | 2.0 | 0.0 | Saved | - | - | NaN | NaN | - | - | 10/09/2016 Kerkyra vs. Platanias | 3.75 | 52.35 | NaN | NaN | 116.25 | 27.65 | NaN | NaN | TO ADD |
df_stratabet_chances.shape
(119148, 37)
# All code below here is old and needs to be sorted
# Add code here
# Add code here
# Add code here
# Old code from here
Label
column into seperate Fixture
and Score
columns¶# Break down 'Label' column into constituent parts - Fixtures, Score, Date, Home Goals, Away Goals, etc.
df_stratabet['fixture'] = df_stratabet['label'].str.split(', ').str[0]
df_stratabet['score_home_away'] = df_stratabet['label'].str.split(', ').str[1]
df_stratabet['team_home'] = df_stratabet['fixture'].str.split(' - ').str[0]
df_stratabet['team_away'] = df_stratabet['fixture'].str.split(' - ').str[1]
df_stratabet['goals_home'] = df_stratabet['score_home_away'].str.split(' - ').str[0]
df_stratabet['goals_away'] = df_stratabet['score_home_away'].str.split(' - ').str[1]
date
column into seperate date_isolated
and time_isolated
columns¶df_stratabet['date_isolated'] = df_stratabet['date'].str.split(' at').str[0]
df_stratabet['time_isolated'] = df_stratabet['date'].str.split(' at ').str[1]
df_stratabet['date_time_isolated'] = df_stratabet['date'].str.split(' GMT').str[0].str.replace(' at ', ' ', regex=True)
As this is a large dataset with >3mil rows, we will remove every column that is not required at this stage.
# Display columns
df_stratabet.columns
# df_stratabet = df_stratabet.drop(['tags', 'dateutc', 'wyId_x', 'label', 'date', 'referees', 'wyId_y', 'date_isolated', 'time_isolated', 'date_time_isolated'])
df_stratabet['full_fixture_date'] = df_stratabet['date_date'].astype(str) + ' ' + df_stratabet['team_home'].astype(str) + ' ' + df_stratabet['goals_home'].astype(str) + ' ' + ' v ' + ' ' + df_stratabet['goals_away'].astype(str) + ' ' + df_stratabet['team_away'].astype(str)
df_stratabet['season'] = '17/18'
Correctly order all the rows in the DataFrame by date, time, country, fixture, half, and time in the match. Important when looking at events and the following event e.g. is possession retains? Which player receives the pass, etc.
df_stratabet = df_stratabet.sort_values(['date_date', 'time_time', 'country', 'league_name', 'full_fixture_date', 'matchPeriod', 'eventSec'], ascending=[True, True, True, True, True, True, True])
followingPossession
column¶'teamIdNext' = following 'teamId'
df_stratabet['teamIdNext'] = df_stratabet['teamId'].shift(-1)
df_stratabet['teamNameNext'] = df_stratabet['teamName'].shift(-1)
df_stratabet['fullNameNext'] = df_stratabet['fullName'].shift(-1)
df_stratabet.head()
player2player
column¶df_stratabet['player2player'] = df_stratabet['fullName'] + ' - ' + df_stratabet['fullNameNext']
isPossessionRetained
column¶When teamId
is not followed by the same teamId
in the following row, possession is lost. We want to creat a column that stats this.
df_stratabet['isPossessionRetained'] = np.where(df_stratabet['teamId'] == df_stratabet['teamIdNext'], True, False)
# CODE HERE
lst_results = list(df_stratabet['full_fixture_date'].unique())
for i, g in df_stratabet.groupby('full_fixture_date'):
g.to_csv(data_dir_wyscout + '/engineered/individual_matches/{}.csv'.format(i), header=True, index_label=True)
df_stratabet.to_csv(data_dir_wyscout + '/engineered/combined/wyscout_events_big5_1718.csv', index=None, header=True)
# Select columns of interest
## Define columns
cols = ['season',
'date_time_timestamp',
'fixture',
'team_home',
'team_away',
'teamName',
'goals_home',
'goals_away',
'eventName',
'subEventName'
]
## Streamline DataFrame with columns of interest
df_stratabet_select = df_stratabet[cols]
##
df_stratabet_select['Opponent'] = np.where(df_stratabet_select['team_home'] == df_stratabet_select['teamName'], df_stratabet_select['team_away'], df_stratabet_select['team_home'])
#
## Group DataFrame and Aggregate on 'eventName'
df_stratabet_fixture_grouped = (df_stratabet_select
.groupby(['season', 'date_time_timestamp', 'fixture', 'teamName', 'Opponent', 'goals_home', 'goals_away', 'eventName'])
.agg({'eventName': ['count']})
)
## Drop level
df_stratabet_fixture_grouped.columns = df_stratabet_fixture_grouped.columns.droplevel(level=0)
## Reset index
df_stratabet_fixture_grouped = df_stratabet_fixture_grouped.reset_index()
## Rename columns
df_stratabet_fixture_grouped = df_stratabet_fixture_grouped.rename(columns={'season': 'Season',
'date_time_timestamp': 'Date',
'fixture': 'Fixture',
'teamName': 'Team',
'Opponent': 'Opponent',
'goals_home': 'Goals_Home',
'goals_away': 'Goals_Away',
'eventName': 'Event',
'count': 'Team_Value'
}
)
## Display DataFrame
df_stratabet_fixture_grouped.head()
# Select columns of interest
## Define columns
cols = ['Season',
'Date',
'Fixture',
'Team',
'Opponent',
'Event',
'Team_Value'
]
## Streamline DataFrame with columns of interest
df_stratabet_fixture_grouped_select = df_stratabet_fixture_grouped[cols]
# Join DataFrame to itself on 'Date', 'Fixture', 'Team'/'Opponent', and 'Event', to join Team and Opponent together
df_stratabet_fixture_grouped = pd.merge(df_stratabet_fixture_grouped, df_stratabet_fixture_grouped, how='left', left_on=['Season', 'Date', 'Fixture', 'Opponent', 'Event'], right_on = ['Season', 'Date', 'Fixture', 'Team', 'Event'])
# Clean Data
## Drop columns
df_stratabet_fixture_grouped = df_stratabet_fixture_grouped.drop(columns=['Team_y', 'Opponent_y', 'Goals_Home_y', 'Goals_Away_y'])
## Rename columns
df_stratabet_fixture_grouped = df_stratabet_fixture_grouped.rename(columns={'Season_x': 'Season',
'Team_x': 'Team',
'Opponent_x': 'Opponent',
'Goals_Home_x': 'Goals_Home',
'Goals_Away_x': 'Goals_Away',
'Team_Value_x': 'Team_Value',
'Team_Value_y': 'Opponent_Value',
}
)
## Replace null values with zeros
df_stratabet_fixture_grouped['Team_Value'] = df_stratabet_fixture_grouped['Team_Value'].replace(np.nan, 0)
df_stratabet_fixture_grouped['Opponent_Value'] = df_stratabet_fixture_grouped['Opponent_Value'].replace(np.nan, 0)
## Convert Opponent_Value' from Float64 to Int64 type
df_stratabet_fixture_grouped['Opponent_Value'] = df_stratabet_fixture_grouped['Opponent_Value'].astype('Int64')
## Display DataFrame
df_stratabet_fixture_grouped.head()
#
##
df_fixture_gw = (df_stratabet_fixture_grouped
.groupby(['Date', 'Team'])
.agg({'Team': ['nunique']})
)
##
df_fixture_gw.columns = df_fixture_gw.columns.droplevel(level=0)
##
df_fixture_gw = df_fixture_gw.reset_index()
##
df_fixture_gw = df_fixture_gw.rename(columns={'Date': 'Date',
'nunique': 'Gameweek',
}
)
## Groupby. See: https://stackoverflow.com/questions/18554920/pandas-aggregate-count-distinct
df_fixture_gw = (df_fixture_gw.groupby(['Team', 'Date']).sum()
.groupby(level=0).cumsum().reset_index()
)
## Display DataFrame
df_fixture_gw.head()
# Join DataFrame
df_stratabet_fixture_grouped = pd.merge(df_stratabet_fixture_grouped, df_fixture_gw, how='left', left_on=['Date', 'Team'], right_on = ['Date', 'Team'])
# Display DataFrame
df_stratabet_fixture_grouped.head(50)
df_stratabet_fixture_grouped.to_csv(data_dir_wyscout + '/engineered/combined/wyscout_aggregated_fixtures_big5_1718.csv', index=None, header=True)
# Group DataFrame by Team
##
df_stratabet_team_grouped = (df_stratabet_fixture_grouped
.groupby(['Team', 'Event'])
.agg({'Team_Value': ['sum'],
'Opponent_Value': ['sum']
}
)
)
##
df_stratabet_team_grouped.columns = df_stratabet_team_grouped.columns.droplevel(level=0)
##
df_stratabet_team_grouped = df_stratabet_team_grouped.reset_index()
## Rename columns
df_stratabet_team_grouped.columns = ['Team', 'Event', 'Team_Value', 'Opponent_Value']
## Display columns
df_stratabet_team_grouped.head()
df_stratabet_team_grouped.to_csv(data_dir_wyscout + '/engineered/combined/wyscout_aggregated_team_big5_1718.csv', index=None, header=True)
This notebook scrapes data for player valuations using Beautifulsoup from TransferMarkt using pandas for data maniuplation through DataFrames and Beautifulsoup for webscraping.
This article was written with the aid of StrataData, which is property of Stratagem Technologies. StrataData powers the StrataBet Sports Trading Platform, in addition to StrataBet Premium Recommendations.
*Visit my website EddWebster.com or my GitHub Repository for more projects. If you'd like to get in contact, my Twitter handle is @eddwebster and my email is: edd.j.webster@gmail.com.*