#!/usr/bin/env python # coding: utf-8 # # # Data Engineering of Opta City Analytics Data # ##### Notebook to engineer the aggregated match-by-match performance Excel data and F24 Event data available from Opta as part of the [#mcfcanalytics](https://web.archive.org/web/20120821021930/http://www.mcfc.co.uk/Home/The%20Club/MCFC%20Analytics) initiative # # ### By [Edd Webster](https://www.twitter.com/eddwebster) # Notebook first written: 02/12/2020
# Notebook last updated: 07/12/2020 # ![title](../../img/opta-mcfcanalytics.png) # Click [here](#section5) to jump straight to the Exploratory Data Analysis section and skip the [Task Brief](#section2), [Data Sources](#section3), and [Data Engineering](#section4) sections. Or click [here](#section6) to jump straight to the Conclusion. # ___ # # # ## Introduction # This notebook engineers the aggregated performance Excel data available from Opta as part of the [#mcfcanalytics](https://web.archive.org/web/20120821021930/http://www.mcfc.co.uk/Home/The%20Club/MCFC%20Analytics) initiative for the 11/12 Premier League season using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames. # # For more information about this notebook and the author, I'm available through all the following channels: # * [eddwebster.com](https://www.eddwebster.com/); # * edd.j.webster@gmail.com; # * [@eddwebster](https://www.twitter.com/eddwebster); # * [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/); # * [github/eddwebster](https://github.com/eddwebster/); # * [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster); # * [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and # * [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster). # # ![title](../../img/fifa21eddwebsterbanner.png) # The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/fifa-league) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/B%29%20Data%20Engineering/Opta%20%23mcfcanalytics%20PL%202011-2012.ipynb). # ___ # # # ## Notebook Contents # 1. [Notebook Dependencies](#section1)
# 2. [Project Brief](#section2)
# 3. [Data Sources](#section3)
# 1. [Introduction](#section3.1)
# 2. [Data Dictionary](#section3.2)
# 3. [Creating the DataFrame](#section3.3)
# 4. [Initial Data Handling](#section3.4)
# 5. [Export the Raw DataFrame](#section3.5)
# 4. [Data Engineering](#section4)
# 1. [Introduction](#section4.1)
# 2. [Columns of Interest](#section4.2)
# 3. [String Cleaning](#section4.3)
# 4. [Converting Data Types](#section4.4)
# 5. [Export the Engineered DataFrame](#section4.5)
# 5. [Exploratory Data Analysis (EDA)](#section5)
# 1. [...](#section5.1)
# 2. [...](#section5.2)
# 3. [...](#section5.3)
# 6. [Summary](#section6)
# 7. [Next Steps](#section7)
# 8. [Bibliography](#section8)
# ___ # # # ## 1. Notebook Dependencies # This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries: # * [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented; # * [`NumPy`](http://www.numpy.org/) for multidimensional array computing; # * [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation; and # * [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations; # # All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/). # ### Import Libraries and Modules # In[142]: # Python ≥3.5 (ideally) import platform import sys, getopt assert sys.version_info >= (3, 5) import csv import xml.etree.ElementTree as et # Import Dependencies get_ipython().run_line_magic('matplotlib', 'inline') # Math Operations import numpy as np from math import pi # Datetime import datetime from datetime import date #from datetime import datetime as dt import time # Data Preprocessing import pandas as pd import os import re import random from io import BytesIO from pathlib import Path # Reading directories import glob import os # Working with JSON import json from pandas.io.json import json_normalize from ast import literal_eval # Data Visualisation import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns plt.style.use('seaborn-whitegrid') import missingno as msno from tabulate import tabulate # Display in Jupyter from IPython.display import Image, Video, YouTubeVideo from IPython.core.display import HTML # Ignore Warnings import warnings warnings.filterwarnings(action="ignore", message="^internal gelsd") print('Setup Complete') # In[4]: # Python / module versions used here for reference print('Python: {}'.format(platform.python_version())) print('NumPy: {}'.format(np.__version__)) print('pandas: {}'.format(pd.__version__)) print('matplotlib: {}'.format(mpl.__version__)) print('Seaborn: {}'.format(sns.__version__)) # ### Defined Variables # In[5]: # Define today's date today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '') # ### Defined Filepaths # In[144]: # Set up initial paths to subfolders base_dir = os.path.join('..', '..', ) data_dir = os.path.join(base_dir, 'data') data_dir_opta = os.path.join(base_dir, 'data', 'opta') data_dir_fixtures = os.path.join(base_dir, 'data', 'fixtures') img_dir = os.path.join(base_dir, 'img') fig_dir = os.path.join(base_dir, 'img', 'fig') video_dir = os.path.join(base_dir, 'video') # ### Notebook Settings # In[7]: pd.set_option('display.max_columns', None) # --- # # # ## 2. Project Brief # This Jupyter notebook explores how to parse publicly available Event data from [Opta](https://www.optasports.com/) using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames. # # In this analysis, we're looking specifically at [Premier League](https://www.premierleague.com/) for the 11/12 season. # # The engineered event data roduced in this notebook is exported to CSV. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel. # --- # # # ## 3. Data Sources # ### 3.1. Introduction # [Opta](https://www.optasports.com/) are a football analytics and data company. # # The data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4). # # The [Opta](https://www.optasports.com/) data made available for the [#mcfcanalytics](https://web.archive.org/web/20120821021930/http://www.mcfc.co.uk/Home/The%20Club/MCFC%20Analytics) initiative is available in two sources: # - Match-by-Match aggregated performance data for all players during the 11/12 season # - Opta F24 Event data for Manchester City vs. Bolton Wanders on 21/08/2011 [[link](https://www.bbc.co.uk/sport/football/14520274)] # # The data import is divided into these two sections. # # We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame. # In[150]: # #mcfcanalytics - Head of Performance Analysis explains concept. YouTubeVideo('ikm52r7RlKc') # ### 3.2. Match-by-Match Data # #### 3.2.1. Data Dictionary # The [Opta](https://www.optasports.com/) dataset has one hundred and fourteen features (columns) with the following definitions and data types: # # | Feature | Data type | # |------|-----| # | `id` | `object` # | `index` | `object` # | `period` | `object` # | `timestamp` | `object` # | `minute` | `object` # | `second` | `object` # | `possession` | `object` # | `duration` | `object` # | `type.id` | `object` # | `type.name` | `object` # | `possession_team.id` | `object` # # # For a full list of definitions, see the official documentation [[link](https://github.com/eddwebster/football_analytics/blob/master/documentation/opta/Event%20Definitions%20-%20MCFC%20analytics.pdf)]. # #### 3.2.2. Import Data # In[9]: # Import XLS file of aggregated performance data as a pandas DataFrame df_opta_mbm_raw = pd.read_excel(data_dir_opta + '/raw/match_by_match/' + 'opta_match_by_match_pl_1112.xls') # #### 3.2.3. Initial Data Handling # Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods. # In[110]: # Display the first 5 rows of the raw DataFrame, df_opta_mbm_raw df_opta_mbm_raw.head() # In[111]: # Display the last 5 rows of the raw DataFrame, df_opta_mbm_raw df_opta_mbm_raw.tail() # In[112]: # Print the shape of the raw DataFrame, df_opta_mbm_raw print(df_opta_mbm_raw.shape) # In[113]: # Print the column names of the raw DataFrame, df_opta_mbm_raw print(df_opta_mbm_raw.columns) # The dataset has two hundred and ten features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1). # In[114]: # Data types of the features of the raw DataFrame, df_opta_mbm_raw df_opta_mbm_raw.dtypes # All six of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1). # In[115]: # Info for the raw DataFrame, df_opta_mbm_raw df_opta_mbm_raw.info() # In[116]: # Description of the raw DataFrame, df_sb_raw, showing some summary statistics for each numberical column in the DataFrame df_opta_mbm_raw.describe() # In[117]: # Plot visualisation of the missing values for each feature of the raw DataFrame, df_opta_mbm_raw msno.matrix(df_opta_mbm_raw, figsize = (30, 7)) # In[118]: # Counts of missing values opta_null_value_stats = df_opta_mbm_raw.isnull().sum(axis=0) opta_null_value_stats[opta_null_value_stats != 0] # The visualisation shows us that the only field with missing values is the `Player Forename` attribute. This is something that will be fixed in the Data Engineering section. # ### 3.2. F24 Event Data # See: https://fcpython.com/blog/parsing-opta-f24-files-introduction-xml-python and https://github.com/imrankhan17/Parsing-Opta-files/blob/master/Parsing%20Opta%20F24%20files.ipynb # #### 3.2.1. Data Dictionary # The [Opta](https://www.optasports.com/) dataset has one hundred and fourteen features (columns) with the following definitions and data types: # # | id | name | 1 | 0 | # |-----:|:-----------------------------|:--------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------------------------------------| # | 1 | `Pass` | a successful pass ie the pass accurately made it from one player to another without any interception | an unsuccessful pass ie the pass is wayward and does not make it to the other player or is intercepted | # | 2 | `Offside Pass` | nan | nan | # | 3 | `Take On` | a player successfully dribbling the ball past the opponent and retaining possession afterwards | a player attempting to dribble past an opposing but being unsuccessful eg tackled or losing possession | # | 4 | `Free kick` | the player who was fouled ie the player who 'won' the foul | the player who committed the foul | # | 5 | `Out` | the team that gains possession by virtue of the opposite team putting the ball out of play | the team & player that gave away possession of the ball by being the last player to touch the ball before it went out of play | # | 6 | `Corner ` | the team that wins the corner by virtue of the opposite team putting the ball out of play | the team & player that gave away possession of the ball by being the last player to touch the ball before it went out of play | # | 7 | `Tackle` | the tackling player both winning the tackle and then retaining possession of the ball afterwards (or the ball goes out of play) | the tackling player winning the tackle but not gaining possession of the ball | # | 8 | `Interception` | nan | nan | # | 9 | `Turnover` | nan | nan | # | 10 | `Save` | a successful save ie the shot was prevented from going into the goal | an unsuccessful save ie the shot led to a goal | # | 11 | `Claim` | a successful catch by the goalkeeper | an unsuccessful catch ie the ball was dropped or fumbled | # | 12 | `Clearance` | a clearance resulting in the team keeping possession of the ball | a clearance where the team lose possession of the ball | # | 13 | `Miss` | nan | nan | # | 14 | `Post` | nan | nan | # | 15 | `Attempt Saved` | nan | nan | # | 16 | `Goal` | nan | nan | # | 17 | `Card` | nan | nan | # | 18 | `Player off` | nan | nan | # | 19 | `Player on` | nan | nan | # | 20 | `Player retired` | nan | nan | # | 21 | `Player returns` | nan | nan | # | 22 | `Player becomes goalkeeper` | nan | nan | # | 23 | `Goalkeeper becomes player` | nan | nan | # | 24 | `Condition change` | nan | nan | # | 25 | `Official change` | nan | nan | # | 26 | `Possession` | nan | nan | # | 27 | `Start delay` | nan | nan | # | 28 | `End delay` | nan | nan | # | 29 | `Temporary stop` | nan | nan | # | 30 | `End` | nan | nan | # | 31 | `Picked an orange` | nan | nan | # | 32 | `Start` | nan | nan | # | 33 | `Start/End canceling` | nan | nan | # | 34 | `Team set up` | nan | nan | # | 35 | `Player changed position` | nan | nan | # | 36 | `Player changed Jersey number` | nan | nan | # | 37 | `Collection End` | nan | nan | # | 38 | `Temp_Goal` | nan | nan | # | 39 | `Temp_Attempt` | nan | nan | # | 40 | `Formation change` | nan | nan | # | 41 | `Punch` | nan | nan | # | 42 | `Good skill` | nan | nan | # | 43 | `Deleted event` | nan | nan | # | 44 | `Aerial` | the player who successfully won the duel ie gained possession from it | the player who lost the duel ie did not win possession of the ball | # | 45 | `Challenge` | nan | nan | # | 46 | `Postponed` | nan | nan | # | 47 | `Rescinded card` | nan | nan | # | 48 | `Provisional lineup` | nan | nan | # | 49 | `Ball recovery` | nan | nan | # | 50 | `Dispossessed` | nan | nan | # | 52 | `Keeper pick-up` | nan | nan | # | 53 | `Cross not claimed` | nan | nan | # | 54 | `Smother` | nan | nan | # | 55 | `Offside provoked` | nan | nan | # | 54 | `51` | Error | nan | nan | # | 58 | `Shot faced` | nan | nan | # | 56 | `Shield ball oop` | nan | nan | # | 57 | `Foul throw in` | nan | nan | # | 59 | `Keeper Sweeper` | nan | nan | # | 62 | `Event placeholder` | nan | nan | # | 60 | `Chance Missed` | the ball simply hit the player unintentionally | player unsuccessfully controlled the ball | # | 61 | `Ball touch` | the ball simply hit the player unintentionally | player unsuccessfully controlled the ball | # | 63 | `Temp_Save` | nan | nan | # | 64 | `Resume` | nan | nan | # # # For a full list of definitions, see the official documentation [[link](https://github.com/eddwebster/football_analytics/blob/master/documentation/opta/Event%20Definitions%20-%20MCFC%20analytics.pdf)]. # #### 3.2.2. Import Data # ##### Method 1: Open CSV prepared in R # R code for managing the F24 dataset [[link](http://profpeppersassistant.blogspot.com/2012/09/r-code-for-managing-f24-dataset.html)]. # In[10]: # Method 1: Open CSV prepared in R #df_opta_events_raw = pd.read_excel(data_dir_opta + '/raw/events/' + 'events.csv') # ##### Method 2: Parse XML file in Python # See: https://fcpython.com/blog/parsing-opta-f24-files-introduction-xml-python # First import the XML file using the imported XML module. # # The following two lines below will take an XML file and parse it into something that we could navigate just like we do with an object by using square brackets: # In[12]: # Import XML file using XML module tree = et.ElementTree(file=data_dir_opta + '/raw/events/' + 'Bolton_ManCityF24.xml') gameFile = tree.getroot() # Next take a look at what information we are given about the game itself with the ‘.attrib’ method from the XML module imported earlier. # # We get loads of information about the match in a dictionary, where the data is laid out with a key, then the value. For example, we can see the team names and also their Opta IDs. # In[14]: gameFile[0].attrib # Print the match details: # In[24]: # Print a string with the two teams, using %s and the attrib to dynamically fill the string print('{} vs {}'.format(gameFile[0].attrib['home_team_name'], gameFile[0].attrib['away_team_name'])) # Tthe structure of the file that the match events lie within the game details tags. The first event is the following: # In[25]: gameFile[0][0].attrib # We can see that there are event keys like `min`, `sec`, `x` and `y` – these are quite easy to understand. # # The values, like `outcome`: 1 and `event_id`: 1, don’t really make much sense by themselves. This is particularly important when it comes to teams, as we only have their `id` and not their name. This is tidied up in the next steps. # # The Opta XML uses lots of IDs rather than names that are detailed in the Opta documentation [link]() # In[27]: # Import XLS file of aggregated performance data as a pandas DataFrame df_opta_event_definitions = pd.read_excel(data_dir_opta + '/raw/' + 'Event Definitions - Excel file.xlsx') # In[35]: # Display DataFrame df_opta_event_definitions.head() # In[33]: # Print DataFrame as Markdown #print(df_opta_event_definitions.to_markdown()) # The events contained qualifiers. Let’s again use square brackets to pull the first one out for the event above: # In[34]: gameFile[0][0][0].attrib # We are going to loop through each of the events that we have identified above and take the details that we want from each. These details will go into different lists for different data categories (player, team, success, etc.). We will then put these lists into a table which is then ready for analysis, plotting or exporting. # # Firstly though, we'll make the team names come through to make this a bit more readible, while than the team ID. The events only carry the team ID, so let’s create a dictionary that will allow us to later swap the ID for the team name: # In[38]: team_dict = {gameFile[0].attrib['home_team_id']: gameFile[0].attrib['home_team_name'], gameFile[0].attrib['away_team_id']: gameFile[0].attrib['away_team_name']} print(team_dict) # For this tutorial, we’re simply going to take the x/y locations of the pass origin and destination, the time of the pass, the team and whether or not it was successful. # # There’s so much more that we could take, such as the players, the pass length or any other details that you spot in the XML. If you’d like to pull those out too, doing so will make a great extension to this tutorial! # # We’re going to start by creating the empty lists for our data: # In[99]: #Create empty lists for the 8 columns we're collecting data for x_origin = [] y_origin = [] x_destination = [] y_destination = [] outcome = [] minute = [] half = [] team = [] event_id = [] qualifier_id = [] type_id = [] ids = [] # The main part of the tutorial sees us going event-by-event and adding our desired details only when the event is a pass. To do this, we will use a for loop on each event, and when the event is a pass (id = 1), we will append the correct attribute to our lists created above. Some of these details are hidden in the qualifiers, so we’ll also iterate over those to get the information needed there. # # This tutorial probably isn’t the best place to go through the intricacies of the feed, so take a look at the docs if you’re interested. # # Follow the code below with comments on each of the above steps: # In[100]: gameFile[0][0].attrib # In[101]: # Iterate through each game in our file - we only have one for game in gameFile: # Iterate through each event for event in game: print(event.attrib.get("type_id")) # In[102]: # Iterate through each game in our file - we only have one for game in gameFile: # Iterate through each event for event in game: # To the correct list, append the correct attribute using attrib.get() ids.append(event.attrib.get("id")) event_id.append(event.attrib.get("event_id")) type_id.append(event.attrib.get("type_id")) x_origin.append(event.attrib.get("x")) y_origin.append(event.attrib.get("y")) outcome.append(event.attrib.get("outcome")) minute.append(event.attrib.get("min")) half.append(event.attrib.get("period_id")) team.append(team_dict[event.attrib.get("team_id")]) # Iterate through each qualifier for qualifier in event: qualifier_id.append(event.attrib.get("qualifier_id")) x_destination.append(qualifier.attrib.get("value")) y_destination.append(qualifier.attrib.get("value")) """ # If the qualifier is relevant, append the information to the x or y destination lists if qualifier.attrib.get("qualifier_id") == "140": x_destination.append(qualifier.attrib.get("value")) if qualifier.attrib.get("qualifier_id") == "141": y_destination.append(qualifier.attrib.get("value")) """ # In[88]: """ #Iterate through each game in our file - we only have one for game in gameFile: #Iterate through each event for event in game: #If the event is a pass (ID = 1) if event.attrib.get("type_id") == '1': #To the correct list, append the correct attribute using attrib.get() x_origin.append(event.attrib.get("x")) y_origin.append(event.attrib.get("y")) outcome.append(event.attrib.get("outcome")) minute.append(event.attrib.get("min")) half.append(event.attrib.get("period_id")) team.append(team_dict[event.attrib.get("team_id")]) #Iterate through each qualifier for qualifier in event: #If the qualifier is relevant, append the information to the x or y destination lists if qualifier.attrib.get("qualifier_id") == "140": x_destination.append(qualifier.attrib.get("value")) if qualifier.attrib.get("qualifier_id") == "141": y_destination.append(qualifier.attrib.get("value")) """ # If this has worked correctly, we should have 8 lists populated. Let’s check out the minutes list: # In[103]: print("The minute list is " + str(len(minute)) + " long and the 43rd entry is " + minute[42]) # In[109]: print("The ID list is " + str(len(ids)) + " long and the 43rd entry is " + minute[42]) # You can check out each list in more detail, but they should work just fine. # # Our final task is to create a table for our data from our lists. To do this, we just need to create a list of our column headers, then assign the list to each one. We’ll then flip our table to make it long, rather than wide – just like you would want to see in a spreadsheet. Let’s take a look: # In[104]: # Create a list of our 8 columns/lists column_titles = ["ids", "event_id", "type_id", "team", "half", "min", "x_origin", "y_origin", "x_destination", "y_destination", "outcome", "qualifier_id"] # Use pd.DataFrame to create our table, assign the data in the order of our columns and give it the column titles above final_table = pd.DataFrame(data=[ids, event_id, type_id, team, half, minute, x_origin, y_origin, x_destination, y_destination, outcome, qualifier_id], index=column_titles) # Transpose, or flip, the table. Otherwise, our table will run from left to right, rather than top to bottom final_table = final_table.T # Show us the top 5 rows of the table final_table.head() # In[108]: final_table.shape # So this is great for passes, and the same logic would apply for shots, fouls or even all events at the same time – just expand on the above with the relevant IDs from the Opta docs. And analysts, if you’re still struggling to get it done, the emergency loan window is always open! # # Now that we’ve taken a complex XML and parsed the passes into a table, there’s a number of things that we can do. We could put the table into a wider dataset, do some analysis of these passes, visualise straight away or just export our new table to a csv: # In[105]: df_opta_events_only = df_opta_event_definitions.drop([1, 0], axis=1) # In[106]: df_opta_events_only # In[107]: final_table # In[122]: df_opta_events_only['id'] = df_opta_events_only['id'].to_string() # In[123]: final_table.dtypes # In[124]: df_opta_events_only.dtypes # In[125]: # Join the Bio and Status DataFrames to form one, unified DataFrame test = pd.merge(final_table, df_opta_events_only, left_on='event_id', right_on='id', how='left') # In[126]: test # In[127]: df_opta_f24_raw = test # #### 3.2.3. Initial Data Handling # Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods. # In[128]: # Display the first 5 rows of the raw DataFrame, df_opta_f24_raw df_opta_f24_raw.head() # In[129]: # Display the last 5 rows of the raw DataFrame, df_opta_f24_raw df_opta_f24_raw.tail() # In[130]: # Print the shape of the raw DataFrame, df_opta_f24_raw print(df_opta_f24_raw.shape) # In[131]: # Print the column names of the raw DataFrame, df_opta_f24_raw print(df_opta_f24_raw.columns) # The dataset has two hundred and ten features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1). # In[132]: # Data types of the features of the raw DataFrame, df_opta_f24_raw df_opta_f24_raw.dtypes # All six of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1). # In[133]: # Info for the raw DataFrame, df_opta_f24_raw df_opta_f24_raw.info() # In[134]: # Description of the raw DataFrame, df_opta_f24_raw, showing some summary statistics for each numberical column in the DataFrame df_opta_mbm_raw.describe() # In[135]: # Plot visualisation of the missing values for each feature of the raw DataFrame, df_opta_f24_raw msno.matrix(df_opta_f24_raw, figsize = (30, 7)) # In[136]: # Counts of missing values opta_null_value_stats = df_opta_f24_raw.isnull().sum(axis=0) opta_null_value_stats[opta_null_value_stats != 0] # The visualisation shows us that the only field with missing values is the `Player Forename` attribute. This is something that will be fixed in the Data Engineering section. # --- # ## 4. Data Engineering # ### 4.1. Introduction # As per the Data Sources section, the Data Engineering section is divided into two sections for the aggregated match-by-match performance data and F24 Event data for Manchester City vs. Bolton Wanders (21/08/2011). # ### 4.2. Aggregated Match-by-Match Performance Data # #### 4.2.1. Assign Raw DataFrame to Engineered DataFrame # In[194]: # Assign Raw DataFrame to Engineered DataFrame df_opta_mbm = df_opta_mbm_raw # #### 4.2.2. String Cleaning # ##### `Player Forename` and `Player Surname` # Some players such as 'Kolo Touré' have a null value for the `Player Forename` attribute. The foollowing cleans this code and derives a cleaned `Player Surname` and `Player Full Name` attribute. # In[195]: # Clean names ## Forename df_opta_mbm['Player Forename Cleaned'] = np.where(df_opta_mbm['Player Forename'].isnull(), df_opta_mbm['Player Surname'].str.split(' ').str[0], df_opta_mbm['Player Forename']) ## Surname df_opta_mbm['Player Surname Cleaned'] = np.where(df_opta_mbm['Player Forename'].notnull(), df_opta_mbm['Player Surname'], df_opta_mbm['Player Surname'].str.split(' ').str[1]) ## Create Full Name attribute from cleaned Forename and Surname attributes df_opta_mbm['Player Full Name'] = df_opta_mbm['Player Forename Cleaned'] + ' ' + df_opta_mbm['Player Surname Cleaned'] # In[193]: # Test of 'Kolo Touré' - had a blank first name in the database df_opta_mbm[df_opta_mbm['Player Surname'] == 'Kolo Touré'] # #### 4.2.3. Add `Season` attribute # In[172]: # Dictionary of competitions dict_seasons = { 2011: '11/12', 2012: '12/13', 2013: '13/14', 2014: '14/15', 2015: '15/16', 2016: '16/17', 2017: '17/18', 2018: '18/19', 2019: '19/20', 2020: '20/21', } # In[173]: # Map grouped positions to DataFrame df_opta_mbm['Season'] = df_opta_mbm['SeasId'].map(dict_seasons) # #### 4.2.4. Add `Competition` attribute from `CompId` # In[174]: # Dictionary of competitions dict_comps = { 8: 'Premier League', 10: 'English Football League Championship', 14: 'Scottish Premiership', 11: 'English Football League 1', 12: 'English Football League 2', 1: 'FA Cup', 2: 'English League Cup', 7: 'English Football League Trophy', 4: 'FIFA World Cup', 5: 'Champions League', 6: 'UEFA Europa League', 20: 'J-League', 21: 'Serie A', 22: 'Bundesliga', 22: 'La Liga' } # In[175]: # Map grouped positions to DataFrame df_opta_mbm['Competitions'] = df_opta_mbm['CompId'].map(dict_comps) # #### 4.2.5. Add `Position Grouped` attribute from `Position Id` # In[213]: # Dictionary of competitions dict_positions_grouped = {1: 'Goalkeeper', 2: 'Defender', 4: 'Midfielder', 6: 'Forward' } # In[214]: # Map grouped positions to DataFrame df_opta_mbm['Position Grouped'] = df_opta_mbm['Position Id'].map(dict_positions_grouped) # #### 4.2.6. Export Data # In[215]: # Export DataFrame as a CSV file ## Export another copy to the Opta folder called 'latest' (can be overwritten) df_opta_mbm.to_csv(data_dir_opta + '/engineered/' + 'opta_match_by_match_pl_1112.csv', index=None, header=True) ## Export another copy to the Opta folder called 'latest' (can be overwritten) df_opta_mbm.to_csv(data_dir + '/export/opta_match_by_match_pl_1112.csv', index=None, header=True) # #### 4.2.6. Aggregate Data # ##### 4.2.6.1. Season Level # In[201]: df_opta_mbm.head() # In[ ]: # CODE HERE # ##### 4.2.6.2. Team Level # In[205]: df_opta_team = df_opta_mbm # In[206]: # Displays all one hundered and four columns with pd.option_context('display.max_rows', None, 'display.max_columns', None): print(df_opta_team.dtypes) # In[207]: df_opta_team = df_opta_team.drop(['Date', 'Player ID', 'Player Surname', 'Player Forename', 'Opposition', 'Opposition id', 'Venue', 'Position Id', 'Player Forename Cleaned', 'Player Surname Cleaned', 'Player Full Name' ], axis=1) # In[208]: df_opta_team # In[ ]: df_opta_team = df_opta_team.groupby(['Team', 'Team Id' ] ).agg( { 'A': 'sum' } ).reset_index() # reset index to get grouped columns back # Rename columns df_opta_team.columns = ['A', 'B'] # #### 4.2.7. Export DataFrame # Export the engineered Events, Lineup, and Tactics [Opta](https://www.optasports.com/) DataFrames as CSV files. # ### 4.3. F24 Event Data # #### 4.3.1. Assign Raw DataFrame to Engineered DataFrame # In[ ]: # Assign Raw DataFrame to Engineered DataFrame df_opta_f24 = df_opta_f24_raw # In[ ]: # CODE HERE # ### 4.4. Import Fixture Data # https://datahub.io/sports-data/english-premier-league#resource-season-1112 # In[148]: # Import data as a DataFrame called 'df_original'. To use in wrangling process before creating 'df' DataFrame for analysis df_pl_fixtures_1112 = pd.read_csv(data_dir_fixtures + '/Premier League/csv/season-1112.csv') # In[149]: df_pl_fixtures_1112.head() # In[ ]: # CODE HERE # --- # ## 5. Summary # This notebook engineers [Opta](https://www.optasports.com/) data using [pandas](http://pandas.pydata.org/). # --- # ## 6. Next Steps # ... # --- # ## 7. References # # #### Data # * [Opta](https://www.optasports.com/) data # # #### Matches # * Manchester City vs. Bolton Wanders on 21/08/2011 [[link](https://www.bbc.co.uk/sport/football/14520274)] # # #### MCFC Analytics # * http://www.mcfc.co.uk/mcfcanalytics (now not online) # * https://web.archive.org/web/20120821021930/http://www.mcfc.co.uk/Home/The%20Club/MCFC%20Analytics # * https://web.archive.org/web/20120820000747/http://www.mcfc.co.uk/The-Club/MCFC-Analytics/What-is-MCFC-Analytics # * https://web.archive.org/web/20120821051105/http://www.mcfc.co.uk/the-club/mcfc-analytics/who-is-the-data-for # * https://www.theguardian.com/football/blog/2012/aug/16/manchester-city-player-statistics # * https://thevideoanalyst.com/mcfc-analytics/ # * https://analysefootball.com/tag/mcfc-analytics/ # * https://www.soccermetrics.net/tag/mcfcanalytics # * https://danjharrington.wordpress.com/tag/mcfc-analytics/ # * https://blog.bimeanalytics.com/english/bime-and-the-mcfc-analytics-project-with-opta # * https://www.forbes.com/sites/zachslaton/2012/08/16/game-changer-mcfc-analytics-releases-full-season-of-opta-data-for-public-use/ # * https://public.tableau.com/profile/jburnmurdoch#!/vizhome/MCFCAnalyticsTierOne/MCFCpassing # * https://www.sportingintelligence.com/2012/08/16/are-you-the-new-bill-james-mcfc-will-give-you-data-worth-thousands-to-find-out-160802/ # * http://www.soccerstatistically.com/blog/2012/9/17/dealing-with-the-mcfc-analytics-advanced-data-release.html # * http://onfooty.com/2012/08/villarreal-2011-12-breaking-down-a-failed-season.html # * http://analysefootball.com/2012/09/03/mcfc-analytics-summary-of-blog-posts-1-mcfcanalytics/ # * http://analysefootball.com/2012/09/10/mcfc-analytics-summary-of-blog-posts-2/ # * http://analysefootball.com/2012/08/30/passing-in-the-final-third-and-goals-epl-2011-12-mcfcanalytics/ # * http://public.tableausoftware.com/views/20120831_opta_metrics_summary/Dashboard1?:embed=y http://www.tips-for-excel.com/2012/08/how-do-your-football-team-pass-the-ball-mcfc-analytics/ # * http://thepowerofgoals.blogspot.co.uk/2012/09/how-teams-win-from-mcfc-data.html?m=1 # * http://thepowerofgoals.blogspot.co.uk/2012/08/how-fouls-turn-into-cards.html?m=1 # * https://efex.bimeapp.com/players/dashboard/MCFCStats # * http://www.guardian.co.uk/news/datablog/interactive/2012/aug/21/premier-league-2011-12-performance-data-interactive # * http://www.tips-for-excel.com/2012/08/what-englands-euro-2012-starting-11-should-really-have-been-according-to-data/ # * http://blog.analysismarketing.com/2012/09/mcfc-analytics-some-thoughts-about-data.html # * http://rightbackbehindthegoal.blogspot.co.uk/2012/09/the-effectiveness-of-crossing-and.html?m=1 # * http://www.soccermetrics.net/soccer-database-development/mcfc-analytics-soccer-database-development/an-update-on-the-two-mcfc-analytics-projects # # #### Parsing F24 Opta Data # * https://github.com/imrankhan17/Parsing-Opta-files/blob/master/Parsing%20Opta%20F24%20files.ipynb # * http://www.fcrstats.com/fcrstats_package.html # * http://profpeppersassistant.blogspot.com/2012/09/r-code-for-managing-f24-dataset.html # * https://github.com/FCrSTATS/OptaEventData # --- # # ***Visit my website [EddWebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.*** # [Back to the top](#top)