#!/usr/bin/env python # coding: utf-8 # # # Second Spectrum Data Engineering # ##### Notebook to engineer [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data using [pandas](http://pandas.pydata.org/). # # ### By [Edd Webster](https://www.twitter.com/eddwebster) # Notebook first written: 20/01/2022
# Notebook last updated: 01/02/2022 # # ![Watford F.C.](../../img/logos/second_spectrum_logo.jpeg) # # ![Second Spectrum](../../img/club_badges/premier_league/watford_fc_logo_small.png) # # Click [here](#section4) to jump straight into the Data Engineering section and skip the [Notebook Brief](#section2) and [Data Sources](#section3) sections. # ___ # # # ## Introduction # This notebook parses and engineers [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data for two matches, that have been provided by [Watford F.C](https://www.watfordfc.com/), using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames. # # For more information about this notebook and the author, I am available through all the following channels: # * [eddwebster.com](https://www.eddwebster.com/); # * edd.j.webster@gmail.com; # * [@eddwebster](https://www.twitter.com/eddwebster); # * [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/); # * [github/eddwebster](https://github.com/eddwebster/); and # * [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster). # # A static version of this notebook can be found [here](https://nbviewer.org/github/eddwebster/watford/blob/main/notebooks/2_data_engineering/Second%20Spectrum%20Data%20Engineering.ipynb). This notebook has an accompanying [`watford`](https://github.com/eddwebster/watford) GitHub repository and for my full repository of football analysis, see my [`football_analysis`](https://github.com/eddwebster/football_analytics) GitHub repository. # ___ # # ## Notebook Contents # 1. [Notebook Dependencies](#section1)
# 2. [Notebook Brief](#section2)
# 3. [Data Sources](#section3)
# 1. [Introduction](#section3.1)
# 2. [Data Dicitonary](#section3.2)
# 3. [Import the Data](#section3.3)
# 4. [Initial Data Handling](#section3.4)
# 4. [Data Engineering](#section4)
# 1. [Reset Index](#section4.1)
# 2. [Rename Columns](#section4.2)
# 3. [Rename Player Columns Using Metadata](#section4.3)
# 4. [Separate Home and Away DataFrames](#section4.4)
# 5. [Calculate Player Velocities](#section4.5)
# 6. [Reverse direction of players](#section4.6)
# 5. [Summary](#section5)
# 6. [Next Steps](#section6)
# 7. [References](#section7)
# ___ # # # # ## 1. Notebook Dependencies # # This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries: # * [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented; # * [`NumPy`](http://www.numpy.org/) for multidimensional array computing; and # * [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation. # # All packages used for this notebook can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/). # ### Import Libraries and Modules # In[45]: # Python ≥3.5 (ideally) import platform import sys, getopt assert sys.version_info >= (3, 5) import csv # Import Dependencies get_ipython().run_line_magic('matplotlib', 'inline') # Math Operations import numpy as np import math from math import pi # Datetime import datetime from datetime import date import time # Data Preprocessing import pandas as pd import pandas_profiling as pp import os import re import chardet import random from io import BytesIO from pathlib import Path # Kloppy from kloppy import secondspectrum # Reading Directories import glob import os # Working with JSON import json from pandas import json_normalize # Data Visualisation import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.animation as animation from matplotlib.colors import LinearSegmentedColormap from matplotlib import patches import seaborn as sns import missingno as msno import moviepy.editor as mpy from moviepy.video.io.bindings import mplfig_to_npimage # Requests and downloads import tqdm import requests # Machine Learning import scipy.signal as signal from scipy.spatial import Voronoi, voronoi_plot_2d, Delaunay # Display in Jupyter from IPython.display import Image, YouTubeVideo from IPython.core.display import HTML # Ignore Warnings import warnings warnings.filterwarnings(action="ignore", message="^internal gelsd") # Print message print('Setup Complete') # In[46]: # Python / module versions used here for reference print('Python: {}'.format(platform.python_version())) print('NumPy: {}'.format(np.__version__)) print('pandas: {}'.format(pd.__version__)) print('matplotlib: {}'.format(mpl.__version__)) # ### Defined Filepaths # In[47]: # Set up initial paths to subfolders base_dir = os.path.join('..', '..') data_dir = os.path.join(base_dir, 'data') data_dir_second_spectrum = os.path.join(base_dir, 'data', 'second_spectrum') data_dir_opta = os.path.join(base_dir, 'data', 'opta') scripts_dir = os.path.join(base_dir, 'scripts') scripts_dir_second_spectrum = os.path.join(base_dir, 'scripts', 'second_spectrum') scripts_dir_metrica_sports = os.path.join(base_dir, 'scripts', 'metrica_sports') img_dir = os.path.join(base_dir, 'img') fig_dir = os.path.join(base_dir, 'img', 'fig') fig_dir_second_spectrum = os.path.join(base_dir, 'img', 'fig', 'second_spectrum') video_dir = os.path.join(base_dir, 'video') video_dir_second_spectrum = os.path.join(base_dir, 'video', 'fig', 'second_spectrum') # ### Custom Libraries for Tracking Data # Custom libraries for working with the [Second Spectrum](https://www.secondspectrum.com/index.html) data, that were initially written by [Laurie Shaw](https://twitter.com/EightyFivePoint), to work with the [Metrica Sports](https://metrica-sports.com/) data. See the following for his original code [[link](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking)]. # # The modifications to this data include the ability to create Pitch Control models without Tracking data. # In[48]: # Custom libraries for working with Tracking data ## Define path of scripts sys.path.insert(0, os.path.abspath(scripts_dir_second_spectrum)) ## Second Spectrum scripts - custom scripts derived from Laurie Shaw's Metrica scripts import Second_Spectrum_IO as sio import Second_Spectrum_Viz as sviz import Second_Spectrum_Velocities as svel import Second_Spectrum_PitchControl as spc import Second_Spectrum_EPV as sepv # In[49]: """ ## Laurie Shaw's custom libraries for working with Metrica Sports data import Metrica_IO as mio import Metrica_Viz as mviz import Metrica_Velocities as mvel import Metrica_PitchControl as mpc import Metrica_EPV as mepv """ # ### Custom Functions # In[50]: # Define function to read in Tracking data using Kloppy def read_in_tracking_data(filename): """ Function to read in Second Spectrum Tracking data as a pandas DataFrame using Kloppy. Written by Edd Webster / @eddwebster """ # Read in exported CSV file if exists, if not, read in the raw JSON and XML files using Kloppy ## If not already saved as a CSV file if not os.path.exists(os.path.join(data_dir_second_spectrum, 'raw', 'data', f'g{filename}_SecondSpectrum_Data.csv')): ### Load meta data (XML) and raw data (JSONL) files dataset = secondspectrum.load( meta_data = f'../../data/second_spectrum/raw/metadata/g{filename}_SecondSpectrum_Metadata.xml', raw_data = f'../../data/second_spectrum/raw/data/g{filename}_SecondSpectrum_Data.jsonl', #meta_data = os.path.join(data_dir_second_spectrum, 'raw', 'data', f'{filename}_SecondSpectrum_Metadata.xml') #raw_data = os.path.join(data_dir_second_spectrum, 'raw', 'data', f'{filename}_SecondSpectrum_Data.jsonl') ### Optional arguments additional_meta_data = f'../../data/second_spectrum/raw/metadata/g{filename}_SecondSpectrum_Metadata.json', #additional_meta_data = os.path.join(data_dir_second_spectrum, 'raw', 'data', f'{filename}_SecondSpectrum_Metadata.json') #sample_rate=1/25, #limit=100, coordinates='secondspectrum', #only_alive=True ) ### Convert DataFrame to pandas df_tracking = dataset.to_pandas() ### Export raw DataFrame df_tracking.to_csv(os.path.join(data_dir_second_spectrum, 'raw', 'data', f'g{filename}_SecondSpectrum_Data.csv'), index=None, header=True) ## If already, read in CSV file else: df_tracking = pd.read_csv(os.path.join(data_dir_second_spectrum, 'raw', 'data', f'g{filename}_SecondSpectrum_Data.csv')) ## Return DataFrame return df_tracking # In[51]: # Define function to create separate DataFrames of the Tracking data for Home and Away def create_home_and_away_dfs(df_tracking): """ Function to separate the Tracking data into Home and Away DataFrames. Written by Edd Webster / @eddwebster """ # Define columns that are found in both DataFrames lst_cols = ['Frame' ,'Period', 'Time [s]', 'Ball State', 'Ball Owning Team ID', 'ball_x', 'ball_y'] # Define columns in Home DataFrame ## Create an empty list for home columns lst_cols_home = [] ## Extract Home columns from the DataFrame for col in df_tracking.columns: if 'Home' in col: lst_cols_home.append(col) # Define columns in Away DataFrame ## Create an empty list for home columns lst_cols_away = [] ## Extract Home columns from the DataFrame for col in df_tracking.columns: if 'Away' in col: lst_cols_away.append(col) # Select columns of interest for Home and Away DataFrames df_tracking_home = df_tracking[lst_cols + lst_cols_home] df_tracking_away = df_tracking[lst_cols + lst_cols_away] # Return DataFrame return df_tracking_home, df_tracking_away # In[52]: # Define function to reverse the direction of players def to_single_playing_direction(home, away): ''' Flip coordinates in second half so that each team always shoots in the same direction through the match. ''' for team in [home, away]: second_half_idx = team.Period.idxmax(2) columns = [c for c in team.columns if c[-1].lower() in ['x','y']] team.loc[second_half_idx:,columns] *= -1 return home, away # ### Notebook Settings # In[53]: # Display all columns of displayed pandas DataFrames pd.set_option('display.max_columns', None) #pd.set_option('display.max_rows', None) pd.options.mode.chained_assignment = None # --- # # # # ## 2. Notebook Brief # This notebook parses and engineers [Second Spectrum](https://www.secondspectrum.com/index.html) using [pandas](http://pandas.pydata.org/) and [`Kloppy`](https://kloppy.pysport.org/), for two Premier League matches featuring Crystal Palace during the 21/22 season. The two datasets provided are for the following matches: # * [27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion](https://www.bbc.co.uk/sport/football/58620544) (g2210324) # * [03/10/2021: Crystal Palace (2) vs. (2) Leicester City](https://www.bbc.co.uk/sport/football/58667896) (g2210334) # # Once engineered, this dataset will be visualised and analysed in congunction with the corresponding Event data, as part of an opposition analysis piece. # # # **Notebook Conventions**:
# * Variables that refer a `DataFrame` object are prefixed with `df_`. # * Variables that refer to a collection of `DataFrame` objects (e.g., a list, a set or a dict) are prefixed with `dfs_`. # --- # # # # ## 3. Data Sources # # # ### 3.1. Introduction # [Second Spectrum](https://www.secondspectrum.com/index.html) are a football analytics, data provider ... # # ![Second Spectrum](../../img/logos/second_spectrum_logo.jpeg) # # The tracking data represents the location of every player on the pitch with a temporal frequency of 25 Hz and the corresponding match time for each tracking frame is specified. # # UPDATE THIS # # # ### 3.2. Data Dictionary # The [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking dataset has fourteen features (columns) with the following definitions and data types: # # | Feature | Data type | Definition | # |-----------------|---------------|----------------| # | `Frame` | object | | # | `Period` | object | | # | `Time [s]` | object | | # | `Home_11_x` | object | | # | `Home_11_y` | object | | # | `Away_8_x` | object | | # | `Away_8_y` | object | | # # UPDATE THIS # # # ### 3.3. Import Data # The following cells take the raw meta data (`XML`) and tracking data (`JSONL`) files and read them in as a unified [pandas](https://pandas.pydata.org/) DataFrame, using the custom `read_in_tracking_data` function that I have written (see the functions section at the top of the notebook), powered by [`Kloppy`](https://kloppy.pysport.org/). # In[54]: # Show files in directory print(glob.glob(os.path.join(data_dir_second_spectrum, 'raw', 'data/*.csv'))) # In[55]: # Read in Tracking data using the custom 'read_in_tracking_data' function, powered by Kloppy df_tracking_cry_bri = read_in_tracking_data(filename='2210324') df_tracking_cry_lei = read_in_tracking_data(filename='2210334') # # # ### 3.4. Initial Data Handling # To avoid duplication commands and to get an idea of what the output of the Tracking data looks like, this section just goes through tone the two Tracking data files, [Crystal Palace vs. Leicester City](https://www.bbc.co.uk/sport/football/58620544) (g2210334). # # First check the quality of the dataset by looking first and last rows in pandas using the [`head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [`tail()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods. # In[56]: # Display the first five rows of the DataFrame, df_tracking_cry_lei df_tracking_cry_lei.head() # In[57]: # Display the last five rows of the DataFrame, df_tracking_cry_lei df_tracking_cry_lei.tail() # In[58]: # Print the shape of the DataFrame, df_tracking_cry_lei print(df_tracking_cry_lei.shape) # The DataFrame has 145,621 rows, which is what we would expected - 25 fps × 60 secs × 90 mins = 135,000 (excluding injury time). # In[59]: # Print the column names of the DataFrame, df_tracking_cry_lei print(df_tracking_cry_lei.columns) # In[60]: # Data types of the features of the raw DataFrame, df_tracking_cry_lei df_tracking_cry_lei.dtypes # Full details of these attributes and their data types is discussed further in the [Data Dictionary](section3.2). # In[61]: # Displays all columns with pd.option_context('display.max_rows', None, 'display.max_columns', None): print(df_tracking_cry_lei.dtypes) # In[62]: # Info for the raw DataFrame, df_tracking_cry_lei df_tracking_cry_lei.info() # In[63]: # Plot visualisation of the missing values for each feature of the raw DataFrame, df_tracking_cry_lei msno.matrix(df_tracking_cry_lei, figsize = (30, 7)) # In[64]: # Counts of missing values null_value_stats = df_tracking_cry_lei.isnull().sum(axis=0) null_value_stats[null_value_stats != 0] # --- # # # # ## 4. Data Engineering # The next step is to wrangle the dataset to into a format that’s suitable for analysis. # # This section is broken down into the following subsections: # # 4.1. [Reset Index](#section4.1)
# 4.2. [Rename Columns](#section4.2)
# 4.3. [Rename Player Columns Using Metadata](#section4.3)
# 4.4. [Separate Home and Away DataFrames](#section4.4)
# 4.5. [Calculate Player Velocities](#section4.5)
# 4.6. [Reverse direction of players](#section4.6)
# # # ### 4.1. Reset Index # Reset index, the existing index will not be dropped and will be used to identify the frame number. # In[65]: # Reset index, the existing index will be used for the frame number df_tracking_cry_bri = df_tracking_cry_bri.reset_index(drop=False) df_tracking_cry_lei = df_tracking_cry_lei.reset_index(drop=False) # # # ### 4.2. Rename Columns # To work with [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data libraries, [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), the data needs to be engineered to match the Metrica schema, which is the following: # # | Feature | Data type | Definition | # |-------------------------------------------|---------------|----------------| # | `Frame` | int64 | | # | `Period` | int64 | | # | `Time [s]` | float64 | | # | `Home/Away_No._x` (repeated 14 times) | float64 | | # | `Home/Away_No._y` (repeated 14 times) | float64 | | # | `ball_x` | float64 | | # | `ball_y` | float64 | | # # To learn more about the Metrica Sports schema, see the official documentation [[link](https://github.com/metrica-sports/sample-data/blob/master/documentation/events-definitions.pdf)]. # In[66]: # Rename columns df_tracking_cry_lei = df_tracking_cry_lei.rename(columns={'period_id': 'Period', 'timestamp': 'Time [s]', 'ball_state': 'Ball State', 'ball_owning_team_id': 'Ball Owning Team ID', 'index': 'Frame' } ) # In[67]: # Rename columns df_tracking_cry_bri = df_tracking_cry_bri.rename(columns={'period_id': 'Period', 'timestamp': 'Time [s]', 'ball_state': 'Ball State', 'ball_owning_team_id': 'Ball Owning Team ID', 'index': 'Frame' } ) # In[68]: # Display DataFrame df_tracking_cry_bri.head() # # # ### 4.3. Rename Player Columns Using Metadata # For compatibility with [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data library [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), the column of each player is given a `Home_` or `Away_` prefix. # ##### a) Crystal Palace (2) vs. (2) Leicester City (f2210334) # In[69]: # Read in the Meta data for the corresponding Tracking data ## Define the file name file_name = 'g2210334' ## Read in the Meta data df_meta_cry_lei = pd.read_csv(os.path.join(data_dir_second_spectrum, 'raw', 'metadata', f'{file_name}_SecondSpectrum_Metadata_Players.csv')) # In[70]: ## Prepare DataFrame to create dictionary ## Convert data types df_meta_cry_lei['optaId'] = df_meta_cry_lei['optaId'].astype(int) df_meta_cry_lei['name'] = df_meta_cry_lei['name'].astype(str) ## Create a dictionary of Qualifier IDs and Qualifier Names from the reference dataset dict_home_away_players = dict(zip(df_meta_cry_lei['optaId'], df_meta_cry_lei['HA'])) # In[71]: dict_home_away_players # In[72]: # Map the values to the data (current a hard coded way, will replace with a loop shortly) dict_home_away_players_cry_lei = {'86417_x': 'Home_15_x', '54861_x': 'Home_20_x', '105666_x': 'Home_1_x', '66975_x': 'Home_4_x', '108413_x': 'Home_12_x', '443661_x': 'Home_7_x', '244723_x': 'Home_3_x', '80146_x': 'Home_9_x', '209036_x': 'Home_6_x', '40836_x': 'Home_13_x', '50471_x': 'Home_18_x', '174874_x': 'Home_16_x', '55494_x': 'Home_2_x', '55037_x': 'Home_8_x', '49413_x': 'Home_5_x', '232787_x': 'Home_23_x', '82403_x': 'Home_11_x', '199670_x': 'Home_22_x', '58786_x': 'Home_34_x', '57328_x': 'Home_17_x', '86417_y': 'Home_15_y', '54861_y': 'Home_20_y', '105666_y': 'Home_1_y', '66975_y': 'Home_4_y', '108413_y': 'Home_12_y', '443661_y': 'Home_7_y', '244723_y': 'Home_3_y', '80146_y': 'Home_9_y', '209036_y': 'Home_6_y', '40836_y': 'Home_13_y', '50471_y': 'Home_18_y', '174874_y': 'Home_16_y', '55494_y': 'Home_2_y', '55037_y': 'Home_8_y', '49413_y': 'Home_5_y', '232787_y': 'Home_23_y', '82403_y': 'Home_11_y', '199670_y': 'Home_22_y', '58786_y': 'Home_34_y', '57328_y': 'Home_17_y', '86417_d': 'Home_15_d', '54861_d': 'Home_20_d', '105666_d': 'Home_1_d', '66975_d': 'Home_4_d', '108413_d': 'Home_12_d', '443661_d': 'Home_7_d', '244723_d': 'Home_3_d', '80146_d': 'Home_9_d', '209036_d': 'Home_6_d', '40836_d': 'Home_13_d', '50471_d': 'Home_18_d', '174874_d': 'Home_16_d', '55494_d': 'Home_2_d', '55037_d': 'Home_8_d', '49413_d': 'Home_5_d', '232787_d': 'Home_23_d', '82403_d': 'Home_11_d', '199670_d': 'Home_22_d', '58786_d': 'Home_34_d', '57328_d': 'Home_17_d', '86417_s': 'Home_15_s', '54861_s': 'Home_20_s', '105666_s': 'Home_1_s', '66975_s': 'Home_4_s', '108413_s': 'Home_12_s', '443661_s': 'Home_7_s', '244723_s': 'Home_3_s', '80146_s': 'Home_9_s', '209036_s': 'Home_6_s', '40836_s': 'Home_13_s', '50471_s': 'Home_18_s', '174874_s': 'Home_16_s', '55494_s': 'Home_2_s', '55037_s': 'Home_8_s', '49413_s': 'Home_5_s', '232787_s': 'Home_23_s', '82403_s': 'Home_11_s', '199670_s': 'Home_22_s', '58786_s': 'Home_34_s', '57328_s': 'Home_17_s', '245419_x': 'Away_29_x', '166477_x': 'Away_27_x', '219352_x': 'Away_37_x', '166989_x': 'Away_8_x', '173515_x': 'Away_14_x', '215413_x': 'Away_22_x', '111931_x': 'Away_21_x', '218031_x': 'Away_4_x', '168580_x': 'Away_17_x', '51938_x': 'Away_11_x', '95463_x': 'Away_12_x', '101668_x': 'Away_9_x', '155569_x': 'Away_18_x', '172780_x': 'Away_10_x', '17745_x': 'Away_1_x', '197469_x': 'Away_20_x', '93100_x': 'Away_23_x', '201666_x': 'Away_7_x', '225902_x': 'Away_42_x', '40146_x': 'Away_5_x', '245419_y': 'Away_29_y', '166477_y': 'Away_27_y', '219352_y': 'Away_37_y', '166989_y': 'Away_8_y', '173515_y': 'Away_14_y', '215413_y': 'Away_22_y', '111931_y': 'Away_21_y', '218031_y': 'Away_4_y', '168580_y': 'Away_17_y', '51938_y': 'Away_11_y', '95463_y': 'Away_12_y', '101668_y': 'Away_9_y', '155569_y': 'Away_18_y', '172780_y': 'Away_10_y', '17745_y': 'Away_1_y', '197469_y': 'Away_20_y', '93100_y': 'Away_23_y', '201666_y': 'Away_7_y', '225902_y': 'Away_42_y', '40146_y': 'Away_5_y', '245419_d': 'Away_29_d', '166477_d': 'Away_27_d', '219352_d': 'Away_37_d', '166989_d': 'Away_8_d', '173515_d': 'Away_14_d', '215413_d': 'Away_22_d', '111931_d': 'Away_21_d', '218031_d': 'Away_4_d', '168580_d': 'Away_17_d', '51938_d': 'Away_11_d', '95463_d': 'Away_12_d', '101668_d': 'Away_9_d', '155569_d': 'Away_18_d', '172780_d': 'Away_10_d', '17745_d': 'Away_1_d', '197469_d': 'Away_20_d', '93100_d': 'Away_23_d', '201666_d': 'Away_7_d', '225902_d': 'Away_42_d', '40146_d': 'Away_5_d', '245419_s': 'Away_29_s', '166477_s': 'Away_27_s', '219352_s': 'Away_37_s', '166989_s': 'Away_8_s', '173515_s': 'Away_14_s', '215413_s': 'Away_22_s', '111931_s': 'Away_21_s', '218031_s': 'Away_4_s', '168580_s': 'Away_17_s', '51938_s': 'Away_11_s', '95463_s': 'Away_12_s', '101668_s': 'Away_9_s', '155569_s': 'Away_18_s', '172780_s': 'Away_10_s', '17745_s': 'Away_1_s', '197469_s': 'Away_20_s', '93100_s': 'Away_23_s', '201666_s': 'Away_7_s', '225902_s': 'Away_42_s', '40146_s': 'Away_5_s' } # In[73]: ## Map Qualifier Names to Qualifier IDs df_tracking_cry_lei = df_tracking_cry_lei.rename(columns=dict_home_away_players_cry_lei) # In[74]: ## Display columns df_tracking_cry_lei.columns # ##### b) Crystal Palace (1) vs. (1) Brighton & Hove Albion (f2210324) # In[75]: # Read in the Meta data for the corresponding Tracking data ## Define the file name file_name = 'g2210324' ## Read in the Meta data df_meta_cry_bri = pd.read_csv(os.path.join(data_dir_second_spectrum, 'raw', 'metadata', f'{file_name}_SecondSpectrum_Metadata_Players.csv')) # In[76]: ## Prepare DataFrame to create dictionary ## Convert data types df_meta_cry_bri['optaId'] = df_meta_cry_bri['optaId'].astype(int) df_meta_cry_bri['name'] = df_meta_cry_bri['name'].astype(str) ## Create a dictionary of Qualifier IDs and Qualifier Names from the reference dataset dict_home_away_players = dict(zip(df_meta_cry_bri['optaId'], df_meta_cry_bri['HA'])) # In[77]: dict_home_away_players # In[78]: # Map the values to the data (current a hard coded way, will replace with a loop shortly) dict_home_away_players_cry_bri = {'55037_x': 'Home_8_x', '50471_x': 'Home_18_x', '80146_x': 'Home_9_x', '82403_x': 'Home_11_x', '108413_x': 'Home_12_x', '199670_x': 'Home_22_x', '232787_x': 'Home_23_x', '174874_x': 'Home_16_x', '244723_x': 'Home_3_x', '66975_x': 'Home_4_x', '55494_x': 'Home_2_x', '209036_x': 'Home_6_x', '54861_x': 'Home_20_x', '86417_x': 'Home_15_x', '105666_x': 'Home_1_x', '231747_x': 'Home_14_x', '40836_x': 'Home_13_x', '58786_x': 'Home_34_x', '443661_x': 'Home_17_x', '57328_x': 'Home_17_x', '55037_y': 'Home_8_y', '50471_y': 'Home_18_y', '80146_y': 'Home_9_y', '82403_y': 'Home_11_y', '108413_y': 'Home_12_y', '199670_y': 'Home_22_y', '232787_y': 'Home_23_y', '174874_y': 'Home_16_y', '244723_y': 'Home_3_y', '66975_y': 'Home_4_y', '55494_y': 'Home_2_y', '209036_y': 'Home_6_y', '54861_y': 'Home_20_y', '86417_y': 'Home_15_y', '105666_y': 'Home_1_y', '231747_y': 'Home_14_y', '40836_y': 'Home_13_y', '58786_y': 'Home_34_y', '443661_y': 'Home_17_y', '57328_y': 'Home_17_y', '55037_d': 'Home_8_d', '50471_d': 'Home_18_d', '80146_d': 'Home_9_d', '82403_d': 'Home_11_d', '108413_d': 'Home_12_d', '199670_d': 'Home_22_d', '232787_d': 'Home_23_d', '174874_d': 'Home_16_d', '244723_d': 'Home_3_d', '66975_d': 'Home_4_d', '55494_d': 'Home_2_d', '209036_d': 'Home_6_d', '54861_d': 'Home_20_d', '86417_d': 'Home_15_d', '105666_d': 'Home_1_d', '231747_d': 'Home_14_d', '40836_d': 'Home_13_d', '58786_d': 'Home_34_d', '443661_d': 'Home_17_d', '57328_d': 'Home_17_d', '55037_s': 'Home_8_s', '50471_s': 'Home_18_s', '80146_s': 'Home_9_s', '82403_s': 'Home_11_s', '108413_s': 'Home_12_s', '199670_s': 'Home_22_s', '232787_s': 'Home_23_s', '174874_s': 'Home_16_s', '244723_s': 'Home_3_s', '66975_s': 'Home_4_s', '55494_s': 'Home_2_s', '209036_s': 'Home_6_s', '54861_s': 'Home_20_s', '86417_s': 'Home_15_s', '105666_s': 'Home_1_s', '231747_s': 'Home_14_s', '40836_s': 'Home_13_s', '58786_s': 'Home_34_s', '443661_s': 'Home_17_s', '57328_s': 'Home_17_s', '39155_x': 'Away_14_x', '116216_x': 'Away_11_x', '61933_x': 'Away_24_x', '78916_x': 'Away_33_x', '243505_x': 'Away_15_x', '233425_x': 'Away_7_x', '49262_x': 'Away_23_x', '50175_x': 'Away_18_x', '111478_x': 'Away_34_x', '243016_x': 'Away_10_x', '235382_x': 'Away_17_x', '232792_x': 'Away_2_x', '209045_x': 'Away_30_x', '60307_x': 'Away_13_x', '433589_x': 'Away_28_x', '115382_x': 'Away_9_x', '83299_x': 'Away_5_x', '215059_x': 'Away_1_x', '109345_x': 'Away_20_x', '179268_x': 'Away_3_x', '39155_y': 'Away_14_y', '116216_y': 'Away_11_y', '61933_y': 'Away_24_y', '78916_y': 'Away_33_y', '243505_y': 'Away_15_y', '233425_y': 'Away_7_y', '49262_y': 'Away_23_y', '50175_y': 'Away_18_y', '111478_y': 'Away_34_y', '243016_y': 'Away_10_y', '235382_y': 'Away_17_y', '232792_y': 'Away_2_y', '209045_y': 'Away_30_y', '60307_y': 'Away_13_y', '433589_y': 'Away_28_y', '115382_y': 'Away_9_y', '83299_y': 'Away_5_y', '215059_y': 'Away_1_y', '109345_y': 'Away_20_y', '179268_y': 'Away_3_y', '39155_d': 'Away_14_d', '116216_d': 'Away_11_d', '61933_d': 'Away_24_d', '78916_d': 'Away_33_d', '243505_d': 'Away_15_d', '233425_d': 'Away_7_d', '49262_d': 'Away_23_d', '50175_d': 'Away_18_d', '111478_d': 'Away_34_d', '243016_d': 'Away_10_d', '235382_d': 'Away_17_d', '232792_d': 'Away_2_d', '209045_d': 'Away_30_d', '60307_d': 'Away_13_d', '433589_d': 'Away_28_d', '115382_d': 'Away_9_d', '83299_d': 'Away_5_d', '215059_d': 'Away_1_d', '109345_d': 'Away_20_d', '179268_d': 'Away_3_d', '39155_s': 'Away_14_s', '116216_s': 'Away_11_s', '61933_s': 'Away_24_s', '78916_s': 'Away_33_s', '243505_s': 'Away_15_s', '233425_s': 'Away_7_s', '49262_s': 'Away_23_s', '50175_s': 'Away_18_s', '111478_s': 'Away_34_s', '243016_s': 'Away_10_s', '235382_s': 'Away_17_s', '232792_s': 'Away_2_s', '209045_s': 'Away_30_s', '60307_s': 'Away_13_s', '433589_s': 'Away_28_s', '115382_s': 'Away_9_s', '83299_s': 'Away_5_s', '215059_s': 'Away_1_s', '109345_s': 'Away_20_s', '179268_s': 'Away_3_s' } # In[79]: ## Map Qualifier Names to Qualifier IDs df_tracking_cry_bri = df_tracking_cry_bri.rename(columns=dict_home_away_players_cry_bri) # In[80]: ## Display columns df_tracking_cry_bri.columns # # # ### 4.4. Separate Home and Away DataFrames # For compatibility with [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data library [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), Home and Away DataFrames are separated. This is done using the custom `create_home_and_away_dfs` that I have written (see the top the top of the notebook) # In[81]: # Separate Home and Away DataFrames df_tracking_home_cry_lei, df_tracking_away_cry_lei = create_home_and_away_dfs(df_tracking_cry_lei) df_tracking_home_cry_bri, df_tracking_away_cry_bri = create_home_and_away_dfs(df_tracking_cry_bri) # In[82]: # Display Home DataFrame df_tracking_home_cry_lei.head() # In[83]: # Display Away DataFrame df_tracking_away_cry_lei.head() # # # ### 4.5. Calculate Player Velocities # Determine the velocity of each player at a given timestep using the `calc_player_velocities` function from the `mvel` library. # In[84]: # Calculate player velocities ## 03/10/2021: Crystal Palace (2) vs. (2) Leicester City (g2210334) ### Home df_tracking_home_cry_lei = svel.calc_player_velocities(df_tracking_home_cry_lei, smoothing=True, filter_='moving_average', maxspeed=12 ) # filter_='Savitzky-Golay' ### Away df_tracking_away_cry_lei = svel.calc_player_velocities(df_tracking_away_cry_lei, smoothing=True, filter_='moving_average', maxspeed=12 ) ## 27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion (g2210324) ### Home df_tracking_home_cry_bri = svel.calc_player_velocities(df_tracking_home_cry_bri, smoothing=True, filter_='moving_average', maxspeed=12 ) # filter_='Savitzky-Golay' ### Away df_tracking_away_cry_bri = svel.calc_player_velocities(df_tracking_away_cry_bri, smoothing=True, filter_='moving_average', maxspeed=12 ) # In[85]: # Display Home DataFrame df_tracking_home_cry_lei.head() # In[86]: # Display Away DataFrame df_tracking_away_cry_lei.head() # Exploring the `df_tracking_home` and `df_tracking_away` DataFrames, we can see the additional columns for velocity have been added to the DataFrame. These columns are of the following, but for all the players: # * `Away_101668_vx` # * `Away_101668_vy` # * `Away_101668_speed` i.e. total speed # # These new metrics are all measured in meters per second. # # # ### 4.6. Reverse direction of players # Reverse direction of a player so that home team is alway attacking right->left i.e. no switch at half time # In[87]: # Reverse direction of a player so that home team is alway attacking right->left i.e. no switch at half time df_tracking_home_cry_lei, df_tracking_away_cry_lei = to_single_playing_direction(df_tracking_home_cry_lei, df_tracking_away_cry_lei) df_tracking_home_cry_bri, df_tracking_away_cry_bri = to_single_playing_direction(df_tracking_home_cry_bri, df_tracking_away_cry_bri) # --- # # # # ## 5. Export Final DataFrames # In[88]: # Export DataFrame ## 03/10/2021: Crystal Palace (2) vs. (2) Leicester City (g2210334) df_tracking_home_cry_lei.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210334_SecondSpectrum_Trackingdata_Home.csv'), index=None, header=True) df_tracking_away_cry_lei.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210334_SecondSpectrum_Trackingdata_Away.csv'), index=None, header=True) ## 27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion (g2210324) df_tracking_home_cry_bri.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210324_SecondSpectrum_Trackingdata_Home.csv'), index=None, header=True) df_tracking_away_cry_bri.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210324_SecondSpectrum_Trackingdata_Away.csv'), index=None, header=True) # --- # # # # ## 6. Summary # This notebook parses and engineers [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data using [pandas](http://pandas.pydata.org/), [`Kloppy`](https://kloppy.pysport.org/), and a custom library of function, based on [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data library [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), that has been modified to be compatible with [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data, which is a different format. # --- # # # # ## 7. Next Steps # The next stage is to visualise and analyse the tracking data. # --- # # # # ## 8. References # * [Second Spectrum](https://www.secondspectrum.com/index.html) data # * [`Kloppy`](https://kloppy.pysport.org/) # * BBC Sports pages for the two Premier League matches featuring Crystal Palace: # + [27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion](https://www.bbc.co.uk/sport/football/58620544) (g2210324) # + [03/10/2021: Crystal Palace (2) vs. (2) Leicester City](https://www.bbc.co.uk/sport/football/58667896) (g2210324) # --- # # ***Visit my website [eddwebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.*** # [Back to the top](#top)