#!/usr/bin/env python
# coding: utf-8
#
# # Second Spectrum Data Engineering
# ##### Notebook to engineer [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data using [pandas](http://pandas.pydata.org/).
#
# ### By [Edd Webster](https://www.twitter.com/eddwebster)
# Notebook first written: 20/01/2022
# Notebook last updated: 01/02/2022
#
# ![Watford F.C.](../../img/logos/second_spectrum_logo.jpeg)
#
# ![Second Spectrum](../../img/club_badges/premier_league/watford_fc_logo_small.png)
#
# Click [here](#section4) to jump straight into the Data Engineering section and skip the [Notebook Brief](#section2) and [Data Sources](#section3) sections.
# ___
#
#
# ## Introduction
# This notebook parses and engineers [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data for two matches, that have been provided by [Watford F.C](https://www.watfordfc.com/), using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.
#
# For more information about this notebook and the author, I am available through all the following channels:
# * [eddwebster.com](https://www.eddwebster.com/);
# * edd.j.webster@gmail.com;
# * [@eddwebster](https://www.twitter.com/eddwebster);
# * [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);
# * [github/eddwebster](https://github.com/eddwebster/); and
# * [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster).
#
# A static version of this notebook can be found [here](https://nbviewer.org/github/eddwebster/watford/blob/main/notebooks/2_data_engineering/Second%20Spectrum%20Data%20Engineering.ipynb). This notebook has an accompanying [`watford`](https://github.com/eddwebster/watford) GitHub repository and for my full repository of football analysis, see my [`football_analysis`](https://github.com/eddwebster/football_analytics) GitHub repository.
# ___
#
# ## Notebook Contents
# 1. [Notebook Dependencies](#section1)
# 2. [Notebook Brief](#section2)
# 3. [Data Sources](#section3)
# 1. [Introduction](#section3.1)
# 2. [Data Dicitonary](#section3.2)
# 3. [Import the Data](#section3.3)
# 4. [Initial Data Handling](#section3.4)
# 4. [Data Engineering](#section4)
# 1. [Reset Index](#section4.1)
# 2. [Rename Columns](#section4.2)
# 3. [Rename Player Columns Using Metadata](#section4.3)
# 4. [Separate Home and Away DataFrames](#section4.4)
# 5. [Calculate Player Velocities](#section4.5)
# 6. [Reverse direction of players](#section4.6)
# 5. [Summary](#section5)
# 6. [Next Steps](#section6)
# 7. [References](#section7)
# ___
#
#
#
# ## 1. Notebook Dependencies
#
# This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
# * [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
# * [`NumPy`](http://www.numpy.org/) for multidimensional array computing; and
# * [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation.
#
# All packages used for this notebook can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).
# ### Import Libraries and Modules
# In[45]:
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv
# Import Dependencies
get_ipython().run_line_magic('matplotlib', 'inline')
# Math Operations
import numpy as np
import math
from math import pi
# Datetime
import datetime
from datetime import date
import time
# Data Preprocessing
import pandas as pd
import pandas_profiling as pp
import os
import re
import chardet
import random
from io import BytesIO
from pathlib import Path
# Kloppy
from kloppy import secondspectrum
# Reading Directories
import glob
import os
# Working with JSON
import json
from pandas import json_normalize
# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from matplotlib.colors import LinearSegmentedColormap
from matplotlib import patches
import seaborn as sns
import missingno as msno
import moviepy.editor as mpy
from moviepy.video.io.bindings import mplfig_to_npimage
# Requests and downloads
import tqdm
import requests
# Machine Learning
import scipy.signal as signal
from scipy.spatial import Voronoi, voronoi_plot_2d, Delaunay
# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML
# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
# Print message
print('Setup Complete')
# In[46]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
# ### Defined Filepaths
# In[47]:
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..')
data_dir = os.path.join(base_dir, 'data')
data_dir_second_spectrum = os.path.join(base_dir, 'data', 'second_spectrum')
data_dir_opta = os.path.join(base_dir, 'data', 'opta')
scripts_dir = os.path.join(base_dir, 'scripts')
scripts_dir_second_spectrum = os.path.join(base_dir, 'scripts', 'second_spectrum')
scripts_dir_metrica_sports = os.path.join(base_dir, 'scripts', 'metrica_sports')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
fig_dir_second_spectrum = os.path.join(base_dir, 'img', 'fig', 'second_spectrum')
video_dir = os.path.join(base_dir, 'video')
video_dir_second_spectrum = os.path.join(base_dir, 'video', 'fig', 'second_spectrum')
# ### Custom Libraries for Tracking Data
# Custom libraries for working with the [Second Spectrum](https://www.secondspectrum.com/index.html) data, that were initially written by [Laurie Shaw](https://twitter.com/EightyFivePoint), to work with the [Metrica Sports](https://metrica-sports.com/) data. See the following for his original code [[link](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking)].
#
# The modifications to this data include the ability to create Pitch Control models without Tracking data.
# In[48]:
# Custom libraries for working with Tracking data
## Define path of scripts
sys.path.insert(0, os.path.abspath(scripts_dir_second_spectrum))
## Second Spectrum scripts - custom scripts derived from Laurie Shaw's Metrica scripts
import Second_Spectrum_IO as sio
import Second_Spectrum_Viz as sviz
import Second_Spectrum_Velocities as svel
import Second_Spectrum_PitchControl as spc
import Second_Spectrum_EPV as sepv
# In[49]:
"""
## Laurie Shaw's custom libraries for working with Metrica Sports data
import Metrica_IO as mio
import Metrica_Viz as mviz
import Metrica_Velocities as mvel
import Metrica_PitchControl as mpc
import Metrica_EPV as mepv
"""
# ### Custom Functions
# In[50]:
# Define function to read in Tracking data using Kloppy
def read_in_tracking_data(filename):
"""
Function to read in Second Spectrum Tracking data as a pandas DataFrame using Kloppy. Written by Edd Webster / @eddwebster
"""
# Read in exported CSV file if exists, if not, read in the raw JSON and XML files using Kloppy
## If not already saved as a CSV file
if not os.path.exists(os.path.join(data_dir_second_spectrum, 'raw', 'data', f'g{filename}_SecondSpectrum_Data.csv')):
### Load meta data (XML) and raw data (JSONL) files
dataset = secondspectrum.load(
meta_data = f'../../data/second_spectrum/raw/metadata/g{filename}_SecondSpectrum_Metadata.xml',
raw_data = f'../../data/second_spectrum/raw/data/g{filename}_SecondSpectrum_Data.jsonl',
#meta_data = os.path.join(data_dir_second_spectrum, 'raw', 'data', f'{filename}_SecondSpectrum_Metadata.xml')
#raw_data = os.path.join(data_dir_second_spectrum, 'raw', 'data', f'{filename}_SecondSpectrum_Data.jsonl')
### Optional arguments
additional_meta_data = f'../../data/second_spectrum/raw/metadata/g{filename}_SecondSpectrum_Metadata.json',
#additional_meta_data = os.path.join(data_dir_second_spectrum, 'raw', 'data', f'{filename}_SecondSpectrum_Metadata.json')
#sample_rate=1/25,
#limit=100,
coordinates='secondspectrum',
#only_alive=True
)
### Convert DataFrame to pandas
df_tracking = dataset.to_pandas()
### Export raw DataFrame
df_tracking.to_csv(os.path.join(data_dir_second_spectrum, 'raw', 'data', f'g{filename}_SecondSpectrum_Data.csv'), index=None, header=True)
## If already, read in CSV file
else:
df_tracking = pd.read_csv(os.path.join(data_dir_second_spectrum, 'raw', 'data', f'g{filename}_SecondSpectrum_Data.csv'))
## Return DataFrame
return df_tracking
# In[51]:
# Define function to create separate DataFrames of the Tracking data for Home and Away
def create_home_and_away_dfs(df_tracking):
"""
Function to separate the Tracking data into Home and Away DataFrames. Written by Edd Webster / @eddwebster
"""
# Define columns that are found in both DataFrames
lst_cols = ['Frame' ,'Period', 'Time [s]', 'Ball State', 'Ball Owning Team ID', 'ball_x', 'ball_y']
# Define columns in Home DataFrame
## Create an empty list for home columns
lst_cols_home = []
## Extract Home columns from the DataFrame
for col in df_tracking.columns:
if 'Home' in col:
lst_cols_home.append(col)
# Define columns in Away DataFrame
## Create an empty list for home columns
lst_cols_away = []
## Extract Home columns from the DataFrame
for col in df_tracking.columns:
if 'Away' in col:
lst_cols_away.append(col)
# Select columns of interest for Home and Away DataFrames
df_tracking_home = df_tracking[lst_cols + lst_cols_home]
df_tracking_away = df_tracking[lst_cols + lst_cols_away]
# Return DataFrame
return df_tracking_home, df_tracking_away
# In[52]:
# Define function to reverse the direction of players
def to_single_playing_direction(home, away):
'''
Flip coordinates in second half so that each team always shoots in the same direction through the match.
'''
for team in [home, away]:
second_half_idx = team.Period.idxmax(2)
columns = [c for c in team.columns if c[-1].lower() in ['x','y']]
team.loc[second_half_idx:,columns] *= -1
return home, away
# ### Notebook Settings
# In[53]:
# Display all columns of displayed pandas DataFrames
pd.set_option('display.max_columns', None)
#pd.set_option('display.max_rows', None)
pd.options.mode.chained_assignment = None
# ---
#
#
#
# ## 2. Notebook Brief
# This notebook parses and engineers [Second Spectrum](https://www.secondspectrum.com/index.html) using [pandas](http://pandas.pydata.org/) and [`Kloppy`](https://kloppy.pysport.org/), for two Premier League matches featuring Crystal Palace during the 21/22 season. The two datasets provided are for the following matches:
# * [27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion](https://www.bbc.co.uk/sport/football/58620544) (g2210324)
# * [03/10/2021: Crystal Palace (2) vs. (2) Leicester City](https://www.bbc.co.uk/sport/football/58667896) (g2210334)
#
# Once engineered, this dataset will be visualised and analysed in congunction with the corresponding Event data, as part of an opposition analysis piece.
#
#
# **Notebook Conventions**:
# * Variables that refer a `DataFrame` object are prefixed with `df_`.
# * Variables that refer to a collection of `DataFrame` objects (e.g., a list, a set or a dict) are prefixed with `dfs_`.
# ---
#
#
#
# ## 3. Data Sources
#
#
# ### 3.1. Introduction
# [Second Spectrum](https://www.secondspectrum.com/index.html) are a football analytics, data provider ...
#
# ![Second Spectrum](../../img/logos/second_spectrum_logo.jpeg)
#
# The tracking data represents the location of every player on the pitch with a temporal frequency of 25 Hz and the corresponding match time for each tracking frame is specified.
#
# UPDATE THIS
#
#
# ### 3.2. Data Dictionary
# The [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking dataset has fourteen features (columns) with the following definitions and data types:
#
# | Feature | Data type | Definition |
# |-----------------|---------------|----------------|
# | `Frame` | object | |
# | `Period` | object | |
# | `Time [s]` | object | |
# | `Home_11_x` | object | |
# | `Home_11_y` | object | |
# | `Away_8_x` | object | |
# | `Away_8_y` | object | |
#
# UPDATE THIS
#
#
# ### 3.3. Import Data
# The following cells take the raw meta data (`XML`) and tracking data (`JSONL`) files and read them in as a unified [pandas](https://pandas.pydata.org/) DataFrame, using the custom `read_in_tracking_data` function that I have written (see the functions section at the top of the notebook), powered by [`Kloppy`](https://kloppy.pysport.org/).
# In[54]:
# Show files in directory
print(glob.glob(os.path.join(data_dir_second_spectrum, 'raw', 'data/*.csv')))
# In[55]:
# Read in Tracking data using the custom 'read_in_tracking_data' function, powered by Kloppy
df_tracking_cry_bri = read_in_tracking_data(filename='2210324')
df_tracking_cry_lei = read_in_tracking_data(filename='2210334')
#
#
# ### 3.4. Initial Data Handling
# To avoid duplication commands and to get an idea of what the output of the Tracking data looks like, this section just goes through tone the two Tracking data files, [Crystal Palace vs. Leicester City](https://www.bbc.co.uk/sport/football/58620544) (g2210334).
#
# First check the quality of the dataset by looking first and last rows in pandas using the [`head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [`tail()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods.
# In[56]:
# Display the first five rows of the DataFrame, df_tracking_cry_lei
df_tracking_cry_lei.head()
# In[57]:
# Display the last five rows of the DataFrame, df_tracking_cry_lei
df_tracking_cry_lei.tail()
# In[58]:
# Print the shape of the DataFrame, df_tracking_cry_lei
print(df_tracking_cry_lei.shape)
# The DataFrame has 145,621 rows, which is what we would expected - 25 fps × 60 secs × 90 mins = 135,000 (excluding injury time).
# In[59]:
# Print the column names of the DataFrame, df_tracking_cry_lei
print(df_tracking_cry_lei.columns)
# In[60]:
# Data types of the features of the raw DataFrame, df_tracking_cry_lei
df_tracking_cry_lei.dtypes
# Full details of these attributes and their data types is discussed further in the [Data Dictionary](section3.2).
# In[61]:
# Displays all columns
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(df_tracking_cry_lei.dtypes)
# In[62]:
# Info for the raw DataFrame, df_tracking_cry_lei
df_tracking_cry_lei.info()
# In[63]:
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tracking_cry_lei
msno.matrix(df_tracking_cry_lei, figsize = (30, 7))
# In[64]:
# Counts of missing values
null_value_stats = df_tracking_cry_lei.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]
# ---
#
#
#
# ## 4. Data Engineering
# The next step is to wrangle the dataset to into a format that’s suitable for analysis.
#
# This section is broken down into the following subsections:
#
# 4.1. [Reset Index](#section4.1)
# 4.2. [Rename Columns](#section4.2)
# 4.3. [Rename Player Columns Using Metadata](#section4.3)
# 4.4. [Separate Home and Away DataFrames](#section4.4)
# 4.5. [Calculate Player Velocities](#section4.5)
# 4.6. [Reverse direction of players](#section4.6)
#
#
# ### 4.1. Reset Index
# Reset index, the existing index will not be dropped and will be used to identify the frame number.
# In[65]:
# Reset index, the existing index will be used for the frame number
df_tracking_cry_bri = df_tracking_cry_bri.reset_index(drop=False)
df_tracking_cry_lei = df_tracking_cry_lei.reset_index(drop=False)
#
#
# ### 4.2. Rename Columns
# To work with [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data libraries, [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), the data needs to be engineered to match the Metrica schema, which is the following:
#
# | Feature | Data type | Definition |
# |-------------------------------------------|---------------|----------------|
# | `Frame` | int64 | |
# | `Period` | int64 | |
# | `Time [s]` | float64 | |
# | `Home/Away_No._x` (repeated 14 times) | float64 | |
# | `Home/Away_No._y` (repeated 14 times) | float64 | |
# | `ball_x` | float64 | |
# | `ball_y` | float64 | |
#
# To learn more about the Metrica Sports schema, see the official documentation [[link](https://github.com/metrica-sports/sample-data/blob/master/documentation/events-definitions.pdf)].
# In[66]:
# Rename columns
df_tracking_cry_lei = df_tracking_cry_lei.rename(columns={'period_id': 'Period',
'timestamp': 'Time [s]',
'ball_state': 'Ball State',
'ball_owning_team_id': 'Ball Owning Team ID',
'index': 'Frame'
}
)
# In[67]:
# Rename columns
df_tracking_cry_bri = df_tracking_cry_bri.rename(columns={'period_id': 'Period',
'timestamp': 'Time [s]',
'ball_state': 'Ball State',
'ball_owning_team_id': 'Ball Owning Team ID',
'index': 'Frame'
}
)
# In[68]:
# Display DataFrame
df_tracking_cry_bri.head()
#
#
# ### 4.3. Rename Player Columns Using Metadata
# For compatibility with [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data library [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), the column of each player is given a `Home_` or `Away_` prefix.
# ##### a) Crystal Palace (2) vs. (2) Leicester City (f2210334)
# In[69]:
# Read in the Meta data for the corresponding Tracking data
## Define the file name
file_name = 'g2210334'
## Read in the Meta data
df_meta_cry_lei = pd.read_csv(os.path.join(data_dir_second_spectrum, 'raw', 'metadata', f'{file_name}_SecondSpectrum_Metadata_Players.csv'))
# In[70]:
## Prepare DataFrame to create dictionary
## Convert data types
df_meta_cry_lei['optaId'] = df_meta_cry_lei['optaId'].astype(int)
df_meta_cry_lei['name'] = df_meta_cry_lei['name'].astype(str)
## Create a dictionary of Qualifier IDs and Qualifier Names from the reference dataset
dict_home_away_players = dict(zip(df_meta_cry_lei['optaId'], df_meta_cry_lei['HA']))
# In[71]:
dict_home_away_players
# In[72]:
# Map the values to the data (current a hard coded way, will replace with a loop shortly)
dict_home_away_players_cry_lei = {'86417_x': 'Home_15_x',
'54861_x': 'Home_20_x',
'105666_x': 'Home_1_x',
'66975_x': 'Home_4_x',
'108413_x': 'Home_12_x',
'443661_x': 'Home_7_x',
'244723_x': 'Home_3_x',
'80146_x': 'Home_9_x',
'209036_x': 'Home_6_x',
'40836_x': 'Home_13_x',
'50471_x': 'Home_18_x',
'174874_x': 'Home_16_x',
'55494_x': 'Home_2_x',
'55037_x': 'Home_8_x',
'49413_x': 'Home_5_x',
'232787_x': 'Home_23_x',
'82403_x': 'Home_11_x',
'199670_x': 'Home_22_x',
'58786_x': 'Home_34_x',
'57328_x': 'Home_17_x',
'86417_y': 'Home_15_y',
'54861_y': 'Home_20_y',
'105666_y': 'Home_1_y',
'66975_y': 'Home_4_y',
'108413_y': 'Home_12_y',
'443661_y': 'Home_7_y',
'244723_y': 'Home_3_y',
'80146_y': 'Home_9_y',
'209036_y': 'Home_6_y',
'40836_y': 'Home_13_y',
'50471_y': 'Home_18_y',
'174874_y': 'Home_16_y',
'55494_y': 'Home_2_y',
'55037_y': 'Home_8_y',
'49413_y': 'Home_5_y',
'232787_y': 'Home_23_y',
'82403_y': 'Home_11_y',
'199670_y': 'Home_22_y',
'58786_y': 'Home_34_y',
'57328_y': 'Home_17_y',
'86417_d': 'Home_15_d',
'54861_d': 'Home_20_d',
'105666_d': 'Home_1_d',
'66975_d': 'Home_4_d',
'108413_d': 'Home_12_d',
'443661_d': 'Home_7_d',
'244723_d': 'Home_3_d',
'80146_d': 'Home_9_d',
'209036_d': 'Home_6_d',
'40836_d': 'Home_13_d',
'50471_d': 'Home_18_d',
'174874_d': 'Home_16_d',
'55494_d': 'Home_2_d',
'55037_d': 'Home_8_d',
'49413_d': 'Home_5_d',
'232787_d': 'Home_23_d',
'82403_d': 'Home_11_d',
'199670_d': 'Home_22_d',
'58786_d': 'Home_34_d',
'57328_d': 'Home_17_d',
'86417_s': 'Home_15_s',
'54861_s': 'Home_20_s',
'105666_s': 'Home_1_s',
'66975_s': 'Home_4_s',
'108413_s': 'Home_12_s',
'443661_s': 'Home_7_s',
'244723_s': 'Home_3_s',
'80146_s': 'Home_9_s',
'209036_s': 'Home_6_s',
'40836_s': 'Home_13_s',
'50471_s': 'Home_18_s',
'174874_s': 'Home_16_s',
'55494_s': 'Home_2_s',
'55037_s': 'Home_8_s',
'49413_s': 'Home_5_s',
'232787_s': 'Home_23_s',
'82403_s': 'Home_11_s',
'199670_s': 'Home_22_s',
'58786_s': 'Home_34_s',
'57328_s': 'Home_17_s',
'245419_x': 'Away_29_x',
'166477_x': 'Away_27_x',
'219352_x': 'Away_37_x',
'166989_x': 'Away_8_x',
'173515_x': 'Away_14_x',
'215413_x': 'Away_22_x',
'111931_x': 'Away_21_x',
'218031_x': 'Away_4_x',
'168580_x': 'Away_17_x',
'51938_x': 'Away_11_x',
'95463_x': 'Away_12_x',
'101668_x': 'Away_9_x',
'155569_x': 'Away_18_x',
'172780_x': 'Away_10_x',
'17745_x': 'Away_1_x',
'197469_x': 'Away_20_x',
'93100_x': 'Away_23_x',
'201666_x': 'Away_7_x',
'225902_x': 'Away_42_x',
'40146_x': 'Away_5_x',
'245419_y': 'Away_29_y',
'166477_y': 'Away_27_y',
'219352_y': 'Away_37_y',
'166989_y': 'Away_8_y',
'173515_y': 'Away_14_y',
'215413_y': 'Away_22_y',
'111931_y': 'Away_21_y',
'218031_y': 'Away_4_y',
'168580_y': 'Away_17_y',
'51938_y': 'Away_11_y',
'95463_y': 'Away_12_y',
'101668_y': 'Away_9_y',
'155569_y': 'Away_18_y',
'172780_y': 'Away_10_y',
'17745_y': 'Away_1_y',
'197469_y': 'Away_20_y',
'93100_y': 'Away_23_y',
'201666_y': 'Away_7_y',
'225902_y': 'Away_42_y',
'40146_y': 'Away_5_y',
'245419_d': 'Away_29_d',
'166477_d': 'Away_27_d',
'219352_d': 'Away_37_d',
'166989_d': 'Away_8_d',
'173515_d': 'Away_14_d',
'215413_d': 'Away_22_d',
'111931_d': 'Away_21_d',
'218031_d': 'Away_4_d',
'168580_d': 'Away_17_d',
'51938_d': 'Away_11_d',
'95463_d': 'Away_12_d',
'101668_d': 'Away_9_d',
'155569_d': 'Away_18_d',
'172780_d': 'Away_10_d',
'17745_d': 'Away_1_d',
'197469_d': 'Away_20_d',
'93100_d': 'Away_23_d',
'201666_d': 'Away_7_d',
'225902_d': 'Away_42_d',
'40146_d': 'Away_5_d',
'245419_s': 'Away_29_s',
'166477_s': 'Away_27_s',
'219352_s': 'Away_37_s',
'166989_s': 'Away_8_s',
'173515_s': 'Away_14_s',
'215413_s': 'Away_22_s',
'111931_s': 'Away_21_s',
'218031_s': 'Away_4_s',
'168580_s': 'Away_17_s',
'51938_s': 'Away_11_s',
'95463_s': 'Away_12_s',
'101668_s': 'Away_9_s',
'155569_s': 'Away_18_s',
'172780_s': 'Away_10_s',
'17745_s': 'Away_1_s',
'197469_s': 'Away_20_s',
'93100_s': 'Away_23_s',
'201666_s': 'Away_7_s',
'225902_s': 'Away_42_s',
'40146_s': 'Away_5_s'
}
# In[73]:
## Map Qualifier Names to Qualifier IDs
df_tracking_cry_lei = df_tracking_cry_lei.rename(columns=dict_home_away_players_cry_lei)
# In[74]:
## Display columns
df_tracking_cry_lei.columns
# ##### b) Crystal Palace (1) vs. (1) Brighton & Hove Albion (f2210324)
# In[75]:
# Read in the Meta data for the corresponding Tracking data
## Define the file name
file_name = 'g2210324'
## Read in the Meta data
df_meta_cry_bri = pd.read_csv(os.path.join(data_dir_second_spectrum, 'raw', 'metadata', f'{file_name}_SecondSpectrum_Metadata_Players.csv'))
# In[76]:
## Prepare DataFrame to create dictionary
## Convert data types
df_meta_cry_bri['optaId'] = df_meta_cry_bri['optaId'].astype(int)
df_meta_cry_bri['name'] = df_meta_cry_bri['name'].astype(str)
## Create a dictionary of Qualifier IDs and Qualifier Names from the reference dataset
dict_home_away_players = dict(zip(df_meta_cry_bri['optaId'], df_meta_cry_bri['HA']))
# In[77]:
dict_home_away_players
# In[78]:
# Map the values to the data (current a hard coded way, will replace with a loop shortly)
dict_home_away_players_cry_bri = {'55037_x': 'Home_8_x',
'50471_x': 'Home_18_x',
'80146_x': 'Home_9_x',
'82403_x': 'Home_11_x',
'108413_x': 'Home_12_x',
'199670_x': 'Home_22_x',
'232787_x': 'Home_23_x',
'174874_x': 'Home_16_x',
'244723_x': 'Home_3_x',
'66975_x': 'Home_4_x',
'55494_x': 'Home_2_x',
'209036_x': 'Home_6_x',
'54861_x': 'Home_20_x',
'86417_x': 'Home_15_x',
'105666_x': 'Home_1_x',
'231747_x': 'Home_14_x',
'40836_x': 'Home_13_x',
'58786_x': 'Home_34_x',
'443661_x': 'Home_17_x',
'57328_x': 'Home_17_x',
'55037_y': 'Home_8_y',
'50471_y': 'Home_18_y',
'80146_y': 'Home_9_y',
'82403_y': 'Home_11_y',
'108413_y': 'Home_12_y',
'199670_y': 'Home_22_y',
'232787_y': 'Home_23_y',
'174874_y': 'Home_16_y',
'244723_y': 'Home_3_y',
'66975_y': 'Home_4_y',
'55494_y': 'Home_2_y',
'209036_y': 'Home_6_y',
'54861_y': 'Home_20_y',
'86417_y': 'Home_15_y',
'105666_y': 'Home_1_y',
'231747_y': 'Home_14_y',
'40836_y': 'Home_13_y',
'58786_y': 'Home_34_y',
'443661_y': 'Home_17_y',
'57328_y': 'Home_17_y',
'55037_d': 'Home_8_d',
'50471_d': 'Home_18_d',
'80146_d': 'Home_9_d',
'82403_d': 'Home_11_d',
'108413_d': 'Home_12_d',
'199670_d': 'Home_22_d',
'232787_d': 'Home_23_d',
'174874_d': 'Home_16_d',
'244723_d': 'Home_3_d',
'66975_d': 'Home_4_d',
'55494_d': 'Home_2_d',
'209036_d': 'Home_6_d',
'54861_d': 'Home_20_d',
'86417_d': 'Home_15_d',
'105666_d': 'Home_1_d',
'231747_d': 'Home_14_d',
'40836_d': 'Home_13_d',
'58786_d': 'Home_34_d',
'443661_d': 'Home_17_d',
'57328_d': 'Home_17_d',
'55037_s': 'Home_8_s',
'50471_s': 'Home_18_s',
'80146_s': 'Home_9_s',
'82403_s': 'Home_11_s',
'108413_s': 'Home_12_s',
'199670_s': 'Home_22_s',
'232787_s': 'Home_23_s',
'174874_s': 'Home_16_s',
'244723_s': 'Home_3_s',
'66975_s': 'Home_4_s',
'55494_s': 'Home_2_s',
'209036_s': 'Home_6_s',
'54861_s': 'Home_20_s',
'86417_s': 'Home_15_s',
'105666_s': 'Home_1_s',
'231747_s': 'Home_14_s',
'40836_s': 'Home_13_s',
'58786_s': 'Home_34_s',
'443661_s': 'Home_17_s',
'57328_s': 'Home_17_s',
'39155_x': 'Away_14_x',
'116216_x': 'Away_11_x',
'61933_x': 'Away_24_x',
'78916_x': 'Away_33_x',
'243505_x': 'Away_15_x',
'233425_x': 'Away_7_x',
'49262_x': 'Away_23_x',
'50175_x': 'Away_18_x',
'111478_x': 'Away_34_x',
'243016_x': 'Away_10_x',
'235382_x': 'Away_17_x',
'232792_x': 'Away_2_x',
'209045_x': 'Away_30_x',
'60307_x': 'Away_13_x',
'433589_x': 'Away_28_x',
'115382_x': 'Away_9_x',
'83299_x': 'Away_5_x',
'215059_x': 'Away_1_x',
'109345_x': 'Away_20_x',
'179268_x': 'Away_3_x',
'39155_y': 'Away_14_y',
'116216_y': 'Away_11_y',
'61933_y': 'Away_24_y',
'78916_y': 'Away_33_y',
'243505_y': 'Away_15_y',
'233425_y': 'Away_7_y',
'49262_y': 'Away_23_y',
'50175_y': 'Away_18_y',
'111478_y': 'Away_34_y',
'243016_y': 'Away_10_y',
'235382_y': 'Away_17_y',
'232792_y': 'Away_2_y',
'209045_y': 'Away_30_y',
'60307_y': 'Away_13_y',
'433589_y': 'Away_28_y',
'115382_y': 'Away_9_y',
'83299_y': 'Away_5_y',
'215059_y': 'Away_1_y',
'109345_y': 'Away_20_y',
'179268_y': 'Away_3_y',
'39155_d': 'Away_14_d',
'116216_d': 'Away_11_d',
'61933_d': 'Away_24_d',
'78916_d': 'Away_33_d',
'243505_d': 'Away_15_d',
'233425_d': 'Away_7_d',
'49262_d': 'Away_23_d',
'50175_d': 'Away_18_d',
'111478_d': 'Away_34_d',
'243016_d': 'Away_10_d',
'235382_d': 'Away_17_d',
'232792_d': 'Away_2_d',
'209045_d': 'Away_30_d',
'60307_d': 'Away_13_d',
'433589_d': 'Away_28_d',
'115382_d': 'Away_9_d',
'83299_d': 'Away_5_d',
'215059_d': 'Away_1_d',
'109345_d': 'Away_20_d',
'179268_d': 'Away_3_d',
'39155_s': 'Away_14_s',
'116216_s': 'Away_11_s',
'61933_s': 'Away_24_s',
'78916_s': 'Away_33_s',
'243505_s': 'Away_15_s',
'233425_s': 'Away_7_s',
'49262_s': 'Away_23_s',
'50175_s': 'Away_18_s',
'111478_s': 'Away_34_s',
'243016_s': 'Away_10_s',
'235382_s': 'Away_17_s',
'232792_s': 'Away_2_s',
'209045_s': 'Away_30_s',
'60307_s': 'Away_13_s',
'433589_s': 'Away_28_s',
'115382_s': 'Away_9_s',
'83299_s': 'Away_5_s',
'215059_s': 'Away_1_s',
'109345_s': 'Away_20_s',
'179268_s': 'Away_3_s'
}
# In[79]:
## Map Qualifier Names to Qualifier IDs
df_tracking_cry_bri = df_tracking_cry_bri.rename(columns=dict_home_away_players_cry_bri)
# In[80]:
## Display columns
df_tracking_cry_bri.columns
#
#
# ### 4.4. Separate Home and Away DataFrames
# For compatibility with [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data library [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), Home and Away DataFrames are separated. This is done using the custom `create_home_and_away_dfs` that I have written (see the top the top of the notebook)
# In[81]:
# Separate Home and Away DataFrames
df_tracking_home_cry_lei, df_tracking_away_cry_lei = create_home_and_away_dfs(df_tracking_cry_lei)
df_tracking_home_cry_bri, df_tracking_away_cry_bri = create_home_and_away_dfs(df_tracking_cry_bri)
# In[82]:
# Display Home DataFrame
df_tracking_home_cry_lei.head()
# In[83]:
# Display Away DataFrame
df_tracking_away_cry_lei.head()
#
#
# ### 4.5. Calculate Player Velocities
# Determine the velocity of each player at a given timestep using the `calc_player_velocities` function from the `mvel` library.
# In[84]:
# Calculate player velocities
## 03/10/2021: Crystal Palace (2) vs. (2) Leicester City (g2210334)
### Home
df_tracking_home_cry_lei = svel.calc_player_velocities(df_tracking_home_cry_lei,
smoothing=True,
filter_='moving_average',
maxspeed=12
) # filter_='Savitzky-Golay'
### Away
df_tracking_away_cry_lei = svel.calc_player_velocities(df_tracking_away_cry_lei,
smoothing=True,
filter_='moving_average',
maxspeed=12
)
## 27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion (g2210324)
### Home
df_tracking_home_cry_bri = svel.calc_player_velocities(df_tracking_home_cry_bri,
smoothing=True,
filter_='moving_average',
maxspeed=12
) # filter_='Savitzky-Golay'
### Away
df_tracking_away_cry_bri = svel.calc_player_velocities(df_tracking_away_cry_bri,
smoothing=True,
filter_='moving_average',
maxspeed=12
)
# In[85]:
# Display Home DataFrame
df_tracking_home_cry_lei.head()
# In[86]:
# Display Away DataFrame
df_tracking_away_cry_lei.head()
# Exploring the `df_tracking_home` and `df_tracking_away` DataFrames, we can see the additional columns for velocity have been added to the DataFrame. These columns are of the following, but for all the players:
# * `Away_101668_vx`
# * `Away_101668_vy`
# * `Away_101668_speed` i.e. total speed
#
# These new metrics are all measured in meters per second.
#
#
# ### 4.6. Reverse direction of players
# Reverse direction of a player so that home team is alway attacking right->left i.e. no switch at half time
# In[87]:
# Reverse direction of a player so that home team is alway attacking right->left i.e. no switch at half time
df_tracking_home_cry_lei, df_tracking_away_cry_lei = to_single_playing_direction(df_tracking_home_cry_lei, df_tracking_away_cry_lei)
df_tracking_home_cry_bri, df_tracking_away_cry_bri = to_single_playing_direction(df_tracking_home_cry_bri, df_tracking_away_cry_bri)
# ---
#
#
#
# ## 5. Export Final DataFrames
# In[88]:
# Export DataFrame
## 03/10/2021: Crystal Palace (2) vs. (2) Leicester City (g2210334)
df_tracking_home_cry_lei.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210334_SecondSpectrum_Trackingdata_Home.csv'), index=None, header=True)
df_tracking_away_cry_lei.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210334_SecondSpectrum_Trackingdata_Away.csv'), index=None, header=True)
## 27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion (g2210324)
df_tracking_home_cry_bri.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210324_SecondSpectrum_Trackingdata_Home.csv'), index=None, header=True)
df_tracking_away_cry_bri.to_csv(os.path.join(data_dir_second_spectrum, 'engineered', 'data', 'g2210324_SecondSpectrum_Trackingdata_Away.csv'), index=None, header=True)
# ---
#
#
#
# ## 6. Summary
# This notebook parses and engineers [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data using [pandas](http://pandas.pydata.org/), [`Kloppy`](https://kloppy.pysport.org/), and a custom library of function, based on [Laurie Shaw](https://twitter.com/EightyFivePoint)'s Metrica Sports Tracking data library [`LaurieOnTracking`](https://github.com/Friends-of-Tracking-Data-FoTD/LaurieOnTracking), that has been modified to be compatible with [Second Spectrum](https://www.secondspectrum.com/index.html) Tracking data, which is a different format.
# ---
#
#
#
# ## 7. Next Steps
# The next stage is to visualise and analyse the tracking data.
# ---
#
#
#
# ## 8. References
# * [Second Spectrum](https://www.secondspectrum.com/index.html) data
# * [`Kloppy`](https://kloppy.pysport.org/)
# * BBC Sports pages for the two Premier League matches featuring Crystal Palace:
# + [27/09/2021: Crystal Palace (1) vs. (1) Brighton & Hove Albion](https://www.bbc.co.uk/sport/football/58620544) (g2210324)
# + [03/10/2021: Crystal Palace (2) vs. (2) Leicester City](https://www.bbc.co.uk/sport/football/58667896) (g2210324)
# ---
#
# ***Visit my website [eddwebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***
# [Back to the top](#top)