Last updated: 26/04/2020
Click here to jump straight to the Exploratory Data Analysis section and skip the Task Brief, Data Sources, and Data Engineering sections. Or click here to jump straight to the Conclusion.
This notebook is a short Exploratory Data Analysis (EDA) of Football Events data using pandas DataFrames and matplotlib visualisations.
For more information about this notebook and the author, I'm available through all the following channels:
This notebook was written using Python 3 and requires the following libraries:
All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the Conda distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows here and Mac here, as well as in the Anaconda documentation itself here.
# Import modules
# Python ≥3.5 (ideally)
import platform
import sys
assert sys.version_info >= (3, 5)
# Import Dependencies
%matplotlib inline
# Math Operations
import numpy as np
# Data Preprocessing
import pandas as pd
import os # used to read the csv filenames
import re
import random
# Working with JSON
import json
from pandas.io.json import json_normalize
# Football libraries
## from FCPython import createPitch
# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
## plt.style.use('seaborn-whitegrid')
from matplotlib.patches import Arc, Rectangle, ConnectionPatch
from matplotlib.offsetbox import OffsetImage
import missingno as msno # not include with Conda, 'pip install missingno' in the terminal if you don't have it
import squarify # pip install squarify
from functools import reduce
# Machine Learning
import scipy as sp
# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML
# Ignore warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
print("Setup Complete")
Setup Complete
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))
print('SciPy: {}'.format(sp.__version__))
Python: 3.7.6 NumPy: 1.18.1 pandas: 1.0.1 matplotlib: 3.1.3 Seaborn: 0.10.0 SciPy: 1.4.1
This workbook looks at how to load in and manipulate football data (json files) from the StatsBomb repository into a Jupyter notebook.
The data has comes from StatsBomb.
This section uses the pandas library to import our data to this workbook as a DataFrame.
The learning goals for this video are:
# Set up initial path to data
dataDir = r'data/'
# Load the Statsbomb competition file
with open(dataDir + 'statsbomb/competitions.json') as f:
competitions = json.load(f)
competitions
[{'competition_id': 37, 'season_id': 42, 'country_name': 'England', 'competition_name': "FA Women's Super League", 'competition_gender': 'female', 'season_name': '2019/2020', 'match_updated': '2020-03-11T14:09:41.932138', 'match_available': '2020-03-11T14:09:41.932138'}, {'competition_id': 37, 'season_id': 4, 'country_name': 'England', 'competition_name': "FA Women's Super League", 'competition_gender': 'female', 'season_name': '2018/2019', 'match_updated': '2020-02-27T15:59:58.148', 'match_available': '2020-02-27T15:59:58.148'}, {'competition_id': 43, 'season_id': 3, 'country_name': 'International', 'competition_name': 'FIFA World Cup', 'competition_gender': 'male', 'season_name': '2018', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 4, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2018/2019', 'match_updated': '2020-02-27T12:19:39.458017', 'match_available': '2020-02-27T12:19:39.458017'}, {'competition_id': 11, 'season_id': 1, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2017/2018', 'match_updated': '2020-02-27T12:19:39.458017', 'match_available': '2020-02-27T12:19:39.458017'}, {'competition_id': 11, 'season_id': 2, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2016/2017', 'match_updated': '2020-04-01T14:15:08.846728', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 27, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2015/2016', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 26, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2014/2015', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 25, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2013/2014', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 24, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2012/2013', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 23, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2011/2012', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 22, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2010/2011', 'match_updated': '2020-04-09T13:13:49.345111', 'match_available': '2020-04-09T13:13:49.345111'}, {'competition_id': 11, 'season_id': 21, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2009/2010', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 41, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2008/2009', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 40, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2007/2008', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 39, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2006/2007', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 11, 'season_id': 38, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2005/2006', 'match_updated': '2020-02-27T12:19:39.458017', 'match_available': '2020-02-27T12:19:39.458017'}, {'competition_id': 11, 'season_id': 37, 'country_name': 'Spain', 'competition_name': 'La Liga', 'competition_gender': 'male', 'season_name': '2004/2005', 'match_updated': '2019-12-16T23:09:16.168756', 'match_available': '2019-12-16T23:09:16.168756'}, {'competition_id': 49, 'season_id': 3, 'country_name': 'United States of America', 'competition_name': 'NWSL', 'competition_gender': 'female', 'season_name': '2018', 'match_updated': '2020-02-27T15:22:21.167136', 'match_available': '2020-01-26T02:37:05.981617'}, {'competition_id': 72, 'season_id': 30, 'country_name': 'International', 'competition_name': "Women's World Cup", 'competition_gender': 'female', 'season_name': '2019', 'match_updated': '2020-02-27T12:19:39.458017', 'match_available': '2020-02-27T12:19:39.458017'}]
# Men's World Cup 2018 has competition ID 43
competition_id = 43
# Load the list of matches for this competition
with open('data/statsbomb/matches/' + str(competition_id) + '/3.json') as f:
matches = json.load(f)
# View contents of the match JSON file, commented out as it's long
# matches
# See the first match in the dataset - Peru vs. Australia
matches[0]
{'match_id': 7562, 'match_date': '2018-06-26', 'kick_off': '16:00:00.000', 'competition': {'competition_id': 43, 'country_name': 'International', 'competition_name': 'FIFA World Cup'}, 'season': {'season_id': 3, 'season_name': '2018'}, 'home_team': {'home_team_id': 792, 'home_team_name': 'Australia', 'home_team_gender': 'male', 'home_team_group': 'Group C', 'country': {'id': 14, 'name': 'Australia'}, 'managers': [{'id': 630, 'name': 'Bert van Marwijk', 'nickname': None, 'dob': '1952-05-19', 'country': {'id': 160, 'name': 'Netherlands'}}]}, 'away_team': {'away_team_id': 784, 'away_team_name': 'Peru', 'away_team_gender': 'male', 'away_team_group': 'Group C', 'country': {'id': 179, 'name': 'Peru'}, 'managers': [{'id': 629, 'name': 'Ricardo Alberto Gareca Nardi', 'nickname': None, 'dob': '1958-02-10', 'country': {'id': 11, 'name': 'Argentina'}}]}, 'home_score': 0, 'away_score': 2, 'match_status': 'available', 'last_updated': '2019-12-16T23:09:16.168756', 'metadata': {'data_version': '1.0.2'}, 'match_week': 3, 'competition_stage': {'id': 10, 'name': 'Group Stage'}, 'stadium': {'id': 249, 'name': 'Olimpiyskiy Stadion Fisht', 'country': {'id': 188, 'name': 'Russia'}}, 'referee': {'id': 725, 'name': 'S. Karasev'}}
# See the away team for the first match in the dataset
matches[0]['away_team']
{'away_team_id': 784, 'away_team_name': 'Peru', 'away_team_gender': 'male', 'away_team_group': 'Group C', 'country': {'id': 179, 'name': 'Peru'}, 'managers': [{'id': 629, 'name': 'Ricardo Alberto Gareca Nardi', 'nickname': None, 'dob': '1958-02-10', 'country': {'id': 11, 'name': 'Argentina'}}]}
# See the away team name for the first match in the dataset
matches[0]['away_team']['away_team_name']
'Peru'
Print out the result list for the Mens World Cup
# Print all the match results
for match in matches:
home_team_name = match['home_team']['home_team_name']
away_team_name = match['away_team']['away_team_name']
home_score = match['home_score']
away_score = match['away_score']
describe_text = f"The match between {home_team_name} and {away_team_name}"
result_text = f" finished {home_score} : {away_score}"
print(describe_text + result_text)
The match between Australia and Peru finished 0 : 2 The match between Nigeria and Iceland finished 2 : 0 The match between Serbia and Brazil finished 0 : 2 The match between Croatia and Denmark finished 1 : 1 The match between Iran and Portugal finished 1 : 1 The match between Mexico and Sweden finished 0 : 3 The match between Brazil and Costa Rica finished 2 : 0 The match between Germany and Mexico finished 0 : 1 The match between Portugal and Spain finished 3 : 3 The match between Russia and Egypt finished 3 : 1 The match between Switzerland and Costa Rica finished 2 : 2 The match between Panama and Tunisia finished 1 : 2 The match between England and Belgium finished 0 : 1 The match between France and Belgium finished 1 : 0 The match between Belgium and England finished 2 : 0 The match between Iran and Spain finished 0 : 1 The match between Uruguay and Russia finished 3 : 0 The match between Croatia and Nigeria finished 2 : 0 The match between Brazil and Belgium finished 1 : 2 The match between France and Peru finished 1 : 0 The match between Tunisia and England finished 1 : 2 The match between France and Argentina finished 4 : 3 The match between Argentina and Croatia finished 0 : 3 The match between Belgium and Panama finished 3 : 0 The match between France and Australia finished 2 : 1 The match between Costa Rica and Serbia finished 0 : 1 The match between Belgium and Japan finished 3 : 2 The match between Uruguay and France finished 0 : 2 The match between Sweden and South Korea finished 1 : 0 The match between Brazil and Switzerland finished 1 : 1 The match between Spain and Morocco finished 2 : 2 The match between Russia and Saudi Arabia finished 5 : 0 The match between Peru and Denmark finished 0 : 1 The match between Poland and Colombia finished 0 : 3 The match between Senegal and Colombia finished 0 : 1 The match between Russia and Croatia finished 2 : 2 The match between South Korea and Mexico finished 1 : 2 The match between Saudi Arabia and Egypt finished 2 : 1 The match between Morocco and Iran finished 0 : 1 The match between Poland and Senegal finished 1 : 2 The match between Serbia and Switzerland finished 1 : 2 The match between Japan and Senegal finished 2 : 2 The match between Iceland and Croatia finished 1 : 2 The match between Sweden and Switzerland finished 1 : 0 The match between Denmark and France finished 0 : 0 The match between Uruguay and Portugal finished 2 : 1 The match between Egypt and Uruguay finished 0 : 1 The match between Sweden and England finished 0 : 2 The match between Argentina and Iceland finished 1 : 1 The match between Belgium and Tunisia finished 5 : 2 The match between France and Croatia finished 4 : 2 The match between Colombia and England finished 1 : 1 The match between England and Panama finished 6 : 1 The match between Portugal and Morocco finished 1 : 0 The match between Germany and Sweden finished 2 : 1 The match between Nigeria and Argentina finished 1 : 2 The match between South Korea and Germany finished 2 : 0 The match between Uruguay and Saudi Arabia finished 1 : 0 The match between Colombia and Japan finished 1 : 2 The match between Brazil and Mexico finished 2 : 0 The match between Japan and Poland finished 0 : 1 The match between Denmark and Australia finished 1 : 1 The match between Spain and Russia finished 1 : 1 The match between Croatia and England finished 2 : 1
Show just Peru's results in the World Cup
# Print match results involving Peru
for match in matches:
home_team_name = match['home_team']['home_team_name']
away_team_name = match['away_team']['away_team_name']
if home_team_name == 'Peru' or away_team_name == 'Peru':
home_score = match['home_score']
away_score = match['away_score']
describe_text = 'The match between ' + home_team_name + ' and ' + away_team_name
result_text = ' finished ' + str(home_score) + ' : ' + str(away_score)
print(describe_text + result_text)
The match between Australia and Peru finished 0 : 2 The match between France and Peru finished 1 : 0 The match between Peru and Denmark finished 0 : 1
Find the match ID for the game we are interested in - France vs. Peru
# Now lets find the match we are interested in - France vs. Peru
home_team_required = "France"
away_team_required = "Peru"
# Find ID for the match we are interested in - France vs. Peru
for match in matches:
home_team_name = match['home_team']['home_team_name']
away_team_name = match['away_team']['away_team_name']
if (home_team_name == home_team_required) and (away_team_name == away_team_required):
match_id_required = match['match_id']
print(home_team_required + ' vs ' + away_team_required + ' has id: ' + str(match_id_required))
France vs Peru has id: 7546
At first, let us use Matplotlib to draw a simple football pitch.
def draw_pitch(ax):
# size of the pitch is 120, 80
#Create figure
#Pitch Outline & Centre Line
plt.plot([0,0],[0,80], color="black")
plt.plot([0,120],[80,80], color="black")
plt.plot([120,120],[80,0], color="black")
plt.plot([120,0],[0,0], color="black")
plt.plot([60,60],[0,80], color="black")
#Left Penalty Area
plt.plot([14.6,14.6],[57.8,22.2],color="black")
plt.plot([0,14.6],[57.8,57.8],color="black")
plt.plot([0,14.6],[22.2,22.2],color="black")
#Right Penalty Area
plt.plot([120,105.4],[57.8,57.8],color="black")
plt.plot([105.4,105.4],[57.8,22.5],color="black")
plt.plot([120, 105.4],[22.5,22.5],color="black")
#Left 6-yard Box
plt.plot([0,4.9],[48,48],color="black")
plt.plot([4.9,4.9],[48,32],color="black")
plt.plot([0,4.9],[32,32],color="black")
#Right 6-yard Box
plt.plot([120,115.1],[48,48],color="black")
plt.plot([115.1,115.1],[48,32],color="black")
plt.plot([120,115.1],[32,32],color="black")
#Prepare Circles
centreCircle = plt.Circle((60,40),8.1,color="black",fill=False)
centreSpot = plt.Circle((60,40),0.71,color="black")
leftPenSpot = plt.Circle((9.7,40),0.71,color="black")
rightPenSpot = plt.Circle((110.3,40),0.71,color="black")
#Draw Circles
ax.add_patch(centreCircle)
ax.add_patch(centreSpot)
ax.add_patch(leftPenSpot)
ax.add_patch(rightPenSpot)
#Prepare Arcs
# arguments for arc
# x, y coordinate of centerpoint of arc
# width, height as arc might not be circle, but oval
# angle: degree of rotation of the shape, anti-clockwise
# theta1, theta2, start and end location of arc in degree
leftArc = Arc((9.7,40),height=16.2,width=16.2,angle=0,theta1=310,theta2=50,color="black")
rightArc = Arc((110.3,40),height=16.2,width=16.2,angle=0,theta1=130,theta2=230,color="black")
#Draw Arcs
ax.add_patch(leftArc)
ax.add_patch(rightArc)
That seems a lot, but let’s unpack the draw_pitch() function line by line. The function takes in an ax argument, which is the output of the add_subplot() function in Matplotlib. It then adds several objects with pre-defined dimension to recreate an image of a football pitch, including the center circle, the penalty areas, the 6-yard boxes, and the arcs in the pitch. Once we have defined this function, we call in together with standard Matplotlib figure function as follows:
fig=plt.figure()
fig.set_size_inches(7, 5)
ax=fig.add_subplot(1,1,1)
draw_pitch(ax)
plt.axis('off')
plt.show()
This section uses the pandas library to import our data to this workbook as a DataFrame and matplotlib for data visualisation.
The learning goals for this video are:
Event Data is effectively chronological event-by-event tabulation of on-ball actions. It's typically collected from broadcast footage by third-party collectors and sold on the open market to clubs, broadcasters, the gambling industry, and even private individuals. The primary companies competing in this space are Opta (now owned by STATS Perform) and StatsBomb, but there are other competitors.
Event data does not include records of the oordinate positions and actions of the remaining 23 players on the field, only the player in possession. For this. we need Tracking Data. Player tracking systems record the coordinate position of every player on the field (and usually the ball), many times per second. State-of-the-art systems collect up to 25 samples-per-second. Because these systems are expensive to install and operate, and require in-stadium hardware, this data is mostly available to the clubs themselves, but academics frequently get their hands on this data in a highly anonymized format through tediously painful research agreements. There are various competitors in this space, such as ChyronHego, Second Spectrum, STATS Perform, Metrica, Signality, and others.
# France vs. Peru highlights, Match 21
YouTubeVideo('O4odLCih0Os')
The game we'll be analysing is Peru vs. Denmark. Peru had 17 shots in this game, the most by a team without scoring at that point in the competition of the 2018 World Cup. Unforunately, after a Cueva missed penalty, Peru lost 1-0 to a Poulson finish.
Let’s start with a Pass Map
We load the json file and do some basic data cleaning in Panda to get a dataset that only contains Passing Events by André Carrillo.
match_id_peru_denmark = 7532
# Load in the json data
file_name = str(match_id_peru_denmark) + '.json'
# Load in all the match events
with open(dataDir + 'statsbomb/events/' + file_name) as data_file:
# print (mypath + 'events/' + file)
data = json.load(data_file)
# Get the nested structure into a DataFrame
df = pd.json_normalize(data, sep = "_").assign(match_id = file_name[:])
# DataFrame of Carrillo's actions on the pitch
carrillo_pass = df[(df['type_name'] == "Pass") & (df['player_name'] == 'André Martín Carrillo Díaz')] # get passing information of Carrillo
carrillo_column = [i for i in df.columns if i.startswith("pass")]
carrillo_pass = carrillo_pass[["id", "period", "timestamp", "location", "pass_end_location", "pass_recipient_name"]]
carrillo_pass.head(60)
id | period | timestamp | location | pass_end_location | pass_recipient_name | |
---|---|---|---|---|---|---|
104 | 11e27ac5-6e0f-485a-a600-f4f40a7ccefd | 1 | 00:01:35.280 | [97.0, 66.0] | [104.0, 65.0] | NaN |
113 | 04b9366f-595a-4dea-a5fd-1e43d7a9363d | 1 | 00:01:43.200 | [82.0, 58.0] | [75.0, 59.0] | Renato Fabrizio Tapia Cortijo |
140 | 36aae18b-b605-486e-8a4a-3a1e730853be | 1 | 00:02:28.440 | [66.0, 55.0] | [62.0, 55.0] | Renato Fabrizio Tapia Cortijo |
209 | 24dd6505-a3dc-49f1-8eae-6f1c5d862c27 | 1 | 00:04:45.360 | [94.0, 24.0] | [100.0, 21.0] | Christian Alberto Cueva Bravo |
272 | f62464bf-60f4-4367-88e9-ae90cf31da48 | 1 | 00:06:43.080 | [101.0, 63.0] | [89.0, 69.0] | Christian Alberto Cueva Bravo |
413 | fb1be5f8-ad4c-4d75-925b-5cf43f9d34ff | 1 | 00:10:19.800 | [96.0, 50.0] | [97.0, 68.0] | Luis Jan Piers Advíncula Castrillón |
420 | 6112273b-5145-46c9-a511-aa3289129d14 | 1 | 00:10:26.400 | [92.0, 55.0] | [102.0, 50.0] | Christian Alberto Cueva Bravo |
430 | b4c3318a-4640-4ec2-8c3c-589cd1208e8e | 1 | 00:10:46.960 | [82.0, 61.0] | [83.0, 54.0] | Renato Fabrizio Tapia Cortijo |
553 | a352ab6a-48c0-4eb8-825a-ba424bb0f048 | 1 | 00:13:20.360 | [46.0, 68.0] | [48.0, 63.0] | Renato Fabrizio Tapia Cortijo |
615 | 85d67434-ee96-40c8-9fee-8ef3e5bd336e | 1 | 00:15:19.560 | [38.0, 58.0] | [62.0, 53.0] | Christian Alberto Cueva Bravo |
638 | dcc3e6a0-88f9-4fff-9d06-f4236dd5c3e7 | 1 | 00:16:22.400 | [53.0, 28.0] | [54.0, 26.0] | Jefferson Agustín Farfán Guadalupe |
836 | 28eb8162-89e2-4537-a304-11ca7db6f12a | 1 | 00:23:06.120 | [4.0, 4.0] | [12.0, 16.0] | Miguel Ángel Trauco Saavedra |
993 | 0d782ec3-6314-44e5-afbb-5a13ab3f52be | 1 | 00:28:19.640 | [84.0, 60.0] | [108.0, 49.0] | Jefferson Agustín Farfán Guadalupe |
1035 | 713b964b-9cf0-4f04-840a-afc7ce93cc6e | 1 | 00:29:39.000 | [97.0, 5.0] | [97.0, 20.0] | Édison Michael Flores Peralta |
1042 | 1db47b93-4a16-4c6c-b885-07f9942adaa0 | 1 | 00:29:41.920 | [101.0, 16.0] | [91.0, 12.0] | Víctor Yoshimar Yotún Flores |
1051 | f650b96a-c9e9-4688-a6aa-0b698b509724 | 1 | 00:29:48.400 | [101.0, 14.0] | [101.0, 23.0] | Víctor Yoshimar Yotún Flores |
1087 | 4cd3afae-dad6-457b-b763-114cb64a774f | 1 | 00:30:24.720 | [69.0, 6.0] | [60.0, 12.0] | Víctor Yoshimar Yotún Flores |
1250 | 1683fd4d-bd06-45b6-be1b-2e1be0101855 | 1 | 00:34:36.000 | [24.0, 78.0] | [36.0, 74.0] | Christian Alberto Cueva Bravo |
1459 | 177c2680-c86d-4951-8231-947d612d65fd | 1 | 00:42:08.120 | [73.0, 74.0] | [83.0, 78.0] | Renato Fabrizio Tapia Cortijo |
1467 | e0a237ed-9761-4047-aac5-6da9bf8d2723 | 1 | 00:42:12.080 | [67.0, 75.0] | [47.0, 66.0] | Christian Guillermo Martín Ramos Garagay |
1950 | 4e5bddc8-3535-4794-89ec-aa5749dcaf13 | 2 | 00:10:35.680 | [83.0, 34.0] | [84.0, 49.0] | Édison Michael Flores Peralta |
1957 | b98fb5d5-e8b9-460a-a708-070541c30658 | 2 | 00:10:41.840 | [93.0, 63.0] | [90.0, 77.0] | Luis Jan Piers Advíncula Castrillón |
1964 | 912d17cd-3123-4d26-a997-981b7cb1e6b9 | 2 | 00:10:48.360 | [91.0, 63.0] | [68.0, 56.0] | Christian Guillermo Martín Ramos Garagay |
2133 | 16914d9a-55d7-4c4c-a747-f9e4289e8720 | 2 | 00:15:26.880 | [104.0, 72.0] | [116.0, 23.0] | Édison Michael Flores Peralta |
2165 | 2eb3c33d-9809-4dc7-9378-ca4a3ae0b7c6 | 2 | 00:16:19.960 | [109.0, 51.0] | [119.0, 51.0] | NaN |
2188 | 2e66e848-496e-4eb3-b9fa-0f82642f8271 | 2 | 00:18:14.760 | [93.0, 80.0] | [100.0, 65.0] | José Paolo Guerrero González |
2196 | 892fc963-a1bf-4285-a7ed-74a64f42281e | 2 | 00:18:20.240 | [102.0, 78.0] | [91.0, 79.0] | Luis Jan Piers Advíncula Castrillón |
2213 | 33d078d2-4d3c-4407-a885-d0533652d5f2 | 2 | 00:18:36.240 | [104.0, 57.0] | [113.0, 44.0] | José Paolo Guerrero González |
2362 | 9ba0a132-e288-4527-ada5-c814d92b134b | 2 | 00:24:57.800 | [76.0, 70.0] | [77.0, 73.0] | NaN |
2371 | 84eeb4e3-e4c5-424b-86dd-959f2885daea | 2 | 00:25:19.760 | [113.0, 76.0] | [111.0, 43.0] | Jefferson Agustín Farfán Guadalupe |
2383 | 09b82985-d82b-48f8-95a7-a4b9774a698b | 2 | 00:25:28.920 | [119.0, 61.0] | [117.0, 39.0] | José Paolo Guerrero González |
2393 | 5b03f4e4-12b8-4909-9dd5-d69f416276e0 | 2 | 00:26:03.080 | [29.0, 53.0] | [29.0, 17.0] | Miguel Ángel Trauco Saavedra |
2428 | c37450d0-03f8-449f-90ee-cd887e23e261 | 2 | 00:26:58.733 | [92.0, 62.0] | [102.0, 78.0] | Luis Jan Piers Advíncula Castrillón |
2434 | 70871412-bacf-4cb7-b5ee-e61ec195efcc | 2 | 00:27:03.320 | [98.0, 61.0] | [111.0, 62.0] | Luis Jan Piers Advíncula Castrillón |
2502 | c12d9541-2429-4f22-a1f9-ef1e56610269 | 2 | 00:30:27.080 | [69.0, 65.0] | [56.0, 39.0] | Alberto Junior Rodríguez Valdelomar |
2518 | b947018d-1f7a-4ddb-ad73-38a72c66a4b2 | 2 | 00:30:46.440 | [102.0, 60.0] | [97.0, 65.0] | Luis Jan Piers Advíncula Castrillón |
2640 | 3f97eadb-1eb6-400d-b7b0-6e704e573247 | 2 | 00:33:16.680 | [97.0, 64.0] | [107.0, 43.0] | José Paolo Guerrero González |
2743 | 1731d177-b0f3-4915-b8d6-c3f801003595 | 2 | 00:37:35.880 | [4.0, 74.0] | [48.0, 69.0] | Jefferson Agustín Farfán Guadalupe |
2780 | a69ffb11-c38c-442f-b51a-8d662a59771b | 2 | 00:38:19.360 | [115.0, 57.0] | [108.0, 43.0] | Jefferson Agustín Farfán Guadalupe |
2863 | 865399fe-92f8-4c3e-9163-75f9a3af330b | 2 | 00:42:15.040 | [98.0, 63.0] | [103.0, 43.0] | Christian Alberto Cueva Bravo |
The dataset shows that Carillo attempted x passes. This shows that...
For the purpose of the pass map, we only care about the starting and ending location of a pass.
The code below allows us to overlay the passes as arrows onto our pitch.
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
ax.set_xlim([0,120])
ax.set_ylim([0,80])
for i in range(len(carrillo_pass)):
# can also differentiate by color
color = "blue" if carrillo_pass.iloc[i]['period'] == 1 else "red"
ax.annotate("", xy = (carrillo_pass.iloc[i]['pass_end_location'][0], carrillo_pass.iloc[i]['pass_end_location'][1]), xycoords = 'data',
xytext = (carrillo_pass.iloc[i]['location'][0], carrillo_pass.iloc[i]['location'][1]), textcoords = 'data',
arrowprops=dict(arrowstyle="->",connectionstyle="arc3", color = "blue"),)
plt.show()
Football heatmaps are used by in-club and media analysts to illustrate the area within which a player has been present. They are effectively a smoothed out scatter plot of player locations and could be a good indicator of how effective a player is at different parts of the field. While there may be some debate as to how much they are useful (they don’t tell you if actions/movement are a good or bad thing!), they can often be very aesthetically pleasing and engaging, hence their popularity.
Let’s plot a heat map using Seaborn on top of Matplotlib to visualize André Carrillo's involvement during 90-minute of the Peru-Denmark match. The syntax of the code is incredibly simple. We use a kdeplot, which will draw a kernel density estimate of the scattering points of Carrillo's locations.
# extract players involvement in the entire game
carrillo_action = df[(df['player_name'] == 'André Martín Carrillo Díaz')][["id", "type_name","period", "timestamp", "location"]]
carrillo_action.head()
id | type_name | period | timestamp | location | |
---|---|---|---|---|---|
71 | e8bfc033-a4d9-46e4-8617-4605cf0f99a7 | Ball Recovery | 1 | 00:01:02.080 | [47.0, 31.0] |
102 | ef5d9fef-e678-4d93-8f38-6068f68749a4 | Ball Receipt* | 1 | 00:01:31.000 | [73.0, 77.0] |
103 | 1fe02959-3210-46d4-9351-58bddfc914a1 | Carry | 1 | 00:01:31.000 | [73.0, 77.0] |
104 | 11e27ac5-6e0f-485a-a600-f4f40a7ccefd | Pass | 1 | 00:01:35.280 | [97.0, 66.0] |
109 | 1bc0199e-714d-4d1b-994b-705018af7da7 | Ball Receipt* | 1 | 00:01:41.160 | [84.0, 62.0] |
fig, ax = plt.subplots()
fig.set_size_inches(7, 5)
x_coord = [i[0] for i in carrillo_action["location"]]
y_coord = [i[1] for i in carrillo_action["location"]]
#shades: give us the heat map we desire
# n_levels: draw more lines, the larger n, the more bluerry it loos
sns.kdeplot(x_coord, y_coord, shade = "True", color = "green", n_levels = 30)
plt.show()
Wow!!! That looks very… anti-climatic. After all, what is the graph trying to tell you? I see some coordinates, and clearly these contour-looking plots does seem to indicate that Özil is more active in the area with darker color.
Can we do any better than that?
Yes, the answer is that we can combine (1) the pitch, (2) the pass map and (3) the heat map in order to have a more comprehensive views of Ozil’s performance during the game
We now put both the heat map and the pass map together, with a nice pitch at the background.
def heat_pass_map(data, player_name):
pass_data = data[(data['type_name'] == "Pass") & (data['player_name'] == player_name)]
action_data = data[(data['player_name'] == player_name)]
fig=plt.figure()
fig.set_size_inches(7, 5)
ax=fig.add_subplot(1,1,1)
draw_pitch(ax)
plt.axis('off')
for i in range(len(pass_data)):
# we also differentiate different half by different color
color = "blue" if pass_data.iloc[i]['period'] == 1 else "red"
ax.annotate("", xy = (pass_data.iloc[i]['pass_end_location'][0], pass_data.iloc[i]['pass_end_location'][1]), xycoords = 'data',
xytext = (pass_data.iloc[i]['location'][0], pass_data.iloc[i]['location'][1]), textcoords = 'data',
arrowprops=dict(arrowstyle="->",connectionstyle="arc3", color = color),)
x_coord = [i[0] for i in action_data["location"]]
y_coord = [i[1] for i in action_data["location"]]
sns.kdeplot(x_coord, y_coord, shade = "True", color = "green", n_levels = 30)
plt.ylim(0, 80) # need this, otherwise kde plot will go outside
plt.xlim(0, 120)
plt.show()
heat_pass_map(df, 'André Martín Carrillo Díaz')
# we can see that ...
# ex. Ozil really struggles to play direct attacking ball in the first half, while he was a lot more direct in the second half
Notice that I also color the passes differently, as the blue arrows indicate passes made in the first half, and the red arrows second half
Now we can see a more comprehensive picture of Mesut Özil’s performance during the game. A couple of observations right off the bat:
What I found interesting was the heat-pass map of Timo Werner, who started out as the lone striker for the Germany team then paired up with Mario Gomez for much of the second half.
heat_pass_map(df, 'Christian Alberto Cueva Bravo')
He surprisingly spent a lot of his time on the two sides, while you would expect the Central Forward to occupy the space in the 18-yard box a lot more. This partly explains the ineffectiveness of German offensive line during the game, as their forward lines (Werner, Reus, Goretzka and then Muller, Gómez) crowd up at the wings but fail to take up space in the penalty area, thus providing very little outlet for playmakers such as Özil and Kroos to direct the ball into the 18-yard box.
heat_pass_map(df, 'José Paolo Guerrero González')
heat_pass_map(df, 'Luis Jan Piers Advíncula Castrillón')
We can again attempt to visualize all shots from the Peru team to decide whether the majority of their goals come from outside or inside the box?
If I just follow the methods shown thus far, this is what I get
I want to plot out the shots from all different angles of the Peru team at the World Cup.
data_id = [7562, 7546, 7532]
# consequently read the json and concatenate into a pre-defined dataframe
all_peru = pd.DataFrame()
for i in data_id:
with open(dataDir + 'statsbomb/events/' + str(i) + '.json') as data_file:
data = json.load(data_file)
df = pd.json_normalize(data, sep = '_')
if all_peru.empty:
all_peru = df
else:
all_peru = pd.concat([all_peru, df], join = 'outer', sort = False)
shot_data = all_peru[(all_peru['type_name'] == "Shot") & (all_peru['team_name'] == 'Peru')]
fig = plt.figure()
fig.set_size_inches(7, 5)
ax = fig.add_subplot(1,1,1)
draw_pitch(ax)
plt.axis('off')
for i in range(len(shot_data)):
# can also differentiate different half by different color
color = "red" if shot_data.iloc[i]['shot_outcome_name'] == "Goal" else "black"
ax.annotate("", xy = (shot_data.iloc[i]['shot_end_location'][0], shot_data.iloc[i]['shot_end_location'][1]), xycoords = 'data',
xytext = (shot_data.iloc[i]['location'][0], shot_data.iloc[i]['location'][1]), textcoords = 'data',
arrowprops=dict(arrowstyle = "->",connectionstyle="arc3", color = color),)
plt.ylim(0, 80)
plt.xlim(0, 120)
plt.show()
Shot taken by France team during the World Cup campaign
This is fine. But we can do more to make the visualization more engaging and insightful. Specifically, I made two small tweaks:
def draw_half_pitch(ax):
# focus on only half of the pitch
#Pitch Outline & Centre Line
Pitch = Rectangle([60,0], width = 60, height = 80, fill = False)
#Right Penalty Area
RightPenalty = Rectangle([105.4,22.3], width = 14.6, height = 35.3, fill = False)
#Right 6-yard Box
RightSixYard = Rectangle([115.1,32], width = 4.9, height = 16, fill = False)
#Prepare Circles
centreCircle = Arc((60,40),width = 8.1, height = 8.1, angle=0,theta1=270,theta2=90,color="black")
centreSpot = plt.Circle((60,40),0.71,color="black")
rightPenSpot = plt.Circle((110.3,40),0.71,color="black")
rightArc = Arc((110.3,40),height=16.2,width=16.2,angle=0,theta1=130,theta2=230,color="black")
element = [Pitch, RightPenalty, RightSixYard, centreCircle, centreSpot, rightPenSpot, rightArc]
for i in element:
ax.add_patch(i)
fig=plt.figure()
fig.set_size_inches(7, 5)
ax=fig.add_subplot(1,1,1)
draw_half_pitch(ax)
plt.axis('off')
# draw the scatter plot for goals
x_coord_goal = [location[0] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] == "Goal"]
y_coord_goal = [location[1] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] == "Goal"]
# shots that end up with no goal
x_coord = [location[0] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] != "Goal"]
y_coord = [location[1] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] != "Goal"]
# put the two scatter plots on to the pitch
ax.scatter(x_coord_goal, y_coord_goal, c = 'red', label = 'goal')
ax.scatter(x_coord, y_coord, c = 'blue', label = 'shots')
plt.ylim(0, 80)
plt.xlim(0, 120)
plt.legend(loc = 'upper right')
plt.axis('off')
plt.show()
Now this looks a whole lot better. We can see right away that France attempted as many shot inside the boxes as they did outside the penalty area. Although to a certain extent, it does support the argument that France did take a lot more long-range efforts than usual, as we would expect a much lower density of shots outside the box. In any case, it does look interesting how they seems equally clinical with the short and long-range efforts.
In the following sections, we'll overlaying a density plot and including an image to the visualization. With a couple more lines of code, you can easily produce this visualization:
# we use a joint plot to see the density of the shot distribution across the 2 axes of the pitch
joint_shot_chart = sns.jointplot(x_coord, y_coord, stat_func=None,
kind = 'scatter', space=0, alpha=0.5)
joint_shot_chart.fig.set_size_inches(7,5)
ax = joint_shot_chart.ax_joint
# overlaying the plot with a pitch
draw_half_pitch(ax)
ax.set_xlim(0.5,120.5)
ax.set_ylim(0.5,80.5)
# draw the scatter plot for goals
x_coord_goal = [location[0] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] == "Goal"]
y_coord_goal = [location[1] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] == "Goal"]
# shots that end up with no goal
x_coord = [location[0] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] != "Goal"]
y_coord = [location[1] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] != "Goal"]
# put the two scatter plots on to the pitch
ax.scatter(x_coord, y_coord, c = 'b', label = 'shots')
ax.scatter(x_coord_goal, y_coord_goal, c = 'r', label = 'goal')
# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
joint_shot_chart.ax_marg_x.set_axis_off()
ax.set_axis_off()
plt.ylim(-.5, 80)
plt.axis('off')
plt.show()
# I want to include some images into our diagram
peru = plt.imread("./img/farfanicon.png")
plt.imshow(peru)
plt.show()
cmap = plt.cm.YlOrRd_r # import cmap
joint_shot_chart = sns.jointplot(x_coord, y_coord, stat_func=None,
kind='reg', space=0, color = cmap(0.1))
joint_shot_chart.fig.set_size_inches(7,5)
ax = joint_shot_chart.ax_joint
draw_half_pitch(ax)
ax.set_xlim(0.5,120.5)
ax.set_ylim(0.5,80.5)
# draw the scatter plot for goals
x_coord_goal = [location[0] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] == "Goal"]
y_coord_goal = [location[1] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] == "Goal"]
# shots that end up with no goal
x_coord = [location[0] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] != "Goal"]
y_coord = [location[1] for i, location in enumerate(shot_data["location"]) if shot_data.iloc[i]['shot_outcome_name'] != "Goal"]
# put the two scatter plots on to the pitch
ax.scatter(x_coord, y_coord, c = 'b', label = 'shots')
ax.scatter(x_coord_goal, y_coord_goal, c = 'r', label = 'goal')
plt.legend(loc='lower right', bbox_to_anchor=(0.975, 0.0125)) # legend location specifically put here
plt.axis('off')
# Get rid of axis labels and tick marks
ax.set_xlabel('')
ax.set_ylabel('')
ax.set_title('Peru 2018 \nWorld Cup',
y=1.2, fontsize=15)
joint_shot_chart.ax_marg_x.set_axis_off()
joint_shot_chart.ax_marg_y.set_axis_off()
img = OffsetImage(peru, zoom=0.873)
img.set_offset((42,15.5)) # play around with the coordinate until I found a good place
# Add image of Farfan
ax.add_artist(img)
ax.set_axis_off()
plt.xlim(0,123)
plt.ylim(-.5, 83)
plt.axis('off')
plt.show()
# Export plot - not working currently
# plt.savefig('./img/fig/peru_all_shots_graphic.png')
<Figure size 432x288 with 0 Axes>
data_id = [7562, 7546, 7532]
# consequently read the json and concatenate into a pre-defined dataframe
peru_all = pd.DataFrame()
for i in data_id:
with open(dataDir + 'statsbomb/events/' + str(i)+'.json') as data_file:
data = json.load(data_file)
df = json_normalize(data, sep = '_')
if all_peru.empty:
peru_all = df
else:
peru_all = pd.concat([peru_all, df], join = 'outer', sort = False)
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: FutureWarning: pandas.io.json.json_normalize is deprecated, use pandas.json_normalize instead
peru_all = peru_all[peru_all.team_name == "Peru"]
Let's say we are interested in the following statistics per player across the tournament
# count total number of dribble
total_dribble = peru_all.groupby('player_name')['type_name'].apply(lambda x:(x=='Dribble').sum()).reset_index(name='total_dribble')
# number of dribble completed
dribble_complete= peru_all.groupby('player_name')['dribble_outcome_name'].apply(lambda x: (x=='Complete').sum()).reset_index(name='dribble_completed')
# total number of pass
total_pass = peru_all.groupby('player_name')['type_name'].apply(lambda x: (x=='Pass').sum()).reset_index(name='total_pass')
# number of incomplete pass
pass_incomplete = peru_all.groupby('player_name')['pass_outcome_name'].count().reset_index(name='incomplete_pass')
# number of times being dispossessed
dispossessed = peru_all.groupby('player_name')['type_name'].apply(lambda x: (x=='Dispossessed').sum()).reset_index(name='dispossessed')
df_list = [total_dribble, dribble_complete, total_pass, pass_incomplete, dispossessed]
summary_data = reduce(lambda x, y: pd.merge(x, y, on = 'player_name'), df_list)
summary_data
player_name | total_dribble | dribble_completed | total_pass | incomplete_pass | dispossessed | |
---|---|---|---|---|---|---|
0 | Alberto Junior Rodríguez Valdelomar | 1 | 0 | 56 | 7 | 0 |
1 | Anderson Santamaría Bardales | 1 | 1 | 74 | 5 | 0 |
2 | André Martín Carrillo Díaz | 18 | 14 | 106 | 28 | 6 |
3 | Christian Alberto Cueva Bravo | 4 | 3 | 138 | 25 | 4 |
4 | Christian Guillermo Martín Ramos Garagay | 2 | 2 | 119 | 20 | 1 |
5 | Christopher Paolo César Hurtado Huertas | 1 | 1 | 19 | 3 | 0 |
6 | Jefferson Agustín Farfán Guadalupe | 2 | 2 | 33 | 5 | 2 |
7 | José Paolo Guerrero González | 3 | 0 | 48 | 18 | 5 |
8 | Luis Jan Piers Advíncula Castrillón | 8 | 7 | 144 | 20 | 2 |
9 | Miguel Ángel Trauco Saavedra | 3 | 2 | 203 | 47 | 4 |
10 | Pedro David Gallese Quiróz | 0 | 0 | 68 | 31 | 0 |
11 | Pedro Jesús Aquino Sánchez | 0 | 0 | 104 | 17 | 1 |
12 | Raúl Mario Ruidíaz Misitich | 0 | 0 | 3 | 0 | 0 |
13 | Renato Fabrizio Tapia Cortijo | 0 | 0 | 47 | 10 | 1 |
14 | Víctor Yoshimar Yotún Flores | 1 | 0 | 144 | 26 | 1 |
15 | Wilder José Cartagena Mendoza | 0 | 0 | 4 | 1 | 0 |
16 | Édison Michael Flores Peralta | 3 | 1 | 104 | 27 | 4 |
# New dataframe, containing only players with more than 50 passes
dataPass= summary_data[summary_data["total_pass"]>50]
# Utilise matplotlib to scale our goal numbers between the min and max, then assign this scale to our values.
norm = mpl.colors.Normalize(vmin=min(dataPass.total_pass), vmax=max(dataPass.total_pass))
colors = [mpl.cm.Blues(norm(value)) for value in dataPass.total_pass]
# Create our plot and resize it.
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
# Use squarify to plot our data, label it and add colours. We add an alpha layer to ensure black labels show through
squarify.plot(label=dataPass.player_name,sizes=dataPass.total_pass, color = colors, alpha=.6)
plt.title("Passing data",fontsize=23,fontweight="bold")
# Remove our axes and display the plot
plt.axis('off')
plt.show()
dataDribble= summary_data[summary_data["total_dribble"]>0]
# Utilise matplotlib to scale our goal numbers between the min and max, then assign this scale to our values.
norm = mpl.colors.Normalize(vmin=min(dataDribble.total_dribble), vmax=max(dataDribble.total_dribble))
colors = [mpl.cm.Blues(norm(value)) for value in dataDribble.total_dribble]
# Create our plot and resize it.
fig = plt.gcf()
ax = fig.add_subplot()
fig.set_size_inches(16, 4.5)
# Use squarify to plot our data, label it and add colours. We add an alpha layer to ensure black labels show through
squarify.plot(label=dataDribble.player_name,sizes=dataDribble.total_dribble, color = colors, alpha=.6)
plt.title("Total dribble",fontsize=23,fontweight="bold")
# Remove our axes and display the plot
plt.axis('off')
plt.show()
This notebooks aims to demonstrate what is it to conduct an EDA with a new set of data using pandas to create DataFrames, clean, wrangle the data, and matplotlib to plot the data.
In this workbook, we have taken a dataset of train data and through Exploratory Data Analysis, determined the following:
To conduct our analysis, we have used the following libraries and modules for the following tasks:
We have also demonstrated an array of techniques in Python using the following methods and functions:
*Visit my website EddWebster.com or my GitHub Repository for more projects. If you'd like to get in contact, my email is: edd.j.webster@gmail.com.*