Webscraping of Understat Data¶

Notebook to scrape raw data using Beautifulsoup from TransferMarkt.¶

By Edd Webster ¶

Last updated: 31/08/2020

Click here to jump straight to the Exploratory Data Analysis section and skip the Task Brief, Data Sources, and Data Engineering sections. Or click here to jump straight to the Conclusion.

Introduction ¶

This notebook scrapes data for player valuations using Beautifulsoup from TransferMarkt using pandas for data maniuplation through DataFrames, Beautifulsoup for webscraping.

For more information about this notebook and the author, I'm available through all the following channels:

The accompanying GitHub repository for this notebook can be found here and a static version of this notebook can be found here.

1. Notebook Dependencies ¶

This notebook was written using Python 3 and requires the following libraries:

Jupyter notebooks for this notebook environment with which this project is presented;
NumPy for multidimensional array computing;
pandas for data analysis and manipulation;
tqdm for a clean progress bar;
requests for executing HTTP requests;
Beautifulsoup for web scraping; and
matplotlib for data visualisations;

All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the Conda distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows here and Mac here, as well as in the Anaconda documentation itself here.

Import Libraries and Modules¶

In [51]:

# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd    # version 1.0.3
import os    #  used to read the csv filenames
import re
import random
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os

# Working with JSON
import json
from pandas.io.json import json_normalize

# Web Scraping
import requests
from bs4 import BeautifulSoup
import re

# Fuzzy Matching - Record Linkage
import recordlinkage
import jellyfish
import numexpr as ne

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno    # visually display missing data

# Progress Bar
from tqdm import tqdm    # a clean progress bar library

# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

Setup Complete

In [52]:

# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))

Python: 3.7.6
NumPy: 1.18.1
pandas: 1.0.1
matplotlib: 3.1.3
Seaborn: 0.10.0

Defined Variables¶

In [53]:

# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

Defined Filepaths¶

In [3]:

# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_fbref = os.path.join(base_dir, 'data', 'fbref')
data_dir_tm = os.path.join(base_dir, 'data', 'tm')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')

2. Project Brief ¶

This Jupyter notebook explores how to scrape football data from TransferMarkt, using pandas for data maniuplation through DataFrames and Beautifulsoup for webscraping.

The data of player values produced in this notebook is exported to CSV. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel.

3. Data Sources ¶

3.1. Introduction ¶

TransferMarkt is a German-based website owned by Axel Springer and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players.

To read more about how these estimations are made, Beyond crowd judgments: Data-driven estimation of market value in association football by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy.

Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section Section 3 and Cleaned in the Data Engineering section Section 4.

We'll be using the pandas library to import our data to this workbook as a DataFrame.

3.2. Data Dictionaries ¶

The TransferMarkt dataset has six features (columns) with the following definitions and data types:

Feature	Data type
`position_number`	object
`position_description`	object
`name`	object
`dob`	object
`nationality`	object
`value`	object

3.3. Creating the DataFrame - scraping the data ¶

Before scraping data from TransferMarkt, we need to look at the top five leagues that we wish to scrape.

The web scraper for TransferMarkt is made up of two parts:

In the first part, the scraper takes the webpages for each of the individual leagues e.g. The Premier League, and extract the hyperlinks to the pages of all the individual teams in the league table.
In the second part the script, the webscraper uses the list of invidual teams hyperlinks collected in part 1 to then collect the hyperlinks for each of the players for those teams. From this, the scraper can then extract the information we need for each of these players.

This information collected for all the players is converted to a pandas DataFrame from which we can view and manipulate the data.

An example webpage for a football league is the following: https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/GB1/plus/?saison_id=2019. As we can see, between the subdirectory path of '/wettbewerb/' and the '/plus/', there is a 3 or 4 digit code. For The Premier League, the code is GB1.

In order to scrape the webpages from TransferMarkt, the codes of the top five leagues need to be recorded from TransferMarkt, which are the following:

League Name on FIFA	Country	Corresponding TransferMarkt League Code
LaLiga Santander	Spain	ES1
Ligue 1 Conforama	France	FR1
Premier League	England	GB1
Serie A TIM	Italy	IT1
Bundesliga	Germany	L1

In [5]:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}

In [6]:

# List of leagues by code for which we want to scrape player data - Big 5 European leagues
lst_leagues = ['ES1', 'FR1', 'GB1', 'IT1', 'L1']

In [7]:

# Assign season by year to season variable e.g. 2014/15 season = 2014
season = '2020'    # 2020/21 season

In [11]:

# Run this script to scrape latest version of this data from TransferMarkt

## Start timer
tic = datetime.datetime.now()


## Scrape TransferMarkt data
def main(url):
    with requests.Session() as req:
        links = []
        for league in lst_leagues:
            print(f'Fetching Links from {league}')
            r = req.get(url.format(league), headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')
            link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll(
                "td", class_="hauptlink no-border-links hide-for-small hide-for-pad")]
            links.extend(link)

        print(f'Collected {len(links)} Links')
        goals = []
        for num, link in enumerate(links):
            print(f"Extracting Page# {num +1}")
            r = req.get(link, headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')
            target = soup.find("table", class_="items")
            pn = [pn.text for pn in target.select("div.rn_nummer")]
            pos = [pos.text for pos in target.findAll("td", class_=False)]
            name = [name.text for name in target.select("td.hide")]
            dob = [date.find_next(
                "td").text for date in target.select("td.hide")]
            nat = [" / ".join([a.get("alt") for a in nat.find_all_next("td")[1] if a.get("alt")]) for nat in target.findAll(
                "td", itemprop="athlete")]
            val = [val.get_text(strip=True)
                   for val in target.select('td.rechts.hauptlink')]
            goal = zip(pn, pos, name, dob, nat, val)
            df = pd.DataFrame(goal, columns=[
                              'position_number', 'position_description', 'name', 'dob', 'nationality', 'value'])
            goals.append(df)

        new = pd.concat(goals)
        new.to_csv(data_dir_tm + '/raw/' + f'players_big5_2021_raw_{today}.csv', index=None, header=True)

main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020")
#main(f'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id={season}')


## End timer
toc = datetime.datetime.now()


## Calculate time take
total_time = (toc-tic).total_seconds()
print(f'Time taken to scrape {len(links)} pages for the Big 5 leagues is: {total_time/60:0.2f} minutes.')

Fetching Links from ES1
Fetching Links from FR1
Fetching Links from GB1
Fetching Links from IT1
Fetching Links from L1
Collected 98 Links
Extracting Page# 1
Extracting Page# 2
Extracting Page# 3
Extracting Page# 4
Extracting Page# 5
Extracting Page# 6
Extracting Page# 7
Extracting Page# 8
Extracting Page# 9
Extracting Page# 10
Extracting Page# 11
Extracting Page# 12
Extracting Page# 13
Extracting Page# 14
Extracting Page# 15
Extracting Page# 16
Extracting Page# 17
Extracting Page# 18
Extracting Page# 19
Extracting Page# 20
Extracting Page# 21
Extracting Page# 22
Extracting Page# 23
Extracting Page# 24
Extracting Page# 25
Extracting Page# 26
Extracting Page# 27
Extracting Page# 28
Extracting Page# 29
Extracting Page# 30
Extracting Page# 31
Extracting Page# 32
Extracting Page# 33
Extracting Page# 34
Extracting Page# 35
Extracting Page# 36
Extracting Page# 37
Extracting Page# 38
Extracting Page# 39
Extracting Page# 40
Extracting Page# 41
Extracting Page# 42
Extracting Page# 43
Extracting Page# 44
Extracting Page# 45
Extracting Page# 46
Extracting Page# 47
Extracting Page# 48
Extracting Page# 49
Extracting Page# 50
Extracting Page# 51
Extracting Page# 52
Extracting Page# 53
Extracting Page# 54
Extracting Page# 55
Extracting Page# 56
Extracting Page# 57
Extracting Page# 58
Extracting Page# 59
Extracting Page# 60
Extracting Page# 61
Extracting Page# 62
Extracting Page# 63
Extracting Page# 64
Extracting Page# 65
Extracting Page# 66
Extracting Page# 67
Extracting Page# 68
Extracting Page# 69
Extracting Page# 70
Extracting Page# 71
Extracting Page# 72
Extracting Page# 73
Extracting Page# 74
Extracting Page# 75
Extracting Page# 76
Extracting Page# 77
Extracting Page# 78
Extracting Page# 79
Extracting Page# 80
Extracting Page# 81
Extracting Page# 82
Extracting Page# 83
Extracting Page# 84
Extracting Page# 85
Extracting Page# 86
Extracting Page# 87
Extracting Page# 88
Extracting Page# 89
Extracting Page# 90
Extracting Page# 91
Extracting Page# 92
Extracting Page# 93
Extracting Page# 94
Extracting Page# 95
Extracting Page# 96
Extracting Page# 97
Extracting Page# 98
Time taken to scrape data for the Big 5 leagues is: 161.51 seconds.

In [12]:

# Import data as a pandas DataFrame, df_tm_players_big5_2021_raw

## Look for most recent CSV file
list_of_files = glob.glob(data_dir_tm + '/raw/*')    # * means all if need specific format then *.csv
filepath_latest_tm = max(list_of_files, key=os.path.getctime)

## Load in most recently parsed CSV file
df_tm_player_top5_2021_raw = pd.read_csv(filepath_latest_tm)

3.4. Preliminary Data Handling ¶

Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.

In [15]:

# Display the first 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.head()

Out[15]:

	position_number	position_description	name	dob	nationality	value
0	1	Goalkeeper	Thibaut Courtois	May 11, 1992 (28)	Belgium	£54.00m
1	13	Goalkeeper	Andriy Lunin	Feb 11, 1999 (21)	Ukraine	£2.43m
2	26	Goalkeeper	Diego Altube	Feb 22, 2000 (20)	Spain	£90Th.
3	5	Centre-Back	Raphaël Varane	Apr 25, 1993 (27)	France / Martinique	£57.60m
4	3	Centre-Back	Éder Militão	Jan 18, 1998 (22)	Brazil	£32.40m

In [16]:

# Display the last 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.tail()

Out[16]:

	position_number	position_description	name	dob	nationality	value
2809	18	Centre-Forward	Sergio Córdova	Aug 9, 1997 (23)	Venezuela	£1.44m
2810	9	Centre-Forward	Fabian Klos	Dec 2, 1987 (32)	Germany	£900Th.
2811	13	Centre-Forward	Sebastian Müller	Jan 23, 2001 (19)	Germany	£270Th.
2812	36	Centre-Forward	Sven Schipplock	Nov 8, 1988 (31)	Germany	£270Th.
2813	39	Centre-Forward	Prince Osei Owusu	Jan 7, 1997 (23)	Germany / Ghana	£225Th.

In [19]:

# Print the shape of the raw DataFrame, df_tm_player_top5_2021_raw
print(df_tm_player_top5_2021_raw.shape)

(2814, 6)

In [20]:

# Print the column names of the raw DataFrame, df_tm_player_top5_2021_raw
print(df_tm_player_top5_2021_raw.columns)

Index(['position_number', 'position_description', 'name', 'dob', 'nationality',
       'value'],
      dtype='object')

The dataset has six features (columns). Full details of these attributes can be found in the Data Dictionary.

In [21]:

# Data types of the features of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.dtypes

Out[21]:

position_number         object
position_description    object
name                    object
dob                     object
nationality             object
value                   object
dtype: object

All six of the columns have the object data type. Full details of these attributes and their data types can be found in the Data Dictionary.

In [ ]:

# Info for the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_12021_raw.info()

In [22]:

# Description of the raw DataFrame, df_tm_player_top5_2021_raw, showing some summary statistics for each numberical column in the DataFrame
df_tm_player_top5_2021_raw.describe()

Out[22]:

	position_number	position_description	name	dob	nationality	value
count	2814	2814	2814	2814	2814	2781
unique	85	14	2806	2209	416	146
top	-	Centre-Back	Danilo	Sep 1, 1993 (26)	Spain	£1.08m
freq	320	500	3	5	407	88

In [23]:

# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_player_top5_2021_raw
msno.matrix(df_tm_player_top5_2021_raw, figsize = (30, 7))

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a204ece10>

In [24]:

# Counts of missing values
tm_null_value_stats = df_tm_player_top5_2021_raw.isnull().sum(axis=0)
tm_null_value_stats[tm_null_value_stats != 0]

Out[24]:

value    33
dtype: int64

The visualisation shows us very quickly that there a few missing values in the value column, but otherwise the dataset is complete.

4. Data Engineering ¶

4.1. Introduction ¶

Before we answer the questions in the brief through Exploratory Data Analysis (EDA), we'll first need to clean and wrangle the datasets to a form that meet our needs.

4.2. Assign Raw DataFrames to New Engineered DataFrames ¶

In [27]:

# Assign Raw DataFrame to new Engineered DataFrame
df_tm_player_top5_2021 = df_tm_player_top5_2021_raw

4.2. String Cleaning ¶

Name¶

In [28]:

df_tm_player_top5_2021['name_lower'] = df_tm_player_top5_2021['name'].str.normalize('NFKD')\
                                                                     .str.encode('ascii', errors='ignore')\
                                                                     .str.decode('utf-8')\
                                                                     .str.lower()

In [29]:

# First Name Lower
df_tm_player_top5_2021['firstname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 0).str[0]

# Last Name Lower
df_tm_player_top5_2021['lastname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 1).str[-1]

# First Initial Lower
df_tm_player_top5_2021['firstinitial_lower'] = df_tm_player_top5_2021['name_lower'].astype(str).str[0]

DoB and Age¶

The dob column is messy and contains both the date of birth as a string and also the age in brackets.

This string cleaning consists of two parts, firstly, to split this apart into their seperate components. However, once the age column is created, we will replaced this by determining the current age using the Python datetime module.

In [30]:

# DoB string cleaning to create birth_date and age columns
df_tm_player_top5_2021[['birth_date', 'age']] = df_tm_player_top5_2021['dob'].str.extract(r'(.+) \((\d+)\)')

Nationality¶

For the nationality, some of the players have duel nationality.

For example, Claudio Pizarro is a Peruvian-born player who has has made 85 appearances for Peru, scoring 20 goals. However, his citizenship according to TransferMarkt is 'Peru / Italy'. For our needs, we only want to know the country the player is eligible to play for, not their full heritage which from observations is always the first part of the string. We'll therefore be discarding anything after the first space in the string to form a new playing_country column.

In [31]:

# Take the first nationality i.e. text before the first space, ex. 'Peru / Italy'
df_tm_player_top5_2021['playing_country'] = df_tm_player_top5_2021['nationality'].str.split(' /').str[0]

Value¶

The values of the players have prefixes (£), commas, spaces, and suffixes (m, k) that need to cleaned and replaced before converting to a numerical value.

In [32]:

# Value string cleaning from shortened string value to full numerical value

## Convert 'm' to '000000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('m','0000')

## Convert 'k' to '000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('k','000')

## Convert 'Th' to '000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('Th','000')

## Remove '.'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('.','')

## Remove '£' sign
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('£','')

## Remove '-'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('-','')

## Remove '¬†¬†'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('¬†¬†','')

## Remove gaps
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace(' ','')

4.3. Converting Data Types ¶

DoB¶

First we need to convert the dob column from the object data type to datetime64[ns], again using the .to_datetime() method.

In [33]:

# Convert birth_date from string to datetime64[ns]
df_tm_player_top5_2021['birth_date'] = pd.to_datetime(df_tm_player_top5_2021['birth_date'])

Age¶

The calculated age column needs to be converted from a float to an integer, with all null values ignored, using to astype() method.

In [35]:

# Date and time manipulation
from datetime import datetime

In [36]:

# Redetermine the age using the newly created birth_date column (after formatted to datetime data type)

## Remove all not numeric values use to_numeric with parameter errors='coerce' - it replaces non numeric to NaNs
df_tm_player_top5_2021['age'] = pd.to_numeric(df_tm_player_top5_2021['age'], errors='coerce')

## Convert floats to integers and leave null values
df_tm_player_top5_2021['age'] = np.nan_to_num(df_tm_player_top5_2021['age']).astype(int)

## Calculate current age
today = datetime.today()
df_tm_player_top5_2021['age'] = df_tm_player_top5_2021['birth_date'].apply(lambda x: today.year - x.year - 
                                                                               ((today.month, today.day) < (x.month, x.day)) 
                                                                          )


# df_tm_player_top5_2021['age'] = pd.to_numeric(ddf_tm_player_top5_2021['age'], downcast='signed')

Value¶

The value column needs to be converted from a string to an integer using to to_numeric() method.

In [37]:

# Convert string to integer
df_tm_player_top5_2021['value'] = pd.to_numeric(df_tm_player_top5_2021['value'])

Position¶

...

In [38]:

sorted(df_tm_player_top5_2021['position_description'].unique())

Out[38]:

['Attacking Midfield',
 'Central Midfield',
 'Centre-Back',
 'Centre-Forward',
 'Defensive Midfield',
 'Goalkeeper',
 'Left Midfield',
 'Left Winger',
 'Left-Back',
 'Midfielder',
 'Right Midfield',
 'Right Winger',
 'Right-Back',
 'Second Striker']

In [39]:

dict_positions_tm = {
    'Attacking Midfield': 'Midfielder',
    'Central Midfield': 'Midfielder',
    'Centre-Back': 'Defender',
    'Centre-Forward': 'Forward',
    'Defensive Midfield': 'Midfielder',
    'Forward': 'Forward',
    'Goalkeeper': 'Goalkeeper',
    'Left Midfield': 'Midfielder',
    'Left Winger': 'Forward',
    'Left-Back': 'Defender',
    'Right Midfield': 'Midfielder',
    'Right Winger': 'Forward',
    'Right-Back': 'Defender',
    'Second Striker': 'Forward'
}

In [40]:

df_tm_player_top5_2021['position_description_cleaned'] = df_tm_player_top5_2021['position_description'].map(dict_positions_tm)

4.4. Create New Attributes ¶

Create new attributes for birth month and birth year.

In [41]:

df_tm_player_top5_2021['birth_year'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).year
df_tm_player_top5_2021['birth_month'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).month

4.5. Columns of Interest ¶

We are interested in the following thirteen columns in the TransferMarkt dataset:

name
name_lower
firstinitial_lower
firstname_lower
lastname_lower
position_description
position_description_cleaned
value
birth_date
birth_year
birth_month
age
playing_country

In [60]:

# Select columns of interest
df_tm_player_top5_2021 = df_tm_player_top5_2021[['name', 'name_lower', 'firstinitial_lower', 'firstname_lower', 'lastname_lower', 'position_description', 'position_description_cleaned', 'value', 'birth_date', 'birth_year', 'birth_month', 'age', 'playing_country']]

4.6. Split Dataset into Outfielder Players and Goalkeepers ¶

In [43]:

# Assign df_tm as a new DataFrame - df_tm_player_top5_all_2021_all, to represent all the players
df_tm_player_top5_all_2021 = df_tm_player_top5_2021

# Filter rows for position_description is not equal to 'Goalkeeper'
df_tm_player_top5_outfield_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] != 'Goalkeeper']

# Filter rows for position_description are equal to 'Goalkeeper'
df_tm_player_top5_goalkeeper_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] == 'Goalkeeper']

In [48]:

df_tm_player_top5_all_2021.head()

Out[48]:

	name	name_lower	firstinitial_lower	firstname_lower	lastname_lower	position_description	position_description_cleaned	value	birth_date	birth_year	birth_month	age	playing_country
0	Thibaut Courtois	thibaut courtois	t	thibaut	courtois	Goalkeeper	Goalkeeper	54000000.0	1992-05-11	1992	5	28	Belgium
1	Andriy Lunin	andriy lunin	a	andriy	lunin	Goalkeeper	Goalkeeper	2430000.0	1999-02-11	1999	2	21	Ukraine
2	Diego Altube	diego altube	d	diego	altube	Goalkeeper	Goalkeeper	90000.0	2000-02-22	2000	2	20	Spain
3	Raphaël Varane	raphael varane	r	raphael	varane	Centre-Back	Defender	57600000.0	1993-04-25	1993	4	27	France
4	Éder Militão	eder militao	e	eder	militao	Centre-Back	Defender	32400000.0	1998-01-18	1998	1	22	Brazil

In [45]:

df_tm_player_top5_outfield_2021.head()

Out[45]:

	name	name_lower	firstinitial_lower	firstname_lower	lastname_lower	position_description	position_description_cleaned	value	birth_date	birth_year	birth_month	age	playing_country
3	Raphaël Varane	raphael varane	r	raphael	varane	Centre-Back	Defender	57600000.0	1993-04-25	1993	4	27	France
4	Éder Militão	eder militao	e	eder	militao	Centre-Back	Defender	32400000.0	1998-01-18	1998	1	22	Brazil
5	Sergio Ramos	sergio ramos	s	sergio	ramos	Centre-Back	Defender	13050000.0	1986-03-30	1986	3	34	Spain
6	Nacho Fernández	nacho fernandez	n	nacho	fernandez	Centre-Back	Defender	10800000.0	1990-01-18	1990	1	30	Spain
7	Ferland Mendy	ferland mendy	f	ferland	mendy	Left-Back	Defender	36000000.0	1995-06-08	1995	6	25	France

In [46]:

df_tm_player_top5_goalkeeper_2021.head()

Out[46]:

	name	name_lower	firstinitial_lower	firstname_lower	lastname_lower	position_description	position_description_cleaned	value	birth_date	birth_year	birth_month	age	playing_country
0	Thibaut Courtois	thibaut courtois	t	thibaut	courtois	Goalkeeper	Goalkeeper	54000000.0	1992-05-11	1992	5	28	Belgium
1	Andriy Lunin	andriy lunin	a	andriy	lunin	Goalkeeper	Goalkeeper	2430000.0	1999-02-11	1999	2	21	Ukraine
2	Diego Altube	diego altube	d	diego	altube	Goalkeeper	Goalkeeper	90000.0	2000-02-22	2000	2	20	Spain
33	Marc-André ter Stegen	marc-andre ter stegen	m	marc-andre	stegen	Goalkeeper	Goalkeeper	64800000.0	1992-04-30	1992	4	28	Germany
34	Neto	neto	n	neto	neto	Goalkeeper	Goalkeeper	13050000.0	1989-07-19	1989	7	31	Brazil

4.7. Exporting the Engineered DataFrames ¶

Export the three engineered TransferMarkt DataFrames as CSV files.

In [54]:

# Datetime
import datetime
from datetime import date
import time

In [55]:

# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

In [58]:

# Export the three DataFrames
df_tm_player_top5_all_2021.to_csv(data_dir_tm + '/engineered/all/' + f'all_big5_1920_{today}.csv', index=None, header=True)
df_tm_player_top5_outfield_2021.to_csv(data_dir_tm + '/engineered/player/' + f'outfield_big5_1920_{today}.csv', index=None, header=True)
df_tm_player_top5_goalkeeper_2021.to_csv(data_dir_tm + '/engineered/goalkeeper/' + f'goalkeeper_big5_1920_{today}.csv', index=None, header=True)

Now we have created three pandas DataFrames and wrangled the data to meet our needs, we'll next conduct and Exploratory Data Analysis .

Data and Web Scraping¶

FBref for the data to scrape
FBref statement for using StatsBomb's data: https://fbref.com/en/statsbomb/
StatsBomb providing the data to FBref
FBref_EPL GitHub repository by chmartin for the original web scraping code
Scrape-FBref-data GitHub repository by parth1902 for the revised web scraping code for the new FBref metrics
Beyond crowd judgments: Data-driven estimation of market value in association football by Oliver Müllera, Alexander Simons, and Markus Weinmann.
06/04/2020: BBC - Premier League squads 'drop £1.6bn in value'.
tyrone_mings GitHub repository by FCrSTATS
Python Package Index (PyPI) tyrone-mings library.

Countries¶

Comparison of alphabetic country codes Wiki

8.2. Python Techniques Observed ¶

To conduct our analysis, we have used the following libraries and modules for the following tasks:

NumPy for multidimensional array computing,
pandas for data manipulation and ingestion, and
Beautifulsoup for scraping data from webpages.

We have also demonstrated an array of techniques in Python using the following methods and functions:

pandas EDA methods:
- head(),
- tail(),
- shape,
- columns,
- dtypes,
- info, and
- describe.
The missingno library to visualise how many missing values we have in the dataset, and
The pandas .to_csv() method to export the DataFrames as csv files.

*Visit my website EddWebster.com or my GitHub Repository for more projects. If you'd like to get in contact, my Twitter handle is @eddwebster and my email is: edd.j.webster@gmail.com.*

Back to the top

Webscraping of Understat Data¶

Notebook to scrape raw data using Beautifulsoup from TransferMarkt.¶

By Edd Webster¶

Import Libraries and Modules¶

Defined Variables¶

Defined Filepaths¶

Name¶

DoB and Age¶

Nationality¶

Value¶

DoB¶

Age¶

Value¶

Position¶

Data and Web Scraping¶

Countries¶

By Edd Webster ¶