Webscraping of TransferMarkt Data¶

Notebook to scrape raw data using Beautifulsoup from TransferMarkt.¶

By Edd Webster ¶

Last updated: 31/08/2020

Click here to jump straight to the Exploratory Data Analysis section and skip the Task Brief, Data Sources, and Data Engineering sections. Or click here to jump straight to the Conclusion.

Introduction ¶

This notebook scrapes data for player valuations using Beautifulsoup from TransferMarkt using pandas for data maniuplation through DataFrames, Beautifulsoup for webscraping.

For more information about this notebook and the author, I'm available through all the following channels:

The accompanying GitHub repository for this notebook can be found here and a static version of this notebook can be found here.

1. Notebook Dependencies ¶

This notebook was written using Python 3 and requires the following libraries:

Jupyter notebooks for this notebook environment with which this project is presented;
NumPy for multidimensional array computing;
pandas for data analysis and manipulation;
tqdm for a clean progress bar;
requests for executing HTTP requests;
Beautifulsoup for web scraping; and
matplotlib for data visualisations;

All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the Conda distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows here and Mac here, as well as in the Anaconda documentation itself here.

Import Libraries and Modules¶

In [1]:

# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd    # version 1.0.3
import os    #  used to read the csv filenames
import re
import random
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os

# Working with JSON
import json
from pandas.io.json import json_normalize

# Web Scraping
import requests
from bs4 import BeautifulSoup
import re

# Fuzzy Matching - Record Linkage
import recordlinkage
import jellyfish
import numexpr as ne

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno    # visually display missing data

# Progress Bar
from tqdm import tqdm    # a clean progress bar library

# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

Setup Complete

In [2]:

# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))

Python: 3.7.6
NumPy: 1.18.1
pandas: 1.0.1
matplotlib: 3.1.3
Seaborn: 0.10.0

Defined Variables¶

In [3]:

# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

Defined Filepaths¶

In [4]:

# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_fbref = os.path.join(base_dir, 'data', 'fbref')
data_dir_tm = os.path.join(base_dir, 'data', 'tm')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')

2. Project Brief ¶

This Jupyter notebook explores how to scrape football data from TransferMarkt, using pandas for data maniuplation through DataFrames and Beautifulsoup for webscraping.

The data of player values produced in this notebook is exported to CSV. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel.

3. Data Sources ¶

3.1. Introduction ¶

TransferMarkt is a German-based website owned by Axel Springer and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players.

To read more about how these estimations are made, Beyond crowd judgments: Data-driven estimation of market value in association football by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy.

Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section Section 3 and Cleaned in the Data Engineering section Section 4.

We'll be using the pandas library to import our data to this workbook as a DataFrame.

3.2. Data Dictionaries ¶

The TransferMarkt dataset has six features (columns) with the following definitions and data types:

Feature	Data type
`position_number`	object
`position_description`	object
`name`	object
`dob`	object
`nationality`	object
`value`	object

3.3. Creating the DataFrame - scraping the data ¶

Before scraping data from TransferMarkt, we need to look at the top five leagues that we wish to scrape.

The web scraper for TransferMarkt is made up of two parts:

In the first part, the scraper takes the webpages for each of the individual leagues e.g. The Premier League, and extract the hyperlinks to the pages of all the individual teams in the league table.
In the second part the script, the webscraper uses the list of invidual teams hyperlinks collected in part 1 to then collect the hyperlinks for each of the players for those teams. From this, the scraper can then extract the information we need for each of these players.

This information collected for all the players is converted to a pandas DataFrame from which we can view and manipulate the data.

An example webpage for a football league is the following: https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/GB1/plus/?saison_id=2019. As we can see, between the subdirectory path of '/wettbewerb/' and the '/plus/', there is a 3 or 4 digit code. For The Premier League, the code is GB1.

In order to scrape the webpages from TransferMarkt, the codes of the top five leagues need to be recorded from TransferMarkt, which are the following:

League Name on FIFA	Country	Corresponding TransferMarkt League Code
LaLiga Santander	Spain	ES1
Ligue 1 Conforama	France	FR1
Premier League	England	GB1
Serie A TIM	Italy	IT1
Bundesliga	Germany	L1

See: https://fcpython.com/blog/scraping-lists-transfermarkt-saving-images

In [21]:

import requests
from bs4 import BeautifulSoup
from os.path  import basename

In [22]:

headers = {'User-Agent': 
           'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}

In [23]:

#Process League Table
page = 'https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1'
tree = requests.get(page, headers = headers)
soup = BeautifulSoup(tree.content, 'html.parser')

#Create an empty list to assign these values to
teamLinks = []

#Extract all links with the correct CSS selector
links = soup.select("a.vereinprofil_tooltip")

#We need the location that the link is pointing to, so for each link, take the link location. 
#Additionally, we only need the links in locations 1,3,5,etc. of our list, so loop through those only
for i in range(1,41,2):
    teamLinks.append(links[i].get("href"))
    
#For each location that we have taken, add the website before it - this allows us to call it later
for i in range(len(teamLinks)):
    teamLinks[i] = "https://www.transfermarkt.co.uk"+teamLinks[i]

In [24]:

#Create an empty list for our player links to go into
playerLinks = []

#Run the scraper through each of our 20 team links
for i in range(len(teamLinks)):

    #Download and process the team page
    page = teamLinks[i]
    tree = requests.get(page, headers = headers)
    soup = BeautifulSoup(tree.content, 'html.parser')

    #Extract all links
    links = soup.select("a.spielprofil_tooltip")
    
    #For each link, extract the location that it is pointing to
    for j in range(len(links)):
        playerLinks.append(links[j].get("href"))

    #Add the location to the end of the transfermarkt domain to make it ready to scrape
    for j in range(len(playerLinks)):
        playerLinks[j] = "https://www.transfermarkt.co.uk"+playerLinks[j]

    #The page list the players more than once - let's use list(set(XXX)) to remove the duplicates
    playerLinks = list(set(playerLinks))

In [25]:

len(playerLinks)

Out[25]:

In [28]:

for i in range(len(playerLinks)):

    #Take site and structure html
    page = playerLinks[i]
    tree = requests.get(page, headers=headers)
    soup = BeautifulSoup(tree.content, 'html.parser')


    #Find image and save it with the player's name
    #Find the player's name
    name = soup.find_all("h1")
    
    #Use the name to call the image
    image = soup.find_all("img",{"title":name[0].text})
    
    #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
    src = image[0].get('src').split("?lm")[0]

    #Save the image under the player's name
    with open(name[0].text+".jpg","wb") as f:
        f.write(requests.get(src).content)

---------------------------------------------------------------------------
gaierror                                  Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self)
    156             conn = connection.create_connection(
--> 157                 (self._dns_host, self.port), self.timeout, **extra_kw
    158             )

/opt/anaconda3/lib/python3.7/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options)
     60 
---> 61     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
     62         af, socktype, proto, canonname, sa = res

/opt/anaconda3/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
    751     addrlist = []
--> 752     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
    753         af, socktype, proto, canonname, sa = res

gaierror: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

NewConnectionError                        Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    671                 headers=headers,
--> 672                 chunked=chunked,
    673             )

/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    375         try:
--> 376             self._validate_conn(conn)
    377         except (SocketTimeout, BaseSSLError) as e:

/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
    993         if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
--> 994             conn.connect()
    995 

/opt/anaconda3/lib/python3.7/site-packages/urllib3/connection.py in connect(self)
    299         # Add certificate verification
--> 300         conn = self._new_conn()
    301         hostname = self.host

/opt/anaconda3/lib/python3.7/site-packages/urllib3/connection.py in _new_conn(self)
    168             raise NewConnectionError(
--> 169                 self, "Failed to establish a new connection: %s" % e
    170             )

NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x1a1b398f90>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known

During handling of the above exception, another exception occurred:

MaxRetryError                             Traceback (most recent call last)
/opt/anaconda3/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    448                     retries=self.max_retries,
--> 449                     timeout=timeout
    450                 )

/opt/anaconda3/lib/python3.7/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    719             retries = retries.increment(
--> 720                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    721             )

/opt/anaconda3/lib/python3.7/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    435         if new_retry.is_exhausted():
--> 436             raise MaxRetryError(_pool, url, error or ResponseError(cause))
    437 

MaxRetryError: HTTPSConnectionPool(host='www.transfermarkt.co.ukhttps', port=443): Max retries exceeded with url: //www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/david-de-gea/profil/spieler/59377 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1b398f90>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
<ipython-input-28-c90af715f1a0> in <module>
      3     #Take site and structure html
      4     page = playerLinks[i]
----> 5     tree = requests.get(page, headers=headers)
      6     soup = BeautifulSoup(tree.content, 'html.parser')
      7 

/opt/anaconda3/lib/python3.7/site-packages/requests/api.py in get(url, params, **kwargs)
     73 
     74     kwargs.setdefault('allow_redirects', True)
---> 75     return request('get', url, params=params, **kwargs)
     76 
     77 

/opt/anaconda3/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     58     # cases, and look like a memory leak in others.
     59     with sessions.Session() as session:
---> 60         return session.request(method=method, url=url, **kwargs)
     61 
     62 

/opt/anaconda3/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    531         }
    532         send_kwargs.update(settings)
--> 533         resp = self.send(prep, **send_kwargs)
    534 
    535         return resp

/opt/anaconda3/lib/python3.7/site-packages/requests/sessions.py in send(self, request, **kwargs)
    644 
    645         # Send the request
--> 646         r = adapter.send(request, **kwargs)
    647 
    648         # Total elapsed time of the request (approximately)

/opt/anaconda3/lib/python3.7/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    514                 raise SSLError(e, request=request)
    515 
--> 516             raise ConnectionError(e, request=request)
    517 
    518         except ClosedPoolError as e:

ConnectionError: HTTPSConnectionPool(host='www.transfermarkt.co.ukhttps', port=443): Max retries exceeded with url: //www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.ukhttps://www.transfermarkt.co.uk/david-de-gea/profil/spieler/59377 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x1a1b398f90>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

In [ ]:

In [27]:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}

In [6]:

# List of leagues by code for which we want to scrape player data - Big 5 European leagues
lst_leagues = ['ES1', 'FR1', 'GB1', 'IT1', 'L1']

In [7]:

# Assign season by year to season variable e.g. 2014/15 season = 2014
season = '2020'    # 2020/21 season

In [8]:

from os.path  import basename

In [17]:

### Create empty list of links
playerLinks = []

url = "https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020"

### For loop to iteratre through each league page to collect the team links
for league in lst_leagues:
    print(f'Fetching Links from {league}')
    r = requests.Session() .get(url.format(league), headers=headers)
    soup = BeautifulSoup(r.content, 'html.parser')
    link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll(
        "td", class_="hauptlink no-border-links hide-for-small hide-for-pad")]
    playerLinks.extend(link)

### Print statement for the number of team links found
print(f'Collected {len(playerLinks)} Links')

Fetching Links from ES1
Fetching Links from FR1
Fetching Links from GB1
Fetching Links from IT1
Fetching Links from L1
Collected 98 Links

In [18]:

len(playerLinks)

Out[18]:

In [20]:

for i in range(len(playerLinks)):

    #Take site and structure html
    page = playerLinks[i]
    tree = requests.get(page, headers=headers)
    soup = BeautifulSoup(tree.content, 'html.parser')


    #Find image and save it with the player's name
    #Find the player's name
    name = soup.find_all("h1")
    
    #Use the name to call the image
    image = soup.find_all("img",{"title":name[0].text})
    
    #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
    src = image[0].get('src').split("?lm")[0]

    #Save the image under the player's name
    with open(name[0].text+".jpg","wb") as f:
        f.write(requests.get(src).content)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-20-c90af715f1a0> in <module>
     15 
     16     #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.
---> 17     src = image[0].get('src').split("?lm")[0]
     18 
     19     #Save the image under the player's name

IndexError: list index out of range

In [ ]:

In [12]:

# Run this script to scrape latest version of this data from TransferMarkt

## Start timer
tic = datetime.datetime.now()


## Scrape TransferMarkt data
def main(url):
    with requests.Session() as req:
        
        ### Create empty list of links
        links = []
        
        ### For loop to iteratre through each league page to collect the team links
        for league in lst_leagues:
            print(f'Fetching Links from {league}')
            r = req.get(url.format(league), headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')
            link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll(
                "td", class_="hauptlink no-border-links hide-for-small hide-for-pad")]
            links.extend(link)
        
        ### Print statement for the number of team links found
        print(f'Collected {len(links)} Links')
        
        
        
        """
        ### Create empty list of goals
        goals = []
        
        ### For loop to iteratre through each goal to collect each players information and assign to a DF
        for num, link in enumerate(links):
            print(f"Extracting Page# {num +1}")
            r = req.get(link, headers=headers)
            soup = BeautifulSoup(r.content, 'html.parser')
            target = soup.find("table", class_="items")
            pn = [pn.text for pn in target.select("div.rn_nummer")]
            pos = [pos.text for pos in target.findAll("td", class_=False)]
            name = [name.text for name in target.select("td.hide")]
            dob = [date.find_next(
                "td").text for date in target.select("td.hide")]
            nat = [" / ".join([a.get("alt") for a in nat.find_all_next("td")[1] if a.get("alt")]) for nat in target.findAll(
                "td", itemprop="athlete")]
            val = [val.get_text(strip=True)
                   for val in target.select('td.rechts.hauptlink')]
            goal = zip(pn, pos, name, dob, nat, val)
            df = pd.DataFrame(goal, columns=[
                              'position_number', 'position_description', 'name', 'dob', 'nationality', 'value'])
            goals.append(df)
                   
        ### Concontate the list of goals
        new = pd.concat(goals)
        
        ### Save DataFrame to a CSV
        new.to_csv(data_dir_tm + '/raw/' + f'players_big5_2021_raw_{today}.csv', index=None, header=True)
        """
        
## Call defined function
main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020")
#main(f'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id={season}')


## End timer
toc = datetime.datetime.now()


## Calculate time take
total_time = (toc-tic).total_seconds()
print(f'Time taken to scrape the data of all the players for the Big 5 leagues is: {total_time/60:0.2f} minutes.')

Fetching Links from ES1
Fetching Links from FR1
Fetching Links from GB1
Fetching Links from IT1
Fetching Links from L1
Collected 98 Links
Extracting Page# 1
Extracting Page# 2
Extracting Page# 3
Extracting Page# 4
Extracting Page# 5
Extracting Page# 6
Extracting Page# 7
Extracting Page# 8
Extracting Page# 9
Extracting Page# 10
Extracting Page# 11
Extracting Page# 12
Extracting Page# 13
Extracting Page# 14
Extracting Page# 15
Extracting Page# 16
Extracting Page# 17
Extracting Page# 18
Extracting Page# 19
Extracting Page# 20
Extracting Page# 21
Extracting Page# 22
Extracting Page# 23
Extracting Page# 24
Extracting Page# 25
Extracting Page# 26
Extracting Page# 27
Extracting Page# 28
Extracting Page# 29
Extracting Page# 30
Extracting Page# 31
Extracting Page# 32
Extracting Page# 33
Extracting Page# 34
Extracting Page# 35
Extracting Page# 36
Extracting Page# 37
Extracting Page# 38
Extracting Page# 39
Extracting Page# 40
Extracting Page# 41
Extracting Page# 42
Extracting Page# 43
Extracting Page# 44
Extracting Page# 45
Extracting Page# 46
Extracting Page# 47
Extracting Page# 48
Extracting Page# 49
Extracting Page# 50
Extracting Page# 51
Extracting Page# 52
Extracting Page# 53
Extracting Page# 54
Extracting Page# 55
Extracting Page# 56
Extracting Page# 57
Extracting Page# 58
Extracting Page# 59
Extracting Page# 60
Extracting Page# 61
Extracting Page# 62
Extracting Page# 63
Extracting Page# 64
Extracting Page# 65
Extracting Page# 66
Extracting Page# 67
Extracting Page# 68
Extracting Page# 69
Extracting Page# 70
Extracting Page# 71
Extracting Page# 72
Extracting Page# 73
Extracting Page# 74
Extracting Page# 75
Extracting Page# 76
Extracting Page# 77
Extracting Page# 78
Extracting Page# 79
Extracting Page# 80
Extracting Page# 81
Extracting Page# 82
Extracting Page# 83
Extracting Page# 84
Extracting Page# 85
Extracting Page# 86
Extracting Page# 87
Extracting Page# 88
Extracting Page# 89
Extracting Page# 90
Extracting Page# 91
Extracting Page# 92
Extracting Page# 93
Extracting Page# 94
Extracting Page# 95
Extracting Page# 96
Extracting Page# 97
Extracting Page# 98

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-12-a47161833114> in <module>
     52         return links
     53 
---> 54 main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020")
     55 #main(f'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id={season}')
     56 

<ipython-input-12-a47161833114> in main(url)
     39 
     40         #Use the name to call the image
---> 41         image = soup.find_all("img",{"title":name[0].text})
     42 
     43         #Extract the location of the image. We also need to strip the text after '?lm', so let's do that through '.split()'.

AttributeError: 'str' object has no attribute 'text'

In [11]:

# Import data as a pandas DataFrame, df_tm_players_big5_2021_raw

## Look for most recent CSV file
list_of_files = glob.glob(data_dir_tm + '/raw/*')    # * means all if need specific format then *.csv
filepath_latest_tm = max(list_of_files, key=os.path.getctime)

## Load in most recently parsed CSV file
df_tm_player_top5_2021_raw = pd.read_csv(filepath_latest_tm)

3.4. Preliminary Data Handling ¶

Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.

In [12]:

# Display the first 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.head()

Out[12]:

	position_number	position_description	name	dob	nationality	value
0	1	Goalkeeper	Marc-André ter Stegen	Apr 30, 1992 (28)	Germany	£64.80m
1	13	Goalkeeper	Neto	Jul 19, 1989 (31)	Brazil / Italy	£13.05m
2	26	Goalkeeper	Iñaki Peña	Mar 2, 1999 (21)	Spain	£2.07m
3	15	Centre-Back	Clément Lenglet	Jun 17, 1995 (25)	France	£43.20m
4	23	Centre-Back	Samuel Umtiti	Nov 14, 1993 (26)	France / Cameroon	£28.80m

In [13]:

# Display the last 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.tail()

Out[13]:

	position_number	position_description	name	dob	nationality	value
2885	18	Centre-Forward	Sergio Córdova	Aug 9, 1997 (23)	Venezuela	£1.44m
2886	9	Centre-Forward	Fabian Klos	Dec 2, 1987 (32)	Germany	£900Th.
2887	13	Centre-Forward	Sebastian Müller	Jan 23, 2001 (19)	Germany	£270Th.
2888	36	Centre-Forward	Sven Schipplock	Nov 8, 1988 (31)	Germany	£270Th.
2889	39	Centre-Forward	Prince Osei Owusu	Jan 7, 1997 (23)	Germany / Ghana	£225Th.

In [14]:

# Print the shape of the raw DataFrame, df_tm_player_top5_2021_raw
print(df_tm_player_top5_2021_raw.shape)

(2890, 6)

In [15]:

# Print the column names of the raw DataFrame, df_tm_player_top5_2021_raw
print(df_tm_player_top5_2021_raw.columns)

Index(['position_number', 'position_description', 'name', 'dob', 'nationality',
       'value'],
      dtype='object')

The dataset has six features (columns). Full details of these attributes can be found in the Data Dictionary.

In [16]:

# Data types of the features of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.dtypes

Out[16]:

position_number         object
position_description    object
name                    object
dob                     object
nationality             object
value                   object
dtype: object

All six of the columns have the object data type. Full details of these attributes and their data types can be found in the Data Dictionary.

In [19]:

# Info for the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2890 entries, 0 to 2889
Data columns (total 6 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   position_number       2890 non-null   object
 1   position_description  2890 non-null   object
 2   name                  2890 non-null   object
 3   dob                   2890 non-null   object
 4   nationality           2890 non-null   object
 5   value                 2854 non-null   object
dtypes: object(6)
memory usage: 135.6+ KB

In [20]:

# Description of the raw DataFrame, df_tm_player_top5_2021_raw, showing some summary statistics for each numberical column in the DataFrame
df_tm_player_top5_2021_raw.describe()

Out[20]:

	position_number	position_description	name	dob	nationality	value
count	2890	2890	2890	2890	2890	2854
unique	82	14	2883	2258	418	149
top	-	Centre-Back	Danilo	Mar 25, 2000 (20)	Spain	£1.08m
freq	476	511	3	5	404	89

In [21]:

# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_player_top5_2021_raw
msno.matrix(df_tm_player_top5_2021_raw, figsize = (30, 7))

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a1d713090>

In [22]:

# Counts of missing values
tm_null_value_stats = df_tm_player_top5_2021_raw.isnull().sum(axis=0)
tm_null_value_stats[tm_null_value_stats != 0]

Out[22]:

value    36
dtype: int64

The visualisation shows us very quickly that there a few missing values in the value column, but otherwise the dataset is complete.

4. Data Engineering ¶

4.1. Introduction ¶

Before we answer the questions in the brief through Exploratory Data Analysis (EDA), we'll first need to clean and wrangle the datasets to a form that meet our needs.

4.2. Assign Raw DataFrames to New Engineered DataFrames ¶

In [23]:

# Assign Raw DataFrame to new Engineered DataFrame
df_tm_player_top5_2021 = df_tm_player_top5_2021_raw

4.2. String Cleaning ¶

Name¶

In [24]:

df_tm_player_top5_2021['name_lower'] = df_tm_player_top5_2021['name'].str.normalize('NFKD')\
                                                                     .str.encode('ascii', errors='ignore')\
                                                                     .str.decode('utf-8')\
                                                                     .str.lower()

In [25]:

# First Name Lower
df_tm_player_top5_2021['firstname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 0).str[0]

# Last Name Lower
df_tm_player_top5_2021['lastname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 1).str[-1]

# First Initial Lower
df_tm_player_top5_2021['firstinitial_lower'] = df_tm_player_top5_2021['name_lower'].astype(str).str[0]

DoB and Age¶

The dob column is messy and contains both the date of birth as a string and also the age in brackets.

This string cleaning consists of two parts, firstly, to split this apart into their seperate components. However, once the age column is created, we will replaced this by determining the current age using the Python datetime module.

In [26]:

# DoB string cleaning to create birth_date and age columns
df_tm_player_top5_2021[['birth_date', 'age']] = df_tm_player_top5_2021['dob'].str.extract(r'(.+) \((\d+)\)')

Nationality¶

For the nationality, some of the players have duel nationality.

For example, Claudio Pizarro is a Peruvian-born player who has has made 85 appearances for Peru, scoring 20 goals. However, his citizenship according to TransferMarkt is 'Peru / Italy'. For our needs, we only want to know the country the player is eligible to play for, not their full heritage which from observations is always the first part of the string. We'll therefore be discarding anything after the first space in the string to form a new playing_country column.

In [27]:

# Take the first nationality i.e. text before the first space, ex. 'Peru / Italy'
df_tm_player_top5_2021['playing_country'] = df_tm_player_top5_2021['nationality'].str.split(' /').str[0]

Value¶

The values of the players have prefixes (£), commas, spaces, and suffixes (m, k) that need to cleaned and replaced before converting to a numerical value.

In [28]:

# Value string cleaning from shortened string value to full numerical value

## Convert 'm' to '000000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('m','0000')

## Convert 'k' to '000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('k','000')

## Convert 'Th' to '000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('Th','000')

## Remove '.'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('.','')

## Remove '£' sign
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('£','')

## Remove '-'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('-','')

## Remove '¬†¬†'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('¬†¬†','')

## Remove gaps
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace(' ','')

Position¶

In [35]:

# List unique values in the df_tm_player_top5_2021['position_description'] column
df_tm_player_top5_2021['position_description'].unique()

Out[35]:

array(['Goalkeeper', 'Centre-Back', 'Left-Back', 'Right-Back',
       'Defensive Midfield', 'Central Midfield', 'Attacking Midfield',
       'Left Winger', 'Right Winger', 'Second Striker', 'Centre-Forward',
       'Right Midfield', 'Left Midfield', 'Midfielder'], dtype=object)

In [36]:

dict_positions={
    'Goalkeeper': 'Goalkeeper',
    'Centre-Back': 'Defender',
    'Left-Back': 'Defender',
    'Right-Back': 'Defender',
    'Defensive Midfield': 'Midfielder',
    'Central Midfield': 'Midfielder',
    'Attacking Midfield': 'Midfielder',
    'Left Winger': 'Forward',
    'Right Winger': 'Forward',
    'Second Striker': 'Forward',
    'Centre-Forward': 'Forward',
    'Right Midfield': 'Midfielder',
    'Left Midfield': 'Midfielder',
    'Midfielder': 'Midfielder'
}

In [37]:

df_tm_player_top5_2021['position_cleaned'] = df_tm_player_top5_2021['position_description'].map(dict_positions)

4.3. Converting Data Types ¶

DoB¶

First we need to convert the dob column from the object data type to datetime64[ns], again using the .to_datetime() method.

In [44]:

df_tm_player_top5_2021

Out[44]:

	position_number	position_description	name	dob	nationality	value	name_lower	firstname_lower	lastname_lower	firstinitial_lower	birth_date	age	playing_country	position_cleaned
0	1	Goalkeeper	Marc-André ter Stegen	Apr 30, 1992 (28)	Germany	64800000	marc-andre ter stegen	marc-andre	stegen	m	Apr 30, 1992	28	Germany	Goalkeeper
1	13	Goalkeeper	Neto	Jul 19, 1989 (31)	Brazil / Italy	13050000	neto	neto	neto	n	Jul 19, 1989	31	Brazil	Goalkeeper
2	26	Goalkeeper	Iñaki Peña	Mar 2, 1999 (21)	Spain	2070000	inaki pena	inaki	pena	i	Mar 2, 1999	21	Spain	Goalkeeper
3	15	Centre-Back	Clément Lenglet	Jun 17, 1995 (25)	France	43200000	clement lenglet	clement	lenglet	c	Jun 17, 1995	25	France	Defender
4	23	Centre-Back	Samuel Umtiti	Nov 14, 1993 (26)	France / Cameroon	28800000	samuel umtiti	samuel	umtiti	s	Nov 14, 1993	26	France	Defender
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
2885	18	Centre-Forward	Sergio Córdova	Aug 9, 1997 (23)	Venezuela	1440000	sergio cordova	sergio	cordova	s	Aug 9, 1997	23	Venezuela	Forward
2886	9	Centre-Forward	Fabian Klos	Dec 2, 1987 (32)	Germany	900000	fabian klos	fabian	klos	f	Dec 2, 1987	32	Germany	Forward
2887	13	Centre-Forward	Sebastian Müller	Jan 23, 2001 (19)	Germany	270000	sebastian muller	sebastian	muller	s	Jan 23, 2001	19	Germany	Forward
2888	36	Centre-Forward	Sven Schipplock	Nov 8, 1988 (31)	Germany	270000	sven schipplock	sven	schipplock	s	Nov 8, 1988	31	Germany	Forward
2889	39	Centre-Forward	Prince Osei Owusu	Jan 7, 1997 (23)	Germany / Ghana	225000	prince osei owusu	prince	owusu	p	Jan 7, 1997	23	Germany	Forward

2890 rows × 14 columns

In [45]:

df_tm_player_top5_2021.dtypes

Out[45]:

position_number         object
position_description    object
name                    object
dob                     object
nationality             object
value                   object
name_lower              object
firstname_lower         object
lastname_lower          object
firstinitial_lower      object
birth_date              object
age                     object
playing_country         object
position_cleaned        object
dtype: object

In [47]:

df_tm_player_top5_2021.to_csv(data_dir + f'test_tm_{today}.csv', index=None, header=True)

In [50]:

df_tm_player_top5_2021 = df_tm_player_top5_2021.replace('N/A', '', regex=True)

In [51]:

# Convert birth_date from string to datetime64[ns]
df_tm_player_top5_2021['birth_date'] = pd.to_datetime(df_tm_player_top5_2021['birth_date'])

Age¶

The calculated age column needs to be converted from a float to an integer, with all null values ignored, using to astype() method.

In [52]:

# Date and time manipulation
from datetime import datetime

In [53]:

# Redetermine the age using the newly created birth_date column (after formatted to datetime data type)

## Remove all not numeric values use to_numeric with parameter errors='coerce' - it replaces non numeric to NaNs
df_tm_player_top5_2021['age'] = pd.to_numeric(df_tm_player_top5_2021['age'], errors='coerce')

## Convert floats to integers and leave null values
df_tm_player_top5_2021['age'] = np.nan_to_num(df_tm_player_top5_2021['age']).astype(int)

## Calculate current age
today = datetime.today()
df_tm_player_top5_2021['age'] = df_tm_player_top5_2021['birth_date'].apply(lambda x: today.year - x.year - 
                                                                               ((today.month, today.day) < (x.month, x.day)) 
                                                                          )


# df_tm_player_top5_2021['age'] = pd.to_numeric(ddf_tm_player_top5_2021['age'], downcast='signed')

Value¶

The value column needs to be converted from a string to an integer using to to_numeric() method.

In [54]:

# Convert string to integer
df_tm_player_top5_2021['value'] = pd.to_numeric(df_tm_player_top5_2021['value'])

Position¶

...

In [55]:

sorted(df_tm_player_top5_2021['position_description'].unique())

Out[55]:

['Attacking Midfield',
 'Central Midfield',
 'Centre-Back',
 'Centre-Forward',
 'Defensive Midfield',
 'Goalkeeper',
 'Left Midfield',
 'Left Winger',
 'Left-Back',
 'Midfielder',
 'Right Midfield',
 'Right Winger',
 'Right-Back',
 'Second Striker']

In [56]:

dict_positions_tm = {
    'Attacking Midfield': 'Midfielder',
    'Central Midfield': 'Midfielder',
    'Centre-Back': 'Defender',
    'Centre-Forward': 'Forward',
    'Defensive Midfield': 'Midfielder',
    'Forward': 'Forward',
    'Goalkeeper': 'Goalkeeper',
    'Left Midfield': 'Midfielder',
    'Left Winger': 'Forward',
    'Left-Back': 'Defender',
    'Right Midfield': 'Midfielder',
    'Right Winger': 'Forward',
    'Right-Back': 'Defender',
    'Second Striker': 'Forward'
}

In [57]:

df_tm_player_top5_2021['position_description_cleaned'] = df_tm_player_top5_2021['position_description'].map(dict_positions_tm)

4.4. Create New Attributes ¶

Create new attributes for birth month and birth year.

In [58]:

df_tm_player_top5_2021['birth_year'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).year
df_tm_player_top5_2021['birth_month'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).month

4.5. Columns of Interest ¶

We are interested in the following thirteen columns in the TransferMarkt dataset:

name
name_lower
firstinitial_lower
firstname_lower
lastname_lower
position_description
position_description_cleaned
value
birth_date
birth_year
birth_month
age
playing_country

In [59]:

# Select columns of interest
df_tm_player_top5_2021 = df_tm_player_top5_2021[['name', 'name_lower', 'firstinitial_lower', 'firstname_lower', 'lastname_lower', 'position_description', 'position_description_cleaned', 'value', 'birth_date', 'birth_year', 'birth_month', 'age', 'playing_country']]

4.6. Split Dataset into Outfielder Players and Goalkeepers ¶

In [60]:

# Assign df_tm as a new DataFrame - df_tm_player_top5_all_2021_all, to represent all the players
df_tm_player_top5_all_2021 = df_tm_player_top5_2021

# Filter rows for position_description is not equal to 'Goalkeeper'
df_tm_player_top5_outfield_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] != 'Goalkeeper']

# Filter rows for position_description are equal to 'Goalkeeper'
df_tm_player_top5_goalkeeper_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] == 'Goalkeeper']

In [61]:

df_tm_player_top5_all_2021.head()

Out[61]:

	name	name_lower	firstinitial_lower	firstname_lower	lastname_lower	position_description	position_description_cleaned	value	birth_date	birth_year	birth_month	age	playing_country
0	Marc-André ter Stegen	marc-andre ter stegen	m	marc-andre	stegen	Goalkeeper	Goalkeeper	64800000.0	1992-04-30	1992.0	4.0	28.0	Germany
1	Neto	neto	n	neto	neto	Goalkeeper	Goalkeeper	13050000.0	1989-07-19	1989.0	7.0	31.0	Brazil
2	Iñaki Peña	inaki pena	i	inaki	pena	Goalkeeper	Goalkeeper	2070000.0	1999-03-02	1999.0	3.0	21.0	Spain
3	Clément Lenglet	clement lenglet	c	clement	lenglet	Centre-Back	Defender	43200000.0	1995-06-17	1995.0	6.0	25.0	France
4	Samuel Umtiti	samuel umtiti	s	samuel	umtiti	Centre-Back	Defender	28800000.0	1993-11-14	1993.0	11.0	26.0	France

In [62]:

df_tm_player_top5_outfield_2021.head()

Out[62]:

	name	name_lower	firstinitial_lower	firstname_lower	lastname_lower	position_description	position_description_cleaned	value	birth_date	birth_year	birth_month	age	playing_country
3	Clément Lenglet	clement lenglet	c	clement	lenglet	Centre-Back	Defender	43200000.0	1995-06-17	1995.0	6.0	25.0	France
4	Samuel Umtiti	samuel umtiti	s	samuel	umtiti	Centre-Back	Defender	28800000.0	1993-11-14	1993.0	11.0	26.0	France
5	Gerard Piqué	gerard pique	g	gerard	pique	Centre-Back	Defender	18000000.0	1987-02-02	1987.0	2.0	33.0	Spain
6	Jean-Clair Todibo	jean-clair todibo	j	jean-clair	todibo	Centre-Back	Defender	12600000.0	1999-12-30	1999.0	12.0	20.0	France
7	Ronald Araújo	ronald araujo	r	ronald	araujo	Centre-Back	Defender	9000000.0	1999-03-07	1999.0	3.0	21.0	Uruguay

In [63]:

df_tm_player_top5_goalkeeper_2021.head()

Out[63]:

	name	name_lower	firstinitial_lower	firstname_lower	lastname_lower	position_description	position_description_cleaned	value	birth_date	birth_year	birth_month	age	playing_country
0	Marc-André ter Stegen	marc-andre ter stegen	m	marc-andre	stegen	Goalkeeper	Goalkeeper	64800000.0	1992-04-30	1992.0	4.0	28.0	Germany
1	Neto	neto	n	neto	neto	Goalkeeper	Goalkeeper	13050000.0	1989-07-19	1989.0	7.0	31.0	Brazil
2	Iñaki Peña	inaki pena	i	inaki	pena	Goalkeeper	Goalkeeper	2070000.0	1999-03-02	1999.0	3.0	21.0	Spain
32	Thibaut Courtois	thibaut courtois	t	thibaut	courtois	Goalkeeper	Goalkeeper	54000000.0	1992-05-11	1992.0	5.0	28.0	Belgium
33	Andriy Lunin	andriy lunin	a	andriy	lunin	Goalkeeper	Goalkeeper	2430000.0	1999-02-11	1999.0	2.0	21.0	Ukraine

4.7. Exporting the Engineered DataFrames ¶

Export the three engineered TransferMarkt DataFrames as CSV files.

In [64]:

# Datetime
import datetime
from datetime import date
import time

In [65]:

# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

In [66]:

# Export the three DataFrames
df_tm_player_top5_all_2021.to_csv(data_dir_tm + '/engineered/all/' + f'all_big5_1920_{today}.csv', index=None, header=True)
df_tm_player_top5_outfield_2021.to_csv(data_dir_tm + '/engineered/player/' + f'outfield_big5_1920_{today}.csv', index=None, header=True)
df_tm_player_top5_goalkeeper_2021.to_csv(data_dir_tm + '/engineered/goalkeeper/' + f'goalkeeper_big5_1920_{today}.csv', index=None, header=True)

Now we have created three pandas DataFrames and wrangled the data to meet our needs, we'll next conduct and Exploratory Data Analysis .

Data and Web Scraping¶

FBref for the data to scrape
FBref statement for using StatsBomb's data: https://fbref.com/en/statsbomb/
StatsBomb providing the data to FBref
FBref_EPL GitHub repository by chmartin for the original web scraping code
Scrape-FBref-data GitHub repository by parth1902 for the revised web scraping code for the new FBref metrics
Beyond crowd judgments: Data-driven estimation of market value in association football by Oliver Müllera, Alexander Simons, and Markus Weinmann.
06/04/2020: BBC - Premier League squads 'drop £1.6bn in value'.
tyrone_mings GitHub repository by FCrSTATS
Python Package Index (PyPI) tyrone-mings library.

Countries¶

Comparison of alphabetic country codes Wiki

8.2. Python Techniques Observed ¶

To conduct our analysis, we have used the following libraries and modules for the following tasks:

NumPy for multidimensional array computing,
pandas for data manipulation and ingestion, and
Beautifulsoup for scraping data from webpages.

We have also demonstrated an array of techniques in Python using the following methods and functions:

pandas EDA methods:
- head(),
- tail(),
- shape,
- columns,
- dtypes,
- info, and
- describe.
The missingno library to visualise how many missing values we have in the dataset, and
The pandas .to_csv() method to export the DataFrames as csv files.

*Visit my website EddWebster.com or my GitHub Repository for more projects. If you'd like to get in contact, my Twitter handle is @eddwebster and my email is: edd.j.webster@gmail.com.*

Back to the top

Webscraping of TransferMarkt Data¶

Notebook to scrape raw data using Beautifulsoup from TransferMarkt.¶

By Edd Webster¶

Import Libraries and Modules¶

Defined Variables¶

Defined Filepaths¶

Name¶

DoB and Age¶

Nationality¶

Value¶

Position¶

DoB¶

Age¶

Value¶

Position¶

Data and Web Scraping¶

Countries¶

By Edd Webster ¶