The Guardian Player Recorded Transfer Fees Data Engineering¶

Notebook to engineer player recorded transfer data scraped from The Guardian.¶

By Edd Webster ¶

Notebook first written: 22/08/2021
Notebook last updated: 06/09/2021

Introduction ¶

This notebook engineers transfer data data from the Guardian provided by Tom Worville (see Tweet [link]), and manipulates this landed data as DataFrames using pandas and matplotlib for visualisation.

For more information about this notebook and the author, I'm available through all the following channels:

title

The accompanying GitHub repository for this notebook can be found here and a static version of this notebook can be found here.

1. Notebook Dependencies ¶

This notebook was written using Python 3 and requires the following libraries:

Jupyter notebooks for this notebook environment with which this project is presented;
NumPy for multidimensional array computing;
pandas for data analysis and manipulation; and
matplotlib for data visualisations.

All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the Conda distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows here and Mac here, as well as in the Anaconda documentation itself here.

Import Libraries and Modules¶

In [1]:

# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv

# Import Dependencies
%matplotlib inline

# Math Operations
import numpy as np
import math
from math import pi

# Datetime
import datetime
from datetime import date
import time

# Data Preprocessing
import pandas as pd
import os
import re
import random
from io import BytesIO
from pathlib import Path

# Reading directories
import glob
import os
from os.path import basename

# Flatten lists
from functools import reduce

# Working with JSON
import json
from pandas.io.json import json_normalize

# Web Scraping
import requests
from bs4 import BeautifulSoup
import re

# Currency Convert
from forex_python.converter import CurrencyRates

# APIs
from tyrone_mings import * 

# Fuzzy Matching - Record Linkage
import recordlinkage
import jellyfish
import numexpr as ne

# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno

# Progress Bar
from tqdm import tqdm

# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML

# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

print('Setup Complete')

Setup Complete

In [2]:

# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))

Python: 3.7.6
NumPy: 1.20.3
pandas: 1.3.2
matplotlib: 3.4.2

Defined Variables¶

In [3]:

# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')

Defined Dictionaries¶

In [4]:

# Defined Dictionaries

## Seasons
dict_seasons = {2016: '2016/2017',
                2017: '2017/2018',
                2018: '2018/2019',
                2019: '2019/2020',
                2020: '2020/2021',
                2021: '2021/2022'
               }

## Positions
dict_positions_grouped = {'GK': 'Goalkeeper',
                          'D': 'Defender',
                          'M': 'Midfielder',
                          'W': 'Forward',    # I think 'W' in the dataset is 'Winger', therefore a forward
                          'F': 'Forward',
                      }

Defined Filepaths¶

In [5]:

# Set up initial paths to subfolders
base_dir = os.path.join('..', '..')
data_dir = os.path.join(base_dir, 'data')
data_dir_guardian = os.path.join(base_dir, 'data', 'guardian')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')

Notebook Settings¶

In [6]:

pd.set_option('display.max_columns', None)

2. Project Brief ¶

This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.

This particular notebook is one of several data engineering notebooks, that cleans recorded transfer fee player data from Guardian using pandas.

This notebook, along with the other notebooks in this project workflow are shown in the following diagram:

roadmap

Links to these notebooks in the football_analytics GitHub repository can be found at the following:

3. Data Sources ¶

3.1. Introduction ¶

Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section Section 3 and cleaned in the Data Engineering section Section 4.

We'll be using the pandas library to import our data to this workbook as a DataFrame.

3.2. Data Dictionaries ¶

The Guardian dataset has six features (columns) with the following definitions and data types:

Feature	Data type
`position_number`	object
`position_description`	object
`name`	object
`dob`	object
`nationality`	object
`value`	object

3.3. Read in Data ¶

In [7]:

## Filename
file_name = 'guardian_transfer_data'

In [8]:

# Import DataFrame as a CSV file
df_guardian_transfers_all_raw = pd.read_csv(data_dir_guardian + f'/raw/' + f'{file_name}_latest.csv')

3.3. Preliminary Data Handling ¶

Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.

In [9]:

# Display the first 5 rows of the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.head()

Out[9]:

	transfer_id	transfer_year	window_day	window_open	timestamp	player_name	transfer_status	transfer_type	primary_player_position	price_in_pounds	previous_club	new_club	previous_league	new_league	big_deal	flow_id
0	1	2021	48	2021-06-09	2021-07-27	Zito Luvumbo	done deal	loan	W	NaN	Cagliari	Como	Serie A	Serie C	NaN	100
1	2	2021	48	2021-06-09	2021-07-27	David Juncà	done deal	free	D	NaN	Celta Vigo	Girona	La Liga	Spanish Segunda	NaN	241
2	1	2021	48	2021-06-09	2021-07-27	Giuseppe Pezzella	done deal	loan	D	NaN	Parma	Atalanta	Serie B	Serie A	NaN	106
3	1	2021	48	2021-06-09	2021-07-27	Pablo Rosario	done deal	fee	M	5130000.0	PSV Eindhoven	Nice	Eredivisie	Ligue 1	Yes	8
4	1	2021	48	2021-06-09	2021-07-27	Martín Calderón	done deal	undisclosed	M	NaN	Real Madrid	Cádiz	La Liga	La Liga	NaN	265

In [10]:

# Display the last five rows of the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.tail()

Out[10]:

	transfer_id	transfer_year	window_day	window_open	timestamp	player_name	transfer_status	transfer_type	primary_player_position	price_in_pounds	previous_club	new_club	previous_league	new_league	big_deal	flow_id
7214	1	2017	-38	2017-07-01	2017-05-24	Joey Barton	done deal	released	NaN	0.0	Burnley	Free agent	Premier League	Free agent	No	2
7215	1	2017	-38	2017-07-01	2017-05-24	Matthias Ostrzolek	done deal	fee	NaN	0.0	Hamburg	Hannover 96	1. Bundesliga	1. Bundesliga	No	1
7216	1	2017	-38	2017-07-01	2017-05-24	Benoît Trémoulinas	done deal	released	NaN	0.0	Sevilla	Free agent	La Liga	Free agent	No	1
7217	1	2017	-38	2017-07-01	2017-05-24	Youri Tielemans	done deal	fee	NaN	21600000.0	Anderlecht	Monaco	Belgian Jupiler Pro League	Ligue 1	Yes	1
7218	1	2017	-39	2017-07-01	2017-05-23	Juan Cuadrado	done deal	fee	NaN	17330000.0	Chelsea	Juventus	Premier League	Serie A	Yes	1

In [11]:

# Print the shape of the raw DataFrame, df_guardian_transfers_all_raw
print(df_guardian_transfers_all_raw.shape)

(7219, 16)

In [12]:

# Print the column names of the raw DataFrame, df_guardian_transfers_all_raw
print(df_guardian_transfers_all_raw.columns)

Index(['transfer_id', 'transfer_year', 'window_day', 'window_open',
       'timestamp', 'player_name', 'transfer_status', 'transfer_type',
       'primary_player_position', 'price_in_pounds', 'previous_club',
       'new_club', 'previous_league', 'new_league', 'big_deal', 'flow_id'],
      dtype='object')

The dataset has twelve features (columns).

In [13]:

# Data types of the features of the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.dtypes

Out[13]:

transfer_id                  int64
transfer_year                int64
window_day                   int64
window_open                 object
timestamp                   object
player_name                 object
transfer_status             object
transfer_type               object
primary_player_position     object
price_in_pounds            float64
previous_club               object
new_club                    object
previous_league             object
new_league                  object
big_deal                    object
flow_id                      int64
dtype: object

In [14]:

# Info for the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7219 entries, 0 to 7218
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   transfer_id              7219 non-null   int64  
 1   transfer_year            7219 non-null   int64  
 2   window_day               7219 non-null   int64  
 3   window_open              7219 non-null   object 
 4   timestamp                7219 non-null   object 
 5   player_name              7219 non-null   object 
 6   transfer_status          7216 non-null   object 
 7   transfer_type            7219 non-null   object 
 8   primary_player_position  5614 non-null   object 
 9   price_in_pounds          2077 non-null   float64
 10  previous_club            7219 non-null   object 
 11  new_club                 7219 non-null   object 
 12  previous_league          7216 non-null   object 
 13  new_league               7217 non-null   object 
 14  big_deal                 2193 non-null   object 
 15  flow_id                  7219 non-null   int64  
dtypes: float64(1), int64(4), object(11)
memory usage: 902.5+ KB

In [15]:

# Description of the raw DataFrame, df_guardian_transfers_all_raw, showing some summary statistics for each numberical column in the DataFrame
df_guardian_transfers_all_raw.describe()

Out[15]:

	transfer_id	transfer_year	window_day	price_in_pounds	flow_id
count	7219.000000	7219.000000	7219.000000	2.077000e+03	7219.000000
mean	1.033523	2018.758692	45.576673	8.547637e+06	65.337443
std	0.183817	1.287752	33.319495	1.270805e+07	83.022501
min	1.000000	2017.000000	-47.000000	0.000000e+00	1.000000
25%	1.000000	2018.000000	21.000000	1.800000e+06	7.000000
50%	1.000000	2019.000000	45.000000	4.400000e+06	28.000000
75%	1.000000	2020.000000	68.000000	1.052000e+07	95.000000
max	3.000000	2021.000000	216.000000	1.980000e+08	425.000000

In [16]:

# Plot visualisation of the missing values for each feature of the raw DataFrame, df_guardian_transfers_all_raw
msno.matrix(df_guardian_transfers_all_raw, figsize = (30, 7))

Out[16]:

<AxesSubplot:>

In [17]:

# Counts of missing values
null_value_stats = df_guardian_transfers_all_raw.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]

Out[17]:

transfer_status               3
primary_player_position    1605
price_in_pounds            5142
previous_league               3
new_league                    2
big_deal                   5026
dtype: int64

The visualisation shows us very quickly that there a few missing values in the ... columns.

4. Data Engineering ¶

Before we answer the questions in the brief through Exploratory Data Analysis (EDA), we'll first need to clean and wrangle the datasets to a form that meet our needs.

4.1. Assign Raw DataFrames to New Engineered DataFrames ¶

In [18]:

# Assign Raw DataFrames to new Engineered DataFrames
df_guardian_transfers_all = df_guardian_transfers_all_raw.copy()

In [19]:

df_guardian_transfers_all.head()

Out[19]:

	transfer_id	transfer_year	window_day	window_open	timestamp	player_name	transfer_status	transfer_type	primary_player_position	price_in_pounds	previous_club	new_club	previous_league	new_league	big_deal	flow_id
0	1	2021	48	2021-06-09	2021-07-27	Zito Luvumbo	done deal	loan	W	NaN	Cagliari	Como	Serie A	Serie C	NaN	100
1	2	2021	48	2021-06-09	2021-07-27	David Juncà	done deal	free	D	NaN	Celta Vigo	Girona	La Liga	Spanish Segunda	NaN	241
2	1	2021	48	2021-06-09	2021-07-27	Giuseppe Pezzella	done deal	loan	D	NaN	Parma	Atalanta	Serie B	Serie A	NaN	106
3	1	2021	48	2021-06-09	2021-07-27	Pablo Rosario	done deal	fee	M	5130000.0	PSV Eindhoven	Nice	Eredivisie	Ligue 1	Yes	8
4	1	2021	48	2021-06-09	2021-07-27	Martín Calderón	done deal	undisclosed	M	NaN	Real Madrid	Cádiz	La Liga	La Liga	NaN	265

4.2. String Cleaning ¶

Player¶

In [20]:

# Remove accents and create lowercase name
df_guardian_transfers_all['player_name_lower'] = (df_guardian_transfers_all['player_name']
                                                          .str.normalize('NFKD')
                                                          .str.encode('ascii', errors='ignore')
                                                          .str.decode('utf-8')
                                                          .str.lower()
                                                     )

In [21]:

# First Name Lower
df_guardian_transfers_all['first_name_lower'] = df_guardian_transfers_all['player_name_lower'].str.rsplit(' ', 0).str[0]

# Last Name Lower
df_guardian_transfers_all['last_name_lower'] = df_guardian_transfers_all['player_name_lower'].str.rsplit(' ', 1).str[-1]

# First Initial Lower
df_guardian_transfers_all['first_initial_lower'] = df_guardian_transfers_all['player_name_lower'].astype(str).str[0]

Season¶

In [22]:

# Map season to DataFrame
df_guardian_transfers_all['season'] = df_guardian_transfers_all['transfer_year'].map(dict_seasons)

In [23]:

df_guardian_transfers_all.head()

Out[23]:

	transfer_id	transfer_year	window_day	window_open	timestamp	player_name	transfer_status	transfer_type	primary_player_position	price_in_pounds	previous_club	new_club	previous_league	new_league	big_deal	flow_id	player_name_lower	first_name_lower	last_name_lower	first_initial_lower	season
0	1	2021	48	2021-06-09	2021-07-27	Zito Luvumbo	done deal	loan	W	NaN	Cagliari	Como	Serie A	Serie C	NaN	100	zito luvumbo	zito	luvumbo	z	2021/2022
1	2	2021	48	2021-06-09	2021-07-27	David Juncà	done deal	free	D	NaN	Celta Vigo	Girona	La Liga	Spanish Segunda	NaN	241	david junca	david	junca	d	2021/2022
2	1	2021	48	2021-06-09	2021-07-27	Giuseppe Pezzella	done deal	loan	D	NaN	Parma	Atalanta	Serie B	Serie A	NaN	106	giuseppe pezzella	giuseppe	pezzella	g	2021/2022
3	1	2021	48	2021-06-09	2021-07-27	Pablo Rosario	done deal	fee	M	5130000.0	PSV Eindhoven	Nice	Eredivisie	Ligue 1	Yes	8	pablo rosario	pablo	rosario	p	2021/2022
4	1	2021	48	2021-06-09	2021-07-27	Martín Calderón	done deal	undisclosed	M	NaN	Real Madrid	Cádiz	La Liga	La Liga	NaN	265	martin calderon	martin	calderon	m	2021/2022

Position¶

In [24]:

lst_postions = list(df_guardian_transfers_all['primary_player_position'].unique())
lst_postions

Out[24]:

['W', 'D', 'M', 'F', 'GK', nan]

In [25]:

# Map grouped positions to DataFrame
df_guardian_transfers_all['primary_player_position_grouped'] = df_guardian_transfers_all['primary_player_position'].map(dict_positions_grouped)

Goalkeeper / Outfielder¶

In [26]:

# Separate Goalkeeper and Outfielders
df_guardian_transfers_all['outfielder_goalkeeper'] = np.where(df_guardian_transfers_all['primary_player_position_grouped'] == 'Goalkeeper', 'Goalkeeper', 'Outfielder')

4.3. Filter Players in 'Big 5' European Leagues and MLS ¶

In [27]:

lst_leagues = list(df_guardian_transfers_all['new_league'].unique())
lst_leagues

Out[27]:

['Serie C',
 'Spanish Segunda',
 'Serie A',
 'Ligue 1',
 'La Liga',
 'Serie B',
 'Qatar Stars League',
 'Premier League',
 'Portugal Liga NOS',
 'Serie D',
 '3. Bundesliga',
 'Scottish Premiership',
 'Ligue 2',
 'National',
 'Swiss Challenge League',
 'MLS',
 'National 2',
 '2. Bundesliga',
 'Bayern Regionaliga',
 '1. Bundesliga',
 'Turkish Super Lig',
 'League One',
 'Championship',
 'Segunda B',
 'Danish SuperLiga',
 'Russian Premier League',
 'National League',
 'Belgian Jupiler Pro League',
 'Eredivisie',
 'Saudi Professional League',
 'Swiss Superleague',
 'UAE Gulf League',
 'J League',
 'Austrian Bundesliga',
 'Croatian 1.HNL',
 'Free agent',
 'NL South',
 'Austrian Bundeliga',
 'K-League',
 'Greek Super League',
 'Super League',
 'League Two',
 'Liga MX',
 'Austrian 2. Liga',
 'Polish Ekstraklasa',
 'Dutch Kampioen Divisie',
 'Fortuna Liga',
 'Romanian Liga 1',
 'Serbian Super Liga',
 nan,
 'Brazilian Serie A',
 'Estonian League',
 'Ukrainian Premier League',
 'Colombian Primera',
 'Bulgarian League',
 'Serbian Prva Liga',
 'Scottish Championship',
 'Cyprus',
 'Allsvenskan',
 'Czech First League',
 'A-League',
 'Poland 3. Liga',
 'Dutch Kampioen',
 'Bulgaria',
 'Croatian top flight',
 'Belgium',
 'Greek Superleague 1',
 'Superliga',
 'Spal',
 'Slovenian Prva Liga',
 'Belgian Proximus',
 'Algeria',
 'Slovenian Liga',
 'National League North',
 'Greece Superleague',
 'Pescara',
 'Ereste Divisie',
 'Chinese Super League',
 '1. HNL',
 'Regionalliga',
 'Belgian D2',
 'Danubio',
 'Eliteserien',
 'Eerste Amateur',
 'Slovak Fortuna Liga',
 'Bulgarian Liga',
 'Belgian Second Division',
 'France CFA',
 'Saudi League',
 'Bundesliga 2.',
 'Belgian Proximus League',
 'Retired',
 'Eerste Divisie',
 'Azerbaijan League',
 'Northern Premier League',
 'Bourg-en-Bresse',
 'Championnat National',
 'Poland Ekstraklasa',
 'Hungarian NB I',
 'Argentina Primera Division',
 'Cypriot First Division',
 'Mouscron',
 'Chilean Primera',
 'Belgian Super League',
 '3.Liga',
 'Norwegian Eliteserien',
 'Croatia HNL Liga',
 'QPR',
 'National League South',
 'Ecuador Liga Pro',
 'Greece Super League',
 'Regionalliga West',
 'Portuguese Primeira Liga',
 'Evo-Stik League South',
 'retired',
 'Portuguese Segunda Liga',
 'Kazakh League',
 'Regionalliga Südwest',
 'Spanish Segunda B',
 'Russian Premier Liga',
 'Argentina Primeira',
 'Slovak Super Liga',
 'Qatari Stars League',
 'NI Premiership',
 'Cyprus First Division',
 'Ecuador Serie A',
 'Croatian League',
 'Czech League',
 'Polish 1 Liga',
 'Cypriot League',
 'Cercle Brugge',
 'Turkish Süper Lig',
 'Parva Liga',
 'Belgian B Division',
 'Eredivise',
 'USL',
 'Bostik Football League',
 'Lega Pro',
 '2 Bundesliga',
 'Kazakh Premier Liga',
 'Slovakian Fortuna Liga',
 'Kazakh Premier League',
 'Qatari Stars Leagur',
 'Israeli Premier League',
 'Slovakia Fortuna Liga',
 'Bulgarian First League',
 'Israeli Ligat',
 'League of Ireland',
 'Belgian First Division',
 'Southern League',
 'Uruguayan Primera División',
 '2. Bungesliga',
 'Maltese Premier League',
 'Turkey Super Lig',
 'Liga Pro',
 'Cyprus 1. Division',
 'Ecuador Primera',
 'UAE Arabian Gulf League',
 'Chilean Primera División',
 'Austrian First League',
 'Serie  B',
 'Brasileiro',
 'Southern League Premier Division',
 'Tercera División de México',
 'Premier League (Israel)',
 'Liga Super (Malaysia)',
 'National 1',
 'US Orléans',
 'Belgian second division',
 'Primera División (Mexico)',
 'Premier League (Russia)',
 'Premier League (Kazakhstan)']

In [28]:

# Filter plays in the 'Big 5' European Leagues and MLS

## Define list of countries
lst_big5_mls_leagues = ['Serie A', 'Ligue 1', 'La Liga', 'Premier League', '1. Bundesliga', 'MLS']

In [30]:

df_guardian_transfers_all['previous_league_isbig5'] = np.where(df_guardian_transfers_all['previous_league'].isin(lst_big5_mls_leagues), True, False)
df_guardian_transfers_all['new_league_isbig5'] = np.where(df_guardian_transfers_all['new_league'].isin(lst_big5_mls_leagues), True, False)
df_guardian_transfers_all['isbig5transfer'] = np.where((df_guardian_transfers_all['previous_league_isbig5'] == True) | (df_guardian_transfers_all['new_league_isbig5'] == True), True, False)
df_guardian_transfers_all = df_guardian_transfers_all[df_guardian_transfers_all['isbig5transfer'] == True]
df_guardian_transfers_big5 = df_guardian_transfers_all.drop(['previous_league_isbig5', 'new_league_isbig5', 'isbig5transfer'], axis=1)

In [33]:

df_guardian_transfers_all.shape

Out[33]:

(7214, 26)

In [31]:

df_guardian_transfers_big5.shape

Out[31]:

(7214, 23)

In [32]:

df_guardian_transfers_big5

Out[32]:

	transfer_id	transfer_year	window_day	window_open	timestamp	player_name	transfer_status	transfer_type	primary_player_position	price_in_pounds	previous_club	new_club	previous_league	new_league	big_deal	flow_id	player_name_lower	first_name_lower	last_name_lower	first_initial_lower	season	primary_player_position_grouped	outfielder_goalkeeper
0	1	2021	48	2021-06-09	2021-07-27	Zito Luvumbo	done deal	loan	W	NaN	Cagliari	Como	Serie A	Serie C	NaN	100	zito luvumbo	zito	luvumbo	z	2021/2022	Forward	Outfielder
1	2	2021	48	2021-06-09	2021-07-27	David Juncà	done deal	free	D	NaN	Celta Vigo	Girona	La Liga	Spanish Segunda	NaN	241	david junca	david	junca	d	2021/2022	Defender	Outfielder
2	1	2021	48	2021-06-09	2021-07-27	Giuseppe Pezzella	done deal	loan	D	NaN	Parma	Atalanta	Serie B	Serie A	NaN	106	giuseppe pezzella	giuseppe	pezzella	g	2021/2022	Defender	Outfielder
3	1	2021	48	2021-06-09	2021-07-27	Pablo Rosario	done deal	fee	M	5130000.0	PSV Eindhoven	Nice	Eredivisie	Ligue 1	Yes	8	pablo rosario	pablo	rosario	p	2021/2022	Midfielder	Outfielder
4	1	2021	48	2021-06-09	2021-07-27	Martín Calderón	done deal	undisclosed	M	NaN	Real Madrid	Cádiz	La Liga	La Liga	NaN	265	martin calderon	martin	calderon	m	2021/2022	Midfielder	Outfielder
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7214	1	2017	-38	2017-07-01	2017-05-24	Joey Barton	done deal	released	NaN	0.0	Burnley	Free agent	Premier League	Free agent	No	2	joey barton	joey	barton	j	2017/2018	NaN	Outfielder
7215	1	2017	-38	2017-07-01	2017-05-24	Matthias Ostrzolek	done deal	fee	NaN	0.0	Hamburg	Hannover 96	1. Bundesliga	1. Bundesliga	No	1	matthias ostrzolek	matthias	ostrzolek	m	2017/2018	NaN	Outfielder
7216	1	2017	-38	2017-07-01	2017-05-24	Benoît Trémoulinas	done deal	released	NaN	0.0	Sevilla	Free agent	La Liga	Free agent	No	1	benoit tremoulinas	benoit	tremoulinas	b	2017/2018	NaN	Outfielder
7217	1	2017	-38	2017-07-01	2017-05-24	Youri Tielemans	done deal	fee	NaN	21600000.0	Anderlecht	Monaco	Belgian Jupiler Pro League	Ligue 1	Yes	1	youri tielemans	youri	tielemans	y	2017/2018	NaN	Outfielder
7218	1	2017	-39	2017-07-01	2017-05-23	Juan Cuadrado	done deal	fee	NaN	17330000.0	Chelsea	Juventus	Premier League	Serie A	Yes	1	juan cuadrado	juan	cuadrado	j	2017/2018	NaN	Outfielder

7214 rows × 23 columns

4.4. Filter for 2017/2018 Season Onwards ¶

This step is not required as the data only goes as far back as the 17/18 season.

In [35]:

lst_seasons = list(df_guardian_transfers_all['season'].unique())
lst_seasons

Out[35]:

['2021/2022', '2020/2021', '2019/2020', '2018/2019', '2017/2018']

In [36]:

lst_seasons_1718_2122 = ['2017/2018', '2018/2019', '2019/2020', '2020/2021', '2021/2022']

In [37]:

df_guardian_transfers_all_1718_2122 = df_guardian_transfers_all[df_guardian_transfers_all['season'].isin(lst_seasons_1718_2122)]
df_guardian_transfers_big5_1718_2122 = df_guardian_transfers_all[df_guardian_transfers_all['season'].isin(lst_seasons_1718_2122)]

In [38]:

df_guardian_transfers_all.shape

Out[38]:

(7214, 26)

5. Exporting the Engineered DataFrames ¶

Export the two engineered The Guardian DataFrames as CSV files.

In [ ]:

## Filename
file_name = 'tm_player_joined_big5_mls'

## Full season string
full_season_string = '2016/2017 - 2021/2022'

## Short season string
short_season_string = '1617-2122'

In [ ]:

# Export DataFrame as a CSV file

## Export a copy to the 'archive' subfolder of the TM folder, including the date
df_tm_market_value_bio_status_select.to_csv(data_dir_tm + f'/engineered/{short_season_string}/joined/archive/' + f'{file_name}_{short_season_string}_last_updated_{today}.csv', index=None, header=True)

## Export another copy to the TM folder called 'latest' (can be overwritten)
df_tm_market_value_bio_status_select.to_csv(data_dir_tm + f'/engineered/{short_season_string}/joined/' + f'{file_name}_{short_season_string}_latest.csv', index=None, header=True)

In [ ]:

# Export DataFrame as a CSV file

## All data
df_guardian_transfers_all.to_csv(data_dir_tm + f'/engineered/transfers/archive/' + f'tm_player_transfers_all_last_updated_{today}.csv', index=None, header=True)
df_guardian_transfers_all.to_csv(data_dir_tm + f'/engineered/transfers/' + f'tm_transfers_player_all_latest.csv', index=None, header=True)
df_guardian_transfers_all.to_csv(data_dir + f'/export/' + f'tm_transfers_player_all_latest.csv', index=None, header=True)


## 'Big 5' European Leagues
df_transfers_big5.to_csv(data_dir_tm + f'/engineered/transfers/archive/' + f'tm_player_transfers_big5_last_updated_{today}.csv', index=None, header=True)
df_transfers_big5.to_csv(data_dir_tm + f'/engineered/transfers/' + f'tm_player_transfers_big5_latest.csv', index=None, header=True)
df_transfers_big5.to_csv(data_dir + f'/export/' + f'tm_player_transfers_big5_latest.csv', index=None, header=True)

The Guardian Player Recorded Transfer Fees Data Engineering¶

Notebook to engineer player recorded transfer data scraped from The Guardian.¶

By Edd Webster ¶

Introduction ¶

Notebook Contents ¶

1. Notebook Dependencies ¶

Import Libraries and Modules¶

Defined Variables¶

Defined Dictionaries¶

Defined Filepaths¶

Notebook Settings¶

2. Project Brief ¶

3. Data Sources ¶

3.1. Introduction ¶

3.2. Data Dictionaries ¶

3.3. Read in Data ¶

3.3. Preliminary Data Handling ¶

4. Data Engineering ¶

4.1. Assign Raw DataFrames to New Engineered DataFrames ¶

4.2. String Cleaning ¶

Player¶

Season¶

Position¶

Goalkeeper / Outfielder¶

4.3. Filter Players in 'Big 5' European Leagues and MLS ¶

4.4. Filter for 2017/2018 Season Onwards ¶

5. Exporting the Engineered DataFrames ¶

6. Summary ¶

7. Next Steps ¶

8. References ¶

The Guardian Player Recorded Transfer Fees Data Engineering¶

Notebook to engineer player recorded transfer data scraped from The Guardian.¶

By Edd Webster¶

Import Libraries and Modules¶

Defined Variables¶

Defined Dictionaries¶

Defined Filepaths¶

Notebook Settings¶

Player¶

Season¶

Position¶

Goalkeeper / Outfielder¶

By Edd Webster ¶