Notebook first written: 22/08/2021
Notebook last updated: 06/09/2021
This notebook engineers transfer data data from the Guardian provided by Tom Worville (see Tweet [link]), and manipulates this landed data as DataFrames using pandas and matplotlib for visualisation.
For more information about this notebook and the author, I'm available through all the following channels:
The accompanying GitHub repository for this notebook can be found here and a static version of this notebook can be found here.
This notebook was written using Python 3 and requires the following libraries:
Jupyter notebooks
for this notebook environment with which this project is presented;NumPy
for multidimensional array computing;pandas
for data analysis and manipulation; andmatplotlib
for data visualisations.All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the Conda distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows here and Mac here, as well as in the Anaconda documentation itself here.
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv
# Import Dependencies
%matplotlib inline
# Math Operations
import numpy as np
import math
from math import pi
# Datetime
import datetime
from datetime import date
import time
# Data Preprocessing
import pandas as pd
import os
import re
import random
from io import BytesIO
from pathlib import Path
# Reading directories
import glob
import os
from os.path import basename
# Flatten lists
from functools import reduce
# Working with JSON
import json
from pandas.io.json import json_normalize
# Web Scraping
import requests
from bs4 import BeautifulSoup
import re
# Currency Convert
from forex_python.converter import CurrencyRates
# APIs
from tyrone_mings import *
# Fuzzy Matching - Record Linkage
import recordlinkage
import jellyfish
import numexpr as ne
# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno
# Progress Bar
from tqdm import tqdm
# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML
# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
print('Setup Complete')
Setup Complete
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
Python: 3.7.6 NumPy: 1.20.3 pandas: 1.3.2 matplotlib: 3.4.2
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')
# Defined Dictionaries
## Seasons
dict_seasons = {2016: '2016/2017',
2017: '2017/2018',
2018: '2018/2019',
2019: '2019/2020',
2020: '2020/2021',
2021: '2021/2022'
}
## Positions
dict_positions_grouped = {'GK': 'Goalkeeper',
'D': 'Defender',
'M': 'Midfielder',
'W': 'Forward', # I think 'W' in the dataset is 'Winger', therefore a forward
'F': 'Forward',
}
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..')
data_dir = os.path.join(base_dir, 'data')
data_dir_guardian = os.path.join(base_dir, 'data', 'guardian')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')
pd.set_option('display.max_columns', None)
This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.
This particular notebook is one of several data engineering notebooks, that cleans recorded transfer fee player data from Guardian using pandas.
This notebook, along with the other notebooks in this project workflow are shown in the following diagram:
Links to these notebooks in the football_analytics
GitHub repository can be found at the following:
Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section Section 3 and cleaned in the Data Engineering section Section 4.
We'll be using the pandas library to import our data to this workbook as a DataFrame.
The Guardian dataset has six features (columns) with the following definitions and data types:
Feature | Data type |
---|---|
position_number |
object |
position_description |
object |
name |
object |
dob |
object |
nationality |
object |
value |
object |
## Filename
file_name = 'guardian_transfer_data'
# Import DataFrame as a CSV file
df_guardian_transfers_all_raw = pd.read_csv(data_dir_guardian + f'/raw/' + f'{file_name}_latest.csv')
Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.
# Display the first 5 rows of the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.head()
transfer_id | transfer_year | window_day | window_open | timestamp | player_name | transfer_status | transfer_type | primary_player_position | price_in_pounds | previous_club | new_club | previous_league | new_league | big_deal | flow_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Zito Luvumbo | done deal | loan | W | NaN | Cagliari | Como | Serie A | Serie C | NaN | 100 |
1 | 2 | 2021 | 48 | 2021-06-09 | 2021-07-27 | David Juncà | done deal | free | D | NaN | Celta Vigo | Girona | La Liga | Spanish Segunda | NaN | 241 |
2 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Giuseppe Pezzella | done deal | loan | D | NaN | Parma | Atalanta | Serie B | Serie A | NaN | 106 |
3 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Pablo Rosario | done deal | fee | M | 5130000.0 | PSV Eindhoven | Nice | Eredivisie | Ligue 1 | Yes | 8 |
4 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Martín Calderón | done deal | undisclosed | M | NaN | Real Madrid | Cádiz | La Liga | La Liga | NaN | 265 |
# Display the last five rows of the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.tail()
transfer_id | transfer_year | window_day | window_open | timestamp | player_name | transfer_status | transfer_type | primary_player_position | price_in_pounds | previous_club | new_club | previous_league | new_league | big_deal | flow_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7214 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Joey Barton | done deal | released | NaN | 0.0 | Burnley | Free agent | Premier League | Free agent | No | 2 |
7215 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Matthias Ostrzolek | done deal | fee | NaN | 0.0 | Hamburg | Hannover 96 | 1. Bundesliga | 1. Bundesliga | No | 1 |
7216 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Benoît Trémoulinas | done deal | released | NaN | 0.0 | Sevilla | Free agent | La Liga | Free agent | No | 1 |
7217 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Youri Tielemans | done deal | fee | NaN | 21600000.0 | Anderlecht | Monaco | Belgian Jupiler Pro League | Ligue 1 | Yes | 1 |
7218 | 1 | 2017 | -39 | 2017-07-01 | 2017-05-23 | Juan Cuadrado | done deal | fee | NaN | 17330000.0 | Chelsea | Juventus | Premier League | Serie A | Yes | 1 |
# Print the shape of the raw DataFrame, df_guardian_transfers_all_raw
print(df_guardian_transfers_all_raw.shape)
(7219, 16)
# Print the column names of the raw DataFrame, df_guardian_transfers_all_raw
print(df_guardian_transfers_all_raw.columns)
Index(['transfer_id', 'transfer_year', 'window_day', 'window_open', 'timestamp', 'player_name', 'transfer_status', 'transfer_type', 'primary_player_position', 'price_in_pounds', 'previous_club', 'new_club', 'previous_league', 'new_league', 'big_deal', 'flow_id'], dtype='object')
The dataset has twelve features (columns).
# Data types of the features of the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.dtypes
transfer_id int64 transfer_year int64 window_day int64 window_open object timestamp object player_name object transfer_status object transfer_type object primary_player_position object price_in_pounds float64 previous_club object new_club object previous_league object new_league object big_deal object flow_id int64 dtype: object
# Info for the raw DataFrame, df_guardian_transfers_all_raw
df_guardian_transfers_all_raw.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7219 entries, 0 to 7218 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 transfer_id 7219 non-null int64 1 transfer_year 7219 non-null int64 2 window_day 7219 non-null int64 3 window_open 7219 non-null object 4 timestamp 7219 non-null object 5 player_name 7219 non-null object 6 transfer_status 7216 non-null object 7 transfer_type 7219 non-null object 8 primary_player_position 5614 non-null object 9 price_in_pounds 2077 non-null float64 10 previous_club 7219 non-null object 11 new_club 7219 non-null object 12 previous_league 7216 non-null object 13 new_league 7217 non-null object 14 big_deal 2193 non-null object 15 flow_id 7219 non-null int64 dtypes: float64(1), int64(4), object(11) memory usage: 902.5+ KB
# Description of the raw DataFrame, df_guardian_transfers_all_raw, showing some summary statistics for each numberical column in the DataFrame
df_guardian_transfers_all_raw.describe()
transfer_id | transfer_year | window_day | price_in_pounds | flow_id | |
---|---|---|---|---|---|
count | 7219.000000 | 7219.000000 | 7219.000000 | 2.077000e+03 | 7219.000000 |
mean | 1.033523 | 2018.758692 | 45.576673 | 8.547637e+06 | 65.337443 |
std | 0.183817 | 1.287752 | 33.319495 | 1.270805e+07 | 83.022501 |
min | 1.000000 | 2017.000000 | -47.000000 | 0.000000e+00 | 1.000000 |
25% | 1.000000 | 2018.000000 | 21.000000 | 1.800000e+06 | 7.000000 |
50% | 1.000000 | 2019.000000 | 45.000000 | 4.400000e+06 | 28.000000 |
75% | 1.000000 | 2020.000000 | 68.000000 | 1.052000e+07 | 95.000000 |
max | 3.000000 | 2021.000000 | 216.000000 | 1.980000e+08 | 425.000000 |
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_guardian_transfers_all_raw
msno.matrix(df_guardian_transfers_all_raw, figsize = (30, 7))
<AxesSubplot:>
# Counts of missing values
null_value_stats = df_guardian_transfers_all_raw.isnull().sum(axis=0)
null_value_stats[null_value_stats != 0]
transfer_status 3 primary_player_position 1605 price_in_pounds 5142 previous_league 3 new_league 2 big_deal 5026 dtype: int64
The visualisation shows us very quickly that there a few missing values in the ... columns.
Before we answer the questions in the brief through Exploratory Data Analysis (EDA), we'll first need to clean and wrangle the datasets to a form that meet our needs.
# Assign Raw DataFrames to new Engineered DataFrames
df_guardian_transfers_all = df_guardian_transfers_all_raw.copy()
df_guardian_transfers_all.head()
transfer_id | transfer_year | window_day | window_open | timestamp | player_name | transfer_status | transfer_type | primary_player_position | price_in_pounds | previous_club | new_club | previous_league | new_league | big_deal | flow_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Zito Luvumbo | done deal | loan | W | NaN | Cagliari | Como | Serie A | Serie C | NaN | 100 |
1 | 2 | 2021 | 48 | 2021-06-09 | 2021-07-27 | David Juncà | done deal | free | D | NaN | Celta Vigo | Girona | La Liga | Spanish Segunda | NaN | 241 |
2 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Giuseppe Pezzella | done deal | loan | D | NaN | Parma | Atalanta | Serie B | Serie A | NaN | 106 |
3 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Pablo Rosario | done deal | fee | M | 5130000.0 | PSV Eindhoven | Nice | Eredivisie | Ligue 1 | Yes | 8 |
4 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Martín Calderón | done deal | undisclosed | M | NaN | Real Madrid | Cádiz | La Liga | La Liga | NaN | 265 |
# Remove accents and create lowercase name
df_guardian_transfers_all['player_name_lower'] = (df_guardian_transfers_all['player_name']
.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.str.decode('utf-8')
.str.lower()
)
# First Name Lower
df_guardian_transfers_all['first_name_lower'] = df_guardian_transfers_all['player_name_lower'].str.rsplit(' ', 0).str[0]
# Last Name Lower
df_guardian_transfers_all['last_name_lower'] = df_guardian_transfers_all['player_name_lower'].str.rsplit(' ', 1).str[-1]
# First Initial Lower
df_guardian_transfers_all['first_initial_lower'] = df_guardian_transfers_all['player_name_lower'].astype(str).str[0]
# Map season to DataFrame
df_guardian_transfers_all['season'] = df_guardian_transfers_all['transfer_year'].map(dict_seasons)
df_guardian_transfers_all.head()
transfer_id | transfer_year | window_day | window_open | timestamp | player_name | transfer_status | transfer_type | primary_player_position | price_in_pounds | previous_club | new_club | previous_league | new_league | big_deal | flow_id | player_name_lower | first_name_lower | last_name_lower | first_initial_lower | season | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Zito Luvumbo | done deal | loan | W | NaN | Cagliari | Como | Serie A | Serie C | NaN | 100 | zito luvumbo | zito | luvumbo | z | 2021/2022 |
1 | 2 | 2021 | 48 | 2021-06-09 | 2021-07-27 | David Juncà | done deal | free | D | NaN | Celta Vigo | Girona | La Liga | Spanish Segunda | NaN | 241 | david junca | david | junca | d | 2021/2022 |
2 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Giuseppe Pezzella | done deal | loan | D | NaN | Parma | Atalanta | Serie B | Serie A | NaN | 106 | giuseppe pezzella | giuseppe | pezzella | g | 2021/2022 |
3 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Pablo Rosario | done deal | fee | M | 5130000.0 | PSV Eindhoven | Nice | Eredivisie | Ligue 1 | Yes | 8 | pablo rosario | pablo | rosario | p | 2021/2022 |
4 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Martín Calderón | done deal | undisclosed | M | NaN | Real Madrid | Cádiz | La Liga | La Liga | NaN | 265 | martin calderon | martin | calderon | m | 2021/2022 |
lst_postions = list(df_guardian_transfers_all['primary_player_position'].unique())
lst_postions
['W', 'D', 'M', 'F', 'GK', nan]
# Map grouped positions to DataFrame
df_guardian_transfers_all['primary_player_position_grouped'] = df_guardian_transfers_all['primary_player_position'].map(dict_positions_grouped)
# Separate Goalkeeper and Outfielders
df_guardian_transfers_all['outfielder_goalkeeper'] = np.where(df_guardian_transfers_all['primary_player_position_grouped'] == 'Goalkeeper', 'Goalkeeper', 'Outfielder')
lst_leagues = list(df_guardian_transfers_all['new_league'].unique())
lst_leagues
['Serie C', 'Spanish Segunda', 'Serie A', 'Ligue 1', 'La Liga', 'Serie B', 'Qatar Stars League', 'Premier League', 'Portugal Liga NOS', 'Serie D', '3. Bundesliga', 'Scottish Premiership', 'Ligue 2', 'National', 'Swiss Challenge League', 'MLS', 'National 2', '2. Bundesliga', 'Bayern Regionaliga', '1. Bundesliga', 'Turkish Super Lig', 'League One', 'Championship', 'Segunda B', 'Danish SuperLiga', 'Russian Premier League', 'National League', 'Belgian Jupiler Pro League', 'Eredivisie', 'Saudi Professional League', 'Swiss Superleague', 'UAE Gulf League', 'J League', 'Austrian Bundesliga', 'Croatian 1.HNL', 'Free agent', 'NL South', 'Austrian Bundeliga', 'K-League', 'Greek Super League', 'Super League', 'League Two', 'Liga MX', 'Austrian 2. Liga', 'Polish Ekstraklasa', 'Dutch Kampioen Divisie', 'Fortuna Liga', 'Romanian Liga 1', 'Serbian Super Liga', nan, 'Brazilian Serie A', 'Estonian League', 'Ukrainian Premier League', 'Colombian Primera', 'Bulgarian League', 'Serbian Prva Liga', 'Scottish Championship', 'Cyprus', 'Allsvenskan', 'Czech First League', 'A-League', 'Poland 3. Liga', 'Dutch Kampioen', 'Bulgaria', 'Croatian top flight', 'Belgium', 'Greek Superleague 1', 'Superliga', 'Spal', 'Slovenian Prva Liga', 'Belgian Proximus', 'Algeria', 'Slovenian Liga', 'National League North', 'Greece Superleague', 'Pescara', 'Ereste Divisie', 'Chinese Super League', '1. HNL', 'Regionalliga', 'Belgian D2', 'Danubio', 'Eliteserien', 'Eerste Amateur', 'Slovak Fortuna Liga', 'Bulgarian Liga', 'Belgian Second Division', 'France CFA', 'Saudi League', 'Bundesliga 2.', 'Belgian Proximus League', 'Retired', 'Eerste Divisie', 'Azerbaijan League', 'Northern Premier League', 'Bourg-en-Bresse', 'Championnat National', 'Poland Ekstraklasa', 'Hungarian NB I', 'Argentina Primera Division', 'Cypriot First Division', 'Mouscron', 'Chilean Primera', 'Belgian Super League', '3.Liga', 'Norwegian Eliteserien', 'Croatia HNL Liga', 'QPR', 'National League South', 'Ecuador Liga Pro', 'Greece Super League', 'Regionalliga West', 'Portuguese Primeira Liga', 'Evo-Stik League South', 'retired', 'Portuguese Segunda Liga', 'Kazakh League', 'Regionalliga Südwest', 'Spanish Segunda B', 'Russian Premier Liga', 'Argentina Primeira', 'Slovak Super Liga', 'Qatari Stars League', 'NI Premiership', 'Cyprus First Division', 'Ecuador Serie A', 'Croatian League', 'Czech League', 'Polish 1 Liga', 'Cypriot League', 'Cercle Brugge', 'Turkish Süper Lig', 'Parva Liga', 'Belgian B Division', 'Eredivise', 'USL', 'Bostik Football League', 'Lega Pro', '2 Bundesliga', 'Kazakh Premier Liga', 'Slovakian Fortuna Liga', 'Kazakh Premier League', 'Qatari Stars Leagur', 'Israeli Premier League', 'Slovakia Fortuna Liga', 'Bulgarian First League', 'Israeli Ligat', 'League of Ireland', 'Belgian First Division', 'Southern League', 'Uruguayan Primera División', '2. Bungesliga', 'Maltese Premier League', 'Turkey Super Lig', 'Liga Pro', 'Cyprus 1. Division', 'Ecuador Primera', 'UAE Arabian Gulf League', 'Chilean Primera División', 'Austrian First League', 'Serie B', 'Brasileiro', 'Southern League Premier Division', 'Tercera División de México', 'Premier League (Israel)', 'Liga Super (Malaysia)', 'National 1', 'US Orléans', 'Belgian second division', 'Primera División (Mexico)', 'Premier League (Russia)', 'Premier League (Kazakhstan)']
# Filter plays in the 'Big 5' European Leagues and MLS
## Define list of countries
lst_big5_mls_leagues = ['Serie A', 'Ligue 1', 'La Liga', 'Premier League', '1. Bundesliga', 'MLS']
df_guardian_transfers_all['previous_league_isbig5'] = np.where(df_guardian_transfers_all['previous_league'].isin(lst_big5_mls_leagues), True, False)
df_guardian_transfers_all['new_league_isbig5'] = np.where(df_guardian_transfers_all['new_league'].isin(lst_big5_mls_leagues), True, False)
df_guardian_transfers_all['isbig5transfer'] = np.where((df_guardian_transfers_all['previous_league_isbig5'] == True) | (df_guardian_transfers_all['new_league_isbig5'] == True), True, False)
df_guardian_transfers_all = df_guardian_transfers_all[df_guardian_transfers_all['isbig5transfer'] == True]
df_guardian_transfers_big5 = df_guardian_transfers_all.drop(['previous_league_isbig5', 'new_league_isbig5', 'isbig5transfer'], axis=1)
df_guardian_transfers_all.shape
(7214, 26)
df_guardian_transfers_big5.shape
(7214, 23)
df_guardian_transfers_big5
transfer_id | transfer_year | window_day | window_open | timestamp | player_name | transfer_status | transfer_type | primary_player_position | price_in_pounds | previous_club | new_club | previous_league | new_league | big_deal | flow_id | player_name_lower | first_name_lower | last_name_lower | first_initial_lower | season | primary_player_position_grouped | outfielder_goalkeeper | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Zito Luvumbo | done deal | loan | W | NaN | Cagliari | Como | Serie A | Serie C | NaN | 100 | zito luvumbo | zito | luvumbo | z | 2021/2022 | Forward | Outfielder |
1 | 2 | 2021 | 48 | 2021-06-09 | 2021-07-27 | David Juncà | done deal | free | D | NaN | Celta Vigo | Girona | La Liga | Spanish Segunda | NaN | 241 | david junca | david | junca | d | 2021/2022 | Defender | Outfielder |
2 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Giuseppe Pezzella | done deal | loan | D | NaN | Parma | Atalanta | Serie B | Serie A | NaN | 106 | giuseppe pezzella | giuseppe | pezzella | g | 2021/2022 | Defender | Outfielder |
3 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Pablo Rosario | done deal | fee | M | 5130000.0 | PSV Eindhoven | Nice | Eredivisie | Ligue 1 | Yes | 8 | pablo rosario | pablo | rosario | p | 2021/2022 | Midfielder | Outfielder |
4 | 1 | 2021 | 48 | 2021-06-09 | 2021-07-27 | Martín Calderón | done deal | undisclosed | M | NaN | Real Madrid | Cádiz | La Liga | La Liga | NaN | 265 | martin calderon | martin | calderon | m | 2021/2022 | Midfielder | Outfielder |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7214 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Joey Barton | done deal | released | NaN | 0.0 | Burnley | Free agent | Premier League | Free agent | No | 2 | joey barton | joey | barton | j | 2017/2018 | NaN | Outfielder |
7215 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Matthias Ostrzolek | done deal | fee | NaN | 0.0 | Hamburg | Hannover 96 | 1. Bundesliga | 1. Bundesliga | No | 1 | matthias ostrzolek | matthias | ostrzolek | m | 2017/2018 | NaN | Outfielder |
7216 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Benoît Trémoulinas | done deal | released | NaN | 0.0 | Sevilla | Free agent | La Liga | Free agent | No | 1 | benoit tremoulinas | benoit | tremoulinas | b | 2017/2018 | NaN | Outfielder |
7217 | 1 | 2017 | -38 | 2017-07-01 | 2017-05-24 | Youri Tielemans | done deal | fee | NaN | 21600000.0 | Anderlecht | Monaco | Belgian Jupiler Pro League | Ligue 1 | Yes | 1 | youri tielemans | youri | tielemans | y | 2017/2018 | NaN | Outfielder |
7218 | 1 | 2017 | -39 | 2017-07-01 | 2017-05-23 | Juan Cuadrado | done deal | fee | NaN | 17330000.0 | Chelsea | Juventus | Premier League | Serie A | Yes | 1 | juan cuadrado | juan | cuadrado | j | 2017/2018 | NaN | Outfielder |
7214 rows × 23 columns
This step is not required as the data only goes as far back as the 17/18 season.
lst_seasons = list(df_guardian_transfers_all['season'].unique())
lst_seasons
['2021/2022', '2020/2021', '2019/2020', '2018/2019', '2017/2018']
lst_seasons_1718_2122 = ['2017/2018', '2018/2019', '2019/2020', '2020/2021', '2021/2022']
df_guardian_transfers_all_1718_2122 = df_guardian_transfers_all[df_guardian_transfers_all['season'].isin(lst_seasons_1718_2122)]
df_guardian_transfers_big5_1718_2122 = df_guardian_transfers_all[df_guardian_transfers_all['season'].isin(lst_seasons_1718_2122)]
df_guardian_transfers_all.shape
(7214, 26)
Export the two engineered The Guardian DataFrames as CSV files.
## Filename
file_name = 'tm_player_joined_big5_mls'
## Full season string
full_season_string = '2016/2017 - 2021/2022'
## Short season string
short_season_string = '1617-2122'
# Export DataFrame as a CSV file
## Export a copy to the 'archive' subfolder of the TM folder, including the date
df_tm_market_value_bio_status_select.to_csv(data_dir_tm + f'/engineered/{short_season_string}/joined/archive/' + f'{file_name}_{short_season_string}_last_updated_{today}.csv', index=None, header=True)
## Export another copy to the TM folder called 'latest' (can be overwritten)
df_tm_market_value_bio_status_select.to_csv(data_dir_tm + f'/engineered/{short_season_string}/joined/' + f'{file_name}_{short_season_string}_latest.csv', index=None, header=True)
# Export DataFrame as a CSV file
## All data
df_guardian_transfers_all.to_csv(data_dir_tm + f'/engineered/transfers/archive/' + f'tm_player_transfers_all_last_updated_{today}.csv', index=None, header=True)
df_guardian_transfers_all.to_csv(data_dir_tm + f'/engineered/transfers/' + f'tm_transfers_player_all_latest.csv', index=None, header=True)
df_guardian_transfers_all.to_csv(data_dir + f'/export/' + f'tm_transfers_player_all_latest.csv', index=None, header=True)
## 'Big 5' European Leagues
df_transfers_big5.to_csv(data_dir_tm + f'/engineered/transfers/archive/' + f'tm_player_transfers_big5_last_updated_{today}.csv', index=None, header=True)
df_transfers_big5.to_csv(data_dir_tm + f'/engineered/transfers/' + f'tm_player_transfers_big5_latest.csv', index=None, header=True)
df_transfers_big5.to_csv(data_dir + f'/export/' + f'tm_player_transfers_big5_latest.csv', index=None, header=True)
This Jupyter notebook engineered scraped football data from The Guardian using pandas for data manipulation through DataFrames.
The step is to use this data to...
*Visit my website eddwebster.com or my GitHub Repository for more projects. If you'd like to get in contact, my Twitter handle is @eddwebster and my email is: edd.j.webster@gmail.com.*