#!/usr/bin/env python
# coding: utf-8
#
# # TransferMarkt Player Bio and Status Data Engineering
# ##### Notebook to engineer data scraped from [TransferMarkt](https://www.transfermarkt.co.uk/) using the [Tyrone Mings web scraper](https://github.com/FCrSTATS/tyrone_mings) by [FCrSTATS](https://twitter.com/FC_rstats).
#
# ### By [Edd Webster](https://www.twitter.com/eddwebster)
# Notebook first written: 13/09/2020
# Notebook last updated: 30/08/2021
#
# ![title](../../img/transfermarkt-logo-banner.png)
# ___
#
#
#
# ## Introduction
# This notebook engineers previously scraped data from [TransferMarkt](https://www.transfermarkt.co.uk/) using the [FCrSTATS](https://twitter.com/FC_rstats) [Tyrone Mings webscraper](https://github.com/FCrSTATS/tyrone_mings) and manipulates this landed data as DataFrames using [pandas](http://pandas.pydata.org/) and [matplotlib](https://matplotlib.org/) for visualisation.
#
# For more information about this notebook and the author, I'm available through all the following channels:
# * [eddwebster.com](https://www.eddwebster.com/);
# * edd.j.webster@gmail.com;
# * [@eddwebster](https://www.twitter.com/eddwebster);
# * [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/);
# * [github/eddwebster](https://github.com/eddwebster/);
# * [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster);
# * [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and
# * [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster).
#
# ![title](../../img/fifa21eddwebsterbanner.png)
#
# The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/A%29%20Web%20Scraping/TransferMarkt%20Web%20Scraping%20and%20Parsing.ipynb).
# ___
#
#
#
# ## Notebook Contents
# 1. [Notebook Dependencies](#section1)
# 2. [Project Brief](#section2)
# 3. [Data Sources](#section3)
# 1. [Introduction](#section3.1)
# 2. [Data Dictionary](#section3.2)
# 3. [Creating the DataFrame](#section3.3)
# 4. [Initial Data Handling](#section3.4)
# 5. [Export the Raw DataFrame](#section3.5)
# 4. [Data Engineering](#section4)
# 1. [Introduction](#section4.1)
# 2. [Columns of Interest](#section4.2)
# 3. [String Cleaning](#section4.3)
# 4. [Converting Data Types](#section4.4)
# 5. [Export the Engineered DataFrame](#section4.5)
# 5. [Exploratory Data Analysis (EDA)](#section5)
# 1. [...](#section5.1)
# 2. [...](#section5.2)
# 3. [...](#section5.3)
# 6. [Summary](#section6)
# 7. [Next Steps](#section7)
# 8. [Bibliography](#section8)
# ___
#
#
#
# ## 1. Notebook Dependencies
# This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries:
# * [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented;
# * [`NumPy`](http://www.numpy.org/) for multidimensional array computing;
# * [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation;
# * [`Beautifulsoup`](https://pypi.org/project/beautifulsoup4/) for web scraping; and
# * [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations;
#
# All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/).
# ### Import Libraries and Modules
# In[72]:
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv
# Import Dependencies
get_ipython().run_line_magic('matplotlib', 'inline')
# Math Operations
import numpy as np
import math
from math import pi
# Datetime
import datetime
from datetime import date
import time
# Data Preprocessing
import pandas as pd
import os
import re
import random
from io import BytesIO
from pathlib import Path
# Reading directories
import glob
import os
from os.path import basename
# Flatten lists
from functools import reduce
# Working with JSON
import json
from pandas.io.json import json_normalize
# Web Scraping
import requests
from bs4 import BeautifulSoup
import re
# Currency Converter
from currency_converter import CurrencyConverter
# APIs
from tyrone_mings import *
# Fuzzy Matching - Record Linkage
import recordlinkage
import jellyfish
import numexpr as ne
# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno
# Progress Bar
from tqdm import tqdm
# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML
# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
print('Setup Complete')
# In[2]:
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
# ### Defined Filepaths
# In[73]:
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..')
data_dir = os.path.join(base_dir, 'data')
data_dir_tm = os.path.join(base_dir, 'data', 'tm')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')
# ### Defined Variables
# In[3]:
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')
# ### Defined Lists
# In[61]:
# Defined Lists
## Define list of league codes
df_leagues = pd.read_csv(data_dir_tm + '/reference/tm_leagues_comps.csv')
lst_league_codes = df_leagues['league_code'].to_numpy().tolist()
## Define list of 'Big 5' European Leagues and MLS codes
lst_big5_mls_league_codes = ['GB1', 'FR1', 'L1', 'IT1', 'ES1', 'MLS1']
# ### Notebook Settings
# In[5]:
pd.set_option('display.max_columns', None)
# ---
#
#
#
# ## 2. Project Brief
# This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages.
#
# This particular notebook is one of several data eningeering notebooks, that cleans player bio and status (valuation) data from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/).
#
# [TransferMarkt](https://www.transfermarkt.co.uk/) is a German-based website owned by [Axel Springer](https://www.axelspringer.com/en/) and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players.
#
# To read more about how these estimations are made, [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy.
#
# This notebook, along with the other notebooks in this project workflow are shown in the following diagram:
#
# ![roadmap](../../img/football_analytics_data_roadmap.png)
#
# Links to these notebooks in the [`football_analytics`](https://github.com/eddwebster/football_analytics) GitHub repository can be found at the following:
# * [Webscraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping)
# + [FBref Player Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Player%20Stats%20Web%20Scraping.ipynb)
# + [TransferMarket Player Bio and Status Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Bio%20and%20Status%20Web%20Scraping.ipynb)
# + [TransferMarket Player Valuation Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Valuation%20Web%20Scraping.ipynb)
# + [TransferMarkt Player Recorded Transfer Fees Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Webscraping.ipynb)
# + [Capology Player Salary Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/Capology%20Player%20Salary%20Web%20Scraping.ipynb)
# + [FBref Team Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Team%20Stats%20Web%20Scraping.ipynb)
# * [Data Parsing](https://github.com/eddwebster/football_analytics/tree/master/notebooks/2_data_parsing)
# + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/2_data_parsing/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)
# * [Data Engineering](https://github.com/eddwebster/football_analytics/tree/master/notebooks/3_data_engineering)
# + [FBref Player Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Player%20Stats%20Data%20Engineering.ipynb)
# + [TransferMarket Player Bio and Status Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Bio%20and%20Status%20Data%20Engineering.ipynb)
# + [TransferMarket Player Valuation Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Valuation%20Data%20Engineering.ipynb)
# + [TransferMarkt Player Recorded Transfer Fees Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb)
# + [Capology Player Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb)
# + [FBref Team Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Team%20Stats%20Data%20Engineering.ipynb)
# + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/ELO%20Team%20Ratings%20Data%20Parsing.ipynb)
# + [TransferMarkt Team Recorded Transfer Fee Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Team%20Recorded%20Transfer%20Fee%20Data%20Engineering.ipynb) (aggregated from [TransferMarkt Player Recorded Transfer Fees notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb))
# + [Capology Team Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Team%20Salary%20Data%20Engineering.ipynb) (aggregated from [Capology Player Salary notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb))
# * [Data Unification](https://github.com/eddwebster/football_analytics/tree/master/notebooks/4_data_unification)
# + [Golden ID for Player Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Player%20Level%20Datasets.ipynb)
# + [Golden ID for Team Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Team%20Level%20Datasets.ipynb)
# * [Production Datasets](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets)
# + [Player Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Player%20Performance/Market%20Value%20Dataset.ipynb)
# + [Team Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Team%20Performance/Market%20Value%20Dataset.ipynb)
# * [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling)
# + [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling/Expected%20Transfer%20%20Modeling.ipynb)
# ---
#
#
#
# ## 3. Data Sources
#
#
# ### 3.1. Introduction
# Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4).
#
# We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame.
# ### 3.2. Bio and Status Dataset
# #### 3.2.1. Data Dictionaries
# The [TransferMarkt](https://www.transfermarkt.co.uk/) dataset has six features (columns) with the following definitions and data types:
#
# | Feature | Data type |
# |------|-----|
# | `position_number` | object |
# | `position_description` | object |
# | `name` | object |
# | `dob` | object |
# | `nationality` | object |
# | `value` | object |
# #### 3.2.2. Import CSV files as pandas DataFrames
# ##### Import Bio data
# In[9]:
# Import DataFrame as a CSV file
df_tm_bio_raw = pd.read_csv(data_dir_tm + f'/raw/bio/tm_player_bio_combined_latest.csv')
# ##### Import Status data
# In[10]:
# Import DataFrame as a CSV file
df_tm_status_raw = pd.read_csv(data_dir_tm + f'/raw/status/tm_player_status_combined_latest.csv')
# #### 3.2.3. Preliminary Data Handling
# Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods.
# ##### Bio data
# In[11]:
# Display the first five rows of the raw DataFrame, df_tm_bio_raw
df_tm_bio_raw.head()
# In[12]:
# Display the last five rows of the raw DataFrame, df_tm_bio_raw
df_tm_bio_raw.tail()
# In[13]:
# Print the shape of the raw DataFrame, df_tm_bio_raw
print(df_tm_bio_raw.shape)
# In[14]:
# Print the column names of the raw DataFrame, df_tm_bio_raw
print(df_tm_bio_raw.columns)
# The dataset has thirteen features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1).
# In[15]:
# Data types of the features of the raw DataFrame, df_tm_bio_raw
df_tm_bio_raw.dtypes
# All thirteen of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1).
# In[16]:
# Info for the raw DataFrame, df_tm_bio_raw
df_tm_bio_raw.info()
# In[17]:
# Description of the raw DataFrame, df_tm_bio_raw, showing some summary statistics for each numberical column in the DataFrame
df_tm_bio_raw.describe()
# In[18]:
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_bio_raw
msno.matrix(df_tm_bio_raw, figsize = (30, 7))
# In[19]:
# Counts of missing values
tm_bio_null_value_stats = df_tm_bio_raw.isnull().sum(axis=0)
tm_bio_null_value_stats[tm_bio_null_value_stats != 0]
# The visualisation shows us very quickly that there a few missing values in most of the columns, but the dataset is ready for us.
# ##### Status data
# In[20]:
# Display the first five rows of the raw DataFrame, df_tm_status_raw
df_tm_status_raw.head()
# In[21]:
# Display the last five rows of the raw DataFrame, df_tm_status_raw
df_tm_status_raw.tail()
# In[22]:
# Print the shape of the raw DataFrame, df_tm_status_raw
print(df_tm_status_raw.shape)
# In[23]:
# Print the column names of the raw DataFrame, df_tm_status_raw
print(df_tm_status_raw.columns)
# The dataset has thirteen features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1).
# In[24]:
# Data types of the features of the raw DataFrame, df_tm_status_raw
df_tm_status_raw.dtypes
# All thirteen of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1).
# In[25]:
# Info for the raw DataFrame, df_tm_status_raw
df_tm_status_raw.info()
# In[26]:
# Description of the raw DataFrame, df_tm_status_raw, showing some summary statistics for each numberical column in the DataFrame
df_tm_status_raw.describe()
# In[27]:
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_status_raw
msno.matrix(df_tm_status_raw, figsize = (30, 7))
# In[28]:
# Counts of missing values
tm_status_null_value_stats = df_tm_status_raw.isnull().sum(axis=0)
tm_status_null_value_stats[tm_status_null_value_stats != 0]
# The visualisation shows us very quickly that there a few missing values in most of the columns, but the dataset is ready for us.
# ## 4. Data Engineering
# Before we answer the questions in the brief through [Exploratory Data Analysis (EDA)](#section5), we'll first need to clean and wrangle the datasets to a form that meet our needs.
# ### 4.1. Bio and Status Datasets
# #### 4.1.1. Assign Raw DataFrames to New Engineered DataFrames
# In[29]:
# Assign Raw DataFrames to new Engineered DataFrames
df_tm_bio = df_tm_bio_raw.copy()
df_tm_status = df_tm_status_raw.copy()
# #### 4.1.2. Dedupe DataFrames
# A copy of each player is saved per season, causing duplication if not treated.
# ##### Bio data
# In[30]:
df_tm_bio = (df_tm_bio
.sort_values(['season', 'player_id'], ascending=[False, True])
#.drop(columns=['league_code', 'season'])
.drop_duplicates(subset=['player_id'], keep='first')
)
# In[31]:
print('No. rows in Bio DataFrame BEFORE deduplication: {}'.format(len(df_tm_bio_raw)))
print('No. rows in DataFrame AFTER deduplication: {}\n'.format(len(df_tm_bio)))
print('Variance in rows before and after deduplication: {}\n'.format(len(df_tm_bio_raw) - len(df_tm_bio)))
print('-'*10)
# In[32]:
df_tm_bio['player_id'].nunique()
# ##### Status data
# In[33]:
df_tm_status = (df_tm_status
.sort_values(['season', 'player_id'], ascending=[False, True])
.drop(columns=['league_code', 'season'])
.drop_duplicates(subset=['player_id'], keep='first')
)
# In[34]:
print('No. rows in Status DataFrame BEFORE deduplication: {}'.format(len(df_tm_status_raw)))
print('No. rows in DataFrame AFTER deduplication: {}\n'.format(len(df_tm_status)))
print('Variance in rows before and after deduplication: {}\n'.format(len(df_tm_status_raw) - len(df_tm_status)))
print('-'*10)
# In[35]:
df_tm_status['player_id'].nunique()
# #### 4.1.2. Join Bio and Status Datasets
# In[36]:
# Join the Bio and Status DataFrames to form one, unified DataFrame
df_tm_bio_status = pd.merge(df_tm_bio, df_tm_status, left_on='player_id', right_on='player_id', how='left')
# In[37]:
print('No. rows in Bio DataFrame BEFORE join to Status data: {}'.format(len(df_tm_bio)))
print('No. rows in DataFrame AFTER join: {}\n'.format(len(df_tm_bio_status)))
print('Variance in rows before and after join: {}\n'.format(len(df_tm_bio_status) - len(df_tm_bio)))
print('-'*10)
# In[38]:
df_tm_bio_status.shape
# #### 4.1.3. String Cleaning
# ##### Name
# In[39]:
#
df_tm_bio_status['name_lower'] = (df_tm_bio_status['player_name']
.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.str.decode('utf-8')
.str.lower()
)
# First Name Lower
df_tm_bio_status['firstname_lower'] = df_tm_bio_status['name_lower'].str.rsplit(' ', 0).str[0]
# Last Name Lower
df_tm_bio_status['lastname_lower'] = df_tm_bio_status['name_lower'].str.rsplit(' ', 1).str[-1]
# First Initial Lower
df_tm_bio_status['firstinitial_lower'] = df_tm_bio_status['name_lower'].astype(str).str[0]
# ##### League Country lower
# In[40]:
# Remove accents and create lowercase name
df_tm_bio_status['league_country_lower'] = (df_tm_bio_status['current_club_country']
.str.normalize('NFKD')
.str.encode('ascii', errors='ignore')
.str.decode('utf-8')
.str.lower()
)
# #### 4.1.4. Converting Data Types
# ##### Date Time
# First we need to convert the `dob` column from the `object` data type to `datetime64[ns]`, again using the [.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) method.
# In[41]:
# Convert birth_date from string to datetime64[ns]
df_tm_bio_status['dob'] = pd.to_datetime(df_tm_bio_status['dob'])
df_tm_bio_status['joined'] = pd.to_datetime(df_tm_bio_status['joined'])
df_tm_bio_status['contract_expires'] = pd.to_datetime(df_tm_bio_status['contract_expires'])
# ##### Numeric
# The `value` column needs to be converted from a string to an integer using to [to_numeric()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) method.
# In[42]:
# Convert string to integer
df_tm_bio_status['market_value'] = pd.to_numeric(df_tm_bio_status['market_value'])
# #### 4.1.5. Create New Attributes
# ##### Position Codes
# In[43]:
df_tm_bio_status['position'].unique()
# In[44]:
dict_positions_codes = {'Goalkeeper': 'GK',
'Defender': 'DEF',
'Defender - Centre-Back': 'CB',
'Defender - Left-Back': 'LB',
'Defender - Right-Back': 'RB',
'midfield': 'MF',
'midfield - Central Midfield': 'CM',
'midfield - Defensive Midfield': 'CDM',
'midfield - Left Midfield': 'LM',
'midfield - Right Midfield': 'RM',
'midfield - Attacking Midfield': 'CAM',
'attack': 'FW',
'attack - Left Winger': 'LW',
'attack - Right Winger': 'RW',
'attack - Second Striker': 'SS',
'attack - Centre-Forward': 'ST'
}
# In[45]:
# Map grouped positions to DataFrame
df_tm_bio_status['position_code'] = df_tm_bio_status['position'].map(dict_positions_codes)
# ##### Position Grouped
# In[46]:
dict_positions_grouped = {'Goalkeeper': 'Goalkeeper',
'Defender': 'Defender',
'Defender - Centre-Back': 'Defender',
'Defender - Left-Back': 'Defender',
'Defender - Right-Back': 'Defender',
'midfield': 'Midfielder',
'midfield - Central Midfield': 'Midfielder',
'midfield - Defensive Midfield': 'Midfielder',
'midfield - Left Midfield': 'Midfielder',
'midfield - Right Midfield': 'Midfielder',
'midfield - Attacking Midfield': 'Midfielder',
'attack': 'Forward',
'attack - Left Winger': 'Forward',
'attack - Right Winger': 'Forward',
'attack - Second Striker': 'Forward',
'attack - Centre-Forward': 'Forward'
}
# In[47]:
# Map grouped positions to DataFrame
df_tm_bio_status['position_grouped'] = df_tm_bio_status['position'].map(dict_positions_grouped)
# ##### Goalkeeper / Outfielder
# In[48]:
# Separate Goalkeeper and Outfielders
df_tm_bio_status['outfielder_goalkeeper'] = np.where(df_tm_bio_status['position'].isnull(), np.nan, (np.where(df_tm_bio_status['position'] == 'Goalkeeper', 'Goalkeeper', 'Outfielder')))
# ##### Current Age
# In[49]:
now = pd.Timestamp('now')
df_tm_bio_status['dob'] = pd.to_datetime(df_tm_bio_status['dob'], format='%m%d%y')
df_tm_bio_status['dob'] = df_tm_bio_status['dob'].where(df_tm_bio_status['dob'] < now, df_tm_bio_status['dob'] - np.timedelta64(100, 'Y'))
df_tm_bio_status['age'] = (now - df_tm_bio_status['dob']).astype('4.1.6. Convert Euros to Pounds
# In[56]:
# Get EUR to GBP exchange rate
## Get latest currency rates
c = CurrencyConverter()
## Get conversion rate from EUR to GBP
rate_eur_gbp = (c.convert(1, 'EUR', 'GBP'))
rate_eur_gbp
# In[57]:
# Overwrite
rate_eur_gbp = 0.90
rate_eur_gbp
# In[58]:
# Convert TM.com Euro value to Pounds
df_tm_bio_status['market_value_gbp'] = df_tm_bio_status['market_value'] * rate_eur_gbp
# #### 4.1.7. Rename Columns
# In[59]:
df_tm_bio_status = df_tm_bio_status.rename(columns={'player_id': 'tm_id',
'season_x': 'season',
'market_value': 'market_value_eur',
'day_of_birth': 'birth_day',
'month_of_birth': 'birth_month',
'year_of_birth': 'birth_year'
}
)
# #### 4.1.8. Filter Players in 'Big 5' European Leagues
# In[62]:
# Filter players in the 'Big 5' European Leagues and MLS
df_tm_bio_status_big5_mls = df_tm_bio_status[df_tm_bio_status['league_code'].isin(lst_big5_mls_league_codes)]
# In[64]:
print('No. rows in DataFrame BEFORE filtration: {}'.format(len(df_tm_bio_status)))
print('No. rows in DataFrame AFTER filtration: {}\n'.format(len(df_tm_bio_status_big5_mls)))
print('Variance in rows before and after filtration: {}\n'.format(len(df_tm_bio_status_big5_mls) - len(df_tm_bio_status)))
print('-'*10)
# #### 4.1.9. Exporting the Engineered DataFrames
# Export the two engineered [TransferMarkt](https://www.transfermarkt.co.uk/) DataFrames as CSV files.
# ##### All players
# In[65]:
## Filename
file_name = 'tm_player_bio_status_all'
## Full season string
full_season_string = '2016/2017 - 2021/2022'
## Short season string
short_season_string = '1617-2122'
# In[67]:
# Export DataFrame as a CSV file
## Export a copy to the 'archive' subfolder of the TM folder, including the date
df_tm_bio_status.to_csv(data_dir_tm + f'/engineered/bio-status/archive/{file_name}_{short_season_string}_last_updated_{today}.csv', index=None, header=True)
## Export another copy to the TM folder called 'latest' (can be overwritten)
df_tm_bio_status.to_csv(data_dir_tm + f'/engineered/bio-status/{file_name}_{short_season_string}_latest.csv', index=None, header=True)
# ##### 'Big 5' European and MLS Players
# In[68]:
## Filename
file_name = 'tm_player_bio_status_big5_mls'
## Full season string
full_season_string = '2016/2017-2021/2022'
## Short season string
short_season_string = '1617-2122'
# In[70]:
## 'Big 5' European Leagues and MLS
### Export a copy to the 'archive' subfolder of the TM folder, including the date
df_tm_bio_status_big5_mls.to_csv(data_dir_tm + f'/engineered/bio-status/archive/tm_bio_status_big5_mls_{short_season_string}_last_updated_{today}.csv', index=None, header=True)
### Export another copy to the TM folder called 'latest' (can be overwritten)
df_tm_bio_status_big5_mls.to_csv(data_dir_tm + f'/engineered/bio-status/tm_bio_status_big5_mls_{short_season_string}_latest.csv', index=None, header=True)
# Now we have created three pandas DataFrames and wrangled the data to meet our needs, we'll next conduct and [Exploratory Data Analysis ](#section5).
# ## 5. Exploratory Data Analysis
# In[71]:
df_tm_bio_status_big5_mls.head()
# ## 6. Summary
# This Jupyter notebook engineered scraped football data from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames.
# ## 7. Next Steps
# The step is to use this data and match it to FBref, 21st Club, Opta, FIFA, and other TransferMarkt datasets, such as the Transfer History dataset.
# ## 8. References
#
# #### Data and Web Scraping
# * [tyrone_mings GitHub repository](https://github.com/FCrSTATS/tyrone_mings) by [FCrSTATS](https://github.com/FCrSTATS)
# * [Python Package Index (PyPI) tyrone-mings library](https://pypi.org/project/tyrone-mings/)
# * [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann.
# * [06/04/2020: BBC - Premier League squads 'drop £1.6bn in value'](https://www.bbc.co.uk/sport/football/52221463).
# ---
#
# ***Visit my website [eddwebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.***
# [Back to the top](#top)