#!/usr/bin/env python # coding: utf-8 # # # TransferMarkt Player Bio and Status Data Engineering # ##### Notebook to engineer data scraped from [TransferMarkt](https://www.transfermarkt.co.uk/) using the [Tyrone Mings web scraper](https://github.com/FCrSTATS/tyrone_mings) by [FCrSTATS](https://twitter.com/FC_rstats). # # ### By [Edd Webster](https://www.twitter.com/eddwebster) # Notebook first written: 13/09/2020
# Notebook last updated: 30/08/2021 # # ![title](../../img/transfermarkt-logo-banner.png) # ___ # # # # ## Introduction # This notebook engineers previously scraped data from [TransferMarkt](https://www.transfermarkt.co.uk/) using the [FCrSTATS](https://twitter.com/FC_rstats) [Tyrone Mings webscraper](https://github.com/FCrSTATS/tyrone_mings) and manipulates this landed data as DataFrames using [pandas](http://pandas.pydata.org/) and [matplotlib](https://matplotlib.org/) for visualisation. # # For more information about this notebook and the author, I'm available through all the following channels: # * [eddwebster.com](https://www.eddwebster.com/); # * edd.j.webster@gmail.com; # * [@eddwebster](https://www.twitter.com/eddwebster); # * [linkedin.com/in/eddwebster](https://www.linkedin.com/in/eddwebster/); # * [github/eddwebster](https://github.com/eddwebster/); # * [public.tableau.com/profile/edd.webster](https://public.tableau.com/profile/edd.webster); # * [kaggle.com/eddwebster](https://www.kaggle.com/eddwebster); and # * [hackerrank.com/eddwebster](https://www.hackerrank.com/eddwebster). # # ![title](../../img/fifa21eddwebsterbanner.png) # # The accompanying GitHub repository for this notebook can be found [here](https://github.com/eddwebster/football_analytics) and a static version of this notebook can be found [here](https://nbviewer.jupyter.org/github/eddwebster/football_analytics/blob/master/notebooks/A%29%20Web%20Scraping/TransferMarkt%20Web%20Scraping%20and%20Parsing.ipynb). # ___ # # # # ## Notebook Contents # 1. [Notebook Dependencies](#section1)
# 2. [Project Brief](#section2)
# 3. [Data Sources](#section3)
# 1. [Introduction](#section3.1)
# 2. [Data Dictionary](#section3.2)
# 3. [Creating the DataFrame](#section3.3)
# 4. [Initial Data Handling](#section3.4)
# 5. [Export the Raw DataFrame](#section3.5)
# 4. [Data Engineering](#section4)
# 1. [Introduction](#section4.1)
# 2. [Columns of Interest](#section4.2)
# 3. [String Cleaning](#section4.3)
# 4. [Converting Data Types](#section4.4)
# 5. [Export the Engineered DataFrame](#section4.5)
# 5. [Exploratory Data Analysis (EDA)](#section5)
# 1. [...](#section5.1)
# 2. [...](#section5.2)
# 3. [...](#section5.3)
# 6. [Summary](#section6)
# 7. [Next Steps](#section7)
# 8. [Bibliography](#section8)
# ___ # # # # ## 1. Notebook Dependencies # This notebook was written using [Python 3](https://docs.python.org/3.7/) and requires the following libraries: # * [`Jupyter notebooks`](https://jupyter.org/) for this notebook environment with which this project is presented; # * [`NumPy`](http://www.numpy.org/) for multidimensional array computing; # * [`pandas`](http://pandas.pydata.org/) for data analysis and manipulation; # * [`Beautifulsoup`](https://pypi.org/project/beautifulsoup4/) for web scraping; and # * [`matplotlib`](https://matplotlib.org/contents.html?v=20200411155018) for data visualisations; # # All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the [Conda](https://anaconda.org/anaconda/conda) distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows [here](https://medium.com/@GalarnykMichael/install-python-on-windows-anaconda-c63c7c3d1444) and Mac [here](https://medium.com/@GalarnykMichael/install-python-on-mac-anaconda-ccd9f2014072), as well as in the Anaconda documentation itself [here](https://docs.anaconda.com/anaconda/install/). # ### Import Libraries and Modules # In[72]: # Python ≥3.5 (ideally) import platform import sys, getopt assert sys.version_info >= (3, 5) import csv # Import Dependencies get_ipython().run_line_magic('matplotlib', 'inline') # Math Operations import numpy as np import math from math import pi # Datetime import datetime from datetime import date import time # Data Preprocessing import pandas as pd import os import re import random from io import BytesIO from pathlib import Path # Reading directories import glob import os from os.path import basename # Flatten lists from functools import reduce # Working with JSON import json from pandas.io.json import json_normalize # Web Scraping import requests from bs4 import BeautifulSoup import re # Currency Converter from currency_converter import CurrencyConverter # APIs from tyrone_mings import * # Fuzzy Matching - Record Linkage import recordlinkage import jellyfish import numexpr as ne # Data Visualisation import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns plt.style.use('seaborn-whitegrid') import missingno as msno # Progress Bar from tqdm import tqdm # Display in Jupyter from IPython.display import Image, YouTubeVideo from IPython.core.display import HTML # Ignore Warnings import warnings warnings.filterwarnings(action="ignore", message="^internal gelsd") print('Setup Complete') # In[2]: # Python / module versions used here for reference print('Python: {}'.format(platform.python_version())) print('NumPy: {}'.format(np.__version__)) print('pandas: {}'.format(pd.__version__)) print('matplotlib: {}'.format(mpl.__version__)) # ### Defined Filepaths # In[73]: # Set up initial paths to subfolders base_dir = os.path.join('..', '..') data_dir = os.path.join(base_dir, 'data') data_dir_tm = os.path.join(base_dir, 'data', 'tm') img_dir = os.path.join(base_dir, 'img') fig_dir = os.path.join(base_dir, 'img', 'fig') video_dir = os.path.join(base_dir, 'video') # ### Defined Variables # In[3]: # Define today's date today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '') # ### Defined Lists # In[61]: # Defined Lists ## Define list of league codes df_leagues = pd.read_csv(data_dir_tm + '/reference/tm_leagues_comps.csv') lst_league_codes = df_leagues['league_code'].to_numpy().tolist() ## Define list of 'Big 5' European Leagues and MLS codes lst_big5_mls_league_codes = ['GB1', 'FR1', 'L1', 'IT1', 'ES1', 'MLS1'] # ### Notebook Settings # In[5]: pd.set_option('display.max_columns', None) # --- # # # # ## 2. Project Brief # This Jupyter notebook is part of a series of notebooks to scrape, parse, engineer, unify, and the model, culminating in a an Expected Transfer (xTransfer) player performance vs. valuation model. This model aims to determine the under- and over-performing players based on their on-the-pitch output against transfer fee and wages. # # This particular notebook is one of several data eningeering notebooks, that cleans player bio and status (valuation) data from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/). # # [TransferMarkt](https://www.transfermarkt.co.uk/) is a German-based website owned by [Axel Springer](https://www.axelspringer.com/en/) and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players. # # To read more about how these estimations are made, [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy. # # This notebook, along with the other notebooks in this project workflow are shown in the following diagram: # # ![roadmap](../../img/football_analytics_data_roadmap.png) # # Links to these notebooks in the [`football_analytics`](https://github.com/eddwebster/football_analytics) GitHub repository can be found at the following: # * [Webscraping](https://github.com/eddwebster/football_analytics/tree/master/notebooks/1_data_scraping) # + [FBref Player Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Player%20Stats%20Web%20Scraping.ipynb) # + [TransferMarket Player Bio and Status Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Bio%20and%20Status%20Web%20Scraping.ipynb) # + [TransferMarket Player Valuation Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Valuation%20Web%20Scraping.ipynb) # + [TransferMarkt Player Recorded Transfer Fees Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Webscraping.ipynb) # + [Capology Player Salary Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/Capology%20Player%20Salary%20Web%20Scraping.ipynb) # + [FBref Team Stats Webscraping](https://github.com/eddwebster/football_analytics/blob/master/notebooks/1_data_scraping/FBref%20Team%20Stats%20Web%20Scraping.ipynb) # * [Data Parsing](https://github.com/eddwebster/football_analytics/tree/master/notebooks/2_data_parsing) # + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/2_data_parsing/ELO%20Team%20Ratings%20Data%20Parsing.ipynb) # * [Data Engineering](https://github.com/eddwebster/football_analytics/tree/master/notebooks/3_data_engineering) # + [FBref Player Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Player%20Stats%20Data%20Engineering.ipynb) # + [TransferMarket Player Bio and Status Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Bio%20and%20Status%20Data%20Engineering.ipynb) # + [TransferMarket Player Valuation Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Valuation%20Data%20Engineering.ipynb) # + [TransferMarkt Player Recorded Transfer Fees Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb) # + [Capology Player Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb) # + [FBref Team Stats Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/FBref%20Team%20Stats%20Data%20Engineering.ipynb) # + [ELO Team Ratings Data Parsing](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/ELO%20Team%20Ratings%20Data%20Parsing.ipynb) # + [TransferMarkt Team Recorded Transfer Fee Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Team%20Recorded%20Transfer%20Fee%20Data%20Engineering.ipynb) (aggregated from [TransferMarkt Player Recorded Transfer Fees notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/TransferMarkt%20Player%20Recorded%20Transfer%20Fees%20Data%20Engineering.ipynb)) # + [Capology Team Salary Data Engineering](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Team%20Salary%20Data%20Engineering.ipynb) (aggregated from [Capology Player Salary notebook](https://github.com/eddwebster/football_analytics/blob/master/notebooks/3_data_engineering/Capology%20Player%20Salary%20Data%20Engineering.ipynb)) # * [Data Unification](https://github.com/eddwebster/football_analytics/tree/master/notebooks/4_data_unification) # + [Golden ID for Player Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Player%20Level%20Datasets.ipynb) # + [Golden ID for Team Level Datasets](https://github.com/eddwebster/football_analytics/blob/master/notebooks/4_data_unification/Golden%20ID%20for%20Team%20Level%20Datasets.ipynb) # * [Production Datasets](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets) # + [Player Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Player%20Performance/Market%20Value%20Dataset.ipynb) # + [Team Performance/Market Value Dataset](https://github.com/eddwebster/football_analytics/tree/master/notebooks/5_production_datasets/Team%20Performance/Market%20Value%20Dataset.ipynb) # * [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling) # + [Expected Transfer (xTransfer) Modeling](https://github.com/eddwebster/football_analytics/tree/master/notebooks/6_data_analysis_and_projects/expected_transfer_modeling/Expected%20Transfer%20%20Modeling.ipynb) # --- # # # # ## 3. Data Sources # # # ### 3.1. Introduction # Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section [Section 3](#section3) and cleaned in the Data Engineering section [Section 4](#section4). # # We'll be using the [pandas](http://pandas.pydata.org/) library to import our data to this workbook as a DataFrame. # ### 3.2. Bio and Status Dataset # #### 3.2.1. Data Dictionaries # The [TransferMarkt](https://www.transfermarkt.co.uk/) dataset has six features (columns) with the following definitions and data types: # # | Feature | Data type | # |------|-----| # | `position_number` | object | # | `position_description` | object | # | `name` | object | # | `dob` | object | # | `nationality` | object | # | `value` | object | # #### 3.2.2. Import CSV files as pandas DataFrames # ##### Import Bio data # In[9]: # Import DataFrame as a CSV file df_tm_bio_raw = pd.read_csv(data_dir_tm + f'/raw/bio/tm_player_bio_combined_latest.csv') # ##### Import Status data # In[10]: # Import DataFrame as a CSV file df_tm_status_raw = pd.read_csv(data_dir_tm + f'/raw/status/tm_player_status_combined_latest.csv') # #### 3.2.3. Preliminary Data Handling # Let's quality of the dataset by looking first and last rows in pandas using the [head()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) and [tail()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) methods. # ##### Bio data # In[11]: # Display the first five rows of the raw DataFrame, df_tm_bio_raw df_tm_bio_raw.head() # In[12]: # Display the last five rows of the raw DataFrame, df_tm_bio_raw df_tm_bio_raw.tail() # In[13]: # Print the shape of the raw DataFrame, df_tm_bio_raw print(df_tm_bio_raw.shape) # In[14]: # Print the column names of the raw DataFrame, df_tm_bio_raw print(df_tm_bio_raw.columns) # The dataset has thirteen features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1). # In[15]: # Data types of the features of the raw DataFrame, df_tm_bio_raw df_tm_bio_raw.dtypes # All thirteen of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1). # In[16]: # Info for the raw DataFrame, df_tm_bio_raw df_tm_bio_raw.info() # In[17]: # Description of the raw DataFrame, df_tm_bio_raw, showing some summary statistics for each numberical column in the DataFrame df_tm_bio_raw.describe() # In[18]: # Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_bio_raw msno.matrix(df_tm_bio_raw, figsize = (30, 7)) # In[19]: # Counts of missing values tm_bio_null_value_stats = df_tm_bio_raw.isnull().sum(axis=0) tm_bio_null_value_stats[tm_bio_null_value_stats != 0] # The visualisation shows us very quickly that there a few missing values in most of the columns, but the dataset is ready for us. # ##### Status data # In[20]: # Display the first five rows of the raw DataFrame, df_tm_status_raw df_tm_status_raw.head() # In[21]: # Display the last five rows of the raw DataFrame, df_tm_status_raw df_tm_status_raw.tail() # In[22]: # Print the shape of the raw DataFrame, df_tm_status_raw print(df_tm_status_raw.shape) # In[23]: # Print the column names of the raw DataFrame, df_tm_status_raw print(df_tm_status_raw.columns) # The dataset has thirteen features (columns). Full details of these attributes can be found in the [Data Dictionary](section3.3.1). # In[24]: # Data types of the features of the raw DataFrame, df_tm_status_raw df_tm_status_raw.dtypes # All thirteen of the columns have the object data type. Full details of these attributes and their data types can be found in the [Data Dictionary](section3.3.1). # In[25]: # Info for the raw DataFrame, df_tm_status_raw df_tm_status_raw.info() # In[26]: # Description of the raw DataFrame, df_tm_status_raw, showing some summary statistics for each numberical column in the DataFrame df_tm_status_raw.describe() # In[27]: # Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_status_raw msno.matrix(df_tm_status_raw, figsize = (30, 7)) # In[28]: # Counts of missing values tm_status_null_value_stats = df_tm_status_raw.isnull().sum(axis=0) tm_status_null_value_stats[tm_status_null_value_stats != 0] # The visualisation shows us very quickly that there a few missing values in most of the columns, but the dataset is ready for us. # ## 4. Data Engineering # Before we answer the questions in the brief through [Exploratory Data Analysis (EDA)](#section5), we'll first need to clean and wrangle the datasets to a form that meet our needs. # ### 4.1. Bio and Status Datasets # #### 4.1.1. Assign Raw DataFrames to New Engineered DataFrames # In[29]: # Assign Raw DataFrames to new Engineered DataFrames df_tm_bio = df_tm_bio_raw.copy() df_tm_status = df_tm_status_raw.copy() # #### 4.1.2. Dedupe DataFrames # A copy of each player is saved per season, causing duplication if not treated. # ##### Bio data # In[30]: df_tm_bio = (df_tm_bio .sort_values(['season', 'player_id'], ascending=[False, True]) #.drop(columns=['league_code', 'season']) .drop_duplicates(subset=['player_id'], keep='first') ) # In[31]: print('No. rows in Bio DataFrame BEFORE deduplication: {}'.format(len(df_tm_bio_raw))) print('No. rows in DataFrame AFTER deduplication: {}\n'.format(len(df_tm_bio))) print('Variance in rows before and after deduplication: {}\n'.format(len(df_tm_bio_raw) - len(df_tm_bio))) print('-'*10) # In[32]: df_tm_bio['player_id'].nunique() # ##### Status data # In[33]: df_tm_status = (df_tm_status .sort_values(['season', 'player_id'], ascending=[False, True]) .drop(columns=['league_code', 'season']) .drop_duplicates(subset=['player_id'], keep='first') ) # In[34]: print('No. rows in Status DataFrame BEFORE deduplication: {}'.format(len(df_tm_status_raw))) print('No. rows in DataFrame AFTER deduplication: {}\n'.format(len(df_tm_status))) print('Variance in rows before and after deduplication: {}\n'.format(len(df_tm_status_raw) - len(df_tm_status))) print('-'*10) # In[35]: df_tm_status['player_id'].nunique() # #### 4.1.2. Join Bio and Status Datasets # In[36]: # Join the Bio and Status DataFrames to form one, unified DataFrame df_tm_bio_status = pd.merge(df_tm_bio, df_tm_status, left_on='player_id', right_on='player_id', how='left') # In[37]: print('No. rows in Bio DataFrame BEFORE join to Status data: {}'.format(len(df_tm_bio))) print('No. rows in DataFrame AFTER join: {}\n'.format(len(df_tm_bio_status))) print('Variance in rows before and after join: {}\n'.format(len(df_tm_bio_status) - len(df_tm_bio))) print('-'*10) # In[38]: df_tm_bio_status.shape # #### 4.1.3. String Cleaning # ##### Name # In[39]: # df_tm_bio_status['name_lower'] = (df_tm_bio_status['player_name'] .str.normalize('NFKD') .str.encode('ascii', errors='ignore') .str.decode('utf-8') .str.lower() ) # First Name Lower df_tm_bio_status['firstname_lower'] = df_tm_bio_status['name_lower'].str.rsplit(' ', 0).str[0] # Last Name Lower df_tm_bio_status['lastname_lower'] = df_tm_bio_status['name_lower'].str.rsplit(' ', 1).str[-1] # First Initial Lower df_tm_bio_status['firstinitial_lower'] = df_tm_bio_status['name_lower'].astype(str).str[0] # ##### League Country lower # In[40]: # Remove accents and create lowercase name df_tm_bio_status['league_country_lower'] = (df_tm_bio_status['current_club_country'] .str.normalize('NFKD') .str.encode('ascii', errors='ignore') .str.decode('utf-8') .str.lower() ) # #### 4.1.4. Converting Data Types # ##### Date Time # First we need to convert the `dob` column from the `object` data type to `datetime64[ns]`, again using the [.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) method. # In[41]: # Convert birth_date from string to datetime64[ns] df_tm_bio_status['dob'] = pd.to_datetime(df_tm_bio_status['dob']) df_tm_bio_status['joined'] = pd.to_datetime(df_tm_bio_status['joined']) df_tm_bio_status['contract_expires'] = pd.to_datetime(df_tm_bio_status['contract_expires']) # ##### Numeric # The `value` column needs to be converted from a string to an integer using to [to_numeric()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_numeric.html) method. # In[42]: # Convert string to integer df_tm_bio_status['market_value'] = pd.to_numeric(df_tm_bio_status['market_value']) # #### 4.1.5. Create New Attributes # ##### Position Codes # In[43]: df_tm_bio_status['position'].unique() # In[44]: dict_positions_codes = {'Goalkeeper': 'GK', 'Defender': 'DEF', 'Defender - Centre-Back': 'CB', 'Defender - Left-Back': 'LB', 'Defender - Right-Back': 'RB', 'midfield': 'MF', 'midfield - Central Midfield': 'CM', 'midfield - Defensive Midfield': 'CDM', 'midfield - Left Midfield': 'LM', 'midfield - Right Midfield': 'RM', 'midfield - Attacking Midfield': 'CAM', 'attack': 'FW', 'attack - Left Winger': 'LW', 'attack - Right Winger': 'RW', 'attack - Second Striker': 'SS', 'attack - Centre-Forward': 'ST' } # In[45]: # Map grouped positions to DataFrame df_tm_bio_status['position_code'] = df_tm_bio_status['position'].map(dict_positions_codes) # ##### Position Grouped # In[46]: dict_positions_grouped = {'Goalkeeper': 'Goalkeeper', 'Defender': 'Defender', 'Defender - Centre-Back': 'Defender', 'Defender - Left-Back': 'Defender', 'Defender - Right-Back': 'Defender', 'midfield': 'Midfielder', 'midfield - Central Midfield': 'Midfielder', 'midfield - Defensive Midfield': 'Midfielder', 'midfield - Left Midfield': 'Midfielder', 'midfield - Right Midfield': 'Midfielder', 'midfield - Attacking Midfield': 'Midfielder', 'attack': 'Forward', 'attack - Left Winger': 'Forward', 'attack - Right Winger': 'Forward', 'attack - Second Striker': 'Forward', 'attack - Centre-Forward': 'Forward' } # In[47]: # Map grouped positions to DataFrame df_tm_bio_status['position_grouped'] = df_tm_bio_status['position'].map(dict_positions_grouped) # ##### Goalkeeper / Outfielder # In[48]: # Separate Goalkeeper and Outfielders df_tm_bio_status['outfielder_goalkeeper'] = np.where(df_tm_bio_status['position'].isnull(), np.nan, (np.where(df_tm_bio_status['position'] == 'Goalkeeper', 'Goalkeeper', 'Outfielder'))) # ##### Current Age # In[49]: now = pd.Timestamp('now') df_tm_bio_status['dob'] = pd.to_datetime(df_tm_bio_status['dob'], format='%m%d%y') df_tm_bio_status['dob'] = df_tm_bio_status['dob'].where(df_tm_bio_status['dob'] < now, df_tm_bio_status['dob'] - np.timedelta64(100, 'Y')) df_tm_bio_status['age'] = (now - df_tm_bio_status['dob']).astype('4.1.6. Convert Euros to Pounds # In[56]: # Get EUR to GBP exchange rate ## Get latest currency rates c = CurrencyConverter() ## Get conversion rate from EUR to GBP rate_eur_gbp = (c.convert(1, 'EUR', 'GBP')) rate_eur_gbp # In[57]: # Overwrite rate_eur_gbp = 0.90 rate_eur_gbp # In[58]: # Convert TM.com Euro value to Pounds df_tm_bio_status['market_value_gbp'] = df_tm_bio_status['market_value'] * rate_eur_gbp # #### 4.1.7. Rename Columns # In[59]: df_tm_bio_status = df_tm_bio_status.rename(columns={'player_id': 'tm_id', 'season_x': 'season', 'market_value': 'market_value_eur', 'day_of_birth': 'birth_day', 'month_of_birth': 'birth_month', 'year_of_birth': 'birth_year' } ) # #### 4.1.8. Filter Players in 'Big 5' European Leagues # In[62]: # Filter players in the 'Big 5' European Leagues and MLS df_tm_bio_status_big5_mls = df_tm_bio_status[df_tm_bio_status['league_code'].isin(lst_big5_mls_league_codes)] # In[64]: print('No. rows in DataFrame BEFORE filtration: {}'.format(len(df_tm_bio_status))) print('No. rows in DataFrame AFTER filtration: {}\n'.format(len(df_tm_bio_status_big5_mls))) print('Variance in rows before and after filtration: {}\n'.format(len(df_tm_bio_status_big5_mls) - len(df_tm_bio_status))) print('-'*10) # #### 4.1.9. Exporting the Engineered DataFrames # Export the two engineered [TransferMarkt](https://www.transfermarkt.co.uk/) DataFrames as CSV files. # ##### All players # In[65]: ## Filename file_name = 'tm_player_bio_status_all' ## Full season string full_season_string = '2016/2017 - 2021/2022' ## Short season string short_season_string = '1617-2122' # In[67]: # Export DataFrame as a CSV file ## Export a copy to the 'archive' subfolder of the TM folder, including the date df_tm_bio_status.to_csv(data_dir_tm + f'/engineered/bio-status/archive/{file_name}_{short_season_string}_last_updated_{today}.csv', index=None, header=True) ## Export another copy to the TM folder called 'latest' (can be overwritten) df_tm_bio_status.to_csv(data_dir_tm + f'/engineered/bio-status/{file_name}_{short_season_string}_latest.csv', index=None, header=True) # ##### 'Big 5' European and MLS Players # In[68]: ## Filename file_name = 'tm_player_bio_status_big5_mls' ## Full season string full_season_string = '2016/2017-2021/2022' ## Short season string short_season_string = '1617-2122' # In[70]: ## 'Big 5' European Leagues and MLS ### Export a copy to the 'archive' subfolder of the TM folder, including the date df_tm_bio_status_big5_mls.to_csv(data_dir_tm + f'/engineered/bio-status/archive/tm_bio_status_big5_mls_{short_season_string}_last_updated_{today}.csv', index=None, header=True) ### Export another copy to the TM folder called 'latest' (can be overwritten) df_tm_bio_status_big5_mls.to_csv(data_dir_tm + f'/engineered/bio-status/tm_bio_status_big5_mls_{short_season_string}_latest.csv', index=None, header=True) # Now we have created three pandas DataFrames and wrangled the data to meet our needs, we'll next conduct and [Exploratory Data Analysis ](#section5). # ## 5. Exploratory Data Analysis # In[71]: df_tm_bio_status_big5_mls.head() # ## 6. Summary # This Jupyter notebook engineered scraped football data from [TransferMarkt](https://www.transfermarkt.co.uk/) using [pandas](http://pandas.pydata.org/) for data manipulation through DataFrames. # ## 7. Next Steps # The step is to use this data and match it to FBref, 21st Club, Opta, FIFA, and other TransferMarkt datasets, such as the Transfer History dataset. # ## 8. References # # #### Data and Web Scraping # * [tyrone_mings GitHub repository](https://github.com/FCrSTATS/tyrone_mings) by [FCrSTATS](https://github.com/FCrSTATS) # * [Python Package Index (PyPI) tyrone-mings library](https://pypi.org/project/tyrone-mings/) # * [Beyond crowd judgments: Data-driven estimation of market value in association football](https://www.sciencedirect.com/science/article/pii/S0377221717304332) by Oliver Müllera, Alexander Simons, and Markus Weinmann. # * [06/04/2020: BBC - Premier League squads 'drop £1.6bn in value'](https://www.bbc.co.uk/sport/football/52221463). # --- # # ***Visit my website [eddwebster.com](https://www.eddwebster.com) or my [GitHub Repository](https://github.com/eddwebster) for more projects. If you'd like to get in contact, my Twitter handle is [@eddwebster](http://www.twitter.com/eddwebster) and my email is: edd.j.webster@gmail.com.*** # [Back to the top](#top)