Last updated: 31/08/2020
Click here to jump straight to the Exploratory Data Analysis section and skip the Task Brief, Data Sources, and Data Engineering sections. Or click here to jump straight to the Conclusion.
This notebook scrapes data for player valuations using Beautifulsoup from TransferMarkt using pandas for data maniuplation through DataFrames, Beautifulsoup for webscraping.
For more information about this notebook and the author, I'm available through all the following channels:
This notebook was written using Python 3 and requires the following libraries:
Jupyter notebooks
for this notebook environment with which this project is presented;NumPy
for multidimensional array computing;pandas
for data analysis and manipulation;tqdm
for a clean progress bar;requests
for executing HTTP requests;Beautifulsoup
for web scraping; andmatplotlib
for data visualisations;All packages used for this notebook except for BeautifulSoup can be obtained by downloading and installing the Conda distribution, available on all platforms (Windows, Linux and Mac OSX). Step-by-step guides on how to install Anaconda can be found for Windows here and Mac here, as well as in the Anaconda documentation itself here.
# Python ≥3.5 (ideally)
import platform
import sys, getopt
assert sys.version_info >= (3, 5)
import csv
# Import Dependencies
%matplotlib inline
# Math Operations
import numpy as np
from math import pi
# Datetime
import datetime
from datetime import date
import time
# Data Preprocessing
import pandas as pd # version 1.0.3
import os # used to read the csv filenames
import re
import random
from io import BytesIO
from pathlib import Path
# Reading directories
import glob
import os
# Working with JSON
import json
from pandas.io.json import json_normalize
# Web Scraping
import requests
from bs4 import BeautifulSoup
import re
# Fuzzy Matching - Record Linkage
import recordlinkage
import jellyfish
import numexpr as ne
# Data Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn-whitegrid')
import missingno as msno # visually display missing data
# Progress Bar
from tqdm import tqdm # a clean progress bar library
# Display in Jupyter
from IPython.display import Image, YouTubeVideo
from IPython.core.display import HTML
# Ignore Warnings
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")
print('Setup Complete')
Setup Complete
# Python / module versions used here for reference
print('Python: {}'.format(platform.python_version()))
print('NumPy: {}'.format(np.__version__))
print('pandas: {}'.format(pd.__version__))
print('matplotlib: {}'.format(mpl.__version__))
print('Seaborn: {}'.format(sns.__version__))
Python: 3.7.6 NumPy: 1.18.1 pandas: 1.0.1 matplotlib: 3.1.3 Seaborn: 0.10.0
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')
# Set up initial paths to subfolders
base_dir = os.path.join('..', '..', )
data_dir = os.path.join(base_dir, 'data')
data_dir_fbref = os.path.join(base_dir, 'data', 'fbref')
data_dir_tm = os.path.join(base_dir, 'data', 'tm')
img_dir = os.path.join(base_dir, 'img')
fig_dir = os.path.join(base_dir, 'img', 'fig')
video_dir = os.path.join(base_dir, 'video')
This Jupyter notebook explores how to scrape football data from TransferMarkt, using pandas for data maniuplation through DataFrames and Beautifulsoup for webscraping.
The data of player values produced in this notebook is exported to CSV. This data can be further analysed in Python, joined to other datasets, or explored using Tableau, PowerBI, Microsoft Excel.
TransferMarkt is a German-based website owned by Axel Springer and is the leading website for the football transfer market. The website posts football related data, including: scores and results, football news, transfer rumours, and most usefully for us - calculated estimates ofthe market values for teams and individual players.
To read more about how these estimations are made, Beyond crowd judgments: Data-driven estimation of market value in association football by Oliver Müllera, Alexander Simons, and Markus Weinmann does an excellent job of explaining how the estimations are made and their level of accuracy.
Before conducting our EDA, the data needs to be imported as a DataFrame in the Data Sources section Section 3 and Cleaned in the Data Engineering section Section 4.
We'll be using the pandas library to import our data to this workbook as a DataFrame.
The TransferMarkt dataset has six features (columns) with the following definitions and data types:
Feature | Data type |
---|---|
position_number |
object |
position_description |
object |
name |
object |
dob |
object |
nationality |
object |
value |
object |
Before scraping data from TransferMarkt, we need to look at the top five leagues that we wish to scrape.
The web scraper for TransferMarkt is made up of two parts:
This information collected for all the players is converted to a pandas DataFrame from which we can view and manipulate the data.
An example webpage for a football league is the following: https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/GB1/plus/?saison_id=2019. As we can see, between the subdirectory path of '/wettbewerb/'
and the '/plus/'
, there is a 3 or 4 digit code. For The Premier League, the code is GB1.
In order to scrape the webpages from TransferMarkt, the codes of the top five leagues need to be recorded from TransferMarkt, which are the following:
League Name on FIFA | Country | Corresponding TransferMarkt League Code |
---|---|---|
LaLiga Santander | Spain | ES1 |
Ligue 1 Conforama | France | FR1 |
Premier League | England | GB1 |
Serie A TIM | Italy | IT1 |
Bundesliga | Germany | L1 |
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'
}
# List of leagues by code for which we want to scrape player data - Big 5 European leagues
lst_leagues = ['ES1', 'FR1', 'GB1', 'IT1', 'L1']
# Assign season by year to season variable e.g. 2014/15 season = 2014
season = '2020' # 2020/21 season
# Run this script to scrape latest version of this data from TransferMarkt
## Start timer
tic = datetime.datetime.now()
## Scrape TransferMarkt data
def main(url):
with requests.Session() as req:
links = []
for league in lst_leagues:
print(f'Fetching Links from {league}')
r = req.get(url.format(league), headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
link = [f"{url[:31]}{item.next_element.get('href')}" for item in soup.findAll(
"td", class_="hauptlink no-border-links hide-for-small hide-for-pad")]
links.extend(link)
print(f'Collected {len(links)} Links')
goals = []
for num, link in enumerate(links):
print(f"Extracting Page# {num +1}")
r = req.get(link, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
target = soup.find("table", class_="items")
pn = [pn.text for pn in target.select("div.rn_nummer")]
pos = [pos.text for pos in target.findAll("td", class_=False)]
name = [name.text for name in target.select("td.hide")]
dob = [date.find_next(
"td").text for date in target.select("td.hide")]
nat = [" / ".join([a.get("alt") for a in nat.find_all_next("td")[1] if a.get("alt")]) for nat in target.findAll(
"td", itemprop="athlete")]
val = [val.get_text(strip=True)
for val in target.select('td.rechts.hauptlink')]
goal = zip(pn, pos, name, dob, nat, val)
df = pd.DataFrame(goal, columns=[
'position_number', 'position_description', 'name', 'dob', 'nationality', 'value'])
goals.append(df)
new = pd.concat(goals)
new.to_csv(data_dir_tm + '/raw/' + f'players_big5_2021_raw_{today}.csv', index=None, header=True)
main("https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id=2020")
#main(f'https://www.transfermarkt.co.uk/jumplist/startseite/wettbewerb/{}/plus/?saison_id={season}')
## End timer
toc = datetime.datetime.now()
## Calculate time take
total_time = (toc-tic).total_seconds()
print(f'Time taken to scrape {len(links)} pages for the Big 5 leagues is: {total_time/60:0.2f} minutes.')
Fetching Links from ES1 Fetching Links from FR1 Fetching Links from GB1 Fetching Links from IT1 Fetching Links from L1 Collected 98 Links Extracting Page# 1 Extracting Page# 2 Extracting Page# 3 Extracting Page# 4 Extracting Page# 5 Extracting Page# 6 Extracting Page# 7 Extracting Page# 8 Extracting Page# 9 Extracting Page# 10 Extracting Page# 11 Extracting Page# 12 Extracting Page# 13 Extracting Page# 14 Extracting Page# 15 Extracting Page# 16 Extracting Page# 17 Extracting Page# 18 Extracting Page# 19 Extracting Page# 20 Extracting Page# 21 Extracting Page# 22 Extracting Page# 23 Extracting Page# 24 Extracting Page# 25 Extracting Page# 26 Extracting Page# 27 Extracting Page# 28 Extracting Page# 29 Extracting Page# 30 Extracting Page# 31 Extracting Page# 32 Extracting Page# 33 Extracting Page# 34 Extracting Page# 35 Extracting Page# 36 Extracting Page# 37 Extracting Page# 38 Extracting Page# 39 Extracting Page# 40 Extracting Page# 41 Extracting Page# 42 Extracting Page# 43 Extracting Page# 44 Extracting Page# 45 Extracting Page# 46 Extracting Page# 47 Extracting Page# 48 Extracting Page# 49 Extracting Page# 50 Extracting Page# 51 Extracting Page# 52 Extracting Page# 53 Extracting Page# 54 Extracting Page# 55 Extracting Page# 56 Extracting Page# 57 Extracting Page# 58 Extracting Page# 59 Extracting Page# 60 Extracting Page# 61 Extracting Page# 62 Extracting Page# 63 Extracting Page# 64 Extracting Page# 65 Extracting Page# 66 Extracting Page# 67 Extracting Page# 68 Extracting Page# 69 Extracting Page# 70 Extracting Page# 71 Extracting Page# 72 Extracting Page# 73 Extracting Page# 74 Extracting Page# 75 Extracting Page# 76 Extracting Page# 77 Extracting Page# 78 Extracting Page# 79 Extracting Page# 80 Extracting Page# 81 Extracting Page# 82 Extracting Page# 83 Extracting Page# 84 Extracting Page# 85 Extracting Page# 86 Extracting Page# 87 Extracting Page# 88 Extracting Page# 89 Extracting Page# 90 Extracting Page# 91 Extracting Page# 92 Extracting Page# 93 Extracting Page# 94 Extracting Page# 95 Extracting Page# 96 Extracting Page# 97 Extracting Page# 98 Time taken to scrape data for the Big 5 leagues is: 161.51 seconds.
# Import data as a pandas DataFrame, df_tm_players_big5_2021_raw
## Look for most recent CSV file
list_of_files = glob.glob(data_dir_tm + '/raw/*') # * means all if need specific format then *.csv
filepath_latest_tm = max(list_of_files, key=os.path.getctime)
## Load in most recently parsed CSV file
df_tm_player_top5_2021_raw = pd.read_csv(filepath_latest_tm)
Let's quality of the dataset by looking first and last rows in pandas using the head() and tail() methods.
# Display the first 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.head()
position_number | position_description | name | dob | nationality | value | |
---|---|---|---|---|---|---|
0 | 1 | Goalkeeper | Thibaut Courtois | May 11, 1992 (28) | Belgium | £54.00m |
1 | 13 | Goalkeeper | Andriy Lunin | Feb 11, 1999 (21) | Ukraine | £2.43m |
2 | 26 | Goalkeeper | Diego Altube | Feb 22, 2000 (20) | Spain | £90Th. |
3 | 5 | Centre-Back | Raphaël Varane | Apr 25, 1993 (27) | France / Martinique | £57.60m |
4 | 3 | Centre-Back | Éder Militão | Jan 18, 1998 (22) | Brazil | £32.40m |
# Display the last 5 rows of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.tail()
position_number | position_description | name | dob | nationality | value | |
---|---|---|---|---|---|---|
2809 | 18 | Centre-Forward | Sergio Córdova | Aug 9, 1997 (23) | Venezuela | £1.44m |
2810 | 9 | Centre-Forward | Fabian Klos | Dec 2, 1987 (32) | Germany | £900Th. |
2811 | 13 | Centre-Forward | Sebastian Müller | Jan 23, 2001 (19) | Germany | £270Th. |
2812 | 36 | Centre-Forward | Sven Schipplock | Nov 8, 1988 (31) | Germany | £270Th. |
2813 | 39 | Centre-Forward | Prince Osei Owusu | Jan 7, 1997 (23) | Germany / Ghana | £225Th. |
# Print the shape of the raw DataFrame, df_tm_player_top5_2021_raw
print(df_tm_player_top5_2021_raw.shape)
(2814, 6)
# Print the column names of the raw DataFrame, df_tm_player_top5_2021_raw
print(df_tm_player_top5_2021_raw.columns)
Index(['position_number', 'position_description', 'name', 'dob', 'nationality', 'value'], dtype='object')
The dataset has six features (columns). Full details of these attributes can be found in the Data Dictionary.
# Data types of the features of the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_2021_raw.dtypes
position_number object position_description object name object dob object nationality object value object dtype: object
All six of the columns have the object data type. Full details of these attributes and their data types can be found in the Data Dictionary.
# Info for the raw DataFrame, df_tm_player_top5_2021_raw
df_tm_player_top5_12021_raw.info()
# Description of the raw DataFrame, df_tm_player_top5_2021_raw, showing some summary statistics for each numberical column in the DataFrame
df_tm_player_top5_2021_raw.describe()
position_number | position_description | name | dob | nationality | value | |
---|---|---|---|---|---|---|
count | 2814 | 2814 | 2814 | 2814 | 2814 | 2781 |
unique | 85 | 14 | 2806 | 2209 | 416 | 146 |
top | - | Centre-Back | Danilo | Sep 1, 1993 (26) | Spain | £1.08m |
freq | 320 | 500 | 3 | 5 | 407 | 88 |
# Plot visualisation of the missing values for each feature of the raw DataFrame, df_tm_player_top5_2021_raw
msno.matrix(df_tm_player_top5_2021_raw, figsize = (30, 7))
<matplotlib.axes._subplots.AxesSubplot at 0x1a204ece10>
# Counts of missing values
tm_null_value_stats = df_tm_player_top5_2021_raw.isnull().sum(axis=0)
tm_null_value_stats[tm_null_value_stats != 0]
value 33 dtype: int64
The visualisation shows us very quickly that there a few missing values in the value
column, but otherwise the dataset is complete.
Before we answer the questions in the brief through Exploratory Data Analysis (EDA), we'll first need to clean and wrangle the datasets to a form that meet our needs.
# Assign Raw DataFrame to new Engineered DataFrame
df_tm_player_top5_2021 = df_tm_player_top5_2021_raw
df_tm_player_top5_2021['name_lower'] = df_tm_player_top5_2021['name'].str.normalize('NFKD')\
.str.encode('ascii', errors='ignore')\
.str.decode('utf-8')\
.str.lower()
# First Name Lower
df_tm_player_top5_2021['firstname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 0).str[0]
# Last Name Lower
df_tm_player_top5_2021['lastname_lower'] = df_tm_player_top5_2021['name_lower'].str.rsplit(' ', 1).str[-1]
# First Initial Lower
df_tm_player_top5_2021['firstinitial_lower'] = df_tm_player_top5_2021['name_lower'].astype(str).str[0]
The dob
column is messy and contains both the date of birth as a string and also the age in brackets.
This string cleaning consists of two parts, firstly, to split this apart into their seperate components. However, once the age
column is created, we will replaced this by determining the current age using the Python datetime module.
# DoB string cleaning to create birth_date and age columns
df_tm_player_top5_2021[['birth_date', 'age']] = df_tm_player_top5_2021['dob'].str.extract(r'(.+) \((\d+)\)')
For the nationality, some of the players have duel nationality.
For example, Claudio Pizarro is a Peruvian-born player who has has made 85 appearances for Peru, scoring 20 goals. However, his citizenship according to TransferMarkt is 'Peru / Italy'. For our needs, we only want to know the country the player is eligible to play for, not their full heritage which from observations is always the first part of the string. We'll therefore be discarding anything after the first space in the string to form a new playing_country
column.
# Take the first nationality i.e. text before the first space, ex. 'Peru / Italy'
df_tm_player_top5_2021['playing_country'] = df_tm_player_top5_2021['nationality'].str.split(' /').str[0]
The values of the players have prefixes (£), commas, spaces, and suffixes (m, k) that need to cleaned and replaced before converting to a numerical value.
# Value string cleaning from shortened string value to full numerical value
## Convert 'm' to '000000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('m','0000')
## Convert 'k' to '000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('k','000')
## Convert 'Th' to '000'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('Th','000')
## Remove '.'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('.','')
## Remove '£' sign
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('£','')
## Remove '-'
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('-','')
## Remove '  '
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace('  ','')
## Remove gaps
df_tm_player_top5_2021['value'] = df_tm_player_top5_2021['value'].str.replace(' ','')
First we need to convert the dob
column from the object
data type to datetime64[ns]
, again using the .to_datetime() method.
# Convert birth_date from string to datetime64[ns]
df_tm_player_top5_2021['birth_date'] = pd.to_datetime(df_tm_player_top5_2021['birth_date'])
# Date and time manipulation
from datetime import datetime
# Redetermine the age using the newly created birth_date column (after formatted to datetime data type)
## Remove all not numeric values use to_numeric with parameter errors='coerce' - it replaces non numeric to NaNs
df_tm_player_top5_2021['age'] = pd.to_numeric(df_tm_player_top5_2021['age'], errors='coerce')
## Convert floats to integers and leave null values
df_tm_player_top5_2021['age'] = np.nan_to_num(df_tm_player_top5_2021['age']).astype(int)
## Calculate current age
today = datetime.today()
df_tm_player_top5_2021['age'] = df_tm_player_top5_2021['birth_date'].apply(lambda x: today.year - x.year -
((today.month, today.day) < (x.month, x.day))
)
# df_tm_player_top5_2021['age'] = pd.to_numeric(ddf_tm_player_top5_2021['age'], downcast='signed')
The value
column needs to be converted from a string to an integer using to to_numeric() method.
# Convert string to integer
df_tm_player_top5_2021['value'] = pd.to_numeric(df_tm_player_top5_2021['value'])
...
sorted(df_tm_player_top5_2021['position_description'].unique())
['Attacking Midfield', 'Central Midfield', 'Centre-Back', 'Centre-Forward', 'Defensive Midfield', 'Goalkeeper', 'Left Midfield', 'Left Winger', 'Left-Back', 'Midfielder', 'Right Midfield', 'Right Winger', 'Right-Back', 'Second Striker']
dict_positions_tm = {
'Attacking Midfield': 'Midfielder',
'Central Midfield': 'Midfielder',
'Centre-Back': 'Defender',
'Centre-Forward': 'Forward',
'Defensive Midfield': 'Midfielder',
'Forward': 'Forward',
'Goalkeeper': 'Goalkeeper',
'Left Midfield': 'Midfielder',
'Left Winger': 'Forward',
'Left-Back': 'Defender',
'Right Midfield': 'Midfielder',
'Right Winger': 'Forward',
'Right-Back': 'Defender',
'Second Striker': 'Forward'
}
df_tm_player_top5_2021['position_description_cleaned'] = df_tm_player_top5_2021['position_description'].map(dict_positions_tm)
Create new attributes for birth month and birth year.
df_tm_player_top5_2021['birth_year'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).year
df_tm_player_top5_2021['birth_month'] = pd.DatetimeIndex(df_tm_player_top5_2021['birth_date']).month
We are interested in the following thirteen columns in the TransferMarkt dataset:
name
name_lower
firstinitial_lower
firstname_lower
lastname_lower
position_description
position_description_cleaned
value
birth_date
birth_year
birth_month
age
playing_country
# Select columns of interest
df_tm_player_top5_2021 = df_tm_player_top5_2021[['name', 'name_lower', 'firstinitial_lower', 'firstname_lower', 'lastname_lower', 'position_description', 'position_description_cleaned', 'value', 'birth_date', 'birth_year', 'birth_month', 'age', 'playing_country']]
# Assign df_tm as a new DataFrame - df_tm_player_top5_all_2021_all, to represent all the players
df_tm_player_top5_all_2021 = df_tm_player_top5_2021
# Filter rows for position_description is not equal to 'Goalkeeper'
df_tm_player_top5_outfield_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] != 'Goalkeeper']
# Filter rows for position_description are equal to 'Goalkeeper'
df_tm_player_top5_goalkeeper_2021 = df_tm_player_top5_all_2021[df_tm_player_top5_all_2021['position_description'] == 'Goalkeeper']
df_tm_player_top5_all_2021.head()
name | name_lower | firstinitial_lower | firstname_lower | lastname_lower | position_description | position_description_cleaned | value | birth_date | birth_year | birth_month | age | playing_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Thibaut Courtois | thibaut courtois | t | thibaut | courtois | Goalkeeper | Goalkeeper | 54000000.0 | 1992-05-11 | 1992 | 5 | 28 | Belgium |
1 | Andriy Lunin | andriy lunin | a | andriy | lunin | Goalkeeper | Goalkeeper | 2430000.0 | 1999-02-11 | 1999 | 2 | 21 | Ukraine |
2 | Diego Altube | diego altube | d | diego | altube | Goalkeeper | Goalkeeper | 90000.0 | 2000-02-22 | 2000 | 2 | 20 | Spain |
3 | Raphaël Varane | raphael varane | r | raphael | varane | Centre-Back | Defender | 57600000.0 | 1993-04-25 | 1993 | 4 | 27 | France |
4 | Éder Militão | eder militao | e | eder | militao | Centre-Back | Defender | 32400000.0 | 1998-01-18 | 1998 | 1 | 22 | Brazil |
df_tm_player_top5_outfield_2021.head()
name | name_lower | firstinitial_lower | firstname_lower | lastname_lower | position_description | position_description_cleaned | value | birth_date | birth_year | birth_month | age | playing_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Raphaël Varane | raphael varane | r | raphael | varane | Centre-Back | Defender | 57600000.0 | 1993-04-25 | 1993 | 4 | 27 | France |
4 | Éder Militão | eder militao | e | eder | militao | Centre-Back | Defender | 32400000.0 | 1998-01-18 | 1998 | 1 | 22 | Brazil |
5 | Sergio Ramos | sergio ramos | s | sergio | ramos | Centre-Back | Defender | 13050000.0 | 1986-03-30 | 1986 | 3 | 34 | Spain |
6 | Nacho Fernández | nacho fernandez | n | nacho | fernandez | Centre-Back | Defender | 10800000.0 | 1990-01-18 | 1990 | 1 | 30 | Spain |
7 | Ferland Mendy | ferland mendy | f | ferland | mendy | Left-Back | Defender | 36000000.0 | 1995-06-08 | 1995 | 6 | 25 | France |
df_tm_player_top5_goalkeeper_2021.head()
name | name_lower | firstinitial_lower | firstname_lower | lastname_lower | position_description | position_description_cleaned | value | birth_date | birth_year | birth_month | age | playing_country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Thibaut Courtois | thibaut courtois | t | thibaut | courtois | Goalkeeper | Goalkeeper | 54000000.0 | 1992-05-11 | 1992 | 5 | 28 | Belgium |
1 | Andriy Lunin | andriy lunin | a | andriy | lunin | Goalkeeper | Goalkeeper | 2430000.0 | 1999-02-11 | 1999 | 2 | 21 | Ukraine |
2 | Diego Altube | diego altube | d | diego | altube | Goalkeeper | Goalkeeper | 90000.0 | 2000-02-22 | 2000 | 2 | 20 | Spain |
33 | Marc-André ter Stegen | marc-andre ter stegen | m | marc-andre | stegen | Goalkeeper | Goalkeeper | 64800000.0 | 1992-04-30 | 1992 | 4 | 28 | Germany |
34 | Neto | neto | n | neto | neto | Goalkeeper | Goalkeeper | 13050000.0 | 1989-07-19 | 1989 | 7 | 31 | Brazil |
Export the three engineered TransferMarkt DataFrames as CSV files.
# Datetime
import datetime
from datetime import date
import time
# Define today's date
today = datetime.datetime.now().strftime('%d/%m/%Y').replace('/', '')
# Export the three DataFrames
df_tm_player_top5_all_2021.to_csv(data_dir_tm + '/engineered/all/' + f'all_big5_1920_{today}.csv', index=None, header=True)
df_tm_player_top5_outfield_2021.to_csv(data_dir_tm + '/engineered/player/' + f'outfield_big5_1920_{today}.csv', index=None, header=True)
df_tm_player_top5_goalkeeper_2021.to_csv(data_dir_tm + '/engineered/goalkeeper/' + f'goalkeeper_big5_1920_{today}.csv', index=None, header=True)
Now we have created three pandas DataFrames and wrangled the data to meet our needs, we'll next conduct and Exploratory Data Analysis .
This notebook scrapes data for player valuations using Beautifulsoup from TransferMarkt using pandas for data maniuplation through DataFrames and Beautifulsoup for webscraping.
To conduct our analysis, we have used the following libraries and modules for the following tasks:
We have also demonstrated an array of techniques in Python using the following methods and functions:
*Visit my website EddWebster.com or my GitHub Repository for more projects. If you'd like to get in contact, my Twitter handle is @eddwebster and my email is: edd.j.webster@gmail.com.*