#!/usr/bin/env python # coding: utf-8 # # Comparing NBA and Euroleague Basketball # # Part 2: Data Cleaning # ## Introduction # Basketball is one of the most popular sports in the world and it's easy to understand why: it involves a lot of scoring and its fast-paced tempo, paired with a rather straightforward set of rules, makes it accessible to a casual audience and exciting to watch. # # When talking about basketball, most people think about the NBA, the top league in the USA, which can count on some of the best players in the world. However, it is important to acknowledge that basketball is played professionally all around the world. Especially in Europe, this sport has deep solid roots and can count on a [long tradition](https://www.sports-fitness.co.uk/blog/growth-basketball-europe). In fact, many believe the level of competition there is comparable to the one of the NBA and, in [exhibition games](https://en.wikipedia.org/wiki/NBA_versus_EuroLeague_games#2010s) held during the off season, it's not unusual to see a European team beat an NBA squad. # # The most important European basketball league is (very intuitively) called Euroleague. Unlike the NBA, where all franchises (except one) are American, teams from many different European countries participate in the Euroleague. Moreover, compared to the same 30 teams that are part of the NBA, only 16 to 18 different clubs compete in the Euroleague within a semi-open system; this means that, next to a slate of teams which return every year thanks to a special license, there are a few open slots assigned on merit according to the standings of each country's national league. Besides their format, the NBA and Euroleague also differ in some of their rules, from the size of the court to the actual in-game calls. The reader can refer to [this page](https://www.fiba.basketball/rule-differences) for an overview of these rule differences. # --- # In this notebook we clean the data we scraped in the first part of the project. The goal is to aggregate and uniformize the NBA and EL tables; in doing this, we will have to take into account the different playing time of games in the NBA and in the EL (this is one of the rules inconsistencies we mentioned above). At the end, we will have two DataFrames: one consisting of the players' data for both leagues and the other consisting of the teams' data for both leagues. # ## Cleaning Players' Data # ### A First Look at the Data # Let's first work with the DataFrames containing the players' stats. # In[1]: # Allow to run all code in a cell from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all" # In[2]: import pandas as pd # Read the CSV files containing the NBA and EL players' stats respectively NBA_players = pd.read_csv("final_DataFrames/NBA/NBA_player_stats.csv") EL_players = pd.read_csv("final_DataFrames/EL/EL_player_stats.csv") # In[3]: # Allow display of all columns of a DataFrame pd.set_option("display.max_columns", None) # Take a look at both tables NBA_players.head() print("The shape of the NBA DataFrame is: {}".format(NBA_players.shape)) EL_players.head() print("The shape of the NBA DataFrame is: {}".format(EL_players.shape)) # Since the NBA has more teams (hence more players) than the EL, the NBA table contains more data points. The first two columns of both DataFrames are meaningless so we can drop them. # In[4]: # Drop first two columns of each DataFrame NBA_players = NBA_players.iloc[:, 2:] EL_players = EL_players.iloc[:, 2:] # In[5]: # Take a look at the new tables NBA_players.head() print("The shape of the NBA DataFrame is: {}".format(NBA_players.shape)) EL_players.head() print("The shape of the NBA DataFrame is: {}".format(EL_players.shape)) # ### Uniformizing Columns' Names # Among the remaining columns, a few of them are rather self-explanatory. In particular, the first column contains the players' names and the last two specify to what league and year each row belongs to. Perhaps, for those with little sports knowledge, it is harder to grasp the abbreviations used for the players' stats. Moreover, some of these stats only appear in the NBA DataFrame (which has seven extra columns) or they appear in both DataFrames but under different names. Therefore, for sake of clarity, let's explicitly go over the meaning of each column. We will list them as they are in the NBA table and then we will specify if they can also be found in the EL table (and under what name). # # `Pos`: player's playing position (it does not appear in the EL table)
# `Tm`: player's team (it appears in the EL table as `Team`)
# `G`: number of games played (it appears in the EL table as `GP`)
# `GS`: number of games started (it does not appear in the EL table)
# `MP`: minutes played (it appears in the EL table as `MPG`)
# `FG`, `FGA`, `FG%`: field goals made, attempted and field goals shooting percentage respectively (all these columns appear in the EL table but `FGM` is used rather than `FG`)
# `3P`, `3PA`,`3P%`: same as above but with 3-point shots instead
# `2P`, `2PA`,`2P%`: same as above but with 2-point shots instead (these columns do not appear in the EL table)
# `FT`, `FTA`,`FT%`: same as above but with free throw shots instead
# `eFG%`: effective field goal percentage (check [here](https://en.wikipedia.org/wiki/Effective_field_goal_percentage) for more on this stat; it does not appear in the EL table)
# `ORB`, `DRB`,`TRB`: offensive, defensive and total rebounds respectively (these columns appear in the EL table but `RPG` is used rather than `TRB`)
# `AST`: assists (it appears in the EL table as `APG`)
# `STL`: steals (it appears in the EL table as `SPG`)
# `BLK`: blockshots (it appears in the EL table as `BPG`)
# `TOV`: turnovers (it appears in the EL table with the same name)
# `PF`: personal fouls (it appears in the EL table with the same name)
# `PTS`: points (it appears in the EL table as `PPG`) # It is important to mention that all the stats above are averages per game. # # We will be able to obtain some of the information missing in the EL table using the data we have available. Yet, the `GS` column (representing the number of games started by a player) doesn't add much to our analysis so we get rid of it. # In[6]: # Drop the `GS` column NBA_players.drop(columns = "GS", inplace = True) # The first step to make the data more uniform is giving the columns the same names. This is what we do next using a combination of acronyms from both tables. # In[7]: # Uniformize the NBA_players columns' names NBA_rename_dict = {"Tm" : "Team", "G" : "GP", "FG" : "FGM", "3P" : "3PM", "2P" : "2PM", "FT" : "FTM"} NBA_players = NBA_players.rename(columns = NBA_rename_dict) # In[8]: # Uniformize the EL_players columns' names EL_rename_dict = {"MPG" : "MP", "RPG" : "TRB", "APG" : "AST", "SPG" : "STL", "BPG" : "BLK", "PPG" : "PTS"} EL_players = EL_players.rename(columns = EL_rename_dict) # In[9]: # Take a look at the new data NBA_players.head() EL_players.head() # ### Uniformizing Teams' Names # Looking at the windows above, we notice that the teams' names are formatted in different ways depending on the table: the NBA DataFrame contains the teams' three letters nicknames while for the EL we have the teams' full names. As we did above, we want to uniformize this data. We do this by converting the NBA teams' nicknames with the help of a supporting table. # In[10]: # Read the CSV file containing the NBA teams' names and nicknames NBA_nicknames = pd.read_csv("auxiliary_DataFrames/NBA_teams_nicknames.csv") # In[11]: # Take a look at the data NBA_nicknames # In[12]: # Turn the DataFrame above into a dictionary with keys the team nicknames and value the team names NBA_nicknames_dict = NBA_nicknames.set_index("Nickname")["Team"].to_dict() # Rename the NBA teams in the players DataFrame NBA_players["Team"] = NBA_players["Team"].replace(NBA_nicknames_dict) # In[13]: # Look at the result NBA_players.head() # For the EL teams, we just replace the hyphens with spaces. # In[14]: # Replace hyphens in EL teams' names EL_players["Team"] = EL_players["Team"].str.replace("-", " ") # In[15]: # Take a look at the final result EL_players.head() # ### Dealing with Duplicates # It is important to realize that not all of the data we have available is relevant for the goal we have in mind. In fact, when we start to analyze and visualize the data in Part 3 of the project, we want to make sure to exclude marginal players who are not good representative of their league. One way to separate these marginal players is to set a minimum threshold involving games played and average minutes per game. # # Before being able to implement this strategy, however, there is a different issue we need to take care of: looking at the window of NBA players data above, we notice that some of the players appear more than once even within the same year. Let's expose these duplicates. # In[16]: # Check NBA duplicates NBA_players[NBA_players.duplicated(["Player", "Year"], keep = False)].head(6) # It seems like the duplicates correspond to players who played for multiple teams within the same season. In particular, for these players we have separate rows for each one of the teams they played for, plus an extra row containing their total stats for the season. We recognize the latter rows because they all show the acronym `TOT` in the `Team` column. # For these players, we will keep their total stats and we will consider their team the one they played the most games for during the season. # In[17]: # Create a mask to get the players who played for multiple teams within the same season mask = NBA_players.duplicated(["Player", "Year"], keep = False) # For each of these players look for the team they played most games for and assign the value to the Team column NBA_players.loc[mask, "Team"] = NBA_players[mask].apply(lambda x : NBA_players[(NBA_players["Player"] == x["Player"]) & (NBA_players["Year"] == x["Year"])].sort_values( ["GP", "MP"], ascending = False).iloc[1]["Team"], axis = 1).copy() # In[18]: # Drop the duplicates keeping only the first observation (the total stats row always apperead first) NBA_players = NBA_players.drop_duplicates(["Player", "Year"], ignore_index = True) # In[19]: # Make sure everything went right NBA_players[NBA_players.duplicated(["Player", "Year"], keep = False)] NBA_players[NBA_players["Team"] == "TOT"] # In[20]: # Take a look at the new data NBA_players.head() # Let's now check how the same source of duplicates is treated for the EL players. # In[21]: # Look for EL duplicates EL_duplicates = EL_players[EL_players.duplicated(["Player", "Year"], keep = False)] EL_duplicates # This time we don't have any extra row whith the players' total stats for the season; we'll have to create this rows ourselves. # In[22]: def clean_duplicates(df): """ Calculates the total stats of a player who played for multiple teams within the same season. The player's team will become the team he played most games for. :param df: the DataFrame containing the player's stats from each team he played for :type df: DataFrame :return: a Series containing the player's total average stats and updated team """ # Initialize a total stats row total_stats = df.iloc[0].copy() # Update the Team column main_team = df.sort_values(["GP", "MP"], ascending = False).iloc[0]["Team"] total_stats["Team"] = main_team # Update the GP column total_games = df["GP"].sum() total_stats["GP"] = total_games # Calculate the player's cumulative stats for the season cumulative_stats = df.iloc[:, 3:-2].multiply(df["GP"], axis = "index").sum() # Calculate the player's average stats for the season average_stats = cumulative_stats / total_games # Update the stats columns (rounding to 3 decimal digits) total_stats.iloc[3:-2] = round(average_stats, 3) return total_stats # In[23]: # Apply the function to the duplicates DataFrame duplicates_tot_stats = EL_duplicates.groupby(["Player", "Year"]).apply(clean_duplicates) # Take a look at the result duplicates_tot_stats # Now that we have created these rows with the total stats, we can drop the duplicates in the EL DataFrame and replace them wth the clean data. To keep things consistent, we also make sure to round all the stats columns which do not refer to shooting percentages to 1 decimal digit. # In[24]: # Gather the columns to round stats_cols = EL_players.columns[3:-2] cols_to_round = [col for col in stats_cols if col not in ["FG%", "3P%", "FT%"]] # Round the stats columns duplicates_tot_stats[cols_to_round] = duplicates_tot_stats[cols_to_round].apply(lambda x : round(x, 1)) # In[25]: # Drop the duplicates for the EL players to_drop = EL_duplicates.index EL_players = EL_players.drop(index = to_drop) # In[26]: # Add the total stats for the players we just dropped EL_players = EL_players.append(duplicates_tot_stats, ignore_index = True) # In[27]: # Make sure everything went right EL_players[EL_players.duplicated(["Player", "Year"], keep = False)] EL_players.tail(7) # ### Filtering Out Irrelevant Data # We can now act on our plan of getting rid of marginal players with little to none playing time. It is tricky to establish what's the optimal treshold to separate these marginal players. After some thought, however, we decide to only keep players who have played at least 10 games and who have averaged more than 10 minutes per game. # In[28]: # Keep only relevant data NBA_players = NBA_players[(NBA_players["GP"] >=10) & (NBA_players["MP"] > 10)].reset_index(drop = True) EL_players = EL_players[(EL_players["GP"] >=10) & (EL_players["MP"] > 10)].reset_index(drop = True) # In[29]: # Check how much data is left print("The shape of the remaining NBA data is: {}".format(NBA_players.shape)) print("The shape of the remaining EL data is: {}".format(EL_players.shape)) # We still have a good amount of data left. # ### Adding EL Players' Shooting Columns # As we have already observed, the NBA DataFrame has a few more columns than the EL one. Among these columns, four of them refer to specific shooting stats (`2P`, `2PA`, `2P%`, `eFG%`). Luckily, we can obtain this same information for the EL players using the data we have available. # In[30]: # Create the missing shooting columns for EL players matching the location in the NBA DataFrame EL_players.insert(10, "2PM", EL_players["FGM"] - EL_players["3PM"]) EL_players.insert(11, "2PA", EL_players["FGA"] - EL_players["3PA"]) EL_players.insert(12, "2P%", EL_players["2PM"] / EL_players["2PA"]) EL_players.insert(13, "eFG%", (EL_players["FGM"] + 0.5*EL_players["3PM"]) / EL_players["FGA"]) # In[31]: # Take a look at the result EL_players # In[32]: # Round the new shooting percentages to 3 decimal places EL_players[["2P%", "eFG%"]] = EL_players[["2P%", "eFG%"]].round(3) # In[33]: # Take a look at the two DataFrames NBA_players.head(3) print("The shape of the NBA DataFrame is: {}".format(NBA_players.shape)) EL_players.head(3) print("The shape of the EL DataFrame is: {}".format(EL_players.shape)) # ### Adding Players' Biodata # The remaining extra columns in the NBA DataFrame are `Pos` and `Age`. The corresponding information for the EL players can be found in the rosters tables. # In[34]: # Read the CSV files containing the EL rosters EL_rosters = pd.read_csv("final_DataFrames/EL/EL_rosters.csv") EL_rosters.info() EL_rosters.head() # In[35]: # Read the CSV files containing the NBA rosters NBA_rosters = pd.read_csv("final_DataFrames/NBA/NBA_rosters.csv") NBA_rosters.info() NBA_rosters.head() # Besides the playing positions and ages of the EL players, we also have other interesting biographical data. In particular, we can use the players' height, weight and nationality. Before we add this information to the players' DataFrames, however, there are a couple of things we need to take care of. First of all let's uniformize the names of the columns we will keep. # In[36]: # Rename the relevant columns in the NBA rosters table (the EL roster table is fine) NBA_rosters.rename(columns = {"Ht": "Height", "Wt": "Weight", "Unnamed: 6" : "Nationality"}, inplace = True) # The players' heights in both rosters' tables are saved as strings; in fact, these heights are in American format and therefore an hyphen is used to separate feet and inches. Nervertheless, the `Age` and `Weight` columns of the EL rosters DataFrame are also of unexpectedly not numeric. Let's try to understand why that's the case. # In[37]: # Inspect unique values of `Weight` and `Age` columns for EL table EL_rosters["Age"].sort_values().unique() EL_rosters["Weight"].sort_values().unique() # It seems like these columns have some `-` values which force all the other numbers to be stored as strings. These hyphens most likely denote missing values. Since these values might belong to irrelevant players who don't appear in the `EL_players` DataFrame anymore, we first do the merge operation to add the biodata and, in case it is necessary, we deal with any surviving missing value afterwards. # In[38]: # Add players' biodata to the `NBA_players` DataFrame NBA_biodata_cols = ["Player", "Height", "Weight", "Nationality", "Year"] NBA_players = NBA_players.merge(NBA_rosters[NBA_biodata_cols], how = "inner", on = ["Player", "Year"]) # In[39]: # Rearrange columns so that biodata comes first cols = list(NBA_players.columns) new_order_cols = [cols[0]]+cols[-3:]+cols[1:-3] NBA_players = NBA_players[new_order_cols] # In[40]: # Add players' biodata to the `EL_players` DataFrame EL_biodata_cols = ["Player", "Pos", "Height", "Weight", "Age", "Nationality", "Year"] EL_players = EL_players.merge(EL_rosters[EL_biodata_cols], how = "inner", on = ["Player", "Year"]) # In[41]: # Uniformize the order of the columns EL_players = EL_players[NBA_players.columns] # In[42]: # Take a look at the final result NBA_players.head() EL_players.head() # We can now check if any of the `-` missing values survived the merge operation for the EL players table. # In[43]: # Check for surviving missing values for col in ["Weight", "Age"]: EL_players[EL_players[col] == "-"].shape[0] # All good: we can then go ahead and make the `Weight` and `Age` columns numeric. # In[44]: # Change Dtype of `Weight` and `Age` columns EL_players[["Weight", "Age"]] = EL_players[["Weight", "Age"]].astype("int64") # To make heights comparisons easier, let's also convert the `Height` column of both DataFrames to a numeric type by changing the units: from feet-inches to cm. # In[45]: # Set up the correct conversion factors fts_to_cm_factor = 30.48 inchs_to_cm_factor = 2.54 # Convert the `Height` column unit to cm for df in [EL_players, NBA_players]: df["Height"] = df["Height"].apply(lambda x : round((int(x.split("-")[0]) * fts_to_cm_factor + int(x.split("-")[1]) * inchs_to_cm_factor))).copy() # Finally, in order to be consistent with the European metric system, let's also convert the weights unit: from pounds to kilos. # In[46]: # Set up the correct conversion factors lbs_to_kg_factor = 0.453 # Convert the `Weight` column unit to kg for df in [NBA_players, EL_players]: df["Weight"] = df["Weight"].apply(lambda x : round(x * lbs_to_kg_factor)) # In[47]: # Take a look at the final result NBA_players.head(3) EL_players.head(3) # ### Treating Missing Values # Let's see if the players' DataFrames have any missing values. We start with the NBA data. # In[48]: # Count null values in each column NBA_players.isnull().sum() # The only null values appear in the `3P%` and `FT%` columns. Let's take a closer look. # In[49]: # Inspect NBA null values NBA_players[NBA_players["3P%"].isnull()].head(3) NBA_players[NBA_players["FT%"].isnull()].head(3) # It looks like null values in both the `3P%` and `FT%` columns are assigned when a player hasn't attempted any 3-point shot or any free throw respectively. This makes sense as the shooting percentage calculation for these players involves a division by zero. Let's explicitly confirm this observation. # In[50]: # Further inspect null values NBA_players.loc[NBA_players["3P%"].isnull(), "3PA"].sum() NBA_players.loc[NBA_players["FT%"].isnull(), "FTA"].sum() # Yes, we were right. Depending on the role, it is not unusual for a basketball player not to shoot any 3-point shot or free throw; hence, dropping these rows would be too extreme. Instead, we replace the null values with zeros. In the future, in case we want to analyze the distribution of these columns, we will make sure to first filter out for players with at least one attempt. # In[51]: # Replace null values with 0 NBA_players[["3P%", "FT%"]] = NBA_players[["3P%", "FT%"]].fillna(0) # In[52]: # Make sure there's no null value left NBA_players.isnull().sum() # Let's now look for any null values in the EL table. # In[53]: # Count null values in each column EL_players.isnull().sum() # No missing values at all, that's great. However, considering what we have observed in the previous section, we should also make sure there's no missing value marked with an hyphen. # In[54]: # Check for `-` missing values null_values_dict = dict() for col in EL_players.columns: null_values_dict[col] = EL_players[EL_players[col] == "-"].shape[0] null_values_dict # Finally, let's check how the shooting percentages for players with no shot attempts are treated in the EL DataFrame. # In[55]: # Check EL players with no shot attempts EL_players[EL_players["FGA"] == 0].head(3) EL_players[EL_players["3PA"] == 0].head(3) EL_players[EL_players["2PA"] == 0].head(3) EL_players[EL_players["FTA"] == 0].head(3) # Regardless of the shot type, we observe 0 values for the shooting percentages. This is consistent with how we handled null values in the NBA table so we can keep going. # In[56]: # Quick look at the current status of both DataFrames NBA_players.head(3) EL_players.head(3) # ### Uniformzing the Geographical Data # Looking at both windows above, we notice that the `Nationality` column has different formats across the two DataFrames: in the NBA table nationalities are reported using the countries' [ISO 3166-1 alpha 2 code](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2) while the EL table contains the countries' full names. # # But that's not the only issue: inspecting the EL table more carefully, we notice that in some case players appear to have multiple nationalities. # In[57]: # Check EL nationality values. Notice how some values include multiple countries. EL_players["Nationality"].unique() # We can see how these double nationalities appear repeatadly and come in many different combinations. Seemingly, they are used to describe players with multiple international passports. Leaving this data as it is would make our analysis too fine and hard to read. Therefore, for players with multiple nationalities, we decide to only keep the first country that appears. # In[58]: import re # Use a regex to insert a comma after the first country in case of double nationality EL_players["Nationality"] = EL_players["Nationality"].apply(lambda x : re.sub(r"(([A-Za-z]+\s*[A-Za-z]+)+(?=[A-Z]))", r"\1,", x)) # Split double nationalities and only keep the first one EL_players["Nationality"] = EL_players["Nationality"].apply(lambda x : x.split(",")[0]) # In[59]: # Take a look at the newly formatted nations EL_players["Nationality"].unique() # We still need to convert the two letter codes that identify nationalities in the NBA table to the countries full names. To help with this task, we use a very handy DataFrame we found on [Kaggle](https://www.kaggle.com/statchaitya/country-to-continent). # In[60]: # We read the CSV file containing the countries codes and names countries = pd.read_csv("auxiliary_DataFrames/country_codes.csv", encoding = "ISO-8859-1") # In[61]: # Take a look at the data countries # Besides the countries' codes and names, we also have the continent each country belongs to. This information can be helpful so it's worth keeping. All the other columns are not important so we get rid of them. # In[62]: # Drop the columns we don't need countries = countries.iloc[:, [0, 1, 5]] # Rename the column with countries' codes countries = countries.rename(columns = {"code_2" : "code"}) # We can now use this DataFrame to properly format the countries' names in the NBA table. # In[63]: # Create a dictionary to replace the countries' codes codes_dict = countries.set_index("code")["country"].to_dict() # Replace the countries' codes with the countries' names NBA_players["Nationality"] = NBA_players["Nationality"].str.upper().replace(codes_dict) # We also add a `Continent` column to both DataFrames. # In[64]: # Create a dictionary to assign a continent to each country continent_dict = countries.set_index("country")["continent"].to_dict() # Create a `Continent` column in both DataFrames for df in [NBA_players, EL_players]: continent_column = df["Nationality"].map(continent_dict) df.insert(4, "Continent", continent_column) # In[65]: # Make sure everything went right NBA_players["Continent"].value_counts(dropna = False) EL_players["Continent"].value_counts(dropna = False) # In[66]: # Final look at both DataFrames NBA_players.head(3) EL_players.head(3) # ### Normalizing the NBA and EL players' Data # We have succesfully cleaned and uniformized the tables containing the players' information for both leagues. Before combining the two DataFrames, however, we need to take into account one last crucial detail: in the NBA games last 48 minutes while in the EL only 40. Therefore, in order to be able to compare the data, we first need to normalize the stats from the two leagues. # # There are a few standard ways of normalizing basketball stats in order to allow comparisons among players. A popular one suggests to consider [per-minute ratings](https://www.nbastuffer.com/analytics101/per-minute-ratings/#:~:text=Per%2Dminute%20ratings%20are%20calculated,he'd%20play%2036%20minutes.). This method is easy to understand but it has its own limitations since it tends to inflate the stats of players playing small number of minutes. However, we have already filtered out these irrelevant players hence we go ahead and normalize our stats using the per 40-minutes system (we chose 40 minutes because that's how long games in the EL last). # In[67]: # Collect the columns to normalize cols_to_normalize = ["FGM", "FGA", "3PM", "3PA", "2PM", "2PA", "FTM", "FTA", "ORB", "DRB", "TRB", "AST", "STL", "BLK","TOV", "PF", "PTS"] # Normalize the players stats to per 40-minutes NBA_players[cols_to_normalize] = NBA_players[cols_to_normalize].divide(NBA_players["MP"], axis = "index").multiply(40) EL_players[cols_to_normalize] = EL_players[cols_to_normalize].divide(EL_players["MP"], axis = "index").multiply(40) # Round the normalized stats to 1 decimal digit NBA_players[cols_to_normalize] = NBA_players[cols_to_normalize].apply(lambda x : round(x, 1)) EL_players[cols_to_normalize] = EL_players[cols_to_normalize].apply(lambda x : round(x, 1)) # In[68]: # Recalculate the `eFG%` column NBA_players["eFG%"] = (NBA_players["FGM"] + 0.5*NBA_players["3PM"]) / NBA_players["FGA"] EL_players["eFG%"] = (EL_players["FGM"] + 0.5*EL_players["3PM"]) / EL_players["FGA"] # Round this column to 3 decimal digits NBA_players["eFG%"] = round(NBA_players["eFG%"], 3) EL_players["eFG%"] = round(EL_players["eFG%"], 3) # In[69]: # Take a look at the result NBA_players.head(3) EL_players.head(3) # ### Combining the NBA and EL Players' DataFrames # We are now finally ready to combine the two tables and create a single DataFrame containing the information for players from both the NBA and the EL. # In[70]: players = pd.concat([NBA_players, EL_players], ignore_index = True) # In[71]: # Look at the final result players.info() players # In[72]: # Change "United States" to "USA" for ease of read players = players.replace("United States", "USA") # In[73]: # Take a new look players # In[74]: # Save the DataFrame to a CSV file players.to_csv("final_DataFrames/player_stats_clean.csv") # ## Cleaning Teams' Data # ### A First Look at the Data # Let's now take care of the DataFrames containing the teams' data. # In[75]: # Read the CSV files containing the NBA and EL teams' stats respectively NBA_teams = pd.read_csv("final_DataFrames/NBA/NBA_team_stats.csv") EL_teams = pd.read_csv("final_DataFrames/EL/EL_team_stats.csv") # In[76]: # Take a look at the data NBA_teams.head() EL_teams.head() # The columns of both DataFrames are analogous to the ones that appear in the corresponding players' tables. Hence, we can just repeat the same cleaning operations using the same variables we used before. This time, however, we will not keep the `MP` column since this statistic applied to teams is not really meaningful. # In[77]: # Rename columns the same way we did for the players DataFrames NBA_teams.rename(columns = NBA_rename_dict, inplace = True) EL_teams.rename(columns = EL_rename_dict, inplace = True) # In[78]: # Create the missing shooting columns in EL_teams matching the location in NBA_teams EL_teams.insert(11, "2PM", EL_teams["FGM"] - EL_teams["3PM"]) EL_teams.insert(12, "2PA", EL_teams["FGA"] - EL_teams["3PA"]) EL_teams.insert(13, "2P%", EL_teams["2PM"] / EL_teams["2PA"]) # In[79]: # Round the new percentage column we added to 3 decimal digits EL_teams["2P%"] = round(EL_teams["2P%"], 3) # In[80]: # Rearrange columns matching the stats columns in the players DataFrames (except MP and "eFG%") cols = ["Team", "GP"] + list(players.columns)[10:] cols.remove("eFG%") NBA_teams = NBA_teams[cols] EL_teams = EL_teams[cols] # In[81]: # Take a look at the result NBA_teams.head(3) EL_teams.head(3) # Lastly, we uniformize the teams' names to match the ones in the `players` DataFrame. In particular, we remove the asterisks appearing next to some of the NBA teams' names and we also get rid of any eventual hyphen in the EL teams' names. # In[82]: # Remove * from NBA team names NBA_teams["Team"] = NBA_teams["Team"].str.replace("*", "", regex = False) # Remove - from EL team names EL_teams["Team"] = EL_teams["Team"].str.replace("-", " ", regex = False) # ### Adding Teams' Winning Percentages # The last piece of information we want to add to the teams DataFrames is the winning percentages. This information is contained in two other tables we scraped from the web. # In[83]: # Read the CSV files containing the NBA and EL teams' records respectively NBA_records = pd.read_csv("final_DataFrames/NBA/NBA_team_records.csv") EL_records = pd.read_csv("final_DataFrames/EL/EL_team_records.csv") # In[84]: # Have a look at the data NBA_records.head(3) EL_records.head(3) # Once again we need to make sure the teams' names match the ones in the other tables. Moreover, we also uniformize the name of the column containing the winning percentages. # In[85]: # Remove the asterisk from the NBA team names NBA_records["Team"] = NBA_records["Team"].str.replace("*", "", regex = False) # Remove the hyphen from the EL team names EL_records["Team"] = EL_records["Team"].str.replace("-", " ", regex = False) # In[86]: # Uniformize the name of the column containing the teams' winning percentages EL_records = EL_records.rename(columns = {"PCT" : "W/L%"}) # Now we are ready to merge the data. # In[87]: # Add the winning percentages to the teams tables NBA_teams_clean = NBA_teams.merge(NBA_records[["Team", "W/L%", "Year"]], on = ["Team", "Year"], how = "left") EL_teams_clean = EL_teams.merge(EL_records[["Team", "W/L%", "Year"]], on = ["Team", "Year"], how = "left") # In[88]: # Look at the final result NBA_teams_clean.head(3) EL_teams_clean.head(3) # ### Normalizing the NBA and EL Teams' Data # Before combining the two DataFrames into a single table, we need to normalize the stats using the same per 40-minutes system we adopted for the players' data. Since 40 minutes is also the duration of a game in the EL, we only need to operate on the NBA table. # In[89]: # Collect the columns to normalize stat_cols = list(NBA_teams.columns)[2:-2] cols_to_normalize = [col for col in stat_cols if col not in ["FG%", "3P%", "2P%", "FT%"]] # In[90]: # Initialize the normalizing factor NBA_to_EL_norm_factor = 40/48 # Normalize NBA team stats to per 40-minutes NBA_teams_clean[cols_to_normalize] = NBA_teams_clean[cols_to_normalize] * NBA_to_EL_norm_factor # Round the normalized stats to 1 decimal digit NBA_teams_clean[cols_to_normalize] = NBA_teams_clean[cols_to_normalize].apply(lambda x : round(x, 1)) # In[91]: # Take a look at the result NBA_teams_clean.head(3) # ### Combining the NBA and EL Teams' DataFrames # Now we can finally create a single DataFrame containing the information for all teams from both leagues. # In[92]: # Combine the two teams DataFrame into a single table teams = pd.concat([NBA_teams_clean, EL_teams_clean], ignore_index = True) # In[93]: # Look at the final result teams.info() teams # In[94]: # Save the DataFrame to a CSV file teams.to_csv("final_DataFrames/team_stats_clean.csv")