Evolution of NBA Player Salary Determinants Over the Last 30 Years

As a longtime NBA fan, I have been curious for a while as to how NBA player contracts are determined, and which factors play the biggest part in determining the size of a contract that a player will get. Specifically in the modern NBA, 3-point shooting has become much more common, and most of the current highest-payed players are excellent 3-point shooters. For this project, I decided to take a look at how the NBA has evolved in the last 30 years in regards to determinants for a player's contract, with a focus on 3-point shooting.

In [1]:
# import packages and set themes
! pip install chart-studio

import numpy as np
import pandas as pd
import qeds
import requests

import plotly as pt
import plotly.express as px
from chart_studio.plotly import plot, iplot as py
import plotly.graph_objects as go
from plotly.offline import iplot, init_notebook_mode

import seaborn as sns
import matplotlib.colors as mplc
import matplotlib.pyplot as plt

from sklearn import (
    linear_model, metrics, neural_network, pipeline, model_selection
)

%matplotlib inline
# activate plot theme
qeds.themes.mpl_style();
colors = qeds.themes.COLOR_CYCLE
Requirement already satisfied: chart-studio in /opt/conda/lib/python3.8/site-packages (1.1.0)
Requirement already satisfied: requests in /opt/conda/lib/python3.8/site-packages (from chart-studio) (2.25.0)
Requirement already satisfied: plotly in /opt/conda/lib/python3.8/site-packages (from chart-studio) (4.14.1)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.8/site-packages (from chart-studio) (1.3.3)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from chart-studio) (1.15.0)
Requirement already satisfied: retrying>=1.3.3 in /opt/conda/lib/python3.8/site-packages (from chart-studio) (1.3.3)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from chart-studio) (1.15.0)
Requirement already satisfied: certifi>=2017.4.17 in /opt/conda/lib/python3.8/site-packages (from requests->chart-studio) (2020.12.5)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /opt/conda/lib/python3.8/site-packages (from requests->chart-studio) (1.25.11)
Requirement already satisfied: idna<3,>=2.5 in /opt/conda/lib/python3.8/site-packages (from requests->chart-studio) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /opt/conda/lib/python3.8/site-packages (from requests->chart-studio) (3.0.4)
Requirement already satisfied: six in /opt/conda/lib/python3.8/site-packages (from chart-studio) (1.15.0)

The first dataset contains all basketball statistics for all NBA players through each season.

In [2]:
df = pd.read_csv("nba_player_data.csv")

df
Out[2]:
seas_id season player_id player birth_year hof pos age experience lg ... ft_percent orb_per_game drb_per_game trb_per_game ast_per_game stl_per_game blk_per_game tov_per_game pf_per_game pts_per_game
0 28943 2021 4219 Aaron Gordon NaN False PF 25.0 7 NBA ... 0.646 1.5 4.5 6.0 3.7 0.7 0.8 2.2 1.9 13.6
1 28944 2021 4219 Aaron Gordon NaN False PF 25.0 7 NBA ... 0.629 1.6 5.1 6.6 4.2 0.6 0.8 2.7 2.0 14.6
2 28945 2021 4219 Aaron Gordon NaN False PF 25.0 7 NBA ... 0.727 1.3 3.3 4.5 2.5 0.8 0.6 1.1 1.6 11.5
3 28946 2021 4582 Aaron Holiday NaN False PG 24.0 3 NBA ... 0.783 0.2 1.1 1.2 1.6 0.6 0.2 0.9 1.6 7.3
4 28947 2021 4805 Aaron Nesmith NaN False SF 21.0 1 NBA ... 0.688 0.5 1.6 2.1 0.3 0.3 0.2 0.5 1.4 3.3
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
29607 200 1947 157 Walt Miller NaN False F 31.0 1 BAA ... 0.500 NaN NaN NaN 0.5 NaN NaN NaN 1.3 1.9
29608 201 1947 158 Warren Fenley NaN False F 24.0 1 BAA ... 0.511 NaN NaN NaN 0.5 NaN NaN NaN 1.8 2.6
29609 202 1947 159 Wilbert Kautz NaN False G-F 31.0 1 BAA ... 0.534 NaN NaN NaN 0.7 NaN NaN NaN 2.3 5.1
29610 203 1947 160 Woody Grimshaw NaN False G 27.0 1 BAA ... 0.477 NaN NaN NaN 0.0 NaN NaN NaN 1.2 2.9
29611 204 1947 161 Wyndol Gray NaN False G-F 24.0 1 BAA ... 0.581 NaN NaN NaN 0.9 NaN NaN NaN 1.9 6.4

29612 rows × 36 columns

This doesn't include any player salaries or salary cap information, so I will need to combine a few more datasets to add in that information.

In [3]:
salary_cap_df = pd.read_csv("nba_historical_salary_cap.csv")
salary_cap_df.head()
Out[3]:
Year Salary Cap Adjusted
0 1985 $3,600,000 $8,557,797
1 1986 $4,233,000 $9,873,144
2 1987 $4,945,000 $11,128,422
3 1988 $6,164,000 $13,325,270
4 1989 $7,232,000 $14,916,364
In [4]:
big_salary_data = pd.read_csv('player_salary_history.csv')
big_salary_data
Out[4]:
YearEnd Team Player Salary BelowMin Unnamed: 5
0 2017 Atlanta Hawks Dwight Howard 23,180,275 NaN 1
1 2017 Atlanta Hawks Paul Millsap 20,072,033 NaN 2
2 2017 Atlanta Hawks Kent Bazemore 15,730,338 NaN 0
3 2017 Atlanta Hawks Tiago Splitter 8,550,000 NaN 0
4 2017 Atlanta Hawks Kyle Korver 5,239,437 NaN 0
... ... ... ... ... ... ...
12710 1991 Washington Bullets Harvey Grant 475,000 NaN 0
12711 1991 Washington Bullets Byron Irvin 375,000 NaN 0
12712 1991 Washington Bullets A.J. English 275,000 NaN 0
12713 1991 Washington Bullets Greg Foster 275,000 NaN 0
12714 1991 Washington Bullets Haywoode Workman 120,000 NaN 0

12715 rows × 6 columns

In [5]:
salary2018 = pd.read_csv('salary2018.csv')
salary2019 = pd.read_csv('salary2019.csv')
salary2020 = pd.read_csv('salary2020.csv')

It looks like we have a lot of data cleaning and merging ahead of us to combine these datasets, so let's get started.

In [6]:
# Data cleaning #

clean_cap = salary_cap_df.drop('Adjusted', axis = 1)
clean_cap['Salary Cap'] = clean_cap['Salary Cap'].str.replace(',', '')
clean_cap['Salary Cap'] = clean_cap['Salary Cap'].str.replace('$', '')
clean_cap['Year'] = clean_cap['Year'].str.replace("'", '')

clean_cap.dtypes
Out[6]:
Year          object
Salary Cap    object
dtype: object
In [7]:
clean_cap['Year'] = pd.to_numeric(clean_cap['Year'])
clean_cap['Salary Cap'] = pd.to_numeric(clean_cap['Salary Cap'])
clean_cap.head(10)
Out[7]:
Year Salary Cap
0 1985 3600000
1 1986 4233000
2 1987 4945000
3 1988 6164000
4 1989 7232000
5 1990 9802000
6 1991 11871000
7 1992 12500000
8 1993 14000000
9 1994 15175000
In [8]:
clean_bigsalary = big_salary_data.drop(['Team', 'BelowMin', 'Unnamed: 5'], axis=1)
cleaned = clean_bigsalary.rename(columns = {'YearEnd': 'Year', ' Salary ': 'Salary'})
cleaned['Salary'] = cleaned['Salary'].str.replace(',', '')
cleaned
Out[8]:
Year Player Salary
0 2017 Dwight Howard 23180275
1 2017 Paul Millsap 20072033
2 2017 Kent Bazemore 15730338
3 2017 Tiago Splitter 8550000
4 2017 Kyle Korver 5239437
... ... ... ...
12710 1991 Harvey Grant 475000
12711 1991 Byron Irvin 375000
12712 1991 A.J. English 275000
12713 1991 Greg Foster 275000
12714 1991 Haywoode Workman 120000

12715 rows × 3 columns

In [9]:
cleaned['Salary'] = cleaned['Salary'].str.replace('Unknown', '0')
cleaned['Salary'] = pd.to_numeric(cleaned['Salary'])
cleaned.dtypes
Out[9]:
Year       int64
Player    object
Salary     int64
dtype: object
In [10]:
clean_salary2018 = salary2018.drop('Unnamed: 0', axis=1)
clean_salary2018['Salary'] = clean_salary2018['Salary'].str.replace(',', '')
clean_salary2018['Salary'] = clean_salary2018['Salary'].str.replace('$', '')
clean_salary2018_v2 = clean_salary2018.rename(columns = {'Season': 'Year'})
clean_salary2018_v2['Salary'] = pd.to_numeric(clean_salary2018_v2['Salary'])
clean_salary2018_v2
Out[10]:
Player Salary Year
0 Stephen Curry 34682550 2018
1 LeBron James 33285709 2018
2 Paul Millsap 30769231 2018
3 Gordon Hayward 29727900 2018
4 Blake Griffin 29512900 2018
... ... ... ...
581 Andre Ingram 46079 2018
582 Trey McKinney-Jones 46079 2018
583 Aaron Jackson 46079 2018
584 Jameel Warney 46079 2018
585 Marcus Thornton II 46079 2018

586 rows × 3 columns

In [11]:
clean_salary2019 = salary2019.drop('Unnamed: 0', axis=1)
clean_salary2019['Salary'] = clean_salary2019['Salary'].str.replace(',', '')
clean_salary2019['Salary'] = clean_salary2019['Salary'].str.replace('$', '')
clean_salary2019['Salary'] = pd.to_numeric(clean_salary2019['Salary'])
In [12]:
clean_salary2020 = salary2020.drop('Unnamed: 0', axis=1)
clean_salary2020['Salary'] = clean_salary2020['Salary'].str.replace(',', '')
clean_salary2020['Salary'] = clean_salary2020['Salary'].str.replace('$', '')
clean_salary2020['Salary'] = pd.to_numeric(clean_salary2020['Salary'])

Now that the five salary data frames are cleaned and have the same column names, I can combine them together.

In [13]:
# Combine all salary and salary cap dataframes
merge1 = pd.concat([clean_salary2018_v2, cleaned], axis=0)
merge1

merge2 = pd.concat([clean_salary2019, merge1], axis=0)
merge2

merge3 = pd.concat([clean_salary2020, merge2], axis=0)
merge3

merge4 = pd.merge(clean_cap, merge3, on='Year')
merge4

print(merge4.dtypes)
merge4
Year           int64
Salary Cap     int64
Player        object
Salary         int64
dtype: object
Out[13]:
Year Salary Cap Player Salary
0 1991 11871000 Moses Malone 2406000
1 1991 11871000 Dominique Wilkins 2065000
2 1991 11871000 Jon Koncak 1550000
3 1991 11871000 Doc Rivers 895000
4 1991 11871000 Rumeal Robinson 800000
... ... ... ... ...
14384 2020 109140000 Jeremiah Martin 79568
14385 2020 109140000 Tremont Waters 79568
14386 2020 109140000 Tacko Fall 79568
14387 2020 109140000 Charlie Brown 79568
14388 2020 109140000 Malik Newman 65978

14389 rows × 4 columns

For our purposes, most of these columns won't be used in our calculations and can be dropped from the data. Additionally, a lot of data from 1947-1979 is fragmented and includes players from the ABA and BAA, so I will limit my data from seasons after 1991.

In [14]:
# Drop rows and combine player and salary data
clean_data = df.drop(['seas_id', 'hof', 'lg', 'pos', 'birth_year', 'gs', 'fg_per_game', 'fga_per_game', 'fg_percent'], axis=1) 
clean_data2 = clean_data.drop(['x2p_per_game', 'x2pa_per_game', 'x2p_percent', 'x3p_percent', 'e_fg_percent', 'ft_per_game', 'fta_per_game', 'ft_percent', 'orb_per_game', 'drb_per_game'], axis=1)
capitalized = clean_data2.rename(columns = {'season' : 'Year', 'player' : 'Player'})
modern_data = capitalized.loc[capitalized["Year"] > 1990]

merge5 = pd.merge(modern_data, merge4, on=['Year', 'Player'])
merge5
Out[14]:
Year player_id Player age experience tm g mp_per_game x3p_per_game x3pa_per_game trb_per_game ast_per_game stl_per_game blk_per_game tov_per_game pf_per_game pts_per_game Salary Cap Salary
0 2020 4219 Aaron Gordon 24.0 6 ORL 62.0 32.5 1.2 3.8 7.7 3.7 0.8 0.6 1.6 2.0 14.4 109140000 19863636
1 2020 4582 Aaron Holiday 23.0 2 IND 66.0 24.5 1.3 3.3 2.4 3.4 0.8 0.2 1.3 1.8 9.5 109140000 2239200
2 2020 4463 Abdel Nader 26.0 3 OKC 55.0 15.8 0.9 2.3 1.8 0.7 0.4 0.4 0.8 1.4 6.3 109140000 1618520
3 2020 4687 Adam Mokoka 21.0 1 CHI 11.0 10.2 0.5 1.4 0.9 0.4 0.4 0.0 0.2 1.5 2.9 109140000 79568
4 2020 4688 Admiral Schofield 22.0 1 WAS 33.0 11.2 0.6 1.8 1.4 0.5 0.2 0.1 0.2 1.5 3.0 109140000 1000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15902 1991 2558 Winston Bennett 25.0 2 CLE 27.0 12.4 0.0 0.0 2.4 1.0 0.3 0.1 0.7 1.9 4.3 11871000 525000
15903 1991 2401 Winston Garland 26.0 4 LAC 69.0 24.7 0.1 0.4 2.9 4.6 1.4 0.1 1.7 2.7 8.2 11871000 450000
15904 1991 2278 Xavier McDaniel 27.0 6 TOT 81.0 32.5 0.0 0.1 6.9 2.3 0.9 0.6 2.3 3.3 17.0 11871000 1400000
15905 1991 2278 Xavier McDaniel 27.0 6 SEA 15.0 35.3 0.0 0.2 5.4 2.5 1.7 0.3 2.7 3.3 21.8 11871000 1400000
15906 1991 2278 Xavier McDaniel 27.0 6 PHO 66.0 31.9 0.0 0.1 7.2 2.3 0.8 0.6 2.2 3.2 15.8 11871000 1400000

15907 rows × 19 columns

It looks like a few players have multiple entries in the data due to changing teams partway through the season. In order to not have the presence of multiple players for each traded player, I will combine the multiple team players into one entry for each season. I will also add a column including the percentage of team salary cap that a player holds, and I will remove players who play less than 10 games, less than 5 minutes per game, or do not sign full-year contracts. In this way, the data is not influenced by players who play an arbitrary amount of time in the NBA or who play a very short season due to injury.

In [15]:
# Combine rows of players who played for multiple seasons in one year

grouped = merge5.groupby(['Year', 'player_id']).mean()
trade_df = grouped.reset_index()

# Add row of salary percentage
salary_percent = pd.DataFrame(data = (trade_df['Salary'] / trade_df['Salary Cap']), columns = ['Salary Percent'])

new_df = pd.concat([trade_df, salary_percent], axis=1)
new_df

# Filter out unnecessary rows and rename columns
trade_df_games = new_df.loc[new_df['g'] > 10]
trade_df_mins = trade_df_games.loc[trade_df_games['mp_per_game'] > 5]
trade_df_adjusted = trade_df_mins[trade_df_mins['Salary Percent'] > 0.02]
clean_df1 = trade_df_adjusted[trade_df_adjusted['Salary Percent'] < 0.6]
clean_df = clean_df1.rename(columns = {'player_id': 'Player', 'age': 'Age', 'experience': 'Experience', 
                                       'g': 'Games', 'mp_per_game': 'Minutes','x3p_per_game' : '3P Makes', 
                                       'x3pa_per_game': '3P Attempts', 'trb_per_game': 'Rebounds', 
                                       'ast_per_game': 'Assists', 'stl_per_game': 'Steals', 
                                       'blk_per_game': 'Blocks', 'tov_per_game': 'Turnovers', 
                                       'pf_per_game': 'Fouls', 'pts_per_game': 'Points'})
# Check for non-number columns
clean_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8585 entries, 0 to 12312
Data columns (total 18 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Year            8585 non-null   int64  
 1   Player          8585 non-null   int64  
 2   Age             8585 non-null   float64
 3   Experience      8585 non-null   float64
 4   Games           8585 non-null   float64
 5   Minutes         8585 non-null   float64
 6   3P Makes        8585 non-null   float64
 7   3P Attempts     8585 non-null   float64
 8   Rebounds        8585 non-null   float64
 9   Assists         8585 non-null   float64
 10  Steals          8585 non-null   float64
 11  Blocks          8585 non-null   float64
 12  Turnovers       8585 non-null   float64
 13  Fouls           8585 non-null   float64
 14  Points          8585 non-null   float64
 15  Salary Cap      8585 non-null   float64
 16  Salary          8585 non-null   float64
 17  Salary Percent  8585 non-null   float64
dtypes: float64(16), int64(2)
memory usage: 1.2 MB

We can finally get to some visualizations. First let's take a look at how the average 3-point attempts and makes has changed in the past 40 years in the NBA.

In [16]:
# Generate yearly averages
seasonal = clean_df.groupby('Year').mean()
averages = seasonal.reset_index()
averages['3P Misses'] = averages['3P Attempts'] - averages['3P Makes']

# Create plot
fig = go.Figure([
        go.Scatter(x = averages['Year'], y = averages['3P Attempts'],line=dict(color='orange', width=2),mode='lines+markers', name = "3P Attempts"),
        go.Scatter(x = averages['Year'], y = averages['3P Misses'],line=dict(color='red', width=2),mode='lines+markers', name = "3P Misses"),
        go.Scatter(x = averages['Year'], y = averages['3P Makes'],line=dict(color='green', width=2),mode='lines+markers', name = "3P Makes")])
fig.update_layout(
    height = 750,
    title={
        'text': "Average 3-point attempts and makes of NBA players by season",
        'y':0.9,
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Year",
    yaxis_title="3P"
                                        
)
fig.show()