title: An Improved Analysis of NBA Foul Calls with Python tags: NBA, Bayesian Statistics, PyMC3
Last April, I wrote a post that used Bayesian item-response theory models to analyze NBA foul call data. Last November, I spoke about a greatly improved version of these models at PyData NYC. This post is a write-up of the models from that talk.
Since late in the 2014-2015 season, the NBA has issued last two minute reports. These reports give the league's assessment of the correctness of foul calls and non-calls in the last two minutes of any game where the score difference was three or fewer points at any point in the last two minutes.
These reports are notably different from play-by-play logs, in that they include information on non-calls for notable on-court interactions. This non-call information presents a unique opportunity to study the factors that impact foul calls. There is a level of subjectivity inherent in the the NBA's definition of notable on-court interactions which we attempt to mitigate later using season-specific factors.
Russel Goldenberg of The Pudding has been scraping the PDFs that the NBA publishes and transforming them into a CSV for some time. I am grateful for his work, which has enabled this analysis.
We download the data locally to be kind to GitHub.
%matplotlib inline
import datetime
from itertools import product
import logging
import pickle
from matplotlib import pyplot as plt
from matplotlib.offsetbox import AnchoredText
from matplotlib.ticker import FuncFormatter, StrMethodFormatter
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from theano import tensor as tt
pct_formatter = StrMethodFormatter('{x:.1%}')
sns.set()
blue, green, *_ = sns.color_palette()
plt.rc('figure', figsize=(8, 6))
LABELSIZE = 14
plt.rc('axes', labelsize=LABELSIZE)
plt.rc('axes', titlesize=LABELSIZE)
plt.rc('figure', titlesize=LABELSIZE)
plt.rc('legend', fontsize=LABELSIZE)
plt.rc('xtick', labelsize=LABELSIZE)
plt.rc('ytick', labelsize=LABELSIZE)
SEED = 207183 # from random.org, for reproducibility
# keep theano from complaining about compile locks for small models
(logging.getLogger('theano.gof.compilelock')
.setLevel(logging.CRITICAL))
%%bash
DATA_URI=https://raw.githubusercontent.com/polygraph-cool/last-two-minute-report/32f1c43dfa06c2e7652cc51ea65758007f2a1a01/output/all_games.csv
DATA_DEST=/tmp/all_games.csv
if [[ ! -e $DATA_DEST ]];
then
wget -q -O $DATA_DEST $DATA_URI
fi
We use only a subset of the columns in the source data set.
USECOLS = [
'period',
'seconds_left',
'call_type',
'committing_player',
'disadvantaged_player',
'review_decision',
'play_id',
'away',
'home',
'date',
'score_away',
'score_home',
'disadvantaged_team',
'committing_team'
]
orig_df = pd.read_csv(
'/tmp/all_games.csv',
usecols=USECOLS,
index_col='play_id',
parse_dates=['date']
)
The data set contains more than 16,000 plays.
orig_df.shape[0]
16300
Each row of the DataFrame
represents a play and each column describes an attrbiute of the play:
period
is the period of the game,seconds_left
is the number of seconds remaining in the game,call_type
is the type of call,committing_player
and disadvantaged_player
are the names of the players involved in the play,review_decision
is the opinion of the league reviewer on whether or not the play was called correctly:review_decision = "INC"
means the call was an incorrect noncall,review_decision = "CNC"
means the call was an correct noncall,review_decision = "IC"
means the call was an incorrect call, andreview_decision = "CC"
means the call was an correct call,away
and home
are the abbreviations of the teams involved in the game,date
is the date on which the game was played,score_away
and score_home
are the scores of the away
and home
team during the play, respectively, anddisadvantaged_team
and committing_team
indicate how each team is involved in the play.orig_df.head(n=2).T
play_id | 20150301CLEHOU-0 | 20150301CLEHOU-1 |
---|---|---|
period | Q4 | Q4 |
seconds_left | 112 | 103 |
call_type | Foul: Shooting | Foul: Shooting |
committing_player | Josh Smith | J.R. Smith |
disadvantaged_player | Kevin Love | James Harden |
review_decision | CNC | CC |
away | CLE | CLE |
home | HOU | HOU |
date | 2015-03-01 00:00:00 | 2015-03-01 00:00:00 |
score_away | 103 | 103 |
score_home | 105 | 105 |
disadvantaged_team | CLE | HOU |
committing_team | HOU | CLE |
In this post, we answer two questions:
The previous post focused on the second question, and gave the first question only a cursory treatment. This post enhances our treatment of the first question, in order to control for non-skill factors influencing foul calls (namely intentional fouls). Controlling for these factors makes our estimates of player skill more realistic.
First we examine the types of calls present in the data set.
(orig_df['call_type']
.value_counts()
.head(n=15))
Foul: Personal 4736 Foul: Shooting 4201 Foul: Offensive 2846 Foul: Loose Ball 1316 Turnover: Traveling 779 Instant Replay: Support Ruling 607 Foul: Defense 3 Second 277 Instant Replay: Overturn Ruling 191 Foul: Personal Take 172 Turnover: 3 Second Violation 139 Turnover: 24 Second Violation 126 Turnover: 5 Second Inbound 99 Stoppage: Out-of-Bounds 96 Violation: Lane 84 Foul: Away from Play 82 Name: call_type, dtype: int64
The portion of call_type
before the colon is the general category of the call. We count the occurence of these categories below.
(orig_df['call_type']
.str.split(':', expand=True)
.iloc[:, 0]
.value_counts()
.plot(
kind='bar',
color=blue, logy=True,
title="Call types"
)
.set_ylabel("Frequency"));
We restrict our attention to foul calls, though other call types would be interesting to study in the future.
foul_df = orig_df[
orig_df['call_type']
.fillna("UNKNOWN")
.str.startswith("Foul")
]
We count the foul call types below.
(foul_df['call_type']
.str.split(': ', expand=True)
.iloc[:, 1]
.value_counts()
.plot(
kind='bar',
color=blue, logy=True,
title="Foul Types"
)
.set_ylabel("Frequency"));
We restrict our attention to the five foul types below, which generally involve two players. This subset of fouls allows us to pursue our second research question in the most direct manner.
FOULS = [
f"Foul: {foul_type}"
for foul_type in [
"Personal",
"Shooting",
"Offensive",
"Loose Ball",
"Away from Play"
]
]
There are a number of misspelled team names in the data, which we correct.
TEAM_MAP = {
"NKY": "NYK",
"COS": "BOS",
"SAT": "SAS",
"CHi": "CHI",
"LA)": "LAC",
"AT)": "ATL",
"ARL": "ATL"
}
def correct_team_name(col):
def _correct_team_name(df):
return df[col].apply(lambda team_name: TEAM_MAP.get(team_name, team_name))
return _correct_team_name
We also convert each game date to an NBA season.
def date_to_season(date):
if date >= datetime.datetime(2017, 10, 17):
return '2017-2018'
elif date >= datetime.datetime(2016, 10, 25):
return '2016-2017'
elif date >= datetime.datetime(2015, 10, 27):
return '2015-2016'
else:
return '2014-2015'
We clean the data by
review_decision
is missing,clean_df = (foul_df.where(lambda df: df['period'] == "Q4")
.where(lambda df: (df['date'].between(datetime.datetime(2016, 10, 25),
datetime.datetime(2017, 4, 12))
| df['date'].between(datetime.datetime(2015, 10, 27),
datetime.datetime(2016, 5, 30)))
)
.assign(
review_decision=lambda df: df['review_decision'].fillna("INC"),
committing_team=correct_team_name('committing_team'),
disadvantged_team=correct_team_name('disadvantaged_team'),
away=correct_team_name('away'),
home=correct_team_name('home'),
season=lambda df: df['date'].apply(date_to_season)
)
.where(lambda df: df['call_type'].isin(FOULS))
.dropna()
.drop('period', axis=1)
.assign(call_type=lambda df: (df['call_type']
.str.split(': ', expand=True)
.iloc[:, 1])))
About 55% of the rows in the original data set remain.
clean_df.shape[0] / orig_df.shape[0]
0.5516564417177914
clean_df.head(n=2).T
play_id | 20151028INDTOR-1 | 20151028INDTOR-2 |
---|---|---|
seconds_left | 89 | 73 |
call_type | Shooting | Shooting |
committing_player | Ian Mahinmi | Bismack Biyombo |
disadvantaged_player | DeMar DeRozan | Paul George |
review_decision | CC | IC |
away | IND | IND |
home | TOR | TOR |
date | 2015-10-28 00:00:00 | 2015-10-28 00:00:00 |
score_away | 99 | 99 |
score_home | 106 | 106 |
disadvantaged_team | TOR | IND |
committing_team | IND | TOR |
disadvantged_team | TOR | IND |
season | 2015-2016 | 2015-2016 |
We use scikit-learn
's LabelEncoder
to transform categorical features (call type, player, and season) to integers.
call_type_enc = LabelEncoder().fit(
clean_df['call_type']
)
n_call_type = call_type_enc.classes_.size
player_enc = LabelEncoder().fit(
np.concatenate((
clean_df['committing_player'],
clean_df['disadvantaged_player']
))
)
n_player = player_enc.classes_.size
season_enc = LabelEncoder().fit(
clean_df['season']
)
n_season = season_enc.classes_.size
We transform the data by
seconds_left
to the nearest second (purely for convenience),foul_called
equal to one or zero depending on whether or not a foul was called, andscore_committing
and score_disadvantaged
to the score of the committing and disadvantaged teams, respectively.df = (clean_df[['seconds_left']]
.round(0)
.assign(
call_type=call_type_enc.transform(clean_df['call_type']),
foul_called=1. * clean_df['review_decision'].isin(['CC', 'INC']),
player_committing=player_enc.transform(clean_df['committing_player']),
player_disadvantaged=player_enc.transform(clean_df['disadvantaged_player']),
score_committing=clean_df['score_home'].where(
clean_df['committing_team'] == clean_df['home'],
clean_df['score_away']
),
score_disadvantaged=clean_df['score_home'].where(
clean_df['disadvantaged_team'] == clean_df['home'],
clean_df['score_away']
),
season=season_enc.transform(clean_df['season'])
))
The resulting data is ready for analysis.
df.head(n=2).T
play_id | 20151028INDTOR-1 | 20151028INDTOR-2 |
---|---|---|
seconds_left | 89.0 | 73.0 |
call_type | 4.0 | 4.0 |
foul_called | 1.0 | 0.0 |
player_committing | 162.0 | 36.0 |
player_disadvantaged | 98.0 | 358.0 |
score_committing | 99.0 | 106.0 |
score_disadvantaged | 106.0 | 99.0 |
season | 0.0 | 0.0 |
We follow George Box's modeling workflow, as summarized by Dustin Tran:
def make_foul_rate_yaxis(ax, label="Observed foul call rate"):
ax.yaxis.set_major_formatter(pct_formatter)
ax.set_ylabel(label)
return ax
make_foul_rate_yaxis(
df.pivot_table('foul_called', 'season')
.rename(index=season_enc.inverse_transform)
.rename_axis("Season")
.plot(kind='bar', rot=0, legend=False)
);
There is a pronounced difference between the foul call rate in the 2015-2016 and 2016-2017 NBA seasons; our first model accounts for this difference.
We use pymc3
to specify our models. Our first model is given by
We use a logistic regression model with different factors for each season.
import pymc3 as pm
with pm.Model() as base_model:
β_season = pm.Normal('β_season', 0., 5., shape=n_season)
p = pm.Deterministic('p', pm.math.sigmoid(β_season))
Foul calls are Bernoulli trials, $y_k \sim \textrm{Bernoulli}(p_k).$
season = df['season'].values
with base_model:
y = pm.Bernoulli(
'y', p[season],
observed=df['foul_called']
)
We now sample from the model's posterior distribution.
NJOBS = 3
SAMPLE_KWARGS = {
'draws': 1000,
'njobs': NJOBS,
'random_seed': [
SEED + i for i in range(NJOBS)
],
'nuts_kwargs': {
'target_accept': 0.9
}
}
with base_model:
base_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 3 jobs) NUTS: [β_season] 100%|██████████| 1500/1500 [00:07<00:00, 198.65it/s]
We rely on three diagnostics to ensure that our samples have converged to the posterior distribution:
For more information on energy plots and BFMI consult Robust Statistical Workflow with PyStan.
bfmi = pm.bfmi(base_trace)
max_gr = max(
np.max(gr_stats) for gr_stats in pm.gelman_rubin(base_trace).values()
)
CONVERGENCE_TITLE = lambda: f"BFMI = {bfmi:.2f}\nGelman-Rubin = {max_gr:.3f}"
(pm.energyplot(base_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
base_trace['p']
array([[ 0.4052151 , 0.30696232], [ 0.3937377 , 0.30995026], [ 0.39881138, 0.29866616], ..., [ 0.40279887, 0.31166828], [ 0.4077945 , 0.30299785], [ 0.40207901, 0.29991789]])
resid_df = (df.assign(p_hat=base_trace['p'][:, df['season']].mean(axis=0))
.assign(resid=lambda df: df['foul_called'] - df['p_hat']))
resid_df[['foul_called', 'p_hat', 'resid']].head()
foul_called | p_hat | resid | |
---|---|---|---|
play_id | |||
20151028INDTOR-1 | 1.0 | 0.403875 | 0.596125 |
20151028INDTOR-2 | 0.0 | 0.403875 | -0.403875 |
20151028INDTOR-3 | 1.0 | 0.403875 | 0.596125 |
20151028INDTOR-4 | 0.0 | 0.403875 | -0.403875 |
20151028INDTOR-6 | 0.0 | 0.403875 | -0.403875 |
The per-season residuals are quite small, which is to be expected.
(resid_df.pivot_table('resid', 'season')
.rename(index=season_enc.inverse_transform))
resid | |
---|---|
season | |
2015-2016 | -0.000162 |
2016-2017 | -0.000219 |
Anyone who has watched a close basketball game will realize that we have neglected an important factor in late game foul calls — intentional fouls. Near the end of the game, intentional fouls are used by the losing team when they are on defense to end the leading team's possession as quickly as possible.
The influence of intentional fouls in the plot below is shown by the rapidly increasing of the residuals as the number of seconds left in the game decreases.
def make_time_axes(ax,
xlabel="Seconds remaining in game",
ylabel="Observed foul call rate"):
ax.invert_xaxis()
ax.set_xlabel(xlabel)
return make_foul_rate_yaxis(ax, label=ylabel)
make_time_axes(
resid_df.pivot_table('resid', 'seconds_left')
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'),
ylabel="Residual"
);
df['trailing_committing'] = (df['score_committing']
.lt(df['score_disadvantaged'])
.mul(1.)
.astype(np.int64))
make_time_axes(
df.pivot_table(
'foul_called',
'seconds_left',
'trailing_committing'
)
.rolling(20)
.mean()
.rename(columns={
0: "No", 1: "Yes"
})
.rename_axis(
"Committing team is trailing",
axis=1
)
.plot()
);
Intentional fouls are only useful when the trailing (and committing) team is on defense. The plot below reflects this fact; shooting and personal fouls are almost always called against the defensive player; we see that they are called at a much higher rate than offensive fouls.
ax = (df.pivot_table('foul_called', 'call_type')
.rename(index=call_type_enc.inverse_transform)
.rename_axis("Call type", axis=0)
.plot(kind='barh', legend=False))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel("Observed foul call rate");
We continue to model the differnce in foul call rates between seasons.
with pm.Model() as poss_model:
β_season = pm.Normal('β_season', 0., 5., shape=2)
Throughout this post, we will use hierarchical distributions to model the variation of foul call rates. For much more information on hierarchical models, consult Data Analysis Using Regression and Multilevel/Hierarchical Models. We use the priors
$$ \begin{align*} \sigma_{\textrm{call}} & \sim \operatorname{HalfNormal}(5) \\ \beta^{\textrm{call}}_{c} & \sim \operatorname{Hierarchical-Normal}(0, \sigma_{\textrm{call}}^2). \end{align*} $$For sampling efficiency, we use an [non-centered parametrization](http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/#The-Funnel-of-Hell-(and-how-to-escape-it%29) of the hierarchical normal distribution.
def hierarchical_normal(name, shape, σ_shape=1):
Δ = pm.Normal(f'Δ_{name}', 0., 1., shape=shape)
σ = pm.HalfNormal(f'σ_{name}', 5., shape=σ_shape)
return pm.Deterministic(name, Δ * σ)
Each call type has a different foul call rate.
with poss_model:
β_call = hierarchical_normal('β_call', n_call_type)
We add score difference and the number of possessions by which the committing team is trailing to the DataFrame
.
df['score_diff'] = (df['score_disadvantaged']
.sub(df['score_committing']))
df['trailing_poss'] = (df['score_diff']
.div(3)
.apply(np.ceil))
trailing_poss_enc = LabelEncoder().fit(df['trailing_poss'])
trailing_poss = trailing_poss_enc.transform(df['trailing_poss'])
n_trailing_poss = trailing_poss_enc.classes_.size
The plot below shows that the foul call rate (over time) varies based on the score difference (quantized into possessions) between the disadvanted team and the committing team. We assume that at most three points can be scored in a single possession (while this is not quite correct, four-point plays are rare enough that we do not account for them in our analysis).
make_time_axes(
df.pivot_table(
'foul_called',
'seconds_left',
'trailing_poss'
)
.loc[:, 1:3]
.rolling(20).mean()
.rename_axis(
"Trailing possessions\n(committing team)",
axis=1
)
.plot()
);
The plot below reflects the fact that intentional fouls are disproportionately personal fouls; the rate at which personal fouls are called increases drastically as the game nears its end.
make_time_axes(
df.pivot_table('foul_called', 'seconds_left', 'call_type')
.rolling(20).mean()
.rename(columns=call_type_enc.inverse_transform)
.rename_axis(None, axis=1)
.plot()
);
Due to the NBA's shot clock, the natural timescale of a basketball game is possessions, not seconds, remaining.
df['remaining_poss'] = (df['seconds_left']
.floordiv(25)
.add(1))
remaining_poss_enc = LabelEncoder().fit(df['remaining_poss'])
remaining_poss = remaining_poss_enc.transform(df['remaining_poss'])
n_remaining_poss = remaining_poss_enc.classes_.size
Below we plot the foul call rate across trailing possession/remaining posession pairs. Note that we always calculate trailing possessions (trailing_poss
) from the perspective of the committing team. For instance, trailing_poss = 1
indicates that the committing team is trailing by 1-3 points, whereas trailing_poss = -1
indicates that the committing team is leading by 1-3 points.
ax = sns.heatmap(
df.pivot_table(