title: An Improved Analysis of NBA Foul Calls with Python tags: NBA, Bayesian Statistics, PyMC3
Last April, I wrote a post that used Bayesian item-response theory models to analyze NBA foul call data. Last November, I spoke about a greatly improved version of these models at PyData NYC. This post is a write-up of the models from that talk.
Since late in the 2014-2015 season, the NBA has issued last two minute reports. These reports give the league's assessment of the correctness of foul calls and non-calls in the last two minutes of any game where the score difference was three or fewer points at any point in the last two minutes.
These reports are notably different from play-by-play logs, in that they include information on non-calls for notable on-court interactions. This non-call information presents a unique opportunity to study the factors that impact foul calls. There is a level of subjectivity inherent in the the NBA's definition of notable on-court interactions which we attempt to mitigate later using season-specific factors.
Russel Goldenberg of The Pudding has been scraping the PDFs that the NBA publishes and transforming them into a CSV for some time. I am grateful for his work, which has enabled this analysis.
We download the data locally to be kind to GitHub.
The data set contains more than 16,000 plays.
orig_df.head(n=2).T
play_id | 20150301CLEHOU-0 | 20150301CLEHOU-1 |
---|---|---|
period | Q4 | Q4 |
seconds_left | 112 | 103 |
call_type | Foul: Shooting | Foul: Shooting |
committing_player | Josh Smith | J.R. Smith |
disadvantaged_player | Kevin Love | James Harden |
review_decision | CNC | CC |
away | CLE | CLE |
home | HOU | HOU |
date | 2015-03-01 00:00:00 | 2015-03-01 00:00:00 |
score_away | 103 | 103 |
score_home | 105 | 105 |
disadvantaged_team | CLE | HOU |
committing_team | HOU | CLE |
In this post, we answer two questions:
The previous post focused on the second question, and gave the first question only a cursory treatment. This post enhances our treatment of the first question, in order to control for non-skill factors influencing foul calls (namely intentional fouls). Controlling for these factors makes our estimates of player skill more realistic.
First we examine the types of calls present in the data set.
(orig_df['call_type']
.str.split(':', expand=True)
.iloc[:, 0]
.value_counts()
.plot(
kind='bar',
color=blue, logy=True,
title="Call types"
)
.set_ylabel("Frequency"));
(foul_df['call_type']
.str.split(': ', expand=True)
.iloc[:, 1]
.value_counts()
.plot(
kind='bar',
color=blue, logy=True,
title="Foul Types"
)
.set_ylabel("Frequency"));
There are a number of misspelled team names in the data, which we correct.
df.head(n=2).T
play_id | 20151028INDTOR-1 | 20151028INDTOR-2 |
---|---|---|
seconds_left | 89.0 | 73.0 |
call_type | 4.0 | 4.0 |
foul_called | 1.0 | 0.0 |
player_committing | 162.0 | 36.0 |
player_disadvantaged | 98.0 | 358.0 |
score_committing | 99.0 | 106.0 |
score_disadvantaged | 106.0 | 99.0 |
season | 0.0 | 0.0 |
We follow George Box's modeling workflow, as summarized by Dustin Tran:
def make_foul_rate_yaxis(ax, label="Observed foul call rate"):
ax.yaxis.set_major_formatter(pct_formatter)
ax.set_ylabel(label)
return ax
make_foul_rate_yaxis(
df.pivot_table('foul_called', 'season')
.rename(index=season_enc.inverse_transform)
.rename_axis("Season")
.plot(kind='bar', rot=0, legend=False)
);
import pymc3 as pm
with pm.Model() as base_model:
β_season = pm.Normal('β_season', 0., 5., shape=n_season)
p = pm.Deterministic('p', pm.math.sigmoid(β_season))
with base_model:
y = pm.Bernoulli(
'y', p[season],
observed=df['foul_called']
)
We now sample from the model's posterior distribution.
with base_model:
base_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 3 jobs) NUTS: [β_season] 100%|██████████| 1500/1500 [00:07<00:00, 198.65it/s]
We rely on three diagnostics to ensure that our samples have converged to the posterior distribution:
For more information on energy plots and BFMI consult Robust Statistical Workflow with PyStan.
(pm.energyplot(base_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
resid_df = (df.assign(p_hat=base_trace['p'][:, df['season']].mean(axis=0))
.assign(resid=lambda df: df['foul_called'] - df['p_hat']))
resid_df[['foul_called', 'p_hat', 'resid']].head()
foul_called | p_hat | resid | |
---|---|---|---|
play_id | |||
20151028INDTOR-1 | 1.0 | 0.403875 | 0.596125 |
20151028INDTOR-2 | 0.0 | 0.403875 | -0.403875 |
20151028INDTOR-3 | 1.0 | 0.403875 | 0.596125 |
20151028INDTOR-4 | 0.0 | 0.403875 | -0.403875 |
20151028INDTOR-6 | 0.0 | 0.403875 | -0.403875 |
(resid_df.pivot_table('resid', 'season')
.rename(index=season_enc.inverse_transform))
resid | |
---|---|
season | |
2015-2016 | -0.000162 |
2016-2017 | -0.000219 |
Anyone who has watched a close basketball game will realize that we have neglected an important factor in late game foul calls — intentional fouls. Near the end of the game, intentional fouls are used by the losing team when they are on defense to end the leading team's possession as quickly as possible.
The influence of intentional fouls in the plot below is shown by the rapidly increasing of the residuals as the number of seconds left in the game decreases.
def make_time_axes(ax,
xlabel="Seconds remaining in game",
ylabel="Observed foul call rate"):
ax.invert_xaxis()
ax.set_xlabel(xlabel)
return make_foul_rate_yaxis(ax, label=ylabel)
make_time_axes(
resid_df.pivot_table('resid', 'seconds_left')
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'),
ylabel="Residual"
);
make_time_axes(
df.pivot_table(
'foul_called',
'seconds_left',
'trailing_committing'
)
.rolling(20)
.mean()
.rename(columns={
0: "No", 1: "Yes"
})
.rename_axis(
"Committing team is trailing",
axis=1
)
.plot()
);
ax = (df.pivot_table('foul_called', 'call_type')
.rename(index=call_type_enc.inverse_transform)
.rename_axis("Call type", axis=0)
.plot(kind='barh', legend=False))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel("Observed foul call rate");
Each call type has a different foul call rate.
with poss_model:
β_call = hierarchical_normal('β_call', n_call_type)
make_time_axes(
df.pivot_table(
'foul_called',
'seconds_left',
'trailing_poss'
)
.loc[:, 1:3]
.rolling(20).mean()
.rename_axis(
"Trailing possessions\n(committing team)",
axis=1
)
.plot()
);
make_time_axes(
df.pivot_table('foul_called', 'seconds_left', 'call_type')
.rolling(20).mean()
.rename(columns=call_type_enc.inverse_transform)
.rename_axis(None, axis=1)
.plot()
);
Due to the NBA's shot clock, the natural timescale of a basketball game is possessions, not seconds, remaining.
ax = sns.heatmap(
df.pivot_table(
'foul_called',
'trailing_poss',
'remaining_poss'
)
.rename_axis(
"Trailing possessions\n(committing team)",
axis=0
)
.rename_axis("Remaining possessions", axis=1),
cmap='seismic',
cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis();
ax.set_title("Observed foul call rate");
def plot_foul_diff_heatmap(*_, data=None, **kwargs):
ax = plt.gca()
sns.heatmap(
data.pivot_table(
'diff',
'trailing_poss',
'remaining_poss'
),
cmap='seismic', robust=True,
cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis()
ax.set_title("Observed foul call rate")
(sns.FacetGrid(diff_df, col='call_type', col_wrap=3, aspect=1.5)
.map_dataframe(plot_foul_diff_heatmap)
.set_axis_labels(
"Remaining possessions",
"Trailing possessions\n(committing team)"
)
.set_titles("{col_name}"));
with poss_model:
β_poss = hierarchical_normal(
'β_poss',
(n_trailing_poss, n_remaining_poss, n_call_type),
σ_shape=(1, 1, n_call_type)
)
The foul call rate is a combination of season, call type, and possession factors.
with poss_model:
η_game = β_season[season] \
+ β_call[call_type] \
+ β_poss[
trailing_poss,
remaining_poss,
call_type
]
Again, we sample from the model's posterior distribution.
with poss_model:
poss_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 3 jobs) NUTS: [σ_β_poss_log__, Δ_β_poss, σ_β_call_log__, Δ_β_call, β_season] 100%|██████████| 1500/1500 [07:56<00:00, 3.15it/s] There were 5 divergences after tuning. Increase `target_accept` or reparameterize. There were 10 divergences after tuning. Increase `target_accept` or reparameterize. There were 9 divergences after tuning. Increase `target_accept` or reparameterize. The number of effective samples is smaller than 25% for some parameters.
Again, we calculate residuals.
ax = sns.heatmap(
resid_df.pivot_table(
'resid',
'trailing_poss',
'remaining_poss'
)
.rename_axis(
"Trailing possessions\n(committing team)",
axis=0
)
.rename_axis(
"Remaining possessions",
axis=1
)
.loc[-3:3],
cmap='seismic',
cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis();
ax.set_title("Observed foul call rate");
ax = (resid_df.groupby(bins[bin_ix])
.resid.mean()
.rename_axis('p_hat', axis=0)
.reset_index()
.plot('p_hat', 'resid', kind='scatter'))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");
make_foul_rate_yaxis(ax, label="Residual");
ax = (resid_df.groupby('seconds_left')
.resid.mean()
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'))
make_time_axes(ax, ylabel="Residual");
Now that we have two models, we can engage in model selection. We use the widely applicable Bayesian information criterion (WAIC) for model selection.
comp_df = (pm.compare(
(base_trace, poss_trace),
(base_model, poss_model)
)
.rename(index=MODEL_NAME_MAP)
.loc[MODEL_NAME_MAP.values()])
comp_df
WAIC | pWAIC | dWAIC | weight | SE | dSE | var_warn | |
---|---|---|---|---|---|---|---|
Base | 11610.1 | 2.11 | 1541.98 | 0 | 56.9 | 73.43 | 0 |
Possession | 10068.1 | 82.93 | 0 | 1 | 88.05 | 0 | 0 |
fig, ax = plt.subplots()
ax.errorbar(
np.arange(len(MODEL_NAME_MAP)),
comp_df.WAIC,
yerr=comp_df.SE, fmt='o'
);
ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");
ax.set_ylabel("WAIC");
We now turn to the question of whether or not committing and/or drawing fouls is a measurable skill. We use an item-response theory (IRT) model to study this question. For more information on Bayesian item-response models, consult the following references.
The item-response theory model includes the season, call type, and possession terms of the previous models.
Each disadvantaged player has an ideal point (per season).
$$ \begin{align*} \sigma_{\theta} & \sim \operatorname{HalfNormal}(5) \\ \theta^{\textrm{player}}_{i, s} & \sim \operatorname{Hierarchical-Normal}(0, \sigma_{\theta}^2) \end{align*} $$with irt_model:
θ_player = hierarchical_normal(
'θ_player', (n_player, n_season)
)
θ = θ_player[player_disadvantaged, season]
Each committing player has an ideal point (per season).
player_committing = df['player_committing'].values
with irt_model:
b_player = hierarchical_normal(
'b_player', (n_player, n_season)
)
b = b_player[player_committing, season]
Players affect the foul call rate through the difference in their ideal points.
with irt_model:
η_player = θ - b
The sum of the game and player effects determines the foul call probability.
with irt_model:
η = η_game + η_player
Again, we sample from the model's posterior distribution.
with irt_model:
irt_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (3 chains in 3 jobs) NUTS: [σ_b_player_log__, Δ_b_player, σ_θ_player_log__, Δ_θ_player, σ_β_poss_log__, Δ_β_poss, σ_β_call_log__, Δ_β_call, β_season] 100%|██████████| 1500/1500 [13:55<00:00, 1.80it/s] There were 3 divergences after tuning. Increase `target_accept` or reparameterize. There were 1 divergences after tuning. Increase `target_accept` or reparameterize. There were 4 divergences after tuning. Increase `target_accept` or reparameterize. The estimated number of effective samples is smaller than 200 for some parameters.
None of the sampling diagnostics indicate problems with convergence.
ax = (resid_df.groupby(bins[bin_ix])
.resid.mean()
.rename_axis('p_hat', axis=0)
.reset_index()
.plot('p_hat', 'resid', kind='scatter'))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");
make_foul_rate_yaxis(ax, label="Residual");
The IRT model is a marginal improvement over the possession model in terms of WAIC.
fig, ax = plt.subplots()
ax.errorbar(
np.arange(len(MODEL_NAME_MAP)), comp_df.WAIC,
yerr=comp_df.SE, fmt='o'
);
ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");
ax.set_ylabel("WAIC");
We now produce two DataFrame
s containing the estimated player ideal points per season.
fig, ax = plot_latent_params(
top_bot_irt_df[top_bot_irt_df['param'] == 'θ']
.sort_values('mean')
)
ax.set_xlabel(r"$\hat{\theta}$");
ax.set_title("Top and bottom ten");
fig, ax = plot_latent_params(
top_bot_irt_df[top_bot_irt_df['param'] == 'b']
.sort_values('mean')
)
ax.set_xlabel(r"$\hat{b}$");
ax.set_title("Top and bottom ten");
style_grid(
sns.PairGrid(
player_irt_df.loc[player_all_season],
size=1.75
)
.map_upper(plt.scatter, alpha=0.5)
.map_diag(plt.hist)
.map_lower(plot_corr)
);
Since we can only reasonably estimate the skills of players for which we have sufficient foul call data, we plot the correlations below for players that appeared in at least ten plays in each season.
As expected, the season-over-season latent skill correlations are higher for this subset of players.
From this figure, it seems that committing skill ($\hat{b}$) exists and is measurable, but is fairly small. It also seems that disadvantaged skill ($\hat{\theta}$), if it exists, is difficult to measure from this data set. Ideally, the NBA would release a foul report for the entirety of every game, but that seems quite unlikely.
In the future it would be useful to include a correction for the probability that a given game appears in the data set. This correction would help with the sample bias introduced by the fact that only games that are close in the last two minutes are included in the NBA's reports.
This post is available as a Jupyter notebook here.