title: An Improved Analysis of NBA Foul Calls with Python tags: NBA, Bayesian Statistics, PyMC3

Last April, I wrote a post that used Bayesian item-response theory models to analyze NBA foul call data. Last November, I spoke about a greatly improved version of these models at PyData NYC. This post is a write-up of the models from that talk.

Last Two-minute Report¶

Since late in the 2014-2015 season, the NBA has issued last two minute reports. These reports give the league's assessment of the correctness of foul calls and non-calls in the last two minutes of any game where the score difference was three or fewer points at any point in the last two minutes.

These reports are notably different from play-by-play logs, in that they include information on non-calls for notable on-court interactions. This non-call information presents a unique opportunity to study the factors that impact foul calls. There is a level of subjectivity inherent in the the NBA's definition of notable on-court interactions which we attempt to mitigate later using season-specific factors.

Loading the data¶

Russel Goldenberg of The Pudding has been scraping the PDFs that the NBA publishes and transforming them into a CSV for some time. I am grateful for his work, which has enabled this analysis.

We download the data locally to be kind to GitHub.

The data set contains more than 16,000 plays.

In [11]:

orig_df.head(n=2).T

Out[11]:

play_id	20150301CLEHOU-0	20150301CLEHOU-1
period	Q4	Q4
seconds_left	112	103
call_type	Foul: Shooting	Foul: Shooting
committing_player	Josh Smith	J.R. Smith
disadvantaged_player	Kevin Love	James Harden
review_decision	CNC	CC
away	CLE	CLE
home	HOU	HOU
date	2015-03-01 00:00:00	2015-03-01 00:00:00
score_away	103	103
score_home	105	105
disadvantaged_team	CLE	HOU
committing_team	HOU	CLE

Research questions¶

In this post, we answer two questions:

How does game context impact foul calls?
Is (not) committing and/or drawing fouls a measurable player skill?

The previous post focused on the second question, and gave the first question only a cursory treatment. This post enhances our treatment of the first question, in order to control for non-skill factors influencing foul calls (namely intentional fouls). Controlling for these factors makes our estimates of player skill more realistic.

Exploratory Data Analysis¶

First we examine the types of calls present in the data set.

In [13]:

(orig_df['call_type']
        .str.split(':', expand=True)
        .iloc[:, 0]
        .value_counts()
        .plot(
            kind='bar',
            color=blue, logy=True, 
            title="Call types"
        )
        .set_ylabel("Frequency"));

In [15]:

(foul_df['call_type']
        .str.split(': ', expand=True)
        .iloc[:, 1]
        .value_counts()
        .plot(
            kind='bar',
            color=blue, logy=True,
            title="Foul Types"
        )
        .set_ylabel("Frequency"));

Data transformation¶

There are a number of misspelled team names in the data, which we correct.

In [24]:

df.head(n=2).T

Out[24]:

play_id	20151028INDTOR-1	20151028INDTOR-2
seconds_left	89.0	73.0
call_type	4.0	4.0
foul_called	1.0	0.0
player_committing	162.0	36.0
player_disadvantaged	98.0	358.0
score_committing	99.0	106.0
score_disadvantaged	106.0	99.0
season	0.0	0.0

Modeling¶

We follow George Box's modeling workflow, as summarized by Dustin Tran:

build a model of the science,
infer the model given data, and
criticize the model given data.

Baseline model¶

Build a model of the science¶

Below we examine the foul call rate by season.

In [25]:

def make_foul_rate_yaxis(ax, label="Observed foul call rate"):
    ax.yaxis.set_major_formatter(pct_formatter)
    ax.set_ylabel(label)
    
    return ax

make_foul_rate_yaxis(
    df.pivot_table('foul_called', 'season')
      .rename(index=season_enc.inverse_transform)
      .rename_axis("Season")
      .plot(kind='bar', rot=0, legend=False)
);

In [26]:

import pymc3 as pm

with pm.Model() as base_model:
    β_season = pm.Normal('β_season', 0., 5., shape=n_season)
    p = pm.Deterministic('p', pm.math.sigmoid(β_season))

In [28]:

with base_model:
    y = pm.Bernoulli(
        'y', p[season],
        observed=df['foul_called']
    )

Infer the model given data¶

We now sample from the model's posterior distribution.

In [30]:

with base_model:
    base_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β_season]
100%|██████████| 1500/1500 [00:07<00:00, 198.65it/s]

Convergence diagnostics¶

We rely on three diagnostics to ensure that our samples have converged to the posterior distribution:

Energy plots: if the two distributions in the energy plot differ significantly (espescially in the tails), the sampling was not very efficient.
Bayesian fraction of missing information (BFMI): BFMI quantifies this difference with a number between zero and one. A BFMI close to (or exceeding) one is preferable, and a BFMI lower than 0.2 is indicative of efficiency issues.
Gelman-Rubin statistics: Gelman-Rubin statistics near one are preferable, and values less than 1.1 are generally taken to indicate convergence.

For more information on energy plots and BFMI consult Robust Statistical Workflow with PyStan.

In [33]:

(pm.energyplot(base_trace, legend=False, figsize=(6, 4))
   .set_title(CONVERGENCE_TITLE()));

Criticize the model given data¶

We use the samples from p's posterior distribution to calculate residuals, which we use to criticize our models. These residuals allow us to assess how well our model describes the data-generation process and to discover unmodeled sources of variation.

In [35]:

resid_df = (df.assign(p_hat=base_trace['p'][:, df['season']].mean(axis=0))
              .assign(resid=lambda df: df['foul_called'] - df['p_hat']))

In [36]:

resid_df[['foul_called', 'p_hat', 'resid']].head()

Out[36]:

	foul_called	p_hat	resid
play_id
20151028INDTOR-1	1.0	0.403875	0.596125
20151028INDTOR-2	0.0	0.403875	-0.403875
20151028INDTOR-3	1.0	0.403875	0.596125
20151028INDTOR-4	0.0	0.403875	-0.403875
20151028INDTOR-6	0.0	0.403875	-0.403875

In [37]:

(resid_df.pivot_table('resid', 'season')
         .rename(index=season_enc.inverse_transform))

Out[37]:

	resid
season
2015-2016	-0.000162
2016-2017	-0.000219

Anyone who has watched a close basketball game will realize that we have neglected an important factor in late game foul calls — intentional fouls. Near the end of the game, intentional fouls are used by the losing team when they are on defense to end the leading team's possession as quickly as possible.

The influence of intentional fouls in the plot below is shown by the rapidly increasing of the residuals as the number of seconds left in the game decreases.

In [38]:

def make_time_axes(ax,
                   xlabel="Seconds remaining in game",
                   ylabel="Observed foul call rate"):
    ax.invert_xaxis()
    ax.set_xlabel(xlabel)
    
    return make_foul_rate_yaxis(ax, label=ylabel)

make_time_axes(
    resid_df.pivot_table('resid', 'seconds_left')
            .reset_index()
            .plot('seconds_left', 'resid', kind='scatter'),
    ylabel="Residual"
);

Possession model¶

Build a model of the science¶

The following plot illustrates the fact that only the trailing team has any incentive to committ intentional fouls.

In [40]:

make_time_axes(
    df.pivot_table(
        'foul_called',
        'seconds_left',
        'trailing_committing'
      )
      .rolling(20)
      .mean()
      .rename(columns={
          0: "No", 1: "Yes"
      })
      .rename_axis(
          "Committing team is trailing",
          axis=1
      )
      .plot()
);

In [41]:

ax = (df.pivot_table('foul_called', 'call_type')
        .rename(index=call_type_enc.inverse_transform)
        .rename_axis("Call type", axis=0)
        .plot(kind='barh', legend=False))

ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel("Observed foul call rate");

Each call type has a different foul call rate.

In [44]:

with poss_model:
    β_call = hierarchical_normal('β_call', n_call_type)

In [47]:

make_time_axes(
    df.pivot_table(
        'foul_called',
        'seconds_left',
        'trailing_poss'
      )
      .loc[:, 1:3]
      .rolling(20).mean()
      .rename_axis(
          "Trailing possessions\n(committing team)",
          axis=1
      )
      .plot()
);

In [48]:

make_time_axes(
    df.pivot_table('foul_called', 'seconds_left', 'call_type')
      .rolling(20).mean()
      .rename(columns=call_type_enc.inverse_transform)
      .rename_axis(None, axis=1)
      .plot()
);

Due to the NBA's shot clock, the natural timescale of a basketball game is possessions, not seconds, remaining.

In [51]:

ax = sns.heatmap(
    df.pivot_table(
        'foul_called',
        'trailing_poss',
        'remaining_poss'
      )
      .rename_axis(
          "Trailing possessions\n(committing team)",
          axis=0
      )
      .rename_axis("Remaining possessions", axis=1),
    cmap='seismic',
    cbar_kws={'format': pct_formatter}
)

ax.invert_yaxis();
ax.set_title("Observed foul call rate");

In [53]:

def plot_foul_diff_heatmap(*_, data=None, **kwargs):
    ax = plt.gca()

    sns.heatmap(
        data.pivot_table(
            'diff',
            'trailing_poss',
            'remaining_poss'
        ),
        cmap='seismic', robust=True,
        cbar_kws={'format': pct_formatter}
    )
    
    ax.invert_yaxis()
    ax.set_title("Observed foul call rate")

(sns.FacetGrid(diff_df, col='call_type', col_wrap=3, aspect=1.5)
    .map_dataframe(plot_foul_diff_heatmap)
    .set_axis_labels(
        "Remaining possessions",
        "Trailing possessions\n(committing team)"
    )
    .set_titles("{col_name}"));

In [54]:

with poss_model:
    β_poss = hierarchical_normal(
        'β_poss',
        (n_trailing_poss, n_remaining_poss, n_call_type),
        σ_shape=(1, 1, n_call_type)
    )

The foul call rate is a combination of season, call type, and possession factors.

In [56]:

with poss_model:
    η_game = β_season[season] \
                + β_call[call_type] \
                + β_poss[
                    trailing_poss,
                    remaining_poss,
                    call_type
                ]

Infer the model given data¶

Again, we sample from the model's posterior distribution.

In [58]:

with poss_model:
    poss_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_β_poss_log__, Δ_β_poss, σ_β_call_log__, Δ_β_call, β_season]
100%|██████████| 1500/1500 [07:56<00:00,  3.15it/s]
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
There were 10 divergences after tuning. Increase `target_accept` or reparameterize.
There were 9 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 25% for some parameters.

Criticize the model given data¶

Again, we calculate residuals.

In [62]:

ax = sns.heatmap(
    resid_df.pivot_table(
                'resid',
                'trailing_poss',
                'remaining_poss'
            )
            .rename_axis(
                "Trailing possessions\n(committing team)",
                axis=0
            )
            .rename_axis(
                "Remaining possessions",
                axis=1
            )
            .loc[-3:3],
    cmap='seismic', 
    cbar_kws={'format': pct_formatter}
)

ax.invert_yaxis();
ax.set_title("Observed foul call rate");

In [64]:

ax = (resid_df.groupby(bins[bin_ix])
              .resid.mean()
              .rename_axis('p_hat', axis=0)
              .reset_index()
              .plot('p_hat', 'resid', kind='scatter'))

ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");

make_foul_rate_yaxis(ax, label="Residual");

In [65]:

ax = (resid_df.groupby('seconds_left')
              .resid.mean()
              .reset_index()
              .plot('seconds_left', 'resid', kind='scatter'))
make_time_axes(ax, ylabel="Residual");

Model selection¶

Now that we have two models, we can engage in model selection. We use the widely applicable Bayesian information criterion (WAIC) for model selection.

In [67]:

comp_df = (pm.compare(
                (base_trace, poss_trace),
                (base_model, poss_model)
             )
             .rename(index=MODEL_NAME_MAP)
             .loc[MODEL_NAME_MAP.values()])

In [68]:

comp_df

Out[68]:

	WAIC	pWAIC	dWAIC	weight	SE	dSE	var_warn
Base	11610.1	2.11	1541.98	0	56.9	73.43	0
Possession	10068.1	82.93	0	1	88.05	0	0

In [69]:

fig, ax = plt.subplots()

ax.errorbar(
    np.arange(len(MODEL_NAME_MAP)),
    comp_df.WAIC,
    yerr=comp_df.SE, fmt='o'
);

ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");

ax.set_ylabel("WAIC");

Player item-response theory model¶

Build a model of the science¶

We now turn to the question of whether or not committing and/or drawing fouls is a measurable skill. We use an item-response theory (IRT) model to study this question. For more information on Bayesian item-response models, consult the following references.

Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation is an excellent introduction to applied Bayesian IRT models and has inspired much of this work.
Bayesian Item Response Modeling — Theory and Applications is a comprehensive mathematical overview of Bayesien IRT modeling.

The item-response theory model includes the season, call type, and possession terms of the previous models.

Each disadvantaged player has an ideal point (per season).

$$ \begin{align*} \sigma_{\theta} & \sim \operatorname{HalfNormal}(5) \\ \theta^{\textrm{player}}_{i, s} & \sim \operatorname{Hierarchical-Normal}(0, \sigma_{\theta}^2) \end{align*} $$

In [72]:

with irt_model:
    θ_player = hierarchical_normal(
        'θ_player', (n_player, n_season)
    )
    θ = θ_player[player_disadvantaged, season]

Each committing player has an ideal point (per season).

In [73]:

player_committing = df['player_committing'].values

In [74]:

with irt_model:
    b_player = hierarchical_normal(
        'b_player', (n_player, n_season)
    )
    b = b_player[player_committing, season]

Players affect the foul call rate through the difference in their ideal points.

In [75]:

with irt_model:
    η_player = θ - b

The sum of the game and player effects determines the foul call probability.

In [76]:

with irt_model:
    η = η_game + η_player

Infer the model given data¶

Again, we sample from the model's posterior distribution.

In [78]:

with irt_model:
    irt_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_b_player_log__, Δ_b_player, σ_θ_player_log__, Δ_θ_player, σ_β_poss_log__, Δ_β_poss, σ_β_call_log__, Δ_β_call, β_season]
100%|██████████| 1500/1500 [13:55<00:00,  1.80it/s]
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
There were 4 divergences after tuning. Increase `target_accept` or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.

None of the sampling diagnostics indicate problems with convergence.

Criticize the model given data, take three¶

In [83]:

ax = (resid_df.groupby(bins[bin_ix])
              .resid.mean()
              .rename_axis('p_hat', axis=0)
              .reset_index()
              .plot('p_hat', 'resid', kind='scatter'))

ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");

make_foul_rate_yaxis(ax, label="Residual");

Model selection¶

The IRT model is a marginal improvement over the possession model in terms of WAIC.

In [87]:

fig, ax = plt.subplots()
ax.errorbar(
    np.arange(len(MODEL_NAME_MAP)), comp_df.WAIC,
    yerr=comp_df.SE, fmt='o'
);
ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");
ax.set_ylabel("WAIC");

We now produce two DataFrames containing the estimated player ideal points per season.

In [96]:

fig, ax = plot_latent_params(
    top_bot_irt_df[top_bot_irt_df['param'] == 'θ']
                  .sort_values('mean')
)
ax.set_xlabel(r"$\hat{\theta}$");
ax.set_title("Top and bottom ten");

In [97]:

fig, ax = plot_latent_params(
    top_bot_irt_df[top_bot_irt_df['param'] == 'b']
                  .sort_values('mean')
)
ax.set_xlabel(r"$\hat{b}$");
ax.set_title("Top and bottom ten");

In [101]:

style_grid(
    sns.PairGrid(
        player_irt_df.loc[player_all_season],
        size=1.75
       )
       .map_upper(plt.scatter, alpha=0.5)
       .map_diag(plt.hist)
       .map_lower(plot_corr)
);

Since we can only reasonably estimate the skills of players for which we have sufficient foul call data, we plot the correlations below for players that appeared in at least ten plays in each season.

As expected, the season-over-season latent skill correlations are higher for this subset of players.

From this figure, it seems that committing skill ($\hat{b}$) exists and is measurable, but is fairly small. It also seems that disadvantaged skill ($\hat{\theta}$), if it exists, is difficult to measure from this data set. Ideally, the NBA would release a foul report for the entirety of every game, but that seems quite unlikely.

In the future it would be useful to include a correction for the probability that a given game appears in the data set. This correction would help with the sample bias introduced by the fact that only games that are close in the last two minutes are included in the NBA's reports.

This post is available as a Jupyter notebook here.