Practical Bayesian Modeling with PyMC
Abstract
In this tutorial, we explore Bayesian regression using PyMC -- the primary library for Bayesian sampling in Python -- focusing on survey data and other datasets with categorical outcomes. Starting with logistic regression, we’ll build up to categorical and ordered logistic regression, showcasing how Bayesian approaches provide versatile tools for developing and evaluating complex models. Participants will leave with practical skills for implementing Bayesian regression models in PyMC, along with a deeper appreciation for the power of Bayesian inference in real-world data analysis. Participants should be familiar with Python, the SciPy ecosystem, and basic statistics, but no experience with Bayesian methods is required.
Description
In this hands-on tutorial, we will dive into the world of Bayesian regression using PyMC, with a particular focus on datasets that feature categorical outcomes. PyMC provides versatile tools for iterative development of powerful models. This tutorial guides participants through the fundamentals of Bayesian regression, starting from logistic regression and extending to categorical and ordered logistic regression. It also introduces the PyMC package, its syntax and capabilities.
Outline:
Logistic Regression with PyMC
Categorical Regression
Ordered Logistic Regression
Conclusion and Q&A
Who Should Attend: This tutorial is aimed at data scientists who are comfortable with Python, the SciPy ecosystem, and basic statistics but are new to Bayesian statistics and/or PyMC. By the end of the session, participants will have hands-on experience with Bayesian regression models and a solid understanding of how to apply these techniques to real-world data analysis.
Notes: The material for the tutorial is in this repository: https://github.com/AllenDowney/SurveyDataPyMC/tree/main/notebooks
The dataset from the General Social Survey (GSS) can be found here: https://gssdataexplorer.norc.org/
[The slides for the tutorial are here](COMING SOON).
The plan:
Logistic regression
Example: happiness data?
Start with the DIY implementation
Show the version using LogisticRegression
Categorical regression
Example: political party?
Start with the version using Categorical
Develop the DIY version with Multinomial
Ordered logistic regression
Start with OrderedLogistic
Maybe develop the DIY version (or leave it in the notebook for future exploration)
%load_ext autoreload
%autoreload 2
%load_ext jupyter_black
The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload The jupyter_black extension is already loaded. To reload it, use: %reload_ext jupyter_black
# Get utils.py
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print("Downloaded " + local)
download("https://github.com/AllenDowney/SurveyDataPyMC/raw/main/notebooks/utils.py")
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
from utils import value_counts, decorate
# Make the figures smaller to save some screen real estate
plt.rcParams["figure.dpi"] = 75
plt.rcParams["figure.figsize"] = [6, 3.5]
plt.rcParams["axes.titlelocation"] = "left"
# This dataset is prepared in GssExtract/notebooks/02_make_extract-2022_4.ipynb
# It has been resampled to correct for stratified sampling
DATA_PATH = "https://github.com/AllenDowney/GssExtract/raw/main/data/interim/"
filename = "gss_extract_2022_4.hdf"
download(DATA_PATH + filename)
gss = pd.read_hdf(filename, "gss")
gss.columns
Index(['age', 'attend', 'ballot', 'class', 'clmtcaus', 'clmtusa', 'clmtwrld', 'cohort', 'commun', 'conarmy', 'conbus', 'conclerg', 'coneduc', 'confed', 'confinan', 'coninc', 'conjudge', 'conlabor', 'conlegis', 'conmedic', 'conpress', 'conrinc', 'consci', 'contv', 'degree', 'educ', 'eqwlth', 'fair', 'fear', 'fund', 'goodlife', 'hapmar', 'happy', 'health', 'helpful', 'id', 'life', 'partyid', 'polviews', 'pres16', 'pres20', 'race', 'realinc', 'realrinc', 'reg16', 'region', 'relig', 'relig16', 'reliten', 'res16', 'rincome', 'satfin', 'satjob', 'sex', 'sexbirth', 'sexnow', 'srcbelt', 'trust', 'wtssall', 'year'], dtype='object')
Generally speaking, would you say that most people can be trusted or that you can't be too careful in dealing with people?
1 Most people can be trusted
2 Can't be too careful
3 Depends
value_counts(gss["trust"])
counts | |
---|---|
values | |
1.0 | 15783 |
2.0 | 24890 |
3.0 | 1966 |
NaN | 29751 |
gss["y"] = gss["trust"].replace({1: 1, 2: 0, 3: np.nan})
value_counts(gss["y"])
counts | |
---|---|
values | |
0.0 | 24890 |
1.0 | 15783 |
NaN | 31717 |
time_series = gss.groupby("year")["y"].mean() * 100
time_series.plot(style="o")
decorate(ylabel="percent", title="Can people be trusted?")
Generally speaking, do you usually think of yourself as a Republican, Democrat, Independent, or what?
0 Strong democrat
1 Not very strong democrat
2 Independent, close to democrat
3 Independent (neither, no response)
4 Independent, close to republican
5 Not very strong republican
6 Strong republican
7 Other party
party_map = {
0: 0,
1: 0,
2: 1,
3: 1,
4: 1,
5: 2,
6: 2,
7: np.nan,
}
gss["partyid3"] = gss["partyid"].replace(party_map)
value_counts(gss["partyid3"])
counts | |
---|---|
values | |
0.0 | 25439 |
1.0 | 26705 |
2.0 | 18338 |
NaN | 1908 |
gss
age | attend | ballot | class | clmtcaus | clmtusa | clmtwrld | cohort | commun | conarmy | ... | satjob | sex | sexbirth | sexnow | srcbelt | trust | wtssall | year | y | partyid3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 24.0 | 1.0 | NaN | 3.0 | NaN | NaN | NaN | 1948.0 | NaN | NaN | ... | NaN | 1.0 | NaN | NaN | 5.0 | 1.0 | 1.333900 | 1972 | 1.0 | 1.0 |
1 | 33.0 | 2.0 | NaN | 2.0 | NaN | NaN | NaN | 1939.0 | NaN | NaN | ... | 2.0 | 1.0 | NaN | NaN | 1.0 | 1.0 | 0.889300 | 1972 | 1.0 | 1.0 |
2 | 40.0 | 8.0 | NaN | 2.0 | NaN | NaN | NaN | 1932.0 | NaN | NaN | ... | NaN | 2.0 | NaN | NaN | 3.0 | 2.0 | 1.333900 | 1972 | 0.0 | 1.0 |
3 | 24.0 | 2.0 | NaN | 2.0 | NaN | NaN | NaN | 1948.0 | NaN | NaN | ... | NaN | 1.0 | NaN | NaN | 3.0 | 2.0 | 1.778600 | 1972 | 0.0 | 0.0 |
4 | 21.0 | 7.0 | NaN | 3.0 | NaN | NaN | NaN | 1951.0 | NaN | NaN | ... | NaN | 2.0 | NaN | NaN | 2.0 | NaN | 1.778600 | 1972 | NaN | 1.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
72385 | 50.0 | 2.0 | 2.0 | 2.0 | NaN | NaN | NaN | 1972.0 | NaN | 1.0 | ... | 1.0 | 1.0 | NaN | NaN | 3.0 | 2.0 | 5.322235 | 2022 | 0.0 | 1.0 |
72386 | 27.0 | 0.0 | 2.0 | 1.0 | NaN | NaN | NaN | 1995.0 | NaN | 1.0 | ... | 1.0 | 2.0 | NaN | NaN | 5.0 | 2.0 | 0.746829 | 2022 | 0.0 | 1.0 |
72387 | 22.0 | 6.0 | 3.0 | 3.0 | NaN | NaN | NaN | 2000.0 | NaN | 2.0 | ... | 3.0 | 2.0 | NaN | NaN | 5.0 | NaN | 2.449262 | 2022 | NaN | 1.0 |
72388 | NaN | 0.0 | 2.0 | 2.0 | NaN | NaN | NaN | NaN | NaN | 2.0 | ... | 2.0 | 2.0 | NaN | NaN | 5.0 | NaN | 2.501657 | 2022 | NaN | 1.0 |
72389 | 78.0 | 2.0 | 3.0 | 3.0 | NaN | NaN | NaN | 1944.0 | NaN | 1.0 | ... | NaN | 1.0 | NaN | NaN | 5.0 | 1.0 | 0.482395 | 2022 | 1.0 | 0.0 |
72390 rows × 62 columns
party_names = {
0: "Democrat",
1: "Independent",
2: "Republican",
}
table = (
gss.pivot_table(index="year", columns="partyid3", values="y", aggfunc="mean") * 100
).rename(columns=party_names)
table.iloc[:10]
partyid3 | Democrat | Independent | Republican |
---|---|---|---|
year | |||
1972 | 42.363878 | 52.741514 | 61.198738 |
1973 | 44.194107 | 50.308008 | 60.179641 |
1975 | 35.205993 | 42.603550 | 46.368715 |
1976 | 42.763158 | 45.730550 | 57.792208 |
1978 | 36.744966 | 46.628131 | 43.450479 |
1980 | 46.401515 | 45.904762 | 51.142857 |
1983 | 31.629393 | 42.293907 | 47.435897 |
1984 | 43.619048 | 49.705305 | 53.168044 |
1986 | 33.213645 | 39.387309 | 49.113924 |
1987 | 34.039900 | 40.227704 | 50.364964 |
colors = ["C0", "C4", "C3"]
table.plot(color=colors)
decorate(ylabel="percent", title="Percent saying people can be trusted")
data = gss.dropna(subset=["y", "year"])
data.shape
(40673, 62)
y = data["y"].values
year = data["year"].values
Shift year
so it's centered at its mean.
year_shift = data["year"].mean()
year = year - year_shift
Here's the model.
with pm.Model() as logistic_model:
y_pt = pm.Data("y", y)
year_pt = pm.Data("year", year)
# Priors for intercept and slope
sigma_prior = 1
mu_prior = 0
alpha = pm.Normal("alpha", mu_prior, sigma_prior)
beta = pm.Normal("beta", mu_prior, sigma_prior)
# Linear predictor (log-odds)
logits = pm.Deterministic("logits", alpha + beta * year_pt)
# Bernoulli likelihood with logit link
y_obs = pm.Bernoulli("y_obs", logit_p=logits, observed=y_pt)
pm.model_to_graphviz(logistic_model)
with logistic_model:
idata = pm.sample(
draws=500, tune=500, chains=3, cores=3, var_names=["alpha", "beta"]
)
Output()
idata
<xarray.Dataset> Size: 28kB Dimensions: (chain: 3, draw: 500) Coordinates: * chain (chain) int64 24B 0 1 2 * draw (draw) int64 4kB 0 1 2 3 4 5 6 7 ... 493 494 495 496 497 498 499 Data variables: alpha (chain, draw) float64 12kB -0.4746 -0.4518 ... -0.4502 -0.4546 beta (chain, draw) float64 12kB -0.0135 -0.01469 ... -0.01499 -0.01371 Attributes: created_at: 2025-03-31T02:37:10.164336+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.21.1 sampling_time: 9.177150964736938 tuning_steps: 500
<xarray.Dataset> Size: 187kB Dimensions: (chain: 3, draw: 500) Coordinates: * chain (chain) int64 24B 0 1 2 * draw (draw) int64 4kB 0 1 2 3 4 5 ... 495 496 497 498 499 Data variables: (12/17) energy (chain, draw) float64 12kB 2.697e+04 ... 2.697e+04 step_size_bar (chain, draw) float64 12kB 1.16 1.16 ... 1.142 1.142 largest_eigval (chain, draw) float64 12kB nan nan nan ... nan nan perf_counter_start (chain, draw) float64 12kB 4.956e+06 ... 4.956e+06 n_steps (chain, draw) float64 12kB 3.0 3.0 3.0 ... 3.0 3.0 step_size (chain, draw) float64 12kB 1.123 1.123 ... 0.906 ... ... lp (chain, draw) float64 12kB -2.697e+04 ... -2.696e+04 process_time_diff (chain, draw) float64 12kB 0.004268 ... 0.004012 index_in_trajectory (chain, draw) int64 12kB -1 -1 -2 0 -2 ... 1 -2 -1 2 smallest_eigval (chain, draw) float64 12kB nan nan nan ... nan nan diverging (chain, draw) bool 2kB False False ... False False max_energy_error (chain, draw) float64 12kB 0.6864 0.5059 ... 0.4596 Attributes: created_at: 2025-03-31T02:37:10.188771+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.21.1 sampling_time: 9.177150964736938 tuning_steps: 500
<xarray.Dataset> Size: 651kB Dimensions: (y_obs_dim_0: 40673) Coordinates: * y_obs_dim_0 (y_obs_dim_0) int64 325kB 0 1 2 3 4 ... 40669 40670 40671 40672 Data variables: y_obs (y_obs_dim_0) int64 325kB 1 1 0 0 0 1 0 0 0 ... 0 1 0 1 0 0 0 1 Attributes: created_at: 2025-03-31T02:37:10.191688+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.21.1
<xarray.Dataset> Size: 1MB Dimensions: (y_dim_0: 40673, year_dim_0: 40673) Coordinates: * y_dim_0 (y_dim_0) int64 325kB 0 1 2 3 4 ... 40669 40670 40671 40672 * year_dim_0 (year_dim_0) int64 325kB 0 1 2 3 4 ... 40669 40670 40671 40672 Data variables: y (y_dim_0) float64 325kB 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 1.0 year (year_dim_0) float64 325kB -23.19 -23.19 -23.19 ... 26.81 26.81 Attributes: created_at: 2025-03-31T02:37:10.192551+00:00 arviz_version: 0.21.0 inference_library: pymc inference_library_version: 5.21.1
import arviz as az
az.plot_posterior(idata)
decorate()
# Sample from Prior
with logistic_model:
idata_prior = pm.sample_prior_predictive(draws=1000, var_names=["alpha", "beta"])
# Add Prior to InferenceData we already have
idata["prior"] = idata_prior.prior
az.plot_dist_comparison(idata, var_names=["alpha", "beta"])
decorate()
az.plot_posterior(idata, ref_val=0)
decorate()
Looking at the posterior distributions of alpha
and beta
, we can see that they fall comfortably within the prior distributions of these parameters, which means that the priors didn't have much effect on the results.
As an experiment, try increasing or decreasing sigma_prior
and see what effect it has on the posterior distributions.
After sampling, we want to check that the process worked well — meaning:
All chains have effectively explored the posterior distribution, and
Successive draws within each chain are only weakly correlated (each draw should provide mostly new information).
We can check these properties visually using plot_trace, which shows:
On the left, the distribution from each chain should look similar — this is evidence that the chains all converged to the same region of the posterior.
On the right, the sequence of draws within each chain should look like uncorrelated noise (sometimes called "hairy caterpillars"). It shouldn't look like a random walk and there should be flat spots, which would suggest the chain got stuck.
az.plot_trace(idata)
decorate()
To check these properties numerically, we can look at two key summary statistics: ESS and r-hat.
Effective Sample Size (ESS) tells us how much independent information the chains actually contribute. If successive values within a chain are highly correlated, the chain isn’t exploring the posterior efficiently, and ESS will be smaller than the total number of draws.
r-hat quantifies the difference between chains. If all chains converged to the same posterior distribution, r-hat should be close to 1. If the chains are exploring different regions, r-hat will be larger than 1, indicating failure to coverge.
az.summary(idata)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
alpha | -0.460 | 0.010 | -0.478 | -0.442 | 0.0 | 0.0 | 1539.0 | 1017.0 | 1.0 |
beta | -0.014 | 0.001 | -0.016 | -0.013 | 0.0 | 0.0 | 1673.0 | 1271.0 | 1.0 |
In Bayesian statistics, we have two core object we care about when analyzing our "fits".
NOTE 1.1:
The posterior is accessible to us not in terms of an equation, but in terms of samples from the distribution of interest. This is the "magic" of MCMC, and in our case canonical result of a call to pm.sample()
.
NOTE 1.2:
The posterior lives in parameter space.
NOTE 2.1:
The posterior predictive lives in data space.
Below we show how to get samples from the posterior predictive in a manual way, by extracting our posterior samples, and constructing our regression model in numpy
. Later we will show how to use PyMC
directly for producing samples from the posterior predictive.
def get_sample(idata: az.InferenceData, var_names: list[str] | str) -> np.ndarray:
"""Extract a posterior sample from idata.
Args:
idata (az.InferenceData): InferenceData object from arviz that hosts our posterior samples as xarray DataArrays
var_name (str): Name of the variable to extract from the posterior
Returns:
np.ndarray: Array containing all posterior samples stacked across chains and draws
"""
if isinstance(var_names, str):
var_names = [var_names]
return [
idata.posterior[var_name].stack(sample=["chain", "draw"]).values
for var_name in var_names
]
# Collect posterior draws of alpha and beta
alphas, betas = get_sample(idata, var_names=["alpha", "beta"])
betas.shape
(1500,)
# Sample 100 from the posterior
idx = np.random.choice(len(betas), size=100, replace=False)
alpha_samples = alphas[idx]
beta_samples = betas[idx]
beta_samples.shape
(100,)
# Range of years to predict
year_range = np.arange(1970, 2031)
year_centered = year_range - year_shift
from scipy.special import expit
for alpha, beta in zip(alpha_samples, beta_samples):
logits = alpha + beta * year_centered
probs = expit(logits) * 100
plt.plot(year_range, probs, color="C0", alpha=0.02)
time_series.plot(style="o", label="observed")
decorate(ylabel="percent", title="Percent saying people can be trusted")
Because we centered years
, the sampled slopes and intercepts are uncorrelated.
pm.plot_pair(idata, var_names=["alpha", "beta"])
decorate()
As an experiment, try:
Set year_shift=1970
and run the model again. You might get some warnings from the sampler, and plot_pair
will show that the alphas and betas are correlated. The not-quite-centered predictor stretches the shape of the joint posterior distribution and makes it harder to sample efficiently.
Set year_shift=0
and run the model again. With the predictor completely uncentered, the joint posterior distribution is so stretched that the sampler diverges frequently -- and basically fails.
Without centering, the intercept represents the log-odds of the outcome when year=0
, which is way outside the range of the data.
This forces the intercept and slope to compensate for each other — if the slope increases, the intercept must decrease to hit the same observed points.
As a general rule, it's a good idea to center continuous predictors in Bayesian regression models.
value_counts(gss["partyid3"])
counts | |
---|---|
values | |
0.0 | 25439 |
1.0 | 26705 |
2.0 | 18338 |
NaN | 1908 |
data = gss.dropna(subset=["y", "year", "partyid3"])
data.shape
(39697, 62)
y = data["y"].values
year = data["year"].values
year_shift = data["year"].mean()
year = year - year_shift
partyid3 = data["partyid3"].astype(int).values
with pm.Model(
coords={"party_affiliation": ["Democrat", "Independent", "Republican"]}
) as logistic_model2:
y_pt = pm.Data("y", y)
year_pt = pm.Data("year", year)
party_idx_pt = pm.Data("party_idx", partyid3)
# Party-specific intercepts (one for each party)
sigma_prior = 1
mu_prior = 0
alpha = pm.Normal("alpha", mu_prior, sigma_prior, dims="party_affiliation")
# Shared slope for year (assuming effect of year is the same across parties)
beta = pm.Normal("beta", mu_prior, sigma_prior)
# Linear predictor (log-odds)
logits = pm.Deterministic("logits", alpha[party_idx_pt] + beta * year_pt)
# Bernoulli likelihood with logit link
y_obs = pm.Bernoulli("y_obs", logit_p=logits, observed=y_pt)
# Add other quantities of interest
probabilities = pm.Deterministic("probabilities", pm.math.sigmoid(logits))
pm.model_to_graphviz(logistic_model2)
with logistic_model2:
idata2 = pm.sample(
draws=500,
tune=500,
chains=3,
cores=3,
var_names=["alpha", "beta"],
# initvals=logistic_model2.initial_point(),
)
Output()
import arviz as az
az.plot_posterior(idata2)
decorate()
az.plot_trace(idata2, compact=False)
decorate()
pm.summary(idata2)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
alpha[Democrat] | -0.544 | 0.017 | -0.579 | -0.515 | 0.0 | 0.0 | 2340.0 | 1350.0 | 1.0 |
alpha[Independent] | -0.552 | 0.017 | -0.583 | -0.519 | 0.0 | 0.0 | 2164.0 | 1374.0 | 1.0 |
alpha[Republican] | -0.214 | 0.020 | -0.252 | -0.179 | 0.0 | 0.0 | 2831.0 | 1498.0 | 1.0 |
beta | -0.015 | 0.001 | -0.016 | -0.013 | 0.0 | 0.0 | 2667.0 | 1331.0 | 1.0 |
# Range of years to predict
probabilities = {}
with logistic_model2:
for key_, party_name_ in party_names.items():
# Assign covariates for which we want to compute the posterior predictive
pm.set_data(
{
"party_idx": np.tile(key_, len(year_centered)),
"year": year_centered,
}
)
# Compute the posterior predictive
probabilities[party_name_] = pm.compute_deterministics(
idata2.posterior, var_names=["probabilities"]
)
Output()
Output()
Output()
# Base plot per party affiliation
table.plot(color=colors)
# Plot posterior draws for each party affiliation
for key_, party_name_ in party_names.items():
probabilities_tmp = (
probabilities[party_name_].probabilities.sel(chain=np.random.choice(3)).values
* 100
).transpose()
plt.plot(
year_range,
probabilities_tmp,
color=colors[key_],
alpha=0.002,
)
decorate(ylabel="percent", title="Percent saying people can be trusted")