Logistic Regression¶

Practical Bayesian Modeling with PyMC

Abstract

In this tutorial, we explore Bayesian regression using PyMC -- the primary library for Bayesian sampling in Python -- focusing on survey data and other datasets with categorical outcomes. Starting with logistic regression, we’ll build up to categorical and ordered logistic regression, showcasing how Bayesian approaches provide versatile tools for developing and evaluating complex models. Participants will leave with practical skills for implementing Bayesian regression models in PyMC, along with a deeper appreciation for the power of Bayesian inference in real-world data analysis. Participants should be familiar with Python, the SciPy ecosystem, and basic statistics, but no experience with Bayesian methods is required.

Description

In this hands-on tutorial, we will dive into the world of Bayesian regression using PyMC, with a particular focus on datasets that feature categorical outcomes. PyMC provides versatile tools for iterative development of powerful models. This tutorial guides participants through the fundamentals of Bayesian regression, starting from logistic regression and extending to categorical and ordered logistic regression. It also introduces the PyMC package, its syntax and capabilities.

Outline:

Logistic Regression with PyMC
- Overview of Bayesian inference
- Modeling binary outcomes with logistic regression
- Introduction to PyMC and its capabilities
- Hands-on example: Happiness data in the General Social Survey
Categorical Regression
- Extending logistic regression to multi-class outcomes
- Differences between Bayesian models and GLM
- Hands-on example: Political alignment and party affiliation (?)
Ordered Logistic Regression
- Introduction to ordinal outcomes and their challenges
- Implementing ordered logistic regression in PyMC
- Hands-on example: Political alignment and party affiliation (?)
Conclusion and Q&A
- Recap of key concepts
- Resources for further learning
- Open discussion and questions

Who Should Attend: This tutorial is aimed at data scientists who are comfortable with Python, the SciPy ecosystem, and basic statistics but are new to Bayesian statistics and/or PyMC. By the end of the session, participants will have hands-on experience with Bayesian regression models and a solid understanding of how to apply these techniques to real-world data analysis.

Notes: The material for the tutorial is in this repository: https://github.com/AllenDowney/SurveyDataPyMC/tree/main/notebooks

The dataset from the General Social Survey (GSS) can be found here: https://gssdataexplorer.norc.org/

[The slides for the tutorial are here](COMING SOON).

The plan:

Logistic regression
Example: happiness data?
Start with the DIY implementation
Show the version using LogisticRegression

Categorical regression
Example: political party?
Start with the version using Categorical
Develop the DIY version with Multinomial

Ordered logistic regression
Start with OrderedLogistic
Maybe develop the DIY version (or leave it in the notebook for future exploration)

Click here to run this notebook on Colab

In [266]:

%load_ext autoreload
%autoreload 2
%load_ext jupyter_black

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The jupyter_black extension is already loaded. To reload it, use:
  %reload_ext jupyter_black

In [267]:

# Get utils.py

from os.path import basename, exists


def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve

        local, _ = urlretrieve(url, filename)
        print("Downloaded " + local)


download("https://github.com/AllenDowney/SurveyDataPyMC/raw/main/notebooks/utils.py")

In [268]:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

from utils import value_counts, decorate

In [269]:

# Make the figures smaller to save some screen real estate
plt.rcParams["figure.dpi"] = 75
plt.rcParams["figure.figsize"] = [6, 3.5]
plt.rcParams["axes.titlelocation"] = "left"

Data¶

In [270]:

# This dataset is prepared in GssExtract/notebooks/02_make_extract-2022_4.ipynb
# It has been resampled to correct for stratified sampling

DATA_PATH = "https://github.com/AllenDowney/GssExtract/raw/main/data/interim/"
filename = "gss_extract_2022_4.hdf"
download(DATA_PATH + filename)

In [271]:

gss = pd.read_hdf(filename, "gss")
gss.columns

Out[271]:

Index(['age', 'attend', 'ballot', 'class', 'clmtcaus', 'clmtusa', 'clmtwrld',
       'cohort', 'commun', 'conarmy', 'conbus', 'conclerg', 'coneduc',
       'confed', 'confinan', 'coninc', 'conjudge', 'conlabor', 'conlegis',
       'conmedic', 'conpress', 'conrinc', 'consci', 'contv', 'degree', 'educ',
       'eqwlth', 'fair', 'fear', 'fund', 'goodlife', 'hapmar', 'happy',
       'health', 'helpful', 'id', 'life', 'partyid', 'polviews', 'pres16',
       'pres20', 'race', 'realinc', 'realrinc', 'reg16', 'region', 'relig',
       'relig16', 'reliten', 'res16', 'rincome', 'satfin', 'satjob', 'sex',
       'sexbirth', 'sexnow', 'srcbelt', 'trust', 'wtssall', 'year'],
      dtype='object')

Generally speaking, would you say that most people can be trusted or that you can't be too careful in dealing with people?

1	Most people can be trusted
2	Can't be too careful
3	Depends

In [272]:

value_counts(gss["trust"])

Out[272]:

	counts
values
1.0	15783
2.0	24890
3.0	1966
NaN	29751

In [273]:

gss["y"] = gss["trust"].replace({1: 1, 2: 0, 3: np.nan})
value_counts(gss["y"])

Out[273]:

	counts
values
0.0	24890
1.0	15783
NaN	31717

In [274]:

time_series = gss.groupby("year")["y"].mean() * 100
time_series.plot(style="o")
decorate(ylabel="percent", title="Can people be trusted?")

Generally speaking, do you usually think of yourself as a Republican, Democrat, Independent, or what?

0	Strong democrat
1	Not very strong democrat	
2	Independent, close to democrat	
3	Independent (neither, no response)	
4	Independent, close to republican	
5	Not very strong republican	
6	Strong republican
7	Other party

In [275]:

party_map = {
    0: 0,
    1: 0,
    2: 1,
    3: 1,
    4: 1,
    5: 2,
    6: 2,
    7: np.nan,
}

gss["partyid3"] = gss["partyid"].replace(party_map)
value_counts(gss["partyid3"])

Out[275]:

	counts
values
0.0	25439
1.0	26705
2.0	18338
NaN	1908

In [276]:

gss

Out[276]:

	age	attend	ballot	class	clmtcaus	clmtusa	clmtwrld	cohort	commun	conarmy	...	satjob	sex	sexbirth	sexnow	srcbelt	trust	wtssall	year	y	partyid3
0	24.0	1.0	NaN	3.0	NaN	NaN	NaN	1948.0	NaN	NaN	...	NaN	1.0	NaN	NaN	5.0	1.0	1.333900	1972	1.0	1.0
1	33.0	2.0	NaN	2.0	NaN	NaN	NaN	1939.0	NaN	NaN	...	2.0	1.0	NaN	NaN	1.0	1.0	0.889300	1972	1.0	1.0
2	40.0	8.0	NaN	2.0	NaN	NaN	NaN	1932.0	NaN	NaN	...	NaN	2.0	NaN	NaN	3.0	2.0	1.333900	1972	0.0	1.0
3	24.0	2.0	NaN	2.0	NaN	NaN	NaN	1948.0	NaN	NaN	...	NaN	1.0	NaN	NaN	3.0	2.0	1.778600	1972	0.0	0.0
4	21.0	7.0	NaN	3.0	NaN	NaN	NaN	1951.0	NaN	NaN	...	NaN	2.0	NaN	NaN	2.0	NaN	1.778600	1972	NaN	1.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72385	50.0	2.0	2.0	2.0	NaN	NaN	NaN	1972.0	NaN	1.0	...	1.0	1.0	NaN	NaN	3.0	2.0	5.322235	2022	0.0	1.0
72386	27.0	0.0	2.0	1.0	NaN	NaN	NaN	1995.0	NaN	1.0	...	1.0	2.0	NaN	NaN	5.0	2.0	0.746829	2022	0.0	1.0
72387	22.0	6.0	3.0	3.0	NaN	NaN	NaN	2000.0	NaN	2.0	...	3.0	2.0	NaN	NaN	5.0	NaN	2.449262	2022	NaN	1.0
72388	NaN	0.0	2.0	2.0	NaN	NaN	NaN	NaN	NaN	2.0	...	2.0	2.0	NaN	NaN	5.0	NaN	2.501657	2022	NaN	1.0
72389	78.0	2.0	3.0	3.0	NaN	NaN	NaN	1944.0	NaN	1.0	...	NaN	1.0	NaN	NaN	5.0	1.0	0.482395	2022	1.0	0.0

72390 rows × 62 columns

In [277]:

party_names = {
    0: "Democrat",
    1: "Independent",
    2: "Republican",
}

table = (
    gss.pivot_table(index="year", columns="partyid3", values="y", aggfunc="mean") * 100
).rename(columns=party_names)

table.iloc[:10]

Out[277]:

partyid3	Democrat	Independent	Republican
year
1972	42.363878	52.741514	61.198738
1973	44.194107	50.308008	60.179641
1975	35.205993	42.603550	46.368715
1976	42.763158	45.730550	57.792208
1978	36.744966	46.628131	43.450479
1980	46.401515	45.904762	51.142857
1983	31.629393	42.293907	47.435897
1984	43.619048	49.705305	53.168044
1986	33.213645	39.387309	49.113924
1987	34.039900	40.227704	50.364964

In [278]:

colors = ["C0", "C4", "C3"]
table.plot(color=colors)
decorate(ylabel="percent", title="Percent saying people can be trusted")

Logistic Regression¶

In [279]:

data = gss.dropna(subset=["y", "year"])
data.shape

Out[279]:

(40673, 62)

In [280]:

y = data["y"].values
year = data["year"].values

Shift year so it's centered at its mean.

In [281]:

year_shift = data["year"].mean()
year = year - year_shift

Here's the model.

In [282]:

with pm.Model() as logistic_model:
    y_pt = pm.Data("y", y)
    year_pt = pm.Data("year", year)

    # Priors for intercept and slope
    sigma_prior = 1
    mu_prior = 0
    alpha = pm.Normal("alpha", mu_prior, sigma_prior)
    beta = pm.Normal("beta", mu_prior, sigma_prior)

    # Linear predictor (log-odds)
    logits = pm.Deterministic("logits", alpha + beta * year_pt)

    # Bernoulli likelihood with logit link
    y_obs = pm.Bernoulli("y_obs", logit_p=logits, observed=y_pt)

In [283]:

pm.model_to_graphviz(logistic_model)

Out[283]:

In [284]:

with logistic_model:
    idata = pm.sample(
        draws=500, tune=500, chains=3, cores=3, var_names=["alpha", "beta"]
    )

Output()

In [285]:

idata

Out[285]:

arviz.InferenceData

In [286]:

import arviz as az

az.plot_posterior(idata)
decorate()

Comparing Prior with Posterior¶

In [287]:

# Sample from Prior
with logistic_model:
    idata_prior = pm.sample_prior_predictive(draws=1000, var_names=["alpha", "beta"])

# Add Prior to InferenceData we already have
idata["prior"] = idata_prior.prior

In [288]:

az.plot_dist_comparison(idata, var_names=["alpha", "beta"])
decorate()

Comparing Posterior to Reference Value¶

In [289]:

az.plot_posterior(idata, ref_val=0)
decorate()

Looking at the posterior distributions of alpha and beta, we can see that they fall comfortably within the prior distributions of these parameters, which means that the priors didn't have much effect on the results.

As an experiment, try increasing or decreasing sigma_prior and see what effect it has on the posterior distributions.

After sampling, we want to check that the process worked well — meaning:

All chains have effectively explored the posterior distribution, and
Successive draws within each chain are only weakly correlated (each draw should provide mostly new information).

We can check these properties visually using plot_trace, which shows:

On the left, the distribution from each chain should look similar — this is evidence that the chains all converged to the same region of the posterior.
On the right, the sequence of draws within each chain should look like uncorrelated noise (sometimes called "hairy caterpillars"). It shouldn't look like a random walk and there should be flat spots, which would suggest the chain got stuck.

In [290]:

az.plot_trace(idata)
decorate()

To check these properties numerically, we can look at two key summary statistics: ESS and r-hat.

Effective Sample Size (ESS) tells us how much independent information the chains actually contribute. If successive values within a chain are highly correlated, the chain isn’t exploring the posterior efficiently, and ESS will be smaller than the total number of draws.
r-hat quantifies the difference between chains. If all chains converged to the same posterior distribution, r-hat should be close to 1. If the chains are exploring different regions, r-hat will be larger than 1, indicating failure to coverge.

In [291]:

az.summary(idata)

Out[291]:

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha	-0.460	0.010	-0.478	-0.442	0.0	0.0	1539.0	1017.0	1.0
beta	-0.014	0.001	-0.016	-0.013	0.0	0.0	1673.0	1271.0	1.0

Interpreting the results¶

In Bayesian statistics, we have two core object we care about when analyzing our "fits".

The posterior, which purely in mathematical terms is the left-hand-side of Bayes' Rule $p(\theta | D_{obs}) \propto \frac{ p(D_{obs} | \theta) p(\theta)}{p(D_{obs})}$. In words, the distribution over parmaters after observing the Data.

NOTE 1.1:

The posterior is accessible to us not in terms of an equation, but in terms of samples from the distribution of interest. This is the "magic" of MCMC, and in our case canonical result of a call to pm.sample().

NOTE 1.2:

The posterior lives in parameter space.

The posterior predictive, which is a collection of forward samples from our observation distribution based on the parameter sample collected in our posterior. In math, it represents access to samples from the distribution $p(D_{new} | \theta) p(\theta | D_{obs})$. The posterior predictive, is important for us to check what kind of DATA we would predict, once we have learned to incoporporate the information of our observed data in our parameter distributions.

NOTE 2.1:

The posterior predictive lives in data space.

Below we show how to get samples from the posterior predictive in a manual way, by extracting our posterior samples, and constructing our regression model in numpy. Later we will show how to use PyMC directly for producing samples from the posterior predictive.

Getting samples from the posterior predictive¶

In [292]:

def get_sample(idata: az.InferenceData, var_names: list[str] | str) -> np.ndarray:
    """Extract a posterior sample from idata.

    Args:
        idata (az.InferenceData): InferenceData object from arviz that hosts our posterior samples as xarray DataArrays
        var_name (str): Name of the variable to extract from the posterior

    Returns:
        np.ndarray: Array containing all posterior samples stacked across chains and draws
    """
    if isinstance(var_names, str):
        var_names = [var_names]
    return [
        idata.posterior[var_name].stack(sample=["chain", "draw"]).values
        for var_name in var_names
    ]

In [293]:

# Collect posterior draws of alpha and beta
alphas, betas = get_sample(idata, var_names=["alpha", "beta"])
betas.shape

Out[293]:

(1500,)

In [294]:

# Sample 100 from the posterior
idx = np.random.choice(len(betas), size=100, replace=False)
alpha_samples = alphas[idx]
beta_samples = betas[idx]
beta_samples.shape

Out[294]:

(100,)

In [295]:

# Range of years to predict
year_range = np.arange(1970, 2031)
year_centered = year_range - year_shift

In [309]:

from scipy.special import expit

for alpha, beta in zip(alpha_samples, beta_samples):
    logits = alpha + beta * year_centered
    probs = expit(logits) * 100
    plt.plot(year_range, probs, color="C0", alpha=0.02)

time_series.plot(style="o", label="observed")
decorate(ylabel="percent", title="Percent saying people can be trusted")

Centering¶

Because we centered years, the sampled slopes and intercepts are uncorrelated.

In [297]:

pm.plot_pair(idata, var_names=["alpha", "beta"])
decorate()

As an experiment, try:

Set year_shift=1970 and run the model again. You might get some warnings from the sampler, and plot_pair will show that the alphas and betas are correlated. The not-quite-centered predictor stretches the shape of the joint posterior distribution and makes it harder to sample efficiently.
Set year_shift=0 and run the model again. With the predictor completely uncentered, the joint posterior distribution is so stretched that the sampler diverges frequently -- and basically fails.

Without centering, the intercept represents the log-odds of the outcome when year=0, which is way outside the range of the data. This forces the intercept and slope to compensate for each other — if the slope increases, the intercept must decrease to hit the same observed points.

As a general rule, it's a good idea to center continuous predictors in Bayesian regression models.

Categorical predictors¶

In [298]:

value_counts(gss["partyid3"])

Out[298]:

	counts
values
0.0	25439
1.0	26705
2.0	18338
NaN	1908

In [299]:

data = gss.dropna(subset=["y", "year", "partyid3"])
data.shape

Out[299]:

(39697, 62)

In [300]:

y = data["y"].values
year = data["year"].values
year_shift = data["year"].mean()
year = year - year_shift

partyid3 = data["partyid3"].astype(int).values

with pm.Model(
    coords={"party_affiliation": ["Democrat", "Independent", "Republican"]}
) as logistic_model2:
    y_pt = pm.Data("y", y)
    year_pt = pm.Data("year", year)
    party_idx_pt = pm.Data("party_idx", partyid3)

    # Party-specific intercepts (one for each party)
    sigma_prior = 1
    mu_prior = 0
    alpha = pm.Normal("alpha", mu_prior, sigma_prior, dims="party_affiliation")

    # Shared slope for year (assuming effect of year is the same across parties)
    beta = pm.Normal("beta", mu_prior, sigma_prior)

    # Linear predictor (log-odds)
    logits = pm.Deterministic("logits", alpha[party_idx_pt] + beta * year_pt)

    # Bernoulli likelihood with logit link
    y_obs = pm.Bernoulli("y_obs", logit_p=logits, observed=y_pt)

    # Add other quantities of interest
    probabilities = pm.Deterministic("probabilities", pm.math.sigmoid(logits))

In [301]:

pm.model_to_graphviz(logistic_model2)

Out[301]:

In [302]:

with logistic_model2:
    idata2 = pm.sample(
        draws=500,
        tune=500,
        chains=3,
        cores=3,
        var_names=["alpha", "beta"],
        # initvals=logistic_model2.initial_point(),
    )

Output()

In [303]:

import arviz as az

az.plot_posterior(idata2)
decorate()

In [304]:

az.plot_trace(idata2, compact=False)
decorate()

In [305]:

pm.summary(idata2)

Out[305]:

	mean	sd	hdi_3%	hdi_97%	ess_bulk	ess_tail	r_hat
alpha[Democrat]	-0.544	0.017	-0.579	-0.515	2340.0	1350.0	1.0
alpha[Independent]	-0.552	0.017	-0.583	-0.519	2164.0	1374.0	1.0
alpha[Republican]	-0.214	0.020	-0.252	-0.179	2831.0	1498.0	1.0
beta	-0.015	0.001	-0.016	-0.013	2667.0	1331.0	1.0

In [306]:

# Range of years to predict
probabilities = {}
with logistic_model2:
    for key_, party_name_ in party_names.items():
        # Assign covariates for which we want to compute the posterior predictive
        pm.set_data(
            {
                "party_idx": np.tile(key_, len(year_centered)),
                "year": year_centered,
            }
        )

        # Compute the posterior predictive
        probabilities[party_name_] = pm.compute_deterministics(
            idata2.posterior, var_names=["probabilities"]
        )

Output()

Output()

Output()

Plotting the results¶

In [307]:

# Base plot per party affiliation
table.plot(color=colors)

# Plot posterior draws for each party affiliation
for key_, party_name_ in party_names.items():
    probabilities_tmp = (
        probabilities[party_name_].probabilities.sel(chain=np.random.choice(3)).values
        * 100
    ).transpose()

    plt.plot(
        year_range,
        probabilities_tmp,
        color=colors[key_],
        alpha=0.002,
    )

decorate(ylabel="percent", title="Percent saying people can be trusted")

Logistic Regression¶

Data¶

Logistic Regression¶

Comparing Prior with Posterior¶

Comparing Posterior to Reference Value¶

Interpreting the results¶

Getting samples from the posterior predictive¶

Centering¶

Categorical predictors¶

Plotting the results¶

End¶