Joint Probability

This notebook is part of Bite Size Bayes, an introduction to probability and Bayesian statistics using Python.

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

The following cell downloads utils.py, which contains some utility function we'll need.

In [1]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/utils.py')

If everything we need is installed, the following cell should run with no error messages.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Review

So far we have been working with distributions of only one variable. In this notebook we'll take a step toward multivariate distributions, starting with two variables.

We'll use cross-tabulation to compute a joint distribution, then use the joint distribution to compute conditional distributions and marginal distributions.

We will re-use pmf_from_seq, which I introduced in a previous notebook.

In [3]:
def pmf_from_seq(seq):
    """Make a PMF from a sequence of values.
    
    seq: sequence
    
    returns: Series representing a PMF
    """
    pmf = pd.Series(seq).value_counts(sort=False).sort_index()
    pmf /= pmf.sum()
    return pmf

Cross tabulation

To understand joint distributions, I'll start with cross tabulation. And to demonstrate cross tabulation, I'll generate a dataset of colors and fruits.

Here are the possible values.

In [4]:
colors = ['red', 'yellow', 'green']
fruits = ['apple', 'banana', 'grape']

And here's a random sample of 100 fruits.

In [5]:
np.random.seed(2)
fruit_sample = np.random.choice(fruits, 100, replace=True)

We can use pmf_from_seq to compute the distribution of fruits.

In [6]:
pmf_fruit = pmf_from_seq(fruit_sample)
pmf_fruit

And here's what it looks like.

In [7]:
pmf_fruit.plot.bar(color='C0')

plt.ylabel('Probability')
plt.title('Distribution of fruit');

Similarly, here's a random sample of colors.

In [8]:
color_sample = np.random.choice(colors, 100, replace=True)

Here's the distribution of colors.

In [9]:
pmf_color = pmf_from_seq(color_sample)
pmf_color

And here's what it looks like.

In [10]:
pmf_color.plot.bar(color='C1')

plt.ylabel('Probability')
plt.title('Distribution of colors');

Looking at these distributions, we know the proportion of each fruit, ignoring color, and we know the proportion of each color, ignoring fruit type.

But if we only have the distributions and not the original data, we don't know how many apples are green, for example, or how many yellow fruits are bananas.

We can compute that information using crosstab, which computes the number of cases for each combination of fruit type and color.

In [11]:
xtab = pd.crosstab(color_sample, fruit_sample, 
                   rownames=['color'], colnames=['fruit'])
xtab

The result is a DataFrame with colors along the rows and fruits along the columns.

Heatmap

The following function plots a cross tabulation using a pseudo-color plot, also known as a heatmap.

It represents each element of the cross tabulation with a colored square, where the color corresponds to the magnitude of the element.

The following function generates a heatmap using the Matplotlib function pcolormesh:

In [12]:
def plot_heatmap(xtab):
    """Make a heatmap to represent a cross tabulation.
    
    xtab: DataFrame containing a cross tabulation
    """

    plt.pcolormesh(xtab)

    # label the y axis
    ys = xtab.index
    plt.ylabel(ys.name)
    locs = np.arange(len(ys)) + 0.5
    plt.yticks(locs, ys)

    # label the x axis
    xs = xtab.columns
    plt.xlabel(xs.name)
    locs = np.arange(len(xs)) + 0.5
    plt.xticks(locs, xs)
    
    plt.colorbar()
    plt.gca().invert_yaxis()
In [13]:
plot_heatmap(xtab)

Joint Distribution

A cross tabulation represents the "joint distribution" of two variables, which is a complete description of two distributions, including all of the conditional distributions.

If we normalize xtab so the sum of the elements is 1, the result is a joint PMF:

In [14]:
joint = xtab / xtab.to_numpy().sum()
joint

Each column in the joint PMF represents the conditional distribution of color for a given fruit.

For example, we can select a column like this:

In [15]:
col = joint['apple']
col

If we normalize it, we get the conditional distribution of color for a given fruit.

In [16]:
col / col.sum()

Each row of the cross tabulation represents the conditional distribution of fruit for each color.

If we select a row and normalize it, like this:

In [17]:
row = xtab.loc['red']
row / row.sum()

The result is the conditional distribution of fruit type for a given color.

Conditional distributions

The following function takes a joint PMF and computes conditional distributions:

In [18]:
def conditional(joint, name, value):
    """Compute a conditional distribution.
    
    joint: DataFrame representing a joint PMF
    name: string name of an axis
    value: value to condition on
    
    returns: Series representing a conditional PMF
    """
    if joint.columns.name == name:
        cond = joint[value]
    elif joint.index.name == name:
        cond = joint.loc[value]
    return cond / cond.sum()

The second argument is a string that identifies which axis we want to select; in this example, 'fruit' means we are selecting a column, like this:

In [19]:
conditional(joint, 'fruit', 'apple')

And 'color' means we are selecting a row, like this:

In [20]:
conditional(joint, 'color', 'red')

Exercise: Compute the conditional distribution of color for bananas. What is the probability that a banana is yellow?

In [21]:
# Solution goes here
In [22]:
# Solution goes here

Marginal distributions

Given a joint distribution, we can compute the unconditioned distribution of either variable.

If we sum along the rows, which is axis 0, we get the distribution of fruit type, regardless of color.

In [23]:
joint.sum(axis=0)

If we sum along the columns, which is axis 1, we get the distribution of color, regardless of fruit type.

In [24]:
joint.sum(axis=1)

These distributions are called "marginal" because of the way they are often displayed. We'll see an example later.

As we did with conditional distributions, we can write a function that takes a joint distribution and computes the marginal distribution of a given variable:

In [25]:
def marginal(joint, name):
    """Compute a marginal distribution.
    
    joint: DataFrame representing a joint PMF
    name: string name of an axis
    
    returns: Series representing a marginal PMF
    """
    if joint.columns.name == name:
        return joint.sum(axis=0)
    elif joint.index.name == name:
        return joint.sum(axis=1)

Here's the marginal distribution of fruit.

In [26]:
pmf_fruit = marginal(joint, 'fruit')
pmf_fruit

And the marginal distribution of color:

In [27]:
pmf_color = marginal(joint, 'color')
pmf_color

The sum of the marginal PMF is the same as the sum of the joint PMF, so if the joint PMF was normalized, the marginal PMF should be, too.

In [28]:
joint.to_numpy().sum()
In [29]:
pmf_color.sum()

However, due to floating point error, the total might not be exactly 1.

In [30]:
pmf_fruit.sum()

Exercise: The following cells load the data from the General Social Survey that we used in Notebooks 1 and 2.

In [31]:
# Load the data file
import os

if not os.path.exists('gss_bayes.csv'):
    !wget https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.csv
In [32]:
gss = pd.read_csv('gss_bayes.csv', index_col=0)

As an exercise, you can use this data to explore the joint distribution of two variables:

  • partyid encodes each respondent's political affiliation, that is, the party the belong to. Here's the description.

  • polviews encodes their political alignment on a spectrum from liberal to conservative. Here's the description.

The values for partyid are

0   Strong democrat
1   Not str democrat
2   Ind,near dem
3   Independent
4   Ind,near rep
5   Not str republican
6   Strong republican
7   Other party

The values for polviews are:

1   Extremely liberal
2   Liberal
3   Slightly liberal
4   Moderate
5   Slightly conservative
6   Conservative
7   Extremely conservative
  1. Make a cross tabulation of gss['partyid'] and gss['polviews'] and normalize it to make a joint PMF.

  2. Use plot_heatmap to display a heatmap of the joint distribution. What patterns do you notice?

  3. Use marginal to compute the marginal distributions of partyid and polviews, and plot the results.

  4. Use conditional to compute the conditional distribution of partyid for people who identify themselves as "Extremely conservative" (polviews==7). How many of them are "strong Republicans" (partyid==6)?

  5. Use conditional to compute the conditional distribution of polviews for people who identify themselves as "Strong Democrat" (partyid==0). How many of them are "Extremely liberal" (polviews==1)?

In [33]:
# Solution goes here
In [34]:
# Solution goes here
In [35]:
# Solution goes here
In [36]:
# Solution goes here
In [37]:
# Solution goes here
In [38]:
# Solution goes here

Review

In this notebook we started with cross tabulation, which we normalized to create a joint distribution, which describes the distribution of two (or more) variables and all of their conditional distributions.

We used heatmaps to visualize cross tabulations and joint distributions.

Then we defined conditional and marginal functions that take a joint distribution and compute conditional and marginal distributions for each variables.

As an exercise, you had a chance to apply the same methods to explore the relationship between political alignment and party affiliation using data from the General Social Survey.

You might have noticed that we did not use Bayes's Theorem in this notebook. In the next notebook we'll take the ideas from this notebook and apply them Bayesian inference.

In [ ]: