This notebook is part of Bite Size Bayes, an introduction to probability and Bayesian statistics using Python.

Copyright 2020 Allen B. Downey

License: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

The following cell downloads `utils.py`

, which contains some utility function we'll need.

In [1]:

```
from os.path import basename, exists
def download(url):
filename = basename(url)
if not exists(filename):
from urllib.request import urlretrieve
local, _ = urlretrieve(url, filename)
print('Downloaded ' + local)
download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/utils.py')
```

If everything we need is installed, the following cell should run with no error messages.

In [2]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
```

So far we have been working with distributions of only one variable. In this notebook we'll take a step toward multivariate distributions, starting with two variables.

We'll use cross-tabulation to compute a **joint distribution**, then use the joint distribution to compute **conditional distributions** and **marginal distributions**.

We will re-use `pmf_from_seq`

, which I introduced in a previous notebook.

In [3]:

```
def pmf_from_seq(seq):
"""Make a PMF from a sequence of values.
seq: sequence
returns: Series representing a PMF
"""
pmf = pd.Series(seq).value_counts(sort=False).sort_index()
pmf /= pmf.sum()
return pmf
```

To understand joint distributions, I'll start with cross tabulation. And to demonstrate cross tabulation, I'll generate a dataset of colors and fruits.

Here are the possible values.

In [4]:

```
colors = ['red', 'yellow', 'green']
fruits = ['apple', 'banana', 'grape']
```

And here's a random sample of 100 fruits.

In [5]:

```
np.random.seed(2)
fruit_sample = np.random.choice(fruits, 100, replace=True)
```

We can use `pmf_from_seq`

to compute the distribution of fruits.

In [6]:

```
pmf_fruit = pmf_from_seq(fruit_sample)
pmf_fruit
```

And here's what it looks like.

In [7]:

```
pmf_fruit.plot.bar(color='C0')
plt.ylabel('Probability')
plt.title('Distribution of fruit');
```

Similarly, here's a random sample of colors.

In [8]:

```
color_sample = np.random.choice(colors, 100, replace=True)
```

Here's the distribution of colors.

In [9]:

```
pmf_color = pmf_from_seq(color_sample)
pmf_color
```

And here's what it looks like.

In [10]:

```
pmf_color.plot.bar(color='C1')
plt.ylabel('Probability')
plt.title('Distribution of colors');
```

Looking at these distributions, we know the proportion of each fruit, ignoring color, and we know the proportion of each color, ignoring fruit type.

But if we only have the distributions and not the original data, we don't know how many apples are green, for example, or how many yellow fruits are bananas.

We can compute that information using `crosstab`

, which computes the number of cases for each combination of fruit type and color.

In [11]:

```
xtab = pd.crosstab(color_sample, fruit_sample,
rownames=['color'], colnames=['fruit'])
xtab
```

The result is a DataFrame with colors along the rows and fruits along the columns.

The following function plots a cross tabulation using a pseudo-color plot, also known as a heatmap.

It represents each element of the cross tabulation with a colored square, where the color corresponds to the magnitude of the element.

The following function generates a heatmap using the Matplotlib function `pcolormesh`

:

In [12]:

```
def plot_heatmap(xtab):
"""Make a heatmap to represent a cross tabulation.
xtab: DataFrame containing a cross tabulation
"""
plt.pcolormesh(xtab)
# label the y axis
ys = xtab.index
plt.ylabel(ys.name)
locs = np.arange(len(ys)) + 0.5
plt.yticks(locs, ys)
# label the x axis
xs = xtab.columns
plt.xlabel(xs.name)
locs = np.arange(len(xs)) + 0.5
plt.xticks(locs, xs)
plt.colorbar()
plt.gca().invert_yaxis()
```

In [13]:

```
plot_heatmap(xtab)
```

A cross tabulation represents the "joint distribution" of two variables, which is a complete description of two distributions, including all of the conditional distributions.

If we normalize `xtab`

so the sum of the elements is 1, the result is a joint PMF:

In [14]:

```
joint = xtab / xtab.to_numpy().sum()
joint
```

Each column in the joint PMF represents the conditional distribution of color for a given fruit.

For example, we can select a column like this:

In [15]:

```
col = joint['apple']
col
```

If we normalize it, we get the conditional distribution of color for a given fruit.

In [16]:

```
col / col.sum()
```

Each row of the cross tabulation represents the conditional distribution of fruit for each color.

If we select a row and normalize it, like this:

In [17]:

```
row = xtab.loc['red']
row / row.sum()
```

The result is the conditional distribution of fruit type for a given color.

The following function takes a joint PMF and computes conditional distributions:

In [18]:

```
def conditional(joint, name, value):
"""Compute a conditional distribution.
joint: DataFrame representing a joint PMF
name: string name of an axis
value: value to condition on
returns: Series representing a conditional PMF
"""
if joint.columns.name == name:
cond = joint[value]
elif joint.index.name == name:
cond = joint.loc[value]
return cond / cond.sum()
```

The second argument is a string that identifies which axis we want to select; in this example, `'fruit'`

means we are selecting a column, like this:

In [19]:

```
conditional(joint, 'fruit', 'apple')
```

And `'color'`

means we are selecting a row, like this:

In [20]:

```
conditional(joint, 'color', 'red')
```

**Exercise:** Compute the conditional distribution of color for bananas. What is the probability that a banana is yellow?

In [21]:

```
# Solution goes here
```

In [22]:

```
# Solution goes here
```

Given a joint distribution, we can compute the unconditioned distribution of either variable.

If we sum along the rows, which is axis 0, we get the distribution of fruit type, regardless of color.

In [23]:

```
joint.sum(axis=0)
```

If we sum along the columns, which is axis 1, we get the distribution of color, regardless of fruit type.

In [24]:

```
joint.sum(axis=1)
```

These distributions are called "marginal" because of the way they are often displayed. We'll see an example later.

As we did with conditional distributions, we can write a function that takes a joint distribution and computes the marginal distribution of a given variable:

In [25]:

```
def marginal(joint, name):
"""Compute a marginal distribution.
joint: DataFrame representing a joint PMF
name: string name of an axis
returns: Series representing a marginal PMF
"""
if joint.columns.name == name:
return joint.sum(axis=0)
elif joint.index.name == name:
return joint.sum(axis=1)
```

Here's the marginal distribution of fruit.

In [26]:

```
pmf_fruit = marginal(joint, 'fruit')
pmf_fruit
```

And the marginal distribution of color:

In [27]:

```
pmf_color = marginal(joint, 'color')
pmf_color
```

The sum of the marginal PMF is the same as the sum of the joint PMF, so if the joint PMF was normalized, the marginal PMF should be, too.

In [28]:

```
joint.to_numpy().sum()
```

In [29]:

```
pmf_color.sum()
```

However, due to floating point error, the total might not be exactly 1.

In [30]:

```
pmf_fruit.sum()
```

**Exercise:** The following cells load the data from the General Social Survey that we used in Notebooks 1 and 2.

In [31]:

```
# Load the data file
import os
if not os.path.exists('gss_bayes.csv'):
!wget https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.csv
```

In [32]:

```
gss = pd.read_csv('gss_bayes.csv', index_col=0)
```

As an exercise, you can use this data to explore the joint distribution of two variables:

`partyid`

encodes each respondent's political affiliation, that is, the party the belong to. Here's the description.`polviews`

encodes their political alignment on a spectrum from liberal to conservative. Here's the description.

The values for `partyid`

are

```
0 Strong democrat
1 Not str democrat
2 Ind,near dem
3 Independent
4 Ind,near rep
5 Not str republican
6 Strong republican
7 Other party
```

The values for `polviews`

are:

```
1 Extremely liberal
2 Liberal
3 Slightly liberal
4 Moderate
5 Slightly conservative
6 Conservative
7 Extremely conservative
```

Make a cross tabulation of

`gss['partyid']`

and`gss['polviews']`

and normalize it to make a joint PMF.Use

`plot_heatmap`

to display a heatmap of the joint distribution. What patterns do you notice?Use

`marginal`

to compute the marginal distributions of`partyid`

and`polviews`

, and plot the results.Use

`conditional`

to compute the conditional distribution of`partyid`

for people who identify themselves as "Extremely conservative" (`polviews==7`

). How many of them are "strong Republicans" (`partyid==6`

)?Use

`conditional`

to compute the conditional distribution of`polviews`

for people who identify themselves as "Strong Democrat" (`partyid==0`

). How many of them are "Extremely liberal" (`polviews==1`

)?

In [33]:

```
# Solution goes here
```

In [34]:

```
# Solution goes here
```

In [35]:

```
# Solution goes here
```

In [36]:

```
# Solution goes here
```

In [37]:

```
# Solution goes here
```

In [38]:

```
# Solution goes here
```

In this notebook we started with cross tabulation, which we normalized to create a joint distribution, which describes the distribution of two (or more) variables and all of their conditional distributions.

We used heatmaps to visualize cross tabulations and joint distributions.

Then we defined `conditional`

and `marginal`

functions that take a joint distribution and compute conditional and marginal distributions for each variables.

As an exercise, you had a chance to apply the same methods to explore the relationship between political alignment and party affiliation using data from the General Social Survey.

You might have noticed that we did not use Bayes's Theorem in this notebook. In the next notebook we'll take the ideas from this notebook and apply them Bayesian inference.

In [ ]:

```
```