#!/usr/bin/env python # coding: utf-8 # # Joint Probability # This notebook is part of [Bite Size Bayes](https://allendowney.github.io/BiteSizeBayes/), an introduction to probability and Bayesian statistics using Python. # # Copyright 2020 Allen B. Downey # # License: [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/) # The following cell downloads `utils.py`, which contains some utility function we'll need. # In[1]: from os.path import basename, exists def download(url): filename = basename(url) if not exists(filename): from urllib.request import urlretrieve local, _ = urlretrieve(url, filename) print('Downloaded ' + local) download('https://github.com/AllenDowney/BiteSizeBayes/raw/master/utils.py') # If everything we need is installed, the following cell should run with no error messages. # In[2]: import numpy as np import pandas as pd import matplotlib.pyplot as plt # ## Review # # So far we have been working with distributions of only one variable. In this notebook we'll take a step toward multivariate distributions, starting with two variables. # # We'll use cross-tabulation to compute a **joint distribution**, then use the joint distribution to compute **conditional distributions** and **marginal distributions**. # # We will re-use `pmf_from_seq`, which I introduced in a previous notebook. # In[3]: def pmf_from_seq(seq): """Make a PMF from a sequence of values. seq: sequence returns: Series representing a PMF """ pmf = pd.Series(seq).value_counts(sort=False).sort_index() pmf /= pmf.sum() return pmf # ## Cross tabulation # # To understand joint distributions, I'll start with cross tabulation. And to demonstrate cross tabulation, I'll generate a dataset of colors and fruits. # # Here are the possible values. # In[4]: colors = ['red', 'yellow', 'green'] fruits = ['apple', 'banana', 'grape'] # And here's a random sample of 100 fruits. # In[5]: np.random.seed(2) fruit_sample = np.random.choice(fruits, 100, replace=True) # We can use `pmf_from_seq` to compute the distribution of fruits. # In[6]: pmf_fruit = pmf_from_seq(fruit_sample) pmf_fruit # And here's what it looks like. # In[7]: pmf_fruit.plot.bar(color='C0') plt.ylabel('Probability') plt.title('Distribution of fruit'); # Similarly, here's a random sample of colors. # In[8]: color_sample = np.random.choice(colors, 100, replace=True) # Here's the distribution of colors. # In[9]: pmf_color = pmf_from_seq(color_sample) pmf_color # And here's what it looks like. # In[10]: pmf_color.plot.bar(color='C1') plt.ylabel('Probability') plt.title('Distribution of colors'); # Looking at these distributions, we know the proportion of each fruit, ignoring color, and we know the proportion of each color, ignoring fruit type. # # But if we only have the distributions and not the original data, we don't know how many apples are green, for example, or how many yellow fruits are bananas. # # We can compute that information using `crosstab`, which computes the number of cases for each combination of fruit type and color. # In[11]: xtab = pd.crosstab(color_sample, fruit_sample, rownames=['color'], colnames=['fruit']) xtab # The result is a DataFrame with colors along the rows and fruits along the columns. # ## Heatmap # # The following function plots a cross tabulation using a pseudo-color plot, also known as a heatmap. # # It represents each element of the cross tabulation with a colored square, where the color corresponds to the magnitude of the element. # # The following function generates a heatmap using the Matplotlib function `pcolormesh`: # In[12]: def plot_heatmap(xtab): """Make a heatmap to represent a cross tabulation. xtab: DataFrame containing a cross tabulation """ plt.pcolormesh(xtab) # label the y axis ys = xtab.index plt.ylabel(ys.name) locs = np.arange(len(ys)) + 0.5 plt.yticks(locs, ys) # label the x axis xs = xtab.columns plt.xlabel(xs.name) locs = np.arange(len(xs)) + 0.5 plt.xticks(locs, xs) plt.colorbar() plt.gca().invert_yaxis() # In[13]: plot_heatmap(xtab) # ## Joint Distribution # # A cross tabulation represents the "joint distribution" of two variables, which is a complete description of two distributions, including all of the conditional distributions. # # If we normalize `xtab` so the sum of the elements is 1, the result is a joint PMF: # In[14]: joint = xtab / xtab.to_numpy().sum() joint # Each column in the joint PMF represents the conditional distribution of color for a given fruit. # # For example, we can select a column like this: # In[15]: col = joint['apple'] col # If we normalize it, we get the conditional distribution of color for a given fruit. # In[16]: col / col.sum() # Each row of the cross tabulation represents the conditional distribution of fruit for each color. # # If we select a row and normalize it, like this: # In[17]: row = xtab.loc['red'] row / row.sum() # The result is the conditional distribution of fruit type for a given color. # ## Conditional distributions # # The following function takes a joint PMF and computes conditional distributions: # In[18]: def conditional(joint, name, value): """Compute a conditional distribution. joint: DataFrame representing a joint PMF name: string name of an axis value: value to condition on returns: Series representing a conditional PMF """ if joint.columns.name == name: cond = joint[value] elif joint.index.name == name: cond = joint.loc[value] return cond / cond.sum() # The second argument is a string that identifies which axis we want to select; in this example, `'fruit'` means we are selecting a column, like this: # In[19]: conditional(joint, 'fruit', 'apple') # And `'color'` means we are selecting a row, like this: # In[20]: conditional(joint, 'color', 'red') # **Exercise:** Compute the conditional distribution of color for bananas. What is the probability that a banana is yellow? # In[21]: # Solution goes here # In[22]: # Solution goes here # ## Marginal distributions # # Given a joint distribution, we can compute the unconditioned distribution of either variable. # # If we sum along the rows, which is axis 0, we get the distribution of fruit type, regardless of color. # In[23]: joint.sum(axis=0) # If we sum along the columns, which is axis 1, we get the distribution of color, regardless of fruit type. # In[24]: joint.sum(axis=1) # These distributions are called "[marginal](https://en.wikipedia.org/wiki/Marginal_distribution#Multivariate_distributions)" because of the way they are often displayed. We'll see an example later. # # As we did with conditional distributions, we can write a function that takes a joint distribution and computes the marginal distribution of a given variable: # In[25]: def marginal(joint, name): """Compute a marginal distribution. joint: DataFrame representing a joint PMF name: string name of an axis returns: Series representing a marginal PMF """ if joint.columns.name == name: return joint.sum(axis=0) elif joint.index.name == name: return joint.sum(axis=1) # Here's the marginal distribution of fruit. # In[26]: pmf_fruit = marginal(joint, 'fruit') pmf_fruit # And the marginal distribution of color: # In[27]: pmf_color = marginal(joint, 'color') pmf_color # The sum of the marginal PMF is the same as the sum of the joint PMF, so if the joint PMF was normalized, the marginal PMF should be, too. # In[28]: joint.to_numpy().sum() # In[29]: pmf_color.sum() # However, due to floating point error, the total might not be exactly 1. # In[30]: pmf_fruit.sum() # **Exercise:** The following cells load the data from the General Social Survey that we used in Notebooks 1 and 2. # In[31]: # Load the data file import os if not os.path.exists('gss_bayes.csv'): get_ipython().system('wget https://github.com/AllenDowney/BiteSizeBayes/raw/master/gss_bayes.csv') # In[32]: gss = pd.read_csv('gss_bayes.csv', index_col=0) # As an exercise, you can use this data to explore the joint distribution of two variables: # # * `partyid` encodes each respondent's political affiliation, that is, the party the belong to. [Here's the description](https://gssdataexplorer.norc.org/variables/141/vshow). # # * `polviews` encodes their political alignment on a spectrum from liberal to conservative. [Here's the description](https://gssdataexplorer.norc.org/variables/178/vshow). # The values for `partyid` are # # ``` # 0 Strong democrat # 1 Not str democrat # 2 Ind,near dem # 3 Independent # 4 Ind,near rep # 5 Not str republican # 6 Strong republican # 7 Other party # ``` # The values for `polviews` are: # # ``` # 1 Extremely liberal # 2 Liberal # 3 Slightly liberal # 4 Moderate # 5 Slightly conservative # 6 Conservative # 7 Extremely conservative # ``` # 1. Make a cross tabulation of `gss['partyid']` and `gss['polviews']` and normalize it to make a joint PMF. # # 2. Use `plot_heatmap` to display a heatmap of the joint distribution. What patterns do you notice? # # 3. Use `marginal` to compute the marginal distributions of `partyid` and `polviews`, and plot the results. # # 4. Use `conditional` to compute the conditional distribution of `partyid` for people who identify themselves as "Extremely conservative" (`polviews==7`). How many of them are "strong Republicans" (`partyid==6`)? # # 5. Use `conditional` to compute the conditional distribution of `polviews` for people who identify themselves as "Strong Democrat" (`partyid==0`). How many of them are "Extremely liberal" (`polviews==1`)? # In[33]: # Solution goes here # In[34]: # Solution goes here # In[35]: # Solution goes here # In[36]: # Solution goes here # In[37]: # Solution goes here # In[38]: # Solution goes here # ## Review # # In this notebook we started with cross tabulation, which we normalized to create a joint distribution, which describes the distribution of two (or more) variables and all of their conditional distributions. # # We used heatmaps to visualize cross tabulations and joint distributions. # # Then we defined `conditional` and `marginal` functions that take a joint distribution and compute conditional and marginal distributions for each variables. # # As an exercise, you had a chance to apply the same methods to explore the relationship between political alignment and party affiliation using data from the General Social Survey. # # You might have noticed that we did not use Bayes's Theorem in this notebook. [In the next notebook](https://colab.research.google.com/github/AllenDowney/BiteSizeBayes/blob/master/11_faceoff.ipynb) we'll take the ideas from this notebook and apply them Bayesian inference. # In[ ]: