Notebook

(correlation)=

Correlation¶

{admonition}
:class: seealso
- {cite}`freedman2007statistics`, Chapters 8 and 9

Scatter Plots¶

The relationship between two quantitative variables can be explored in a scatter plot. You need data that is paired, like in the father and son data below. Each $x,y$ pair is plotted and you might notice a trend from the shape of the scatter of points.

{list-table}
:header-rows: 1
:widths: auto

* - Father's Height (inches)
  - Son's Height (inches)
* - 65.0
  - 59.8
* - 63.3
  - 63.2
* - 65.0
  - 63.3

The relationship is called an association. When the scatter of points slopes upward, that shows a positive association. If the scatter slopes downward, that shows a negative association. A strong association indicates that one variable can be used to predict the others.

{figure}
:width: 90%
:name: fathersonscatter

There is a positive association between a father's height and his son's height. The orange points are for the three pairs shown in the table above.

The variable on the $x$-axis is thought of as the independent variable and the $y$-axis variable is the dependent variable. This language suggests that the $x$ variable is the one influencing the $y$, suggesting (1) $y$ is predicted from $x$ or (2) there's a direction of causality being hypothesized. In this class, we have learned to be cautious in claiming causality. Go ahead and choose one variable as the independent variable and the other as the dependent without too much angst or hesitation. Just be ready to be humbled if you ultimately find the there is no causal relationship or if you get the direction backwards.

For each of these two pairs, which variable would you choose as the independent variable?

Hours of sleep and test performance on the following day
Public assistance (like welfare) and poverty
Attractiveness and income

{dropdown}

Sleep should be the independent variable, especially given the temporal precedence.

{dropdown}

Either choice is justified. First, let's consider public assistance as the independent variable. You don't have to be taking a stance on there being a positive or negative relationship. Statistician Udny Yule considered spending on public assistance as the independent variable and poverty as the dependent variable in his "investigation into the causes of changes in pauperism in England," finding a positive relationship ({cite}`yule1899investigation`). This is the association you would expect if you believe that assistance promotes dependency and increases poverty. Public assistance might also break cycles of poverty, leading to a negative relationship. This can still be compatible with Yule's data if you recognize that it's important to consider the other direction. 

Societies that experience more poverty might spend more on public assistance. Given that spending responds to poverty levels, it can also be appropriate to consider poverty as the independent variable.

{dropdown}


Either choice is justified. {cite}`monk2021beholding` considers attractiveness as the independent variable and earnings as the dependent variable in examining the "returns to physical attractiveness." However, as the figure below suggests, the effect can go both ways. 

```{figure} images/TomBradyIsHandsome.png
:width: 61%

Correlation Coefficient¶

The correlation coefficient, denoted $r$, is a units-free measure of linear association, or clustering around a line. The coefficient falls between -1 and 1. It doesn't matter which of the two variables is thought of as the independent or dependent variable. The correlation coefficient shouldn't be confused with the slope of a trend line-it's more about the tightness of the scatter around some line. The correlation will be one if $x$ and $y$ are perfectly linearly related with a slope of 0.2 or a slope of 200. A correlation coefficient of 1 or -1 means that one of the two variables will perfectly predict the other in the data. More moderate correlation coefficients indicate a less predictive relationship.

{figure}
:width: 89%
:name: slopecorrcomparison

The trend line on the left has a greater slope but the correlation coefficient is lower because $y$ is less predictable given $x$.

Interactives¶

You can drag around the data points and delete individual points in the plot below to see how the correlation coefficient responds. Notice some limitations:

A correlation coefficient of 0.8 doesn't indicate an association twice as strong as a correlation of 0.4 in any natural sense.
A correlaton coefficient of 0.8 doesn't mean 80% of the points fall on a trend line. For now, we don't say anything much deeper than 0.8 represents a stronger linear association than 0.4.

In [1]:

from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, PointDrawTool, CustomJS, Div, HoverTool
from bokeh.layouts import column
from IPython.display import HTML

# Display plot within a notebook
output_notebook(hide_banner = True)

# Sample data
data = {'x': [1, 2, 3, 4, 5], 'y': [1, 2, 3, 4, 5]}

# Create ColumnDataSource
source = ColumnDataSource(data)

# Create a figure
width = 450
height = width #int((9/16)*width)
p = figure(x_range=(0, 6), y_range=(0, 6), tools="save", height=height, width=width)

# Add circle glyph
c1 = p.scatter('x', 'y', source=source, size=20, color="navy", alpha=0.5)

# Display for correlation
div = Div(text="Correlation coefficient: 1", width=200, height=30)

# CustomJS for updating the correlation coefficient dynamically
callback = CustomJS(args=dict(source=source, div=div), code="""
    const data = source.data;
    const x = data['x'];
    const y = data['y'];
    const n = x.length;

    let mean_x = 0;
    let mean_y = 0;
    for (let i = 0; i < n; i++) {
        mean_x += x[i];
        mean_y += y[i];
    }
    mean_x /= n;
    mean_y /= n;

    let num = 0;
    let den_x = 0;
    let den_y = 0;
    for (let i = 0; i < n; i++) {
        num += (x[i] - mean_x) * (y[i] - mean_y);
        den_x += (x[i] - mean_x) ** 2;
        den_y += (y[i] - mean_y) ** 2;
    }

    const corr = num / Math.sqrt(den_x * den_y);
    div.text = "Correlation coefficient: " + corr.toFixed(4);
""")

# Add PointDrawTool to the figure
tool = PointDrawTool(renderers=[c1], add=True)
p.add_tools(tool)
p.toolbar.active_tap = tool  # Activate the draw tool


# Add HoverTool to the figure
hover = HoverTool(renderers=[c1], tooltips=[("X", "@x"), ("Y", "@y")])
p.add_tools(hover)

# Attach the callback to the source
source.js_on_change('data', callback)

# Show plot
# Define the CSS style
style = """
<style>
.output {
    display: flex;
    align-items: center;
    justify-content: center;
}
</style>
"""

# Apply the style
display(HTML(style))


# Show the plot and the correlation coefficient
show(column(div, p))

Below, you can adjust a random noise parameter and a slope parameter and observe how the correlation coefficient changes. The $y$-value is calculated as $\text{slope} \times (x + \text{noise})$.

Notice a few things.

Increasing the noise will decrease the predictability of $y$ from $x$ (and $x$ from $y$). The correlation coefficient will tend to decrease.
Changing the slope doesn't change the correlation coefficient unless we flip the sign of the relationship. This underscores that correlation is about the clustering around a line and not the effect size indicated by the slope.

In [2]:

from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource, Slider, CustomJS, Div
from bokeh.layouts import column, row
import numpy as np
np.random.seed(1)

output_notebook(hide_banner = True)

# Generate 30 x values
x = np.linspace(-15, 16, 31)
y = x  # Initial y values, assuming initial slope = 1 and no noise
data={'x': x, 'y': y}

noise_end = 100
noise_step = 1
noises = dict()
for i in range(0, noise_end + 1):
    noises[i] = np.random.normal(0,i, size = len(x))
    data[f'noise{i}'] = noises[i]
    
source = ColumnDataSource(data = data)

# Create the figure
p = figure(x_range=(-17, 17), y_range=(-50, 50), width=450, height=450) #, title="Interactive Scatter Plot")
p.scatter('x', 'y', source=source, size=10, color="teal", alpha=0.99)

# Slider for slope
slope_slider = Slider(start=-10, end=10, value=1, step=0.1, title="Slope")

# Slider for noise
noise_slider = Slider(start=0, end=noise_end, value=0, step=noise_step, title="Noise")

# Div for displaying the correlation coefficient
div = Div(text="Correlation coefficient: 1.00", width=400, height=30)

# CustomJS callback to update the plot according to the sliders
callback = CustomJS(args=dict(source=source, slope=slope_slider, noise=noise_slider, div=div), code="""
    const data = source.data;
    const x = data['x'];
    const m = slope.value;
    const noiseKey = 'noise' + noise.value; // Correct key for noise data
    const noiseArray = data[noiseKey]; // Access the correct noise array
    const y = data['y'];
    
    // Apply slope and noise to y values
    for (let i = 0; i < x.length; i++) {
        y[i] = m * (x[i] + noiseArray[i]);
    }
    
    // Calculate correlation coefficient
    let mean_x = 0;
    let mean_y = 0;
    let num = 0;
    let den_x = 0;
    let den_y = 0;
    for (let i = 0; i < x.length; i++) {
        mean_x += x[i];
        mean_y += y[i];
    }
    mean_x /= x.length;
    mean_y /= x.length;
    for (let i = 0; i < x.length; i++) {
        num += (x[i] - mean_x) * (y[i] - mean_y);
        den_x += (x[i] - mean_x) ** 2;
        den_y += (y[i] - mean_y) ** 2;
    }
    const r = num / Math.sqrt(den_x * den_y);
    div.text = "Correlation coefficient: " + r.toFixed(4);
    
    source.change.emit();
""")

slope_slider.js_on_change('value', callback)
noise_slider.js_on_change('value', callback)

# Layout
layout = column(column(noise_slider, slope_slider, div), p)

# Show
show(layout)

Calculation¶

The coefficient, $r$, is calculated as the average of the products of the standardized values,

$$r = \frac{1}{n} \sum_{i=1}^{n} \frac{x_i - \text{average}_x}{\text{SD}_x} \times \frac{y_i - \text{average}_y}{\text{SD}_y}. $$

The product of two standardized values is positive if the data moves together. If $x$ is above its average when $y$ is above its average, this will increase the correlation. If the two variables move in opposite directions, $x$ goes up when $y$ goes down, then this decreases the correlation coefficient.

{figure}
:width: 89%
:name: correlationQuadrants

Points in the positive quadrants, where both $x$ and $y$ are above or below the average, push the correlation coefficient up. The data on the left is negatively correlated. The data on the right is positively correlated.

Example¶

Consider the eight $x$-$y$ pairs in the table below.

{figure}
:width: 44%
:name: correlationExample

First, we must standardize the values. In this case, the data is already standardized because the average is zero and the SD is one.

$$\text{SD}_x = \text{SD}_y = \sqrt{ \frac{1}{8}\left( (-2-0)^2 + 0 + \cdots + 0 + (2-0)^2 \right) } = 1$$

Then, the correlation coefficient is calculated by averaging the product of the paired standardized values,

$$r = \frac{1}{8} \dfrac{ (-2-0)(-2-0) + (0-0)(2-0) + \cdots + (2-0)(0-0) }{1\times 1}$$$$r = \frac{1}{2}.$$

What if we considered the alternate data set below?

{figure}
:width: 44%
:name: correlationExampleDouble

{dropdown}

Now, $\text{SD}_x$ is 2 and $\text{SD}_y$ is unchanged. 

$$\text{SD}_x = \sqrt{ \frac{1}{8}\left( (-4-0)^2 + (4-0)^2 \right) } = \sqrt{4} = 2$$ 

$$r = \frac{1}{8} \dfrac{ (-4-0)(-2-0) + (0-0)(2-0) + \cdots + (4-0)(0-0) }{2\times 1}$$

$$ r = \frac{1}{2}$$

Exceptional Cases¶

Per {cite}freedman2007statistics, the corrrelation coefficient is useful for "football-shaped scatter diagrams."

{figure}
:width: 22%
:name: AmericanFootball

A football-shaped football.

This is meant to rule out anything obviously non-linear. A surprisingly precise football shape arises when $x$ and $y$ both follow a normal curve like in {numref}footballScatter.

{figure}
:width: 75%
:name: footballScatter

A football-shaped scatter diagram. The football shape arises from the bell-shaped distributions for both variables. The tapering comes from there being fewer values at the extreme $x$ and $y$ values.

Anscombe's Quartet is a set of four data sets that demonstrate the limitations of a correlation coefficient.

{figure}
:width: 70%
:name: AnscombeScatter

The top right panel shows an obvious pattern, and one stronger than the pattern in the top left, but the correlation coefficient rates the association at 0.82 in either case. This is because the correlation coefficient only measures the linear association.

Causation¶

The phrase "correlation does not imply causation" is often uttered in recognition that a correlation between two variables doesn't mean one causes the other. Correlation only quantifies if there is a predictable linear relationship between two variables.

When two variables are highly correlated without their being a causal relationship, that is often called a spurious correlation. Spurious correlation arise in observational data when there is a confounding third variable. Michael Luca gives a few examples of misinterpreted correlations in his Harvard Business Review article, Leaders: Stop Confusing Correlation with Causation.

One of Luca's examples is a classic case of spurious correlation.

[Economists at eBay] analyzed natural experiments and conducted a new randomized controlled trial, and found that these ads were largely a waste, despite what the marketing team previously believed. The advertisements were targeting people who were already likely to shop on eBay. The targeted customers’ pre-existing purchase intentions were responsible for both the advertisements being shown and the purchase decisions. eBay’s marketing team made the mistake of underappreciating this factor, and instead assuming that the observed correlation was a result of advertisements causing purchases.

{figure}
:width: 80%
:name: ebayDag

Ads don't lead to higher sales if the people who receive targeted ads were targeted for their interest in the product and would have spent more regardless.

So, we know that correlation does not imply causation. Less appreciated sometimes is that no correlation need not imply no causation. Luca gives the example of police spending and crime, highlighting a chicken and egg problem in any observational study.

A 2020 Washington Post article examined the correlation between police spending and crime. It concluded that, "A review of spending on state and local police over the past 60 years ... shows no correlation nationally between spending and crime rates." This correlation is misleading. An important driver of police spending is the current level of crime, which creates a chicken and egg scenario. Causal research has, in fact, shown that more police lead to a reduction in crime.

For a similar example, consider the relationship between public assistance and poverty studied by Yule in the 1800s. Yule found a positive correlation between poverty and public assistance. A naive read of the data might suggest that increasing public assistance will cause poverty to go up, but there's nothing in the correlation by itself that should suggest this because of the same chicken and egg problem.

{figure}
:width: 95%
:name: yulePolice

A positive, negative, or zero correlation can be misleading in either case.

Exercises¶

{exercise-start}
:label: corrPairing

A movie theater clerk wants to know the correlation between daily ticket sales for Matt Damon and Mark Wahlberg movies. He records the following data over three days. Find the correlation.

Day of Week	Matt Damon	Mark Wahlberg
Monday	0	3
Tuesday	0	0
Wednesday	6	3

The clerk finds he mixed up Matt Damon and Wahlberg. This is the actual data, below. Does the correlation change?

Day of Week	Mark Wahlberg	Matt Damon
Monday	0	3
Tuesday	0	0
Wednesday	6	3

Oops! The clerk messed up again. It turns out he initially corrected his error on Wednesday, so this is the actual data. Does the correlation coefficient change?

Day of Week	Mark Wahlberg	Matt Damon
Monday	0	3
Tuesday	0	0
Wednesday	3	6

{exercise-end}

{exercise-start}
:label: corrMental

A journalist is investigating bloat in colleges hiring more and more non-faculty employees. In particular, the journalist focuses on mental health professionals. They study many campuses and find no correlation between money spent on mental health and a trusted measure of mental health. The journalist writes an article accusing schools of wasting money and receives an award for data-driven journalism. Is the award merited? Why or why not?

{exercise-end}

In [ ]: