Correlation and covariance from scratch¶

In this post we examine covariance and a correlation a bit closer.

We will use them to examine the relationship between Ethereum transaction value and gas price.

Again, most of the time, we break down the steps into standard Python data types and operations (i.e. we use numpy mostly for verification of our results).

We pull the data from Google's public datasets with BigQuery, use pandas and numpy to manipulate it, and altair to plot their relationship.

In [1]:
import os

client = bigquery.Client()

import altair as alt
alt.data_transformers.disable_max_rows()

import numpy as np
import pandas as pd
In [2]:
query ="""
SELECT
EXTRACT(DATE FROM block_timestamp) AS date,
AVG(value) AS value,
AVG(gas_price) AS gas_price,
FROM bigquery-public-data.ethereum_blockchain.transactions
WHERE
EXTRACT(YEAR FROM block_timestamp) = 2019
GROUP BY date
ORDER BY date
"""
In [3]:
transactions = client.query(query).to_dataframe(dtypes={'value': float, 'gas_price': float}, date_as_object=False)
Out[3]:
date value gas_price
0 2019-01-01 3.719103e+18 1.431514e+10
1 2019-01-02 4.649915e+18 1.349952e+10
2 2019-01-03 4.188781e+18 1.269504e+10
3 2019-01-04 6.958368e+18 1.418197e+10
4 2019-01-05 8.167590e+18 2.410475e+10

There are a few days when the gas prices were outstandingly high so we remove values beyond three standard deviation from the mean.

Outliers¶

In [5]:
labelx = alt.selection_single(
encodings=['x'],
on='mouseover',
empty='none'
)

labely = alt.selection_single(
encodings=['y'],
on='mouseover',
empty='none'
)

ruler = alt.Chart().mark_rule(color='darkgray')

chart = alt.Chart().mark_point().encode(
alt.X('value', axis=alt.Axis(format=(',.2e'))),
alt.Y('gas_price', axis=alt.Axis(format=(',.2e'))),
alt.Tooltip(['value', 'gas_price', 'date'])

alt.layer(
chart,
ruler.encode(x='value:Q').transform_filter(labelx),
ruler.encode(y='gas_price:Q').transform_filter(labely),
data=transactions
).interactive()
Out[5]:
In [4]:
transactions = transactions[~(transactions['gas_price'] >= transactions['gas_price'].mean() + 3 * transactions['gas_price'].std())]
In [6]:
values = transactions['value']
gas_prices = transactions['gas_price']

As we emphasize standard operations, we use a few helper functions in the steps leading to covariance and correlation.

Helper functions¶

In [7]:
from typing import Union, List

Vector = List[float]
In [8]:
def dot(vector1: Vector, vector2: Vector) -> float:
assert len(vector1) == len(vector2)

return sum(v1 * v2 for v1, v2 in zip(vector1, vector2))

assert dot([1, 2, 3], [4, 5, 6]) == 32

def mean(x: Vector) -> float:
return sum(x) / len(x)

assert mean([1, 2, 3, 4]) == 2.5

def de_mean(xs: Vector) -> Vector:
x_mean = mean(xs)
return [x - x_mean for x in xs]

assert de_mean([4, 5, 6, 7, 8]) == [-2, -1, 0, 1, 2]

def sum_of_squares(xs: Vector) -> float:
return dot(xs, xs)

assert sum_of_squares([1, 2, 3]) == 14

def variance(xs: Vector) -> float:
return sum_of_squares(de_mean(xs)) / (len(xs) - 1)

assert variance([1, 2, 3]) == 1

import math as m

def standard_deviation(xs: Vector):
return m.sqrt(variance(xs))

assert standard_deviation([4, 5, 6]) == 1

Covariance looks at the degree two variables 'move together'.

For this, first, it multiplies the variables' deviation from their respective means. This produces a series of values which are very high for those observations where both variables deviate a lot. Furthermore, when the two variables deviate to the same direction these values are positive, otherwise they are negative.

Then, it calculates the mean of these multiplied deviation values. However, because we are calculating the sample covariance, we divide their sum by $n + 1$ (where $n$ is the number of observations)

$\text{Cov} = \frac { \sum_{i=1}^n (x-\bar{x}) (y-\bar{y})} {n - 1}$

Covariance¶

In [9]:
def covariance(xs: Vector, ys: Vector) -> float:
assert len(xs) == len(ys)

return dot(de_mean(xs), de_mean(ys)) / (len(xs) - 1)

assert covariance([1, 2, 3], [4, 5, 6]) == 1

There is also an alternate way to calculate covariance, using the variables' expected values (which here are the means):

$\text{Cov} = E[\vec{x}\vec{y}] - E[\vec{x}]E[\vec{y}]$

This is a much simpler version. However, again, as we are dealing with sample data, so we need to adjust for that:

$\text{Cov}_s = \frac {n} {n - 1} (E[\vec{x}\vec{y}] - E[\vec{x}]E[\vec{y}])$

In [10]:
covariance(values, gas_prices)
Out[10]:
1.5856518696847875e+26
In [11]:
def covariance_2(xs: Vector, ys: Vector) -> float:
xsys = [x * y for x, y in zip(xs, ys)]
return (mean(xsys) - mean(xs) * mean(ys)) * len(xs) / (len(xs) - 1)

assert np.isclose(covariance_2([1, 2, 3], [4, 5, 6]), 1)

We also verify our method with numpy.

In [13]:
np.cov(values, gas_prices)[0, 1]
Out[13]:
1.5856518696847882e+26
In [12]:
covariance_2(values, gas_prices)
Out[12]:
1.585651869684888e+26

Because the value of covariance really depends on the units of the variables, it is often hard to interpret and also to compare it with other covariences.

This is why correlation is an often preferred method as it adjusts the covariance by the variables' standard deviation values. As a result, it bounds the end result into the $[-1, 1]$ domain making it comparable with other correlation values.

$\text{Corr(x, y)} = \frac { \text{Cov(x, y)} } {\text{Std(x)} \text{Std(y)}}$

Correlation¶

In [16]:
correlation(values, gas_prices)
Out[16]:
0.035069533929694634

Finally, we verify the result with numpy.

In [17]:
values.corr(gas_prices)
Out[17]:
0.03506953392969465
In [14]:
def correlation(xs: Vector, ys: Vector) -> float:
return covariance(xs, ys) / (standard_deviation(xs) * standard_deviation(ys))

assert np.isclose(correlation([.1, .2, .3], [400, 500, 600]), 1)