import numpy as np
import matplotlib.pyplot as plt
import scipy as sp
from scipy import stats
plt.style.use("fivethirtyeight")
import pandas as pd
Before making any inferences from a dataset, it is crucial that we understand the features of datasets by 'describing' them. We can either describe them by plotting or calculating numerical summarisations.
We will refresh most common descriptive statistics in this chapter, if you are working with data on a daily basis, no concepts here should sound strange to you.
Furthermore, we will not engage in complex programming techniques, such as OOP and sophisticated data structures. All the programmes should be self-explanatory to most of audiences.
Strictly speaking, frequency distribution and histogram are different descriptive tools, though they are delivering the largely identical information. Frequency distribution are usually presented in a table, for instance, rolling a dice $1000$ times might give a us frequency distribution table as following:
Sides | Frequency |
---|---|
1 | 172 |
2 | 158 |
3 | 170 |
4 | 158 |
5 | 187 |
6 | 155 |
Or we can draw a frequency histogram, which simply converts the information of table into a graph.
rollings = np.random.randint(1, 7, 1000)
fig, ax = plt.subplots(figsize=(9, 9))
n, bins, patches = ax.hist(rollings, bins=6)
ax.set_title("Frequency Histogram of 1000 Times of Rolling a Dice", size=19)
ax.set_xlim(0, 7)
ax.set_ylim(0, 400)
plt.show()
Next we try generating an array from a standard normal distribution $x\sim N(0, 1)$, then plot the histogram. And note that we add one parameter density=True
in the hist
method, it turns into a relative frequency histogram, because the $y$-axis represents relative frequencies.
x = np.random.randn(1000)
fig, ax = plt.subplots(figsize=(9, 9))
n, bins, patches = ax.hist(x, bins=50, density=True)
Careful audiences might have noticed that while we are plotting the histogram, we also have 3 variables returned n
, bins
and patches
. n
is (relative) frequency counts, bins
is the location of each bin, patches
is the list of patches objects in the plot.
The first charts below shows the same information as histogram, but with line plots rather than bars. But the second one is called Empirical cumulative distribution, it is empirical because it is not theoretically plotted, i.e. using a cumulative distribution function (CDF)*.
fig, ax = plt.subplots(nrows=2, ncols=1, figsize=(9, 9))
ax[0].plot(bins[:50], n)
ax[1].plot(np.cumsum(n))
plt.show()
The most important mean we need is arithmetic mean, which is prevalent in all kinds of statistical techniques. $$ \mu = \frac{1}{N}\sum_{i=1}^Nx_i\\ \bar{x} = \frac{1}{n}\sum_{i=1}^nx_i $$ The former one is population mean, the latter is sample mean. The formulae appear the same, but with different indication. $N$ is the population size, imagine the number of all human beings on earth, on the other hand, $n$ is sample size, for instance a sample of $1000$ persons from UK.
And one tricky mean commonly used in finance is geometric mean, not so intuitive at first sight. $$ g = \bigg(\prod_{i=1}^nx_i\bigg)^{1/n} $$ If you can't make sense out of it, accept a fact that geometric mean is commonly used when calculating compound growth rates, such as portfolio return. For instance, a portfolio manager has annual return recorded as below
Year | Return |
---|---|
2015 | 36% |
2016 | 23% |
2017 | -48% |
2018 | -30% |
2019 | 15% |
2020 | 31% |
Arithmetic Mean | 4.5% |
Geometric Mean | -1.4% |
The arithmetic mean of return is $4.45\%$.
portfolio_return = np.array([0.36, 0.23, -0.48, -0.3, 0.15, 0.31])
np.mean(portfolio_return)
0.045000000000000005
However the geometric mean of return is $-1.42\%$. Geometric mean sp.stats.mstats.gmean
is more accurate measurement when considering the compound effect, i.e. the data are related to each other by a growth rate.
sp.stats.mstats.gmean(portfolio_return + 1) - 1
-0.014282599080668423
The first measurement of variability is the range measuring distance from lowest to highest. We'll generate an array of standard normal distribution for demonstration.
x = np.random.randn(50)
rangeLS = x.max() - x.min()
rangeLS
4.725332255653198
Percentile is also a common statistic concept, which could be best explained by an example. For instance, the GRE test result shows percentile besides your absolute score, say you have a percentile of $96\%$, it means your score is higher than $96\%$ of candidates. The special percentile of $75\%$ and $25\%$ are sometimes called the third quartile and first quartile.
q75, q25 = np.percentile(x, [75, 25]) # IQR
q75 - q25
1.1190705852500262
Before moving any further, we must clarify two statistical concepts: population and sample. For example, we want to know the variance of all human adults height, therefore we ought to measure all adults (population) on earth in order to calculate the variance, however the mission is impossible, instead we measure a smaller group of adults (sample) to make inferential statements about population.
Population and sample variance differ in degree of freedom, where $N$ is the population size, whereas $n$ is the sample size, $\mu$ is the population mean and $\bar{x}$ is the sample mean. The formulae of variances are
$$ \sigma^2 = \frac{\sum(x_i - \mu)^2}{N}\\ s^2 = \frac{\sum(x_i - \bar{x})^2}{n-1} $$The latter is an unbiased estimator of population variance, which is also the sample variance.
To illustrate the idea, let's pretend there are only $N = 1000$ people on earth, we can generate an array to represent the population height, only for demonstrative purpose. We generate a population by setting loc=170
and scale=10
with np.random.normal
, i.e. $X\sim N(170, 10)$, then calculate the population variance by using np.var()
.
population_height = np.random.normal(170, 10, 1000)
np.var(population_height)
108.33360191813833
Now suppose we know nothing about the population, but we can get a sample of 100 persons. By setting ddof=1
in function, we actually mean the degree of freedom is $N-1$, used on sample estimators.
sample_height = np.random.choice(population_height, size=100)
np.var(sample_height, ddof=1)
98.1270651211727
Theoretically, we can have tremendous amount of samples, say $10000$ samples. Yes, I mean $10000$ samples, not the sample size, but this is just a thought experiment, will never be achieved in real world. Again, pure demonstrative purpose.
What we are doing next: generate $10000$ samples, calculate the sample variances, plot histogram. The vertical line is the mean of sampling distribution of variance estimates. We set standard deviation of $\sigma = 10$, the theoretical variance $\sigma^2=100$, therefore we see the point estimator is doing a fair well job.
sample_height_array = []
for i in range(10000):
sample_height = np.random.choice(population_height, size=100)
sample_height_array.append(np.var(sample_height, ddof=1))
fig, ax = plt.subplots(figsize=(9, 9))
n, bins, patches = ax.hist(sample_height_array, bins=50)
ax.axvline(x=np.mean(sample_height_array), color="tomato")
ax.text(
np.mean(sample_height_array) + 1,
np.max(n),
r"$\mu_\sigma = {:.2f}$".format(np.mean(sample_height_array)),
size=16,
)
ax.set_title("Sampling Distribution of Variance Estimates", size=19)
plt.show()
I guess you have noticed the power of estimators now, if you have many samples, by using an unbiased sample estimator, you will get an accurate estimate of population parameters. Even if you have only one sample, you can still can estimate how possible you would be correct based on sampling distribution.
But we shall keep in mind that standard deviation is the most popular measurement of variability. The population/sample standard deviation is simply the square root of variances respectively.
Similar to variance function in NumPy.
np.std(sample_height_array, ddof=1)
15.50977277891788
z-score is defined as below, used for measuring how many standard deviations away from the mean.
\begin{equation} z_i = \frac{x_i-\bar{x}}{s} \end{equation}Note that $z$ has subscript notation $i$ which means each observation has its own $z$-score.
Actually,measuring how many standard deviations away from the mean (or hypothesis) is the fundamental philosophy of frequentist statistics, it asks one important question: how far away from the mean is far-away? If it is far enough, very likely the mean we are looking at right now is not the 'real' mean of the random mechanism that generates the observation.
Here is the example of calculating $z$-score for an randomly generated array. So you can see for a standard normal distribution, it would be fairly hard to stray $2$ standard deviations away.
x = np.random.randn(10)
z = (x - np.mean(x)) / np.std(x)
np.round(z, 2)
array([ 0.09, -0.59, 0.6 , -0.91, 1.06, -1.44, 0.78, -1.58, 0.69, 1.3 ])
Chebyshev's Theorem is used as the last resort of deduction when we have absolute no knowledge of a sample and its distribution. It guarantees minimum proportion of data that must be within $z$ standard deviation from the mean, where the $z$ is $z$-score. But the downside of the theorem is that it only addresses the symmetric distribution.
\begin{equation} p \geq 1-\frac{1}{z^2} \end{equation}For instance, the average height of people in Helsinki is $174cm$, with a standard deviation of of $4cm$, so the question is how many people (percentage) are within $166cm$ and $182cm$? Note that this range is symmetrically $2$ standard derivations away from its mean. Thus according to Chebyshev, we know that
z_l = (166 - 174) / 4 # lower z-score
z_u = (182 - 174) / 4 # upper z-score
Because it is a symmetric range, $z$-score on each sides are equal, we can calculate the probability as below and print the conclusion.
p = 1 - 1 / z_l**2
print("At least {0}% of people are within 168cm and 182cm in Helsinki.".format(p * 100))
At least 75.0% of people are within 168cm and 182cm in Helsinki.
If you think of Chebyshev as a function, then the only variable is $z$-score. Now we define a Chebyshev function and plot it against $z$-score.
Because $z$-score means how many standard deviation away from the mean, from the graph we could conclude that no matter what types of distributions we are investigating, it is guaranteed that $\pm 2.5$ standard deviation range would cover at least $84\%$ of data.
def chebyshev(z):
return 1 - 1 / z**2
chebyshev_array = []
for z in np.arange(1, 21, 0.5):
chebyshev_array.append(chebyshev(z))
fig, ax = plt.subplots(figsize=(9, 9))
ax.plot(np.arange(1, 21, 0.5), chebyshev_array)
ax.scatter(2.5, chebyshev(2.5), s=100, color="red", zorder=3)
ax.text(2.5 + 0.5, chebyshev(2.5), r"(2.5, {}%)".format(chebyshev(2.5) * 100))
ax.set_title("Chebyshev's Theorem")
ax.set_xlabel("z-score")
ax.set_ylabel("Probability")
plt.show()
However in practice we are more likely to deal with data of (semi)bell-shape distribution, they are regular and easier to make deduction. So Empirical Rules apply.
These empirical numbers are from normal distribution.
With similar notation, the population and sample covariance is defined as: $$ \sigma_{xy} \frac{\sum(x_i-\mu_x)(y_i-\mu_y)}{N}\\ s_{xy} \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{n-1} $$
Generate two random arrays, then evaluate a covariance matrix. The values on the diagonal presents the variance, the off-diagonal presents the covariance.
x = np.random.randn(300)
y = np.random.rand(300)
np.cov(x, y)
array([[ 0.88955858, -0.00649224], [-0.00649224, 0.07885574]])
Correlation Coefficient is normalized version of covariance, just like the standard deviation is the normalised variance, some statistics textbook calls it Pearson Product Moment Correlation Coefficient.
$$ \rho = \frac{\sigma_{xy}}{\sigma_x\sigma_y}\\ r_{xy}=\frac{s_{xy}}{s_xs_y} $$For the sake of everyone's sanity, not to cluster the lecture notes with repetitive codes, some of them were transported in to plot_material.py
module. Take a look inside if you are curious.
What we did below was basically generating eight linear regression plot $Y = \beta_1+\beta_2X+u$ with different parameters. The correlation coefficient $\rho$ is displayed in the graph, therefore you could observe that the correlation coefficients are affected by size of parameters $\beta_2$ and scale of error term.
The reasons are:
import plot_material
plot_material.reg_corr_plot()
We will dive much deeper into linear regression in our econometrics training sessions.
The correlation coefficient above is the Pearson correlation, which assumes a linear relationship between two variables. However, there are other correlation coefficients which also measures nonlinear correlation, such as Spearman correlation and Kendall's $\tau$.
X = np.linspace(-10, 10, 200)
Y = 1 / (1 + np.exp(-X))
df_dict = {"X": X, "Y": Y}
df = pd.DataFrame(df_dict)
df.plot(x="X", y="Y", kind="scatter", figsize=(16, 7))
plt.show()
df.corr(method="pearson")
X | Y | |
---|---|---|
X | 1.000000 | 0.936137 |
Y | 0.936137 | 1.000000 |
print("Pearson coeffcient: {}".format(sp.stats.stats.pearsonr(df["X"], df["Y"])[0]))
Pearson coeffcient: 0.9361365508325603
print("Pearson coeffcient: {}".format(sp.stats.stats.spearmanr(df["X"], df["Y"])[0]))
Pearson coeffcient: 1.0
sp.stats.stats.kendalltau(X, Y)
print("Pearson coeffcient: {}".format(sp.stats.stats.kendalltau(df["X"], df["Y"])[0]))
Pearson coeffcient: 1.0
Notice the difference, that Pearson and Kendall produce a perfect correlation coefficient, it is because the latter two are rank tests.