This is the only r.v. that we are going to actively study. If you take a course specifically on probability you will study tens of r.v. how they relate and thier properties. In addition you will learn how to mathematically derive their distributions and their summary statistics instead of approximating them as we do in this course.
Why is the normal random variable so special?? Well let's first discuss some properies of a normal random variable first.
To start off, let's look at a normal r.v. distribution:
%matplotlib inline
from numpy.random import normal
import seaborn as sns
sns.distplot(normal(size=1000))
<matplotlib.axes._subplots.AxesSubplot at 0x10bed3150>
Question: is this what a normal distribution looks like?
Almost! It is important to remember that this is approximately what a normal distribution looks like. We can only see what a normal distribution looks like by taking infinite samples.
Most people say that the normal distribution looks like a bell, and thus call it a bell curve. For your information, a normal distribution has no skewness, it is a symetric distribution.
Normal distributions can have any mean or standard deviation. The below normal random variable has a mean of 3.5 and a standard deviation of .25:
import numpy as np
sns.distplot(normal(loc=3.5, scale=0.25, size=1000))
print np.mean(normal(loc=3.5, scale=0.25, size=1000))
print np.std(normal(loc=3.5, scale=0.25, size=1000))
3.4914305021 0.244714943362
So all that I have said above is rather normal (pun) for a r.v. They all have distributions and they all have means. So what gives?
Well it all has to do with the central limit theorem (arguably the most important theorem in all of statistics). This theorem says:
The sum of independent and identically distributed (iid) r.v. with finite mean and variance is approximatly distributed normally (that means if we take the samples from the sum of iid r.v., those samples are approximately distributed normally).
This is incredibly important!
Why?
Well, because even if we don't know what the random variables are (what the distributions look like), we know that their sum is distributed normally! So we know something about them. Let's prove it.
We start off with the simplest of distributions, a uniform one:
sns.distplot(np.random.uniform(size=10000))
<matplotlib.axes._subplots.AxesSubplot at 0x10c337b50>
this is not a normal distribution, but if we make a r.v. that is the sum of uniform random variables, what will it be distributed?
def almost_normal():
return sum(np.random.uniform(size=50))
dist = np.array([almost_normal() for _ in range(10000)])
sns.distplot(dist)
<matplotlib.axes._subplots.AxesSubplot at 0x10cc94990>
Pretty normal right ;)
Now if we reduce the number of normals that we sum:
def almost_normal():
return sum(np.random.uniform(size=2))
dist = np.array([almost_normal() for _ in range(10000)])
sns.distplot(dist)
<matplotlib.axes._subplots.AxesSubplot at 0x10c7a54d0>
This becomes less normal.
Well, by making a very simple assumption (that each x_i is an iid sample from a r.v.) we were able to know something very specific about our samples. But there are probably two things that are bothering you now:
And both of these are really good questions, and for that reason we will be doing a video on each of them.