The learning objectives today are quite simple. We will learn what IID (independent and identically distributed) means and how it applies to samples.
So we come to the second most important acronym in statistics: iid. And the most important assumption. This will be important not only for the bootstrapping (up next) but throughout the class.
This assumption has two parts:
The second assumption is pretty easy, so we will start there.
This means that the samples come from the same r.v. More specifically, this means that we sample from the same function each time. Below would be some identically distributed samples:
def rv1():
return 1
identically_distributed_samples = [rv1() for _ in range(10)]
identically_distributed_samples
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
In this case the samples are all the same (remember that a r.v. does not have to be random).
And below is an example of samples that are not:
def rv2():
return 2
not_identically_distributed_samples = [rv1() for _ in range(5)] + [rv2() for _ in range(5)]
not_identically_distributed_samples
[1, 1, 1, 1, 1, 2, 2, 2, 2, 2]
Now there is something subtle about this. Say we are measuring the height of pigs at a farm. We randomly pick a pig and then we measure the height. This is a real life r.v. Now let's say that we accidentally picked up a chicken and measured it and put it in the data set. Then that is not identically distributed right!
But if instead you are measuring the height of farm animals, then picking a pig then a chicken would be right up your alley. So notice that the answer will greatly depend on how your problem is framed.
The most important thing to remember here is to always know what you are sampling from.
The next property that we will discuss is very hard to wrap your mind around, but this is independence. Two r.v. are independent if they do not depend on each other. In function speak, this would mean that the independent variables don't take as parameters each other (or any r.v. that is dependent on the other). Okay, confusing right, well, let's look at an example.
These r.v. are independent:
def ind1():
return 4
def ind2():
return 5
ind1(), ind2()
(4, 5)
These are not:
x = 3
def not_ind():
global x
x += 1
return x
not_ind_samples = [not_ind() for _ in range(5)]
not_ind_samples
[4, 5, 6, 7, 8]
Notice that one depends on the other's outcome. Another way of thinking about it is: if I know the outcome of one r.v., does that give me information as to the outcome of another?
Okay, let's try a more complicated one:
from numpy.random import rand
x = 3
def not_ind_adv():
global x
if rand() < .01:
x += 1
return x
not_ind_samples = [not_ind_adv() for _ in range(5)]
not_ind_samples
[3, 3, 3, 3, 3]
Are these samples independent of each other?
Actually not! One way of thinking about it is that the sample values will always be either x or 3 + x, therefore they are dependent on each other.
This is one of the most confusing parts of statistics and rightly so, and we would skip it if we could, but this assumption is so vitally important that there is no way to get around it.
This is probably one of the most important lessons that you will learn here, because this is one of the ones that is most often messed up. Knowing what distribution you are sampling from and whether two things are independent are crucial assumptions that you will often make in statistics, and getting this wrong can often ruin entire projects (think of the Truman election or the recent facebook CL scandal)!
The learning objectives today are quite simple. We will learn what IID (independent and identically distributed) means and how it applies to samples.