In this tutorial we are going to see how to take samples from our data and compute and plot confidence intervals. In addition, you guys are going to demonstrate a few concepts and statements introduced in the lectures.
sat.dat<-read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/psych/sat.act.csv")
sat.dat$gender<-as.factor(sat.dat$gender)
sat.dat$education<-as.factor(sat.dat$education)
The R built-in function sample
allows you to get samples from a given vector of values. For example, let's take a sample of 100 observations from the ACT scores.
sample(sat.dat$ACT, size=100)
This, however, is restricted to vectors, but what if we want to sample a data frame? We can do it with the function sample_n
from tidyverse.
library(tidyverse)
sample_n(sat.dat, size=10)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ✔ ggplot2 3.4.0 ✔ purrr 1.0.1 ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ✔ tidyr 1.3.0 ✔ stringr 1.5.0 ✔ readr 2.1.3 ✔ forcats 0.5.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag()
X | gender | education | age | ACT | SATV | SATQ |
---|---|---|---|---|---|---|
<int> | <fct> | <fct> | <int> | <int> | <int> | <int> |
35910 | 2 | 0 | 16 | 24 | 800 | 800 |
39772 | 2 | 3 | 20 | 36 | 760 | 760 |
33352 | 2 | 3 | 28 | 27 | 445 | 525 |
35380 | 2 | 3 | 27 | 17 | 300 | 300 |
38404 | 2 | 3 | 20 | 34 | 700 | 670 |
39238 | 1 | 5 | 43 | 32 | 400 | 500 |
38894 | 2 | 5 | 30 | 26 | 570 | 430 |
29518 | 1 | 5 | 26 | 28 | 640 | 640 |
31080 | 2 | 4 | 33 | 36 | 500 | 500 |
38198 | 1 | 3 | 19 | 33 | 780 | 800 |
# Here using the pipe operator
sat.dat %>% sample_n(size=10)
X | gender | education | age | ACT | SATV | SATQ |
---|---|---|---|---|---|---|
<int> | <fct> | <fct> | <int> | <int> | <int> | <int> |
34812 | 2 | 3 | 22 | 32 | 700 | 690 |
30129 | 2 | 5 | 30 | 36 | 660 | 610 |
35574 | 2 | 3 | 32 | 33 | 650 | 635 |
34628 | 2 | 3 | 21 | 20 | 300 | 300 |
32429 | 1 | 3 | 24 | 29 | 650 | 610 |
32811 | 1 | 4 | 36 | 22 | 450 | 460 |
35977 | 2 | 3 | 21 | 25 | 580 | 660 |
39242 | 2 | 3 | 21 | 27 | 610 | NA |
37229 | 2 | 5 | 23 | 25 | 530 | NA |
34183 | 2 | 4 | 45 | 30 | 750 | 420 |
As we saw in the lectures, our estimations of the mean will never be accurate, so it is a common practice to build confidence intervals around this measure, so that we have some degree of confidence about the observed value.
As noted in the lectures, the confidence intervals are usually given in terms of percentage, (1-$\alpha$)$\cdot$100%, which tells you that if you are to repeat the sampling many times, the given percentage of samples would contain the population mean.
To calculate these confidence intervals, one basically has to take the estimated mean and build a symmetrical distribution around it. Then, one just needs to calculate the cut-off values in this distribution such that it contains the (1-$\alpha$)$\cdot$100 of the points. This can be very easily accomplished using the quantile function. (N.B. This is one way of calculating confidence intervals that works well for the mean. However, you can always build confidence intervals around ANY statistics, so for some cases, you may need to use a different method, e.g. using a procedure called "bootstrapping". This is beyond the scope of this course.)
We also showed that one has to use different quantile functions depending on the situation encountered. Let's see this for the calculation of the 95% confidence intervals of the mean (say equal to 10), from a sample of size 30.
alpha<-0.05 # because of 95% = 100*(1-0.05)
n<-30 # sample size
qnorm
.z.alpha052<-qnorm(alpha/2)
z.alpha052
For this then we would build our 95% confidence intervals for a given sample like this:
< X > $\pm$ 1.95996398454005 * $\frac{\sigma}{\sqrt{n}}$
Here we would use the quantiles from the Student's t-distribution. In R this is calculated with the function qt
.
t.29.alpha052<-qt(alpha/2, df = n-1)
t.29.alpha052
And the 95% confidence intervals would be expressed in this case like:
< X > $\pm$ 2.0452296421327 * $\frac{\hat{\sigma}}{\sqrt{n}}$
Here, since we don't know the standard deviation from the population, we will estimate it from the sample. That's why we have a hat, i.e. $\hat{\sigma}$. This is calculated with the R function sd
.
As you can see, this result is way different from the previous one, because the t-distribution has heavier tails than the gaussian distribution. However, as the sample sizes increase, the t-distribution converges to a gaussian distribution, so results should not differ that much in such scenarios. Let's see this.
sample.size<-c(30, 50, 100, 500)
print(z.alpha052)
for (sample.size in sample.size)
{
print(qt(alpha/2, df = sample.size-1))
}
[1] -1.959964 [1] -2.04523 [1] -2.009575 [1] -1.984217 [1] -1.964729
Let's demonstrate this for a population that follows a Gaussian distribution with mean=100 and standard deviation=15. Consider 500 experiments. In each experiment, sample 100 observations from a population with the aforementioned parameters (mean=100, standard deviation=15), calculate the sample mean and standard deviation. These will be the estimated parameters for our population. Then, calculate the 95% confidence interval. Finally, check - i.e you should obtain a logical value (i.e. TRUE or FALSE) - whether this confidence interval contains the population mean (100). For each experiment, store the final logic value into a vector. With this vector of 500 logical values (one for each experiment), calculate the proportion of TRUES in the vector. It should give approximately 95%.
Some times, you may want to plot the confidence intervals. We are going to see now how to do this using the libraries ggplot and dplyr.
library(tidyverse)
act.summary<-sat.dat %>%
group_by(education) %>%
summarize(mu=mean(ACT),
std=sd(ACT),
N=n(),
se=std/sqrt(N),
tn1.alpha2=qt(0.025, df = N-1))
act.summary
education | mu | std | N | se | tn1.alpha2 |
---|---|---|---|---|---|
<fct> | <dbl> | <dbl> | <int> | <dbl> | <dbl> |
0 | 27.47368 | 5.206813 | 57 | 0.6896592 | -2.003241 |
1 | 27.48889 | 6.055134 | 45 | 0.9026461 | -2.015368 |
2 | 26.97727 | 5.808929 | 44 | 0.8757290 | -2.016692 |
3 | 28.29455 | 4.846227 | 275 | 0.2922385 | -1.968660 |
4 | 29.26087 | 4.345153 | 138 | 0.3698840 | -1.977431 |
5 | 29.60284 | 3.954887 | 141 | 0.3330616 | -1.977054 |
ggplot(data = act.summary, aes(x=education, y=mu)) +
geom_errorbar(aes(ymin=mu-1.96*se, ymax=mu+1.96*se), width=.1) +
geom_point(size=5)
ggplot(data = act.summary, aes(x=education, y=mu)) +
geom_errorbar(aes(ymin=mu+tn1.alpha2*se, ymax=mu-tn1.alpha2*se), width=.1) +
geom_point(size=5)
# This is the same as above
ggplot(data = act.summary, aes(x=education, y=mu)) +
geom_errorbar(aes(ymin=mu-abs(qt(0.025, df = N-1))*se, ymax=mu+abs(qt(0.025, df = N-1))*se), width=.1) +
geom_point(size=5)
As you can see, the confidence intervals for education > 2 are similar to the gaussian case; and that's because for those categories, the sample sizes were large.
Let's demonstrate this for a population whose weight follows a gaussian distribution with mean = 70kg and standard deviation=10kg.
Consider the following sequence of samples sizes: (5, 10, 50, 100, 200, 500, 1000, 5000, 10000). For each of these sample sizes, generate a random sample and take the mean. Plot these against the sample sizes.
To demonstrate this, consider sampling from the population of IQ's, that is, from a normal distribution with mean=100 and standard deviation=15. Now, using rnorm
, extract 10,000 samples with 2 observation, 10,000 more with 3 observations, and so on up to a sample size of 10. In each of these samples, calculate the mean, the standard deviation using the built-in R function sd
(which divides by N-1), and the standard deviation divided by N (you will need to implement it yourself).
Then, plot the average of the sample means and the average of the standard deviations versus the sample size. You should see that, on average, the sample means are always around 100, regardless of the sample size. However, the estimated standard deviations calculated dividing by N turn out to be systematically too small, especially for small sample sizes. Dividing by N-1 tries to circumvent this.