This document provides a brief introduction to a few of the core constructs of the R
language and environment for statistical computing. These will be necessary for the first laboratory assignment and will be staples that you will rely on throughout the course.
options(repr.plot.width = 4, repr.plot.height = 3) ## resizing plots
options(scipen = 999) ## has scientific notation ever annoyed you?
# packages for project management
library(usethis) # commonly used package components
library(here) # set paths relative to a .Rproj file
Attaching package: ‘usethis’ The following objects are masked from ‘package:devtools’: use_appveyor, use_build_ignore, use_code_of_conduct, use_coverage, use_cran_badge, use_cran_comments, use_data, use_data_raw, use_dev_version, use_git, use_git_hook, use_github, use_github_links, use_gpl3_license, use_mit_license, use_news_md, use_package, use_package_doc, use_rcpp, use_readme_md, use_readme_rmd, use_revdep, use_rstudio, use_test, use_testthat, use_travis, use_vignette here() starts at /Users/nimahejazi/Dropbox/UC_Berkeley-grad/teaching/tlbbd-labs/lab_01
Let's take a moment to see why we'd use something like here
:
While usethis
is not a necessity (unlike here
), it does make life a lot easier
# some more data-oriented packages we'll use
library(tidyverse) ## core tools for "tidy" data science
library(skimr) ## take a deeper look at your data
Attaching package: ‘skimr’ The following objects are masked from ‘package:dplyr’: contains, ends_with, everything, matches, num_range, one_of, starts_with
A big part of what we statisticians do is based on simulating data. You just came up with a method that, in theory, is brilliant/elegant/the-best-thing-since-sliced-bread, but you need to illustrate that the method actually works well in practice. For example, your theoretical results may be based on the assumption of a large sample size and you want to investigate how well the method performs when the sample size is modestly small. Providing such information can help convince readers that your method is in fact a good idea. Simulations can also be helpful for showing when a method breaks down, which is something reviewers also like to see.
To accomplish these tasks, we need to create a data generating experiment where we know the truth so that we can benchmark our method by it. Let's start with a very simple example. Our simulated experiment enrolls n=100
subjects and measures an outcome $Y$ on each subject. Suppose that we want to know how our method performs when $Y \sim N(0,1)$.
# we're simulating -- it's important to set the seed for the PRNG
set.seed(3724591) ## note something like '5' is a BAD seed, it might be worth reading about PRNGs
# simulate 100 draws from a normal distribution.
Y <- rnorm(n = 100, mean = 0 , sd = 1)
# we'll use skimr::skim to take a look at y
skim(Y)
type | stat | level | value | formatted | variable |
---|---|---|---|---|---|
numeric | missing | .all | 0.00000000 | 0 | Y |
numeric | complete | .all | 100.00000000 | 100 | Y |
numeric | n | .all | 100.00000000 | 100 | Y |
numeric | mean | .all | -0.13230679 | -0.13 | Y |
numeric | sd | .all | 1.02838232 | 1.03 | Y |
numeric | p0 | .all | -2.27469359 | -2.27 | Y |
numeric | p25 | .all | -0.78641356 | -0.79 | Y |
numeric | median | .all | -0.07902102 | -0.079 | Y |
numeric | p75 | .all | 0.62299759 | 0.62 | Y |
numeric | p100 | .all | 1.97201608 | 1.97 | Y |
numeric | hist | .all | NA | ▃▂▅▇▇▇▃▂ | Y |
The above code chunk illustrates several important R
functions.
rnorm
, which simulates n
observations from a standard Normal distribution with mean = 0
and standard deviation sd = 1
.skim
from the skimr
package provides a catch-all way to take a look at some important properties of a data set (in this case, just a vector). We can see that we received information about the quantiles (just like summary
would provide), in addition to missingness/completeness, and even a histogram!Now, let's move on to visualizing our data. We'll do this with ggplot2
, a modern graphics engine that is an implementation of the Grammar of Graphics.
# let's take a closer look at the histogram of Y
p1 <- gather(as_tibble(Y)) %>%
ggplot(., aes(x = value)) +
geom_histogram(aes(y = ..density..))
p1
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Ok, so there's a lot going on above. Let's take a moment to dissect it. The ggplot2
engine requires that we pass in a table of "tidy" data.
What's tidy data?
To tidy up our data, we can use the gather
function from the tidyr
package.
Some questions:
tibble
? It's just a cleaner implementation of a data.frame
. There's a few re-implementations of this core object because of inefficiencies. Another is the very useful data.table
.%>%
" thing all about? It's a "pipe" operator, introduced initially by the magrittr
package. We'll come back to this.How does ggplot2
work?
tibble
that is passed in by the pipe. With aes
, we specify that the x-axis of our plot is going to be the variable value
(n.b., this is created implicitly by tidyr::gather
).geom_histogram
. By specifying aes
here, we note that the y-axis is going to be a density (we could have also done aes(y = ..counts..)
.p2 <- p1 +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
linetype = "dashed") + ## let's add a normal distribution to compare our histogram to
xlim(c(-3, 3)) + ## we'll set hard limits to the x-axis to make the visualization clearer
xlab("observed values of y") + ## labeling our axes never hurts
ggtitle("Density Histogram of R.V. Y") + ## titles are good too...
theme_bw() ## let's get rid of that annoying gray background
p2
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Much nicer! Definitely a nicer looking plot than we saw before.
p2
) by adding layers to the first plot (p1
) -- both are preserved!Alright, now that we've taken a look at ggplot2
and some other core tools of the Tidyverse, let's resume talking about simulations:
There are other distributions we might want to simulate from as well.
# simulate 100 draws from a uniform(0,1) distribution
Y_unif <- runif(n = 100, min = 0, max = 1)
# simulate 100 draws from a bernoulli(0.5) distribution
# recall that a bernoulli(p) distribution is just a binomial(n=1, p)
Y_bern <- rbinom(n = 100, size = 1, prob = 0.5)
# simulate 100 draws from a gamma(1,2) distribution
Y_gamma <- rgamma(n = 100, shape = 1, scale = 2)
# Let's skim these
y_rv <- as_tibble(list(unif = Y_unif, bern = Y_bern, gamma = Y_gamma))
skim(y_rv)
variable | type | stat | level | value | formatted |
---|---|---|---|---|---|
unif | numeric | missing | .all | 0.00000000 | 0 |
unif | numeric | complete | .all | 100.00000000 | 100 |
unif | numeric | n | .all | 100.00000000 | 100 |
unif | numeric | mean | .all | 0.49401894 | 0.49 |
unif | numeric | sd | .all | 0.25992231 | 0.26 |
unif | numeric | p0 | .all | 0.02287295 | 0.023 |
unif | numeric | p25 | .all | 0.30916221 | 0.31 |
unif | numeric | median | .all | 0.47026868 | 0.47 |
unif | numeric | p75 | .all | 0.70735798 | 0.71 |
unif | numeric | p100 | .all | 0.99970650 | 1 |
unif | numeric | hist | .all | NA | ▆▃▇▇▆▅▅▅ |
bern | integer | missing | .all | 0.00000000 | 0 |
bern | integer | complete | .all | 100.00000000 | 100 |
bern | integer | n | .all | 100.00000000 | 100 |
bern | integer | mean | .all | 0.44000000 | 0.44 |
bern | integer | sd | .all | 0.49888765 | 0.5 |
bern | integer | p0 | .all | 0.00000000 | 0 |
bern | integer | p25 | .all | 0.00000000 | 0 |
bern | integer | median | .all | 0.00000000 | 0 |
bern | integer | p75 | .all | 1.00000000 | 1 |
bern | integer | p100 | .all | 1.00000000 | 1 |
bern | integer | hist | .all | NA | ▇▁▁▁▁▁▁▆ |
gamma | numeric | missing | .all | 0.00000000 | 0 |
gamma | numeric | complete | .all | 100.00000000 | 100 |
gamma | numeric | n | .all | 100.00000000 | 100 |
gamma | numeric | mean | .all | 1.55780404 | 1.56 |
gamma | numeric | sd | .all | 1.40703401 | 1.41 |
gamma | numeric | p0 | .all | 0.01638454 | 0.016 |
gamma | numeric | p25 | .all | 0.48768429 | 0.49 |
gamma | numeric | median | .all | 1.08718046 | 1.09 |
gamma | numeric | p75 | .all | 2.11952945 | 2.12 |
gamma | numeric | p100 | .all | 5.57925427 | 5.58 |
gamma | numeric | hist | .all | NA | ▇▆▃▂▂▁▁▁ |
R
has several other functions for simulating data from various distributions. Additionally many packages have been constructed for simulating data from other distributions. Googling "how to simulate from XYZ distribution" will likely be sufficient to figure what package/functions you will need.
Often, our experiment measures more than a single variable on each unit, so we might want to write a function that returns n
copies of $O = (W,A,Y)$. Suppose that $O$ is distributed as follows
We can now write a function that takes as input n
and returns a data.frame
object with rows corresponding to observations $O$ and columns corresponding to $W, A, Y$.
# define a function that takes as input n (a numeric)
make_data_O <- function(n) {
W <- runif(n, min = 0, max = 1)
# plogis is the pdf of the logistic distribution and
# plogis(x) = expit(x) = exp(x)/(1+exp(x))
A <- rbinom(n, size = 1, prob = plogis(-1 + W))
Y <- rnorm(n, mean = W + A, sd = sqrt(1/2))
# return a data.frame object with named columns W, A, Y
return(as_tibble(list(W = W, A = A, Y = Y)))
}
# make a data set with 100 observations
data_O_obs <- make_data_O(n = 100)
# confirm that make_data_O() returned a tibble object
class(data_O_obs)
# let's skim over the observed data
skim(data_O_obs)
variable | type | stat | level | value | formatted |
---|---|---|---|---|---|
W | numeric | missing | .all | 0.000000000 | 0 |
W | numeric | complete | .all | 100.000000000 | 100 |
W | numeric | n | .all | 100.000000000 | 100 |
W | numeric | mean | .all | 0.508036909 | 0.51 |
W | numeric | sd | .all | 0.287817470 | 0.29 |
W | numeric | p0 | .all | 0.008145839 | 0.0081 |
W | numeric | p25 | .all | 0.262854320 | 0.26 |
W | numeric | median | .all | 0.519560866 | 0.52 |
W | numeric | p75 | .all | 0.736241402 | 0.74 |
W | numeric | p100 | .all | 0.997624422 | 1 |
W | numeric | hist | .all | NA | ▆▇▅▇▆▇▆▆ |
A | integer | missing | .all | 0.000000000 | 0 |
A | integer | complete | .all | 100.000000000 | 100 |
A | integer | n | .all | 100.000000000 | 100 |
A | integer | mean | .all | 0.350000000 | 0.35 |
A | integer | sd | .all | 0.479372485 | 0.48 |
A | integer | p0 | .all | 0.000000000 | 0 |
A | integer | p25 | .all | 0.000000000 | 0 |
A | integer | median | .all | 0.000000000 | 0 |
A | integer | p75 | .all | 1.000000000 | 1 |
A | integer | p100 | .all | 1.000000000 | 1 |
A | integer | hist | .all | NA | ▇▁▁▁▁▁▁▅ |
Y | numeric | missing | .all | 0.000000000 | 0 |
Y | numeric | complete | .all | 100.000000000 | 100 |
Y | numeric | n | .all | 100.000000000 | 100 |
Y | numeric | mean | .all | 0.959493484 | 0.96 |
Y | numeric | sd | .all | 0.914968502 | 0.91 |
Y | numeric | p0 | .all | -1.223206222 | -1.22 |
Y | numeric | p25 | .all | 0.295534856 | 0.3 |
Y | numeric | median | .all | 0.923989197 | 0.92 |
Y | numeric | p75 | .all | 1.618579997 | 1.62 |
Y | numeric | p100 | .all | 3.165030308 | 3.17 |
Y | numeric | hist | .all | NA | ▁▃▅▇▅▆▂▁ |
Notice that in our function, we specified the vector W
as above. However, when we generated A
, rather than giving the prob
argument of the rbinom
function a numeric, we gave it a vector plogis(-1 + W)
. The function is smart enough to know that when we give it a scalar argument for size
, but a vector argument for prob
, we mean that we want all n
objects to be random draws from a binomimal distribution with size = 1
, but whose probabilities vary according to the vector plogis(-1 + W)
.
Our function returned a tibble
object. These objects in R
can be useful for interfacing with certain functions like glm
, that we'll meet in a minute (just like the canonical data.frame
). These objects consist of named vectors of the same length that are stored as columns. Using the $
operator accesses the a named attribute of the object. In addition, tibble
s form the core object of the Tidyverse, a useful set of tools with which you should strive to be familiar.
We conclude this section with a quick statement of a fact that may be obvious but bears mention. We have not idiot-proofed the function make_data_O
. That is, the function expects a numeric input n
and if we give it something crazy, the results will be unexpected. We can see this like so
make_data_O(n = "something crazy")
Warning message in runif(n, min = 0, max = 1): “NAs introduced by coercion”
Error in runif(n, min = 0, max = 1): invalid arguments Traceback: 1. make_data_O(n = "something crazy") 2. runif(n, min = 0, max = 1) # at line 3 of file <text>
Above, the function tried to pass the character object n
into runif
, which threw an error. If only you will ever use your function, it may not be worth your time to include sanity checks; however, if others will use your function, it's (usually) worth it to build in some checks of function inputs and throw errors if input is unexpected. Here's an example:
# define a function that takes as input n (a numeric)
make_sane_data_O <- function(n){
# check whether n is of class numeric
if(class(n) != "numeric"){
stop("n is not numeric. Can't simulate data.")
}
W <- runif(n, min = 0, max = 1)
# plogis is the pdf of the logistic distribution and
# plogis(x) = expit(x) = exp(x)/(1+exp(x))
A <- rbinom(n, size = 1, prob = plogis(-1 + W))
Y <- rnorm(n, mean = W + A, sd = sqrt(1/2))
# return a data.frame object with named columns W, A, Y
return(as_tibble(list(W = W, A = A, Y = Y)))
}
make_sane_data_O(n = "something crazy")
Error in make_sane_data_O(n = "something crazy"): n is not numeric. Can't simulate data. Traceback: 1. make_sane_data_O(n = "something crazy") 2. stop("n is not numeric. Can't simulate data.") # at line 5 of file <text>
The first lab assignment asks you to fit several generalized linear models. A generalized linear model for an observation $O=(W,A,Y)$ assumes that the conditional distribution of $Y$ given $A, W$ has some parameteric distribution (typically one from the Exponential Family) and that the conditional mean of $Y$ given $A,W$ can be written as
$$ g[E(Y | A, W)] = \beta_0 + \beta_1 A + \beta_2 W \ . $$There is typically a choice of the link function $g()$ that is associated with a particular parametric family; this is known as the canonical link function. For example, when $Y \in \{0,1\}$ we assume that $Y$ is Bernoulli distribution with $g(x) = log[x/(1-x)]$ -- also known as the logit link function. R
has so-called family
objects built in that can identify these relationships.
# for logisitic regression
binomial()
# for linear regression
gaussian()
# for poisson regression
poisson()
# for Gamma regression (make sure G is capital!)
Gamma()
Family: binomial Link function: logit
Family: gaussian Link function: identity
Family: poisson Link function: log
Family: Gamma Link function: inverse
The unknown parameters of a GLM are typically estimated via maximum likelihood, which can be achieved in R
through the use of the glm
function. Below, we illustrate a couple different ways to call glm
.
# we'll use our data.frame object dat from before to fit the
# linear regression Y = \beta_0 + \beta_1 A + \beta_2 W
# first we can specify a formula and a data.frame object
mod_1 <- data_O_obs %>%
glm(formula = "Y ~ A + W", data = ., family = gaussian())
# check the class
class(mod_1)
# summarize the results
summary(mod_1)
Call: glm(formula = "Y ~ A + W", family = gaussian(), data = .) Deviance Residuals: Min 1Q Median 3Q Max -1.51390 -0.44471 -0.03176 0.39016 1.61826 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1817 0.1343 1.353 0.1792 A 1.2030 0.1395 8.626 0.000000000000123 W 0.7022 0.2323 3.023 0.0032 (Dispersion parameter for gaussian family taken to be 0.4217676) Null deviance: 82.880 on 99 degrees of freedom Residual deviance: 40.911 on 97 degrees of freedom AIC: 202.41 Number of Fisher Scoring iterations: 2
How about logistic regression this time? We'll need new data though...
# define a function that takes as input n (a numeric)
# and returns n copies of O=(W, A, Y) where Y \in \{0,1\}
make_binary_O <- function(n) {
W <- runif(n, min = 0, max = 1)
# plogis is the pdf of the logistic distribution and
# plogis(x) = expit(x) = exp(x)/(1+exp(x))
A <- rbinom(n, size = 1, prob = plogis(-1 + W))
# now Y is binary
Y <- rbinom(n, size = 1, prob = plogis(-1 + W + A))
# return a data.frame object with named columns W, A, Y
return(as_tibble(list(W = W, A = A, Y = Y)))
}
# generate the data and fit logistic regression
binary_O <- make_binary_O(n = 100)
# remember, we need family = binomial for logistic regression
mod_2 <- binary_O %>%
glm("Y ~ A + W", data = ., family = binomial())
# summarize the model output
summary(mod_2)
Call: glm(formula = "Y ~ A + W", family = binomial(), data = .) Deviance Residuals: Min 1Q Median 3Q Max -1.3282 -0.8670 -0.7749 1.1608 1.7023 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.2185 0.4602 -2.648 0.0081 A 0.9839 0.4310 2.282 0.0225 W 0.6116 0.7424 0.824 0.4101 (Dispersion parameter for binomial family taken to be 1) Null deviance: 132.81 on 99 degrees of freedom Residual deviance: 126.20 on 97 degrees of freedom AIC: 132.2 Number of Fisher Scoring iterations: 4
The method summary.glm
returns Wald-style tests that the estimated regression parameters are equal to zero. Let's figure out how to access those values.
To access these values, we can look inside the model object like so
objects(mod_2)
## looks like we might want "coefficients"
mod_2$coefficients
We now have all the components we need to execute a basic simulation. In frequentist statistics, we care a lot about what happens to a given summary measure of data (i.e., a statistic) over repeated experiments. Whereas in real life, we typically only get data from one or two repeated experiments where we don't know the real answer, in simulations we can generate many, many experiments where we do know the answer.
Suppose we are interested in the power of a statistical test. That is, if the null hypothesis is not true, what is the probability (i.e., proportion of times over repeated experiments) that we correctly reject the null hypothesis. Below, we define a couple functions that we will combine into a function that executes one simulation (i.e., one analysis for one data set).
# a function that takes as input 'fm', a glm object and returns whether
# or not the p-value for the coefficient associated with A is
# rejected at the 0.05 level
is_null_rejected <- function(fm) {
sumFm <- summary(fm)
p_val <- sumFm$coefficients["A","Pr(>|z|)"]
return(p_val < 0.05)
}
# a function that takes as input 'n', a scalar, simulates a data set
# using makeBinaryO, estimates a main-terms glm and determines whether
# the coefficient associated with A is rejected at the 0.05 level
run_one_sim <- function(n) {
# make data set
dat <- make_binary_O(n = n)
# fit glm
fm <- glm("Y ~ A + W", data = dat, family = binomial())
# get hypothesis test result
out <- is_null_rejected(fm)
# return
return(out)
}
# try it out
run_one_sim(n = 100)
Now we need to do this a large number of times. There are many ways to do this, a couple illustrated below. First, we start with a basic for
loop.
# number of simulations, in general more is better to decrease
# the Monte Carlo error in your answer (i.e., error due to random seed used)
n_sim <- 100
# let's set a seed so the results are reproducible
set.seed(1234)
## get results using a for() loop
# create empty vector of length nSim that results will go into
results_vector <- rep(NA, n_sim)
for(i in seq_len(n_sim)){
results_vector[i] <- run_one_sim(n = 100)
}
# check out first and last results
head(results_vector)
tail(results_vector)
# power is the average number of times you reject the null
cat("The power of the test is " , mean(results_vector))
The power of the test is 0.58
Next we use R
's built in function replicate
. This function repeatedly evaluates expr
a total of n
times and stores the output. It's exactly like a for
loop, but with shorter syntax.
# again set a seed so reproducible
set.seed(1234)
# replicate
results_rep <- replicate(n = n_sim, expr = run_one_sim(n = 100))
cat("The power of the test is " , mean(results_rep))
The power of the test is 0.58