This document provides a brief introduction to a few of the core constructs of the `R`

language and environment for statistical computing. These will be necessary for the first laboratory assignment and will be staples that you will rely on throughout the course.

In [52]:

```
options(repr.plot.width = 4, repr.plot.height = 3) ## resizing plots
options(scipen = 999) ## has scientific notation ever annoyed you?
```

In [54]:

```
# packages for project management
library(usethis) # commonly used package components
library(here) # set paths relative to a .Rproj file
```

Let's take a moment to see why we'd use something like `here`

:

While `usethis`

is not a necessity (unlike `here`

), it does make life *a lot* easier

In [1]:

```
# some more data-oriented packages we'll use
library(tidyverse) ## core tools for "tidy" data science
library(skimr) ## take a deeper look at your data
```

A big part of what we statisticians do is based on simulating data. You just came up with a method that, in theory, is brilliant/elegant/the-best-thing-since-sliced-bread, but you need to illustrate that the method actually works well in practice. For example, your theoretical results may be based on the assumption of a large sample size and you want to investigate how well the method performs when the sample size is modestly small. Providing such information can help convince readers that your method is in fact a good idea. Simulations can also be helpful for showing when a method breaks down, which is something reviewers also like to see.

To accomplish these tasks, we need to create a data generating experiment where we know the truth so that we can benchmark our method by it. Let's start with a very simple example. Our simulated experiment enrolls `n=100`

subjects and measures an outcome $Y$ on each subject. Suppose that we want to know how our method performs when $Y \sim N(0,1)$.

In [3]:

```
# we're simulating -- it's important to set the seed for the PRNG
set.seed(3724591) ## note something like '5' is a BAD seed, it might be worth reading about PRNGs
# simulate 100 draws from a normal distribution.
Y <- rnorm(n = 100, mean = 0 , sd = 1)
# we'll use skimr::skim to take a look at y
skim(Y)
```

The above code chunk illustrates several important `R`

functions.

- First is
`rnorm`

, which simulates`n`

observations from a standard Normal distribution with`mean = 0`

and standard deviation`sd = 1`

. - We also illustrated several functions that are useful for ensuring that the data were created properly.
- The function
`skim`

from the`skimr`

package provides a catch-all way to take a look at some important properties of a data set (in this case, just a vector). We can see that we received information about the quantiles (just like`summary`

would provide), in addition to missingness/completeness, and even a histogram!

Now, let's move on to visualizing our data. We'll do this with `ggplot2`

, a modern graphics engine that is an implementation of the Grammar of Graphics.

In [14]:

```
# let's take a closer look at the histogram of Y
p1 <- gather(as_tibble(Y)) %>%
ggplot(., aes(x = value)) +
geom_histogram(aes(y = ..density..))
p1
```

Ok, so there's a lot going on above. Let's take a moment to dissect it. The `ggplot2`

engine requires that we pass in a table of "tidy" data.

What's tidy data?

- each observation in a single row
- each variable in a single column
- each value in a cell.

To tidy up our data, we can use the `gather`

function from the `tidyr`

package.

Some questions:

- Really, are you kidding? What's a
`tibble`

? It's just a cleaner implementation of a`data.frame`

. There's a few re-implementations of this core object because of inefficiencies. Another is the very useful`data.table`

. - What's this "
`%>%`

" thing all about? It's a "pipe" operator, introduced initially by the`magrittr`

package. We'll come back to this.

How does `ggplot2`

work?

- We're composing a plot in
*layers*. - First, we add a core data layer: this is the
`tibble`

that is passed in by the pipe. With`aes`

, we specify that the x-axis of our plot is going to be the variable`value`

(n.b., this is created implicitly by`tidyr::gather`

). - Next, we specify the type of plot -- we're making a histogram, hence the use of the aptly named
`geom_histogram`

. By specifying`aes`

here, we note that the y-axis is going to be a density (we could have also done`aes(y = ..counts..)`

. - We could just stop here, but there's more we can do to improve our plot.

In [18]:

```
p2 <- p1 +
stat_function(fun = dnorm, args = list(mean = 0, sd = 1),
linetype = "dashed") + ## let's add a normal distribution to compare our histogram to
xlim(c(-3, 3)) + ## we'll set hard limits to the x-axis to make the visualization clearer
xlab("observed values of y") + ## labeling our axes never hurts
ggtitle("Density Histogram of R.V. Y") + ## titles are good too...
theme_bw() ## let's get rid of that annoying gray background
p2
```

Much nicer! Definitely a nicer looking plot than we saw before.

- Consult the comments in the above block to know what each step of the layering does
- Note that we simple constructed a new plot (
`p2`

) by adding layers to the first plot (`p1`

) -- both are preserved!

Alright, now that we've taken a look at `ggplot2`

and some other core tools of the Tidyverse, let's resume talking about simulations:

There are other distributions we might want to simulate from as well.

In [24]:

```
# simulate 100 draws from a uniform(0,1) distribution
Y_unif <- runif(n = 100, min = 0, max = 1)
# simulate 100 draws from a bernoulli(0.5) distribution
# recall that a bernoulli(p) distribution is just a binomial(n=1, p)
Y_bern <- rbinom(n = 100, size = 1, prob = 0.5)
# simulate 100 draws from a gamma(1,2) distribution
Y_gamma <- rgamma(n = 100, shape = 1, scale = 2)
# Let's skim these
y_rv <- as_tibble(list(unif = Y_unif, bern = Y_bern, gamma = Y_gamma))
skim(y_rv)
```

`R`

has several other functions for simulating data from various distributions. Additionally many packages have been constructed for simulating data from other distributions. Googling "how to simulate from XYZ distribution" will likely be sufficient to figure what package/functions you will need.

Often, our experiment measures more than a single variable on each unit, so we might want to write a function that returns `n`

copies of $O = (W,A,Y)$. Suppose that $O$ is distributed as follows

We can now write a function that takes as input `n`

and returns a `data.frame`

object with rows corresponding to observations $O$ and columns corresponding to $W, A, Y$.

In [25]:

```
# define a function that takes as input n (a numeric)
make_data_O <- function(n) {
W <- runif(n, min = 0, max = 1)
# plogis is the pdf of the logistic distribution and
# plogis(x) = expit(x) = exp(x)/(1+exp(x))
A <- rbinom(n, size = 1, prob = plogis(-1 + W))
Y <- rnorm(n, mean = W + A, sd = sqrt(1/2))
# return a data.frame object with named columns W, A, Y
return(as_tibble(list(W = W, A = A, Y = Y)))
}
# make a data set with 100 observations
data_O_obs <- make_data_O(n = 100)
# confirm that make_data_O() returned a tibble object
class(data_O_obs)
```

In [26]:

```
# let's skim over the observed data
skim(data_O_obs)
```

Notice that in our function, we specified the vector `W`

as above. However, when we generated `A`

, rather than giving the `prob`

argument of the `rbinom`

function a numeric, we gave it a vector `plogis(-1 + W)`

. The function is smart enough to know that when we give it a scalar argument for `size`

, but a vector argument for `prob`

, we mean that we want all `n`

objects to be random draws from a binomimal distribution with `size = 1`

, but whose probabilities vary according to the vector `plogis(-1 + W)`

.

Our function returned a `tibble`

object. These objects in `R`

can be useful for interfacing with certain functions like `glm`

, that we'll meet in a minute (just like the canonical `data.frame`

). These objects consist of named vectors of the same length that are stored as columns. Using the `$`

operator accesses the a named attribute of the object. In addition, `tibble`

s form the core object of the Tidyverse, a useful set of tools with which you should strive to be familiar.

We conclude this section with a quick statement of a fact that may be obvious but bears mention. We have not idiot-proofed the function `make_data_O`

. That is, the function expects a numeric input `n`

and if we give it something crazy, the results will be unexpected. We can see this like so

In [27]:

```
make_data_O(n = "something crazy")
```

Above, the function tried to pass the character object `n`

into `runif`

, which threw an error. If only you will ever use your function, it may not be worth your time to include sanity checks; however, if others will use your function, it's (usually) worth it to build in some checks of function inputs and throw errors if input is unexpected. Here's an example:

In [29]:

```
# define a function that takes as input n (a numeric)
make_sane_data_O <- function(n){
# check whether n is of class numeric
if(class(n) != "numeric"){
stop("n is not numeric. Can't simulate data.")
}
W <- runif(n, min = 0, max = 1)
# plogis is the pdf of the logistic distribution and
# plogis(x) = expit(x) = exp(x)/(1+exp(x))
A <- rbinom(n, size = 1, prob = plogis(-1 + W))
Y <- rnorm(n, mean = W + A, sd = sqrt(1/2))
# return a data.frame object with named columns W, A, Y
return(as_tibble(list(W = W, A = A, Y = Y)))
}
```

In [30]:

```
make_sane_data_O(n = "something crazy")
```

*R for Reproducible Scientific Analysis*(Software Carpentry)- Tools for Reproducible Research (Karl Broman, UW-Madison)
*Version Control with git*(Software Carpentry)*Happy git and GitHub for the userR*(Jenny Bryan)- git for Humans (Alice Bartlett)
- Introduction to git (Berkeley Stat 259, KJ Millman)
- Interactive git Tutorial (GitHub)

The first lab assignment asks you to fit several generalized linear models. A generalized linear model for an observation $O=(W,A,Y)$ assumes that the conditional distribution of $Y$ given $A, W$ has some parameteric distribution (typically one from the Exponential Family) and that the conditional mean of $Y$ given $A,W$ can be written as

$$ g[E(Y | A, W)] = \beta_0 + \beta_1 A + \beta_2 W \ . $$There is typically a choice of the link function $g()$ that is associated with a particular parametric family; this is known as the canonical link function. For example, when $Y \in \{0,1\}$ we assume that $Y$ is Bernoulli distribution with $g(x) = log[x/(1-x)]$ -- also known as the logit link function. `R`

has so-called `family`

objects built in that can identify these relationships.

In [35]:

```
# for logisitic regression
binomial()
# for linear regression
gaussian()
# for poisson regression
poisson()
# for Gamma regression (make sure G is capital!)
Gamma()
```

The unknown parameters of a GLM are typically estimated via maximum likelihood, which can be achieved in `R`

through the use of the `glm`

function. Below, we illustrate a couple different ways to call `glm`

.

In [37]:

```
# we'll use our data.frame object dat from before to fit the
# linear regression Y = \beta_0 + \beta_1 A + \beta_2 W
# first we can specify a formula and a data.frame object
mod_1 <- data_O_obs %>%
glm(formula = "Y ~ A + W", data = ., family = gaussian())
# check the class
class(mod_1)
# summarize the results
summary(mod_1)
```

How about logistic regression this time? We'll need new data though...

In [38]:

```
# define a function that takes as input n (a numeric)
# and returns n copies of O=(W, A, Y) where Y \in \{0,1\}
make_binary_O <- function(n) {
W <- runif(n, min = 0, max = 1)
# plogis is the pdf of the logistic distribution and
# plogis(x) = expit(x) = exp(x)/(1+exp(x))
A <- rbinom(n, size = 1, prob = plogis(-1 + W))
# now Y is binary
Y <- rbinom(n, size = 1, prob = plogis(-1 + W + A))
# return a data.frame object with named columns W, A, Y
return(as_tibble(list(W = W, A = A, Y = Y)))
}
```

In [42]:

```
# generate the data and fit logistic regression
binary_O <- make_binary_O(n = 100)
# remember, we need family = binomial for logistic regression
mod_2 <- binary_O %>%
glm("Y ~ A + W", data = ., family = binomial())
# summarize the model output
summary(mod_2)
```

The method `summary.glm`

returns Wald-style tests that the estimated regression parameters are equal to zero. Let\'s figure out how to access those values.

To access these values, we can look inside the model object like so

In [43]:

```
objects(mod_2)
```

In [44]:

```
## looks like we might want "coefficients"
mod_2$coefficients
```

We now have all the components we need to execute a basic simulation. In frequentist statistics, we care a lot about what happens to a given summary measure of data (i.e., a statistic) over repeated experiments. Whereas in real life, we typically only get data from one or two repeated experiments where we don't know the real answer, in simulations we can generate many, many experiments where we do know the answer.

Suppose we are interested in the power of a statistical test. That is, if the null hypothesis is not true, what is the probability (i.e., proportion of times over repeated experiments) that we correctly reject the null hypothesis. Below, we define a couple functions that we will combine into a function that executes one simulation (i.e., one analysis for one data set).

In [46]:

```
# a function that takes as input 'fm', a glm object and returns whether
# or not the p-value for the coefficient associated with A is
# rejected at the 0.05 level
is_null_rejected <- function(fm) {
sumFm <- summary(fm)
p_val <- sumFm$coefficients["A","Pr(>|z|)"]
return(p_val < 0.05)
}
# a function that takes as input 'n', a scalar, simulates a data set
# using makeBinaryO, estimates a main-terms glm and determines whether
# the coefficient associated with A is rejected at the 0.05 level
run_one_sim <- function(n) {
# make data set
dat <- make_binary_O(n = n)
# fit glm
fm <- glm("Y ~ A + W", data = dat, family = binomial())
# get hypothesis test result
out <- is_null_rejected(fm)
# return
return(out)
}
# try it out
run_one_sim(n = 100)
```

Now we need to do this a large number of times. There are many ways to do this, a couple illustrated below. First, we start with a basic `for`

loop.

In [49]:

```
# number of simulations, in general more is better to decrease
# the Monte Carlo error in your answer (i.e., error due to random seed used)
n_sim <- 100
# let's set a seed so the results are reproducible
set.seed(1234)
## get results using a for() loop
# create empty vector of length nSim that results will go into
results_vector <- rep(NA, n_sim)
for(i in seq_len(n_sim)){
results_vector[i] <- run_one_sim(n = 100)
}
# check out first and last results
head(results_vector)
tail(results_vector)
# power is the average number of times you reject the null
cat("The power of the test is " , mean(results_vector))
```

Next we use `R`

's built in function `replicate`

. This function repeatedly evaluates `expr`

a total of `n`

times and stores the output. It's exactly like a `for`

loop, but with shorter syntax.

In [51]:

```
# again set a seed so reproducible
set.seed(1234)
# replicate
results_rep <- replicate(n = n_sim, expr = run_one_sim(n = 100))
cat("The power of the test is " , mean(results_rep))
```