In this brief tutorial, we are going to see how to make use of the different probability distributions in R.

There is one simple rule here, that no matter what distribution we’re talking about, there will ALWAYS be a d function, a p function, a q function and a r function, each representing the following:

d: probability density, i.e. the probability of obtaining a particular outcome under that distribution.
p: cumulative probability function. Here one specifies a particular quantile, and it tells you the probability of obtaining an outcome smaller than or equal to that quantile.
r: random number generation. It generates n random outcomes from the distribution.
q: quantile. One specifies a probability value, and gives the value of the variable for which there’s a probability of obtaining an outcome lower than that value.

We'll understand all these better now with several examples, which you guys will complete in class.

Binomial Distribution¶

The binomial distribution models the distribution of the number of sucesses of a given outcome after a certain number of trials.

In R, for this distribution we will have:

dbinom
pbinom
rbinom
qbinom

In [1]:

args(dbinom)
args(pbinom)
args(rbinom)
args(qbinom)

function (x, size, prob, log = FALSE) 
NULL

function (q, size, prob, lower.tail = TRUE, log.p = FALSE) 
NULL

function (n, size, prob) 
NULL

function (p, size, prob, lower.tail = TRUE, log.p = FALSE) 
NULL

Here, size and prob are the parameters of the binomial distribution. Let's see documentation to see the meaning of the rest of arguments.

In [2]:

?dbinom

For example, say that we have a coin that we toss three times. Then, the probability of getting two heads in these tosses would be:

In [3]:

(0.5 * 0.5 * 0.5) + (0.5 * 0.5 * 0.5) + (0.5 * 0.5 * 0.5)

0.375

This is the same as using dbinom as follows:

In [4]:

# i.e. 0.5 * 0.5, as expected
dbinom(2, size=3, prob=0.5)

0.375

For example, say we want to calculate the probability of getting two heads or lower. For this, instead of the previous probability, we should be summing the probabilities of getting no heads and one head.

In [5]:

p.0<- (0.5 * 0.5 * 0.5) # prbability of no heads
p.1<- (0.5 * 0.5 * 0.5) + (0.5 * 0.5 * 0.5) + (0.5 * 0.5 * 0.5) # prability of one head
p.2<- (0.5 * 0.5 * 0.5) + (0.5 * 0.5 * 0.5) + (0.5 * 0.5 * 0.5) # probability of two heads
p.0 + p.1 + p.2

0.875

But the above is the same as computing the cumulative probability, so we should be able to get it with pbinom.

In [6]:

pbinom(2, size=3, prob=0.5)

0.875

In the following, we are going to put into practice all this considering a dice instead of a coin, and later a gaussian distribution, but the logic behind the use of the above functions will be always the same.

Practice: If you roll a dice 20 times, compute the probability of getting x = 4 sixes.

Practice: If you roll a dice 20 times, compute the probability of getting x = 4 sixes or lower.

Practice: At a probability 0.6, which value is more likely?

Practice: Simulate 100 random numbers with the same setting as above 20 times.

Practice: See what happens when the different parameters in the binomial distribution change. For example, generate 1000 random numbers, with the same probability as before, but now for sizes 20, 50 and 100. In each case, plot the histogram.

Practice: Now, do the same but varying the probabilities of success (e.g. consider 1/6, 3/6 and 5/6). Keep the number of trials, i.e. the parameter size, fixed to 20, for example.

Gaussian Distribution¶

dnorm
pnorm
rnorm
qnorm

In [7]:

args(dnorm)
args(pnorm)
args(rnorm)
args(qnorm)

function (x, mean = 0, sd = 1, log = FALSE) 
NULL

function (q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) 
NULL

function (n, mean = 0, sd = 1) 
NULL

function (p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE) 
NULL

Practice: In the week 2 about descriptive statistcs, we mentioned that, for gaussian distributions, 68% of the points lie within 1 standard deviation around the mean, 95% within 2 standard deviations, and 99.7% within 3 standard deviations. Let's demonstrate this using pnorm. Since this function gives the cumulative probability function up to a certain value (the area under the curve), we can just use a difference of pnorms to get the areas that we desire. Let's suppose the mean is zero and the standard deviation 5

In [9]:

m<-0 # mean
s<-5 # standard deviation

In [10]:

# Just to visualize what we are saying
hist(rnorm(1000, mean = m, sd = s))

Practice: Now, generate 1000 random numbers from a gaussian distribution with mean = 1 and sd = 0.1. And plot the distribution of these numbers using ggplot, using geom_histogram and overlaying a density curve using the function geom_density (You may visit this page to see how to do this ). Do you guys see anything weird in your plot?

Finally, we said in the lectures that the student´s t-distribution looks like a gaussian distribution but heavier tails. Let's see this by generating 1000 random numbers under both distributions and plotting their density curves together, using the geom_density function we´ve just learned.

Practice: Generate 1000 random numbers from a gaussian distribution with mean = 0 and sd = 1. Then generate 1000 random numbers from a t-distribution using the function rt with df = 15