Notebook

Previous: Chapter 7 - Principles of Data Visualization

In [ ]:

# This chapter uses the package ggformula, which takes a few moments to download on Google Colab. 
# Run this first so it's ready by the time you need it
install.packages("ggformula")
library(ggformula)

Chapter 8 - Where Data Come From¶

We can learn a lot by examining distributions and summaries of a dataset. We can describe the current state of some situation for a clearer picture about reality around us. But in data analysis, our interest usually goes beyond these data alone. We are generally looking at data because we trying to find insights into the broader population it came from. What will a new data point look like? What will the future look like, so we are prepared for it? This is the goal of using statistics for prediction.

We may also want to find out why our data are they way they are; to infer the ways in which the world works. What made the data come out this way? As we discussed earlier, statistics alone can't determine cause and effect, but it can give us important clues about the mechanisms of the human mind and society.

This is when we now move from descriptive statistics (describing features of a dataset) to inferential statistics (using that dataset to understand the population it came from). We are going to switch from asking questions like "what do the data look like," to trying to understand what could have produced the specific pattern of variation we see in our data. In this chapter, let's explore various reasons why a dataset may end up looking the way it does, given a particular population. This will help us build intuitions for later when we try to infer the population when all we have is a sample of data.

8.1 Samples vs. populations¶

First, let's define what we mean by the word population above. The population of some data is the distribution of all possible values those data could take. In psychology, that often means some quality about people - all people who exist and who will ever exist. We have been using a dataset of student demographics like thumb length, but that dataset is not the only thumb values out there. Many more people have thumbs and heights than are included in our dataset. In contrast, a sample is a subset of the population. Our dataset is a sample, because it includes some but not all of the people who have thumbs.

Of course, people are not the only population one might be interested in. If you're in astronomy, you could be interested in the population of all stars in the universe. If you're a biologist, maybe you're interested in all bacteria of a certain strain. Populations don't have to be physical entities, either; as an engineer building a bridge, you may be concerned with the population of all possible force vectors that your bridge could experience.

Even if you are studying people, your population might not consist of all people. Maybe we don't care about understanding the typical thumb length of all people, but rather college students specifically. Or perhaps we're only concerned with students in our own class - in that case our sample is our population of interest, and we don't have to do any inferential statistics at all!

Usually though, we are interested in more general populations in the world. In those cases, it is impractical or even impossible to measure every datapoint from the population. A sample is all we have. Thus, it is very important to have a good idea of what population we are trying to understand. Only then can we use our sample for inference.

Now, what makes a good sample? If we want to use a sample to make inferences about a population, we want that sample to be representative of that population. In other words, the sample distribution should closely resemble the population distribution.

Political polling is a good example of why it's important to have representative samples. Sometimes, polls can be incredibly accurate at predicting the outcomes of elections. The best known example comes from the 2008 and 2012 US Presidential elections, when the pollster Nate Silver correctly predicted electoral outcomes for 49/50 states in 2008 and for all 50 states in 2012. Silver did this by combining data from 21 different polls, each of which included data from about 1000 likely voters. If we consider the population of interest during an election to be all the votes of people who cast a vote, then in this case the intentions of that 1000-person sample very closely matched the overall voting behavior of the entire electorate. Yet, the 2016 US Presidential election was noteworthy for how poorly polls predicted the outcome (Clinton was strongly favored to win by many polling samples, but the electorate behaved differently). This is because the people who came out to vote for Donald Trump were not well represented in the poll samples. Bad predictions were made because of bad sampling.

Do you think our studentdata dataset is a good representation of what that broader population of thumb lengths is like? What about the population of thumb lengths from college students specifically? What if our sample came only from Scripps, given what we've seen about sex and thumb length variance? Before you do research and publish your results, make sure you are able to collect a sample that speaks to the population you are interested in.

8.2 The sampling process¶

How do data make it from the population to being in a sample? The selection of certain data points over others is called the sampling process, and will determine how representative the sample is.

Many samples are based on practicality, where the sample is comprised of datapoints that were the easiest to obtain. This is called convenience sampling. The samples are chosen in a way that is convenient to the researcher. In real life, most studies are convenience samples of one form or another. If you do a thesis on human cognition but collect data from other students, this is inherently a convenience sample because it is much easier for you to access other students than people from the opposite side of the world.

Convenience sampling sometimes does a good job of creating a representative sample, but can instead create a biased sample if there's some reason for why the sample is easier to collect that also affects the variables being studied. E.g., can you think of any way college students might differ from other people? Would those differences matter for understanding cognition processes?

Over time, statisticians have figured out that the best way to pick a sample that most closely resembles the shape, center, and variability of the target population is to draw a random sample. Random in this case means every member of the population has an equal chance of getting picked for the sample. This is usually impractical, though. For a psychologist, it's not as easy to contact, recruit, or even know about some people in the population compared to others (indeed, the data in psychology results are often based on an over-representation of US psychology students, since universities require their psych majors to participate in studies!). For an ecologist, it is much easier to access animals that live in one site nearby, versus the same species that lives in a remote canyon. You can probably think of examples in other fields of study that make it hard to sample every member of the population at equal likelihood.

In addition, it is important to know the difference between sampling with replacement and sampling without replacement. With replacement means that, after picking a data point out of the entire population for inclusion in the sample, you record its value and then put it back "into the bag" of the population, so that it has a chance of being selected again. Without replacement is the opposite - once a data point is recorded, it can't be selected again. In real world data collection, you almost always use sampling without replacement because you want to collect as much unique information in your data as possible. I.e., once someone has participated in your study, you don't let them participate again. However, for some theoretical or simulating procedures that we will do later in this chapter, sampling with replacement is an underlying assumption.

Interestingly, even though random sampling is the most representative sampling process, that doesn't always mean the sample you end up with will look exactly like the population, or even other samples. For example, the picture below shows a population of thumb lengths from which several N=100 random samples were drawn:

Despite all coming from the same population, and all being generated with the same random sampling process, these samples are somewhat different from each other. Some are more skewed or peaked than others, and they all have different means which are themselves different from the population mean. If random sampling is representative, why is it not making samples that are simply mini versions of the population distribution?

8.3 Probability¶

To understand this, we need to take a detour into learning about probability.

Let's consider a much easier population to understand - rolls on a six-sided die. There are only six options. The set of these options is known as the sample space in probability theory, and would be written like {1,2,3,4,5,6}. For a fair die, all of these options have the same chance of occurring. Because of this, the distribution of the population die rolls looks like a uniform distribution:

Now let's draw a very small sample from this population - just one die roll. This is known as an elementary event in probability theory, and may be writted as X = E (where X is a die roll variable that takes on the value of some event E for one specific roll). Obviously with a sample like this, we can't get a sample distribution that looks like above. We can only get one value (e.g., X = 2). If we draw another one-roll event, this value might be different (e.g., X = 4).

With one roll, we can't get a sample that looks like the population. We also can't predict what the exact value of any one roll will be. However, if we were to roll the die many times, the proportion of times the die came up with a specific value would be about 1/6, since there are six possible options in the sample space.

The proportion of times an elementary event will equal the value of one outcome in the sample space is the probability of that outcome. While you can't know for sure if the next roll of the die will equal 6, the probability that it does is 1/6 (~0.167). Written mathematically:

$$P(X = 6) = 0.167$$

If you guessed that the next roll of the die would be 6, you will be right about 16.7% of the time.

Incidentally, take a look again at the die roll population distribution above, and notice the density value of each roll option. It equals 0.167, the same as the probability of each option. Because of this, population distributions are also known as probability distributions. The proportion of the population that equals a particular value is the probability of getting that value if we randomly drew one event from the population. Since probabilities are proportions of data, there are some basic axioms about what probability an event can have:

Probability cannot be negative:

$$P(X = E_i) => 0$$

The total probability of all outcomes in the sample space is 1; that is, if we take the probability of each E_i and add them up, they must sum to 1. We can express this using the summation symbol:

$$\sum_{i=1}^{N} P(X = E_i) = P(X = E_1) + P(X = E_2) + P(X = E_3) + ... + P(X = E_N) = 1$$

This is interpreted as saying “Take all of the N elementary events, which we have labeled from 1 to N, and add up their probabilities. These must sum to one.” Another way of saying this, which will come in handy later in the course, is that the probability of NOT E_i is 1-P(X = E_i).

The probability of any individual event cannot be greater than one:

$$P(X = E_i) <= 1$$

This is implied by the previous point; since they must sum to one, and they can’t be negative, then any particular probability cannot exceed one.

8.4 Simulating data with probabilities¶

For any population you may want to study, every value in that population has some probability of ending up in your sample. The math of all inferential statistics is thus built on top of probability theory. That is why traditionally, before computers were common, statistics students learned about the axioms and and equations derived in probability theory in order to hand calculate analyses. If you were taking this class in the math department, we would teach you those next.

However, now that we are in a world where we all have a device our pocket with more compute power than the first mission to the moon, another approach is available that doesn't involve memorizing equations. We can simulate the sampling process to understand why data samples turn out the way they do.

"Simulation" may sound complicated, based on popular ideas of super computers and simulated realities. But simulating a sample of data like the thumb samples in section 8.2 is easy to do in R. We just need to know the population/probability distribution to sample from, and then we will use the function sample().

For example, let's simulate one roll of a six-sided die:

In [ ]:

#possible values of the die
sample_space <- c(1,2,3,4,5,6)

#probabilities of each value
probs <- c(0.167,0.167,0.167,0.167,0.167,0.167)

#sample() takes a sample space vector, sample size, replacement flag, and probability vector
simulated_roll <- sample(x = sample_space, size = 1, replace = TRUE, prob = probs)

simulated_roll

In the code above, you just generated a random roll of a die. Let's dive into what each argument in sample() does:

x: this is where you define what the same space is for your population
size: how many events you want to sample from the population (i.e., sample size)
replace: This is a boolean value, where you tell the function if you want to sample with replacement or not. It doesn't matter much for a sample size of 1, but this will come in handy later if we want to be able to redraw a value from the sample space or not.
probs: the corresponding probabilities of each option in the sample space. Note that this argument isn't required; if we left it out, R would default to all options having the same probability. Use this if you want to set different probabilities.

When you ran the code above, what value did your die roll get? Remember that we can never perfectly predict the value of one event - we only know its probability across many events. Thus, you likely got a different value than another student did. In fact, if you execute the above code multiple times, you will get different die roll results even though you didn't change any code. Try it!

This is because sample(), as the name implies, performs random sampling. However, if you ever want to make a random process in R spit out the same value every time you run it, include the function set.seed() on the line before. Put any number you want as the argument to this function. Each different number will operate as a separate "seed" from which your random outcome deterministically "grows." Every sample generated from the same seed will produce the same sample. This is useful if you want to check your work or someone else needs to reproduce it.

In [ ]:

#setting the seed to make outcome always the same
set.seed(47)
simulated_roll <- sample(x = sample_space, size = 1, replace = TRUE)

#this value should be 3
simulated_roll

Now that we understand the basics of probability and simulation, we can finally return to the question posed in section 8.2: why don't random samples always look like the population they came from?

We can see this play out in our own simulations. Instead of rolling a die just once, let's draw a slightly larger sample of 10 rolls. Make a change in the code below to create a sample of 10 rolls instead of just one.

In [ ]:

#make a change here
roll_sample <- sample(x = sample_space, size = 1, replace = TRUE, prob = probs)

#put the sample into a dataframe for plotting
rolls_df <- data.frame(roll_sample)

#plot all the die rolls
gf_dhistogram(~ roll_sample, data = rolls_df, color= "black", fill = "turquoise", bins = 6)

We now have a distribution of data instead of just one value, but it does not resemble the uniform distribution we saw for the population of all rolls. Some values came up more often than others, even though they all have the same probability of occurring. Further, try running your code multiple times to get many different samples and compare them to each other. In some samples one outcome happens frequently, in other samples it doesn't happen at all.

However, if our sample size is larger:

The sample distribution increasingly resembles the population distribution. This phenomenon is known as the law of large numbers. This law states that, even when you know the underlying probability distribution (since we defined the probabilities of all die rolls to be the same), there's still randomness at play. With a small sample size, this randomness can mean some possible rolls are more numerous than others. But the bigger your sample size, the less this randomness matters - eventually the proportions of the die values will equal their expected proportions. This is one reason larger sample sizes are better to use in research than small sample sizes.

8.5 Theoretical probability distributions¶

One reason a dataset looks like way it does is because of the sampling process used to obtain it. Another reason is the specific shape of the population distribution it came from. As you might imagine, the probability distributions underlying variables in the real world could be a great many shapes. They could look rectangle shaped, like a symmetric bell, skewed one way or another, etc. However, over the years statisticians have figured out that some of these shapes are more common than others. This has led to naming these special shapes and defining the idealized version of them mathematically. These special shapes are called theoretical probability distributions, and we'll introduce you to working with them now.

Uniform distribution¶

First, let's look at our example above of a fair six-sided die. There are 6 possible outcomes, {1,2,3,4,5,6}. We know the probabilities of rolling any of these outcomes is the same, or uniform. Since the sum of all possible probabilities is 1 (from probability axiom 2), we therefore know that the probability of rolling any one value is 1/6 = 0.167, or P(X = E_i) = 0.167. Written as a set, that would be {0.167,0.167,0.167,0.167,0.167,0.167}.

We can use simulation to make a large sample to approximate this distribution, as we did above, and we see that it forms what looks mostly like a rectangle. The theoretically ideal version of this is a perfect rectangle, called the uniform distribution.

We can also use some different R functions to access this perfectly shaped distribution without having to generate it ourselves: runif() and dunif(). The "unif" part of the function name is short for the uniform distribution. In the first one, the "r" refers to generating a random sample from the uniform distribution, and in the second one, the "d" refers to calculating the density (or probability) of a particular value.

In [ ]:

#sampling 1 million items from the uniform distribution, between values of 0 and 6
outcomes <- runif(n = 1000000, min = 0, max = 6)

#putting sample into a dataframe
outcome_df <- data.frame(outcomes)

#look at first few values sampled from uniform distribution
head(outcome_df)

#plot the histogram of sample
gf_dhistogram(~ outcomes, data = outcome_df, bins = 500)

The idealized, theoretical shape of the uniform distribution is a smooth line, so we don't have to only sample integer values from it. Any decimal number has the exact same probability of being drawn as any other. Thus, sampling with runif() generates continuous data, meaning any decimal value between the min and max arguments.

If we wanted to generate ordinal whole number values like on a die, we need to round every generated number up to the nearest ordinal value. R gives a few functions for this: round() (round values to the nearest specified decimal place), ceiling() (round up to the nearest integer), and floor() (round down to the nearest integer). We'll use ceiling() in our case to round up so all our values are only in {1,2,3,4,5,6}.

In [ ]:

outcome_df$rounded_outcomes <- ceiling(outcome_df$outcomes)
head(outcome_df)
gf_dhistogram(~ rounded_outcomes, data = outcome_df, bins = 6)

Where runif() lets us generate a random sample from the uniform distribution, dunif() tells us the probability (density) of a specific value. Try using it to calculate the probability of rolling a 4 on a die, given a uniform probability distribution:

In [ ]:

# first argument is the value you want the probabilty of; second and third arguments are the min and max of the
# (continuous) distribution
dunif(4, min=0, max=6)

Normal distribution¶

When we talked about data distributions in chapter 5, we saw that a distribution where both sides of the curve are symmetrical around the center is called normal. In fact, the normal distribution is another theoretical probability distribution.

The shape of the theoretical normal distribution is unimodal and symmetrical - it's a curve that looks like a bell. But its central tendency and spread can vary. In fact, the normal distribution is actually a family of theoretical distributions, each having a different shape. To specify a particular normal distribution to use, we need to provide two values - the mean and standard deviation. The figure below can give you a sense of how normal distributions can vary from each other, depending on their means and standard deviations. Note that three of the four distributions pictured have the same mean, which is 0, but quite different shapes. The fourth distribution has a mean that is below the other three.

Note that you can't specify a normal distribution with a min and a max - even if it doesn't look like it in the picture, technically all values are possible with a normal distribution. The sample space is {-∞ ≤ mean ≤ ∞}.

In R, we'll use rnorm() and dnorm() for doing the same sorts of commands we did previously with the uniform distribution.

In [ ]:

#sampling from the normal distribution 
outcomes <- rnorm(n = 1000000, mean = 0, sd = 1)

#putting sample into a dataframe for plotting
outcome_df <- data.frame(outcomes)

#plot all the sampled values
gf_dhistogram(~ outcomes, data = outcome_df)

Now use dnorm() to figure out how likely it is that you get a value of 2 from this normal distribution:

In [ ]:

dnorm(2, mean = 0, sd = 1)

Perhaps the population you want to draw from has more variability, so the standard deviation of your distribution is actually 3. Plot a new normal distribution with this new sd of 3, and recalculate the probability of drawing a value of 2:

In [ ]:

# Write your code below

Other distributions¶

There are other distributions you are likely to encounter if you continue in your statistics journey: Poisson, binomial, gamma, etc. They also have corresponding R functions to help you draw from them (rpois(), rbinom(), rgamma(), etc.). These are outside the scope of this intro course, because their shape isn't too different than the normal distribution, so the normal is often a fine approximation. But feel free to look them up and learn about what sort of populations have these distributions!

Image source

8.6 The data generation process¶

You might be wondering, why are some populations shaped differently than others? Why are some values more likely than others? Why are the mean and variation what they are? Why might skew happen?

To answer these questions is to understand the data generation process. Not just how a sample was generated from a population, but what processes created that population in the first place.

Let's consider again a six-sided die, but this time you're not interested in the outcome of just one roll. This time you're playing Dungeons and Dragons, and you're rolling two dice to determine how much damage your character will deal in a battle. This value is calculated by adding up the values of both dice after rolling.

When we roll one fair six-sided die, every discrete outcome has the same probability. Is that also true for every sum score from two dice?

We can use simulation to figure this out. But before moving on, think programatically for a moment about each step your code would need to do to calculate this.

(Seriously, stop and think of a solution for yourself before reading the next paragraph!)

One answer is that we can use a three-step process: 1) simulate two different samples of rolls (for two different dice); then 2) add the values of the ith roll in both samples; lastly 3) make a distribution of that new variable. Here's the code for that process.

In [ ]:

sample_space <- c(1,2,3,4,5,6)

#roll 1 die 1,000,000 times
die1 <- sample(x = sample_space, size = 1000000, replace = TRUE)
#repeat for the other die
die2 <- sample(x = sample_space, size = 1000000, replace = TRUE)

#make a new variable that adds up the values of the dice rolls
total_damage <- die1 + die2 

outcome_df <- data.frame(total_damage)
gf_dhistogram(~ total_damage, data = outcome_df, bins = 11)

Interesting! Even though the chance of any value on one die is the same, there are differences in probability among the different possible combination scores. The chance of getting a 7 for a character's total damage is about 6 times as high as getting a 12.

As with many things in statistics and coding more generally, there isn't always just one correct way to do something. That is true here as well. Maybe you thought of a different way to simulate this data. Here is another approach:

In [ ]:

sample_space <- c(1,2,3,4,5,6)         #sample space
damage_vals <- vector(length=1000000)  #an empty vector to store damage values

for (i in 1:1000000) {  #making a for loop of 1,000,000 tosses of two dice
    twodice <- sample(x = sample_space, size = 2, replace = TRUE) #sample two dice at once
    damage_vals[i] <- sum(twodice)
}

outcome_df <- data.frame(damage_vals)
gf_dhistogram(~ damage_vals, data = outcome_df, bins = 11)

Did that come out as approximately the same distribution as the first simulation approach?

Technically this way is a little bit slower for the computer to do (you may have noticed it took a few seconds longer to run). It's not as optimal time wise, since sample() samples two sets of one million datapoints really fast compared to sampling 2 datapoints, one million times. If you get into building simulations a lot, this time optimization can become something really important to you. But ultimately, they both gave us the correct answer.

Let's now see what happens if your character has a good ability that let's you sum up four dice for your damage:

In [ ]:

die1 <- sample(x = sample_space, size = 1000000, replace = TRUE)
die2 <- sample(x = sample_space, size = 1000000, replace = TRUE)
die3 <- sample(x = sample_space, size = 1000000, replace = TRUE)
die4 <- sample(x = sample_space, size = 1000000, replace = TRUE)

total_damage <- die1 + die2 + die3 + die4 

outcome_df <- data.frame(total_damage)
gf_dhistogram(~ total_damage, data = outcome_df, bins = 21)

This histogram is starting to look a lot like a normal distribution.

In this DnD example, the amount of damage your character does is not a value that comes out of thin air. It comes from a combination of more basic events. The data generation process of your damage is a sum of multiple independent dice rolls.

This is where the normal distribution comes from - combinations of more basic uniform events. Many things in life (indeed, probably most things we care about) are also outcomes of multiple other events and processes. For example, you aren't your height just by happenstance - you can think of it as a combination score being generated by many other variables such as your genes, nutrition, any diseases, etc. The other less common probability distributions are the products of data generation processes as well, just different ones (e.g., combination with multiplication instead of addition, data that can't be less than zero, etc.).

If your population distrubtion isn't uniform, there are other things causing it. Later in the course we will learn how to investigate what this data generation process might be.

8.7 Mathematical rules with probability¶

After practicing with probability distributions and simulating data samples from them, hopefully you're developing a sense of how to find the probability of drawing a certain data value from a particular population. As the last section of this chapter, we want to introduce you to more complicated probability events you can calculate. These events correspond to certain kinds of data you may deal with.

Traditionally in stats and math classes, students are taught about these through mathematical equations (and professors will go to great lengths to make students memorize them). For us, now that you appreciate the shapes of probability distributions and how to simulate events, you won't have to just memorize these equations - you'll have a deeper sense for why they are the way they are.

First, here's a quick recap of the axioms of probabilities we introduced earlier.

Probability cannot be negative
The total probability of all outcomes in the sample space is 1
The probability of any individual event cannot be greater than one

Based on these axioms and some algebra, we can build additional mathematical rules that hold so long as the axioms are true:

Subtraction: If the sum of all probabilities for a sample space is always 1 (axiom #2 above), then the probability of any particular event NOT occurring is equal to the sum of the probabilities of all other possible events, or 1 minus the probability of this event (in R, the ! symbol means "not," so we'll use it here too):

$$P(X != E) = 1 - P(X = E)$$

To see this visually, consider the following uniform distribution:

The probability of all events not equal to 3 is the area of the distribution (turquoise), minus the area of the bin corresponding to 3 (orange). You can also verify this by using dunif():

In [ ]:

prob_3 <- dunif(3, min = 0, max = 6)
prob_not_3 <- dunif(1, min = 0, max = 6) + dunif(2, min = 0, max = 6) + 
                dunif(4, min = 0, max = 6) + dunif(5, min = 0, max = 6) + 
                dunif(6, min = 0, max = 6)

# does prob_not_3 have the same value as 1 - prob_3?
prob_not_3 
1 - prob_3

This doesn't apply to only uniform distributions. This is also true of normal distributions with their different probabilities for different values.

In [ ]:

prob_1 <- dnorm(1, mean = 0, sd = 1)

#probability of drawing a 1 from a normal distribution
prob_1

#probability of not 1
1 - prob_1

Cumulative probability: Say we roll a six-sided die, but we don't really care about the exact number that comes up. Instead, we care about rolling any of a subset of numbers - e.g., less than or equal to 3 (so 1, or 2, or 3). For this, we add up the probabilities of each specific outcome that would be true under our logical statement:

$$P(X <= E_N) = P(X = E_1) + P(X = E_2) + ... + P(X = E_N)$$

Visually:

The probability of rolling any number less than or equal to 3 is the sum of the areas of options that satisfy this condition.

We could use dunif() to add up the probabilities of 1-3 to verify this, but there's also a punif() (and pnorm()) function that makes this faster and returns the cumulative probability of a distribution up to a certain max:

In [ ]:

punif(3,                  # The max (or min) of the cumulative
      min = 0,            # Lower limit of the distribution 
      max = 6,            # Upper limit of the distribution 
      lower.tail = TRUE)  # If TRUE, calculate cumulative below 3. 
                          # If FALSE, calculate cumulative above 3.

For the record, even if you didn't know the equation for cumulative probability or the function to calculate it directly, you could use a simulation to approximate it:

In [ ]:

sample_space <- c(1,2,3,4,5,6)

#roll 1 die 1,000,000 times
die <- sample(x = sample_space, size = 1000000, replace = TRUE)

#make a boolean variable for whether die was less than or equal to 3
up_to_3 <- die <= 3 

#compute cumulative probability (proportion of new variable that is true)
# can count the number of true values in a boolean just by summing,
# since TRUE also equals 1 and FALSE equals 0
# then divide by the number of total rolls
sum(up_to_3) / length(up_to_3)

Cumulative probabilities pertain to the kind of variables where you only care about values past/up to a threshold - e.g., cutoff score for an autism diagnosis, pass/fail qualifications, levee breeches during flooding, etc.

Joint probability: (also known as an "intersection") is the probability of both outcome 1 AND outcome 2 happening jointly across multiple events. We use a ∩ symbol to denote this. So the probability of getting both a 6 on rolling die 1 AND a 6 on die 2:

$$P(X = E_a \cap Y = E_b) = P(X = E_a) * P(Y = E_b)$$

Let's visualize this by making two uniform distributions, and turning one on its side. Then overlap them so we make a sort of matrix. The joint probability of getting two 6s on two different die is the area where the 6 column of each distribution overlaps.

There's not a quick R function for calculating joint probabilities between two distributions, but we can simulate it:

In [ ]:

sample_space <- c(1,2,3,4,5,6)

#roll 1 die 1,000,000 times
die1 <- sample(x = sample_space, size = 1000000, replace = TRUE)
#repeat for the other die
die2 <- sample(x = sample_space, size = 1000000, replace = TRUE)

#make a new boolean variable for whether both dice equal 6 for each index
both6 <- (die1 == 6) & (die2 == 6) 

sum(both6) / length(both6)

#verify it's about the same as the equation P(E6)*P(F6)
(1/6)*(1/6)

Union probability: A union is the probability of either outcome 1 OR outcome 2 happening across multiple events (denoted with the ∪ symbol):

$$P(X = E_a \cup Y = E_b) = P(X = E_a) + P(Y = E_b) - P(X = E_a \cap Y = E_b)$$

In this rule, we need to remember to subtract the probability of A and B occuring together, because otherwise we're counting the outcome of A & B twice, since it's covered both by P(A) and P(B).

Also remember that this is NOT the same thing as the probability that X equals either one of two events. That would be a cumulative probability of one event having a value of one of a subset of outcomes. Unions refer to separate events happening together.

A union event visualized is the area covering both columns that have a 6 somewhere in it:

And we can simulate it, this time using a | logical condition instead of a & condition:

In [ ]:

sample_space <- c(1,2,3,4,5,6)

#roll 1 die 1,000,000 times
die1 <- sample(x = sample_space, size = 1000000, replace = TRUE)
#repeat for the other die
die2 <- sample(x = sample_space, size = 1000000, replace = TRUE)

#make a new boolean variable for whether either dice equal 6 for each index
either6 <- (die1 == 6) | (die2 == 6) 

sum(either6) / length(either6)

#verify it's about the same as the axiom P(E6) + P(F6) - P(E6)*P(F6)
1/6 + 1/6 - (1/6)*(1/6)

Joint and union probabilities pertain to the chance of drawing data with certain values across multiple variables.

Chapter summary¶

After reading this chapter, you should be able to:

Describe the difference between samples and populations
Discuss different sampling processes and which is better for a representative sample
Understand the axioms of probabilities
Describe what a uniform and normal probability distribution look like, and why
Use sample(), runif(), and rnorm() to simulate data from probability distributions
Find the probability of events using dunif() and dnorm()
Explain multiple reasons why a sample of data looks the way it does

Next: Chapter 9 - Modeling the Data Generation Process