Notebook

Previous: Chapter 2 - Statistical Reasoning

Chapter 3 - What are Data¶

Although we can apply statistical techniques to any bunch of numbers, and sometimes we do just make up numbers to analyze (more about that later), we generally want to analyze numbers that represent something about the world. These numbers we refer to as data.

The first important point about data is that data are – meaning that the word “data” is plural. You might also wonder how to pronounce “data” – some people say “DAY-tah”, some say “DAH-tah.” Some people even use both interchangeably, within the same conversation!

3.1 Representing data in R¶

We’ve used R objects so far to store a single number or message. But in statistics we are dealing with more than one — and sometimes many — numbers. An R object can also store a whole set of numbers. This set together is called a vector. You can think of a vector as a list of numbers (or values).

The R function c() can be used to combine a list of individual values into a vector. You could think of the “c” as standing for “combine.” In the following code we have created two vectors (we just named them my_vector and my_vector2) and put a list of values into each vector.

In [ ]:

# Here is the code to create two vectors my_vector and my_vector2. We just made up those names.

# Run the code and see what happens
my_vector <- c(1,2,3,4,5)
my_vector2 <- c(10,10,10,10,10)

# Now write some code to return these two vectors in the R console. Run the code and see what happens.

Many functions will take in a vector as the input. For example, try using max() to find the largest number in my_vector.

In [ ]:

my_vector <- c(14,22,31,24,15)

# Use max() to find the largest value in my_vector

There may be times when you just want to know a subset of the values in a vector, not all of the values. We can index a position in the vector by using brackets with a number in it like this: [1]. So if we wanted to print out just the first item in my_vector, we could write my_vector[1].

In [ ]:

# Write code to get the 4th value in my_vector

We're not limited to indexing only one position at a time, either. We can also pass a vector of index positions inside the brackets to return only those positions:

In [ ]:

# This code returns both the 2nd and 4th value in my_vector
my_vector[c(2,4)]

And if we want to return a whole chunk of the vector (say, the 1st through the 4th position), we don't have to type out c(1,2,3,4) within the brackets. The : operator works like the word "through", such that the code 1:4 means "every value from the 1st to the 4th, inclusive".

In [ ]:

# This code returns the 1st through the 4th values in my_vector
my_vector[1:4]

Finally, what if we want to return every item in a vector except a particular one? Then we get to use what's called negative indexing! Simply put a negative sign in front of an index position, and that will remove the item at that position from the output.

In [ ]:

# This code returns my_vector, but without the 3rd item
my_vector[-3]

# Try using both the colon method and the negative indexing method as different ways to return 
# the vector (2,3,4,5) using my_vector

for loops become very useful when we want to iterate through a vector (do something with each vector item). For instance, last chapter we created a vector of numbers, and then printed each item:

In [ ]:

my_sequence <- c(2,4,6,8,10)         

for (item in my_sequence) {     # for loop over the specified items
    print(item)                 # Print the value of each item
}

When the command in the set of parentheses looks like (X in SOME_VECTOR), this is saying that X will take on the value of a specific item in SOME_VECTOR for that iteration of the loop. However, you can also lean on the function length() to go through each position of each item:

In [ ]:

my_sequence <- c(2,4,6,8,10)         

for (position in 1:length(my_sequence)) {     # for loop over the position number of each item,
    print(position)                           # from the first item up to the max length of my_sequence
}

Instead of the object position taking on the value of each item in my_sequence, it is now taking a numeric value in the sequence 1 through the length of my_sequence.

In [ ]:

# Using indexing, how would you make this code print out (2,4,6,8,10) while still using positions 
# in the for loop?

my_sequence <- c(2,4,6,8,10)         

for (position in 1:length(my_sequence)) {     
    print(position)                           # Make a change to this line to print out the value of my_sequence 
}                                             # at this position using indexing

3.2 Data types¶

Qualitative data¶

Data are the measurements that make up variables. Variables are sets of measurements that, across many values, will vary in some way. For example, human intelligence is a variable because, across many people, we would likely measure different intelligence scores. The values of intelligence would vary person to person. In contrast, people's species is not a variable, since by definition all humans are homo sapien. There's no variation datapoint to datapoint.

There are different types of variables. Some are qualitative, meaning that they describe a quality rather than a numeric quantity. For example, say we ran a survey in class that asked “What is your favorite food?” Some of the answers may be: blueberries, chocolate, tamales, pasta, and pizza. Those data are not intrinsically numerical; there's nothing to indicate quantity or numerical relations among them.

In R, there are also different types of objects we can rely on to represent different data types. We can represent qualitative data like words or letters with the character value type. Remember at the beginning of Chapter 1 when you printed out "hello world!"? By putting those words within a set of quotation marks, we created a character-type object. This is also known in other programming languages as a "string". If we forget the quotes, R will think that a word is a name of an object instead of a character value. Note that numbers can also be turned into character values - for example, when 20 is in quotation marks like this – “20” – it will be treated as a character value, even though it includes a number. R doesn’t care whether you use single quotes, ‘like this’, or double quotes, ”like that".

In [ ]:

#example of a character value
my_character <- "I am a character."
my_character

You can also put character objects into a vector, to represent a set of qualitative data.

In [ ]:

many_hellos <- c("shalom", "hello", "hola", "bonjour", "ni hao", "merhaba")

# Write code to print out the 5th way of saying hello in this vector

Numeric data¶

In statistics we will also work with quantitative data. This means data that are numerical. It is often easier to work with quantitative over qualitative data - this let's us do more mathematical computations like finding the mean. For example, this table shows the results from the same class favorite food survey, but presented in a different way:

Food choice	Number of students
blueberries	2
chocolate	8
tamales	5
pasta	6
pizza	10

The students’ answers were qualitative, but we generated a quantitative summary of them by counting how many students gave each response.

In R, when the program sees typed numbers, it will usually assume you're talking about a numeric value type. However, don't assume that if you see numbers in some data, that the variable is definitely quantitative. A researcher might choose to use numbers as a sort of code or label, in place of the full qualitative description. For example, if they recorded what hand every student in a class used to write with, they could save the variable as:

handedness_word <- c("left","right","right","right","right","left","right","right","right")

That's a lot of typing though. Instead, they could choose to assign "right" the value of 1, and "left" the value of 2. This would create a vector like:

handedness_label <- c(2,1,1,1,1,2,1,1,1)

Much shorter to write!

Boolean data¶

Boolean values (also called "logical values") are either True or False - they represent the truth of a state of reality, like when we were using if/else last chapter. Maybe we have a statement such as: "10 is greater than 5". We can ask R to evaluate this and return the answer TRUE or FALSE. We can do that by using logical operators like >, <, >=, <=, and ==. The double == sign checks if two values are equal. There is even a logical operator to check whether values are not equal: !=. For example, 5 != 3 is a True statement. Essentially, a boolean represents the answer to a true/false statement. Below is a table of the logic operators in R that you can use for making logic statements.

Symbol	Example use	Meaning
<	A < B	A is less than B
>	A > B	A is greater than B
==	A == B	A is equal to B
<=	A <= B	A is less than or equal to B
>=	A >= B	A is greater than or equal to B
!=	A != B	A is not equal to B
%in%	A %in% c(B1, B2, B3)	A is in the set [B1, B2, B3]
&	A == B & C == D	A equals B AND C equals D (both statements are true)
\|	A == B \| C == D	A equals B OR C equals D (either statement is true)

Note: "%in%" is NOT the same as the "in" statement used in for loops. That one helps R go through each item in a sequence; %in% is used in conditional statements that check whether a value matches ANY item in the sequence.

In R, we will avoid using the single equal sign, =. If you want to know whether A is equal to B, use the double equal sign, ==. The single equal sign is sometimes used instead of the assignment operator, <-, but in more niche cases that you'll learn later. Use the arrow <- to assign values to an R object, and == to ask whether two values are equal.

In [ ]:

# Read this code and predict what value will come out of the R console. Then run the code and see if you were right.

A <- 1
B <- 5

comparison <- A > B

comparison

Lots of things can be turned into boolean data if you think about it. For example, The researcher above could choose to represent handedness as a boolean, by conceptualizing the variable as "this person is right-handed."

handedness_bool <- c(FALSE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE).

Before creating an object, it's good to think about what type the value should be. I.e., think about what operations you plan to do on the object, and thus what type the value needs to be for that operation to make sense. However, sometimes you forget what type an object was that you created a while ago, or you will receive some data from someone else with pre-defined objects. If you ever need to find out the type of an object, you can simply use the command str([OBJECT_NAME]). Str is short for "structure." It will print out the type of the object as well as its value.

In [ ]:

#Use str() to print out the type of each object below
object1 <- "What am I?"
object2 <- TRUE
object3 <- 30

#Write your code below this line

Note: There are technically subtypes of values within the umbrella of numeric (double, integer, etc.) or qualitative (character or factor) that refer to how the number is stored in computer memory. You don't need to know the difference for this course, just remember that when R returns one of these for a value type, it's a sort of number or qualitative value.

3.3 Scales of measurement¶

Among quantitative data, the numbers a variable can take can be related to each other in different ways. What exactly the numbers mean determines what kind, or what scale of measurement the variable is in.

A categorical variable (also called a nominal variable) is one like the example above, when we first assigned numeric labels to each food choice. In a nominal variable, each value of the variable represents a different category of something. These categories are distinct, and they're not inherently sortable. You can use character or numeric data types to represent a categorical variable, as discussed above with handedness. As another example, we might ask people for their political party affiliation, and then code those as numbers: 1 = “Republican”, 2 = “Democrat”, 3 = “Libertarian”, and so on. The different numbers do not have any ordered relationship with one another - they're just another way of writing a categorical label.

An ordinal variable has values where the magnitude matters, but only for ordering the values, not for determining how far apart any of the levels are from each other. For example, we might ask a person with chronic pain to complete a form assessing how bad their pain is, using a 1-7 numeric scale. While the person is presumably feeling more pain on a day when they report a 6 versus a day when they report a 3, it wouldn’t make sense to say that their pain is twice as bad on the former versus the latter day - there isn't a unit of measurement available, only a sense of one option being stronger than another. The ordering gives us information about relative magnitude, but the differences between values are not necessarily equal in magnitude.

An interval variable has all of the features of an ordinal scale, but in addition the distance between units on the measurement scale are measurable and consistent. A standard example is physical temperature measured in Celsius or Fahrenheit; the physical difference between 10 and 20 degrees is the same as the physical difference between 90 and 100 degrees.

3.4 Converting data¶

Sometimes you will have data in one format that you would prefer to have in another - e.g., you have a variable for someone's zip code, but R is trying to evaluate that data type as quantitative instead of categorical. That doesn't make sense - Pomona's zip code 91711 isn't quantitatively comparable to Harvard's 01434, it's just using numbers to denote a category (Pomona is of course comparable and better than Harvard in other ways!). How though would you convince R to treat these numbers as a categorical variable?

Luckily, you don't need to rewrite the whole zipcode data vector with quotations manually. You can use a function to classify the vector of numbers as a categorical variable. The function that converts numerical data to categorical is as.character(). Try it out below.

In [ ]:

zipcodes <- c(01267, 01002, 19081, 91711, 02481, 04011, 21402, 91711, 55057, 05753)
str(zipcodes)

# Use the as.character() function on zipcodes, assign it to zipcodes_char, and then return the type 
# of that new object
zipcodes_char <- #replace this comment with your code
str(zipcodes_char)

We can also turn a nominal variable back into a numeric variable by using the as.numeric() function. This is useful in the case where a dataset has numbers that accidentally get saved as character types.

In [ ]:

zipcodes_numeric <- as.numeric(zipcodes_char)
str(zipcodes_numeric)

When we code the values of a variable using numbers, it is always important to keep in mind what the numbers mean. The value 2 has a very different meaning if it represents the handedness of a person (e.g., left) than if it represents their height in inches (very short!). When we use statistical software to analyze data, the software processes the numbers. But the software doesn’t know what the numbers actually mean. Only you know that.

We can also convert the actual values of data within variables. One might do this if we decided the numbers in the zipcode variable were too ambiguous, and we instead wanted to label the names of their towns. The factor() function has some arguments you can add to relabel the values in a variable that you want to convert to qualitative:

In [ ]:

# 4 places, coded by zipcode
places_num <- c(01267, 01002, 19081, 91711, 91711)
places_factor <- factor(places_num, levels = c(01267, 01002, 19081, 91711), 
                        labels = c("williamstown", "amherst", "swarthmore", "claremont"))
# if your command is too wide, hit return after a comma to make it wrap around on a new line, indented

places_factor

This looks complicated - why are there now = signs and other words in our function? At this point we'll introduce to you the concept of named arguments. A function like factor() can be run with just one argument, a vector to turn into nominal values, or it can take additional arguments. It's flexible like that. But if you're giving it additional arguments, you have to tell the function what each argument is supposed to represent. That's where you get to using the argument's name (levels or labels in this example), and using the = sign to tell the function what the value of that specific argument should be.

Here you can think of the input to the factor() function as having three parts: the variable to recode, the existing values (levels), and the labels to recode the values as. This seems like we have to write out every new label for every data point included in the vector, but this is only because we have all unique values in places_num. If there were any repeated zipcodes, you only have to include it once in the levels argument, because R factor() will always assign the matching new label to any data point with that zipcode.

You can also change the values of numeric variables. For example, you have water depth data measured in inches, but you'd prefer to see it in feet. In this case, you will need to transform each value in the vector holding that data.

You don't have to manually retype the vector to do this either. Vectors in R work like vectors in linear algebra. So, if you have a constant value you want to modify a vector by, you can simply use the mathematical operation between that value and the vector object. E.g., in the example above, you could write water_depth / 12 to convert every water depth measurement from inches to feet. When you use an operation between a vector and a single value, R will apply that operation to every item in the vector individually.

In [ ]:

my_vector <- c(1,2,3,4,5)

# write code to multiply each number in my_vector by 100

Notice that when you do a calculation with a vector, you’ll get a vector of numbers as the answer, not just a single number.

After you multiply my_vector by 100, what will happen if you return my_vector? Will you get the original vector (1,2,3,4,5), or one that has the hundreds (100,200,300,400,500)? Try running this code to see what happens.

In [ ]:

# Run the code below to see what happens
my_vector <- c(1,2,3,4,5)
my_vector * 100

# This will return my_vector
my_vector

Remember, R will do the calculations, but if you want something saved, you have to assign it somewhere. Try writing some code to compute my_vector * 100 and then assign the result back into my_vector. If you do this, it will replace the old contents of my_vector with the new contents (i.e., the product of my_vector and 100). (Hint: you can use objects in the same command where you redefine their values).

In [ ]:

# This creates `my_vector` and stores 1, 2, 3, 4, 5 in it
my_vector <- c(1,2,3,4,5)

# Now write code to save `my_vector * 100` back into `my_vector`
my_vector <- #replace this comment with your code

We can also create Boolean vectors by subjecting a whole vector to a logical statement. Since a logical statement is an operation, R will apply it to every item in the handedness vector.

In [ ]:

handedness <- c(2,1,1,1,1,2,1,1,1)
handedness_bool <- handedness == 1

handedness_bool

In [ ]:

# What do you expect from this code? Make a guess first, then run the code to see what happens. 

handedness <- c(2,1,1,1,1,2,1,1,1)
footedness <- c(2,1,1,1,1,1,1,2,1)

hand_foot_alignment <- handedness == footedness
hand_foot_alignment

Sometimes you will need to do something more complicated. Instead of modifying each value in a vector by the same amount, you want to make a different modification. E.g., say you want to know how much each person improved their SAT scores with practice, but you only have their first and last scores saved as data. Everyone improves at different rates, so you can't add the same score to each person. Instead, you could do end_score - beginning_score. If both end_score and beginning_score are vectors of the same length, R will apply the operation to each item pairwise: the first item of beginning_score will be subtracted from the first item of end_score, the second from the second, third from the third, and so on.

In [ ]:

end_score <- c(1400, 1450, 1500, 1200, 1150, 1600)
beginning_score <- c(1350, 1400, 1400, 1200, 1100, 1400)

#write code that will subtract beginning_score from end_score, and return the resulting vector
change_score <- #write your code here

3.5 What makes good data?¶

Data are meant to be a clear, quantitative representation of some quality in the world. However, in many fields such as psychology, the thing that we are measuring is not a physical feature, but instead an unobservable theoretical concept, which we usually refer to as a construct. For example, let’s say that we want to test how well you understand the distinction between categorical, ordinal, and interval variables. The professor could give you a pop quiz that would ask you several questions about these concepts and count how many you got right. This would be one way to try and measure the construct that is your knowledge. However, constructs are difficult to measure well. There are several things that can get in the way of good measurement:

Measurement error¶

It is usually impossible to measure a construct without some amount of error. In the example above, you might know the answer, but you might mis-read the question and get it wrong. In other cases, there is error intrinsic to the thing being measured, such as when we measure how long it takes a person to respond on a simple reaction time test, which will vary from trial to trial for many reasons. This is called measurement error - when the values we record are different than what the true construct value should be because of an imperfect measurement system. We generally want our measurement error to be as low as possible, which we can achieve either by improving the quality of the measurement (for example, using a better timer to measure reaction time), or by averaging over a large number of indvidiual measurements.

Reliability¶

Reliability refers to the consistency of our measurements. One common form of reliability, known as test-retest reliability, measures how well the measurements agree if the same measurement is performed twice. For example, someone might give you a questionnaire about your attitude towards statistics today, repeat this same questionnaire tomorrow, and compare your answers on the two days; if the questionnaire is a good measurement of your true attitude, we would hope that they would be very similar to one another, unless something happened in between the two tests that should have changed your view of statistics (like taking this class!).

Another way to assess reliability comes in cases where the data include subjective judgments. For example, let’s say that a researcher wants to determine whether a treatment changes how well an autistic child interacts with other children, which is measured by having experts watch the child and rate their interactions with the other children. In this case we would like to make sure that the answers don’t depend on the individual rater — that is, we would like for there to be high inter-rater reliability. This can be assessed by having more than one rater perform the rating, and then comparing their ratings to make sure that they agree well with one another.

Reliability is important if we want to compare one measurement to another, because the relationship between two different variables can’t be any stronger than the relationship between either of the variables and itself (i.e., its reliability). This means that an unreliable measure can never have a strong statistical relationship with any other measure. For this reason, researchers developing a new measurement (such as a new survey) will often go to great lengths to establish and improve its reliability.

Validity¶

Reliability is important, but on its own it’s not enough: After all, I could create a perfectly reliable measurement on a personality test by re-coding every answer using the same number, regardless of how the person actually answers. We want our measurements to also be valid — that is, we want to make sure that we are actually measuring the construct that we think we are measuring. There are many different types of validity that are commonly discussed; we will focus on three of them.

Face validity - Does the measurement make sense on its face? If I were to tell you that I was going to measure a person’s blood pressure by looking at the color of their tongue, you would probably think that this was not a valid measure on its face. On the other hand, using a blood pressure cuff would have face validity. This is usually a first reality check before we dive into more complicated aspects of validity. If you read about a construct measurement that doesn't seem face valid to you (e.g., "what your shirt color says about you!!"), you should be immediately suspicious of it until there is good evidence of other validity.

Construct validity - Is the measurement related to other measurements in an appropriate way? This is often subdivided into two aspects. Convergent validity means that the measurement should be closely related to other measures that are thought to reflect the same construct. Let’s say that I am interested in measuring how extroverted a person is using either a questionnaire or an interview. Convergent validity would be demonstrated if both of these different measurements are closely related to one another. On the other hand, measurements thought to reflect different constructs should be unrelated, known as divergent validity. If my theory of personality says that extraversion and conscientiousness are two distinct constructs, then I should also see that my measurements of extraversion are unrelated to measurements of conscientiousness.

A figure demonstrating the distinction between reliability and validity, using shots at a bullseye. The consistency of location of shots is a metaphor for reliability, and the accuracy of the shots with respect to the center of the bullseye.

Chapter summary¶

After reading this chapter, you should be able to:

Define a variable
Make a vector in R
Describe the difference between qualitative and quantitative data
recode vectors in R as numeric or categorical
Explain the difference between categorical, ordinal, and interval data
Do basic math operations on vectors
Talk about in what ways data can be more or less reliable
Define the four types of validity

Next: Chapter 4 - Organizing Data