Module 4: Data Wrangling Part 2¶

How to Clean and Manipulate your Data in R¶

Hawken Hass¶

University of North Carolina Wilmington¶

Manipulating your Data¶

Once your data is all clean, you can manipulate your data. Data manipulation can include grouping data, combining data sets, summarizing data, and making new variables. For this, we are going to use the dplyr package.

In [ ]:

library(dplyr)

For this module, let's use our data we just cleaned in the last module. I refactored the data to make sure that the levels are in the correct order.

In [ ]:

clean_data<-read.table("clean_data.txt")
clean_data$Month<-factor(clean_data$Month, levels=unique(clean_data$Month))
clean_data

Filtering Data by Rows¶

The filter function allows you to subset rows from your data frame. For example, let's say we only want to look at rainfall in the month of January. We can do that using the filter function.

In [ ]:

filter(clean_data, Month=="Jan")

Since there are only two observations for each month, this is a pretty small data frame. However, this function is quite useful for large datasets. In the second argument, you denote an operate that tells the function how you would like to filter the column. Since we wanted only values for the month of January, we used the "==" operator. There are many different operators you can use to filter your data. For example, the code below would filter out any rainfall measurement that is less than 1.

In [ ]:

filter(clean_data, Rainfall_mm>1)

Here are some more examples of operators.

== (Equal to)
!= (Not equal to)
< (Less than)
<= (Less than or equal to)
> (Greater than)
>= (Greater than or equal to)

Filtering by Columns¶

The select function is used to filter out specific columns. The select function also uses helper functions to filter out columns based on certain properties. Using double quotes, you can filter out columns of a certain name.

In [ ]:

select(clean_data, contains("Lake"))

The "-" operator can be used to denote which columns you want to exclude.

In [ ]:

select(clean_data, -Period)

Piping¶

It may be unproductive to create one line of code for each data cleaning and manipulating function. Sometimes you will see programmers use a pipe. A pipe looks like this: %>%. Piping can make your code easier to read and efficient. For example, everything I just did above can be done in one line of code using a couple of pipes.

In [ ]:

new_data<-clean_data%>%select(-Period)%>%filter(Rainfall_mm>1)%>%filter(Lake=="Victoria")
print(new_data)

You can see how piping made our code more efficient. In one line of code we were able to create a new data set that removed the period column, and outputed only values for Lake Victoria that were greater than 1 mm. This is very useful for large datasets.

Summarize Data¶

The dplyr function also allows you to summarize your data using different summary functions.

In [ ]:

summarise(clean_data, avg=mean(Rainfall_mm))

You can add more than one summary function in each argument.

In [ ]:

summarise(clean_data, avg=mean(Rainfall_mm), n=n(), sd=sd(Rainfall_mm), var=var(Rainfall_mm),median=median(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))

It may be more useful to look at the average rainfall for both groups. We can find the means of separate groups by using the group_by function and piping it to the summarise function. The code would look like this:

In [ ]:

clean_data%>%group_by(Lake)%>%summarise(avg=mean(Rainfall_mm), sd=sd(Rainfall_mm), min=min(Rainfall_mm), max=max(Rainfall_mm))

Creating New Variables¶

Dplyr also allows you to create new variables within your data set using the mutate function. In this function you are computing a new variable from an existing variable. The basic code for the function is:

mutate(df, name of new column= formula)

For example, let's say we want to transform the Rainfall_mm to measure rainfall in inches.

In [ ]:

clean_data_2<-mutate(clean_data, Rainfall_inches=Rainfall_mm*0.039)
clean_data_2

Combining Data Sets¶

Dplyr also has the ability to combine data sets into one. Let's read in some more lake data and combine it with our current data set!

In [ ]:

lake_data<-read.table("lake_data.txt")
print(lake_data)

Now we have two more lakes: Lake SeaHawk and Lake Randall along with their monthly rainfall in mm. Let's clean and tidy our data to make it look like our clean_data data frame.

In [ ]:

library(tidyr)
lake_data<-lake_data %>% gather(key="Lake",value="Rainfall_mm",3:4)%>%mutate(Rainfall_inches=Rainfall_mm*0.039)
lake_data

Let's combine our two data sets!

In [ ]:

All_lakes<-bind_rows(clean_data_2,lake_data)
All_lakes$Month<-as.factor(All_lakes$Month)
All_lakes$Lake<-as.factor(All_lakes$Lake)
print(All_lakes)

The bind_rows function adds the rows from your second argument to the dataframe in your first argument. You can also use the bind_cols function to add the columns from one dataset to the other. For this example, let's get our data back into wide format.

In [ ]:

clean_data_wide<-clean_data%>%spread(key="Lake",value="Rainfall_mm")
clean_data_wide

In [ ]:

lake_data_wide<-read.table("lake_data.txt")
All_lakes_wide<-bind_cols(clean_data_wide,lake_data_wide)
All_lakes_wide

You'll notice that this created two new columns "Month1" and "Period1" because they share the same name in both datasets. To account for this you can use the full join function. Using the "by=" argument, you can join columns that have the same name and values.

In [ ]:

full_join(clean_data_wide,lake_data_wide, by=c("Month","Period"))

Always read any warning messages that R outputs. You'll notice that because the month columns have different levels, R automatically converts it to a character vector. In this case, make sure you refactor your data if necessary.

There are many other functions that dplyr offers. I would recommend exploring different functions to learn how to properly manipulate your data. In the next module we will learn all about t-tests!