title: "Assignment 9: Data Wrangling, in Python"
author: "Student Name"

OVERVIEW¶

This exercise accompanies the lessons in Environmental Data Analytics (ENV872L) on data wrangling in Python.
(This assignment mimics Assignment 4: Data Wrangling)

Helpful materials: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf and https://duke-nsoe.github.io/ENV872/Python-R.html

Directions¶

Change "Student Name" on line 2 (above) with your name.
Use the lesson as a guide. It contains code that can be modified to complete the assignment.
Work through the steps, creating code and output that fulfill each instruction.
Be sure to answer the questions in this assignment document.
Create a new cell in the Jupyter notebook and set its type to Markdown. Type your answer there.
When you have completed the assignment, save your file as an HTML document via the File>Download as menu.
After convertng to HTML, please submit the completed exercise to the dropbox in Sakai. Please add your last name into the file name (e.g., "Fay_A07_DataWranglingInPython.html") prior to submission.

The completed exercise is due on Tuesday March 5 before class begins.

Set up your session¶

1a. Import pandas

In [ ]:

1b. Read in the following CSV files into dataframes named as follows:

EPAair_PM25_NC2017_raw.csv → pm_2017
EPAair_PM25_NC2018_raw.csv → pm_2018
EPAair_O3_NC2017_raw.csv → o3_2017
EPAair_O3_NC2018_raw.csv → o3_2017

In [ ]:

pm_2017 = 
pm_2018 = 
o3_2017 = 
o3_2018 = 

Generate a few lines of code to get to know your pm_2017 dataset (basic data summaries, etc.).

In [ ]:

#Show the structure of the dataframe

In [ ]:

#Show a summary of the data in the dataframe

In [ ]:

#Show the "head" or the first 5 rows of data

In [ ]:

#Show what data type the "Date" field is

Wrangle individual datasets to create processed files.¶

Change Date field's data type to date

In [ ]:

#Convert the date field to a datetime object

Select the following columns: Date, DAILY_AQI_VALUE, Site.Name, AQS_PARAMETER_DESC, COUNTY, SITE_LATITUDE, SITE_LONGITUDE

In [ ]:

#Select the subset of fields

For the PM2.5 datasets, create a new column called AQS_PARAMETER_DESC filled with the value "PM2.5" (all cells in this column should be identical).

In [ ]:

Save all four processed datasets in the Processed folder.

In [ ]:

Combine datasets¶

Merge (concat) the four datasets into one. Make sure your column names are identical prior to running this code.

In [ ]:

#Concatenate the 4 data frames into one
allData = 
allData.head()

7a. What are the dimensions of the concatenated data frames?

In [ ]:

#Report the shape of the dataframe

Subsetting and splitting¶

8. Wrangle your new dataset...¶

Subset only records in Blackstone, Bryson City, Triple Oak
1. Create a Boolean mask for each site name (query won't work because "Site Name" has a space in it!)
2. Apply these Boolean masks using the appropriate logical operators ("&", "|", "~") to select records in any of these sites into a new dataframe.
3. Make a "deep" copy of this dataframe (to avoid warnings about working on a subset of a dataframe...)
4. Display the dimensions of the resulting dataframe

In [ ]:

# Create 3 Boolean masks, one each for Blackstone, Bryson City, Triple Oak
blackstoneMask = 
brysonMask = 
tripleoakMask = 

In [ ]:

#Apply the mask to the dataframe
allDataSelect = 

In [ ]:

#Create a deep copy of the above dataframe
allDataSelect = 

In [ ]:

#Display the dimensions of the dataframe

Add columns for "Month" and "Year" by parsing your "Date" column (hint: see pd.DatetimeIndex...)

In [ ]:

#C. Add columns for Month and Year

9. Spread (pivot_table) your datasets such that AQI values for ozone and PM2.5 are in separate columns. Each location on a specific date should now occupy only one row.

In [ ]:

allPivot =

Call up the dimensions of your new tidy dataset.

In [ ]:

Save your processed dataset with the following file name: "EPAair_O3_PM25_NC1718_Processed.csv"

In [ ]:

Generate summary tables¶

Use the split-apply-combine strategy to generate two new data frames:
1. A summary table of mean AQI values for O3 and PM2.5 by month
2. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site

In [ ]:

#12a. A summary table of mean AQI values for O3 and PM2.5 by month
group = allDataSelect.groupby(['Month','AQS_PARAMETER_DESC'])
aggData = group['DAILY_AQI_VALUE'].mean().reset_index()
aggData.pivot_table(index='Month',columns='AQS_PARAMETER_DESC',values='DAILY_AQI_VALUE')

In [ ]:

#12b. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site