title: "Assignment 9: Data Wrangling, in Python"
author: "Student Name"
This exercise accompanies the lessons in Environmental Data Analytics (ENV872L) on data wrangling in Python.
(This assignment mimics Assignment 4: Data Wrangling)
Helpful materials: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf and https://duke-nsoe.github.io/ENV872/Python-R.html
Change "Student Name" on line 2 (above) with your name.
Use the lesson as a guide. It contains code that can be modified to complete the assignment.
Work through the steps, creating code and output that fulfill each instruction.
Be sure to answer the questions in this assignment document.
Create a new cell in the Jupyter notebook and set its type to Markdown
. Type your answer there.
When you have completed the assignment, save your file as an HTML document via the File
>Download as
menu.
After convertng to HTML, please submit the completed exercise to the dropbox in Sakai. Please add your last name into the file name (e.g., "Fay_A07_DataWranglingInPython.html") prior to submission.
The completed exercise is due on Tuesday March 5 before class begins.
1a. Import pandas
1b. Read in the following CSV files into dataframes named as follows:
EPAair_PM25_NC2017_raw.csv
→ pm_2017
EPAair_PM25_NC2018_raw.csv
→ pm_2018
EPAair_O3_NC2017_raw.csv
→ o3_2017
EPAair_O3_NC2018_raw.csv
→ o3_2017
pm_2017 =
pm_2018 =
o3_2017 =
o3_2018 =
pm_2017
dataset (basic data summaries, etc.).#Show the structure of the dataframe
#Show a summary of the data in the dataframe
#Show the "head" or the first 5 rows of data
#Show what data type the "Date" field is
Date
field's data type to date#Convert the date field to a datetime object
Date
, DAILY_AQI_VALUE
, Site.Name
, AQS_PARAMETER_DESC
, COUNTY
, SITE_LATITUDE
, SITE_LONGITUDE
#Select the subset of fields
AQS_PARAMETER_DESC
filled with the value "PM2.5" (all cells in this column should be identical).
concat
) the four datasets into one. Make sure your column names are identical prior to running this code.#Concatenate the 4 data frames into one
allData =
allData.head()
7a. What are the dimensions of the concatenated data frames?
#Report the shape of the dataframe
query
won't work because "Site Name" has a space in it!)&
", "|
", "~
") to select records in any of these sites into a new dataframe.# Create 3 Boolean masks, one each for Blackstone, Bryson City, Triple Oak
blackstoneMask =
brysonMask =
tripleoakMask =
#Apply the mask to the dataframe
allDataSelect =
#Create a deep copy of the above dataframe
allDataSelect =
#Display the dimensions of the dataframe
pd.DatetimeIndex
...)#C. Add columns for Month and Year
9. Spread (pivot_table
) your datasets such that AQI values for ozone and PM2.5 are in separate columns. Each location on a specific date should now occupy only one row.
allPivot =
#12a. A summary table of mean AQI values for O3 and PM2.5 by month
group = allDataSelect.groupby(['Month','AQS_PARAMETER_DESC'])
aggData = group['DAILY_AQI_VALUE'].mean().reset_index()
aggData.pivot_table(index='Month',columns='AQS_PARAMETER_DESC',values='DAILY_AQI_VALUE')
#12b. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site