#!/usr/bin/env python
# coding: utf-8
# ```
# title: "Assignment 9: Data Wrangling, in Python"
# author: "Student Name"
# ```
# ## OVERVIEW
#
# This exercise accompanies the lessons in Environmental Data Analytics (ENV872L) on data wrangling in Python.
(_This assignment mimics Assignment 4: Data Wrangling)_
#
# Helpful materials: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf and https://duke-nsoe.github.io/ENV872/Python-R.html
# ## Directions
# 1. Change "Student Name" on line 2 (above) with your name.
#
#
# 2. Use the lesson as a guide. It contains code that can be modified to complete the assignment.
#
#
# 3. Work through the steps, **creating code and output** that fulfill each instruction.
#
#
# 4. Be sure to **answer the questions** in this assignment document.
Create a new cell in the Jupyter notebook and set its type to `Markdown`. Type your answer there.
#
#
# 5. When you have completed the assignment, save your file as an HTML document via the `File`>`Download as` menu.
#
#
# 6. After convertng to HTML, please submit the completed exercise to the dropbox in Sakai. Please add your last name into the file name (e.g., "Fay_A07_DataWranglingInPython.html") prior to submission.
#
# The completed exercise is due on **Tuesday March 5** before class begins.
#
# ---
# ## Set up your session
# 1a. Import pandas
# In[ ]:
# 1b. Read in the following CSV files into dataframes named as follows:
# * `EPAair_PM25_NC2017_raw.csv` → `pm_2017`
# * `EPAair_PM25_NC2018_raw.csv` → `pm_2018`
# * `EPAair_O3_NC2017_raw.csv` → `o3_2017`
# * `EPAair_O3_NC2018_raw.csv` → `o3_2017`
# In[ ]:
pm_2017 =
pm_2018 =
o3_2017 =
o3_2018 =
# 3. Generate a few lines of code to get to know your `pm_2017` dataset (basic data summaries, etc.).
# In[ ]:
#Show the structure of the dataframe
# In[ ]:
#Show a summary of the data in the dataframe
# In[ ]:
#Show the "head" or the first 5 rows of data
# In[ ]:
#Show what data type the "Date" field is
# ## Wrangle individual datasets to create processed files.
# 3. Change `Date` field's data type to date
# In[ ]:
#Convert the date field to a datetime object
# 4. Select the following columns: `Date`, `DAILY_AQI_VALUE`, `Site.Name`, `AQS_PARAMETER_DESC`, `COUNTY`, `SITE_LATITUDE`, `SITE_LONGITUDE`
# In[ ]:
#Select the subset of fields
# 5. For the PM2.5 datasets, create a new column called `AQS_PARAMETER_DESC` filled with the value "PM2.5" (all cells in this column should be identical).
# In[ ]:
# 6. Save all four processed datasets in the Processed folder.
# In[ ]:
# ## Combine datasets
# 7. Merge (`concat`) the four datasets into one. Make sure your column names are identical prior to running this code.
# In[ ]:
#Concatenate the 4 data frames into one
allData =
allData.head()
# 7a. What are the dimensions of the concatenated data frames?
# In[ ]:
#Report the shape of the dataframe
# ### Subsetting and splitting
#
# #### 8. Wrangle your new dataset...
# 1. Subset only records in Blackstone, Bryson City, Triple Oak
# 1. Create a Boolean mask for each site name (`query` won't work because "Site Name" has a space in it!)
# 2. Apply these Boolean masks using the appropriate logical operators ("`&`", "`|`", "`~`") to select records in any of these sites into a new dataframe.
# 3. Make a "deep" copy of this dataframe (to avoid warnings about working on a subset of a dataframe...)
# 4. Display the dimensions of the resulting dataframe
# In[ ]:
# Create 3 Boolean masks, one each for Blackstone, Bryson City, Triple Oak
blackstoneMask =
brysonMask =
tripleoakMask =
# In[ ]:
#Apply the mask to the dataframe
allDataSelect =
# In[ ]:
#Create a deep copy of the above dataframe
allDataSelect =
# In[ ]:
#Display the dimensions of the dataframe
# 2. Add columns for "Month" and "Year" by parsing your "Date" column (hint: see [`pd.DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html)...)
# In[ ]:
#C. Add columns for Month and Year
# **9. Spread (`pivot_table`) your datasets** such that AQI values for ozone and PM2.5 are in separate columns. Each location on a specific date should now occupy only one row.
# In[ ]:
allPivot =
# 10. Call up the dimensions of your new tidy dataset.
# In[ ]:
# 11. Save your processed dataset with the following file name: "EPAair_O3_PM25_NC1718_Processed.csv"
# In[ ]:
# ## Generate summary tables
# 12. Use the split-apply-combine strategy to generate two new data frames:
# 1. A summary table of mean AQI values for O3 and PM2.5 by month
# 2. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site
# In[ ]:
#12a. A summary table of mean AQI values for O3 and PM2.5 by month
group = allDataSelect.groupby(['Month','AQS_PARAMETER_DESC'])
aggData = group['DAILY_AQI_VALUE'].mean().reset_index()
aggData.pivot_table(index='Month',columns='AQS_PARAMETER_DESC',values='DAILY_AQI_VALUE')
# In[ ]:
#12b. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site