#!/usr/bin/env python # coding: utf-8 # ``` # title: "Assignment 9: Data Wrangling, in Python" # author: "Student Name" # ``` # ## OVERVIEW # # This exercise accompanies the lessons in Environmental Data Analytics (ENV872L) on data wrangling in Python.
(_This assignment mimics Assignment 4: Data Wrangling)_ # # Helpful materials: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf and https://duke-nsoe.github.io/ENV872/Python-R.html # ## Directions # 1. Change "Student Name" on line 2 (above) with your name. # # # 2. Use the lesson as a guide. It contains code that can be modified to complete the assignment. # # # 3. Work through the steps, **creating code and output** that fulfill each instruction. # # # 4. Be sure to **answer the questions** in this assignment document.
Create a new cell in the Jupyter notebook and set its type to `Markdown`. Type your answer there. # # # 5. When you have completed the assignment, save your file as an HTML document via the `File`>`Download as` menu. # # # 6. After convertng to HTML, please submit the completed exercise to the dropbox in Sakai. Please add your last name into the file name (e.g., "Fay_A07_DataWranglingInPython.html") prior to submission. # # The completed exercise is due on **Tuesday March 5** before class begins. # # --- # ## Set up your session # 1a. Import pandas # In[ ]: # 1b. Read in the following CSV files into dataframes named as follows: # * `EPAair_PM25_NC2017_raw.csv` → `pm_2017` # * `EPAair_PM25_NC2018_raw.csv` → `pm_2018` # * `EPAair_O3_NC2017_raw.csv` → `o3_2017` # * `EPAair_O3_NC2018_raw.csv` → `o3_2017` # In[ ]: pm_2017 = pm_2018 = o3_2017 = o3_2018 = # 3. Generate a few lines of code to get to know your `pm_2017` dataset (basic data summaries, etc.). # In[ ]: #Show the structure of the dataframe # In[ ]: #Show a summary of the data in the dataframe # In[ ]: #Show the "head" or the first 5 rows of data # In[ ]: #Show what data type the "Date" field is # ## Wrangle individual datasets to create processed files. # 3. Change `Date` field's data type to date # In[ ]: #Convert the date field to a datetime object # 4. Select the following columns: `Date`, `DAILY_AQI_VALUE`, `Site.Name`, `AQS_PARAMETER_DESC`, `COUNTY`, `SITE_LATITUDE`, `SITE_LONGITUDE` # In[ ]: #Select the subset of fields # 5. For the PM2.5 datasets, create a new column called `AQS_PARAMETER_DESC` filled with the value "PM2.5" (all cells in this column should be identical). # In[ ]: # 6. Save all four processed datasets in the Processed folder. # In[ ]: # ## Combine datasets # 7. Merge (`concat`) the four datasets into one. Make sure your column names are identical prior to running this code. # In[ ]: #Concatenate the 4 data frames into one allData = allData.head() # 7a. What are the dimensions of the concatenated data frames? # In[ ]: #Report the shape of the dataframe # ### Subsetting and splitting # # #### 8. Wrangle your new dataset... # 1. Subset only records in Blackstone, Bryson City, Triple Oak # 1. Create a Boolean mask for each site name (`query` won't work because "Site Name" has a space in it!) # 2. Apply these Boolean masks using the appropriate logical operators ("`&`", "`|`", "`~`") to select records in any of these sites into a new dataframe. # 3. Make a "deep" copy of this dataframe (to avoid warnings about working on a subset of a dataframe...) # 4. Display the dimensions of the resulting dataframe # In[ ]: # Create 3 Boolean masks, one each for Blackstone, Bryson City, Triple Oak blackstoneMask = brysonMask = tripleoakMask = # In[ ]: #Apply the mask to the dataframe allDataSelect = # In[ ]: #Create a deep copy of the above dataframe allDataSelect = # In[ ]: #Display the dimensions of the dataframe # 2. Add columns for "Month" and "Year" by parsing your "Date" column (hint: see [`pd.DatetimeIndex`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html)...) # In[ ]: #C. Add columns for Month and Year # **9. Spread (`pivot_table`) your datasets** such that AQI values for ozone and PM2.5 are in separate columns. Each location on a specific date should now occupy only one row. # In[ ]: allPivot = # 10. Call up the dimensions of your new tidy dataset. # In[ ]: # 11. Save your processed dataset with the following file name: "EPAair_O3_PM25_NC1718_Processed.csv" # In[ ]: # ## Generate summary tables # 12. Use the split-apply-combine strategy to generate two new data frames: # 1. A summary table of mean AQI values for O3 and PM2.5 by month # 2. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site # In[ ]: #12a. A summary table of mean AQI values for O3 and PM2.5 by month group = allDataSelect.groupby(['Month','AQS_PARAMETER_DESC']) aggData = group['DAILY_AQI_VALUE'].mean().reset_index() aggData.pivot_table(index='Month',columns='AQS_PARAMETER_DESC',values='DAILY_AQI_VALUE') # In[ ]: #12b. A summary table of the mean, minimum, and maximum AQI values of O3 and PM2.5 for each site