This notebook downloads water use data for a user specified state and year from the National Water Information System server, and translates these data into physical supply and use table (PSUT), using a preformatted PSUT template and value mapping table.
Original data are from the USGS state water programs. Below is a sample URL for Louisiana:
https://waterdata.usgs.gov/la/nwis/water_use?format=rdb&rdb_compression=value&wu_area=County&wu_year=2000%2C2005%2C2010&wu_county=ALL&wu_category=ALL&wu_county_nms=--ALL%2BCounties--&wu_category_nms=--ALL%2BCategories--
The following may require installation before running this notebook. These can each be installed using pip
------>pip install pandas
---->pip install openpyxl
Created September 2017
John.Fay@duke.edu
#Specify the state and year to process
state = 'la' #Louisiana
year = 2010
#Get the input filenames
templateFilename = '../Data/Templates/StatePSUT_Template.xlsx'
remapFilename = '../Data/RemapTables/StatePSUTLookup.csv'
#Import built-in modules
import sys, os, urllib
from shutil import copyfile
#Import pandas, install if necessary...
try:
import pandas as pd
except:
import pip
pip.main(['install','pandas'])
import pandas as pd
#Import openpyxl, install if necessary...
try:
from openpyxl import load_workbook
except:
import pip
pip.main(['install','openpyxl'])
from openpyxl import load_workbook
la_2010.csv
) and will be saved in the StateData subfolder of the Data directory.¶This folder will be created, if it does not exist already. And then the PSUT template will be copied to this location using the output filename.
#Create the StateData folder, if not already present
outFolder = '..\\Data\\StateData'
if os.path.exists(outFolder) == False:
os.mkdir(outFolder)
#Set the output filename, putting it in the output data folder
outFilename = outFolder + os.sep + '{0}_{1}.xlsx'.format(state,year)
print("Output will be written to " + outFilename)
Output will be written to ..\Data\StateData\la_2010.xlsx
#Copy the template to the output filename
copyfile(src=templateFilename,dst=outFilename)
#Get the remap table and load as a dataframe
dfRemap = pd.read_csv(remapFilename,dtype='str',index_col="Index")
dfRaw
.¶#Set the data URL path and parameters and construct the url
path = 'https://waterdata.usgs.gov/{}/nwis/water_use?'.format(state)
values = {'format':'rdb',
'rdb_compression':'value',
'wu_area':'County',
'wu_year': year,
'wu_county':'ALL',
'wu_county_nms':'--ALL+Counties--',
'wu_category_nms':'--ALL+Categories--'
}
url = path + urllib.urlencode(values)
#Pull data in using the URL and remove the 2nd row of headers
dfRaw = pd.read_table(url,comment='#',header=[0,1],na_values='-')
dfRaw.columns = dfRaw.columns.droplevel(level=1)
#CHECK: Display a sample of the retrieved data
dfRaw.head()
state_cd | state_name | county_cd | county_nm | year | Total Population total population of area, in thousands | Public Supply population served by groundwater, in thousands | Public Supply population served by surface water, in thousands | Public Supply total population served, in thousands | Public Supply self-supplied groundwater withdrawals, fresh, in Mgal/d | ... | Hydroelectric Power total offstream surface-water withdrawals in Mgal/d | Hydroelectric Power power generated by instream use, in gigawatt-hours | Hydroelectric Power power generated by offstream use, in gigawatt-hours | Hydroelectric Power total power generated, in gigawatt-hours | Hydroelectric Power number of instream facilities | Hydroelectric Power number of offstream facilities | Hydroelectric Power total number of facilities | Wastewater Treatment returns by public wastewater facilities, in Mgal/d | Wastewater Treatment number of public wastewater facilities | Wastewater Treatment reclaimed wastewater released by wastewater facilities, in Mgal/d | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 22 | Louisiana | 1 | Acadia Parish | 2010 | 61.773 | 44.666 | 0.000 | 44.666 | 5.82 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 22 | Louisiana | 3 | Allen Parish | 2010 | 25.764 | 22.573 | 0.000 | 22.573 | 4.27 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 22 | Louisiana | 5 | Ascension Parish | 2010 | 107.215 | 29.510 | 27.094 | 56.604 | 3.02 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 22 | Louisiana | 7 | Assumption Parish | 2010 | 23.421 | 0.000 | 23.041 | 23.041 | 0.00 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | 22 | Louisiana | 9 | Avoyelles Parish | 2010 | 42.073 | 39.756 | 0.000 | 39.756 | 3.85 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 281 columns
#Tidy the data: transform so data in each usage column become row values with a new column listing the usage type
rowHeadings = ['county_cd', 'county_nm', 'state_cd', 'state_name', 'year']
dfTidy = pd.melt(dfRaw,id_vars=rowHeadings,value_name='MGal',var_name='Group')
print("Data transformed from {0} rows/columns to {1} rows/columns".format(dfRaw.shape, dfTidy.shape))
Data transformed from (64, 281) rows/columns to (17664, 7) rows/columns
#Remove rows that don't have volume data (i.e. keep only columns with 'Mgal' in the name)
dfTidy = dfTidy[dfTidy['Group'].str.contains('Mgal')].copy(deep=True)
dfTidy.shape
(15040, 7)
#Change the type of the MGal column to float
dfTidy['MGal'] = dfTidy.MGal.astype("float")
#CHECK: Show the structure of the 'tidied' data frame
dfTidy.head()
county_cd | county_nm | state_cd | state_name | year | Group | MGal | |
---|---|---|---|---|---|---|---|
256 | 1 | Acadia Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 5.82 |
257 | 3 | Allen Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 4.27 |
258 | 5 | Ascension Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 3.02 |
259 | 7 | Assumption Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 0.00 |
260 | 9 | Avoyelles Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 3.85 |
#Join the remap table
dfAll = pd.merge(dfTidy,dfRemap,how='inner',left_on="Group",right_on="Group")
dfAll.head()
county_cd | county_nm | state_cd | state_name | year | Group | MGal | Column1 | Row1 | Column2 | Row2 | Column3 | Row3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Acadia Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 5.82 | N | 31 | NaN | NaN | NaN | NaN |
1 | 3 | Allen Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 4.27 | N | 31 | NaN | NaN | NaN | NaN |
2 | 5 | Ascension Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 3.02 | N | 31 | NaN | NaN | NaN | NaN |
3 | 7 | Assumption Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 0.00 | N | 31 | NaN | NaN | NaN | NaN |
4 | 9 | Avoyelles Parish | 22 | Louisiana | 2010 | Public Supply self-supplied groundwater withdr... | 3.85 | N | 31 | NaN | NaN | NaN | NaN |
os.path.exists(outFilename)
True
#Open the spreadsheet template
wb = load_workbook(filename=outFilename)
ws = wb['template']
ws.title = str(year)
As some water usage categories are inserted into more than one cell - some in as many as three - in the PSUT, the value mapping file contains three sets of cell coordinates (i.e. column/row pairs). Thus, we repeat the process of reading the dataframe and mapping values to the PSUT three times.
In cases where two or more usage categories map to the same PSUT cell, the values are summed. This is done in the dfAll.groupby
statement.
Where no data exist for a particular usage category, the value n/a
is inserted.
#Loop through the first set of row/columns and insert values into the Excel spreadsheet
dfRound1 = dfAll.groupby(['Column1','Row1'])['MGal'].sum()
dfRound1.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound1.iteritems():
#Set the value in the workbook
rv = str(row)+ str(column)
ws[rv] = value
#Save the workbook
wb.save(outFilename)
#Loop through the second set of row/columns and insert values into the Excel spreadsheet
dfRound2 = dfAll.groupby(['Column2','Row2'])['MGal'].sum()
dfRound2.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound2.iteritems():
#Set the value in the workbook
rv = str(row)+ str(column)
ws[rv] = value
#Save the workbook
wb.save(outFilename)
#Loop through the third set of row/columns and insert values into the Excel spreadsheet
dfRound3 = dfAll.groupby(['Column3','Row3'])['MGal'].sum()
dfRound3.fillna(value="n/a",inplace=True)
for (row,column), value in dfRound3.iteritems():
#print row, column, value
#Set the value in the workbook
rv = str(row)+ str(column)
ws[rv] = value
#Save the workbook
wb.save(outFilename)
os.startfile(outFilename)