Import USGS use data & Create usage table¶

Here we download the raw usage data for years 2000, 2005, and 2010 from the USGS usage site and synthesize all data into a tidy table listing: year, FIPS code, and total annual withdrawals by sector in MGal/day tagged by sector, source type (surface or groundwater) and water type (fresh or saline). This table is formatted as follows:

STATE	FIPS	YEAR	UseClass	SourceClass	SourceType	Amount
AL	01001	2000	PublicSupply	Surface	Fresh	0.00
AL	01001	2000	Industrial	Ground	Saline	0.00

Workflow:¶

Pull the raw data file for a given year, in tab-delimted format, from the USGS server into a pandas dataframe.
- Care must be taken that the FIPS codes are preserved as text, not numbers
Fix some inconsistencies in nomeclature across years in the USGS data. This includes:
- For year 2000, remapping 'IT', 'LA', 'LS', and 'PE' fields to 'IR', 'AQ', 'LI' and 'PC', respectively(reference).
  - IT -> IR (Irrigated cropland)
  - LA -> AQ (Aquaculture)
  - LS -> LI (Livestock)
  - PE -> PC (Thermoelectric power closed-loop)
- For year 2005, remapping 'LA' and 'LS' fields to 'AQ' and 'LI', respectively(reference).
  - LA -> AQ (Aquaculture)
  - LS -> LI (Livestock)
- Adding year columns to 2000 and 2005 datasets.
- Fixing an inconsistency in the total withdrawal fields (Wtotl to WTotal).
- Removing extranneous fields
Re-arrange data into a tidy format to facilitate additional analyses.
Append all tables into a single dataframe, with records tagged by the year of the dataset.
Remove rows with null values
Save the table to a file for later analysis.

This table can then easily be summarized and joined, by YEAR and FIPS code, to other accounting data tables and summarized by state.

In [1]:

#Import modules required for analysis
import os
import pandas as pd
import numpy as np

In [2]:

#Create prefix remapping dictionaries to standardize column names for all years to 2010
remapDict = {'IT-':'IR-', #Irrigated (total)
             'LA-':'AQ-', #Aquaculture
             'LS-':'LI-', #Livestock
             'PE-':'PC-', #Closed-loop thermo electric
             'Wtotl':'WTotl' #Capitalization mismatch
            }

In [3]:

#Create a function for importing a table and standardizing all field names
def importTable(year, remapDict):
    
    #Set the download URL using with the year
    theURL = 'http://water.usgs.gov/watuse/data/{0}/usco{0}.txt'.format(year) 
    
    #Download the dataset to a data frame (keeping the FIPS attribute as a string)
    df = pd.read_table(theURL,dtype={'FIPS':str})
    
    #Ensure that the FIPS column maintains 5 characters
    df['FIPS'] = df['FIPS'].apply(lambda x: str(x).zfill(5))
    
    #Remove the STATEFIPS and COUNTYFIPS columns (as they are not needed)
    df.drop("STATEFIPS",axis=1,inplace=True)
    df.drop("COUNTYFIPS",axis=1,inplace=True)
    
    #Use the remap dictionary to rename columns
    
    #Get the current column names as a list
    colNames = df.columns.values.tolist()
    
    for inFld,outFld in remapDict.items():
        #This loops through each item in colNames and replaces it with a revised one
        colNames_update = [x.replace(inFld,outFld) for x in colNames]
        colNames = colNames_update

    #Update the column names in the data frame
    df.columns = colNames
    
    #Add year field, if not present
    if "YEAR" not in df.columns.values: 
        df.insert(1,"YEAR",year)
        
    #Remove unnamed columns, which oddly appear in 2005 and 2010 datasets
    if "Unnamed" in df.columns.values[-1]:
        df.drop(df.columns.values[-1],axis=1,inplace=True)
        
    #Status
    print "{} records and {} attributes returned for year {}".format(df.shape[0],df.shape[1],year)
        
    #Return the data frame
    return df

In [4]:

#Get the tables
df2000 = importTable(2000,remapDict)
df2005 = importTable(2005,remapDict)
df2010 = importTable(2010,remapDict)

3222 records and 69 attributes returned for year 2000
3224 records and 106 attributes returned for year 2005
3224 records and 115 attributes returned for year 2010

In [5]:

#CHECK: example output with a column for each supply/source/type grouping.
df2000.head()

Out[5]:

	STATE	YEAR	FIPS	TP-TotPop	PS-TOPop	PS-WGWFr	PS-WSWFr	PS-WFrTo	DO-SSPop	DO-WGWFr	...	TO-WGWFr	TO-WGWTo	TO-WSWFr	TO-WSWTo	TO-WFrTo	TO-WTotl
0	AL	2000	01001	43.67	39.30	5.70	0.00	5.70	4.37	2.95	...	10.65	10.65	26.60	26.60	37.25	37.25
1	AL	2000	01003	140.42	107.61	19.70	0.00	19.70	32.81	2.46	...	27.04	27.04	7.59	7.59	34.63	34.63
2	AL	2000	01005	29.04	22.27	4.85	0.00	4.85	6.77	0.51	...	6.36	6.36	1.06	1.06	7.42	7.42
3	AL	2000	01007	20.83	17.93	3.78	0.00	3.78	2.90	0.22	...	4.00	4.00	0.00	0.00	4.00	4.00
4	AL	2000	01009	51.02	45.90	2.55	29.26	31.81	5.12	0.38	...	2.93	2.93	29.33	29.33	32.26	32.26

5 rows × 69 columns

In [6]:

#"Melt" data so each use/class/type becomes a new row
df2000m = pd.melt(df2000,id_vars=['STATE','FIPS','YEAR'],var_name='Class',value_name='Amount')
df2005m = pd.melt(df2005,id_vars=['STATE','FIPS','YEAR'],var_name='Class',value_name='Amount')
df2010m = pd.melt(df2010,id_vars=['STATE','FIPS','YEAR'],var_name='Class',value_name='Amount')

In [7]:

#CHECK: Observe how the data has been 'melted'
df2000m.head()

Out[7]:

	STATE	FIPS	YEAR	Class	Amount
0	AL	01001	2000	TP-TotPop	43.67
1	AL	01003	2000	TP-TotPop	140.42
2	AL	01005	2000	TP-TotPop	29.04
3	AL	01007	2000	TP-TotPop	20.83
4	AL	01009	2000	TP-TotPop	51.02

In [8]:

#Merge all records into a single table
dfUse = pd.concat([df2000m, df2005m, df2010m],ignore_index=True)

In [10]:

#CHECK: the dfUse frame should have as many rows as the 2000, 2005, and 2010 data combined
dfUse.shape[0]

Out[10]:

(905812, 5)

In [11]:

#Create remap dictionaries. These enable extraction of specific sub-attriubutes
# contained in the Class field. For example, PS-WGWFr refers to Fresh groundwater
# withdrawn for Public Supply.
useClassMap = {'TP':'Population',
               'PS':'PublicSupply',
               'DO':'Domestic',
               'IN':'Industrial',
               'IR':'Irrigation',
               'IC':'Irrigation_Crop',
               'IG':'Irrigation_Golf',
               'LI':'Livestock',
               'AQ':'Aquaculture',
               'MI':'Mining',
               'PT':'ThermoElec',
               'PO':'ThermoElec_OnceThru',
               'PC':'ThermoElec_Recirc',
               'TO':'Total'
              }

srcClassMap = {'-WGW':'Groundwater',
               '-WSW':'Surface'
              }

srcTypeMap = {'Fr':'Fresh',
              'Sa':'Saline'
             }

In [12]:

#Set the use classes (PS, DO, LI, etc.)
for code,value in useClassMap.items():
    dfUse.loc[dfUse.Class.str.startswith(code), 'UseClass'] = value

In [13]:

#Set the source classes (surface or groundwater)
for code,value in srcClassMap.items():
    dfUse.loc[dfUse.Class.str.contains(code), 'SrcClass'] = value

In [14]:

#Set the source types (fresh or saline)
for code,value in srcTypeMap.items():
    dfUse.loc[dfUse.Class.str.endswith(code), 'SrcType'] = value

In [22]:

#CHECK: List classes with null srcClass remaps
dfUse[pd.isnull(dfUse['UseClass'])].Class.unique()

Out[22]:

array(['COUNTY'], dtype=object)

In [24]:

#CHECK: List classes with null srcClass remaps
dfUse[pd.isnull(dfUse['SrcClass'])].Class.unique()

Out[24]:

array(['TP-TotPop', 'PS-TOPop', 'PS-WFrTo', 'DO-SSPop', 'DO-WFrTo',
       'IN-WFrTo', 'IN-WSaTo', 'IN-WTotl', 'IR-IrSpr', 'IR-IrMic',
       'IR-IrSur', 'IR-IrTot', 'IR-WFrTo', 'AQ-WFrTo', 'LI-WFrTo',
       'MI-WFrTo', 'MI-WSaTo', 'MI-WTotl', 'PT-WFrTo', 'PT-WSaTo',
       'PT-WTotl', 'PC-WFrTo', 'PC-WSaTo', 'PC-WTotl', 'TO-WFrTo',
       'TO-WSaTo', 'TO-WTotl', 'PS-GWPop', 'PS-SWPop', 'PS-WSaTo',
       'PS-WTotl', 'DO-PSDel', 'DO-TOTAL ', 'IC-WFrTo', 'IC-IrSpr',
       'IC-IrMic', 'IC-IrSur', 'IC-IrTot', 'IG-WFrTo', 'IG-IrSpr',
       'IG-IrMic', 'IG-IrSur', 'IG-IrTot', 'PT-Power', 'PO-WFrTo',
       'PO-WSaTo', 'PO-WTotl', 'PO-Power', 'PC-Power', 'COUNTY',
       'DO-SSPCp', 'DO-PSPCp', 'AQ-WSaTo', 'AQ-WTotl'], dtype=object)

In [25]:

#CHECK: List classes with null srcClass remaps
dfUse[pd.isnull(dfUse['SrcType'])].Class.unique()

Out[25]:

array(['TP-TotPop', 'PS-TOPop', 'PS-WFrTo', 'DO-SSPop', 'DO-WFrTo',
       'IN-WGWTo', 'IN-WSWTo', 'IN-WFrTo', 'IN-WSaTo', 'IN-WTotl',
       'IR-IrSpr', 'IR-IrMic', 'IR-IrSur', 'IR-IrTot', 'IR-WFrTo',
       'AQ-WFrTo', 'LI-WFrTo', 'MI-WGWTo', 'MI-WSWTo', 'MI-WFrTo',
       'MI-WSaTo', 'MI-WTotl', 'PT-WSWTo', 'PT-WFrTo', 'PT-WSaTo',
       'PT-WTotl', 'PO-WSWTo', 'PC-WSWTo', 'PC-WFrTo', 'PC-WSaTo',
       'PC-WTotl', 'TO-WGWTo', 'TO-WSWTo', 'TO-WFrTo', 'TO-WSaTo',
       'TO-WTotl', 'PS-GWPop', 'PS-SWPop', 'PS-WGWTo', 'PS-WSWTo',
       'PS-WSaTo', 'PS-WTotl', 'DO-PSDel', 'DO-TOTAL ', 'IC-WFrTo',
       'IC-IrSpr', 'IC-IrMic', 'IC-IrSur', 'IC-IrTot', 'IG-WFrTo',
       'IG-IrSpr', 'IG-IrMic', 'IG-IrSur', 'IG-IrTot', 'PT-WGWTo',
       'PT-Power', 'PO-WGWTo', 'PO-WFrTo', 'PO-WSaTo', 'PO-WTotl',
       'PO-Power', 'PC-WGWTo', 'PC-Power', 'COUNTY', 'DO-SSPCp',
       'DO-PSPCp', 'AQ-WGWTo', 'AQ-WSWTo', 'AQ-WSaTo', 'AQ-WTotl'], dtype=object)

In [26]:

#Remove rows (axis = 0), with a null value in 'any' column
dfOutput = dfUse.dropna(axis=0,how='any')

In [27]:

#CHECK: see that new columns were added to decode the Class
dfOutput.head()

Out[27]:

	STATE	FIPS	YEAR	Class	Amount	UseClass	SrcClass	SrcType
6444	AL	01001	2000	PS-WGWFr	5.7	PublicSupply	Groundwater	Fresh
6445	AL	01003	2000	PS-WGWFr	19.7	PublicSupply	Groundwater	Fresh
6446	AL	01005	2000	PS-WGWFr	4.85	PublicSupply	Groundwater	Fresh
6447	AL	01007	2000	PS-WGWFr	3.78	PublicSupply	Groundwater	Fresh
6448	AL	01009	2000	PS-WGWFr	2.55	PublicSupply	Groundwater	Fresh

In [28]:

#Set the output location and filename
dataDir = '../../Data'
outFN = 'UsageDataTidy.csv'
dfOutput.to_csv(dataDir + os.sep + outFN,index=False)