Using Flow Duration Curves to Determine Basin Characteristics and Estimate Flow
This notebook uses the Scientific Python (scipy) stack tools to generate flow duration curves from current USGS NWIS data.
Using recipes from this notebook, you can make:
Import necessary packages. ulmo is not a standard package and will have to be loaded into your local python repository for some of these functions to work. Use the following command in your cmd prompt to install ulmo
:
pip install ulmo
You might also be missing the pandas and scipy packages, which is a shame, because pandas
is awesome
pip install pandas
pip install scipy
Check out this for some great pandas
applications: http://earthpy.org/time_series_analysis_with_pandas_part_2.html
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import scipy.stats as sp
import scipy.optimize as op
import statsmodels.api as sm
import ulmo
from pandas.stats.api import ols
from datetime import datetime, date, timedelta
from matplotlib.backends.backend_pdf import PdfPages
from matplotlib.pyplot import cm
from pylab import rcParams
rcParams['figure.figsize'] = 10, 8
import logging
logging.basicConfig()
logger = logging.getLogger('ulmo.usgs.nwis.core')
logger.setLevel(logging.WARNING)
rpy2
allows one to run R
files within IPython Notebook. You may need to configure the environment variable settings for this to work properly. If you get errors, use the Google.
import rpy2
Set a directory to store output figures:
pwd
u'/home/magik/INTERNET/EARTHPY/earthpy.org/content/notebooks'
rootname = '/home/magik/INTERNET/EARTHPY/data'
dataroot = '/home/magik/INTERNET/EARTHPY/data'
# rootname = 'U:\\GWP\\Groundwater\\UMSS_Manti\\Data\\Hydrology\\plots\\'
# dataroot = 'U:\\GWP\\Groundwater\\UMSS_Manti\\Data\\Hydrology\\'
The function importusgssite
uses the ulmo package to retrieve U.S. Geological Survey surface water site data from the National Water Information System (NWIS) website. The function also puts the data from the website into a usable format. getusgssiteinfo
gets site information from NWIS, which includes site name, watershed size, and coordinates.
def importusgssite(siteno):
sitename = {}
sitename = ulmo.usgs.nwis.get_site_data(siteno, service="daily", period="all")
sitename = pd.DataFrame(sitename['00060:00003']['values'])
sitename['dates'] = pd.to_datetime(pd.Series(sitename['datetime']))
sitename.set_index(['dates'],inplace=True)
sitename[siteno] = sitename['value'].astype(float)
sitename[str(siteno)+'qual'] = sitename['qualifiers']
sitename = sitename.drop(['datetime','qualifiers','value'],axis=1)
sitename = sitename.replace('-999999',np.NAN)
sitename = sitename.dropna()
#sitename['mon']=sitename.index.month
return sitename
def getusgssiteinfo(siteno):
siteinfo = ulmo.usgs.nwis.get_site_data(siteno, service="daily", period="all")
siteinfo = pd.DataFrame(siteinfo['00060:00003']['site'])
siteinfo['latitude'] = siteinfo.loc['latitude','location']
siteinfo['longitude'] = siteinfo.loc['longitude','location']
siteinfo['latitude'] = siteinfo['latitude'].astype(float)
siteinfo['longitude'] = siteinfo['longitude'].astype(float)
siteinfo = siteinfo.drop(['default_tz','dst_tz','srs','uses_dst','longitude'],axis=0)
siteinfo = siteinfo.drop(['agency','timezone_info','location','state_code','network'],axis=1)
return siteinfo
fdc
generates a flow duration curve for hydrologic data. A flow duration curve is a percent point function (ppf), displaying discharge as a function of probability of that discharge occuring. The ppf is the inverse of the better known cumulative distribution function (cdf). See this USGS publication for more information.
def fdc(df,site,begyear,endyear):
'''
Generate flow duration curve for hydrologic time series data
df = pandas dataframe containing data
site = column within dataframe that contains the flow values
begyear = start year for analysis
endyear = end year for analysis
'''
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
data = data[site].dropna().values
data = np.sort(data)
ranks = sp.rankdata(data, method='average')
ranks = ranks[::-1]
prob = [100*(ranks[i]/(len(data)+1)) for i in range(len(data)) ]
plt.figure()
plt.scatter(prob,data,label=site)
plt.yscale('log')
plt.grid(which = 'both')
plt.xlabel('% of time that indicated discharge was exceeded or equaled')
plt.ylabel('discharge (cfs)')
plt.xticks(range(0,100,5))
plt.title('Flow duration curve for ' + site)
def fdc_simple(df, site, begyear=1900, endyear=2015, normalizer=1):
'''
Generate flow duration curve for hydrologic time series data
PARAMETERS:
df = pandas dataframe of interest; must have a date or date-time as the index
site = pandas column containing discharge data; must be within df
begyear = beginning year of analysis; defaults to 1900
endyear = end year of analysis; defaults to 2015
normalizer = value to use to normalize discharge; defaults to 1 (no normalization)
RETURNS:
matplotlib plot displaying the flow duration curve of the data
REQUIRES:
numpy as np
pandas as pd
matplotlib.pyplot as plt
scipy.stats as sp
'''
# limit dataframe to only the site
df = df[[site]]
# filter dataframe to only include dates of interest
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
# remove na values from dataframe
data = data.dropna()
# take average of each day of year (from 1 to 366) over the selected period of record
data['doy']=data.index.dayofyear
dailyavg = data[site].groupby(data['doy']).mean()
data = np.sort(dailyavg)
## uncomment the following to use normalized discharge instead of discharge
#mean = np.mean(data)
#std = np.std(data)
#data = [(data[i]-np.mean(data))/np.std(data) for i in range(len(data))]
data = [(data[i])/normalizer for i in range(len(data))]
# ranks data from smallest to largest
ranks = sp.rankdata(data, method='average')
# reverses rank order
ranks = ranks[::-1]
# calculate probability of each rank
prob = [(ranks[i]/(len(data)+1)) for i in range(len(data)) ]
# plot data via matplotlib
plt.plot(prob,data,label=site+' '+str(begyear)+'-'+str(endyear))
The fdcmatch
uses Python's optimization capabilities to fit natural logarithim and exponential functions the flow duration curves.
def fdcmatch(df, site, begyear=1900, endyear=2015, normalizer=1, fun=1):
'''
* This function creates a flow duration curve (or its inverse) and then matches a natural logrithmic function (or its inverse - exp)
to the flow duration curve
* The flow duration curve will be framed for averaged daily data for the duration of one year (366 days)
PARAMETERS:
df = pandas dataframe of interest; must have a date or date-time as the index
site = pandas column containing discharge data; must be within df
begyear = beginning year of analysis; defaults to 1900
endyear = end year of analysis; defaults to 2015
normalizer = value to use to normalize discharge; defaults to 1 (no normalization)
fun = 1 for probability as a function of discharge; 0 for discharge as a function of probability; default=1
* 1 will choose:
prob = a*ln(discharge*b+c)+d
* 0 will choose:
discharge = a*exp(prob*b+c)+d
RETURNS:
para, parb, parc, pard, r_squared_value, stderr
par = modifying variables for functions = a,b,c,d
r_squared_value = r squared value for model
stderr = standard error of the estimate
REQUIREMENTS:
pandas, scipy, numpy
'''
df = df[[site]]
# filter dataframe to only include dates of interest
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
# remove na values from dataframe
data = data.dropna()
# take average of each day of year (from 1 to 366) over the selected period of record
data['doy']=data.index.dayofyear
dailyavg = data[site].groupby(data['doy']).mean()
data = np.sort(dailyavg)
## uncomment the following to use normalized discharge instead of discharge
#mean = np.mean(data)
#std = np.std(data)
#data = [(data[i]-np.mean(data))/np.std(data) for i in range(len(data))]
data = [(data[i])/normalizer for i in range(len(data))]
# ranks data from smallest to largest
ranks = sp.rankdata(data, method='average')
# reverses rank order
ranks = ranks[::-1]
# calculate probability of each rank
prob = [(ranks[i]/(len(data)+1)) for i in range(len(data)) ]
# choose which function to use
try:
if fun==1:
# function to determine probability as a function of discharge
def func(x,a,b,c,d):
return a*np.log(x*b+c)+d
# matches func to data
par, cov = op.curve_fit(func, data, prob)
# checks fit of curve match
slope, interecept, r_value, p_value, stderr = \
sp.linregress(prob, [par[0]*np.log(data[i]*par[1]+par[2])+par[3] for i in range(len(data))])
else:
# function to determine discharge as a function of probability
def func(x,a,b,c,d):
return a*np.exp(x*b+c)+d
# matches func to data
par, cov = op.curve_fit(func, prob, data)
# checks fit of curve match
slope, interecept, r_value, p_value, stderr = \
sp.linregress(data,[par[0]*np.exp(prob[i]*par[1]+par[2])+par[3] for i in range(len(prob))])
# return parameters (a,b,c,d), r-squared of model fit, and standard error of model fit
return par[0], par[1], par[2], par[3], round(r_value**2,2), round(stderr,5)
except (RuntimeError,TypeError):
return np.nan, np.nan, np.nan, np.nan, np.nan, np.nan
The list sites
designates the USGS surface sites you are interested in analyzing.
sites = ['09309600','09309800',
'09310000','09310500','09310700','09311500','09312700','09313000',
'09317000','09317919','09317920','09317997','09318000','09318500','09319000',
'09323000','09324000','09324200','09324500','09325000','09325100','09326500','09327500','09327550',
'09330500','09331900','09331950','09332100',
'10148500',
'10205030','10206000','10206001','10208500',
'10210000','10211000','10215700','10215900','10216210','10216400','10217000']
sitelab = ['Q'+site for site in sites]
print(len(sites))
40
Now we can run through all of the sites and import thier data!
#d = {site: importusgssite(site) for site in sites}
d = {}
for site in sites:
#print site #use this line to error check site numbers
d[site] = importusgssite(site);
Lets merge all of the site data together in one dataframe, so that all of the daily data are aligned and together in one place. We will call that one place USGS_Site_Data
! While we are at it, we will add month, year, and day of year columns to make summarizing the data easier.
f = {}
f[sites[0]] = d[sites[0]]
for i in range(len(sites)-1):
f[sites[i+1]] = pd.merge(d[sites[i+1]], f[sites[i]], left_index=True, right_index=True, how='outer')
USGS_Site_Data = f[sites[-1]]
USGS_Site_Data['mon']=USGS_Site_Data.index.month
USGS_Site_Data['yr']=USGS_Site_Data.index.year
USGS_Site_Data['dy']=USGS_Site_Data.index.dayofyear
USGS_Site_Data.to_csv(dataroot+'USGS_Site_Data.csv')
Now we should import the information on each site.
z = {}
for site in sites:
z[site] = getusgssiteinfo(site);
Like we did with the data, we will combine all of the site information together in one table, so that it is easy to read.
USGS_Site_Info = pd.concat(z)
USGS_Site_Info = USGS_Site_Info.reset_index()
USGS_Site_Info = USGS_Site_Info.drop(['level_1'],axis=1)
USGS_Site_Info = USGS_Site_Info.set_index(['level_0'])
USGS_Site_Info = USGS_Site_Info.drop(['code'],axis=1)
Lets extract the measurement start and end dates from the station data, as well as some basic summary statistics. We can tack this information onto the site information table we created above.
q = {}
m = {}
for site in sites:
q[site] = USGS_Site_Data[site].first_valid_index()
m[site] = USGS_Site_Data[site].last_valid_index()
USGS_start_date = pd.DataFrame(data=q, index=[0])
USGS_finish_date = pd.DataFrame(data=m, index=[0])
USGS_start_date = USGS_start_date.transpose()
USGS_start_date['start_date'] = USGS_start_date[0]
USGS_start_date = USGS_start_date.drop([0],axis=1)
USGS_finish_date = USGS_finish_date.transpose()
USGS_finish_date['fin_date'] = USGS_finish_date[0]
USGS_finish_date = USGS_finish_date.drop([0],axis=1)
USGS_start_fin = pd.merge(USGS_finish_date,USGS_start_date, left_index=True, right_index=True, how='outer' )
USGS_Site_Info = pd.merge(USGS_start_fin,USGS_Site_Info, left_index=True, right_index=True, how='outer' )
USGS_sum_stats = USGS_Site_Data[sites].describe()
USGS_sum_stats = USGS_sum_stats.transpose()
USGS_Site_Info = pd.merge(USGS_sum_stats,USGS_Site_Info, left_index=True, right_index=True, how='outer' )
The next line allows us to save the site information table to a clipboard, which can be pasted into a spreadsheet, for those who like visualizing information in Excel and similar products.
USGS_Site_Info.to_clipboard()
We can use the site information to generate a Gantt chart, showing the duration that each station measured.
# designate variables
x2 = USGS_Site_Info['fin_date'].astype(datetime).values
x1 = USGS_Site_Info['start_date'].astype(datetime).values
y = USGS_Site_Info.index.astype(np.int)
names = USGS_Site_Info['name'].values
labs, tickloc, col = [], [], []
# create color iterator for multi-color lines in gantt chart
color=iter(cm.Dark2(np.linspace(0,1,len(y))))
plt.figure(figsize=[8,10])
fig, ax = plt.subplots()
# generate a line and line properties for each station
for i in range(len(y)):
c=next(color)
plt.hlines(i+1, x1[i], x2[i], label=y[i], color=c, linewidth=2)
labs.append(names[i].title()+" ("+str(y[i])+")")
tickloc.append(i+1)
col.append(c)
plt.ylim(0,len(y)+1)
plt.yticks(tickloc, labs)
# create custom x labels
plt.xticks(np.arange(datetime(np.min(x1).year,1,1),np.max(x2)+timedelta(days=365.25),timedelta(days=365.25*5)),rotation=45)
plt.xlim(datetime(np.min(x1).year,1,1),np.max(x2)+timedelta(days=365.25))
plt.xlabel('Date')
plt.ylabel('USGS Official Station Name and Station Id')
plt.grid()
plt.title('USGS Station Measurement Duration')
# color y labels to match lines
gytl = plt.gca().get_yticklabels()
for i in range(len(gytl)):
gytl[i].set_color(col[i])
plt.tight_layout()
plt.savefig(rootname+'gantt.pdf');
<matplotlib.figure.Figure at 0xae3c97ec>
The following script will generate a series of box and whisker plots and save them in a pdf. It makes a box plot for each station, breaking the data into monthly intervals. Make sure to change the directory name in the script so it ends up in a recognizable place on your computer.
# create dictionary of integers and their month equivalent
months = {'1':'Jan.', '2':'Feb.', '3':'Mar.', '4':'Apr.', '5':'May', '6':'Jun.',
'7':'Jul.', '8':'Aug.', '9':'Sep.', '10':'Oct.', '11':'Nov.', '12':'Dec.', 'Total':'Total'}
# create empty dictionary to hold pandas Dataframes
j = {}
with PdfPages(rootname + 'station_boxplots.pdf') as pdfs:
ymax = 10000
ymin = 0.01
for i in range(len(sites)):
# make a dataframe containing summary statistics and store it in the j dictionary
j[sites[i]] = USGS_Site_Data.groupby('mon')[sites[i]].agg({'min':np.min, 'mean':np.mean,
'median':np.median, 'max':np.max, 'std':np.std,
'cnt':(lambda x: np.count_nonzero(~np.isnan(x)))}).reset_index()
# make a list of the custom lables you will use for your boxplot; this one will show the number of samples used to make the plot
labs = [months[(str(j[sites[i]]['mon'][b]))] + " (n=" + str(int(j[sites[i]]['cnt'][b])) + ")" for b in range(len(j[sites[i]]))]
# designate the location of each custom label
tickloc = [b+1 for b in range(len(j[sites[i]]['mon']))]
plt.figure()
USGS_Site_Data.boxplot(column=sites[i],by='mon', rot=70)
strtdt = str(USGS_Site_Info.ix[sites[i],'start_date'])[0:10]
findt = str(USGS_Site_Info.ix[sites[i],'fin_date'])[0:10]
siteName = USGS_Site_Info.ix[sites[i],'name'].title()
plt.title( siteName + ' (' + sites[i] + ') ' + strtdt + ' to ' + findt )
plt.suptitle('')
plt.yscale('log')
plt.ylabel('Discharge (cfs)')
plt.ylim((ymin,ymax))
plt.xlabel('Month')
# here is where your lists for the custom label come into play
plt.xticks(tickloc, labs)
pdfs.savefig()
plt.close()
# Save metadata of the pdf so you can find it later
d = pdfs.infodict()
d['Title'] = 'Monthly Station USGS Boxplots UMSS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Boxplots of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS Boxplot'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
/home/magik/miniconda/envs/earthpy/lib/python2.7/site-packages/matplotlib/pyplot.py:424: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). max_open_warning, RuntimeWarning)
<matplotlib.figure.Figure at 0xae8ca12c>
<matplotlib.figure.Figure at 0xae22188c>
<matplotlib.figure.Figure at 0xae65d72c>
<matplotlib.figure.Figure at 0xae40470c>
<matplotlib.figure.Figure at 0xae810fcc>
<matplotlib.figure.Figure at 0xaddc0f0c>
<matplotlib.figure.Figure at 0xae6c54ac>
<matplotlib.figure.Figure at 0xa58b02cc>
<matplotlib.figure.Figure at 0xa56f9cec>
<matplotlib.figure.Figure at 0xa52fd8cc>
<matplotlib.figure.Figure at 0xa531fd8c>
<matplotlib.figure.Figure at 0xa50f8acc>
<matplotlib.figure.Figure at 0xa4fa618c>
<matplotlib.figure.Figure at 0xa4de4e6c>
<matplotlib.figure.Figure at 0xa4a3bdcc>
<matplotlib.figure.Figure at 0xa4ab496c>
<matplotlib.figure.Figure at 0xa4868eac>
<matplotlib.figure.Figure at 0xa4692fec>
<matplotlib.figure.Figure at 0xa449458c>
<matplotlib.figure.Figure at 0xa45a77ac>
<matplotlib.figure.Figure at 0xa40c672c>
<matplotlib.figure.Figure at 0xa41b8b6c>
<matplotlib.figure.Figure at 0xa3c39fcc>
<matplotlib.figure.Figure at 0xa3b40aec>
<matplotlib.figure.Figure at 0xa39e652c>
<matplotlib.figure.Figure at 0xa3b4076c>
<matplotlib.figure.Figure at 0xa363ecac>
<matplotlib.figure.Figure at 0xa345aeec>
<matplotlib.figure.Figure at 0xa327ddac>
<matplotlib.figure.Figure at 0xa3088aac>
<matplotlib.figure.Figure at 0xa3453bec>
<matplotlib.figure.Figure at 0xa2da3bcc>
<matplotlib.figure.Figure at 0xa2b58f4c>
<matplotlib.figure.Figure at 0xa2980b2c>
<matplotlib.figure.Figure at 0xa26c984c>
<matplotlib.figure.Figure at 0xa26254cc>
<matplotlib.figure.Figure at 0xa2427a4c>
<matplotlib.figure.Figure at 0xa225296c>
<matplotlib.figure.Figure at 0xa208032c>
<matplotlib.figure.Figure at 0xa1ee280c>
Let's plot a few of the boxplots so you can see what they look like.
for i in range(1,3):
j[sites[i]] = USGS_Site_Data.groupby('mon')[sites[i]].agg([np.min, np.mean, np.median, np.max, np.std, np.size]).reset_index()
# make a list of the custom lables you will use for your boxplot; this one will show the number of samples used to make the plot
labs = [months[(str(j[sites[i]]['mon'][b]))] + " (n=" + str(int(j[sites[i]]['size'][b])) + ")" for b in range(len(j[sites[i]]))]
# designate the location of each custom label
tickloc = [b+1 for b in range(len(j[sites[i]]['mon']))]
plt.figure()
USGS_Site_Data.boxplot(column=sites[i],by='mon', rot=70)
plt.title(USGS_Site_Info.ix[sites[i],'name'].title() + ' (' + sites[i] + ')')
plt.suptitle('')
plt.yscale('log')
plt.ylabel('Discharge (cfs)')
plt.ylim((ymin,ymax))
plt.xlabel('Month')
# here is where your lists for the custom label come into play
plt.xticks(tickloc, labs)
plt.show()
plt.close()
<matplotlib.figure.Figure at 0xae74c60c>
<matplotlib.figure.Figure at 0xae8f762c>
This script will generate boxplots showing all of the station data.
# This script summarizes discharge for all sites and limits the number of box plots on one graph to the n variable
j=0
with PdfPages(rootname + 'sum_boxplots.pdf') as pdf:
while j < len(sites):
ymax = 10000
ymin = 0.01
n=10
# if statement allows for uneven number of sites on last page
if j+n >= len(sites):
plt.figure()
USGS_Site_Data[sites[j:-1]].plot(kind='box')
plt.title('Sites '+sites[j]+' to '+sites[-1] )
plt.yscale('log')
plt.xlabel('USGS Site')
plt.xticks(rotation=45)
plt.ylabel('discharge (cfs)')
plt.ylim((ymin,ymax))
pdf.savefig()
plt.show()
plt.close()
j = j+n
else:
plt.figure()
USGS_Site_Data[sites[j:j+n]].plot(kind='box')
plt.title('Sites '+sites[j]+' to '+sites[j+n] )
plt.yscale('log')
plt.xlabel('USGS Site')
plt.xticks(rotation=45)
plt.ylabel('discharge (cfs)')
plt.ylim((ymin,ymax))
pdf.savefig()
plt.show()
plt.close()
j = j+n
# Save metadata of the pdf so you can find it later
d = pdf.infodict()
d['Title'] = 'Summary USGS Boxplots UMSS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Boxplots of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS Boxplot'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
<matplotlib.figure.Figure at 0xa5171f2c>
<matplotlib.figure.Figure at 0xa1184bec>
<matplotlib.figure.Figure at 0xa155ba8c>
<matplotlib.figure.Figure at 0xa1dec7cc>
We should also produce hydrographs of each station.
xmax = USGS_Site_Data.index.astype(datetime).values[-1]
xmin = USGS_Site_Data.index.astype(datetime).values[0]
pdfs = PdfPages(rootname + 'station_hydrographs.pdf')
ymax = 10000
ymin = 0.1
for i in range(len(sites)):
x = USGS_Site_Data.index.values
y = USGS_Site_Data[sites[i]].values
plt.figure()
plt.plot(x,y)
strtdt = str(USGS_Site_Info.ix[sites[i],'start_date'])[0:10]
findt = str(USGS_Site_Info.ix[sites[i],'fin_date'])[0:10]
siteName = USGS_Site_Info.ix[sites[i],'name'].title()
plt.title( siteName + ' (' + sites[i] + ') ' + strtdt + ' to ' + findt )
plt.suptitle('')
plt.yscale('log')
plt.ylabel('Discharge (cfs)')
plt.ylim((ymin,ymax))
plt.xlabel('Year')
plt.xticks(np.arange(datetime(1905,1,1),xmax+timedelta(days=365.25),timedelta(days=365.25*5)),rotation=45)
plt.xlim(xmin,xmax)
pdfs.savefig()
plt.close()
# Save metadata of the pdf so you can find it later
d = pdfs.infodict()
d['Title'] = 'Monthly Station USGS Hydrographs UMSS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Hydrograph of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS Hydrograph'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
pd.date_range(start=xmin, end=xmax, freq='5AS').year
array([1908, 1913, 1918, 1923, 1928, 1933, 1938, 1943, 1948, 1953, 1958, 1963, 1968, 1973, 1978, 1983, 1988, 1993, 1998, 2003, 2008, 2013])
xmax = USGS_Site_Data.index[-1]
xmin = USGS_Site_Data.index[0]
plt.figure()
ticks = pd.date_range(start=xmin, end=xmax, freq='4AS')
USGS_Site_Data[sites[0:3]].plot(subplots=True,sharex=True,figsize=(10,8),logy=True, rot=90)
plt.xlim(xmin,xmax)
labs = pd.date_range(start=xmin, end=xmax, freq='4AS').year
plt.xticks(ticks,labs)
plt.show()
plt.close()
<matplotlib.figure.Figure at 0xa18d434c>
def lumped_hydro(i1,i2):
pdfs = PdfPages(rootname + 'station_hydrographs_lumped.pdf')
plt.figure()
ticks = pd.date_range(start=xmin, end=xmax, freq='4AS')
USGS_Site_Data[sites[i1:i2]].plot(subplots=True, sharex=True, figsize=(10,24),logy=True, rot=90)
plt.xlim(xmin,xmax)
labs = pd.date_range(start=xmin, end=xmax, freq='4AS').year
plt.xticks(ticks,labs)
pdfs.savefig()
plt.close()
lumped_hydro(0,10)
lumped_hydro(10,20)
<matplotlib.figure.Figure at 0xa522e04c>
<matplotlib.figure.Figure at 0xa0e4a0ac>
lumped_hydro(20,30)
lumped_hydro(30,-1)
<matplotlib.figure.Figure at 0xa50b8dac>
<matplotlib.figure.Figure at 0x9ade90ac>
This script will iteratively produce Flow Duration Curves for each of the stations. A flow duration curve is a percent point function (ppf), displaying discharge as a function of probability of that discharge occuring. The ppf is the inverse of the better known cumulative distribution function (cdf). See this USGS publication for more information.
with PdfPages(rootname+'station_fdc.pdf') as pdf:
ymax = 10000
ymin = 0.01
for i in range(len(sites)):
plt.figure()
fdc_simple(USGS_Site_Data,sites[i],1900,2015)
fdc_simple(USGS_Site_Data,sites[i],1900,1970)
fdc_simple(USGS_Site_Data,sites[i],1970,2015)
plt.ylim(0.01,10000)
plt.xlim(-.05,1.05)
plt.grid(which = 'both')
plt.legend()
plt.xlabel('probability that discharge was exceeded or equaled')
plt.title('Flow duration curve for ' + str(USGS_Site_Info['name'][i]).title() + ' ('+ sites[i] +')'+'\n'+
'Record: ' + str(USGS_Site_Info['start_date'][i])[0:10] + ' to ' + str(USGS_Site_Info['fin_date'][i])[0:10])
plt.yscale('log')
plt.ylabel('discharge (cfs)')
plt.xticks(np.arange(0,1.05,0.05))
pdf.savefig()
plt.close()
# Save metadata of the pdf so you can find it later
d = pdf.infodict()
d['Title'] = 'Flow Duration Curves USGS'
d['Author'] = u'Paul C. Inkenbrandt\xe4nen'
d['Subject'] = 'Flow Duration Curves of several USGS Surface Stations'
d['Keywords'] = 'USGS Surface NWIS FDC Flow Duration'
d['CreationDate'] = datetime.today()
d['ModDate'] = datetime.today()
For brevity, here is an example plot from the output of the script above.
plt.figure()
fdc_simple(USGS_Site_Data,sites[38],1900,2015)
fdc_simple(USGS_Site_Data,sites[38],1900,1970)
fdc_simple(USGS_Site_Data,sites[38],1970,2015)
plt.yscale('log')
plt.grid(which = 'both')
plt.xlabel('% of time that indicated discharge was exceeded or equaled')
plt.ylabel('discharge (cfs)')
plt.xticks(np.arange(0.0,1.05,0.05))
plt.title('Flow duration curve for ' + str(USGS_Site_Info['name'][38]) + ' ('+ sites[i] +')'+'\n'+
'Record: ' + str(USGS_Site_Info['start_date'][i])[0:10] + ' to ' + str(USGS_Site_Info['fin_date'][i])[0:10])
plt.legend()
<matplotlib.legend.Legend at 0x946451ac>
Below are some sad attempts to model the flow duration curves:
site = sites[28]
df = USGS_Site_Data[[site]]
begyear = 1900
endyear = 2015
data = df[(df.index.to_datetime() > pd.datetime(begyear,1,1))&(df.index.to_datetime() < pd.datetime(endyear,1,1))]
data = data.dropna()
data['doy']=data.index.dayofyear
dailyavg = data[site].groupby(data['doy']).mean()
data = np.sort(dailyavg)
mean = np.mean(data)
std = np.std(data)
f = [(data[i]) for i in range(len(data))]
#f = [(data[i])/mean for i in range(len(data))]
#f = [(data[i]-std)/mean for i in range(len(data))]
ranks = sp.rankdata(f, method='average')
ranks = ranks[::-1]
prob = [(ranks[i]/(len(f)+1)) for i in range(len(f)) ]
x = prob
y = f
plt.figure()
plt.scatter(y,x,label=site,color='blue')
#plt.yscale('log')
#plt.xlim(-.05,1.05)
plt.grid(which = 'both')
plt.ylabel('probability that discharge was exceeded or equaled')
plt.xlabel('normalized discharge')
#plt.xticks(np.arange(0,1.00,0.05))
plt.title('Flow duration curve for ' + site + ' averaged from ' + str(begyear) + ' to ' + str(endyear))
def func(x,a,b,c,d):
return a*np.log(x*b+c)+d
par, cov = op.curve_fit(func,y,x,p0=[-0.16528617, 1.54535185, -24.70440088, 0.9])
plt.plot(y, [par[0]*np.log(y[i]*par[1]+par[2])+par[3] for i in range(len(y))], color='red')
print 'curve fit', sp.linregress(x,[par[0]*np.log(y[i]*par[1]+par[2])+par[3] for i in range(len(y))])
plt.figure()
plt.scatter(x,y,label=site,color='blue')
#plt.yscale('log')
#plt.xlim(-.05,1.05)
plt.grid(which = 'both')
plt.xlabel('probability that discharge was exceeded or equaled')
plt.ylabel('normalized discharge')
#plt.xticks(np.arange(0,1.00,0.05))
plt.title('Flow duration curve for ' + site + ' averaged from ' + str(begyear) + ' to ' + str(endyear))
def func2(x,a,b,c,d):
return a*np.exp(x*b+c)+d
parm, covm = op.curve_fit(func2,x,y,p0=[-0.16528617, 0.02, 0.70440088, 0.9])
plt.plot(x, [parm[0]*np.exp(x[i]*parm[1]+parm[2])+parm[3] for i in range(len(x))], color='red')
print 'curve fit', sp.linregress(y,[parm[0]*np.exp(x[i]*parm[1]+parm[2])+parm[3] for i in range(len(x))])
curve fit LinregressResult(slope=0.96211619415216332, intercept=0.01894190307030913, rvalue=0.9808752178330773, pvalue=8.090748669377643e-261, stderr=0.010006676932384276) curve fit LinregressResult(slope=0.99170704524510944, intercept=0.73530300510799407, rvalue=0.99584544907142547, pvalue=0.0, stderr=0.004752983003883238)
def dic2df(dic,head):
df = pd.DataFrame(data=dic)
df = df.transpose()
df.columns = [str(head)+'_var1',str(head)+'_var2',str(head)+'_var3',str(head)+'_var4',str(head)+'_r2',str(head)+'_err']
return df
q = {}; m = {}; n = {}; k = {}; j = {}; p = {}
for site in sites:
q[site] = fdcmatch(USGS_Site_Data,site,1900,2015,1,1)
m[site] = fdcmatch(USGS_Site_Data,site,1900,1970,1,1)
n[site] = fdcmatch(USGS_Site_Data,site,1970,2015,1,1)
k[site] = fdcmatch(USGS_Site_Data,site,1900,2015,1,0)
j[site] = fdcmatch(USGS_Site_Data,site,1900,1970,1,0)
p[site] = fdcmatch(USGS_Site_Data,site,1970,2015,1,0)
dics = [q,m,n,k,j,p]
heads = ['all','to70','fm70','allin','to70in','fm70in']
USGS_q = dic2df(q,'all')
USGS_m = dic2df(m,'to70')
USGS_parms = pd.merge(USGS_q,USGS_m, left_index=True, right_index=True, how='outer' )
for i in range(2,6,1):
x = dic2df(dics[i],heads[i])
USGS_parms = pd.merge(USGS_parms,x, left_index=True, right_index=True, how='outer' )
USGS_parms.to_clipboard()
#USGS_finish_date = USGS_finish_date.drop([0],axis=1)
#USGS_start_fin = pd.merge(USGS_finish_date,USGS_start_date, left_index=True, right_index=True, how='outer' )
The equations from the modeled fdc plots can be used to estimate the discharge for a site based on data from a similar site. However, the results are mediocre.
n1 = 12
n2 = 11
USGS_Site_Data[sites[n1]+'p'] = [USGS_parms['all_var1'][n1]*np.log(USGS_Site_Data[sites[n1]][i]*USGS_parms['all_var2'][n1]+\
USGS_parms['all_var3'][n1])+USGS_parms['all_var4'][n1] for i in \
range(len(USGS_Site_Data[sites[n1]]))]
USGS_Data = USGS_Site_Data[USGS_Site_Data[sites[n1]+'p']>0]
USGS_Data[sites[n2]+'d'] = [USGS_parms['allin_var1'][n2]*np.exp(USGS_Data[sites[n1]+'p'][i]*USGS_parms['allin_var2'][n2]+\
USGS_parms['allin_var3'][n2])+USGS_parms['allin_var4'][n2] for i in \
range(len(USGS_Data[sites[n1]+'p']))]
y1 = USGS_Data[sites[n2]+'d'].values
y2 = USGS_Site_Data[sites[n2]].values
x1 = USGS_Data.index.to_datetime()
x2 = USGS_Site_Data.index.to_datetime()
y3 = USGS_Data[sites[n1]]
plt.figure()
plt.plot(x1,y1, label="modeled discharge")
plt.plot(x2,y2, label="actual discharge")
plt.plot(x1,y3, label="reference discharge")
plt.yscale('log')
plt.legend()
#plt.xlim(USGS_Site_Info['start_date'][n2],USGS_Site_Info['fin_date'][n2])
plt.xlim('1/1/1970','1/1/1974')
#plt.ylim(1,100)
/home/magik/miniconda/envs/earthpy/lib/python2.7/site-packages/ipykernel/__main__.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
(719163.0, 720624.0)
Run the following script if you want to see a map of your stations. This assumes that you have the Basemap package installed.
from mpl_toolkits.basemap import Basemap
X = USGS_Site_Info['longitude'].astype(float).values.tolist()
Y = USGS_Site_Info['latitude'].astype(float).values.tolist()
n = 0.05
m = Basemap(llcrnrlon=min(X)+n,llcrnrlat=min(Y)+n,urcrnrlon=max(X)+n,urcrnrlat=max(Y)+n,
resolution='h',projection='cyl',lon_0=np.mean(X),lat_0=np.mean(Y))
m.drawrivers(color='blue',linewidth=0.5)
m.drawcounties(color='red',linewidth=0.5)
m.arcgisimage()
#m.etopo(scale=0.5)
lons = X
lats = Y
x,y = m(lons,lats)
m.plot(x,y,'ro', markersize=8)
#m.drawmapscale(lon=-114, lat=43.5, length=100, lon0=-114, lat0=39, barstyle='simple', units='km')
[<matplotlib.lines.Line2D at 0x944fa14c>]
col = USGS_Site_Data.columns
import rpy2.robjects as robj
import rpy2.rlike.container as rlc
def pandas_data_frame_to_rpy2_data_frame(pDataframe):
orderedDict = rlc.OrdDict()
for columnName in pDataframe:
columnValues = pDataframe[columnName].values
filteredValues = [value if pd.notnull(value) else robj.NA_Real
for value in columnValues]
try:
orderedDict[columnName] = robj.FloatVector(filteredValues)
except ValueError:
orderedDict[columnName] = robj.StrVector(filteredValues)
rDataFrame = robj.DataFrame(orderedDict)
rDataFrame.rownames = robj.StrVector(pDataframe.index)
return rDataFrame
USD = pandas_data_frame_to_rpy2_data_frame(USGS_Site_Data)