Pandas
¶While once much more common, a number of agencies still provide access to data as raw text files served up on a web server. Some examples are USGS Water Use Data, and NWIS Stream Flow data. If you've ever used these data in your projects, you know it's fairly easy to manage: just download the link to the file (if a download link is provided), or justcopy and paste.
But if you have a lot of files to download, or if, like the stream gage data, you need to download it in real-time for your reseach, you can write a script to do this. Turns out it's note even that hard, thanks to the pandas
package.
This notebook demonstrates how pandas
can read a text file from the web just as easily as a text file located on our local machine. We just substitute the data's web address (i.e., its URL) where we'd put the filename in using the read_csv
function. It's that easy, but it only works if the data being served remotely is a raw text file...
Here we examine the process for grabbing 2010 water use data for North Carolina hosted on a USGS server...
First, for argument's sake, let's see how this data set would be nabbed manually:
North Carolina
from the dropdownState Data
--All Years--
County
, --All Counties--
Industrial
Submit
Tab-separated data
and Display in Browser
# Import the pandas module
import pandas as pd
# Set the url as a variable; this is the URL we generated above
theURL = 'https://waterdata.usgs.gov/nc/nwis/water_use?format=rdb&rdb_compression=value&wu_area=County&wu_year=ALL&wu_county=ALL&wu_category=IN&wu_county_nms=--ALL%2BCounties--&wu_category_nms=Industrial'
# Read in the data as a pandas data frame and display the first 5 rows
# -Note we need to specify that it's a tab delimited file and uses '#' to indicated commments
dfNWIS = pd.read_csv(theURL, delimiter='\t', comment='#')
dfNWIS.head()
So there's one catch: the second line of the dataframe is not the data we want, but rather a listing of the field type and width. Pandas offers two ways around this. First, we could just drop the first row...
#Drop the first row, and again show the first 5 rows of data...
dfNWIS.drop(0,axis='rows',inplace=True)
dfNWIS.head()
Another way around this is to invoke the skip rows option when reading the CSV. If you look at the file we are importing, you see that the first 49 rows are comments, then comes our header row, and then the field type row that we don't want. So we want to skip rows 1 thru 49 and also line 51. If we create a list of these row numbers, we can pass that to the skip_rows parameter...
#Create a list of numbers 0 thru 49, recalling Python lists are zero-indexed...
rowsToSkip = list(range(49))
#Append '51' to the list
rowsToSkip.append(50)
#Use the read_csv function as before, but skip the rows we want to skip
dfNWIS = pd.read_csv(theURL, delimiter='\t', skiprows=rowsToSkip)
dfNWIS.head()
Now that we have this as a pandas data frame, we can analyze it here, or we can simply save a copy to our local machine. For the latter, pandas' to_csv() function works quite easily.
dfNWIS.to_csv("Data/NCWaterData.csv",index=False)
Teaser: We'll look more and Pandas soon, but here we see we can make a quick plot of the data. Here we plot an annual time series of NC Industrial groundwater withdrawals:
%matplotlib inline
col = 'Industrial self-supplied groundwater withdrawals, fresh, in Mgal/d'
dfNWIS.groupby('year')[col].mean().plot();
► See if you can import NWIS discharge data located at this web address:
http://waterdata.usgs.gov/nwis/uv?cb_00060=on&cb_00065=on&format=rdb&period=1&begin_date=&end_date=&site_no=02085070
and save it to a file named NWISDischarge.csv
.