Unlike MatLab, Python does not have native support for multidimensional NetCDF datasets. Instead, we have to import a few specific packages into our script and do a few extra steps to manipulate the data. The notebook below demonstrates the steps needed to import, link variables representing the different dimensions, and run simple analyses on netCDF data.
In the example here, we'll use daily precipitation predictions downscaled to a 1/16-degree resolution spanning the years 1990 to 2005. Thus, our data has 4 dimensions: latitude
, longitude
, time
, and precipitation
. (Actually, the dataset has a 5th "dimension" as well, but this only includes data on the geographic coordinate reference system and has only one value, so we'll ignore it).
Resources:
.nc
data file into our Python script as netCDF4 dataset
object¶The Python NetCDF4 package allows us to read in our .nc
file into a NetCDF4 dataset
object that we can manipulate programmatically. Documentation on the NetCDF4 package is here: http://unidata.github.io/netcdf4-python/, and it displays the various properties and methods of the dataset
object.
#Import package to read netCDF file
import netCDF4
#Read the file into a netCDF dataset object
fileName = 'macav2livneh_pr_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc'
dataset = netCDF4.Dataset(fileName)
#Confirm that the `dataset` variable points to a netCDF4 dataset object
type(dataset)
#Show some documentation on the `dataset` object
?dataset
dataset.file_format
With our dataset object created, we can explore a bit about the data within it and its structure. Specifically, we'll examine what dimensions, variables, and attributes are contained within the dataset
The dimensions in our dataset are accessed as a Python dictionary, which is a collection of values referenced by specific keys. Here, each dimension is listed as a key, and to get information about that dimension, we "look up" its definition in the dictionary:
#Show the full dimensions dictionary
dataset.dimensions
#Show just the keys, or dimensions, in the dataset
dataset.dimensions.keys()
#Show the values associated with the `time` dimension
dataset.dimensions['time']
► Now you try it: Show the values associated with the "lat" dimension. How many items are in this dimension? How many in the 'lon' and 'crs' dimensions?
dataset.dimensions['lat']
Summing up, we see our dataset has 4 dimensions:
lat
with 444 valueslon
with 922 valuestime
with 192 values, andcrs
) with one valueLike dimensions, the dataset's variables are accessed as dictionary objects. First let's list all the variables contained in our datset by exposing the "keys" included in the dictionary. Our dataset has 5 variables:
#Show the 'key' or names of the variables
dataset.variables.keys()
We see many of the same names as the dimensions, but also a precipitation variable.
lat
lon
time
precipitation
crs
#Show attributes of the precipitation variable
dataset.variables['precipitation']
NetCDF datasets have both global attributes and attributes associated with each variable. Here's how we explore what attributes are included in each and how to access information on each attribute
Listing the datset's global attributes is done bit differently than the process for dimensions and variables. Here, the ncattrs()
function returns a list of the dataset's global attributes. Then we can display more information on any of these attributes using the getattr()
function.
#List the dataset's attributes
dataset.ncattrs()
#Display information stored in the `summary` attribute
dataset.summary
Now we'll focus on the attributes of a single variable in our datatset. We'll choose the precipitation
variable.
#List the precipitation variable's attributes
dataset.variables['precipitation'].ncattrs()
#Reveal the information associated with the "cell_methods" attribute
dataset.variables['precipitation'].cell_methods
There's a lot going on in the above statements. It works, but it can be hard to read for the newbie. One of the advantanges of Python code is its readability, but this only works of coders write the code to be readable. So, let's rewrite the above statement so that it's more readable. It appears less "efficient", but sometimes readability is better than terseness.
#Pull the precipitation variable into a Python variable called "precip"
precip = dataset.variables['precipitation']
#Now list its attributes
precip.ncattrs()
With the precip
Python variable established, we can use it to display attribute contents directly
precip._FillValue
With some familiarity of what's in the dataset, we can now start manipulating and visualizing the data...
variable
objects¶Let's now break our dataset into its component variables so that we can work with each more easily. Here, we assign Python variables to the four dataset variables: time
, latitute
, longitude
, and precipitation
. (We'll ignore the crs
variable as that contains only one value, a metadata value listing the coordinate system used.) Then, we'll quickly examine some properties of these variable objects.
#Read the variables in NETCDF file
ncTime = dataset.variables['time']
ncLon = dataset.variables['lon']
ncLat = dataset.variables['lat']
ncPrecip = dataset.variables['precipitation']
#Confirm that these objects are netCDF4 variables
type(ncPrecip)
#Display what we can do with a variable object
?ncPrecip
#What if we just display everything about the variable?
ncPrecip
► Now you try it: Recalling how we revealed the attributes of the precip
variable above, what are the units of the time
variable? The lat
and lon
variables?
ncTime.units, ncLat.units, ncLon.units
Variable shape: These four variables are all arrays, i.e. a series of values set across one or or more dimensions. We can examine the size and dimensions of each variable via its shape
property:
ncTime.shape
ncLat.shape
ncLon.shape
ncPrecip.shape
Note that the size of the three dimensions of the precip
variable corresponds to the size of the time
, lat
, and lon
variables, respectively. This gives a more tangible sense of how these data are structured and how we can manipulate our data.
More specifically, we see that the data in the precip
variable is 3 dimensional. The first dimension is time, and the other two dimensions are x-y coordinates in space. So we can envision our data as a stack of precipitation maps, with each layer in the stack as precipitation values for a single time snapshot. This helps us in subsetting our data...
What may not be clear here is: what then are the time
, lat
, and lon
variables, if everything is held in the one precip
variable? This has everything to do with the fact that the actual data (i.e., the amount of precipitation) are referenced by their index position in the array, as we'll see in a moment. It'd be confusing to explain this in more detail at the moment. Instead, let's move a bit further with the data and then we'll come back to this.
The precip
variable has 3 axes: time
,lat
, and lon
, with sizes of 192, 444, and 922, respectively. And the value at a given time/geographic coordinate is the predicted rainfall at that specific time/location. We can extract a specific precipitation value by specifying a time
, lat
, and lon
value:
print(ncPrecip[0,250,300])
The interpolated precipitation at that time/location is 37 mm
.
But what time and location did we actually specify?? What is time=0?? Likewise, a latitude of "250" and a longitude of "300" are not really geographic coordinates.
These values are actually pointers to the positions in the precipitation array, that is, the 1st time slice, the 251st "latitude" column, and the 301st "longitude" column - if you think of our precipitation array as a stack of lat/long tables, with a layer for each time. (Also note that Python indices are zero-based, meaning the values start at zero, not 1.)
So how, then, do we extract data for a known time and/or location? Well, those data are contained in the other arrays, with positions corresponding to the axes in the precipitation array:
print(ncTime[0])
print(ncLat[250])
print(ncLon[300])
So, we see that the precipitation value extracted above is associated with the location (40.71824°N
,254.15625°W
) and the time "32864
"??!?
That time value is actually days since 01-01-1990 (found in the datasets metadata):
ncTime.units
And we have tools to conver this into a more readable format:
#Create a new array, converting days since 1900 to a date time value
ncTime2 = netCDF4.num2date(ncTime[:],ncTime.units)
ncTime2[0]
And we see the first time slice is Dec 12, 1989.
So now what we have to do is somehow cross-reference the values in the time2
, lat
, and lon
arrays with the axes in the precipitation array. In doing so, we can more easily extract values we want and run various analyses on the data.
The process to do this involves converting each of our NetCDF variabls into numeric arrays which then allows us to use Python's NumPy package to do exactly what we need to do...
#Import the numpy package, calling it "np" to save typing
import numpy as np
#Convert the NetCDF variables to numpy arrays
arrTime = ncTime2[:].astype(np.datetime64)
arrLat = ncLat[:]
arrLon = ncLon[:]
arrPrecip = ncPrecip[:]
#Show the type of object created
type(arrPrecip)
The netCDF variables were converted to Numpy masked arrays. This means that, in addition to the N-dimensional numeric array, we have an associated n-dimensional array, but one with boolean values that indicate whether the data should be used in calculations or not.
#Show the shape of the precip array
arrPrecip.shape
Numpy arrays work much like the netCDF variables in terms of querying values using their position...
arrPrecip[0,250,300]
arrTime[0]
arrLat[250]
arrLon[300]
With our variable stored as NumPy arrays, we can use NumPy methods to slice our data across time and/or space. We'll begin simply by just [blindly] using raw position values (i.e. not actual times or lat/long coordinates) to do first a time series plot and then a simple map for a single time slice.
To extract precipitation across all time for one location (here we'll choose the lat
at position 250
and the lon at postition 300
), we use a colon to say "grab all values" at the first axis (time
), then then specific values for the lat
and lon
axes.
#Grab all the precip data at location lat=200 and lon=300
arrPrecipLocX = arrPrecip[:,250,300]
#What is the shape of the result: should be 192
arrPrecipLocX.shape
#Show the first 10 values in the resulting array
arrPrecipLocX[0:10].data
We use another library for plotting. Python has a few, but as you are likely familiar with Matlab, we'll use matplotlib
which was desinged after Matlab's plotting procedures.
#Import the pyplot subpackage and enable inline plotting
from matplotlib import pyplot as plt
%matplotlib inline
#Plot as a time series
fig = plt.figure(figsize=(15,3))
plt.plot(arrPrecipLocX);
#Plot time series for three points long a longitudinal transect
fig = plt.figure(figsize=(15,3))
plt.plot(arrPrecip[:,250,300])
plt.plot(arrPrecip[:,300,300])
plt.plot(arrPrecip[:,350,300])
plt.show();
#Pull precip data for all lat/lon dimensions for the most recent record
arrPrecipRecent = arrPrecip[-1,:,:]
arrPrecipRecent.shape
#Or, compute the mean across all time values (time = axis zero)
arrPrecipMean = arrPrecip.mean(axis=0)
arrPrecipMean.shape
#Or, compute the mean across all time values (time = axis zero) for a spatial subset
arrPrecipMeanSubset = arrPrecip[:,250:300,300:400].mean(axis=0)
arrPrecipMeanSubset.shape
Matplotlib's imshow
function allows us to plot images, i.e., data stored with X/Y coordinates, as we have.
fig = plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.imshow(arrPrecipMean,origin=(0,0),cmap="YlGnBu")
plt.subplot(1,2,2)
plt.imshow(arrPrecipMeanSubset,origin=(0,0),cmap="YlGnBu")
plt.colorbar();
Numpy works great for computations on n-dimensional arrays, but the axes are still a bit befuddling as they can only be accessed via position, not actual time stamps or actual lat-long coordinates. You can see that easily in the map we just created: the coordinates are image coordinates, not geographic ones (e.g. degrees north and degrees east).
Numpy has a companion Python package called Pandas (a contraction of Panel Data) that overcomes this issue by allowing us to store and manipulate our data in data frames. The key difference [for us] between Numpy's ndarrays and Pandas' dataframes is the ability to name and label columns and rows so we can move beyond just integer indices.
The drawback, however, is that dataframes require data to be stored in two dimensions, i.e., in rows and columns. (There are ways around that, using something called a hierarchical multiindex, but more on that later...).
Anyway, let's convert our Numpy arrays to Pandas dataframes. Of course, before we can do that, we need to *reduce* our data to two or fewer dimensions. We've done that above, creating the time series dataset (by reducing the lat and lon dimensions to a single location), and by creating the mapped dataset (by reducing the time dimension to a single point in time or by averaging all time values into one value.)
#Import pandas
import pandas as pd
Let's review the whole process...
#Pull the precip variable from the netCDF dataset
precip = dataset.variables['precipitation']
#Convert the netCDF variable to a 3-d numpy array
arrPrecip = precip[:]
arrPrecip.shape
#Reduce the 3d array to a 1d array
arrTimeSeries = arrPrecip[:,250,300]
arrTimeSeries.shape
#Reduce the 3d array to a 2d array, collapsing the time dimension into average values
arrMeanPrecip = arrPrecip.mean(axis=0)
arrMeanPrecip.shape
#Convert the 1d time series array to a dataframe
dfTimeSeries=pd.DataFrame(arrTimeSeries)
dfTimeSeries.shape
#Display the first 5 records
dfTimeSeries.head()
#Convert the 2d mean values to a dataframe
dfMeanPrecip=pd.DataFrame(arrMeanPrecip)
dfMeanPrecip.shape
#Display the first 5 records of the
dfMeanPrecip.head()
*♦ Note that the rows and columns remain as simple integers, not actual times or lat/long coordinates*.
By adding row and column labels to our dataset - using the time
, lat
, and lon
variable values to assign those row & column names, we can then manipulate our data using those values (as compared to just integers in numpy arrays).
Note: row labels are called indices in Pandas
#Convert the time, lat, and lon variables to Pandas series objects (series b/c 1 dimension)
ncTime = dataset.variables['time']
seriesTime = pd.Series(netCDF4.num2date(ncTime[:],ncTime.units))
seriesLat = pd.Series(dataset.variables['lat'][:])
seriesLon = pd.Series(dataset.variables['lon'][:])
print(seriesTime.shape,seriesLat.shape,seriesLon.shape)
Now we can assign those values to the dataFrame column name and row indices...
#Set the index of the time series dataframe to the actual times
dfTimeSeries.index = seriesTime
dfTimeSeries.head(3)
#Add column names and an index to the dfMeanPrecip dataframe
dfMeanPrecip.columns = seriesLon
dfMeanPrecip.index = seriesLat
dfMeanPrecip.head()
#Subset the data for the years 2000 to 2004
df_subset = dfTimeSeries[(dfTimeSeries.index >= '2000') &
(dfTimeSeries.index < '2004')]
#Plot the subset
fig = plt.figure(figsize=(15,3))
plt.plot(df_subset);
♦ AND we get the actual dates on the axis!
Here we create Boolean masks to filter our datasets on the lat and lon axes.
dfX = dfMeanPrecip.reset_index()
dfX.head()
dfX=dfMeanPrecip.reset_index().melt(var_name="lon",value_name="precip",id_vars='index')
dfX.columns = ['lat','lon','precip']
dfX.head()
latFilter = (dfMeanPrecip.index > 2000) & (dfMeanPrecip.index < 2005)
lonFilter =
What if we fix time and then plot data across space? Well create a 2d array for one time snapshot and then map it.
import matplotlib.image as mpimg
fig = plt.figure(figsize=(15,5))
plt.imshow(dfMeanPrecip,origin=(0,0),cmap="YlGnBu")
plt.colorbar();