#!/usr/bin/env python # coding: utf-8 # # Python and NetCDF climate data # Unlike MatLab, Python does not have native support for multidimensional NetCDF datasets. Instead, we have to import a few specific packages into our script and do a few extra steps to manipulate the data. The notebook below demonstrates the steps needed to import, link variables representing the different dimensions, and run simple analyses on netCDF data. # # In the example here, we'll use daily precipitation predictions downscaled to a 1/16-degree resolution spanning the years 1990 to 2005. Thus, our data has 4 dimensions: `latitude`, `longitude`, `time`, and `precipitation`. *(Actually, the dataset has a 5th "dimension" as well, but this only includes data on the geographic coordinate reference system and has only one value, so we'll ignore it)*. # # Resources: # * http://www.ceda.ac.uk/static/media/uploads/ncas-reading-2015/10_read_netcdf_python.pdf # ## 1. Import the `.nc` data file into our Python script as netCDF4 `dataset` object # The Python *NetCDF4* package allows us to read in our `.nc` file into a NetCDF4 `dataset` object that we can manipulate programmatically. Documentation on the NetCDF4 package is here: http://unidata.github.io/netcdf4-python/, and it displays the various *properties* and *methods* of the `dataset` object. # In[ ]: #Import package to read netCDF file import netCDF4 # In[ ]: #Read the file into a netCDF dataset object fileName = 'macav2livneh_pr_bcc-csm1-1_r1i1p1_historical_1990_2005_CONUS_monthly.nc' dataset = netCDF4.Dataset(fileName) # In[ ]: #Confirm that the `dataset` variable points to a netCDF4 dataset object type(dataset) # In[ ]: #Show some documentation on the `dataset` object get_ipython().run_line_magic('pinfo', 'dataset') # In[ ]: dataset.file_format # ## 2. Explore out netCDF dataset # With our dataset object created, we can explore a bit about the data within it and its structure. Specifically, we'll examine what dimensions, variables, and attributes are contained within the dataset # ### 2.1 Dimensions # The dimensions in our dataset are accessed as a Python **dictionary**, which is a collection of values referenced by specific keys. Here, each dimension is listed as a *key*, and to get information about that dimension, we "look up" its definition in the dictionary: # In[ ]: #Show the full dimensions dictionary dataset.dimensions # In[ ]: #Show just the keys, or dimensions, in the dataset dataset.dimensions.keys() # In[ ]: #Show the values associated with the `time` dimension dataset.dimensions['time'] # ► *Now you try it: Show the values associated with the "lat" dimension. How many items are in this dimension? How many in the 'lon' and 'crs' dimensions?* # In[ ]: dataset.dimensions['lat'] # Summing up, we see our dataset has 4 dimensions: # * `lat` with 444 values # * `lon` with 922 values # * `time` with 192 values, and # * a coordinate reference system (`crs`) with one value # ### 2.2 Variables # Like dimensions, the dataset's variables are accessed as dictionary objects. First let's list all the variables contained in our datset by exposing the "keys" included in the dictionary. # Our dataset has 5 variables: # In[ ]: #Show the 'key' or names of the variables dataset.variables.keys() # We see many of the same names as the dimensions, but also a *precipitation* variable. # * `lat` # * `lon` # * `time` # * `precipitation` # * `crs` # In[ ]: #Show attributes of the precipitation variable dataset.variables['precipitation'] # ### 2.3 Attributes # NetCDF datasets have both global attributes and attributes associated with each variable. Here's how we explore what attributes are included in each and how to access information on each attribute # #### 2.3.1 Global attributes # Listing the datset's **global** attributes is done bit differently than the process for dimensions and variables. Here, the `ncattrs()` function returns a list of the dataset's global attributes. Then we can display more information on any of these attributes using the `getattr()` function. # In[ ]: #List the dataset's attributes dataset.ncattrs() # In[ ]: #Display information stored in the `summary` attribute dataset.summary # #### 2.3.2 Variable attributes # Now we'll focus on the attributes of a single variable in our datatset. We'll choose the `precipitation` variable. # In[ ]: #List the precipitation variable's attributes dataset.variables['precipitation'].ncattrs() # In[ ]: #Reveal the information associated with the "cell_methods" attribute dataset.variables['precipitation'].cell_methods # *There's a lot going on in the above statements. It works, but it can be hard to read for the newbie. One of the advantanges of Python code is its readability, but this only works of coders write the code to be readable. So, let's rewrite the above statement so that it's more readable. It appears less "efficient", but sometimes readability is better than terseness.* # In[ ]: #Pull the precipitation variable into a Python variable called "precip" precip = dataset.variables['precipitation'] #Now list its attributes precip.ncattrs() # With the `precip` Python variable established, we can use it to display attribute contents directly # In[ ]: precip._FillValue # --- # ## 3. Working with the data # With some familiarity of what's in the dataset, we can now start manipulating and visualizing the data... # ### 3.1 Import the variables into netCDF4 `variable` objects # Let's now break our dataset into its component variables so that we can work with each more easily. Here, we assign Python variables to the four dataset variables: `time`, `latitute`, `longitude`, and `precipitation`. (We'll ignore the `crs` variable as that contains only one value, a metadata value listing the coordinate system used.) Then, we'll quickly examine some properties of these variable objects. # In[ ]: #Read the variables in NETCDF file ncTime = dataset.variables['time'] ncLon = dataset.variables['lon'] ncLat = dataset.variables['lat'] ncPrecip = dataset.variables['precipitation'] # In[ ]: #Confirm that these objects are netCDF4 variables type(ncPrecip) # In[ ]: #Display what we can do with a variable object get_ipython().run_line_magic('pinfo', 'ncPrecip') # In[ ]: #What if we just display everything about the variable? ncPrecip # ► *Now you try it: Recalling how we revealed the attributes of the `precip` variable above, what are the units of the `time` variable? The `lat` and `lon` variables?* # In[ ]: ncTime.units, ncLat.units, ncLon.units # **Variable shape**: These four variables are all arrays, i.e. a series of values set across one or or more dimensions. We can examine the size and dimensions of each variable via its `shape` property: # In[ ]: ncTime.shape # In[ ]: ncLat.shape # In[ ]: ncLon.shape # In[ ]: ncPrecip.shape # **Note** that the size of the three dimensions of the `precip` variable corresponds to the size of the `time`, `lat`, and `lon` variables, respectively. *This gives a more tangible sense of how these data are structured and how we can manipulate our data.* # # More specifically, we see that the data in the `precip` variable is 3 dimensional. The first dimension is time, and the other two dimensions are x-y coordinates in space. So we can envision our data as a stack of precipitation maps, with each layer in the stack as precipitation values for a single time snapshot. This helps us in subsetting our data... # # What may not be clear here is: *what then are the `time`, `lat`, and `lon` variables, if everything is held in the one `precip` variable?* This has everything to do with the fact that the actual data (i.e., the amount of precipitation) are referenced by their *index position* in the array, as we'll see in a moment. It'd be confusing to explain this in more detail at the moment. Instead, let's move a bit further with the data and then we'll come back to this. # # --- # ### 3.2 Extracting data from our variables # #### 3.2.1 Extract a single value (i.e. single location-time value) # The `precip` variable has 3 axes: `time`,`lat`, and `lon`, with sizes of 192, 444, and 922, respectively. And the value at a given time/geographic coordinate is the predicted rainfall at that specific time/location. We can extract a specific precipitation value by specifying a `time`, `lat`, and `lon` value: # In[ ]: print(ncPrecip[0,250,300]) # The interpolated precipitation at that time/location is `37 mm`. # # But what time and location did we actually specify?? *What is time=0?? Likewise, a latitude of "250" and a longitude of "300" are not really geographic coordinates.* # # These values are actually pointers to the **positions** in the precipitation array, that is, the 1st time slice, the 251st "latitude" column, and the 301st "longitude" column - if you think of our precipitation array as a stack of lat/long tables, with a layer for each time. (Also note that Python indices are zero-based, meaning the values start at zero, not 1.) # # **So how, then, do we extract data for a known time and/or location?** Well, those data are contained in the other arrays, with positions corresponding to the axes in the precipitation array: # In[ ]: print(ncTime[0]) print(ncLat[250]) print(ncLon[300]) # So, we see that the precipitation value extracted above is associated with the location (`40.71824°N`,`254.15625°W`) and the time "`32864`"??!? # # That time value is actually days since 01-01-1990 (found in the datasets metadata): # In[ ]: ncTime.units # And we have tools to conver this into a more readable format: # In[ ]: #Create a new array, converting days since 1900 to a date time value ncTime2 = netCDF4.num2date(ncTime[:],ncTime.units) ncTime2[0] # And we see the first time slice is Dec 12, 1989. # So now what we have to do is somehow cross-reference the *values* in the `time2`, `lat`, and `lon` arrays with the *axes* in the precipitation array. In doing so, we can more easily extract values we want and run various analyses on the data. # ## 4. Python's NumPy (numeric Python) package # ### 4.1 Converting the netCDF4 variable into NumPy arrays # The process to do this involves converting each of our NetCDF variabls into numeric arrays which then allows us to use Python's **NumPy** package to do exactly what we need to do... # In[ ]: #Import the numpy package, calling it "np" to save typing import numpy as np # In[ ]: #Convert the NetCDF variables to numpy arrays arrTime = ncTime2[:].astype(np.datetime64) arrLat = ncLat[:] arrLon = ncLon[:] arrPrecip = ncPrecip[:] # In[ ]: #Show the type of object created type(arrPrecip) # ### 4.2 Working with NumPy arrays # The netCDF variables were converted to Numpy *masked arrays*. This means that, in addition to the N-dimensional numeric array, we have an associated n-dimensional array, but one with boolean values that indicate whether the data should be used in calculations or not. # In[ ]: #Show the shape of the precip array arrPrecip.shape # Numpy arrays work much like the netCDF variables in terms of querying values using their position... # In[ ]: arrPrecip[0,250,300] # In[ ]: arrTime[0] # In[ ]: arrLat[250] # In[ ]: arrLon[300] # ### 4.3 Subsetting our data # With our variable stored as NumPy arrays, we can use NumPy methods to slice our data across time and/or space. We'll begin simply by just [blindly] using raw position values (i.e. not actual times or lat/long coordinates) to do first a time series plot and then a simple map for a single time slice. # #### 4.3.1 Time series for one location # To extract precipitation across all time for one location (here we'll choose the `lat` at position `250` and the lon at postition `300`), we use a colon to say "grab all values" at the first axis (`time`), then then specific values for the `lat` and `lon` axes. # In[ ]: #Grab all the precip data at location lat=200 and lon=300 arrPrecipLocX = arrPrecip[:,250,300] # In[ ]: #What is the shape of the result: should be 192 arrPrecipLocX.shape # In[ ]: #Show the first 10 values in the resulting array arrPrecipLocX[0:10].data # #### 4.3.2 Plotting - with NumPy & MatPlotLib # We use another library for plotting. Python has a few, but as you are likely familiar with Matlab, we'll use `matplotlib` which was desinged after Matlab's plotting procedures. # In[ ]: #Import the pyplot subpackage and enable inline plotting from matplotlib import pyplot as plt get_ipython().run_line_magic('matplotlib', 'inline') # In[ ]: #Plot as a time series fig = plt.figure(figsize=(15,3)) plt.plot(arrPrecipLocX); # In[ ]: #Plot time series for three points long a longitudinal transect fig = plt.figure(figsize=(15,3)) plt.plot(arrPrecip[:,250,300]) plt.plot(arrPrecip[:,300,300]) plt.plot(arrPrecip[:,350,300]) plt.show(); # #### 4.3.3 Creating a location matrix for a single point in time, or average across all times # In[ ]: #Pull precip data for all lat/lon dimensions for the most recent record arrPrecipRecent = arrPrecip[-1,:,:] arrPrecipRecent.shape # In[ ]: #Or, compute the mean across all time values (time = axis zero) arrPrecipMean = arrPrecip.mean(axis=0) arrPrecipMean.shape # In[ ]: #Or, compute the mean across all time values (time = axis zero) for a spatial subset arrPrecipMeanSubset = arrPrecip[:,250:300,300:400].mean(axis=0) arrPrecipMeanSubset.shape # #### 4.3.4 Plotting in two dimensions # Matplotlib's `imshow` function allows us to plot images, i.e., data stored with X/Y coordinates, as we have. # In[ ]: fig = plt.figure(figsize=(15,5)) plt.subplot(1,2,1) plt.imshow(arrPrecipMean,origin=(0,0),cmap="YlGnBu") plt.subplot(1,2,2) plt.imshow(arrPrecipMeanSubset,origin=(0,0),cmap="YlGnBu") plt.colorbar(); # --- # ## 5. Pandas # **Numpy** works great for computations on n-dimensional arrays, but the axes are still a bit befuddling as they can only be accessed via position, not actual time stamps or actual lat-long coordinates. You can see that easily in the map we just created: the coordinates are *image* coordinates, not geographic ones (e.g. degrees north and degrees east). # # Numpy has a companion Python package called **Pandas** (a contraction of *Pan*el *Da*ta) that overcomes this issue by allowing us to store and manipulate our data in **data frames**. The key difference [for us] between *Numpy's ndarrays* and *Pandas' dataframes* is the ability to name and label columns and rows so we can move beyond just integer indices. # # The drawback, however, is that dataframes require data to be stored in two dimensions, i.e., in rows and columns. (There are ways around that, using something called a hierarchical multiindex, but more on that later...). # ### 5.1 Reducing and converting our NumPy arrays to Pandas dataframes # Anyway, let's convert our Numpy arrays to Pandas dataframes. Of course, before we can do that, we need to ***reduce*** our data to two or fewer dimensions. We've done that above, creating the time series dataset (by reducing the lat and lon dimensions to a single location), and by creating the mapped dataset (by reducing the time dimension to a single point in time or by averaging all time values into one value.) # In[ ]: #Import pandas import pandas as pd # #### 5.1.1 Full process from netCDF4 variable to 2d Pandas dataframes # Let's review the whole process... # In[ ]: #Pull the precip variable from the netCDF dataset precip = dataset.variables['precipitation'] # In[ ]: #Convert the netCDF variable to a 3-d numpy array arrPrecip = precip[:] arrPrecip.shape # In[ ]: #Reduce the 3d array to a 1d array arrTimeSeries = arrPrecip[:,250,300] arrTimeSeries.shape # In[ ]: #Reduce the 3d array to a 2d array, collapsing the time dimension into average values arrMeanPrecip = arrPrecip.mean(axis=0) arrMeanPrecip.shape # In[ ]: #Convert the 1d time series array to a dataframe dfTimeSeries=pd.DataFrame(arrTimeSeries) dfTimeSeries.shape # In[ ]: #Display the first 5 records dfTimeSeries.head() # In[ ]: #Convert the 2d mean values to a dataframe dfMeanPrecip=pd.DataFrame(arrMeanPrecip) dfMeanPrecip.shape # In[ ]: #Display the first 5 records of the dfMeanPrecip.head() # #### 5.1.2 Adding column names and indices to our dataframes # ***♦ Note that the rows and columns remain as simple integers, not actual times or lat/long coordinates***. # # By adding row and column labels to our dataset - using the `time`, `lat`, and `lon` variable values to assign those row & column names, we can then manipulate our data using those values (as compared to just integers in numpy arrays). # # **Note**: row labels are called indices in Pandas # * First, we import the variable values in to Pandas *series* objects. Series are just 1d arrays. # In[ ]: #Convert the time, lat, and lon variables to Pandas series objects (series b/c 1 dimension) ncTime = dataset.variables['time'] seriesTime = pd.Series(netCDF4.num2date(ncTime[:],ncTime.units)) seriesLat = pd.Series(dataset.variables['lat'][:]) seriesLon = pd.Series(dataset.variables['lon'][:]) print(seriesTime.shape,seriesLat.shape,seriesLon.shape) # Now we can assign those values to the dataFrame column name and row indices... # In[ ]: #Set the index of the time series dataframe to the actual times dfTimeSeries.index = seriesTime dfTimeSeries.head(3) # In[ ]: #Add column names and an index to the dfMeanPrecip dataframe dfMeanPrecip.columns = seriesLon dfMeanPrecip.index = seriesLat dfMeanPrecip.head() # ## 5.2 Subsetting data in Pandas dataframes # ### 5.2.1 Selecting a *slice* of time # With our data in Pandas dataframe and the actual times set as the index, we can select, subset, or **slice** our data on *actual times*. # In[ ]: #Subset the data for the years 2000 to 2004 df_subset = dfTimeSeries[(dfTimeSeries.index >= '2000') & (dfTimeSeries.index < '2004')] # In[ ]: #Plot the subset fig = plt.figure(figsize=(15,3)) plt.plot(df_subset); # ♦ **AND** we get the actual dates on the axis! # --- # ### 5.2.2 Spatial slices # Here we create Boolean masks to filter our datasets on the lat and lon axes. # In[ ]: dfX = dfMeanPrecip.reset_index() dfX.head() # In[ ]: dfX=dfMeanPrecip.reset_index().melt(var_name="lon",value_name="precip",id_vars='index') dfX.columns = ['lat','lon','precip'] dfX.head() # In[ ]: latFilter = (dfMeanPrecip.index > 2000) & (dfMeanPrecip.index < 2005) lonFilter = # #### 4.2.2 Creating a map for one time slice # What if we fix time and then plot data across space? Well create a 2d array for one time snapshot and then map it. # In[ ]: import matplotlib.image as mpimg # In[ ]: fig = plt.figure(figsize=(15,5)) plt.imshow(dfMeanPrecip,origin=(0,0),cmap="YlGnBu") plt.colorbar(); # In[ ]: