Pythonic Data Analysis¶

One of the nice aspects of Python is that it contains many powerful tools built into the language itself:

Iteration and unpacking
dictionaries
first-class functions
filepath manipulation (standard library)

In this section we'll look at how to use these concepts to write simple, yet powerful data analysis scripts.

Basic¶

Let's start by opening a time series from the GFS to play with.

In [1]:

from netCDF4 import Dataset, num2date
data = Dataset('data/model-gfs.nc', 'r')

# Convert the array of time numbers to datetimes
time_var = data['time']
time = num2date(time_var[:], time_var.units).squeeze()

Now let's make a basic time series plot with matplotlib.

In [2]:

import matplotlib.pyplot as plt
%matplotlib inline

# Just a single panel of time vs. Temperature_isobaric, which was taken from 1000mb
fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.plot(time, data.variables['Temperature_isobaric'][:].squeeze(), 'r-')
ax.grid()

Let's work a bit harder and add axis labels, set up some better date ticking, and format the grid lines to go with the new ticking so that we can more easily denote the days.

In [3]:

fig, ax = plt.subplots(1, 1, figsize=(12, 6))
ax.plot(time, data.variables['Temperature_isobaric'][:].squeeze(), 'r-', linewidth=2)

# Add x/y axis labels with a bigger font
label_font = dict(size=16)
ax.set_xlabel('Forecast time (UTC)', fontdict=label_font)
ax.set_ylabel('Temperature', fontdict=label_font)

# Set the x-axis to do major ticks on the days and label them like 'Jul 20'
from matplotlib.dates import DateFormatter, DayLocator, HourLocator
ax.xaxis.set_major_locator(DayLocator())
ax.xaxis.set_major_formatter(DateFormatter('%h %d'))

# Set up minor ticks with the hours 6, 12, 18 written as '18Z'
ax.xaxis.set_minor_locator(HourLocator(range(6, 24, 6)))
ax.xaxis.set_minor_formatter(DateFormatter('%HZ'))

# Highlight the major x-axis grid lines with a thicker, dashed line
ax.grid(axis='x', linestyle='--', color='#666699', linewidth=1.0)
ax.grid(which='minor', axis='x')
ax.grid(axis='y')

Containers and iteration¶

Python's built-in lists and tuples facilitate easy ways of looping over groups of data. One useful utility is zip, which iterates over multiple lists at the same time and gives tuples of the elements.

In [4]:

vals = [1, 2, 3, 4]
names = ['one', 'two', 'three', 'four']
z = list(zip(names, vals))

Coincidentally, zip can also be used to unzip, using the ability of Python to pass a sequence as a set of positional arguments:

In [5]:

list(zip(*z))

Out[5]:

[('one', 'two', 'three', 'four'), (1, 2, 3, 4)]

Another useful part of the containers is unpacking, which allows assignment directly from a container

In [6]:

one, two, three, four = vals
print(three)

This same functionality enables multiple return values from functions

In [7]:

import math
def polar_to_cartesian(r, th):
    x = r * math.cos(th)
    y = r * math.sin(th)
    return x, y

X,Y = polar_to_cartesian(2, math.pi / 3)
print(X, Y)

1.0000000000000002 1.7320508075688772

You can put those together and iterate over multiple lists, and unpack those items into individual variables:

In [8]:

for n, v in zip(names, vals):
    print('%s == %d' % (n, v))

one == 1
two == 2
three == 3
four == 4

Now let's add Relative_humidity_isobaric to our plot on a new subplot. But we need to do this in a loop without duplicating code. To do so we need:

A list of variables to plot
A way to convert the variable from the file to an axis label
A list of plot formats

In [9]:

# This is just to keep the different cells in the notebook from rehashing this
def set_defaults(ax):
    # Set the x-axis to do major ticks on the days and label them like 'Jul 20'
    from matplotlib.dates import DateFormatter, DayLocator, HourLocator
    ax.xaxis.set_major_locator(DayLocator())
    ax.xaxis.set_major_formatter(DateFormatter('%h %d'))

    # Set up minor ticks with the hours 6, 12, 18 written as '18Z'
    ax.xaxis.set_minor_locator(HourLocator(range(6, 24, 6)))
    ax.xaxis.set_minor_formatter(DateFormatter('%HZ'))

    # Highlight the major x-axis grid lines with a thicker, dashed line
    ax.grid(axis='x', linestyle='--', color='#666699', linewidth=1.0)
    ax.grid(which='minor', axis='x')
    ax.grid(axis='y')
    
    # Normal x-axis label
    ax.set_xlabel('Forecast time (UTC)', fontdict=dict(size=16))

In [ ]:

# This creates a figure and 2 subplots (axes is a list 2 Axes objects)
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# What should we loop over?

Dictionaries¶

Dictionaries are another powerful language feature that allow you to create arbitrary mappings between two sets of things (key -> value). They can be abused, certainly, but they give programmers the ability to create simple data structures on the fly.

In [11]:

# From before, we can now create a way to map a integer value to its
# english representation
num_map = dict(zip(vals, names))
print(num_map)

{1: 'one', 2: 'two', 3: 'three', 4: 'four'}

Of course, the values can themselves be dictionaries:

In [12]:

states = dict(Colorado={'abbreviation': 'CO', 'capitol': 'Denver', 'notes': 'Home!'},
              Oklahoma={'abbreviation': 'OK', 'capitol': 'Oklahoma City', 'flat': True},
              Kansas={'abbreviation': 'KS', 'capitol': 'Topeka', 'flat': True})

print(states['Oklahoma']['abbreviation'])

OK

Dictionaries can also be used to pass a set of keyword arguments:

In [13]:

def print_states(Colorado, **the_rest):
    print(Colorado['abbreviation'], 'is', Colorado['notes'])
    for state, info in the_rest.items():
        print('The capitol of %s is %s' % (state, info['capitol']))

In [14]:

print_states(**states)

CO is Home!
The capitol of Kansas is Topeka
The capitol of Oklahoma is Oklahoma City

Now let's add a third panel with the Temperature_surface. This should have the same plotting style as Temperature, but without duplicating the style information. It might also be nice to set the title of the axes with the variable name for clarity.

Hint: We already have name, that might make a nice key

Hint #2: plot takes a bunch of keyword arguments

In [ ]:

# Should probably specify plotting styles here

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Now need to loop over the subplots

First class functions¶

Functions in Python can be passed around and used, just like any other object. This makes it very easy to pull out bits of functionality and re-use them.

In [16]:

# Function using our dictionary from earlier
def to_english(i):
    return num_map[i]

to_english(2)

Out[16]:

'two'

So let's sort our list of numbers using their english representation

In [17]:

vals

Out[17]:

[1, 2, 3, 4]

In [18]:

vals.sort(key=to_english)
vals

Out[18]:

[4, 1, 3, 2]

Let's convert the temperature on our plots from Kelvin to Farenheit. Since functions are first-class objects, we can stuff them into dictionaries as well....

In [ ]:

# Conversion functions, and some structure to use them...

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Loop over the subplots

A bit more advanced¶

Let's now go ahead and put both temperature variables on the same plot. We'll need to change to loop over a list of varnames for each plot. We'll also need to label each plot with the level surface vs. isobaric, and add a legend. This will also require separating out the plot styles across the type and level.

In [ ]:

# Need some styles here

# Now only two subplots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# Looping over each subplot

Filepath manipulation¶

The Python standard library has a lot of utilties for manipulating paths. One such utility is the glob module, which makes it easy to find files matching a UNIX shell-style file wildcard:

In [21]:

import glob
glob.glob('data/*.nc')

Out[21]:

['data/model-gfs.nc',
 'data/model-nam.nc',
 'data/new.nc',
 'data/new2.nc',
 'data/prmsl.2000.nc',
 'data/prmsl.2001.nc',
 'data/prmsl.2002.nc',
 'data/prmsl.2003.nc',
 'data/prmsl.2004.nc',
 'data/prmsl.2005.nc',
 'data/prmsl.2006.nc',
 'data/prmsl.2007.nc',
 'data/prmsl.2008.nc',
 'data/prmsl.2009.nc',
 'data/prmsl.2010.nc',
 'data/prmsl.2011.nc',
 'data/rtofs_glo_3dz_f006_6hrly_reg3.nc',
 'data/testrh.nc']

The os.path module provides functions that make it easy to manipulate paths, making munging such information trivial. For instance, to get the base filename from those paths above:

In [22]:

import os.path
filenames = [os.path.basename(fname) for fname in glob.glob('data/*.nc')]
filenames

Out[22]:

['model-gfs.nc',
 'model-nam.nc',
 'new.nc',
 'new2.nc',
 'prmsl.2000.nc',
 'prmsl.2001.nc',
 'prmsl.2002.nc',
 'prmsl.2003.nc',
 'prmsl.2004.nc',
 'prmsl.2005.nc',
 'prmsl.2006.nc',
 'prmsl.2007.nc',
 'prmsl.2008.nc',
 'prmsl.2009.nc',
 'prmsl.2010.nc',
 'prmsl.2011.nc',
 'rtofs_glo_3dz_f006_6hrly_reg3.nc',
 'testrh.nc']

And even remove the file extensions:

In [23]:

[os.path.splitext(name)[0] for name in filenames]

Out[23]:

['model-gfs',
 'model-nam',
 'new',
 'new2',
 'prmsl.2000',
 'prmsl.2001',
 'prmsl.2002',
 'prmsl.2003',
 'prmsl.2004',
 'prmsl.2005',
 'prmsl.2006',
 'prmsl.2007',
 'prmsl.2008',
 'prmsl.2009',
 'prmsl.2010',
 'prmsl.2011',
 'rtofs_glo_3dz_f006_6hrly_reg3',
 'testrh']

Can we use this to process all files beginning with model- in the data/ directory? Let's start by just collecting the NetCDF4 datasets into a dictionary with the model name (the part after model-, but before .nc) capitalized as the key.

In [ ]:

# Create empty dict where we'll store stuff
model_data = dict()

# Glob for file and loop

Now we need to update the plotting to loop over the model datasets. Note that we need to do the time conversion step from the beginning inside this loop, so that each dataset has it done. The subplots should have their title set based on the model. It might also be nice to use consistent y-axis limits for the variables.

In [ ]:

# ?