Intermediate Scientific Python¶

This lesson assumes some programming background, but not necessarily with Python. It moves at a faster pace than a novice lesson, however.

We will learn how to do the following:

Reading and manipulating tabular data
Visualization and plotting
Modularization and documentation
Defensive programming
Writing command line scripts with Python

This lesson is based heavily on the Software Carpentry python-intermediate-mosquitoes lesson, and uses its datasets. It also includes some mixed-in components from the python-novice-inflammation lesson.

Getting started¶

We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:

In [1]:

%cat A1_mosquito_data.csv | head -n 5

year,temperature,rainfall,mosquitos
2001,80,157,150
2002,85,252,217
2003,86,154,153
2004,87,159,158

And we have five files to work with:

In [2]:

%ls *.csv

A1_mosquito_data.csv  A3_mosquito_data.csv  B2_mosquito_data.csv
A2_mosquito_data.csv  B1_mosquito_data.csv

Note: commands preceded with a % are known as "magics". These are specfic to the notebook environment. They are not Python commands.

Since this is tabular data, our tool of choice is pandas, a library that provides special data structures for doing fast numerical operations on tabular data. Internally, pandas uses numpy to do the heavy lifting.

In [3]:

import pandas as pd

Importing a library in this way allows us to use the components defined inside of it. First, we'll read in a single dataset using the pandas.read_csv function:

In [4]:

pd.read_csv('A1_mosquito_data.csv')

Out[4]:

	year	temperature	rainfall	mosquitos
0	2001	80	157	150
1	2002	85	252	217
2	2003	86	154	153
3	2004	87	159	158
4	2005	74	292	243
5	2006	75	283	237
6	2007	80	214	190
7	2008	85	197	181
8	2009	74	231	200
9	2010	74	207	184

This reads the CSV from disk and deserializes it into a pandas.DataFrame. But unless we attach a name to the DataFrame the function returns, we cannot keep working with the object in memory.

In [5]:

data = pd.read_csv('A1_mosquito_data.csv')

In [6]:

data

Out[6]:

	year	temperature	rainfall	mosquitos
0	2001	80	157	150
1	2002	85	252	217
2	2003	86	154	153
3	2004	87	159	158
4	2005	74	292	243
5	2006	75	283	237
6	2007	80	214	190
7	2008	85	197	181
8	2009	74	231	200
9	2010	74	207	184

Now we can refer to the DataFrame directly, without having to read it from disk every time.

A bit about names. If I attach the name weight_kg to the value 55:

In [7]:

weight_kg = 55

I can then use this value to calculate the same weight in weight_lb:

In [8]:

weight_lb = 2.2 * weight_kg

In [9]:

weight_lb

Out[9]:

121.00000000000001

If I change my weight_kg to 66, what is the current value of weight_lb?

In [10]:

weight_kg = 66

In [11]:

weight_lb

Out[11]:

121.00000000000001

There's no change. This is because the name weight_lb is attached to a value, in this case a floating point number. The value has no concept of how it came to be.

In [12]:

weight_lb = 2.2 * weight_kg

In [13]:

weight_lb

Out[13]:

145.20000000000002

Back to our DataFrame. A DataFrame is an object. We can see what type of object it is with the Python builtin, type:

In [14]:

type(data)

Out[14]:

pandas.core.frame.DataFrame

It turns out that in Python, everything is an object. We'll see what this means as we go, but the most important aspect of this is that in Python we have names, and we assign these to objects. Any name can point to any object, and more than one name can point to a single object.

Anyway, a DataFrame allows us to get at individual components of our tabular data. We can get single columns like:

In [15]:

data['year']

Out[15]:

0    2001
1    2002
2    2003
3    2004
4    2005
5    2006
6    2007
7    2008
8    2009
9    2010
Name: year, dtype: int64

Or multiple columns with:

In [16]:

data[['rainfall', 'temperature']]

Out[16]:

	rainfall	temperature
0	157	80
1	252	85
2	154	86
3	159	87
4	292	74
5	283	75
6	214	80
7	197	85
8	231	74
9	207	74

Slicing can be used to get back subsets of rows:

In [17]:

data[0:2]

Out[17]:

	year	temperature	rainfall	mosquitos
0	2001	80	157	150
1	2002	85	252	217

Python indices are 0-based, meaning counting goes as 0, 1, 2, 3...; this means that the first row is row 0, the second row is row 1, etc. It's best to refer to row 0 as the "zeroth row" to avoid confusion.

This slice should be read as "get the 0th element up to and not including the 2nd element". The "not including" is important, and the cause of much initial frustration. It does take some getting used to.

What if we want a single row?

In [18]:

data[1]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1875             try:
-> 1876                 return self._engine.get_loc(key)
   1877             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 1

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-18-c805864c0d75> in <module>()
----> 1 data[1]

/usr/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993 
   1994     def _getitem_column(self, key):

/usr/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000 
   2001         # duplicate columns & possible reduce dimensionality

/usr/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1343         res = cache.get(item)
   1344         if res is None:
-> 1345             values = self._data.get(item)
   1346             res = self._box_item_values(item, values)
   1347             cache[item] = res

/usr/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3223 
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items))[isnull(self.items)]

/usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1879 
   1880         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)()

KeyError: 1

For a DataFrame, this is ambiguous, since a single value is interpreted as a column name. We can only get at rows by slicing at the top level:

In [19]:

data[1:2]

Out[19]:

	year	temperature	rainfall	mosquitos
1	2002	85	252	217

Or we could use .iloc:

In [20]:

data.iloc[1]

Out[20]:

year           2002
temperature      85
rainfall        252
mosquitos       217
Name: 1, dtype: int64

Getting a single row in this way returns a Series:

In [21]:

type(data.iloc[1])

Out[21]:

pandas.core.series.Series

A Series is a 1-D column of values, having all the same datatype. Since each of the datatypes of our columns were integers, we got a Series with dtype int64 this time. If we had columns with, e.g. strings, then we'd get back dtype object, which is a catchall for pandas.

We can also get the data in our Series as a raw numpy array:

In [22]:

type(data.iloc[1].values)

Out[22]:

numpy.ndarray

Pandas is a relatively young library, but it's built on top of the venerable numpy array, which makes it possible to do fast numerical work in Python. A Series is basically a 1-D numpy array with the ability to select by labeled indices:

Subsetting data¶

More usefully than simple slicing, we can use boolean indexing to subselect our data. Say we want only data for years beyond 2005?

In [23]:

data[data['year'] > 2005]

Out[23]:

	year	temperature	rainfall	mosquitos
5	2006	75	283	237
6	2007	80	214	190
7	2008	85	197	181
8	2009	74	231	200
9	2010	74	207	184

There's no magic here; we get a boolean index directly from a comparison:

In [24]:

gt_2005 = data['year'] > 2005
gt_2005

Out[24]:

0    False
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
9     True
Name: year, dtype: bool

And using this Series of bools will then give only the rows for which the Series had True:

In [25]:

data[gt_2005]

Out[25]:

	year	temperature	rainfall	mosquitos
5	2006	75	283	237
6	2007	80	214	190
7	2008	85	197	181
8	2009	74	231	200
9	2010	74	207	184

This is the same behavior as numpy arrays: using most binary operators, such as +, *, >, &, work element-wise. With a single value on one side (such as 2005), we get the result of the operation for each element.

A DataFrame is an object, and objects have methods. These are functions that are part of the object itself, often doing operations on the object's data. One of these is DataFrame.mean:

In [26]:

data.mean()

Out[26]:

year           2005.5
temperature      80.0
rainfall        214.6
mosquitos       191.3
dtype: float64

We get back the mean value of each column as a single Series. There's more like this:

In [27]:

data.max()

Out[27]:

year           2010
temperature      87
rainfall        292
mosquitos       243
dtype: int64

There's also DataFrame.describe, which gives common descriptive statistics of the whole DataFrame:

In [28]:

data.describe()

Out[28]:

	year	temperature	rainfall	mosquitos
count	10.00000	10.000000	10.000000	10.00000
mean	2005.50000	80.000000	214.600000	191.30000
std	3.02765	5.456902	50.317216	33.23335
min	2001.00000	74.000000	154.000000	150.00000
25%	2003.25000	74.250000	168.500000	163.75000
50%	2005.50000	80.000000	210.500000	187.00000
75%	2007.75000	85.000000	246.750000	212.75000
max	2010.00000	87.000000	292.000000	243.00000

This is, itself, a DataFrame:

In [29]:

data.describe()['temperature']

Out[29]:

count    10.000000
mean     80.000000
std       5.456902
min      74.000000
25%      74.250000
50%      80.000000
75%      85.000000
max      87.000000
Name: temperature, dtype: float64

Documentation is a key part of Python's design. In the notebook, you can get a quick look at the docs for a given Python function or method with:

In [30]:

data.describe?

Or more generally (built-in Python behavior):

In [31]:

help(data.describe)

Help on method describe in module pandas.core.generic:

describe(percentiles=None, include=None, exclude=None) method of pandas.core.frame.DataFrame instance
Generate various summary statistics, excluding NaN values.

Parameters
----------
percentiles : array-like, optional
The percentiles to include in the output. Should all
be in the interval [0, 1]. By default `percentiles` is
[.25, .5, .75], returning the 25th, 50th, and 75th percentiles.
include, exclude : list-like, 'all', or None (default)
Specify the form of the returned result. Either:

- None to both (default). The result will include only
numeric-typed columns or, if none are, only categorical columns.
- A list of dtypes or strings to be included/excluded.
To select all numeric types use numpy numpy.number. To select
categorical objects use type object. See also the select_dtypes
documentation. eg. df.describe(include=['O'])
- If include is the string 'all', the output column-set will
match the input one.

Returns
-------
summary: NDFrame of summary statistics

Notes
-----
The output DataFrame index depends on the requested dtypes:

For numeric dtypes, it will include: count, mean, std, min,
max, and lower, 50, and upper percentiles.

For object dtypes (e.g. timestamps or strings), the index
will include the count, unique, most common, and frequency of the
most common. Timestamps also include the first and last items.

For mixed dtypes, the index will be the union of the corresponding
output types. Non-applicable entries will be filled with NaN.
Note that mixed-dtype outputs can only be returned from mixed-dtype
inputs and appropriate use of the include/exclude arguments.

If multiple values have the highest count, then the
`count` and `most common` pair will be arbitrarily chosen from
among those with the highest count.

The include, exclude arguments are ignored for Series.

See Also
--------
DataFrame.select_dtypes

Challenge: obtain the standard deviation of mosquito count for years in which the rainfall was greater than 200:¶

One way we could do this is to first grab the "mosquitos" column, then use a fancy index obtained from comparing the "rainfall" column to 200. We can then call the std method of the resulting Series:

In [32]:

data['mosquitos'][data['rainfall'] > 200].std()

Out[32]:

24.587937421969063

Note that this is a key part of the power of pandas objects: operations for subsetting and calculating descriptive statistics can often be stacked to great effect.

What if we know our temperatures are in fahrenheit, but we want them in celsius? We can convert them.

In [33]:

(data['temperature'] - 32) * 5 / 9

Out[33]:

0    26.666667
1    29.444444
2    30.000000
3    30.555556
4    23.333333
5    23.888889
6    26.666667
7    29.444444
8    23.333333
9    23.333333
Name: temperature, dtype: float64

This gives us back a new Series object. If we want to change the values in our existing DataFrame to be these, we can just set the column to this new Series:

In [34]:

data['temperature'] = (data['temperature'] - 32) * 5 / 9

In [35]:

data['temperature']

Out[35]:

0    26.666667
1    29.444444
2    30.000000
3    30.555556
4    23.333333
5    23.888889
6    26.666667
7    29.444444
8    23.333333
9    23.333333
Name: temperature, dtype: float64

Similarly, it's also possible to add new columns. We could add one giving us, e.g. the ratio of rainfall to mosquitos:

In [36]:

data['rainfall / mosquitos'] = data['rainfall'] / data['mosquitos']

In [37]:

data

Out[37]:

	year	temperature	rainfall	mosquitos	rainfall / mosquitos
0	2001	26.666667	157	150	1.046667
1	2002	29.444444	252	217	1.161290
2	2003	30.000000	154	153	1.006536
3	2004	30.555556	159	158	1.006329
4	2005	23.333333	292	243	1.201646
5	2006	23.888889	283	237	1.194093
6	2007	26.666667	214	190	1.126316
7	2008	29.444444	197	181	1.088398
8	2009	23.333333	231	200	1.155000
9	2010	23.333333	207	184	1.125000

It's probably a bit silly to have such a column. So, let's get rid of it:

In [38]:

del data['rainfall / mosquitos']

In [39]:

data

Out[39]:

	year	temperature	rainfall	mosquitos
0	2001	26.666667	157	150
1	2002	29.444444	252	217
2	2003	30.000000	154	153
3	2004	30.555556	159	158
4	2005	23.333333	292	243
5	2006	23.888889	283	237
6	2007	26.666667	214	190
7	2008	29.444444	197	181
8	2009	23.333333	231	200
9	2010	23.333333	207	184

Don't like the order of your columns? We can make a DataFrame with a different column order by selecting them out in the order we want:

In [40]:

data[['rainfall', 'year', 'mosquitos', 'temperature']]

Out[40]:

	rainfall	year	mosquitos	temperature
0	157	2001	150	26.666667
1	252	2002	217	29.444444
2	154	2003	153	30.000000
3	159	2004	158	30.555556
4	292	2005	243	23.333333
5	283	2006	237	23.888889
6	214	2007	190	26.666667
7	197	2008	181	29.444444
8	231	2009	200	23.333333
9	207	2010	184	23.333333

We can even have duplicates:

In [41]:

data[['rainfall', 'year', 'mosquitos', 'temperature', 'year']]

Out[41]:

	rainfall	year	mosquitos	temperature	year
0	157	2001	150	26.666667	2001
1	252	2002	217	29.444444	2002
2	154	2003	153	30.000000	2003
3	159	2004	158	30.555556	2004
4	292	2005	243	23.333333	2005
5	283	2006	237	23.888889	2006
6	214	2007	190	26.666667	2007
7	197	2008	181	29.444444	2008
8	231	2009	200	23.333333	2009
9	207	2010	184	23.333333	2010

Remember: this returns a copy. It does not change our existing DataFrame:

In [42]:

data

Out[42]:

	year	temperature	rainfall	mosquitos
0	2001	26.666667	157	150
1	2002	29.444444	252	217
2	2003	30.000000	154	153
3	2004	30.555556	159	158
4	2005	23.333333	292	243
5	2006	23.888889	283	237
6	2007	26.666667	214	190
7	2008	29.444444	197	181
8	2009	23.333333	231	200
9	2010	23.333333	207	184

Plotting data from a DataFrame¶

Just as pandas is the de-facto library for working with tabular data, matplotlib is the de-facto library for producing high-quality plots from numerical data. A saying often goes that matplotlib makes easy things easy and hard things possible when it comes to making plots.

In [43]:

import matplotlib.pyplot as plt

We need to tell the notebook to render plots in the notebook itself:

In [44]:

%matplotlib inline

Before we use matplotlib explicitly, pandas objects feature convenience methods for common plotting operations. These use matplotlib internally, so everything we learn later about matplotlib objects applies to these. For example, we can plot both "temperature" and "mosquitos" as a function of "year":

In [45]:

data.plot(x='year', y=['temperature', 'mosquitos'])

Out[45]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f339711ce80>

Let's load a larger dataset:

In [46]:

data = pd.read_csv('A2_mosquito_data.csv')

This dataset has 51 rows, instead of just 10:

In [47]:

len(data)

Out[47]:

In [48]:

data.plot(x='year', y=['temperature', 'mosquitos'])

Out[48]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f33970c72e8>

There are other convenience methods for different ways of plotting the data. For example, we can get a kernel-density estimate:

In [49]:

data['mosquitos'].plot.kde()

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f3397030160>

These convenience methods are great, but for more complex plots we can, and should, use matplotlib directly. We can make a multi-paneled figure giving both "temperature" and "rainfall" for each "year."

In [50]:

fig = plt.figure(figsize=(4, 6))

ax1 = fig.add_subplot(2, 1, 1)

ax1.plot(data['year'], data['temperature'], 'ro-')

ax1.set_xlabel('Year')
ax1.set_ylabel('Temperature')

ax2 = fig.add_subplot(2, 1, 2)

ax2.plot(data['year'], data['rainfall'], 'bs-')

ax2.set_xlabel('Year')
ax2.set_ylabel('Rainfall')

Out[50]:

<matplotlib.text.Text at 0x7f3390d64ef0>

Challenge: plot the relationship between the number of mosquitos and temperature and the number of mosquitos and rainfall.¶

We can do this in a similar way as we did above.

In [51]:

fig = plt.figure(figsize=(4, 6))

ax1 = fig.add_subplot(2, 1, 1)

ax1.plot(data['temperature'], data['mosquitos'], 'ro')

ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')

ax2 = fig.add_subplot(2, 1, 2)

ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')

Out[51]:

<matplotlib.text.Text at 0x7f3390d15978>

Note that the linestyle code on the right can be used to turn off interpolation lines, since we want to just plot the points, and don't care about their order.

From this one dataset we see what looks like a linear relationship between mosquitos and rainfall, but not really any relationship between mosquitos and temperature. We'd like to quantify this further by applying a statistical model to the data, but we also want to do this same treatment to all our datasets. We need to learn a few more things, including loops, before continuing further.

Repetition with loops¶

First, an aside with strings. Most programming languages generally deal in a few basic data types, including integers, floating-point numbers, and characters. A string is a sequence of characters, and we can manipulate them easily with python. For example:

In [52]:

word = 'bird'

We have a string 'lead' that we've assigned the name word. We can examine the string's type, and indeed any object in python, with the type builtin.

In [54]:

type(word)

Out[54]:

str

We can also access its characters (it's a sequence!) with indexing:

In [55]:

word[0]

Out[55]:

'b'

In [56]:

word[1]

Out[56]:

'i'

And slicing:

In [57]:

word[1:]

Out[57]:

'ird'

If I want to output each letter of word on a new line, I could do:

In [59]:

print(word[0])
print(word[1])
print(word[2])
print(word[3])

b
i
r
d

But this doesn't scale to words of arbitrary length. Instead, let's let the computer do the boring work of iterating with a for loop:

In [44]:

for letter in word:
    print letter

l
e
a
d

This same bit of code will work for any word, including ridiculously long ones. Like this gem:

In [61]:

word = 'supercallifragilisticexpialidocious'
print(word)

supercallifragilisticexpialidocious

In [62]:

for letter in word:
    print(letter)

s
u
p
e
r
c
a
l
l
i
f
r
a
g
i
l
i
s
t
i
c
e
x
p
i
a
l
i
d
o
c
i
o
u
s

We can do more with loops than iterate through the elements of a sequence. A common use-case is building a sum. For example, getting the number of letters in word:

In [63]:

counter = 0
for letter in word:
    counter = counter + 1

print(counter)

For this very particular use, there's a python built-in, which is guaranteed to be more efficient. Use it instead:

In [64]:

len(word)

Out[64]:

It works on any sequence! For pandas.DataFrames it gives the length of the number of rows:

In [66]:

len(data)

Out[66]:

Lists: ordered collections of other objects (even other lists)¶

Loops work best when they have something to iterate over. In python, the list is a built-in data structure that serves as a container for a sequence of other objects. We can make an empty list with:

In [100]:

stuff = list()

In [101]:

print(stuff)

[]

And we can append things to the end of the list, in this case a bunch of random strings:

In [102]:

stuff.append('marklar')

In [103]:

stuff.append('chewbacca')

In [104]:

stuff.append('chickenfingers')

In [105]:

stuff

Out[105]:

['marklar', 'chewbacca', 'chickenfingers']

Just like with the word example, we can iterate through the elements of the list:

In [106]:

for item in stuff:
    print(item)

marklar
chewbacca
chickenfingers

And slicing works too! Slicing with negative numbers counts backwards from the end of the list. The slice stuff[-2:] could be read as "get the 2nd-to-last element of stuff, through the last element":

In [107]:

for item in stuff[-2:]:
    print(item)

chewbacca
chickenfingers

So indexing with -1 gives the last element:

In [108]:

stuff[-1]

Out[108]:

'chickenfingers'

And consistently, but uselessly, slicing from and to the same index yeilds nothing. Remember, for a slice n:m, it reads as "get all elements starting with element n and up to but not including element m."

In [109]:

stuff[0:0]

Out[109]:

[]

Lists are mutable: that is, their elements can be changed. For example, we can replace 'chewbacca' with 'hansolo':

In [110]:

stuff[1]

Out[110]:

'chewbacca'

In [111]:

stuff[1] = 'hansolo'

In [112]:

stuff[1]

Out[112]:

'hansolo'

...or replace 'hansolo' with a DataFrame:

In [113]:

stuff[1] = data

In [114]:

stuff[1]

Out[114]:

	year	temperature	rainfall	mosquitos
0	1960	82	200	180
1	1961	70	227	194
2	1962	89	231	207
3	1963	74	114	121
4	1964	78	147	140
5	1965	85	151	148
6	1966	86	172	162
7	1967	75	106	112
8	1968	70	276	230
9	1969	86	165	162
10	1970	83	222	198
11	1971	78	297	247
12	1972	87	288	248
13	1973	76	286	239
14	1974	86	231	202
15	1975	90	284	243
16	1976	76	190	175
17	1977	87	257	225
18	1978	88	128	133
19	1979	87	218	199
20	1980	81	206	184
21	1981	74	175	160
22	1982	85	202	187
23	1983	71	130	126
24	1984	80	225	200
25	1985	72	196	173
26	1986	76	261	222
27	1987	85	111	121
28	1988	83	247	210
29	1989	86	137	142
30	1990	82	159	152
31	1991	77	172	160
32	1992	74	280	231
33	1993	70	291	238
34	1994	77	126	125
35	1995	89	191	178
36	1996	83	298	248
37	1997	80	282	237
38	1998	86	219	195
39	1999	72	143	134
40	2000	79	262	221
41	2001	85	189	175
42	2002	86	205	186
43	2003	72	195	173
44	2004	78	148	146
45	2005	71	262	219
46	2006	88	255	226
47	2007	79	262	221
48	2008	73	198	176
49	2009	86	215	187
50	2010	87	127	129

In [115]:

print(stuff)

['marklar',     year  temperature  rainfall  mosquitos
0   1960           82       200        180
1   1961           70       227        194
2   1962           89       231        207
3   1963           74       114        121
4   1964           78       147        140
5   1965           85       151        148
6   1966           86       172        162
7   1967           75       106        112
8   1968           70       276        230
9   1969           86       165        162
10  1970           83       222        198
11  1971           78       297        247
12  1972           87       288        248
13  1973           76       286        239
14  1974           86       231        202
15  1975           90       284        243
16  1976           76       190        175
17  1977           87       257        225
18  1978           88       128        133
19  1979           87       218        199
20  1980           81       206        184
21  1981           74       175        160
22  1982           85       202        187
23  1983           71       130        126
24  1984           80       225        200
25  1985           72       196        173
26  1986           76       261        222
27  1987           85       111        121
28  1988           83       247        210
29  1989           86       137        142
30  1990           82       159        152
31  1991           77       172        160
32  1992           74       280        231
33  1993           70       291        238
34  1994           77       126        125
35  1995           89       191        178
36  1996           83       298        248
37  1997           80       282        237
38  1998           86       219        195
39  1999           72       143        134
40  2000           79       262        221
41  2001           85       189        175
42  2002           86       205        186
43  2003           72       195        173
44  2004           78       148        146
45  2005           71       262        219
46  2006           88       255        226
47  2007           79       262        221
48  2008           73       198        176
49  2009           86       215        187
50  2010           87       127        129, 'chickenfingers']

We also add a list as an element to itself:

In [116]:

stuff.append(stuff)

So now stuff[-1] points to the same list the name stuff does:

In [117]:

stuff[-1] is stuff

Out[117]:

True

It's important to recognize that these are exactly the same list. Modifications to stuff[-1] will be seen in stuff. These are just names, and in this case they refer to the same object in memory.

Lists can contain lists can contain lists can contain lists...and of course any number of other python objects. Even functions can be included as elements:

In [118]:

stuff.append(len)

In [119]:

stuff

Out[119]:

['marklar',     year  temperature  rainfall  mosquitos
 0   1960           82       200        180
 1   1961           70       227        194
 2   1962           89       231        207
 3   1963           74       114        121
 4   1964           78       147        140
 5   1965           85       151        148
 6   1966           86       172        162
 7   1967           75       106        112
 8   1968           70       276        230
 9   1969           86       165        162
 10  1970           83       222        198
 11  1971           78       297        247
 12  1972           87       288        248
 13  1973           76       286        239
 14  1974           86       231        202
 15  1975           90       284        243
 16  1976           76       190        175
 17  1977           87       257        225
 18  1978           88       128        133
 19  1979           87       218        199
 20  1980           81       206        184
 21  1981           74       175        160
 22  1982           85       202        187
 23  1983           71       130        126
 24  1984           80       225        200
 25  1985           72       196        173
 26  1986           76       261        222
 27  1987           85       111        121
 28  1988           83       247        210
 29  1989           86       137        142
 30  1990           82       159        152
 31  1991           77       172        160
 32  1992           74       280        231
 33  1993           70       291        238
 34  1994           77       126        125
 35  1995           89       191        178
 36  1996           83       298        248
 37  1997           80       282        237
 38  1998           86       219        195
 39  1999           72       143        134
 40  2000           79       262        221
 41  2001           85       189        175
 42  2002           86       205        186
 43  2003           72       195        173
 44  2004           78       148        146
 45  2005           71       262        219
 46  2006           88       255        226
 47  2007           79       262        221
 48  2008           73       198        176
 49  2009           86       215        187
 50  2010           87       127        129, 'chickenfingers', [...], <function len>]

Lists also have their own methods, which usually alter the list itself:

In [120]:

stuff.reverse()

In [121]:

stuff

Out[121]:

[<function len>,
 [...],
 'chickenfingers',
     year  temperature  rainfall  mosquitos
 0   1960           82       200        180
 1   1961           70       227        194
 2   1962           89       231        207
 3   1963           74       114        121
 4   1964           78       147        140
 5   1965           85       151        148
 6   1966           86       172        162
 7   1967           75       106        112
 8   1968           70       276        230
 9   1969           86       165        162
 10  1970           83       222        198
 11  1971           78       297        247
 12  1972           87       288        248
 13  1973           76       286        239
 14  1974           86       231        202
 15  1975           90       284        243
 16  1976           76       190        175
 17  1977           87       257        225
 18  1978           88       128        133
 19  1979           87       218        199
 20  1980           81       206        184
 21  1981           74       175        160
 22  1982           85       202        187
 23  1983           71       130        126
 24  1984           80       225        200
 25  1985           72       196        173
 26  1986           76       261        222
 27  1987           85       111        121
 28  1988           83       247        210
 29  1989           86       137        142
 30  1990           82       159        152
 31  1991           77       172        160
 32  1992           74       280        231
 33  1993           70       291        238
 34  1994           77       126        125
 35  1995           89       191        178
 36  1996           83       298        248
 37  1997           80       282        237
 38  1998           86       219        195
 39  1999           72       143        134
 40  2000           79       262        221
 41  2001           85       189        175
 42  2002           86       205        186
 43  2003           72       195        173
 44  2004           78       148        146
 45  2005           71       262        219
 46  2006           88       255        226
 47  2007           79       262        221
 48  2008           73       198        176
 49  2009           86       215        187
 50  2010           87       127        129,
 'marklar']

list.pop removes the element at the given index:

In [122]:

stuff.pop(-2)

Out[122]:

	year	temperature	rainfall	mosquitos
0	1960	82	200	180
1	1961	70	227	194
2	1962	89	231	207
3	1963	74	114	121
4	1964	78	147	140
5	1965	85	151	148
6	1966	86	172	162
7	1967	75	106	112
8	1968	70	276	230
9	1969	86	165	162
10	1970	83	222	198
11	1971	78	297	247
12	1972	87	288	248
13	1973	76	286	239
14	1974	86	231	202
15	1975	90	284	243
16	1976	76	190	175
17	1977	87	257	225
18	1978	88	128	133
19	1979	87	218	199
20	1980	81	206	184
21	1981	74	175	160
22	1982	85	202	187
23	1983	71	130	126
24	1984	80	225	200
25	1985	72	196	173
26	1986	76	261	222
27	1987	85	111	121
28	1988	83	247	210
29	1989	86	137	142
30	1990	82	159	152
31	1991	77	172	160
32	1992	74	280	231
33	1993	70	291	238
34	1994	77	126	125
35	1995	89	191	178
36	1996	83	298	248
37	1997	80	282	237
38	1998	86	219	195
39	1999	72	143	134
40	2000	79	262	221
41	2001	85	189	175
42	2002	86	205	186
43	2003	72	195	173
44	2004	78	148	146
45	2005	71	262	219
46	2006	88	255	226
47	2007	79	262	221
48	2008	73	198	176
49	2009	86	215	187
50	2010	87	127	129

...while list.remove removes the first instance of the given element in the list:

In [123]:

stuff.remove('chickenfingers')

So now our list looks like:

In [124]:

stuff

Out[124]:

[<function len>, [...], 'marklar']

Plot all the data¶

Now that we know how to do loops and use lists in Python, we'll use these to make a plot for each dataset. First, we'll import a module called glob:

In [125]:

import glob

We can use this to grab all the filenames matching a "globbing" pattern:

In [126]:

glob.glob("*.csv")

Out[126]:

['A1_mosquito_data.csv',
 'A3_mosquito_data.csv',
 'B2_mosquito_data.csv',
 'A2_mosquito_data.csv',
 'B1_mosquito_data.csv']

"Globbing" is the name of the shell-completion patterns common for shells like sh and bash.

Challenge: write a for-loop that produces a plot of mosquito population vs. temperature and mosquito population vs. rainfall for each dataset we have.¶

We can grab the block of code we wrote previously, and loop through it for each file obtained by globbing:

In [127]:

import glob

files = glob.glob('*.csv')

for file in files:
    
    print(file)
    data = pd.read_csv(file)
    
    fig = plt.figure(figsize=(4, 6))

    ax1 = fig.add_subplot(2, 1, 1)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')

    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    ax2 = fig.add_subplot(2, 1, 2)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')
    
    # tell matplotlib to render the plot *now*
    plt.show()

A1_mosquito_data.csv

A3_mosquito_data.csv

B2_mosquito_data.csv

A2_mosquito_data.csv

B1_mosquito_data.csv

It looks like there's a similar pattern for each area we have data for. We'd like to quantify more closely the relationship between mosquito population and each of these variables.

Applying a statistical model to the dataset¶

There are many statistics packages in Python, but one of the most well-developed and widely-used is statsmodels:

In [128]:

import statsmodels.api as sm

We'll use our second dataset for the initial stab at fitting a statistical model to our data:

In [129]:

data = pd.read_csv('A2_mosquito_data.csv')

And let's convert the temperatures to Celsius, for good measure:

In [130]:

data['temperature'] = (data['temperature'] - 32) * 5 / 9

We can build an ordinary least-squares fit to the data using statsmodels.OLS. Remember we can check the docs with:

In [131]:

sm.OLS?

We want to fit the relationship between mosquito population and, for a start, rainfall:

In [132]:

regr_results = sm.OLS(data['mosquitos'], data['rainfall']).fit()

In [133]:

regr_results.summary()

Out[133]:

OLS Regression Results
Dep. Variable:	mosquitos	R-squared:	0.996
Model:	OLS	Adj. R-squared:	0.996
Method:	Least Squares	F-statistic:	1.386e+04
Date:	Mon, 09 May 2016	Prob (F-statistic):	8.69e-63
Time:	23:02:14	Log-Likelihood:	-196.25
No. Observations:	51	AIC:	394.5
Df Residuals:	50	BIC:	396.4
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
rainfall	0.8811	0.007	117.744	0.000	0.866 0.896

Omnibus:	4.422	Durbin-Watson:	1.878
Prob(Omnibus):	0.110	Jarque-Bera (JB):	1.959
Skew:	-0.085	Prob(JB):	0.376
Kurtosis:	2.055	Cond. No.	1.00

So, this gives us a lot of information. But is this what we want?

We actually want to quantify which variable correlates more with the mosquito population. One way we can build a multivariate model with statsmodels is to use a formula:

In [134]:

regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
regr_results.summary()

Out[134]:

OLS Regression Results
Dep. Variable:	mosquitos	R-squared:	0.997
Model:	OLS	Adj. R-squared:	0.997
Method:	Least Squares	F-statistic:	7889.
Date:	Mon, 09 May 2016	Prob (F-statistic):	3.68e-61
Time:	23:02:15	Log-Likelihood:	-111.54
No. Observations:	51	AIC:	229.1
Df Residuals:	48	BIC:	234.9
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[95.0% Conf. Int.]
Intercept	17.5457	2.767	6.341	0.000	11.983 23.109
temperature	0.8719	0.092	9.457	0.000	0.687 1.057
rainfall	0.6967	0.006	125.385	0.000	0.686 0.708

Omnibus:	1.651	Durbin-Watson:	1.872
Prob(Omnibus):	0.438	Jarque-Bera (JB):	0.906
Skew:	-0.278	Prob(JB):	0.636
Kurtosis:	3.343	Cond. No.	1.92e+03

Each of these results can be accessed as attributes of the Results object:

In [135]:

print(regr_results.params)

Intercept      17.545739
temperature     0.871943
rainfall        0.696717
dtype: float64

In [136]:

print(regr_results.rsquared)

0.996966873691

Our model "explains" most of the data. Nice! Let's plot the predicted population of mosquitos from our fitted model against the measured population to see how well we did:

In [137]:

import matplotlib.pyplot as plt

figure = plt.figure()

ax = figure.add_subplot(1,1,1)

parameters = regr_results.params
predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
ax.plot(predicted, data['mosquitos'], 'ro')

ax.set_xlabel('predicted mosquito population')
ax.set_ylabel('measured mosquito population')

Out[137]:

<matplotlib.text.Text at 0x7f338d98c908>

The more linear this plot is, the better our model fits the data. Also, it turns out that, as you might have guessed from the plots we made previously, rainfall is the better predictor than average temperature for how many mosquitoes we expect to see.

In [138]:

regr_results.tvalues

Out[138]:

Intercept        6.341405
temperature      9.456614
rainfall       125.385116
dtype: float64

Challenge: for each dataset, print the t-value for temperature and rainfall, and plot the statsmodel results along with plots for the relationships between mosquitos and the two variables separately.¶

We can take our block of code we worked on previously, and make a three-panel plot instead of just a two-panel one:

In [139]:

filenames = glob.glob('*.csv')

for filename in filenames:
    print(filename)
    data = pd.read_csv(filename)
    
    # convert temperatures to celsius from fahrenheit
    data['temperature'] = (data['temperature'] - 32) * 5/9
    
    # perform fit
    regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
    print(regr_results.tvalues)
    
    fig = plt.figure(figsize=(6, 9))

    # plot predicted vs. measured mosquito populations from fitted model
    parameters = regr_results.params
    predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
    
    ax0 = fig.add_subplot(3, 1, 1)
    ax0.plot(predicted, data['mosquitos'], 'gd')
    
    ax0.set_xlabel('predicted mosquito population')
    ax0.set_ylabel('measured mosquito population')
   
    # plot population vs. temperature
    ax1 = fig.add_subplot(3, 1, 2)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')
    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    # plot population vs. rainfall
    ax2 = fig.add_subplot(3, 1, 3)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')

    plt.show()

A1_mosquito_data.csv
Intercept       3.589733
temperature     3.456159
rainfall       56.825601
dtype: float64

A3_mosquito_data.csv
Intercept       9.658621
temperature    10.881139
rainfall       70.798256
dtype: float64

B2_mosquito_data.csv
Intercept        8.778676
temperature     11.647932
rainfall       125.016860
dtype: float64

A2_mosquito_data.csv
Intercept        6.341405
temperature      9.456614
rainfall       125.385116
dtype: float64

B1_mosquito_data.csv
Intercept      11.016743
temperature    13.396049
rainfall       43.072783
dtype: float64

For all areas, it looks as if rainfall is indeed the stronger predictor of mosquito population than temperature. But what if this is just the beginning of the number of datasets we have to analyze? Are we going to just copy this block of code around each time we want to analyze new data?

No. That would be awful.

Writing functions: we need to break things up¶

We have code blocks we've been copying and pasting around a lot, and this is fine for a while. But once we've built some useful things, we want to re-use them, and we don't want to maintain many versions of the same thing lying around.

In [140]:

def square(x):
    x_squared = x ** 2
    return x_squared

In [141]:

square(3)

Out[141]:

In [142]:

square(4)

Out[142]:

Functions are also composable:

In [143]:

square(square(3))

Out[143]:

We can improve on this a bit by adding documentation. In Python, we can write a docstring for our function directly in its definition:

In [144]:

def square(x):
    """Return the square of the value, `x`.
    
    Parameters
    ----------
    x : float
        Number to obtain the square of.
    
    Returns
    -------
    square : float
        The square of `x`.
    
    """
    x_squared = x ** 2
    return x_squared

And this docstring is now what we see when we use the builtin help system:

In [145]:

help(square)

Help on function square in module __main__:

square(x)
    Return the square of the value, `x`.
    
    Parameters
    ----------
    x : float
        Number to obtain the square of.
    
    Returns
    -------
    square : float
        The square of `x`.

Or alternatively, in the notebook:

In [146]:

square?

Challenge: write your fahrenheit to celsius conversion as a function that takes temperatures in fahrenheit and converts them to temperatures in celsius¶

This is one way we could write it:

In [147]:

def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
    
    Returns
    -------
    float or array_like
        Temperature(s) in celsius.
    
    """
    return (temp - 32) * 5 / 9

Note that this same function will work for pandas.Series objects, numpy arrays, and single numbers. Writing functions in this way takes advantage of the fact that these objects are roughly duck-typed: they all behave as if they were single numbers, so doing arithmetic operations with them yields the expected result.

Challenge: write a function that prints the t-value for temperature and rainfall, and plots the statsmodel results along with plots for the relationships between mosquitos and the two variables separately.¶

Let's do our model building and plotting in a single function, returning a useful panel plot:

In [148]:

def analyze(data):
    """Return panel plot of mosquito population vs. temperature, rainfall.
    
    Also prints t-values for temperature and rainfall.
    
    Panel plot gives: 
        1. comparison of modeled values of mosquito population vs. actual values
        2. mosquito population vs. average temperature
        3. mosquito population vs. total rainfall
    
    Parameters
    ----------
    data : DataFrame
        DataFrame giving columns for average temperature, 
        total rainfall, and mosquito population during mosquito
        breeding season for each year.
    
    Returns
    -------
    Figure
        :mod:`matplotlib.figure.Figure` object giving panel plot.
    
    """
    # perform fit
    regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
    print(regr_results.tvalues)
    
    fig = plt.figure(figsize=(6, 9))

    # plot prediction from fitted model against measured mosquito population
    parameters = regr_results.params
    predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
    
    ax0 = fig.add_subplot(3, 1, 1)
    ax0.plot(predicted, data['mosquitos'], 'go')
    
    ax0.set_xlabel('predicted mosquito population')
    ax0.set_ylabel('measured mosquito population')
   
    # plot population vs. temperature
    ax1 = fig.add_subplot(3, 1, 2)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')
    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    # plot population vs. rainfall
    ax2 = fig.add_subplot(3, 1, 3)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')

    return fig

Note that since this is Python, and everything is an object, we can return our Figure object. This is generally a useful thing to do for plotting functions, since in different contexts you may want to modify the resulting Figure before rendering any plots of it.

We can now re-write our big loop from before, and it should look a lot simpler.

In [149]:

filenames = glob.glob('*.csv')

for filename in filenames:
    print(filename)
    data = pd.read_csv(filename)
    
    # convert temperatures to celsius
    data['temperature'] = fahrenheit_to_celsius(data['temperature'])
    
    # get t-values; return plots
    analyze(data)
    
    plt.show()

A1_mosquito_data.csv
Intercept       3.589733
temperature     3.456159
rainfall       56.825601
dtype: float64

A3_mosquito_data.csv
Intercept       9.658621
temperature    10.881139
rainfall       70.798256
dtype: float64

B2_mosquito_data.csv
Intercept        8.778676
temperature     11.647932
rainfall       125.016860
dtype: float64

A2_mosquito_data.csv
Intercept        6.341405
temperature      9.456614
rainfall       125.385116
dtype: float64

B1_mosquito_data.csv
Intercept      11.016743
temperature    13.396049
rainfall       43.072783
dtype: float64

Our for loop went from being long and complicated to simple and easy-to-understand. This is part of the power of functions. Because the human brain can only keep about $7 \pm 2$ concepts in working memory at any given time, functions allow us to abstract and build more complicated code without making it more difficult to understand. Instead of having to worry about how filter_data does its thing, we need only know its inputs and its outputs to wield it effectively. It can be re-used without copying and pasting, and we can maintain it and improve it easily since it can be defined in a single place.

Since we may want to use these functions later in other notebooks/code, we can put their definitions in a file and import it. Opening a text editor and copying the function definitions (and the imports they need) to a file datastuff.py, that looks like this:

In [150]:

%cat datastuff.py

import statsmodels.api as sm
import matplotlib.pyplot as plt


def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
        
    Returns
    -------
    float or array_like
        Temperatures in celsius.

    bubbles
    
    """
    try:
        newtemps = (temp - 32) * 5/9
    except TypeError:
        newtemps = []
        for value in temp:
            newtemps.append(fahrenheit_to_celsius(value))

    return newtemps


def analyze(data):
    """Return panel plot of mosquito population vs. temperature, rainfall.
    
    Also prints t-values for temperature and rainfall.
    
    Panel plot gives: 
        1. comparison of modeled values of mosquito population vs. actual values
        2. mosquito population vs. average temperature
        3. mosquito population vs. total rainfall
    
    Parameters
    ----------
    data : DataFrame
        DataFrame giving columns for average temperature, 
        total rainfall, and mosquito population during mosquito
        breeding season for each year.
    
    Returns
    -------
    Figure
        :mod:`matplotlib.figure.Figure` object giving panel plot.
    
    """
    # perform fit
    regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
    print(regr_results.tvalues)
    
    fig = plt.figure(figsize=(6, 9))

    # plot prediction from fitted model against measured mosquito population
    parameters = regr_results.params
    predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
    
    ax0 = fig.add_subplot(3, 1, 1)
    ax0.plot(predicted, data['mosquitos'], 'go')
    
    ax0.set_xlabel('predicted mosquito population')
    ax0.set_ylabel('measured mosquito population')
   
    # plot population vs. temperature
    ax1 = fig.add_subplot(3, 1, 2)

    ax1.plot(data['temperature'], data['mosquitos'], 'ro')
    ax1.set_xlabel('Temperature')
    ax1.set_ylabel('Mosquitos')

    # plot population vs. rainfall
    ax2 = fig.add_subplot(3, 1, 3)

    ax2.plot(data['rainfall'], data['mosquitos'], 'bs')

    ax2.set_xlabel('Rainfall')
    ax2.set_ylabel('Mosquitos')

    return fig

Congratulations: you just made your first Python module! We can import it directly:

In [151]:

import datastuff

And we can see our documentation for these functions and use them as you'd expect:

In [152]:

datastuff.fahrenheit_to_celsius?

In [153]:

datastuff.analyze(data)

Intercept      11.016743
temperature    13.396049
rainfall       43.072783
dtype: float64

Out[153]:

Note: if you make modifications to your module, the will not show up in this session, even if you rerun import datastuff. You will either need to restart your Python session, or explicitly tell the Python session to rebuild the module object by re-reading the file:

In [154]:

import importlib

In [155]:

importlib.reload(datastuff)

Out[155]:

<module 'datastuff' from '/home/alter/Documents/SWC/Events/2016_UCSF/swc_ucsf_intermediate_data/python/datastuff.py'>

The reason this is the case is because it's more efficient for the Python interpreter to build a module object once from a module file, and then with each import encountered use that same object instead of building a whole new one. It's faster, and requires less memory, for this to be the case.

An aside: combining DataFrames¶

Although we didn't need to do it for these examples, pandas is generally more useful when we can pack as much of our data as is reasonable into a single DataFrame. Since all our datasets have the same form, we can probably gain by combining them.

In [156]:

import pandas as pd
import glob

We will use the pandas.concat function to make a single DataFrame from all our others:

In [157]:

filenames = glob.glob('*.csv')
datas = []
keys = []
for filename in filenames:
    keys.append(filename.split('_')[0])
    datas.append(pd.read_csv(filename))
    
df = pd.concat(datas, keys=keys, names=['location'])

We now have a "multi-index" DataFrame:

In [158]:

df

Out[158]:

		year	temperature	rainfall	mosquitos
location
A1	0	2001	80	157	150
	1	2002	85	252	217
	2	2003	86	154	153
	3	2004	87	159	158
	4	2005	74	292	243
	5	2006	75	283	237
	6	2007	80	214	190
	7	2008	85	197	181
	8	2009	74	231	200
	9	2010	74	207	184
A3	0	1960	76	191	93
	1	1961	73	249	109
	2	1962	81	112	73
	3	1963	78	113	72
	4	1964	81	159	89
	5	1965	87	222	109
	6	1966	72	103	68
	7	1967	77	176	92
	8	1968	89	236	114
	9	1969	88	283	128
	10	1970	89	151	89
	11	1971	71	121	72
	12	1972	88	267	124
	13	1973	85	211	102
	14	1974	75	101	67
	15	1975	72	173	85
	16	1976	74	254	117
	17	1977	85	155	90
	18	1978	89	158	92
	19	1979	86	117	75
...	...	...	...	...	...
B1	21	1981	77	115	77
	22	1982	71	267	106
	23	1983	89	164	99
	24	1984	75	175	90
	25	1985	70	210	95
	26	1986	72	214	99
	27	1987	79	123	81
	28	1988	74	195	96
	29	1989	84	129	90
	30	1990	80	276	111
	31	1991	73	147	82
	32	1992	73	140	80
	33	1993	78	285	115
	34	1994	71	280	107
	35	1995	80	180	95
	36	1996	90	280	119
	37	1997	86	166	100
	38	1998	82	285	117
	39	1999	75	215	101
	40	2000	86	212	101
	41	2001	72	138	83
	42	2002	84	130	85
	43	2003	74	166	87
	44	2004	87	280	121
	45	2005	85	130	88
	46	2006	83	102	77
	47	2007	85	163	94
	48	2008	83	190	99
	49	2009	79	151	88
	50	2010	88	295	122

214 rows × 4 columns

The outer index gives the area name ("location") the rows came from. These allow us to select rows on the basis of area:

In [159]:

df.loc[('A1', 1)]

Out[159]:

year           2002
temperature      85
rainfall        252
mosquitos       217
Name: (A1, 1), dtype: int64

More powerfully, this allows us to get aggregates over all areas:

In [160]:

df[df['year'] > 2005]['temperature'].mean()

Out[160]:

81.040000000000006

Or get descriptive statistics directly as a function of location:

In [161]:

df.mean(level='location')

Out[161]:

	year	temperature	rainfall	mosquitos
location
A1	2005.5	80.000000	214.600000	191.300000
A3	1985.0	80.333333	193.274510	97.647059
B2	1985.0	79.764706	196.823529	196.058824
A2	1985.0	80.392157	207.039216	185.235294
B1	1985.0	79.627451	204.588235	99.509804

Challenge: obtain the minimum rainfall for each location given that the temperature was between 75F and 90F.¶

This is best read backwards. First, let's filter our rows so we only get those for the temperature range we want:

In [162]:

df[(df['temperature'] > 75) & (df['temperature'] < 90)]

Out[162]:

		year	temperature	rainfall	mosquitos
location
A1	0	2001	80	157	150
	1	2002	85	252	217
	2	2003	86	154	153
	3	2004	87	159	158
	6	2007	80	214	190
	7	2008	85	197	181
A3	0	1960	76	191	93
	2	1962	81	112	73
	3	1963	78	113	72
	4	1964	81	159	89
	5	1965	87	222	109
	7	1967	77	176	92
	8	1968	89	236	114
	9	1969	88	283	128
	10	1970	89	151	89
	12	1972	88	267	124
	13	1973	85	211	102
	17	1977	85	155	90
	18	1978	89	158	92
	19	1979	86	117	75
	21	1981	80	153	87
	23	1983	80	145	82
	27	1987	83	261	117
	28	1988	89	280	127
	29	1989	81	271	121
	30	1990	77	188	94
	31	1991	80	218	105
	33	1993	83	211	105
	34	1994	77	224	108
	36	1996	86	247	115
...	...	...	...	...	...
B1	5	1965	76	272	113
	6	1966	77	140	87
	7	1967	87	207	106
	8	1968	86	266	115
	9	1969	79	159	92
	11	1971	79	252	108
	13	1973	82	281	116
	14	1974	80	295	120
	15	1975	86	166	99
	16	1976	87	298	123
	17	1977	79	184	95
	19	1979	86	157	95
	21	1981	77	115	77
	23	1983	89	164	99
	27	1987	79	123	81
	29	1989	84	129	90
	30	1990	80	276	111
	33	1993	78	285	115
	35	1995	80	180	95
	37	1997	86	166	100
	38	1998	82	285	117
	40	2000	86	212	101
	42	2002	84	130	85
	44	2004	87	280	121
	45	2005	85	130	88
	46	2006	83	102	77
	47	2007	85	163	94
	48	2008	83	190	99
	49	2009	79	151	88
	50	2010	88	295	122

151 rows × 4 columns

Then, we can grab the minimum rainfall for these:

In [163]:

df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min()

Out[163]:

But we wanted minimum rainfall for each location; we can tell Series.min to split the data by location:

In [164]:

df[(df['temperature'] > 75) & (df['temperature'] < 90)]['rainfall'].min(level='location')

Out[164]:

location
A1    154
A3    112
B2    100
A2    111
B1    102
Name: rainfall, dtype: int64

We don't have time during this lesson to get into groupbys, but they are such a central part of pandas that we'd be remiss if we excluded them. A groupby allows us to split our rows according to values in one or more column. For example, we could ask for the minimum rainfall measured across all areas for each year:

In [165]:

df.groupby('year')['rainfall'].min()

Out[165]:

year
1960    129
1961    135
1962    112
1963    113
1964    147
1965    151
1966    103
1967    106
1968    138
1969    147
1970    151
1971    121
1972    153
1973    100
1974    101
1975    134
1976    116
1977    155
1978    128
1979    117
1980    138
1981    115
1982    202
1983    130
1984    151
1985    152
1986    214
1987    111
1988    195
1989    129
1990    159
1991    147
1992    140
1993    211
1994    126
1995    180
1996    187
1997    123
1998    127
1999    123
2000    212
2001    112
2002    130
2003    132
2004    119
2005    130
2006    102
2007    163
2008    186
2009    151
2010    127
Name: rainfall, dtype: int64

This should be read as "group the data by year, then get the minimum rainfall for each group." If we want to do something more complex for each group of rows, we can use a for loop:

In [166]:

for name, group in df.groupby('year'):
    print(name, group['rainfall'].min())

To learn more about groupbys, check out the pandas docs on the "split-apply-combine" approach to working with datasets with groupby: http://pandas.pydata.org/pandas-docs/stable/groupby.html

Conditionals¶

We've already used boolean operations to filter rows on DataFrames, but more broadly we must be able to write code that makes decisions based on the inputs it is given.

We can use conditionals. To demonstrate how these work, we'll use absurd toy examples. Say we have some number:

In [167]:

num = 37

We can ask whether that number is greater than 100 with a conditional like this:

In [169]:

if num > 100:
    print('greater')
else:
    print('not greater')

not greater

We could add some more sophisticatioin to the conditional by checking for equality:

In [170]:

if num > 100:
    print('greater')
elif num == 100:
    print('equal!')
else:
    print('not greater')

not greater

...and we could test it:

In [171]:

num = 100

In [172]:

if num > 100:
    print('greater')
elif num == 100:
    print('equal!')
else:
    print('not greater')

equal!

Conditionals can include boolean operators such as and:

In [173]:

if (1 > 0) and (-1 > 0):
    print('both parts are true')
else:
    print('one part is not true')

one part is not true

A "truth table" for and, given two boolean values s1 and s2, each either True or False gives:

s1	s2	s1 and s2
True	True	True
True	False	False
False	True	False
False	False	False

There's also or:

In [174]:

if (1 > 0) or (-1 > 0):
    print('at least one part is true')
else:
    print('no part is true')

at least one part is true

s1	s2	s1 or s2
True	True	True
True	False	True
False	True	True
False	False	False

And not:

In [176]:

if (1 > 0) and not (-1 > 0):
    print('both parts are true')
else:
    print('one part is not true')

both parts are true

s1	not s1
True	False
False	True

Challenge: rewrite your Fahrenheit to Celsius converter with an if-else statement such that it can also take a list of values in addition to arrays/single numbers.¶

There are perhaps other ways to do this, but if we want to specifically handle the case of lists of temperatures, we can use the builtin isinstance to check that we are dealing with a list:

In [177]:

def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
        
    Returns
    -------
    float or array_like
        Temperatures in celsius.

    bubbles
    
    """
    if isinstance(temp, list):
        newtemps = []
        for value in temp:
            newtemps.append(fahrenheit_to_celsius(value))
    else:
        newtemps = (temp - 32) * 5/9

    return newtemps

This function takes advantage of recursion for the case of a list, calling itself on each value in the list and converting those values accordingly.

Errors and exceptions¶

Encountering errors is part of coding. If you are coding, you will hit errors. The important thing to remember is that errors that are loud are the best kind, because they usually give hints as to what the problem is. In Python, errors are known as exceptions, and there are a few common varieties. Familiarity with these varieties helps to more quickly identify the root of the problem, which means we can more quickly fix it.

We want to run a function inside a module called errors_01.py; it's inside our scripts directory:

In [180]:

%ls scripts/

argv_list.py  errors_01.py  errors_02.py  plot_rand_mp.py*

Let's try to import it:

In [181]:

import errors_01

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-181-56eb787402eb> in <module>()
----> 1 import errors_01

ImportError: No module named 'errors_01'

We got an ImportError. This exception tells us that Python cannot find a module we are trying to import. To troubleshoot why, we should first look at sys.path:

In [182]:

import sys
sys.path

Out[182]:

['',
 '/home/alter/.local/lib/python3.5/site-packages/scandir-1.1-py3.5-linux-x86_64.egg',
 '/home/alter/Library/datreant/datreant.core/src',
 '/home/alter/Library/datreant/datreant.data/src',
 '/home/alter/Library/asciitree',
 '/usr/lib/python35.zip',
 '/usr/lib/python3.5',
 '/usr/lib/python3.5/plat-linux',
 '/usr/lib/python3.5/lib-dynload',
 '/home/alter/.local/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages/gtk-2.0',
 '/home/alter/.local/lib/python3.5/site-packages/IPython/extensions',
 '/home/alter/.ipython']

The Python interpreter looks at each location in this list, in order, when we try to import a module by name. Since our scripts directory is not in this list, errors_01 is not found. We can add the directory to the list, though, to make errors_01 findable:

In [183]:

sys.path.append('scripts/')

In [184]:

sys.path

Out[184]:

['',
 '/home/alter/.local/lib/python3.5/site-packages/scandir-1.1-py3.5-linux-x86_64.egg',
 '/home/alter/Library/datreant/datreant.core/src',
 '/home/alter/Library/datreant/datreant.data/src',
 '/home/alter/Library/asciitree',
 '/usr/lib/python35.zip',
 '/usr/lib/python3.5',
 '/usr/lib/python3.5/plat-linux',
 '/usr/lib/python3.5/lib-dynload',
 '/home/alter/.local/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages',
 '/usr/lib/python3.5/site-packages/gtk-2.0',
 '/home/alter/.local/lib/python3.5/site-packages/IPython/extensions',
 '/home/alter/.ipython',
 'scripts/']

Note that this change will only be present in this session. It is not permanent. This is a perfectly fine method, though, for using a module you've built yourself. Let's import errors_01 now:

In [185]:

import errors_01

  File "scripts/errors_01.py", line 7
    print ice_creams[3]
                   ^
SyntaxError: Missing parentheses in call to 'print'

We get a different exception now: a SyntaxError. This the Python interpreter's way of telling us that we are using poor grammar, and it doesn't understand what we are telling it. In this case, it tells us exactly what's wrong:

SyntaxError: Missing parentheses in call to 'print'

If we make the suggested fix in, say, errors_02.py (we copied the file here for demo purposes, but normally we would fix it and version control it with git):

In [186]:

%cat scripts/errors_02.py

def favorite_ice_cream():
    ice_creams = [
        "chocolate",
        "vanilla",
        "strawberry"
    ]
    print(ice_creams[3])

In [187]:

import errors_02

This one imports just fine. Let's run errors_02.favorite_ice_cream:

In [188]:

errors_02.favorite_ice_cream()

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-188-8c34ad3792fd> in <module>()
----> 1 errors_02.favorite_ice_cream()

/home/alter/Documents/SWC/Events/2016_UCSF/swc_ucsf_intermediate_data/python/scripts/errors_02.py in favorite_ice_cream()
      5         "strawberry"
      6     ]
----> 7     print(ice_creams[3])

IndexError: list index out of range

This time we get an IndexError. Let's dissect this. The first line shows where the error originates in our code. In this case it's the only line in the cell. The next set of lines shows where the error originates at the source: that is, inside the definition of errors_02.favorite_ice_cream in line 7 of the module errors_02. The last line gives us a hint as to the nature of the IndexError: we appear to be using an index for an element of a list that doesn't exist.

There are other types of exceptions, but these are some of the most common. You need not know all of them, but having some knowledge of the basic varieties is valuable for understanding what's going wroing, and then how to fix it!

Aside: catching exceptions¶

In addition to being useful tools for debugging, exceptions are catchable. That is, you can write code in such a way that it is able to recover from failure or exceptional cases. As an example, let's rewrite our Fahrenheit to Celsius converter one last time:

In [189]:

def fahrenheit_to_celsius(temp):
    """Convert temperature in fahrenheit to celsius.
    
    Parameters
    ----------
    temp : float or array_like
        Temperature(s) in fahrenheit.
        
    Returns
    -------
    float or array_like
        Temperatures in celsius.

    bubbles
    
    """
    try:
        newtemps = (temp - 32) * 5/9
    except TypeError:
        newtemps = []
        for value in temp:
            newtemps.append(fahrenheit_to_celsius(value))

    return newtemps

This function can take lists just like the one with the if-else and isinstance check, but it turns the problem on its head. It uses a try-except block instead, first trying to use temp as a single number or array, and when that fails with a TypeError (as it would if temp is a list), it then handles temp as if it were a list.

This is known as "ask forgiveness" style: we do the most common thing, and if that fails, we then do something else. This is in contrast to "ask permission" style, as exhibited in the if-else implementation: we first checked if we had a funny case, and if not, did the common thing.

Which you choose is often a matter of style, though there are performance benefits to "ask forgiveness" when the odd cases truly are rare; when the other cases a function must deal with are almost equally common, it is better to use if-else statements to direct flow.

Defensive programming: being skeptical of your code¶

Say we write a function that does something a bit obtuse (heh): Given the coordinates of the lower-left and upper-right corners of a rectangle, it normalizes the rectangle so that its lower-left coordinate is at the origin (0, 0), and its longest side is 1.0 units long:

In [190]:

def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin, 
    and 1.0 units long on its longest axis.
    
    :Arguments:
        *rect*
            a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
            positions of the lower-left and upper-right corner
            
    :Returns:
        *newrect*
            a tuple of size four, giving the new coordinates of our rectangle's
            vertices
    
    """
    x0, y0, x1, y1 = rect
    
    assert x0 < x1, 'Invalid x coordinates'
    assert y0 < y1, 'Invalid y coordinates'
    
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dx) / dy
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
        
    return (0, 0, upper_x, upper_y)

Notice that we added a nice documentation string at the top that explicitly states the functions inputs (arguments) and outputs (returns), so when we use it we don't need to mind its details anymore. Also, at the beginning of the function we have assert statements that check the validity of the inputs, while at the end of the function we have assert statements that check the validity of the outputs.

These statements codify what our documentation says about what the inputs should mean and what the output should look like, guarding against cases where the function is given inputs it wasn't designed to handle.

Let's test the function!

In [191]:

normalize_rectangle((3, 4, 4, 6))

Out[191]:

(0, 0, 0.5, 1.0)

Looks good. First point is at the origin; longest side is 1.0.

In [192]:

normalize_rectangle((0.0, 1.0, 2.0, 5.0))

Out[192]:

(0, 0, 0.5, 1.0)

Nice!

In [193]:

normalize_rectangle((0.0, 0.0, 5.0, 1.0))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-193-a41732464e3f> in <module>()
----> 1 normalize_rectangle((0.0, 0.0, 5.0, 1.0))

<ipython-input-190-a7edcf11432e> in normalize_rectangle(rect)
     30 
     31     assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
---> 32     assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
     33 
     34     return (0, 0, upper_x, upper_y)

AssertionError: Calculated upper Y coordinate invalid

Oh...what happened here? The inputs should be fine, but an assertion caught something strange in the output. Looking closely at the code, we spot a mistake in line 26:

In [194]:

def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin, 
    and 1.0 units long on its longest axis.
    
    :Arguments:
        *rect*
            a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
            positions of the lower-left and upper-right corner
            
    :Returns:
        *newrect*
            a tuple of size four, giving the new coordinates of our rectangle's
            vertices
    
    """
    x0, y0, x1, y1 = rect
    
    assert x0 < x1, 'Invalid x coordinates'
    assert y0 < y1, 'Invalid y coordinates'
    
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dy) / dx
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
        
    return (0, 0, upper_x, upper_y)

And now we get:

In [195]:

normalize_rectangle((0.0, 0.0, 5.0, 1.0))

Out[195]:

(0, 0, 1.0, 0.2)

In [196]:

normalize_rectangle((2, 3, 4, 6))

Out[196]:

(0, 0, 0.6666666666666666, 1.0)

That looks better.

What if we were to try feeding in only three values instead of four?

In [197]:

normalize_rectangle((2, 4, 5))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-197-c5d8e9954719> in <module>()
----> 1 normalize_rectangle((2, 4, 5))

<ipython-input-194-d4ce9e9a4cfa> in normalize_rectangle(rect)
     14 
     15     """
---> 16     x0, y0, x1, y1 = rect
     17 
     18     assert x0 < x1, 'Invalid x coordinates'

ValueError: not enough values to unpack (expected 4, got 3)

That's a rather unhelpful error message. We can also add an assertion that checks that the input is long enough, and gives a message indicating what the problem is:

In [198]:

def normalize_rectangle(rect):
    """Normalizes a rectangle so that it is at the origin, 
    and 1.0 units long on its longest axis.
    
    :Arguments:
        *rect*
            a tuple of size four, giving (x0, y0, x1, y1), the (x,y)
            positions of the lower-left and upper-right corner
            
    :Returns:
        *newrect*
            a tuple of size four, giving the new coordinates of our rectangle's
            vertices
    
    """
    assert len(rect) == 4, "Rectangle must have 4 elements"

    x0, y0, x1, y1 = rect
    
    assert x0 < x1, 'Invalid x coordinates'
    assert y0 < y1, 'Invalid y coordinates'
    
    dx = x1 - x0
    dy = y1 - y0
    
    if dx > dy:
        scaled = float(dy) / dx
        upper_x, upper_y = 1.0, scaled
    else:
        scaled = float(dx) / dy
        upper_x, upper_y = scaled, 1.0
        
    assert 0 < upper_x <= 1.0, 'Calculated upper X coordinate invalid'
    assert 0 < upper_y <= 1.0, 'Calculated upper Y coordinate invalid'
        
    return (0, 0, upper_x, upper_y)

In [199]:

normalize_rectangle((2, 4, 5))

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-199-c5d8e9954719> in <module>()
----> 1 normalize_rectangle((2, 4, 5))

<ipython-input-198-e9732fa1861f> in normalize_rectangle(rect)
     14 
     15     """
---> 16     assert len(rect) == 4, "Rectangle must have 4 elements"
     17 
     18     x0, y0, x1, y1 = rect

AssertionError: Rectangle must have 4 elements

Now when we use this function six months later, we won't have to scratch our heads too long when this sort of thing happens! Defensive programming in this way makes code more robust, easier to debug, and easier to maintain. This is especially important for scientific code, since incorrect results can occur silently and end up getting published, perhaps only getting uncovered years later. Much embarrassment and misinformation can be avoided by being skeptical of our code, using tools such as assert statements to make sure it keeps doing what we think it should.

Writing Python scripts¶

Running Python code from a notebook is great, but we often need to run Python code from the command line, perhaps on files as input data. We can turn our module datastuff.py into a simple script by adding some meat at the bottom:

In [ ]:

if __name__ == '__main__':

    import pandas as pd
    import sys

    for dataset in sys.argv[1:]:
        print(dataset)

        data = pd.read_csv(dataset)
        data['temperature'] = fahrenheit_to_celsius(
                data['temperature'])

        fig = analyze_mosquitos(data)

        fig.savefig(dataset + "_plot.pdf")

The if __name__ == '__main__' statement at the top says that this code should only be executed if the module is being run as a script. It's a common check that the current namespace is __main__, as it would only be if the code were being executed as a script.

Adding this allows us to run the script on any number of input files as:

python datastuff.py A1_mosquito_data.csv A2_mosquito_data.csv

or even:

python datastuff.py *.csv

It will loop through each filename given as arguments, convert the temperature column to Celsius, then generate and save a PDF of the analysis figures for that dataset.

sys.argv is a list having the name of the script as its zeroth element, then each argument to the script in order from the first to the last. We do sys.argv[1:] to slice out only those arguments so we can iterate through them. This is a quick-and-dirty way to parse arguments, but if you're interested in writing more sophisticated command-line utilities, definitely use argparse.

	rainfall	temperature
0	157	80
1	252	85
2	154	86
3	159	87
4	292	74
5	283	75
6	214	80
7	197	85
8	231	74
9	207	74

	rainfall	temperature
0	157	80
1	252	85
2	154	86
3	159	87
4	292	74
5	283	75
6	214	80
7	197	85
8	231	74
9	207	74

	rainfall	temperature
0	157	80
1	252	85
2	154	86
3	159	87
4	292	74
5	283	75
6	214	80
7	197	85
8	231	74
9	207	74