This lesson assumes some programming background, but not necessarily with Python. It moves at a faster pace than a novice lesson, however.
We will learn how to do the following:
This lesson is based heavily on the Software Carpentry python-intermediate-mosquitoes
lesson, and uses its datasets. It also includes some mixed-in components from the python-novice-inflammation
lesson.
We are interested in understanding the relationship between the weather and the number of mosquitos occuring in a particular year so that we can plan mosquito control measures accordingly. Since we want to apply these mosquito control measures at a number of different sites we need to understand both the relationship at a particular site and whether or not it is consistent across sites. The data we have to address this problem comes from the local government and are stored in tables in comma-separated values (CSV) files. Each file holds the data for a single location, each row holds the information for a single year at that location, and the columns hold the data on both mosquito numbers and the average temperature and rainfall from the beginning of mosquito breeding season. The first few rows of our first file look like:
%cat A1_mosquito_data.csv | head -n 5
year,temperature,rainfall,mosquitos 2001,80,157,150 2002,85,252,217 2003,86,154,153 2004,87,159,158
And we have five files to work with:
%ls *.csv
A1_mosquito_data.csv A3_mosquito_data.csv B2_mosquito_data.csv A2_mosquito_data.csv B1_mosquito_data.csv
Note: commands preceded with a %
are known as "magics". These are specfic to the notebook environment. They are not Python commands.
import pandas as pd
Importing a library in this way allows us to use the components defined inside of it. First, we'll read in a single dataset using the pandas.read_csv
function:
pd.read_csv('A1_mosquito_data.csv')
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 |
2 | 2003 | 86 | 154 | 153 |
3 | 2004 | 87 | 159 | 158 |
4 | 2005 | 74 | 292 | 243 |
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
This reads the CSV from disk and deserializes it into a pandas.DataFrame
. But unless we attach a name to the DataFrame
the function returns, we cannot keep working with the object in memory.
data = pd.read_csv('A1_mosquito_data.csv')
data
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 |
2 | 2003 | 86 | 154 | 153 |
3 | 2004 | 87 | 159 | 158 |
4 | 2005 | 74 | 292 | 243 |
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
Now we can refer to the DataFrame
directly, without having to read it from disk every time.
A bit about names. If I attach the name weight_kg
to the value 55
:
weight_kg = 55
I can then use this value to calculate the same weight in weight_lb
:
weight_lb = 2.2 * weight_kg
weight_lb
121.00000000000001
If I change my weight_kg
to 66, what is the current value of weight_lb
?
weight_kg = 66
weight_lb
121.00000000000001
There's no change. This is because the name weight_lb
is attached to a value, in this case a floating point number. The value has no concept of how it came to be.
weight_lb = 2.2 * weight_kg
weight_lb
145.20000000000002
Back to our DataFrame
. A DataFrame
is an object. We can see what type of object it is with the Python builtin, type
:
type(data)
pandas.core.frame.DataFrame
It turns out that in Python, everything is an object. We'll see what this means as we go, but the most important aspect of this is that in Python we have names, and we assign these to objects. Any name can point to any object, and more than one name can point to a single object.
Anyway, a DataFrame
allows us to get at individual components of our tabular data. We can get single columns like:
data['year']
0 2001 1 2002 2 2003 3 2004 4 2005 5 2006 6 2007 7 2008 8 2009 9 2010 Name: year, dtype: int64
Or multiple columns with:
data[['rainfall', 'temperature']]
rainfall | temperature | |
---|---|---|
0 | 157 | 80 |
1 | 252 | 85 |
2 | 154 | 86 |
3 | 159 | 87 |
4 | 292 | 74 |
5 | 283 | 75 |
6 | 214 | 80 |
7 | 197 | 85 |
8 | 231 | 74 |
9 | 207 | 74 |
Slicing can be used to get back subsets of rows:
data[0:2]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 80 | 157 | 150 |
1 | 2002 | 85 | 252 | 217 |
Python indices are 0-based, meaning counting goes as 0, 1, 2, 3...; this means that the first row is row 0, the second row is row 1, etc. It's best to refer to row 0 as the "zeroth row" to avoid confusion.
This slice should be read as "get the 0th element up to and not including the 2nd element". The "not including" is important, and the cause of much initial frustration. It does take some getting used to.
What if we want a single row?
data[1]
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance) 1875 try: -> 1876 return self._engine.get_loc(key) 1877 except KeyError: pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)() pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)() KeyError: 1 During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-18-c805864c0d75> in <module>() ----> 1 data[1] /usr/lib/python3.5/site-packages/pandas/core/frame.py in __getitem__(self, key) 1990 return self._getitem_multilevel(key) 1991 else: -> 1992 return self._getitem_column(key) 1993 1994 def _getitem_column(self, key): /usr/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key) 1997 # get column 1998 if self.columns.is_unique: -> 1999 return self._get_item_cache(key) 2000 2001 # duplicate columns & possible reduce dimensionality /usr/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1343 res = cache.get(item) 1344 if res is None: -> 1345 values = self._data.get(item) 1346 res = self._box_item_values(item, values) 1347 cache[item] = res /usr/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3223 3224 if not isnull(item): -> 3225 loc = self.items.get_loc(item) 3226 else: 3227 indexer = np.arange(len(self.items))[isnull(self.items)] /usr/lib/python3.5/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance) 1876 return self._engine.get_loc(key) 1877 except KeyError: -> 1878 return self._engine.get_loc(self._maybe_cast_indexer(key)) 1879 1880 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4027)() pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3891)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12408)() pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12359)() KeyError: 1
For a DataFrame, this is ambiguous, since a single value is interpreted as a column name. We can only get at rows by slicing at the top level:
data[1:2]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
1 | 2002 | 85 | 252 | 217 |
Or we could use .iloc
:
data.iloc[1]
year 2002 temperature 85 rainfall 252 mosquitos 217 Name: 1, dtype: int64
Getting a single row in this way returns a Series
:
type(data.iloc[1])
pandas.core.series.Series
A Series
is a 1-D column of values, having all the same datatype. Since each of the datatypes of our columns were integers, we got a Series
with dtype int64
this time. If we had columns with, e.g. strings, then we'd get back dtype object
, which is a catchall for pandas
.
We can also get the data in our Series
as a raw numpy
array:
type(data.iloc[1].values)
numpy.ndarray
Pandas is a relatively young library, but it's built on top of the venerable numpy
array, which makes it possible to do fast numerical work in Python. A Series
is basically a 1-D numpy
array with the ability to select by labeled indices:
More usefully than simple slicing, we can use boolean indexing to subselect our data. Say we want only data for years beyond 2005?
data[data['year'] > 2005]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
There's no magic here; we get a boolean index directly from a comparison:
gt_2005 = data['year'] > 2005
gt_2005
0 False 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True 9 True Name: year, dtype: bool
And using this Series
of bools will then give only the rows for which the Series
had True
:
data[gt_2005]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
5 | 2006 | 75 | 283 | 237 |
6 | 2007 | 80 | 214 | 190 |
7 | 2008 | 85 | 197 | 181 |
8 | 2009 | 74 | 231 | 200 |
9 | 2010 | 74 | 207 | 184 |
This is the same behavior as numpy
arrays: using most binary operators, such as +
, *
, >
, &
, work element-wise. With a single value on one side (such as 2005
), we get the result of the operation for each element.
A DataFrame
is an object, and objects have methods. These are functions that are part of the object itself, often doing operations on the object's data. One of these is DataFrame.mean
:
data.mean()
year 2005.5 temperature 80.0 rainfall 214.6 mosquitos 191.3 dtype: float64
We get back the mean value of each column as a single Series
. There's more like this:
data.max()
year 2010 temperature 87 rainfall 292 mosquitos 243 dtype: int64
There's also DataFrame.describe
, which gives common descriptive statistics of the whole DataFrame
:
data.describe()
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
count | 10.00000 | 10.000000 | 10.000000 | 10.00000 |
mean | 2005.50000 | 80.000000 | 214.600000 | 191.30000 |
std | 3.02765 | 5.456902 | 50.317216 | 33.23335 |
min | 2001.00000 | 74.000000 | 154.000000 | 150.00000 |
25% | 2003.25000 | 74.250000 | 168.500000 | 163.75000 |
50% | 2005.50000 | 80.000000 | 210.500000 | 187.00000 |
75% | 2007.75000 | 85.000000 | 246.750000 | 212.75000 |
max | 2010.00000 | 87.000000 | 292.000000 | 243.00000 |
This is, itself, a DataFrame
:
data.describe()['temperature']
count 10.000000 mean 80.000000 std 5.456902 min 74.000000 25% 74.250000 50% 80.000000 75% 85.000000 max 87.000000 Name: temperature, dtype: float64
Documentation is a key part of Python's design. In the notebook, you can get a quick look at the docs for a given Python function or method with:
data.describe?
Or more generally (built-in Python behavior):
help(data.describe)
Help on method describe in module pandas.core.generic: describe(percentiles=None, include=None, exclude=None) method of pandas.core.frame.DataFrame instance Generate various summary statistics, excluding NaN values. Parameters ---------- percentiles : array-like, optional The percentiles to include in the output. Should all be in the interval [0, 1]. By default `percentiles` is [.25, .5, .75], returning the 25th, 50th, and 75th percentiles. include, exclude : list-like, 'all', or None (default) Specify the form of the returned result. Either: - None to both (default). The result will include only numeric-typed columns or, if none are, only categorical columns. - A list of dtypes or strings to be included/excluded. To select all numeric types use numpy numpy.number. To select categorical objects use type object. See also the select_dtypes documentation. eg. df.describe(include=['O']) - If include is the string 'all', the output column-set will match the input one. Returns ------- summary: NDFrame of summary statistics Notes ----- The output DataFrame index depends on the requested dtypes: For numeric dtypes, it will include: count, mean, std, min, max, and lower, 50, and upper percentiles. For object dtypes (e.g. timestamps or strings), the index will include the count, unique, most common, and frequency of the most common. Timestamps also include the first and last items. For mixed dtypes, the index will be the union of the corresponding output types. Non-applicable entries will be filled with NaN. Note that mixed-dtype outputs can only be returned from mixed-dtype inputs and appropriate use of the include/exclude arguments. If multiple values have the highest count, then the `count` and `most common` pair will be arbitrarily chosen from among those with the highest count. The include, exclude arguments are ignored for Series. See Also -------- DataFrame.select_dtypes
One way we could do this is to first grab the "mosquitos"
column, then use a fancy index obtained from comparing the "rainfall"
column to 200
. We can then call the std
method of the resulting Series
:
data['mosquitos'][data['rainfall'] > 200].std()
24.587937421969063
Note that this is a key part of the power of pandas
objects: operations for subsetting and calculating descriptive statistics can often be stacked to great effect.
What if we know our temperatures are in fahrenheit, but we want them in celsius? We can convert them.
(data['temperature'] - 32) * 5 / 9
0 26.666667 1 29.444444 2 30.000000 3 30.555556 4 23.333333 5 23.888889 6 26.666667 7 29.444444 8 23.333333 9 23.333333 Name: temperature, dtype: float64
This gives us back a new Series
object. If we want to change the values in our existing DataFrame
to be these, we can just set the column to this new Series
:
data['temperature'] = (data['temperature'] - 32) * 5 / 9
data['temperature']
0 26.666667 1 29.444444 2 30.000000 3 30.555556 4 23.333333 5 23.888889 6 26.666667 7 29.444444 8 23.333333 9 23.333333 Name: temperature, dtype: float64
Similarly, it's also possible to add new columns. We could add one giving us, e.g. the ratio of rainfall to mosquitos:
data['rainfall / mosquitos'] = data['rainfall'] / data['mosquitos']
data
year | temperature | rainfall | mosquitos | rainfall / mosquitos | |
---|---|---|---|---|---|
0 | 2001 | 26.666667 | 157 | 150 | 1.046667 |
1 | 2002 | 29.444444 | 252 | 217 | 1.161290 |
2 | 2003 | 30.000000 | 154 | 153 | 1.006536 |
3 | 2004 | 30.555556 | 159 | 158 | 1.006329 |
4 | 2005 | 23.333333 | 292 | 243 | 1.201646 |
5 | 2006 | 23.888889 | 283 | 237 | 1.194093 |
6 | 2007 | 26.666667 | 214 | 190 | 1.126316 |
7 | 2008 | 29.444444 | 197 | 181 | 1.088398 |
8 | 2009 | 23.333333 | 231 | 200 | 1.155000 |
9 | 2010 | 23.333333 | 207 | 184 | 1.125000 |
It's probably a bit silly to have such a column. So, let's get rid of it:
del data['rainfall / mosquitos']
data
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 26.666667 | 157 | 150 |
1 | 2002 | 29.444444 | 252 | 217 |
2 | 2003 | 30.000000 | 154 | 153 |
3 | 2004 | 30.555556 | 159 | 158 |
4 | 2005 | 23.333333 | 292 | 243 |
5 | 2006 | 23.888889 | 283 | 237 |
6 | 2007 | 26.666667 | 214 | 190 |
7 | 2008 | 29.444444 | 197 | 181 |
8 | 2009 | 23.333333 | 231 | 200 |
9 | 2010 | 23.333333 | 207 | 184 |
Don't like the order of your columns? We can make a DataFrame
with a different column order by selecting them out in the order we want:
data[['rainfall', 'year', 'mosquitos', 'temperature']]
rainfall | year | mosquitos | temperature | |
---|---|---|---|---|
0 | 157 | 2001 | 150 | 26.666667 |
1 | 252 | 2002 | 217 | 29.444444 |
2 | 154 | 2003 | 153 | 30.000000 |
3 | 159 | 2004 | 158 | 30.555556 |
4 | 292 | 2005 | 243 | 23.333333 |
5 | 283 | 2006 | 237 | 23.888889 |
6 | 214 | 2007 | 190 | 26.666667 |
7 | 197 | 2008 | 181 | 29.444444 |
8 | 231 | 2009 | 200 | 23.333333 |
9 | 207 | 2010 | 184 | 23.333333 |
We can even have duplicates:
data[['rainfall', 'year', 'mosquitos', 'temperature', 'year']]
rainfall | year | mosquitos | temperature | year | |
---|---|---|---|---|---|
0 | 157 | 2001 | 150 | 26.666667 | 2001 |
1 | 252 | 2002 | 217 | 29.444444 | 2002 |
2 | 154 | 2003 | 153 | 30.000000 | 2003 |
3 | 159 | 2004 | 158 | 30.555556 | 2004 |
4 | 292 | 2005 | 243 | 23.333333 | 2005 |
5 | 283 | 2006 | 237 | 23.888889 | 2006 |
6 | 214 | 2007 | 190 | 26.666667 | 2007 |
7 | 197 | 2008 | 181 | 29.444444 | 2008 |
8 | 231 | 2009 | 200 | 23.333333 | 2009 |
9 | 207 | 2010 | 184 | 23.333333 | 2010 |
Remember: this returns a copy. It does not change our existing DataFrame
:
data
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 2001 | 26.666667 | 157 | 150 |
1 | 2002 | 29.444444 | 252 | 217 |
2 | 2003 | 30.000000 | 154 | 153 |
3 | 2004 | 30.555556 | 159 | 158 |
4 | 2005 | 23.333333 | 292 | 243 |
5 | 2006 | 23.888889 | 283 | 237 |
6 | 2007 | 26.666667 | 214 | 190 |
7 | 2008 | 29.444444 | 197 | 181 |
8 | 2009 | 23.333333 | 231 | 200 |
9 | 2010 | 23.333333 | 207 | 184 |
Just as pandas
is the de-facto library for working with tabular data, matplotlib
is the de-facto library for producing high-quality plots from numerical data. A saying often goes that matplotlib makes easy things easy and hard things possible when it comes to making plots.
import matplotlib.pyplot as plt
We need to tell the notebook to render plots in the notebook itself:
%matplotlib inline
Before we use matplotlib
explicitly, pandas
objects feature convenience methods for common plotting operations. These use matplotlib
internally, so everything we learn later about matplotlib
objects applies to these. For example, we can plot both "temperature"
and "mosquitos"
as a function of "year"
:
data.plot(x='year', y=['temperature', 'mosquitos'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f339711ce80>
Let's load a larger dataset:
data = pd.read_csv('A2_mosquito_data.csv')
This dataset has 51 rows, instead of just 10:
len(data)
51
data.plot(x='year', y=['temperature', 'mosquitos'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f33970c72e8>
There are other convenience methods for different ways of plotting the data. For example, we can get a kernel-density estimate:
data['mosquitos'].plot.kde()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3397030160>
These convenience methods are great, but for more complex plots we can, and should, use matplotlib
directly. We can make a multi-paneled figure giving both "temperature" and "rainfall" for each "year."
fig = plt.figure(figsize=(4, 6))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(data['year'], data['temperature'], 'ro-')
ax1.set_xlabel('Year')
ax1.set_ylabel('Temperature')
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(data['year'], data['rainfall'], 'bs-')
ax2.set_xlabel('Year')
ax2.set_ylabel('Rainfall')
<matplotlib.text.Text at 0x7f3390d64ef0>
We can do this in a similar way as we did above.
fig = plt.figure(figsize=(4, 6))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(data['temperature'], data['mosquitos'], 'ro')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(data['rainfall'], data['mosquitos'], 'bs')
ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
<matplotlib.text.Text at 0x7f3390d15978>
Note that the linestyle code on the right can be used to turn off interpolation lines, since we want to just plot the points, and don't care about their order.
From this one dataset we see what looks like a linear relationship between mosquitos and rainfall, but not really any relationship between mosquitos and temperature. We'd like to quantify this further by applying a statistical model to the data, but we also want to do this same treatment to all our datasets. We need to learn a few more things, including loops, before continuing further.
First, an aside with strings. Most programming languages generally deal in a few basic data types, including integers, floating-point numbers, and characters. A string is a sequence of characters, and we can manipulate them easily with python. For example:
word = 'bird'
We have a string 'lead'
that we've assigned the name word
. We can examine the string's type, and indeed any object in python, with the type
builtin.
type(word)
str
We can also access its characters (it's a sequence!) with indexing:
word[0]
'b'
word[1]
'i'
And slicing:
word[1:]
'ird'
If I want to output each letter of word
on a new line, I could do:
print(word[0])
print(word[1])
print(word[2])
print(word[3])
b i r d
But this doesn't scale to words of arbitrary length. Instead, let's let the computer do the boring work of iterating with a for
loop:
for letter in word:
print letter
l e a d
This same bit of code will work for any word
, including ridiculously long ones. Like this gem:
word = 'supercallifragilisticexpialidocious'
print(word)
supercallifragilisticexpialidocious
for letter in word:
print(letter)
s u p e r c a l l i f r a g i l i s t i c e x p i a l i d o c i o u s
We can do more with loops than iterate through the elements of a sequence. A common use-case is building a sum. For example, getting the number of letters in word
:
counter = 0
for letter in word:
counter = counter + 1
print(counter)
35
For this very particular use, there's a python built-in, which is guaranteed to be more efficient. Use it instead:
len(word)
35
It works on any sequence! For pandas.DataFrame
s it gives the length of the number of rows:
len(data)
51
Loops work best when they have something to iterate over. In python, the list
is a built-in data structure that serves as a container for a sequence of other objects. We can make an empty list with:
stuff = list()
print(stuff)
[]
And we can append things to the end of the list, in this case a bunch of random strings:
stuff.append('marklar')
stuff.append('chewbacca')
stuff.append('chickenfingers')
stuff
['marklar', 'chewbacca', 'chickenfingers']
Just like with the word
example, we can iterate through the elements of the list:
for item in stuff:
print(item)
marklar chewbacca chickenfingers
And slicing works too! Slicing with negative numbers counts backwards from the end of the list. The slice stuff[-2:]
could be read as "get the 2nd-to-last element of stuff
, through the last element":
for item in stuff[-2:]:
print(item)
chewbacca chickenfingers
So indexing with -1
gives the last element:
stuff[-1]
'chickenfingers'
And consistently, but uselessly, slicing from and to the same index yeilds nothing. Remember, for a slice n:m
, it reads as "get all elements starting with element n
and up to but not including element m
."
stuff[0:0]
[]
Lists are mutable: that is, their elements can be changed. For example, we can replace 'chewbacca'
with 'hansolo'
:
stuff[1]
'chewbacca'
stuff[1] = 'hansolo'
stuff[1]
'hansolo'
...or replace 'hansolo'
with a DataFrame
:
stuff[1] = data
stuff[1]
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 1960 | 82 | 200 | 180 |
1 | 1961 | 70 | 227 | 194 |
2 | 1962 | 89 | 231 | 207 |
3 | 1963 | 74 | 114 | 121 |
4 | 1964 | 78 | 147 | 140 |
5 | 1965 | 85 | 151 | 148 |
6 | 1966 | 86 | 172 | 162 |
7 | 1967 | 75 | 106 | 112 |
8 | 1968 | 70 | 276 | 230 |
9 | 1969 | 86 | 165 | 162 |
10 | 1970 | 83 | 222 | 198 |
11 | 1971 | 78 | 297 | 247 |
12 | 1972 | 87 | 288 | 248 |
13 | 1973 | 76 | 286 | 239 |
14 | 1974 | 86 | 231 | 202 |
15 | 1975 | 90 | 284 | 243 |
16 | 1976 | 76 | 190 | 175 |
17 | 1977 | 87 | 257 | 225 |
18 | 1978 | 88 | 128 | 133 |
19 | 1979 | 87 | 218 | 199 |
20 | 1980 | 81 | 206 | 184 |
21 | 1981 | 74 | 175 | 160 |
22 | 1982 | 85 | 202 | 187 |
23 | 1983 | 71 | 130 | 126 |
24 | 1984 | 80 | 225 | 200 |
25 | 1985 | 72 | 196 | 173 |
26 | 1986 | 76 | 261 | 222 |
27 | 1987 | 85 | 111 | 121 |
28 | 1988 | 83 | 247 | 210 |
29 | 1989 | 86 | 137 | 142 |
30 | 1990 | 82 | 159 | 152 |
31 | 1991 | 77 | 172 | 160 |
32 | 1992 | 74 | 280 | 231 |
33 | 1993 | 70 | 291 | 238 |
34 | 1994 | 77 | 126 | 125 |
35 | 1995 | 89 | 191 | 178 |
36 | 1996 | 83 | 298 | 248 |
37 | 1997 | 80 | 282 | 237 |
38 | 1998 | 86 | 219 | 195 |
39 | 1999 | 72 | 143 | 134 |
40 | 2000 | 79 | 262 | 221 |
41 | 2001 | 85 | 189 | 175 |
42 | 2002 | 86 | 205 | 186 |
43 | 2003 | 72 | 195 | 173 |
44 | 2004 | 78 | 148 | 146 |
45 | 2005 | 71 | 262 | 219 |
46 | 2006 | 88 | 255 | 226 |
47 | 2007 | 79 | 262 | 221 |
48 | 2008 | 73 | 198 | 176 |
49 | 2009 | 86 | 215 | 187 |
50 | 2010 | 87 | 127 | 129 |
print(stuff)
['marklar', year temperature rainfall mosquitos 0 1960 82 200 180 1 1961 70 227 194 2 1962 89 231 207 3 1963 74 114 121 4 1964 78 147 140 5 1965 85 151 148 6 1966 86 172 162 7 1967 75 106 112 8 1968 70 276 230 9 1969 86 165 162 10 1970 83 222 198 11 1971 78 297 247 12 1972 87 288 248 13 1973 76 286 239 14 1974 86 231 202 15 1975 90 284 243 16 1976 76 190 175 17 1977 87 257 225 18 1978 88 128 133 19 1979 87 218 199 20 1980 81 206 184 21 1981 74 175 160 22 1982 85 202 187 23 1983 71 130 126 24 1984 80 225 200 25 1985 72 196 173 26 1986 76 261 222 27 1987 85 111 121 28 1988 83 247 210 29 1989 86 137 142 30 1990 82 159 152 31 1991 77 172 160 32 1992 74 280 231 33 1993 70 291 238 34 1994 77 126 125 35 1995 89 191 178 36 1996 83 298 248 37 1997 80 282 237 38 1998 86 219 195 39 1999 72 143 134 40 2000 79 262 221 41 2001 85 189 175 42 2002 86 205 186 43 2003 72 195 173 44 2004 78 148 146 45 2005 71 262 219 46 2006 88 255 226 47 2007 79 262 221 48 2008 73 198 176 49 2009 86 215 187 50 2010 87 127 129, 'chickenfingers']
We also add a list as an element to itself:
stuff.append(stuff)
So now stuff[-1]
points to the same list the name stuff
does:
stuff[-1] is stuff
True
It's important to recognize that these are exactly the same list. Modifications to stuff[-1]
will be seen in stuff
. These are just names, and in this case they refer to the same object in memory.
Lists can contain lists can contain lists can contain lists...and of course any number of other python objects. Even functions can be included as elements:
stuff.append(len)
stuff
['marklar', year temperature rainfall mosquitos 0 1960 82 200 180 1 1961 70 227 194 2 1962 89 231 207 3 1963 74 114 121 4 1964 78 147 140 5 1965 85 151 148 6 1966 86 172 162 7 1967 75 106 112 8 1968 70 276 230 9 1969 86 165 162 10 1970 83 222 198 11 1971 78 297 247 12 1972 87 288 248 13 1973 76 286 239 14 1974 86 231 202 15 1975 90 284 243 16 1976 76 190 175 17 1977 87 257 225 18 1978 88 128 133 19 1979 87 218 199 20 1980 81 206 184 21 1981 74 175 160 22 1982 85 202 187 23 1983 71 130 126 24 1984 80 225 200 25 1985 72 196 173 26 1986 76 261 222 27 1987 85 111 121 28 1988 83 247 210 29 1989 86 137 142 30 1990 82 159 152 31 1991 77 172 160 32 1992 74 280 231 33 1993 70 291 238 34 1994 77 126 125 35 1995 89 191 178 36 1996 83 298 248 37 1997 80 282 237 38 1998 86 219 195 39 1999 72 143 134 40 2000 79 262 221 41 2001 85 189 175 42 2002 86 205 186 43 2003 72 195 173 44 2004 78 148 146 45 2005 71 262 219 46 2006 88 255 226 47 2007 79 262 221 48 2008 73 198 176 49 2009 86 215 187 50 2010 87 127 129, 'chickenfingers', [...], <function len>]
Lists also have their own methods, which usually alter the list itself:
stuff.reverse()
stuff
[<function len>, [...], 'chickenfingers', year temperature rainfall mosquitos 0 1960 82 200 180 1 1961 70 227 194 2 1962 89 231 207 3 1963 74 114 121 4 1964 78 147 140 5 1965 85 151 148 6 1966 86 172 162 7 1967 75 106 112 8 1968 70 276 230 9 1969 86 165 162 10 1970 83 222 198 11 1971 78 297 247 12 1972 87 288 248 13 1973 76 286 239 14 1974 86 231 202 15 1975 90 284 243 16 1976 76 190 175 17 1977 87 257 225 18 1978 88 128 133 19 1979 87 218 199 20 1980 81 206 184 21 1981 74 175 160 22 1982 85 202 187 23 1983 71 130 126 24 1984 80 225 200 25 1985 72 196 173 26 1986 76 261 222 27 1987 85 111 121 28 1988 83 247 210 29 1989 86 137 142 30 1990 82 159 152 31 1991 77 172 160 32 1992 74 280 231 33 1993 70 291 238 34 1994 77 126 125 35 1995 89 191 178 36 1996 83 298 248 37 1997 80 282 237 38 1998 86 219 195 39 1999 72 143 134 40 2000 79 262 221 41 2001 85 189 175 42 2002 86 205 186 43 2003 72 195 173 44 2004 78 148 146 45 2005 71 262 219 46 2006 88 255 226 47 2007 79 262 221 48 2008 73 198 176 49 2009 86 215 187 50 2010 87 127 129, 'marklar']
list.pop
removes the element at the given index:
stuff.pop(-2)
year | temperature | rainfall | mosquitos | |
---|---|---|---|---|
0 | 1960 | 82 | 200 | 180 |
1 | 1961 | 70 | 227 | 194 |
2 | 1962 | 89 | 231 | 207 |
3 | 1963 | 74 | 114 | 121 |
4 | 1964 | 78 | 147 | 140 |
5 | 1965 | 85 | 151 | 148 |
6 | 1966 | 86 | 172 | 162 |
7 | 1967 | 75 | 106 | 112 |
8 | 1968 | 70 | 276 | 230 |
9 | 1969 | 86 | 165 | 162 |
10 | 1970 | 83 | 222 | 198 |
11 | 1971 | 78 | 297 | 247 |
12 | 1972 | 87 | 288 | 248 |
13 | 1973 | 76 | 286 | 239 |
14 | 1974 | 86 | 231 | 202 |
15 | 1975 | 90 | 284 | 243 |
16 | 1976 | 76 | 190 | 175 |
17 | 1977 | 87 | 257 | 225 |
18 | 1978 | 88 | 128 | 133 |
19 | 1979 | 87 | 218 | 199 |
20 | 1980 | 81 | 206 | 184 |
21 | 1981 | 74 | 175 | 160 |
22 | 1982 | 85 | 202 | 187 |
23 | 1983 | 71 | 130 | 126 |
24 | 1984 | 80 | 225 | 200 |
25 | 1985 | 72 | 196 | 173 |
26 | 1986 | 76 | 261 | 222 |
27 | 1987 | 85 | 111 | 121 |
28 | 1988 | 83 | 247 | 210 |
29 | 1989 | 86 | 137 | 142 |
30 | 1990 | 82 | 159 | 152 |
31 | 1991 | 77 | 172 | 160 |
32 | 1992 | 74 | 280 | 231 |
33 | 1993 | 70 | 291 | 238 |
34 | 1994 | 77 | 126 | 125 |
35 | 1995 | 89 | 191 | 178 |
36 | 1996 | 83 | 298 | 248 |
37 | 1997 | 80 | 282 | 237 |
38 | 1998 | 86 | 219 | 195 |
39 | 1999 | 72 | 143 | 134 |
40 | 2000 | 79 | 262 | 221 |
41 | 2001 | 85 | 189 | 175 |
42 | 2002 | 86 | 205 | 186 |
43 | 2003 | 72 | 195 | 173 |
44 | 2004 | 78 | 148 | 146 |
45 | 2005 | 71 | 262 | 219 |
46 | 2006 | 88 | 255 | 226 |
47 | 2007 | 79 | 262 | 221 |
48 | 2008 | 73 | 198 | 176 |
49 | 2009 | 86 | 215 | 187 |
50 | 2010 | 87 | 127 | 129 |
...while list.remove
removes the first instance of the given element in the list:
stuff.remove('chickenfingers')
So now our list looks like:
stuff
[<function len>, [...], 'marklar']
Now that we know how to do loops and use lists in Python, we'll use these to make a plot for each dataset. First, we'll import a module called glob
:
import glob
We can use this to grab all the filenames matching a "globbing" pattern:
glob.glob("*.csv")
['A1_mosquito_data.csv', 'A3_mosquito_data.csv', 'B2_mosquito_data.csv', 'A2_mosquito_data.csv', 'B1_mosquito_data.csv']
"Globbing" is the name of the shell-completion patterns common for shells like sh
and bash
.
We can grab the block of code we wrote previously, and loop through it for each file obtained by globbing:
import glob
files = glob.glob('*.csv')
for file in files:
print(file)
data = pd.read_csv(file)
fig = plt.figure(figsize=(4, 6))
ax1 = fig.add_subplot(2, 1, 1)
ax1.plot(data['temperature'], data['mosquitos'], 'ro')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')
ax2 = fig.add_subplot(2, 1, 2)
ax2.plot(data['rainfall'], data['mosquitos'], 'bs')
ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
# tell matplotlib to render the plot *now*
plt.show()
A1_mosquito_data.csv
A3_mosquito_data.csv
B2_mosquito_data.csv
A2_mosquito_data.csv
B1_mosquito_data.csv
It looks like there's a similar pattern for each area we have data for. We'd like to quantify more closely the relationship between mosquito population and each of these variables.
There are many statistics packages in Python, but one of the most well-developed and widely-used is statsmodels
:
import statsmodels.api as sm
We'll use our second dataset for the initial stab at fitting a statistical model to our data:
data = pd.read_csv('A2_mosquito_data.csv')
And let's convert the temperatures to Celsius, for good measure:
data['temperature'] = (data['temperature'] - 32) * 5 / 9
We can build an ordinary least-squares fit to the data using statsmodels.OLS
. Remember we can check the docs with:
sm.OLS?
We want to fit the relationship between mosquito population and, for a start, rainfall:
regr_results = sm.OLS(data['mosquitos'], data['rainfall']).fit()
regr_results.summary()
Dep. Variable: | mosquitos | R-squared: | 0.996 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.996 |
Method: | Least Squares | F-statistic: | 1.386e+04 |
Date: | Mon, 09 May 2016 | Prob (F-statistic): | 8.69e-63 |
Time: | 23:02:14 | Log-Likelihood: | -196.25 |
No. Observations: | 51 | AIC: | 394.5 |
Df Residuals: | 50 | BIC: | 396.4 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
rainfall | 0.8811 | 0.007 | 117.744 | 0.000 | 0.866 0.896 |
Omnibus: | 4.422 | Durbin-Watson: | 1.878 |
---|---|---|---|
Prob(Omnibus): | 0.110 | Jarque-Bera (JB): | 1.959 |
Skew: | -0.085 | Prob(JB): | 0.376 |
Kurtosis: | 2.055 | Cond. No. | 1.00 |
So, this gives us a lot of information. But is this what we want?
We actually want to quantify which variable correlates more with the mosquito population. One way we can build a multivariate model with statsmodels is to use a formula:
regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
regr_results.summary()
Dep. Variable: | mosquitos | R-squared: | 0.997 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.997 |
Method: | Least Squares | F-statistic: | 7889. |
Date: | Mon, 09 May 2016 | Prob (F-statistic): | 3.68e-61 |
Time: | 23:02:15 | Log-Likelihood: | -111.54 |
No. Observations: | 51 | AIC: | 229.1 |
Df Residuals: | 48 | BIC: | 234.9 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 17.5457 | 2.767 | 6.341 | 0.000 | 11.983 23.109 |
temperature | 0.8719 | 0.092 | 9.457 | 0.000 | 0.687 1.057 |
rainfall | 0.6967 | 0.006 | 125.385 | 0.000 | 0.686 0.708 |
Omnibus: | 1.651 | Durbin-Watson: | 1.872 |
---|---|---|---|
Prob(Omnibus): | 0.438 | Jarque-Bera (JB): | 0.906 |
Skew: | -0.278 | Prob(JB): | 0.636 |
Kurtosis: | 3.343 | Cond. No. | 1.92e+03 |
Each of these results can be accessed as attributes of the Results
object:
print(regr_results.params)
Intercept 17.545739 temperature 0.871943 rainfall 0.696717 dtype: float64
print(regr_results.rsquared)
0.996966873691
Our model "explains" most of the data. Nice! Let's plot the predicted population of mosquitos from our fitted model against the measured population to see how well we did:
import matplotlib.pyplot as plt
figure = plt.figure()
ax = figure.add_subplot(1,1,1)
parameters = regr_results.params
predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
ax.plot(predicted, data['mosquitos'], 'ro')
ax.set_xlabel('predicted mosquito population')
ax.set_ylabel('measured mosquito population')
<matplotlib.text.Text at 0x7f338d98c908>
The more linear this plot is, the better our model fits the data. Also, it turns out that, as you might have guessed from the plots we made previously, rainfall is the better predictor than average temperature for how many mosquitoes we expect to see.
regr_results.tvalues
Intercept 6.341405 temperature 9.456614 rainfall 125.385116 dtype: float64
We can take our block of code we worked on previously, and make a three-panel plot instead of just a two-panel one:
filenames = glob.glob('*.csv')
for filename in filenames:
print(filename)
data = pd.read_csv(filename)
# convert temperatures to celsius from fahrenheit
data['temperature'] = (data['temperature'] - 32) * 5/9
# perform fit
regr_results = sm.OLS.from_formula('mosquitos ~ temperature + rainfall', data).fit()
print(regr_results.tvalues)
fig = plt.figure(figsize=(6, 9))
# plot predicted vs. measured mosquito populations from fitted model
parameters = regr_results.params
predicted = parameters['Intercept'] + parameters['temperature'] * data['temperature'] + parameters['rainfall'] * data['rainfall']
ax0 = fig.add_subplot(3, 1, 1)
ax0.plot(predicted, data['mosquitos'], 'gd')
ax0.set_xlabel('predicted mosquito population')
ax0.set_ylabel('measured mosquito population')
# plot population vs. temperature
ax1 = fig.add_subplot(3, 1, 2)
ax1.plot(data['temperature'], data['mosquitos'], 'ro')
ax1.set_xlabel('Temperature')
ax1.set_ylabel('Mosquitos')
# plot population vs. rainfall
ax2 = fig.add_subplot(3, 1, 3)
ax2.plot(data['rainfall'], data['mosquitos'], 'bs')
ax2.set_xlabel('Rainfall')
ax2.set_ylabel('Mosquitos')
plt.show()
A1_mosquito_data.csv Intercept 3.589733 temperature 3.456159 rainfall 56.825601 dtype: float64