To pull in comma-separated datasets like those for our inflammation data, we make use of a package called numpy
. This is the package to use for anything numerical, especially when messing with whole arrays of data.
import numpy
numpy.loadtxt("inflammation-01.csv", delimiter=',')
array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]])
We can load one of the inflammation data sets using the numpy.loadtxt
function. This function loads the data from disk, but without attaching a name to it, it isn't necessarily stored in memory. A bit about names...
The equal sign is known as the "assignment" operator. For example, given some weight in kg, we can attach a name to that weight like:
weight_kg = 55
print weight_kg
55
The value 55
has the name weight_kg
attached to it. To get the value, we need only use the name. We can use this in other expressions that may assign to other names:
weight_lb = 2.2 * weight_kg
weight_lb
121.00000000000001
weight_lb
is calculated from weight_kg
. What happens to weight_lb
if we change the value attached to weight_kg
?
weight_kg = 66
weight_lb
121.00000000000001
Nothing! That's because weight_lb
only remembers the value assigned to it, not how that value was obtained. To update its value, it must be recalculated:
weight_lb = 2.2 * weight_kg
weight_lb
145.20000000000002
We can assign a name to our data array, which quantifies the inflammation observed in a set of patients (rows) over succeeding days (columns), so we can begin working with it in memory:
data = numpy.loadtxt('inflammation-01.csv', delimiter=',')
print data
[[ 0. 0. 1. ..., 3. 0. 0.] [ 0. 1. 2. ..., 1. 0. 1.] [ 0. 1. 1. ..., 2. 1. 1.] ..., [ 0. 1. 1. ..., 1. 1. 1.] [ 0. 0. 0. ..., 0. 2. 0.] [ 0. 0. 1. ..., 1. 1. 0.]]
Arrays are rectangular datasets that consist of all a single data type (in this case, floating-point numbers). They have attributes such as:
data.shape
(60, 40)
and methods to calculate quantities from their elements such as:
print "mean for all values:", data.mean()
mean for all values: 6.14875
Many of their methods can be fed parameters to change their behavior. For example, instead of the mean of all elements in the array, perhaps we want the mean of each row (that is, for each patient, calculated across all days):
print "mean for each patient:\n",
print data.mean(axis=1)
mean for each patient: [ 5.45 5.425 6.1 5.9 5.55 6.225 5.975 6.65 6.625 6.525 6.775 5.8 6.225 5.75 5.225 6.3 6.55 5.7 5.85 6.55 5.775 5.825 6.175 6.1 5.8 6.425 6.05 6.025 6.175 6.55 6.175 6.35 6.725 6.125 7.075 5.725 5.925 6.15 6.075 5.75 5.975 5.725 6.3 5.9 6.75 5.925 7.225 6.15 5.95 6.275 5.7 6.1 6.825 5.975 6.725 5.7 6.25 6.4 7.05 5.9 ]
Notice the shape of the output array:
data.mean(axis=1).shape
(60,)
If you want to know the other parameters a function or method take, put a question mark at the end and run the cell. The documentation will pop up in the notebook:
# displays the docstring
data.mean?
We can also get the mean across all patients for each day:
print "mean for each day:\n",
print data.mean(axis=0)
mean for each day: [ 0. 0.45 1.11666667 1.75 2.43333333 3.15 3.8 3.88333333 5.23333333 5.51666667 5.95 5.9 8.35 7.73333333 8.36666667 9.5 9.58333333 10.63333333 11.56666667 12.35 13.25 11.96666667 11.03333333 10.16666667 10. 8.66666667 9.15 7.25 7.33333333 6.58333333 6.06666667 5.95 5.11666667 3.6 3.3 3.56666667 2.48333333 1.5 1.13333333 0.56666667]
Note: python indices are 0-based, meaning counting goes as 0, 1, 2, 3...; this means that the first axis (rows) is axis 0, the second axis (columns) is axis 1.
How do we get at individual elements in array? We can use indices to grab them. To get the inflammation value for the fifth patient on the 10th day, we can use:
data[4, 9]
4.0
Remember that python uses 0-based indices! The data for the first patient on the first day is:
data[0, 0]
0.0
We can select subsets of our data with slicing. To select only the first 5 rows (patients) of the array, we can do:
data[0:5]
array([[ 0., 0., 1., 3., 1., 2., 4., 7., 8., 3., 3., 3., 10., 5., 7., 4., 7., 7., 12., 18., 6., 13., 11., 11., 7., 7., 4., 6., 8., 8., 4., 4., 5., 7., 3., 4., 2., 3., 0., 0.], [ 0., 1., 2., 1., 2., 1., 3., 2., 2., 6., 10., 11., 5., 9., 4., 4., 7., 16., 8., 6., 18., 4., 12., 5., 12., 7., 11., 5., 11., 3., 3., 5., 4., 4., 5., 5., 1., 1., 0., 1.], [ 0., 1., 1., 3., 3., 2., 6., 2., 5., 9., 5., 7., 4., 5., 4., 15., 5., 11., 9., 10., 19., 14., 12., 17., 7., 12., 11., 7., 4., 2., 10., 5., 4., 2., 2., 3., 2., 2., 1., 1.], [ 0., 0., 2., 0., 4., 2., 2., 1., 6., 7., 10., 7., 9., 13., 8., 8., 15., 10., 10., 7., 17., 4., 4., 7., 6., 15., 6., 4., 9., 11., 3., 5., 6., 3., 3., 4., 2., 3., 2., 1.], [ 0., 1., 1., 3., 3., 1., 3., 5., 2., 4., 4., 7., 6., 5., 3., 10., 8., 10., 6., 17., 9., 14., 9., 7., 13., 9., 12., 6., 7., 7., 9., 6., 3., 2., 2., 4., 2., 0., 1., 1.]])
data[0:5].shape
(5, 40)
This slice should be read as "get the 0th element up to and not including the 5th element". The "not including" is important, and the cause of much initial frustration. It does take some getting used to.
We can also slice the columns; to get only the data for the first five rows (first five patients) and columns 9 through 12 (10th through 13th day), we can use:
data[0:5, 9:13]
array([[ 3., 3., 3., 10.], [ 6., 10., 11., 5.], [ 9., 5., 7., 4.], [ 7., 10., 7., 9.], [ 4., 4., 7., 6.]])
Since the output of a sliced array is another array, we can run methods on these just like before. For example, we can get the standard deviation across days for each patient for this subset:
data[0:5, 9:13].std(axis=1)
array([ 3.03108891, 2.54950976, 1.92028644, 1.29903811, 1.29903811])
We can treat arrays as if they are single numbers in arithmetic operations. For example, multiplying the array by 2 doubles each element of the array:
double = data * 2
data
array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]])
double
array([[ 0., 0., 2., ..., 6., 0., 0.], [ 0., 2., 4., ..., 2., 0., 2.], [ 0., 2., 2., ..., 4., 2., 2.], ..., [ 0., 2., 2., ..., 2., 2., 2.], [ 0., 0., 0., ..., 0., 4., 0.], [ 0., 0., 2., ..., 2., 2., 0.]])
Arithmetic operations can be applied between arrays of the same shape; these operations are performed on corresponding elements (this is also true for multiplication/division).
triple = double + data
triple
array([[ 0., 0., 3., ..., 9., 0., 0.], [ 0., 3., 6., ..., 3., 0., 3.], [ 0., 3., 3., ..., 6., 3., 3.], ..., [ 0., 3., 3., ..., 3., 3., 3.], [ 0., 0., 0., ..., 0., 6., 0.], [ 0., 0., 3., ..., 3., 3., 0.]])
This amounts to $3x_{i,j}^2$, for each element $x_{i,j}$ of the array:
triple * data
array([[ 0., 0., 3., ..., 27., 0., 0.], [ 0., 3., 12., ..., 3., 0., 3.], [ 0., 3., 3., ..., 12., 3., 3.], ..., [ 0., 3., 3., ..., 3., 3., 3.], [ 0., 0., 0., ..., 0., 12., 0.], [ 0., 0., 3., ..., 3., 3., 0.]])
Interested in performing matrix operations between arrays? There are methods built into numpy for that. For example, the inner product:
numpy.inner(triple, data)
array([[ 5388., 4266., 5145., ..., 5802., 5841., 5220.], [ 4266., 5775., 5463., ..., 5307., 5895., 5019.], [ 5145., 5463., 7182., ..., 6444., 6930., 5931.], ..., [ 5802., 5307., 6444., ..., 7320., 7122., 6423.], [ 5841., 5895., 6930., ..., 7122., 8628., 6771.], [ 5220., 5019., 5931., ..., 6423., 6771., 6654.]])
Let's try viewing this array graphically. Just as numpy
is the de-facto library for numerically-intensive work, matplotlib
is the de-facto library for producing high-quality plots from numerical data. A saying often goes that matplotlib
makes easy things easy and hard things possible when it comes to making plots.
import matplotlib.pyplot as plt
%matplotlib inline
We can throw our data into the matplotlib.pyplot.imshow
function to view it as a heatmap:
plt.imshow(data)
<matplotlib.image.AxesImage at 0x7fa6ab0c1590>
We can try viewing the mean inflammation value across patients for each day directly using the matplotlib.pyplot.plot
function.
plt.plot(data.mean(axis=0))
[<matplotlib.lines.Line2D at 0x7fa6aafce310>]
Huh...that looks a bit orderly. Almost completely linear with a positive slope to precisely day 20, then back down with about the same slope to day 40...
We can try plotting the max value across all patients for each day:
plt.plot(data.max(axis=0))
[<matplotlib.lines.Line2D at 0x7fa6a969c550>]
That's fishy...
The min across all patients for each day?
plt.plot(data.min(axis=0))
[<matplotlib.lines.Line2D at 0x7fa6a9650350>]
A step function, with steps that appear to go in increments of 4 or so across the whole domain of days. Weird...
Matplotlib can make figures that include multiple sub-figures, referred to as axes. We can put the three plots above side-by-side with something like:
# make a figure object
fig = plt.figure(figsize=(10,3))
# make an axis object for the figure, each with a different position
# on a grid that has 1 row and 3 columns of axes
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)
# plot the means
ax1.set_ylabel('average')
ax1.plot(data.mean(axis=0))
# plot the maxes
ax2.set_ylabel('max')
ax2.plot(data.max(axis=0))
# plot the minses
ax3.set_ylabel('min')
ax3.plot(data.min(axis=0))
fig.tight_layout()
It looks like the books were cooked for this data set. What about the other 11? Instead of checking them manually, we can let the computer do what it's good at: doing the same thing repetitively. We need a loop.
First, an aside with strings. Most programming languages generally deal in a few basic data types, including integers, floating-point numbers, and characters. A string is a sequence of characters, and we can manipulate them easily with python. For example:
word = 'lead'
We have a string 'lead'
that we've assigned the name word
. We can examine the string's type, and indeed any object in python, with the type
builtin.
print type(word)
<type 'str'>
We can also access its characters (it's a sequence!) with indexing:
word[0]
'l'
word[1]
'e'
And slicing:
word[1:]
'ead'
If I want to output each letter of word
on a new line, I could do:
print word[0]
print word[1]
print word[2]
print word[3]
l e a d
But this doesn't scale to words of arbitrary length. Instead, let's let the computer do the boring work of iterating with a for
loop:
for letter in word:
print letter
l e a d
This same bit of code will work for any word
, including ridiculously long ones. Like this gem:
word = 'supercallifragilisticexpialidocious'
print word
supercallifragilisticexpialidocious
for letter in word:
print letter
s u p e r c a l l i f r a g i l i s t i c e x p i a l i d o c i o u s
We can do more with loops than iterate through the elements of a sequence. A common use-case is building a sum. For example, getting the number of letters in word
:
counter = 0
for letter in word:
counter = counter + 1
print counter
35
For this very particular use, there's a python built-in, which is guaranteed to be more efficient. Use it instead:
len(word)
35
It works on any sequence! For arrays it gives the length of the first axis, in this case the number of rows:
len(data)
60
Loops work best when they have something to iterate over. In python, the list
is a built-in data structure that serves as a container for a sequence of other objects. We can make an empty list with:
stuff = list()
print stuff
[]
And we can append things to the end of the list, in this case a bunch of random strings:
stuff.append('marklar')
stuff.append('chewbacca')
stuff.append('chickenfingers')
stuff
['marklar', 'chewbacca', 'chickenfingers']
Just like with the word
example, we can iterate through the elements of the list:
for item in stuff:
print item
marklar chewbacca chickenfingers
And slicing works too! Slicing with negative numbers counts backwards from the end of the list. The slice stuff[-2:]
could be read as "get the 2nd-to-last element of stuff
, through the last element":
for item in stuff[-2:]:
print item
chewbacca chickenfingers
So indexing with -1
gives the last element:
stuff[-1]
'chickenfingers'
And consistently, but uselessly, slicing from and to the same index yeilds nothing. Remember, for a slice n:m
, it reads as "get all elements starting with element n
and up to but not including element m
."
stuff[0:0]
[]
Lists are mutable: that is, their elements can be changed. For example, we can replace 'chewbacca'
with 'hansolo'
:
stuff[1]
'chewbacca'
stuff[1] = 'hansolo'
stuff[1]
'hansolo'
...or replace 'hansolo'
with our data
array:
stuff[1] = data
stuff[1]
array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]])
print stuff
['marklar', array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]]), 'chickenfingers']
We can make another list, including stuff
as an element:
morestuff = ['logs', stuff]
morestuff
['logs', ['marklar', array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]]), 'chickenfingers']]
Lists can contain lists can contain lists can contain lists...and of course any number of other python objects. Even functions can be included as elements:
morestuff.append(len)
morestuff
['logs', ['marklar', array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]]), 'chickenfingers'], <function len>]
Want the third element of the second element of morestuff
?
morestuff[1][2]
'chickenfingers'
Lists also have their own methods, which usually alter the list itself:
stuff
['marklar', array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]]), 'chickenfingers']
stuff.reverse()
stuff
['chickenfingers', array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]]), 'marklar']
list.pop
removes the element at the given index:
stuff.pop(1)
array([[ 0., 0., 1., ..., 3., 0., 0.], [ 0., 1., 2., ..., 1., 0., 1.], [ 0., 1., 1., ..., 2., 1., 1.], ..., [ 0., 1., 1., ..., 1., 1., 1.], [ 0., 0., 0., ..., 0., 2., 0.], [ 0., 0., 1., ..., 1., 1., 0.]])
...while list.remove
removes the first instance of the given element in the list:
stuff.remove('chickenfingers')
So now our list looks like:
stuff
['marklar']
We noticed what looked like fraudulent data in the first dataset we analyzed. To look at more datasets without manually running our block of plotting code on each, we'll take advantage of loops and lists.
To grab all the filenames for our inflammation data sets, we'll use the glob
library. This includes a function (glob.glob
) that will return a list of filenames matching a "glob" pattern, which is what we used for matching files in the bash shell.
import glob
# this will get us all the inflammation data files; we don't care about order
print glob.glob('inflammation*.csv')
['inflammation-07.csv', 'inflammation-04.csv', 'inflammation-01.csv', 'inflammation-05.csv', 'inflammation-08.csv', 'inflammation-10.csv', 'inflammation-09.csv', 'inflammation-06.csv', 'inflammation-12.csv', 'inflammation-03.csv', 'inflammation-02.csv', 'inflammation-11.csv']
Stealing our code block for the plots above and putting them in a loop over filenames, we can look at a whole block of figures. We can slice a subset of the filenames to limit the number we plot at one time.
# get filenames
filenames = glob.glob('inflammation*.csv')
# iterate through each filename, perhaps only a slice of them
for filename in filenames[0:7]:
# so we know what we're looking at
print filename
# load data
data = numpy.loadtxt(fname=filename, delimiter=',')
# make a figure object, with 3 axes in 1 row, 3 columns
fig = plt.figure(figsize=(10,3))
ax1 = fig.add_subplot(1, 3, 1)
ax2 = fig.add_subplot(1, 3, 2)
ax3 = fig.add_subplot(1, 3, 3)
# plot means
ax1.set_ylabel('average')
ax1.plot(data.mean(axis=0))
# plot maxes
ax2.set_ylabel('max')
ax2.plot(data.max(axis=0))
# plot minses
ax3.set_ylabel('min')
ax3.plot(data.min(axis=0))
# this is useful for making the plot elements not overlap
fig.tight_layout()
# this will tell matplotlib to draw the figure now, instead
# of letting the notebook wait to draw them all at once when the cell
# finishes executing
plt.show()
print "done"
inflammation-07.csv
inflammation-04.csv
inflammation-01.csv
inflammation-05.csv
inflammation-08.csv
inflammation-10.csv
inflammation-09.csv
done
At least one new pattern emerged: all mins were zero for one dataset! It only gets weirder...
We don't have time to cover pandas
in this workshop, but it is quickly becoming the de-facto tool for working with tabular data interactively, and it adds great power to numpy
, which it is built on top of. Among its features are the ability to cleanly handle datasets that have missing values (non-rectangular), joining datasets along non-integer indices, grouping datasets by column values, as well as many convenience methods for quickly generating useful visualizations. There's plenty more, but you'll have to explore it on your own for now.
But here's a quick demo; say we want to examine how the distribution of inflammation values of patients changes with each passing day. We can do this easily by putting this data in a pandas.DataFrame
, then plotting directly:
import pandas as pd
df = pd.DataFrame(data)
We can look at what the top of our data look like as a DataFrame:
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 2 | 4 | 5 | 5 | 4 | 4 | 6 | ... | 3 | 5 | 3 | 1 | 5 | 4 | 4 | 3 | 2 | 1 |
1 | 0 | 1 | 0 | 1 | 3 | 1 | 5 | 3 | 8 | 5 | ... | 8 | 5 | 5 | 4 | 5 | 2 | 2 | 3 | 2 | 1 |
2 | 0 | 1 | 2 | 2 | 4 | 1 | 4 | 2 | 7 | 5 | ... | 2 | 4 | 3 | 7 | 4 | 5 | 3 | 0 | 2 | 1 |
3 | 0 | 1 | 1 | 2 | 2 | 1 | 5 | 3 | 5 | 6 | ... | 3 | 8 | 2 | 4 | 6 | 2 | 2 | 3 | 0 | 1 |
4 | 0 | 1 | 2 | 3 | 3 | 5 | 3 | 4 | 4 | 6 | ... | 9 | 3 | 7 | 1 | 6 | 1 | 3 | 0 | 1 | 1 |
5 rows × 40 columns
We'll quickly plot a histogram for each day from 1 to 21 (the rise) to see how the distributions shift with day. Cyan distributions are closer to day 0; magenta distributions are closer to day 20.
df.iloc[:, 1:21].plot(kind='hist', legend=False, xlim=(0, 20), bins=range(0, 21, 1), colormap='cool', alpha=.2, histtype='stepfilled', figsize=(13,4))
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6a9038d50>
We can also plot day 21 through day 39 (the fall) to see how the distributions shift with day. Cyan distributions are closer to day 21; magenta distributions are closer to day 39.
df.iloc[:, 21:40].plot(kind='hist', legend=False, xlim=(0, 20), bins=range(0, 21, 1), colormap='cool', alpha=.2, histtype='stepfilled', figsize=(13,4))
<matplotlib.axes._subplots.AxesSubplot at 0x7fa6a8b3a0d0>