Numerical Analysis with NumPy

Materials created by:

Aron Ahmadia, Continuum Analytics

Peter Wang, Continuum Analytics

Ben Zaitlen, Continuum Analytics

Randy Olson, Michigan State Universty

Learning Objectives

  1. Understand NumPy
  2. Create simple arrays of data in NumPy
  3. Access and modify array elements
  4. Compute on arrays

The basics

In [1]:
sales = [0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0]
In [2]:
sales*2
Out[2]:
[0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0, 0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0]
In [3]:
sales+2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-45b722299797> in <module>()
----> 1 sales+2

TypeError: can only concatenate list (not "int") to list
In [4]:
import numpy as np
In [5]:
sales = np.array(sales)
print sales
[  0.75   1.5    2.     4.     5.     6.    10.  ]

Basic Math

In [6]:
sales+1
Out[6]:
array([  1.75,   2.5 ,   3.  ,   5.  ,   6.  ,   7.  ,  11.  ])
In [7]:
sales*2
Out[7]:
array([  1.5,   3. ,   4. ,   8. ,  10. ,  12. ,  20. ])
In [8]:
np.power(sales, 2)
Out[8]:
array([   0.5625,    2.25  ,    4.    ,   16.    ,   25.    ,   36.    ,
        100.    ])

What is NumPy?

Python library that provides multi-dimensional arrays, tables, and matrices for Python

  • Contiguous or strided arrays
  • Homogeneous (but types can be algebraic)
  • Arrays of records and nested records
  • Fast routines for array operations (C, ATLAS, MKL)

NumPy's Many Uses

  • Image and signal processing
  • Linear algebra
  • Data transformation and query
  • Time series analysis
  • Statistical analysis
  • Many more!

Create Arrays of Data in NumPy

Lots of methods of array initialization

  • python list
  • zeros
  • ones
  • diagonal
  • eye
  • random
In [9]:
np.zeros(5)
Out[9]:
array([ 0.,  0.,  0.,  0.,  0.])
In [10]:
np.ones(5)
Out[10]:
array([ 1.,  1.,  1.,  1.,  1.])
In [11]:
np.random.random(5)
Out[11]:
array([ 0.66648901,  0.89045666,  0.24164698,  0.3969619 ,  0.34864545])
In [12]:
np.diag([1,1,1,1,1])
Out[12]:
array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

NumPy also has methods for generating linearly/log spaced values

In [13]:
np.arange(0,10)
Out[13]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [14]:
np.arange(10,30,2)
Out[14]:
array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])
In [15]:
np.linspace(0,2,10)
Out[15]:
array([ 0.        ,  0.22222222,  0.44444444,  0.66666667,  0.88888889,
        1.11111111,  1.33333333,  1.55555556,  1.77777778,  2.        ])

NumPy has a type and a shape. The above array are either floats or ints, but these are infered. We could be explicit

In [16]:
np.ones(5)
Out[16]:
array([ 1.,  1.,  1.,  1.,  1.])
In [17]:
np.ones(5,dtype='int64')
Out[17]:
array([1, 1, 1, 1, 1], dtype=int64)

We also can define the shape of the array

In [18]:
np.ones(shape=(3,3),dtype='float64')
Out[18]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
In [19]:
np.random.uniform(0,1,size=(5,5))
Out[19]:
array([[ 0.00304427,  0.84160095,  0.3167405 ,  0.56702975,  0.24500155],
       [ 0.42505285,  0.911683  ,  0.33022088,  0.52441459,  0.8219738 ],
       [ 0.44855388,  0.40749562,  0.5910397 ,  0.48225375,  0.16479927],
       [ 0.39055229,  0.32811335,  0.23763238,  0.16384718,  0.51039829],
       [ 0.88561498,  0.47630962,  0.15369293,  0.57210622,  0.00708663]])

Multidimensional properties

In [20]:
data = np.zeros(9)
print data
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.]
In [21]:
data.reshape((3,3,))
Out[21]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
In [22]:
data = data.reshape((3,3))
data[0,0] = 12
data[1,0] = 13
data[2,0] = 14
print data
[[ 12.   0.   0.]
 [ 13.   0.   0.]
 [ 14.   0.   0.]]

Data access is arr[row, column]

reshaping is just another view of the same data. no

In [23]:
#row selection
print data[0,:]
#column selection
print data[:,0]
[ 12.   0.   0.]
[ 12.  13.  14.]
In [24]:
data[:,1] = 0.2
print data
[[ 12.    0.2   0. ]
 [ 13.    0.2   0. ]
 [ 14.    0.2   0. ]]
In [25]:
data[:,2] = [.3,.4,9.12]
print data
[[ 12.     0.2    0.3 ]
 [ 13.     0.2    0.4 ]
 [ 14.     0.2    9.12]]

In last cell, shape of assignment must match shape of accessed data

Exercise 1

  1. Create a 4x4 array with numbers from a normal distribution with mean = 5 and standard deviation = 2.7.

  2. Sum the values of each column and store the result in a new array.

  3. Sum the values of each row and store the result in a new array.

  4. Sum all of the values and store the result in a new array.

In [26]:
 

Let's get our feet wet with some real data.

Import Modules

In [27]:
import numpy as np
import zipfile
import os
import urllib2

Download Data

We've gone ahead and bundled up a small bucket of this data for our analysis today at http://figshare.com/articles/GSOD_Weather_NOAA/743699

We can grab the data from the following permanent URL: http://files.figshare.com/1116528/weather.zip

In [28]:
dirname = 'weather'
if os.path.exists(dirname):
    print 'weather directory exists, skipping download/unzip'
else:
    os.mkdir(dirname)

    url = 'http://files.figshare.com/1116528/weather.zip'
    response = urllib2.urlopen(url)

    fname = 'weather.zip'
    with open(fname,'wb') as f:
        f.write(response.read())

    zfile = zipfile.ZipFile('weather.zip')
    
    for name in zfile.namelist()[1:]:
        print name
        with open(name,"w") as f:
            f.write(zfile.read(name))
weather/CALOSANG.txt
weather/CASANFRA.txt
weather/INSOBEND.txt
weather/KYLEXING.txt
weather/MDWASHDC.txt
weather/NYNEWYOR.txt
weather/WASEATTL.txt

What does the text file look like?

In [29]:
with open('weather/CALOSANG.txt','r') as f:
    print '\n'.join(f.readlines()[:3])
 1             1             1995         56.4

 1             2             1995         55.1

 1             3             1995         54.3

This text file is

  • Month
  • Day
  • Year
  • Temp

Load data file into numpy array and print first row

In [30]:
w_data = np.loadtxt('weather/CALOSANG.txt')
w_data[0] 
Out[30]:
array([  1.00000000e+00,   1.00000000e+00,   1.99500000e+03,
         5.64000000e+01])

We can do a bit better if we define a dtype. That is, the type for each column: string, int, float

In [31]:
dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
w_data = np.loadtxt('weather/CALOSANG.txt',dtype=dt)
w_data[:5] 
Out[31]:
array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3),
       (1, 4, 1995, 53.6), (1, 5, 1995, 56.6)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Access and Modify Array Elements

In [32]:
dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
data = np.loadtxt('weather/CALOSANG.txt',dtype=dt)

Slicing

  • first 5 rows
  • rows 30 35
  • every other row

Row Selection

In [33]:
data[0]
Out[33]:
(1, 1, 1995, 56.4)
In [34]:
data[0:2]
Out[34]:
array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])
In [35]:
data[30:35]
Out[35]:
array([(1, 31, 1995, 64.1), (2, 1, 1995, 63.9), (2, 2, 1995, 65.8),
       (2, 3, 1995, 66.1), (2, 4, 1995, 65.4)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

every other row

In [36]:
data[0:10:2] 
Out[36]:
array([(1, 1, 1995, 56.4), (1, 3, 1995, 54.3), (1, 5, 1995, 56.6),
       (1, 7, 1995, 53.5), (1, 9, 1995, 59.7)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Last 5 rows

In [37]:
data[-5:]
Out[37]:
array([(6, 19, 2013, 66.8), (6, 20, 2013, 68.3), (6, 21, 2013, 68.4),
       (6, 22, 2013, 68.4), (6, 23, 2013, 66.1)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Reverse entire array

In [38]:
data[::-1]
Out[38]:
array([(6, 23, 2013, 66.1), (6, 22, 2013, 68.4), (6, 21, 2013, 68.4), ...,
       (1, 3, 1995, 54.3), (1, 2, 1995, 55.1), (1, 1, 1995, 56.4)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

slicing

start:end:stride

Column selection

In [39]:
data['Temp']
Out[39]:
array([ 56.4,  55.1,  54.3, ...,  68.4,  68.4,  66.1])
In [40]:
print data['Year']
[1995 1995 1995 ..., 2013 2013 2013]

Exercise 2

  1. List the values of Temp for every Saturday.

  2. List the value of Temp for April 1st, 2001.

In [40]:
 

Plotting

In [41]:
%pylab inline
Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.
In [42]:
plt.figure(figsize=(6,4))
plt.plot(w_data['Temp'])
Out[42]:
[<matplotlib.lines.Line2D at 0x85fef28>]

Question: What is the max,min,standard deviation,average, and variance of temperature in 1995 compare to 2012?

NumPy has vectorized conditionals which result in boolean arrays

In [43]:
w_data['Year'] == 1995
Out[43]:
array([ True,  True,  True, ..., False, False, False], dtype=bool)

These arrays can be used an input into the original array selecting all the rows where the conditional is True

In [44]:
cond = w_data['Year'] == 1995
w_data[cond][:5]
Out[44]:
array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3),
       (1, 4, 1995, 53.6), (1, 5, 1995, 56.6)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])
In [45]:
#save resulting array in variable year1995
year1995 = w_data[cond] #or w_data[w_data['Year']==1995]

Compute on Arrays

NumPy has lots of builtin mathematics

  • avg/mean
  • std
  • min
  • max
In [46]:
year1995['Temp'].max(),year1995['Temp'].min(),year1995['Temp'].mean(),year1995['Temp'].std(),year1995['Temp'].var()
Out[46]:
(76.599999999999994,
 51.200000000000003,
 62.272876712328781,
 4.8976263413331891,
 23.986743779320719)

Do the same for 2012

In [47]:
year2012 = w_data[w_data['Year'] == 2012]
year2012['Temp'].max(),year2012['Temp'].min(),year2012['Temp'].mean(),year2012['Temp'].std(),year2012['Temp'].var()
Out[47]:
(81.5,
 49.700000000000003,
 62.709836065573725,
 6.0234842299293421,
 36.282362268207478)

What day was the hottest in 1995 compared to 2012?

  • argmax returns the index where the maximum occurs
  • argmin returns the index where the minimum occurs
In [48]:
np.argmax(year1995['Temp'])
Out[48]:
248
In [49]:
max_idx = np.argmax(year1995['Temp'])
year1995[max_idx]
Out[49]:
(9, 6, 1995, 76.6)
In [50]:
max_idx = np.argmax(year2012['Temp'])
year2012[max_idx]
Out[50]:
(9, 15, 2012, 81.5)

Exercise 3

Plot a visual comparison of the two years.

In [50]:
 

It may also be interesting to look at how how temperature changes from day to day.

In [51]:
arr1 = year1995['Temp'][:-1]
arr2 = year1995['Temp'][1:]
len(arr1),len(arr2)
Out[51]:
(364, 364)
In [52]:
deltaT = arr2-arr1
plt.plot(deltaT, '.');
In [53]:
plt.plot(year2012['Temp'],label='2012')
plt.plot(year1995['Temp'],label='1995')
plt.legend()
Out[53]:
<matplotlib.legend.Legend at 0x8753ef0>

Interesting spike around day 50

Where does that occur? We could look for the maximum value in year 1995 between 0 and 100

In [54]:
max_val = np.argmax(year1995['Temp'][:100])
year1995[max_val], max_val #Ha!  Looks like it occur exactly on day 50 of the year
Out[54]:
((2, 20, 1995, 73.6), 50)

Let's also return to the previous plot of all the years

In [55]:
plt.figure(figsize=(9,9))
plt.plot(w_data['Temp'])
Out[55]:
[<matplotlib.lines.Line2D at 0x8610b38>]

How should we handle those spikes?

  • fill with zeros
  • interporlate (cubic, linear)
  • forward/backward fill
  • fill with average

How do we find spikes?

In [56]:
min_idx = np.argmin(w_data['Temp']) #only returns one value
w_data['Temp'][min_idx], min_idx
Out[56]:
(-99.0, 1453)

np.where returns the indices where condition is met

In [57]:
np.where(w_data['Temp'] == -99.0)
Out[57]:
(array([1453, 1454, 1459, 1460, 1470, 2725, 2726, 2727, 2728, 2934, 2981,
       4622, 5015, 5212], dtype=int64),)
In [58]:
na_vals = np.where(w_data['Temp'] == -99.0)
In [59]:
w_data['Temp'][na_vals] = [32 for i in range(len(na_vals))]
In [60]:
plt.figure(figsize=(6,4))
plt.plot(w_data['Temp']);

Perhaps this is a good stopping point in our analysis and we want to save the state of the array/matrix

In [61]:
#saving data
np.savez('weather-data.npz',w_data)
In [62]:
#loading data
dataz = np.load('weather-data.npz')
In [63]:
dataz['arr_0']
Out[63]:
array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3), ...,
       (6, 21, 2013, 68.4), (6, 22, 2013, 68.4), (6, 23, 2013, 66.1)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])
In [ ]: