Numerical Analysis with NumPy¶

Materials created by:

Aron Ahmadia, Continuum Analytics

Peter Wang, Continuum Analytics

Ben Zaitlen, Continuum Analytics

Randy Olson, Michigan State Universty

Learning Objectives¶

Understand NumPy
Create simple arrays of data in NumPy
Access and modify array elements
Compute on arrays

The basics¶

In [1]:

sales = [0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0]

In [2]:

sales*2

Out[2]:

[0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0, 0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0]

In [3]:

sales+2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-45b722299797> in <module>()
----> 1 sales+2

TypeError: can only concatenate list (not "int") to list

In [4]:

import numpy as np

In [5]:

sales = np.array(sales)
print sales

[  0.75   1.5    2.     4.     5.     6.    10.  ]

Basic Math¶

In [6]:

sales+1

Out[6]:

array([  1.75,   2.5 ,   3.  ,   5.  ,   6.  ,   7.  ,  11.  ])

In [7]:

sales*2

Out[7]:

array([  1.5,   3. ,   4. ,   8. ,  10. ,  12. ,  20. ])

In [8]:

np.power(sales, 2)

Out[8]:

array([   0.5625,    2.25  ,    4.    ,   16.    ,   25.    ,   36.    ,
        100.    ])

What is NumPy?¶

Python library that provides multi-dimensional arrays, tables, and matrices for Python

Contiguous or strided arrays
Homogeneous (but types can be algebraic)
Arrays of records and nested records
Fast routines for array operations (C, ATLAS, MKL)

NumPy's Many Uses¶

Image and signal processing
Linear algebra
Data transformation and query
Time series analysis
Statistical analysis
Many more!

Create Arrays of Data in NumPy¶

Lots of methods of array initialization

python list
zeros
ones
diagonal
eye
random

In [9]:

np.zeros(5)

Out[9]:

array([ 0.,  0.,  0.,  0.,  0.])

In [10]:

np.ones(5)

Out[10]:

array([ 1.,  1.,  1.,  1.,  1.])

In [11]:

np.random.random(5)

Out[11]:

array([ 0.66648901,  0.89045666,  0.24164698,  0.3969619 ,  0.34864545])

In [12]:

np.diag([1,1,1,1,1])

Out[12]:

array([[1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1]])

NumPy also has methods for generating linearly/log spaced values

In [13]:

np.arange(0,10)

Out[13]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [14]:

np.arange(10,30,2)

Out[14]:

array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])

In [15]:

np.linspace(0,2,10)

Out[15]:

array([ 0.        ,  0.22222222,  0.44444444,  0.66666667,  0.88888889,
        1.11111111,  1.33333333,  1.55555556,  1.77777778,  2.        ])

NumPy has a type and a shape. The above array are either floats or ints, but these are infered. We could be explicit

In [16]:

np.ones(5)

Out[16]:

array([ 1.,  1.,  1.,  1.,  1.])

In [17]:

np.ones(5,dtype='int64')

Out[17]:

array([1, 1, 1, 1, 1], dtype=int64)

We also can define the shape of the array

In [18]:

np.ones(shape=(3,3),dtype='float64')

Out[18]:

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

In [19]:

np.random.uniform(0,1,size=(5,5))

Out[19]:

array([[ 0.00304427,  0.84160095,  0.3167405 ,  0.56702975,  0.24500155],
       [ 0.42505285,  0.911683  ,  0.33022088,  0.52441459,  0.8219738 ],
       [ 0.44855388,  0.40749562,  0.5910397 ,  0.48225375,  0.16479927],
       [ 0.39055229,  0.32811335,  0.23763238,  0.16384718,  0.51039829],
       [ 0.88561498,  0.47630962,  0.15369293,  0.57210622,  0.00708663]])

Multidimensional properties¶

In [20]:

data = np.zeros(9)
print data

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.]

In [21]:

data.reshape((3,3,))

Out[21]:

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [22]:

data = data.reshape((3,3))
data[0,0] = 12
data[1,0] = 13
data[2,0] = 14
print data

[[ 12.   0.   0.]
 [ 13.   0.   0.]
 [ 14.   0.   0.]]

Data access is arr[row, column]

reshaping is just another view of the same data. no

In [23]:

#row selection
print data[0,:]
#column selection
print data[:,0]

[ 12.   0.   0.]
[ 12.  13.  14.]

In [24]:

data[:,1] = 0.2
print data

[[ 12.    0.2   0. ]
 [ 13.    0.2   0. ]
 [ 14.    0.2   0. ]]

In [25]:

data[:,2] = [.3,.4,9.12]
print data

[[ 12.     0.2    0.3 ]
 [ 13.     0.2    0.4 ]
 [ 14.     0.2    9.12]]

In last cell, shape of assignment must match shape of accessed data

Exercise 1¶

Create a 4x4 array with numbers from a normal distribution with mean = 5 and standard deviation = 2.7.
Sum the values of each column and store the result in a new array.
Sum the values of each row and store the result in a new array.
Sum all of the values and store the result in a new array.

In [26]:

Let's get our feet wet with some real data.

Import Modules¶

In [27]:

import numpy as np
import zipfile
import os
import urllib2

Download Data¶

We've gone ahead and bundled up a small bucket of this data for our analysis today at http://figshare.com/articles/GSOD_Weather_NOAA/743699

We can grab the data from the following permanent URL: http://files.figshare.com/1116528/weather.zip

In [28]:

dirname = 'weather'
if os.path.exists(dirname):
    print 'weather directory exists, skipping download/unzip'
else:
    os.mkdir(dirname)

    url = 'http://files.figshare.com/1116528/weather.zip'
    response = urllib2.urlopen(url)

    fname = 'weather.zip'
    with open(fname,'wb') as f:
        f.write(response.read())

    zfile = zipfile.ZipFile('weather.zip')
    
    for name in zfile.namelist()[1:]:
        print name
        with open(name,"w") as f:
            f.write(zfile.read(name))

weather/CALOSANG.txt
weather/CASANFRA.txt
weather/INSOBEND.txt
weather/KYLEXING.txt
weather/MDWASHDC.txt
weather/NYNEWYOR.txt
weather/WASEATTL.txt

What does the text file look like?¶

In [29]:

with open('weather/CALOSANG.txt','r') as f:
    print '\n'.join(f.readlines()[:3])

 1             1             1995         56.4

 1             2             1995         55.1

 1             3             1995         54.3

This text file is

Month
Day
Year
Temp

Load data file into numpy array and print first row¶

In [30]:

w_data = np.loadtxt('weather/CALOSANG.txt')
w_data[0] 

Out[30]:

array([  1.00000000e+00,   1.00000000e+00,   1.99500000e+03,
         5.64000000e+01])

We can do a bit better if we define a `dtype`. That is, the type for each column: string, int, float¶

In [31]:

dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
w_data = np.loadtxt('weather/CALOSANG.txt',dtype=dt)
w_data[:5] 

Out[31]:

array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3),
       (1, 4, 1995, 53.6), (1, 5, 1995, 56.6)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Access and Modify Array Elements¶

In [32]:

dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
data = np.loadtxt('weather/CALOSANG.txt',dtype=dt)

Slicing¶

first 5 rows
rows 30 35
every other row

Row Selection¶

In [33]:

data[0]

Out[33]:

(1, 1, 1995, 56.4)

In [34]:

data[0:2]

Out[34]:

array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

In [35]:

data[30:35]

Out[35]:

array([(1, 31, 1995, 64.1), (2, 1, 1995, 63.9), (2, 2, 1995, 65.8),
       (2, 3, 1995, 66.1), (2, 4, 1995, 65.4)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

every other row

In [36]:

data[0:10:2] 

Out[36]:

array([(1, 1, 1995, 56.4), (1, 3, 1995, 54.3), (1, 5, 1995, 56.6),
       (1, 7, 1995, 53.5), (1, 9, 1995, 59.7)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Last 5 rows

In [37]:

data[-5:]

Out[37]:

array([(6, 19, 2013, 66.8), (6, 20, 2013, 68.3), (6, 21, 2013, 68.4),
       (6, 22, 2013, 68.4), (6, 23, 2013, 66.1)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Reverse entire array

In [38]:

data[::-1]

Out[38]:

array([(6, 23, 2013, 66.1), (6, 22, 2013, 68.4), (6, 21, 2013, 68.4), ...,
       (1, 3, 1995, 54.3), (1, 2, 1995, 55.1), (1, 1, 1995, 56.4)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

slicing¶

start:end:stride¶

Column selection¶

In [39]:

data['Temp']

Out[39]:

array([ 56.4,  55.1,  54.3, ...,  68.4,  68.4,  66.1])

In [40]:

print data['Year']

[1995 1995 1995 ..., 2013 2013 2013]

Exercise 2¶

List the values of Temp for every Saturday.
List the value of Temp for April 1st, 2001.

In [40]:

Plotting¶

In [41]:

%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].
For more information, type 'help(pylab)'.

In [42]:

plt.figure(figsize=(6,4))
plt.plot(w_data['Temp'])

Out[42]:

[<matplotlib.lines.Line2D at 0x85fef28>]

Question: What is the max,min,standard deviation,average, and variance of temperature in 1995 compare to 2012?

NumPy has vectorized conditionals which result in boolean arrays

In [43]:

w_data['Year'] == 1995

Out[43]:

array([ True,  True,  True, ..., False, False, False], dtype=bool)

These arrays can be used an input into the original array selecting all the rows where the conditional is True

In [44]:

cond = w_data['Year'] == 1995
w_data[cond][:5]

Out[44]:

array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3),
       (1, 4, 1995, 53.6), (1, 5, 1995, 56.6)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

In [45]:

#save resulting array in variable year1995
year1995 = w_data[cond] #or w_data[w_data['Year']==1995]

Compute on Arrays¶

NumPy has lots of builtin mathematics

avg/mean
std
min
max

In [46]:

year1995['Temp'].max(),year1995['Temp'].min(),year1995['Temp'].mean(),year1995['Temp'].std(),year1995['Temp'].var()

Out[46]:

(76.599999999999994,
 51.200000000000003,
 62.272876712328781,
 4.8976263413331891,
 23.986743779320719)

Do the same for 2012¶

In [47]:

year2012 = w_data[w_data['Year'] == 2012]
year2012['Temp'].max(),year2012['Temp'].min(),year2012['Temp'].mean(),year2012['Temp'].std(),year2012['Temp'].var()

Out[47]:

(81.5,
 49.700000000000003,
 62.709836065573725,
 6.0234842299293421,
 36.282362268207478)

What day was the hottest in 1995 compared to 2012?

argmax returns the index where the maximum occurs
argmin returns the index where the minimum occurs

In [48]:

np.argmax(year1995['Temp'])

Out[48]:

In [49]:

max_idx = np.argmax(year1995['Temp'])
year1995[max_idx]

Out[49]:

(9, 6, 1995, 76.6)

In [50]:

max_idx = np.argmax(year2012['Temp'])
year2012[max_idx]

Out[50]:

(9, 15, 2012, 81.5)

Exercise 3¶

Plot a visual comparison of the two years.

In [50]:

It may also be interesting to look at how how temperature changes from day to day.

In [51]:

arr1 = year1995['Temp'][:-1]
arr2 = year1995['Temp'][1:]
len(arr1),len(arr2)

Out[51]:

(364, 364)

In [52]:

deltaT = arr2-arr1
plt.plot(deltaT, '.');

In [53]:

plt.plot(year2012['Temp'],label='2012')
plt.plot(year1995['Temp'],label='1995')
plt.legend()

Out[53]:

<matplotlib.legend.Legend at 0x8753ef0>

Interesting spike around day 50¶

Where does that occur? We could look for the maximum value in year 1995 between 0 and 100

In [54]:

max_val = np.argmax(year1995['Temp'][:100])
year1995[max_val], max_val #Ha!  Looks like it occur exactly on day 50 of the year

Out[54]:

((2, 20, 1995, 73.6), 50)

Let's also return to the previous plot of all the years¶

In [55]:

plt.figure(figsize=(9,9))
plt.plot(w_data['Temp'])

Out[55]:

[<matplotlib.lines.Line2D at 0x8610b38>]

How should we handle those spikes?¶

fill with zeros
interporlate (cubic, linear)
forward/backward fill
fill with average

How do we find spikes?

In [56]:

min_idx = np.argmin(w_data['Temp']) #only returns one value
w_data['Temp'][min_idx], min_idx

Out[56]:

(-99.0, 1453)

np.where returns the indices where condition is met

In [57]:

np.where(w_data['Temp'] == -99.0)

Out[57]:

(array([1453, 1454, 1459, 1460, 1470, 2725, 2726, 2727, 2728, 2934, 2981,
       4622, 5015, 5212], dtype=int64),)

In [58]:

na_vals = np.where(w_data['Temp'] == -99.0)

In [59]:

w_data['Temp'][na_vals] = [32 for i in range(len(na_vals))]

In [60]:

plt.figure(figsize=(6,4))
plt.plot(w_data['Temp']);

Perhaps this is a good stopping point in our analysis and we want to save the state of the array/matrix¶

In [61]:

#saving data
np.savez('weather-data.npz',w_data)

In [62]:

#loading data
dataz = np.load('weather-data.npz')

In [63]:

dataz['arr_0']

Out[63]:

array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3), ...,
       (6, 21, 2013, 68.4), (6, 22, 2013, 68.4), (6, 23, 2013, 66.1)], 
      dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

In [ ]: