# Numerical Analysis with NumPy¶

Materials created by:

Peter Wang, Continuum Analytics

Ben Zaitlen, Continuum Analytics

Randy Olson, Michigan State Universty

## Learning Objectives¶

1. Understand NumPy
2. Create simple arrays of data in NumPy
3. Access and modify array elements
4. Compute on arrays

### The basics¶

In [1]:
sales = [0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0]

In [2]:
sales*2

Out[2]:
[0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0, 0.75, 1.5, 2.0, 4.0, 5.0, 6.0, 10.0]
In [3]:
sales+2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-45b722299797> in <module>()
----> 1 sales+2

TypeError: can only concatenate list (not "int") to list
In [4]:
import numpy as np

In [5]:
sales = np.array(sales)
print sales

[  0.75   1.5    2.     4.     5.     6.    10.  ]


### Basic Math¶

In [6]:
sales+1

Out[6]:
array([  1.75,   2.5 ,   3.  ,   5.  ,   6.  ,   7.  ,  11.  ])
In [7]:
sales*2

Out[7]:
array([  1.5,   3. ,   4. ,   8. ,  10. ,  12. ,  20. ])
In [8]:
np.power(sales, 2)

Out[8]:
array([   0.5625,    2.25  ,    4.    ,   16.    ,   25.    ,   36.    ,
100.    ])

### What is NumPy?¶

Python library that provides multi-dimensional arrays, tables, and matrices for Python

• Contiguous or strided arrays
• Homogeneous (but types can be algebraic)
• Arrays of records and nested records
• Fast routines for array operations (C, ATLAS, MKL)

### NumPy's Many Uses¶

• Image and signal processing
• Linear algebra
• Data transformation and query
• Time series analysis
• Statistical analysis
• Many more!

## Create Arrays of Data in NumPy¶

Lots of methods of array initialization

• python list
• zeros
• ones
• diagonal
• eye
• random
In [9]:
np.zeros(5)

Out[9]:
array([ 0.,  0.,  0.,  0.,  0.])
In [10]:
np.ones(5)

Out[10]:
array([ 1.,  1.,  1.,  1.,  1.])
In [11]:
np.random.random(5)

Out[11]:
array([ 0.66648901,  0.89045666,  0.24164698,  0.3969619 ,  0.34864545])
In [12]:
np.diag([1,1,1,1,1])

Out[12]:
array([[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[0, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 0, 0, 1]])

NumPy also has methods for generating linearly/log spaced values

In [13]:
np.arange(0,10)

Out[13]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [14]:
np.arange(10,30,2)

Out[14]:
array([10, 12, 14, 16, 18, 20, 22, 24, 26, 28])
In [15]:
np.linspace(0,2,10)

Out[15]:
array([ 0.        ,  0.22222222,  0.44444444,  0.66666667,  0.88888889,
1.11111111,  1.33333333,  1.55555556,  1.77777778,  2.        ])

NumPy has a type and a shape. The above array are either floats or ints, but these are infered. We could be explicit

In [16]:
np.ones(5)

Out[16]:
array([ 1.,  1.,  1.,  1.,  1.])
In [17]:
np.ones(5,dtype='int64')

Out[17]:
array([1, 1, 1, 1, 1], dtype=int64)

We also can define the shape of the array

In [18]:
np.ones(shape=(3,3),dtype='float64')

Out[18]:
array([[ 1.,  1.,  1.],
[ 1.,  1.,  1.],
[ 1.,  1.,  1.]])
In [19]:
np.random.uniform(0,1,size=(5,5))

Out[19]:
array([[ 0.00304427,  0.84160095,  0.3167405 ,  0.56702975,  0.24500155],
[ 0.42505285,  0.911683  ,  0.33022088,  0.52441459,  0.8219738 ],
[ 0.44855388,  0.40749562,  0.5910397 ,  0.48225375,  0.16479927],
[ 0.39055229,  0.32811335,  0.23763238,  0.16384718,  0.51039829],
[ 0.88561498,  0.47630962,  0.15369293,  0.57210622,  0.00708663]])

### Multidimensional properties¶

In [20]:
data = np.zeros(9)
print data

[ 0.  0.  0.  0.  0.  0.  0.  0.  0.]

In [21]:
data.reshape((3,3,))

Out[21]:
array([[ 0.,  0.,  0.],
[ 0.,  0.,  0.],
[ 0.,  0.,  0.]])
In [22]:
data = data.reshape((3,3))
data[0,0] = 12
data[1,0] = 13
data[2,0] = 14
print data

[[ 12.   0.   0.]
[ 13.   0.   0.]
[ 14.   0.   0.]]


Data access is arr[row, column]

reshaping is just another view of the same data. no

In [23]:
#row selection
print data[0,:]
#column selection
print data[:,0]

[ 12.   0.   0.]
[ 12.  13.  14.]

In [24]:
data[:,1] = 0.2
print data

[[ 12.    0.2   0. ]
[ 13.    0.2   0. ]
[ 14.    0.2   0. ]]

In [25]:
data[:,2] = [.3,.4,9.12]
print data

[[ 12.     0.2    0.3 ]
[ 13.     0.2    0.4 ]
[ 14.     0.2    9.12]]


In last cell, shape of assignment must match shape of accessed data

### Exercise 1¶

1. Create a 4x4 array with numbers from a normal distribution with mean = 5 and standard deviation = 2.7.

2. Sum the values of each column and store the result in a new array.

3. Sum the values of each row and store the result in a new array.

4. Sum all of the values and store the result in a new array.

In [26]:



Let's get our feet wet with some real data.

#### Import Modules¶

In [27]:
import numpy as np
import zipfile
import os
import urllib2


We've gone ahead and bundled up a small bucket of this data for our analysis today at http://figshare.com/articles/GSOD_Weather_NOAA/743699

We can grab the data from the following permanent URL: http://files.figshare.com/1116528/weather.zip

In [28]:
dirname = 'weather'
if os.path.exists(dirname):
else:
os.mkdir(dirname)

url = 'http://files.figshare.com/1116528/weather.zip'
response = urllib2.urlopen(url)

fname = 'weather.zip'
with open(fname,'wb') as f:

zfile = zipfile.ZipFile('weather.zip')

for name in zfile.namelist()[1:]:
print name
with open(name,"w") as f:

weather/CALOSANG.txt
weather/CASANFRA.txt
weather/INSOBEND.txt
weather/KYLEXING.txt
weather/MDWASHDC.txt
weather/NYNEWYOR.txt
weather/WASEATTL.txt


#### What does the text file look like?¶

In [29]:
with open('weather/CALOSANG.txt','r') as f:

 1             1             1995         56.4

1             2             1995         55.1

1             3             1995         54.3



This text file is

• Month
• Day
• Year
• Temp

#### Load data file into numpy array and print first row¶

In [30]:
w_data = np.loadtxt('weather/CALOSANG.txt')
w_data[0]

Out[30]:
array([  1.00000000e+00,   1.00000000e+00,   1.99500000e+03,
5.64000000e+01])

#### We can do a bit better if we define a dtype. That is, the type for each column: string, int, float¶

In [31]:
dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])
w_data[:5]

Out[31]:
array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3),
(1, 4, 1995, 53.6), (1, 5, 1995, 56.6)],
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

## Access and Modify Array Elements¶

In [32]:
dt = np.dtype([('Month', 'int8'), ('Day', 'int8'), ('Year', 'int16'), ('Temp', 'float64')])


### Slicing¶

• first 5 rows
• rows 30 35
• every other row

#### Row Selection¶

In [33]:
data[0]

Out[33]:
(1, 1, 1995, 56.4)
In [34]:
data[0:2]

Out[34]:
array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1)],
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])
In [35]:
data[30:35]

Out[35]:
array([(1, 31, 1995, 64.1), (2, 1, 1995, 63.9), (2, 2, 1995, 65.8),
(2, 3, 1995, 66.1), (2, 4, 1995, 65.4)],
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

every other row

In [36]:
data[0:10:2]

Out[36]:
array([(1, 1, 1995, 56.4), (1, 3, 1995, 54.3), (1, 5, 1995, 56.6),
(1, 7, 1995, 53.5), (1, 9, 1995, 59.7)],
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Last 5 rows

In [37]:
data[-5:]

Out[37]:
array([(6, 19, 2013, 66.8), (6, 20, 2013, 68.3), (6, 21, 2013, 68.4),
(6, 22, 2013, 68.4), (6, 23, 2013, 66.1)],
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

Reverse entire array

In [38]:
data[::-1]

Out[38]:
array([(6, 23, 2013, 66.1), (6, 22, 2013, 68.4), (6, 21, 2013, 68.4), ...,
(1, 3, 1995, 54.3), (1, 2, 1995, 55.1), (1, 1, 1995, 56.4)],
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])

### Column selection¶

In [39]:
data['Temp']

Out[39]:
array([ 56.4,  55.1,  54.3, ...,  68.4,  68.4,  66.1])
In [40]:
print data['Year']

[1995 1995 1995 ..., 2013 2013 2013]


### Exercise 2¶

1. List the values of Temp for every Saturday.

2. List the value of Temp for April 1st, 2001.

In [40]:



### Plotting¶

In [41]:
%pylab inline

Welcome to pylab, a matplotlib-based Python environment [backend: module://IPython.zmq.pylab.backend_inline].

In [42]:
plt.figure(figsize=(6,4))
plt.plot(w_data['Temp'])

Out[42]:
[<matplotlib.lines.Line2D at 0x85fef28>]

Question: What is the max,min,standard deviation,average, and variance of temperature in 1995 compare to 2012?

NumPy has vectorized conditionals which result in boolean arrays

In [43]:
w_data['Year'] == 1995

Out[43]:
array([ True,  True,  True, ..., False, False, False], dtype=bool)

These arrays can be used an input into the original array selecting all the rows where the conditional is True

In [44]:
cond = w_data['Year'] == 1995
w_data[cond][:5]

Out[44]:
array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3),
(1, 4, 1995, 53.6), (1, 5, 1995, 56.6)],
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])
In [45]:
#save resulting array in variable year1995
year1995 = w_data[cond] #or w_data[w_data['Year']==1995]


## Compute on Arrays¶

NumPy has lots of builtin mathematics

• avg/mean
• std
• min
• max
In [46]:
year1995['Temp'].max(),year1995['Temp'].min(),year1995['Temp'].mean(),year1995['Temp'].std(),year1995['Temp'].var()

Out[46]:
(76.599999999999994,
51.200000000000003,
62.272876712328781,
4.8976263413331891,
23.986743779320719)

### Do the same for 2012¶

In [47]:
year2012 = w_data[w_data['Year'] == 2012]
year2012['Temp'].max(),year2012['Temp'].min(),year2012['Temp'].mean(),year2012['Temp'].std(),year2012['Temp'].var()

Out[47]:
(81.5,
49.700000000000003,
62.709836065573725,
6.0234842299293421,
36.282362268207478)

What day was the hottest in 1995 compared to 2012?

• argmax returns the index where the maximum occurs
• argmin returns the index where the minimum occurs
In [48]:
np.argmax(year1995['Temp'])

Out[48]:
248
In [49]:
max_idx = np.argmax(year1995['Temp'])
year1995[max_idx]

Out[49]:
(9, 6, 1995, 76.6)
In [50]:
max_idx = np.argmax(year2012['Temp'])
year2012[max_idx]

Out[50]:
(9, 15, 2012, 81.5)

### Exercise 3¶

Plot a visual comparison of the two years.

In [50]:



It may also be interesting to look at how how temperature changes from day to day.

In [51]:
arr1 = year1995['Temp'][:-1]
arr2 = year1995['Temp'][1:]
len(arr1),len(arr2)

Out[51]:
(364, 364)
In [52]:
deltaT = arr2-arr1
plt.plot(deltaT, '.');

In [53]:
plt.plot(year2012['Temp'],label='2012')
plt.plot(year1995['Temp'],label='1995')
plt.legend()

Out[53]:
<matplotlib.legend.Legend at 0x8753ef0>

### Interesting spike around day 50¶

Where does that occur? We could look for the maximum value in year 1995 between 0 and 100

In [54]:
max_val = np.argmax(year1995['Temp'][:100])
year1995[max_val], max_val #Ha!  Looks like it occur exactly on day 50 of the year

Out[54]:
((2, 20, 1995, 73.6), 50)

### Let's also return to the previous plot of all the years¶

In [55]:
plt.figure(figsize=(9,9))
plt.plot(w_data['Temp'])

Out[55]:
[<matplotlib.lines.Line2D at 0x8610b38>]

### How should we handle those spikes?¶

• fill with zeros
• interporlate (cubic, linear)
• forward/backward fill
• fill with average

How do we find spikes?

In [56]:
min_idx = np.argmin(w_data['Temp']) #only returns one value
w_data['Temp'][min_idx], min_idx

Out[56]:
(-99.0, 1453)

np.where returns the indices where condition is met

In [57]:
np.where(w_data['Temp'] == -99.0)

Out[57]:
(array([1453, 1454, 1459, 1460, 1470, 2725, 2726, 2727, 2728, 2934, 2981,
4622, 5015, 5212], dtype=int64),)
In [58]:
na_vals = np.where(w_data['Temp'] == -99.0)

In [59]:
w_data['Temp'][na_vals] = [32 for i in range(len(na_vals))]

In [60]:
plt.figure(figsize=(6,4))
plt.plot(w_data['Temp']);


### Perhaps this is a good stopping point in our analysis and we want to save the state of the array/matrix¶

In [61]:
#saving data
np.savez('weather-data.npz',w_data)

In [62]:
#loading data

dataz['arr_0']

array([(1, 1, 1995, 56.4), (1, 2, 1995, 55.1), (1, 3, 1995, 54.3), ...,
dtype=[('Month', 'i1'), ('Day', 'i1'), ('Year', '<i2'), ('Temp', '<f8')])