The cell below loads the visual style of the notebook when run.¶

In [0]:

from IPython.core.display import HTML
css_file = '../../styles/styles.css'
HTML(open(css_file, "r").read())

Introducing Numpy¶

Learning Objectives

How to import libraries, and where to find useful libraries.

Explain why numpy is more useful than Python lists

Use numpy for simple operations on arrays.

Learn how to slice numpy arrays

Read tabular data from a file into a program.

Prerequisites

Before starting this lecture, it is assumed you are familiar with material in part 2 of the Python Bootcamp

Numpy¶

As a reminder, most of Python's functionatility comes from the use of libraries. numpy is one such library that is designed to work with arrays of data, which are useful to represent vectors and matrices. numpy also provides routines to perform operations on the whole array in one line of code. We often import numpy using an alias to save typing:

In [0]:

import numpy as np

The core of numpy is the array object. We'll start by looking at how numpy's arrays differ from a Python list. We can create a numpy array from a Python list like so:

In [0]:

a_list = [1,2,3,4] # a python list
an_array = np.array( a_list ) # one way to make an array
an_array = np.array( [1,2,3,4] ) # another way

print(an_array)

We saw in the bootcamp how to access elements of a numpy array or a Python list:

In [0]:

print (a_list[0])
print (an_array[0])

In [0]:

print (a_list[1:3])
print (an_array[1:3])

Differences between arrays and lists¶

What's the difference? To start with, Python's lists are very flexible and can contain different types of objects:

In [0]:

a_list[-1] = 'a string inside a list'
print(a_list)

whereas numpy arrays can only contain a single type!

In [0]:

an_array[-1] = 'a string inside an array'

Information about the type of object in a numpy array is stored in the dtype member

In [0]:

an_array.dtype

Once the array is created, the dtype is fixed, and an array can only hold that type. For example, if we try and put a floating point number in our array, it gets converted to an integer:

In [0]:

an_array[-1] = 1.3141
print(an_array)

Why use arrays?¶

Right now, numpy arrays look less useful than Python lists!

But numpy arrays are fast, and smart. Numpy has methods which allow fast computations on all elements of the array at the same time, and some other tricks up it's sleeve that we'll look at later.

As a taster, let's compare two ways of calculating the sum of all the elements in a list or array. Make sure each function makes some sense to you before running the code.

In [0]:

def sum_python(a_list):
    '''Add up all the values in a list or array using plain Python'''
    sum = 0.0
    for thing in a_list:
        sum += thing
    return sum

def sum_numpy(a_list):
    '''Add up all the values in a list or array using numpy'''
    return np.sum(a_list)

In [0]:

my_list = np.random.randn(1000) # use numpy to create an array of 1000 random numbers
%timeit sum_python(my_list)

In [0]:

%timeit sum_numpy(my_list)

We can see that using numpy allows us to avoid for loops, and so leads to simpler code, but also that the numpy code is around 40 times faster!

Numpy vs List

Write two functions to calculate the mean of a list/array of numbers; one should use pure Python and the other should use numpy. Compare your answer to your neighbour. Are your answers similar? Which did you find easier to write?

Hint: the mean of a list of numbers is the sum of all the numbers, divided by the size of the list, so use the functions above as a template.

In [0]:

# Write your own code here

Creating arrays¶

We have seen that we can create arrays from lists. Let's look at a few more ways of creating arrays. We can create arrays filled with zeros or ones:

In [0]:

zeros = np.zeros(5)
ones  = np.ones(3)

print(zeros)
print(ones)

Numpy also offers the arange function which returns an array of evenly spaced values:

In [0]:

np.arange(0, 5, 1) # min, max(exclusive), step

and the very useful linspace function to create a linearly-spaced grid, with a fixed number of points and including both ends of the specified interval:

In [0]:

print ("A linear grid of 5 points between 0 and 2:")
print (np.linspace(0, 2, 5)) #min, max (inclusive), num_points

Arrays with more than one dimension¶

We've used 1D numpy arrays so far, but all the methods discussed work with arrays with 2 or more dimensions, e.g:

In [0]:

np.zeros( (2,2) )

In [0]:

x = np.random.normal(loc=5, scale=2, size=(5,4))
print(x)

You can create a 2D array from a list of lists:

In [0]:

list_of_lists = [ [1,2,3], [4,5,6], [7,8,9] ] # each list is a row
np.array(list_of_lists)

Arrays can even have their shape changed, as long as we don't change the total number of elements in the array:

In [0]:

x.reshape((10,2))

Views, not copies

Reshaping an array, like most numpy operations, provides a view of the same bit of computer memory. Without running it, what do you expect the code below to print?

arr  = np.arange(8)
arr2 = arr.reshape(2, 4)

arr[0] = 1000
print (arr)
print (arr2)

Now run the code below. Did you get what you expected?

In [0]:

arr  = np.arange(8)
arr2 = arr.reshape(2, 4)

arr[0] = 1000
print (arr)
print (arr2)

This lack of copying might be confusing at times, but it makes numpy very efficient and fast.

Array methods and properties¶

One of the great things about numpy arrays is the many methods that allow us to quickly get info about an array, or perform some operation on it. Here are some useful ones:

In [0]:

print ('Data type                :', arr2.dtype)
print ('Total number of elements :', arr2.size)
print ('Number of dimensions     :', arr2.ndim)
print ('Shape (dimensionality)   :', arr2.shape)

In [0]:

print ('Minimum and maximum             :', arr.min(), arr.max() )
print ('Mean and standard deviation     :', arr.mean(), arr.std() )

Array methods

Use the Tab completion feature of the Jupyter notebook to browse through all the methods belonging to the arr2 object. Use the argmax method to find the index and value of the largest value in arr2

In [0]:

# Write your code here

Array Operations¶

Arrays support all the normal math operators (+, -, / etc). The numpy library also contains a complete collection of basic mathematical functions that operate on arrays. In general, the math operators act on every element of an array in turn. Again, this means you don't have to use for loops to add all the numbers in two arrays together. For example:

In [0]:

arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print (arr1, '+', arr2, '=', arr1+arr2)

In particular, multiplication is not matrix multiplication, but multiplies each element of the arrays together in turn:

In [0]:

print (arr1, '*', arr2, '=', arr1*arr2)

We can also multiply arrays by scalars:

In [0]:

print (arr1 * 3)

This is a first example of broadcasting.

In principle, to add two arrays their shapes must match. However, numpy will broadcast the shapes to match if possible. Broadcasting is a bit beyond the level of this course, but if you find yourself needing to understand it, it is described in the companion notebook 01a-broadcasting.ipynb.

As we mentioned before, numpy ships with many mathematical functions that work on entire arrays, including logarithms, exponentials, and trigonometric functions. For example, calculating the sine function at 100 points between $0$ and $2\pi$ is as simple as:

In [0]:

x = np.linspace(0.0,2.0*np.pi,100)
y = np.sin(x)

Using Numpy to do calculations on lots of values at once is a great way to speed up your code and make it simpler to understand. If you ever find yourself looping over an array to do some calculation with each element in turn, stop, and see if Numpy can do the hard work for you.

Linear Algebra in numpy¶

numpy has a dot method which gives the dot product between two vectors (one-dimensional arrays), or matrix multiplication when one or both of its arguments are matrices (two-dimensional arrays):

In [0]:

v1 = np.array([2, 3, 4])
v2 = np.array([1, 0, 1])
result = np.dot(v1,v2)

print (v1, '.', v2, '=', result)

Here is a regular matrix-vector multiplication, note that the array v1 should be viewed as a column vector in traditional linear algebra notation

In [0]:

A = np.arange(6).reshape(2, 3)
result = np.dot(A, v1)

print (A, 'x', v1, '=', result)

We can calculate transposes of arrays with arr.T, and np.linalg module contains methods to find eigenvectors, determinants and much more.

Numpy Maths Challenge

A system of linear equations can be written in matrix form.
$$3x + 6y - 5z = 12$$$$x - 3y + 2z = -2$$$$5x - y + 4z = 10$$
$$\left[\begin{matrix} 3 & 6 & -5 \\ 1 & -3 & 2 \\ 5 & -1 & 4 \end{matrix}\right]

\left[\begin{matrix} x \\ y \\ z \end{matrix}\right] = \left[\begin{matrix} 12 \\ -2\\ 10 \end{matrix}\right]$$

Now, representing the matrix system as $\mathbf{AX} = \mathbf{B}$, we can see the solution is given by $\mathbf{X} = \mathbf{A^{-1}B}$.

Write some numpy code to find the solution of these equations.

Hint: np.linalg.inv can be used to find a matrices inverse...

In [0]:

# Write your code here

Numpy slicing¶

Slicing is the term used to refer to extracting part of an array. We already saw some basic slicing in the bootcamp. As a quick reminder, we use the syntax arr[lower:upper:step] like so:

In [0]:

A = np.array([1,2,3,4,5,6,7,8])
A[1:6:2]

We can omit any part of the [lower:upper:step] notation:

In [0]:

print (A[2:]) #from index 2 to the end, assumed step=1
print (A[:3]) # from index 0 to 3
print (A[::2]) # from 0 until the end, in steps of 2

And slicing works the same way for arrays with more dimensions:

In [0]:

B = np.arange(36).reshape((6,6))
print (B)

In [0]:

print (B[1:3]) # just rows 1-3

In [0]:

print (B[1:3, 1:3]) # columns 1-3 and rows 1-3

In [0]:

print (B[1, :])  # row 1, all columns

In [0]:

print (B[:, 0])  # column 0 (the first column), and all rows

Fancy slicing¶

We can use numpy arrays or python lists to slice numpy arrays as well! This is an extremely powerful technique.

In [0]:

row_indices = [1, 3, 5]
print (B[row_indices])

Why is this so useful? We can also use arrays of True and False values, known as masks as an index. If the mask is True, the element will be selected. For example, we can calculate a mask to see if each element of B is odd:

In [0]:

is_odd = (B % 2 != 0) # does each element of B divide by two with no remainder?
print (is_odd)

And use that mask to select only the odd values:

In [0]:

print( B[is_odd] )

You can negate a mask (switch True and False) with ~, so to find the even numbers we use:

In [0]:

print( B[~is_odd] )

Fancy slicing

Use np.arange to make a 1D array of values from 0 to 100 (inclusive) and calculate the sum of only the even values using fancy slicing. Compare your answer to your neighbour. Do you get the same answer? Is your code the same?

In [0]:

# Write your code here.

Reading and writing data to files¶

It is perfectly possible to read and write files without using numpy. Files are opened using the open function. This returns a file object which is used to read or write the file. It is most often used with two arguments

file_handle = open(filename,mode)

The second argument is a string containing a few characters describing the way in which the file will be used. mode can be r when the file will only be read, w for only writing (an existing file with the same name will be erased), and a opens the file for appending; any data written to the file is automatically added to the end. The mode argument is optional, if it is missing, r is assumed.

For example, I have a file with some basic data on temperature in Stockholm. Below I show how you would open a file and read in the lines in pure Python. Notice how the open function gets what we call a relative path for the file; this is how to get to the file from the location of the notebook on the filesystem. The symbol ".." means "go up one directory".

In [0]:

file_handle = open('../../data/Session1/td_stockholm.dat','r') # open file for reading
lines = file_handle.readlines() # read in all the lines into a list
for line in lines[0:5]: # iterate over the first few lines
    print(line) # print out line

file_handle.close()

You can see this is a space separated text file. If we wanted to convert this data to a numpy array we can imagine writing the code to do so. It would:

read in the lines into a list
for each line in the list
skip header lines if necessary
split lines into fields at the whitespace
put the fields into a numpy array

This code is obviously going to be very similar for each file and it would make sense to write a function you could re-use. But the first lesson of Python club is to see if someone else has done it for you! In this case they have, and we can look at np.loadtxt:

In [0]:

np.loadtxt?

Looking at the help for this function, we can see that we can read in the file above with:

In [0]:

data = np.loadtxt('../../data/Session1/td_stockholm.dat', skiprows=1)
print(data.shape)

We've got our 7 columns and 77431 rows. We can use slicing to access individual columns:

In [0]:

year = data[:, 0] # first column
month = data[:, 1]
day = data[:, 2]
temperature = data[:, 5]
# calculate a 'decimal year' for each observation
# NOTICE HOW NUMPY ALLOWS US TO CALCULATE EVERY ROW IN ONE LINE.
obs_time = year + month/12 + day/31

And plot the data (plotting is discussed in another lecture):

In [0]:

%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(14,4))
ax.plot(obs_time,temperature)
ax.axis('tight')
ax.set_title('tempeatures in Stockholm')
ax.set_xlabel('year')
ax.set_ylabel('tempature (C)');

Writing files¶

Suppose we want to save our data to a file? np.savetxt will take a 2D numpy array and save it to a simple text file. We can stack our 1D columns into a 2D array with np.column_stack:

In [0]:

data_out = np.column_stack([obs_time, temperature])

And we can save this 2D array to a comma-seperated value file with np.savetxt:

In [0]:

np.savetxt('output.csv', data_out, delimiter=',')

In [0]:

!head output.csv

File IO

Read the help file for np.loadtxt and try and read only the minimum and maximum temperature from the file. Place the data directly in two numpy arrays. Your call should start like
temp_min, temp_max = np.loadtxt(...
Save the min and max temp to a new file in a format of your choice using np.savetxt

In [0]:

#Write your code here

Homework #1

Advanced File IO and Slicing

In the data directory is a file named 'sdss_wds.csv'. This is a comma-seperated values file of data about white dwarfs discovered in the Sloan Digital Sky Survey. Your job is to investigate how the median mass of the white dwarfs depends on temperature.

The code below prints out the first few lines of the file. As you can see, there is one header row which tells us what the columns are.

In [0]:

lines = open('../../data/Session1/sdss_wds.csv', 'r').readlines()
for line in lines[0:4]:
    print (line)

Question 1 (1 points)

Look at the documentation for np.loadtxt and work out how to read in the file above.

Use np.loadtxt to load the white dwarf file into an array called data.

If you are successful, your code will pass the tests below each question. You can also use these tests to see if your code gives the right answer.

In [0]:

# WRITE YOUR OWN CODE HERE

In [0]:

from nose.tools import assert_tuple_equal, assert_false, assert_true  
assert_tuple_equal(data.shape, (13673, 6))

Question 2 (4 points)

np.loadtxt has read in our text files, and we have placed the contents into a 2D array called "data". Complete the function outline below to create a function which will take this 2D array and return a user-specified column. The function should return a 1D array of the column values.

Get into the habit of reading the documentation at the start of the function to understand how it should work, and what the function arguments should be.

Use defensive programming and assert statements to make sure that the function only accepts valid column numbers.

Again, try and get your code to pass the tests that I have written in the cells below.

In [0]:

def get_column(data, col_num):
    """Filter a numpy array and return the requested column.

    Given a 2D numpy array, and a column number, this function returns the 
    requested column number as a 1D numpy array.

    Args:
        data (numpy.ndarray): the array to filter
        col_num (int): the column number to return (e.g `0` for the first column in the array).
    """
    # here's an example of using defensive programming to make sure data is a 2D array
    # notice how we run these checks before doing anything else
    assert(len(data.shape) == 2), "data must be 2D array"
    # ADD YOUR CODE HERE

In [0]:

from nose.tools import assert_equal, assert_almost_equal
teff = get_column(data, 4)
mass = get_column(data, 5)
assert_equal(len(teff), 13673)
assert_equal(len(mass), 13673)
assert_almost_equal(teff.mean(), 17103.082278943904)

In [0]:

from nose.tools import assert_raises
assert_raises(AssertionError, get_column, data, -1 )
assert_raises(AssertionError, get_column, data, 6 )

Question 3 (1 points)

What is the median mass of the white dwarfs found in the Sloan Digital Sky Survey? Store your answer in a variable named median_mass. Store the mean mass in a variable named mean_mass

In [0]:

# YOUR CODE HERE

In [0]:

from nose.tools import assert_almost_equal
assert_almost_equal(median_mass, 0.5939999999999999)
assert_almost_equal(mean_mass, 0.62164258026768093)

Question 4 (2 points)

Why do the mean and median masses differ? What does this tell you about the shape of the mass distribution of white dwarfs discovered in the Sloan Digital Sky Survey?

YOUR ANSWER HERE

Question 5 (2 points)

Use fancy slicing to find the mean mass of white dwarfs cooler than 12,000K. Store your answer in the variable mean_mass_coolwd.

In [0]:

# YOUR CODE HERE

In [0]:

from nose.tools import assert_almost_equal
assert_almost_equal(mean_mass_coolwd, 0.67178046054102403)

Extra Credit (2 points)

Extra credit questions allow you to make up for marks dropped in this and other homeworks. You can't score more than 100% overall, but if you get 2 extra credit points this week, and lose 2 points next week, you'd still be on course for 100% marks. I don't expect you to answer extra credit questions, unless you want to.

For extra credit this week, use the techniques you've learned above and your own statistical knowledge to justify whether the cool white dwarfs are significantly different in mass to the hotter white dwarfs. Use the code cell to make any calculations you need, and the markdown cell to justify your answer.

In [0]:

# YOUR CODE HERE

The cell below loads the visual style of the notebook when run.¶

Introducing Numpy¶

Learning Objectives

Prerequisites

Numpy¶

Differences between arrays and lists¶

Why use arrays?¶

Numpy vs List

Creating arrays¶

Arrays with more than one dimension¶

Views, not copies

Array methods and properties¶

Array methods

Array Operations¶

Linear Algebra in numpy¶

Numpy Maths Challenge

Numpy slicing¶

Fancy slicing¶

Fancy slicing

Reading and writing data to files¶

Writing files¶

File IO

Homework #1

Advanced File IO and Slicing

Question 1 (1 points)

Question 2 (4 points)

Question 3 (1 points)

Question 4 (2 points)

Question 5 (2 points)

Extra Credit (2 points)

Write your answer here¶