from IPython.core.display import HTML
css_file = '../../styles/styles.css'
HTML(open(css_file, "r").read())
- How to import libraries, and where to find useful libraries.
- Explain why numpy is more useful than Python lists
- Use numpy for simple operations on arrays.
- Learn how to slice numpy arrays
- Read tabular data from a file into a program.
Before starting this lecture, it is assumed you are familiar with material in part 2 of the Python Bootcamp
As a reminder, most of Python's functionatility comes from the use of libraries. numpy
is one such library that is designed to work with arrays of data, which are useful to represent vectors and matrices. numpy
also provides routines to perform operations on the whole array in one line of code. We often import numpy using an alias to save typing:
import numpy as np
The core of numpy is the array
object. We'll start by looking at how numpy's arrays differ from a Python list
. We can create a numpy array from a Python list like so:
a_list = [1,2,3,4] # a python list
an_array = np.array( a_list ) # one way to make an array
an_array = np.array( [1,2,3,4] ) # another way
print(an_array)
We saw in the bootcamp how to access elements of a numpy array or a Python list:
print (a_list[0])
print (an_array[0])
print (a_list[1:3])
print (an_array[1:3])
a_list[-1] = 'a string inside a list'
print(a_list)
whereas numpy arrays can only contain a single type!
an_array[-1] = 'a string inside an array'
Information about the type of object in a numpy array is stored in the dtype
member
an_array.dtype
Once the array is created, the dtype
is fixed, and an array can only hold that type. For example, if we try and put a floating point number in our array, it gets converted to an integer:
an_array[-1] = 1.3141
print(an_array)
Right now, numpy arrays look less useful than Python lists!
But numpy arrays are fast, and smart. Numpy has methods which allow fast computations on all elements of the array at the same time, and some other tricks up it's sleeve that we'll look at later.
As a taster, let's compare two ways of calculating the sum of all the elements in a list or array. Make sure each function makes some sense to you before running the code.
def sum_python(a_list):
'''Add up all the values in a list or array using plain Python'''
sum = 0.0
for thing in a_list:
sum += thing
return sum
def sum_numpy(a_list):
'''Add up all the values in a list or array using numpy'''
return np.sum(a_list)
my_list = np.random.randn(1000) # use numpy to create an array of 1000 random numbers
%timeit sum_python(my_list)
%timeit sum_numpy(my_list)
We can see that using numpy allows us to avoid for
loops, and so leads to simpler code, but also that the numpy code is around 40 times faster!
Write two functions to calculate the mean of a list/array of numbers; one should use pure Python and the other should use numpy. Compare your answer to your neighbour. Are your answers similar? Which did you find easier to write?
Hint: the mean of a list of numbers is the sum of all the numbers, divided by the size of the list, so use the functions above as a template.
# Write your own code here
We have seen that we can create arrays from lists. Let's look at a few more ways of creating arrays. We can create arrays filled with zeros or ones:
zeros = np.zeros(5)
ones = np.ones(3)
print(zeros)
print(ones)
Numpy also offers the arange
function which returns an array of evenly spaced values:
np.arange(0, 5, 1) # min, max(exclusive), step
and the very useful linspace
function to create a linearly-spaced grid, with a fixed number of points and including both ends of the specified interval:
print ("A linear grid of 5 points between 0 and 2:")
print (np.linspace(0, 2, 5)) #min, max (inclusive), num_points
We've used 1D numpy arrays so far, but all the methods discussed work with arrays with 2 or more dimensions, e.g:
np.zeros( (2,2) )
x = np.random.normal(loc=5, scale=2, size=(5,4))
print(x)
You can create a 2D array from a list of lists:
list_of_lists = [ [1,2,3], [4,5,6], [7,8,9] ] # each list is a row
np.array(list_of_lists)
Arrays can even have their shape changed, as long as we don't change the total number of elements in the array:
x.reshape((10,2))
Reshaping an array, like most numpy operations, provides a view of the same bit of computer memory. Without running it, what do you expect the code below to print?
arr = np.arange(8) arr2 = arr.reshape(2, 4) arr[0] = 1000 print (arr) print (arr2)
Now run the code below. Did you get what you expected?
arr = np.arange(8)
arr2 = arr.reshape(2, 4)
arr[0] = 1000
print (arr)
print (arr2)
This lack of copying might be confusing at times, but it makes numpy very efficient and fast.
print ('Data type :', arr2.dtype)
print ('Total number of elements :', arr2.size)
print ('Number of dimensions :', arr2.ndim)
print ('Shape (dimensionality) :', arr2.shape)
print ('Minimum and maximum :', arr.min(), arr.max() )
print ('Mean and standard deviation :', arr.mean(), arr.std() )
Use the Tab completion feature of the Jupyter notebook to browse through all the methods belonging to the
arr2
object. Use theargmax
method to find the index and value of the largest value inarr2
# Write your code here
Arrays support all the normal math operators (+, -, / etc). The numpy library also contains a complete collection of basic mathematical functions that operate on arrays. In general, the math operators act on every element of an array in turn. Again, this means you don't have to use for loops to add all the numbers in two arrays together. For example:
arr1 = np.arange(4)
arr2 = np.arange(10, 14)
print (arr1, '+', arr2, '=', arr1+arr2)
In particular, multiplication is not matrix multiplication, but multiplies each element of the arrays together in turn:
print (arr1, '*', arr2, '=', arr1*arr2)
We can also multiply arrays by scalars:
print (arr1 * 3)
This is a first example of broadcasting.
In principle, to add two arrays their shapes must match. However, numpy will broadcast the shapes to match if possible. Broadcasting is a bit beyond the level of this course, but if you find yourself needing to understand it, it is described in the companion notebook 01a-broadcasting.ipynb
.
As we mentioned before, numpy ships with many mathematical functions that work on entire arrays, including logarithms, exponentials, and trigonometric functions. For example, calculating the sine function at 100 points between $0$ and $2\pi$ is as simple as:
x = np.linspace(0.0,2.0*np.pi,100)
y = np.sin(x)
Using Numpy to do calculations on lots of values at once is a great way to speed up your code and make it simpler to understand. If you ever find yourself looping over an array to do some calculation with each element in turn, stop, and see if Numpy can do the hard work for you.
numpy has a dot
method which gives the dot product between two vectors (one-dimensional arrays), or matrix multiplication when one or both of its arguments are matrices (two-dimensional arrays):
v1 = np.array([2, 3, 4])
v2 = np.array([1, 0, 1])
result = np.dot(v1,v2)
print (v1, '.', v2, '=', result)
Here is a regular matrix-vector multiplication, note that the array v1
should be viewed as a column vector in traditional linear algebra notation
A = np.arange(6).reshape(2, 3)
result = np.dot(A, v1)
print (A, 'x', v1, '=', result)
We can calculate transposes of arrays with arr.T
, and np.linalg
module contains methods to find eigenvectors, determinants and much more.
A system of linear equations can be written in matrix form.
$$3x + 6y - 5z = 12$$$$x - 3y + 2z = -2$$$$5x - y + 4z = 10$$$$\left[\begin{matrix} 3 & 6 & -5 \\ 1 & -3 & 2 \\ 5 & -1 & 4 \end{matrix}\right]
\left[\begin{matrix} x \\ y \\ z \end{matrix}\right] = \left[\begin{matrix} 12 \\ -2\\ 10 \end{matrix}\right]$$
Now, representing the matrix system as $\mathbf{AX} = \mathbf{B}$, we can see the solution is given by $\mathbf{X} = \mathbf{A^{-1}B}$.
Write some numpy code to find the solution of these equations.
Hint:
np.linalg.inv
can be used to find a matrices inverse...
# Write your code here
A = np.array([1,2,3,4,5,6,7,8])
A[1:6:2]
We can omit any part of the [lower:upper:step]
notation:
print (A[2:]) #from index 2 to the end, assumed step=1
print (A[:3]) # from index 0 to 3
print (A[::2]) # from 0 until the end, in steps of 2
And slicing works the same way for arrays with more dimensions:
B = np.arange(36).reshape((6,6))
print (B)
print (B[1:3]) # just rows 1-3
print (B[1:3, 1:3]) # columns 1-3 and rows 1-3
print (B[1, :]) # row 1, all columns
print (B[:, 0]) # column 0 (the first column), and all rows
We can use numpy arrays or python lists to slice numpy arrays as well! This is an extremely powerful technique.
row_indices = [1, 3, 5]
print (B[row_indices])
Why is this so useful? We can also use arrays of True
and False
values, known as masks as an index. If the mask is True
, the element will be selected. For example, we can calculate a mask to see if each element of B is odd:
is_odd = (B % 2 != 0) # does each element of B divide by two with no remainder?
print (is_odd)
And use that mask to select only the odd values:
print( B[is_odd] )
You can negate a mask (switch True and False) with ~
, so to find the even numbers we use:
print( B[~is_odd] )
Use
np.arange
to make a 1D array of values from 0 to 100 (inclusive) and calculate the sum of only the even values using fancy slicing. Compare your answer to your neighbour. Do you get the same answer? Is your code the same?
# Write your code here.
It is perfectly possible to read and write files without using numpy. Files are opened using the open
function. This returns a file object which is used to read or write the file. It is most often used with two arguments
file_handle = open(filename,mode)
The second argument is a string containing a few characters describing the way in which the file will be used. mode can be r
when the file will only be read, w
for only writing (an existing file with the same name will be erased), and a
opens the file for appending; any data written to the file is automatically added to the end. The mode argument is optional, if it is missing, r
is assumed.
For example, I have a file with some basic data on temperature in Stockholm. Below I show how you would open a file and read in the lines in pure Python. Notice how the open
function gets what we call a relative path for the file; this is how to get to the file from the location of the notebook on the filesystem. The symbol ".." means "go up one directory".
file_handle = open('../../data/Session1/td_stockholm.dat','r') # open file for reading
lines = file_handle.readlines() # read in all the lines into a list
for line in lines[0:5]: # iterate over the first few lines
print(line) # print out line
file_handle.close()
You can see this is a space separated text file. If we wanted to convert this data to a numpy array we can imagine writing the code to do so. It would:
This code is obviously going to be very similar for each file and it would make sense to write a function you could re-use. But the first lesson of Python club is to see if someone else has done it for you! In this case they have, and we can look at np.loadtxt
:
np.loadtxt?
Looking at the help for this function, we can see that we can read in the file above with:
data = np.loadtxt('../../data/Session1/td_stockholm.dat', skiprows=1)
print(data.shape)
We've got our 7 columns and 77431 rows. We can use slicing to access individual columns:
year = data[:, 0] # first column
month = data[:, 1]
day = data[:, 2]
temperature = data[:, 5]
# calculate a 'decimal year' for each observation
# NOTICE HOW NUMPY ALLOWS US TO CALCULATE EVERY ROW IN ONE LINE.
obs_time = year + month/12 + day/31
And plot the data (plotting is discussed in another lecture):
%matplotlib inline
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(14,4))
ax.plot(obs_time,temperature)
ax.axis('tight')
ax.set_title('tempeatures in Stockholm')
ax.set_xlabel('year')
ax.set_ylabel('tempature (C)');
Suppose we want to save our data to a file? np.savetxt
will take a 2D numpy array and save it to a simple text file. We can stack our 1D columns into a 2D array with np.column_stack
:
data_out = np.column_stack([obs_time, temperature])
And we can save this 2D array to a comma-seperated value file with np.savetxt
:
np.savetxt('output.csv', data_out, delimiter=',')
!head output.csv
Read the help file for
np.loadtxt
and try and read only the minimum and maximum temperature from the file. Place the data directly in two numpy arrays. Your call should start liketemp_min, temp_max = np.loadtxt(...Save the min and max temp to a new file in a format of your choice using
np.savetxt
#Write your code here
In the data directory is a file named 'sdss_wds.csv'. This is a comma-seperated values file of data about white dwarfs discovered in the Sloan Digital Sky Survey. Your job is to investigate how the median mass of the white dwarfs depends on temperature.
The code below prints out the first few lines of the file. As you can see, there is one header row which tells us what the columns are.
lines = open('../../data/Session1/sdss_wds.csv', 'r').readlines()
for line in lines[0:4]:
print (line)
Look at the documentation for
np.loadtxt
and work out how to read in the file above.
Use
np.loadtxt
to load the white dwarf file into an array calleddata
.
If you are successful, your code will pass the tests below each question. You can also use these tests to see if your code gives the right answer.
# WRITE YOUR OWN CODE HERE
from nose.tools import assert_tuple_equal, assert_false, assert_true
assert_tuple_equal(data.shape, (13673, 6))
np.loadtxt
has read in our text files, and we have placed the contents into a 2D array called "data". Complete the function outline below to create a function which will take this 2D array and return a user-specified column. The function should return a 1D array of the column values.
Get into the habit of reading the documentation at the start of the function to understand how it should work, and what the function arguments should be.
Use defensive programming and assert statements to make sure that the function only accepts valid column numbers.
Again, try and get your code to pass the tests that I have written in the cells below.
def get_column(data, col_num):
"""Filter a numpy array and return the requested column.
Given a 2D numpy array, and a column number, this function returns the
requested column number as a 1D numpy array.
Args:
data (numpy.ndarray): the array to filter
col_num (int): the column number to return (e.g `0` for the first column in the array).
"""
# here's an example of using defensive programming to make sure data is a 2D array
# notice how we run these checks before doing anything else
assert(len(data.shape) == 2), "data must be 2D array"
# ADD YOUR CODE HERE
from nose.tools import assert_equal, assert_almost_equal
teff = get_column(data, 4)
mass = get_column(data, 5)
assert_equal(len(teff), 13673)
assert_equal(len(mass), 13673)
assert_almost_equal(teff.mean(), 17103.082278943904)
from nose.tools import assert_raises
assert_raises(AssertionError, get_column, data, -1 )
assert_raises(AssertionError, get_column, data, 6 )
What is the median mass of the white dwarfs found in the Sloan Digital Sky Survey? Store your answer in a variable named
median_mass
. Store the mean mass in a variable namedmean_mass
# YOUR CODE HERE
from nose.tools import assert_almost_equal
assert_almost_equal(median_mass, 0.5939999999999999)
assert_almost_equal(mean_mass, 0.62164258026768093)
Why do the mean and median masses differ? What does this tell you about the shape of the mass distribution of white dwarfs discovered in the Sloan Digital Sky Survey?
YOUR ANSWER HERE
Use fancy slicing to find the mean mass of white dwarfs cooler than 12,000K. Store your answer in the variable
mean_mass_coolwd
.
# YOUR CODE HERE
from nose.tools import assert_almost_equal
assert_almost_equal(mean_mass_coolwd, 0.67178046054102403)
Extra credit questions allow you to make up for marks dropped in this and other homeworks. You can't score more than 100% overall, but if you get 2 extra credit points this week, and lose 2 points next week, you'd still be on course for 100% marks. I don't expect you to answer extra credit questions, unless you want to.
For extra credit this week, use the techniques you've learned above and your own statistical knowledge to justify whether the cool white dwarfs are significantly different in mass to the hotter white dwarfs. Use the code cell to make any calculations you need, and the markdown cell to justify your answer.
# YOUR CODE HERE