So far we have talked about “data” and “results”, but what are the pieces of information that we want to manipulate? Typically numbers (or groups of numbers) or text (or lists of text). Different kinds of data can be used for different things: numbers can be combined in mathematical expressions, text can be printed, searched, or reorganised; numbers can be arranged by magnitude, names can be arranged by alphabetical order. In Python, these differences are represented by different data types.
We will discuss two kinds of numeric types: integers and floating point numbers. Python has other built in numeric data types, including complex numbers, which are useful in specialised cases.
Whole numbers, without decimal points are integers, e.g. 1, 6, 2331.
Numbers with decimal points are floating point numbers or “floats”, e.g. 1.0, 232.141, 1.3e5.
That last example, 1.3e5, uses scientific notation and is shorthand for 130000.0.
Very large and very small numbers can be written using scientific notation. For example, instead of 0.0000241, we would normally write 2.41×10-5. In Python this would be written
2.41e-5 == 0.0000241
Strings are any sequence of text. We indicate that a sequence of text is a string, and not a Python command, by enclosing it in single or double quotes. Being able to use either quote type allows strings that themselves contain quotes.
'this is a string using single quotes'
"this is a string using double quotes"
'this string has "nested quotes"'
Python also contains built-in data types for collections of things. For data analysis we often deal with sets of numbers. These can be collected in lists.
A list is denoted by a series separated by commas, and enclosed in square brackets:
my_list = [ 1, 2, 3, 4 ] mylist
although lists can contain any set of Python objects, even other lists:
my_other_list = [ 4, 1.5, 'peach' ] my_other_list
both_lists = [ my_list, my_other_list ] both_lists
To refer to one element in a list, use the index of that element. Index numbering counts the number of jumps along the sequence, so starts at zero.
# 1st element (zero jumps along the sequence) print( my_other_list ) # 2nd element (one jump along the sequence) print( my_other_list ) # 3rd element (two jumps along the sequence) print( my_other_list )
Using an index outside the range of elements in the list will produce an error. For example,
my_other_list has three elements, but
my_other_list tries to return the 4th element (which does not exist)
print( my_other_list )
You can also refer to a sequence of elements by giving a range as the index:
# run this cell to create the list `alphabet` alphabet = [ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z' ]
→ start from three jumps, finish at eight jumps, i.e. elements 4 to 9.
Negative numbers count backwards from the end of the sequence.
→ 9th from the end up to 4th from the end.
And leaving out one of the numbers in the range will include all elements up to the start or end of the sequence.
Although lists can be very useful for handling ordered collections of things, for data manipulation we usually deal with ordered lists of only numbers. The flexibility of lists means using them is (relatively) computationally slow. This is not an issue for small data sets, but can be prohibitive for large data sets, with perhaps millions or more entries.
An alternative data type, specifically designed for manipulating (large) numerical data sets is the numpy array.
numpy is a module for numerical scientific computing with Python, and is conventionally imported via
import numpy as np
This is similar to the import math we saw above, but uses the as keyword to make
numpy more convenient to work with.
import math math.sqrt(4)
np) we can store lists of numbers as
import numpy as np a = np.array( [ 1, 2, 3, 4 ] ) a
You can think of a 1-dimensional
numpy array as a vector, and we can use very compact code to perform vector mathematical operations on the entire array.
a + 1
** is the $power$ operator. This code calculates $a^2$ for every number stored in
In both these cases, the mathematical operation (add one; square) is applied to every element in the array, and a new array with all the results is returned.
If the mathematical expression contains two (or more) arrays, then an element-by-element operation is performed:
e.g. vector addition:
b = np.array( [ 5, 6, 7, 8 ] ) a + b
a * b
Let us try to calculate the square root of all the numbers in
from math import sqrt sqrt(a)
import numpy as np a = np.array( [ 1, 2, 3, 4 ] ) a np.sqrt(a)
This gives an error.
numpy is not part of the standard Python library, the
sqrt function provided by the
math module does not know how to treat a
numpy array of numbers. To do what we want we can use the
sqrt function in
# This cell tests your answers from the three previous code cells. # You do not need to edit it assert _ == math.sqrt(1) assert _ == math.sqrt(2) assert _ == math.sqrt(3) assert _ == math.sqrt(4)
numpy contains a great many functions for performing mathematical operations on arrays of numbers, which are all listed on the
To limit the number of decimal places in our result we can use
np.round( np.sqrt(a), 2 ) # round the result to 2 decimal places
Notice that here the first argument is
np.sqrt(a), which is itself a
numpy function. This is analogous to a function of a function in mathematics: $f(g(x))$. Nesting functions like this helps write compact code without storing the intermediate results. Nesting several functions can make your code confusing to read, however, and your primary goal should be to write clear understandable code.
Often, we will want to use
numpy arrays to store experimental data. Other times we might just want a list of number, e.g. from 1 to 20. We could write these out to create the array:
one_to_twenty = np.array( [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 ] )
To save typing (and make your code easier to read)
numpy contains a function for creating lists of numbers:
n = np.arange(1,21) n
arange gives us numbers starting from 1, up to, but not including, 21.
We can generate lists of numbers with different spacings by providing a step-size (which has a default value of 1)
m = np.arange(2,21,2) m
Another way to generate an evenly spaced list of number is to use
p = np.linspace(0,10,50) p
linspace() takes three arguments: the starting number, the end number, and the total number of values in the sequence.
linspace() is particularly useful for generating evenly spaced points that are not integers.