(last updated 2020-05-18)
The Python Data Stack comprises a set of packages that makes Python a powerful data science language. These include
These were the original stack that was meant to replace Matlab. However, these were meant to tackle purely numerical data, and the kinds of heterogeneous data we regularly face needed more tools. These were added more recently.
This is the basic stack of packages we will be using in this workshop. Additionally we will use a few packages that add some functionality to the data science process. These include
seaborn: Better statistical graphs
plotly: Interactive graphics
biopython: Python for bioinformatics
We may also introduce the package rpy2
which allows one to run R from within Python. This can be useful since many bioinformatic pipelines are already implemented in R.
The PyData stack also includes
sympy
, a symbolic mathematics package emulating Maple
We start by importing the Numpy package into Python using the alias np
.
import numpy as np
Numpy provides both arrays (vectors, matrices, higher dimensional arrays) and vectorized functions which are very fast. Let's see how this works.
z = [1,2,3,4,5,6,7,8,9.3,10.6] # This is a list
z_array = np.array(z)
z_array
Now, we have already seen functions in Python earlier. In Numpy, there are functions that are optimized for arrays, that can be accessed directly from the array objects. This is an example of object-oriented programming in Python, where functions are provided for particular classes of objects, and which can be directly accessed from the objects. We will use several such functions over the course of this workshop, but we won't actually talk about how to do this program development here.
Numpy functions are often very fast, and are vectorized, i.e., they are written to work on vectors of numbers rather than single numbers. This is an advantage in data science since we often want to do the same operation to all elements of a column of data, which is essentially a vector
We apply the functions sum
, min
(minimum value) and max
(maximum value) to z_array
.
z_array.sum()
z_array.min()
z_array.max()
The versions of these functions in Numpy are optimized for arrays and are quite a bit faster than the corresponding functions available in base Python. When doing data work, these are the preferred functions.
These functions can also be used in the usual function manner:
np.max(z_array)
Calling np.max
ensures that we are using the max
function from numpy, and not the one in base Python.
Numpy arrays are homogeneous in type.
np.array(['a','b','c'])
np.array([1,2,3,6,8,29])
But, what if we provide a heterogeneous list?
y = [1,3,'a']
np.array(y)
So what's going on here? Upon conversion from a heterogeneous list, numpy converted the numbers into strings. This is necessary since, by definition, numpy arrays can hold data of a single type. When one of the elements is a string, numpy casts all the other entities into strings as well. Think about what would happen if the opposite rule was used. The string 'a' doesn't have a corresponding number, while both numbers 1 and 3 have corresponding string representations, so going from string to numeric would create all sorts of problems.
The advantage of numpy arrays is that the data is stored in a contiguous section of memory, and you can be very efficient with homogeneous arrays in terms of manipulating them, applying functions, etc. However,
numpy
does provide a "catch-all"dtype
calledobject
, which can be any Python object. Thisdtype
essentially is an array of pointers to actual data stored in different parts of the memory. You can get to the actual objects by extracting them. So one could do
np.array([1,3,'a'], dtype='object')
which would basically be a valid
numpy
array, but would go back to the actual objects when used, much like a list. We can see this later if we want to transform a heterogeneouspandas
DataFrame
into anumpy
array. It's not particularly useful as is, but it prevents errors from popping up during transformations frompandas
tonumpy
.
We had seen earlier how we could generate a sequence of numbers in a list using range
. In numpy, you can generate a sequence of numbers in an array using arange
(which actually creates the array rather than provide an iterator like range
).
np.arange(10)
You can also generate regularly spaced sequences of numbers between particular values
np.linspace(start=0, stop=1, num=11) # or np.linspace(0, 1, 11)
You can also do this with real numbers rather than integers.
np.linspace(start = 0, stop = 2*np.pi, num = 10)
More generally, you can transform lists into numpy
arrays. We saw this above for vectors. For matrices, you can provide a list of lists. Note the double [
in front and back.
np.array([[1,3,5,6],[4,3,9,7]])
You can generate an array of 0's
np.zeros(10)
This can easily be extended to a two-dimensional array (a matrix), by specifying the dimension of the matrix as a tuple.
np.zeros((10,10))
You can also generate a matrix of 1s in a similar manner.
np.ones((3,4))
In matrix algebra, the identity matrix is important. It is a square matrix with 1's on the diagonal and 0's everywhere else.
np.eye(4)
You can also create numpy vectors directly from lists, as long as lists are made up of atomic elements of the same type. This means a list of numbers or a list of strings. The elements can't be more composite structures, generally. One exception is a list of lists, where all the lists contain the same type of atomic data, which, as we will see, can be used to create a matrix or 2-dimensional array.
a = [1,2,3,4,5,6,7,8]
b = ['a','b','c','d','3']
np.array(a)
np.array(b)
Generating random numbers is quite useful in many areas of data science. All computers don't produce truly random numbers but generate pseudo-random sequences. These are completely deterministic sequences defined algorithmically that emulate the properties of random numbers. Since these are deterministic, we can set a seed or starting value for the sequence, so that we can exactly reproduce this sequence to help debug our code. To actually see how things behave in simulations we will often run several sequences of random numbers starting at different seed values.
The seed is set by the RandomState
function within the random
submodule of numpy. Note that all Python names are case-sensitive.
rng = np.random.RandomState(35) # set seed
rng.randint(0, 10, (3,4))
We have created a 3x4 matrix of random integers between 0 and 10 (in line with slicing rules, this includes 0 but not 10).
We can also create a random sample of numbers between 0 and 1.
rng.random_sample((5,2))
We'll see later how to generate random numbers from particular probability distributions.
Numpy generates arrays, which can be of arbitrary dimension. However the most useful are vectors (1-d arrays) and matrices (2-d arrays).
In these examples, we will generate samples from the Normal (Gaussian) distribution, with mean 0 and variance 1.
A = rng.normal(0,1,(4,5))
We can compute some characteristics of this matrix's dimensions. The number of rows and columns are given by shape
.
A.shape
The total number of elements are given by size
.
A.size
If we want to create a matrix of 0's with the same dimensions as A
, we don't actually have to compute its dimensions. We can use the zeros_like
function to figure it out.
np.zeros_like(A)
We can also create vectors by only providing the number of rows to the random sampling function. The number of columns will be assumed to be 1.
B = rng.normal(0, 1, (4,))
B
The syntax for extracting elements from arrays is almost exactly the same as for lists, with the same rules for slices.
Exercise: State what elements of B are extracted by each of the following statements
B[:3]
B[:-1]
B[[0,2,4]]
B[[0,2,5]]
For matrices, we have two dimensions, so you can slice by rows, or columns or both.
A
We can extract the first column by specifying :
(meaning everything) for the rows, and the index for the column (reminder, Python starts counting at 0)
A[:,0]
Similarly the 4th row can be extracted by putting the row index, and :
for the column index.
A[3,:]
All slicing operations work for rows and columns
A[:2,:2]
We can do a variety of vector and matrix operations in numpy
.
First, all usual arithmetic operations work on arrays, like adding or multiplying an array with a scalar.
A = rng.randn(3,5)
A
A + 10
We can also add and multiply arrays element-wise as long as they are the same shape.
B = rng.randint(0,10, (3,5))
B
A + B
A * B
You can also do matrix multiplication. Recall what this is.
If you have a matrix $A_{m x n}$ and another matrix $B_{n x p}$, as long as the number of columns of $A$ and rows of $B$ are the same, you can multiply them ($C_{m x p} = A_{m x n}B_{n x p}$), with the (i,j)-th element of C being
$$ c_{ij} = \sum_{k=1}^n a_{ik}b_{kj}, i= 1, \dots, m; j = 1, \dots, p$$In numpy
the operant for matrix multiplication is @
.
In the above examples, A
and B
cannot be multiplied since they have incompatible dimensions. However, we can take the transpose of B
, i.e. flip the rows and columns, to make it compatible with A
for matrix multiplication.
A @ np.transpose(B)
np.transpose(A) @ B
More generally, you can reshape a numpy
array into a new shape, provided it is compatible with the number of elements in the original array.
D = rng.randint(0,5, (4,4))
D
D.reshape(8,2)
D.reshape(1,16)
This can also be used to cast a vector into a matrix.
e = np.arange(20)
E = e.reshape(5,4)
E
One thing to note in all the reshaping operations above is that the new array takes elements of the old array by row. See the examples above to convince yourself of that.
You can sum all the elements of a matrix using sum
. You can also sum along rows or along columns by adding an argument to the sum
function.
A = rng.normal(0, 1, (4,2))
A
A.sum()
You can sum along rows (i.e., down columns) with the option axis = 0
A.sum(axis=0)
You can sum along columns (i.e., across rows) with axis = 1
.
A.sum(axis=1)
Of course, you can use the usual function calls:
np.sum(A, axis = 1)
We can also find the minimum and maximum values.
A.min(axis = 0)
A.max(axis = 0)
We can also find the position where the minimum and maximum values occur.
A.argmin(axis=0)
A.argmax(axis=0)
We can sort arrays and also find the indices which will result in the sorted array. I'll demonstrate this for a vector, where it is more relevant
a = rng.randint(0,10, 8)
a
np.sort(a)
np.argsort(a)
a[np.argsort(a)]
np.argsort
can also help you find the 2nd smallest or 3rd largest value in an array, too.
ind_2nd_smallest = np.argsort(a)[1]
a[ind_2nd_smallest]
ind_3rd_largest = np.argsort(a)[-3]
a[ind_3rd_largest]
You can also sort strings in this way.
m = np.array(['Aram','Raymond','Elizabeth','Donald','Harold'])
np.sort(m)
If you want to sort arrays in place, you can use the sort
function in a different way.
m.sort()
m
We can put arrays together by row or column, provided the corresponding axes have compatible lengths.
A = rng.randint(0,5, (3,5))
B = rng.randint(0,5, (3,5))
print('A = ', A)
print('B = ', B)
np.hstack((A,B))
np.vstack((A,B))
Note that both hstack
and vstack
take a tuple of arrays as input.
You can query a matrix to see which elements meet some criterion. In this example, we'll see which elements are negative.
A < 0
This is called masking, and is useful in many contexts.
We can extract all the negative elements of A using
A[A<0]
This forms a 1-d array. You can also count the number of elements that meet the criterion
np.sum(A<0)
Since the entity A<0
is a matrix as well, we can do row-wise and column-wise operations as well.
One has to be a bit careful with copying objects in Python. By default, if you just assign one object to a new name, it does a shallow copy, which means that both names point to the same memory. So if you change something in the original, it also changes in the new copy.
A[0,:]
A1 = A
A1[0,0] = 4
A[0,0]
To actually create a copy that is not linked back to the original, you have to make a deep copy, which creates a new space in memory and a new pointer, and copies the original object to the new memory location
A1 = A.copy()
A1[0,0] = 6
A[0,0]
You can also replace sub-matrices of a matrix with new data, provided that the dimensions are compatible. (Make sure that the sub-matrix we are replacing below truly has 2 rows and 2 columns, which is what np.eye(2)
will produce)
A[:2,:2] = np.eye(2)
A
Sometimes the output of some operation ends up being a matrix of one column or one row. We can reduce it to become a vector. There are two functions that can do that, flatten
and ravel
.
A = rng.randint(0,5, (5,1))
A
A.flatten()
A.ravel()
So why two functions? I'm not sure, but they do different things behind the scenes. flatten
creates a copy, i.e. a new array disconnected from A
. ravel
creates a view, so a representation of the original array. If you then changed a value after a ravel
operation, you would also change it in the original array; if you did this after a flatten
operation, you would not.
Python deals with arrays in an interesting way, in terms of matching up dimensions of arrays for arithmetic operations. There are 3 rules:
A = rng.normal(0,1,(4,5))
B = rng.normal(0,1,5)
A.shape
B.shape
A - B
B is 1-d, A is 2-d, so B's shape is made into (1,5) (added to the left). Then it is repeated into 4 rows to make it's shape (4,5), then the operation is performed. This means that we subtract the first element of B from the first column of A, the second element of B from the second column of A, and so on.
You can be explicit about adding dimensions for broadcasting by using np.newaxis
.
B[np.newaxis,:].shape
B[:,np.newaxis].shape
This can be very useful, since these operations are faster than for loops. For example:
d = rng.random_sample((10,2))
d
We want to find the Euclidean distance (the sum of squared differences) between the points defined by the rows. This should result in a 10x10 distance matrix
d.shape
d[np.newaxis,:,:]
creates a 3-d array with the first dimension being of length 1
d[np.newaxis,:,:].shape
d[:, np.newaxis,:]
creates a 3-d array with the 2nd dimension being of length 1
d[:,np.newaxis,:].shape
Now for the trick, using broadcasting of arrays. These two arrays are incompatible without broadcasting, but with broadcasting, the right things get repeated to make things compatible
dist_sq = np.sum((d[:,np.newaxis,:] - d[np.newaxis,:,:]) ** 2)
dist_sq.shape
dist_sq
Whoops! we wanted a 10x10 matrix, not a scalar.
(d[:,np.newaxis,:] - d[np.newaxis,:,:]).shape
What we really want is the 10x10 distance matrix.
dist_sq = np.sum((d[:,np.newaxis,:] - d[np.newaxis,:,:]) ** 2, axis=2)
You can verify what is happening by creating D = d[:,np.newaxis,:]-d[np.newaxis,:,:]
and then looking at D[:,:,0]
and D[:,:,1]
. These are the difference between each combination in the first and second columns of d, respectively. Squaring and summing along the 3rd axis then gives the sum of squared differences.
dist_sq
dist_sq.shape
dist_sq.diagonal()
It's important to understand numpy and arrays, since most data sets we encounter are rectangular. The notations and operations we saw in numpy will translate to data, except for the fact that data is typically heterogeneous, i.e., of different types. The problem with using numpy for modern data analysis is that if you have mixed data types, it will all be coerced to strings, and then you can't actually do any data analysis.
The solution to this issue (which is also present in Matlab) came about with the pandas
package, which is the main workhorse of data science in Python