__author__ = "Christopher Potts, Will Monroe, and Lucy Li"
__version__ = "CS224u, Stanford, Spring 2020"
Why should we care about NumPy?
np_
in your cs224u
directory).In Jupyter notebooks, NumPy documentation is two clicks away: Help -> NumPy reference.
import numpy as np
np.zeros(5)
np.ones(5)
# convert list to numpy array
np.array([1,2,3,4,5])
# convert numpy array to list
np.ones(5).tolist()
# one float => all floats
np.array([1.0,2,3,4,5])
# same as above
np.array([1,2,3,4,5], dtype='float')
# spaced values in interval
np.array([x for x in range(20) if x % 2 == 0])
# same as above
np.arange(0,20,2)
# random floats in [0, 1)
np.random.random(10)
# random integers
np.random.randint(5, 15, size=10)
x = np.array([10,20,30,40,50])
x[0]
# slice
x[0:2]
x[0:1000]
# last value
x[-1]
# last value as array
x[[-1]]
# last 3 values
x[-3:]
# pick indices
x[[0,2,4]]
Be careful when assigning arrays to new variables!
#x2 = x # try this line instead
x2 = x.copy()
x2[0] = 10
x2
x2[[1,2]] = 10
x2
x2[[3,4]] = [0, 1]
x2
# check if the original vector changed
x
x.sum()
x.mean()
x.max()
x.argmax()
np.log(x)
np.exp(x)
x + x # Try also with *, -, /, etc.
x + 1
Vectorizing your mathematical expressions can lead to huge performance gains. The following example is meant to give you a sense for this. It compares applying np.log
to each element of a list with 10 million values with the same operation done on a vector.
# log every value as list, one by one
def listlog(vals):
return [np.log(y) for y in vals]
# get random vector
samp = np.random.random_sample(int(1e7))+1
samp
%time _ = np.log(samp)
%time _ = listlog(samp)
The matrix is the core object of machine learning implementations.
np.array([[1,2,3], [4,5,6]])
np.array([[1,2,3], [4,5,6]], dtype='float')
np.zeros((3,5))
np.ones((3,5))
np.identity(3)
np.diag([1,2,3])
X = np.array([[1,2,3], [4,5,6]])
X
X[0]
X[0,0]
# get row
X[0, : ]
# get column
X[ : , 0]
# get multiple columns
X[ : , [0,2]]
# X2 = X # try this line instead
X2 = X.copy()
X2
X2[0,0] = 20
X2
X2[0] = 3
X2
X2[: , -1] = [5, 6]
X2
# check if original matrix changed
X
z = np.arange(1, 7)
z
z.shape
Z = z.reshape(2,3)
Z
Z.shape
Z.reshape(6)
# same as above
Z.flatten()
# transpose
Z.T
A = np.array(range(1,7), dtype='float').reshape(2,3)
A
B = np.array([1, 2, 3])
# not the same as A.dot(B)
A * B
A + B
A / B
# matrix multiplication
A.dot(B)
B.dot(A.T)
A.dot(A.T)
# outer product
# multiplying each element of first vector by each element of the second
np.outer(B, B)
The following is a practical example of numerical operations on NumPy matrices.
In our class, we have a shallow neural network implemented in np_shallow_neural_network.py
. See how the forward and backward passes use no for loops, and instead takes advantage of NumPy's ability to vectorize manipulations of data.
def forward_propagation(self, x):
h = self.hidden_activation(x.dot(self.W_xh) + self.b_xh)
y = softmax(h.dot(self.W_hy) + self.b_hy)
return h, y
def backward_propagation(self, h, predictions, x, labels):
y_err = predictions.copy()
y_err[np.argmax(labels)] -= 1
d_b_hy = y_err
h_err = y_err.dot(self.W_hy.T) * self.d_hidden_activation(h)
d_W_hy = np.outer(h, y_err)
d_W_xh = np.outer(x, h_err)
d_b_xh = h_err
return d_W_hy, d_b_hy, d_W_xh, d_b_xh
The forward pass essentially computes the following:
$$h = f(xW_{xh} + b_{xh})$$
$$y = \text{softmax}(hW_{hy} + b_{hy}),$$
where $f$ is self.hidden_activation
.
The backward pass propagates error by computing local gradients and chaining them. Feel free to learn more about backprop here, though it is not necessary for our class. Also look at this neural networks case study to see another example of how NumPy can be used to implement forward and backward passes of a simple neural network.
These are examples of how NumPy can be used with other Python packages.
We can convert numpy matrices to Pandas dataframes. In the following example, this is useful because it allows us to label each row. You may have noticed this being done in our first unit on distributed representations.
import pandas as pd
count_df = pd.DataFrame(
np.array([
[1,0,1,0,0,0],
[0,1,0,1,0,0],
[1,1,1,1,0,0],
[0,0,0,0,1,1],
[0,0,0,0,0,1]], dtype='float64'),
index=['gnarly', 'wicked', 'awesome', 'lame', 'terrible'])
count_df
In sklearn
, NumPy matrices are the most common input and output and thus a key to how the library's numerous methods can work together. Many of the cs224u's model built by Chris operate just like sklearn
ones, such as the classifiers we used for our sentiment analysis unit.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data
y = iris.target
print(type(X))
print("Dimensions of X:", X.shape)
print(type(y))
print("Dimensions of y:", y.shape)
# split data into train/test
X_iris_train, X_iris_test, y_iris_train, y_iris_test = train_test_split(
X, y, train_size=0.7, test_size=0.3)
print("X_iris_train:", type(X_iris_train))
print("y_iris_train:", type(y_iris_train))
print()
# start up model
maxent = LogisticRegression(
fit_intercept=True,
solver='liblinear',
multi_class='auto')
# train on train set
maxent.fit(X_iris_train, y_iris_train)
# predict on test set
iris_predictions = maxent.predict(X_iris_test)
fnames_iris = iris['feature_names']
tnames_iris = iris['target_names']
# how well did our model do?
print(classification_report(y_iris_test, iris_predictions, target_names=tnames_iris))
SciPy contains what may seem like an endless treasure trove of operations for linear algebra, optimization, and more. It is built so that everything can work with NumPy arrays.
from scipy.spatial.distance import cosine
from scipy.stats import pearsonr
from scipy import linalg
# cosine distance
a = np.random.random(10)
b = np.random.random(10)
cosine(a, b)
# pearson correlation (coeff, p-value)
pearsonr(a, b)
# inverse of matrix
A = np.array([[1,3,5],[2,5,1],[2,3,8]])
linalg.inv(A)
To learn more about how NumPy can be combined with SciPy and Scikit-learn for machine learning, check out this notebook tutorial by Chris Potts and Will Monroe. (You may notice that over half of this current notebook is modified from theirs.) Their tutorial also has some interesting exercises in it!
import matplotlib.pyplot as plt
a = np.sort(np.random.random(30))
b = a**2
c = np.log(a)
plt.plot(a, b, label='y = x^2')
plt.plot(a, c, label='y = log(x)')
plt.legend()
plt.title("Some functions")
plt.show()