np.sort
and np.argsort
¶import numpy as np
x = np.array([2, 1, 4, 3, 5])
np.sort(x) # return a sorted version of the array without modifying the input
array([1, 2, 3, 4, 5])
x
array([2, 1, 4, 3, 5])
x.sort() # sort the array in-place
x
array([1, 2, 3, 4, 5])
argsort
: returns the indices of the sorted elements
x = np.array([2, 1, 4, 3, 5])
i = np.argsort(x)
i
array([1, 0, 3, 2, 4])
The first element of this result gives the index of the smallest element, the second value gives the index of the second smallest, and so on. These indices can then be used (via fancy indexing) to construct the sorted array if desired:
x[i]
array([1, 2, 3, 4, 5])
rand = np.random.RandomState(42)
X = rand.randint(0, 10, (4, 6))
X
array([[6, 3, 7, 4, 6, 9], [2, 6, 7, 4, 3, 7], [7, 2, 5, 4, 1, 7], [5, 1, 4, 0, 9, 5]])
# sort each column of X
np.sort(X, axis=0)
array([[2, 1, 4, 0, 1, 5], [5, 2, 5, 4, 3, 7], [6, 3, 7, 4, 6, 7], [7, 6, 7, 4, 9, 9]])
# sort each row of X
np.sort(X, axis=1)
array([[3, 4, 6, 6, 7, 9], [2, 3, 4, 6, 7, 7], [1, 2, 4, 5, 7, 7], [0, 1, 4, 5, 5, 9]])
Find the k smallest values in the array.
np.partition
: takes an array and a number K; the result is a new array with the smallest K values to the left of the partition, and the remaining values to the right, in arbitrary order
x = np.array([7, 2, 3, 1, 6, 5, 4])
x
array([7, 2, 3, 1, 6, 5, 4])
np.partition(x, 3) # the first three values in the resulting array are the three smallest in the array
array([2, 1, 3, 4, 6, 5, 7])
np.partition(x, 4)
array([2, 1, 3, 4, 5, 6, 7])
X
array([[6, 3, 7, 4, 6, 9], [2, 6, 7, 4, 3, 7], [7, 2, 5, 4, 1, 7], [5, 1, 4, 0, 9, 5]])
np.partition(X, 2, axis=1)
array([[3, 4, 6, 7, 6, 9], [2, 3, 4, 7, 6, 7], [1, 2, 4, 5, 7, 7], [0, 1, 4, 5, 9, 5]])
X = rand.randint(10, size=(5, 2))
X
array([[2, 0], [2, 4], [2, 0], [4, 9], [6, 6]])
X.shape
(5, 2)
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
plt.scatter(X[:, 0], X[:, 1], s=100)
<matplotlib.collections.PathCollection at 0x1a2fea7ba8>
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis=-1)
Break the code above down into its component steps:
# for each pair of points, compute differences in their coordinates
differences = X[:, np.newaxis, :] - X[np.newaxis, :, :]
differences.shape
(5, 5, 2)
# square the coordinate differences
sq_differences = differences ** 2
sq_differences.shape
(5, 5, 2)
# sum the coordinate differences to get the squared distance
dist_sq = sq_differences.sum(-1)
dist_sq.shape
(5, 5)
# double check: whether the diagonal of this matrix is all zero
dist_sq.diagonal()
array([0, 0, 0, 0, 0])
dist_sq
array([[ 0, 16, 0, 85, 52], [16, 0, 16, 29, 20], [ 0, 16, 0, 85, 52], [85, 29, 85, 0, 13], [52, 20, 52, 13, 0]])
np.argsort
to sort along each row¶The leftmost columns will then give the indices of the nearest neighbors
nearest = np.argsort(dist_sq, axis=1)
nearest
array([[0, 2, 1, 4, 3], [1, 0, 2, 4, 3], [0, 2, 1, 4, 3], [3, 4, 1, 0, 2], [4, 3, 1, 0, 2]])
The first column gives the numbers 0 through 9 in order: this is due to the fact that each point's closest neighbor is itself.
If we're simply interested in the nearest $k$ neighbors, all we need is to partition each row so that the smallest $k+1$ squared distances come first, with larger distances filling the remaining positions of the array.
K = 2
nearest_partition = np.argpartition(dist_sq, K + 1, axis=1)
nearest_partition
array([[2, 0, 1, 4, 3], [1, 2, 0, 4, 3], [2, 0, 1, 4, 3], [3, 4, 1, 0, 2], [3, 4, 1, 0, 2]])
Plot the points along with lines representing the connections from each point to its two nearest neighbors
plt.scatter(X[:, 0], X[:, 1], s=100)
# draw lines from each point to its two nearest neighbors
K = 2
for i in range(X.shape[0]):
for j in nearest_partition[i, :K+1]:
# plot a line from X[i] to X[j]
# use zip magic
plt.plot(*zip(X[j], X[i]), color="black")