import os
import pandas as pd
import numpy as np
# libraries for scoring/clustering
from sklearn.manifold.t_sne import trustworthiness
# GPU UMAP
import cudf
from cuml.manifold.umap import UMAP as cumlUMAP
# plotting
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style='white', rc={'figure.figsize':(25, 12.5)})
# hide warnings
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
We are going to work with the fashion mnist data set. This is a dataset consisting of 70,000 28x28 grayscale images of clothing. It should already be in the data/fashion
folder, but let's do a sanity check!
if not os.path.exists('data/fashion'):
print("error, data is missing!")
Now let's make sure we have our RAPIDS compliant GPU. It must be Pascal or higher! You can also use this to define which GPU RAPIDS should use (advanced feature not covered here)
!nvidia-smi
# https://github.com/zalandoresearch/fashion-mnist/blob/master/utils/mnist_reader.py
def load_mnist(path, kind='train'):
import os
import gzip
import numpy as np
"""Load MNIST data from `path`"""
labels_path = os.path.join(path,
'%s-labels-idx1-ubyte.gz'
% kind)
images_path = os.path.join(path,
'%s-images-idx3-ubyte.gz'
% kind)
with gzip.open(labels_path, 'rb') as lbpath:
labels = np.frombuffer(lbpath.read(), dtype=np.uint8,
offset=8)
with gzip.open(images_path, 'rb') as imgpath:
images = np.frombuffer(imgpath.read(), dtype=np.uint8,
offset=16).reshape(len(labels), 784)
return images, labels
train, train_labels = load_mnist('data/fashion', kind='train')
test, test_labels = load_mnist('data/fashion', kind='t10k')
data = np.array(np.vstack([train, test]), dtype=np.float64) / 255.0
target = np.array(np.hstack([train_labels, test_labels]))
There are 60000 training images and 10000 test images
f"Train shape: {train.shape} and Test Shape: {test.shape}"
train[0].shape
As mentioned previously, each row in the train matrix is an image
# display a Nike? sneaker
pixels = train[0].reshape((28, 28))
plt.imshow(pixels, cmap='gray')
There is cost with moving data between host memory and device memory (GPU memory) and we will include that core when comparing speeds
%%time
record_data = (('fea%d'%i, data[:,i]) for i in range(data.shape[1]))
gdf = cudf.DataFrame(record_data)
gdf
is a GPU backed dataframe -- all the data is stored in the device memory of the GPU. With the data converted, we can apply the cumlUMAP
the same inputs as we do for the standard UMAP. Additionally, it should be noted that within cuml, [FAISS] https://github.com/facebookresearch/faiss) is used for extremely fast kNN and it's limited to single precision. cumlUMAP
will automatically downcast to float32
when needed.
%%timeit
g_embedding = cumlUMAP(n_neighbors=5, init="spectral").fit_transform(gdf)
OK, now let's plot the output of the embeddings so that we can see the seperation of the neighborhoods. Let's start by creating the classes.
classes = [
'T-shirt/top',
'Trouser',
'Pullover',
'Dress',
'Coat',
'Sandal',
'Shirt',
'Sneaker',
'Bag',
'Ankle boot']
#Needs to be redone because of timeit function sometimes loses our g_embedding variable
g_embedding = cumlUMAP(n_neighbors=5, init="spectral").fit_transform(gdf)
Just as the original author of UMAP, Leland McInnes, states in the UMAP docs, we can plot the results and show the separation between the various classes defined above.
g_embedding_numpy = g_embedding.to_pandas().values #it is necessary to convert to numpy array to do the visual mapping
fig, ax = plt.subplots(1, figsize=(14, 10))
plt.scatter(g_embedding_numpy[:,1], g_embedding_numpy[:,0], s=0.3, c=target, cmap='Spectral', alpha=1.0)
plt.setp(ax, xticks=[], yticks=[])
cbar = plt.colorbar(boundaries=np.arange(11)-0.5)
cbar.set_ticks(np.arange(10))
cbar.set_ticklabels(classes)
plt.title('Fashion MNIST Embedded via cumlUMAP');
Additionally, we can also quanititaviely compare the perfomance of cumlUMAP
(GPU UMAP) to the reference/original implementation (CPU UMAP) using the trustworthiness score. From the docstring:
Trustworthiness expresses to what extent the local structure is retained. The trustworthiness is within [0, 1].
Like t-SNE
, UMAP tries to capture both global and local structure and thus, we can apply the trustworthiness
of the g_embedding
data against the original input. With a higher score we are demonstrating that the algorithm does a better and better job of local structure retention. As Corey Nolet notes:
Algorithms like UMAP aim to preserve local neighborhood structure and so measuring this property (trustworthiness) measures the algorithm's performance.
Scoring ~97% shows the GPU implementation is comparable to the original CPU implementation and the training time was ~9.5X faster