Nearest Neighbors enables the query of the k-nearest neighbors from a set of input samples.
The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input.
For information on converting your dataset to cuDF format, refer to the cuDF documentation: https://docs.rapids.ai/api/cudf/stable
For additional information on cuML's Nearest Neighbors implementation: https://rapidsai.github.io/projects/cuml/en/stable/api.html#nearest-neighbors
import cudf
import numpy as np
from cuml.datasets import make_blobs
from cuml.neighbors import NearestNeighbors as cuNearestNeighbors
from sklearn.neighbors import NearestNeighbors as skNearestNeighbors
n_samples = 2**17
n_features = 40
n_query = 2**13
n_neighbors = 4
random_state = 0
%%time
device_data, _ = make_blobs(n_samples=n_samples,
n_features=n_features,
centers=5,
random_state=random_state)
device_data = cudf.DataFrame.from_gpu_matrix(device_data)
# Copy dataset from GPU memory to host memory.
# This is done to later compare CPU and GPU results.
host_data = device_data.to_pandas()
%%time
knn_sk = skNearestNeighbors(algorithm="brute",
n_jobs=-1)
knn_sk.fit(host_data)
%%time
D_sk, I_sk = knn_sk.kneighbors(host_data[:n_query], n_neighbors)
%%time
knn_cuml = cuNearestNeighbors()
knn_cuml.fit(device_data)
%%time
D_cuml, I_cuml = knn_cuml.kneighbors(device_data[:n_query], n_neighbors)
cuML currently uses FAISS for exact nearest neighbors search, which limits inputs to single-precision. This results in possible round-off errors when floats of different magnitude are added. As a result, it's very likely that the cuML results will not match Sciklearn's nearest neighbors exactly. You can read more in the FAISS wiki.
passed = np.allclose(D_sk, D_cuml.as_gpu_matrix(), atol=1e-3)
print('compare knn: cuml vs sklearn distances %s'%('equal'if passed else 'NOT equal'))
sk_sorted = np.sort(I_sk, axis=1)
cuml_sorted = np.sort(I_cuml.as_gpu_matrix(), axis=1)
diff = sk_sorted - cuml_sorted
passed = (len(diff[diff!=0]) / n_samples) < 1e-9
print('compare knn: cuml vs sklearn indexes %s'%('equal'if passed else 'NOT equal'))