Another package we are going to explore is cuML. Similar to scikit-learn
you can use cuml
to train machine learning models on your data to make predictions. As with other packages in the RAPIDS suite of tools the API of cuml
is the same as scikit-learn
but the underlying code has been implemented to run on the GPU.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py
import cudf
Let's look at training a K Nearest Neigbours model to predict whether someone has diabetes based on some other attributes such as their blood pressure, glucose levels, BMI, etc.
We start by loading in our data to a GPU dataframe with cudf
.
df = cudf.read_csv("https://github.com/jacobtomlinson/gpu-python-tutorial/raw/main/data/diabetes.csv")
df.head()
Next we need to create two separate tables. One containing the attributes of the patient except the diabetes column, and one with just the diabetes column.
X = df.drop(columns=["Outcome"])
X.head()
y = df["Outcome"].values
y[0:5]
Next we need to use the train_test_split
method from cuml
to split our data into two sets.
The first larger set will be used to train out model. We will take 80% of the data from each table and call them X_train
and y_train
. When the model is trained it will be able to see both sets of data in order to perform clustering.
The other 20% of the data will be called X_test
and y_test
. Once our model is trained we will feel our X_test
data through our model to predict whether those people have diabetes. We can then compare those pridictions with the actual y_test
data to see how accurate our model is.
We also set random_state
to 1
to make the random selection consistent, just for the purposes of this tutorial. We also set statify
which means that if 75% of people in our intial data have diabetes then 75% of people in our training set will be guarantted to have diabetes.
from cuml.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)
Now that we have our training data we can import our KNeighborsClassifier
from cuml
and fit our model.
from cuml.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train,y_train)
Fitting our model happened on our GPU and now we can make some predictions. Let's predict the first five people from our test set.
knn.predict(X_test)[0:5]
We can see here that our new model thinks that the first patient has diabetes but the rest do not.
Let's run the whole test set through the scoring function along with the actual answers and see how well our model performs.
knn.score(X_test, y_test)
Congratulations you just trained a machine learning model on the GPU in Python and achieved a score of 69% accuracy. There are a bunch of things we could do here to improve this score, but that is beyond the scope of this tutorial.