Notebook

Linear Regression¶

LinearRegression is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

The linear regression model implemented in the cuml library allows the user to change the fit_intercept, normalize and algorithm parameters. cuML’s LinearRegression expects either a cuDF DataFrame or a NumPy matrix and provides 2 algorithms to fit a linear mode: lSVD and Eig . SVD is more stable, but Eig (default) is much more faster.

The Linear Regression function accepts the following parameters:

algorithm:‘eig’ or ‘svd’ (default = ‘eig’). Eig uses a eigendecomposition of the covariance matrix, and is much faster. SVD is slower, but is guaranteed to be stable.
fit_intercept:boolean (default = True). If True, LinearRegression tries to correct for the global mean of y. If False, the model expects that you have centered the data.
normalize:boolean (default = False). If True, the predictors in X will be normalized by dividing by it’s L2 norm. If False, no scaling will be done.

The methods that can be used with the Linear regression are:

fit: Fit the model with X and y.
get_params: Sklearn style return parameter state
predict: Predicts the y for X.
set_params: Sklearn style set parameter state to dictionary of params.

In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the linear regression model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html

In [ ]:

import numpy as np
import pandas as pd
import cudf
import os
from cuml import LinearRegression as cuLinearRegression
from sklearn.linear_model import LinearRegression as skLinearRegression
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

Helper Functions¶

In [ ]:

# check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for regression 
import gzip
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):
    #split the dataset in a 80:20 split
    train_rows = int(nrows*0.8)
    if os.path.exists(cached):
        print('use mortgage data')

        with gzip.open(cached) as f:
            X = np.load(f)
        # the 4th column is 'adj_remaining_months_to_maturity'
        # used as the label
        X = X[:,[i for i in range(X.shape[1]) if i!=4]]
        y = X[:,4:5]
        rindices = np.random.randint(0,X.shape[0]-1,nrows)
        X = X[rindices,:ncols]
        y = y[rindices]
        df_y_train = pd.DataFrame({'fea%d'%i:y[0:train_rows,i] for i in range(y.shape[1])})
        df_y_test = pd.DataFrame({'fea%d'%i:y[train_rows:,i] for i in range(y.shape[1])})
    else:
        print('use random data')
        X,y = make_regression(n_samples=nrows,n_features=ncols,n_informative=ncols, random_state=0)
        df_y_train = pd.DataFrame({'fea0':y[0:train_rows,]})
        df_y_test = pd.DataFrame({'fea0':y[train_rows:,]})

    df_X_train = pd.DataFrame({'fea%d'%i:X[0:train_rows,i] for i in range(X.shape[1])})
    df_X_test = pd.DataFrame({'fea%d'%i:X[train_rows:,i] for i in range(X.shape[1])})

    return df_X_train, df_X_test, df_y_train, df_y_test

Run tests¶

In [ ]:

%%time
# nrows = number of samples
# ncols = number of features of each sample 

nrows = 2**20
ncols = 399

#split the dataset into training and testing sets, in the ratio of 80:20 respectively
X_train, X_test, y_train, y_test = load_data(nrows,ncols)
print('training data',X_train.shape)
print('training label',y_train.shape)
print('testing data',X_test.shape)
print('testing label',y_test.shape)
print('label',y_test.shape)

In [ ]:

%%time
# use the sklearn linear regression model to fit the training dataset 
skols = skLinearRegression(fit_intercept=True,
                  normalize=True)
skols.fit(X_train, y_train)

In [ ]:

%%time
# calculate the mean squared error of the sklearn linear regression model on the testing dataset
sk_predict = skols.predict(X_test)
error_sk = mean_squared_error(y_test,sk_predict)

In [ ]:

%%time
# convert the pandas dataframe to cudf format
X_cudf = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)
y_cudf = y_train.values
y_cudf = y_cudf[:,0]
y_cudf = cudf.Series(y_cudf)

In [ ]:

%%time
# run the cuml linear regression model to fit the training dataset 
cuols = cuLinearRegression(fit_intercept=True,
                  normalize=True,
                  algorithm='eig')
cuols.fit(X_cudf, y_cudf)

In [ ]:

%%time
# calculate the mean squared error of the testing dataset using the cuml linear regression model
cu_predict = cuols.predict(X_cudf_test).to_array()
error_cu = mean_squared_error(y_test,cu_predict)

In [ ]:

# print the mean squared error of the sklearn and cuml model to compare the two
print("SKL MSE(y):")
print(error_sk)
print("CUML MSE(y):")
print(error_cu)