___

Copyright by Pierian Data Inc. For more information, visit us at www.pieriandata.com

Random Forest - Regression¶

Plus: An Additional Analysis of Various Regression Methods!¶

The Data¶

We just got hired by a tunnel boring company which uses X-rays in an attempt to know rock density, ideally this will allow them to switch out boring heads on their equipment before having to mine through the rock!

They have given us some lab test results of signal strength returned in nHz to their sensors for various rock density types tested. You will notice it has almost a sine wave like relationship, where signal strength oscillates based off the density, the researchers are unsure why this is, but

In [178]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [179]:

df = pd.read_csv("../DATA/rock_density_xray.csv")

In [180]:

df.head()

Out[180]:

	Rebound Signal Strength nHz	Rock Density kg/m3
0	72.945124	2.456548
1	14.229877	2.601719
2	36.597334	1.967004
3	9.578899	2.300439
4	21.765897	2.452374

In [181]:

df.columns=['Signal',"Density"]

In [182]:

plt.figure(figsize=(12,8),dpi=200)
sns.scatterplot(x='Signal',y='Density',data=df)

Out[182]:

<AxesSubplot:xlabel='Signal', ylabel='Density'>

Splitting the Data¶

Let's split the data in order to be able to have a Test set for performance metric evaluation.

In [198]:

X = df['Signal'].values.reshape(-1,1)  
y = df['Density']

In [199]:

from sklearn.model_selection import train_test_split

In [200]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=101)

Linear Regression¶

In [201]:

from sklearn.linear_model import LinearRegression

In [202]:

lr_model = LinearRegression()

In [203]:

lr_model.fit(X_train,y_train)

Out[203]:

LinearRegression()

In [204]:

lr_preds = lr_model.predict(X_test)

In [205]:

from sklearn.metrics import mean_squared_error

In [206]:

np.sqrt(mean_squared_error(y_test,lr_preds))

Out[206]:

0.2570051996584629

What does the fit look like?

In [208]:

signal_range = np.arange(0,100)

In [210]:

lr_output = lr_model.predict(signal_range.reshape(-1,1))

In [214]:

plt.figure(figsize=(12,8),dpi=200)
sns.scatterplot(x='Signal',y='Density',data=df,color='black')
plt.plot(signal_range,lr_output)

Out[214]:

[<matplotlib.lines.Line2D at 0x216f1c9c490>]

Polynomial Regression¶

Attempting with a Polynomial Regression Model¶

Let's explore why our standard regression approach of a polynomial could be difficult to fit here, keep in mind, we're in a fortunate situation where we can easily visualize results of y vs x.

Function to Help Run Models¶

In [296]:

from sklearn.linear_model import LinearRegression
model = LinearRegression()

In [298]:

def run_model(model,X_train,y_train,X_test,y_test):
    
    # Fit Model
    model.fit(X_train,y_train)
    
    # Get Metrics
    
    preds = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test,preds))
    print(f'RMSE : {rmse}')
    
    # Plot results
    signal_range = np.arange(0,100)
    output = model.predict(signal_range.reshape(-1,1))
    
    
    plt.figure(figsize=(12,6),dpi=150)
    sns.scatterplot(x='Signal',y='Density',data=df,color='black')
    plt.plot(signal_range,output)

In [299]:

run_model(model,X_train,y_train,X_test,y_test)

RMSE : 0.2570051996584629

Pipeline for Poly Orders¶

In [268]:

from sklearn.pipeline import make_pipeline

In [269]:

from sklearn.preprocessing import PolynomialFeatures

In [270]:

pipe = make_pipeline(PolynomialFeatures(2),LinearRegression())

In [271]:

run_model(pipe,X_train,y_train,X_test,y_test)

RMSE : 0.2817309563725596

Comparing Various Polynomial Orders¶

In [272]:

pipe = make_pipeline(PolynomialFeatures(10),LinearRegression())
run_model(pipe,X_train,y_train,X_test,y_test)

RMSE : 0.1417947898442399

KNN Regression¶

In [273]:

from sklearn.neighbors import KNeighborsRegressor

In [274]:

preds = {}
k_values = [1,5,10]
for n in k_values:
    
    
    model = KNeighborsRegressor(n_neighbors=n)
    run_model(model,X_train,y_train,X_test,y_test)

RMSE : 0.15234870286353372
RMSE : 0.13730685016923655
RMSE : 0.13277855732740926

Decision Tree Regression¶

In [275]:

from sklearn.tree import DecisionTreeRegressor

In [276]:

model = DecisionTreeRegressor()

run_model(model,X_train,y_train,X_test,y_test)

RMSE : 0.15234870286353372

In [277]:

model.get_n_leaves()

Out[277]:

Support Vector Regression¶

In [282]:

from sklearn.svm import SVR

In [283]:

from sklearn.model_selection import GridSearchCV

In [284]:

param_grid = {'C':[0.01,0.1,1,5,10,100,1000],'gamma':['auto','scale']}
svr = SVR()

In [285]:

grid = GridSearchCV(svr,param_grid)

In [286]:

run_model(grid,X_train,y_train,X_test,y_test)

RMSE : 0.12634668775105407

In [287]:

grid.best_estimator_

Out[287]:

SVR(C=1000)

Random Forest Regression¶

In [288]:

from sklearn.ensemble import RandomForestRegressor

In [289]:

# help(RandomForestRegressor)

In [290]:

trees = [10,50,100]
for n in trees:
    
    model = RandomForestRegressor(n_estimators=n)
    
    run_model(model,X_train,y_train,X_test,y_test)

RMSE : 0.1417613358931285
RMSE : 0.133281449397454
RMSE : 0.13699094997283662

Gradient Boosting¶

We will cover this in more detail in next section.

In [291]:

from sklearn.ensemble import GradientBoostingRegressor

In [292]:

# help(GradientBoostingRegressor)

In [293]:

   
model = GradientBoostingRegressor()

run_model(model,X_train,y_train,X_test,y_test)

RMSE : 0.13294148649584667

Adaboost¶

In [294]:

from sklearn.ensemble import AdaBoostRegressor

In [295]:

model = GradientBoostingRegressor()

run_model(model,X_train,y_train,X_test,y_test)

RMSE : 0.13294148649584667