For each independent individual $i$ in each data set, we have a features vector $x_i=(x_{i_1},x_{i_2},x_{i_3})^T \in \mathbb{R}^3$ corresponding to the value of the wind speed, the wind direction, and the temperature. For each $i$, we know the label $y_i \in \mathbb{R}$ corresponding to the electrical power.
Problem : Given a features vector $x$ we want to predict its label $y$.
As $y \in \mathbb{R}$, this is a regression problem.
# Maths tools
import numpy as np
import scipy as sp
# Data Visualization
import pandas as pd # Dataframes
import matplotlib.pyplot as plt # plot (suitable for numerical data)
import seaborn as sns # plot (suitable for pandas dataframe)
# Files
import os # open files
# ML Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline
# ML Models
from sklearn.linear_model import LinearRegression, Ridge, Lasso, SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor, ExtraTreeRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, AdaBoostRegressor, GradientBoostingRegressor
# ML Selection
from sklearn.model_selection import train_test_split, KFold, GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import validation_curve
# ML Metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
The two datasets contain the following features:
P : electrical power in kW.
V : wind speed in $m.s^{-1}$.
Dir : wind direction in degrees °.
T : temperature in °C.
Each .txt
file (eol1.txt and eol2.txt) stores data for a wind turbine. We will read these files and store the data in dataframes (one dataframe per turbine)
# Repertory where files are stored
repertory = './'
# The name of the files
filename1 = 'eol1.txt'
filename2 = 'eol2.txt'
files = [filename1,filename2]
datas = []
for file in files :
path = os.path.join(repertory, file)
try :
datas.append(
pd.read_csv(path, sep=' ')
)
except OSError as err:
print(f"OS error: {err}")
raise
data1_full, data2_full = datas
DataFrame = type(data1_full)
original_columns = data1_full.columns
X_original_cols = ['V','Dir','T']
y_original_col = ['P']
For each wind turbine, we are given a dataset where the labels are known.
In a real situation, our model should predict labels on a dataset that has never been seen, and where the labels are unknown.
We will simulate this real situation, by splitting each dataset in two. The test datasets will play the role of these "never seen" datasets.
# 80-20% split
test_size = 0.2
data1_X, data1_X_test, data1_y, data1_y_test =\
train_test_split(data1_full[X_original_cols],
data1_full[y_original_col],
test_size=test_size,
stratify=None)
data2_X, data2_X_test, data2_y, data2_y_test =\
train_test_split(data2_full[X_original_cols],
data2_full[y_original_col],
test_size=test_size,
stratify=None)
data1 = data1_X.join(data1_y).sort_index().reset_index(drop=True)
data2 = data2_X.join(data2_y).sort_index().reset_index(drop=True)
data1_test = data1_X_test.join(data1_y_test).sort_index().reset_index(drop=True)
data2_test = data2_X_test.join(data2_y_test).sort_index().reset_index(drop=True)
data1_train_original = data1.copy()
data2_train_original = data2.copy()
data1_test_original = data1_test.copy()
data2_test_original = data2_test.copy()
# don't keep copy
data1_X, data1_y = None,None
data2_X, data2_y = None,None
data1_full = None
data2_full = None
📎 From now on, the test data set is assumed to be unknown, and not in our possession.
We rename the column for our convenience.
original_columns = data1.columns
def rename_cols(data) :
data = data.copy()
new_columns = { 'P' : 'Power' , 'V' : 'Speed', 'Dir' : 'Direction', 'T' : 'Temperature' }
data = data.rename(columns = new_columns)
return data
data1 = rename_cols(data1)
data2 = rename_cols(data2)
X_cols = ['Speed','Direction','Temperature']
Y_col = ['Power']
Let's look at the shape of the data
print("Dataset 1:", data1.shape, "\tDataset 2:",data2.shape)
Dataset 1: (6892, 4) Dataset 2: (7008, 4)
We visualize a sample of the data with the sample()
method.
def displaying_two_prototype(display_fun,fun,name, data1, data2, **args) :
print(f"Dataset 1 ~ {name}")
display_fun(fun(data1,**args))
print(f"\n\n\nDataset 2 ~ {name}")
display_fun(fun(data2,**args))
return
def display_two(fun,name,data1,data2,**args) :
displaying_two_prototype(display,fun,name,data1,data2,**args)
return
def print_two(fun,name,data1,data2,**args) :
displaying_two_prototype(print,fun,name,data1,data2,**args)
return
display_two(DataFrame.sample,"Sample",data1,data2,n=5, random_state=15)
Dataset 1 ~ Sample
Speed | Direction | Temperature | Power | |
---|---|---|---|---|
5675 | 14.790000 | 209.073331 | 14.550000 | 1459.453328 |
6798 | 11.566667 | 184.533333 | 6.198333 | 1735.303345 |
3875 | 2.125000 | 70.116666 | 22.015000 | 5.495000 |
3273 | 7.625000 | 195.326668 | 13.920000 | 833.023336 |
3537 | 5.111667 | 238.054998 | 19.815000 | 191.511665 |
Dataset 2 ~ Sample
Speed | Direction | Temperature | Power | |
---|---|---|---|---|
2579 | 5.911667 | 275.165001 | 14.740000 | 324.186665 |
1574 | 6.266667 | 47.945000 | 11.525000 | 531.404999 |
1443 | 10.705000 | 34.066667 | 9.481667 | 1692.981649 |
1389 | 5.501667 | 137.639997 | 7.496667 | 253.661669 |
9 | NaN | 296.458333 | -0.301667 | -4.281667 |
The sample above shows us that the features are all real numbers.
We use the describe()
method to display descriptive statistics values on our data
display_two(DataFrame.describe,"Descriptive statistics",data1,data2)
Dataset 1 ~ Descriptive statistics
Speed | Direction | Temperature | Power | |
---|---|---|---|---|
count | 6816.000000 | 6815.000000 | 6816.000000 | 6816.000000 |
mean | 6.103965 | 194.705678 | 11.577526 | 478.179989 |
std | 2.627439 | 96.541765 | 7.449784 | 487.721729 |
min | 0.000000 | 1.513336 | -7.975000 | -8.103333 |
25% | 4.702917 | 124.429167 | 5.949167 | 96.532916 |
50% | 5.980000 | 203.586667 | 11.164167 | 312.683332 |
75% | 7.418750 | 271.608332 | 16.920417 | 718.196663 |
max | 19.870000 | 358.838338 | 34.550001 | 2049.641663 |
Dataset 2 ~ Descriptive statistics
Speed | Direction | Temperature | Power | |
---|---|---|---|---|
count | 6884.000000 | 6933.000000 | 6934.000000 | 6934.000000 |
mean | 5.252358 | 171.708384 | 12.238588 | 354.414159 |
std | 2.509134 | 86.256922 | 7.219262 | 450.278093 |
min | 0.000000 | 2.571667 | -5.761667 | -8.623333 |
25% | 3.644583 | 95.706668 | 6.789167 | 23.349167 |
50% | 5.134167 | 189.628334 | 12.683333 | 173.379164 |
75% | 6.498333 | 233.430000 | 17.482917 | 511.912507 |
max | 19.923333 | 358.385005 | 34.856667 | 2042.544963 |
We can now note some interesting things: there are negative electric power values. (see line min
)
We look at the distribution of each of the variables and their relationships using a pairplot
.
data1["Turbine"] = 1
data2["Turbine"] = 2
full_data = pd.concat([data1,data2])
sns.pairplot(full_data, hue='Turbine',
plot_kws = {'alpha': 0.3, 's': 20, 'edgecolor': 'k'},
height = 3, palette=['blue','orange'])
plt.suptitle("Pair plot of the two wind turbines", size=15, y=1.05)
plt.show()
data1.drop('Turbine', axis=1, inplace=True)
data2.drop('Turbine', axis=1, inplace=True)
Visually, it is expected that there is a strong dependency between the wind speed and the electrical power produced. The other variables seem to be less correlated (values are scattered).
We can indeed verify these results in the correlation matrix below.
corr_mat1 = data1.corr()
corr_mat2 = data2.corr()
fig, ax = plt.subplots(figsize=(20,8))
ax1 = plt.subplot(1,2,1)
ax1 = sns.heatmap(corr_mat1, annot=True, cmap="YlGnBu")
ax1.set_title("Wind turbine 1")
ax2 = plt.subplot(1,2,2)
ax2 = sns.heatmap(corr_mat2, annot=True, cmap="YlGnBu")
ax2.set_title("Wind turbine 2")
plt.show()
We check how the data are typed.
f = lambda x : x.dtypes
print_two(f,"Types",data1,data2)
Dataset 1 ~ Types Speed float64 Direction float64 Temperature float64 Power float64 dtype: object Dataset 2 ~ Types Speed float64 Direction float64 Temperature float64 Power float64 dtype: object
The default dtypes is relevant for real numbers. We will keep this dtypes
.
First, we check the lines where all values are missing
f = lambda x : x[x.isna().all(axis=1)].head(1)
display_two(f,"Example of line where all values are missing",data1,data2)
Dataset 1 ~ Example of line where all values are missing
Speed | Direction | Temperature | Power | |
---|---|---|---|---|
780 | NaN | NaN | NaN | NaN |
Dataset 2 ~ Example of line where all values are missing
Speed | Direction | Temperature | Power | |
---|---|---|---|---|
902 | NaN | NaN | NaN | NaN |
We remove these kind of lines.
def remove_empty_lines(data) :
data = data.dropna(how='all')
data.reset_index(inplace=True, drop=True)
return data
data1 = remove_empty_lines(data1)
data2 = remove_empty_lines(data2)
print("Dataset 1:", data1.shape, "\tDataset 2:",data2.shape)
Dataset 1: (6816, 4) Dataset 2: (6934, 4)
Now we check the lines where there is at least one missing value.
f = lambda x : x.isna().sum(axis=0)/x.count()
print_two(f, "Proportion of missing values (by line)", data1, data2)
Dataset 1 ~ Proportion of missing values (by line) Speed 0.000000 Direction 0.000147 Temperature 0.000000 Power 0.000000 dtype: float64 Dataset 2 ~ Proportion of missing values (by line) Speed 0.007263 Direction 0.000144 Temperature 0.000000 Power 0.000000 dtype: float64
For each column, the rows with at least one missing value correspond to less than 0.8% of the data. We will remove these lines (because it will not really change the distribution of our data).
def remove_missing(data) :
data = data.dropna(how='any')
data.reset_index(inplace=True, drop=True)
return data
f = lambda x : x.shape
print_two(f ,"Shape before removing missing values:",data1,data2)
data1 = remove_missing(data1)
data2 = remove_missing(data2)
print("\n\n\n")
print_two(f,"Shape after removing missing values:",data1,data2)
Dataset 1 ~ Shape before removing missing values: (6816, 4) Dataset 2 ~ Shape before removing missing values: (6934, 4) Dataset 1 ~ Shape after removing missing values: (6815, 4) Dataset 2 ~ Shape after removing missing values: (6883, 4)
We could see that there were data with negative electrical powers. On peut afficher leur proportion :
f = lambda x : ( x[x.Power < 0].count()/x.count() ).Power
print_two(f,"Proportion of negative electrical power", data1, data2)
Dataset 1 ~ Proportion of negative electrical power 0.10975788701393983 Dataset 2 ~ Proportion of negative electrical power 0.16998401859654222
This corresponds to about 10 to 20% of the data. This number is not negligible. We can ask ourselves if these negative values make sense. After research and reflection, we have concluded that they do. Indeed, it is possible that a wind turbine consumes more electricity than it produces (in very low wind conditions).
So we will not apply any particular pre-treatment for the lines where the power is negative. Especially since the power is the label to predict. If it is negative, it must be taken into account for the training of the model.
Now we standardize our features in order to put each feature on the same scale. We will center-reduce each value with the following formula :
$z_{i_f}=\dfrac{x_{i_f}-\mu_f}{\sigma_f}$
with :
For an instance of StandardScaler
, the method fit
compute the mean and the standard deviation to use and the method transform()
will apply the above formula on each value.
# Separing X and y
def separe_Xy(data) :
data_X = data[ X_cols ]
try :
data_y = data[ Y_col ]
except :
data_y = None
return data_X, data_y
data1_X, data1_y = separe_Xy(data1)
data2_X, data2_y = separe_Xy(data2)
# Prevent copy problems
data1 = None
data2 = None
# Standardize features X
def standardize(data_X, standard_scaler, fit) :
if fit : data_X = standard_scaler.fit_transform(data_X) # standard_scaler.fit(...) ; standard_scaler.tranfsorm(...)
else : data_X = standard_scaler.transform(data_X)
data_X = pd.DataFrame(data_X)
data_X.columns = X_cols
return data_X
def destandardize(data_X, standard_scaler) :
return data_X * np.sqrt(standard_scaler.var_) + standard_scaler.mean_
standard_scaler1 = StandardScaler()
data1_X = standardize(data1_X,standard_scaler1,fit=True)
standard_scaler2 = StandardScaler()
data2_X = standardize(data2_X,standard_scaler2,fit=True)
Now that we have pre-processed our training data, it is ready to be used to train our model. We now define a function preprocess
that performs all these preprocessing steps. The reasons is as follows.
After we have train our model on the training dataset, we want to predict labels for a supposed unknown similar dataset.
This supposed unknown dataset is not already preprocessed. We have to:
1) Rename our columns (only for our convenience)
2) Clean our data (removing missing values)
3) Standardize our data
⚠️ Note that standardization use parameters estimated from our training dataset (mean, standard deviation). The unknown similar dataset is supposed to have the same distribution. We don't have to re-estimated these parameters (especially since the training data set is representative). In other terms, we dont't have to apply fit
method of the StandardScaler
but only the transform
method
def preprocess_no_scale(data, verbose=False) :
data = rename_cols(data)
if verbose :
print("Rename")
display(data)
data = remove_empty_lines(data)
if verbose :
print("Remove empty")
display(data)
data = remove_missing(data)
if verbose :
print("Remove missing")
display(data)
data_X, data_y = separe_Xy(data)
if verbose :
print("Separe X Y")
display(data_X)
display(data_y)
return data_X, data_y
def preprocess_scale(data_X, data_y, standard_scaler, fit, verbose=False) :
data_X = standardize(data_X,standard_scaler,fit)
if verbose :
print("Standardize X")
display(data_X)
display(data_y)
return data_X, data_y
def preprocess(data, standard_scaler, fit, verbose=False) :
data_X, data_y = preprocess_no_scale(data, verbose)
return preprocess_scale(data_X, data_y, standard_scaler, fit, verbose)
def create_and_fit_model(data_X, data_y, model_name, **hyperparameters) :
regressor = model_name(**hyperparameters)
regressor.fit(data_X, data_y.values.ravel())
return regressor
regressor1 = create_and_fit_model(data1_X, data1_y , RandomForestRegressor)
regressor2 = create_and_fit_model(data2_X, data2_y , RandomForestRegressor)
Now it's time to test our models. For this we will use the test data that we have ignored so far. We do not know this data. The only thing we assume is that the test data set for turbine 1 (respectively, for turbine 2) follows the same probability law as the training data for turbine 1 (respectively, for turbine 2). These are the laws that we have tried to estimate by our models. This is what will allow us to make predictions on the unknown test data.
Our test data has not been pre-processed. However, this data must be in the same format as the training data that fed our model. In other words, our test data must not contain any missing values, and the feature values must be standardized using the training parameters (the means and the standard deviations of the training data, assumed to be representative of any data from the wind turbine in question)
# Preprocess our test data (no fit !)
data1_X_test, data1_y_test = preprocess(data1_test, standard_scaler1, fit=False)
data2_X_test, data2_y_test = preprocess(data2_test, standard_scaler2, fit=False)
We can now make our predictions.
# Make predictions
data1_y_test_pred = regressor1.predict(data1_X_test)
data2_y_test_pred = regressor2.predict(data2_X_test)
In order to evaluate the success of our predictions, we need to define performance metrics. Since we are dealing with a regression problem, our metrics will measure a distance between the prediction and the true value of the label. For $n$ individuals, with $y_i$ a true label of the individual $i$, $\hat{y}_i$ the predicted label for $i$ and $\bar{y}$ the mean of values, we will choose the following metrics:
$R_2 = 1 - \dfrac{\sum^{n}_{i=1}{(y_i - \hat{y}_i)^2}}{\sum^{n}_{i=1}{(y_i - \bar{y})^2}}$
$MAE = \dfrac{\sum^{n}_{i=1}{|y_i - \hat{y}_i|}}{n} $
$RMSE = \displaystyle\sqrt{\dfrac{\sum^{n}_{i=1}{(y_i - \hat{y}_i)^2}}{n}}$
We will now calculate the scores of our predictions for each of the models.
scores_keys = ['R2 Score [-1,1] ','Mean Absolute Error (kW) ', 'Root Mean Squared Error (kW) ']
def compute_scores(y_pred,y_true) :
return {
scores_keys[0] : r2_score(y_pred,y_true),
scores_keys[1] : mean_absolute_error(y_pred,y_true),
scores_keys[2] : mean_squared_error(y_pred,y_true,squared=False)
}
def scores_tostr(y_pred,y_true) :
s = ''
scores = compute_scores(y_pred,y_true)
for idx,score in scores.items() :
s = s+str(idx)+': \t'+str(score)+'\n'
return s
sco = lambda x : scores_tostr(x[0],x[1])
print_two(sco,'Scores', (data1_y_test,data1_y_test_pred) , (data2_y_test,data2_y_test_pred))
Dataset 1 ~ Scores R2 Score [-1,1] : 0.958806564723879 Mean Absolute Error (kW) : 40.729163940482145 Root Mean Squared Error (kW) : 98.46943075204126 Dataset 2 ~ Scores R2 Score [-1,1] : 0.9861313673199962 Mean Absolute Error (kW) : 26.844404390773516 Root Mean Squared Error (kW) : 52.913120698991825
Finally, we will plot the actual labels of the test data and the labels predicted by our model as a function of the "speed" feature. We choose speed because it is the feature most correlated to the label value.
def plot_predictions(scores1, data1_X_test,data1_y_test,data1_y_test_pred,standard_scaler1, model_name1,
scores2, data2_X_test,data2_y_test,data2_y_test_pred,standard_scaler2, model_name2,
feature="Speed") :
fig, (ax1,ax2) = plt.subplots(1,2, figsize=(14,8))
if (standard_scaler1 != None) :
df1 = destandardize(data1_X_test,standard_scaler1)
else :
df1 = data1_X_test
if (standard_scaler2 != None) :
df2 = destandardize(data2_X_test,standard_scaler2)
else :
df2 = data2_X_test
s = 5
#for col in df1.columns :
ax1.set_title(f"Dataset 1")
ax1.scatter(df1[feature],data1_y_test, c="blue", alpha=0.7, label="Real values", s=s)
ax1.scatter(df1[feature],data1_y_test_pred, c="green", alpha=0.7, label="Predicted values", s=s)
ax1.set_ylabel("Power")
ax1.set_xlabel("Speed")
ax1.text(13,200,f"R2 Score : { round(scores1[scores_keys[0]],3) }")
ax1.text(13,100,f"MAE : { round(scores1[scores_keys[1]],3) } kW")
ax1.text(13,0,f"RMSE : { round(scores1[scores_keys[2]],3) } kW")
ax2.set_title(f"Dataset 2")
ax2.scatter(df2[feature],data2_y_test, c="red",alpha=0.7, s=s, label="Real values")
ax2.scatter(df2[feature],data2_y_test_pred, c="green", alpha=0.7, s=s)
ax2.set_xlabel("Speed")
ax2.set_ylabel("Power")
ax2.text(13,200,f"R2 Score : { round(scores2[scores_keys[0]],3) }")
ax2.text(13,100,f"MAE : { round(scores2[scores_keys[1]],3) } kW")
ax2.text(13,0,f"RMSE : { round(scores2[scores_keys[2]],3) } kW")
fig.suptitle(f"{model_name} predictions")
fig.legend()
fig.show()
model_name = 'Random Forest'
plot_predictions(compute_scores(data1_y_test_pred,data1_y_test), data1_X_test,data1_y_test,data1_y_test_pred,standard_scaler1, model_name,
compute_scores(data2_y_test_pred,data2_y_test), data2_X_test,data2_y_test,data2_y_test_pred,standard_scaler2, model_name )
Measured performance may be biased. Indeed, the separation of the training data and the test data was done in a random way. Maybe the test data and the training data were by chance well chosen in order to obtain such a score. To correct this, we will use the K-folds technique.
First, the data set is randomly separated into k folds. The model will be tested on 1 folds, while the remaining k-1 folds will serve as training data. We reiterate by changing each time the fold that will be used as test data and by noting each time the scores obtained.
⚠️ Important remark: we do not preprocess the data before separating them into $k$ folds. Indeed, if we preprocess our data before the separation, the standardization would estimate parameters $\mu$ and $\sigma$ from the data set. Once this is done, we would separate the data into $k$ folds. One of the folds would serve as a test fold. However, the data of the test fold (supposedly unknown) were taken into account in the calculation of $\mu$ and $\sigma$. There would then be a data leak!
Thus, in our KFolds_validation
function we use non-preprocessed data that we separate into $k$ folds. Within each iteration, we preprocess the training folds: they are standardized. There is no information leakage because the standardization is done on the training folds. After that, we preprocess the test fold by standardizing from the parameters of the training folds (in order to have the test data in the same format), then we make our predictions.
# Create k-Fold (with k the number of splits)
def KFolds_validation(data, k, model_name, **hyperparameters) :
folds = KFold(n_splits = k, shuffle=True, random_state=10)
folds_scores = []
for train_index, valid_index in folds.split(data) :
# Split data
data_train = data.iloc[train_index]
data_valid = data.iloc[valid_index]
# Preprocessing
standard_scaler = StandardScaler()
data_X_train, data_y_train = preprocess(data_train, standard_scaler, fit=True)
# Build model
regressor = create_and_fit_model(data_X_train, data_y_train, model_name, **hyperparameters)
# Preprocess our valid data
data_X_valid, data_y_valid = preprocess(data_valid, standard_scaler, fit=False)
# Predict valid data
data_y_valid_pred = regressor.predict(data_X_valid)
# Compute scores
folds_scores += [ compute_scores(data_y_valid_pred,data_y_valid) ]
return folds_scores
data1 = data1_train_original.copy() # no preprocessing
data2 = data2_train_original.copy() # no preprocessing
folds_scores1 = KFolds_validation(data1,k=10, model_name=RandomForestRegressor)
folds_scores2 = KFolds_validation(data2,k=10, model_name=RandomForestRegressor)
display_two(lambda x : pd.DataFrame(x), "Folds scores", folds_scores1, folds_scores2 )
Dataset 1 ~ Folds scores
R2 Score [-1,1] | Mean Absolute Error (kW) | Root Mean Squared Error (kW) | |
---|---|---|---|
0 | 0.973109 | 37.787083 | 79.061367 |
1 | 0.966797 | 37.826243 | 83.939494 |
2 | 0.983642 | 36.376880 | 65.322201 |
3 | 0.956185 | 40.322963 | 103.928996 |
4 | 0.982418 | 35.093999 | 62.158823 |
5 | 0.984050 | 34.918770 | 59.843334 |
6 | 0.980036 | 36.861035 | 64.642455 |
7 | 0.971975 | 38.199674 | 81.174180 |
8 | 0.947531 | 43.326411 | 113.016364 |
9 | 0.975386 | 37.385591 | 77.660349 |
Dataset 2 ~ Folds scores
R2 Score [-1,1] | Mean Absolute Error (kW) | Root Mean Squared Error (kW) | |
---|---|---|---|
0 | 0.991098 | 25.068677 | 39.943213 |
1 | 0.982489 | 27.740841 | 58.115367 |
2 | 0.989417 | 28.069918 | 48.628165 |
3 | 0.989730 | 26.954498 | 45.965698 |
4 | 0.989585 | 26.733925 | 44.472214 |
5 | 0.989070 | 27.080457 | 46.109574 |
6 | 0.989610 | 25.116301 | 49.431577 |
7 | 0.989936 | 26.661346 | 46.394564 |
8 | 0.991621 | 23.650542 | 39.088449 |
9 | 0.983227 | 26.895971 | 57.465282 |
r2_scores1 = [ fold_scores[scores_keys[0]] for fold_scores in folds_scores1 ]
mae1 = [ fold_scores[scores_keys[1]] for fold_scores in folds_scores1 ]
rmse1 = [ fold_scores[scores_keys[2]] for fold_scores in folds_scores1 ]
r2_scores2 = [ fold_scores[scores_keys[0]] for fold_scores in folds_scores2 ]
mae2 = [ fold_scores[scores_keys[1]] for fold_scores in folds_scores2 ]
rmse2 = [ fold_scores[scores_keys[2]] for fold_scores in folds_scores2 ]
r2_scores1
folds1 = list(map( lambda x : "F"+str(x) , np.arange(1,len(folds_scores1)+1) ))
folds2 = list(map( lambda x : "F"+str(x) , np.arange(1,len(folds_scores2)+1) ))
fig, axes = plt.subplots(3,2, figsize=(14,14))
r2_min = min( min(r2_scores1), min(r2_scores2))-0.01
mae_min = min( min(mae1), min(mae2))-1
rmse_min = min( min(rmse1), min(rmse2))-1
r2_max = max( max(r2_scores1), max(r2_scores2))+0.01
mae_max = max( max(mae1), max(mae2))+1
rmse_max = max( max(rmse1), max(rmse2))+1
#
axes[0,0].bar(folds1,r2_scores1,
linewidth=1, edgecolor='black', color='blue')
axes[0,0].set_title("Dataset 1")
axes[0,0].set_ylabel("R2 Score")
axes[0,0].set_ylim(r2_min, r2_max)
#
axes[0,1].bar(folds2,r2_scores2,
linewidth=1, edgecolor='black', color='red')
axes[0,1].set_title("Dataset 2")
axes[0,1].set_ylim(r2_min, r2_max)
#
axes[1,0].bar(folds1,mae1,
linewidth=1, edgecolor='black', color='blue')
axes[1,0].set_ylabel("Mean Absolute Error")
axes[1,0].set_ylim(mae_min, mae_max)
#
axes[1,1].bar(folds2,mae2,
linewidth=1, edgecolor='black', color='red')
axes[1,1].set_ylim(mae_min, mae_max)
#
axes[2,0].bar(folds1,rmse1,
linewidth=1, edgecolor='black', color='blue')
axes[2,0].set_ylabel("Root Mean Squared Error")
axes[2,0].set_xlabel("Fold")
axes[2,0].set_ylim(rmse_min, rmse_max)
#
axes[2,1].bar(folds2,rmse2,
linewidth=1, edgecolor='black', color='red')
axes[2,1].set_xlabel("Fold")
axes[2,1].set_ylim(rmse_min, rmse_max)
(38.08844877828422, 114.01636377294118)
Finally, we compute the average scores of the obtained K-folds. We can see below that they are close to the scores obtained without the K-folds technique. So it was not a lucky guess.
def mean_cross(folds_scores) :
scores_list = []
for key in scores_keys :
scores_list.append(list(map(lambda x : x[key] , folds_scores)))
scores_means = list(map( lambda x: np.mean(x) , scores_list) )
return scores_means
def mean_cross_str(scores_means) :
s = {}
for i in range(0,len(scores_means)) :
s[scores_keys[i]] = scores_means[i]
return s
def dict_to_str(dic) :
s = ""
for key,val in dic.items() :
s+=key+': '+str(val)+'\n'
return s
print_two(lambda x : dict_to_str(mean_cross_str(mean_cross(x))),
"Mean score (K-folds)", folds_scores1, folds_scores2)
Dataset 1 ~ Mean score (K-folds) R2 Score [-1,1] : 0.9721128620588779 Mean Absolute Error (kW) : 37.809864898507925 Root Mean Squared Error (kW) : 79.07475629772696 Dataset 2 ~ Mean score (K-folds) R2 Score [-1,1] : 0.9885782794763742 Mean Absolute Error (kW) : 26.39724761963786 Root Mean Squared Error (kW) : 47.561410251932855
Until now, we have used the RandomForestRegressor
model with its default hyper-parameters.
Our goal is now to study the influence of hyperparameters on the performance of our model.
The validation_curve
function, will use the K-folds technique to compute test scores based on the value of the hyperparameter to study.
⚠️ Important note: in our KFolds_validation
function we use non-preprocessed data.
Here, the validation_curve
function will use the cross-validation K-Fold technique to separate the input data.
Problem: We want to standardize the data inside the iterations to avoid data leakage (see Section 6.1.3.4 The K-Folds technique)
To solve this problem, we will create a pipeline. The pipeline will be composed of a StandardScaler
and a RandomForestRegressor
model. This pipeline will be given as a parameter to the function (instead of just the model). Thus, the function will apply the pipeline at each iteration of the K-Fold cross-validation.
The first hyperparameter we will focus on is the min_samples_leaf
hyperparameter which allows to perform a tree pruning on the trees of our random forest.
def compute_validation_curve(data1_train,data2_train, model, param_name, param_range) :
data1_X_train, data1_y_train = preprocess_no_scale(data1_train)
data2_X_train, data2_y_train = preprocess_no_scale(data2_train)
ppl1 = Pipeline(
[
('scl', StandardScaler() ) ,
('model', model() )
]
)
ppl2 = Pipeline(
[
('scl', StandardScaler() ) ,
('model', model() )
]
)
train_scores1, valid_scores1 = validation_curve(ppl1,
data1_X_train, data1_y_train.values.ravel(),
param_name="model__"+param_name,
param_range=param_range,
cv=5, n_jobs=-1)
train_scores2, valid_scores2 = validation_curve(ppl2,
data2_X_train, data2_y_train.values.ravel(),
param_name="model__"+param_name,
param_range=param_range,
cv=5, n_jobs=-1)
mean_valid_scores1 = list(map(np.mean,valid_scores1))
mean_train_scores1 = list(map(np.mean,train_scores1))
mean_valid_scores2 = list(map(np.mean,valid_scores2))
mean_train_scores2 = list(map(np.mean,train_scores2))
return mean_valid_scores1,mean_train_scores1,mean_valid_scores2,mean_train_scores2
def plot_validation_curve(mean_valid_scores1,mean_train_scores1,mean_valid_scores2,mean_train_scores2,param_name,param_range) :
fig, axs = plt.subplots(1,2, figsize=(14,4))
axs[0].plot(param_range,mean_train_scores1, c="blue", label='Train set')
axs[0].plot(param_range,mean_valid_scores1, c="green", label='Test set')
axs[0].set_title("Dataset 1")
axs[0].set_xlabel(param_name)
axs[0].set_ylabel("R2 Score")
x_max = param_range[np.argmax(mean_valid_scores1)]
axs[0].axvline(x=x_max, color='black', linestyle=(0,(1,5)), label='Best fit' )
axs[0].legend()
axs[1].plot(param_range,mean_train_scores2, c="red", label='Train set')
axs[1].plot(param_range,mean_valid_scores2, c="green", label='Test set')
axs[1].set_title("Dataset 2")
axs[1].set_xlabel(param_name)
axs[1].set_ylabel("R2 Score")
x_max = param_range[np.argmax(mean_valid_scores2)]
axs[1].axvline(x=x_max, color='black', linestyle=(0,(1,5)), label='Best fit' )
axs[1].legend()
fig.suptitle("Hyperparameter influence (validation curve)")
fig.show()
return
%%time
data1_train = data1_train_original.copy()
data2_train = data2_train_original.copy()
min_samples_leaf = range(1,25,1)
mean_valid_scores1,mean_train_scores1,\
mean_valid_scores2,mean_train_scores2 = compute_validation_curve(data1_train,data2_train,
RandomForestRegressor,
'min_samples_leaf',
min_samples_leaf)
CPU times: user 811 ms, sys: 176 ms, total: 986 ms Wall time: 2min 28s
plot_validation_curve(mean_valid_scores1,
mean_train_scores1,
mean_valid_scores2,
mean_train_scores2,
'Min samples leaf',
min_samples_leaf)
We find that the RandomForestRegressor
with full trees (min_samples_leaf
close to 0) does overfitting (very high score for train, but lower for test). By performing tree pruning on the trees of the random forest, the score on the test increases, then decreases again if the trees are too pruned (underfitting).
The second hyperparameter we will focus on is the n_estimators
hyperparameter which compute the number of trees used for the RandomForestRegressor
model.
%%time
data1_train = data1_train_original.copy()
data2_train = data2_train_original.copy()
n_estimators_range = range(1,50,1)
mean_valid_scores1,mean_train_scores1,\
mean_valid_scores2,mean_train_scores2 = compute_validation_curve(data1_train,
data2_train,
RandomForestRegressor,
'n_estimators',
n_estimators_range)
CPU times: user 1.27 s, sys: 86.7 ms, total: 1.36 s Wall time: 1min 53s
plot_validation_curve(mean_valid_scores1,
mean_train_scores1,
mean_valid_scores2,
mean_train_scores2,
'Number of trees',
n_estimators_range)
As we can see on the graph, we do not increase the overfitting by adding more trees to the RandomForestRegressor
model. The score stabilizes for a large enough number of trees
We will now implement an automatic search procedure for the best hyperparameters.
rdf_params = {
'RdForest__n_estimators': sp.stats.randint(1,50),
'RdForest__max_depth' : np.arange(10,15,1),
'RdForest__min_samples_split' : np.arange(30,150),
'RdForest__min_samples_leaf' : np.arange(1,20),
#'RdForest__min_weight_fraction_leaf' : np.arange(0.0,0.1,0.01),
'RdForest__max_features' : ['auto', 'sqrt', 'log2'],
'RdForest__max_leaf_nodes' : np.arange(20,100,1),
#'RdForest__min_impurity_decrease' : np.arange(0,100,1),
'RdForest__ccp_alpha' : np.arange(0,100,1),
}
def automatic_search(data1_train, data2_train, model,model_name,model_params, randomized=True) :
data1_X_train, data1_y_train = preprocess_no_scale(data1_train)
data2_X_train, data2_y_train = preprocess_no_scale(data2_train)
# Pipelines
ppl1 = Pipeline( [ ('scl', StandardScaler() ) , (model_name, model() ) ] )
ppl2 = Pipeline( [ ('scl', StandardScaler() ) , (model_name, model() ) ] )
# KFold
kfold = KFold(n_splits = 5, shuffle=True)
# Model
searchCV = RandomizedSearchCV
if not randomized : searchCV = GridSearchCV
rds1 = searchCV(
ppl1, # model
model_params, # param dictionnary
cv = kfold, # number of folds in cross-val
refit = True, # after finding the best params, refit model on all dataset
verbose = 0, # 3 print details, False/0 pas print
n_jobs=-1, # if -1 use all processors
#random_state = 75
)
rds2 = searchCV(
ppl2, # model
model_params, # param dictionnary
cv = kfold, # number of folds in cross-val
refit = True, # after finding the best params, refit model on all dataset
verbose = 0, # 3 print details, False/0 pas print
n_jobs=-1, # if -1 use all processors
#random_state = 75
)
rds1.fit(data1_X_train, data1_y_train.values.ravel())
rds2.fit(data2_X_train, data2_y_train.values.ravel())
# return best parameter
return rds1, rds2
%%time
data1_train = data1_train_original.copy()
data2_train = data2_train_original.copy()
model_name = 'RdForest'
rf_rds1,rf_rds2 = automatic_search(data1_train,data2_train,RandomForestRegressor,model_name,rdf_params)
CPU times: user 1.55 s, sys: 28.4 ms, total: 1.57 s Wall time: 10.9 s
print_two(lambda x : dict_to_str(x),"Best hyperparameters" , rf_rds1.best_params_, rf_rds2.best_params_)
Dataset 1 ~ Best hyperparameters RdForest__ccp_alpha: 8 RdForest__max_depth: 13 RdForest__max_features: auto RdForest__max_leaf_nodes: 90 RdForest__min_samples_leaf: 8 RdForest__min_samples_split: 110 RdForest__n_estimators: 43 Dataset 2 ~ Best hyperparameters RdForest__ccp_alpha: 23 RdForest__max_depth: 14 RdForest__max_features: auto RdForest__max_leaf_nodes: 93 RdForest__min_samples_leaf: 8 RdForest__min_samples_split: 125 RdForest__n_estimators: 40
data1_test = data1_test_original.copy()
data1_test = data1_test_original.copy()
data1_X_test, data1_y_test = preprocess_no_scale(data1_test)
data2_X_test, data2_y_test = preprocess_no_scale(data2_test)
data1_y_test_pred = rf_rds1.predict(data1_X_test)
data2_y_test_pred = rf_rds2.predict(data2_X_test)
plot_predictions(compute_scores(data1_y_test_pred,data1_y_test), data1_X_test,data1_y_test,data1_y_test_pred,None, model_name,
compute_scores(data2_y_test_pred,data2_y_test), data2_X_test,data2_y_test,data2_y_test_pred,None, model_name)
Now that the typical steps of solving a regression problem have been discussed, we are equipped to compare different models in terms of their performance.
If the possibilities of hyperparameters to test are few, one can do an exhaustive search of the best hyperparameters with GridSearchCV
. Otherwise, it is preferable to make a randomized search with RandomizedSearchCV
and in this case, we will plot the influence of the hyperparameters on the performance of the model with a validation curve as previous (cf: Pruning and Number of trees for Random Forest).
def full_auto_search_print_plot(model,model_name,params,randomized) :
data1_train = data1_train_original.copy()
data2_train = data2_train_original.copy()
rds1,rds2 = automatic_search(data1_train,data2_train,model,model_name,params,
randomized=randomized)
print_two(lambda x : dict_to_str(x),"Best hyperparameters" , rds1.best_params_, rds2.best_params_)
# Make predictions
data1_test = data1_test_original.copy()
data1_test = data1_test_original.copy()
data1_X_test, data1_y_test = preprocess_no_scale(data1_test)
data2_X_test, data2_y_test = preprocess_no_scale(data2_test)
data1_y_test_pred = rds1.predict(data1_X_test)
data2_y_test_pred = rds2.predict(data2_X_test)
plot_predictions(compute_scores(data1_y_test_pred,data1_y_test), data1_X_test,data1_y_test,data1_y_test_pred,None, model_name,
compute_scores(data2_y_test_pred,data2_y_test), data2_X_test,data2_y_test,data2_y_test_pred,None, model_name)
return rds1, rds2
def full_validation_curve(model,param_name,param_range) :
data1_train = data1_train_original.copy()
data2_train = data2_train_original.copy()
mean_valid_scores1,mean_train_scores1,\
mean_valid_scores2,mean_train_scores2 = compute_validation_curve(data1_train,
data2_train,
model,
param_name,
param_range)
plot_validation_curve(mean_valid_scores1,
mean_train_scores1,
mean_valid_scores2,
mean_train_scores2,
param_name,
param_range)
return
estimators1 = []
estimators2 = []
estimators_names = []
estimators1.append(rf_rds1)
estimators2.append(rf_rds2)
estimators_names.append(model_name)
# Find best params
lin_reg_params = {
'Linear__fit_intercept' : [True,False] ,
}
model_name = 'Linear'
lin_rds1,lin_rds2 = full_auto_search_print_plot(LinearRegression,
model_name,
lin_reg_params,
randomized=False)
estimators1.append(lin_rds1)
estimators2.append(lin_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters Linear__fit_intercept: True Dataset 2 ~ Best hyperparameters Linear__fit_intercept: True
full_validation_curve(Ridge,"alpha",np.arange(0,1000,10))
# Find best params
ridge_params = {
'Ridge__fit_intercept' : [True,False] ,
'Ridge__alpha' : np.arange(0,300) ,
}
model_name = 'Ridge'
ridge_rds1,ridge_rds2 = full_auto_search_print_plot(Ridge,
model_name,
ridge_params,
randomized=True)
estimators1.append(ridge_rds1)
estimators2.append(ridge_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters Ridge__fit_intercept: True Ridge__alpha: 10 Dataset 2 ~ Best hyperparameters Ridge__fit_intercept: True Ridge__alpha: 67
full_validation_curve(Lasso,"alpha",np.arange(0,100,10))
# Find best params
lasso_params = {
'Lasso__fit_intercept' : [True,False] ,
'Lasso__alpha' : np.arange(0,30,0.01) ,
'Lasso__selection' : [True,False] ,
'Lasso__max_iter' : np.arange(100,10000,100) ,
'Lasso__tol' : np.arange(0.00001,0.1,0.001) ,
'Lasso__selection' : ['cyclic', 'random']
}
model_name = 'Lasso'
lasso_rds1,lasso_rds2 = full_auto_search_print_plot(Lasso,
model_name,
lasso_params,
randomized=True)
estimators1.append(lasso_rds1)
estimators2.append(lasso_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters Lasso__tol: 0.09501 Lasso__selection: random Lasso__max_iter: 8700 Lasso__fit_intercept: True Lasso__alpha: 14.65 Dataset 2 ~ Best hyperparameters Lasso__tol: 0.01001 Lasso__selection: cyclic Lasso__max_iter: 3800 Lasso__fit_intercept: True Lasso__alpha: 6.28
full_validation_curve(SGDRegressor,"alpha",np.arange(0.001,10,0.1))
full_validation_curve(SGDRegressor,"l1_ratio",np.arange(0.001,1,0.1))
full_validation_curve(SGDRegressor,"epsilon",np.arange(1,100,2))
full_validation_curve(SGDRegressor,"eta0",np.arange(0.001,1,0.1))
full_validation_curve(SGDRegressor,"power_t",np.arange(0.001,1,0.1))
SGD_reg_params = {
'SGD__loss' : ['squared_loss', 'huber', 'epsilon_insensitive', 'squared_epsilon_insensitive'],
'SGD__penalty' : ['l2', 'l1', 'elasticnet'] ,
'SGD__alpha' : np.arange(0.001,1,0.1),
'SGD__l1_ratio' : np.arange(0.001,1,0.1),
'SGD__epsilon' : np.arange(0.001,1,0.1),
'SGD__power_t' : np.arange(0.001,1,0.1),
'SGD__learning_rate' : ['constant','optimal','invscaling','adaptive'],
}
model_name = 'SGD'
sgd_rds1,sgd_rds2 = full_auto_search_print_plot(SGDRegressor,
model_name,
SGD_reg_params,
randomized=True)
estimators1.append(sgd_rds1)
estimators2.append(sgd_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters SGD__power_t: 0.30100000000000005 SGD__penalty: l1 SGD__loss: squared_loss SGD__learning_rate: invscaling SGD__l1_ratio: 0.001 SGD__epsilon: 0.501 SGD__alpha: 0.30100000000000005 Dataset 2 ~ Best hyperparameters SGD__power_t: 0.001 SGD__penalty: l1 SGD__loss: squared_loss SGD__learning_rate: optimal SGD__l1_ratio: 0.101 SGD__epsilon: 0.30100000000000005 SGD__alpha: 0.901
First, we define some kernel functions that are not available in the default sklearn library for KNeighborsRegressor
.
# kernel for KNN
def gaussian_kernel(distance) :
return np.exp(-distance**2)
def cauchy_kernel(distance) :
dim = 3 # number of features X
return 1/(1+distance**(dim+1))
full_validation_curve(KNeighborsRegressor,"n_neighbors",np.arange(1,100,1))
KNN_reg_params = {
'KNN__n_neighbors' : np.arange(1,50) ,
'KNN__weights' : ['uniform','distance', gaussian_kernel, cauchy_kernel] ,
'KNN__leaf_size' : np.arange(1,50) ,
'KNN__algorithm' : ['auto', 'ball_tree', 'kd_tree', 'brute'],
'KNN__p' : [1,2,3] ,
}
model_name = 'KNN'
knn_rds1,knn_rds2 = full_auto_search_print_plot(KNeighborsRegressor,
model_name,
KNN_reg_params,
randomized=True)
estimators1.append(knn_rds1)
estimators2.append(knn_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters KNN__weights: distance KNN__p: 1 KNN__n_neighbors: 18 KNN__leaf_size: 10 KNN__algorithm: brute Dataset 2 ~ Best hyperparameters KNN__weights: <function gaussian_kernel at 0x7fd038b55040> KNN__p: 1 KNN__n_neighbors: 13 KNN__leaf_size: 20 KNN__algorithm: ball_tree
full_validation_curve(DecisionTreeRegressor,"max_depth",np.arange(1,15,1))
full_validation_curve(DecisionTreeRegressor,"min_samples_split",np.arange(2,400,10))
full_validation_curve(DecisionTreeRegressor,"min_samples_leaf",np.arange(1,100,1))
#full_validation_curve(DecisionTreeRegressor,"min_weight_fraction_leaf",np.arange(0.0,0.5,0.01))
full_validation_curve(DecisionTreeRegressor,"max_leaf_nodes",np.arange(2,100,1))
#full_validation_curve(DecisionTreeRegressor,"min_impurity_decrease",np.arange(0,100,10))
full_validation_curve(DecisionTreeRegressor,"ccp_alpha",np.arange(0.0,100,10))
DT_reg_params = {
'DTree__splitter' : ['best','random'] ,
'DTree__max_depth' : np.arange(10,15,1),
'DTree__max_features' : ['auto', 'sqrt', 'log2'],
'DTree__min_samples_split' : np.arange(30,150),
'DTree__min_samples_leaf' : np.arange(1,20),
#'DTree__min_weight_fraction_leaf':np.arange(0.0,0.1,0.01),
'DTree__max_leaf_nodes':np.arange(20,100,1),
#'DTree__min_impurity_decrease':np.arange(0,100,1),
'DTree__ccp_alpha':np.arange(0,100,1),
}
model_name = 'DTree'
dt_rds1,dt_rds2 = full_auto_search_print_plot(DecisionTreeRegressor,
model_name,
DT_reg_params,
randomized=True)
estimators1.append(dt_rds1)
estimators2.append(dt_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters DTree__splitter: best DTree__min_samples_split: 149 DTree__min_samples_leaf: 3 DTree__max_leaf_nodes: 29 DTree__max_features: auto DTree__max_depth: 14 DTree__ccp_alpha: 50 Dataset 2 ~ Best hyperparameters DTree__splitter: best DTree__min_samples_split: 143 DTree__min_samples_leaf: 19 DTree__max_leaf_nodes: 84 DTree__max_features: auto DTree__max_depth: 10 DTree__ccp_alpha: 2
full_validation_curve(ExtraTreeRegressor,"max_depth",np.arange(1,15,1))
full_validation_curve(ExtraTreeRegressor,"min_samples_split",np.arange(2,400,10))
full_validation_curve(ExtraTreeRegressor,"min_samples_leaf",np.arange(1,100,1))
#full_validation_curve(ExtraTreeRegressor,"min_weight_fraction_leaf",np.arange(0.0,0.5,0.01))
full_validation_curve(ExtraTreeRegressor,"max_leaf_nodes",np.arange(2,100,1))
#full_validation_curve(ExtraTreeRegressor,"min_impurity_decrease",np.arange(0,100,10))
full_validation_curve(ExtraTreeRegressor,"ccp_alpha",np.arange(0.0,100,10))
ET_reg_params = {
'ExtraTree__splitter' : ['best','random'] ,
'ExtraTree__max_depth' : np.arange(10,15,1),
'ExtraTree__max_features' : ['auto', 'sqrt', 'log2'],
'ExtraTree__min_samples_split' : np.arange(30,150),
'ExtraTree__min_samples_leaf' : np.arange(1,20),
#'ExtraTree__min_weight_fraction_leaf':np.arange(0.0,0.1,0.01),
'ExtraTree__max_leaf_nodes':np.arange(20,100,1),
#'ExtraTree__min_impurity_decrease':np.arange(0,100,1),
'ExtraTree__ccp_alpha':np.arange(0,100,1),
}
model_name = 'ExtraTree'
ext_rds1,ext_rds2 = full_auto_search_print_plot(ExtraTreeRegressor,
model_name,
ET_reg_params,
randomized=True)
estimators1.append(ext_rds1)
estimators2.append(ext_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters ExtraTree__splitter: best ExtraTree__min_samples_split: 46 ExtraTree__min_samples_leaf: 13 ExtraTree__max_leaf_nodes: 43 ExtraTree__max_features: auto ExtraTree__max_depth: 13 ExtraTree__ccp_alpha: 29 Dataset 2 ~ Best hyperparameters ExtraTree__splitter: best ExtraTree__min_samples_split: 119 ExtraTree__min_samples_leaf: 5 ExtraTree__max_leaf_nodes: 31 ExtraTree__max_features: auto ExtraTree__max_depth: 10 ExtraTree__ccp_alpha: 10
full_validation_curve(AdaBoostRegressor,"n_estimators",np.arange(1,100))
full_validation_curve(AdaBoostRegressor,"learning_rate",np.arange(0.01,1,0.1))
AdaB_reg_params = {
'AdaBoost__n_estimators' : np.arange(1,100) ,
'AdaBoost__learning_rate' : np.arange(0.01,1,0.01) ,
}
model_name = 'AdaBoost'
ada_rds1,ada_rds2 = full_auto_search_print_plot(AdaBoostRegressor,
model_name,
AdaB_reg_params,
randomized=True)
estimators1.append(ada_rds1)
estimators2.append(ada_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters AdaBoost__n_estimators: 42 AdaBoost__learning_rate: 0.86 Dataset 2 ~ Best hyperparameters AdaBoost__n_estimators: 18 AdaBoost__learning_rate: 0.06999999999999999
full_validation_curve(GradientBoostingRegressor,"n_estimators",np.arange(1,100,10))
full_validation_curve(GradientBoostingRegressor,"max_depth",np.arange(1,20,1))
full_validation_curve(GradientBoostingRegressor,"learning_rate",np.arange(0.01,1,0.1))
full_validation_curve(GradientBoostingRegressor,"subsample",np.arange(0.01,1,0.1))
GradB_reg_params = {
'GradBoost__n_estimators' : np.arange(1,100) ,
'GradBoost__learning_rate' : np.arange(0.01,1,0.01) ,
'GradBoost__max_depth' : np.arange(1,20,1) ,
'GradBoost__subsample' : np.arange(0.01,1,0.01) ,
}
model_name = 'GradBoost'
grb_rds1,grb_rds2 = full_auto_search_print_plot(GradientBoostingRegressor,
model_name,
GradB_reg_params,
randomized=True)
estimators1.append(grb_rds1)
estimators2.append(grb_rds2)
estimators_names.append(model_name)
Dataset 1 ~ Best hyperparameters GradBoost__subsample: 0.52 GradBoost__n_estimators: 92 GradBoost__max_depth: 7 GradBoost__learning_rate: 0.060000000000000005 Dataset 2 ~ Best hyperparameters GradBoost__subsample: 0.93 GradBoost__n_estimators: 75 GradBoost__max_depth: 4 GradBoost__learning_rate: 0.73
One of the first difficulties was to define a good metric to evaluate the performance of our model. After reflection we chose to define three metrics:
Another difficulty was to choose the ranges for our hyperparameters to test. We finally decided to first make validation curves in order to have an idea of how the score varies according to the value of a hyperparameter.
best_scores1 = list(map(lambda x : x.best_score_ , estimators1))
best_scores2 = list(map(lambda x : x.best_score_ , estimators2))
plt.figure( figsize=(10,4))
plt.scatter(estimators_names,best_scores1, label='Turbine 1',
s=150, marker='^', linewidth=1, edgecolor='black', color='blue')
plt.scatter(estimators_names,best_scores2, label='Turbine 2',
s=150, marker='^', linewidth=1, edgecolor='black', color='red')
plt.axhline(y=0.95, color='r', linestyle='--', label='R2 Score = 95%')
plt.legend(fancybox=True, bbox_to_anchor=(1.5, 1.1), ncol=1, framealpha=1, shadow=True, borderpad=1)
plt.title("Comparison of estimators")
plt.ylabel("R2 Score")
plt.show()
As expected, linear models perform less well because the data values are not linear. On the other hand, they have a rather correct score (R2 score > 80%) for a very low cost. These models allow us to make a more or less correct estimation at a lower cost and the predicted values are easily interpretable (linearity between the wind speed and the electric power produced).
Concerning the non-linear models, we have much better performances. We can see that the GradientBoosting and KNN models get the best scores.
There is no model that disproportionately dominates the others.
The data for each wind turbine has its own behavior. However, we could see that they are similar in many points:
However, they have their own characteristics: turbine 1 has two power regimes at very high speed (visible by the shape of the points forming two horizontal lines) while the power produced by turbine 2 follows its "natural" curve. Therefore, the models have better performances on the data set of turbine 2 than on that of turbine 1 (as can be seen on the previous graph)
However, there is no model that would be more adapted to one or more adapted to the other. The best performing models on one are also the best performing models on the other.
In the previous section we could see different curve validations. What emerges is that for linear models, the smaller the alpha hyperparameter the better the scores, however, if it is too small it can overfit the training data. We find the same phenomenon for the KNN
model with the hyperparameter n_neighbors
(which corresponds to the number of neighbors), the parameter min_samples_leaf
,min_samples_split
,ccp_alpha
for DecisionTree
, ExtraTree
, RandomForest
, and the max_depth
for GradientBoosting.
The other hyperparameters do not seem to cause overffiting (the score evolves in a similar way between the training and the test data)
Homework 1
To be done in groups of 2 students.
Predicting wind energy
The two datasets contain the following features.
P : electrical power in kW.
V : wind speed in $m.s^{-1}$.
Dir : wind direction in degrees °.
T : temperature in °C.
Your job
We want to see the following things
Hand in your assignment before 18/10/21 23h55: