In this study, we'll cover how to:
The UCI website provides everything you need to get started:
from pathlib import Path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
#dataquest solution
#https://github.com/dataquestio/solutions/blob/master/Mission155Solutions.ipynb
#about the dataset
#https://archive.ics.uci.edu/ml/datasets/automobile
#https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
#download the dataset
#https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
attributes=["symboling",
"normalized_losses",
"make",
"fuel_type",
"aspiration",
"doors",
"body_style",
"drive_wheels",
"engine_location",
"wheel_base",
"length",
"width",
"height",
"curb_weight",
"engine_type",
"num_cylinders",
"engine_size",
"fuel_system",
"bore",
"stroke",
"compression_ratio",
"horsepower",
"peak_rpm",
"city_mpg",
"highway_mpg",
"price"]
cars = pd.read_csv('data/imports-85.data',
header=None,
names=attributes)
Let's take a quick look at the raw data.
def highlight_missing_value(val):
"""
return the css property `'background-color: orange'` for "?" value, white otherwise.
"""
color = 'orange' if val=="?" else 'white'
return 'background-color: %s' % color
cars.head().style.applymap(highlight_missing_value)
symboling | normalized_losses | make | fuel_type | aspiration | doors | body_style | drive_wheels | engine_location | wheel_base | length | width | height | curb_weight | engine_type | num_cylinders | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | dohc | four | 130 | mpfi | 3.47 | 2.68 | 9 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | ohcv | six | 152 | mpfi | 2.68 | 3.47 | 9 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | ohc | four | 109 | mpfi | 3.19 | 3.40 | 10 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | ohc | five | 136 | mpfi | 3.19 | 3.40 | 8 | 115 | 5500 | 18 | 22 | 17450 |
The dataset contains categorical and numerical variables. We will need to take this into account when preprocessing the dataset thereafter. Since we plan to perform a linear regression, it's imperative to work with numerical values only. The dataset has also missing values we need to fix.
According to the Data Set Description, the missing values are denoted by an interrogation point on the following attributes:
Unfortunately, this causes this attributes have been cast to Pandas objects even when they contain continuous values.
So we will do the following:
np.nan
values.def remove_attribute(attribute, df):
"""
Remove column from a dataframe
attribute (string): the column name to remove
df (DataFrame): the data
"""
df.drop(attribute,axis=1,inplace=True)
def clean_continuous_attributes(mv_str, cols_index, df):
"""
Replace missing vaalues by np.nan value and change object columns to float data type
mv_str (string): the string denoting the missing value
cols_index (list of integers) : the columns index we want to clean
df (DataFrame) : the data
"""
for idx, col in enumerate(df.columns):
if idx in cols_index:
df[col]=df[col].replace(mv_str, np.nan)
df[col]=df[col].astype(np.float64)
return df
def remove_rows(target, df):
"""
Remove rows with np.nan values from a DataFrame and reset the index
target (array-like of strings): the column name which we want to remove np.nan values
df (DataFrame) : the data
"""
df.dropna(subset=target,inplace=True)
df.reset_index(inplace=True)
df.drop("index",axis=1,inplace=True)
def impute_missing_values(cols_index, df):
"""
Replace np.nan values by the mean of the attribute
cols_index (list of integers) : the columns index we want to replace np.nan values by the mean
df (DataFrame) : the data
"""
for idx, col in enumerate(df.columns):
if idx in cols_index:
m = df[col].mean()
df[col] = df[col].fillna(m)
return df
remove_attribute("normalized_losses", cars)
missing_values_col_index=[17,18,20,21,24]
cars = clean_continuous_attributes("?", missing_values_col_index, cars)
remove_rows(["price"], cars)
cars = impute_missing_values(missing_values_col_index, cars)
cars.head()
symboling | make | fuel_type | aspiration | doors | body_style | drive_wheels | engine_location | wheel_base | length | ... | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 13495.0 |
1 | 3 | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | 168.8 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111.0 | 5000.0 | 21 | 27 | 16500.0 |
2 | 1 | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | 171.2 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154.0 | 5000.0 | 19 | 26 | 16500.0 |
3 | 2 | audi | gas | std | four | sedan | fwd | front | 99.8 | 176.6 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102.0 | 5500.0 | 24 | 30 | 13950.0 |
4 | 2 | audi | gas | std | four | sedan | 4wd | front | 99.4 | 176.6 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115.0 | 5500.0 | 18 | 22 | 17450.0 |
5 rows × 25 columns
#safety check
assert cars.isnull().sum().sum()==0
cars.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 201 entries, 0 to 200 Data columns (total 25 columns): symboling 201 non-null int64 make 201 non-null object fuel_type 201 non-null object aspiration 201 non-null object doors 201 non-null object body_style 201 non-null object drive_wheels 201 non-null object engine_location 201 non-null object wheel_base 201 non-null float64 length 201 non-null float64 width 201 non-null float64 height 201 non-null float64 curb_weight 201 non-null int64 engine_type 201 non-null object num_cylinders 201 non-null object engine_size 201 non-null int64 fuel_system 201 non-null object bore 201 non-null float64 stroke 201 non-null float64 compression_ratio 201 non-null float64 horsepower 201 non-null float64 peak_rpm 201 non-null float64 city_mpg 201 non-null int64 highway_mpg 201 non-null int64 price 201 non-null float64 dtypes: float64(10), int64(5), object(10) memory usage: 39.4+ KB
We need also to replace the two missing values for the doors column by "unknown" and then to one-hot encode (i.e., generate a column by category) the categorical columns (including the symboling attribute). Categorical variables cannot be included in a linear model if not coded as integers first and we want to avoid categorical features to be treated as ordered values. Since the symboling feature has ordered integer values, we will also one-hot encode this numerical attribute.
from sklearn.preprocessing import OneHotEncoder
#replace the two missing values for the doors column by "unknown"
cars["doors"]=cars["doors"].replace("?", "unknown")
#replace symboling attribute ordered values by letters
symboling_dict = { 3:"G", 2:"F", 1:"E", 0:"D", -1:"C", -2:"B", -3:"A" }
cars["symboling"] = cars["symboling"].apply(lambda x: symboling_dict[x])
#select columns object
cars_object = cars.select_dtypes('object')
#one-hot encoding
enc=OneHotEncoder().fit(cars_object)
#pass to array
encoded = enc.transform(cars_object).toarray()
#create a dataframe with the binary features
one_hot_encoded_features = pd.DataFrame(index=cars.index)
#put binary features names in a readable fashion
binary_col_names = enc.get_feature_names(cars_object.columns)
for idx, feature in enumerate(binary_col_names):
one_hot_encoded_features[feature]=encoded[:,idx]
one_hot_encoded_features.head()
symboling_B | symboling_C | symboling_D | symboling_E | symboling_F | symboling_G | make_alfa-romero | make_audi | make_bmw | make_chevrolet | ... | num_cylinders_twelve | num_cylinders_two | fuel_system_1bbl | fuel_system_2bbl | fuel_system_4bbl | fuel_system_idi | fuel_system_mfi | fuel_system_mpfi | fuel_system_spdi | fuel_system_spfi | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
3 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
4 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 |
5 rows × 66 columns
The mean value of binary features shows us the prevalence, meaning what proportion of the described object has that characteristic. The plot below shows that a dozen characteristics are common across cars, but most of them concern less than 20% of the dataset.
one_hot_encoded_features.apply(pd.value_counts)
sorted_one_hot_encoded_features = one_hot_encoded_features.iloc[:, 1:].mean().sort_values()
ax = sorted_one_hot_encoded_features.plot(kind='barh',
stacked=True,
figsize=(5, 12),
title='Prevalence of Binary Features')
ax.set_xlabel('Proportion of cars');
As mentioned by the scikit-learn documentation:
If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
That's why we need to transform the data so that each attribute will be like standard normally distributed data (Gaussian) with zero mean and unit variance.
#remove all object features since we constructed binary features from them
cars=cars.drop(cars.select_dtypes('object').columns,axis=1)
#join binary features and continuous features
all_features = one_hot_encoded_features.join(cars)
#normalize the numerical values
standardized_cars = (all_features - all_features.mean())/(all_features.std())
#reintroduce the non normalized target column
standardized_cars["price"] = all_features["price"]
#safety check if mean = 0 and variance = 1 for each column
assert np.isclose(standardized_cars.std(),1.0).sum()==len(standardized_cars.columns)-1
assert np.isclose(standardized_cars.mean(),0).sum()==len(standardized_cars.columns)-1
Let’s get some insights by looking at the variables' distributions and at the pairwise relationships between them. Only main continuous variables will be used. The following plot shows their Pearson standard correlation coefficients.
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,10))
continuous_idx = len(one_hot_encoded_features.columns)
ticks = range(len(standardized_cars.columns)-continuous_idx)
dt = standardized_cars.iloc[:,continuous_idx:]
plt.grid(False)
plt.imshow(dt.corr())
plt.xticks(ticks,dt.columns,rotation=90)
plt.yticks(ticks,dt.columns)
plt.colorbar()
plt.show()
We can see that some variables are strongly linearly correlated, exhibiting clusters (length, width, wheelbase, or between city-mpg and highway-mpg). Risk of redundancy here and that may decrease the performance of the model since correlated features are known to induce instabilities. Regarding price target, the strongest correlation occurs with the engine size attribute, so we can expect this feature to be a good predictor.
Now we have removed the discrete variables, we are ready to perform a linear regression based on the k-nearest neighbors (KNN) algorithm using the root of mean squared error (RMSE) as error metric during test/train validation. First, we will fit a univariate model, useful to know which features are the best predictors.
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
def knn_train_test(col_names, target, df, k = 5):
"""
Trains a KNN model and returns the RMSE
col_names (string): the training features
target (string): the column target
df (DataFrame): the data
k (int) : the number of neighbors
output (float) : RMSE
"""
#initiate a seed for reproductibility
np.random.seed(123)
#randomization
p = np.random.permutation(len(df))
df = df.loc[p]
#using 50/50 split for training and testing data
l = len(df)//2
#train only col_name
train_feature = df.iloc[0:l][col_names]#.values.reshape(l,1)
test_feature = df.iloc[l:][col_names]#.values.reshape(len(df)-l,1)
#target
train_target = df.iloc[0:l][target]
test_target = df.iloc[l:][target]
#instantiate KNeighborsRegressor class with default parameters
knn = KNeighborsRegressor(n_neighbors = k)
# Pass everything into the fit method.
knn.fit(train_feature, train_target)
#making predictions on the testing set using the training data
predictions = knn.predict(test_feature)
#calculating mean squared error (MSE) and root mean squared error (RMSE)
predictions_mse = mean_squared_error(test_target, predictions)
predictions_rmse = predictions_mse ** (1/2)
return predictions_rmse
target = "price"
rmse=[]
for col in standardized_cars.columns[:-1]:
rmse.append(knn_train_test([col], target, standardized_cars))
best_attributes = pd.Series(data=rmse, index=standardized_cars.columns[:-1]).sort_values()
best_attributes
horsepower 4367.420884 engine_size 4622.893168 curb_weight 4741.616768 highway_mpg 5077.530212 city_mpg 5746.501341 ... fuel_system_spdi 9783.758088 make_plymouth 9787.573337 make_isuzu 9807.865944 make_mitsubishi 9808.004146 make_chevrolet 9918.293043 Length: 79, dtype: float64
Since the car prices are ranging from 5118 to 45400, the best error score of 4367 seems pretty bad. Let's create a new function, namely knn_predictions
, that will return the predictions and the testing set. Then we will plot the results with another function, plot_pred
.
We also create another error score function returning the Mean Absolute Percentage Error (MAPE), fitting better with past results (Kibler et al., 1989) for comparison purposes: 11,84% for their instance-based technique (IBL) algorithm derived from KNN algorithm, and 14,12 % for the resulting linear regression equation, according to UCI website.
def knn_predictions(col_names, target, df, k = 5):
"""
Trains a KNN model, returns the predictions & testing set
col_names (string): the training features
target (string): the column target
df (DataFrame): the data
k (int) : the number of neighbors
output:
test_target (Pandas series): the testing set
predictions (array): model predictions
"""
#initiate a seed for reproductibility
np.random.seed(123)
#randomization
p = np.random.permutation(len(df))
df = df.loc[p]
#using 50/50 split for training and testing data
l = len(df)//2
#train only col_name
train_feature = df.iloc[0:l][col_names]#.values.reshape(l,1)
test_feature = df.iloc[l:][col_names]#.values.reshape(len(df)-l,1)
#target
train_target = df.iloc[0:l][target]
test_target = df.iloc[l:][target]
#instantiate KNeighborsRegressor class with default parameters
knn = KNeighborsRegressor(n_neighbors = k)
# Pass everything into the fit method.
knn.fit(train_feature, train_target)
#making predictions on the testing set using the training data
predictions = knn.predict(test_feature)
return test_target, predictions
def mean_absolute_percentage_error(target, predictions):
"""
Calculate MAPE error metric
target (string): the attribute we want to predict
predictions
output (float): MAPE score
"""
return np.mean(np.abs((target - predictions) / target))
def plot_pred(target, predictions, title=""):
"""
Scatter plot of model predictions (y-axis) vs target (x-axis)
target (1d array): the true values of the attribute we want to predict
predictions (1d array): the predictions made by the model
title (string, optional): the title of the chart
"""
fig, ax = plt.subplots(figsize=(8, 8))
mape = mean_absolute_percentage_error(target, predictions)
string_score = f'MAPE on testing set: {mape:.2%}'
ax.scatter(target, predictions)
ax.plot([0, 1], [0, 1], transform=ax.transAxes,ls="--", c="red")
plt.text(10000, 27000, string_score)
plt.title(title)
plt.ylabel('Model predictions')
plt.xlabel('Truths')
plt.xlim(5000,46000)
plt.ylim(5000,46000)
plt.show()
tt, pred = knn_predictions(["horsepower"], "price", standardized_cars)
plot_pred(tt, pred, "Univariate model (horsepower)")
The figure confirms our first intuition: the best univariate model is not very predictive. Good predictions should lie on the red line.
Since the KNN algorithm is based on finding the $k$-neighbors of a data point, it may be possible to improve the model by tuning the $k$ parameter (namely n_neighbors
parameter). As pointed in the scikit-learn documentation:
The optimal choice of the [k] value is highly data-dependent: in general a larger suppresses the effects of noise, but makes the classification boundaries less distinct.
In other words, a smaller $k$ may lead to overfitting, while a larger value may come at the cost of predictive power (underfitting).
Shortcut:
So, this is about the famous bias-variance tradeoff. How to minimize bias without increasing variance, and vice versa?
#tuning k parameter and save the results into a DataFrame
univariate_model_df = pd.DataFrame(index=standardized_cars.columns[:-1])
for k in [1,3,5,7,9]:
rmse=[]
col_name="k"+str(k)
for col in standardized_cars.columns[:-1]:
rmse.append(knn_train_test([col], target, standardized_cars, k))
univariate_model_df[col_name]=rmse
#label the best RMSE scores by k-neighbors values
univariate_model_df.idxmin(axis=0)
k1 engine_size k3 engine_size k5 horsepower k7 engine_size k9 engine_size dtype: object
Hyperparameter optimization doesn't look to upset dramatically the features importance, though the horsepower leadership is now challenged by the engine size attribute. Let's take a look.
plt.figure(figsize=(10,5))
x=[1,3,5,7,9]
plt.plot(x,univariate_model_df.loc["horsepower"],label="horsepower")
plt.plot(x,univariate_model_df.loc["engine_size"],label="engine size")
plt.title("Univariate Model, tuning $k$")
plt.legend(loc='lower right')
plt.ylabel("RMSE")
plt.xlabel("k-neighbors")
plt.xticks(x)
plt.show()
Indeed, the univariate model based on the horsepower attribute didn't improve its initial error score of 4367 (with $k$ = 5, the default value). Remember that the engine size attribute has the best Pearson score (highly correlated to price variable). The figure confirms this fact since this attribute improves notably the univariate model by reaching the 4200 levels when $1\leq k \leq3$. But, as said before, low $k$ values present a non-negligible risk of overfitting, so it is preferable to be cautious with this result.
Let's now plot again the prediction and calculate the MAPE score.
tt, pred = knn_predictions(["engine_size"], "price", standardized_cars, k = 3)
plot_pred(tt, pred, "Univariate model (engine size, $k$ = 3)")
Surprisingly, the MAPE score is worst than expected! It means there is no linear relationship between RMSE and MAPE scores, at least for the engine size attribute. It's still unclear if the engine size attribute is more informative than the horsepower one. Finally, despite hyperparameter optimization, the best univariate models are still poorly predictive.
Since the univariate models are getting poor results, we will now build several multivariate models and work on feature selection.
Possible approaches consist of testing all attributes together or only the continuous ones.
However, by selecting the best attributes obtained during the hyperparameter optimization step (saved into univariate_model_df
DataFrame), we can build several other multivariate models (with $k$ = 5, the default value) and save again the results.
#useful variable
total_features = len(standardized_cars.columns)-1
#full feature set model
target = "price"
full_feature_set = standardized_cars.columns[:-1].tolist()
print(knn_train_test(full_feature_set, target, standardized_cars))
5748.238867240251
#only continuous features
target = "price"
continuous_features = standardized_cars.iloc[:,66:-1].columns.tolist()
print(knn_train_test(continuous_features, target, standardized_cars))
4935.325727330618
At first glance, the binary features are not adding value to the model, at least when used indiscriminately. Indeed, the results above shows that the full feature set (i.e. absence of feature selection) has a larger score error than the model based on continuous features only.
#print the top 10 obtained from previous hyperparameter optimization step
sorted_univariates = univariate_model_df.mean(axis=1).sort_values(ascending=False)
ax = sorted_univariates[-10:].plot(kind='barh',
stacked=True,
figsize=(5, 5),
title='Univariate model')
ax.set_xlabel('Average RMSE, tuning $k$');
ax.set_ylabel('Attribute');
We will now build new models based on univariate_model_df
sorted values, building feature sets to test by selecting the top-$X$ best features where $2\leq X \leq 77$ (total number of attributes).
#select the top-X index from univariate_model_df and build a model
rmse=[]
for i in range(2,total_features):
col_names = univariate_model_df.mean(axis=1).sort_values().head(i).index
rmse.append(knn_train_test(col_names, target, standardized_cars))
plt.figure(figsize=(10,5))
plt.plot(rmse)
plt.title("Multivariate model, $k$ = 5")
plt.xlabel("Number of attributes")
plt.ylabel("RMSE")
plt.show()
By using this hierarchical approach, we still don't see any significant improvement. The best univariate model (horsepower with default values) is still leading the competition in terms of RMSE (4367). As for the univariate models, we will now tune the $k$ parameter and save the results into a DataFrame.
#tuning k parameter and save the results into a DataFrame
multivariate_model_df = pd.DataFrame(index=range(2,total_features))
for k in range(1,20):
rmse=[]
col_name="k"+str(k)
for i in range(2,total_features):
col_names = univariate_model_df.mean(axis=1).sort_values().head(i).index
rmse.append(knn_train_test(col_names, target, standardized_cars, k))
multivariate_model_df[col_name]=rmse
#print the top 10 obtained from previous hyperparameter optimization step
sorted_multivariates = multivariate_model_df.mean(axis=1).sort_values(ascending=False)
ax = sorted_multivariates[-10:].plot(kind='barh',
stacked=True,
figsize=(5, 5),
title='Multivariate model, tuning $k$')
ax.set_xlabel('Average RMSE');
ax.set_ylabel('Number of attributes');
x = multivariate_model_df.idxmin(axis=1).sort_values().apply(lambda x: x[1])
y = multivariate_model_df.min(axis=1).sort_values()
plt.scatter(x,y)
plt.xlabel("k-neighbors")
plt.ylabel("Minimum RMSE")
plt.title("Multivariate model, best $k$ values")
plt.show()
The model with the 15th best attributes has the best average score over all $k$ values, and this time we managed to obtain better RMSEs overall, but it looks suspicious since most of the minimal values occur when $k=1$.
plt.figure(figsize=(10,5))
x=range(1,20)
line1 = int(best_attributes[0])
plt.plot(x,multivariate_model_df.min(axis=0))
plt.xticks(x)
plt.xlabel("k-neighbors")
plt.ylabel("Minimum RMSE")
plt.hlines(line1, 1, 19, linestyle='dashed',linewidth=2)
plt.text(14, 4400, "horsepower, $k$ = 5")
plt.title("Multivariate model, best scores, tuning $k$")
plt.show()
Discarding the risk of overfitting at $k <3$, we think there is maybe a good window at $3\leq k \leq6$ where the best scores are under the horsepower score dashed line, our best univariate model with default parameters. Let's find and display the set of attributes candidates. Note that we start now to store the most relevant feature sets into a Python set
object (advantage compared to list
, tuple
, or dictionary
built-in data types: sets do not allow duplicate values), namely feature_sets
, for future usage.
#initiate feature_sets
feature_sets = set()
#add the best attribute
feature_sets.add(tuple(["horsepower"]))
#add the 2nd best attribute
feature_sets.add(tuple(["engine_size"]))
#add the full feature set
feature_sets.add(tuple(full_feature_set))
#add a feature set containing only continuous values
feature_sets.add(tuple(continuous_features))
#store best features for k-neighbors values between 3 and 6
for idx, i in enumerate(multivariate_model_df.idxmin(axis=0)[2:6]):
features = list(univariate_model_df.mean(axis=1).sort_values().head(i).index)
feature_sets.add(tuple(features))
print(f'Best feature set for n_neighbors = {idx+3}: {features}')
print()
Best feature set for n_neighbors = 3: ['engine_size', 'horsepower'] Best feature set for n_neighbors = 4: ['engine_size', 'horsepower', 'curb_weight'] Best feature set for n_neighbors = 5: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg'] Best feature set for n_neighbors = 6: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg', 'length', 'city_mpg', 'width', 'compression_ratio', 'bore', 'wheel_base', 'num_cylinders_four', 'fuel_system_2bbl', 'peak_rpm', 'drive_wheels_rwd', 'make_mercedes-benz']
Interesting to notice, the fourth set (15 attributes) contains binary features - four cylinders, 2bbl fuel system, rear-wheel drive and Mercedes-Benz. Binary features are finally showing some predictive power.
Now, let's take an ultimate look at some of the multivariate models we've just built.
plt.figure(figsize=(10,5))
for idx in range(0,10):
plt.plot(range(1,20),multivariate_model_df.iloc[idx])
plt.title("Multivariate models, tuning $k$")
plt.xticks(range(1,20))
plt.xlabel("k-neighbors")
plt.ylabel("RMSE")
Text(0, 0.5, 'RMSE')
In conclusion, hierarchical feature selection and train/test validation are a bit deceptive, leading to poor results compared to the best univariate model with default parameters. Even after tuning the $k$ parameter, the small improvements noted seem to be negligible and let us a bit septic. We need to improve the validation strategy.
As mentioned by the scikit-learn documentation, the cross-validation technique splits the training set into $k$ smaller sets and $k$-folds follow the same procedure:
a model is trained using $k-1$ of the folds as training data;
the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).
from sklearn.model_selection import KFold, cross_val_score, cross_validate
def cross_validation(col_names, target, df, k=5, n_splits=5):
"""
Trains a KNN model following the cross validation procedure and returns the RMSE
col_name (string): the training features
target (string): the column target
df (DataFrame): the data
k (int, optional): the n_neighbors parameter of KNN model we want to optimize
n_splits (int, optional, min = 2): the number of folds
output (float) : the average RMSE over n_splits
"""
#instantiate KNeighborsRegressor class with default parameters
knn = KNeighborsRegressor(n_neighbors = k)
#initiate KFold, shuffle and set a seed for reproductibility
kf = KFold(n_splits=n_splits,shuffle=True,random_state=123)
# Pass everything into the fit method.
mses = cross_val_score(knn,df[col_names],df[target],
scoring ="neg_mean_squared_error", cv = kf)
rmses = [np.sqrt(np.abs(mse)) for mse in mses]
avg_rmse = np.mean(rmses)
return avg_rmse
#applying cross validation on each stored feature set and print the results
for feature_set in feature_sets:
col_names=list(feature_set)
score = cross_validation(col_names, target, standardized_cars, k=5, n_splits=5)
print(f'Cross validation with {feature_set}: {score:.0f}')
print()
Cross validation with ('wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg'): 3535 Cross validation with ('symboling_B', 'symboling_C', 'symboling_D', 'symboling_E', 'symboling_F', 'symboling_G', 'make_alfa-romero', 'make_audi', 'make_bmw', 'make_chevrolet', 'make_dodge', 'make_honda', 'make_isuzu', 'make_jaguar', 'make_mazda', 'make_mercedes-benz', 'make_mercury', 'make_mitsubishi', 'make_nissan', 'make_peugot', 'make_plymouth', 'make_porsche', 'make_renault', 'make_saab', 'make_subaru', 'make_toyota', 'make_volkswagen', 'make_volvo', 'fuel_type_diesel', 'fuel_type_gas', 'aspiration_std', 'aspiration_turbo', 'doors_four', 'doors_two', 'doors_unknown', 'body_style_convertible', 'body_style_hardtop', 'body_style_hatchback', 'body_style_sedan', 'body_style_wagon', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_ohc', 'engine_type_ohcf', 'engine_type_ohcv', 'engine_type_rotor', 'num_cylinders_eight', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_six', 'num_cylinders_three', 'num_cylinders_twelve', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_2bbl', 'fuel_system_4bbl', 'fuel_system_idi', 'fuel_system_mfi', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg'): 4029 Cross validation with ('engine_size', 'horsepower'): 2977 Cross validation with ('engine_size',): 3304 Cross validation with ('engine_size', 'horsepower', 'curb_weight'): 3045 Cross validation with ('horsepower',): 3896 Cross validation with ('engine_size', 'horsepower', 'curb_weight', 'highway_mpg', 'length', 'city_mpg', 'width', 'compression_ratio', 'bore', 'wheel_base', 'num_cylinders_four', 'fuel_system_2bbl', 'peak_rpm', 'drive_wheels_rwd', 'make_mercedes-benz'): 3295 Cross validation with ('engine_size', 'horsepower', 'curb_weight', 'highway_mpg'): 3022
The first results show pretty interesting improvements in model performance. The worst value is now under 3900!
Like in the previous section, by selecting the best features from univariate_model_df
sorted values, 77 feature sets of increasing size go to the oven and the scores after tuning the number of folds are stored into a new DataFrame named cross_validation_df
.
After analysis, the best sets of attributes will be added to feature_sets
. The score obtained previously with engine size and horsepower features is now temporarily our new benchmark (RMSE = 2977).
#cross validation with top-𝑋 feature selection from univariate_model_df sorted values
cross_validation_df = pd.DataFrame(index=range(2,total_features))
for k in range(2,20):
rmse=[]
col_name="k"+str(k)
for i in range(2,total_features):
col_names = univariate_model_df.mean(axis=1).sort_values().head(i).index
rmse.append(cross_validation(col_names, target, standardized_cars, n_splits=k))
cross_validation_df[col_name]=rmse
labels = ["2 features", "3 features", "4 features", "5 features", "6 features"]
plt.figure(figsize=(10,5))
for idx in range(0,5):
plt.plot(range(2,20),cross_validation_df.iloc[idx],label=labels[idx])
plt.hlines(2977, 1, 19, linestyle='dashed',linewidth=2)
plt.text(14, 3010, "2 features, folds = 5")
plt.xticks(range(2,20))
plt.title("Multivariate models, cross-validation")
plt.xlabel("Number of folds")
plt.ylabel("Average RMSE")
plt.legend(loc="upper right")
<matplotlib.legend.Legend at 0xd528148>
The figure shows that the model is now starting to be much more predictive. We seat now around the 3000 levels. Since all the curves are showing the same decreasing convexity, one may be tempted to use a large number of folds since larger values visibly converge toward lower error scores. Once again, it's about finding the right balance between bias and variance. We will address this question infra.
#adding best sets to feature_sets
for idx, i in enumerate(cross_validation_df.idxmin(axis=0)):
features = list(univariate_model_df.mean(axis=1).sort_values().head(i).index)
feature_sets.add(tuple(features))
print(f'Best feature set for {idx+2} folds: {features}')
print()
print('------------------------------------------')
print(f'Actual number of stored feature sets: {len(feature_sets)}')
Best feature set for 2 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg', 'length', 'city_mpg', 'width', 'compression_ratio', 'bore', 'wheel_base', 'num_cylinders_four', 'fuel_system_2bbl', 'peak_rpm'] Best feature set for 3 folds: ['engine_size', 'horsepower', 'curb_weight'] Best feature set for 4 folds: ['engine_size', 'horsepower'] Best feature set for 5 folds: ['engine_size', 'horsepower'] Best feature set for 6 folds: ['engine_size', 'horsepower'] Best feature set for 7 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg'] Best feature set for 8 folds: ['engine_size', 'horsepower'] Best feature set for 9 folds: ['engine_size', 'horsepower'] Best feature set for 10 folds: ['engine_size', 'horsepower'] Best feature set for 11 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg'] Best feature set for 12 folds: ['engine_size', 'horsepower'] Best feature set for 13 folds: ['engine_size', 'horsepower'] Best feature set for 14 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg'] Best feature set for 15 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg'] Best feature set for 16 folds: ['engine_size', 'horsepower'] Best feature set for 17 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg'] Best feature set for 18 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg'] Best feature set for 19 folds: ['engine_size', 'horsepower'] ------------------------------------------ Actual number of stored feature sets: 9
We introduce two additional algorithms selecting features iteratively.
forward_selection
: starts with an empty set and adds progressively new features following the RMSE criterion.backward_elimination
: starts with a full feature set and removes at each iteration the least useful feature, whose absence improves the performance following the same RMSE criterion.def forward_selection(features, target, df, min_rmse):
"""
Trains a KNN model following the train/test validation procedure selecting features iteratively
Starts with empty set and adds progressively new features following the RMSE criterion.
features (list of string): the training features
target (string): the column target
df (DataFrame): the data
min_rmse (float): the minimum RSME value we want to minimize
outputs :
RMSE (float)
featurees (list of string): the best feature set found by the algorithm
"""
all_features = df.drop(target,axis=1, inplace = False).columns
l=len(all_features)-len(features)
updated_features = [feature for feature in all_features if feature not in features]
selected_features = []
for i in range(0,l):
features.append(updated_features[i])
avg_rmse = knn_train_test(features, target, df)
if avg_rmse < min_rmse:
min_rmse = avg_rmse
selected_features.append(updated_features[i])
features.remove(updated_features[i])
else:
features.remove(updated_features[i])
if len(selected_features)>0:
features.append(selected_features[-1])
forward_selection(features, target, df, min_rmse)
return(knn_train_test(features, target, df), features)
def backward_elimination(features, target, df, min_rmse):
"""
Trains a KNN model following the train/test validation procedure selecting features iteratively
Starts with full feature set and removes at each iteration the least useful feature
following the RMSE criterion.
features (list of string): the training features
target (string): the column target
df (DataFrame): the data
min_rmse (float): the minimum RSME value we want to minimize
outputs :
RMSE (float)
featurees (list of string): the best feature set found by the algorithm
"""
l=len(features)
eliminated_features = []
for i in range(0,l):
f = [feature for feature in features if feature!=features[i]]
avg_rmse = knn_train_test(f, target, df)
if avg_rmse < min_rmse:
min_rmse = avg_rmse
eliminated_features.append(features[i])
if len(eliminated_features)>0:
features.remove(eliminated_features[-1])
backward_elimination(features, target, df, min_rmse)
return(knn_train_test(features, target, df), features)
#train/test validation with default parameters using forward feature selection
#store the features set obtained
min_rmse = knn_train_test(standardized_cars.columns[:-1].tolist(), "price", standardized_cars)
fs_best_rmse, fs_f = forward_selection(features=[],
target="price",
df=standardized_cars,
min_rmse=min_rmse)
print(f'Forward Selection: {fs_f}')
print()
print(f'RMSE: {fs_best_rmse}')
#store the feature set
feature_sets.add(tuple(fs_f))
Forward Selection: ['horsepower', 'highway_mpg', 'make_mercedes-benz', 'drive_wheels_rwd', 'make_peugot', 'make_volkswagen', 'make_audi', 'make_porsche', 'city_mpg', 'make_alfa-romero', 'engine_type_rotor', 'fuel_system_2bbl', 'make_saab', 'make_isuzu', 'make_chevrolet'] RMSE: 2815.095801475529
#train/test validation with default parameters using backward elimination
#store the feature set obtained
features = standardized_cars.columns[:-1].tolist()
min_rmse = knn_train_test(features, "price", standardized_cars)
#min_rmse = standardized_cars["price"].max()
#print(min_rmse)
be_best_rmse, be_f = backward_elimination(features=features,
target="price",
df=standardized_cars,
min_rmse=min_rmse)
print(f'Backward Elimination: {be_f}')
print()
print(f'RMSE: {be_best_rmse}')
#store the feature set
feature_sets.add(tuple(be_f))
Backward Elimination: ['symboling_B', 'symboling_C', 'symboling_F', 'make_alfa-romero', 'make_audi', 'make_honda', 'make_isuzu', 'make_mazda', 'make_mercury', 'make_nissan', 'make_renault', 'make_saab', 'make_toyota', 'make_volkswagen', 'doors_two', 'body_style_convertible', 'body_style_hatchback', 'body_style_sedan', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_rotor', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_three', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_4bbl', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'length', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'horsepower', 'highway_mpg'] RMSE: 2348.328023370358
Composed of 43 attributes, the feature set obtained by backward elimination is doing the best RMSE score we made until now. This is very promising, showing that feature selection is a very important step in machine learning.
Let's now build another DataFrame tuning the number of folds into cross_validation
function and using each set of attributes stored in feature_sets
as parameter.
Then, we will use this DataFrame to find the optimal number of folds for each series of RMSEs values and plot the best result in MAPE/RMSE terms after writing a new cross-validation function.
#construct a dataframe with the scores of the stored feature sets
final_df = pd.DataFrame()
#keep into memory the features names
features_list = list(feature_sets)
for k in range(2,20):
rmse=[]
col_name="k"+str(k)
for feature_set in features_list:
features = list(feature_set)
score = cross_validation(features, target, standardized_cars, k=5, n_splits=k)
rmse.append(score)
final_df[col_name]=rmse
plt.figure(figsize=(10,5))
for idx in range(0,len(feature_sets)):
plt.plot(range(2,20),final_df.iloc[idx,:])
plt.title("Selected feature sets, cross-validation")
plt.xlabel("Number of folds")
plt.xticks(range(2,20))
plt.ylabel("Average RMSE")
#plt.legend(loc="upper right")
Text(0, 0.5, 'Average RMSE')
The chart above shows clearly there is a substantial gap between two series and the others ones regardless the cross-validation strategy. The full feature set is one of them. Enough to say that feature selection is not optional here!
Note that the second worst series is the best univariate model (horsepower), showing now its intrinsic limits.
#display the 2 worst scores series
for idx in final_df.idxmax().unique():
feature_set = features_list[idx]
print(f'Number of features: {len(feature_set)}')
print(feature_set)
print()
Number of features: 79 ('symboling_B', 'symboling_C', 'symboling_D', 'symboling_E', 'symboling_F', 'symboling_G', 'make_alfa-romero', 'make_audi', 'make_bmw', 'make_chevrolet', 'make_dodge', 'make_honda', 'make_isuzu', 'make_jaguar', 'make_mazda', 'make_mercedes-benz', 'make_mercury', 'make_mitsubishi', 'make_nissan', 'make_peugot', 'make_plymouth', 'make_porsche', 'make_renault', 'make_saab', 'make_subaru', 'make_toyota', 'make_volkswagen', 'make_volvo', 'fuel_type_diesel', 'fuel_type_gas', 'aspiration_std', 'aspiration_turbo', 'doors_four', 'doors_two', 'doors_unknown', 'body_style_convertible', 'body_style_hardtop', 'body_style_hatchback', 'body_style_sedan', 'body_style_wagon', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_ohc', 'engine_type_ohcf', 'engine_type_ohcv', 'engine_type_rotor', 'num_cylinders_eight', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_six', 'num_cylinders_three', 'num_cylinders_twelve', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_2bbl', 'fuel_system_4bbl', 'fuel_system_idi', 'fuel_system_mfi', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg') Number of features: 1 ('horsepower',)
It's possible to find the point of maximum curvature (elbow point) in a system with the Kneedle algorithm. A Python package, kneed, has been created by Kevin Arvai. For more information about parameters and how it works, see this Kaggle notebook. Applied to the RMSE curves, the elbow point will be equal to the appropriate number of folds after which information gain becomes negligible. Indeed, a large(r) number of folds will increase the variance of the estimate, and that's not desirable. As pointed by Sebastian Raschka:
The reason for the increasing variance of the estimate is that the overlap between training sets increases with an increasing size of [folds].
Example of elbow point detection with the first feature set stored in final_df
:
from kneed import KneeLocator
kl = KneeLocator(
range(2,20),
final_df.iloc[0,:],
curve="convex",
direction="decreasing",
S=0.0,
interp_method="polynomial"
)
kl.plot_knee()
print(f'Feature set: {features_list[0]}')
print()
print(f'Elbow point: {kl.elbow}')
Feature set: ('symboling_B', 'symboling_C', 'symboling_F', 'make_alfa-romero', 'make_audi', 'make_honda', 'make_isuzu', 'make_mazda', 'make_mercury', 'make_nissan', 'make_renault', 'make_saab', 'make_toyota', 'make_volkswagen', 'doors_two', 'body_style_convertible', 'body_style_hatchback', 'body_style_sedan', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_rotor', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_three', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_4bbl', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'length', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'horsepower', 'highway_mpg') Elbow point: 5
Here, the first significant elbow point, or the first point of maximum curvature from a polynomial interpolation point of view, is marked by the dashed line. The error is still decreasing after the elbow point, but at the cost of an increasing variance that can lead to worst performances when the model will be applied to a new dataset.
Now, we will write a new cross-validation function, cross_validation_predictions
, that uses cross_val_predict
scikit-learn function instead of cross_val_score
. We will use the predictions output to calculate the MAPE score in conjunction with the RMSE. Remember that we want in the last instance to compare our results with Kibler et al. (1989).
To end this study, the last charts below show the best feature sets found minimizing respectively the MAPE criterion and the RMSE criterion.
from sklearn.model_selection import cross_val_predict
def cross_validation_predictions(col_names, target, df, k = 5, n_splits = 5):
"""
Trains a KNN model following the cross validation procedure
and returns the predictions & testing set
col_names (string): the training features
target (string): the column target
df (DataFrame): the data
k (int): the n_neighbors parameter of KNN model we want to optimize
n_splits (int, min = 2): the number of folds
output:
predictions (array): model predictions
"""
knn = KNeighborsRegressor(n_neighbors = k)
#initiate KFold, shuffle and set a seed for reproductibility
kf = KFold(n_splits=n_splits,shuffle=True,random_state=123)
predictions = cross_val_predict(knn,df[col_names],df[target],cv = kf)
return predictions
def plot_pred(target, predictions, title=""):
"""
Scatter plot of model predictions (y-axis) vs target (x-axis)
target (1d array): the true values of the attribute we want to predict
predictions (1d array): the predictions made by the model
title (string, optional): the title of the chart
"""
fig, ax = plt.subplots(figsize=(8, 8))
mape = mean_absolute_percentage_error(target, predictions)
rmse = mean_squared_error(target, predictions) ** (1/2)
string_mape = f'MAPE on dataset: {mape:.2%}'
string_rmse = f'RMSE on dataset: {rmse:.0f}'
ax.scatter(target, predictions)
ax.plot([0, 1], [0, 1], transform=ax.transAxes,ls="--", c="red")
plt.text(30000, 10100, string_mape)
plt.text(30000, 11500, string_rmse)
plt.title(title)
plt.ylabel('Model predictions')
plt.xlabel('Truths')
plt.xlim(5000,46000)
plt.ylim(5000,46000)
plt.show()
def elbow(x, y, curve="convex", direction="decreasing", s = 0.0, interp_method='polynomial'):
"""
Calculate the elbow point of a curve.
Optional parameters: see https://www.kaggle.com/kevinarvai/knee-elbow-point-detection
x (1d array): the x-coordinates of the curve we want to calculate the elbow point
y (1d array): the y-coordinates of the curve we want to calculate the elbow point
output: the elbow point (int)
"""
kl = KneeLocator(
x,
y,
curve=curve,
direction=direction,
S=s,
interp_method=interp_method
)
return kl.elbow
elbows=[]
mape = np.zeros(shape=(len(feature_sets),))
rmse = np.zeros(shape=(len(feature_sets),))
predictions_df = pd.DataFrame()
for idx in range(0,len(feature_sets)):
elbows.append(elbow(range(2,20),final_df.iloc[idx,:]))
for idx in range(0,len(feature_sets)):
col_name="feature_sets"+str(idx)
pred = cross_validation_predictions(list(features_list[idx]),
"price",
standardized_cars,
k = 5,
n_splits = elbows[idx])
mape[idx]+=mean_absolute_percentage_error(standardized_cars["price"], pred)
rmse[idx]+=mean_squared_error(standardized_cars["price"], pred) ** (1/2)
predictions_df[col_name] = pred
for i in range(0,2):
if i == 0:
idx_min = np.argmin(mape) #first chart = best MAPE
title = "Best model (MAPE criterion)"
else:
idx_min = np.argmin(rmse) #2nd chart = best RMSE
title = "Best model (RMSE criterion)"
print(f'Folds: {elbows[idx_min]}')
print(f'Feature set: {list(features_list[idx_min])}')
plot_pred(standardized_cars["price"], predictions_df.iloc[:,idx_min], title)
if i == 0:
print('------------------------------------------------------------------')
print()
Folds: 6 Feature set: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']
------------------------------------------------------------------ Folds: 5 Feature set: ['symboling_B', 'symboling_C', 'symboling_F', 'make_alfa-romero', 'make_audi', 'make_honda', 'make_isuzu', 'make_mazda', 'make_mercury', 'make_nissan', 'make_renault', 'make_saab', 'make_toyota', 'make_volkswagen', 'doors_two', 'body_style_convertible', 'body_style_hatchback', 'body_style_sedan', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_rotor', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_three', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_4bbl', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'length', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'horsepower', 'highway_mpg']
After a long process, we obtained finally an error score very closed to Kibler et al. (1989) best past result (IBL algorithm, 11,84%).
It's interesting to note that when minimizing the MAPE score, the best result is obtained with continuous features only (engine size, horsepower, curb weight, highway mpg), found using univariate feature selection.
But when minimizing the RMSE, the binary features are playing a more important role (in particular, the make of the car), and this time this is the backward elimination technique that worked better for us.
An interesting fact we have discovered during the process: MAPE and RMSE scores don't have clear linear relationship. In other words, scoring parameter is a non trivial task, and the choice of the scoring function we want to optimize need to be thought carefully since the very start of the project.
Also, since the target (car prices) is skewed to the right, the biggest prices are "high variance", being the most challenging values to predict. It would be interesting, for further investigations, to model the logarithm of the price rather than the price in monetary unit.
Some changes could improve the model. Between several ideas:
from IPython.core.display import HTML
def css_styling():
styles = open("css/custom.css", "r").read() #or edit path to custom.css
return HTML(styles)
css_styling()