Price prediction with k-nearest neighbors algorithm

In this study, we'll cover how to:

  • load the data
  • impute missing values
  • encode categorical features
  • standardize numerical values
  • train a k-nearest neighbors model and plot the predictions
  • tune the hyperparameters of an estimator
  • evaluate a model using cross-validation and error minimization
  • improve the model with custom feature selection
  • find the elbow point of a curve

Loading the data

The UCI website provides everything you need to get started:

  • Training features, consisting of three types of entities: a) the specification of an auto in terms of various characteristics; b) its assigned insurance risk rating; c) its normalized losses in use as compared to other cars.
  • Target Feature, i. e. the variable which we want to predict: the price of the car, ranging from 5118 to 45400.
In [1]:
from pathlib import Path

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('fivethirtyeight')
In [2]:
#dataquest solution
#https://github.com/dataquestio/solutions/blob/master/Mission155Solutions.ipynb
#about the dataset
#https://archive.ics.uci.edu/ml/datasets/automobile
#https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.names
#download the dataset
#https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
In [3]:
attributes=["symboling",
            "normalized_losses",
            "make",
            "fuel_type",
            "aspiration",
            "doors",
            "body_style",
            "drive_wheels",
            "engine_location",
            "wheel_base",
            "length",
            "width",
            "height",
            "curb_weight",
            "engine_type",
            "num_cylinders",
            "engine_size",
            "fuel_system",
            "bore",
            "stroke",
            "compression_ratio",
            "horsepower",
            "peak_rpm",
            "city_mpg",
            "highway_mpg",
            "price"]
In [4]:
cars = pd.read_csv('data/imports-85.data',
                   header=None,
                   names=attributes)

Let's take a quick look at the raw data.

In [5]:
def highlight_missing_value(val):
    """    
    return the css property `'background-color: orange'` for "?" value, white otherwise.
    """
    color = 'orange' if val=="?" else 'white'
    return 'background-color: %s' % color
In [6]:
cars.head().style.applymap(highlight_missing_value)
Out[6]:
symboling normalized_losses make fuel_type aspiration doors body_style drive_wheels engine_location wheel_base length width height curb_weight engine_type num_cylinders engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8 115 5500 18 22 17450

The dataset contains categorical and numerical variables. We will need to take this into account when preprocessing the dataset thereafter. Since we plan to perform a linear regression, it's imperative to work with numerical values only. The dataset has also missing values we need to fix.

Imputation of missing values

According to the Data Set Description, the missing values are denoted by an interrogation point on the following attributes:

  • normalized losses (41 missing values)
  • doors (2)
  • bore (4)
  • stroke (4)
  • horsepower (2)
  • peak-rpm (2)
  • price (4)

Unfortunately, this causes this attributes have been cast to Pandas objects even when they contain continuous values.

So we will do the following:

  1. Remove the normalized losses column, since this attribute contains 20% of missing values.
  2. Impute the continuous missing values by np.nan values.
  3. Change the data type to float.
  4. Remove the 4 observations with an unknown price since this attribute is our target.
  5. Replace the remaining missing values by the mean of the attribute.
In [7]:
def remove_attribute(attribute, df):
    """
    Remove column from a dataframe
    
    attribute (string): the column name to remove
    df (DataFrame): the data
    
    """
    
    df.drop(attribute,axis=1,inplace=True)
    

def clean_continuous_attributes(mv_str, cols_index, df):
    """
    Replace missing vaalues by np.nan value and change object columns to float data type
    
    mv_str (string): the string denoting the missing value
    cols_index (list of integers) : the columns index we want to clean
    df (DataFrame) : the data  
        
    """
   
    for idx, col in enumerate(df.columns):
        if idx in cols_index:
            df[col]=df[col].replace(mv_str, np.nan)
            df[col]=df[col].astype(np.float64)
    return df

def remove_rows(target, df):
    """
    Remove rows with np.nan values from a DataFrame and reset the index
    
    target (array-like of strings): the column name which we want to remove np.nan values  
    df (DataFrame) : the data  
        
    """
    df.dropna(subset=target,inplace=True)
    df.reset_index(inplace=True)
    df.drop("index",axis=1,inplace=True)
    

def impute_missing_values(cols_index, df):
    """
    Replace np.nan values by the mean of the attribute    
    
    cols_index (list of integers) : the columns index we want to replace np.nan values by the mean
    df (DataFrame) : the data  
        
    """
    for idx, col in enumerate(df.columns):
        if idx in cols_index:        
            m = df[col].mean()
            df[col] = df[col].fillna(m)    
    return df
In [8]:
remove_attribute("normalized_losses", cars)
missing_values_col_index=[17,18,20,21,24]
cars = clean_continuous_attributes("?", missing_values_col_index, cars)
remove_rows(["price"], cars)
cars = impute_missing_values(missing_values_col_index, cars)
cars.head()
Out[8]:
symboling make fuel_type aspiration doors body_style drive_wheels engine_location wheel_base length ... engine_size fuel_system bore stroke compression_ratio horsepower peak_rpm city_mpg highway_mpg price
0 3 alfa-romero gas std two convertible rwd front 88.6 168.8 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0
1 3 alfa-romero gas std two convertible rwd front 88.6 168.8 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
2 1 alfa-romero gas std two hatchback rwd front 94.5 171.2 ... 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
3 2 audi gas std four sedan fwd front 99.8 176.6 ... 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
4 2 audi gas std four sedan 4wd front 99.4 176.6 ... 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

5 rows × 25 columns

In [9]:
#safety check
assert cars.isnull().sum().sum()==0
In [10]:
cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 25 columns):
symboling            201 non-null int64
make                 201 non-null object
fuel_type            201 non-null object
aspiration           201 non-null object
doors                201 non-null object
body_style           201 non-null object
drive_wheels         201 non-null object
engine_location      201 non-null object
wheel_base           201 non-null float64
length               201 non-null float64
width                201 non-null float64
height               201 non-null float64
curb_weight          201 non-null int64
engine_type          201 non-null object
num_cylinders        201 non-null object
engine_size          201 non-null int64
fuel_system          201 non-null object
bore                 201 non-null float64
stroke               201 non-null float64
compression_ratio    201 non-null float64
horsepower           201 non-null float64
peak_rpm             201 non-null float64
city_mpg             201 non-null int64
highway_mpg          201 non-null int64
price                201 non-null float64
dtypes: float64(10), int64(5), object(10)
memory usage: 39.4+ KB

Feature construction with one-hot encoding

We need also to replace the two missing values for the doors column by "unknown" and then to one-hot encode (i.e., generate a column by category) the categorical columns (including the symboling attribute). Categorical variables cannot be included in a linear model if not coded as integers first and we want to avoid categorical features to be treated as ordered values. Since the symboling feature has ordered integer values, we will also one-hot encode this numerical attribute.

In [11]:
from sklearn.preprocessing import OneHotEncoder
In [12]:
#replace the two missing values for the doors column by "unknown"
cars["doors"]=cars["doors"].replace("?", "unknown")
In [13]:
#replace symboling attribute ordered values by letters
symboling_dict = { 3:"G", 2:"F", 1:"E", 0:"D", -1:"C", -2:"B", -3:"A" }
cars["symboling"] = cars["symboling"].apply(lambda x: symboling_dict[x])
In [14]:
#select columns object
cars_object = cars.select_dtypes('object')
#one-hot encoding
enc=OneHotEncoder().fit(cars_object)
#pass to array
encoded = enc.transform(cars_object).toarray()
#create a dataframe with the binary features
one_hot_encoded_features = pd.DataFrame(index=cars.index)
#put binary features names in a readable fashion
binary_col_names = enc.get_feature_names(cars_object.columns)

for idx, feature in enumerate(binary_col_names):
    one_hot_encoded_features[feature]=encoded[:,idx]
one_hot_encoded_features.head()
Out[14]:
symboling_B symboling_C symboling_D symboling_E symboling_F symboling_G make_alfa-romero make_audi make_bmw make_chevrolet ... num_cylinders_twelve num_cylinders_two fuel_system_1bbl fuel_system_2bbl fuel_system_4bbl fuel_system_idi fuel_system_mfi fuel_system_mpfi fuel_system_spdi fuel_system_spfi
0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 66 columns

The mean value of binary features shows us the prevalence, meaning what proportion of the described object has that characteristic. The plot below shows that a dozen characteristics are common across cars, but most of them concern less than 20% of the dataset.

In [15]:
one_hot_encoded_features.apply(pd.value_counts)
sorted_one_hot_encoded_features = one_hot_encoded_features.iloc[:, 1:].mean().sort_values()
ax = sorted_one_hot_encoded_features.plot(kind='barh',
                                 stacked=True,
                                 figsize=(5, 12),
                                 title='Prevalence of Binary Features')
ax.set_xlabel('Proportion of cars');

Standardization of numerical values

As mentioned by the scikit-learn documentation:

If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

That's why we need to transform the data so that each attribute will be like standard normally distributed data (Gaussian) with zero mean and unit variance.

In [16]:
#remove all object features since we constructed binary features from them
cars=cars.drop(cars.select_dtypes('object').columns,axis=1)
#join binary features and continuous features
all_features = one_hot_encoded_features.join(cars)
In [17]:
#normalize the numerical values
standardized_cars = (all_features - all_features.mean())/(all_features.std())
#reintroduce the non normalized target column
standardized_cars["price"] = all_features["price"]
#safety check if mean = 0 and variance = 1 for each column
assert np.isclose(standardized_cars.std(),1.0).sum()==len(standardized_cars.columns)-1
assert np.isclose(standardized_cars.mean(),0).sum()==len(standardized_cars.columns)-1

Let’s get some insights by looking at the variables' distributions and at the pairwise relationships between them. Only main continuous variables will be used. The following plot shows their Pearson standard correlation coefficients.

In [18]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,10))
continuous_idx = len(one_hot_encoded_features.columns)
ticks = range(len(standardized_cars.columns)-continuous_idx)
dt = standardized_cars.iloc[:,continuous_idx:]
plt.grid(False)
plt.imshow(dt.corr())
plt.xticks(ticks,dt.columns,rotation=90)
plt.yticks(ticks,dt.columns)
plt.colorbar()
plt.show()

We can see that some variables are strongly linearly correlated, exhibiting clusters (length, width, wheelbase, or between city-mpg and highway-mpg). Risk of redundancy here and that may decrease the performance of the model since correlated features are known to induce instabilities. Regarding price target, the strongest correlation occurs with the engine size attribute, so we can expect this feature to be a good predictor.

Build the model

Now we have removed the discrete variables, we are ready to perform a linear regression based on the k-nearest neighbors (KNN) algorithm using the root of mean squared error (RMSE) as error metric during test/train validation. First, we will fit a univariate model, useful to know which features are the best predictors.

Univariate Model

In [19]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
In [20]:
def knn_train_test(col_names, target, df, k = 5):
    """
    Trains a KNN model and returns the RMSE
    
    col_names (string): the training features
    target (string): the column target
    df (DataFrame): the data
    k (int) : the number of neighbors
    
    output (float) : RMSE
    """
    #initiate a seed for reproductibility 
    np.random.seed(123)
    #randomization
    p = np.random.permutation(len(df))
    df = df.loc[p]
    #using 50/50 split for training and testing data
    l = len(df)//2
    #train only col_name
    train_feature = df.iloc[0:l][col_names]#.values.reshape(l,1)
    test_feature = df.iloc[l:][col_names]#.values.reshape(len(df)-l,1)
    #target
    train_target = df.iloc[0:l][target]
    test_target = df.iloc[l:][target]
    #instantiate KNeighborsRegressor class with default parameters
    knn = KNeighborsRegressor(n_neighbors = k)
    # Pass everything into the fit method.
    knn.fit(train_feature, train_target)
    #making predictions on the testing set using the training data 
    predictions = knn.predict(test_feature)
    #calculating mean squared error (MSE) and root mean squared error (RMSE)
    predictions_mse = mean_squared_error(test_target, predictions)
    predictions_rmse = predictions_mse ** (1/2)
    return predictions_rmse
In [21]:
target = "price"
rmse=[]
for col in standardized_cars.columns[:-1]:    
    rmse.append(knn_train_test([col], target, standardized_cars))
best_attributes = pd.Series(data=rmse, index=standardized_cars.columns[:-1]).sort_values()
best_attributes 
Out[21]:
horsepower          4367.420884
engine_size         4622.893168
curb_weight         4741.616768
highway_mpg         5077.530212
city_mpg            5746.501341
                       ...     
fuel_system_spdi    9783.758088
make_plymouth       9787.573337
make_isuzu          9807.865944
make_mitsubishi     9808.004146
make_chevrolet      9918.293043
Length: 79, dtype: float64

Plotting the predictions

Since the car prices are ranging from 5118 to 45400, the best error score of 4367 seems pretty bad. Let's create a new function, namely knn_predictions, that will return the predictions and the testing set. Then we will plot the results with another function, plot_pred.

We also create another error score function returning the Mean Absolute Percentage Error (MAPE), fitting better with past results (Kibler et al., 1989) for comparison purposes: 11,84% for their instance-based technique (IBL) algorithm derived from KNN algorithm, and 14,12 % for the resulting linear regression equation, according to UCI website.

In [22]:
def knn_predictions(col_names, target, df, k = 5):
    """
    Trains a KNN model, returns the predictions & testing set
    
    col_names (string): the training features
    target (string): the column target
    df (DataFrame): the data
    k (int) : the number of neighbors
    
    output: 
      test_target (Pandas series): the testing set
      predictions (array): model predictions
    
    """
    
    #initiate a seed for reproductibility 
    np.random.seed(123)
    #randomization
    p = np.random.permutation(len(df))
    df = df.loc[p]
    #using 50/50 split for training and testing data
    l = len(df)//2
    #train only col_name
    train_feature = df.iloc[0:l][col_names]#.values.reshape(l,1)
    test_feature = df.iloc[l:][col_names]#.values.reshape(len(df)-l,1)
    #target
    train_target = df.iloc[0:l][target]
    test_target = df.iloc[l:][target]
    #instantiate KNeighborsRegressor class with default parameters
    knn = KNeighborsRegressor(n_neighbors = k)
    # Pass everything into the fit method.
    knn.fit(train_feature, train_target)
    #making predictions on the testing set using the training data 
    predictions = knn.predict(test_feature)    
    return test_target, predictions

def mean_absolute_percentage_error(target, predictions):
    """
    Calculate MAPE error metric
    
    target (string): the attribute we want to predict
    predictions 
   
    output (float): MAPE score
    """    
    return np.mean(np.abs((target - predictions) / target))

def plot_pred(target, predictions, title=""):
    """
    Scatter plot of model predictions (y-axis) vs target (x-axis)
    
    target (1d array): the true values of the attribute we want to predict
    predictions (1d array): the predictions made by the model
    title (string, optional): the title of the chart
    
    """
    
    fig, ax = plt.subplots(figsize=(8, 8))
    mape = mean_absolute_percentage_error(target, predictions)
    string_score = f'MAPE on testing set: {mape:.2%}'
    ax.scatter(target, predictions)
    ax.plot([0, 1], [0, 1], transform=ax.transAxes,ls="--", c="red")
    plt.text(10000, 27000, string_score)
    plt.title(title)
    plt.ylabel('Model predictions')
    plt.xlabel('Truths')
    plt.xlim(5000,46000)
    plt.ylim(5000,46000)
    plt.show()
In [23]:
tt, pred = knn_predictions(["horsepower"], "price", standardized_cars)
plot_pred(tt, pred, "Univariate model (horsepower)")

The figure confirms our first intuition: the best univariate model is not very predictive. Good predictions should lie on the red line.

Hyperparameter optimization

Since the KNN algorithm is based on finding the $k$-neighbors of a data point, it may be possible to improve the model by tuning the $k$ parameter (namely n_neighbors parameter). As pointed in the scikit-learn documentation:

The optimal choice of the [k] value is highly data-dependent: in general a larger suppresses the effects of noise, but makes the classification boundaries less distinct.

In other words, a smaller $k$ may lead to overfitting, while a larger value may come at the cost of predictive power (underfitting).

Shortcut:

  • small $k$ value = low bias + high variance
  • large $k$ value = high bias + low variance

So, this is about the famous bias-variance tradeoff. How to minimize bias without increasing variance, and vice versa?

In [24]:
#tuning k parameter and save the results into a DataFrame
univariate_model_df = pd.DataFrame(index=standardized_cars.columns[:-1])
for k in [1,3,5,7,9]:
    rmse=[]
    col_name="k"+str(k)
    for col in standardized_cars.columns[:-1]:       
        rmse.append(knn_train_test([col], target, standardized_cars, k))
    univariate_model_df[col_name]=rmse
In [25]:
#label the best RMSE scores by k-neighbors values
univariate_model_df.idxmin(axis=0)
Out[25]:
k1    engine_size
k3    engine_size
k5     horsepower
k7    engine_size
k9    engine_size
dtype: object

Hyperparameter optimization doesn't look to upset dramatically the features importance, though the horsepower leadership is now challenged by the engine size attribute. Let's take a look.

In [26]:
plt.figure(figsize=(10,5))
x=[1,3,5,7,9]
plt.plot(x,univariate_model_df.loc["horsepower"],label="horsepower")
plt.plot(x,univariate_model_df.loc["engine_size"],label="engine size")
plt.title("Univariate Model, tuning $k$")
plt.legend(loc='lower right')
plt.ylabel("RMSE")
plt.xlabel("k-neighbors")
plt.xticks(x)
plt.show()

Indeed, the univariate model based on the horsepower attribute didn't improve its initial error score of 4367 (with $k$ = 5, the default value). Remember that the engine size attribute has the best Pearson score (highly correlated to price variable). The figure confirms this fact since this attribute improves notably the univariate model by reaching the 4200 levels when $1\leq k \leq3$. But, as said before, low $k$ values present a non-negligible risk of overfitting, so it is preferable to be cautious with this result.

Let's now plot again the prediction and calculate the MAPE score.

In [27]:
tt, pred = knn_predictions(["engine_size"], "price", standardized_cars, k = 3)
plot_pred(tt, pred, "Univariate model (engine size, $k$ = 3)") 

Surprisingly, the MAPE score is worst than expected! It means there is no linear relationship between RMSE and MAPE scores, at least for the engine size attribute. It's still unclear if the engine size attribute is more informative than the horsepower one. Finally, despite hyperparameter optimization, the best univariate models are still poorly predictive.

Multivariate Model

Since the univariate models are getting poor results, we will now build several multivariate models and work on feature selection. Possible approaches consist of testing all attributes together or only the continuous ones. However, by selecting the best attributes obtained during the hyperparameter optimization step (saved into univariate_model_df DataFrame), we can build several other multivariate models (with $k$ = 5, the default value) and save again the results.

In [28]:
#useful variable
total_features = len(standardized_cars.columns)-1

#full feature set model
target = "price"
full_feature_set = standardized_cars.columns[:-1].tolist()
print(knn_train_test(full_feature_set, target, standardized_cars))
5748.238867240251
In [29]:
#only continuous features
target = "price"
continuous_features = standardized_cars.iloc[:,66:-1].columns.tolist()
print(knn_train_test(continuous_features, target, standardized_cars))
4935.325727330618

At first glance, the binary features are not adding value to the model, at least when used indiscriminately. Indeed, the results above shows that the full feature set (i.e. absence of feature selection) has a larger score error than the model based on continuous features only.

In [30]:
#print the top 10 obtained from previous hyperparameter optimization step
sorted_univariates = univariate_model_df.mean(axis=1).sort_values(ascending=False)
ax = sorted_univariates[-10:].plot(kind='barh',
                                 stacked=True,
                                 figsize=(5, 5),
                                 title='Univariate model')
ax.set_xlabel('Average RMSE, tuning $k$');
ax.set_ylabel('Attribute');

We will now build new models based on univariate_model_df sorted values, building feature sets to test by selecting the top-$X$ best features where $2\leq X \leq 77$ (total number of attributes).

In [31]:
#select the top-X index from univariate_model_df and build a model
rmse=[]
for i in range(2,total_features):
    col_names = univariate_model_df.mean(axis=1).sort_values().head(i).index
    rmse.append(knn_train_test(col_names, target, standardized_cars))
In [32]:
plt.figure(figsize=(10,5))
plt.plot(rmse)
plt.title("Multivariate model, $k$ = 5")
plt.xlabel("Number of attributes")
plt.ylabel("RMSE")
plt.show()

By using this hierarchical approach, we still don't see any significant improvement. The best univariate model (horsepower with default values) is still leading the competition in terms of RMSE (4367). As for the univariate models, we will now tune the $k$ parameter and save the results into a DataFrame.

In [33]:
#tuning k parameter and save the results into a DataFrame
multivariate_model_df = pd.DataFrame(index=range(2,total_features))
for k in range(1,20):
    rmse=[]
    col_name="k"+str(k)
    for i in range(2,total_features):
        col_names = univariate_model_df.mean(axis=1).sort_values().head(i).index  
        rmse.append(knn_train_test(col_names, target, standardized_cars, k))
    multivariate_model_df[col_name]=rmse
In [34]:
#print the top 10 obtained from previous hyperparameter optimization step
sorted_multivariates = multivariate_model_df.mean(axis=1).sort_values(ascending=False)
ax = sorted_multivariates[-10:].plot(kind='barh',
                                 stacked=True,
                                 figsize=(5, 5),
                                 title='Multivariate model, tuning $k$')
ax.set_xlabel('Average RMSE');
ax.set_ylabel('Number of attributes');
In [35]:
x = multivariate_model_df.idxmin(axis=1).sort_values().apply(lambda x: x[1])
y = multivariate_model_df.min(axis=1).sort_values()
plt.scatter(x,y)
plt.xlabel("k-neighbors")
plt.ylabel("Minimum RMSE")
plt.title("Multivariate model, best $k$ values")
plt.show()

The model with the 15th best attributes has the best average score over all $k$ values, and this time we managed to obtain better RMSEs overall, but it looks suspicious since most of the minimal values occur when $k=1$.

In [36]:
plt.figure(figsize=(10,5))
x=range(1,20)
line1 = int(best_attributes[0])
plt.plot(x,multivariate_model_df.min(axis=0))
plt.xticks(x)
plt.xlabel("k-neighbors")
plt.ylabel("Minimum RMSE")
plt.hlines(line1, 1, 19, linestyle='dashed',linewidth=2)
plt.text(14, 4400, "horsepower, $k$ = 5")
plt.title("Multivariate model, best scores, tuning $k$")
plt.show()

Discarding the risk of overfitting at $k <3$, we think there is maybe a good window at $3\leq k \leq6$ where the best scores are under the horsepower score dashed line, our best univariate model with default parameters. Let's find and display the set of attributes candidates. Note that we start now to store the most relevant feature sets into a Python set object (advantage compared to list, tuple, or dictionary built-in data types: sets do not allow duplicate values), namely feature_sets, for future usage.

In [37]:
#initiate feature_sets
feature_sets = set()
#add the best attribute
feature_sets.add(tuple(["horsepower"]))
#add the 2nd best attribute
feature_sets.add(tuple(["engine_size"]))
#add the full feature set
feature_sets.add(tuple(full_feature_set))
#add a feature set containing only continuous values
feature_sets.add(tuple(continuous_features))

#store best features for k-neighbors values between 3 and 6
for idx, i in enumerate(multivariate_model_df.idxmin(axis=0)[2:6]):    
    features = list(univariate_model_df.mean(axis=1).sort_values().head(i).index)
    feature_sets.add(tuple(features))
    print(f'Best feature set for n_neighbors = {idx+3}: {features}')
    print()
Best feature set for n_neighbors = 3: ['engine_size', 'horsepower']

Best feature set for n_neighbors = 4: ['engine_size', 'horsepower', 'curb_weight']

Best feature set for n_neighbors = 5: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']

Best feature set for n_neighbors = 6: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg', 'length', 'city_mpg', 'width', 'compression_ratio', 'bore', 'wheel_base', 'num_cylinders_four', 'fuel_system_2bbl', 'peak_rpm', 'drive_wheels_rwd', 'make_mercedes-benz']

Interesting to notice, the fourth set (15 attributes) contains binary features - four cylinders, 2bbl fuel system, rear-wheel drive and Mercedes-Benz. Binary features are finally showing some predictive power.

Now, let's take an ultimate look at some of the multivariate models we've just built.

In [38]:
plt.figure(figsize=(10,5))

for idx in range(0,10):   
    plt.plot(range(1,20),multivariate_model_df.iloc[idx])
plt.title("Multivariate models, tuning $k$")
plt.xticks(range(1,20))
plt.xlabel("k-neighbors")
plt.ylabel("RMSE")
Out[38]:
Text(0, 0.5, 'RMSE')

In conclusion, hierarchical feature selection and train/test validation are a bit deceptive, leading to poor results compared to the best univariate model with default parameters. Even after tuning the $k$ parameter, the small improvements noted seem to be negligible and let us a bit septic. We need to improve the validation strategy.

Improving the model with K-fold cross-validation

As mentioned by the scikit-learn documentation, the cross-validation technique splits the training set into $k$ smaller sets and $k$-folds follow the same procedure:

  • a model is trained using $k-1$ of the folds as training data;

  • the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

In [39]:
from sklearn.model_selection import KFold, cross_val_score, cross_validate
In [40]:
def cross_validation(col_names, target, df, k=5, n_splits=5):
    """
    Trains a KNN model following the cross validation procedure and returns the RMSE
    
    col_name (string): the training features
    target (string): the column target
    df (DataFrame): the data
    k (int, optional): the n_neighbors parameter of KNN model we want to optimize
    n_splits (int, optional, min = 2): the number of folds
    
    output (float) : the average RMSE over n_splits
    """
    #instantiate KNeighborsRegressor class with default parameters
    knn = KNeighborsRegressor(n_neighbors = k)
    #initiate KFold, shuffle and set a seed for reproductibility
    kf = KFold(n_splits=n_splits,shuffle=True,random_state=123) 
    # Pass everything into the fit method.
    mses = cross_val_score(knn,df[col_names],df[target],
                      scoring ="neg_mean_squared_error", cv = kf)
    rmses = [np.sqrt(np.abs(mse)) for mse in mses]
    avg_rmse = np.mean(rmses)    
    return avg_rmse
In [41]:
#applying cross validation on each stored feature set and print the results
for feature_set in feature_sets:
    col_names=list(feature_set)
    score = cross_validation(col_names, target, standardized_cars, k=5, n_splits=5)
    print(f'Cross validation with {feature_set}: {score:.0f}')
    print()    
Cross validation with ('wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg'): 3535

Cross validation with ('symboling_B', 'symboling_C', 'symboling_D', 'symboling_E', 'symboling_F', 'symboling_G', 'make_alfa-romero', 'make_audi', 'make_bmw', 'make_chevrolet', 'make_dodge', 'make_honda', 'make_isuzu', 'make_jaguar', 'make_mazda', 'make_mercedes-benz', 'make_mercury', 'make_mitsubishi', 'make_nissan', 'make_peugot', 'make_plymouth', 'make_porsche', 'make_renault', 'make_saab', 'make_subaru', 'make_toyota', 'make_volkswagen', 'make_volvo', 'fuel_type_diesel', 'fuel_type_gas', 'aspiration_std', 'aspiration_turbo', 'doors_four', 'doors_two', 'doors_unknown', 'body_style_convertible', 'body_style_hardtop', 'body_style_hatchback', 'body_style_sedan', 'body_style_wagon', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_ohc', 'engine_type_ohcf', 'engine_type_ohcv', 'engine_type_rotor', 'num_cylinders_eight', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_six', 'num_cylinders_three', 'num_cylinders_twelve', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_2bbl', 'fuel_system_4bbl', 'fuel_system_idi', 'fuel_system_mfi', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg'): 4029

Cross validation with ('engine_size', 'horsepower'): 2977

Cross validation with ('engine_size',): 3304

Cross validation with ('engine_size', 'horsepower', 'curb_weight'): 3045

Cross validation with ('horsepower',): 3896

Cross validation with ('engine_size', 'horsepower', 'curb_weight', 'highway_mpg', 'length', 'city_mpg', 'width', 'compression_ratio', 'bore', 'wheel_base', 'num_cylinders_four', 'fuel_system_2bbl', 'peak_rpm', 'drive_wheels_rwd', 'make_mercedes-benz'): 3295

Cross validation with ('engine_size', 'horsepower', 'curb_weight', 'highway_mpg'): 3022

The first results show pretty interesting improvements in model performance. The worst value is now under 3900!

Like in the previous section, by selecting the best features from univariate_model_df sorted values, 77 feature sets of increasing size go to the oven and the scores after tuning the number of folds are stored into a new DataFrame named cross_validation_df.

After analysis, the best sets of attributes will be added to feature_sets. The score obtained previously with engine size and horsepower features is now temporarily our new benchmark (RMSE = 2977).

In [42]:
#cross validation with top-𝑋 feature selection from univariate_model_df sorted values
cross_validation_df = pd.DataFrame(index=range(2,total_features))
for k in range(2,20):
    rmse=[]
    col_name="k"+str(k)
    for i in range(2,total_features):
        col_names = univariate_model_df.mean(axis=1).sort_values().head(i).index  
        rmse.append(cross_validation(col_names, target, standardized_cars, n_splits=k))
    cross_validation_df[col_name]=rmse
In [43]:
labels = ["2 features", "3 features", "4 features", "5 features", "6 features"]

plt.figure(figsize=(10,5))

for idx in range(0,5):   
    plt.plot(range(2,20),cross_validation_df.iloc[idx],label=labels[idx])
plt.hlines(2977, 1, 19, linestyle='dashed',linewidth=2)
plt.text(14, 3010, "2 features, folds = 5")
plt.xticks(range(2,20))
plt.title("Multivariate models, cross-validation")
plt.xlabel("Number of folds")
plt.ylabel("Average RMSE") 
plt.legend(loc="upper right")
Out[43]:
<matplotlib.legend.Legend at 0xd528148>

The figure shows that the model is now starting to be much more predictive. We seat now around the 3000 levels. Since all the curves are showing the same decreasing convexity, one may be tempted to use a large number of folds since larger values visibly converge toward lower error scores. Once again, it's about finding the right balance between bias and variance. We will address this question infra.

In [44]:
#adding best sets to feature_sets
for idx, i in enumerate(cross_validation_df.idxmin(axis=0)):    
    features = list(univariate_model_df.mean(axis=1).sort_values().head(i).index)
    feature_sets.add(tuple(features))
    print(f'Best feature set for {idx+2} folds: {features}')
    print()
    
print('------------------------------------------')
print(f'Actual number of stored feature sets: {len(feature_sets)}')
Best feature set for 2 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg', 'length', 'city_mpg', 'width', 'compression_ratio', 'bore', 'wheel_base', 'num_cylinders_four', 'fuel_system_2bbl', 'peak_rpm']

Best feature set for 3 folds: ['engine_size', 'horsepower', 'curb_weight']

Best feature set for 4 folds: ['engine_size', 'horsepower']

Best feature set for 5 folds: ['engine_size', 'horsepower']

Best feature set for 6 folds: ['engine_size', 'horsepower']

Best feature set for 7 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']

Best feature set for 8 folds: ['engine_size', 'horsepower']

Best feature set for 9 folds: ['engine_size', 'horsepower']

Best feature set for 10 folds: ['engine_size', 'horsepower']

Best feature set for 11 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']

Best feature set for 12 folds: ['engine_size', 'horsepower']

Best feature set for 13 folds: ['engine_size', 'horsepower']

Best feature set for 14 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']

Best feature set for 15 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']

Best feature set for 16 folds: ['engine_size', 'horsepower']

Best feature set for 17 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']

Best feature set for 18 folds: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']

Best feature set for 19 folds: ['engine_size', 'horsepower']

------------------------------------------
Actual number of stored feature sets: 9

Forward selection & backward elimination

We introduce two additional algorithms selecting features iteratively.

  • forward_selection: starts with an empty set and adds progressively new features following the RMSE criterion.
  • backward_elimination: starts with a full feature set and removes at each iteration the least useful feature, whose absence improves the performance following the same RMSE criterion.
In [45]:
def forward_selection(features, target, df, min_rmse):
    """
    Trains a KNN model following the train/test validation procedure selecting features iteratively
    Starts with empty set and adds progressively new features following the RMSE criterion.
    
    features (list of string): the training features
    target (string): the column target
    df (DataFrame): the data
    min_rmse (float): the minimum RSME value we want to minimize    
    
    outputs :
        RMSE (float)
        featurees (list of string): the best feature set found by the algorithm
    """    
    all_features = df.drop(target,axis=1, inplace = False).columns
    l=len(all_features)-len(features)
    updated_features = [feature for feature in all_features if feature not in features]
    selected_features = []
    for i in range(0,l): 
        features.append(updated_features[i])        
        avg_rmse = knn_train_test(features, target, df)
        if avg_rmse < min_rmse:
            min_rmse = avg_rmse            
            selected_features.append(updated_features[i])
            features.remove(updated_features[i])
        else:
            features.remove(updated_features[i])
    if len(selected_features)>0:
        features.append(selected_features[-1])       
        forward_selection(features, target, df, min_rmse)
       
    return(knn_train_test(features, target, df), features)

def backward_elimination(features, target, df, min_rmse):
    """
    Trains a KNN model following the train/test validation procedure selecting features iteratively
    Starts with full feature set and removes at each iteration the least useful feature
     following the RMSE criterion.
     
    features (list of string): the training features
    target (string): the column target
    df (DataFrame): the data
    min_rmse (float): the minimum RSME value we want to minimize    
    
    outputs :
        RMSE (float)        
        featurees (list of string): the best feature set found by the algorithm
    """ 
    l=len(features)    
    eliminated_features = [] 
    for i in range(0,l): 
        f = [feature for feature in features if feature!=features[i]]        
        avg_rmse = knn_train_test(f, target, df)
        
        if avg_rmse < min_rmse:
            min_rmse = avg_rmse            
            eliminated_features.append(features[i])            
            
    if len(eliminated_features)>0:      
        features.remove(eliminated_features[-1])        
        backward_elimination(features, target, df, min_rmse)
    
    return(knn_train_test(features, target, df), features)  
In [46]:
#train/test validation with default parameters using forward feature selection
#store the features set obtained
min_rmse = knn_train_test(standardized_cars.columns[:-1].tolist(), "price", standardized_cars)
fs_best_rmse, fs_f = forward_selection(features=[],
                                       target="price",
                                       df=standardized_cars,
                                       min_rmse=min_rmse)
print(f'Forward Selection: {fs_f}')
print()
print(f'RMSE: {fs_best_rmse}')
#store the feature set
feature_sets.add(tuple(fs_f))
Forward Selection: ['horsepower', 'highway_mpg', 'make_mercedes-benz', 'drive_wheels_rwd', 'make_peugot', 'make_volkswagen', 'make_audi', 'make_porsche', 'city_mpg', 'make_alfa-romero', 'engine_type_rotor', 'fuel_system_2bbl', 'make_saab', 'make_isuzu', 'make_chevrolet']

RMSE: 2815.095801475529
In [47]:
#train/test validation with default parameters using backward elimination
#store the feature set obtained
features = standardized_cars.columns[:-1].tolist()
min_rmse = knn_train_test(features, "price", standardized_cars)
#min_rmse = standardized_cars["price"].max()
#print(min_rmse)
be_best_rmse, be_f = backward_elimination(features=features,
                                          target="price",
                                          df=standardized_cars,
                                          min_rmse=min_rmse)
print(f'Backward Elimination: {be_f}')
print()
print(f'RMSE: {be_best_rmse}')
#store the feature set
feature_sets.add(tuple(be_f))
Backward Elimination: ['symboling_B', 'symboling_C', 'symboling_F', 'make_alfa-romero', 'make_audi', 'make_honda', 'make_isuzu', 'make_mazda', 'make_mercury', 'make_nissan', 'make_renault', 'make_saab', 'make_toyota', 'make_volkswagen', 'doors_two', 'body_style_convertible', 'body_style_hatchback', 'body_style_sedan', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_rotor', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_three', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_4bbl', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'length', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'horsepower', 'highway_mpg']

RMSE: 2348.328023370358

Composed of 43 attributes, the feature set obtained by backward elimination is doing the best RMSE score we made until now. This is very promising, showing that feature selection is a very important step in machine learning.

Let's now build another DataFrame tuning the number of folds into cross_validation function and using each set of attributes stored in feature_sets as parameter.

Then, we will use this DataFrame to find the optimal number of folds for each series of RMSEs values and plot the best result in MAPE/RMSE terms after writing a new cross-validation function.

In [48]:
#construct a dataframe with the scores of the stored feature sets
final_df = pd.DataFrame()
#keep into memory the features names
features_list = list(feature_sets)
for k in range(2,20):
    rmse=[]
    col_name="k"+str(k)
    for feature_set in features_list:
        features = list(feature_set)
        score = cross_validation(features, target, standardized_cars, k=5, n_splits=k)
        rmse.append(score)    
    final_df[col_name]=rmse
In [49]:
plt.figure(figsize=(10,5))
for idx in range(0,len(feature_sets)):
    plt.plot(range(2,20),final_df.iloc[idx,:])
plt.title("Selected feature sets, cross-validation")
plt.xlabel("Number of folds")
plt.xticks(range(2,20))
plt.ylabel("Average RMSE")
#plt.legend(loc="upper right")
Out[49]:
Text(0, 0.5, 'Average RMSE')

The chart above shows clearly there is a substantial gap between two series and the others ones regardless the cross-validation strategy. The full feature set is one of them. Enough to say that feature selection is not optional here!

Note that the second worst series is the best univariate model (horsepower), showing now its intrinsic limits.

In [50]:
#display the 2 worst scores series
for idx in final_df.idxmax().unique():
    feature_set  = features_list[idx]
    print(f'Number of features: {len(feature_set)}')
    print(feature_set)
    print()
Number of features: 79
('symboling_B', 'symboling_C', 'symboling_D', 'symboling_E', 'symboling_F', 'symboling_G', 'make_alfa-romero', 'make_audi', 'make_bmw', 'make_chevrolet', 'make_dodge', 'make_honda', 'make_isuzu', 'make_jaguar', 'make_mazda', 'make_mercedes-benz', 'make_mercury', 'make_mitsubishi', 'make_nissan', 'make_peugot', 'make_plymouth', 'make_porsche', 'make_renault', 'make_saab', 'make_subaru', 'make_toyota', 'make_volkswagen', 'make_volvo', 'fuel_type_diesel', 'fuel_type_gas', 'aspiration_std', 'aspiration_turbo', 'doors_four', 'doors_two', 'doors_unknown', 'body_style_convertible', 'body_style_hardtop', 'body_style_hatchback', 'body_style_sedan', 'body_style_wagon', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_ohc', 'engine_type_ohcf', 'engine_type_ohcv', 'engine_type_rotor', 'num_cylinders_eight', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_six', 'num_cylinders_three', 'num_cylinders_twelve', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_2bbl', 'fuel_system_4bbl', 'fuel_system_idi', 'fuel_system_mfi', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg')

Number of features: 1
('horsepower',)

Elbow detection point

It's possible to find the point of maximum curvature (elbow point) in a system with the Kneedle algorithm. A Python package, kneed, has been created by Kevin Arvai. For more information about parameters and how it works, see this Kaggle notebook. Applied to the RMSE curves, the elbow point will be equal to the appropriate number of folds after which information gain becomes negligible. Indeed, a large(r) number of folds will increase the variance of the estimate, and that's not desirable. As pointed by Sebastian Raschka:

The reason for the increasing variance of the estimate is that the overlap between training sets increases with an increasing size of [folds].

Example of elbow point detection with the first feature set stored in final_df:

In [51]:
from kneed import KneeLocator
In [52]:
kl = KneeLocator(
    range(2,20),
    final_df.iloc[0,:],
    curve="convex",
    direction="decreasing",
    S=0.0,
    interp_method="polynomial"
)
kl.plot_knee()

print(f'Feature set: {features_list[0]}')
print()
print(f'Elbow point: {kl.elbow}')
Feature set: ('symboling_B', 'symboling_C', 'symboling_F', 'make_alfa-romero', 'make_audi', 'make_honda', 'make_isuzu', 'make_mazda', 'make_mercury', 'make_nissan', 'make_renault', 'make_saab', 'make_toyota', 'make_volkswagen', 'doors_two', 'body_style_convertible', 'body_style_hatchback', 'body_style_sedan', 'drive_wheels_4wd', 'drive_wheels_fwd', 'drive_wheels_rwd', 'engine_location_front', 'engine_location_rear', 'engine_type_dohc', 'engine_type_l', 'engine_type_rotor', 'num_cylinders_five', 'num_cylinders_four', 'num_cylinders_three', 'num_cylinders_two', 'fuel_system_1bbl', 'fuel_system_4bbl', 'fuel_system_mpfi', 'fuel_system_spdi', 'fuel_system_spfi', 'length', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'horsepower', 'highway_mpg')

Elbow point: 5

Here, the first significant elbow point, or the first point of maximum curvature from a polynomial interpolation point of view, is marked by the dashed line. The error is still decreasing after the elbow point, but at the cost of an increasing variance that can lead to worst performances when the model will be applied to a new dataset.

Now, we will write a new cross-validation function, cross_validation_predictions, that uses cross_val_predict scikit-learn function instead of cross_val_score. We will use the predictions output to calculate the MAPE score in conjunction with the RMSE. Remember that we want in the last instance to compare our results with Kibler et al. (1989).

To end this study, the last charts below show the best feature sets found minimizing respectively the MAPE criterion and the RMSE criterion.

In [53]:
from sklearn.model_selection import cross_val_predict
In [54]:
def cross_validation_predictions(col_names, target, df, k = 5, n_splits = 5):
    """
    Trains a KNN model following the cross validation procedure
    and returns the predictions & testing set
    
    col_names (string): the training features
    target (string): the column target
    df (DataFrame): the data
    k (int): the n_neighbors parameter of KNN model we want to optimize
    n_splits (int, min = 2): the number of folds    
    
    output:      
      predictions (array): model predictions
    
    """
    knn = KNeighborsRegressor(n_neighbors = k)
    #initiate KFold, shuffle and set a seed for reproductibility 
    kf = KFold(n_splits=n_splits,shuffle=True,random_state=123) 
    predictions = cross_val_predict(knn,df[col_names],df[target],cv = kf)
    return predictions

def plot_pred(target, predictions, title=""):
    """
    Scatter plot of model predictions (y-axis) vs target (x-axis)
    
    target (1d array): the true values of the attribute we want to predict
    predictions (1d array): the predictions made by the model   
    title (string, optional): the title of the chart
    
    """
    
    fig, ax = plt.subplots(figsize=(8, 8))
    
    mape = mean_absolute_percentage_error(target, predictions)
    rmse = mean_squared_error(target, predictions) ** (1/2)
    string_mape = f'MAPE on dataset: {mape:.2%}'
    string_rmse = f'RMSE on dataset: {rmse:.0f}'
    ax.scatter(target, predictions)
    ax.plot([0, 1], [0, 1], transform=ax.transAxes,ls="--", c="red")
    plt.text(30000, 10100,  string_mape)
    plt.text(30000, 11500,  string_rmse)
    plt.title(title)
    plt.ylabel('Model predictions')
    plt.xlabel('Truths')
    plt.xlim(5000,46000)
    plt.ylim(5000,46000)
    plt.show()

def elbow(x, y, curve="convex", direction="decreasing", s = 0.0, interp_method='polynomial'):
    """
    Calculate the elbow point of a curve.
    Optional parameters: see https://www.kaggle.com/kevinarvai/knee-elbow-point-detection
    
    x (1d array): the x-coordinates of the curve we want to calculate the elbow point
    y (1d array): the y-coordinates of the curve we want to calculate the elbow point  
    
    output: the elbow point (int)
    
    """  
       
    kl = KneeLocator(
        x,
        y,
        curve=curve,
        direction=direction,
        S=s,
        interp_method=interp_method
    )
    
    return kl.elbow
In [55]:
elbows=[]

mape = np.zeros(shape=(len(feature_sets),))
rmse = np.zeros(shape=(len(feature_sets),))
predictions_df = pd.DataFrame()
for idx in range(0,len(feature_sets)):
    elbows.append(elbow(range(2,20),final_df.iloc[idx,:]))
for idx in range(0,len(feature_sets)):
    col_name="feature_sets"+str(idx)    
    pred = cross_validation_predictions(list(features_list[idx]),
                                        "price",
                                        standardized_cars,
                                        k = 5,
                                        n_splits = elbows[idx])
    mape[idx]+=mean_absolute_percentage_error(standardized_cars["price"], pred)
    rmse[idx]+=mean_squared_error(standardized_cars["price"], pred) ** (1/2)
    predictions_df[col_name] = pred

for i in range(0,2):
    if i == 0:        
        idx_min = np.argmin(mape) #first chart = best MAPE
        title = "Best model (MAPE criterion)"
    else:
        idx_min = np.argmin(rmse) #2nd chart = best RMSE
        title = "Best model (RMSE criterion)"
    print(f'Folds: {elbows[idx_min]}')
    print(f'Feature set: {list(features_list[idx_min])}') 
    plot_pred(standardized_cars["price"], predictions_df.iloc[:,idx_min], title)
    if i == 0:
        print('------------------------------------------------------------------') 
        print() 
Folds: 6
Feature set: ['engine_size', 'horsepower', 'curb_weight', 'highway_mpg']