Guided Project: Predictiong Car Prices

In this project, we will use the machine learning algorithm, K-Nearest Neighbors, to perform regressions. More precisely, we will apply the algorithm to predict the price of cars using The Automobile Data Set. Before we dive further into the project, let's take a closer look at the algorithm we will be using.

The K-Nearest Neighbors (KNN) is a fundamental machine learning algorithm that can be used for classification and regression problems based on feature similarity.

Maybe you are asking how the algortheme work? the answer is for a given data point a prediction is made by looking at the k nearest neighbors data points, it depind in the the problem and the nature of data the mesure of similary is choisen.

Once we select the k neighbors we are almost donne in case of classification the data piont will be assigned to the class most common among its k nearest neighbors. In the case of regression like in this project where we want to estimate to price of car, the K-nearest neighbors will compute the averge price of k similar to predic the price.


In this project, we will follow these steps:

  • Introduction to the data set
  • Data Cleaning
  • Univariate Model
  • Multivariate Model
  • Hyperparameter Tuning
  • K-fold cross validation
  • Conclution

Introduction to the data set

In [36]:
import pandas as pd
import numpy as np



cars = pd.read_csv('imports-85.data')
pd.set_option('display.max_columns', None) # to display all the columns
cars.head()                                # print the first five rows
    
Out[36]:
3 ? alfa-romero gas std two convertible rwd front 88.60 168.80 64.10 48.80 2548 dohc four 130 mpfi 3.47 2.68 9.00 111 5000 21 27 13495
0 3 ? alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 ? alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
3 2 164 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
4 2 ? audi gas std two sedan fwd front 99.8 177.3 66.3 53.1 2507 ohc five 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250

As we can notice that the data has no header, we will use the documentation on the data source website to generate a list of column names and pass it as an argument to the read_csv method. We can also note the use of ? to represent missing values, we can also pass ? to the na_values argument to replace it with np.nan. See pd.read_csv.

In [37]:
columns_name = ['symboling','normalized-losses','make','fuel-type','aspiration','num-of-doors','body-style','drive-wheels','engine-location','wheel-base','lenght','width','height','curb-weight','engine-type','num-of-cylinders','engine-size','fuel-system','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
cars = pd.read_csv('imports-85.data',header=None,names=columns_name,na_values='?')
pd.set_option('display.max_columns', None)
cars.head()
Out[37]:
symboling normalized-losses make fuel-type aspiration num-of-doors body-style drive-wheels engine-location wheel-base lenght width height curb-weight engine-type num-of-cylinders engine-size fuel-system bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 168.8 64.1 48.8 2548 dohc four 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 171.2 65.5 52.4 2823 ohcv six 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
3 2 164.0 audi gas std four sedan fwd front 99.8 176.6 66.2 54.3 2337 ohc four 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
4 2 164.0 audi gas std four sedan 4wd front 99.4 176.6 66.4 54.3 2824 ohc five 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

Data Cleaning

Let's took a close look at the data.

In [38]:
cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  lenght             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               201 non-null    float64
 19  stroke             201 non-null    float64
 20  compression-ratio  205 non-null    float64
 21  horsepower         203 non-null    float64
 22  peak-rpm           203 non-null    float64
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              201 non-null    float64
dtypes: float64(11), int64(5), object(10)
memory usage: 41.8+ KB
In [39]:
cars.describe()
Out[39]:
symboling normalized-losses wheel-base lenght width height curb-weight engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
count 205.000000 164.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 201.000000 201.000000 205.000000 203.000000 203.000000 205.000000 205.000000 201.000000
mean 0.834146 122.000000 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 3.329751 3.255423 10.142537 104.256158 5125.369458 25.219512 30.751220 13207.129353
std 1.245307 35.442168 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 0.273539 0.316717 3.972040 39.714369 479.334560 6.542142 6.886443 7947.066342
min -2.000000 65.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 0.000000 94.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 3.150000 3.110000 8.600000 70.000000 4800.000000 19.000000 25.000000 7775.000000
50% 1.000000 115.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5200.000000 24.000000 30.000000 10295.000000
75% 2.000000 150.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 3.590000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16500.000000
max 3.000000 256.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 288.000000 6600.000000 49.000000 54.000000 45400.000000

20% of the values in the column normalized-losses are missing, if we delete all the rows with missing values, we lose a lot of data, the alternative is to replace this value with the average of the columns or to delete the column entirely. For now we will juste replace the missing values using the average.

In [40]:
cars['normalized-losses'] = cars['normalized-losses'].fillna(cars['normalized-losses'].mean())
In [41]:
cars['num-of-doors'].value_counts(dropna=False)
Out[41]:
four    114
two      89
NaN       2
Name: num-of-doors, dtype: int64
In [42]:
cars['num-of-cylinders'].value_counts(dropna=False)
Out[42]:
four      159
six        24
five       11
eight       5
two         4
three       1
twelve      1
Name: num-of-cylinders, dtype: int64

the database contine 15 continuous attributes out of 26 and one integer attribute symboling , and some nominal attributes can be transformed to numerica like num-of-doors and num-of-cylinders

lets statr by tronsforming the num-of-doors and num-of-cylinders to numeric columns

In [43]:
cars['num-of-cylinders'] = cars['num-of-cylinders'].replace(to_replace={'four':4,'six':6,'five':5,'eight':8, 'two':2,'twelve':11,'three':3})
In [44]:
cars['num-of-doors'] = cars['num-of-doors'].replace(to_replace={'four':4, 'two':2})
In [45]:
cars['num-of-doors'].value_counts(dropna=False)
Out[45]:
4.0    114
2.0     89
NaN      2
Name: num-of-doors, dtype: int64
In [46]:
cars['num-of-cylinders'].value_counts(dropna=False)
Out[46]:
4     159
6      24
5      11
8       5
2       4
11      1
3       1
Name: num-of-cylinders, dtype: int64
In [47]:
cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    float64
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    float64
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  lenght             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    int64  
 16  engine-size        205 non-null    int64  
 17  fuel-system        205 non-null    object 
 18  bore               201 non-null    float64
 19  stroke             201 non-null    float64
 20  compression-ratio  205 non-null    float64
 21  horsepower         203 non-null    float64
 22  peak-rpm           203 non-null    float64
 23  city-mpg           205 non-null    int64  
 24  highway-mpg        205 non-null    int64  
 25  price              201 non-null    float64
dtypes: float64(12), int64(6), object(8)
memory usage: 41.8+ KB

we still have some columns with missing data, we will just drop the rows

In [48]:
cars.dropna(inplace=True)
In [49]:
cars.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 193 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          193 non-null    int64  
 1   normalized-losses  193 non-null    float64
 2   make               193 non-null    object 
 3   fuel-type          193 non-null    object 
 4   aspiration         193 non-null    object 
 5   num-of-doors       193 non-null    float64
 6   body-style         193 non-null    object 
 7   drive-wheels       193 non-null    object 
 8   engine-location    193 non-null    object 
 9   wheel-base         193 non-null    float64
 10  lenght             193 non-null    float64
 11  width              193 non-null    float64
 12  height             193 non-null    float64
 13  curb-weight        193 non-null    int64  
 14  engine-type        193 non-null    object 
 15  num-of-cylinders   193 non-null    int64  
 16  engine-size        193 non-null    int64  
 17  fuel-system        193 non-null    object 
 18  bore               193 non-null    float64
 19  stroke             193 non-null    float64
 20  compression-ratio  193 non-null    float64
 21  horsepower         193 non-null    float64
 22  peak-rpm           193 non-null    float64
 23  city-mpg           193 non-null    int64  
 24  highway-mpg        193 non-null    int64  
 25  price              193 non-null    float64
dtypes: float64(12), int64(6), object(8)
memory usage: 40.7+ KB

the list of numeric columns to keep

In [50]:
numeric_columns = ['symboling','normalized-losses','num-of-doors','wheel-base','lenght','width','height',
                   'curb-weight','num-of-cylinders','engine-size','bore','stroke','compression-ratio','horsepower','peak-rpm','city-mpg','highway-mpg','price']
In [51]:
len(numeric_columns)
Out[51]:
18
In [52]:
numeric_cars = cars[numeric_columns]
numeric_cars.head()
Out[52]:
symboling normalized-losses num-of-doors wheel-base lenght width height curb-weight num-of-cylinders engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 3 122.0 2.0 88.6 168.8 64.1 48.8 2548 4 130 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0
1 3 122.0 2.0 88.6 168.8 64.1 48.8 2548 4 130 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
2 1 122.0 2.0 94.5 171.2 65.5 52.4 2823 6 152 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
3 2 164.0 4.0 99.8 176.6 66.2 54.3 2337 4 109 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
4 2 164.0 4.0 99.4 176.6 66.4 54.3 2824 5 136 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

Feature scaling

Depending on the algorithm you are using, you may or may not need to standardize the data, since we assume that all features have the same considerations and we are computing distance we should apply standardization. To normalize all columns in a range of 0 to 1, we will use Min-Max normalization.

Min-Max Normalization:

$$ X = \frac{X - X.min}{X.max - X.min}$$
In [53]:
price_col = numeric_cars['price']
numeric_cars = (numeric_cars - numeric_cars.min())/(numeric_cars.max() - numeric_cars.min())
numeric_cars['price'] = price_col
In [54]:
numeric_cars.head()
Out[54]:
symboling normalized-losses num-of-doors wheel-base lenght width height curb-weight num-of-cylinders engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg price
0 1.0 0.298429 0.0 0.058309 0.413433 0.324786 0.083333 0.411171 0.125 0.260377 0.664286 0.290476 0.1250 0.294393 0.346939 0.222222 0.289474 13495.0
1 1.0 0.298429 0.0 0.058309 0.413433 0.324786 0.083333 0.411171 0.125 0.260377 0.664286 0.290476 0.1250 0.294393 0.346939 0.222222 0.289474 16500.0
2 0.6 0.298429 0.0 0.230321 0.449254 0.444444 0.383333 0.517843 0.375 0.343396 0.100000 0.666667 0.1250 0.495327 0.346939 0.166667 0.263158 16500.0
3 0.8 0.518325 1.0 0.384840 0.529851 0.504274 0.541667 0.329325 0.125 0.181132 0.464286 0.633333 0.1875 0.252336 0.551020 0.305556 0.368421 13950.0
4 0.8 0.518325 1.0 0.373178 0.529851 0.521368 0.541667 0.518231 0.250 0.283019 0.464286 0.633333 0.0625 0.313084 0.551020 0.138889 0.157895 17450.0

Univariate Model

In [55]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

def knn_train_test(df,features='',target='price',k=5):
    
    #first split the data set train/test
    nbr_rows = len(df)
    np.random.seed(1)
    indexs = np.random.permutation(nbr_rows) #shuffle the data
    train = df.iloc[indexs[0:round(nbr_rows * .75)]] # 75% for the training
    test = df.iloc[indexs[round(nbr_rows * .75):]]
    
    #train th model
    model = KNeighborsRegressor(n_neighbors=k)
    model.fit(train[features],train[target])
    predictions = model.predict(test[features])
    mse = mean_squared_error(test[target],predictions)
    rmse = mse ** (1/2)
    return rmse
    
In [56]:
numeric_columns.remove('price')
In [57]:
len(numeric_columns)
Out[57]:
17

For each numeric column, we will create, train, and test a univariate model using default k value from scikit-learn.

In [58]:
rmse_values = list()
for feature in numeric_columns:
    rmse = knn_train_test(numeric_cars,[feature])
    rmse_values.append(rmse)
In [59]:
rmse_values
Out[59]:
[7391.45857809314,
 6691.161998798614,
 7355.852500673643,
 5771.642749902318,
 5260.713472920443,
 3709.0194335340625,
 7982.664949230092,
 3084.2734194350105,
 4086.3104265951206,
 3171.8674012060046,
 5995.115904425868,
 6234.768973733778,
 5096.504099053259,
 4550.010748522103,
 6514.248899080129,
 3675.0418801468554,
 4131.277200818168]
In [60]:
import  matplotlib.pyplot as plt
from matplotlib import style


style.use('bmh')

%matplotlib inline 
plt.scatter(numeric_columns,rmse_values,)
plt.xticks(rotation=90)
plt.title('The  Root Mean Square Error  For Univariate Model')
plt.ylabel('RMSE')
plt.xlabel('Numeric attribute')
plt.show()

For each numeric column, we will create, train, and test a univariate model using the following k values (1, 3, 5, 7, and 9). Visualize the results using a scatter plot.

In [61]:
neighbors = [1,3,5,7,9]
rmse_k = {}
for feature in numeric_columns:
    rmse_values = list()
    for k in neighbors:      
        rmse = knn_train_test(numeric_cars,[feature],k=k)
        rmse_values.append(rmse)
    rmse_k[feature] = rmse_values 
In [62]:
results = pd.DataFrame(rmse_k, index=neighbors)
In [63]:
results
Out[63]:
symboling normalized-losses num-of-doors wheel-base lenght width height curb-weight num-of-cylinders engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg
1 9936.839864 6581.090924 9473.439976 4665.530244 4571.303106 2062.156330 9660.569786 5103.185798 6469.261154 2569.223914 6024.216180 3697.468140 4765.611197 3349.458156 6178.049318 4805.300574 4528.756243
3 7639.444338 6207.707240 8001.398262 5542.296773 5127.738422 3130.699015 8324.506117 3554.133091 3929.319132 2768.704046 5976.832174 5377.241925 4589.625354 4001.242160 5700.254306 3891.116610 4491.859959
5 7391.458578 6691.161999 7355.852501 5771.642750 5260.713473 3709.019434 7982.664949 3084.273419 4086.310427 3171.867401 5995.115904 6234.768974 5096.504099 4550.010749 6514.248899 3675.041880 4131.277201
7 6923.249562 7090.636344 7691.424289 5427.876232 5562.721854 3509.697064 7865.607051 2928.318411 4447.210949 3194.923231 6116.079189 6440.383994 5660.042521 4849.468468 6427.913607 3623.745989 3756.331451
9 7017.689447 7195.266961 7521.324168 5325.721120 5577.753263 3313.365517 7625.072018 2744.663489 4905.965071 2927.365390 5973.632081 6542.294028 6254.368059 5010.198400 6980.497340 4055.921880 3349.029400
In [64]:
fig = plt.figure(figsize=(12, 3*6))
<Figure size 864x1296 with 0 Axes>
In [65]:
for sp in range(0,6):
    ax = fig.add_subplot(6,3,sp*3+1)
    ax.scatter(x=neighbors,y=results[numeric_columns[sp]])
    ax.spines["right"].set_visible(False)    
    ax.spines["left"].set_visible(False)
    ax.spines["top"].set_visible(False)    
    ax.spines["bottom"].set_visible(False)
    ax.set_xlim(0, 10)
    ax.set_ylim(0,10000)
    #ax.set_yticks([0,50,100])
    #ax.axhline(50, c=(171/255, 171/255, 171/255), alpha=0.3)
    ax.set_title(numeric_columns[sp])
    ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
    #if sp == 5:
    #   ax.tick_params(labelbottom='on')
fig
Out[65]:
In [66]:
for sp in range(0,5):
    ax = fig.add_subplot(6,3,sp*3+2)
    ax.scatter(x=neighbors,y=results[numeric_columns[sp+6]])
    ax.spines["right"].set_visible(False)    
    ax.spines["left"].set_visible(False)
    ax.spines["top"].set_visible(False)    
    ax.spines["bottom"].set_visible(False)
    ax.set_xlim(0, 10)
    ax.set_ylim(0,10000)
    ax.set_title(numeric_columns[sp+6])
    ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
    #if sp == 5:
    #   ax.tick_params(labelbottom='on')
fig
Out[66]:
In [67]:
for sp in range(0,6):
    ax = fig.add_subplot(6,3,sp*3+3)
    ax.scatter(x=neighbors,y=results[numeric_columns[sp+11]])
    ax.spines["right"].set_visible(False)    
    ax.spines["left"].set_visible(False)
    ax.spines["top"].set_visible(False)    
    ax.spines["bottom"].set_visible(False)
    ax.set_xlim(0, 10)
    ax.set_ylim(0,10000)
    ax.set_title(numeric_columns[sp+11])
    ax.tick_params(bottom="off", top="off", left="off", right="off",labelbottom='off')
    #if sp == 5:
    #   ax.tick_params(labelbottom='on')
fig
Out[67]:

Multivariate Model

We will sort the features based in result from the previous step.

In [68]:
# Compute average RMSE across different `k` values for each feature.
results.loc['avg'] = results.apply(np.mean)
In [69]:
results
Out[69]:
symboling normalized-losses num-of-doors wheel-base lenght width height curb-weight num-of-cylinders engine-size bore stroke compression-ratio horsepower peak-rpm city-mpg highway-mpg
1 9936.839864 6581.090924 9473.439976 4665.530244 4571.303106 2062.156330 9660.569786 5103.185798 6469.261154 2569.223914 6024.216180 3697.468140 4765.611197 3349.458156 6178.049318 4805.300574 4528.756243
3 7639.444338 6207.707240 8001.398262 5542.296773 5127.738422 3130.699015 8324.506117 3554.133091 3929.319132 2768.704046 5976.832174 5377.241925 4589.625354 4001.242160 5700.254306 3891.116610 4491.859959
5 7391.458578 6691.161999 7355.852501 5771.642750 5260.713473 3709.019434 7982.664949 3084.273419 4086.310427 3171.867401 5995.115904 6234.768974 5096.504099 4550.010749 6514.248899 3675.041880 4131.277201
7 6923.249562 7090.636344 7691.424289 5427.876232 5562.721854 3509.697064 7865.607051 2928.318411 4447.210949 3194.923231 6116.079189 6440.383994 5660.042521 4849.468468 6427.913607 3623.745989 3756.331451
9 7017.689447 7195.266961 7521.324168 5325.721120 5577.753263 3313.365517 7625.072018 2744.663489 4905.965071 2927.365390 5973.632081 6542.294028 6254.368059 5010.198400 6980.497340 4055.921880 3349.029400
avg 7781.736358 6753.172694 8008.687839 5346.613424 5220.046023 3144.987472 8291.683984 3482.914842 4767.613347 2926.416796 6017.175106 5658.431412 5273.230246 4352.075586 6360.192694 4010.225386 4051.450851
In [70]:
best_features = results.loc['avg'].sort_values().index
In [71]:
best_features 
Out[71]:
Index(['engine-size', 'width', 'curb-weight', 'city-mpg', 'highway-mpg',
       'horsepower', 'num-of-cylinders', 'lenght', 'compression-ratio',
       'wheel-base', 'stroke', 'bore', 'peak-rpm', 'normalized-losses',
       'symboling', 'num-of-doors', 'height'],
      dtype='object')
In [72]:
best_features
Out[72]:
Index(['engine-size', 'width', 'curb-weight', 'city-mpg', 'highway-mpg',
       'horsepower', 'num-of-cylinders', 'lenght', 'compression-ratio',
       'wheel-base', 'stroke', 'bore', 'peak-rpm', 'normalized-losses',
       'symboling', 'num-of-doors', 'height'],
      dtype='object')
In [73]:
#the best 2 features from the previous step
best_features[:2]
Out[73]:
Index(['engine-size', 'width'], dtype='object')
In [74]:
nbr_features = [2,3,4,5,6,7]
rmse_values ={}
for nbr_feature in nbr_features:
        rmse = knn_train_test(numeric_cars,best_features[:nbr_feature])
        rmse_values[nbr_feature] = rmse
In [75]:
rmse_values
Out[75]:
{2: 2244.634766133086,
 3: 2199.4665137255447,
 4: 2346.6186852930896,
 5: 2350.786971420422,
 6: 2537.1719108093566,
 7: 2402.188727147252}

Hyperparameter Tuning

A good a chose of k can imporve accuracy, we will use searsh grid to covere a range of value between 1 and 25. For the features we will select the three top models from the last step.

In [76]:
top3 = [2,3,4,5]
rmse_k = {}
for nbr_feature in top3:
        rmse_values = list()
        for k in range(1,25):
            rmse = knn_train_test(numeric_cars,best_features[:nbr_feature],k=k)
            rmse_values.append(rmse)
        rmse_k[nbr_feature] = rmse_values
In [77]:
rmse_k
Out[77]:
{2: [2384.8323010573863,
  2030.8784608923959,
  1964.0343194227464,
  2201.423501053258,
  2244.634766133086,
  2467.51531056453,
  2579.4367959108126,
  2503.2739671653003,
  2487.9080667005983,
  2559.959226359865,
  2689.18357091232,
  2716.31906933967,
  2765.3286698631077,
  2834.7934028985865,
  2911.3369994315767,
  2909.5235766556757,
  2926.8478955706305,
  2936.1936513744995,
  2977.316330053173,
  3008.711678185186,
  3087.4809134656584,
  3181.9239792949256,
  3243.0021632731014,
  3303.6151531104692],
 3: [1959.2384734465923,
  1902.5755131750575,
  1746.8690921188113,
  1767.3586418750488,
  2199.4665137255447,
  2208.9384608545784,
  2214.8778506460153,
  2407.3242039485804,
  2436.9999119266404,
  2223.685969192443,
  2190.0789485687924,
  2300.616854176771,
  2306.17457684166,
  2398.464915856294,
  2486.1731734773657,
  2558.965994025218,
  2667.128930620863,
  2783.3294599200517,
  2817.6408636433353,
  2886.588850570788,
  2979.7812577769514,
  3015.7859197633666,
  3094.5277518437797,
  3101.193287848042],
 4: [2131.390679618982,
  2206.456985096998,
  2271.0984987894526,
  2145.8483795993525,
  2346.6186852930896,
  2407.1171822433002,
  2458.6987468244365,
  2505.775662199398,
  2540.68372219656,
  2499.9748365400264,
  2452.8085856456073,
  2407.9226779476408,
  2353.7413061113307,
  2443.2431227676734,
  2479.9547049141697,
  2615.546035247707,
  2726.81420267943,
  2798.017348142094,
  2909.634763853679,
  2997.496469130184,
  3049.461809572687,
  3095.515781923601,
  3147.9935104031965,
  3187.1392490403423],
 5: [1933.0795810485058,
  2155.764972018827,
  2205.8025865715917,
  2209.161862067264,
  2350.786971420422,
  2397.6095645126616,
  2532.9836826822816,
  2537.307925625131,
  2500.662772434606,
  2295.518860737154,
  2305.323067136258,
  2299.7159552489065,
  2417.0951768720624,
  2549.317541292039,
  2572.7566502467525,
  2683.8710127117415,
  2768.240263311874,
  2867.431478622222,
  2950.5906760713688,
  3047.6063083898325,
  3129.0189045129937,
  3181.3859005391196,
  3189.2883411802713,
  3254.3498733317087]}
In [78]:
final_result = pd.DataFrame(rmse_k,index=range(1,25))
In [79]:
final_result
Out[79]:
2 3 4 5
1 2384.832301 1959.238473 2131.390680 1933.079581
2 2030.878461 1902.575513 2206.456985 2155.764972
3 1964.034319 1746.869092 2271.098499 2205.802587
4 2201.423501 1767.358642 2145.848380 2209.161862
5 2244.634766 2199.466514 2346.618685 2350.786971
6 2467.515311 2208.938461 2407.117182 2397.609565
7 2579.436796 2214.877851 2458.698747 2532.983683
8 2503.273967 2407.324204 2505.775662 2537.307926
9 2487.908067 2436.999912 2540.683722 2500.662772
10 2559.959226 2223.685969 2499.974837 2295.518861
11 2689.183571 2190.078949 2452.808586 2305.323067
12 2716.319069 2300.616854 2407.922678 2299.715955
13 2765.328670 2306.174577 2353.741306 2417.095177
14 2834.793403 2398.464916 2443.243123 2549.317541
15 2911.336999 2486.173173 2479.954705 2572.756650
16 2909.523577 2558.965994 2615.546035 2683.871013
17 2926.847896 2667.128931 2726.814203 2768.240263
18 2936.193651 2783.329460 2798.017348 2867.431479
19 2977.316330 2817.640864 2909.634764 2950.590676
20 3008.711678 2886.588851 2997.496469 3047.606308
21 3087.480913 2979.781258 3049.461810 3129.018905
22 3181.923979 3015.785920 3095.515782 3181.385901
23 3243.002163 3094.527752 3147.993510 3189.288341
24 3303.615153 3101.193288 3187.139249 3254.349873
In [80]:
final_result.loc['top_k'] = final_result.apply(lambda col: col.sort_values().index[0])

the best value of k for each groupe of features is:

In [81]:
final_result.loc['top_k']
Out[81]:
2    3.0
3    3.0
4    1.0
5    1.0
Name: top_k, dtype: float64

K-fold cross validation

We will update the knn_train_test function to add cross validation

In [82]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold

def knn_train_test2(df,features='',target='price',k=5,fold=5):
    
    #first split the data set train/test
    kf = KFold(fold,shuffle=True, random_state=1)
    
    #train th model
    model = KNeighborsRegressor(n_neighbors=k)
    mses= cross_val_score(model, df[features], df[target], scoring='neg_mean_squared_error',cv=kf)
    rmses = np.sqrt(np.absolute(mses))
    avg_rmse = np.mean(rmses)
    return avg_rmse
In [82]:
 
In [83]:
rmse_values = list()
for feature in numeric_columns:
    rmse = knn_train_test2(numeric_cars,[feature])
    rmse_values.append(rmse)
In [84]:
style.use('bmh')

%matplotlib inline 
plt.scatter(numeric_columns,rmse_values)
plt.xticks(rotation=90)
plt.title('The  Root Mean Square Error  For Univariate Model With Cross Validation')
plt.ylabel('RMSE')
plt.xlabel('Numeric attribute')
plt.show()