Predicting Car Prices

Introduction

This project aims at prediction of various cars' market prices based their attributes such as weight of the car, acceleration speed, miles per gallon, among others.

The automobile dataset used is from the UCI Machine Learning Repository, and can be found here.

About the dataset

This dataset consists of three entity types:-

(a) Specification - In terms of various characteristics of the auto

(b) Assigned insurance risk rating - The degree to which an auto is more risky other than its price indicates

(c) Normalized losses in use - Relative loss payment per insured vehicle per year, as compared to other cars.

Reading the dataframe
In [1]:
import pandas as pd
cars = pd.read_csv('imports-85.data')

# First few rows of the data
cars.head(3)
Out[1]:
3 ? alfa-romero gas std two convertible rwd front 88.60 ... 130 mpfi 3.47 2.68 9.00 111 5000 21 27 13495
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
2 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950

3 rows × 26 columns

The dataset has 26 columns each giving some info about different autos.

However, the column names don't give very clear information as they are, and therefore need some cleaning.

Data Cleaning
In [2]:
#Renaming the columns
new_cols = ['Symbol', 'Normalized_loss', 'Make', 'Fuel_type', 'Aspiration', 'No_of_doors', 'Body_style', 'Drive_wheels',
           'Engine_loc', 'Wheel_base', 'Length', 'Width', 'Height', 'Curb_weight', 'Engine_type', 'No_of_cylinders',
           'Engine_size', 'Fuel_system', 'Bore', 'Stroke', 'Compression_ratio', 'Horse_power', 'Peak_rpm', 'City_mpg',
           'Highway_mpg', 'Price']
cars.columns = new_cols
cars.head(2)
Out[2]:
Symbol Normalized_loss Make Fuel_type Aspiration No_of_doors Body_style Drive_wheels Engine_loc Wheel_base ... Engine_size Fuel_system Bore Stroke Compression_ratio Horse_power Peak_rpm City_mpg Highway_mpg Price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
1 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500

2 rows × 26 columns

In [3]:
print('The dataset has {} records and {} columns'.format(cars.shape[0], cars.shape[1]))
print(' ')
print('Info on number of non-null values and the datatype of each column: ')
print(' ')
cars.info()
The dataset has 204 records and 26 columns
 
Info on number of non-null values and the datatype of each column: 
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 26 columns):
Symbol               204 non-null int64
Normalized_loss      204 non-null object
Make                 204 non-null object
Fuel_type            204 non-null object
Aspiration           204 non-null object
No_of_doors          204 non-null object
Body_style           204 non-null object
Drive_wheels         204 non-null object
Engine_loc           204 non-null object
Wheel_base           204 non-null float64
Length               204 non-null float64
Width                204 non-null float64
Height               204 non-null float64
Curb_weight          204 non-null int64
Engine_type          204 non-null object
No_of_cylinders      204 non-null object
Engine_size          204 non-null int64
Fuel_system          204 non-null object
Bore                 204 non-null object
Stroke               204 non-null object
Compression_ratio    204 non-null float64
Horse_power          204 non-null object
Peak_rpm             204 non-null object
City_mpg             204 non-null int64
Highway_mpg          204 non-null int64
Price                204 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.5+ KB

There are columns that are of object datatype but the values are/ should be either int or float:

  • Normalized_loss
  • Bore
  • Stroke
  • Horse_power
  • Peak_rpm
  • Price

Such columns need to be cleaned before continuing with the analysis;

In [4]:
import numpy as np
In [5]:
# list of columns that need some cleaning
numeric_cols = ['Normalized_loss', 'Bore', 'Stroke', 'Horse_power', 'Peak_rpm', 'Price']

# Remove whitespaces and '?' in the columns:
def strip_cols(df):
    for col in numeric_cols:
        df[col] = df[col].str.replace('?', ' ')#np.nan)
        df[col] = df[col].str.strip()
        
    return df
        
cars = strip_cols(cars)
cars[numeric_cols] = cars[numeric_cols].apply(pd.to_numeric)
In [6]:
cars.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 26 columns):
Symbol               204 non-null int64
Normalized_loss      164 non-null float64
Make                 204 non-null object
Fuel_type            204 non-null object
Aspiration           204 non-null object
No_of_doors          204 non-null object
Body_style           204 non-null object
Drive_wheels         204 non-null object
Engine_loc           204 non-null object
Wheel_base           204 non-null float64
Length               204 non-null float64
Width                204 non-null float64
Height               204 non-null float64
Curb_weight          204 non-null int64
Engine_type          204 non-null object
No_of_cylinders      204 non-null object
Engine_size          204 non-null int64
Fuel_system          204 non-null object
Bore                 200 non-null float64
Stroke               200 non-null float64
Compression_ratio    204 non-null float64
Horse_power          202 non-null float64
Peak_rpm             202 non-null float64
City_mpg             204 non-null int64
Highway_mpg          204 non-null int64
Price                200 non-null float64
dtypes: float64(11), int64(5), object(10)
memory usage: 41.5+ KB

In this mini-project, we'll only use numeric columns for the prediction and ignore columns with string values. Otherwise some of the object columns can be encoded for better results.

In [7]:
cars = cars.select_dtypes(exclude = ['object']).copy()
cars.head()
Out[7]:
Symbol Normalized_loss Wheel_base Length Width Height Curb_weight Engine_size Bore Stroke Compression_ratio Horse_power Peak_rpm City_mpg Highway_mpg Price
0 3 NaN 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
1 1 NaN 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
2 2 164.0 99.8 176.6 66.2 54.3 2337 109 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
3 2 164.0 99.4 176.6 66.4 54.3 2824 136 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0
4 2 NaN 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 8.5 110.0 5500.0 19 25 15250.0
Handling missing values
In [8]:
missing_vals = cars.columns[cars.isna().any()]
cars[missing_vals].isna().sum()
Out[8]:
Normalized_loss    40
Bore                4
Stroke              4
Horse_power         2
Peak_rpm            2
Price               4
dtype: int64
In [9]:
# An overview of the rows with missing values
null_data = cars[cars.isnull().any(axis=1)]
null_data
Out[9]:
Symbol Normalized_loss Wheel_base Length Width Height Curb_weight Engine_size Bore Stroke Compression_ratio Horse_power Peak_rpm City_mpg Highway_mpg Price
0 3 NaN 88.6 168.8 64.1 48.8 2548 130 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
1 1 NaN 94.5 171.2 65.5 52.4 2823 152 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
4 2 NaN 99.8 177.3 66.3 53.1 2507 136 3.19 3.40 8.5 110.0 5500.0 19 25 15250.0
6 1 NaN 105.8 192.7 71.4 55.7 2954 136 3.19 3.40 8.5 110.0 5500.0 19 25 18920.0
8 0 NaN 99.5 178.2 67.9 52.0 3053 131 3.13 3.40 7.0 160.0 5500.0 16 22 NaN
13 1 NaN 103.5 189.0 66.9 55.7 3055 164 3.31 3.19 9.0 121.0 4250.0 20 25 24565.0
14 0 NaN 103.5 189.0 66.9 55.7 3230 209 3.62 3.39 8.0 182.0 5400.0 16 22 30760.0
15 0 NaN 103.5 193.8 67.9 53.7 3380 209 3.62 3.39 8.0 182.0 5400.0 16 22 41315.0
16 0 NaN 110.0 197.0 70.9 56.3 3505 209 3.62 3.39 8.0 182.0 5400.0 15 20 36880.0
42 0 NaN 94.3 170.7 61.8 53.5 2337 111 3.31 3.23 8.5 78.0 4800.0 24 29 6785.0
43 1 NaN 94.5 155.9 63.6 52.0 1874 90 3.03 3.11 9.6 70.0 5400.0 38 43 NaN
44 0 NaN 94.5 155.9 63.6 52.0 1909 90 3.03 3.11 9.6 70.0 5400.0 38 43 NaN
45 2 NaN 96.0 172.6 65.2 51.4 2734 119 3.43 3.23 9.2 90.0 5000.0 24 29 11048.0
47 0 NaN 113.0 199.6 69.6 52.8 4066 258 3.63 4.17 8.1 176.0 4750.0 15 19 35550.0
48 0 NaN 102.0 191.7 70.6 47.8 3950 326 3.54 2.76 11.5 262.0 5000.0 13 17 36000.0
54 3 150.0 95.3 169.0 65.7 49.6 2380 70 NaN NaN 9.4 101.0 6000.0 17 23 10945.0
55 3 150.0 95.3 169.0 65.7 49.6 2380 70 NaN NaN 9.4 101.0 6000.0 17 23 11845.0
56 3 150.0 95.3 169.0 65.7 49.6 2385 70 NaN NaN 9.4 101.0 6000.0 17 23 13645.0
57 3 150.0 95.3 169.0 65.7 49.6 2500 80 NaN NaN 9.4 135.0 6000.0 16 23 15645.0
62 0 NaN 98.8 177.8 66.5 55.5 2443 122 3.39 3.39 22.7 64.0 4650.0 36 42 10795.0
65 0 NaN 104.9 175.0 66.1 54.4 2700 134 3.43 3.64 22.0 72.0 4200.0 31 39 18344.0
70 -1 NaN 115.6 202.6 71.7 56.5 3740 234 3.46 3.10 8.3 155.0 4750.0 16 18 34184.0
72 0 NaN 120.9 208.1 71.7 56.7 3900 308 3.80 3.35 8.0 184.0 4500.0 14 16 40960.0
73 1 NaN 112.0 199.2 72.0 55.4 3715 304 3.80 3.35 8.0 184.0 4500.0 14 16 45400.0
74 1 NaN 102.7 178.4 68.0 54.8 2910 140 3.78 3.12 8.0 175.0 5000.0 19 24 16503.0
81 3 NaN 95.9 173.2 66.3 50.2 2833 156 3.58 3.86 7.0 145.0 5000.0 19 24 12629.0
82 3 NaN 95.9 173.2 66.3 50.2 2921 156 3.59 3.86 7.0 145.0 5000.0 19 24 14869.0
83 3 NaN 95.9 173.2 66.3 50.2 2926 156 3.59 3.86 7.0 145.0 5000.0 19 24 14489.0
108 0 NaN 114.2 198.9 68.4 58.7 3230 120 3.46 3.19 8.4 97.0 5000.0 19 24 12440.0
109 0 NaN 114.2 198.9 68.4 58.7 3430 152 3.70 3.52 21.0 95.0 4150.0 25 25 13860.0
112 0 NaN 114.2 198.9 68.4 56.7 3285 120 3.46 2.19 8.4 95.0 5000.0 19 24 16695.0
113 0 NaN 114.2 198.9 68.4 58.7 3485 152 3.70 3.52 21.0 95.0 4150.0 25 25 17075.0
123 3 NaN 95.9 173.2 66.3 50.2 2818 156 3.59 3.86 7.0 145.0 5000.0 19 24 12764.0
125 3 NaN 89.5 168.9 65.0 51.6 2756 194 3.74 2.90 9.5 207.0 5900.0 17 25 32528.0
126 3 NaN 89.5 168.9 65.0 51.6 2756 194 3.74 2.90 9.5 207.0 5900.0 17 25 34028.0
127 3 NaN 89.5 168.9 65.0 51.6 2800 194 3.74 2.90 9.5 207.0 5900.0 17 25 37028.0
128 1 NaN 98.4 175.7 72.3 50.5 3366 203 3.94 3.11 10.0 288.0 5750.0 17 28 NaN
129 0 NaN 96.1 181.5 66.5 55.2 2579 132 3.46 3.90 8.7 NaN NaN 23 31 9295.0
130 2 NaN 96.1 176.8 66.6 50.5 2460 132 3.46 3.90 8.7 NaN NaN 23 31 9895.0
180 -1 NaN 104.5 187.8 66.5 54.1 3151 161 3.27 3.35 9.2 156.0 5200.0 19 24 15750.0
188 3 NaN 94.5 159.3 64.2 55.6 2254 109 3.19 3.40 8.5 90.0 5500.0 24 29 11595.0
190 0 NaN 100.4 180.2 66.9 55.1 2661 136 3.19 3.40 8.5 110.0 5500.0 19 24 13295.0
191 0 NaN 100.4 180.2 66.9 55.1 2579 97 3.01 3.40 23.0 68.0 4500.0 33 38 13845.0
192 0 NaN 100.4 183.1 66.9 55.1 2563 109 3.19 3.40 9.0 88.0 5500.0 25 31 12290.0

Here's how the missing values will be handled:-

  • Missing columns in bore, stroke, horse_power, price and peak_rpm will be dropped since they have missing values in other columns as well
  • Missing values in the Normalized loss column will be replaced by average values in that column
In [10]:
cars = cars.dropna(subset = ['Bore', 'Stroke', 'Horse_power', 'Peak_rpm', 'Price'])
avg_loss = cars['Normalized_loss'].mean()
cars['Normalized_loss'] = cars['Normalized_loss'].fillna(value = avg_loss)
cars.isna().sum()
Out[10]:
Symbol               0
Normalized_loss      0
Wheel_base           0
Length               0
Width                0
Height               0
Curb_weight          0
Engine_size          0
Bore                 0
Stroke               0
Compression_ratio    0
Horse_power          0
Peak_rpm             0
City_mpg             0
Highway_mpg          0
Price                0
dtype: int64
In [11]:
# imputing price column
# from sklearn.impute import KNNImputer

# imputer = KNNImputer(n_neighbors=5)
# cars = pd.DataFrame(imputer.fit_transform(cars),columns = cars.columns)
# cars.isna().sum()

Prediction

1. Univariate Model

We'll use the holdout validation and K-Fold cross-validation methods to build the predictive model.

In [12]:
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
# import math

def knn_train_test_univariate(df, feature_col, target_col):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    # model building
    model = KNeighborsRegressor()
    model.fit(train_set[[feature_col]], train_set[target_col])
    predictions = model.predict(test_set[[feature_col]])
    rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return rmse
                
In [13]:
all_features = cars.columns.tolist()
all_features.remove('Price')

rmse_dict = {}
for col in all_features:
    rmse_dict[col] = knn_train_test_univariate(cars, col, 'Price')
    
rmse_dict = sorted(rmse_dict.items(), key=lambda x: x[1])
print('The following are the rmse values for each feature column:')
print(' ')
rmse_dict
The following are the rmse values for each feature column:
 
Out[13]:
[('Engine_size', 2768.7024577193997),
 ('City_mpg', 3801.9522491944367),
 ('Horse_power', 3840.8349432699265),
 ('Highway_mpg', 3961.0130844072405),
 ('Curb_weight', 4017.1563713857277),
 ('Width', 4540.626474420263),
 ('Length', 5332.7291227377045),
 ('Wheel_base', 5930.3202334562175),
 ('Compression_ratio', 6406.467140561386),
 ('Bore', 6773.123772768155),
 ('Stroke', 6827.417884190247),
 ('Peak_rpm', 7014.6680399077395),
 ('Symbol', 7354.647765392282),
 ('Height', 7610.692804694108),
 ('Normalized_loss', 7987.618494882803)]

Using the default number neighbors (k = 5), engine size gave the best prediction of car prices.

Modifying the function to use various values of k:

In [14]:
def knn_train_test_univariate(df, feature_col, target_col, k_values):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    #model building
    k_rmse = {}
    for k in k_values:
        
        model = KNeighborsRegressor(n_neighbors = k)
        model.fit(train_set[[feature_col]], train_set[target_col])
        predictions = model.predict(test_set[[feature_col]])
        k_rmse[k] = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return k_rmse
In [15]:
# which k gives the best model for each feature column?
hyper_params = [1,3,5,7,9]

rmses_dict = {}
for col in all_features:
    rmses_dict[col] = knn_train_test_univariate(cars, col, 'Price', hyper_params)

print('Features and their error metrics for each k-value: ')
# rmses_dict = sorted(rmses_dict.items(), key=lambda x: x[1])
rmses_dict
Features and their error metrics for each k-value: 
Out[15]:
{'Bore': {1: 6525.949489711673,
  3: 6541.776528330883,
  5: 6773.123772768155,
  7: 6729.9829252536565,
  9: 6357.036118920397},
 'City_mpg': {1: 4664.349620412614,
  3: 3720.345109563052,
  5: 3801.9522491944367,
  7: 4180.392991380315,
  9: 4296.643528372117},
 'Compression_ratio': {1: 8429.72226860252,
  3: 7020.903704767295,
  5: 6406.467140561386,
  7: 6184.377890549715,
  9: 6411.835343433226},
 'Curb_weight': {1: 5496.564537175945,
  3: 4695.644900063398,
  5: 4017.1563713857277,
  7: 3992.3539935803296,
  9: 4117.954136455427},
 'Engine_size': {1: 3135.512141860174,
  3: 2765.713844538874,
  5: 2768.7024577193997,
  7: 3129.136821023901,
  9: 3428.5532010324764},
 'Height': {1: 9213.420979437002,
  3: 8154.015954739659,
  5: 7610.692804694108,
  7: 7434.132286129642,
  9: 7225.565525315033},
 'Highway_mpg': {1: 4912.499136171588,
  3: 4013.776142072605,
  5: 3961.0130844072405,
  7: 4132.843966396334,
  9: 4141.612716576377},
 'Horse_power': {1: 4025.957210890042,
  3: 4071.865575599259,
  5: 3840.8349432699265,
  7: 3759.8430295951794,
  9: 3855.726260320768},
 'Length': {1: 4729.385459522438,
  3: 4641.256909859858,
  5: 5332.7291227377045,
  7: 5699.182523221176,
  9: 5906.604384918108},
 'Normalized_loss': {1: 6575.726293624134,
  3: 7150.762204024888,
  5: 7987.618494882803,
  7: 7959.284364487943,
  9: 7799.910037272096},
 'Peak_rpm': {1: 8427.859651731176,
  3: 7314.371808190612,
  5: 7014.6680399077395,
  7: 7115.832016066374,
  9: 7396.66892661851},
 'Stroke': {1: 6192.3391013729015,
  3: 6581.129120229773,
  5: 6827.417884190247,
  7: 6995.072405390737,
  9: 7340.819986586927},
 'Symbol': {1: 6760.109693796761,
  3: 7667.881427369129,
  5: 7354.647765392282,
  7: 7272.019699866985,
  9: 7201.320970947845},
 'Wheel_base': {1: 4315.016501981323,
  3: 5749.194776628336,
  5: 5930.3202334562175,
  7: 6209.168849787841,
  9: 6301.256464775368},
 'Width': {1: 3336.9678748735573,
  3: 4697.335149820219,
  5: 4540.626474420263,
  7: 4829.883257740291,
  9: 4944.217366473074}}
In [16]:
# Creating a dataframe to hold each feature and its error metrics for each value of k
data = pd.DataFrame.from_dict(rmses_dict)
# data.insert(loc = 0, column = 'N_neighbors', value = [1,3,5,7,9])
data
Out[16]:
Bore City_mpg Compression_ratio Curb_weight Engine_size Height Highway_mpg Horse_power Length Normalized_loss Peak_rpm Stroke Symbol Wheel_base Width
1 6525.949490 4664.349620 8429.722269 5496.564537 3135.512142 9213.420979 4912.499136 4025.957211 4729.385460 6575.726294 8427.859652 6192.339101 6760.109694 4315.016502 3336.967875
3 6541.776528 3720.345110 7020.903705 4695.644900 2765.713845 8154.015955 4013.776142 4071.865576 4641.256910 7150.762204 7314.371808 6581.129120 7667.881427 5749.194777 4697.335150
5 6773.123773 3801.952249 6406.467141 4017.156371 2768.702458 7610.692805 3961.013084 3840.834943 5332.729123 7987.618495 7014.668040 6827.417884 7354.647765 5930.320233 4540.626474
7 6729.982925 4180.392991 6184.377891 3992.353994 3129.136821 7434.132286 4132.843966 3759.843030 5699.182523 7959.284364 7115.832016 6995.072405 7272.019700 6209.168850 4829.883258
9 6357.036119 4296.643528 6411.835343 4117.954136 3428.553201 7225.565525 4141.612717 3855.726260 5906.604385 7799.910037 7396.668927 7340.819987 7201.320971 6301.256465 4944.217366
In [17]:
# Visualizing on line chart
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

#plotting
for col in list(data.columns):
    data[col].plot(figsize = (13,8))

plt.title('Error metrics at for each k - value')
plt.xlabel('K value')
plt.ylabel('RMSE values')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

For the case of 'Bore', the model performs best at k = 5.

All the other features' error metrics can also be visualized using such the line chart as above

2. Multivariate Model

In [18]:
# Modifying the function to accept list of feature columns
def knn_train_test_multivariate(df, feature_cols, target_col):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    #model building
    model = KNeighborsRegressor(n_neighbors = 5)
    model.fit(train_set[feature_cols], train_set[target_col])
    predictions = model.predict(test_set[feature_cols])
    rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return rmse
In [19]:
# best features:
# average mse value in each column
mean_rmses = data.mean().sort_values()
mean_rmses
best_features = list(mean_rmses.index)
In [20]:
two_features = best_features[:2]
three_features = best_features[:3]
four_features = best_features[:4]
five_features = best_features[:5]

select_lst = [two_features, three_features, four_features,five_features]

select_output = []
for item in select_lst:
    select_output.append(knn_train_test_multivariate(cars, item, 'Price'))
    
select_dict = {'two_features': select_output[0],
              'three_features': select_output[1],
              'four_features': select_output[2],
              'five_features': select_output[3]
              }
select_dict
Out[20]:
{'five_features': 3948.5887004549895,
 'four_features': 3083.271094949073,
 'three_features': 3078.7980786456533,
 'two_features': 2996.8873214940795}

The best model is the model with 3 independent variables: ('Highway_mpg', 'Engine_size', 'City_mpg')

Hyperparameter Optimization

In [21]:
def knn_train_test_multivariate(df, feature_cols, target_col, k_values):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    #model building
    k_rmse = {}
    for k in k_values:
        
        model = KNeighborsRegressor(n_neighbors = k)
        model.fit(train_set[feature_cols], train_set[target_col])
        predictions = model.predict(test_set[feature_cols])
        k_rmse[k] = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return k_rmse
In [22]:
hyp_params = list(range(1, 26))
top_three = [two_features, three_features, four_features]

selected = []
for item in top_three:
    selected.append(knn_train_test_multivariate(cars, item, 'Price', hyp_params))

print('Features and their error metrics for each k-value: ')
best_dict = {'two_features': selected[0],
              'three_features': selected[1],
              'four_features': selected[2],
              }
best_dict
Features and their error metrics for each k-value: 
Out[22]:
{'four_features': {1: 2610.6627234612483,
  2: 2674.6854266330647,
  3: 2495.415808033366,
  4: 2737.609650630784,
  5: 3083.271094949073,
  6: 3130.173688680205,
  7: 3384.5351687709153,
  8: 3560.6258772476385,
  9: 3489.3617381363333,
  10: 3574.2210022302074,
  11: 3642.6804175077923,
  12: 3675.8881501828423,
  13: 3841.6091062898613,
  14: 3925.294206504302,
  15: 3882.231349399912,
  16: 3946.6363421326932,
  17: 3878.929278856831,
  18: 3893.0936284329778,
  19: 3911.6103208045697,
  20: 3961.8887642849104,
  21: 3999.9386784900307,
  22: 4070.518779726736,
  23: 4162.771545655721,
  24: 4222.27321782126,
  25: 4282.307729786075},
 'three_features': {1: 2534.588538965579,
  2: 2680.5878404005143,
  3: 2464.771925999155,
  4: 2729.525458407154,
  5: 3078.7980786456533,
  6: 3148.583470886329,
  7: 3383.9058919272175,
  8: 3577.165165323855,
  9: 3494.9357692498993,
  10: 3597.8367148899783,
  11: 3678.316396079907,
  12: 3711.0517621578206,
  13: 3860.557996264031,
  14: 3930.9075907490123,
  15: 3882.5919982742203,
  16: 3944.00229722676,
  17: 3905.7462793347754,
  18: 3905.309340629744,
  19: 3916.592517113872,
  20: 3942.3319701492446,
  21: 3995.3507403242816,
  22: 4059.7959116530124,
  23: 4156.537589932155,
  24: 4237.941348741846,
  25: 4287.16475174757},
 'two_features': {1: 2807.9143358595848,
  2: 2759.9046982276345,
  3: 2354.72887298986,
  4: 2587.626322555174,
  5: 2996.8873214940795,
  6: 3095.941706834793,
  7: 3351.095825242845,
  8: 3557.2028865114526,
  9: 3524.1670038672614,
  10: 3599.2079940642616,
  11: 3682.8374655370367,
  12: 3733.954887479565,
  13: 3818.342821734111,
  14: 3810.6160085975616,
  15: 3805.6453609757978,
  16: 3869.710323645115,
  17: 3850.6126908392766,
  18: 3891.1983452584423,
  19: 3926.9249455122713,
  20: 3963.2914529408017,
  21: 4004.4869239008185,
  22: 4079.0805793509444,
  23: 4175.556938053969,
  24: 4240.634363607186,
  25: 4278.664903559106}}
In [23]:
best_df = pd.DataFrame.from_dict(best_dict)
best_df
Out[23]:
four_features three_features two_features
1 2610.662723 2534.588539 2807.914336
2 2674.685427 2680.587840 2759.904698
3 2495.415808 2464.771926 2354.728873
4 2737.609651 2729.525458 2587.626323
5 3083.271095 3078.798079 2996.887321
6 3130.173689 3148.583471 3095.941707
7 3384.535169 3383.905892 3351.095825
8 3560.625877 3577.165165 3557.202887
9 3489.361738 3494.935769 3524.167004
10 3574.221002 3597.836715 3599.207994
11 3642.680418 3678.316396 3682.837466
12 3675.888150 3711.051762 3733.954887
13 3841.609106 3860.557996 3818.342822
14 3925.294207 3930.907591 3810.616009
15 3882.231349 3882.591998 3805.645361
16 3946.636342 3944.002297 3869.710324
17 3878.929279 3905.746279 3850.612691
18 3893.093628 3905.309341 3891.198345
19 3911.610321 3916.592517 3926.924946
20 3961.888764 3942.331970 3963.291453
21 3999.938678 3995.350740 4004.486924
22 4070.518780 4059.795912 4079.080579
23 4162.771546 4156.537590 4175.556938
24 4222.273218 4237.941349 4240.634364
25 4282.307730 4287.164752 4278.664904
In [24]:
#plotting
for col in list(best_df.columns):
    best_df[col].plot(figsize = (13,8))

plt.title('Error metrics at for each k - value')
plt.xlabel('K value')
plt.ylabel('RMSE values')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()
In [25]:
# rmse averages
best_df.min()
Out[25]:
four_features     2495.415808
three_features    2464.771926
two_features      2354.728873
dtype: float64

Additional: Using kfold validation technique for the three best models

In [26]:
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

def knn_train_test_kfold(df, feature_cols, target_col, n_folds):
    '''
    folds: a list of number of folds
    '''
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    # splitting the dataframe into folds 
    for fold in n_folds:
        kf = KFold(fold, shuffle=True, random_state=1) 

        # model building
        model = KNeighborsRegressor()
        mses = cross_val_score(model, train_set[feature_cols], train_set["Price"], scoring="neg_mean_squared_error", cv=kf)
        
        mean_mse = np.mean(np.abs(mses))
        kfold_rmse = np.sqrt(mean_mse)
                       
    return kfold_rmse
               
In [27]:
n_folds = [5,9,13,15,19]
kfold_rmses = []
for lst in select_lst:
    kfold_rmses.append(knn_train_test_kfold(cars, lst, 'Price', n_folds))

kfold_dict = {'Two best features': kfold_rmses[0], 'Three best features': kfold_rmses[1], 
              'Four best features': kfold_rmses[2],'Five best features': kfold_rmses[3]}
kfold_dict
Out[27]:
{'Five best features': 4403.893418376636,
 'Four best features': 3410.8686574207804,
 'Three best features': 3556.772181805177,
 'Two best features': 3645.2146005798754}
Summary of the models' performance

1. Univariate model

In [28]:
data
print('The minimum rmse for each predictor variable:')
print(' ')
print(data.min())
The minimum rmse for each predictor variable:
 
Bore                 6357.036119
City_mpg             3720.345110
Compression_ratio    6184.377891
Curb_weight          3992.353994
Engine_size          2765.713845
Height               7225.565525
Highway_mpg          3961.013084
Horse_power          3759.843030
Length               4641.256910
Normalized_loss      6575.726294
Peak_rpm             7014.668040
Stroke               6192.339101
Symbol               6760.109694
Wheel_base           4315.016502
Width                3336.967875
dtype: float64

Engine size is the best predictor with an rmse of 2766

2. Multivariate model

In [29]:
best_df
print('The minimum rmse for each set of predictor variables:')
print(' ')
print(best_df.min())
The minimum rmse for each set of predictor variables:
 
four_features     2495.415808
three_features    2464.771926
two_features      2354.728873
dtype: float64

The best predictors for this model have an rmse of 2354

3. Kfold Validation Technique

In [30]:
kfold_dict
Out[30]:
{'Five best features': 4403.893418376636,
 'Four best features': 3410.8686574207804,
 'Three best features': 3556.772181805177,
 'Two best features': 3645.2146005798754}

The best predictors for this model have an rmse of 3410

Conclusion

The best performing model among the three options is the multivariate model with two independent variables, and 3 k-neighbours.

This implies that Engine_size and Horse_power are the best predictors of car prices. We'll therefore use these parameters to get predicted prices.

Prediction

In [31]:
np.random.seed(0)
shuffled_index = np.random.permutation(cars.index)
cars = cars.reindex(index = shuffled_index)
split_loc = int(0.5*len(cars))

train_set = cars.iloc[:split_loc].copy()
test_set  = cars.iloc[split_loc:].copy()

knn_model = KNeighborsRegressor(n_neighbors = 3)
knn_model.fit(train_set[two_features], train_set['Price'])
test_set['Predicted_price'] = knn_model.predict(test_set[two_features])
test_set.head()
Out[31]:
Symbol Normalized_loss Wheel_base Length Width Height Curb_weight Engine_size Bore Stroke Compression_ratio Horse_power Peak_rpm City_mpg Highway_mpg Price Predicted_price
101 0 108.0 100.4 184.6 66.5 56.1 3296 181 3.43 3.27 9.0 152.0 5200.0 17 22 14399.0 16365.666667
199 -1 95.0 109.1 188.8 68.9 55.5 2952 141 3.78 3.15 9.5 114.0 5400.0 23 28 16845.0 17518.333333
102 0 108.0 100.4 184.6 66.5 55.1 3060 181 3.43 3.27 9.0 152.0 5200.0 19 25 13499.0 16365.666667
71 3 142.0 96.6 180.3 70.5 50.8 3685 234 3.46 3.10 8.3 155.0 4750.0 16 18 35056.0 33994.666667
147 0 85.0 96.9 173.6 65.4 54.9 2420 108 3.62 2.64 9.0 82.0 4800.0 23 29 8013.0 7911.000000