Predicting Car Prices¶

Introduction¶

This project aims at prediction of various cars' market prices based their attributes such as weight of the car, acceleration speed, miles per gallon, among others.

The automobile dataset used is from the UCI Machine Learning Repository, and can be found here.

About the dataset¶

This dataset consists of three entity types:-

(a) Specification - In terms of various characteristics of the auto

(b) Assigned insurance risk rating - The degree to which an auto is more risky other than its price indicates

(c) Normalized losses in use - Relative loss payment per insured vehicle per year, as compared to other cars.

Reading the dataframe¶

In [1]:

import pandas as pd
cars = pd.read_csv('imports-85.data')

# First few rows of the data
cars.head(3)

Out[1]:

	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.60	...	130	mpfi	3.47	2.68	9.00	111	5000	21	27	13495
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
1	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
2	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950

3 rows × 26 columns

The dataset has 26 columns each giving some info about different autos.

However, the column names don't give very clear information as they are, and therefore need some cleaning.

Data Cleaning¶

In [2]:

#Renaming the columns
new_cols = ['Symbol', 'Normalized_loss', 'Make', 'Fuel_type', 'Aspiration', 'No_of_doors', 'Body_style', 'Drive_wheels',
           'Engine_loc', 'Wheel_base', 'Length', 'Width', 'Height', 'Curb_weight', 'Engine_type', 'No_of_cylinders',
           'Engine_size', 'Fuel_system', 'Bore', 'Stroke', 'Compression_ratio', 'Horse_power', 'Peak_rpm', 'City_mpg',
           'Highway_mpg', 'Price']
cars.columns = new_cols
cars.head(2)

Out[2]:

	Symbol	Normalized_loss	Make	Fuel_type	Aspiration	No_of_doors	Body_style	Drive_wheels	Engine_loc	Wheel_base	...	Engine_size	Fuel_system	Bore	Stroke	Compression_ratio	Horse_power	Peak_rpm	City_mpg	Highway_mpg	Price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
1	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500

2 rows × 26 columns

In [3]:

print('The dataset has {} records and {} columns'.format(cars.shape[0], cars.shape[1]))
print(' ')
print('Info on number of non-null values and the datatype of each column: ')
print(' ')
cars.info()

The dataset has 204 records and 26 columns
 
Info on number of non-null values and the datatype of each column: 
 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 26 columns):
Symbol               204 non-null int64
Normalized_loss      204 non-null object
Make                 204 non-null object
Fuel_type            204 non-null object
Aspiration           204 non-null object
No_of_doors          204 non-null object
Body_style           204 non-null object
Drive_wheels         204 non-null object
Engine_loc           204 non-null object
Wheel_base           204 non-null float64
Length               204 non-null float64
Width                204 non-null float64
Height               204 non-null float64
Curb_weight          204 non-null int64
Engine_type          204 non-null object
No_of_cylinders      204 non-null object
Engine_size          204 non-null int64
Fuel_system          204 non-null object
Bore                 204 non-null object
Stroke               204 non-null object
Compression_ratio    204 non-null float64
Horse_power          204 non-null object
Peak_rpm             204 non-null object
City_mpg             204 non-null int64
Highway_mpg          204 non-null int64
Price                204 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.5+ KB

There are columns that are of object datatype but the values are/ should be either int or float:

Normalized_loss
Bore
Stroke
Horse_power
Peak_rpm
Price

Such columns need to be cleaned before continuing with the analysis;

In [4]:

import numpy as np

In [5]:

# list of columns that need some cleaning
numeric_cols = ['Normalized_loss', 'Bore', 'Stroke', 'Horse_power', 'Peak_rpm', 'Price']

# Remove whitespaces and '?' in the columns:
def strip_cols(df):
    for col in numeric_cols:
        df[col] = df[col].str.replace('?', ' ')#np.nan)
        df[col] = df[col].str.strip()
        
    return df
        
cars = strip_cols(cars)
cars[numeric_cols] = cars[numeric_cols].apply(pd.to_numeric)

In [6]:

cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 204 entries, 0 to 203
Data columns (total 26 columns):
Symbol               204 non-null int64
Normalized_loss      164 non-null float64
Make                 204 non-null object
Fuel_type            204 non-null object
Aspiration           204 non-null object
No_of_doors          204 non-null object
Body_style           204 non-null object
Drive_wheels         204 non-null object
Engine_loc           204 non-null object
Wheel_base           204 non-null float64
Length               204 non-null float64
Width                204 non-null float64
Height               204 non-null float64
Curb_weight          204 non-null int64
Engine_type          204 non-null object
No_of_cylinders      204 non-null object
Engine_size          204 non-null int64
Fuel_system          204 non-null object
Bore                 200 non-null float64
Stroke               200 non-null float64
Compression_ratio    204 non-null float64
Horse_power          202 non-null float64
Peak_rpm             202 non-null float64
City_mpg             204 non-null int64
Highway_mpg          204 non-null int64
Price                200 non-null float64
dtypes: float64(11), int64(5), object(10)
memory usage: 41.5+ KB

In this mini-project, we'll only use numeric columns for the prediction and ignore columns with string values. Otherwise some of the object columns can be encoded for better results.

In [7]:

cars = cars.select_dtypes(exclude = ['object']).copy()
cars.head()

Out[7]:

	Symbol	Normalized_loss	Wheel_base	Length	Width	Height	Curb_weight	Engine_size	Bore	Stroke	Compression_ratio	Horse_power	Peak_rpm	City_mpg	Highway_mpg	Price
0	3	NaN	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
1	1	NaN	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
2	2	164.0	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10.0	102.0	5500.0	24	30	13950.0
3	2	164.0	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115.0	5500.0	18	22	17450.0
4	2	NaN	99.8	177.3	66.3	53.1	2507	136	3.19	3.40	8.5	110.0	5500.0	19	25	15250.0

Handling missing values¶

In [8]:

missing_vals = cars.columns[cars.isna().any()]
cars[missing_vals].isna().sum()

Out[8]:

Normalized_loss    40
Bore                4
Stroke              4
Horse_power         2
Peak_rpm            2
Price               4
dtype: int64

In [9]:

# An overview of the rows with missing values
null_data = cars[cars.isnull().any(axis=1)]
null_data

Out[9]:

	Symbol	Normalized_loss	Wheel_base	Length	Width	Height	Curb_weight	Engine_size	Bore	Stroke	Compression_ratio	Horse_power	Peak_rpm	City_mpg	Highway_mpg	Price
0	3	NaN	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111.0	5000.0	21	27	16500.0
1	1	NaN	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154.0	5000.0	19	26	16500.0
4	2	NaN	99.8	177.3	66.3	53.1	2507	136	3.19	3.40	8.5	110.0	5500.0	19	25	15250.0
6	1	NaN	105.8	192.7	71.4	55.7	2954	136	3.19	3.40	8.5	110.0	5500.0	19	25	18920.0
8	0	NaN	99.5	178.2	67.9	52.0	3053	131	3.13	3.40	7.0	160.0	5500.0	16	22	NaN
13	1	NaN	103.5	189.0	66.9	55.7	3055	164	3.31	3.19	9.0	121.0	4250.0	20	25	24565.0
14	0	NaN	103.5	189.0	66.9	55.7	3230	209	3.62	3.39	8.0	182.0	5400.0	16	22	30760.0
15	0	NaN	103.5	193.8	67.9	53.7	3380	209	3.62	3.39	8.0	182.0	5400.0	16	22	41315.0
16	0	NaN	110.0	197.0	70.9	56.3	3505	209	3.62	3.39	8.0	182.0	5400.0	15	20	36880.0
42	0	NaN	94.3	170.7	61.8	53.5	2337	111	3.31	3.23	8.5	78.0	4800.0	24	29	6785.0
43	1	NaN	94.5	155.9	63.6	52.0	1874	90	3.03	3.11	9.6	70.0	5400.0	38	43	NaN
44	0	NaN	94.5	155.9	63.6	52.0	1909	90	3.03	3.11	9.6	70.0	5400.0	38	43	NaN
45	2	NaN	96.0	172.6	65.2	51.4	2734	119	3.43	3.23	9.2	90.0	5000.0	24	29	11048.0
47	0	NaN	113.0	199.6	69.6	52.8	4066	258	3.63	4.17	8.1	176.0	4750.0	15	19	35550.0
48	0	NaN	102.0	191.7	70.6	47.8	3950	326	3.54	2.76	11.5	262.0	5000.0	13	17	36000.0
54	3	150.0	95.3	169.0	65.7	49.6	2380	70	NaN	NaN	9.4	101.0	6000.0	17	23	10945.0
55	3	150.0	95.3	169.0	65.7	49.6	2380	70	NaN	NaN	9.4	101.0	6000.0	17	23	11845.0
56	3	150.0	95.3	169.0	65.7	49.6	2385	70	NaN	NaN	9.4	101.0	6000.0	17	23	13645.0
57	3	150.0	95.3	169.0	65.7	49.6	2500	80	NaN	NaN	9.4	135.0	6000.0	16	23	15645.0
62	0	NaN	98.8	177.8	66.5	55.5	2443	122	3.39	3.39	22.7	64.0	4650.0	36	42	10795.0
65	0	NaN	104.9	175.0	66.1	54.4	2700	134	3.43	3.64	22.0	72.0	4200.0	31	39	18344.0
70	-1	NaN	115.6	202.6	71.7	56.5	3740	234	3.46	3.10	8.3	155.0	4750.0	16	18	34184.0
72	0	NaN	120.9	208.1	71.7	56.7	3900	308	3.80	3.35	8.0	184.0	4500.0	14	16	40960.0
73	1	NaN	112.0	199.2	72.0	55.4	3715	304	3.80	3.35	8.0	184.0	4500.0	14	16	45400.0
74	1	NaN	102.7	178.4	68.0	54.8	2910	140	3.78	3.12	8.0	175.0	5000.0	19	24	16503.0
81	3	NaN	95.9	173.2	66.3	50.2	2833	156	3.58	3.86	7.0	145.0	5000.0	19	24	12629.0
82	3	NaN	95.9	173.2	66.3	50.2	2921	156	3.59	3.86	7.0	145.0	5000.0	19	24	14869.0
83	3	NaN	95.9	173.2	66.3	50.2	2926	156	3.59	3.86	7.0	145.0	5000.0	19	24	14489.0
108	0	NaN	114.2	198.9	68.4	58.7	3230	120	3.46	3.19	8.4	97.0	5000.0	19	24	12440.0
109	0	NaN	114.2	198.9	68.4	58.7	3430	152	3.70	3.52	21.0	95.0	4150.0	25	25	13860.0
112	0	NaN	114.2	198.9	68.4	56.7	3285	120	3.46	2.19	8.4	95.0	5000.0	19	24	16695.0
113	0	NaN	114.2	198.9	68.4	58.7	3485	152	3.70	3.52	21.0	95.0	4150.0	25	25	17075.0
123	3	NaN	95.9	173.2	66.3	50.2	2818	156	3.59	3.86	7.0	145.0	5000.0	19	24	12764.0
125	3	NaN	89.5	168.9	65.0	51.6	2756	194	3.74	2.90	9.5	207.0	5900.0	17	25	32528.0
126	3	NaN	89.5	168.9	65.0	51.6	2756	194	3.74	2.90	9.5	207.0	5900.0	17	25	34028.0
127	3	NaN	89.5	168.9	65.0	51.6	2800	194	3.74	2.90	9.5	207.0	5900.0	17	25	37028.0
128	1	NaN	98.4	175.7	72.3	50.5	3366	203	3.94	3.11	10.0	288.0	5750.0	17	28	NaN
129	0	NaN	96.1	181.5	66.5	55.2	2579	132	3.46	3.90	8.7	NaN	NaN	23	31	9295.0
130	2	NaN	96.1	176.8	66.6	50.5	2460	132	3.46	3.90	8.7	NaN	NaN	23	31	9895.0
180	-1	NaN	104.5	187.8	66.5	54.1	3151	161	3.27	3.35	9.2	156.0	5200.0	19	24	15750.0
188	3	NaN	94.5	159.3	64.2	55.6	2254	109	3.19	3.40	8.5	90.0	5500.0	24	29	11595.0
190	0	NaN	100.4	180.2	66.9	55.1	2661	136	3.19	3.40	8.5	110.0	5500.0	19	24	13295.0
191	0	NaN	100.4	180.2	66.9	55.1	2579	97	3.01	3.40	23.0	68.0	4500.0	33	38	13845.0
192	0	NaN	100.4	183.1	66.9	55.1	2563	109	3.19	3.40	9.0	88.0	5500.0	25	31	12290.0

Here's how the missing values will be handled:-

Missing columns in bore, stroke, horse_power, price and peak_rpm will be dropped since they have missing values in other columns as well
Missing values in the Normalized loss column will be replaced by average values in that column

In [10]:

cars = cars.dropna(subset = ['Bore', 'Stroke', 'Horse_power', 'Peak_rpm', 'Price'])
avg_loss = cars['Normalized_loss'].mean()
cars['Normalized_loss'] = cars['Normalized_loss'].fillna(value = avg_loss)
cars.isna().sum()

Out[10]:

Symbol               0
Normalized_loss      0
Wheel_base           0
Length               0
Width                0
Height               0
Curb_weight          0
Engine_size          0
Bore                 0
Stroke               0
Compression_ratio    0
Horse_power          0
Peak_rpm             0
City_mpg             0
Highway_mpg          0
Price                0
dtype: int64

In [11]:

# imputing price column
# from sklearn.impute import KNNImputer

# imputer = KNNImputer(n_neighbors=5)
# cars = pd.DataFrame(imputer.fit_transform(cars),columns = cars.columns)
# cars.isna().sum()

Prediction¶

1. Univariate Model

We'll use the holdout validation and K-Fold cross-validation methods to build the predictive model.

In [12]:

from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, KFold
import numpy as np
# import math

def knn_train_test_univariate(df, feature_col, target_col):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    # model building
    model = KNeighborsRegressor()
    model.fit(train_set[[feature_col]], train_set[target_col])
    predictions = model.predict(test_set[[feature_col]])
    rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return rmse
                

In [13]:

all_features = cars.columns.tolist()
all_features.remove('Price')

rmse_dict = {}
for col in all_features:
    rmse_dict[col] = knn_train_test_univariate(cars, col, 'Price')
    
rmse_dict = sorted(rmse_dict.items(), key=lambda x: x[1])
print('The following are the rmse values for each feature column:')
print(' ')
rmse_dict

The following are the rmse values for each feature column:

Out[13]:

[('Engine_size', 2768.7024577193997),
 ('City_mpg', 3801.9522491944367),
 ('Horse_power', 3840.8349432699265),
 ('Highway_mpg', 3961.0130844072405),
 ('Curb_weight', 4017.1563713857277),
 ('Width', 4540.626474420263),
 ('Length', 5332.7291227377045),
 ('Wheel_base', 5930.3202334562175),
 ('Compression_ratio', 6406.467140561386),
 ('Bore', 6773.123772768155),
 ('Stroke', 6827.417884190247),
 ('Peak_rpm', 7014.6680399077395),
 ('Symbol', 7354.647765392282),
 ('Height', 7610.692804694108),
 ('Normalized_loss', 7987.618494882803)]

Using the default number neighbors (k = 5), engine size gave the best prediction of car prices.

Modifying the function to use various values of k:

In [14]:

def knn_train_test_univariate(df, feature_col, target_col, k_values):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    #model building
    k_rmse = {}
    for k in k_values:
        
        model = KNeighborsRegressor(n_neighbors = k)
        model.fit(train_set[[feature_col]], train_set[target_col])
        predictions = model.predict(test_set[[feature_col]])
        k_rmse[k] = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return k_rmse

In [15]:

# which k gives the best model for each feature column?
hyper_params = [1,3,5,7,9]

rmses_dict = {}
for col in all_features:
    rmses_dict[col] = knn_train_test_univariate(cars, col, 'Price', hyper_params)

print('Features and their error metrics for each k-value: ')
# rmses_dict = sorted(rmses_dict.items(), key=lambda x: x[1])
rmses_dict

Features and their error metrics for each k-value:

Out[15]:

{'Bore': {1: 6525.949489711673,
  3: 6541.776528330883,
  5: 6773.123772768155,
  7: 6729.9829252536565,
  9: 6357.036118920397},
 'City_mpg': {1: 4664.349620412614,
  3: 3720.345109563052,
  5: 3801.9522491944367,
  7: 4180.392991380315,
  9: 4296.643528372117},
 'Compression_ratio': {1: 8429.72226860252,
  3: 7020.903704767295,
  5: 6406.467140561386,
  7: 6184.377890549715,
  9: 6411.835343433226},
 'Curb_weight': {1: 5496.564537175945,
  3: 4695.644900063398,
  5: 4017.1563713857277,
  7: 3992.3539935803296,
  9: 4117.954136455427},
 'Engine_size': {1: 3135.512141860174,
  3: 2765.713844538874,
  5: 2768.7024577193997,
  7: 3129.136821023901,
  9: 3428.5532010324764},
 'Height': {1: 9213.420979437002,
  3: 8154.015954739659,
  5: 7610.692804694108,
  7: 7434.132286129642,
  9: 7225.565525315033},
 'Highway_mpg': {1: 4912.499136171588,
  3: 4013.776142072605,
  5: 3961.0130844072405,
  7: 4132.843966396334,
  9: 4141.612716576377},
 'Horse_power': {1: 4025.957210890042,
  3: 4071.865575599259,
  5: 3840.8349432699265,
  7: 3759.8430295951794,
  9: 3855.726260320768},
 'Length': {1: 4729.385459522438,
  3: 4641.256909859858,
  5: 5332.7291227377045,
  7: 5699.182523221176,
  9: 5906.604384918108},
 'Normalized_loss': {1: 6575.726293624134,
  3: 7150.762204024888,
  5: 7987.618494882803,
  7: 7959.284364487943,
  9: 7799.910037272096},
 'Peak_rpm': {1: 8427.859651731176,
  3: 7314.371808190612,
  5: 7014.6680399077395,
  7: 7115.832016066374,
  9: 7396.66892661851},
 'Stroke': {1: 6192.3391013729015,
  3: 6581.129120229773,
  5: 6827.417884190247,
  7: 6995.072405390737,
  9: 7340.819986586927},
 'Symbol': {1: 6760.109693796761,
  3: 7667.881427369129,
  5: 7354.647765392282,
  7: 7272.019699866985,
  9: 7201.320970947845},
 'Wheel_base': {1: 4315.016501981323,
  3: 5749.194776628336,
  5: 5930.3202334562175,
  7: 6209.168849787841,
  9: 6301.256464775368},
 'Width': {1: 3336.9678748735573,
  3: 4697.335149820219,
  5: 4540.626474420263,
  7: 4829.883257740291,
  9: 4944.217366473074}}

In [16]:

# Creating a dataframe to hold each feature and its error metrics for each value of k
data = pd.DataFrame.from_dict(rmses_dict)
# data.insert(loc = 0, column = 'N_neighbors', value = [1,3,5,7,9])
data

Out[16]:

	Bore	City_mpg	Compression_ratio	Curb_weight	Engine_size	Height	Highway_mpg	Horse_power	Length	Normalized_loss	Peak_rpm	Stroke	Symbol	Wheel_base	Width
1	6525.949490	4664.349620	8429.722269	5496.564537	3135.512142	9213.420979	4912.499136	4025.957211	4729.385460	6575.726294	8427.859652	6192.339101	6760.109694	4315.016502	3336.967875
3	6541.776528	3720.345110	7020.903705	4695.644900	2765.713845	8154.015955	4013.776142	4071.865576	4641.256910	7150.762204	7314.371808	6581.129120	7667.881427	5749.194777	4697.335150
5	6773.123773	3801.952249	6406.467141	4017.156371	2768.702458	7610.692805	3961.013084	3840.834943	5332.729123	7987.618495	7014.668040	6827.417884	7354.647765	5930.320233	4540.626474
7	6729.982925	4180.392991	6184.377891	3992.353994	3129.136821	7434.132286	4132.843966	3759.843030	5699.182523	7959.284364	7115.832016	6995.072405	7272.019700	6209.168850	4829.883258
9	6357.036119	4296.643528	6411.835343	4117.954136	3428.553201	7225.565525	4141.612717	3855.726260	5906.604385	7799.910037	7396.668927	7340.819987	7201.320971	6301.256465	4944.217366

In [17]:

# Visualizing on line chart
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline 

#plotting
for col in list(data.columns):
    data[col].plot(figsize = (13,8))

plt.title('Error metrics at for each k - value')
plt.xlabel('K value')
plt.ylabel('RMSE values')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

For the case of 'Bore', the model performs best at k = 5.

All the other features' error metrics can also be visualized using such the line chart as above

2. Multivariate Model

In [18]:

# Modifying the function to accept list of feature columns
def knn_train_test_multivariate(df, feature_cols, target_col):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    #model building
    model = KNeighborsRegressor(n_neighbors = 5)
    model.fit(train_set[feature_cols], train_set[target_col])
    predictions = model.predict(test_set[feature_cols])
    rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return rmse

In [19]:

# best features:
# average mse value in each column
mean_rmses = data.mean().sort_values()
mean_rmses
best_features = list(mean_rmses.index)

In [20]:

two_features = best_features[:2]
three_features = best_features[:3]
four_features = best_features[:4]
five_features = best_features[:5]

select_lst = [two_features, three_features, four_features,five_features]

select_output = []
for item in select_lst:
    select_output.append(knn_train_test_multivariate(cars, item, 'Price'))
    
select_dict = {'two_features': select_output[0],
              'three_features': select_output[1],
              'four_features': select_output[2],
              'five_features': select_output[3]
              }
select_dict

Out[20]:

{'five_features': 3948.5887004549895,
 'four_features': 3083.271094949073,
 'three_features': 3078.7980786456533,
 'two_features': 2996.8873214940795}

The best model is the model with 3 independent variables: ('Highway_mpg', 'Engine_size', 'City_mpg')

Hyperparameter Optimization

In [21]:

def knn_train_test_multivariate(df, feature_cols, target_col, k_values):
    
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    #model building
    k_rmse = {}
    for k in k_values:
        
        model = KNeighborsRegressor(n_neighbors = k)
        model.fit(train_set[feature_cols], train_set[target_col])
        predictions = model.predict(test_set[feature_cols])
        k_rmse[k] = np.sqrt(mean_squared_error(test_set[target_col], predictions))

    return k_rmse

In [22]:

hyp_params = list(range(1, 26))
top_three = [two_features, three_features, four_features]

selected = []
for item in top_three:
    selected.append(knn_train_test_multivariate(cars, item, 'Price', hyp_params))

print('Features and their error metrics for each k-value: ')
best_dict = {'two_features': selected[0],
              'three_features': selected[1],
              'four_features': selected[2],
              }
best_dict

Features and their error metrics for each k-value:

Out[22]:

{'four_features': {1: 2610.6627234612483,
  2: 2674.6854266330647,
  3: 2495.415808033366,
  4: 2737.609650630784,
  5: 3083.271094949073,
  6: 3130.173688680205,
  7: 3384.5351687709153,
  8: 3560.6258772476385,
  9: 3489.3617381363333,
  10: 3574.2210022302074,
  11: 3642.6804175077923,
  12: 3675.8881501828423,
  13: 3841.6091062898613,
  14: 3925.294206504302,
  15: 3882.231349399912,
  16: 3946.6363421326932,
  17: 3878.929278856831,
  18: 3893.0936284329778,
  19: 3911.6103208045697,
  20: 3961.8887642849104,
  21: 3999.9386784900307,
  22: 4070.518779726736,
  23: 4162.771545655721,
  24: 4222.27321782126,
  25: 4282.307729786075},
 'three_features': {1: 2534.588538965579,
  2: 2680.5878404005143,
  3: 2464.771925999155,
  4: 2729.525458407154,
  5: 3078.7980786456533,
  6: 3148.583470886329,
  7: 3383.9058919272175,
  8: 3577.165165323855,
  9: 3494.9357692498993,
  10: 3597.8367148899783,
  11: 3678.316396079907,
  12: 3711.0517621578206,
  13: 3860.557996264031,
  14: 3930.9075907490123,
  15: 3882.5919982742203,
  16: 3944.00229722676,
  17: 3905.7462793347754,
  18: 3905.309340629744,
  19: 3916.592517113872,
  20: 3942.3319701492446,
  21: 3995.3507403242816,
  22: 4059.7959116530124,
  23: 4156.537589932155,
  24: 4237.941348741846,
  25: 4287.16475174757},
 'two_features': {1: 2807.9143358595848,
  2: 2759.9046982276345,
  3: 2354.72887298986,
  4: 2587.626322555174,
  5: 2996.8873214940795,
  6: 3095.941706834793,
  7: 3351.095825242845,
  8: 3557.2028865114526,
  9: 3524.1670038672614,
  10: 3599.2079940642616,
  11: 3682.8374655370367,
  12: 3733.954887479565,
  13: 3818.342821734111,
  14: 3810.6160085975616,
  15: 3805.6453609757978,
  16: 3869.710323645115,
  17: 3850.6126908392766,
  18: 3891.1983452584423,
  19: 3926.9249455122713,
  20: 3963.2914529408017,
  21: 4004.4869239008185,
  22: 4079.0805793509444,
  23: 4175.556938053969,
  24: 4240.634363607186,
  25: 4278.664903559106}}

In [23]:

best_df = pd.DataFrame.from_dict(best_dict)
best_df

Out[23]:

	four_features	three_features	two_features
1	2610.662723	2534.588539	2807.914336
2	2674.685427	2680.587840	2759.904698
3	2495.415808	2464.771926	2354.728873
4	2737.609651	2729.525458	2587.626323
5	3083.271095	3078.798079	2996.887321
6	3130.173689	3148.583471	3095.941707
7	3384.535169	3383.905892	3351.095825
8	3560.625877	3577.165165	3557.202887
9	3489.361738	3494.935769	3524.167004
10	3574.221002	3597.836715	3599.207994
11	3642.680418	3678.316396	3682.837466
12	3675.888150	3711.051762	3733.954887
13	3841.609106	3860.557996	3818.342822
14	3925.294207	3930.907591	3810.616009
15	3882.231349	3882.591998	3805.645361
16	3946.636342	3944.002297	3869.710324
17	3878.929279	3905.746279	3850.612691
18	3893.093628	3905.309341	3891.198345
19	3911.610321	3916.592517	3926.924946
20	3961.888764	3942.331970	3963.291453
21	3999.938678	3995.350740	4004.486924
22	4070.518780	4059.795912	4079.080579
23	4162.771546	4156.537590	4175.556938
24	4222.273218	4237.941349	4240.634364
25	4282.307730	4287.164752	4278.664904

In [24]:

#plotting
for col in list(best_df.columns):
    best_df[col].plot(figsize = (13,8))

plt.title('Error metrics at for each k - value')
plt.xlabel('K value')
plt.ylabel('RMSE values')
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.show()

In [25]:

# rmse averages
best_df.min()

Out[25]:

four_features     2495.415808
three_features    2464.771926
two_features      2354.728873
dtype: float64

Additional: Using kfold validation technique for the three best models¶

In [26]:

from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, KFold
import numpy as np

def knn_train_test_kfold(df, feature_cols, target_col, n_folds):
    '''
    folds: a list of number of folds
    '''
    # train and test sets
    np.random.seed(0)
    shuffled_index = np.random.permutation(df.index)
    df = df.reindex(index = shuffled_index)
    split_loc = int(0.5*len(df))

    train_set = df.iloc[:split_loc].copy()
    test_set  = df.iloc[split_loc:].copy()
    
    # splitting the dataframe into folds 
    for fold in n_folds:
        kf = KFold(fold, shuffle=True, random_state=1) 

        # model building
        model = KNeighborsRegressor()
        mses = cross_val_score(model, train_set[feature_cols], train_set["Price"], scoring="neg_mean_squared_error", cv=kf)
        
        mean_mse = np.mean(np.abs(mses))
        kfold_rmse = np.sqrt(mean_mse)
                       
    return kfold_rmse
               

In [27]:

n_folds = [5,9,13,15,19]
kfold_rmses = []
for lst in select_lst:
    kfold_rmses.append(knn_train_test_kfold(cars, lst, 'Price', n_folds))

kfold_dict = {'Two best features': kfold_rmses[0], 'Three best features': kfold_rmses[1], 
              'Four best features': kfold_rmses[2],'Five best features': kfold_rmses[3]}
kfold_dict

Out[27]:

{'Five best features': 4403.893418376636,
 'Four best features': 3410.8686574207804,
 'Three best features': 3556.772181805177,
 'Two best features': 3645.2146005798754}

Summary of the models' performance¶

1. Univariate model

In [28]:

data
print('The minimum rmse for each predictor variable:')
print(' ')
print(data.min())

The minimum rmse for each predictor variable:
 
Bore                 6357.036119
City_mpg             3720.345110
Compression_ratio    6184.377891
Curb_weight          3992.353994
Engine_size          2765.713845
Height               7225.565525
Highway_mpg          3961.013084
Horse_power          3759.843030
Length               4641.256910
Normalized_loss      6575.726294
Peak_rpm             7014.668040
Stroke               6192.339101
Symbol               6760.109694
Wheel_base           4315.016502
Width                3336.967875
dtype: float64

Engine size is the best predictor with an rmse of 2766

2. Multivariate model

In [29]:

best_df
print('The minimum rmse for each set of predictor variables:')
print(' ')
print(best_df.min())

The minimum rmse for each set of predictor variables:
 
four_features     2495.415808
three_features    2464.771926
two_features      2354.728873
dtype: float64

The best predictors for this model have an rmse of 2354

3. Kfold Validation Technique

In [30]:

kfold_dict

Out[30]:

{'Five best features': 4403.893418376636,
 'Four best features': 3410.8686574207804,
 'Three best features': 3556.772181805177,
 'Two best features': 3645.2146005798754}

The best predictors for this model have an rmse of 3410

Conclusion¶

The best performing model among the three options is the multivariate model with two independent variables, and 3 k-neighbours.

This implies that Engine_size and Horse_power are the best predictors of car prices. We'll therefore use these parameters to get predicted prices.

Prediction¶

In [31]:

np.random.seed(0)
shuffled_index = np.random.permutation(cars.index)
cars = cars.reindex(index = shuffled_index)
split_loc = int(0.5*len(cars))

train_set = cars.iloc[:split_loc].copy()
test_set  = cars.iloc[split_loc:].copy()

knn_model = KNeighborsRegressor(n_neighbors = 3)
knn_model.fit(train_set[two_features], train_set['Price'])
test_set['Predicted_price'] = knn_model.predict(test_set[two_features])
test_set.head()

Out[31]:

	Symbol	Normalized_loss	Wheel_base	Length	Width	Height	Curb_weight	Engine_size	Bore	Stroke	Compression_ratio	Horse_power	Peak_rpm	City_mpg	Highway_mpg	Price	Predicted_price
101	0	108.0	100.4	184.6	66.5	56.1	3296	181	3.43	3.27	9.0	152.0	5200.0	17	22	14399.0	16365.666667
199	-1	95.0	109.1	188.8	68.9	55.5	2952	141	3.78	3.15	9.5	114.0	5400.0	23	28	16845.0	17518.333333
102	0	108.0	100.4	184.6	66.5	55.1	3060	181	3.43	3.27	9.0	152.0	5200.0	19	25	13499.0	16365.666667
71	3	142.0	96.6	180.3	70.5	50.8	3685	234	3.46	3.10	8.3	155.0	4750.0	16	18	35056.0	33994.666667
147	0	85.0	96.9	173.6	65.4	54.9	2420	108	3.62	2.64	9.0	82.0	4800.0	23	29	8013.0	7911.000000