Predicting car prices¶

image_png;base64,iVBORw0KGgoAAAANSUhEUgAAAjcAAAEKCAYAAADuJHRAAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+_AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4xLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy8QZhcZAAAgAElEQVR4n.png

Project Aim¶

The aim of this project is to develop a model that uses a number of features of various cars to make a prediction about the price of car of interest. The cars being considered in this project were available to the US import market in 1985.

This project forms part of an ongoing Data Science course and is also aimed at gaining experience in the use of the K-Nearest Neighbour algorithm, as implemented by the scikit-learn python library.

Dataset¶

The dataset used is an abstract from the 1985 Ward's Automotive Yearbook and is available as csv file. Information about the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/automobile

Setting up the necessary libraries and viewing the data¶

In [1]:

# set up our working environment
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns

In [2]:

imports = pd.read_csv('imports-85.data', header=None)
imports.columns=['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_of_doors', 'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type', 'num_of_cylinder', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
imports.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized_losses    205 non-null object
make                 205 non-null object
fuel_type            205 non-null object
aspiration           205 non-null object
num_of_doors         205 non-null object
body_style           205 non-null object
drive_wheels         205 non-null object
engine_location      205 non-null object
wheel_base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb_weight          205 non-null int64
engine_type          205 non-null object
num_of_cylinder      205 non-null object
engine_size          205 non-null int64
fuel_system          205 non-null object
bore                 205 non-null object
stroke               205 non-null object
compression_ratio    205 non-null float64
horsepower           205 non-null object
peak_rpm             205 non-null object
city_mpg             205 non-null int64
highway_mpg          205 non-null int64
price                205 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 41.8+ KB

In [3]:

imports.head()

Out[3]:

	symboling	normalized_losses	make	fuel_type	aspiration	num_of_doors	body_style	drive_wheels	engine_location	wheel_base	...	engine_size	fuel_system	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	13495
1	3	?	alfa-romero	gas	std	two	convertible	rwd	front	88.6	...	130	mpfi	3.47	2.68	9.0	111	5000	21	27	16500
2	1	?	alfa-romero	gas	std	two	hatchback	rwd	front	94.5	...	152	mpfi	2.68	3.47	9.0	154	5000	19	26	16500
3	2	164	audi	gas	std	four	sedan	fwd	front	99.8	...	109	mpfi	3.19	3.40	10.0	102	5500	24	30	13950
4	2	164	audi	gas	std	four	sedan	4wd	front	99.4	...	136	mpfi	3.19	3.40	8.0	115	5500	18	22	17450

5 rows × 26 columns

Data Set Information:¶

This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.

The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.

Attribute Information: Attribute: Attribute Range¶

symboling: -3, -2, -1, 0, 1, 2, 3.
normalized-losses: continuous from 65 to 256.
make:

alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo

fuel-type: diesel, gas.
aspiration: std, turbo.
num-of-doors: four, two.
body-style: hardtop, wagon, sedan, hatchback, convertible.
drive-wheels: 4wd, fwd, rwd.
engine-location: front, rear.
wheel-base: continuous from 86.6 to 120.9.
length: continuous from 141.1 to 208.1.
width: continuous from 60.3 to 72.3.
height: continuous from 47.8 to 59.8.
curb-weight: continuous from 1488 to 4066.
engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
num-of-cylinders: eight, five, four, six, three, twelve, two.
engine-size: continuous from 61 to 326.
fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
bore: continuous from 2.54 to 3.94.
stroke: continuous from 2.07 to 4.17.
compression-ratio: continuous from 7 to 23.
horsepower: continuous from 48 to 288.
peak-rpm: continuous from 4150 to 6600.
city-mpg: continuous from 13 to 49.
highway-mpg: continuous from 16 to 54.
price: continuous from 5118 to 45400.

Potential feature columns¶

The columns available for using as features - which must be continuous numeric columns - are as stated below:

symboling 205 non-null int64
normalized_losses 205 non-null object
wheel_base 205 non-null float64
length 205 non-null float64
width 205 non-null float64
height 205 non-null float64
curb_weight 205 non-null int64
engine_size 205 non-null int64
bore 205 non-null object
stroke 205 non-null object
compression_ratio 205 non-null float64
horsepower 205 non-null object
peak_rpm 205 non-null object
city_mpg 205 non-null int64
highway_mpg 205 non-null int64
price 205 non-null object

The reason that some of these columns are not currently showing as numeric types is likely to be due to those columns containing some 'Null' values which are shown as'?' in this dataset. These will need to be replaced with np.nan.

Target column¶

The target column for this analysis will be the 'price' column.

Exploring the numeric columns¶

In [4]:

# Select only the columns with continuous values
continuous_values = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
numeric_imports = imports[continuous_values]

In [5]:

numeric_imports

Out[5]:

	normalized_losses	wheel_base	length	width	height	curb_weight	engine_size	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	?	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	13495
1	?	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	16500
2	?	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154	5000	19	26	16500
3	164	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10.0	102	5500	24	30	13950
4	164	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115	5500	18	22	17450
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
200	95	109.1	188.8	68.9	55.5	2952	141	3.78	3.15	9.5	114	5400	23	28	16845
201	95	109.1	188.8	68.8	55.5	3049	141	3.78	3.15	8.7	160	5300	19	25	19045
202	95	109.1	188.8	68.9	55.5	3012	173	3.58	2.87	8.8	134	5500	18	23	21485
203	95	109.1	188.8	68.9	55.5	3217	145	3.01	3.40	23.0	106	4800	26	27	22470
204	95	109.1	188.8	68.9	55.5	3062	141	3.78	3.15	9.5	114	5400	19	25	22625

205 rows × 15 columns

Data cleaning¶

We need to set the '?' values, used in the dataset to numpy.nan values, so that analysis is possible. Also, with 'price' being the target column, we need values in all rows and will therefore delete any rows missing 'price' values.

In [6]:

numeric_imports = numeric_imports.replace('?', np.nan)
numeric_imports.head(5)

Out[6]:

	normalized_losses	wheel_base	length	width	height	curb_weight	engine_size	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	NaN	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	13495
1	NaN	88.6	168.8	64.1	48.8	2548	130	3.47	2.68	9.0	111	5000	21	27	16500
2	NaN	94.5	171.2	65.5	52.4	2823	152	2.68	3.47	9.0	154	5000	19	26	16500
3	164	99.8	176.6	66.2	54.3	2337	109	3.19	3.40	10.0	102	5500	24	30	13950
4	164	99.4	176.6	66.4	54.3	2824	136	3.19	3.40	8.0	115	5500	18	22	17450

In [7]:

numeric_imports = numeric_imports.astype('float')
numeric_imports.isnull().sum()

Out[7]:

normalized_losses    41
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_size           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64

In [8]:

# Dealing with empty 'price' values
numeric_imports = numeric_imports.dropna(subset=['price'])
numeric_imports.isnull().sum()

Out[8]:

normalized_losses    37
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_size           0
bore                  4
stroke                4
compression_ratio     0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 0
dtype: int64

Dealing withe the 'normalized_losses' column¶

With 37 nan's in this column, accounting for nearly 20% of all rows, I have decided that the use of the mean value would significantly impact on possible results if this column was used. We will therefore drop this column from the dataset.

In [9]:

numeric_imports = numeric_imports.drop('normalized_losses', axis = 1)
numeric_imports.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 14 columns):
wheel_base           201 non-null float64
length               201 non-null float64
width                201 non-null float64
height               201 non-null float64
curb_weight          201 non-null float64
engine_size          201 non-null float64
bore                 197 non-null float64
stroke               197 non-null float64
compression_ratio    201 non-null float64
horsepower           199 non-null float64
peak_rpm             199 non-null float64
city_mpg             201 non-null float64
highway_mpg          201 non-null float64
price                201 non-null float64
dtypes: float64(14)
memory usage: 23.6 KB

Dealing with the remaining null values¶

For the remaining entries we shall replace any null values with the mean value for the relevant column.

In [10]:

# Replace all remaining missing values in other columns using column means.
numeric_imports = numeric_imports.fillna(numeric_imports.mean())
numeric_imports.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 14 columns):
wheel_base           201 non-null float64
length               201 non-null float64
width                201 non-null float64
height               201 non-null float64
curb_weight          201 non-null float64
engine_size          201 non-null float64
bore                 201 non-null float64
stroke               201 non-null float64
compression_ratio    201 non-null float64
horsepower           201 non-null float64
peak_rpm             201 non-null float64
city_mpg             201 non-null float64
highway_mpg          201 non-null float64
price                201 non-null float64
dtypes: float64(14)
memory usage: 23.6 KB

Normalising the feature columns¶

Having cleaned all of the potential features of interest, we need to normalise all these columns so that they each do not have an overly significant impact on the analysis.

An excellent article, that summarises the various normalization techniques, can be found here: https://analystanswers.com/data-normalization-techniques-easy-to-advanced-the-best/

The method we shall use for this analysis is the Linear Normalization technique, also known as 'Max-Min'.

In [11]:

# normalize all columns
normalized_imports = (numeric_imports - numeric_imports.min())/(numeric_imports.max() - numeric_imports.min())
# and now return the 'price' column to its original figures
normalized_imports['price'] = numeric_imports['price']
# and view the results
normalized_imports.head()

Out[11]:

	wheel_base	length	width	height	curb_weight	engine_size	bore	stroke	compression_ratio	horsepower	peak_rpm	city_mpg	highway_mpg	price
0	0.058309	0.413433	0.324786	0.083333	0.411171	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	13495.0
1	0.058309	0.413433	0.324786	0.083333	0.411171	0.260377	0.664286	0.290476	0.1250	0.294393	0.346939	0.222222	0.289474	16500.0
2	0.230321	0.449254	0.444444	0.383333	0.517843	0.343396	0.100000	0.666667	0.1250	0.495327	0.346939	0.166667	0.263158	16500.0
3	0.384840	0.529851	0.504274	0.541667	0.329325	0.181132	0.464286	0.633333	0.1875	0.252336	0.551020	0.305556	0.368421	13950.0
4	0.373178	0.529851	0.521368	0.541667	0.518231	0.283019	0.464286	0.633333	0.0625	0.313084	0.551020	0.138889	0.157895	17450.0

Data analysis¶

We are now ready to start our analysis of the various features to determine which give best performance in any model.

We start with some univariate k-nearest neighbors models. Starting with simple models before moving to more complex models helps to structure the code workflow and understand the features better.

Pearson Correlation¶

Let us first of all simply look at the Pearson Correlation between all of the features, to see if there are any likely dependencies between any of the features and then we can try to identify any likely candidates for our model based on their correlations with the 'price' target.

In [12]:

plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(normalized_imports.corr(), dtype=np.bool))
heatmap = sns.heatmap(normalized_imports.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);

We can see some obvious connections between certain features in the dataset. For example:

city_mpg and highway_mpg
wheel_base, length, width and curb_weight

If we included a number of these related features in our model, that interrelationship could bias our model towards these features, so we do need to be mindful of this.

Now let's look at any correlations between all of the features and the 'price' target.

In [13]:

plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(normalized_imports.corr()[['price']].sort_values(by='price', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title("All Features and their Correlation with 'price'", fontdict={'fontsize':18}, pad=16);

The graph above shows there are relatively strong positive correlations between 'price' and:

engine_size
curb_weight
horsepower

Whereas there are relatively strong negative correlations between 'price' and the economic factors of:

highway_mpg
city_mpg

Using the scikit-learn's KNeighborsRegressor to identify candidate features¶

To further explore the possible candidates for inclusion in the model we shall scikit-learn's KNeighborsRegressor algorithm to do an inital assessment of the results obtained for all features. Initially, we shall use the algorithm's default parameter settings, apart from stipulating the algorithm uses the 'Brute Force' method is used, which is suitable for small samples such as the dataset that we are working with here. We will simply split the dataset into two halves: 50% as the training set and 50% as the test set. We shall use Root Mean Square Error (RMSE) obtained for each feature as the performance score.

In [14]:

# define a function that will use the algorythm on each 
def knn_train_test(train_cols, target_col, df):
    knn = KNeighborsRegressor(algorithm = 'brute')
    np.random.seed(1)
        
    # Randomize the order of rows in the data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Select the first half and set as the training set - 0:101
    # Select the second half and set as the test set - 101:201
    train_df = rand_df.iloc[0:101]
    test_df = rand_df.iloc[101:]
    
    # Fit a KNN model using default k value.
    knn.fit(train_df[[train_cols]], train_df[target_col])
    
    # Make predictions using this model.
    predictions = knn.predict(test_df[[train_cols]])

    # Calculate and return RMSE.
    mse = mean_squared_error(test_df[target_col], predictions)
    rmse = np.sqrt(mse)
    return rmse

rmse_results = {}
train_cols = normalized_imports.columns.drop('price')

# For each column (other than the target - `price`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
    rmse_val = knn_train_test(col, 'price', normalized_imports)
    rmse_results[col] = rmse_val

# Create a Series object from the dictionary so 
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()

Out[14]:

engine_size          3193.852584
horsepower           4052.345145
curb_weight          4433.584275
width                4627.838618
highway_mpg          4631.877511
city_mpg             4706.744376
wheel_base           5495.090042
length               5545.624068
compression_ratio    6165.689197
bore                 7207.045838
peak_rpm             7354.365343
height               7572.273382
stroke               7791.413037
dtype: float64

Results from using the standard settings¶

'engine_size' (RMSE 3194) is significantly better than the other features, with 'horsepower' (RMSE 4052) also fairly good and the following 4 features, down to 'city_mpg' giving similar results (RMSE's of 4434 to 4707).

Exploring how varying the k-value, used in the KNeighborsRegressor, impacts on the results obtained¶

We shall now run the same exercise, but this time we shall vary the k-value to see what the optimum k-value should be. We shall use k-values between 1 and 9, and then visualise the results using a line graph.

In [15]:

def knn_train_test_k_value(train_cols, target_col, df):
    np.random.seed(1)
        
    # Randomize the order of rows in the data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Select the first half and set as the training set - 0:101
    # Select the second half and set as the test set - 101:201
    train_df = rand_df.iloc[0:101]
    test_df = rand_df.iloc[101:]
    
    # set our range of k-values that we want to test the model with
    k_values = [1,2,3,4,5,6,7,8,9]
    k_rmses = {}

    for k in k_values:
        # instantiate using the k_value
        knn = KNeighborsRegressor(n_neighbors=k, algorithm = 'brute')
        # fit the model to the training data
        knn.fit(train_df[[train_cols]], train_df[target_col])
        # use scikit-learn's KNeighborsRegressor function to predict the models results on the test data
        predictions = knn.predict(test_df[[train_cols]])
        # now use scikit-learn's mean_squared_error function to calculate the MSE for the predictions
        # set the y-pred (resulting prices) and the y-true (correct prices) values
        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predictions)
        rmse = np.sqrt(mse)
        k_rmses[k] = rmse
    return k_rmses

k_rmse_results = {}
# For each column (other than the target - `price`) 
train_cols = normalized_imports.columns.drop('price')

# train a model, return the RMSE values for each k-value and add to the dictionary `k_rmse_results`.
for col in train_cols:
    rmse_val = knn_train_test_k_value(col, 'price', normalized_imports)
    k_rmse_results[col] = rmse_val

k_rmse_results

Out[15]:

{'wheel_base': {1: 4713.644728869583,
  2: 4387.5338571343245,
  3: 5131.693127029324,
  4: 5429.055461242775,
  5: 5495.090042028429,
  6: 5480.758948382767,
  7: 5564.821885995737,
  8: 5700.79793285554,
  9: 5822.087833425691},
 'length': {1: 5054.374735810553,
  2: 4625.24083399989,
  3: 4713.563898415343,
  4: 5294.113068895488,
  5: 5545.624068434498,
  6: 5458.817557020323,
  7: 5409.7898250131975,
  8: 5469.892779199378,
  9: 5455.421592709176},
 'width': {1: 4364.32041559737,
  2: 4527.255895959494,
  3: 4638.59313896885,
  4: 4663.834600814012,
  5: 4627.838618015974,
  6: 4669.805592551174,
  7: 4596.540134066234,
  8: 4629.540517325937,
  9: 4736.614890443424},
 'height': {1: 8944.532711662472,
  2: 7780.9471229407545,
  3: 8024.138973815889,
  4: 8001.845759495118,
  5: 7572.273382412972,
  6: 7711.3531514146935,
  7: 7713.990888704784,
  8: 7750.25070475546,
  9: 7848.478733707367},
 'curb_weight': {1: 5522.701924782832,
  2: 5527.791988669617,
  3: 5078.909050070585,
  4: 4777.993767720402,
  5: 4433.584274511989,
  6: 4403.110922467836,
  7: 4347.409273137654,
  8: 4510.694730148977,
  9: 4657.356427891229},
 'engine_size': {1: 3489.4088639768197,
  2: 2958.256321720618,
  3: 2795.7233277911378,
  4: 3064.4840856578126,
  5: 3193.8525843877014,
  6: 3483.9419960648656,
  7: 3596.625739580398,
  8: 3723.6998260718465,
  9: 3797.500337063843},
 'bore': {1: 7592.603240654684,
  2: 6913.3824046193195,
  3: 6904.370272033021,
  4: 6854.296165271092,
  5: 7207.045838011577,
  6: 7299.1996315615925,
  7: 7181.916504794297,
  8: 7071.36068771642,
  9: 6988.202236521101},
 'stroke': {1: 7281.147086139656,
  2: 7411.264440869722,
  3: 7284.803720912611,
  4: 7684.51690027584,
  5: 7791.413036542216,
  6: 8072.547602901523,
  7: 7993.206756173568,
  8: 7717.599308925347,
  9: 7924.069829884698},
 'compression_ratio': {1: 7947.701145614372,
  2: 7074.238078584859,
  3: 6349.29920271171,
  4: 6213.487303036838,
  5: 6165.689196642984,
  6: 6164.329766644906,
  7: 6505.079663130458,
  8: 6526.982410887228,
  9: 6766.621257191185},
 'horsepower': {1: 3559.5289477682295,
  2: 3653.868350118816,
  3: 4000.7772510128857,
  4: 3870.3047220147923,
  5: 4052.3451454682386,
  6: 4201.77301088481,
  7: 4158.2072794876,
  8: 4249.7304450179545,
  9: 4500.872781822226},
 'peak_rpm': {1: 8005.071514983486,
  2: 7591.886470601362,
  3: 7807.556279940892,
  4: 7694.867207390586,
  5: 7354.365342515967,
  6: 7178.748420685879,
  7: 7327.85932885378,
  8: 7305.882502690896,
  9: 7416.989346255606},
 'city_mpg': {1: 5021.755671077596,
  2: 4959.9657763436235,
  3: 4688.484536962166,
  4: 4606.867389221769,
  5: 4706.7443759779435,
  6: 4746.89375732524,
  7: 4729.946871902517,
  8: 4670.230474271318,
  9: 4760.349807602456},
 'highway_mpg': {1: 6320.21090233546,
  2: 5192.497320654099,
  3: 4706.327077574519,
  4: 4595.994466720451,
  5: 4631.877510902032,
  6: 4912.407591695181,
  7: 5182.017937984376,
  8: 5178.896149790272,
  9: 5272.494903204838}}

In [16]:

for k,v in k_rmse_results.items():

    x = list(v.keys())
    y = list(v.values())
    
    plt.scatter(x,y)
    plt.plot(x,y)
    plt.xlabel('k-value')
    plt.ylabel('RMSE')
    plt.xticks([1,3,5,7,9], fontsize=14)
    plt.yticks(fontsize=14)
    plt.legend(k_rmse_results, frameon=False, bbox_to_anchor=(1.5, 1))

Developing a Multivariate Model¶

By averaging the RMSE values obtained for each feature, across all of the k-values calculated above, and then selecting the best performing features from a sorted list of the results, we will hopefully be looking at those features whose performance is not overly dependent on the algorithm's parameters.

In [17]:

# Computing the average RMSE across the various k-values and sorting the features by performance.
feature_avg_rmse = {}
for k,v in k_rmse_results.items():
    avg_rmse = np.mean(list(v.values()))
    feature_avg_rmse[k] = avg_rmse
series_avg_rmse = pd.Series(feature_avg_rmse)
sorted_series_avg_rmse = series_avg_rmse.sort_values()
print(sorted_series_avg_rmse)

sorted_features = sorted_series_avg_rmse.index

engine_size          3344.832565
horsepower           4027.489770
width                4606.038200
city_mpg             4765.693185
curb_weight          4806.616929
highway_mpg          5110.302651
length               5225.204262
wheel_base           5302.831535
compression_ratio    6634.825336
bore                 7112.486331
peak_rpm             7520.358490
stroke               7684.507631
height               7927.534603
dtype: float64

Having sorted the RMSE results for all features, we will now run five models, consisting of:

just the best performer in model 1
the two best performers in model 2;
and so on, to the best five performers being included in model 5

We shall limit it to the best five features, because if we go beyond number five ('curb_weight') we are moving into the area where there are too many interrelated features being used in our model, which could negatively impact on the models overall suitability.

We shall also continue our exploration of the correct k-value that we should use in the final model, by again running each of these models with various k-values - 1 through 8.

In [18]:

def knn_train_test(train_cols, target_col, df):
    np.random.seed(1)
    
    # Randomize order of rows in data frame.
    shuffled_index = np.random.permutation(df.index)
    rand_df = df.reindex(shuffled_index)

    # Divide number of rows in half and round.
    last_train_row = int(len(rand_df) / 2)
    
    # Select the first half and set as training set.
    # Select the second half and set as test set.
    train_df = rand_df.iloc[0:last_train_row]
    test_df = rand_df.iloc[last_train_row:]
    
    k_values = [1,2,3,4,5,6,7,8]
    k_rmses = {}
    
    for k in k_values:
        # Fit model using k nearest neighbors.
        knn = KNeighborsRegressor(n_neighbors=k, algorithm = 'brute')
        knn.fit(train_df[train_cols], train_df[target_col])

        # Make predictions using model.
        predicted_labels = knn.predict(test_df[train_cols])

        # Calculate and return RMSE.
        mse = mean_squared_error(test_df[target_col], predicted_labels)
        rmse = np.sqrt(mse)
        
        k_rmses[k] = rmse
    return k_rmses


models_results = {}

for nr_best_feats in range(1,6):
    models_results['{} best features'.format(nr_best_feats)] = knn_train_test(
        sorted_features[:nr_best_feats],
        'price',
        normalized_imports
    )

models_results

Out[18]:

{'1 best features': {1: 3475.2810813025476,
  2: 2943.936340300614,
  3: 2790.45642413665,
  4: 3188.427394694385,
  5: 3216.8365762064755,
  6: 3521.046600098613,
  7: 3631.9860373729207,
  8: 3718.9009453931467},
 '2 best features': {1: 2778.412353831013,
  2: 2687.9334815039783,
  3: 2789.39413024789,
  4: 2844.017557427423,
  5: 2950.255443479798,
  6: 3126.834022873721,
  7: 3200.8528818656673,
  8: 3424.405007309176},
 '3 best features': {1: 3538.458459177772,
  2: 3479.958271913246,
  3: 3343.2671970925558,
  4: 3343.612859314053,
  5: 3573.310158703976,
  6: 3733.1236592832756,
  7: 3665.506151479582,
  8: 3746.1970732261298},
 '4 best features': {1: 3546.7043271406787,
  2: 3289.8038725434653,
  3: 3267.98652530774,
  4: 3255.907469109187,
  5: 3394.6046092393894,
  6: 3624.223483273441,
  7: 3674.722141239556,
  8: 3759.3676154425716},
 '5 best features': {1: 2838.5257695773093,
  2: 2879.0159566617326,
  3: 2992.6702414093975,
  4: 3133.9563398915648,
  5: 3360.832156381264,
  6: 3563.4746528087057,
  7: 3568.1754419934873,
  8: 3811.14991399243}}

In [19]:

for k,v in models_results.items():

    x = list(v.keys())
    y = list(v.values())
    
    plt.scatter(x,y)
    plt.plot(x,y)
    plt.xlabel('k-value')
    plt.ylabel('RMSE')
    plt.xticks([1,2,3,4,5,6,7,8], fontsize=14)
    plt.yticks(fontsize=14)
    plt.legend(['1 Feature', '2 Features', '3 Features', '4 Features', '5 Features'], frameon=False, bbox_to_anchor=(1.5, 1))

Best results from the initial Multivariate Model analysis¶

This analysis has indicated that a model containing the two features - engine_size and horsepower - would appear to give the best results across the board.

If we use a k-value of 2, the performance of this two-feature model results in an RMSE value of USD 2688.

Our model is therefore able to predict the 'price' of cars in the test set to an accuracy of USD 2,688. This is across actual price values between USD 5,118 and USD 45,400 - ie a range of USD 40,000. This means the error in our predicted prices is likely to be approximatly 7% of actual prices.

KNN algorithm choice¶

Having identified the features and the optimum k-value, we shall now examine whether our stipulation that the 'Brute Force' KNN algorithm should be used in our modeling was the correct decision. We shall remove this stipulation from the following model, instead allowing the KNeighborRegressor to default to 'auto'.

In [20]:

np.random.seed(1)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(normalized_imports.index)
rand_df = normalized_imports.reindex(shuffled_index)

# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
    
# Select the first half and set as training set.
train_df = rand_df.iloc[0:last_train_row]
# Select the second half and set as test set.
test_df = rand_df.iloc[last_train_row:]

# Fit model using k nearest neighbors.
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(train_df[['engine_size', 'horsepower']], train_df['price'])

# Make predictions using model.
predicted_labels = knn.predict(test_df[['engine_size', 'horsepower']])

# Calculate and return RMSE.
mse = mean_squared_error(test_df['price'], predicted_labels)
rmse = np.sqrt(mse)

rmse

Out[20]:

2657.7963807419765

By allowing KNeighborRegressor to choose it's own algorithm, we have further improved the performance by USD 30.

Project Conclusion¶

This project has developed a model for predicting the price of imported cars based on a number of features relevant to each car, as available from a 1985 dataset of car imports into the USA.

The Model¶

The best peforming prediction model for 'price' consists of:

two predictor variables - 'engine_size' and 'horsepower'
utilises the KNeighborRegressor from the scikit-learn library
uses the default paramaters for this algorithm, apart from the 'n_neighbors' parameter which should be set to 2 (instead of the default of 5).

Model performance¶

The Root Mean Squared Error (RMSE) achieved with this model is USD 2,658, which implies that our predicted 'prices' will be off, on average, by USD 2,658. Across a range of prices in our dataset of approximately USD 40,000, this means we will typically be about 7% out.

Model limitations¶

The use of this model is restricted to predicting the prices of cars imported into the USA in 1985. The project has not looked at the effects of extrapolating this model outside of this market.

In [ ]: