The aim of this project is to develop a model that uses a number of features of various cars to make a prediction about the price of car of interest. The cars being considered in this project were available to the US import market in 1985.
This project forms part of an ongoing Data Science course and is also aimed at gaining experience in the use of the K-Nearest Neighbour algorithm, as implemented by the scikit-learn python library.
The dataset used is an abstract from the 1985 Ward's Automotive Yearbook and is available as csv file. Information about the dataset can be found here: https://archive.ics.uci.edu/ml/datasets/automobile
# set up our working environment
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score, KFold
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
imports = pd.read_csv('imports-85.data', header=None)
imports.columns=['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_of_doors', 'body_style', 'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type', 'num_of_cylinder', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
imports.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): symboling 205 non-null int64 normalized_losses 205 non-null object make 205 non-null object fuel_type 205 non-null object aspiration 205 non-null object num_of_doors 205 non-null object body_style 205 non-null object drive_wheels 205 non-null object engine_location 205 non-null object wheel_base 205 non-null float64 length 205 non-null float64 width 205 non-null float64 height 205 non-null float64 curb_weight 205 non-null int64 engine_type 205 non-null object num_of_cylinder 205 non-null object engine_size 205 non-null int64 fuel_system 205 non-null object bore 205 non-null object stroke 205 non-null object compression_ratio 205 non-null float64 horsepower 205 non-null object peak_rpm 205 non-null object city_mpg 205 non-null int64 highway_mpg 205 non-null int64 price 205 non-null object dtypes: float64(5), int64(5), object(16) memory usage: 41.8+ KB
imports.head()
symboling | normalized_losses | make | fuel_type | aspiration | num_of_doors | body_style | drive_wheels | engine_location | wheel_base | ... | engine_size | fuel_system | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
This data set consists of three types of entities: (a) the specification of an auto in terms of various characteristics, (b) its assigned insurance risk rating, (c) its normalized losses in use as compared to other cars. The second rating corresponds to the degree to which the auto is more risky than its price indicates. Cars are initially assigned a risk factor symbol associated with its price. Then, if it is more risky (or less), this symbol is adjusted by moving it up (or down) the scale. Actuarians call this process "symboling". A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe.
The third factor is the relative average loss payment per insured vehicle year. This value is normalized for all autos within a particular size classification (two-door small, station wagons, sports/speciality, etc...), and represents the average loss per car per year.
alfa-romero, audi, bmw, chevrolet, dodge, honda, isuzu, jaguar, mazda, mercedes-benz, mercury, mitsubishi, nissan, peugot, plymouth, porsche, renault, saab, subaru, toyota, volkswagen, volvo
The columns available for using as features - which must be continuous numeric columns - are as stated below:
The reason that some of these columns are not currently showing as numeric types is likely to be due to those columns containing some 'Null' values which are shown as'?' in this dataset. These will need to be replaced with np.nan.
The target column for this analysis will be the 'price' column.
# Select only the columns with continuous values
continuous_values = ['normalized_losses', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
numeric_imports = imports[continuous_values]
numeric_imports
normalized_losses | wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ? | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | ? | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | ? | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 164 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 164 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
200 | 95 | 109.1 | 188.8 | 68.9 | 55.5 | 2952 | 141 | 3.78 | 3.15 | 9.5 | 114 | 5400 | 23 | 28 | 16845 |
201 | 95 | 109.1 | 188.8 | 68.8 | 55.5 | 3049 | 141 | 3.78 | 3.15 | 8.7 | 160 | 5300 | 19 | 25 | 19045 |
202 | 95 | 109.1 | 188.8 | 68.9 | 55.5 | 3012 | 173 | 3.58 | 2.87 | 8.8 | 134 | 5500 | 18 | 23 | 21485 |
203 | 95 | 109.1 | 188.8 | 68.9 | 55.5 | 3217 | 145 | 3.01 | 3.40 | 23.0 | 106 | 4800 | 26 | 27 | 22470 |
204 | 95 | 109.1 | 188.8 | 68.9 | 55.5 | 3062 | 141 | 3.78 | 3.15 | 9.5 | 114 | 5400 | 19 | 25 | 22625 |
205 rows × 15 columns
We need to set the '?' values, used in the dataset to numpy.nan values, so that analysis is possible. Also, with 'price' being the target column, we need values in all rows and will therefore delete any rows missing 'price' values.
numeric_imports = numeric_imports.replace('?', np.nan)
numeric_imports.head(5)
normalized_losses | wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | NaN | 88.6 | 168.8 | 64.1 | 48.8 | 2548 | 130 | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | NaN | 94.5 | 171.2 | 65.5 | 52.4 | 2823 | 152 | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 164 | 99.8 | 176.6 | 66.2 | 54.3 | 2337 | 109 | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 164 | 99.4 | 176.6 | 66.4 | 54.3 | 2824 | 136 | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
numeric_imports = numeric_imports.astype('float')
numeric_imports.isnull().sum()
normalized_losses 41 wheel_base 0 length 0 width 0 height 0 curb_weight 0 engine_size 0 bore 4 stroke 4 compression_ratio 0 horsepower 2 peak_rpm 2 city_mpg 0 highway_mpg 0 price 4 dtype: int64
# Dealing with empty 'price' values
numeric_imports = numeric_imports.dropna(subset=['price'])
numeric_imports.isnull().sum()
normalized_losses 37 wheel_base 0 length 0 width 0 height 0 curb_weight 0 engine_size 0 bore 4 stroke 4 compression_ratio 0 horsepower 2 peak_rpm 2 city_mpg 0 highway_mpg 0 price 0 dtype: int64
With 37 nan's in this column, accounting for nearly 20% of all rows, I have decided that the use of the mean value would significantly impact on possible results if this column was used. We will therefore drop this column from the dataset.
numeric_imports = numeric_imports.drop('normalized_losses', axis = 1)
numeric_imports.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 201 entries, 0 to 204 Data columns (total 14 columns): wheel_base 201 non-null float64 length 201 non-null float64 width 201 non-null float64 height 201 non-null float64 curb_weight 201 non-null float64 engine_size 201 non-null float64 bore 197 non-null float64 stroke 197 non-null float64 compression_ratio 201 non-null float64 horsepower 199 non-null float64 peak_rpm 199 non-null float64 city_mpg 201 non-null float64 highway_mpg 201 non-null float64 price 201 non-null float64 dtypes: float64(14) memory usage: 23.6 KB
For the remaining entries we shall replace any null values with the mean value for the relevant column.
# Replace all remaining missing values in other columns using column means.
numeric_imports = numeric_imports.fillna(numeric_imports.mean())
numeric_imports.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 201 entries, 0 to 204 Data columns (total 14 columns): wheel_base 201 non-null float64 length 201 non-null float64 width 201 non-null float64 height 201 non-null float64 curb_weight 201 non-null float64 engine_size 201 non-null float64 bore 201 non-null float64 stroke 201 non-null float64 compression_ratio 201 non-null float64 horsepower 201 non-null float64 peak_rpm 201 non-null float64 city_mpg 201 non-null float64 highway_mpg 201 non-null float64 price 201 non-null float64 dtypes: float64(14) memory usage: 23.6 KB
Having cleaned all of the potential features of interest, we need to normalise all these columns so that they each do not have an overly significant impact on the analysis.
An excellent article, that summarises the various normalization techniques, can be found here: https://analystanswers.com/data-normalization-techniques-easy-to-advanced-the-best/
The method we shall use for this analysis is the Linear Normalization technique, also known as 'Max-Min'.
# normalize all columns
normalized_imports = (numeric_imports - numeric_imports.min())/(numeric_imports.max() - numeric_imports.min())
# and now return the 'price' column to its original figures
normalized_imports['price'] = numeric_imports['price']
# and view the results
normalized_imports.head()
wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_ratio | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 13495.0 |
1 | 0.058309 | 0.413433 | 0.324786 | 0.083333 | 0.411171 | 0.260377 | 0.664286 | 0.290476 | 0.1250 | 0.294393 | 0.346939 | 0.222222 | 0.289474 | 16500.0 |
2 | 0.230321 | 0.449254 | 0.444444 | 0.383333 | 0.517843 | 0.343396 | 0.100000 | 0.666667 | 0.1250 | 0.495327 | 0.346939 | 0.166667 | 0.263158 | 16500.0 |
3 | 0.384840 | 0.529851 | 0.504274 | 0.541667 | 0.329325 | 0.181132 | 0.464286 | 0.633333 | 0.1875 | 0.252336 | 0.551020 | 0.305556 | 0.368421 | 13950.0 |
4 | 0.373178 | 0.529851 | 0.521368 | 0.541667 | 0.518231 | 0.283019 | 0.464286 | 0.633333 | 0.0625 | 0.313084 | 0.551020 | 0.138889 | 0.157895 | 17450.0 |
We are now ready to start our analysis of the various features to determine which give best performance in any model.
We start with some univariate k-nearest neighbors models. Starting with simple models before moving to more complex models helps to structure the code workflow and understand the features better.
Let us first of all simply look at the Pearson Correlation between all of the features, to see if there are any likely dependencies between any of the features and then we can try to identify any likely candidates for our model based on their correlations with the 'price' target.
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(normalized_imports.corr(), dtype=np.bool))
heatmap = sns.heatmap(normalized_imports.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);
We can see some obvious connections between certain features in the dataset. For example:
If we included a number of these related features in our model, that interrelationship could bias our model towards these features, so we do need to be mindful of this.
Now let's look at any correlations between all of the features and the 'price' target.
plt.figure(figsize=(8, 12))
heatmap = sns.heatmap(normalized_imports.corr()[['price']].sort_values(by='price', ascending=False), vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title("All Features and their Correlation with 'price'", fontdict={'fontsize':18}, pad=16);
The graph above shows there are relatively strong positive correlations between 'price' and:
Whereas there are relatively strong negative correlations between 'price' and the economic factors of:
To further explore the possible candidates for inclusion in the model we shall scikit-learn's KNeighborsRegressor algorithm to do an inital assessment of the results obtained for all features. Initially, we shall use the algorithm's default parameter settings, apart from stipulating the algorithm uses the 'Brute Force' method is used, which is suitable for small samples such as the dataset that we are working with here. We will simply split the dataset into two halves: 50% as the training set and 50% as the test set. We shall use Root Mean Square Error (RMSE) obtained for each feature as the performance score.
# define a function that will use the algorythm on each
def knn_train_test(train_cols, target_col, df):
knn = KNeighborsRegressor(algorithm = 'brute')
np.random.seed(1)
# Randomize the order of rows in the data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Select the first half and set as the training set - 0:101
# Select the second half and set as the test set - 101:201
train_df = rand_df.iloc[0:101]
test_df = rand_df.iloc[101:]
# Fit a KNN model using default k value.
knn.fit(train_df[[train_cols]], train_df[target_col])
# Make predictions using this model.
predictions = knn.predict(test_df[[train_cols]])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predictions)
rmse = np.sqrt(mse)
return rmse
rmse_results = {}
train_cols = normalized_imports.columns.drop('price')
# For each column (other than the target - `price`), train a model, return RMSE value
# and add to the dictionary `rmse_results`.
for col in train_cols:
rmse_val = knn_train_test(col, 'price', normalized_imports)
rmse_results[col] = rmse_val
# Create a Series object from the dictionary so
# we can easily view the results, sort, etc
rmse_results_series = pd.Series(rmse_results)
rmse_results_series.sort_values()
engine_size 3193.852584 horsepower 4052.345145 curb_weight 4433.584275 width 4627.838618 highway_mpg 4631.877511 city_mpg 4706.744376 wheel_base 5495.090042 length 5545.624068 compression_ratio 6165.689197 bore 7207.045838 peak_rpm 7354.365343 height 7572.273382 stroke 7791.413037 dtype: float64
'engine_size' (RMSE 3194) is significantly better than the other features, with 'horsepower' (RMSE 4052) also fairly good and the following 4 features, down to 'city_mpg' giving similar results (RMSE's of 4434 to 4707).
We shall now run the same exercise, but this time we shall vary the k-value to see what the optimum k-value should be. We shall use k-values between 1 and 9, and then visualise the results using a line graph.
def knn_train_test_k_value(train_cols, target_col, df):
np.random.seed(1)
# Randomize the order of rows in the data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Select the first half and set as the training set - 0:101
# Select the second half and set as the test set - 101:201
train_df = rand_df.iloc[0:101]
test_df = rand_df.iloc[101:]
# set our range of k-values that we want to test the model with
k_values = [1,2,3,4,5,6,7,8,9]
k_rmses = {}
for k in k_values:
# instantiate using the k_value
knn = KNeighborsRegressor(n_neighbors=k, algorithm = 'brute')
# fit the model to the training data
knn.fit(train_df[[train_cols]], train_df[target_col])
# use scikit-learn's KNeighborsRegressor function to predict the models results on the test data
predictions = knn.predict(test_df[[train_cols]])
# now use scikit-learn's mean_squared_error function to calculate the MSE for the predictions
# set the y-pred (resulting prices) and the y-true (correct prices) values
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predictions)
rmse = np.sqrt(mse)
k_rmses[k] = rmse
return k_rmses
k_rmse_results = {}
# For each column (other than the target - `price`)
train_cols = normalized_imports.columns.drop('price')
# train a model, return the RMSE values for each k-value and add to the dictionary `k_rmse_results`.
for col in train_cols:
rmse_val = knn_train_test_k_value(col, 'price', normalized_imports)
k_rmse_results[col] = rmse_val
k_rmse_results
{'wheel_base': {1: 4713.644728869583, 2: 4387.5338571343245, 3: 5131.693127029324, 4: 5429.055461242775, 5: 5495.090042028429, 6: 5480.758948382767, 7: 5564.821885995737, 8: 5700.79793285554, 9: 5822.087833425691}, 'length': {1: 5054.374735810553, 2: 4625.24083399989, 3: 4713.563898415343, 4: 5294.113068895488, 5: 5545.624068434498, 6: 5458.817557020323, 7: 5409.7898250131975, 8: 5469.892779199378, 9: 5455.421592709176}, 'width': {1: 4364.32041559737, 2: 4527.255895959494, 3: 4638.59313896885, 4: 4663.834600814012, 5: 4627.838618015974, 6: 4669.805592551174, 7: 4596.540134066234, 8: 4629.540517325937, 9: 4736.614890443424}, 'height': {1: 8944.532711662472, 2: 7780.9471229407545, 3: 8024.138973815889, 4: 8001.845759495118, 5: 7572.273382412972, 6: 7711.3531514146935, 7: 7713.990888704784, 8: 7750.25070475546, 9: 7848.478733707367}, 'curb_weight': {1: 5522.701924782832, 2: 5527.791988669617, 3: 5078.909050070585, 4: 4777.993767720402, 5: 4433.584274511989, 6: 4403.110922467836, 7: 4347.409273137654, 8: 4510.694730148977, 9: 4657.356427891229}, 'engine_size': {1: 3489.4088639768197, 2: 2958.256321720618, 3: 2795.7233277911378, 4: 3064.4840856578126, 5: 3193.8525843877014, 6: 3483.9419960648656, 7: 3596.625739580398, 8: 3723.6998260718465, 9: 3797.500337063843}, 'bore': {1: 7592.603240654684, 2: 6913.3824046193195, 3: 6904.370272033021, 4: 6854.296165271092, 5: 7207.045838011577, 6: 7299.1996315615925, 7: 7181.916504794297, 8: 7071.36068771642, 9: 6988.202236521101}, 'stroke': {1: 7281.147086139656, 2: 7411.264440869722, 3: 7284.803720912611, 4: 7684.51690027584, 5: 7791.413036542216, 6: 8072.547602901523, 7: 7993.206756173568, 8: 7717.599308925347, 9: 7924.069829884698}, 'compression_ratio': {1: 7947.701145614372, 2: 7074.238078584859, 3: 6349.29920271171, 4: 6213.487303036838, 5: 6165.689196642984, 6: 6164.329766644906, 7: 6505.079663130458, 8: 6526.982410887228, 9: 6766.621257191185}, 'horsepower': {1: 3559.5289477682295, 2: 3653.868350118816, 3: 4000.7772510128857, 4: 3870.3047220147923, 5: 4052.3451454682386, 6: 4201.77301088481, 7: 4158.2072794876, 8: 4249.7304450179545, 9: 4500.872781822226}, 'peak_rpm': {1: 8005.071514983486, 2: 7591.886470601362, 3: 7807.556279940892, 4: 7694.867207390586, 5: 7354.365342515967, 6: 7178.748420685879, 7: 7327.85932885378, 8: 7305.882502690896, 9: 7416.989346255606}, 'city_mpg': {1: 5021.755671077596, 2: 4959.9657763436235, 3: 4688.484536962166, 4: 4606.867389221769, 5: 4706.7443759779435, 6: 4746.89375732524, 7: 4729.946871902517, 8: 4670.230474271318, 9: 4760.349807602456}, 'highway_mpg': {1: 6320.21090233546, 2: 5192.497320654099, 3: 4706.327077574519, 4: 4595.994466720451, 5: 4631.877510902032, 6: 4912.407591695181, 7: 5182.017937984376, 8: 5178.896149790272, 9: 5272.494903204838}}
for k,v in k_rmse_results.items():
x = list(v.keys())
y = list(v.values())
plt.scatter(x,y)
plt.plot(x,y)
plt.xlabel('k-value')
plt.ylabel('RMSE')
plt.xticks([1,3,5,7,9], fontsize=14)
plt.yticks(fontsize=14)
plt.legend(k_rmse_results, frameon=False, bbox_to_anchor=(1.5, 1))
By averaging the RMSE values obtained for each feature, across all of the k-values calculated above, and then selecting the best performing features from a sorted list of the results, we will hopefully be looking at those features whose performance is not overly dependent on the algorithm's parameters.
# Computing the average RMSE across the various k-values and sorting the features by performance.
feature_avg_rmse = {}
for k,v in k_rmse_results.items():
avg_rmse = np.mean(list(v.values()))
feature_avg_rmse[k] = avg_rmse
series_avg_rmse = pd.Series(feature_avg_rmse)
sorted_series_avg_rmse = series_avg_rmse.sort_values()
print(sorted_series_avg_rmse)
sorted_features = sorted_series_avg_rmse.index
engine_size 3344.832565 horsepower 4027.489770 width 4606.038200 city_mpg 4765.693185 curb_weight 4806.616929 highway_mpg 5110.302651 length 5225.204262 wheel_base 5302.831535 compression_ratio 6634.825336 bore 7112.486331 peak_rpm 7520.358490 stroke 7684.507631 height 7927.534603 dtype: float64
Having sorted the RMSE results for all features, we will now run five models, consisting of:
We shall limit it to the best five features, because if we go beyond number five ('curb_weight') we are moving into the area where there are too many interrelated features being used in our model, which could negatively impact on the models overall suitability.
We shall also continue our exploration of the correct k-value that we should use in the final model, by again running each of these models with various k-values - 1 through 8.
def knn_train_test(train_cols, target_col, df):
np.random.seed(1)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(df.index)
rand_df = df.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
# Select the second half and set as test set.
train_df = rand_df.iloc[0:last_train_row]
test_df = rand_df.iloc[last_train_row:]
k_values = [1,2,3,4,5,6,7,8]
k_rmses = {}
for k in k_values:
# Fit model using k nearest neighbors.
knn = KNeighborsRegressor(n_neighbors=k, algorithm = 'brute')
knn.fit(train_df[train_cols], train_df[target_col])
# Make predictions using model.
predicted_labels = knn.predict(test_df[train_cols])
# Calculate and return RMSE.
mse = mean_squared_error(test_df[target_col], predicted_labels)
rmse = np.sqrt(mse)
k_rmses[k] = rmse
return k_rmses
models_results = {}
for nr_best_feats in range(1,6):
models_results['{} best features'.format(nr_best_feats)] = knn_train_test(
sorted_features[:nr_best_feats],
'price',
normalized_imports
)
models_results
{'1 best features': {1: 3475.2810813025476, 2: 2943.936340300614, 3: 2790.45642413665, 4: 3188.427394694385, 5: 3216.8365762064755, 6: 3521.046600098613, 7: 3631.9860373729207, 8: 3718.9009453931467}, '2 best features': {1: 2778.412353831013, 2: 2687.9334815039783, 3: 2789.39413024789, 4: 2844.017557427423, 5: 2950.255443479798, 6: 3126.834022873721, 7: 3200.8528818656673, 8: 3424.405007309176}, '3 best features': {1: 3538.458459177772, 2: 3479.958271913246, 3: 3343.2671970925558, 4: 3343.612859314053, 5: 3573.310158703976, 6: 3733.1236592832756, 7: 3665.506151479582, 8: 3746.1970732261298}, '4 best features': {1: 3546.7043271406787, 2: 3289.8038725434653, 3: 3267.98652530774, 4: 3255.907469109187, 5: 3394.6046092393894, 6: 3624.223483273441, 7: 3674.722141239556, 8: 3759.3676154425716}, '5 best features': {1: 2838.5257695773093, 2: 2879.0159566617326, 3: 2992.6702414093975, 4: 3133.9563398915648, 5: 3360.832156381264, 6: 3563.4746528087057, 7: 3568.1754419934873, 8: 3811.14991399243}}
for k,v in models_results.items():
x = list(v.keys())
y = list(v.values())
plt.scatter(x,y)
plt.plot(x,y)
plt.xlabel('k-value')
plt.ylabel('RMSE')
plt.xticks([1,2,3,4,5,6,7,8], fontsize=14)
plt.yticks(fontsize=14)
plt.legend(['1 Feature', '2 Features', '3 Features', '4 Features', '5 Features'], frameon=False, bbox_to_anchor=(1.5, 1))
This analysis has indicated that a model containing the two features - engine_size and horsepower - would appear to give the best results across the board.
If we use a k-value of 2, the performance of this two-feature model results in an RMSE value of USD 2688.
Our model is therefore able to predict the 'price' of cars in the test set to an accuracy of USD 2,688. This is across actual price values between USD 5,118 and USD 45,400 - ie a range of USD 40,000. This means the error in our predicted prices is likely to be approximatly 7% of actual prices.
Having identified the features and the optimum k-value, we shall now examine whether our stipulation that the 'Brute Force' KNN algorithm should be used in our modeling was the correct decision. We shall remove this stipulation from the following model, instead allowing the KNeighborRegressor to default to 'auto'.
np.random.seed(1)
# Randomize order of rows in data frame.
shuffled_index = np.random.permutation(normalized_imports.index)
rand_df = normalized_imports.reindex(shuffled_index)
# Divide number of rows in half and round.
last_train_row = int(len(rand_df) / 2)
# Select the first half and set as training set.
train_df = rand_df.iloc[0:last_train_row]
# Select the second half and set as test set.
test_df = rand_df.iloc[last_train_row:]
# Fit model using k nearest neighbors.
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(train_df[['engine_size', 'horsepower']], train_df['price'])
# Make predictions using model.
predicted_labels = knn.predict(test_df[['engine_size', 'horsepower']])
# Calculate and return RMSE.
mse = mean_squared_error(test_df['price'], predicted_labels)
rmse = np.sqrt(mse)
rmse
2657.7963807419765
By allowing KNeighborRegressor to choose it's own algorithm, we have further improved the performance by USD 30.
This project has developed a model for predicting the price of imported cars based on a number of features relevant to each car, as available from a 1985 dataset of car imports into the USA.
The best peforming prediction model for 'price' consists of:
The Root Mean Squared Error (RMSE) achieved with this model is USD 2,658, which implies that our predicted 'prices' will be off, on average, by USD 2,658. Across a range of prices in our dataset of approximately USD 40,000, this means we will typically be about 7% out.
The use of this model is restricted to predicting the prices of cars imported into the USA in 1985. The project has not looked at the effects of extrapolating this model outside of this market.