This is my first attempt at a machine learning model. The following notebook consists of developing a simple ML model and tuning it with different columns, k-values etc. Being me I couldn't obediently follow all the lessons steps. No, instead I've decided to fiddle with a few more knobs and combinations:
Links: Dataset My Github My LinkedIn
Imports:
import pandas as pd
import numpy as np
import random
import seaborn as sns
%matplotlib inline
import matplotlib.pyplot as plt
from numpy import arange
from matplotlib.patches import ConnectionPatch
from matplotlib.patches import Rectangle
from collections import Counter
from itertools import combinations
import matplotlib.image as mpimg
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score, KFold
cols = ['symboling', 'normalized_losses', 'make', 'fuel_type', 'aspiration', 'num_doors', 'body_style',
'drive_wheels', 'engine_location', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'engine_type',
'num_cylinders', 'engine_size', 'fuel_system', 'bore', 'stroke', 'compression_rate', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg', 'price']
df = pd.read_csv('imports-85.data', names=cols)
df.head()
symboling | normalized_losses | make | fuel_type | aspiration | num_doors | body_style | drive_wheels | engine_location | wheel_base | ... | engine_size | fuel_system | bore | stroke | compression_rate | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 13495 |
1 | 3 | ? | alfa-romero | gas | std | two | convertible | rwd | front | 88.6 | ... | 130 | mpfi | 3.47 | 2.68 | 9.0 | 111 | 5000 | 21 | 27 | 16500 |
2 | 1 | ? | alfa-romero | gas | std | two | hatchback | rwd | front | 94.5 | ... | 152 | mpfi | 2.68 | 3.47 | 9.0 | 154 | 5000 | 19 | 26 | 16500 |
3 | 2 | 164 | audi | gas | std | four | sedan | fwd | front | 99.8 | ... | 109 | mpfi | 3.19 | 3.40 | 10.0 | 102 | 5500 | 24 | 30 | 13950 |
4 | 2 | 164 | audi | gas | std | four | sedan | 4wd | front | 99.4 | ... | 136 | mpfi | 3.19 | 3.40 | 8.0 | 115 | 5500 | 18 | 22 | 17450 |
5 rows × 26 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 205 entries, 0 to 204 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 symboling 205 non-null int64 1 normalized_losses 205 non-null object 2 make 205 non-null object 3 fuel_type 205 non-null object 4 aspiration 205 non-null object 5 num_doors 205 non-null object 6 body_style 205 non-null object 7 drive_wheels 205 non-null object 8 engine_location 205 non-null object 9 wheel_base 205 non-null float64 10 length 205 non-null float64 11 width 205 non-null float64 12 height 205 non-null float64 13 curb_weight 205 non-null int64 14 engine_type 205 non-null object 15 num_cylinders 205 non-null object 16 engine_size 205 non-null int64 17 fuel_system 205 non-null object 18 bore 205 non-null object 19 stroke 205 non-null object 20 compression_rate 205 non-null float64 21 horsepower 205 non-null object 22 peak_rpm 205 non-null object 23 city_mpg 205 non-null int64 24 highway_mpg 205 non-null int64 25 price 205 non-null object dtypes: float64(5), int64(5), object(16) memory usage: 41.8+ KB
df.describe(include='all')
symboling | normalized_losses | make | fuel_type | aspiration | num_doors | body_style | drive_wheels | engine_location | wheel_base | ... | engine_size | fuel_system | bore | stroke | compression_rate | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 205.000000 | 205 | 205 | 205 | 205 | 205 | 205 | 205 | 205 | 205.000000 | ... | 205.000000 | 205 | 205 | 205 | 205.000000 | 205 | 205 | 205.000000 | 205.000000 | 205 |
unique | NaN | 52 | 22 | 2 | 2 | 3 | 5 | 3 | 2 | NaN | ... | NaN | 8 | 39 | 37 | NaN | 60 | 24 | NaN | NaN | 187 |
top | NaN | ? | toyota | gas | std | four | sedan | fwd | front | NaN | ... | NaN | mpfi | 3.62 | 3.40 | NaN | 68 | 5500 | NaN | NaN | ? |
freq | NaN | 41 | 32 | 185 | 168 | 114 | 96 | 120 | 202 | NaN | ... | NaN | 94 | 23 | 20 | NaN | 19 | 37 | NaN | NaN | 4 |
mean | 0.834146 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 98.756585 | ... | 126.907317 | NaN | NaN | NaN | 10.142537 | NaN | NaN | 25.219512 | 30.751220 | NaN |
std | 1.245307 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 6.021776 | ... | 41.642693 | NaN | NaN | NaN | 3.972040 | NaN | NaN | 6.542142 | 6.886443 | NaN |
min | -2.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 86.600000 | ... | 61.000000 | NaN | NaN | NaN | 7.000000 | NaN | NaN | 13.000000 | 16.000000 | NaN |
25% | 0.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 94.500000 | ... | 97.000000 | NaN | NaN | NaN | 8.600000 | NaN | NaN | 19.000000 | 25.000000 | NaN |
50% | 1.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 97.000000 | ... | 120.000000 | NaN | NaN | NaN | 9.000000 | NaN | NaN | 24.000000 | 30.000000 | NaN |
75% | 2.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 102.400000 | ... | 141.000000 | NaN | NaN | NaN | 9.400000 | NaN | NaN | 30.000000 | 34.000000 | NaN |
max | 3.000000 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 120.900000 | ... | 326.000000 | NaN | NaN | NaN | 23.000000 | NaN | NaN | 49.000000 | 54.000000 | NaN |
11 rows × 26 columns
We have some missing values - instead of applying 1 method to a single dataframe, we'll create 6 different dataframes:
We'll also create normalized versions of the above dataframes: numeric_a_n, numeric_b_n, numeric_c_n
numeric_cols = ['normalized_losses', 'price', 'bore', 'stroke', 'horsepower', 'peak_rpm',]
df[numeric_cols] = np.where(df[numeric_cols] == '?', '', df[numeric_cols])
df[numeric_cols] = df[numeric_cols].apply(pd.to_numeric)
df[df.columns[df.isna().any()]].isna().sum()
normalized_losses 41 bore 4 stroke 4 horsepower 2 peak_rpm 2 price 4 dtype: int64
numeric = df.loc[:, df.dtypes != object].copy()
numeric_a = numeric.dropna(subset = ['price']).copy()
numeric_a = numeric_a.fillna(numeric.mean())
numeric_b = numeric.dropna(subset = [ 'bore', 'stroke', 'horsepower', 'peak_rpm', 'price']).copy()
numeric_b = numeric_b.fillna(numeric_b.mean())
corr_df = numeric_a.corr()['price'].to_frame().merge(numeric_b.corr()['price'],
left_index=True, right_index=True, suffixes=['_A_corr', '_B_corr'])
corr_df.sort_values('price_A_corr')
price_A_corr | price_B_corr | |
---|---|---|
highway_mpg | -0.704692 | -0.715590 |
city_mpg | -0.686571 | -0.702685 |
peak_rpm | -0.101616 | -0.104333 |
symboling | -0.082391 | -0.084118 |
compression_rate | 0.071107 | 0.069500 |
stroke | 0.082269 | 0.093746 |
normalized_losses | 0.133999 | 0.131736 |
height | 0.135486 | 0.138291 |
bore | 0.543155 | 0.546873 |
wheel_base | 0.584642 | 0.585793 |
length | 0.690628 | 0.695331 |
width | 0.751265 | 0.754273 |
horsepower | 0.809575 | 0.811027 |
curb_weight | 0.834415 | 0.835729 |
engine_size | 0.872335 | 0.888942 |
price | 1.000000 | 1.000000 |
How to fill in the missing price values for numeric_c?
In numeric a and b dataframes we've filled in the missing values, using the mean value for the column. Very simplistic aproach that we're going to modify and upgrade for the dataframe numeric_c. Our plan assumed being very thorough in filling in the missing price values in the last dataframe, that's why we've created a corelation dataframe for the 'price' column.
numeric_c = numeric.dropna(subset = [ 'bore', 'stroke', 'horsepower', 'peak_rpm']).copy()
numeric_c[numeric_c['price'].isnull()]
symboling | normalized_losses | wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_rate | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 0 | NaN | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160.0 | 5500.0 | 16 | 22 | NaN |
44 | 1 | NaN | 94.5 | 155.9 | 63.6 | 52.0 | 1874 | 90 | 3.03 | 3.11 | 9.6 | 70.0 | 5400.0 | 38 | 43 | NaN |
45 | 0 | NaN | 94.5 | 155.9 | 63.6 | 52.0 | 1909 | 90 | 3.03 | 3.11 | 9.6 | 70.0 | 5400.0 | 38 | 43 | NaN |
129 | 1 | NaN | 98.4 | 175.7 | 72.3 | 50.5 | 3366 | 203 | 3.94 | 3.11 | 10.0 | 288.0 | 5750.0 | 17 | 28 | NaN |
bins = numeric_c['engine_size'].value_counts(bins=20).index
numeric_c.groupby(pd.cut(numeric_c.engine_size, bins=bins))['price'].mean()
engine_size (87.5, 100.75] 7265.948276 (100.75, 114.0] 9754.628571 (114.0, 127.25] 11619.034483 (140.5, 153.75] 15516.047619 (127.25, 140.5] 17510.142857 (180.25, 193.5] 21027.000000 (153.75, 167.0] 16678.333333 (193.5, 206.75] 34528.000000 (167.0, 180.25] 17432.750000 (206.75, 220.0] 36318.333333 (233.25, 246.5] 34620.000000 (299.5, 312.75] 43180.000000 (246.5, 259.75] 33900.000000 (60.734, 74.25] 5151.000000 (74.25, 87.5] 5399.000000 (312.75, 326.0] 36000.000000 (220.0, 233.25] NaN (259.75, 273.0] NaN (273.0, 286.25] NaN (286.25, 299.5] NaN Name: price, dtype: float64
numeric_c[numeric_c['price'].isnull()]
symboling | normalized_losses | wheel_base | length | width | height | curb_weight | engine_size | bore | stroke | compression_rate | horsepower | peak_rpm | city_mpg | highway_mpg | price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9 | 0 | NaN | 99.5 | 178.2 | 67.9 | 52.0 | 3053 | 131 | 3.13 | 3.40 | 7.0 | 160.0 | 5500.0 | 16 | 22 | NaN |
44 | 1 | NaN | 94.5 | 155.9 | 63.6 | 52.0 | 1874 | 90 | 3.03 | 3.11 | 9.6 | 70.0 | 5400.0 | 38 | 43 | NaN |
45 | 0 | NaN | 94.5 | 155.9 | 63.6 | 52.0 | 1909 | 90 | 3.03 | 3.11 | 9.6 | 70.0 | 5400.0 | 38 | 43 | NaN |
129 | 1 | NaN | 98.4 | 175.7 | 72.3 | 50.5 | 3366 | 203 | 3.94 | 3.11 | 10.0 | 288.0 | 5750.0 | 17 | 28 | NaN |
numeric_c.loc[9,'price'] = 16520.750000
numeric_c.loc[44,'price'] = 7265.948276
numeric_c.loc[45,'price'] = 7265.948276
numeric_c.loc[129,'price'] = 34528.000000
numeric_c[numeric_c.columns[numeric_c.isna().any()]].isna().sum()
normalized_losses 39 dtype: int64
numeric_c = numeric_c.fillna(numeric_c.mean())
Now that we have 3 different datasets, lets create normalized versions of them. We'll add "_n" at the end of their names to for 'normalized'
price_col = numeric_a['price']
numeric_a_n = (numeric_a - numeric_a.min())/(numeric_a.max() - numeric_a.min()).copy()
numeric_a_n['price'] = price_col
price_col = numeric_b['price']
numeric_b_n = (numeric_b - numeric_b.min())/(numeric_b.max() - numeric_b.min()).copy()
numeric_b_n['price'] = price_col
price_col = numeric_c['price']
numeric_c_n = (numeric_c - numeric_c.min())/(numeric_c.max() - numeric_c.min()).copy()
numeric_c_n['price'] = price_col
In the following steps will create a model using only 1 column from the dataframe to predict the price value. We'll test out different dataframes and columns and check their performance.
def knn_train_test(df, feature_col, target_col, r=1):
# randomize
np.random.seed(r)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
# split
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
# train and predict
knn = KNeighborsRegressor()
knn.fit(train_set[[feature_col]], train_set[target_col])
predictions = knn.predict(test_set[[feature_col]])
rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))
return rmse
all_features = numeric.columns.tolist()
all_features.remove('price')
df_list = [numeric_a, numeric_b, numeric_c]
df_n_list = [numeric_a_n, numeric_b_n, numeric_c_n]
def fill_dict(model, dlist=df_list, r=1):
rmse_dict_a = {}
rmse_dict_b = {}
rmse_dict_c = {}
for col in all_features:
rmse_dict_a[col] = model(dlist[0], col, 'price', r)
for col in all_features:
rmse_dict_b[col] = model(dlist[1], col, 'price', r)
for col in all_features:
rmse_dict_c[col] = model(dlist[2], col, 'price', r)
return rmse_dict_a, rmse_dict_b, rmse_dict_c
def give_results_df(dict1, dict2, dict3):
results_a = pd.DataFrame.from_dict(dict1, orient='index')
results_b = pd.DataFrame.from_dict(dict2, orient='index')
results_c = pd.DataFrame.from_dict(dict3, orient='index')
results_frame_ab = results_a.merge(results_b, left_index=True, right_index=True, suffixes=['_numeric_A', '_numeric_B'])
results_frame = results_frame_ab.merge(results_c, left_index=True, right_index=True)
results_frame = results_frame.rename(columns={results_frame.columns[0]: 'numeric_a', results_frame.columns[1]: 'numeric_b',
results_frame.columns[2]: 'numeric_c'
})
results_frame = results_frame.sort_values(results_frame.loc[results_frame.idxmin(axis=0)[0],:].idxmin(axis=0))
return results_frame
rmse_dict_a, rmse_dict_b, rmse_dict_c = fill_dict(knn_train_test)
results_frame = give_results_df(rmse_dict_a, rmse_dict_b, rmse_dict_c )
results_frame
numeric_a | numeric_b | numeric_c | |
---|---|---|---|
engine_size | 3271.449938 | 4206.269209 | 3014.293169 |
curb_weight | 4445.141920 | 4594.629028 | 3667.924168 |
highway_mpg | 4643.046278 | 4368.230464 | 4139.949639 |
width | 4772.459149 | 5038.643838 | 4290.488430 |
horsepower | 4064.464572 | 4543.317619 | 4441.723870 |
city_mpg | 4760.449842 | 4440.237523 | 4717.122297 |
wheel_base | 5460.851458 | 5692.794469 | 6121.214509 |
length | 5428.810101 | 5212.513611 | 6186.265681 |
compression_rate | 6610.812153 | 7020.882648 | 6210.718457 |
peak_rpm | 7649.170564 | 7309.974616 | 7043.426558 |
normalized_losses | 7339.675755 | 7083.841259 | 7217.866305 |
stroke | 7954.301034 | 7217.395544 | 7269.245529 |
bore | 6778.637424 | 7225.667155 | 7590.625100 |
symboling | 7979.434920 | 7368.869758 | 8050.034443 |
height | 8073.119759 | 7574.428009 | 8304.016043 |
# lets create some functions we're going to use constantly on plots:
def spines(ax,yl='RMSE',xl=''):
x1 = ax.spines['right'].set_visible(False)
x2 = ax.spines['top'].set_visible(False)
x3 = ax.spines['left'].set_linewidth(2)
x4 = ax.spines['bottom'].set_linewidth(2)
x5 = ax.set_ylabel(yl,size=14)
x6 = ax.set_xlabel(xl,size=14)
return x1, x2, x3, x4, x5, x6
def spines2():
x1 = ax.spines['right'].set_visible(False)
x2 = ax.spines['top'].set_visible(False)
x3 = ax.spines['left'].set_linewidth(3)
x4 = ax.spines['bottom'].set_linewidth(3)
return x1, x2, x3, x4
labels = ['numeric_a','numeric_b' ,'numeric_c']
y_list = [results_frame['numeric_a'], results_frame['numeric_b'], results_frame['numeric_c']]
def scatter_3(x, y_list, labels, size=10, alpha = 1):
p1 = plt.scatter( x=x, y=y_list[0] ,label=labels[0], s=size)
p2 = plt.scatter( x=x, y=y_list[1] ,label=labels[1], s=size)
p3 = plt.scatter( x=x, y=y_list[2] ,label=labels[2], s=size)
return p1, p2, p3
def scatter_3v2(x, y_list, labels, size=1, alpha = 1, z=True):
p1 = sns.stripplot(x=x, y=y_list[0], label=labels[0], s=size, alpha=alpha, color='#1f77b4', jitter=z)
p2 = sns.stripplot(x=x, y=y_list[1], label=labels[1], s=size, alpha=alpha, color='#ff7f0e' , jitter=z)
p3 = sns.stripplot(x=x, y=y_list[2], label=labels[2], s=size, alpha=alpha, color='green', jitter=z)
return p1, p2, p3
fig, ax = plt.subplots(figsize=(16,8))
scatter_3(results_frame.index, y_list, labels, 50)
spines(ax,xl='column')
ax.tick_params(axis = 'x',labelsize=12, rotation=65)
plt.title('RMSE results for each column and dataframe')
plt.legend()
plt.show()
# create a function with a random seed input
def knn_train_test_v2(df, feature_col, target_col, r):
# randomize
np.random.seed(r)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
# split
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
# train and test
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(train_set[[feature_col]], train_set[target_col])
predictions = knn.predict(test_set[[feature_col]])
rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))
return rmse
df_list = [numeric_a, numeric_b, numeric_c]
results = []
def df_seed_check(df_list, col_name,n):
best_results = []
for df in df_list:
seed_dictionary = {}
for i in list(range(0,n)):
seed_dictionary[i] = knn_train_test_v2(df,col_name, 'price', i)
results.append(seed_dictionary[i])
best_results.append(min(seed_dictionary.items(), key=lambda x: x[1]))
return best_results
results = []
labels = ['numeric_a','numeric_b' ,'numeric_c']
df_seed_check(df_list,'engine_size',1000)
num_a_results = results[:1000].copy()
num_b_results = results[1000:2000].copy()
num_c_results = results[2000:].copy()
fig, ax = plt.subplots(figsize=(16,12))
fig.suptitle('Why random seed is important?')
grid = plt.GridSpec(8, 1, wspace=0, hspace=19.3)
ax1 = plt.subplot(grid[0:6, 0])
y_list = [num_a_results, num_b_results, num_c_results]
scatter_3(list(range(1,1001)), y_list, labels, 10)
plt.legend()
plt.title(' engine_size column RMSE results')
plt.xlabel('Random seed', size=16)
plt.ylabel('RMSE', size=16)
ax1.add_patch(Rectangle((0, 2300), 1000, 400, alpha=0.3, facecolor = 'grey'))
plt.legend(loc=2)
plt.ylim(2000,6500)
plt.arrow(2,-3000, 10,500)
ax2 = plt.subplot(grid[6:, 0])
scatter_3(list(range(1,1001)), y_list, labels, 40)
plt.legend()
plt.title('engine_size columns results ZOOMED IN')
spines(ax1,xl='Random seed')
ax.tick_params(labelsize=12)
plt.ylim(2300,2700)
plt.legend(loc=3)
con = ConnectionPatch(
xyA=(0.1, 0.1), coordsA=ax1.transAxes,
xyB=(0.1, 2759.7), coordsB=ax2.get_yaxis_transform(),
arrowstyle="->", linewidth=1.5)
ax2.add_artist(con)
plt.show()
The above scatter plot presents 1000 rmse results for each of our 3 dataframes, but only the engine size column. For each result we've changed the random seed number. This clearly shows how different the results can be if we change the order of our dataframes index and split.
At first using various random seeds looks very attractive - we can improve our results in a very easy way. But we're only improving them in this single dataset example... and our end goal is to create a model that can be 'released into the wild' and work on various datasets isn't it? Thus we should pursue not the single best result but mean of many results. All of this comes at a price: computing models with multiple random seeds takes time and computational power. We must try to answer the question: how much is enough?
nums = []
x = 1
for n in list(range(1,7)):
x *= 4
nums.append(x)
nums
[4, 16, 64, 256, 1024, 4096]
mean_list = []
std_list = []
best_list = []
for n in nums:
results = []
df_list = [numeric_c_n]
x = df_seed_check(df_list,'engine_size',n)
mean_list.append(np.std(results))
std_list.append(np.mean(results))
best_list.append(x)
seed_frame = pd.DataFrame([mean_list,std_list,best_list], index=('std','mean','best result'), columns=[str(col) for col in nums])
seed_frame = seed_frame.transpose()
seed_frame['best result'] = seed_frame['best result'].str[0].str[1]
seed_frame['mean change'] =(seed_frame['mean'] - seed_frame['mean'].shift()) / seed_frame['mean'].shift() * 100
seed_frame['best result change'] =(seed_frame['best result'] - seed_frame['best result'].shift()) / seed_frame['best result'].shift() * 100
seed_frame.index.name = 'number of seeds'
seed_frame
std | mean | best result | mean change | best result change | |
---|---|---|---|---|---|
number of seeds | |||||
4 | 169.773707 | 3220.754961 | 3014.293169 | NaN | NaN |
16 | 531.362191 | 3388.531008 | 2636.971126 | 5.209215 | -12.517762 |
64 | 420.287777 | 3438.397706 | 2636.971126 | 1.471632 | 0.000000 |
256 | 447.763487 | 3397.101321 | 2636.971126 | -1.201036 | 0.000000 |
1024 | 453.550087 | 3385.32257 | 2353.206318 | -0.346729 | -10.761013 |
4096 | 439.748785 | 3372.912832 | 2306.099998 | -0.366575 | -2.001793 |
df_list = [numeric_c]
%timeit df_seed_check(df_list,'engine_size',4)
21.8 ms ± 731 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df_seed_check(df_list,'engine_size',16)
84.8 ms ± 1.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df_seed_check(df_list,'engine_size',256)
1.43 s ± 79.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df_seed_check(df_list,'engine_size',1024)
5.59 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df_n_list = [numeric_a_n, numeric_b_n, numeric_c_n]
results = []
labels = ['numeric_a_n','numeric_b_n' ,'numeric_c_n']
df_seed_check(df_n_list,'engine_size',100)
num_a_results = results[:100].copy()
num_b_results = results[100:200].copy()
num_c_results = results[200:].copy()
fig, ax = plt.subplots(figsize=(16,8))
ax1 = plt.subplot()
y_list = [num_a_results, num_b_results, num_c_results]
scatter_3(list(range(1,101)), y_list, labels=labels, size=50)
plt.title('Engine_size column RMSE results - 100 randoms seeds',size=18)
plt.legend(loc=1)
spines(ax1,xl='Random seed')
plt.show()
# create dataframes with average results for every column:
# non normalized results:
df_list = [numeric_a, numeric_b, numeric_c]
for col in results_frame.columns:
results_frame[col].values[:] = 0
for n in list(range(0,100)):
rmse_dict_a, rmse_dict_b, rmse_dict_c = fill_dict(knn_train_test,dlist=df_list ,r=n)
results_df = give_results_df(rmse_dict_a, rmse_dict_b, rmse_dict_c )
results_frame += results_df
results_frame_non = results_frame / len(list(range(0,100)))
# normalized results:
for col in results_frame.columns:
results_frame[col].values[:] = 0
for n in list(range(0,100)):
rmse_dict_a, rmse_dict_b, rmse_dict_c = fill_dict(knn_train_test,dlist=df_n_list ,r=n)
results_df = give_results_df(rmse_dict_a, rmse_dict_b, rmse_dict_c )
results_frame += results_df
results_frame_norm = results_frame / len(list(range(0,100)))
results_frame_norm.columns = ['numeric_a_n','numeric_b_n','numeric_c_n']
y_list1 = [results_frame_non['numeric_a'], results_frame_non['numeric_b'], results_frame_non['numeric_c']]
y_list2 = [results_frame_norm['numeric_a_n'], results_frame_norm['numeric_b_n'], results_frame_norm['numeric_c_n']]
fig, ax = plt.subplots(figsize=(16,8))
scatter_3v2(results_frame_non.index, y_list1, labels, 13, alpha=0.6)
scatter_3v2(results_frame_norm.index, y_list2, labels, 13)
spines(ax,xl='column')
ax.tick_params(axis = 'x',labelsize=12, rotation=65)
handles, labels = ax.get_legend_handles_labels()
l = plt.legend(handles[::15], ['numeric_a','numeric_b','numeric_c','numeric_a_n','numeric_b_n','numeric_c_n'], bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('RMSE average results for each column and dataframe - 100 random seeds')
plt.show()
Observations:
# lets check our average results BUT only for the first 6 columns (the idea is -
# we know we won't use all of the columns for multi-column models)
pd.concat([results_frame_norm[:6].mean(), results_frame_non[:6].mean()])
numeric_a_n 4358.715630 numeric_b_n 4396.028186 numeric_c_n 4344.353694 numeric_a 4390.045094 numeric_b 4429.656350 numeric_c 4383.198739 dtype: float64
Knowing that in the future we'll try to develop multiple-column models it's worth trying to guess which columns we're going to use in those models. Engine_size column is always going to be included, but what other columns are worthy of our attention?
results_frame_norm[:6]
numeric_a_n | numeric_b_n | numeric_c_n | |
---|---|---|---|
engine_size | 3646.149961 | 3503.280847 | 3411.582309 |
curb_weight | 4333.835797 | 4479.708402 | 4381.759039 |
highway_mpg | 4521.929554 | 4609.537973 | 4783.851207 |
width | 4647.999496 | 4747.806227 | 4679.850873 |
horsepower | 4334.431229 | 4596.600987 | 4367.352011 |
city_mpg | 4667.947744 | 4439.234678 | 4441.726729 |
we'll use the index of the above dataframe for column selection:
Lets check average rmse results for every dataframe for 2 single columns:
results_frame_norm[:6][:2].mean()
numeric_a_n 3989.992879 numeric_b_n 3991.494625 numeric_c_n 3896.670674 dtype: float64
Lets check average rmse results for every dataframe for 4 columns:
results_frame_norm[:6][:4].mean()
numeric_a_n 4287.478702 numeric_b_n 4335.083362 numeric_c_n 4314.260857 dtype: float64
Lets check average rmse results for every dataframe for 6 columns:
results_frame_norm[:6][:6].mean()
numeric_a_n 4358.715630 numeric_b_n 4396.028186 numeric_c_n 4344.353694 dtype: float64
Lets include multiple columns in our model, we'll use the index of the dataframe with best columns(step 2.4.) as an indicator of which columns should we use.
# lets change our previous train/ test function so that it includes multiple columns:
def knn_train_test_v3(df, train_cols, target_col, seed, k=5):
# randomize
np.random.seed(seed)
shuffled_index = np.random.permutation(df.index)
df = df.reindex(index = shuffled_index)
split_loc = int(0.5*len(df))
# split
train_set = df.iloc[:split_loc].copy()
test_set = df.iloc[split_loc:].copy()
rmse_dict = {}
# futureproof: we're not going to use multiple k values yet, but we will in the future
knn = KNeighborsRegressor(n_neighbors=k)
#model:
knn.fit(train_set[train_cols], train_set[target_col])
predictions = knn.predict(test_set[train_cols])
rmse = np.sqrt(mean_squared_error(test_set[target_col], predictions))
return rmse
def fill_dict_v2(model,col_list, dlist=df_n_list, r=1, k=5, ):
rmse_dict_a = {}
rmse_dict_b = {}
rmse_dict_c = {}
rmse_dict_a[str(col_list)] = model(dlist[0], col_list, 'price', r, k)
rmse_dict_b[str(col_list)] = model(dlist[1], col_list, 'price', r, k)
rmse_dict_c[str(col_list)] = model(dlist[2], col_list, 'price', r, k)
return rmse_dict_a, rmse_dict_b, rmse_dict_c
# loop of columns:
df_results_list = []
for n in list(range(2,8)):
col_list = results_frame_norm.index[:n].tolist()
dict_a, dict_b, dict_c = fill_dict_v2(knn_train_test_v3, col_list)
results_frame = give_results_df(dict_a, dict_b, dict_c)
df_results_list.append(results_frame)
mutli_cols_df = pd.concat(df_results_list)
mutli_cols_df.columns = ['numeric_a_n', 'numeric_b_n', 'numeric_c_n']
mutli_cols_df.sort_values('numeric_c_n')
numeric_a_n | numeric_b_n | numeric_c_n | |
---|---|---|---|
['engine_size', 'curb_weight', 'highway_mpg', 'width'] | 3596.417368 | 3655.264952 | 2532.332056 |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg'] | 3416.042892 | 3628.898274 | 2644.465049 |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower'] | 3410.217013 | 3789.590624 | 2725.092022 |
['engine_size', 'curb_weight'] | 3330.869800 | 3657.896664 | 2791.664988 |
['engine_size', 'curb_weight', 'highway_mpg'] | 3413.544315 | 3848.872270 | 2964.095155 |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg', 'wheel_base'] | 3659.516895 | 3999.202836 | 3119.720636 |
That's a significant improvement, lets not forget though: it's a single test on 1 random seed.
for col in mutli_cols_df.columns:
mutli_cols_df[col].values[:] = 0
cols_frame = mutli_cols_df[:0].copy()
# loop of random seeds:
for i in list(range(0,100)):
# loop of columns:
df_results_list = []
for n in list(range(2,8)):
col_list = results_frame_norm.index[:n].tolist()
dict_a, dict_b, dict_c = fill_dict_v2(knn_train_test_v3, col_list, r=i)
results_frame = give_results_df(dict_a, dict_b, dict_c)
df_results_list.append(results_frame)
to_add = pd.concat(df_results_list)
to_add.columns = ['numeric_a_n', 'numeric_b_n', 'numeric_c_n']
cols_frame = cols_frame.append(to_add)
cols_frame[:5]
numeric_a_n | numeric_b_n | numeric_c_n | |
---|---|---|---|
['engine_size', 'curb_weight'] | 3228.359525 | 4256.826853 | 3477.979752 |
['engine_size', 'curb_weight', 'highway_mpg'] | 3125.160344 | 4255.565360 | 3621.460394 |
['engine_size', 'curb_weight', 'highway_mpg', 'width'] | 3273.331358 | 4694.873426 | 3388.959768 |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower'] | 3112.224563 | 4329.296422 | 3637.414232 |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg'] | 3241.827913 | 4052.052106 | 3645.852617 |
Lets group the results by their input columns and display the mean value for every column combination:
mean_frame = cols_frame.groupby(level=0).mean()
mean_frame['min value'] = mean_frame.min(axis=1)
mean_frame['best df'] = mean_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
mean_frame.sort_values('min value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['engine_size', 'curb_weight'] | 3682.784458 | 3753.830141 | 3595.801513 | 3595.801513 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg'] | 3629.039749 | 3733.883104 | 3676.055245 | 3629.039749 | numeric_a_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower'] | 3661.143393 | 3762.183618 | 3713.839165 | 3661.143393 | numeric_a_n |
['engine_size', 'curb_weight', 'highway_mpg'] | 3744.533452 | 3836.744851 | 3722.420131 | 3722.420131 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width'] | 3762.442996 | 3843.108108 | 3792.881284 | 3762.442996 | numeric_a_n |
Mean observations:
Lets group the results by their input columns and display the minimum value for every column combination:
min_frame = cols_frame.groupby(level=0).min()
min_frame['min value'] = min_frame.min(axis=1)
min_frame['best df'] = min_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
min_frame.sort_values('min value')
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg'] | 2318.076629 | 2324.881102 | 2249.517922 | 2249.517922 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower'] | 2276.782559 | 2387.439304 | 2383.434823 | 2276.782559 | numeric_a_n |
['engine_size', 'curb_weight', 'highway_mpg'] | 2510.426401 | 2423.967138 | 2333.405493 | 2333.405493 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width'] | 2574.248405 | 2513.923029 | 2396.464561 | 2396.464561 | numeric_c_n |
['engine_size', 'curb_weight'] | 2513.273993 | 2469.453935 | 2568.686643 | 2469.453935 | numeric_b_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg', 'wheel_base'] | 2494.004238 | 2506.046462 | 2504.532566 | 2494.004238 | numeric_a_n |
cols_frame.groupby(level=0).min().min()
numeric_a_n 2276.782559 numeric_b_n 2324.881102 numeric_c_n 2249.517922 dtype: float64
Min value observations:
std_frame = cols_frame.groupby(level=0).std()
std_frame['min value'] = std_frame.min(axis=1)
std_frame['best df'] = std_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
std_frame.sort_values('min value')
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['engine_size', 'curb_weight', 'highway_mpg'] | 654.838537 | 738.078229 | 563.156199 | 563.156199 | numeric_c_n |
['engine_size', 'curb_weight'] | 647.510903 | 730.505719 | 573.764612 | 573.764612 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width'] | 693.940031 | 778.047962 | 638.636391 | 638.636391 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg', 'wheel_base'] | 731.347721 | 794.591040 | 685.112527 | 685.112527 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower'] | 715.482269 | 773.335242 | 685.414110 | 685.414110 | numeric_c_n |
['engine_size', 'curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg'] | 713.580811 | 776.581351 | 689.375058 | 689.375058 | numeric_c_n |
cols_frame.groupby(level=0).std().min()
numeric_a_n 647.510903 numeric_b_n 730.505719 numeric_c_n 563.156199 dtype: float64
Std value observations:
Having tested multivariate model, we've improved our results. But lets not forget how we choose the columns for that model:
That is probably not the best way of selecting columns for multivariate model. After all the univariate model results are based on single column results, we shouldn't treat that as a certain indicator of those columns multivariate performance. Instead: lets try more column combinations
best_cols_list = results_frame_norm.index[:7].tolist()
top6cols = best_cols_list[1:7]
best_cols_list
for col in mutli_cols_df.columns:
mutli_cols_df[col].values[:] = 0
cols_frame = mutli_cols_df[:0].copy()
# random seed loop:
for i in list(range(0,100)):
# column combinations loop:
for n in list(range(1,7)):
cc = list(combinations(top6cols,n))
cols_list = []
for el in cc:
cols_list.append(list(el))
for el in cols_list:
el.append(best_cols_list[0])
# cols list ready now
df_results_list = []
for el in cols_list:
dict_a, dict_b, dict_c = fill_dict_v2(knn_train_test_v3, el, r=i)
results_frame = give_results_df(dict_a, dict_b, dict_c)
df_results_list.append(results_frame)
to_add = pd.concat(df_results_list)
to_add.columns = ['numeric_a_n', 'numeric_b_n', 'numeric_c_n']
cols_frame = cols_frame.append(to_add)
cols_frame[:10]
numeric_a_n | numeric_b_n | numeric_c_n | |
---|---|---|---|
['curb_weight', 'engine_size'] | 3228.359525 | 4256.826853 | 3477.979752 |
['highway_mpg', 'engine_size'] | 3062.032054 | 3396.052505 | 3950.420641 |
['width', 'engine_size'] | 3532.581550 | 4505.704490 | 3662.583547 |
['horsepower', 'engine_size'] | 3544.814633 | 3804.656531 | 3061.882330 |
['city_mpg', 'engine_size'] | 3435.218515 | 3756.841251 | 2770.972665 |
['wheel_base', 'engine_size'] | 3824.627926 | 4871.690719 | 4588.332954 |
['curb_weight', 'highway_mpg', 'engine_size'] | 3125.160344 | 4255.565360 | 3621.460394 |
['curb_weight', 'width', 'engine_size'] | 3371.375995 | 4892.124757 | 3339.438030 |
['curb_weight', 'horsepower', 'engine_size'] | 3649.193881 | 4230.180385 | 4282.134186 |
['curb_weight', 'city_mpg', 'engine_size'] | 3155.800703 | 4372.510420 | 3557.484647 |
That's a few rows to process and analyse, we'll start with a simple visualization, to help us understand the next steps.
cols_frame_plot = cols_frame.copy()
index_list = []
for el in cols_frame_plot.index:
index_list.append((str(el.count(',')+1)+' cols'))
cols_frame_plot.index = index_list
cols_frame_plot = cols_frame_plot.sort_index()
mean_frame = cols_frame.groupby(level=0).mean()
index_list = []
for el in mean_frame.index:
index_list.append((str(el.count(',')+1)+' cols'))
mean_frame_plot = mean_frame.copy()
mean_frame_plot.index = index_list
mean_frame_plot = mean_frame_plot.sort_index()
mean_frame_plot2 = mean_frame_plot.groupby(level=0).mean()
fig, ax = plt.subplots(figsize=(16,23))
fig.subplots_adjust(hspace=0.2, wspace=0.2)
ax0=plt.subplot(311)
y_list = [cols_frame_plot['numeric_a_n'], cols_frame_plot['numeric_b_n'], cols_frame_plot['numeric_c_n']]
scatter_3v2(cols_frame_plot.index,y_list,labels, size=1.2, alpha = 1, z=0.3)
plt.title('Plot 1: Results (63 column combinations, 100 random seeds)',size=14)
handles, labels = ax0.get_legend_handles_labels()
l = plt.legend(handles[::6], ['numeric_a_n','numeric_b_n','numeric_c_n'], loc=1, borderaxespad=0.)
# legend points to small? solution:
for el in l.legendHandles:
el._sizes = [40]
ax0.set_facecolor('black')
ax1=plt.subplot(312)
y_list = [mean_frame_plot['numeric_a_n'], mean_frame_plot['numeric_b_n'], mean_frame_plot['numeric_c_n']]
scatter_3v2(mean_frame_plot.index,y_list,labels, size=8, alpha = 1)
handles, labels = ax1.get_legend_handles_labels()
l = plt.legend(handles[::6], ['numeric_a_n','numeric_b_n','numeric_c_n'], loc=1, borderaxespad=0.)
plt.title('Plot 2: Average results for every column combination',size=14)
ax2=plt.subplot(313)
y_list = [mean_frame_plot2['numeric_a_n'], mean_frame_plot2['numeric_b_n'], mean_frame_plot2['numeric_c_n']]
scatter_3v2(mean_frame_plot2.index, y_list, labels, size=12, alpha = 1)
handles, labels = ax2.get_legend_handles_labels()
l = plt.legend(handles[::6], ['numeric_a_n','numeric_b_n','numeric_c_n'], loc=2, borderaxespad=0.)
plt.title('Plot 3: Average results for column count',size=14)
axes = [ax0, ax1, ax2]
for ax in axes:
spines(ax, xl='Column count')
# ax.set_ylabel('Rmse')
# ax.set_xlabel('Column count')
plt.show()
The three plots may look a bit overwhelming at first but it's shouldn't be hard to understand:
Now instead of trying to guess which dot is lower, lets look at the numbers:
# lets group all the '100 random seed runs' into 1 row:
mean_frame = cols_frame.groupby(level=0).mean()
# analyze results:
mean_frame['min value'] = mean_frame.min(axis=1)
mean_frame['best df'] = mean_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
mean_frame.sort_values('min value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['horsepower', 'city_mpg', 'engine_size'] | 3486.106650 | 3601.674566 | 3472.221711 | 3472.221711 | numeric_c_n |
['highway_mpg', 'engine_size'] | 3479.492569 | 3663.626929 | 3548.028924 | 3479.492569 | numeric_a_n |
['horsepower', 'engine_size'] | 3592.997303 | 3634.020534 | 3501.877159 | 3501.877159 | numeric_c_n |
['highway_mpg', 'horsepower', 'engine_size'] | 3503.590497 | 3688.249982 | 3611.423671 | 3503.590497 | numeric_a_n |
['highway_mpg', 'horsepower', 'city_mpg', 'engine_size'] | 3504.043639 | 3692.797647 | 3569.490424 | 3504.043639 | numeric_a_n |
(sidenote: remember which columns and dataframe were considered the best in step 2.4? are they different to the columns in the above table?)
Quick reminder: every row is an individual column combination that had gone trough our model 100 times with a different random seed each time. Then we've calculated the mean of those 100 results for every combination and now we can see the best (lowest) 5 results above.
Observations:
# lets group all the '100 random seed runs' into 1 row:
std_frame = cols_frame.groupby(level=0).std()
# analyze results:
std_frame['min value'] = std_frame.min(axis=1)
std_frame['best df'] = std_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
std_frame.sort_values('min value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['highway_mpg', 'city_mpg', 'engine_size'] | 510.436332 | 651.061747 | 530.402643 | 510.436332 | numeric_a_n |
['highway_mpg', 'engine_size'] | 585.324504 | 596.409823 | 511.322197 | 511.322197 | numeric_c_n |
['curb_weight', 'highway_mpg', 'engine_size'] | 654.924366 | 738.105358 | 563.106099 | 563.106099 | numeric_c_n |
['curb_weight', 'engine_size'] | 647.274010 | 730.527653 | 573.741237 | 573.741237 | numeric_c_n |
['city_mpg', 'engine_size'] | 577.116755 | 717.079346 | 575.689823 | 575.689823 | numeric_c_n |
Observations:
# lets group all the '100 random seed runs' into 1 row:
min_frame = cols_frame.groupby(level=0).min()
# analyze results:
min_frame['min value'] = min_frame.min(axis=1)
min_frame['best df'] = min_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
min_frame.sort_values('min value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['curb_weight', 'highway_mpg', 'horsepower', 'city_mpg', 'engine_size'] | 2232.838670 | 2586.170058 | 2556.594386 | 2232.838670 | numeric_a_n |
['curb_weight', 'horsepower', 'city_mpg', 'engine_size'] | 2247.562752 | 2449.312953 | 2661.953333 | 2247.562752 | numeric_a_n |
['curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg', 'engine_size'] | 2318.076629 | 2324.881102 | 2249.083822 | 2249.083822 | numeric_c_n |
['curb_weight', 'width', 'horsepower', 'engine_size'] | 2292.644801 | 2253.987320 | 2452.391139 | 2253.987320 | numeric_b_n |
['curb_weight', 'highway_mpg', 'horsepower', 'engine_size'] | 2256.006738 | 2540.914745 | 2625.906921 | 2256.006738 | numeric_a_n |
Having looked at the mean of the results and their std, it's worth taking a peak at what is the single best result and what combination achieved it ('curb_weight', 'highway_mpg', 'horsepower', 'city_mpg', 'engine_size' columns on a numeric_a_n dataframe)
std_frame.loc["['curb_weight', 'highway_mpg', 'horsepower', 'city_mpg', 'engine_size']"]
numeric_a_n 679.981298 numeric_b_n 782.460179 numeric_c_n 647.11019 min value 647.11019 best df numeric_c_n Name: ['curb_weight', 'highway_mpg', 'horsepower', 'city_mpg', 'engine_size'], dtype: object
mean_frame.loc["['curb_weight', 'highway_mpg', 'horsepower', 'city_mpg', 'engine_size']"]
numeric_a_n 3678.997546 numeric_b_n 3829.775032 numeric_c_n 3787.792919 min value 3678.997546 best df numeric_a_n Name: ['curb_weight', 'highway_mpg', 'horsepower', 'city_mpg', 'engine_size'], dtype: object
Now lets recreate all the steps from previous chapter and multiply them by 25 'k' values... Woah that's like 3 dataframes, 63 column combinations, 100 random seeds and now 25 different k values? Remember we used to test more than 1000 random seeds?
best_cols_list = results_frame_norm.index[:7].tolist()
top6cols = best_cols_list[1:7]
best_cols_list
for col in mutli_cols_df.columns:
mutli_cols_df[col].values[:] = 0
cols_framek = mutli_cols_df[:0].copy()
# k value loop (26)
for z in list(range(1,26)):
# random seed loop - 100:
for i in list(range(0,100)):
# column combinations loop(7):
for n in list(range(1,7)):
cc = list(combinations(top6cols,n))
cols_list = []
for el in cc:
cols_list.append(list(el))
for el in cols_list:
el.append(best_cols_list[0])
# cols list ready now
df_results_list = []
for el in cols_list:
dict_a, dict_b, dict_c = fill_dict_v2(knn_train_test_v3, el, r=i, k=z)
results_frame = give_results_df(dict_a, dict_b, dict_c)
results_frame = results_frame.rename(index={results_frame.index[-1]: results_frame.index[-1]+' k: '+str(z) })
df_results_list.append(results_frame)
to_add = pd.concat(df_results_list)
to_add.columns = ['numeric_a_n', 'numeric_b_n', 'numeric_c_n']
cols_framek = cols_framek.append(to_add)
cols_framek[:10]
numeric_a_n | numeric_b_n | numeric_c_n | |
---|---|---|---|
['curb_weight', 'engine_size'] k: 1 | 3572.476984 | 3928.853331 | 2911.558220 |
['highway_mpg', 'engine_size'] k: 1 | 2888.327174 | 3151.107410 | 3940.517316 |
['width', 'engine_size'] k: 1 | 2658.439165 | 3607.760014 | 3425.534883 |
['horsepower', 'engine_size'] k: 1 | 2728.447665 | 3882.533376 | 3168.868620 |
['city_mpg', 'engine_size'] k: 1 | 2749.070667 | 3877.164534 | 3298.299729 |
['wheel_base', 'engine_size'] k: 1 | 3224.296336 | 4480.431807 | 3539.407157 |
['curb_weight', 'highway_mpg', 'engine_size'] k: 1 | 2920.017525 | 4748.365574 | 2924.460119 |
['curb_weight', 'width', 'engine_size'] k: 1 | 2773.575040 | 3507.198458 | 3220.266475 |
['curb_weight', 'horsepower', 'engine_size'] k: 1 | 3016.944873 | 3774.918623 | 2823.616823 |
['curb_weight', 'city_mpg', 'engine_size'] k: 1 | 2886.008335 | 3628.351108 | 3357.315306 |
# lets group all the '100 random seed runs' into 1 row:
mean_frame = cols_framek.groupby(level=0).mean()
# analyze results:
mean_frame['min value'] = mean_frame.min(axis=1)
mean_frame['best df'] = mean_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
mean_frame.sort_values('min value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['horsepower', 'city_mpg', 'engine_size'] k: 2 | 3111.109682 | 3221.824832 | 3086.584667 | 3086.584667 | numeric_c_n |
['horsepower', 'engine_size'] k: 2 | 3209.018693 | 3246.876181 | 3107.182528 | 3107.182528 | numeric_c_n |
['highway_mpg', 'engine_size'] k: 2 | 3163.525840 | 3337.194600 | 3155.707789 | 3155.707789 | numeric_c_n |
['horsepower', 'city_mpg', 'engine_size'] k: 3 | 3174.210462 | 3296.011471 | 3156.049960 | 3156.049960 | numeric_c_n |
['curb_weight', 'horsepower', 'city_mpg', 'engine_size'] k: 1 | 3170.428320 | 3252.506225 | 3276.417479 | 3170.428320 | numeric_a_n |
min_frame = cols_framek.groupby(level=0).min()
# analyze results:
min_frame['min value'] = min_frame.min(axis=1)
min_frame['best df'] = min_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
min_frame.sort_values('min value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['curb_weight', 'width', 'horsepower', 'city_mpg', 'engine_size'] k: 1 | 2202.498604 | 1875.722236 | 2251.406929 | 1875.722236 | numeric_b_n |
['curb_weight', 'highway_mpg', 'width', 'horsepower', 'city_mpg', 'engine_size'] k: 1 | 2029.547118 | 1929.328304 | 2221.236631 | 1929.328304 | numeric_b_n |
['curb_weight', 'highway_mpg', 'width', 'horsepower', 'engine_size'] k: 2 | 1948.147033 | 2163.093561 | 2027.308791 | 1948.147033 | numeric_a_n |
['curb_weight', 'highway_mpg', 'width', 'engine_size'] k: 1 | 2119.891409 | 1957.382290 | 2120.227003 | 1957.382290 | numeric_b_n |
['highway_mpg', 'width', 'city_mpg', 'engine_size'] k: 1 | 1972.105775 | 2337.443067 | 2066.061022 | 1972.105775 | numeric_a_n |
# lets group the dataframe rows by number of columns and k value:
cols_framek_plot = cols_framek.groupby(level=0).mean()
index_list = []
for el in cols_framek_plot.index:
index_list.append((str(el.count(',')+1)+' cols' + el[el.find(' k:'):]))
cols_framek_plot.index = index_list
cols_ks_frame = cols_framek_plot.groupby(level=0).mean()
cols_ks_frame['cols'] = cols_ks_frame.index.str[:6]
index_list = []
for el in cols_ks_frame.index:
index_list.append(int(el[el.find('k:')+2:]))
cols_ks_frame.index = index_list
cols_ks_frame.sort_index
n = 0
fig, ax = plt.subplots(figsize=(16,16))
for el in cols_ks_frame['cols'].value_counts().index.tolist()[:-2]:
n += 0.25
frame1 = cols_ks_frame[cols_ks_frame['cols']==el].sort_index()
y_list = [frame1['numeric_a_n'], frame1['numeric_b_n'], frame1['numeric_c_n']]
scatter_3v2(frame1.index,y_list,labels, size=10, alpha = 1.25-n, z=0.3)
spines(ax,xl='k value')
# legend labels:
labels_list = []
col_label_list = cols_ks_frame['cols'].value_counts().index.tolist()[:-2]
df_label_list = ['numeric_a_n', 'numeric_b_n', 'numeric_c_n']
for ele in col_label_list:
for el in df_label_list:
labels_list.append(el+' '+ele)
handles, labels = ax.get_legend_handles_labels()
l = plt.legend(handles[::25],labels_list, loc=2, borderaxespad=0.)
plt.title('Average rmse results per k value, dataframe and number of columns')
plt.show()
mean_frame_plot0 = cols_framek.groupby(level=0).mean().copy()
k_index = []
for el in mean_frame_plot0.index:
k_index.append(int(el[el.find('k:')+2:]))
mean_frame_plot0.index = k_index
index_list = []
mean_frame_plot = cols_framek.groupby(level=0).mean()
for el in mean_frame_plot.index:
index_list.append((str(el.count(',')+1)+' cols'))
mean_frame_plot.index = index_list
fig, ax = plt.subplots(figsize=(16,12))
ax1 = plt.subplot(211)
y_list = [mean_frame_plot0['numeric_a_n'], mean_frame_plot0['numeric_b_n'], mean_frame_plot0['numeric_c_n']]
scatter_3v2(mean_frame_plot0.index,y_list,labels, size=3, alpha = 1, z=0.4)
spines(ax1, xl='K value')
handles, labels = ax0.get_legend_handles_labels()
l = plt.legend(handles[::6], ['numeric_a_n','numeric_b_n','numeric_c_n'], loc=2, borderaxespad=0.)
# legend points to small? solution:
for el in l.legendHandles:
el._sizes = [40]
ax2 = plt.subplot(212)
y_list = [mean_frame_plot['numeric_a_n'], mean_frame_plot['numeric_b_n'], mean_frame_plot['numeric_c_n']]
scatter_3v2(mean_frame_plot.index,y_list,labels, size=3, alpha = 1,z=0.4)
spines(ax2, xl='Number of columns')
handles, labels = ax0.get_legend_handles_labels()
l = plt.legend(handles[::6], ['numeric_a_n', 'numeric_b_n', 'numeric_c_n'], loc=1, borderaxespad=0.)
# legend points to small? solution:
for el in l.legendHandles:
el._sizes = [40]
plt.show()
Observations:
In our last step we've generated a dataframe consisting of 157500 rows, now we want to include cross validation in our model which in turn will again drastically increase the amount of generated data. We'll try to limit our processing time:
# update the training model:
def knn_train_test_v4(df, train_cols, target_col, r, k, fold):
rmse_dict = {}
# train and test
kf = KFold(fold, shuffle=True, random_state=r)
knn = KNeighborsRegressor(n_neighbors=k)
mses = cross_val_score(knn, df[train_cols], df[target_col], scoring='neg_mean_squared_error', cv=kf)
rmses = np.sqrt(np.absolute(mses))
avg_rmse = np.mean(rmses)
return avg_rmse
# update the function that fills the dictionaries:
def fill_dict_3(model,col_list, dlist=df_n_list, r=1, k=5, fold=5):
rmse_dict_a = {}
rmse_dict_b = {}
rmse_dict_c = {}
rmse_dict_a[str(col_list)] = model(dlist[0], col_list, 'price', r, k, fold)
rmse_dict_b[str(col_list)] = model(dlist[1], col_list, 'price', r, k, fold)
rmse_dict_c[str(col_list)] = model(dlist[2], col_list, 'price', r, k, fold)
return rmse_dict_a, rmse_dict_b, rmse_dict_c
best_cols_list = results_frame_norm.index[:6].tolist()
top6cols = best_cols_list[1:6]
best_cols_list
for col in mutli_cols_df.columns:
mutli_cols_df[col].values[:] = 0
cols_framef = mutli_cols_df[:0].copy()
for fold in list(range(2,11)):
# k value loop (4)
for z in list(range(1,4)):
# random seed loop - 100:
for i in list(range(0,100)):
# column combinations loop(6):
for n in list(range(1,6)):
cc = list(combinations(top6cols,n))
cols_list = []
for el in cc:
cols_list.append(list(el))
for el in cols_list:
el.append(best_cols_list[0])
# cols list ready now
df_results_list = []
for el in cols_list:
dict_a, dict_b, dict_c = fill_dict_3(knn_train_test_v4, el, r=i, k=z, fold=fold)
results_frame = give_results_df(dict_a, dict_b, dict_c)
results_frame = results_frame.rename(index={results_frame.index[-1]: results_frame.index[-1]+' k: '+str(z) +' fold: '+str(fold) })
df_results_list.append(results_frame)
to_add = pd.concat(df_results_list)
to_add.columns = ['numeric_a_n', 'numeric_b_n', 'numeric_c_n']
cols_framef = cols_framef.append(to_add)
cols_framef[:10]
numeric_a_n | numeric_b_n | numeric_c_n | |
---|---|---|---|
['curb_weight', 'engine_size'] k: 1 fold: 2 | 3864.977042 | 4191.490892 | 3378.947940 |
['highway_mpg', 'engine_size'] k: 1 fold: 2 | 3137.849210 | 3560.588731 | 3663.709798 |
['width', 'engine_size'] k: 1 fold: 2 | 3348.052461 | 3534.975122 | 3645.008165 |
['horsepower', 'engine_size'] k: 1 fold: 2 | 3216.880400 | 3786.130937 | 3073.667271 |
['city_mpg', 'engine_size'] k: 1 fold: 2 | 3890.797544 | 3867.354891 | 3313.619897 |
['curb_weight', 'highway_mpg', 'engine_size'] k: 1 fold: 2 | 3709.768489 | 4498.710613 | 3815.367610 |
['curb_weight', 'width', 'engine_size'] k: 1 fold: 2 | 3254.811895 | 3555.318797 | 3328.763930 |
['curb_weight', 'horsepower', 'engine_size'] k: 1 fold: 2 | 3453.089035 | 4214.285753 | 2765.497703 |
['curb_weight', 'city_mpg', 'engine_size'] k: 1 fold: 2 | 3826.076727 | 4018.740122 | 4036.617481 |
['highway_mpg', 'width', 'engine_size'] k: 1 fold: 2 | 3694.172947 | 4246.443323 | 4230.294438 |
# lets group all the '100 random seed runs' into 1 row:
mean_frame = cols_framef.groupby(level=0).mean()
# analyze results:
mean_frame['min avg value'] = mean_frame.min(axis=1)
mean_frame['best df'] = mean_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
mean_frame.sort_values('min avg value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min avg value | best df | |
---|---|---|---|---|---|
['curb_weight', 'highway_mpg', 'width', 'engine_size'] k: 1 fold: 10 | 2381.810484 | 2388.642524 | 2400.914894 | 2381.810484 | numeric_a_n |
['curb_weight', 'highway_mpg', 'width', 'engine_size'] k: 1 fold: 9 | 2406.535764 | 2423.289327 | 2445.994091 | 2406.535764 | numeric_a_n |
['curb_weight', 'highway_mpg', 'horsepower', 'engine_size'] k: 1 fold: 10 | 2418.459021 | 2408.982903 | 2582.765987 | 2408.982903 | numeric_b_n |
['curb_weight', 'highway_mpg', 'width', 'city_mpg', 'engine_size'] k: 1 fold: 10 | 2422.996694 | 2426.654309 | 2431.095057 | 2422.996694 | numeric_a_n |
['curb_weight', 'width', 'city_mpg', 'engine_size'] k: 1 fold: 10 | 2427.348265 | 2437.362250 | 2432.892226 | 2427.348265 | numeric_a_n |
An rmse average of 100 random seeds dipping below 2400 is a significant improvement
labels2 = ['numeric_a','numeric_b' ,'numeric_c']
mean_frame = cols_framef.groupby(level=0).mean()
k_index = []
for el in mean_frame.index:
k_index.append(int(el[el.find('fold:')+5:]))
mean_frame_plot0 = mean_frame.copy()
mean_frame_plot0.index = k_index
mean_frame_plot0 = mean_frame_plot0.sort_index()
fig, ax = plt.subplots(figsize=(16,8))
y_list = [mean_frame_plot0['numeric_a_n'], mean_frame_plot0['numeric_b_n'], mean_frame_plot0['numeric_c_n']]
scatter_3v2(mean_frame_plot0.index,y_list,labels, size=3, alpha = 1, z=0.2)
spines(ax, xl='Number of folds')
handles, labels = ax0.get_legend_handles_labels()
l = plt.legend(handles[::6], labels2, loc=1, borderaxespad=0.)
# legend points to small? solution:
for el in l.legendHandles:
el._sizes = [40]
plt.show()
(remember that k value of 2 delivered best results? also 2 or 3 column models were the best in the previous step?)
Observations:
index_list = []
mean_frame = cols_framef.groupby(level=0).mean()
mean_frame_plot_k = mean_frame
for el in mean_frame_plot_k.index:
index_list.append((int(el[el.find('k:')+3:el.find('fold:')-1])))
mean_frame_plot_k.index = index_list
mean_frame_plot_k
fig, ax = plt.subplots(figsize=(16,8))
y_list = [mean_frame_plot_k['numeric_a_n'], mean_frame_plot_k['numeric_b_n'], mean_frame_plot_k['numeric_c_n']]
scatter_3v2(mean_frame_plot_k.index,y_list,labels, size=5, alpha = 1,z=0.4)
spines(ax,xl='K value')
plt.title('RMSE results per k value')
handles, labels = ax0.get_legend_handles_labels()
l = plt.legend(handles[::6],labels2, loc=1, borderaxespad=0.)
# legend points to small? solution:
for el in l.legendHandles:
el._sizes = [40]
plt.show()
index_list = []
mean_frame = cols_framef.groupby(level=0).mean()
mean_frame_plotf = mean_frame
for el in mean_frame_plotf.index:
index_list.append((str(el.count(',')+1)+' cols'))
mean_frame_plotf.index = index_list
# mean_frame_plotf = mean_frame_plotf.groupby(level=0).mean()
fig, ax = plt.subplots(figsize=(16,8))
y_list = [mean_frame_plotf['numeric_a_n'], mean_frame_plotf['numeric_b_n'], mean_frame_plotf['numeric_c_n']]
scatter_3v2(mean_frame_plotf.index,y_list,labels, size=5, alpha = 1,z=0.4)
spines(ax, xl='Number of columns')
handles, labels = ax0.get_legend_handles_labels()
l = plt.legend(handles[::6], labels2, loc=1, borderaxespad=0.)
# legend points to small? solution:
for el in l.legendHandles:
el._sizes = [40]
plt.show()
# lets group all the '100 random seed runs' into 1 row:
min_frame = cols_framef.groupby(level=0).min()
# analyze results:
min_frame['min value'] = min_frame.min(axis=1)
min_frame['best df'] = min_frame.loc[:,:'numeric_c_n'].idxmin(axis=1)
min_frame.sort_values('min value')[:5]
numeric_a_n | numeric_b_n | numeric_c_n | min value | best df | |
---|---|---|---|---|---|
['curb_weight', 'highway_mpg', 'horsepower', 'engine_size'] k: 1 fold: 8 | 2182.862984 | 2105.436876 | 2277.258464 | 2105.436876 | numeric_b_n |
['curb_weight', 'highway_mpg', 'horsepower', 'engine_size'] k: 1 fold: 10 | 2117.717840 | 2180.396321 | 2218.114785 | 2117.717840 | numeric_a_n |
['curb_weight', 'highway_mpg', 'horsepower', 'engine_size'] k: 1 fold: 5 | 2133.734624 | 2294.420473 | 2358.409024 | 2133.734624 | numeric_a_n |
['curb_weight', 'highway_mpg', 'horsepower', 'engine_size'] k: 1 fold: 9 | 2152.383885 | 2143.715946 | 2262.474523 | 2143.715946 | numeric_b_n |
['curb_weight', 'highway_mpg', 'horsepower', 'engine_size'] k: 1 fold: 7 | 2205.451044 | 2152.065816 | 2211.526053 | 2152.065816 | numeric_b_n |
Looking at the table of single best result numbers we can notice that those numbers are not as good as before - another reason to use cross validation. It's harder to achieve a single best result, which in theory is closer to real world aplications.
Removing majority of k-values and 7 column combinations was risky, we know that every change introduced to the model can pottentialy make previous observations obsolete. Having that knowledge I did run the cross validation loop on all 25 k values in a different notebook. Just to be sure that we're not missing any good results, here are the images of result plots (grouped by columns):
The rmse results are increasing together with k value, on top of that the 7 column combinations don't deliver any lower results than combinations with a lower number of columns. Intuition paid of, the results are what we expected them to be and we saved some computing in the current notebook. (the dataframe used for these visualizations had 1.4 million rows)
Having tested multiple univariate and multivariate models we can draw a few simple conlusions for our future projects: