Predicting import vehicle prices

Predicting the prices of cars and trucks imported in 1985

The aim of this project is to analyze the various characteristics of a vehicle and eventually predict the price of the vehicle given its various characteristics. This can be helpful when trying to understand the market, which features are pricing points and based on the features of the vehicle, what could be its price, for a new manufacturer. The dataset can be found at this Link
The dataset contains information on 205 vehicles imported in the year of 1985 from various manufacturers. It is a dataset from the UCI repository of automobiles. The column description is as follows:-

  1. symboling: Insurance Risk factor associated with the price, -3 : very safe, 3 : very risky.
  2. normalized-losses: Relative average payment per insured vehicle year.
  3. make: The make of the vehicle.
  4. fuel-type: Fuel type, disel or gas.
  5. aspiration: std, turbo.
  6. num-of-doors: four, two.
  7. body-style: hardtop, wagon, sedan, hatchback, convertible.
  8. drive-wheels: 4wd, fwd, rwd.
  9. engine-location: front, rear.
  10. wheel-base: Distance between centers of front and rear axles.
  11. length: Length of the vehicle.
  12. width: Width of the vehicle.
  13. height: Height of the vehicle.
  14. curb-weight: Total mass of vehicle.
  15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
  16. num-of-cylinders: eight, five, four, six, three, twelve, two.
  17. engine-size: Size of the engine
  18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
  19. bore: Diameter of piston cylinder.
  20. stroke: Stroke length of the piston cylinder.
  21. compression-ratio: Ratio of volume of cylinder and combustion chamber.
  22. horsepower: Horsepower of the vehicle.
  23. peak-rpm: Max achievable RPM.
  24. city-mpg: Lowest mpg rating for the vehicle.
  25. highway-mpg: Highest mpg rating for the vehicle.
  26. price: Price of the car

There are a total of 25 characteristics related to a vehicle that have to be analyzed in order to find the best predictors for its price.

In [350]:
%matplotlib inline
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.linear_model import LinearRegression, Lasso,Ridge,ElasticNet
from sklearn.feature_selection import RFE
from statsmodels.regression.linear_model import WLS
In [2]:
df = pd.read_csv('imports-85.data',header=None)
df.head(10)
Out[2]:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710
7 1 ? audi gas std four wagon fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 ... 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875
9 0 ? audi gas turbo two hatchback 4wd front 99.5 ... 131 mpfi 3.13 3.40 7.0 160 5500 16 22 ?

10 rows × 26 columns

The dataset inherently doesnot have column names. The column names have been picked up from the above specified link.

In [3]:
cols = [
    'symboling', 'normalized_losses', 'make', 'fuel_type',
    'aspiration', 'num_of_doors', 'body_style', 
    'drive_wheels', 'engine_location', 'wheel_base',
    'length', 'width', 'height', 'curb_weight', 'engine_type', 
    'num_of_cylinders', 'engine_size', 'fuel_system', 'bore',
    'stroke', 'compression_rate', 'horsepower', 
    'peak_rpm', 'city_mpg', 'highway_mpg', 'price'
       ]

df.columns = cols
df.head(10)
Out[3]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710
7 1 ? audi gas std four wagon fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 ... 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875
9 0 ? audi gas turbo two hatchback 4wd front 99.5 ... 131 mpfi 3.13 3.40 7.0 160 5500 16 22 ?

10 rows × 26 columns

In [4]:
df.isna().sum()
Out[4]:
symboling            0
normalized_losses    0
make                 0
fuel_type            0
aspiration           0
num_of_doors         0
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_of_cylinders     0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_rate     0
horsepower           0
peak_rpm             0
city_mpg             0
highway_mpg          0
price                0
dtype: int64

The dataset as such doesnot contain any NaN values, but there are '?' as values, which in this case are reprsenting NaN values. The preliminary step would be to re-read the dataset with the na_values parameter set to '?'. This will read in the dataset and at every encounter of '?' will be replaced with np.NaN.

In [5]:
df = pd.read_csv('imports-85.data',header=None, na_values='?')
df.columns = cols
df.head(5)
Out[5]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
3 2 164.0 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
4 2 164.0 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

5 rows × 26 columns

In [6]:
df.isna().sum()
Out[6]:
symboling             0
normalized_losses    41
make                  0
fuel_type             0
aspiration            0
num_of_doors          2
body_style            0
drive_wheels          0
engine_location       0
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_type           0
num_of_cylinders      0
engine_size           0
fuel_system           0
bore                  4
stroke                4
compression_rate      0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64

The dataset is pretty much clean except a few columns containing Null values. These Null values have to be cleaned before starting analysis on it.
The price column is the target variable. Since the aim is to eventually predict the price, at this point imputing is not possible. Thus it would be wise to drop the 4 rows, there isnt much loss of information as well.

In [7]:
df = df[~df.price.isna()]

The normalized_losses column contains the maximum NaN values the dataset. Dropping 41 rows will highly reduce the dataset and lose out on valuable information of other columns.
These NaN values have to be imputed. The distribution of normalized_losses column can give a sense of how this can be achieved.

In [9]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,8))
df['normalized_losses'].plot.hist()
plt.axvline(df['normalized_losses'].mean(),color='black')
Out[9]:
<matplotlib.lines.Line2D at 0x7f775fb7c610>

The distribution shows that most of the normalized_losses values are close to the mean. Thus imputing the 41 rows missing data with the mean will not change the general distribution of the column.

In [10]:
df['normalized_losses'].fillna(
    df['normalized_losses'].mean(),inplace=True
)

The bore and stroke columns have 4 missing values. The bore defines the diameter of the piston cylinder and stroke defines the length of the piston cylinder. These values are usually specific to a vehicle and even vehicles from the same manufacturer do not have the same values for bore and stroke. Imputing these values requires alot of other parameters such as horsepower, engine capacity, pressure values, stroke length etc.

In [12]:
df[df.bore.isna()]
Out[12]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
55 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 70 4bbl NaN NaN 9.4 101.0 6000.0 17 23 10945.0
56 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 70 4bbl NaN NaN 9.4 101.0 6000.0 17 23 11845.0
57 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 70 4bbl NaN NaN 9.4 101.0 6000.0 17 23 13645.0
58 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 80 mpfi NaN NaN 9.4 135.0 6000.0 16 23 15645.0

4 rows × 26 columns

Analyzing the missing values, the rows containing the missing bore values also have missing values for stroke. This makes it impossible to derive either bore or stroke values for these rows. Since the question is of 4 rows, these can be ignored without possible loss of information loss.

In [13]:
df.dropna(subset=['bore','stroke'],inplace=True)

The horsepower and the peak_rpm columns both contain 2 missing values. A closer look into the missing values shows that the same two vehicles have missing horsepower and peak_rpm. Due to lack of data, these values cannot be calculated readily. Hence these two rows are also dropped from the dataset.

In [14]:
df[df.horsepower.isna()]
Out[14]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
130 0 122.0 renault gas std four wagon fwd front 96.1 ... 132 mpfi 3.46 3.9 8.7 NaN NaN 23 31 9295.0
131 2 122.0 renault gas std two hatchback fwd front 96.1 ... 132 mpfi 3.46 3.9 8.7 NaN NaN 23 31 9895.0

2 rows × 26 columns

In [15]:
df.dropna(subset=['horsepower','peak_rpm'],inplace=True)
In [17]:
df.body_style.value_counts()
Out[17]:
sedan          94
hatchback      63
wagon          24
hardtop         8
convertible     6
Name: body_style, dtype: int64

The num_of_doors is the last column left to clean. This column also contains 2 missing rows. A closer look into these rows reveals that both the vehicles listed were sedans. The body_style of the vehicle identifies whether the vehicle is a sedan, hatchback, convertible, hardtop or a wagon. Usually the same style of vehicles have similar body charateristics like number of doors, for example, a convertible usually is a 2-door vehicle. Needless to say that there are exceptions, but the general trend is followed. Going by this trend, a sedan usually has 4 doors. Thus the two missing values are imputed with the mode of the column for sedans.

In [18]:
df[df['num_of_doors'].isna()]
Out[18]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
27 1 148.0 dodge gas turbo NaN sedan fwd front 93.7 ... 98 mpfi 3.03 3.39 7.6 102.0 5500.0 24 30 8558.0
63 0 122.0 mazda diesel std NaN sedan fwd front 98.8 ... 122 idi 3.39 3.39 22.7 64.0 4650.0 36 42 10795.0

2 rows × 26 columns

In [300]:
df[df['body_style'] == 'sedan']['num_of_doors'].value_counts()
Out[300]:
four    79
two     15
Name: num_of_doors, dtype: int64
In [19]:
df['num_of_doors'].fillna(
    df[df['body_style'] == 'sedan']['num_of_doors'].mode()[0],
    inplace=True
)

The dataset is clean from any missing values. The next step is to analyze these various charateristics. The price column is the target column for the predictive modeling and the base for the analysis. Looking into the distribution of price via a box plot reveals a significant number of outliers.

In [20]:
plt.figure(figsize=(10,8))
sns.boxplot(df.price)
plt.xlabel('Price range')
sns.despine(left=True)

The outliers exist beyond the 30k mark. This data could be faulty and hence requires a detailed analysis for those vehicles listed with a price greater than 30k.

In [21]:
df[df.price > 30000]
Out[21]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
15 0 122.0 bmw gas std four sedan rwd front 103.5 ... 209 mpfi 3.62 3.39 8.0 182.0 5400.0 16 22 30760.0
16 0 122.0 bmw gas std two sedan rwd front 103.5 ... 209 mpfi 3.62 3.39 8.0 182.0 5400.0 16 22 41315.0
17 0 122.0 bmw gas std four sedan rwd front 110.0 ... 209 mpfi 3.62 3.39 8.0 182.0 5400.0 15 20 36880.0
47 0 145.0 jaguar gas std four sedan rwd front 113.0 ... 258 mpfi 3.63 4.17 8.1 176.0 4750.0 15 19 32250.0
48 0 122.0 jaguar gas std four sedan rwd front 113.0 ... 258 mpfi 3.63 4.17 8.1 176.0 4750.0 15 19 35550.0
49 0 122.0 jaguar gas std two sedan rwd front 102.0 ... 326 mpfi 3.54 2.76 11.5 262.0 5000.0 13 17 36000.0
70 -1 93.0 mercedes-benz diesel turbo four sedan rwd front 115.6 ... 183 idi 3.58 3.64 21.5 123.0 4350.0 22 25 31600.0
71 -1 122.0 mercedes-benz gas std four sedan rwd front 115.6 ... 234 mpfi 3.46 3.10 8.3 155.0 4750.0 16 18 34184.0
72 3 142.0 mercedes-benz gas std two convertible rwd front 96.6 ... 234 mpfi 3.46 3.10 8.3 155.0 4750.0 16 18 35056.0
73 0 122.0 mercedes-benz gas std four sedan rwd front 120.9 ... 308 mpfi 3.80 3.35 8.0 184.0 4500.0 14 16 40960.0
74 1 122.0 mercedes-benz gas std two hardtop rwd front 112.0 ... 304 mpfi 3.80 3.35 8.0 184.0 4500.0 14 16 45400.0
126 3 122.0 porsche gas std two hardtop rwd rear 89.5 ... 194 mpfi 3.74 2.90 9.5 207.0 5900.0 17 25 32528.0
127 3 122.0 porsche gas std two hardtop rwd rear 89.5 ... 194 mpfi 3.74 2.90 9.5 207.0 5900.0 17 25 34028.0
128 3 122.0 porsche gas std two convertible rwd rear 89.5 ... 194 mpfi 3.74 2.90 9.5 207.0 5900.0 17 25 37028.0

14 rows × 26 columns

The outliers do not look faulty, rather the vehicles listed are high end vehicles from world class manufacturers - jaguar, mercedes benz, porsche and bmw. These prices are a result of these vehicles being luxury/sports car from these vehicles.

The city_mpg column describes the average miles per gallon (mileage) the car delivers when driven in the city with occasional accelerations and brakes. Similarly the highway_mpg describes the average miles per gallon (mileage) the car delivers under continuos acceleration. The two metrics are considered very important for a vehicle.

In [27]:
plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.scatterplot(y='price',x='city_mpg',data=df)
plt.xlabel('city mileage (mpg)')
plt.subplot(1,2,2)
sns.scatterplot(y='price',x='highway_mpg',data=df)
plt.xlabel('highway mileage (mpg)')
Out[27]:
Text(0.5, 0, 'highway mileage (mpg)')

From the two scatter plots, the following conclusions are drawn :-

  • Both city mileage and highway mileage (in mpg) are negatively correlated to the price of the vehicle
  • High end luxury or sports vehicles have lesser city and highway mileage (in mpg).

The city_mpg and highway_mpg define individual charateristics of the vehicle, both measured separately. These values for a vehicle are measured under the assumption that the car is driven only in particular setting i.e. city_mpg for a vehicle is measured under the assumption that the vehicle is only driven in the city.
In reality the vehicle isnt always only exposed to either one. There is a combination of both city driving and highway driving, this makes a huge difference in the actual mileage (in mpg) the vehicle can achieve. Thus fuel economy metric is dervied. The fuel economy is assumed to be 60% of city_mpg and 40% of highway_mpg. Intuitively this means that the fuel economy gives the average miles per gallon (mpg) for a vehicle that is driven 60% of the times in a city and the rest on a highway. This assumption seems fair and realistic.

In [28]:
df['fuel_economy'] = (df.city_mpg * 0.6) + (df.highway_mpg * 0.4)
df.head(10)
Out[28]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price fuel_economy
0 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0 23.4
1 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0 23.4
2 1 122.0 alfa-romero gas std two hatchback rwd front 94.5 ... mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0 21.8
3 2 164.0 audi gas std four sedan fwd front 99.8 ... mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0 26.4
4 2 164.0 audi gas std four sedan 4wd front 99.4 ... mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0 19.6
5 2 122.0 audi gas std two sedan fwd front 99.8 ... mpfi 3.19 3.40 8.5 110.0 5500.0 19 25 15250.0 21.4
6 1 158.0 audi gas std four sedan fwd front 105.8 ... mpfi 3.19 3.40 8.5 110.0 5500.0 19 25 17710.0 21.4
7 1 122.0 audi gas std four wagon fwd front 105.8 ... mpfi 3.19 3.40 8.5 110.0 5500.0 19 25 18920.0 21.4
8 1 158.0 audi gas turbo four sedan fwd front 105.8 ... mpfi 3.13 3.40 8.3 140.0 5500.0 17 20 23875.0 18.2
10 2 192.0 bmw gas std two sedan rwd front 101.2 ... mpfi 3.50 2.80 8.8 101.0 5800.0 23 29 16430.0 25.4

10 rows × 27 columns

In [29]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x='fuel_economy',y='price',data=df)
plt.xlabel('Fuel economy (in mpg)')
Out[29]:
Text(0.5, 0, 'Fuel economy (in mpg)')

Since the fuel_economy is a representation of the city_mpg and highway_mpg combined, it mimics the trend as seen above.

The following columns are numeric in nature and mostly describe physical specifications of the vehicle :-

  • wheel_base
  • length
  • width
  • height
  • curb_weight
  • engine_size
  • bore
  • stroke
  • compression_rate
  • horsepower
  • peak_rpm

The next set of plots analyze these values against the price of the vehicle in a regression plot. The regplot fits a regressor to the data and draws the best fitting line across the data. The regplot gives insight into the relationship between two variables. This shall give an idea as to which charateristics are correlated to the price.

In [55]:
cols = [
    'wheel_base',
    'length',
    'width',
    'height',
    'curb_weight',
    'engine_size',
    'bore',
    'stroke',
    'compression_rate',
    'horsepower',
    'peak_rpm'
]

plt.style.use('seaborn-white')
plt.subplots(figsize=(16,42))
i=1
for col in cols:
    plt.subplot(6,2,i)
    sns.regplot(x=col,y='price',data=df)
    plt.title('Price vs '+col)
    i+=1
    

The following conclusions are drawn from the scatter plot :-

  • The wheel base of a vehicle shows slight positive correlation with the price. For a certain wheel-base there is a huge band within which the prices of the vehicles lie.
  • The length and width of the vehicle share a linear relationship with the price and are highly positively correlated. This means, the length and width of the vehicle can be a determining charateristic for its price.
  • The height of the car has a slight positive correlation, but not enough to conclude a linear relationship. The height doesnot dictate the price of the vehicle.
  • The curb_weight has a linear relationship with price of vehicle. High end vehicles have higher curb weights i.e they are heavier packed vehicles.
  • The engine_size and horsepower share a linear relationship with price. High end vehicles pack a punch when it comes to power of the vehicle.
  • The bore shows high positive correlation with the price whereas the stroke doesnot show a clear linear relationship.
  • Compression rate and Peak RPM have no relations with the price as such.

From these conclusions, it can be said that the following columns can be definte predictors for the price of the vehicle :-

  • length
  • width
  • curb_weight
  • engine_size
  • horsepower
  • bore
  • wheel_base

The normalized_losses column describes the average loss the car can incur per year. Viewing these losses against the price via a scatterplot.

In [56]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x='normalized_losses',y='price',data=df)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7750f8f450>

The normalized_losses shares no relationship with the price. Mostly the losses for high end vehicles is at the mean.

Analyzing all the numeric columns leads to the conclusion that - Most of the numeric columns share a linear relationship with the price of the vehicle as identified. These charateristics can be good predictors for the target price. The dataset also contains categorical variables. These variables have to be compared against the price to find some relations.

The symboling column descibes the risk factor invovled with the price of a car. A value of +3 means very risky and -2 means pretty safe. the symbols are first changed to labels as given below then compared with price.

  • Very safe - -2
  • Safe - -1
  • Neutral - 0
  • Maybe risky - 1
  • Risky - 2
  • Very risky - 3
In [92]:
def decode(row):
    if row == -2:
        return 'Very safe'
    elif row == -1:
        return 'Safe'
    elif row == 0:
        return 'Neutral'
    elif row == 1:
        return 'Maybe risky'
    elif row == 2:
        return 'Risky'
    else:
        return 'Very risky'
    
df['symboling_labels'] = df.symboling.apply(decode).astype(
    CategoricalDtype(
        categories=['Very safe','Safe','Neutral','Maybe risky','Risky','Very risky'],
        ordered=True
    )
)

plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.countplot(df.symboling_labels)
plt.subplot(1,2,2)
sns.boxplot(x=df.symboling_labels, y=df.price)
Out[92]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7748e856d0>

The symboling plots suggest :-

  • The vehicles with a symbol rating of 0 - Neutral, are maximum sold.
  • Vehicles with a symbol rating of -1 - Safe have higher prices.

The make of a vehicle identifies the manufacturer of the vehicle and body_style identifies the type of vehicle as discussed previously. The two bar plots below show the number of vehicles sold per maker and per body type.

In [130]:
makers = df.make.value_counts().sort_values(ascending=False)
styles = df.body_style.value_counts().sort_values(ascending=False)
makers_price = df.pivot_table(values='price',index='make').sort_values('price',ascending=False)
styles_price = df.pivot_table(values='price',index='body_style').sort_values('price',ascending=False)


plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,24))
plt.subplot(3,2,1)
sns.barplot(x=makers.index[:10],y=makers.values[:10])
plt.xlabel('Manufacturer')
plt.ylabel('Number of vehicles')
plt.subplot(3,2,2)
sns.barplot(x=styles.index,y=styles.values)
plt.ylabel('Number of vehicles')
plt.xlabel('Type of vehicle')
plt.subplot(3,2,3)
sns.barplot(makers_price.index[:5],makers_price.price[:5])
plt.ylabel('Average price')
plt.xlabel('Manufacturer')
plt.subplot(3,2,4)
sns.barplot(styles_price.index,styles_price.price)
plt.ylabel('Average price')
plt.xlabel('Type of vehicle')
plt.subplot(3,2,5)
sns.boxplot(x=df.body_style,y=df.price)
Out[130]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7745cefb10>

Inferences :-

  • Toyota is the most sold vehicle in the dataset. Other manufacturers have relatively lesser vehicles sold.
  • Sedan body type is the most sold vehicle type. Hatchback is the second most sold. The other categories have very less sold.
  • Manufacturers sell cars for different price ranges i.e. usually they have cars for lower range, mediocre range and higher range, thus manufacturer cannot be used to predict the price of a car.
  • Hardtop and convertibles have the highest average price ranges.
  • Every body style has vehicles of different price ranges for example, hardtops and convertibles are in high end vehicles whereas sedans and wagons are in moderate price ranges. Hatchbacks are in the lowest price range.

The columns - fuel_type, aspiration, engine_location and num_of_doors are two-category variables describing the latent features of the vehicle.

In [105]:
cols = [
    'fuel_type',
    'aspiration',
    'engine_location',
    'num_of_doors',
    'drive_wheels'
]

plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,32))
i=1
for col in cols:
    plt.subplot(5,2,i)
    sns.countplot(df[col])
    i+=1
    plt.subplot(5,2,i)
    sns.boxplot(x=df[col],y=df.price)
    i+=1

The conclusions drawn :-

  • Four door vehicles are most sold among the lot. The number of doors do not add much to the price.
  • Most vehicles have engines in the front except a very few. The very few that have engines at the rear have very high price points. A few of the high end vehicles can be said to have rear engines.
  • Turbo engine vehicles are higher priced than standard engine vehicles, but standard engine vehicles are most sold.
  • Disel vehicles are priced higher than gas vehicles, but gas vehicles are most sold.
  • Forward wheel drive vehicles are most sold. The rear wheel drive vehicles are ranged at higher price points.

Out of the above characteristics, aspiration, drive_wheels and fuel_type seem like good predictors for the price of the vehicle. The engine_location though showing distinct price gap, has too few representatives for rear category.
The engine_type describes the build of the engine from the manufacturer.

In [115]:
engines = df[['engine_type','price']].groupby('engine_type').mean()
engines.sort_values('price',ascending=False,inplace=True)
print(engines)

plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.countplot(df.engine_type)
plt.ylabel('number of vehicles')
plt.subplot(1,2,2)
sns.barplot(x=engines.index,y=engines.price)
plt.ylabel('average price')
                    price
engine_type              
ohcv         25098.384615
dohc         18116.416667
l            14627.583333
ohcf         13738.600000
ohc          11594.944056
Out[115]:
Text(0, 0.5, 'average price')

The conclusions drawn :-

  • Vehicles with ohc engine type are most sold, but they have the least average price. Thus vehicles with ohc engine type have lesser price points and are most sold.
  • Vehicles with the ohcv engine type are priced costlier than the rest.

The plots conclude that engine type can determine the average price range of the vehicle and hence makes it an important charateristic for the prediction of the price.

The final categorical column, num_of_cylinders is an important parameter for an engine and determeines the power of the engine. The plots below visualize the number of vehicles sold per number of cylinders and the average price range of the vehicles sold per number of cylinders.

In [118]:
cylinders = df[['num_of_cylinders','price']].groupby('num_of_cylinders').mean()
cylinders.sort_values('price',ascending=False,inplace=True)
print(cylinders)

plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.countplot(df.num_of_cylinders)
plt.ylabel('number of vehicles')
plt.subplot(1,2,2)
sns.barplot(x=cylinders.index,y=cylinders.price)
plt.ylabel('average price')
                         price
num_of_cylinders              
eight             38900.000000
twelve            36000.000000
six               23671.833333
five              22007.600000
four              10312.335484
three              5151.000000
Out[118]:
Text(0, 0.5, 'average price')
  • Vechiles with a four cylinder engine were sold the most and have a moderate price range.
  • High end vehicles boast a twelve or eight cylinder engine, but are the less sold.

The analysis completes here. Based on the analysis, the charateristics that are estimated to be good predictors are :-

  • length
  • width
  • curb_weight
  • engine_size
  • horsepower
  • bore
  • wheel_base
  • aspiration
  • drive_wheels
  • fuel_type
  • fuel_economy
  • engine_type
  • num_of_cylinders

Removing the other columns from the dataset.

In [132]:
cols = [
    'length',
    'width',
    'curb_weight',
    'engine_size',
    'horsepower',
    'bore',
    'wheel_base',
    'aspiration',
    'drive_wheels',
    'fuel_type',
    'fuel_economy',
    'engine_type',
    'num_of_cylinders',
    'price',
    'body_style'
]

df = df[cols]
df.head(5)
Out[132]:
length width curb_weight engine_size horsepower bore wheel_base aspiration drive_wheels fuel_type fuel_economy engine_type num_of_cylinders price body_style
0 168.8 64.1 2548 130 111.0 3.47 88.6 std rwd gas 23.4 dohc four 13495.0 convertible
1 168.8 64.1 2548 130 111.0 3.47 88.6 std rwd gas 23.4 dohc four 16500.0 convertible
2 171.2 65.5 2823 152 154.0 2.68 94.5 std rwd gas 21.8 ohcv six 16500.0 hatchback
3 176.6 66.2 2337 109 102.0 3.19 99.8 std fwd gas 26.4 ohc four 13950.0 sedan
4 176.6 66.4 2824 136 115.0 3.19 99.4 std 4wd gas 19.6 ohc five 17450.0 sedan

The categorical variables are converted to binary dummy variables using pd.get_dummies. Once all column are in numeric form, using the StandardScaler from sklearn.preprocessing module, normalization is performed. This is done because, the columns are in on different scales.

In [395]:
X = df.drop('price',axis=1)
y = df.price

X = pd.get_dummies(X)
scaler = StandardScaler()
X_res = pd.DataFrame(scaler.fit_transform(X))
X_res.head(5)
Out[395]:
0 1 2 3 4 5 6 7 8 9 ... 21 22 23 24 25 26 27 28 29 30
0 -0.438504 -0.839749 -0.021018 0.049883 0.204599 0.518555 -1.683439 -0.639027 0.475831 -0.475831 ... -0.232495 0.508001 -0.374634 -0.071796 -0.071796 5.612486 -0.206835 -0.690849 -0.964724 -0.374634
1 -0.438504 -0.839749 -0.021018 0.049883 0.204599 0.518555 -1.683439 -0.639027 0.475831 -0.475831 ... -0.232495 0.508001 -0.374634 -0.071796 -0.071796 5.612486 -0.206835 -0.690849 -0.964724 -0.374634
2 -0.245646 -0.181548 0.504425 0.582216 1.342993 -2.394771 -0.718803 -0.884745 0.475831 -0.475831 ... -0.232495 -1.968502 2.669270 -0.071796 -0.071796 -0.178174 -0.206835 1.447494 -0.964724 -0.374634
3 0.188283 0.147553 -0.424175 -0.458253 -0.033670 -0.514016 0.147735 -0.178304 0.475831 -0.475831 ... -0.232495 0.508001 -0.374634 -0.071796 -0.071796 -0.178174 -0.206835 -0.690849 1.036566 -0.374634
4 0.188283 0.241582 0.506335 0.195065 0.310496 -0.514016 0.082336 -1.222609 0.475831 -0.475831 ... 4.301163 -1.968502 -0.374634 -0.071796 -0.071796 -0.178174 -0.206835 -0.690849 1.036566 -0.374634

5 rows × 31 columns

The feature set is broken down into training and testing sets for building the model.

In [396]:
X_train, X_test, y_train, y_test = train_test_split(X_res,y,random_state=1)

The models tested here are the KNeighborsRegressor and the LinearRegression.

Usually different manufacturers produce vehicles at different price ranges for every segment of the economy. Say there are two manufacturers - Renault and Subaru. Both these manufacturers will have a car for the lower range, a car for the mediocre range and a car at high end range. The cars at lower range usually have a lot of charateristics in the same neighbourhood such as engine_size, horsepower, drive_wheels, aspiration, fuel economy etc. Using KNeighborsRegressor, such clusters will be identified and for every prediction, the price would be the mean of this cluster of cars with similar charateristics. This model can have issues since the number of vehicles are too less thus not giving a complete idea of the vehicles a manufacturer sells.

All throughout the analysis, it was clear that many variables share a linear relationship with the price variable. All numrical columns chosen were either positively or negatively correlated to the price. This was confirmed in the regplots. The categorical varibles also showed distinct average price differences, leading to the conclusion that a certain category can give an idea about the price of the vehicle. The LinearRegression model is used keeping all of this in mind.

The above reasoning explains why the two models have been chosen for building the final model.

In [397]:
knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X_train,y_train)

pred = knn.predict(X_train)
print("Train score: ",r2_score(pred,y_train))

pred = knn.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))

print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
Train score:  0.9982318952770243
Test_score:  0.8396070972956275
RMSE:  3737.26141413081
MAE:  2438.5510204081634

The KNeighborsRegressor achieves a good r2_score of about 83.96% on the test set. The RMSE values indicate the average error in a prediction made by the model. The model achieves a good RMSE value - 3737.26. The models seems to have worked well, with the n_neighbors paramter set to 2.
Comapring the model with LinearRegression is neccessary as it can give insights to how well this model works.

In [398]:
lm = LinearRegression()
lm.fit(X_train,y_train)

pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))

pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))

print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
Train score:  0.8914832054451866
Test_score:  0.855910012633967
RMSE:  3181.0322780707743
MAE:  2212.806213059051

LinearRegression gives round about the same result, with a test set r2_score of 85.59%, the RMSE value has decreased for this model, which is now said to have an average error of about 3181.03 in predictions. Taking a closer look at the coefficients

In [400]:
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
    print(col, " : ",coef)
intercept :  13240.364283181842
length  :  69.05001567827414
width  :  -312.64657177503926
curb_weight  :  3112.6422247921687
engine_size  :  -168.76840324099993
horsepower  :  3358.0603396880524
bore  :  -334.5993021207858
wheel_base  :  525.3453228799971
fuel_economy  :  51.35374451761106
aspiration_std  :  152.29063205157854
aspiration_turbo  :  -152.29063205158377
drive_wheels_4wd  :  -192.65794100608596
drive_wheels_fwd  :  -102.3344355733428
drive_wheels_rwd  :  183.82258381712714
fuel_type_diesel  :  282.1211482052628
fuel_type_gas  :  -282.1211482052585
engine_type_dohc  :  -531.9238114834744
engine_type_l  :  30.272123574678087
engine_type_ohc  :  727.6095684879377
engine_type_ohcf  :  778.0690284738686
engine_type_ohcv  :  -1637.7955847905962
num_of_cylinders_eight  :  1815.84020805974
num_of_cylinders_five  :  302.83823993731977
num_of_cylinders_four  :  -1531.9516188248065
num_of_cylinders_six  :  774.735747928345
num_of_cylinders_three  :  124.62799296438376
num_of_cylinders_twelve  :  433.80838661581413
body_style_convertible  :  804.2111420484559
body_style_hardtop  :  -94.99575353368817
body_style_hatchback  :  -268.559084028999
body_style_sedan  :  289.3694910702973
body_style_wagon  :  -423.2174528939979

It is observed that the coefficients of the independent variables are quite high for some of these. Higher coefficients tend to dictate the model more developing bias. Thus regularization is necessary. RidgeRegression offers L2 regualrization, which reduces the coefficients using alpha. The LassoRegression offers L1 regualrization as well as feature selection, since out of independent variables that are correlated, it choses one of them to represent the group and become a predictor.
A trade off between the two would be the ElasticNet which considers both L1 and L2 regularization. Instead of chosing just one of the group of correlated variables like Lasso, it considers the entire group as predictors as well as reduces the feature set. All this while also performing L2 regularization as Ridge.

It is not guaranteed to give better results, but it would be helpful to try them out.

In [401]:
lm = Ridge(alpha=2)
lm.fit(X_train,y_train)

pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))

pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))

print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
Train score:  0.8889706653948796
Test_score:  0.8671449813077978
RMSE:  3076.4526038725635
MAE:  2168.362602263705
In [402]:
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
    print(col, " : ",coef)
intercept :  13234.855514836217
length  :  346.307921997421
width  :  4.915165173945227
curb_weight  :  1910.0157142792557
engine_size  :  813.3823496623019
horsepower  :  2772.6679621879707
bore  :  -404.8681909574322
wheel_base  :  440.3772592669129
fuel_economy  :  -292.31922668139333
aspiration_std  :  35.115413618676634
aspiration_turbo  :  -35.1154136186765
drive_wheels_4wd  :  -254.01188086251113
drive_wheels_fwd  :  -178.49002931878687
drive_wheels_rwd  :  286.810728508113
fuel_type_diesel  :  286.93362467877745
fuel_type_gas  :  -286.93362467874874
engine_type_dohc  :  -481.2576295129912
engine_type_l  :  11.056378445288852
engine_type_ohc  :  639.6807356834718
engine_type_ohcf  :  782.1510033830963
engine_type_ohcv  :  -1516.5745120285826
num_of_cylinders_eight  :  1613.3496452200689
num_of_cylinders_five  :  256.4995246812394
num_of_cylinders_four  :  -1410.7354760799105
num_of_cylinders_six  :  745.6584665876092
num_of_cylinders_three  :  207.39371779678564
num_of_cylinders_twelve  :  344.45524070745677
body_style_convertible  :  814.4234724996103
body_style_hardtop  :  -120.71384227618105
body_style_hatchback  :  -291.8558609829045
body_style_sedan  :  255.2233661602942
body_style_wagon  :  -327.960236915297

As observed, RidgeRegression regularizes the independent variables, i.e. the coefficents have reduced, with an alpha setting of 2. The model performs well with a test set r2_score of 86.71% and an RMSE value of 3076.45.

In [403]:
lm = Lasso(alpha=4)
lm.fit(X_train,y_train)

pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))

pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))

print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
Train score:  0.8911175779094181
Test_score:  0.8586311970968112
RMSE:  3151.2074765843927
MAE:  2194.366681789046
In [404]:
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
    print(col, " : ",coef)
intercept :  13240.997691614599
length  :  66.82300613817876
width  :  -213.42009123479866
curb_weight  :  2911.939004544829
engine_size  :  0.0
horsepower  :  3270.8875333165747
bore  :  -322.629650764538
wheel_base  :  483.1736098883539
fuel_economy  :  0.0
aspiration_std  :  271.5340440205393
aspiration_turbo  :  -9.425828954811123e-14
drive_wheels_4wd  :  -146.8886801446939
drive_wheels_fwd  :  -0.0
drive_wheels_rwd  :  305.5112144339379
fuel_type_diesel  :  562.1328948193295
fuel_type_gas  :  -1.243996881938771e-14
engine_type_dohc  :  -562.1374708518558
engine_type_l  :  -4.059435551121502
engine_type_ohc  :  628.4566219561877
engine_type_ohcf  :  707.3385785146162
engine_type_ohcv  :  -1670.9542579204797
num_of_cylinders_eight  :  1590.912455827576
num_of_cylinders_five  :  -0.0
num_of_cylinders_four  :  -2061.0296798989107
num_of_cylinders_six  :  324.50744348722424
num_of_cylinders_three  :  33.052553912359194
num_of_cylinders_twelve  :  314.3517367813086
body_style_convertible  :  887.2593651816261
body_style_hardtop  :  0.0
body_style_hatchback  :  -28.998282742807678
body_style_sedan  :  551.2430475319044
body_style_wagon  :  -222.0989799008459

As said, LassoRegression performed L1 regualrization and feature selection of sorts. Some variables have a coefficient of 0 such as the engine_type, aspiration and engine_size. These variables were not significant to the prediction.
LassoRegression has performed very well, with a test set r2_score of 85.86% and an RMSE value of 3151.20.

In [405]:
lm = ElasticNet(alpha=0.5,l1_ratio=1)
lm.fit(X_train,y_train)

pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))

pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))

print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
Train score:  0.8914400985686434
Test_score:  0.8568438026529581
RMSE:  3172.528610109757
MAE:  2208.337849138228
In [406]:
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
    print(col, " : ",coef)
intercept :  13240.853030458207
length  :  73.6943214704772
width  :  -297.41912923007646
curb_weight  :  3063.0613347090143
engine_size  :  -116.25473036514512
horsepower  :  3339.089235082762
bore  :  -339.76124421255525
wheel_base  :  520.1333566235419
fuel_economy  :  38.912984786323115
aspiration_std  :  296.5004550472449
aspiration_turbo  :  -0.0
drive_wheels_4wd  :  -184.7269083328328
drive_wheels_fwd  :  -82.88720032552314
drive_wheels_rwd  :  209.03211353526098
fuel_type_diesel  :  564.2449608700294
fuel_type_gas  :  -0.0
engine_type_dohc  :  -561.5399806086097
engine_type_l  :  -0.0
engine_type_ohc  :  665.1613549382881
engine_type_ohcf  :  741.9968854766879
engine_type_ohcv  :  -1668.331895097389
num_of_cylinders_eight  :  1612.6300278784174
num_of_cylinders_five  :  -0.0
num_of_cylinders_four  :  -2077.11590986883
num_of_cylinders_six  :  320.7869987883807
num_of_cylinders_three  :  29.636584106001454
num_of_cylinders_twelve  :  328.08773434105836
body_style_convertible  :  887.8340994479023
body_style_hardtop  :  0.0
body_style_hatchback  :  -40.31189105339501
body_style_sedan  :  534.4152811609569
body_style_wagon  :  -255.9352928197938

The ElasticNet model performs both L1 and L2 regularization. It has reduced the coefficents as well as performed feature selection (neglected fuel_type). The model does not out perform RidgeRegression but is in the same neighbourhood with reliable predictions. The model achieves a test set r2_score of 85.68% and an RMSE value of 3172.52.
For this model, the residual plot is given below. Residual plot is the plot of predicted prices vs the error in each prediction. These plots show the variance in errors of the predictions.

In [393]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x=pred,y=(pred-y_test))
plt.hlines(0,xmax=35000,xmin=5000,linestyles='dotted',colors='grey')
plt.xlabel('predicted price')
plt.ylabel('residual error')
plt.title('Residual plot')
Out[393]:
Text(0.5, 1.0, 'Residual plot')
In [394]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.distplot((pred-y_test))
plt.xlabel('residual')
plt.title('residual histogram')
Out[394]:
Text(0.5, 1.0, 'residual histogram')

The residual plot, shows a sort of funnel shape. Given the dataset has very less data points, this shape seems distorted. But the presence of this shape indicated Heteroskedasticity. This means that the variance of errors is unequal along the price range. Simply put it can be observed that errors are smaller around smaller price range, but errors also increase as the price increases.
To tackle heteroskedasticity, weighed regression can be performed. In this, every data point is assigned a weight based on variance of its fitted values. Small weights are given to observations associated with higher variances, to shrink the squared error (residual). The method used is the Weighted Least Squares (WLS).
Assume the weights to be

1/variancei

where i is from 1 to n (number of samples) Basically, a sample with high error variance is assigned a smaller weight and a sample with low variance in error is assigned higher weight. The weights are a 1D array, the length of the number of samples.

In [373]:
weights = X_train.apply(lambda x: np.var(x),axis=1).values.ravel('K')
weights
Out[373]:
array([0.36355981, 0.47233522, 0.18755933, 0.31203838, 0.98163972,
       0.81534256, 0.62195724, 1.6757309 , 1.35736675, 0.49892083,
       0.34901347, 8.41810332, 0.912873  , 2.26177483, 1.82572548,
       0.4180836 , 0.22310492, 0.16671625, 1.39365981, 0.30495931,
       1.27667874, 1.42504386, 0.62438679, 1.30750983, 0.96532475,
       0.43848524, 0.4033464 , 1.14371272, 0.98912255, 0.27245959,
       0.70855455, 0.18930625, 0.96516178, 2.14974895, 0.34910303,
       0.24425545, 1.17314343, 1.13087275, 0.35892001, 0.18775988,
       0.17009998, 1.38009349, 1.4161696 , 1.36738696, 0.60747512,
       0.44929128, 1.08657162, 0.83772165, 1.44283185, 0.42825359,
       0.30881805, 0.26212765, 1.21761944, 1.83552253, 1.93926673,
       0.28856914, 0.61643319, 3.83083844, 0.17434137, 1.00724101,
       0.3474051 , 0.51767138, 8.30806004, 2.79601051, 0.21643825,
       0.21730675, 0.53512903, 0.42406894, 1.27706428, 0.30726831,
       1.67420031, 1.00695405, 0.85921725, 0.25849183, 0.32570049,
       0.47408818, 0.31999923, 0.1881886 , 1.18985051, 0.23798268,
       0.98372156, 0.2283615 , 0.43848524, 1.1883302 , 0.25849183,
       0.17769585, 0.99025023, 0.30859349, 1.71266991, 0.51872736,
       0.40130099, 3.2579952 , 1.07131656, 0.18111196, 0.31751569,
       1.83552253, 0.4495501 , 0.22310492, 0.55907076, 0.61601956,
       0.2894162 , 0.38636617, 0.17539945, 0.22551749, 0.41745866,
       1.00521707, 0.73033078, 0.40010183, 0.65105897, 2.51457907,
       0.32474976, 1.5326066 , 1.58136919, 2.1773225 , 0.91518317,
       1.93926673, 0.22186174, 0.61280786, 2.42449889, 2.51746712,
       1.43161602, 0.86606247, 0.40680487, 0.52325915, 1.19379554,
       0.82313527, 0.33598884, 0.40922222, 1.00716962, 0.90570362,
       1.94074386, 0.26195825, 1.82855621, 0.52407242, 0.17335698,
       0.38636617, 1.46599076, 0.42176788, 0.85405145, 1.30664409,
       0.1655501 , 0.67633649, 0.89166814, 0.61655546, 0.45114961,
       0.24147346])

From all the above models, RidgeRegression performed the best with the highest r2_score on the test set. Thus using RidgeRegression's fit() function with the parameter sample_weight to assign these weights.

In [408]:
lm = Ridge(alpha=2)
lm.fit(X_train,y_train,sample_weight=weights)

pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))

pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))

print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
Train score:  0.8912144419880048
Test_score:  0.8821630474839759
RMSE:  3125.032593950767
MAE:  2347.13836664688
In [409]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x=pred,y=(pred-y_test))
plt.hlines(0,xmax=35000,xmin=5000,linestyles='dotted',colors='grey')
plt.xlabel('predicted price')
plt.ylabel('residual error')
plt.title('Residual plot')
Out[409]:
Text(0.5, 1.0, 'Residual plot')