The aim of this project is to analyze the various characteristics of a vehicle and eventually predict the price of the vehicle given its various characteristics. This can be helpful when trying to understand the market, which features are pricing points and based on the features of the vehicle, what could be its price, for a new manufacturer.
The dataset can be found at this Link

The dataset contains information on 205 vehicles imported in the year of 1985 from various manufacturers. It is a dataset from the UCI repository of automobiles. The column description is as follows:-

- symboling: Insurance Risk factor associated with the price, -3 : very safe, 3 : very risky.
- normalized-losses: Relative average payment per insured vehicle year.
- make: The make of the vehicle.
- fuel-type: Fuel type, disel or gas.
- aspiration: std, turbo.
- num-of-doors: four, two.
- body-style: hardtop, wagon, sedan, hatchback, convertible.
- drive-wheels: 4wd, fwd, rwd.
- engine-location: front, rear.
- wheel-base: Distance between centers of front and rear axles.
- length: Length of the vehicle.
- width: Width of the vehicle.
- height: Height of the vehicle.
- curb-weight: Total mass of vehicle.
- engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
- num-of-cylinders: eight, five, four, six, three, twelve, two.
- engine-size: Size of the engine
- fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
- bore: Diameter of piston cylinder.
- stroke: Stroke length of the piston cylinder.
- compression-ratio: Ratio of volume of cylinder and combustion chamber.
- horsepower: Horsepower of the vehicle.
- peak-rpm: Max achievable RPM.
- city-mpg: Lowest mpg rating for the vehicle.
- highway-mpg: Highest mpg rating for the vehicle.
- price: Price of the car

There are a total of 25 characteristics related to a vehicle that have to be analyzed in order to find the best predictors for its price.

In [350]:

```
%matplotlib inline
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.linear_model import LinearRegression, Lasso,Ridge,ElasticNet
from sklearn.feature_selection import RFE
from statsmodels.regression.linear_model import WLS
```

In [2]:

```
df = pd.read_csv('imports-85.data',header=None)
df.head(10)
```

Out[2]:

The dataset inherently doesnot have column names. The column names have been picked up from the above specified link.

In [3]:

```
cols = [
'symboling', 'normalized_losses', 'make', 'fuel_type',
'aspiration', 'num_of_doors', 'body_style',
'drive_wheels', 'engine_location', 'wheel_base',
'length', 'width', 'height', 'curb_weight', 'engine_type',
'num_of_cylinders', 'engine_size', 'fuel_system', 'bore',
'stroke', 'compression_rate', 'horsepower',
'peak_rpm', 'city_mpg', 'highway_mpg', 'price'
]
df.columns = cols
df.head(10)
```

Out[3]:

In [4]:

```
df.isna().sum()
```

Out[4]:

The dataset as such doesnot contain any NaN values, but there are '?' as values, which in this case are reprsenting NaN values. The preliminary step would be to re-read the dataset with the `na_values`

parameter set to '?'. This will read in the dataset and at every encounter of '?' will be replaced with `np.NaN`

.

In [5]:

```
df = pd.read_csv('imports-85.data',header=None, na_values='?')
df.columns = cols
df.head(5)
```

Out[5]:

In [6]:

```
df.isna().sum()
```

Out[6]:

The dataset is pretty much clean except a few columns containing Null values. These Null values have to be cleaned before starting analysis on it.

The `price`

column is the target variable. Since the aim is to eventually predict the price, at this point imputing is not possible. Thus it would be wise to drop the 4 rows, there isnt much loss of information as well.

In [7]:

```
df = df[~df.price.isna()]
```

The *normalized_losses* column contains the maximum `NaN`

values the dataset. Dropping 41 rows will highly reduce the dataset and lose out on valuable information of other columns.

These `NaN`

values have to be imputed. The distribution of *normalized_losses* column can give a sense of how this can be achieved.

In [9]:

```
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,8))
df['normalized_losses'].plot.hist()
plt.axvline(df['normalized_losses'].mean(),color='black')
```

Out[9]:

The distribution shows that most of the *normalized_losses* values are close to the mean. Thus imputing the 41 rows missing data with the mean will not change the general distribution of the column.

In [10]:

```
df['normalized_losses'].fillna(
df['normalized_losses'].mean(),inplace=True
)
```

The *bore* and *stroke* columns have 4 missing values. The *bore* defines the diameter of the piston cylinder and *stroke* defines the length of the piston cylinder. These values are usually specific to a vehicle and even vehicles from the same manufacturer do not have the same values for *bore* and *stroke*. Imputing these values requires alot of other parameters such as *horsepower*, *engine capacity*, *pressure values*, *stroke length* etc.

In [12]:

```
df[df.bore.isna()]
```

Out[12]:

Analyzing the missing values, the rows containing the missing *bore* values also have missing values for *stroke*. This makes it impossible to derive either *bore* or *stroke* values for these rows. Since the question is of 4 rows, these can be ignored without possible loss of information loss.

In [13]:

```
df.dropna(subset=['bore','stroke'],inplace=True)
```

The *horsepower* and the *peak_rpm* columns both contain 2 missing values. A closer look into the missing values shows that the same two vehicles have missing *horsepower* and *peak_rpm*. Due to lack of data, these values cannot be calculated readily. Hence these two rows are also dropped from the dataset.

In [14]:

```
df[df.horsepower.isna()]
```

Out[14]:

In [15]:

```
df.dropna(subset=['horsepower','peak_rpm'],inplace=True)
```

In [17]:

```
df.body_style.value_counts()
```

Out[17]:

The *num_of_doors* is the last column left to clean. This column also contains 2 missing rows. A closer look into these rows reveals that both the vehicles listed were *sedans*. The *body_style* of the vehicle identifies whether the vehicle is a *sedan*, *hatchback*, *convertible*, *hardtop* or a *wagon*. Usually the same style of vehicles have similar body charateristics like number of doors, for example, a *convertible* usually is a 2-door vehicle. Needless to say that there are exceptions, but the general trend is followed. Going by this trend, a *sedan* usually has 4 doors. Thus the two missing values are imputed with the mode of the column for *sedans*.

In [18]:

```
df[df['num_of_doors'].isna()]
```

Out[18]:

In [300]:

```
df[df['body_style'] == 'sedan']['num_of_doors'].value_counts()
```

Out[300]:

In [19]:

```
df['num_of_doors'].fillna(
df[df['body_style'] == 'sedan']['num_of_doors'].mode()[0],
inplace=True
)
```

The dataset is clean from any missing values. The next step is to analyze these various charateristics. The *price* column is the target column for the predictive modeling and the base for the analysis. Looking into the distribution of price via a box plot reveals a significant number of outliers.

In [20]:

```
plt.figure(figsize=(10,8))
sns.boxplot(df.price)
plt.xlabel('Price range')
sns.despine(left=True)
```

The outliers exist beyond the 30k mark. This data could be faulty and hence requires a detailed analysis for those vehicles listed with a price greater than 30k.

In [21]:

```
df[df.price > 30000]
```

Out[21]:

The outliers do not look faulty, rather the vehicles listed are high end vehicles from world class manufacturers - *jaguar*, *mercedes benz*, *porsche* and *bmw*. These prices are a result of these vehicles being luxury/sports car from these vehicles.

The *city_mpg* column describes the average miles per gallon (mileage) the car delivers when driven in the city with occasional accelerations and brakes. Similarly the *highway_mpg* describes the average miles per gallon (mileage) the car delivers under continuos acceleration. The two metrics are considered very important for a vehicle.

In [27]:

```
plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.scatterplot(y='price',x='city_mpg',data=df)
plt.xlabel('city mileage (mpg)')
plt.subplot(1,2,2)
sns.scatterplot(y='price',x='highway_mpg',data=df)
plt.xlabel('highway mileage (mpg)')
```

Out[27]:

From the two scatter plots, the following conclusions are drawn :-

- Both city mileage and highway mileage (in mpg) are negatively correlated to the price of the vehicle
- High end luxury or sports vehicles have lesser city and highway mileage (in mpg).

The *city_mpg* and *highway_mpg* define individual charateristics of the vehicle, both measured separately. These values for a vehicle are measured under the assumption that the car is driven only in particular setting i.e. *city_mpg* for a vehicle is measured under the assumption that the vehicle is only driven in the city.

In reality the vehicle isnt always only exposed to either one. There is a combination of both city driving and highway driving, this makes a huge difference in the actual mileage (in mpg) the vehicle can achieve. Thus *fuel economy* metric is dervied. The *fuel economy* is assumed to be 60% of *city_mpg* and 40% of *highway_mpg*. Intuitively this means that the *fuel economy* gives the average miles per gallon (mpg) for a vehicle that is driven 60% of the times in a city and the rest on a highway. This assumption seems fair and realistic.

In [28]:

```
df['fuel_economy'] = (df.city_mpg * 0.6) + (df.highway_mpg * 0.4)
df.head(10)
```

Out[28]:

In [29]:

```
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x='fuel_economy',y='price',data=df)
plt.xlabel('Fuel economy (in mpg)')
```

Out[29]:

Since the *fuel_economy* is a representation of the *city_mpg* and *highway_mpg* combined, it mimics the trend as seen above.

The following columns are numeric in nature and mostly describe physical specifications of the vehicle :-

- wheel_base
- length
- width
- height
- curb_weight
- engine_size
- bore
- stroke
- compression_rate
- horsepower
- peak_rpm

The next set of plots analyze these values against the price of the vehicle in a regression plot. The `regplot`

fits a regressor to the data and draws the best fitting line across the data. The `regplot`

gives insight into the relationship between two variables. This shall give an idea as to which charateristics are correlated to the price.

In [55]:

```
cols = [
'wheel_base',
'length',
'width',
'height',
'curb_weight',
'engine_size',
'bore',
'stroke',
'compression_rate',
'horsepower',
'peak_rpm'
]
plt.style.use('seaborn-white')
plt.subplots(figsize=(16,42))
i=1
for col in cols:
plt.subplot(6,2,i)
sns.regplot(x=col,y='price',data=df)
plt.title('Price vs '+col)
i+=1
```

The following conclusions are drawn from the scatter plot :-

- The wheel base of a vehicle shows slight positive correlation with the price. For a certain wheel-base there is a huge band within which the prices of the vehicles lie.
- The length and width of the vehicle share a linear relationship with the price and are highly positively correlated. This means, the length and width of the vehicle can be a determining charateristic for its price.
- The height of the car has a slight positive correlation, but not enough to conclude a linear relationship. The height doesnot dictate the price of the vehicle.
- The curb_weight has a linear relationship with price of vehicle. High end vehicles have higher curb weights i.e they are heavier packed vehicles.
- The engine_size and horsepower share a linear relationship with price. High end vehicles pack a punch when it comes to power of the vehicle.
- The bore shows high positive correlation with the price whereas the stroke doesnot show a clear linear relationship.
- Compression rate and Peak RPM have no relations with the price as such.

From these conclusions, it can be said that the following columns can be definte predictors for the price of the vehicle :-

- length
- width
- curb_weight
- engine_size
- horsepower
- bore
- wheel_base

The *normalized_losses* column describes the average loss the car can incur per year. Viewing these losses against the price via a scatterplot.

In [56]:

```
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x='normalized_losses',y='price',data=df)
```

Out[56]:

The *normalized_losses* shares no relationship with the price. Mostly the losses for high end vehicles is at the mean.

Analyzing all the numeric columns leads to the conclusion that - Most of the numeric columns share a linear relationship with the price of the vehicle as identified. These charateristics can be good predictors for the target price. The dataset also contains categorical variables. These variables have to be compared against the price to find some relations.

The *symboling* column descibes the risk factor invovled with the price of a car. A value of +3 means very risky and -2 means pretty safe. the symbols are first changed to labels as given below then compared with price.

- Very safe - -2
- Safe - -1
- Neutral - 0
- Maybe risky - 1
- Risky - 2
- Very risky - 3

In [92]:

```
def decode(row):
if row == -2:
return 'Very safe'
elif row == -1:
return 'Safe'
elif row == 0:
return 'Neutral'
elif row == 1:
return 'Maybe risky'
elif row == 2:
return 'Risky'
else:
return 'Very risky'
df['symboling_labels'] = df.symboling.apply(decode).astype(
CategoricalDtype(
categories=['Very safe','Safe','Neutral','Maybe risky','Risky','Very risky'],
ordered=True
)
)
plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.countplot(df.symboling_labels)
plt.subplot(1,2,2)
sns.boxplot(x=df.symboling_labels, y=df.price)
```

Out[92]:

The symboling plots suggest :-

- The vehicles with a symbol rating of 0 - Neutral, are maximum sold.
- Vehicles with a symbol rating of -1 - Safe have higher prices.

The *make* of a vehicle identifies the manufacturer of the vehicle and *body_style* identifies the type of vehicle as discussed previously. The two bar plots below show the number of vehicles sold per maker and per body type.

In [130]:

```
makers = df.make.value_counts().sort_values(ascending=False)
styles = df.body_style.value_counts().sort_values(ascending=False)
makers_price = df.pivot_table(values='price',index='make').sort_values('price',ascending=False)
styles_price = df.pivot_table(values='price',index='body_style').sort_values('price',ascending=False)
plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,24))
plt.subplot(3,2,1)
sns.barplot(x=makers.index[:10],y=makers.values[:10])
plt.xlabel('Manufacturer')
plt.ylabel('Number of vehicles')
plt.subplot(3,2,2)
sns.barplot(x=styles.index,y=styles.values)
plt.ylabel('Number of vehicles')
plt.xlabel('Type of vehicle')
plt.subplot(3,2,3)
sns.barplot(makers_price.index[:5],makers_price.price[:5])
plt.ylabel('Average price')
plt.xlabel('Manufacturer')
plt.subplot(3,2,4)
sns.barplot(styles_price.index,styles_price.price)
plt.ylabel('Average price')
plt.xlabel('Type of vehicle')
plt.subplot(3,2,5)
sns.boxplot(x=df.body_style,y=df.price)
```

Out[130]:

Inferences :-

- Toyota is the most sold vehicle in the dataset. Other manufacturers have relatively lesser vehicles sold.
- Sedan body type is the most sold vehicle type. Hatchback is the second most sold. The other categories have very less sold.
- Manufacturers sell cars for different price ranges i.e. usually they have cars for lower range, mediocre range and higher range, thus manufacturer cannot be used to predict the price of a car.
- Hardtop and convertibles have the highest average price ranges.
- Every body style has vehicles of different price ranges for example, hardtops and convertibles are in high end vehicles whereas sedans and wagons are in moderate price ranges. Hatchbacks are in the lowest price range.

The columns - *fuel_type*, *aspiration*, *engine_location* and *num_of_doors* are two-category variables describing the latent features of the vehicle.

In [105]:

```
cols = [
'fuel_type',
'aspiration',
'engine_location',
'num_of_doors',
'drive_wheels'
]
plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,32))
i=1
for col in cols:
plt.subplot(5,2,i)
sns.countplot(df[col])
i+=1
plt.subplot(5,2,i)
sns.boxplot(x=df[col],y=df.price)
i+=1
```

The conclusions drawn :-

- Four door vehicles are most sold among the lot. The number of doors do not add much to the price.
- Most vehicles have engines in the front except a very few. The very few that have engines at the rear have very high price points. A few of the high end vehicles can be said to have rear engines.
- Turbo engine vehicles are higher priced than standard engine vehicles, but standard engine vehicles are most sold.
- Disel vehicles are priced higher than gas vehicles, but gas vehicles are most sold.
- Forward wheel drive vehicles are most sold. The rear wheel drive vehicles are ranged at higher price points.

Out of the above characteristics, *aspiration*, *drive_wheels* and *fuel_type* seem like good predictors for the price of the vehicle. The *engine_location* though showing distinct price gap, has too few representatives for *rear* category.

The *engine_type* describes the build of the engine from the manufacturer.

In [115]:

```
engines = df[['engine_type','price']].groupby('engine_type').mean()
engines.sort_values('price',ascending=False,inplace=True)
print(engines)
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.countplot(df.engine_type)
plt.ylabel('number of vehicles')
plt.subplot(1,2,2)
sns.barplot(x=engines.index,y=engines.price)
plt.ylabel('average price')
```

Out[115]:

The conclusions drawn :-

- Vehicles with ohc engine type are most sold, but they have the least average price. Thus vehicles with ohc engine type have lesser price points and are most sold.
- Vehicles with the ohcv engine type are priced costlier than the rest.

The plots conclude that engine type can determine the average price range of the vehicle and hence makes it an important charateristic for the prediction of the price.

The final categorical column, *num_of_cylinders* is an important parameter for an engine and determeines the power of the engine. The plots below visualize the number of vehicles sold per number of cylinders and the average price range of the vehicles sold per number of cylinders.

In [118]:

```
cylinders = df[['num_of_cylinders','price']].groupby('num_of_cylinders').mean()
cylinders.sort_values('price',ascending=False,inplace=True)
print(cylinders)
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.countplot(df.num_of_cylinders)
plt.ylabel('number of vehicles')
plt.subplot(1,2,2)
sns.barplot(x=cylinders.index,y=cylinders.price)
plt.ylabel('average price')
```

Out[118]:

- Vechiles with a four cylinder engine were sold the most and have a moderate price range.
- High end vehicles boast a twelve or eight cylinder engine, but are the less sold.

The analysis completes here. Based on the analysis, the charateristics that are estimated to be good predictors are :-

- length
- width
- curb_weight
- engine_size
- horsepower
- bore
- wheel_base
- aspiration
- drive_wheels
- fuel_type
- fuel_economy
- engine_type
- num_of_cylinders

Removing the other columns from the dataset.

In [132]:

```
cols = [
'length',
'width',
'curb_weight',
'engine_size',
'horsepower',
'bore',
'wheel_base',
'aspiration',
'drive_wheels',
'fuel_type',
'fuel_economy',
'engine_type',
'num_of_cylinders',
'price',
'body_style'
]
df = df[cols]
df.head(5)
```

Out[132]:

The categorical variables are converted to binary dummy variables using `pd.get_dummies`

. Once all column are in numeric form, using the `StandardScaler`

from `sklearn.preprocessing`

module, normalization is performed. This is done because, the columns are in on different scales.

In [395]:

```
X = df.drop('price',axis=1)
y = df.price
X = pd.get_dummies(X)
scaler = StandardScaler()
X_res = pd.DataFrame(scaler.fit_transform(X))
X_res.head(5)
```

Out[395]:

The feature set is broken down into training and testing sets for building the model.

In [396]:

```
X_train, X_test, y_train, y_test = train_test_split(X_res,y,random_state=1)
```

The models tested here are the `KNeighborsRegressor`

and the `LinearRegression`

.

Usually different manufacturers produce vehicles at different price ranges for every segment of the economy. Say there are two manufacturers - Renault and Subaru. Both these manufacturers will have a car for the lower range, a car for the mediocre range and a car at high end range. The cars at lower range usually have a lot of charateristics in the same neighbourhood such as engine_size, horsepower, drive_wheels, aspiration, fuel economy etc. Using `KNeighborsRegressor`

, such clusters will be identified and for every prediction, the price would be the mean of this cluster of cars with similar charateristics. This model can have issues since the number of vehicles are too less thus not giving a complete idea of the vehicles a manufacturer sells.

All throughout the analysis, it was clear that many variables share a linear relationship with the price variable. All numrical columns chosen were either positively or negatively correlated to the price. This was confirmed in the `regplot`

s. The categorical varibles also showed distinct average price differences, leading to the conclusion that a certain category can give an idea about the price of the vehicle. The `LinearRegression`

model is used keeping all of this in mind.

The above reasoning explains why the two models have been chosen for building the final model.

In [397]:

```
knn = KNeighborsRegressor(n_neighbors=1)
knn.fit(X_train,y_train)
pred = knn.predict(X_train)
print("Train score: ",r2_score(pred,y_train))
pred = knn.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))
print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
```

The `KNeighborsRegressor`

achieves a good `r2_score`

of about 83.96% on the test set. The RMSE values indicate the average error in a prediction made by the model. The model achieves a good RMSE value - 3737.26.
The models seems to have worked well, with the n_neighbors paramter set to 2.

Comapring the model with `LinearRegression`

is neccessary as it can give insights to how well this model works.

In [398]:

```
lm = LinearRegression()
lm.fit(X_train,y_train)
pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))
pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))
print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
```

`LinearRegression`

gives round about the same result, with a test set `r2_score`

of 85.59%, the RMSE value has decreased for this model, which is now said to have an average error of about 3181.03 in predictions. Taking a closer look at the coefficients

In [400]:

```
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
print(col, " : ",coef)
```

It is observed that the coefficients of the independent variables are quite high for some of these. Higher coefficients tend to dictate the model more developing bias. Thus regularization is necessary. `RidgeRegression`

offers L2 regualrization, which reduces the coefficients using `alpha`

. The `LassoRegression`

offers L1 regualrization as well as feature selection, since out of independent variables that are correlated, it choses one of them to represent the group and become a predictor.

A trade off between the two would be the `ElasticNet`

which considers both L1 and L2 regularization. Instead of chosing just one of the group of correlated variables like Lasso, it considers the entire group as predictors as well as reduces the feature set. All this while also performing L2 regularization as Ridge.

It is not guaranteed to give better results, but it would be helpful to try them out.

In [401]:

```
lm = Ridge(alpha=2)
lm.fit(X_train,y_train)
pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))
pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))
print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
```

In [402]:

```
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
print(col, " : ",coef)
```

As observed, `RidgeRegression`

regularizes the independent variables, i.e. the coefficents have reduced, with an alpha setting of 2. The model performs well with a test set `r2_score`

of 86.71% and an RMSE value of 3076.45.

In [403]:

```
lm = Lasso(alpha=4)
lm.fit(X_train,y_train)
pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))
pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))
print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
```

In [404]:

```
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
print(col, " : ",coef)
```

As said, `LassoRegression`

performed L1 regualrization and feature selection of sorts. Some variables have a coefficient of 0 such as the *engine_type*, *aspiration* and *engine_size*. These variables were not significant to the prediction.

`LassoRegression`

has performed very well, with a test set `r2_score`

of 85.86% and an RMSE value of 3151.20.

In [405]:

```
lm = ElasticNet(alpha=0.5,l1_ratio=1)
lm.fit(X_train,y_train)
pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))
pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))
print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
```

In [406]:

```
print("intercept : ",lm.intercept_)
for col,coef in zip(X.columns,lm.coef_):
print(col, " : ",coef)
```

The `ElasticNet`

model performs both L1 and L2 regularization. It has reduced the coefficents as well as performed feature selection (neglected fuel_type). The model does not out perform `RidgeRegression`

but is in the same neighbourhood with reliable predictions. The model achieves a test set `r2_score`

of 85.68% and an RMSE value of 3172.52.

For this model, the residual plot is given below. Residual plot is the plot of predicted prices vs the error in each prediction. These plots show the variance in errors of the predictions.

In [393]:

```
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x=pred,y=(pred-y_test))
plt.hlines(0,xmax=35000,xmin=5000,linestyles='dotted',colors='grey')
plt.xlabel('predicted price')
plt.ylabel('residual error')
plt.title('Residual plot')
```

Out[393]:

In [394]:

```
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.distplot((pred-y_test))
plt.xlabel('residual')
plt.title('residual histogram')
```

Out[394]:

The residual plot, shows a sort of funnel shape. Given the dataset has very less data points, this shape seems distorted. But the presence of this shape indicated Heteroskedasticity. This means that the variance of errors is unequal along the price range. Simply put it can be observed that errors are smaller around smaller price range, but errors also increase as the price increases.

To tackle heteroskedasticity, weighed regression can be performed. In this, every data point is assigned a weight based on variance of its fitted values. Small weights are given to observations associated with higher variances, to shrink the squared error (residual). The method used is the Weighted Least Squares (WLS).

Assume the weights to be

1/variance_{i}

where i is from 1 to n (number of samples) Basically, a sample with high error variance is assigned a smaller weight and a sample with low variance in error is assigned higher weight. The weights are a 1D array, the length of the number of samples.

In [373]:

```
weights = X_train.apply(lambda x: np.var(x),axis=1).values.ravel('K')
weights
```

Out[373]:

From all the above models, `RidgeRegression`

performed the best with the highest `r2_score`

on the test set. Thus using `RidgeRegression`

's `fit()`

function with the parameter `sample_weight`

to assign these weights.

In [408]:

```
lm = Ridge(alpha=2)
lm.fit(X_train,y_train,sample_weight=weights)
pred = lm.predict(X_train)
print("Train score: ",r2_score(pred,y_train))
pred = lm.predict(X_test)
print("Test_score: ",r2_score(pred,y_test))
print("RMSE: ",np.sqrt(mean_squared_error(pred,y_test)))
print("MAE: ",mean_absolute_error(pred,y_test))
```

In [409]:

```
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x=pred,y=(pred-y_test))
plt.hlines(0,xmax=35000,xmin=5000,linestyles='dotted',colors='grey')
plt.xlabel('predicted price')
plt.ylabel('residual error')
plt.title('Residual plot')
```

Out[409]: