Predicting import vehicle prices

Predicting the prices of cars and trucks imported in 1985

The aim of this project is to analyze the various characteristics of a vehicle and eventually predict the price of the vehicle given its various characteristics. This can be helpful when trying to understand the market, which features are pricing points and based on the features of the vehicle, what could be its price, for a new manufacturer. The dataset can be found at this Link
The dataset contains information on 205 vehicles imported in the year of 1985 from various manufacturers. It is a dataset from the UCI repository of automobiles. The column description is as follows:-

  1. symboling: Insurance Risk factor associated with the price, -3 : very safe, 3 : very risky.
  2. normalized-losses: Relative average payment per insured vehicle year.
  3. make: The make of the vehicle.
  4. fuel-type: Fuel type, disel or gas.
  5. aspiration: std, turbo.
  6. num-of-doors: four, two.
  7. body-style: hardtop, wagon, sedan, hatchback, convertible.
  8. drive-wheels: 4wd, fwd, rwd.
  9. engine-location: front, rear.
  10. wheel-base: Distance between centers of front and rear axles.
  11. length: Length of the vehicle.
  12. width: Width of the vehicle.
  13. height: Height of the vehicle.
  14. curb-weight: Total mass of vehicle.
  15. engine-type: dohc, dohcv, l, ohc, ohcf, ohcv, rotor.
  16. num-of-cylinders: eight, five, four, six, three, twelve, two.
  17. engine-size: Size of the engine
  18. fuel-system: 1bbl, 2bbl, 4bbl, idi, mfi, mpfi, spdi, spfi.
  19. bore: Diameter of piston cylinder.
  20. stroke: Stroke length of the piston cylinder.
  21. compression-ratio: Ratio of volume of cylinder and combustion chamber.
  22. horsepower: Horsepower of the vehicle.
  23. peak-rpm: Max achievable RPM.
  24. city-mpg: Lowest mpg rating for the vehicle.
  25. highway-mpg: Highest mpg rating for the vehicle.
  26. price: Price of the car

There are a total of 25 characteristics related to a vehicle that have to be analyzed in order to find the best predictors for its price.

In [1]:
%matplotlib inline
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error,r2_score,mean_absolute_error
from sklearn.linear_model import LinearRegression, Lasso,Ridge,ElasticNet
from sklearn.feature_selection import RFE
from statsmodels.regression.linear_model import WLS
In [2]:
df = pd.read_csv('imports-85.data',header=None)
df.head(10)
Out[2]:
0 1 2 3 4 5 6 7 8 9 ... 16 17 18 19 20 21 22 23 24 25
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710
7 1 ? audi gas std four wagon fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 ... 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875
9 0 ? audi gas turbo two hatchback 4wd front 99.5 ... 131 mpfi 3.13 3.40 7.0 160 5500 16 22 ?

10 rows × 26 columns

The dataset inherently doesnot have column names. The column names have been picked up from the above specified link.

In [3]:
cols = [
    'symboling', 'normalized_losses', 'make', 'fuel_type',
    'aspiration', 'num_of_doors', 'body_style', 
    'drive_wheels', 'engine_location', 'wheel_base',
    'length', 'width', 'height', 'curb_weight', 'engine_type', 
    'num_of_cylinders', 'engine_size', 'fuel_system', 'bore',
    'stroke', 'compression_rate', 'horsepower', 
    'peak_rpm', 'city_mpg', 'highway_mpg', 'price'
       ]

df.columns = cols
df.head(10)
Out[3]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
0 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495
1 3 ? alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500
2 1 ? alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500
3 2 164 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950
4 2 164 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450
5 2 ? audi gas std two sedan fwd front 99.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 15250
6 1 158 audi gas std four sedan fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 17710
7 1 ? audi gas std four wagon fwd front 105.8 ... 136 mpfi 3.19 3.40 8.5 110 5500 19 25 18920
8 1 158 audi gas turbo four sedan fwd front 105.8 ... 131 mpfi 3.13 3.40 8.3 140 5500 17 20 23875
9 0 ? audi gas turbo two hatchback 4wd front 99.5 ... 131 mpfi 3.13 3.40 7.0 160 5500 16 22 ?

10 rows × 26 columns

In [4]:
df.isna().sum()
Out[4]:
symboling            0
normalized_losses    0
make                 0
fuel_type            0
aspiration           0
num_of_doors         0
body_style           0
drive_wheels         0
engine_location      0
wheel_base           0
length               0
width                0
height               0
curb_weight          0
engine_type          0
num_of_cylinders     0
engine_size          0
fuel_system          0
bore                 0
stroke               0
compression_rate     0
horsepower           0
peak_rpm             0
city_mpg             0
highway_mpg          0
price                0
dtype: int64

The dataset as such doesnot contain any NaN values, but there are '?' as values, which in this case are reprsenting NaN values. The preliminary step would be to re-read the dataset with the na_values parameter set to '?'. This will read in the dataset and at every encounter of '?' will be replaced with np.NaN.

In [5]:
df = pd.read_csv('imports-85.data',header=None, na_values='?')
df.columns = cols
df.head(5)
Out[5]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
0 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0
1 3 NaN alfa-romero gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0
2 1 NaN alfa-romero gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0
3 2 164.0 audi gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0
4 2 164.0 audi gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0

5 rows × 26 columns

In [6]:
df.isna().sum()
Out[6]:
symboling             0
normalized_losses    41
make                  0
fuel_type             0
aspiration            0
num_of_doors          2
body_style            0
drive_wheels          0
engine_location       0
wheel_base            0
length                0
width                 0
height                0
curb_weight           0
engine_type           0
num_of_cylinders      0
engine_size           0
fuel_system           0
bore                  4
stroke                4
compression_rate      0
horsepower            2
peak_rpm              2
city_mpg              0
highway_mpg           0
price                 4
dtype: int64

The dataset is pretty much clean except a few columns containing Null values. These Null values have to be cleaned before starting analysis on it.
The price column is the target variable. Since the aim is to eventually predict the price, at this point imputing is not possible. Thus it would be wise to drop the 4 rows, there isnt much loss of information as well.

In [7]:
df = df[~df.price.isna()]

The normalized_losses column contains the maximum NaN values the dataset. Dropping 41 rows will highly reduce the dataset and possibly lose out on valuable information of other columns.
These NaN values have to be imputed. The distribution of normalized_losses column can give a sense of how this can be achieved.

In [8]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(10,8))
df['normalized_losses'].plot.hist()
plt.axvline(df['normalized_losses'].mean(),color='black')
plt.title('Distribution of Normalized losses')
Out[8]:
Text(0.5, 1.0, 'Distribution of Normalized losses')

The distribution shows that most of the normalized_losses values are close to the mean. Thus imputing the 41 rows missing data with the mean will not change the general distribution of the column.

In [9]:
df['normalized_losses'].fillna(
    df['normalized_losses'].mean(),inplace=True
)

The bore and stroke columns have 4 missing values. The bore defines the diameter of the piston cylinder and stroke defines the length of the piston cylinder. These values are usually specific to a vehicle and even vehicles from the same manufacturer do not have the same values for bore and stroke. Imputing these values requires alot of other parameters such as horsepower, engine capacity, pressure values, stroke length etc.

In [10]:
df[df.bore.isna()]
Out[10]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
55 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 70 4bbl NaN NaN 9.4 101.0 6000.0 17 23 10945.0
56 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 70 4bbl NaN NaN 9.4 101.0 6000.0 17 23 11845.0
57 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 70 4bbl NaN NaN 9.4 101.0 6000.0 17 23 13645.0
58 3 150.0 mazda gas std two hatchback rwd front 95.3 ... 80 mpfi NaN NaN 9.4 135.0 6000.0 16 23 15645.0

4 rows × 26 columns

Analyzing the missing values, the rows containing the missing bore values also have missing values for stroke. This makes it impossible to derive either bore or stroke values for these rows. Since the question is of 4 rows, these can be ignored without possible loss of information.

In [11]:
df.dropna(subset=['bore','stroke'],inplace=True)

The horsepower and the peak_rpm columns both contain 2 missing values. A closer look into the missing values shows that the same two vehicles have missing horsepower and peak_rpm. Due to lack of data, these values cannot be calculated readily. Hence these two rows are also dropped from the dataset.

In [12]:
df[df.horsepower.isna()]
Out[12]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
130 0 122.0 renault gas std four wagon fwd front 96.1 ... 132 mpfi 3.46 3.9 8.7 NaN NaN 23 31 9295.0
131 2 122.0 renault gas std two hatchback fwd front 96.1 ... 132 mpfi 3.46 3.9 8.7 NaN NaN 23 31 9895.0

2 rows × 26 columns

In [13]:
df.dropna(subset=['horsepower','peak_rpm'],inplace=True)
In [14]:
df.body_style.value_counts()
Out[14]:
sedan          94
hatchback      63
wagon          24
hardtop         8
convertible     6
Name: body_style, dtype: int64

The num_of_doors is the last column left to clean. This column also contains 2 missing rows. A closer look into these rows reveals that both the vehicles listed were sedans. The body_style of the vehicle identifies whether the vehicle is a sedan, hatchback, convertible, hardtop or a wagon. Usually the same style of vehicles have similar body charateristics like number of doors, for example, a convertible usually is a 2-door vehicle. Needless to say that there are exceptions, but the general trend is followed. Going by this trend, a sedan usually has 4 doors. Thus the two missing values are imputed with the mode of the column for sedans.

In [15]:
df[df['num_of_doors'].isna()]
Out[15]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
27 1 148.0 dodge gas turbo NaN sedan fwd front 93.7 ... 98 mpfi 3.03 3.39 7.6 102.0 5500.0 24 30 8558.0
63 0 122.0 mazda diesel std NaN sedan fwd front 98.8 ... 122 idi 3.39 3.39 22.7 64.0 4650.0 36 42 10795.0

2 rows × 26 columns

In [16]:
df[df['body_style'] == 'sedan']['num_of_doors'].value_counts()
Out[16]:
four    78
two     14
Name: num_of_doors, dtype: int64
In [17]:
df['num_of_doors'].fillna(
    df[df['body_style'] == 'sedan']['num_of_doors'].mode()[0],
    inplace=True
)

The dataset is clean from any missing values. The next step is to analyze these various charateristics. The price column is the target column for the predictive modeling and the base for the analysis. Looking into the distribution of price via a box plot reveals a significant number of outliers.

In [18]:
plt.figure(figsize=(10,8))
sns.boxplot(df.price)
plt.xlabel('Price range')
sns.despine(left=True)
plt.title('Price distribution')
Out[18]:
Text(0.5, 1.0, 'Price distribution')

The outliers exist beyond the 30k mark. This data could be faulty and hence requires a detailed analysis for those vehicles listed with a price greater than 30k.

In [19]:
df[df.price > 30000]
Out[19]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... engine_size fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price
15 0 122.0 bmw gas std four sedan rwd front 103.5 ... 209 mpfi 3.62 3.39 8.0 182.0 5400.0 16 22 30760.0
16 0 122.0 bmw gas std two sedan rwd front 103.5 ... 209 mpfi 3.62 3.39 8.0 182.0 5400.0 16 22 41315.0
17 0 122.0 bmw gas std four sedan rwd front 110.0 ... 209 mpfi 3.62 3.39 8.0 182.0 5400.0 15 20 36880.0
47 0 145.0 jaguar gas std four sedan rwd front 113.0 ... 258 mpfi 3.63 4.17 8.1 176.0 4750.0 15 19 32250.0
48 0 122.0 jaguar gas std four sedan rwd front 113.0 ... 258 mpfi 3.63 4.17 8.1 176.0 4750.0 15 19 35550.0
49 0 122.0 jaguar gas std two sedan rwd front 102.0 ... 326 mpfi 3.54 2.76 11.5 262.0 5000.0 13 17 36000.0
70 -1 93.0 mercedes-benz diesel turbo four sedan rwd front 115.6 ... 183 idi 3.58 3.64 21.5 123.0 4350.0 22 25 31600.0
71 -1 122.0 mercedes-benz gas std four sedan rwd front 115.6 ... 234 mpfi 3.46 3.10 8.3 155.0 4750.0 16 18 34184.0
72 3 142.0 mercedes-benz gas std two convertible rwd front 96.6 ... 234 mpfi 3.46 3.10 8.3 155.0 4750.0 16 18 35056.0
73 0 122.0 mercedes-benz gas std four sedan rwd front 120.9 ... 308 mpfi 3.80 3.35 8.0 184.0 4500.0 14 16 40960.0
74 1 122.0 mercedes-benz gas std two hardtop rwd front 112.0 ... 304 mpfi 3.80 3.35 8.0 184.0 4500.0 14 16 45400.0
126 3 122.0 porsche gas std two hardtop rwd rear 89.5 ... 194 mpfi 3.74 2.90 9.5 207.0 5900.0 17 25 32528.0
127 3 122.0 porsche gas std two hardtop rwd rear 89.5 ... 194 mpfi 3.74 2.90 9.5 207.0 5900.0 17 25 34028.0
128 3 122.0 porsche gas std two convertible rwd rear 89.5 ... 194 mpfi 3.74 2.90 9.5 207.0 5900.0 17 25 37028.0

14 rows × 26 columns

The outliers do not look faulty, rather the vehicles listed are high end vehicles from world class manufacturers - jaguar, mercedes benz, porsche and bmw. These prices are a result of these vehicles being luxury/sports car from these manufacturers.

The city_mpg column describes the average miles per gallon (mileage) the car delivers when driven in the city with occasional accelerations and brakes. Similarly the highway_mpg describes the average miles per gallon (mileage) the car delivers under continuos acceleration. The two metrics are considered very important for a vehicle.

In [20]:
plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.scatterplot(y='price',x='city_mpg',data=df)
plt.xlabel('city mileage (mpg)')
plt.title('Price vs city mileage')
plt.subplot(1,2,2)
sns.scatterplot(y='price',x='highway_mpg',data=df)
plt.xlabel('highway mileage (mpg)')
plt.title('Price vs highway mileage')
Out[20]:
Text(0.5, 1.0, 'Price vs highway mileage')

From the two scatter plots, the following conclusions are drawn :-

  • Both city mileage and highway mileage (in mpg) are negatively correlated to the price of the vehicle
  • High end luxury or sports vehicles have lesser city and highway mileage (in mpg).

The city_mpg and highway_mpg define individual charateristics of the vehicle, both measured separately. These values for a vehicle are measured under the assumption that the car is driven only in particular setting i.e. city_mpg for a vehicle is measured under the assumption that the vehicle is only driven in the city.
In reality the vehicle isnt always only exposed to either one. There is a combination of both city driving and highway driving, this makes a huge difference in the actual mileage (in mpg) the vehicle can achieve. Thus fuel economy metric is dervied. The fuel economy is assumed to be 60% of city_mpg and 40% of highway_mpg. Intuitively this means that the fuel economy gives the average miles per gallon (mpg) for a vehicle that is driven 60% of the times in a city and the rest on a highway. This assumption seems fair and realistic.

In [21]:
df['fuel_economy'] = (df.city_mpg * 0.6) + (df.highway_mpg * 0.4)
df.head(10)
Out[21]:
symboling normalized_losses make fuel_type aspiration num_of_doors body_style drive_wheels engine_location wheel_base ... fuel_system bore stroke compression_rate horsepower peak_rpm city_mpg highway_mpg price fuel_economy
0 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 13495.0 23.4
1 3 122.0 alfa-romero gas std two convertible rwd front 88.6 ... mpfi 3.47 2.68 9.0 111.0 5000.0 21 27 16500.0 23.4
2 1 122.0 alfa-romero gas std two hatchback rwd front 94.5 ... mpfi 2.68 3.47 9.0 154.0 5000.0 19 26 16500.0 21.8
3 2 164.0 audi gas std four sedan fwd front 99.8 ... mpfi 3.19 3.40 10.0 102.0 5500.0 24 30 13950.0 26.4
4 2 164.0 audi gas std four sedan 4wd front 99.4 ... mpfi 3.19 3.40 8.0 115.0 5500.0 18 22 17450.0 19.6
5 2 122.0 audi gas std two sedan fwd front 99.8 ... mpfi 3.19 3.40 8.5 110.0 5500.0 19 25 15250.0 21.4
6 1 158.0 audi gas std four sedan fwd front 105.8 ... mpfi 3.19 3.40 8.5 110.0 5500.0 19 25 17710.0 21.4
7 1 122.0 audi gas std four wagon fwd front 105.8 ... mpfi 3.19 3.40 8.5 110.0 5500.0 19 25 18920.0 21.4
8 1 158.0 audi gas turbo four sedan fwd front 105.8 ... mpfi 3.13 3.40 8.3 140.0 5500.0 17 20 23875.0 18.2
10 2 192.0 bmw gas std two sedan rwd front 101.2 ... mpfi 3.50 2.80 8.8 101.0 5800.0 23 29 16430.0 25.4

10 rows × 27 columns

In [22]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x='fuel_economy',y='price',data=df)
plt.xlabel('Fuel economy (in mpg)')
plt.title('Price vs Fuel economy')
Out[22]:
Text(0.5, 1.0, 'Price vs Fuel economy')

Since the fuel_economy is a representation of the city_mpg and highway_mpg combined, it mimics the trend as seen above.

The following columns are numeric in nature and mostly describe physical specifications of the vehicle :-

  • wheel_base
  • length
  • width
  • height
  • curb_weight
  • engine_size
  • bore
  • stroke
  • compression_rate
  • horsepower
  • peak_rpm

The next set of plots analyze these values against the price of the vehicle in a regression plot. The regplot fits a regressor to the data and draws the best fitting line across the data. The regplot gives insight into the relationship between two variables. This shall give an idea as to which charateristics are correlated to the price.

In [23]:
cols = [
    'wheel_base',
    'length',
    'width',
    'height',
    'curb_weight',
    'engine_size',
    'bore',
    'stroke',
    'compression_rate',
    'horsepower',
    'peak_rpm'
]

plt.style.use('seaborn-white')
plt.subplots(figsize=(16,42))
i=1
for col in cols:
    plt.subplot(6,2,i)
    sns.regplot(x=col,y='price',data=df)
    plt.title('Price vs '+col)
    i+=1
    

The following conclusions are drawn from the scatter plot :-

  • The wheel base of a vehicle shows slight positive correlation with the price. For a certain wheel-base there is a huge band within which the prices of the vehicles lie.
  • The length and width of the vehicle share a linear relationship with the price and are highly positively correlated. This means, the length and width of the vehicle can be a determining charateristic for its price.
  • The height of the car has a slight positive correlation, but not enough to conclude a linear relationship. The height doesnot dictate the price of the vehicle.
  • The curb_weight has a linear relationship with price of vehicle. High end vehicles have higher curb weights i.e they are heavier packed vehicles.
  • The engine_size and horsepower share a linear relationship with price. High end vehicles pack a punch when it comes to power of the vehicle.
  • The bore shows high positive correlation with the price whereas the stroke doesnot show a clear linear relationship.
  • Compression rate and Peak RPM have no relations with the price as such.

From these conclusions, it can be said that the following columns can be definte predictors for the price of the vehicle :-

  • length
  • width
  • curb_weight
  • engine_size
  • horsepower
  • bore
  • wheel_base

The normalized_losses column describes the average loss the car can incur per year. Viewing these losses against the price via a scatterplot.

In [24]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(12,8))
sns.scatterplot(x='normalized_losses',y='price',data=df)
plt.title('Price vs Normalized losses')
Out[24]:
Text(0.5, 1.0, 'Price vs Normalized losses')

The normalized_losses shares no relationship with the price. Mostly the losses for high end vehicles is at the mean.

Analyzing all the numeric columns leads to the conclusion that - Most of the numeric columns share a linear relationship with the price of the vehicle as identified. These charateristics can be good predictors for the target price. The dataset also contains categorical variables. These variables have to be compared against the price to find some relations.

The symboling column descibes the risk factor invovled with the price of a car. A value of +3 means very risky and -2 means pretty safe. the symbols are first changed to labels as given below then compared with price.

  • Very safe - -2
  • Safe - -1
  • Neutral - 0
  • Maybe risky - 1
  • Risky - 2
  • Very risky - 3
In [25]:
def decode(row):
    if row == -2:
        return 'Very safe'
    elif row == -1:
        return 'Safe'
    elif row == 0:
        return 'Neutral'
    elif row == 1:
        return 'Maybe risky'
    elif row == 2:
        return 'Risky'
    else:
        return 'Very risky'
    
df['symboling_labels'] = df.symboling.apply(decode).astype(
    CategoricalDtype(
        categories=['Very safe','Safe','Neutral','Maybe risky','Risky','Very risky'],
        ordered=True
    )
)

plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,6))
plt.subplot(1,2,1)
sns.countplot(df.symboling_labels)
plt.ylabel('Number of vehicles')
plt.xlabel('Symboling labels')
plt.title('Number of vehicles sold per symboling label')
plt.subplot(1,2,2)
sns.boxplot(x=df.symboling_labels, y=df.price)
plt.ylabel('Price distribution')
plt.xlabel('Symboling labels')
plt.title('Price distribution of vehicles sold per symboling label')
Out[25]:
Text(0.5, 1.0, 'Price distribution of vehicles sold per symboling label')

The symboling plots suggest :-

  • The vehicles with a symbol rating of 0 - Neutral, are maximum sold.
  • Vehicles with a symbol rating of -1 - Safe have higher prices.

The make of a vehicle identifies the manufacturer of the vehicle and body_style identifies the type of vehicle as discussed previously. The two bar plots below show the number of vehicles sold per maker and per body type.

In [26]:
makers = df.make.value_counts().sort_values(ascending=False)
styles = df.body_style.value_counts().sort_values(ascending=False)
makers_price = df.pivot_table(values='price',index='make').sort_values('price',ascending=False)
styles_price = df.pivot_table(values='price',index='body_style').sort_values('price',ascending=False)


plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,24))
plt.subplot(3,2,1)
sns.barplot(x=makers.index[:6],y=makers.values[:6])
plt.xlabel('Manufacturer')
plt.ylabel('Number of vehicles')
plt.title('Number of vehicles sold per manufacturer')
plt.subplot(3,2,2)
sns.barplot(x=styles.index,y=styles.values)
plt.ylabel('Number of vehicles')
plt.xlabel('Type of vehicle')
plt.title('Number of vehicles sold per vehicle type')
plt.subplot(3,2,3)
sns.barplot(makers_price.index[:5],makers_price.price[:5])
plt.ylabel('Average price')
plt.xlabel('Manufacturer')
plt.title('Average price of vehicles sold per manufacturer')
plt.subplot(3,2,4)
sns.barplot(styles_price.index,styles_price.price)
plt.ylabel('Average price')
plt.xlabel('Type of vehicle')
plt.title('Average price of vehicles sold per vehicle type')
plt.subplot(3,2,5)
sns.boxplot(x=df.body_style,y=df.price)
plt.xlabel('vehicle style')
plt.ylabel('price distribution')
plt.title('Price distribution per vehicle type')
Out[26]:
Text(0.5, 1.0, 'Price distribution per vehicle type')

Inferences :-

  • Toyota is the most sold vehicle in the dataset. Other manufacturers have relatively lesser vehicles sold.
  • Sedan body type is the most sold vehicle type. Hatchback is the second most sold. The other categories have very less sold.
  • Manufacturers sell cars for different price ranges i.e. usually they have cars for lower range, mediocre range and higher range, thus manufacturer cannot be used to predict the price of a car.
  • Hardtop and convertibles have the highest average price ranges.
  • Every body style has vehicles of different price ranges for example, hardtops and convertibles are in high end vehicles whereas sedans and wagons are in moderate price ranges. Hatchbacks are in the lowest price range.

The columns - fuel_type, aspiration, engine_location and num_of_doors are two-category variables describing the latent features of the vehicle.

In [27]:
cols = [
    'fuel_type',
    'aspiration',
    'engine_location',
    'num_of_doors',
    'drive_wheels'
]

plt.style.use('fivethirtyeight')
plt.subplots(figsize=(16,36))
i=1
for col in cols:
    plt.subplot(5,2,i)
    sns.countplot(df[col])
    plt.ylabel('number of vehicles')
    plt.title('number of vehicles sold per '+col)
    i+=1
    plt.subplot(5,2,i)
    sns.boxplot(x=df[col],y=df.price)
    plt.ylabel('price distribution')
    plt.title('price distribution of vehicles sold per '+col)
    i+=1