Exploring eBay Car Sales Data

  • In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
  • You can find the dataset and its description here.
  • We sampled 50,000 data points from the full dataset, to ensure our code runs quickly in our hosted environment.
  • The aim of this project is to clean the data and analyze the included used car listings and try to do the following analysis:
    • Find the most common brand/model combinations
    • Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
    • How much cheaper are cars with damage than their non-damaged counterparts?

Reading the dataset into a dataframe

In [295]:
import pandas as pd #importing pandas library
import numpy as np  #importing numpy library
autos = pd.read_csv('autos.csv',encoding = 'Latin-1')
In [296]:
autos
Out[296]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [297]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[297]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

As is clear from the output of the code cell above, the dataset contains 50,000 rows and 20 columns. Most of the columns don't have missing values except for the 'vehicletype', 'gearbox', 'model', 'fueltype' and 'notRepairedDamage' columns. 12 columns are strings, 5 are integers while 3 columns have datetime values.

Renaming the columns

In [298]:
autos.columns
Out[298]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We can see in the output of the code cell above that most of the column names are in camelcase and some of them are too long which makes it difficult to work with. So now, we will change the names of these columns from camelcase to snakecase

In [299]:
autos = autos.rename({'dateCrawled':'date_crawled',
                      'yearOfRegistration':'registration_year',
                      'monthOfRegistration':'registration_month',
                      'notRepairedDamage':'unrepaired_damage',
                      'dateCreated':'ad_created','offerType':'offer_type',
                      'abtest':'ab_test','vehicleType':'vehicle_type',
                      'gearBox':'gear_box','powerPS':'power_ps',
                      'fuelType':'fuel_type','nrOfPictures':'nr_of_pictures',
                      'postalCode':'postal_code',
                      'lastSeen':'last_seen'},axis = 1)
In [300]:
autos.head()
Out[300]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Analysing the columns

In [301]:
autos.describe(include='all')
Out[301]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-19 17:36:18 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
  • Few columns appear to have almost a single value like 'seller', 'offer_type' and 'nr_of_pictures'. We shall drop these from our analysis.
  • Some columns have numeric data which are stored as strings - 'price' and 'odometer' for example. We shall clean these and convert into numeric data.
  • The maximum and minimum values in the 'registration_year' column are 9999 and 1000 and minimum values in the 'registration_month' and 'power_ps' columns is 0 which is not possible. We need more investigation for these columns. For now, we will clean the 'price' and 'odometer' columns

Cleaning the price and odometer columns

In [302]:
autos = autos.drop(columns = ['seller','offer_type','nr_of_pictures'])
In [303]:
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','').astype(int)
autos = autos.rename({'odometer':'odometer_km'},axis = 1)
autos
Out[303]:
date_crawled name price ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 24900 control limousine 2011 automatik 239 q5 100000 1 diesel audi nein 2016-03-27 00:00:00 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 1980 control cabrio 1996 manuell 75 astra 150000 5 benzin opel nein 2016-03-28 00:00:00 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 13200 test cabrio 2014 automatik 69 500 5000 11 benzin fiat nein 2016-04-02 00:00:00 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 22900 control kombi 2013 manuell 150 a3 40000 11 diesel audi nein 2016-03-08 00:00:00 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V 1250 control limousine 1996 manuell 101 vectra 150000 1 benzin opel nein 2016-03-13 00:00:00 45897 2016-04-06 21:18:48

50000 rows × 17 columns

In [304]:
print(autos['price'].unique().shape)
autos['price'].describe()
(2357,)
Out[304]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [305]:
autos['price'].value_counts().sort_index()
Out[305]:
0           1421
1            156
2              3
3              1
5              2
            ... 
10000000       1
11111111       2
12345678       3
27322222       1
99999999       1
Name: price, Length: 2357, dtype: int64
  • Some price values are abnormally low and high. Some are as low as 0 and some even reach 100 million. We will remove these outliers before we investigate further. We set the minimum price to be 1000 and maximum as 100000 dollars.
In [306]:
autos.loc[autos['price'].between(0,1000),'price'] = np.nan
autos.loc[autos['price'].between(100000,100000000),'price'] = np.nan
autos['price'].value_counts().sort_index()
Out[306]:
1039.0      1
1040.0      1
1049.0      6
1050.0     95
1059.0      1
           ..
93911.0     1
94999.0     1
98500.0     1
99000.0     2
99900.0     2
Name: price, Length: 2051, dtype: int64
In [307]:
print(autos['price'].value_counts().shape)
autos['price'].describe()
(2051,)
Out[307]:
count    37948.000000
mean      7191.577896
std       7920.021774
min       1039.000000
25%       2300.000000
50%       4499.000000
75%       8999.000000
max      99900.000000
Name: price, dtype: float64

Now that we have removed the outliers, we are left with around 38000 price entries. Standard deviation has also reduced from about 480000 to almost 8000 which is a direct result of removing the exceedingly large and small price values.

In [308]:
autos['odometer_km'].value_counts(dropna=False)
Out[308]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64
In [309]:
autos['odometer_km'].describe()
Out[309]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

The data in this column seems fine and it does not have any outliers that could affect our analysis.

Exploring the date columns

We are only interested in number of sales or registrations on a single day, so we will not consider the time part for all the three date columns.

In [310]:
date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()*100
In [311]:
date_crawled
Out[311]:
2016-03-05    2.538
2016-03-06    1.394
2016-03-07    3.596
2016-03-08    3.330
2016-03-09    3.322
2016-03-10    3.212
2016-03-11    3.248
2016-03-12    3.678
2016-03-13    1.556
2016-03-14    3.662
2016-03-15    3.398
2016-03-16    2.950
2016-03-17    3.152
2016-03-18    1.306
2016-03-19    3.490
2016-03-20    3.782
2016-03-21    3.752
2016-03-22    3.294
2016-03-23    3.238
2016-03-24    2.910
2016-03-25    3.174
2016-03-26    3.248
2016-03-27    3.104
2016-03-28    3.484
2016-03-29    3.418
2016-03-30    3.362
2016-03-31    3.192
2016-04-01    3.380
2016-04-02    3.540
2016-04-03    3.868
2016-04-04    3.652
2016-04-05    1.310
2016-04-06    0.318
2016-04-07    0.142
Name: date_crawled, dtype: float64
In [312]:
date_crawled.describe()
Out[312]:
count    34.000000
mean      2.941176
std       0.976476
min       0.142000
25%       2.988500
50%       3.271000
75%       3.488500
max       3.868000
Name: date_crawled, dtype: float64
In [313]:
autos['date_crawled'].unique().shape
Out[313]:
(48213,)

Clearly, their are 48213 unique values in the date_crawled column all of which are of April and March months of the year 2016 with each value occupying almost 3% of the total values.

In [314]:
autos['ad_created'].str[:10].describe()
Out[314]:
count          50000
unique            76
top       2016-04-03
freq            1946
Name: ad_created, dtype: object
In [315]:
ad_created = (autos['ad_created'].str[:10].value_counts(normalize=True).sort_index())*100
In [316]:
ad_created
Out[316]:
2015-06-11    0.002
2015-08-10    0.002
2015-09-09    0.002
2015-11-10    0.002
2015-12-05    0.002
              ...  
2016-04-03    3.892
2016-04-04    3.688
2016-04-05    1.184
2016-04-06    0.326
2016-04-07    0.128
Name: ad_created, Length: 76, dtype: float64
In [317]:
ad_created.describe()
Out[317]:
count    76.000000
mean      1.315789
std       1.595835
min       0.002000
25%       0.002000
50%       0.014000
75%       3.227500
max       3.892000
Name: ad_created, dtype: float64
In [318]:
autos['ad_created'].unique().shape
Out[318]:
(76,)

The ad_created column contains 76 unique values with each unique value occupying about 1.3% by average of the total values which means about 650 ads were created on a single day by average. Almost half of the values are of the months of March and April of the year 2016, about 5-10 values are from months of 2015.

In [319]:
autos['last_seen'].str[:10].value_counts()
Out[319]:
2016-04-06    11050
2016-04-07     6546
2016-04-05     6214
2016-03-17     1396
2016-04-03     1268
2016-04-02     1245
2016-03-30     1242
2016-04-04     1231
2016-03-31     1192
2016-03-12     1191
2016-04-01     1155
2016-03-29     1117
2016-03-22     1079
2016-03-28     1043
2016-03-21     1037
2016-03-20     1035
2016-03-24      978
2016-03-25      960
2016-03-23      929
2016-03-26      848
2016-03-16      822
2016-03-27      801
2016-03-15      794
2016-03-19      787
2016-03-14      640
2016-03-11      626
2016-03-10      538
2016-03-09      493
2016-03-13      449
2016-03-08      380
2016-03-18      371
2016-03-07      268
2016-03-06      221
2016-03-05       54
Name: last_seen, dtype: int64
In [320]:
autos['last_seen'].str[:10].describe()
Out[320]:
count          50000
unique            34
top       2016-04-06
freq           11050
Name: last_seen, dtype: object
In [321]:
last_seen = (autos['last_seen'].str[:10].value_counts(normalize=True).sort_index())*100
In [322]:
last_seen.mean()
Out[322]:
2.941176470588235
In [323]:
autos['last_seen'].str[:10].unique().shape
Out[323]:
(34,)

Clearly, there are 34 unique date values in the last_seen column all of them are of the year 2016 with each date occupying almost 3%, i.e., 1500 sales by average of all the values. Most sales are of the April month and most of these are 3 consecutive days- 5th, 6th and 7th of April - almost 4-7 times the average sales number.

In [324]:
autos['registration_year'].describe()
Out[324]:
count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The registration_year column appears to have a number of invalid entries. The maximum and minimum values of the column are 9999 and 1000 which is impossible. We will need to fix these errors.

In [325]:
autos['registration_year'].unique()
Out[325]:
array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010,
       1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011,
       2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988,
       1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969,
       1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971,
       1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959,
       9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941,
       1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888,
       1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001])

We will remove the years earlier than 1910 (this is the earliest plausible year after the invention of the first car). We shall also remove the years later than 2016 since a car cannot be first registered after it was last seen on the website (we saw earlier that all the values in the last_seen column are of the year 2016). We will not remove the entire row since we may lose valuable data, so we will assign the corresponding value as NaN.

In [326]:
autos.loc[autos['registration_year'] < 1910,'registration_year'] = np.nan
In [327]:
autos.loc[autos['registration_year'] > 2016, 'registration_year'] = np.nan
In [328]:
autos['registration_year'].describe()
Out[328]:
count    48028.00000
mean      2002.80351
std          7.31085
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2016.00000
Name: registration_year, dtype: float64
In [329]:
(autos['registration_year'].value_counts(normalize=True).sort_index().tail(50))*100
Out[329]:
1967.0    0.056217
1968.0    0.054135
1969.0    0.039560
1970.0    0.093695
1971.0    0.056217
1972.0    0.072874
1973.0    0.054135
1974.0    0.049971
1975.0    0.039560
1976.0    0.056217
1977.0    0.045807
1978.0    0.097860
1979.0    0.072874
1980.0    0.201966
1981.0    0.064546
1982.0    0.089531
1983.0    0.110352
1984.0    0.110352
1985.0    0.218622
1986.0    0.158241
1987.0    0.156159
1988.0    0.295661
1989.0    0.376863
1990.0    0.822437
1991.0    0.741234
1992.0    0.814108
1993.0    0.926543
1994.0    1.374198
1995.0    2.733822
1996.0    3.006579
1997.0    4.222537
1998.0    5.107437
1999.0    6.246356
2000.0    6.983426
2001.0    5.627967
2002.0    5.274007
2003.0    5.677938
2004.0    5.698759
2005.0    6.277588
2006.0    5.638378
2007.0    4.797202
2008.0    4.645207
2009.0    4.368285
2010.0    3.325144
2011.0    3.402182
2012.0    2.754643
2013.0    1.678188
2014.0    1.386691
2015.0    0.830765
2016.0    2.740068
Name: registration_year, dtype: float64
In [330]:
((autos['registration_year'].value_counts(normalize=True))*100).mean()
Out[330]:
1.2820512820512817

Each year makes approximately 1.3% of the entire registration_year column, i.e., about 620 cars/year by average. About 20-30 cars/year were registered before the 80s and the registrations start to increase after that. The maximum registrations were recorded in the year 2005 - around 3000, then we again see a steady decrease untill the year 2016 where the numbers jump from about 400 (in the year 2015) to almost 1300.

Aggregating brand, price and odometer data

In [331]:
autos['brand'].describe()
Out[331]:
count          50000
unique            40
top       volkswagen
freq           10687
Name: brand, dtype: object
In [332]:
(autos['brand'].value_counts())
Out[332]:
volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
citroen             701
smart               701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

There are 40 unique brands in the brand column with volkswagen being the most frequent with more than 10000 sales. We will leave aggregation of those brands whose frequencies are less than 1% of the total values in the brand column.

In [333]:
brand_sel = (autos['brand'].value_counts(normalize=True))*100 > 1
In [334]:
brands = ((autos['brand'].value_counts(normalize=True))*100)[brand_sel]
In [335]:
brands
Out[335]:
volkswagen        21.374
opel              10.922
bmw               10.858
mercedes_benz      9.468
audi               8.566
ford               6.958
renault            4.808
peugeot            2.912
fiat               2.616
seat               1.882
skoda              1.572
mazda              1.514
nissan             1.508
citroen            1.402
smart              1.402
toyota             1.234
sonstige_autos     1.092
Name: brand, dtype: float64
In [336]:
sel_brands = brands.index
In [337]:
price_by_brand = {}

for b in sel_brands:
    mean_price = autos.loc[autos['brand'] == b, 'price'].mean()
    price_by_brand[b] = mean_price
In [338]:
price_by_brand
Out[338]:
{'volkswagen': 6652.138861386138,
 'opel': 4259.6706021251475,
 'bmw': 8990.066779374472,
 'mercedes_benz': 9202.106543822349,
 'audi': 10242.50607615447,
 'ford': 5351.12505663797,
 'renault': 3625.2169274537696,
 'peugeot': 4006.038684719536,
 'fiat': 4033.7395061728394,
 'seat': 5577.062682215744,
 'skoda': 6894.382889200561,
 'mazda': 5300.20955882353,
 'nissan': 6403.148362235067,
 'citroen': 4676.230188679246,
 'smart': 3760.979623824451,
 'toyota': 5597.41981981982,
 'sonstige_autos': 12567.545691906005}
  • Analysing the results shows that top three most expenseive brands are 'Audi' which has the highest average selling price of 10242 dollars, 2nd highest is 'Mercedes Benz' - 9202 dollars and the 3rd highest average price is of 'BMW' - nearly 8990 dollars.
  • Next three high-priced brands are 'skoda', 'volkswagen' and 'nissan' with average prices of 6894, 6652 and 6403 dollars respectively.
  • 'sonstige_autos' which means 'other cars' in English is showing the highest values but these may include newer or obsolete brands so we can ignore these in the results.
  • And the cheapest brands are 'renault', 'smart' and 'peugeot' in increasing order of prices.

Next important parameter is the mileage of the cars. We have only the odometer_km column with us so we'd have to make do with it.

In [339]:
mileage = {}
for b in sel_brands:
    mean = autos.loc[autos['brand'] == b, 'odometer_km'].mean()
    mileage[b] = mean
In [340]:
mean_mileage_km = pd.Series(mileage)
mean_mileage_km = mean_mileage_km.sort_values(ascending = False)
mean_mileage_km
Out[340]:
bmw               132521.643028
mercedes_benz     130886.142797
audi              129643.941163
opel              129298.663248
volkswagen        128955.272761
renault           128223.793677
peugeot           127352.335165
mazda             125132.100396
ford              124131.934464
seat              122061.636557
citroen           119764.621969
nissan            118978.779841
fiat              117037.461774
toyota            115988.654781
skoda             110947.837150
smart             100756.062767
sonstige_autos     87188.644689
dtype: float64
In [341]:
mean_price_dollars = pd.Series(price_by_brand)
mean_price_dollars = mean_price_dollars.sort_values(ascending = False)
mean_price_dollars 
Out[341]:
sonstige_autos    12567.545692
audi              10242.506076
mercedes_benz      9202.106544
bmw                8990.066779
skoda              6894.382889
volkswagen         6652.138861
nissan             6403.148362
toyota             5597.419820
seat               5577.062682
ford               5351.125057
mazda              5300.209559
citroen            4676.230189
opel               4259.670602
fiat               4033.739506
peugeot            4006.038685
smart              3760.979624
renault            3625.216927
dtype: float64
In [342]:
df = pd.DataFrame(mean_price_dollars,columns = ['mean_price_dollars'])
df
Out[342]:
mean_price_dollars
sonstige_autos 12567.545692
audi 10242.506076
mercedes_benz 9202.106544
bmw 8990.066779
skoda 6894.382889
volkswagen 6652.138861
nissan 6403.148362
toyota 5597.419820
seat 5577.062682
ford 5351.125057
mazda 5300.209559
citroen 4676.230189
opel 4259.670602
fiat 4033.739506
peugeot 4006.038685
smart 3760.979624
renault 3625.216927
In [343]:
df['mean_mileage_km'] = mean_mileage_km
df = df.iloc[1:,:]
df['mean_price_dollars'] = df['mean_price_dollars'].astype(int)
df['mean_mileage_km'] = df['mean_mileage_km'].astype(int)
<ipython-input-343-82216bbf9c11>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_dollars'] = df['mean_price_dollars'].astype(int)
<ipython-input-343-82216bbf9c11>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_mileage_km'] = df['mean_mileage_km'].astype(int)
In [344]:
df
Out[344]:
mean_price_dollars mean_mileage_km
audi 10242 129643
mercedes_benz 9202 130886
bmw 8990 132521
skoda 6894 110947
volkswagen 6652 128955
nissan 6403 118978
toyota 5597 115988
seat 5577 122061
ford 5351 124131
mazda 5300 125132
citroen 4676 119764
opel 4259 129298
fiat 4033 117037
peugeot 4006 127352
smart 3760 100756
renault 3625 128223

Exploring most common brand/model combinations

In [345]:
autos.loc[autos['brand']=='bmw','model'].describe().top
Out[345]:
'3er'
In [346]:
d = {}
for b in autos['brand'].unique():
    d[b] = autos.loc[autos['brand']==b,'model'].describe().top
d
    
    
Out[346]:
{'peugeot': '2_reihe',
 'bmw': '3er',
 'volkswagen': 'golf',
 'smart': 'fortwo',
 'ford': 'focus',
 'chrysler': 'andere',
 'seat': 'ibiza',
 'renault': 'twingo',
 'mercedes_benz': 'c_klasse',
 'audi': 'a4',
 'sonstige_autos': nan,
 'opel': 'corsa',
 'mazda': '3_reihe',
 'porsche': '911',
 'mini': 'cooper',
 'toyota': 'yaris',
 'dacia': 'sandero',
 'nissan': 'micra',
 'jeep': 'grand',
 'saab': 'andere',
 'volvo': 'v40',
 'mitsubishi': 'colt',
 'jaguar': 'andere',
 'fiat': 'punto',
 'skoda': 'octavia',
 'subaru': 'legacy',
 'kia': 'andere',
 'citroen': 'andere',
 'chevrolet': 'andere',
 'hyundai': 'i_reihe',
 'honda': 'civic',
 'daewoo': 'matiz',
 'suzuki': 'andere',
 'trabant': '601',
 'land_rover': 'freelander',
 'alfa_romeo': '156',
 'lada': 'niva',
 'rover': 'andere',
 'daihatsu': 'cuore',
 'lancia': 'ypsilon'}
In [347]:
df['most_common_model'] = pd.Series(d)
<ipython-input-347-68a91684a4fa>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['most_common_model'] = pd.Series(d)
In [348]:
df
Out[348]:
mean_price_dollars mean_mileage_km most_common_model
audi 10242 129643 a4
mercedes_benz 9202 130886 c_klasse
bmw 8990 132521 3er
skoda 6894 110947 octavia
volkswagen 6652 128955 golf
nissan 6403 118978 micra
toyota 5597 115988 yaris
seat 5577 122061 ibiza
ford 5351 124131 focus
mazda 5300 125132 3_reihe
citroen 4676 119764 andere
opel 4259 129298 corsa
fiat 4033 117037 punto
peugeot 4006 127352 2_reihe
smart 3760 100756 fortwo
renault 3625 128223 twingo

Exploring prices of damaged and un-damaged cars

Let us first rename the entries in unrepaired_damage column which are in German

In [349]:
autos.loc[autos['unrepaired_damage']=='nein','unrepaired_damage'] = 'no'
In [350]:
autos.loc[autos['unrepaired_damage']=='ja','unrepaired_damage'] ='yes'
In [351]:
autos.loc[(autos['brand']=='bmw') & (autos['unrepaired_damage']=='no'), 'price'].describe()
Out[351]:
count     3823.000000
mean      9733.702851
std       8495.453418
min       1050.000000
25%       3500.000000
50%       7490.000000
75%      12999.000000
max      72900.000000
Name: price, dtype: float64
In [352]:
d1 = {}
for b in autos['brand'].unique():
    d1[b] = int(autos.loc[(autos['brand']==b) & (autos['unrepaired_damage']=='no'), 'price'].mean())
In [353]:
df['mean_price_without_damage'] = pd.Series(d1)
<ipython-input-353-adb9091db602>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_without_damage'] = pd.Series(d1)
In [354]:
df
Out[354]:
mean_price_dollars mean_mileage_km most_common_model mean_price_without_damage
audi 10242 129643 a4 11374
mercedes_benz 9202 130886 c_klasse 10066
bmw 8990 132521 3er 9733
skoda 6894 110947 octavia 7291
volkswagen 6652 128955 golf 7324
nissan 6403 118978 micra 6922
toyota 5597 115988 yaris 5931
seat 5577 122061 ibiza 6019
ford 5351 124131 focus 5768
mazda 5300 125132 3_reihe 5857
citroen 4676 119764 andere 4792
opel 4259 129298 corsa 4592
fiat 4033 117037 punto 4356
peugeot 4006 127352 2_reihe 4286
smart 3760 100756 fortwo 3956
renault 3625 128223 twingo 3937
In [355]:
d2 = {}
for b in autos['brand'].unique():
    d2[b] = (autos.loc[(autos['brand']==b) & (autos['unrepaired_damage']=='yes'), 'price'].mean())
In [356]:
df['mean_price_with_damage'] = pd.Series(d2)
<ipython-input-356-0cc0768a2817>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_with_damage'] = pd.Series(d2)
In [357]:
autos.loc[(autos['unrepaired_damage'] != 'yes') & (autos['unrepaired_damage']!='no'),'unrepaired_damage'] = 'unknown'
In [358]:
autos['unrepaired_damage'].value_counts(dropna=False)
Out[358]:
no         35232
unknown     9829
yes         4939
Name: unrepaired_damage, dtype: int64
In [359]:
df['mean_price_with_damage'] = df['mean_price_with_damage'].astype(int)
<ipython-input-359-26b4b8f8b3d4>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_with_damage'] = df['mean_price_with_damage'].astype(int)
In [360]:
df['non_damaged_vs_damaged_price_ratio'] = df['mean_price_without_damage']/df['mean_price_with_damage']
<ipython-input-360-280aa0ccb949>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['non_damaged_vs_damaged_price_ratio'] = df['mean_price_without_damage']/df['mean_price_with_damage']
In [361]:
df
Out[361]:
mean_price_dollars mean_mileage_km most_common_model mean_price_without_damage mean_price_with_damage non_damaged_vs_damaged_price_ratio
audi 10242 129643 a4 11374 4769 2.384986
mercedes_benz 9202 130886 c_klasse 10066 5182 1.942493
bmw 8990 132521 3er 9733 5113 1.903579
skoda 6894 110947 octavia 7291 5269 1.383754
volkswagen 6652 128955 golf 7324 4006 1.828258
nissan 6403 118978 micra 6922 3878 1.784941
toyota 5597 115988 yaris 5931 4256 1.393562
seat 5577 122061 ibiza 6019 3079 1.954855
ford 5351 124131 focus 5768 3067 1.880665
mazda 5300 125132 3_reihe 5857 2475 2.366465
citroen 4676 119764 andere 4792 3323 1.442070
opel 4259 129298 corsa 4592 2762 1.662563
fiat 4033 117037 punto 4356 2826 1.541401
peugeot 4006 127352 2_reihe 4286 2716 1.578056
smart 3760 100756 fortwo 3956 2027 1.951653
renault 3625 128223 twingo 3937 2563 1.536091
In [362]:
df['non_damaged_vs_damaged_price_ratio'].mean()
Out[362]:
1.7834620365276217

Conclusion

  • Top three most expensive brands are 'Audi', 'Mercedes Benz' and 'BMW'. 'Ford' ,'Nissan', 'Volkswagen' and 'Toyota' are relatively less expensive and among the cheapest ones are 'Renault', 'Smart' and 'Fiat'.

  • There seems to be no direct correlation between mileage and average prices.

  • Average price of cars without damage is about 1.78 times the average price of non-damaged cars which is not surprising. For most cases this ratio is even more than 2.