Exploring eBay Car Sales Data¶

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
You can find the dataset and its description here.
We sampled 50,000 data points from the full dataset, to ensure our code runs quickly in our hosted environment.
The aim of this project is to clean the data and analyze the included used car listings and try to do the following analysis:
- Find the most common brand/model combinations
- Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
- How much cheaper are cars with damage than their non-damaged counterparts?

Reading the dataset into a dataframe¶

In [295]:

import pandas as pd #importing pandas library
import numpy as np  #importing numpy library
autos = pd.read_csv('autos.csv',encoding = 'Latin-1')

In [296]:

autos

Out[296]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [297]:

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[297]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

As is clear from the output of the code cell above, the dataset contains 50,000 rows and 20 columns. Most of the columns don't have missing values except for the 'vehicletype', 'gearbox', 'model', 'fueltype' and 'notRepairedDamage' columns. 12 columns are strings, 5 are integers while 3 columns have datetime values.

Renaming the columns¶

In [298]:

autos.columns

Out[298]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

We can see in the output of the code cell above that most of the column names are in camelcase and some of them are too long which makes it difficult to work with. So now, we will change the names of these columns from camelcase to snakecase

In [299]:

autos = autos.rename({'dateCrawled':'date_crawled',
                      'yearOfRegistration':'registration_year',
                      'monthOfRegistration':'registration_month',
                      'notRepairedDamage':'unrepaired_damage',
                      'dateCreated':'ad_created','offerType':'offer_type',
                      'abtest':'ab_test','vehicleType':'vehicle_type',
                      'gearBox':'gear_box','powerPS':'power_ps',
                      'fuelType':'fuel_type','nrOfPictures':'nr_of_pictures',
                      'postalCode':'postal_code',
                      'lastSeen':'last_seen'},axis = 1)

In [300]:

autos.head()

Out[300]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Analysing the columns¶

In [301]:

autos.describe(include='all')

Out[301]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-19 17:36:18	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Few columns appear to have almost a single value like 'seller', 'offer_type' and 'nr_of_pictures'. We shall drop these from our analysis.
Some columns have numeric data which are stored as strings - 'price' and 'odometer' for example. We shall clean these and convert into numeric data.
The maximum and minimum values in the 'registration_year' column are 9999 and 1000 and minimum values in the 'registration_month' and 'power_ps' columns is 0 which is not possible. We need more investigation for these columns. For now, we will clean the 'price' and 'odometer' columns

Cleaning the price and odometer columns¶

In [302]:

autos = autos.drop(columns = ['seller','offer_type','nr_of_pictures'])

In [303]:

autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','').astype(int)
autos = autos.rename({'odometer':'odometer_km'},axis = 1)
autos

Out[303]:

	date_crawled	name	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	24900	control	limousine	2011	automatik	239	q5	100000	1	diesel	audi	nein	2016-03-27 00:00:00	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	1980	control	cabrio	1996	manuell	75	astra	150000	5	benzin	opel	nein	2016-03-28 00:00:00	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	13200	test	cabrio	2014	automatik	69	500	5000	11	benzin	fiat	nein	2016-04-02 00:00:00	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	22900	control	kombi	2013	manuell	150	a3	40000	11	diesel	audi	nein	2016-03-08 00:00:00	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	1250	control	limousine	1996	manuell	101	vectra	150000	1	benzin	opel	nein	2016-03-13 00:00:00	45897	2016-04-06 21:18:48

50000 rows × 17 columns

In [304]:

print(autos['price'].unique().shape)
autos['price'].describe()

(2357,)

Out[304]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [305]:

autos['price'].value_counts().sort_index()

Out[305]:

0           1421
1            156
2              3
3              1
5              2
            ... 
10000000       1
11111111       2
12345678       3
27322222       1
99999999       1
Name: price, Length: 2357, dtype: int64

Some price values are abnormally low and high. Some are as low as 0 and some even reach 100 million. We will remove these outliers before we investigate further. We set the minimum price to be 1000 and maximum as 100000 dollars.

In [306]:

autos.loc[autos['price'].between(0,1000),'price'] = np.nan
autos.loc[autos['price'].between(100000,100000000),'price'] = np.nan
autos['price'].value_counts().sort_index()

Out[306]:

1039.0      1
1040.0      1
1049.0      6
1050.0     95
1059.0      1
           ..
93911.0     1
94999.0     1
98500.0     1
99000.0     2
99900.0     2
Name: price, Length: 2051, dtype: int64

In [307]:

print(autos['price'].value_counts().shape)
autos['price'].describe()

(2051,)

Out[307]:

count    37948.000000
mean      7191.577896
std       7920.021774
min       1039.000000
25%       2300.000000
50%       4499.000000
75%       8999.000000
max      99900.000000
Name: price, dtype: float64

Now that we have removed the outliers, we are left with around 38000 price entries. Standard deviation has also reduced from about 480000 to almost 8000 which is a direct result of removing the exceedingly large and small price values.

In [308]:

autos['odometer_km'].value_counts(dropna=False)

Out[308]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

In [309]:

autos['odometer_km'].describe()

Out[309]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

The data in this column seems fine and it does not have any outliers that could affect our analysis.

Exploring the date columns¶

We are only interested in number of sales or registrations on a single day, so we will not consider the time part for all the three date columns.

In [310]:

date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()*100

In [311]:

date_crawled

Out[311]:

2016-03-05    2.538
2016-03-06    1.394
2016-03-07    3.596
2016-03-08    3.330
2016-03-09    3.322
2016-03-10    3.212
2016-03-11    3.248
2016-03-12    3.678
2016-03-13    1.556
2016-03-14    3.662
2016-03-15    3.398
2016-03-16    2.950
2016-03-17    3.152
2016-03-18    1.306
2016-03-19    3.490
2016-03-20    3.782
2016-03-21    3.752
2016-03-22    3.294
2016-03-23    3.238
2016-03-24    2.910
2016-03-25    3.174
2016-03-26    3.248
2016-03-27    3.104
2016-03-28    3.484
2016-03-29    3.418
2016-03-30    3.362
2016-03-31    3.192
2016-04-01    3.380
2016-04-02    3.540
2016-04-03    3.868
2016-04-04    3.652
2016-04-05    1.310
2016-04-06    0.318
2016-04-07    0.142
Name: date_crawled, dtype: float64

In [312]:

date_crawled.describe()

Out[312]:

count    34.000000
mean      2.941176
std       0.976476
min       0.142000
25%       2.988500
50%       3.271000
75%       3.488500
max       3.868000
Name: date_crawled, dtype: float64

In [313]:

autos['date_crawled'].unique().shape

Out[313]:

(48213,)

Clearly, their are 48213 unique values in the date_crawled column all of which are of April and March months of the year 2016 with each value occupying almost 3% of the total values.

In [314]:

autos['ad_created'].str[:10].describe()

Out[314]:

count          50000
unique            76
top       2016-04-03
freq            1946
Name: ad_created, dtype: object

In [315]:

ad_created = (autos['ad_created'].str[:10].value_counts(normalize=True).sort_index())*100

In [316]:

ad_created

Out[316]:

2015-06-11    0.002
2015-08-10    0.002
2015-09-09    0.002
2015-11-10    0.002
2015-12-05    0.002
              ...  
2016-04-03    3.892
2016-04-04    3.688
2016-04-05    1.184
2016-04-06    0.326
2016-04-07    0.128
Name: ad_created, Length: 76, dtype: float64

In [317]:

ad_created.describe()

Out[317]:

count    76.000000
mean      1.315789
std       1.595835
min       0.002000
25%       0.002000
50%       0.014000
75%       3.227500
max       3.892000
Name: ad_created, dtype: float64

In [318]:

autos['ad_created'].unique().shape

Out[318]:

(76,)

The ad_created column contains 76 unique values with each unique value occupying about 1.3% by average of the total values which means about 650 ads were created on a single day by average. Almost half of the values are of the months of March and April of the year 2016, about 5-10 values are from months of 2015.

In [319]:

autos['last_seen'].str[:10].value_counts()

Out[319]:

2016-04-06    11050
2016-04-07     6546
2016-04-05     6214
2016-03-17     1396
2016-04-03     1268
2016-04-02     1245
2016-03-30     1242
2016-04-04     1231
2016-03-31     1192
2016-03-12     1191
2016-04-01     1155
2016-03-29     1117
2016-03-22     1079
2016-03-28     1043
2016-03-21     1037
2016-03-20     1035
2016-03-24      978
2016-03-25      960
2016-03-23      929
2016-03-26      848
2016-03-16      822
2016-03-27      801
2016-03-15      794
2016-03-19      787
2016-03-14      640
2016-03-11      626
2016-03-10      538
2016-03-09      493
2016-03-13      449
2016-03-08      380
2016-03-18      371
2016-03-07      268
2016-03-06      221
2016-03-05       54
Name: last_seen, dtype: int64

In [320]:

autos['last_seen'].str[:10].describe()

Out[320]:

count          50000
unique            34
top       2016-04-06
freq           11050
Name: last_seen, dtype: object

In [321]:

last_seen = (autos['last_seen'].str[:10].value_counts(normalize=True).sort_index())*100

In [322]:

last_seen.mean()

Out[322]:

2.941176470588235

In [323]:

autos['last_seen'].str[:10].unique().shape

Out[323]:

(34,)

Clearly, there are 34 unique date values in the last_seen column all of them are of the year 2016 with each date occupying almost 3%, i.e., 1500 sales by average of all the values. Most sales are of the April month and most of these are 3 consecutive days- 5th, 6th and 7th of April - almost 4-7 times the average sales number.

In [324]:

autos['registration_year'].describe()

Out[324]:

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The registration_year column appears to have a number of invalid entries. The maximum and minimum values of the column are 9999 and 1000 which is impossible. We will need to fix these errors.

In [325]:

autos['registration_year'].unique()

Out[325]:

array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010,
       1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011,
       2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988,
       1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969,
       1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971,
       1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959,
       9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941,
       1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888,
       1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001])

We will remove the years earlier than 1910 (this is the earliest plausible year after the invention of the first car). We shall also remove the years later than 2016 since a car cannot be first registered after it was last seen on the website (we saw earlier that all the values in the last_seen column are of the year 2016). We will not remove the entire row since we may lose valuable data, so we will assign the corresponding value as NaN.

In [326]:

autos.loc[autos['registration_year'] < 1910,'registration_year'] = np.nan

In [327]:

autos.loc[autos['registration_year'] > 2016, 'registration_year'] = np.nan

In [328]:

autos['registration_year'].describe()

Out[328]:

count    48028.00000
mean      2002.80351
std          7.31085
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2016.00000
Name: registration_year, dtype: float64

In [329]:

(autos['registration_year'].value_counts(normalize=True).sort_index().tail(50))*100

Out[329]:

1967.0    0.056217
1968.0    0.054135
1969.0    0.039560
1970.0    0.093695
1971.0    0.056217
1972.0    0.072874
1973.0    0.054135
1974.0    0.049971
1975.0    0.039560
1976.0    0.056217
1977.0    0.045807
1978.0    0.097860
1979.0    0.072874
1980.0    0.201966
1981.0    0.064546
1982.0    0.089531
1983.0    0.110352
1984.0    0.110352
1985.0    0.218622
1986.0    0.158241
1987.0    0.156159
1988.0    0.295661
1989.0    0.376863
1990.0    0.822437
1991.0    0.741234
1992.0    0.814108
1993.0    0.926543
1994.0    1.374198
1995.0    2.733822
1996.0    3.006579
1997.0    4.222537
1998.0    5.107437
1999.0    6.246356
2000.0    6.983426
2001.0    5.627967
2002.0    5.274007
2003.0    5.677938
2004.0    5.698759
2005.0    6.277588
2006.0    5.638378
2007.0    4.797202
2008.0    4.645207
2009.0    4.368285
2010.0    3.325144
2011.0    3.402182
2012.0    2.754643
2013.0    1.678188
2014.0    1.386691
2015.0    0.830765
2016.0    2.740068
Name: registration_year, dtype: float64

In [330]:

((autos['registration_year'].value_counts(normalize=True))*100).mean()

Out[330]:

1.2820512820512817

Each year makes approximately 1.3% of the entire registration_year column, i.e., about 620 cars/year by average. About 20-30 cars/year were registered before the 80s and the registrations start to increase after that. The maximum registrations were recorded in the year 2005 - around 3000, then we again see a steady decrease untill the year 2016 where the numbers jump from about 400 (in the year 2015) to almost 1300.

Aggregating brand, price and odometer data¶

In [331]:

autos['brand'].describe()

Out[331]:

count          50000
unique            40
top       volkswagen
freq           10687
Name: brand, dtype: object

In [332]:

(autos['brand'].value_counts())

Out[332]:

volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
citroen             701
smart               701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

There are 40 unique brands in the brand column with volkswagen being the most frequent with more than 10000 sales. We will leave aggregation of those brands whose frequencies are less than 1% of the total values in the brand column.

In [333]:

brand_sel = (autos['brand'].value_counts(normalize=True))*100 > 1

In [334]:

brands = ((autos['brand'].value_counts(normalize=True))*100)[brand_sel]

In [335]:

brands

Out[335]:

volkswagen        21.374
opel              10.922
bmw               10.858
mercedes_benz      9.468
audi               8.566
ford               6.958
renault            4.808
peugeot            2.912
fiat               2.616
seat               1.882
skoda              1.572
mazda              1.514
nissan             1.508
citroen            1.402
smart              1.402
toyota             1.234
sonstige_autos     1.092
Name: brand, dtype: float64

In [336]:

sel_brands = brands.index

In [337]:

price_by_brand = {}

for b in sel_brands:
    mean_price = autos.loc[autos['brand'] == b, 'price'].mean()
    price_by_brand[b] = mean_price

In [338]:

price_by_brand

Out[338]:

{'volkswagen': 6652.138861386138,
 'opel': 4259.6706021251475,
 'bmw': 8990.066779374472,
 'mercedes_benz': 9202.106543822349,
 'audi': 10242.50607615447,
 'ford': 5351.12505663797,
 'renault': 3625.2169274537696,
 'peugeot': 4006.038684719536,
 'fiat': 4033.7395061728394,
 'seat': 5577.062682215744,
 'skoda': 6894.382889200561,
 'mazda': 5300.20955882353,
 'nissan': 6403.148362235067,
 'citroen': 4676.230188679246,
 'smart': 3760.979623824451,
 'toyota': 5597.41981981982,
 'sonstige_autos': 12567.545691906005}

Analysing the results shows that top three most expenseive brands are 'Audi' which has the highest average selling price of 10242 dollars, 2nd highest is 'Mercedes Benz' - 9202 dollars and the 3rd highest average price is of 'BMW' - nearly 8990 dollars.
Next three high-priced brands are 'skoda', 'volkswagen' and 'nissan' with average prices of 6894, 6652 and 6403 dollars respectively.
'sonstige_autos' which means 'other cars' in English is showing the highest values but these may include newer or obsolete brands so we can ignore these in the results.
And the cheapest brands are 'renault', 'smart' and 'peugeot' in increasing order of prices.

Next important parameter is the mileage of the cars. We have only the odometer_km column with us so we'd have to make do with it.

In [339]:

mileage = {}
for b in sel_brands:
    mean = autos.loc[autos['brand'] == b, 'odometer_km'].mean()
    mileage[b] = mean

In [340]:

mean_mileage_km = pd.Series(mileage)
mean_mileage_km = mean_mileage_km.sort_values(ascending = False)
mean_mileage_km

Out[340]:

bmw               132521.643028
mercedes_benz     130886.142797
audi              129643.941163
opel              129298.663248
volkswagen        128955.272761
renault           128223.793677
peugeot           127352.335165
mazda             125132.100396
ford              124131.934464
seat              122061.636557
citroen           119764.621969
nissan            118978.779841
fiat              117037.461774
toyota            115988.654781
skoda             110947.837150
smart             100756.062767
sonstige_autos     87188.644689
dtype: float64

In [341]:

mean_price_dollars = pd.Series(price_by_brand)
mean_price_dollars = mean_price_dollars.sort_values(ascending = False)
mean_price_dollars 

Out[341]:

sonstige_autos    12567.545692
audi              10242.506076
mercedes_benz      9202.106544
bmw                8990.066779
skoda              6894.382889
volkswagen         6652.138861
nissan             6403.148362
toyota             5597.419820
seat               5577.062682
ford               5351.125057
mazda              5300.209559
citroen            4676.230189
opel               4259.670602
fiat               4033.739506
peugeot            4006.038685
smart              3760.979624
renault            3625.216927
dtype: float64

In [342]:

df = pd.DataFrame(mean_price_dollars,columns = ['mean_price_dollars'])
df

Out[342]:

	mean_price_dollars
sonstige_autos	12567.545692
audi	10242.506076
mercedes_benz	9202.106544
bmw	8990.066779
skoda	6894.382889
volkswagen	6652.138861
nissan	6403.148362
toyota	5597.419820
seat	5577.062682
ford	5351.125057
mazda	5300.209559
citroen	4676.230189
opel	4259.670602
fiat	4033.739506
peugeot	4006.038685
smart	3760.979624
renault	3625.216927

In [343]:

df['mean_mileage_km'] = mean_mileage_km
df = df.iloc[1:,:]
df['mean_price_dollars'] = df['mean_price_dollars'].astype(int)
df['mean_mileage_km'] = df['mean_mileage_km'].astype(int)

<ipython-input-343-82216bbf9c11>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_dollars'] = df['mean_price_dollars'].astype(int)
<ipython-input-343-82216bbf9c11>:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_mileage_km'] = df['mean_mileage_km'].astype(int)

In [344]:

df

Out[344]:

	mean_price_dollars	mean_mileage_km
audi	10242	129643
mercedes_benz	9202	130886
bmw	8990	132521
skoda	6894	110947
volkswagen	6652	128955
nissan	6403	118978
toyota	5597	115988
seat	5577	122061
ford	5351	124131
mazda	5300	125132
citroen	4676	119764
opel	4259	129298
fiat	4033	117037
peugeot	4006	127352
smart	3760	100756
renault	3625	128223

Exploring most common brand/model combinations¶

In [345]:

autos.loc[autos['brand']=='bmw','model'].describe().top

Out[345]:

'3er'

In [346]:

d = {}
for b in autos['brand'].unique():
    d[b] = autos.loc[autos['brand']==b,'model'].describe().top
d
    
    

Out[346]:

{'peugeot': '2_reihe',
 'bmw': '3er',
 'volkswagen': 'golf',
 'smart': 'fortwo',
 'ford': 'focus',
 'chrysler': 'andere',
 'seat': 'ibiza',
 'renault': 'twingo',
 'mercedes_benz': 'c_klasse',
 'audi': 'a4',
 'sonstige_autos': nan,
 'opel': 'corsa',
 'mazda': '3_reihe',
 'porsche': '911',
 'mini': 'cooper',
 'toyota': 'yaris',
 'dacia': 'sandero',
 'nissan': 'micra',
 'jeep': 'grand',
 'saab': 'andere',
 'volvo': 'v40',
 'mitsubishi': 'colt',
 'jaguar': 'andere',
 'fiat': 'punto',
 'skoda': 'octavia',
 'subaru': 'legacy',
 'kia': 'andere',
 'citroen': 'andere',
 'chevrolet': 'andere',
 'hyundai': 'i_reihe',
 'honda': 'civic',
 'daewoo': 'matiz',
 'suzuki': 'andere',
 'trabant': '601',
 'land_rover': 'freelander',
 'alfa_romeo': '156',
 'lada': 'niva',
 'rover': 'andere',
 'daihatsu': 'cuore',
 'lancia': 'ypsilon'}

In [347]:

df['most_common_model'] = pd.Series(d)

<ipython-input-347-68a91684a4fa>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['most_common_model'] = pd.Series(d)

In [348]:

df

Out[348]:

	mean_price_dollars	mean_mileage_km	most_common_model
audi	10242	129643	a4
mercedes_benz	9202	130886	c_klasse
bmw	8990	132521	3er
skoda	6894	110947	octavia
volkswagen	6652	128955	golf
nissan	6403	118978	micra
toyota	5597	115988	yaris
seat	5577	122061	ibiza
ford	5351	124131	focus
mazda	5300	125132	3_reihe
citroen	4676	119764	andere
opel	4259	129298	corsa
fiat	4033	117037	punto
peugeot	4006	127352	2_reihe
smart	3760	100756	fortwo
renault	3625	128223	twingo

Exploring prices of damaged and un-damaged cars¶

Let us first rename the entries in unrepaired_damage column which are in German

In [349]:

autos.loc[autos['unrepaired_damage']=='nein','unrepaired_damage'] = 'no'

In [350]:

autos.loc[autos['unrepaired_damage']=='ja','unrepaired_damage'] ='yes'

In [351]:

autos.loc[(autos['brand']=='bmw') & (autos['unrepaired_damage']=='no'), 'price'].describe()

Out[351]:

count     3823.000000
mean      9733.702851
std       8495.453418
min       1050.000000
25%       3500.000000
50%       7490.000000
75%      12999.000000
max      72900.000000
Name: price, dtype: float64

In [352]:

d1 = {}
for b in autos['brand'].unique():
    d1[b] = int(autos.loc[(autos['brand']==b) & (autos['unrepaired_damage']=='no'), 'price'].mean())

In [353]:

df['mean_price_without_damage'] = pd.Series(d1)

<ipython-input-353-adb9091db602>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_without_damage'] = pd.Series(d1)

In [354]:

df

Out[354]:

	mean_price_dollars	mean_mileage_km	most_common_model	mean_price_without_damage
audi	10242	129643	a4	11374
mercedes_benz	9202	130886	c_klasse	10066
bmw	8990	132521	3er	9733
skoda	6894	110947	octavia	7291
volkswagen	6652	128955	golf	7324
nissan	6403	118978	micra	6922
toyota	5597	115988	yaris	5931
seat	5577	122061	ibiza	6019
ford	5351	124131	focus	5768
mazda	5300	125132	3_reihe	5857
citroen	4676	119764	andere	4792
opel	4259	129298	corsa	4592
fiat	4033	117037	punto	4356
peugeot	4006	127352	2_reihe	4286
smart	3760	100756	fortwo	3956
renault	3625	128223	twingo	3937

In [355]:

d2 = {}
for b in autos['brand'].unique():
    d2[b] = (autos.loc[(autos['brand']==b) & (autos['unrepaired_damage']=='yes'), 'price'].mean())

In [356]:

df['mean_price_with_damage'] = pd.Series(d2)

<ipython-input-356-0cc0768a2817>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_with_damage'] = pd.Series(d2)

In [357]:

autos.loc[(autos['unrepaired_damage'] != 'yes') & (autos['unrepaired_damage']!='no'),'unrepaired_damage'] = 'unknown'

In [358]:

autos['unrepaired_damage'].value_counts(dropna=False)

Out[358]:

no         35232
unknown     9829
yes         4939
Name: unrepaired_damage, dtype: int64

In [359]:

df['mean_price_with_damage'] = df['mean_price_with_damage'].astype(int)

<ipython-input-359-26b4b8f8b3d4>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['mean_price_with_damage'] = df['mean_price_with_damage'].astype(int)

In [360]:

df['non_damaged_vs_damaged_price_ratio'] = df['mean_price_without_damage']/df['mean_price_with_damage']

<ipython-input-360-280aa0ccb949>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['non_damaged_vs_damaged_price_ratio'] = df['mean_price_without_damage']/df['mean_price_with_damage']

In [361]:

df

Out[361]:

	mean_price_dollars	mean_mileage_km	most_common_model	mean_price_without_damage	mean_price_with_damage	non_damaged_vs_damaged_price_ratio
audi	10242	129643	a4	11374	4769	2.384986
mercedes_benz	9202	130886	c_klasse	10066	5182	1.942493
bmw	8990	132521	3er	9733	5113	1.903579
skoda	6894	110947	octavia	7291	5269	1.383754
volkswagen	6652	128955	golf	7324	4006	1.828258
nissan	6403	118978	micra	6922	3878	1.784941
toyota	5597	115988	yaris	5931	4256	1.393562
seat	5577	122061	ibiza	6019	3079	1.954855
ford	5351	124131	focus	5768	3067	1.880665
mazda	5300	125132	3_reihe	5857	2475	2.366465
citroen	4676	119764	andere	4792	3323	1.442070
opel	4259	129298	corsa	4592	2762	1.662563
fiat	4033	117037	punto	4356	2826	1.541401
peugeot	4006	127352	2_reihe	4286	2716	1.578056
smart	3760	100756	fortwo	3956	2027	1.951653
renault	3625	128223	twingo	3937	2563	1.536091

In [362]:

df['non_damaged_vs_damaged_price_ratio'].mean()

Out[362]:

1.7834620365276217

Conclusion¶

Top three most expensive brands are 'Audi', 'Mercedes Benz' and 'BMW'. 'Ford' ,'Nissan', 'Volkswagen' and 'Toyota' are relatively less expensive and among the cheapest ones are 'Renault', 'Smart' and 'Fiat'.
There seems to be no direct correlation between mileage and average prices.
Average price of cars without damage is about 1.78 times the average price of non-damaged cars which is not surprising. For most cases this ratio is even more than 2.