INTRODUCTION¶

Using a dataset of eBay Kleinanzeigen (the classified section of the German eBay website) consisting of 50K data points available on kaggle, we want to (1) CLEAN THE DATA & (2) ANALYZE THE DATA.

This dataset contains the following information:

Variable	Description
dateCrawled	When the ad was 1st crawled
name	Name of car
seller	Private or Dealer as seller
offerType	Type of listing
price	Price on the car ad
abtest	Whether the listing included A/B test
vehicleType	Type of vehicle
yearOfRegistration	Year when car was 1st registered
gearbox	The transmission type
powerPS	Horsepower of the car
model	Model of the car
kilometer	# of kilometers driven
monthOfRegistration	Month in which car was 1st registered
fuelType	Type of fuel the car uses
brand	Brand of the car
notRepairedDamage	If the repairs were done on damaged car
dateCreated	Date when eBay listing was created
nrOfPictures	# of pics of car in ad
postalCode	Postal code where the car is at
lastSeenOnline	When the crawler saw this ad last online

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head()

Out[1]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [2]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

There are several things that need to change:

(1) need to lower case everything

(2) deal with missing values in "vehicleType", "gearbox", "model", "fuelType", "notRepairedDamage"

(3) turn things into snake_case

(4) turn "price" and "odometer" into float

Clean Column Name¶

In [3]:

autos.rename(str.lower, axis = 1, inplace = True)
autos.rename({"yearofregistration":"registration_year", 
              "monthofregistration":"registration_month",
              "notrepaireddamage":"unrepaired_damage",
              "datecreated":"ad_created",
              "datecrawled":"date_crawled",
              "offertype":"offer_type",
              "vehicletype":"vehicle_type",
              "gearbox":"gear_box",
              "fueltype":"fuel_type",
              "nrofpictures":"num_pictures",
              "postalcode":"postal_code",
              "lastseen":"last_seen"}, axis = 1, inplace = True)

autos.head(5)

Out[3]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gear_box	powerps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Initial Exploration + Cleaning¶

In [4]:

autos.describe(include = "all")

Out[4]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gear_box	powerps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-16 21:50:53	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [5]:

print(autos["offer_type"].value_counts())
print("\n")
print(autos['seller'].value_counts())
print("\n")
print(autos["num_pictures"].value_counts())
print("\n")
print(autos['gear_box'].value_counts())
print("\n")
print(autos['abtest'].value_counts())

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64


privat        49999
gewerblich        1
Name: seller, dtype: int64


0    50000
Name: num_pictures, dtype: int64


manuell      36993
automatik    10327
Name: gear_box, dtype: int64


test       25756
control    24244
Name: abtest, dtype: int64

Remove "seller", "num_pictures" and "seller" from dataframe
"powerps" has an unrealistic value (> 1000 HP). Need to further analyze
Similarly, 'price' also need to be examined as there is no way to have a $2 million+ vehicle
"registration_year" has an unrealistic range, particularly at the extremes. Need to remove them.
Odometer + price needs to return to floats

In [6]:

# Fix price

autos['price'].value_counts()
# There's "$" and ",". Plus it's an object and not an integer/float 

autos['price'] = autos['price'].str.replace("$","").str.replace(",","")

In [7]:

autos['price'] = autos['price'].astype(float)
autos['price'].describe()

Out[7]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [8]:

# fix odometer 
autos['odometer'].value_counts()
autos['odometer'].unique()
# Need to deal with "," and "km" + turn it to float/int
autos['odometer'] = autos['odometer'].str.replace(",", "")
autos['odometer'] = autos['odometer'].str.replace("km", "")
autos['odometer'] = autos['odometer'].astype(int)

In [9]:

autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)

In [10]:

autos.columns
# removing unnecessary columns 

Out[10]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'powerps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [11]:

cols = ['date_crawled', 'name', 'price', 
        'abtest', 'vehicle_type', 'registration_year',
        'gear_box', 'powerps', 'model',
        'odometer_km', 'registration_month', 'fuel_type',
        'brand', 'unrepaired_damage', 'ad_created',
        'postal_code', 'last_seen']
autos = autos[cols]

In [12]:

autos.head()
autos.describe(include = 'all')

Out[12]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gear_box	powerps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
count	50000	50000	5.000000e+04	50000	44905	50000.000000	47320	50000.000000	47242	50000.000000	50000.000000	45518	50000	40171	50000	50000.000000	50000
unique	48213	38754	NaN	2	8	NaN	2	NaN	245	NaN	NaN	7	40	2	76	NaN	39481
top	2016-03-16 21:50:53	Ford_Fiesta	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	2016-04-07 06:17:27
freq	3	78	NaN	25756	12859	NaN	36993	NaN	4024	NaN	NaN	30107	10687	35232	1946	NaN	8
mean	NaN	NaN	9.840044e+03	NaN	NaN	2005.073280	NaN	116.355920	NaN	125732.700000	5.723360	NaN	NaN	NaN	NaN	50813.627300	NaN
std	NaN	NaN	4.811044e+05	NaN	NaN	105.712813	NaN	209.216627	NaN	40042.211706	3.711984	NaN	NaN	NaN	NaN	25779.747957	NaN
min	NaN	NaN	0.000000e+00	NaN	NaN	1000.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	1067.000000	NaN
25%	NaN	NaN	1.100000e+03	NaN	NaN	1999.000000	NaN	70.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	30451.000000	NaN
50%	NaN	NaN	2.950000e+03	NaN	NaN	2003.000000	NaN	105.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	49577.000000	NaN
75%	NaN	NaN	7.200000e+03	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	71540.000000	NaN
max	NaN	NaN	1.000000e+08	NaN	NaN	9999.000000	NaN	17700.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	99998.000000	NaN

Examining powerps¶

There is no way that horse power can be 17700. So we need to find out what's up.

In [13]:

autos['powerps'].value_counts().sort_index(ascending = False).head(50)

Out[13]:

17700    1
16312    1
16011    1
15016    1
15001    1
14009    1
9011     1
8404     1
7511     1
6512     1
6226     1
6045     1
5867     1
4400     1
3750     1
3500     1
2729     1
2018     1
1998     2
1988     1
1986     1
1800     3
1796     1
1793     1
1781     1
1780     1
1779     1
1771     1
1753     1
1704     1
1405     1
1401     1
1400     3
1398     1
1367     1
1300     1
1202     1
1103     1
1090     1
1082     1
1056     1
1055     1
1016     1
1011     1
1003     1
1001     3
1000     1
999      1
952      1
950      1
Name: powerps, dtype: int64

In [14]:

autos['powerps'].value_counts().sort_index(ascending = True).head(50)

Out[14]:

0     5500
1        5
2        2
3        3
4        4
5       13
6        3
8        2
9        1
10       3
11       4
12       1
14       1
15       5
16       1
18       6
19       3
20       4
21       1
23       4
24       1
25       2
26      39
27       5
29       5
30       3
32       2
33       9
34      27
35       2
37       7
38       2
39      18
40      42
41      57
42       7
43      20
44      52
45     397
46       9
47       8
48      16
49       2
50     604
51      14
52      40
53      28
54     759
55     275
56      40
Name: powerps, dtype: int64

As the Bugatti Cheron was the highest HP production car at approx. 1600, really anything above that is not valid. So we will use this as the top-end cap. Conversely the lowest HP production car is the Renault Twitzy at about 17 hp. Thus we will use this as the bottom-end cap.

In [15]:

autos =  autos[autos['powerps'].between(16, 1600)]
autos.describe()

Out[15]:

	price	registration_year	powerps	odometer_km	registration_month	postal_code
count	4.442000e+04	44420.000000	44420.000000	44420.000000	44420.000000	44420.000000
mean	1.045555e+04	2004.047726	126.743134	125794.236830	5.930752	51368.900698
std	5.076550e+05	53.063904	67.302685	39267.227509	3.602153	25754.846000
min	0.000000e+00	1927.000000	16.000000	5000.000000	0.000000	1067.000000
25%	1.299000e+03	1999.000000	80.000000	100000.000000	3.000000	31028.000000
50%	3.299000e+03	2004.000000	116.000000	150000.000000	6.000000	50354.000000
75%	7.800000e+03	2008.000000	150.000000	150000.000000	9.000000	72250.000000
max	1.000000e+08	9000.000000	1405.000000	150000.000000	12.000000	99998.000000

In [16]:

autos['powerps'].value_counts().sort_index(ascending = False)

Out[16]:

1405      1
1401      1
1400      3
1398      1
1367      1
1300      1
1202      1
1103      1
1090      1
1082      1
1056      1
1055      1
1016      1
1011      1
1003      1
1001      3
1000      1
999       1
952       1
950       1
923       1
900       3
850       1
800       1
799       1
754       1
740       1
696       1
682       1
678       1
       ... 
50      604
49        2
48       16
47        8
46        9
45      397
44       52
43       20
42        7
41       57
40       42
39       18
38        2
37        7
35        2
34       27
33        9
32        2
30        3
29        5
27        5
26       39
25        2
24        1
23        4
21        1
20        4
19        3
18        6
16        1
Name: powerps, Length: 404, dtype: int64

Examining Price¶

In [17]:

autos['price'].shape[0]

Out[17]:

In [18]:

autos['price'].describe()

Out[18]:

count    4.442000e+04
mean     1.045555e+04
std      5.076550e+05
min      0.000000e+00
25%      1.299000e+03
50%      3.299000e+03
75%      7.800000e+03
max      1.000000e+08
Name: price, dtype: float64

In [19]:

autos['price'].value_counts().sort_index(ascending = False).head(10)

Out[19]:

99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    1
10000000.0    1
3890000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
Name: price, dtype: int64

In [20]:

autos['price'].value_counts().sort_index(ascending = False).tail(10)

Out[20]:

30.0      1
25.0      1
20.0      1
13.0      1
11.0      1
10.0      2
5.0       1
3.0       1
1.0      89
0.0     907
Name: price, dtype: int64

In [21]:

print(len(autos[autos['price'] < 100]))
print(len(autos[autos['price'] > 500000]))
print(len(autos))

1072
12
44420

Of the 50K datapoints in this dataset, 1762 datapoints list price of car as less than 100 dollars whilst 11 list the price of the car as greater than 500,000 dollars. Considering both the unlikelihood of purchasing a car within these extremes and its make up of the total data point to be less than 5% (i.e. can remove without affecting the integrity of the data), we will use this as our liimts.

In [22]:

autos =  autos[autos['price'].between(100, 500000)]
autos['price'].value_counts().sort_index()

Out[22]:

100.0        77
110.0         1
111.0         1
115.0         1
117.0         1
120.0        18
125.0         4
130.0         8
135.0         1
139.0         1
140.0         7
145.0         2
149.0         6
150.0       148
156.0         1
160.0         7
170.0         2
173.0         1
175.0         9
179.0         1
180.0        24
185.0         1
188.0         1
190.0        13
193.0         1
195.0         2
199.0        27
200.0       192
210.0         1
215.0         1
           ... 
115000.0      1
115991.0      1
116000.0      1
119500.0      1
119900.0      1
120000.0      2
128000.0      1
129000.0      1
135000.0      1
137999.0      1
139997.0      1
145000.0      1
151990.0      1
155000.0      1
163500.0      1
163991.0      1
169000.0      1
169999.0      1
175000.0      1
180000.0      1
190000.0      1
197000.0      1
198000.0      1
220000.0      1
259000.0      1
265000.0      1
295000.0      1
299000.0      1
345000.0      1
350000.0      1
Name: price, Length: 2260, dtype: int64

Examining Odometer_km¶

In [23]:

autos['odometer_km'].unique()

Out[23]:

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000, 100000,   5000,  40000])

In [24]:

autos['odometer_km'].describe()

Out[24]:

count     43336.000000
mean     125625.230755
std       39265.241462
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [25]:

autos['odometer_km'].value_counts().sort_index(ascending = False)

Out[25]:

150000    27696
125000     4657
100000     1922
90000      1594
80000      1325
70000      1145
60000      1101
50000       964
40000       772
30000       731
20000       698
10000       215
5000        516
Name: odometer_km, dtype: int64

Considering the nature of the items as used cars, which are likely to have a very high mileage, it would be expected that there would be a large number of vehicles with extraordinarily high mileage and few vehicles with low mileage. As such, we will not be modifying the vehicle in terms of odometer.

Exploring Date Columns¶

In [26]:

# date_crawled + ad_created + last_seen

autos[['date_crawled','ad_created','last_seen']].head()
# the format is %YYYY-%mm-%dd (essentially 10 strings)

Out[26]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
5	2016-03-21 13:47:45	2016-03-21 00:00:00	2016-04-06 09:45:21

In [27]:

autos['date_crawled'] = autos['date_crawled'].str[:10]

In [28]:

autos['ad_created'] = autos['ad_created'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]

In [29]:

# Counting date_crawled

autos['date_crawled'].value_counts(normalize = True, dropna =  False).sort_index(ascending = False).round(2)

Out[29]:

2016-04-07    0.00
2016-04-06    0.00
2016-04-05    0.01
2016-04-04    0.04
2016-04-03    0.04
2016-04-02    0.04
2016-04-01    0.03
2016-03-31    0.03
2016-03-30    0.03
2016-03-29    0.03
2016-03-28    0.04
2016-03-27    0.03
2016-03-26    0.03
2016-03-25    0.03
2016-03-24    0.03
2016-03-23    0.03
2016-03-22    0.03
2016-03-21    0.04
2016-03-20    0.04
2016-03-19    0.03
2016-03-18    0.01
2016-03-17    0.03
2016-03-16    0.03
2016-03-15    0.03
2016-03-14    0.04
2016-03-13    0.02
2016-03-12    0.04
2016-03-11    0.03
2016-03-10    0.03
2016-03-09    0.03
2016-03-08    0.03
2016-03-07    0.04
2016-03-06    0.01
2016-03-05    0.03
Name: date_crawled, dtype: float64

Examining the distribution of date_crawled, which ranged from March 5th, 2016 - April 7th, 2016, there is a rounghly equivalent distribution at approx 3% across the dates in this range when the crawler accessed the ads.

In [30]:

# Counting date_crawled

autos['last_seen'].value_counts(normalize = True, dropna =  False).sort_index(ascending = False).round(2)

Out[30]:

2016-04-07    0.13
2016-04-06    0.23
2016-04-05    0.13
2016-04-04    0.02
2016-04-03    0.02
2016-04-02    0.03
2016-04-01    0.02
2016-03-31    0.02
2016-03-30    0.02
2016-03-29    0.02
2016-03-28    0.02
2016-03-27    0.02
2016-03-26    0.02
2016-03-25    0.02
2016-03-24    0.02
2016-03-23    0.02
2016-03-22    0.02
2016-03-21    0.02
2016-03-20    0.02
2016-03-19    0.02
2016-03-18    0.01
2016-03-17    0.03
2016-03-16    0.02
2016-03-15    0.02
2016-03-14    0.01
2016-03-13    0.01
2016-03-12    0.02
2016-03-11    0.01
2016-03-10    0.01
2016-03-09    0.01
2016-03-08    0.01
2016-03-07    0.01
2016-03-06    0.00
2016-03-05    0.00
Name: last_seen, dtype: float64

In terms of "last_seen" that looked within the range of March 5th, 2016 to April 7th, 2016, there was a noted uptake in when the majority of the ads were last seen (i.e. near April 7th), which could be explained by: (1) the corresponding to the collection period when the dataset was being compiled or (2) this being the prime period to which car sales occur since it's the end of the 1st quarter.

In [31]:

autos['ad_created'].value_counts(normalize = True, dropna =  False).sort_index(ascending = False).round(2)

Out[31]:

2016-04-07    0.00
2016-04-06    0.00
2016-04-05    0.01
2016-04-04    0.04
2016-04-03    0.04
2016-04-02    0.04
2016-04-01    0.03
2016-03-31    0.03
2016-03-30    0.03
2016-03-29    0.03
2016-03-28    0.04
2016-03-27    0.03
2016-03-26    0.03
2016-03-25    0.03
2016-03-24    0.03
2016-03-23    0.03
2016-03-22    0.03
2016-03-21    0.04
2016-03-20    0.04
2016-03-19    0.03
2016-03-18    0.01
2016-03-17    0.03
2016-03-16    0.03
2016-03-15    0.03
2016-03-14    0.04
2016-03-13    0.02
2016-03-12    0.04
2016-03-11    0.03
2016-03-10    0.03
2016-03-09    0.03
              ... 
2016-02-23    0.00
2016-02-22    0.00
2016-02-21    0.00
2016-02-20    0.00
2016-02-19    0.00
2016-02-18    0.00
2016-02-17    0.00
2016-02-16    0.00
2016-02-14    0.00
2016-02-12    0.00
2016-02-09    0.00
2016-02-08    0.00
2016-02-07    0.00
2016-02-05    0.00
2016-02-02    0.00
2016-02-01    0.00
2016-01-29    0.00
2016-01-27    0.00
2016-01-22    0.00
2016-01-16    0.00
2016-01-14    0.00
2016-01-13    0.00
2016-01-10    0.00
2016-01-07    0.00
2016-01-03    0.00
2015-12-30    0.00
2015-12-05    0.00
2015-11-10    0.00
2015-09-09    0.00
2015-08-10    0.00
Name: ad_created, Length: 74, dtype: float64

As for "ad_created", looking at the range from June 11th, 2015 to April 7th, 2016, the majority of the ads were created in March 2016.

Dealing with Incorrect Registration Year¶

In [32]:

autos['registration_year'].value_counts().sort_index()

Out[32]:

1927       1
1931       1
1937       3
1941       2
1950       1
1951       2
1954       2
1955       1
1956       3
1957       2
1958       2
1959       6
1960       8
1961       4
1962       3
1963       7
1964      10
1965      13
1966      13
1967      18
1968      23
1969      14
1970      19
1971      20
1972      29
1973      16
1974      20
1975      14
1976      16
1977      19
        ... 
1994     550
1995     999
1996    1187
1997    1721
1998    2084
1999    2555
2000    2530
2001    2405
2002    2266
2003    2484
2004    2487
2005    2579
2006    2533
2007    2145
2008    2102
2009    1992
2010    1540
2011    1585
2012    1270
2013     784
2014     642
2015     364
2016     864
2017    1034
2018     349
2019       2
2800       1
5000       1
5911       1
9000       1
Name: registration_year, Length: 76, dtype: int64

Seeing as the earliest commercial car was made in 1917, we can just use this as the set minimum. Furthermore, as this dataset only took in data up until April 2017, it doesn't make sense to have registration years after 2017. So the range in registration year we are lookin for is b/t 1917 - 2017.

In [33]:

min_bool = autos['registration_year'] > 1917
max_bool = autos['registration_year'] < 2018
combined = min_bool & max_bool

autos = autos[combined]
autos.describe()

Out[33]:

	price	registration_year	powerps	odometer_km	registration_month	postal_code
count	42981.000000	42981.000000	42981.000000	42981.000000	42981.000000	42981.000000
mean	6264.865964	2003.512180	127.107001	125600.381564	5.967776	51538.376492
std	9231.082718	7.060427	67.377291	39274.090885	3.583941	25732.257434
min	100.000000	1927.000000	16.000000	5000.000000	0.000000	1067.000000
25%	1399.000000	1999.000000	80.000000	100000.000000	3.000000	31177.000000
50%	3495.000000	2004.000000	116.000000	150000.000000	6.000000	50737.000000
75%	7950.000000	2008.000000	150.000000	150000.000000	9.000000	72406.000000
max	350000.000000	2017.000000	1405.000000	150000.000000	12.000000	99998.000000

Exploring Price by Brand¶

In [34]:

autos['brand'].unique()

Out[34]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'chrysler', 'renault',
       'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini',
       'mercedes_benz', 'ford', 'seat', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'land_rover', 'alfa_romeo', 'rover',
       'daihatsu', 'trabant', 'lancia', 'lada'], dtype=object)

In [35]:

top_10_brands = autos['brand'].value_counts()[:10]

In [36]:

bottom_10_brands = autos['brand'].value_counts()[-10:]

In [37]:

top_10_brands

Out[37]:

volkswagen       9194
bmw              4862
opel             4567
mercedes_benz    4126
audi             3842
ford             2936
renault          1929
peugeot          1264
fiat             1062
seat              809
Name: brand, dtype: int64

In [38]:

bottom_10_brands

Out[38]:

daihatsu      96
subaru        92
land_rover    91
saab          73
jaguar        64
daewoo        55
rover         52
lancia        46
trabant       41
lada          21
Name: brand, dtype: int64

In [39]:

print(10157+5201+5155+4586+4118+3331+2272+1413+1232+888)
print(len(autos))
print((10157+5201+5155+4586+4118+3331+2272+1413+1232+888) / len(autos))

38353
42981
0.8923245154835858

Looking at the data, it seems that the majority of the used cars are German brands (i.e. Volkswagen, BMW, Audi, Mercedes Benz) whilst the major foreign brand being Ford. Nevertheless, the most common used car brands are domestic. However, as it pertains to the 10 least popular brands, they are mainly foreign brands such as Subaru, Land Rover, Daewoo, etc.

The decision to take the Top 10 brands is that these make up 89.3% of the entire dataset, which is a pretty good representation of the marketplace.

In [40]:

volkswagen_bool = autos['brand'] == "volkswagen"
bmw_bool = autos['brand'] == 'bmw'
opel_bool = autos['brand'] == "opel"
mercedes_bool = autos['brand'] == 'mercedes_benz'
audi_bool = autos['brand'] == "audi"
ford_bool = autos['brand'] =='ford'
renault_bool = autos['brand'] == "renault"
peugeot_bool = autos['brand'] == 'peugeot'
fiat_bool = autos['brand'] == 'fiat'
seat_bool = autos['brand'] == 'seat'

In [41]:

volkswagen_only = autos[volkswagen_bool]
bmw_only = autos[bmw_bool]
opel_only = autos[opel_bool]
mercedes_only = autos[mercedes_bool]
audi_only = autos[audi_bool]
ford_only = autos[ford_bool]
renault_only = autos[renault_bool]
peugeot_only = autos[peugeot_bool]
fiat_only = autos[fiat_bool]
seat_only = autos[seat_bool]

In [42]:

top_brands_mean_price = {}

top_10 = autos['brand'].value_counts(ascending = False)[:10].index

for maker in top_10:
    selected_rows = autos[autos['brand'] == maker]
    mean_price = selected_rows['price'].mean()
    mean_price = mean_price.round(2)
    top_brands_mean_price[maker] = mean_price

top_brands_mean_price

Out[42]:

{'audi': 9659.37,
 'bmw': 8577.61,
 'fiat': 2968.12,
 'ford': 3932.04,
 'mercedes_benz': 8936.97,
 'opel': 3151.16,
 'peugeot': 3249.6,
 'renault': 2635.15,
 'seat': 4675.29,
 'volkswagen': 5660.48}

The avg. price for each of the top 10 car brands are:

Brand	Avg. Price (in dollars)
Audi	9659.37
BMW	8577.61
Fiat	2968.12
Ford	3933.09
Mercedes Benz	8936.97
Opel	3151.16
Peugeot	3249.60
Renault	2635.15
Seat	4675.29
Volkswagen	5660.48

Looking at the data, it seems that the more luxurious brands (i.e. Audi, BMW, Mercedes Benz) have the highest price point whilst the foreign brands (i.e. Ford, Peugeot, Fiat) are on the lower end of the price point spectrum. As for the most populous brand (Volkswagen), its pricepoint is moderate.

Storing Aggregate Data¶

Working with the top 10 brands, we want to see if mean mileage (i.e. odometer_km) and mean price are related to one another.

In [43]:

top_brands_mean_mileage = {}

top_10 = autos['brand'].value_counts(ascending = False)[:10].index

for maker in top_10:
    selected_rows = autos[autos['brand'] == maker]
    mean_km= selected_rows['odometer_km'].mean().round(2)
    top_brands_mean_mileage[maker] = mean_km

top_brands_mean_mileage

Out[43]:

{'audi': 128854.76,
 'bmw': 132670.71,
 'fiat': 116864.41,
 'ford': 124105.93,
 'mercedes_benz': 130822.83,
 'opel': 129162.47,
 'peugeot': 126475.47,
 'renault': 127392.43,
 'seat': 121161.93,
 'volkswagen': 128454.97}

Looking at the average mileage of the Top 10 brands of cars, it seems that the range lies b/t 116000km to 133000km. There luxurious brands (BMW and Mercedes) appear to have a higher mileage as compared to the foreign brands.

In [44]:

top_10_price = pd.Series(top_brands_mean_price)
top_10_mileage = pd.Series(top_brands_mean_mileage)

top_10_makers = pd.DataFrame(top_10_price, columns = ['mean_price'])
top_10_makers['mean_mileage'] = top_10_mileage

In [45]:

top_10_makers

Out[45]:

	mean_price	mean_mileage
audi	9659.37	128854.76
bmw	8577.61	132670.71
fiat	2968.12	116864.41
ford	3932.04	124105.93
mercedes_benz	8936.97	130822.83
opel	3151.16	129162.47
peugeot	3249.60	126475.47
renault	2635.15	127392.43
seat	4675.29	121161.93
volkswagen	5660.48	128454.97

It appears as though mileage doesn't seem to be as strongly related to price despite the more luxurious and priciest brands having a higher mean mileage as compared to the more economic and cheapest brands with a lower mean mileage.

EXTRA ANALYSIS: Common Brand + Model¶

In [46]:

col = ['brand', 'model']
Brand_Model = autos[col]
Brand_Model = Brand_Model.dropna(axis = 0)

In [47]:

Brand_Model['brand'].value_counts().head()

Out[47]:

volkswagen       8878
bmw              4685
opel             4411
mercedes_benz    4031
audi             3747
Name: brand, dtype: int64

In [48]:

Brand_Model['model'].value_counts().head()

Out[48]:

golf      3536
andere    2997
3er       2467
polo      1455
corsa     1426
Name: model, dtype: int64

In [49]:

top_model_by_brand = {}

brands = Brand_Model['brand'].unique()

for b in brands:
    selected_rows = Brand_Model[Brand_Model['brand'] == b]
    sorted_rows = selected_rows.mode()
    top_row = selected_rows.iloc[0]
    top_model = top_row['model']
    top_model_by_brand[b] = top_model

top_model_by_brand

Out[49]:

{'alfa_romeo': '156',
 'audi': 'a3',
 'bmw': '7er',
 'chevrolet': 'andere',
 'chrysler': 'voyager',
 'citroen': 'andere',
 'dacia': 'sandero',
 'daewoo': 'kalos',
 'daihatsu': 'terios',
 'fiat': 'punto',
 'ford': 'mondeo',
 'honda': 'civic',
 'hyundai': 'i_reihe',
 'jaguar': 'andere',
 'jeep': 'wrangler',
 'kia': 'andere',
 'lada': 'niva',
 'lancia': 'lybra',
 'land_rover': 'freelander',
 'mazda': 'andere',
 'mercedes_benz': 'e_klasse',
 'mini': 'cooper',
 'mitsubishi': 'colt',
 'nissan': 'primera',
 'opel': 'vectra',
 'peugeot': 'andere',
 'porsche': '911',
 'renault': 'megane',
 'rover': 'andere',
 'saab': 'andere',
 'seat': 'altea',
 'skoda': 'octavia',
 'smart': 'fortwo',
 'subaru': 'andere',
 'suzuki': 'grand',
 'toyota': 'andere',
 'trabant': '601',
 'volkswagen': 'golf',
 'volvo': 'andere'}

EXTRA ANALYSIS: Split Odometer x Price¶

In [50]:

autos['odometer_km'].value_counts().sort_index()

Out[50]:

5000        505
10000       213
20000       697
30000       726
40000       769
50000       962
60000      1096
70000      1138
80000      1313
90000      1579
100000     1906
125000     4621
150000    27456
Name: odometer_km, dtype: int64

In [51]:

len(autos)

Out[51]:

Normally, the best approach should be a split in terms of equally distributed groups. However, considering the aggregation of vehicles with mileage greater than or equal to 150K km, we will break it down to the following distribution:

0 - 30000
30001 - 60000
60001 - 90000
90001 - 125000
125000 +

In [52]:

odometer_km_splits = []

for km in autos['odometer_km']:
    if km < 30001:
        odometer_km_splits.append("0 - 30,000")
    elif km < 60001: 
        odometer_km_splits.append("30,001 - 60,000")
    elif km < 90001: 
        odometer_km_splits.append("60,001 - 90,000")
    elif km < 125001: 
        odometer_km_splits.append("90,001 - 125,000")
    else:
        odometer_km_splits.append("125,000+")

In [53]:

autos["odometer_km_splits"] = odometer_km_splits

In [54]:

autos['odometer_km_splits'].value_counts()

Out[54]:

125,000+            27456
90,001 - 125,000     6527
60,001 - 90,000      4030
30,001 - 60,000      2827
0 - 30,000           2141
Name: odometer_km_splits, dtype: int64

In [55]:

price_by_mileage = {}

odometer_splits = autos["odometer_km_splits"].unique()

for km in odometer_splits:
    selected_rows = autos[autos["odometer_km_splits"] == km]
    avg_price = selected_rows['price'].mean().round(2)
    price_by_mileage[km] =  avg_price

price_by_mileage

Out[55]:

{'0 - 30,000': 17004.72,
 '125,000+': 3953.13,
 '30,001 - 60,000': 13819.04,
 '60,001 - 90,000': 9826.49,
 '90,001 - 125,000': 6995.37}

Looking at the breakdown of the odometer, there seems to be a negative relationship between mean price points of a vehicle and mileage where the greater the mileage = the lower the price of the vehicle.

Extra Analysis: Price x Damage¶

In [58]:

col = ['price', 'unrepaired_damage']
price_by_damage = autos[col].dropna(axis = 0)
price_by_damage

Out[58]:

	price	unrepaired_damage
0	5000.0	nein
1	8500.0	nein
2	8990.0	nein
3	4350.0	nein
7	1990.0	nein
9	590.0	nein
10	999.0	nein
12	5299.0	nein
13	1350.0	nein
15	18900.0	nein
16	350.0	nein
17	5500.0	nein
19	4150.0	nein
21	41500.0	nein
22	25450.0	nein
23	7999.0	nein
24	48500.0	nein
26	777.0	nein
28	5250.0	ja
29	4999.0	nein
31	2850.0	nein
33	4800.0	nein
36	7500.0	nein
37	13800.0	nein
38	2850.0	nein
39	1450.0	nein
40	3500.0	nein
41	3200.0	nein
43	11500.0	nein
44	900.0	nein
...	...	...
49953	14750.0	nein
49956	4450.0	nein
49959	4200.0	nein
49962	2200.0	ja
49963	7600.0	nein
49965	6700.0	nein
49966	1490.0	ja
49968	2100.0	nein
49969	4500.0	nein
49970	15800.0	nein
49971	950.0	nein
49972	3300.0	ja
49973	6000.0	nein
49975	9700.0	nein
49976	5900.0	nein
49977	5500.0	nein
49978	900.0	ja
49979	11000.0	nein
49981	2000.0	nein
49986	15900.0	nein
49987	21990.0	nein
49988	9550.0	nein
49990	17500.0	nein
49992	4800.0	nein
49994	5000.0	nein
49995	24900.0	nein
49996	1980.0	nein
49997	13200.0	nein
49998	22900.0	nein
49999	1250.0	nein

36657 rows × 2 columns

In [60]:

price_by_status = {}

status = price_by_damage['unrepaired_damage'].unique()

for s in status:
    rows = price_by_damage[price_by_damage['unrepaired_damage'] == s]
    avg_price = rows['price'].mean().round(2)
    price_by_status[s] = avg_price

price_by_status

Out[60]:

{'ja': 2354.32, 'nein': 7282.34}

Looking at whether or not there was any unrepaired damage to a vehicle, a greater average price point was found for vehicles without any unrepaired damage as compared to those with it.