INTRODUCTION

Using a dataset of eBay Kleinanzeigen (the classified section of the German eBay website) consisting of 50K data points available on kaggle, we want to (1) CLEAN THE DATA & (2) ANALYZE THE DATA.

This dataset contains the following information:

Variable Description
dateCrawled When the ad was 1st crawled
name Name of car
seller Private or Dealer as seller
offerType Type of listing
price Price on the car ad
abtest Whether the listing included A/B test
vehicleType Type of vehicle
yearOfRegistration Year when car was 1st registered
gearbox The transmission type
powerPS Horsepower of the car
model Model of the car
kilometer # of kilometers driven
monthOfRegistration Month in which car was 1st registered
fuelType Type of fuel the car uses
brand Brand of the car
notRepairedDamage If the repairs were done on damaged car
dateCreated Date when eBay listing was created
nrOfPictures # of pics of car in ad
postalCode Postal code where the car is at
lastSeenOnline When the crawler saw this ad last online
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head()
Out[1]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [2]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

There are several things that need to change:

(1) need to lower case everything

(2) deal with missing values in "vehicleType", "gearbox", "model", "fuelType", "notRepairedDamage"

(3) turn things into snake_case

(4) turn "price" and "odometer" into float

Clean Column Name

In [3]:
autos.rename(str.lower, axis = 1, inplace = True)
autos.rename({"yearofregistration":"registration_year", 
              "monthofregistration":"registration_month",
              "notrepaireddamage":"unrepaired_damage",
              "datecreated":"ad_created",
              "datecrawled":"date_crawled",
              "offertype":"offer_type",
              "vehicletype":"vehicle_type",
              "gearbox":"gear_box",
              "fueltype":"fuel_type",
              "nrofpictures":"num_pictures",
              "postalcode":"postal_code",
              "lastseen":"last_seen"}, axis = 1, inplace = True)

autos.head(5)
Out[3]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gear_box powerps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Initial Exploration + Cleaning

In [4]:
autos.describe(include = "all")
Out[4]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gear_box powerps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-16 21:50:53 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
In [5]:
print(autos["offer_type"].value_counts())
print("\n")
print(autos['seller'].value_counts())
print("\n")
print(autos["num_pictures"].value_counts())
print("\n")
print(autos['gear_box'].value_counts())
print("\n")
print(autos['abtest'].value_counts())
Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64


privat        49999
gewerblich        1
Name: seller, dtype: int64


0    50000
Name: num_pictures, dtype: int64


manuell      36993
automatik    10327
Name: gear_box, dtype: int64


test       25756
control    24244
Name: abtest, dtype: int64

1) Remove "seller", "num_pictures" and "seller" from dataframe

2) "powerps" has an unrealistic value (> 1000 HP). Need to further analyze

3) Similarly, 'price' also need to be examined as there is no way to have a $2 million+ vehicle

4) "registration_year" has an unrealistic range, particularly at the extremes. Need to remove them.

5) Odometer + price needs to return to floats

In [6]:
# Fix price

autos['price'].value_counts()
# There's "$" and ",". Plus it's an object and not an integer/float 

autos['price'] = autos['price'].str.replace("$","").str.replace(",","")
In [7]:
autos['price'] = autos['price'].astype(float)
autos['price'].describe()
Out[7]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [8]:
# fix odometer 
autos['odometer'].value_counts()
autos['odometer'].unique()
# Need to deal with "," and "km" + turn it to float/int
autos['odometer'] = autos['odometer'].str.replace(",", "")
autos['odometer'] = autos['odometer'].str.replace("km", "")
autos['odometer'] = autos['odometer'].astype(int)
In [9]:
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
In [10]:
autos.columns
# removing unnecessary columns 
Out[10]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'powerps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [11]:
cols = ['date_crawled', 'name', 'price', 
        'abtest', 'vehicle_type', 'registration_year',
        'gear_box', 'powerps', 'model',
        'odometer_km', 'registration_month', 'fuel_type',
        'brand', 'unrepaired_damage', 'ad_created',
        'postal_code', 'last_seen']
autos = autos[cols]
In [12]:
autos.head()
autos.describe(include = 'all')
Out[12]:
date_crawled name price abtest vehicle_type registration_year gear_box powerps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
count 50000 50000 5.000000e+04 50000 44905 50000.000000 47320 50000.000000 47242 50000.000000 50000.000000 45518 50000 40171 50000 50000.000000 50000
unique 48213 38754 NaN 2 8 NaN 2 NaN 245 NaN NaN 7 40 2 76 NaN 39481
top 2016-03-16 21:50:53 Ford_Fiesta NaN test limousine NaN manuell NaN golf NaN NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN 2016-04-07 06:17:27
freq 3 78 NaN 25756 12859 NaN 36993 NaN 4024 NaN NaN 30107 10687 35232 1946 NaN 8
mean NaN NaN 9.840044e+03 NaN NaN 2005.073280 NaN 116.355920 NaN 125732.700000 5.723360 NaN NaN NaN NaN 50813.627300 NaN
std NaN NaN 4.811044e+05 NaN NaN 105.712813 NaN 209.216627 NaN 40042.211706 3.711984 NaN NaN NaN NaN 25779.747957 NaN
min NaN NaN 0.000000e+00 NaN NaN 1000.000000 NaN 0.000000 NaN 5000.000000 0.000000 NaN NaN NaN NaN 1067.000000 NaN
25% NaN NaN 1.100000e+03 NaN NaN 1999.000000 NaN 70.000000 NaN 125000.000000 3.000000 NaN NaN NaN NaN 30451.000000 NaN
50% NaN NaN 2.950000e+03 NaN NaN 2003.000000 NaN 105.000000 NaN 150000.000000 6.000000 NaN NaN NaN NaN 49577.000000 NaN
75% NaN NaN 7.200000e+03 NaN NaN 2008.000000 NaN 150.000000 NaN 150000.000000 9.000000 NaN NaN NaN NaN 71540.000000 NaN
max NaN NaN 1.000000e+08 NaN NaN 9999.000000 NaN 17700.000000 NaN 150000.000000 12.000000 NaN NaN NaN NaN 99998.000000 NaN

Examining powerps

There is no way that horse power can be 17700. So we need to find out what's up.

In [13]:
autos['powerps'].value_counts().sort_index(ascending = False).head(50)
Out[13]:
17700    1
16312    1
16011    1
15016    1
15001    1
14009    1
9011     1
8404     1
7511     1
6512     1
6226     1
6045     1
5867     1
4400     1
3750     1
3500     1
2729     1
2018     1
1998     2
1988     1
1986     1
1800     3
1796     1
1793     1
1781     1
1780     1
1779     1
1771     1
1753     1
1704     1
1405     1
1401     1
1400     3
1398     1
1367     1
1300     1
1202     1
1103     1
1090     1
1082     1
1056     1
1055     1
1016     1
1011     1
1003     1
1001     3
1000     1
999      1
952      1
950      1
Name: powerps, dtype: int64
In [14]:
autos['powerps'].value_counts().sort_index(ascending = True).head(50)
Out[14]:
0     5500
1        5
2        2
3        3
4        4
5       13
6        3
8        2
9        1
10       3
11       4
12       1
14       1
15       5
16       1
18       6
19       3
20       4
21       1
23       4
24       1
25       2
26      39
27       5
29       5
30       3
32       2
33       9
34      27
35       2
37       7
38       2
39      18
40      42
41      57
42       7
43      20
44      52
45     397
46       9
47       8
48      16
49       2
50     604
51      14
52      40
53      28
54     759
55     275
56      40
Name: powerps, dtype: int64

As the Bugatti Cheron was the highest HP production car at approx. 1600, really anything above that is not valid. So we will use this as the top-end cap. Conversely the lowest HP production car is the Renault Twitzy at about 17 hp. Thus we will use this as the bottom-end cap.

In [15]:
autos =  autos[autos['powerps'].between(16, 1600)]
autos.describe()
Out[15]:
price registration_year powerps odometer_km registration_month postal_code
count 4.442000e+04 44420.000000 44420.000000 44420.000000 44420.000000 44420.000000
mean 1.045555e+04 2004.047726 126.743134 125794.236830 5.930752 51368.900698
std 5.076550e+05 53.063904 67.302685 39267.227509 3.602153 25754.846000
min 0.000000e+00 1927.000000 16.000000 5000.000000 0.000000 1067.000000
25% 1.299000e+03 1999.000000 80.000000 100000.000000 3.000000 31028.000000
50% 3.299000e+03 2004.000000 116.000000 150000.000000 6.000000 50354.000000
75% 7.800000e+03 2008.000000 150.000000 150000.000000 9.000000 72250.000000
max 1.000000e+08 9000.000000 1405.000000 150000.000000 12.000000 99998.000000
In [16]:
autos['powerps'].value_counts().sort_index(ascending = False)
Out[16]:
1405      1
1401      1
1400      3
1398      1
1367      1
1300      1
1202      1
1103      1
1090      1
1082      1
1056      1
1055      1
1016      1
1011      1
1003      1
1001      3
1000      1
999       1
952       1
950       1
923       1
900       3
850       1
800       1
799       1
754       1
740       1
696       1
682       1
678       1
       ... 
50      604
49        2
48       16
47        8
46        9
45      397
44       52
43       20
42        7
41       57
40       42
39       18
38        2
37        7
35        2
34       27
33        9
32        2
30        3
29        5
27        5
26       39
25        2
24        1
23        4
21        1
20        4
19        3
18        6
16        1
Name: powerps, Length: 404, dtype: int64

Examining Price

In [17]:
autos['price'].shape[0]
Out[17]:
44420
In [18]:
autos['price'].describe()
Out[18]:
count    4.442000e+04
mean     1.045555e+04
std      5.076550e+05
min      0.000000e+00
25%      1.299000e+03
50%      3.299000e+03
75%      7.800000e+03
max      1.000000e+08
Name: price, dtype: float64
In [19]:
autos['price'].value_counts().sort_index(ascending = False).head(10)
Out[19]:
99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    1
10000000.0    1
3890000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
Name: price, dtype: int64
In [20]:
autos['price'].value_counts().sort_index(ascending = False).tail(10)
Out[20]:
30.0      1
25.0      1
20.0      1
13.0      1
11.0      1
10.0      2
5.0       1
3.0       1
1.0      89
0.0     907
Name: price, dtype: int64
In [21]:
print(len(autos[autos['price'] < 100]))
print(len(autos[autos['price'] > 500000]))
print(len(autos))
1072
12
44420

Of the 50K datapoints in this dataset, 1762 datapoints list price of car as less than 100 dollars whilst 11 list the price of the car as greater than 500,000 dollars. Considering both the unlikelihood of purchasing a car within these extremes and its make up of the total data point to be less than 5% (i.e. can remove without affecting the integrity of the data), we will use this as our liimts.

In [22]:
autos =  autos[autos['price'].between(100, 500000)]
autos['price'].value_counts().sort_index()
Out[22]:
100.0        77
110.0         1
111.0         1
115.0         1
117.0         1
120.0        18
125.0         4
130.0         8
135.0         1
139.0         1
140.0         7
145.0         2
149.0         6
150.0       148
156.0         1
160.0         7
170.0         2
173.0         1
175.0         9
179.0         1
180.0        24
185.0         1
188.0         1
190.0        13
193.0         1
195.0         2
199.0        27
200.0       192
210.0         1
215.0         1
           ... 
115000.0      1
115991.0      1
116000.0      1
119500.0      1
119900.0      1
120000.0      2
128000.0      1
129000.0      1
135000.0      1
137999.0      1
139997.0      1
145000.0      1
151990.0      1
155000.0      1
163500.0      1
163991.0      1
169000.0      1
169999.0      1
175000.0      1
180000.0      1
190000.0      1
197000.0      1
198000.0      1
220000.0      1
259000.0      1
265000.0      1
295000.0      1
299000.0      1
345000.0      1
350000.0      1
Name: price, Length: 2260, dtype: int64

Examining Odometer_km

In [23]:
autos['odometer_km'].unique()
Out[23]:
array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000, 100000,   5000,  40000])
In [24]:
autos['odometer_km'].describe()
Out[24]:
count     43336.000000
mean     125625.230755
std       39265.241462
min        5000.000000
25%      100000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [25]:
autos['odometer_km'].value_counts().sort_index(ascending = False)
Out[25]:
150000    27696
125000     4657
100000     1922
90000      1594
80000      1325
70000      1145
60000      1101
50000       964
40000       772
30000       731
20000       698
10000       215
5000        516
Name: odometer_km, dtype: int64

Considering the nature of the items as used cars, which are likely to have a very high mileage, it would be expected that there would be a large number of vehicles with extraordinarily high mileage and few vehicles with low mileage. As such, we will not be modifying the vehicle in terms of odometer.

Exploring Date Columns

In [26]:
# date_crawled + ad_created + last_seen

autos[['date_crawled','ad_created','last_seen']].head()
# the format is %YYYY-%mm-%dd (essentially 10 strings)
Out[26]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
5 2016-03-21 13:47:45 2016-03-21 00:00:00 2016-04-06 09:45:21
In [27]:
autos['date_crawled'] = autos['date_crawled'].str[:10]
In [28]:
autos['ad_created'] = autos['ad_created'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]
In [29]:
# Counting date_crawled

autos['date_crawled'].value_counts(normalize = True, dropna =  False).sort_index(ascending = False).round(2)
Out[29]:
2016-04-07    0.00
2016-04-06    0.00
2016-04-05    0.01
2016-04-04    0.04
2016-04-03    0.04
2016-04-02    0.04
2016-04-01    0.03
2016-03-31    0.03
2016-03-30    0.03
2016-03-29    0.03
2016-03-28    0.04
2016-03-27    0.03
2016-03-26    0.03
2016-03-25    0.03
2016-03-24    0.03
2016-03-23    0.03
2016-03-22    0.03
2016-03-21    0.04
2016-03-20    0.04
2016-03-19    0.03
2016-03-18    0.01
2016-03-17    0.03
2016-03-16    0.03
2016-03-15    0.03
2016-03-14    0.04
2016-03-13    0.02
2016-03-12    0.04
2016-03-11    0.03
2016-03-10    0.03
2016-03-09    0.03
2016-03-08    0.03
2016-03-07    0.04
2016-03-06    0.01
2016-03-05    0.03
Name: date_crawled, dtype: float64

Examining the distribution of date_crawled, which ranged from March 5th, 2016 - April 7th, 2016, there is a rounghly equivalent distribution at approx 3% across the dates in this range when the crawler accessed the ads.

In [30]:
# Counting date_crawled

autos['last_seen'].value_counts(normalize = True, dropna =  False).sort_index(ascending = False).round(2)
Out[30]:
2016-04-07    0.13
2016-04-06    0.23
2016-04-05    0.13
2016-04-04    0.02
2016-04-03    0.02
2016-04-02    0.03
2016-04-01    0.02
2016-03-31    0.02
2016-03-30    0.02
2016-03-29    0.02
2016-03-28    0.02
2016-03-27    0.02
2016-03-26    0.02
2016-03-25    0.02
2016-03-24    0.02
2016-03-23    0.02
2016-03-22    0.02
2016-03-21    0.02
2016-03-20    0.02
2016-03-19    0.02
2016-03-18    0.01
2016-03-17    0.03
2016-03-16    0.02
2016-03-15    0.02
2016-03-14    0.01
2016-03-13    0.01
2016-03-12    0.02
2016-03-11    0.01
2016-03-10    0.01
2016-03-09    0.01
2016-03-08    0.01
2016-03-07    0.01
2016-03-06    0.00
2016-03-05    0.00
Name: last_seen, dtype: float64

In terms of "last_seen" that looked within the range of March 5th, 2016 to April 7th, 2016, there was a noted uptake in when the majority of the ads were last seen (i.e. near April 7th), which could be explained by: (1) the corresponding to the collection period when the dataset was being compiled or (2) this being the prime period to which car sales occur since it's the end of the 1st quarter.

In [31]:
autos['ad_created'].value_counts(normalize = True, dropna =  False).sort_index(ascending = False).round(2)
Out[31]:
2016-04-07    0.00
2016-04-06    0.00
2016-04-05    0.01
2016-04-04    0.04
2016-04-03    0.04
2016-04-02    0.04
2016-04-01    0.03
2016-03-31    0.03
2016-03-30    0.03
2016-03-29    0.03
2016-03-28    0.04
2016-03-27    0.03
2016-03-26    0.03
2016-03-25    0.03
2016-03-24    0.03
2016-03-23    0.03
2016-03-22    0.03
2016-03-21    0.04
2016-03-20    0.04
2016-03-19    0.03
2016-03-18    0.01
2016-03-17    0.03
2016-03-16    0.03
2016-03-15    0.03
2016-03-14    0.04
2016-03-13    0.02
2016-03-12    0.04
2016-03-11    0.03
2016-03-10    0.03
2016-03-09    0.03
              ... 
2016-02-23    0.00
2016-02-22    0.00
2016-02-21    0.00
2016-02-20    0.00
2016-02-19    0.00
2016-02-18    0.00
2016-02-17    0.00
2016-02-16    0.00
2016-02-14    0.00
2016-02-12    0.00
2016-02-09    0.00
2016-02-08    0.00
2016-02-07    0.00
2016-02-05    0.00
2016-02-02    0.00
2016-02-01    0.00
2016-01-29    0.00
2016-01-27    0.00
2016-01-22    0.00
2016-01-16    0.00
2016-01-14    0.00
2016-01-13    0.00
2016-01-10    0.00
2016-01-07    0.00
2016-01-03    0.00
2015-12-30    0.00
2015-12-05    0.00
2015-11-10    0.00
2015-09-09    0.00
2015-08-10    0.00
Name: ad_created, Length: 74, dtype: float64

As for "ad_created", looking at the range from June 11th, 2015 to April 7th, 2016, the majority of the ads were created in March 2016.

Dealing with Incorrect Registration Year

In [32]:
autos['registration_year'].value_counts().sort_index()
Out[32]:
1927       1
1931       1
1937       3
1941       2
1950       1
1951       2
1954       2
1955       1
1956       3
1957       2
1958       2
1959       6
1960       8
1961       4
1962       3
1963       7
1964      10
1965      13
1966      13
1967      18
1968      23
1969      14
1970      19
1971      20
1972      29
1973      16
1974      20
1975      14
1976      16
1977      19
        ... 
1994     550
1995     999
1996    1187
1997    1721
1998    2084
1999    2555
2000    2530
2001    2405
2002    2266
2003    2484
2004    2487
2005    2579
2006    2533
2007    2145
2008    2102
2009    1992
2010    1540
2011    1585
2012    1270
2013     784
2014     642
2015     364
2016     864
2017    1034
2018     349
2019       2
2800       1
5000       1
5911       1
9000       1
Name: registration_year, Length: 76, dtype: int64

Seeing as the earliest commercial car was made in 1917, we can just use this as the set minimum. Furthermore, as this dataset only took in data up until April 2017, it doesn't make sense to have registration years after 2017. So the range in registration year we are lookin for is b/t 1917 - 2017.

In [33]:
min_bool = autos['registration_year'] > 1917
max_bool = autos['registration_year'] < 2018
combined = min_bool & max_bool

autos = autos[combined]
autos.describe()
Out[33]:
price registration_year powerps odometer_km registration_month postal_code
count 42981.000000 42981.000000 42981.000000 42981.000000 42981.000000 42981.000000
mean 6264.865964 2003.512180 127.107001 125600.381564 5.967776 51538.376492
std 9231.082718 7.060427 67.377291 39274.090885 3.583941 25732.257434
min 100.000000 1927.000000 16.000000 5000.000000 0.000000 1067.000000
25% 1399.000000 1999.000000 80.000000 100000.000000 3.000000 31177.000000
50% 3495.000000 2004.000000 116.000000 150000.000000 6.000000 50737.000000
75% 7950.000000 2008.000000 150.000000 150000.000000 9.000000 72406.000000
max 350000.000000 2017.000000 1405.000000 150000.000000 12.000000 99998.000000

Exploring Price by Brand

In [34]:
autos['brand'].unique()
Out[34]:
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'chrysler', 'renault',
       'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini',
       'mercedes_benz', 'ford', 'seat', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'land_rover', 'alfa_romeo', 'rover',
       'daihatsu', 'trabant', 'lancia', 'lada'], dtype=object)
In [35]:
top_10_brands = autos['brand'].value_counts()[:10]
In [36]:
bottom_10_brands = autos['brand'].value_counts()[-10:]
In [37]:
top_10_brands
Out[37]:
volkswagen       9194
bmw              4862
opel             4567
mercedes_benz    4126
audi             3842
ford             2936
renault          1929
peugeot          1264
fiat             1062
seat              809
Name: brand, dtype: int64
In [38]:
bottom_10_brands
Out[38]:
daihatsu      96
subaru        92
land_rover    91
saab          73
jaguar        64
daewoo        55
rover         52
lancia        46
trabant       41
lada          21
Name: brand, dtype: int64
In [39]:
print(10157+5201+5155+4586+4118+3331+2272+1413+1232+888)
print(len(autos))
print((10157+5201+5155+4586+4118+3331+2272+1413+1232+888) / len(autos))
38353
42981
0.8923245154835858

Looking at the data, it seems that the majority of the used cars are German brands (i.e. Volkswagen, BMW, Audi, Mercedes Benz) whilst the major foreign brand being Ford. Nevertheless, the most common used car brands are domestic. However, as it pertains to the 10 least popular brands, they are mainly foreign brands such as Subaru, Land Rover, Daewoo, etc.

The decision to take the Top 10 brands is that these make up 89.3% of the entire dataset, which is a pretty good representation of the marketplace.

In [40]:
volkswagen_bool = autos['brand'] == "volkswagen"
bmw_bool = autos['brand'] == 'bmw'
opel_bool = autos['brand'] == "opel"
mercedes_bool = autos['brand'] == 'mercedes_benz'
audi_bool = autos['brand'] == "audi"
ford_bool = autos['brand'] =='ford'
renault_bool = autos['brand'] == "renault"
peugeot_bool = autos['brand'] == 'peugeot'
fiat_bool = autos['brand'] == 'fiat'
seat_bool = autos['brand'] == 'seat'
In [41]:
volkswagen_only = autos[volkswagen_bool]
bmw_only = autos[bmw_bool]
opel_only = autos[opel_bool]
mercedes_only = autos[mercedes_bool]
audi_only = autos[audi_bool]
ford_only = autos[ford_bool]
renault_only = autos[renault_bool]
peugeot_only = autos[peugeot_bool]
fiat_only = autos[fiat_bool]
seat_only = autos[seat_bool]
In [42]:
top_brands_mean_price = {}

top_10 = autos['brand'].value_counts(ascending = False)[:10].index

for maker in top_10:
    selected_rows = autos[autos['brand'] == maker]
    mean_price = selected_rows['price'].mean()
    mean_price = mean_price.round(2)
    top_brands_mean_price[maker] = mean_price

top_brands_mean_price
Out[42]:
{'audi': 9659.37,
 'bmw': 8577.61,
 'fiat': 2968.12,
 'ford': 3932.04,
 'mercedes_benz': 8936.97,
 'opel': 3151.16,
 'peugeot': 3249.6,
 'renault': 2635.15,
 'seat': 4675.29,
 'volkswagen': 5660.48}

The avg. price for each of the top 10 car brands are:

Brand Avg. Price (in dollars)
Audi 9659.37
BMW 8577.61
Fiat 2968.12
Ford 3933.09
Mercedes Benz 8936.97
Opel 3151.16
Peugeot 3249.60
Renault 2635.15
Seat 4675.29
Volkswagen 5660.48

Looking at the data, it seems that the more luxurious brands (i.e. Audi, BMW, Mercedes Benz) have the highest price point whilst the foreign brands (i.e. Ford, Peugeot, Fiat) are on the lower end of the price point spectrum. As for the most populous brand (Volkswagen), its pricepoint is moderate.

Storing Aggregate Data

Working with the top 10 brands, we want to see if mean mileage (i.e. odometer_km) and mean price are related to one another.

In [43]:
top_brands_mean_mileage = {}

top_10 = autos['brand'].value_counts(ascending = False)[:10].index

for maker in top_10:
    selected_rows = autos[autos['brand'] == maker]
    mean_km= selected_rows['odometer_km'].mean().round(2)
    top_brands_mean_mileage[maker] = mean_km

top_brands_mean_mileage
Out[43]:
{'audi': 128854.76,
 'bmw': 132670.71,
 'fiat': 116864.41,
 'ford': 124105.93,
 'mercedes_benz': 130822.83,
 'opel': 129162.47,
 'peugeot': 126475.47,
 'renault': 127392.43,
 'seat': 121161.93,
 'volkswagen': 128454.97}

Looking at the average mileage of the Top 10 brands of cars, it seems that the range lies b/t 116000km to 133000km. There luxurious brands (BMW and Mercedes) appear to have a higher mileage as compared to the foreign brands.

In [44]:
top_10_price = pd.Series(top_brands_mean_price)
top_10_mileage = pd.Series(top_brands_mean_mileage)

top_10_makers = pd.DataFrame(top_10_price, columns = ['mean_price'])
top_10_makers['mean_mileage'] = top_10_mileage
In [45]:
top_10_makers
Out[45]:
mean_price mean_mileage
audi 9659.37 128854.76
bmw 8577.61 132670.71
fiat 2968.12 116864.41
ford 3932.04 124105.93
mercedes_benz 8936.97 130822.83
opel 3151.16 129162.47
peugeot 3249.60 126475.47
renault 2635.15 127392.43
seat 4675.29 121161.93
volkswagen 5660.48 128454.97

It appears as though mileage doesn't seem to be as strongly related to price despite the more luxurious and priciest brands having a higher mean mileage as compared to the more economic and cheapest brands with a lower mean mileage.

EXTRA ANALYSIS: Common Brand + Model

In [46]:
col = ['brand', 'model']
Brand_Model = autos[col]
Brand_Model = Brand_Model.dropna(axis = 0)
In [47]:
Brand_Model['brand'].value_counts().head()
Out[47]:
volkswagen       8878
bmw              4685
opel             4411
mercedes_benz    4031
audi             3747
Name: brand, dtype: int64
In [48]:
Brand_Model['model'].value_counts().head()
Out[48]:
golf      3536
andere    2997
3er       2467
polo      1455
corsa     1426
Name: model, dtype: int64
In [49]:
top_model_by_brand = {}

brands = Brand_Model['brand'].unique()

for b in brands:
    selected_rows = Brand_Model[Brand_Model['brand'] == b]
    sorted_rows = selected_rows.mode()
    top_row = selected_rows.iloc[0]
    top_model = top_row['model']
    top_model_by_brand[b] = top_model

top_model_by_brand
Out[49]:
{'alfa_romeo': '156',
 'audi': 'a3',
 'bmw': '7er',
 'chevrolet': 'andere',
 'chrysler': 'voyager',
 'citroen': 'andere',
 'dacia': 'sandero',
 'daewoo': 'kalos',
 'daihatsu': 'terios',
 'fiat': 'punto',
 'ford': 'mondeo',
 'honda': 'civic',
 'hyundai': 'i_reihe',
 'jaguar': 'andere',
 'jeep': 'wrangler',
 'kia': 'andere',
 'lada': 'niva',
 'lancia': 'lybra',
 'land_rover': 'freelander',
 'mazda': 'andere',
 'mercedes_benz': 'e_klasse',
 'mini': 'cooper',
 'mitsubishi': 'colt',
 'nissan': 'primera',
 'opel': 'vectra',
 'peugeot': 'andere',
 'porsche': '911',
 'renault': 'megane',
 'rover': 'andere',
 'saab': 'andere',
 'seat': 'altea',
 'skoda': 'octavia',
 'smart': 'fortwo',
 'subaru': 'andere',
 'suzuki': 'grand',
 'toyota': 'andere',
 'trabant': '601',
 'volkswagen': 'golf',
 'volvo': 'andere'}

EXTRA ANALYSIS: Split Odometer x Price

In [50]:
autos['odometer_km'].value_counts().sort_index()
Out[50]:
5000        505
10000       213
20000       697
30000       726
40000       769
50000       962
60000      1096
70000      1138
80000      1313
90000      1579
100000     1906
125000     4621
150000    27456
Name: odometer_km, dtype: int64
In [51]:
len(autos)
Out[51]:
42981

Normally, the best approach should be a split in terms of equally distributed groups. However, considering the aggregation of vehicles with mileage greater than or equal to 150K km, we will break it down to the following distribution:

1) 0 - 30000 2) 30001 - 60000 3) 60001 - 90000 4) 90001 - 125000 5) 125000 +

In [52]:
odometer_km_splits = []

for km in autos['odometer_km']:
    if km < 30001:
        odometer_km_splits.append("0 - 30,000")
    elif km < 60001: 
        odometer_km_splits.append("30,001 - 60,000")
    elif km < 90001: 
        odometer_km_splits.append("60,001 - 90,000")
    elif km < 125001: 
        odometer_km_splits.append("90,001 - 125,000")
    else:
        odometer_km_splits.append("125,000+")
In [53]:
autos["odometer_km_splits"] = odometer_km_splits
In [54]:
autos['odometer_km_splits'].value_counts()
Out[54]:
125,000+            27456
90,001 - 125,000     6527
60,001 - 90,000      4030
30,001 - 60,000      2827
0 - 30,000           2141
Name: odometer_km_splits, dtype: int64
In [55]:
price_by_mileage = {}

odometer_splits = autos["odometer_km_splits"].unique()

for km in odometer_splits:
    selected_rows = autos[autos["odometer_km_splits"] == km]
    avg_price = selected_rows['price'].mean().round(2)
    price_by_mileage[km] =  avg_price

price_by_mileage
Out[55]:
{'0 - 30,000': 17004.72,
 '125,000+': 3953.13,
 '30,001 - 60,000': 13819.04,
 '60,001 - 90,000': 9826.49,
 '90,001 - 125,000': 6995.37}

Looking at the breakdown of the odometer, there seems to be a negative relationship between mean price points of a vehicle and mileage where the greater the mileage = the lower the price of the vehicle.

Extra Analysis: Price x Damage

In [58]:
col = ['price', 'unrepaired_damage']
price_by_damage = autos[col].dropna(axis = 0)
price_by_damage
Out[58]:
price unrepaired_damage
0 5000.0 nein
1 8500.0 nein
2 8990.0 nein
3 4350.0 nein
7 1990.0 nein
9 590.0 nein
10 999.0 nein
12 5299.0 nein
13 1350.0 nein
15 18900.0 nein
16 350.0 nein
17 5500.0 nein
19 4150.0 nein
21 41500.0 nein
22 25450.0 nein
23 7999.0 nein
24 48500.0 nein
26 777.0 nein
28 5250.0 ja
29 4999.0 nein
31 2850.0 nein
33 4800.0 nein
36 7500.0 nein
37 13800.0 nein
38 2850.0 nein
39 1450.0 nein
40 3500.0 nein
41 3200.0 nein
43 11500.0 nein
44 900.0 nein
... ... ...
49953 14750.0 nein
49956 4450.0 nein
49959 4200.0 nein
49962 2200.0 ja
49963 7600.0 nein
49965 6700.0 nein
49966 1490.0 ja
49968 2100.0 nein
49969 4500.0 nein
49970 15800.0 nein
49971 950.0 nein
49972 3300.0 ja
49973 6000.0 nein
49975 9700.0 nein
49976 5900.0 nein
49977 5500.0 nein
49978 900.0 ja
49979 11000.0 nein
49981 2000.0 nein
49986 15900.0 nein
49987 21990.0 nein
49988 9550.0 nein
49990 17500.0 nein
49992 4800.0 nein
49994 5000.0 nein
49995 24900.0 nein
49996 1980.0 nein
49997 13200.0 nein
49998 22900.0 nein
49999 1250.0 nein

36657 rows × 2 columns

In [60]:
price_by_status = {}

status = price_by_damage['unrepaired_damage'].unique()

for s in status:
    rows = price_by_damage[price_by_damage['unrepaired_damage'] == s]
    avg_price = rows['price'].mean().round(2)
    price_by_status[s] = avg_price

price_by_status
Out[60]:
{'ja': 2354.32, 'nein': 7282.34}

Looking at whether or not there was any unrepaired damage to a vehicle, a greater average price point was found for vehicles without any unrepaired damage as compared to those with it.