EXPLORING EBAY KLEINANZEIGEN CAR SALES DATA

The main goal of this project is to clean and analyze used car listings that were scraped from a classifieds section of the German Ebay website and upload to kaggle. However, the dataset we are going to use was modified by Dataquest team since this project is part of one of Dataquest requirements for accomplishing Data Scientist certificate.

So in this sence, our project is structured into 11 steps: first, we are going to import the proper libraries for data analysis, which are Pandas and Numpy, and open the data file. Next, a deep analysis in each column of the dataset will take place aiming to find any patterns. Moving foward, we will need to drop some inaccurate data and some outliers to make our analysis easier and more clear. After a quick cleaning process, we are going move onto data analysis and make some observations about the dataset.

Step 1: Importing Pandas and opening the dataset

As mentioned above, it is necessary to import two important libraries for data analysis: Pandas and Numpy. Next, we are going to read the dataset and print it.

In [1]:
import pandas as pd
import numpy as np
In [2]:
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
In [3]:
# Verifying the dataset

autos
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

Step 2: Verifying the dataset

In [4]:
# Checking information about each column

autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
In [5]:
# printing the first few rows of the dataset

autos.head()
Out[5]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

According to the information above, there are null values in 5 different columns (but none with more than 20% null values): vehicleType, gearbox, model, fuelType and notRepairedDamage. Besides that, the values of the price and odometer columns are classified as objects. Therefore, it will be necessary to convert these values to int64 in order to make data manipulation easier.

Another point that can be observed is that the columns titles are in camelcase, which are not the Python's preferred notation. That is why we are going to start working with this issue.

Step 3: Converting the column names from camelcase to snakecase

In [6]:
# checking the present column names

autos.columns
Out[6]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [7]:
# changing the column names

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
In [8]:
# checking the modified dataset

autos.head()
Out[8]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Step 4: Descriptive statistics of all columns

In this step, we are going to make a quick view on descriptive statistics of each column to identify any odd data.

In [9]:
autos.describe(include = 'all')
Out[9]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-30 17:37:35 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
In [10]:
# checking on seller, offer_type and nr_of_pictures columns
In [11]:
autos['seller'].value_counts()
Out[11]:
privat        49999
gewerblich        1
Name: seller, dtype: int64
In [12]:
autos['offer_type'].value_counts()
Out[12]:
Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64
In [13]:
autos['nr_of_pictures'].value_counts()
Out[13]:
0    50000
Name: nr_of_pictures, dtype: int64

Observations

It is possible to notice in the description above that there are two columns that have mostly one value (49999 of the same value, to be exact): seller and offer_type. Also, it is possible to see thath the column nr_of_pictures has only 0 values. To make our analysis easier, we are going to drop these three columns since they will not make any difference for us.

Step 5: Dropping seller, offer_type and nr_of_pictures columns.

In [14]:
autos.drop(labels = ['seller', 'offer_type', 'nr_of_pictures'], axis = 1, inplace = True)
In [15]:
# checking the modified dataset
In [16]:
autos.shape
Out[16]:
(50000, 17)
In [17]:
autos.head()
Out[17]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50

Step 6: Removing non-numeric characters and converting strings to integers.

As it was said before, we have an issue with the price and odometer columns which are classified as object but should contain numeric values instead. Therefore, we are now going to remove any non-numeric characters and convert the strings in these two columns to integers.

For price column

  • Removing non-numeric character ("$" and ",")
In [18]:
autos['price'].unique()
Out[18]:
array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)
In [19]:
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '')
In [20]:
# checking the new values

autos['price'].unique()
Out[20]:
array(['5000', '8500', '8990', ..., '385', '22200', '16995'], dtype=object)
  • Converting strings to integers
In [21]:
autos['price'] = autos['price'].astype(int)
In [22]:
# checking the cleaned data

autos['price'].dtype
Out[22]:
dtype('int32')
In [23]:
autos['price'].head()
Out[23]:
0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32

For odometer column

  • Removing non-numeric character ("," and "km")
In [24]:
autos['odometer'].unique()
Out[24]:
array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)
In [25]:
autos['odometer'] = autos['odometer'].str.replace(',','').str.replace('km', '')
In [26]:
# checking the new values

autos['odometer'].unique()
Out[26]:
array(['150000', '70000', '50000', '80000', '10000', '30000', '125000',
       '90000', '20000', '60000', '5000', '100000', '40000'], dtype=object)
  • Converting strings to integers
In [27]:
autos['odometer'] = autos['odometer'].astype(int)
In [28]:
# checking the cleaned data

autos['odometer'].head()
Out[28]:
0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int32

To finish this first cleaning process, we are going to change price and odometer labels to price_dollar and odometer_km, for better comprehension.

In [29]:
autos.rename({'price': 'price_dollar', 'odometer': 'odometer_km'}, axis = 1, inplace = True)
In [30]:
# checking our modified dataset

autos.head()
Out[30]:
date_crawled name price_dollar abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
In [31]:
autos[['price_dollar', 'odometer_km']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   price_dollar  50000 non-null  int32
 1   odometer_km   50000 non-null  int32
dtypes: int32(2)
memory usage: 390.8 KB

Step 7: Looking for outliers in price_dollar and odometer_km columns.

In this section, we are going to look for outliers in price_dollar and odometer_km columns aiming to drop them to make our analysis easier. To do so, we are going to use to use Series.unique(), Series.describe() and Series.value_counts() methods, for each column.

For price_dollar column

In [32]:
autos['price_dollar'].unique().shape
Out[32]:
(2357,)

According to the code line above, our dataset has 2357 unique values.

In [33]:
autos[['price_dollar']].describe()
Out[33]:
price_dollar
count 5.000000e+04
mean 9.840044e+03
std 4.811044e+05
min 0.000000e+00
25% 1.100000e+03
50% 2.950000e+03
75% 7.200000e+03
max 1.000000e+08

We can see in the code line above that the minimum value of the cars in our dataset is 0 and the maximum is \$100,000,000, which is technically impossible. After a quick research on [eBay website](https://www.ebay-kleinanzeigen.de/), we found out that there are no significative amount of cars that costs more than 1 million dollars and that mosts of the cars that were announced as less than $ 50 are actually donation ads or "give an offer" ads.

In [34]:
autos['price_dollar'].value_counts().sort_index(ascending = True).head(50)
Out[34]:
0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
115       2
117       1
120      39
122       1
125       8
129       1
130      15
135       1
139       1
140       9
Name: price_dollar, dtype: int64

After this quick analysis, we are now going to remove the outliers (lines that has cars' prices less than 50 and more than \$1,000,000.

In [35]:
# before removal

autos['price_dollar'].shape
Out[35]:
(50000,)
In [36]:
autos[['price_dollar']].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 1 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   price_dollar  50000 non-null  int32
dtypes: int32(1)
memory usage: 195.4 KB
In [37]:
autos = autos[autos['price_dollar'].between(50, 1000000)]  # code to remove the outliers
In [38]:
# after removal

autos['price_dollar'].shape
Out[38]:
(48350,)

For odometer_km column

In [39]:
autos['odometer_km'].unique()
Out[39]:
array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])
In [40]:
autos['odometer_km'].describe()
Out[40]:
count     48350.000000
mean     125876.732161
std       39617.575951
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [41]:
autos['odometer_km'].value_counts()
Out[41]:
150000    31290
125000     5046
100000     2105
90000      1733
80000      1414
70000      1215
60000      1153
50000      1011
40000       815
5000        782
30000       778
20000       759
10000       249
Name: odometer_km, dtype: int64

As we can see from the code line above, there are not apparent outliers in the odometer_km column, since it is totally possible to find cars that have driven 150,000 km. Therefore, we are going to let the column as it is.

Step 8: Exploring the date columns (date_crawled, last_seen, ad_created, registration_month and registration_year)

In [42]:
autos[['date_crawled', 'last_seen', 'ad_created', 'registration_month', 'registration_year']].head()
Out[42]:
date_crawled last_seen ad_created registration_month registration_year
0 2016-03-26 17:47:46 2016-04-06 06:45:54 2016-03-26 00:00:00 3 2004
1 2016-04-04 13:38:56 2016-04-06 14:45:08 2016-04-04 00:00:00 6 1997
2 2016-03-26 18:57:24 2016-04-06 20:15:37 2016-03-26 00:00:00 7 2009
3 2016-03-12 16:58:10 2016-03-15 03:16:28 2016-03-12 00:00:00 6 2007
4 2016-04-01 14:38:50 2016-04-01 14:38:50 2016-04-01 00:00:00 7 2003

It is possible to see that registration_month and registration_year columns are represented as numerical data, whereas data_crawled, last_seen and ad_created are represented as strings. Because of that, we need to convert these three columns into numerical data. First of all, we are going to isolate the date values (which are represented by the first 10 characters) to understand the date range, for each of the three columns.

For date_crawled column

In [44]:
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
Out[44]:
2016-03-05    0.025357
2016-03-06    0.014064
2016-03-07    0.036029
2016-03-08    0.033257
2016-03-09    0.033051
2016-03-10    0.032223
2016-03-11    0.032554
2016-03-12    0.036980
2016-03-13    0.015657
2016-03-14    0.036670
2016-03-15    0.034292
2016-03-16    0.029493
2016-03-17    0.031603
2016-03-18    0.012906
2016-03-19    0.034747
2016-03-20    0.037787
2016-03-21    0.037311
2016-03-22    0.032947
2016-03-23    0.032265
2016-03-24    0.029431
2016-03-25    0.031499
2016-03-26    0.032265
2016-03-27    0.031086
2016-03-28    0.034891
2016-03-29    0.034126
2016-03-30    0.033713
2016-03-31    0.031830
2016-04-01    0.033733
2016-04-02    0.035533
2016-04-03    0.038573
2016-04-04    0.036525
2016-04-05    0.013051
2016-04-06    0.003164
2016-04-07    0.001386
Name: date_crawled, dtype: float64

For ad_created column

In [46]:
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
Out[46]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038821
2016-04-04    0.036898
2016-04-05    0.011768
2016-04-06    0.003247
2016-04-07    0.001241
Name: ad_created, Length: 76, dtype: float64

For last_seen column

In [48]:
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = True)
Out[48]:
2016-03-05    0.001075
2016-03-06    0.004323
2016-03-07    0.005419
2016-03-08    0.007342
2016-03-09    0.009638
2016-03-10    0.010610
2016-03-11    0.012389
2016-03-12    0.023744
2016-03-13    0.008873
2016-03-14    0.012637
2016-03-15    0.015884
2016-03-16    0.016463
2016-03-17    0.028087
2016-03-18    0.007322
2016-03-19    0.015801
2016-03-20    0.020620
2016-03-21    0.020600
2016-03-22    0.021406
2016-03-23    0.018594
2016-03-24    0.019772
2016-03-25    0.019173
2016-03-26    0.016774
2016-03-27    0.015574
2016-03-28    0.020869
2016-03-29    0.022337
2016-03-30    0.024736
2016-03-31    0.023826
2016-04-01    0.022875
2016-04-02    0.024881
2016-04-03    0.025171
2016-04-04    0.024509
2016-04-05    0.124964
2016-04-06    0.221634
2016-04-07    0.132079
Name: last_seen, dtype: float64

After checking simultaneously the 3 columns, it is possible to conclude that ads created before March 2016 are considered as outliers, since the date_crawled and last_seen columns only registered values after March 2016.

Dealing with incorrect data in registration_year column

In [49]:
autos['registration_year'].value_counts().sort_index(ascending = True)
Out[49]:
1000    1
1001    1
1111    1
1800    2
1910    2
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 95, dtype: int64

From the code cell above, it is possible to see that we have some outliers in the registration_year column. For example, we can see that there are some cars that were registered before 1800 (which is impossilbe since the car was not invented yet) and others were registered after 2016, which are definitely inaccurate since a car can't be registered after the listing was last seen. Therefore, we need to drop these outliers, considering that the first car registration happened in the first few decades of 1900s.

In [50]:
# before removal

autos['registration_year'].shape
Out[50]:
(48350,)
In [51]:
autos = autos[autos['registration_year'].between(1900, 2016)]
In [52]:
# after removal

autos['registration_year'].shape
Out[52]:
(46476,)

Now we are going to calculate the distribution of the remaining values for the registration_year.

In [53]:
autos['registration_year'].value_counts(normalize = True).sort_index(ascending = True)
Out[53]:
1910    0.000043
1927    0.000022
1929    0.000022
1931    0.000022
1934    0.000043
          ...   
2012    0.028165
2013    0.017235
2014    0.014244
2015    0.008262
2016    0.025970
Name: registration_year, Length: 78, dtype: float64

According to the output from the code cell above, most of the cars were registered after 1995.

Step 9: Identifying the top brands in the dataset and calculating the mean price

In order to identify the top brands of brand column, we are going to use the aggregation method, looping over the brand column. As we can see bellow, there are a total of 40 different brands in the dataset.

In [54]:
autos['brand'].unique().shape[0]
Out[54]:
40
In [55]:
autos['brand'].value_counts(normalize =  True)
Out[55]:
volkswagen        0.211356
bmw               0.109949
opel              0.107303
mercedes_benz     0.096501
audi              0.086604
ford              0.069993
renault           0.047186
peugeot           0.029908
fiat              0.025712
seat              0.018246
skoda             0.016417
nissan            0.015320
mazda             0.015255
smart             0.014179
citroen           0.014050
toyota            0.012759
hyundai           0.010048
sonstige_autos    0.009618
volvo             0.009101
mini              0.008779
mitsubishi        0.008176
honda             0.007854
kia               0.007100
alfa_romeo        0.006649
porsche           0.006046
suzuki            0.005939
chevrolet         0.005702
chrysler          0.003529
dacia             0.002647
daihatsu          0.002517
jeep              0.002281
subaru            0.002109
land_rover        0.002109
saab              0.001657
jaguar            0.001571
daewoo            0.001485
trabant           0.001377
rover             0.001334
lancia            0.001054
lada              0.000581
Name: brand, dtype: float64

Now we are going to loop over the brand column to add the brands that respond for more than 5% of the total of cars in the dataset into a dictionary. The main purpose of this method will be to calculate the mean price of each of the top brands.

In [56]:
top_brands = autos['brand'].value_counts(normalize = True, ascending = False)
common_brands = top_brands[top_brands > 0.05].index

brand_mean_price = {}

for b in common_brands:
    mean_price = autos[autos['brand'] == b]['price_dollar'].mean()
    brand_mean_price[b] = int(mean_price)      
In [57]:
brand_mean_price
Out[57]:
{'volkswagen': 5627,
 'bmw': 8376,
 'opel': 2996,
 'mercedes_benz': 8663,
 'audi': 9373,
 'ford': 4068}

Step 10: Calculating the mean milage for the top brands

In order to compare the mean price to the average milage of each brand, we are going to create another dictionary for Mean Milage just like we did for the mean price. Next, it will be necessary to transform both dictonaries into data series and then into a dataframe, using the series and dataframe constructors.

First, we are going to transform our previous dictionary(brand_mean_price) into a series.

Transforming Mean Price dictionary into a series

In [59]:
bmp_series = pd.Series(data=brand_mean_price)
bmp_series
Out[59]:
volkswagen       5627
bmw              8376
opel             2996
mercedes_benz    8663
audi             9373
ford             4068
dtype: int64

Creating Mean Milage dictionary and then series

Now, we can work on the Mean Milage, looping over the milage of the top brands (which we have called previously as common_brands)

In [60]:
brand_mean_milage = {}

for b in common_brands:
    mean_milage = autos[autos['brand'] == b]['odometer_km'].mean()
    brand_mean_milage[b] = int(mean_milage)
    
brand_mean_milage
    
Out[60]:
{'volkswagen': 128775,
 'bmw': 132651,
 'opel': 129440,
 'mercedes_benz': 130923,
 'audi': 129254,
 'ford': 124296}

Moving foward! It is possible now to transform the brand_mean_milage dictionary into a series, such as bmp_series.

In [61]:
bmm_series = pd.Series(data=brand_mean_milage)
bmm_series
Out[61]:
volkswagen       128775
bmw              132651
opel             129440
mercedes_benz    130923
audi             129254
ford             124296
dtype: int64

Step 11: Creating a DataFrame

In this final step, we are going to create a DataFrame containing both the brand mean price and the brand mean milage series. The main goal of this step is to make the comparison between both series easier and to identify if there is any link with each other.

In [62]:
mean_price_milage = pd.DataFrame(data = bmp_series, columns = ['mean_price_dollar'])
mean_price_milage
Out[62]:
mean_price_dollar
volkswagen 5627
bmw 8376
opel 2996
mercedes_benz 8663
audi 9373
ford 4068

Now, we must add the bmm_series as a new column of our dataframe.

In [63]:
mean_price_milage['mean_milage_km'] = bmm_series
mean_price_milage.sort_values('mean_price_dollar', ascending = False)
Out[63]:
mean_price_dollar mean_milage_km
audi 9373 129254
mercedes_benz 8663 130923
bmw 8376 132651
volkswagen 5627 128775
ford 4068 124296
opel 2996 129440

Apparently, there is not a direct relation between the price and the milage of the cars when we consider just the top 6 brands, since the mean milage doesn't varies significatively with the price.