Exploring Ebay Car Sales Data

The aim of this project is to clean the data and analyze the included used car listings. In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The data dictionary provided with data is as follows:

column Header Description
dateCrawled When this ad was first crawled. All field-values are taken from this date
name Name of the car
seller Whether the seller is private or a dealer
offerType The type of listing
price The price on the ad to sell the car
abtest Whether the listing is included in an A/B test
vehicleType The vehicle Type
yearOfRegistration The year in which the car was first registered
gearbox The transmission type
powerPS The power of the car in PS
model The car model name
kilometer How many kilometers the car has driven
monthOfRegistration The month in which the car was first registered
fuelType What type of fuel the car uses
brand The brand of the car
notRepairedDamage If the car has a damage which is not yet repaired
dateCreated The date on which the eBay listing was created
nrOfPictures The number of pictures in the ad
postalCode The postal code for the location of the vehicle
lastSeenOnline When the crawler saw this ad last online
In [187]:
import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
In [188]:
autos
Out[188]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau privat Angebot $999 test NaN 2017 manuell 90 NaN 150,000km 4 benzin volkswagen nein 2016-03-14 00:00:00 0 86157 2016-04-07 03:16:21
11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden privat Angebot $350 control NaN 2000 NaN 0 NaN 150,000km 0 benzin mercedes_benz NaN 2016-03-16 00:00:00 0 17498 2016-03-16 18:45:34
12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot $5,299 control kleinwagen 2010 automatik 71 fortwo 50,000km 9 benzin smart nein 2016-03-31 00:00:00 0 34590 2016-04-06 14:17:52
13 2016-03-23 10:48:32 Audi_A3_1.6_tuning privat Angebot $1,350 control limousine 1999 manuell 101 a3 150,000km 11 benzin audi nein 2016-03-23 00:00:00 0 12043 2016-04-01 14:17:13
14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot $3,999 test kleinwagen 2007 manuell 75 clio 150,000km 9 benzin renault NaN 2016-03-23 00:00:00 0 81737 2016-04-01 15:46:47
15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot $18,900 test coupe 1982 automatik 203 NaN 80,000km 6 benzin sonstige_autos nein 2016-04-01 00:00:00 0 61276 2016-04-02 21:10:48
16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi privat Angebot $350 test kombi 1999 manuell 101 vectra 150,000km 5 benzin opel nein 2016-03-16 00:00:00 0 57299 2016-03-18 05:29:37
17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 privat Angebot $5,500 test coupe 1990 manuell 205 scirocco 150,000km 6 benzin volkswagen nein 2016-03-29 00:00:00 0 74821 2016-04-05 20:46:26
18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot $300 control bus 1995 manuell 150 3er 150,000km 0 benzin bmw NaN 2016-03-26 00:00:00 0 54329 2016-04-02 12:16:41
19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot $4,150 control suv 2004 manuell 124 andere 150,000km 2 lpg mazda nein 2016-03-17 00:00:00 0 40878 2016-03-17 14:45:58
20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot $3,500 test kombi 2003 manuell 131 a4 150,000km 5 diesel audi NaN 2016-03-05 00:00:00 0 53913 2016-03-07 05:46:46
21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio privat Angebot $41,500 test cabrio 2004 manuell 320 911 150,000km 4 benzin porsche nein 2016-03-06 00:00:00 0 65428 2016-04-05 23:46:19
22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio privat Angebot $25,450 control cabrio 2015 manuell 184 cooper 10,000km 1 benzin mini nein 2016-03-28 00:00:00 0 44789 2016-04-01 06:45:30
23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima privat Angebot $7,999 control bus 2010 manuell 120 NaN 150,000km 2 diesel peugeot nein 2016-03-10 00:00:00 0 30900 2016-03-17 08:45:17
24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. privat Angebot $48,500 control limousine 2014 automatik 306 5er 30,000km 12 benzin bmw nein 2016-04-03 00:00:00 0 22547 2016-04-07 13:16:50
25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung privat Angebot $90 control kombi 1996 manuell 116 NaN 150,000km 4 benzin ford ja 2016-03-21 00:00:00 0 27574 2016-04-01 05:16:49
26 2016-04-03 22:46:28 Volkswagen_Polo_Fox privat Angebot $777 control kleinwagen 1992 manuell 54 polo 125,000km 2 benzin volkswagen nein 2016-04-03 00:00:00 0 38110 2016-04-05 23:46:48
27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE privat Angebot $0 control NaN 2005 NaN 0 NaN 150,000km 0 NaN ford NaN 2016-03-27 00:00:00 0 66701 2016-03-27 18:45:01
28 2016-03-19 21:56:19 MINI_Cooper_D privat Angebot $5,250 control kleinwagen 2007 manuell 110 cooper 150,000km 7 diesel mini ja 2016-03-19 00:00:00 0 15745 2016-04-07 14:58:48
29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... privat Angebot $4,999 test kombi 2004 automatik 204 e_klasse 150,000km 10 diesel mercedes_benz nein 2016-04-02 00:00:00 0 47638 2016-04-02 12:45:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... privat Angebot $15,800 control bus 2010 automatik 136 c4 60,000km 4 diesel citroen nein 2016-03-21 00:00:00 0 14947 2016-04-07 04:17:34
49971 2016-03-29 14:54:12 W.Lupo_1.0 privat Angebot $950 test kleinwagen 2001 manuell 50 lupo 150,000km 4 benzin volkswagen nein 2016-03-29 00:00:00 0 65197 2016-03-29 20:41:51
49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. privat Angebot $3,300 control bus 2004 automatik 150 vito 150,000km 10 diesel mercedes_benz ja 2016-03-26 00:00:00 0 65326 2016-03-28 11:28:18
49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor privat Angebot $6,000 control cabrio 2004 manuell 163 slk 150,000km 11 benzin mercedes_benz nein 2016-03-27 00:00:00 0 53567 2016-03-27 08:25:24
49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... privat Angebot $0 control cabrio 1983 manuell 70 golf 150,000km 2 benzin volkswagen nein 2016-03-20 00:00:00 0 8209 2016-03-27 19:48:16
49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort privat Angebot $9,700 control kleinwagen 2012 automatik 88 jazz 100,000km 11 hybrid honda nein 2016-03-27 00:00:00 0 84385 2016-04-05 19:45:34
49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... privat Angebot $5,900 test kombi 1992 automatik 150 80 150,000km 12 benzin audi nein 2016-03-19 00:00:00 0 36100 2016-04-07 06:16:44
49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 privat Angebot $5,500 control limousine 2003 manuell 116 c_klasse 150,000km 2 diesel mercedes_benz nein 2016-03-31 00:00:00 0 33739 2016-04-06 12:16:11
49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic privat Angebot $900 control limousine 1996 automatik 136 e_klasse 150,000km 9 benzin mercedes_benz ja 2016-04-04 00:00:00 0 24405 2016-04-06 12:44:20
49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style privat Angebot $11,000 test kleinwagen 2011 manuell 90 polo 70,000km 11 diesel volkswagen nein 2016-03-20 00:00:00 0 48455 2016-04-07 01:45:12
49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V privat Angebot $400 control kombi 1995 manuell 105 escort 125,000km 3 benzin ford NaN 2016-03-12 00:00:00 0 56218 2016-04-06 17:16:49
49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung privat Angebot $2,000 control kombi 1998 manuell 115 astra 150,000km 12 benzin opel nein 2016-03-15 00:00:00 0 86859 2016-04-05 17:21:46
49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm privat Angebot $1,950 control kleinwagen 2004 manuell 0 fabia 90,000km 7 benzin skoda NaN 2016-03-29 00:00:00 0 45884 2016-03-29 18:51:08
49983 2016-03-06 12:43:04 Ford_focus_99 privat Angebot $600 test kleinwagen 1999 manuell 101 focus 150,000km 4 benzin ford NaN 2016-03-06 00:00:00 0 52477 2016-03-09 06:16:08
49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... privat Angebot $0 test NaN 2000 NaN 0 NaN 150,000km 0 NaN sonstige_autos NaN 2016-03-31 00:00:00 0 12103 2016-04-02 19:44:53
49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! privat Angebot $1,000 control NaN 1995 automatik 0 NaN 150,000km 0 benzin volkswagen NaN 2016-04-02 00:00:00 0 30900 2016-04-06 15:17:52
49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... privat Angebot $15,900 control limousine 2010 automatik 218 300c 125,000km 11 diesel chrysler nein 2016-04-04 00:00:00 0 73527 2016-04-06 23:16:00
49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... privat Angebot $21,990 control limousine 2013 manuell 150 a3 50,000km 11 diesel audi nein 2016-03-22 00:00:00 0 94362 2016-03-26 22:46:06
49988 2016-03-28 19:49:51 BMW_330_Ci privat Angebot $9,550 control coupe 2001 manuell 231 3er 150,000km 10 benzin bmw nein 2016-03-28 00:00:00 0 83646 2016-04-07 02:17:40
49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau privat Angebot $150 test kleinwagen 1997 manuell 0 polo 150,000km 5 benzin volkswagen ja 2016-03-11 00:00:00 0 21244 2016-03-12 10:17:55
49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban privat Angebot $17,500 test limousine 2012 manuell 156 a_klasse 30,000km 12 benzin mercedes_benz nein 2016-03-21 00:00:00 0 58239 2016-04-06 22:46:57
49991 2016-03-06 15:25:19 Kleinwagen privat Angebot $500 control NaN 2016 manuell 0 twingo 150,000km 0 benzin renault NaN 2016-03-06 00:00:00 0 61350 2016-03-06 18:24:19
49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport privat Angebot $4,800 control kleinwagen 2009 manuell 120 andere 125,000km 9 lpg fiat nein 2016-03-10 00:00:00 0 68642 2016-03-13 01:44:51
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot $1,650 control kleinwagen 1997 manuell 0 NaN 150,000km 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot $5,000 control kombi 2001 automatik 299 a6 150,000km 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [189]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[189]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
  • The data set consist of 50,000 records under 20 data fields. Data in 15 out of 20 fields are stored as strings while 5 fields are stored as integer.
  • Some columns have null values, but none have more than ~20% null values.
  • The column names use camelcase instead of Python's preferred snakecase

Cleaning Column Names

In [190]:
autos.columns
Out[190]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [191]:
autos.rename(columns={"yearOfRegistration" : "registration_year"}, inplace=True)
autos.rename(columns={"monthOfRegistration" : "registration_month"}, inplace=True)
autos.rename(columns={"notRepairedDamage" : "notRepairedDamage"}, inplace=True)
autos.rename(columns={"dateCreated" : "ad_created"}, inplace=True)

i = 0
col_old = ['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'gearbox', 'powerPS', 'model', 'odometer', 'fuelType', 'brand', 'nrOfPictures', 'postalCode',
       'lastSeen']
col_new = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'gearbox', 'power_ps', 'model',
       'odometer', 'fuel_type', 'brand','nr_of_pictures', 'postal_code',
       'last_seen']
while i < 16:
    autos.rename(columns={col_old[i] : col_new[i]}, inplace=True)
    i +=1
In [192]:
autos.head(2)
Out[192]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand notRepairedDamage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08

Initially all column names of autos data set were in camelcase and in here we converted those column names to Python's preferred snakecase. In addition to that four column names were edited to give unifomrmity to all column names.

Initial Exploration and Cleaning

In [193]:
autos.describe(include = 'all')
Out[193]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand notRepairedDamage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-12 16:06:22 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
In [194]:
autos["seller"].value_counts()
Out[194]:
privat        49999
gewerblich        1
Name: seller, dtype: int64
In [195]:
autos["offer_type"].value_counts()
Out[195]:
Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64
In [196]:
autos["nr_of_pictures"].value_counts()
Out[196]:
0    50000
Name: nr_of_pictures, dtype: int64

In initial exploration, observations are as follow:

  • seller , offer_type and nr_of_pictures columns have mostly one value and these columns can be dropped.
  • The price and odometer columns have numeric values stored as text.
In [197]:
autos["price"] = (autos["price"]
                  .str.replace("$" , "")
                  .str.replace("," , "")
                  .astype(int)
                 )
autos["odometer"] = (autos["odometer"]
                     .str.replace("km" , "")
                     .str.replace("," , "")
                     .astype(int)
                    )
In [198]:
autos.rename({"odometer" : "odometer_km"} , axis=1, inplace=True)

Exploring the Odometer and Price Columns

In [199]:
autos["price"].unique().shape
Out[199]:
(2357,)
In [200]:
autos["price"].value_counts().sort_index(ascending = True).head(40)
Out[200]:
0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
Name: price, dtype: int64
In [201]:
autos["price"].value_counts().sort_index(ascending = False).head(50)
Out[201]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
194000      1
190000      1
180000      1
175000      1
169999      1
169000      1
163991      1
163500      1
155000      1
151990      1
145000      1
139997      1
137999      1
135000      1
130000      1
129000      1
128000      1
120000      2
119900      1
119500      1
116000      1
115991      1
115000      1
114400      1
109999      1
105000      2
104900      1
99900       2
99000       2
98500       1
Name: price, dtype: int64
In [202]:
autos["price"].describe()
Out[202]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
  • it can be observed that there are 2357 unique values in price column and these values varies between 0 and 100 millions.
  • There are 1781 car records with price below 100 USD and it is almost 4% of the total data set.
  • Number of records with price below 100,000 USD are 53 and it is almost 0.1% of the total data set.
  • Practically it is very rare case that the price of a used vehicle is higher than 100,000 USD or lower than 100 USD. Therefore in this analysis I will consider prices below 100 USD and above 100,000 USD are as outliers.
In [203]:
autos["odometer_km"].unique()
Out[203]:
array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])
In [204]:
autos["odometer_km"].value_counts()
Out[204]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64
In [205]:
autos["odometer_km"].describe()
Out[205]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
  • When exploring odometer_km column, it can be seen that the values are rounded off and this idicates that the seller had to chose the value for this field from pre-set options.
  • It can't be seen outliers in odometer_km column and can retain all values.
In [206]:
autos = autos[autos["price"].between(100,100000)]
autos["price"].describe()
Out[206]:
count    48185.000000
mean      5796.099741
std       7525.532405
min        100.000000
25%       1250.000000
50%       3000.000000
75%       7499.000000
max      99900.000000
Name: price, dtype: float64

Exploring the date columns

In the data set, below 5 columns represent date values.

Column Name Data Stored As
date_crawled Strings
last_seen Strings
ad_created Strings
registration_month Intiger
registration_year Intiger

Columns with data stored as strings will be explored to learn more about the data.

In [207]:
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[207]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [208]:
(autos["date_crawled"]
        .str[:10]
        .value_counts(normalize = True, dropna = False)
        .sort_index()
)
Out[208]:
2016-03-05    0.025340
2016-03-06    0.014050
2016-03-07    0.036090
2016-03-08    0.033164
2016-03-09    0.033019
2016-03-10    0.032313
2016-03-11    0.032624
2016-03-12    0.036920
2016-03-13    0.015690
2016-03-14    0.036692
2016-03-15    0.034305
2016-03-16    0.029470
2016-03-17    0.031504
2016-03-18    0.012867
2016-03-19    0.034762
2016-03-20    0.037813
2016-03-21    0.037190
2016-03-22    0.032811
2016-03-23    0.032292
2016-03-24    0.029449
2016-03-25    0.031504
2016-03-26    0.032292
2016-03-27    0.031109
2016-03-28    0.034949
2016-03-29    0.034139
2016-03-30    0.033703
2016-03-31    0.031856
2016-04-01    0.033662
2016-04-02    0.035633
2016-04-03    0.038601
2016-04-04    0.036567
2016-04-05    0.013054
2016-04-06    0.003175
2016-04-07    0.001390
Name: date_crawled, dtype: float64

It can be observed that site was crawled daily over approximately one month period between March 2016 and April 2016. it shows uniform distribution of listing on each days except few days.

In [209]:
(autos["ad_created"]
        .str[:10]
        .value_counts(normalize = True, dropna = False)
        .sort_index()
)
Out[209]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000042
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000042
2016-02-05    0.000042
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000042
2016-02-14    0.000042
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000042
2016-02-19    0.000062
2016-02-20    0.000042
2016-02-21    0.000062
                ...   
2016-03-09    0.033122
2016-03-10    0.032022
2016-03-11    0.032936
2016-03-12    0.036754
2016-03-13    0.017059
2016-03-14    0.035322
2016-03-15    0.034035
2016-03-16    0.029968
2016-03-17    0.031151
2016-03-18    0.013573
2016-03-19    0.033641
2016-03-20    0.037875
2016-03-21    0.037418
2016-03-22    0.032624
2016-03-23    0.032126
2016-03-24    0.029387
2016-03-25    0.031628
2016-03-26    0.032354
2016-03-27    0.031026
2016-03-28    0.035052
2016-03-29    0.034077
2016-03-30    0.033537
2016-03-31    0.031898
2016-04-01    0.033641
2016-04-02    0.035322
2016-04-03    0.038850
2016-04-04    0.036920
2016-04-05    0.011788
2016-04-06    0.003258
2016-04-07    0.001245
Name: ad_created, Length: 76, dtype: float64

Ad created dates spread out roughly over 10 months period. But majority of the ads have been created in the one month period from March 2016 to April 2016

In [210]:
(autos["last_seen"]
        .str[:10]
        .value_counts(normalize = True, dropna = False)
        .sort_index()
)
Out[210]:
2016-03-05    0.001079
2016-03-06    0.004317
2016-03-07    0.005437
2016-03-08    0.007326
2016-03-09    0.009567
2016-03-10    0.010646
2016-03-11    0.012411
2016-03-12    0.023804
2016-03-13    0.008882
2016-03-14    0.012639
2016-03-15    0.015876
2016-03-16    0.016437
2016-03-17    0.028121
2016-03-18    0.007305
2016-03-19    0.015773
2016-03-20    0.020650
2016-03-21    0.020546
2016-03-22    0.021376
2016-03-23    0.018574
2016-03-24    0.019736
2016-03-25    0.019114
2016-03-26    0.016623
2016-03-27    0.015544
2016-03-28    0.020836
2016-03-29    0.022310
2016-03-30    0.024717
2016-03-31    0.023846
2016-04-01    0.022870
2016-04-02    0.024883
2016-04-03    0.025132
2016-04-04    0.024551
2016-04-05    0.124935
2016-04-06    0.221999
2016-04-07    0.132137
Name: last_seen, dtype: float64

The last seen date allows us to determine on what day a listing was removed, presumably because the car was sold.

The last three days show a unusual amount of 'last seen' values and those are more than 6 times of normal daily value.

In [211]:
autos["registration_year"].describe()
Out[211]:
count    48185.000000
mean      2004.730456
std         87.932039
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Min and Max values are not practical values as Min value is even before cars were invented and the Max values is many years into the future. This clearly shows that there are some odd values in registration_year column.

Dealing with Incorrect Registration Year Data

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [212]:
(autos.shape[0]- autos["registration_year"].between(1900,2016).sum())/autos.shape[0]
Out[212]:
0.038850264605167585

If we remove the records of cars that fall outside 1900 - 2016, we will lose only about 4% of the data from total data set and it won't make big impact on our analysis. Therefore we can remove those records from our data set.

In [213]:
autos = autos[autos["registration_year"].between(1900,2016)]
autos["registration_year"].value_counts(normalize=True).sort_index()
Out[213]:
1910    0.000043
1927    0.000022
1929    0.000022
1931    0.000022
1934    0.000043
1937    0.000086
1938    0.000022
1939    0.000022
1941    0.000043
1943    0.000022
1948    0.000022
1950    0.000022
1951    0.000022
1952    0.000022
1953    0.000022
1954    0.000043
1955    0.000043
1956    0.000086
1957    0.000043
1958    0.000086
1959    0.000130
1960    0.000475
1961    0.000130
1962    0.000086
1963    0.000173
1964    0.000259
1965    0.000367
1966    0.000475
1967    0.000561
1968    0.000540
          ...   
1987    0.001555
1988    0.002850
1989    0.003692
1990    0.007169
1991    0.007298
1992    0.007903
1993    0.009069
1994    0.013517
1995    0.025738
1996    0.029257
1997    0.041543
1998    0.050461
1999    0.062164
2000    0.067001
2001    0.056744
2002    0.053484
2003    0.058148
2004    0.058277
2005    0.062855
2006    0.057608
2007    0.049058
2008    0.047676
2009    0.044912
2010    0.034180
2011    0.034915
2012    0.028178
2013    0.017209
2014    0.014251
2015    0.008119
2016    0.025824
Name: registration_year, Length: 78, dtype: float64

It can be observed that majority of the cars in the data set are registed after the 1999.

Exploring Price by Brand

In [214]:
autos["brand"].value_counts(normalize = True)
Out[214]:
volkswagen        0.211582
bmw               0.110207
opel              0.107335
mercedes_benz     0.096668
audi              0.086822
ford              0.069872
renault           0.047114
peugeot           0.029884
fiat              0.025630
seat              0.018267
skoda             0.016432
nissan            0.015352
mazda             0.015244
smart             0.014208
citroen           0.014057
toyota            0.012804
hyundai           0.010019
sonstige_autos    0.009436
volvo             0.009134
mini              0.008810
mitsubishi        0.008183
honda             0.007881
kia               0.007082
alfa_romeo        0.006672
suzuki            0.005938
chevrolet         0.005679
porsche           0.005463
chrysler          0.003520
dacia             0.002656
daihatsu          0.002505
jeep              0.002289
land_rover        0.002116
subaru            0.002116
saab              0.001663
jaguar            0.001533
daewoo            0.001490
trabant           0.001360
rover             0.001339
lancia            0.001058
lada              0.000583
Name: brand, dtype: float64

German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.

There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings.

In [215]:
brand_percentage = autos["brand"].value_counts(normalize = True)
top_brands = brand_percentage[brand_percentage > 0.05].index
print(top_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
In [216]:
top_brands_mean_price = {}
for tb in top_brands:
    bp = autos[autos["brand"] == tb]["price"]
    mp = bp.mean()
    top_brands_mean_price[tb] = int(mp)
top_brands_mean_price
Out[216]:
{'audi': 9339,
 'bmw': 8249,
 'ford': 3740,
 'mercedes_benz': 8573,
 'opel': 3005,
 'volkswagen': 5436}

It can be observed that a distinct price gap in the top 06 brads. Based on the mean price these top brands can be categorize into 03 classes as below:

Price Class Brands
Heigh Audi, Mercedes Benz, BMW
Middle volkswagen
Low Ford, Opel

Reason for the high popularity of volkswagen brand may be middle range price.

Storing Aggregate Data in a DataFrame

In [228]:
top_brand_mean_mileage = {}
for tb in top_brands:
    bo = autos[autos["brand"] == tb]["odometer_km"]
    mm = bo.mean()
    top_brand_mean_mileage[tb] = int(mm)
top_brand_mean_mileage
Out[228]:
{'audi': 129276,
 'bmw': 132756,
 'ford': 124300,
 'mercedes_benz': 131088,
 'opel': 129384,
 'volkswagen': 128799}
In [240]:
tbmp_series = pd.Series(top_brands_mean_price).sort_values(ascending = True)
tbmm_series = pd.Series(top_brand_mean_mileage).sort_values(ascending = True)
In [241]:
top_brand_info = pd.DataFrame(tbmp_series, columns=['mean_price'])
top_brand_info
Out[241]:
mean_price
opel 3005
ford 3740
volkswagen 5436
bmw 8249
mercedes_benz 8573
audi 9339
In [242]:
top_brand_mileage_info = pd.DataFrame(tbmm_series, columns=['mean_mileage'])
top_brand_mileage_info 
Out[242]:
mean_mileage
ford 124300
volkswagen 128799
audi 129276
opel 129384
mercedes_benz 131088
bmw 132756
In [243]:
top_brand_info["mean_mileage"] = tbmm_series
top_brand_info
Out[243]:
mean_price mean_mileage
opel 3005 129384
ford 3740 124300
volkswagen 5436 128799
bmw 8249 132756
mercedes_benz 8573 131088
audi 9339 129276
In [245]:
percentage_gap_mean_mileage = ((top_brand_info["mean_mileage"].max()
                              - top_brand_info["mean_mileage"].min())
                              / top_brand_info["mean_mileage"].max()
                              )
percentage_gap_mean_mileage 
Out[245]:
0.0636958028262376

There is no significant gap in mean mileages of top 6 brands. Mileage variation of top brands are falling within roughly 7%.