Exploring Ebday Car Sales Data

We will be working with the dataset of used cars from eBay Kleinanzeigen - a classifieds section of the German eBay website.

The aim of this project is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.

name - Name of the car.

seller - Whether the seller is private or a dealer.

offerType - The type of listing

price - The price on the ad to sell the car.

abtest - Whether the listing is included in an A/B test.

vehicleType - The vehicle Type.

yearOfRegistration - The year in which the car was first registered.

gearbox - The transmission type.

powerPS - The power of the car in PS.

model - The car model name.

kilometer - How many kilometers the car has driven.

monthOfRegistration - The month in which the car was first registered.

fuelType - What type of fuel the car uses.

brand - The brand of the car.

notRepairedDamage - If the car has a damage which is not yet repaired.

dateCreated - The date on which the eBay listing was created.

nrOfPictures - The number of pictures in the ad.

postalCode - The postal code for the location of the vehicle.

lastSeenOnline - When the crawler saw this ad last online.

If we are looking at cars, what place better to look at than the home of the modern car? Hence, let us start off our project by importing the libraries and read the dataset which we will work on using pandas.

On that note, let's go!

Or as a typical german would say, 'Los geht's!'

In [1]:
import pandas as pd
import numpy as nm

autos = pd.read_csv('autos.csv',encoding = "Latin-1")
In [2]:
autos #Jupyter notebook has a neat feature to render the first and last few fvalues of any pandas object!
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau privat Angebot $999 test NaN 2017 manuell 90 NaN 150,000km 4 benzin volkswagen nein 2016-03-14 00:00:00 0 86157 2016-04-07 03:16:21
11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden privat Angebot $350 control NaN 2000 NaN 0 NaN 150,000km 0 benzin mercedes_benz NaN 2016-03-16 00:00:00 0 17498 2016-03-16 18:45:34
12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot $5,299 control kleinwagen 2010 automatik 71 fortwo 50,000km 9 benzin smart nein 2016-03-31 00:00:00 0 34590 2016-04-06 14:17:52
13 2016-03-23 10:48:32 Audi_A3_1.6_tuning privat Angebot $1,350 control limousine 1999 manuell 101 a3 150,000km 11 benzin audi nein 2016-03-23 00:00:00 0 12043 2016-04-01 14:17:13
14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot $3,999 test kleinwagen 2007 manuell 75 clio 150,000km 9 benzin renault NaN 2016-03-23 00:00:00 0 81737 2016-04-01 15:46:47
15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot $18,900 test coupe 1982 automatik 203 NaN 80,000km 6 benzin sonstige_autos nein 2016-04-01 00:00:00 0 61276 2016-04-02 21:10:48
16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi privat Angebot $350 test kombi 1999 manuell 101 vectra 150,000km 5 benzin opel nein 2016-03-16 00:00:00 0 57299 2016-03-18 05:29:37
17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 privat Angebot $5,500 test coupe 1990 manuell 205 scirocco 150,000km 6 benzin volkswagen nein 2016-03-29 00:00:00 0 74821 2016-04-05 20:46:26
18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot $300 control bus 1995 manuell 150 3er 150,000km 0 benzin bmw NaN 2016-03-26 00:00:00 0 54329 2016-04-02 12:16:41
19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot $4,150 control suv 2004 manuell 124 andere 150,000km 2 lpg mazda nein 2016-03-17 00:00:00 0 40878 2016-03-17 14:45:58
20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot $3,500 test kombi 2003 manuell 131 a4 150,000km 5 diesel audi NaN 2016-03-05 00:00:00 0 53913 2016-03-07 05:46:46
21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio privat Angebot $41,500 test cabrio 2004 manuell 320 911 150,000km 4 benzin porsche nein 2016-03-06 00:00:00 0 65428 2016-04-05 23:46:19
22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio privat Angebot $25,450 control cabrio 2015 manuell 184 cooper 10,000km 1 benzin mini nein 2016-03-28 00:00:00 0 44789 2016-04-01 06:45:30
23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima privat Angebot $7,999 control bus 2010 manuell 120 NaN 150,000km 2 diesel peugeot nein 2016-03-10 00:00:00 0 30900 2016-03-17 08:45:17
24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. privat Angebot $48,500 control limousine 2014 automatik 306 5er 30,000km 12 benzin bmw nein 2016-04-03 00:00:00 0 22547 2016-04-07 13:16:50
25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung privat Angebot $90 control kombi 1996 manuell 116 NaN 150,000km 4 benzin ford ja 2016-03-21 00:00:00 0 27574 2016-04-01 05:16:49
26 2016-04-03 22:46:28 Volkswagen_Polo_Fox privat Angebot $777 control kleinwagen 1992 manuell 54 polo 125,000km 2 benzin volkswagen nein 2016-04-03 00:00:00 0 38110 2016-04-05 23:46:48
27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE privat Angebot $0 control NaN 2005 NaN 0 NaN 150,000km 0 NaN ford NaN 2016-03-27 00:00:00 0 66701 2016-03-27 18:45:01
28 2016-03-19 21:56:19 MINI_Cooper_D privat Angebot $5,250 control kleinwagen 2007 manuell 110 cooper 150,000km 7 diesel mini ja 2016-03-19 00:00:00 0 15745 2016-04-07 14:58:48
29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... privat Angebot $4,999 test kombi 2004 automatik 204 e_klasse 150,000km 10 diesel mercedes_benz nein 2016-04-02 00:00:00 0 47638 2016-04-02 12:45:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... privat Angebot $15,800 control bus 2010 automatik 136 c4 60,000km 4 diesel citroen nein 2016-03-21 00:00:00 0 14947 2016-04-07 04:17:34
49971 2016-03-29 14:54:12 W.Lupo_1.0 privat Angebot $950 test kleinwagen 2001 manuell 50 lupo 150,000km 4 benzin volkswagen nein 2016-03-29 00:00:00 0 65197 2016-03-29 20:41:51
49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. privat Angebot $3,300 control bus 2004 automatik 150 vito 150,000km 10 diesel mercedes_benz ja 2016-03-26 00:00:00 0 65326 2016-03-28 11:28:18
49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor privat Angebot $6,000 control cabrio 2004 manuell 163 slk 150,000km 11 benzin mercedes_benz nein 2016-03-27 00:00:00 0 53567 2016-03-27 08:25:24
49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... privat Angebot $0 control cabrio 1983 manuell 70 golf 150,000km 2 benzin volkswagen nein 2016-03-20 00:00:00 0 8209 2016-03-27 19:48:16
49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort privat Angebot $9,700 control kleinwagen 2012 automatik 88 jazz 100,000km 11 hybrid honda nein 2016-03-27 00:00:00 0 84385 2016-04-05 19:45:34
49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... privat Angebot $5,900 test kombi 1992 automatik 150 80 150,000km 12 benzin audi nein 2016-03-19 00:00:00 0 36100 2016-04-07 06:16:44
49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 privat Angebot $5,500 control limousine 2003 manuell 116 c_klasse 150,000km 2 diesel mercedes_benz nein 2016-03-31 00:00:00 0 33739 2016-04-06 12:16:11
49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic privat Angebot $900 control limousine 1996 automatik 136 e_klasse 150,000km 9 benzin mercedes_benz ja 2016-04-04 00:00:00 0 24405 2016-04-06 12:44:20
49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style privat Angebot $11,000 test kleinwagen 2011 manuell 90 polo 70,000km 11 diesel volkswagen nein 2016-03-20 00:00:00 0 48455 2016-04-07 01:45:12
49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V privat Angebot $400 control kombi 1995 manuell 105 escort 125,000km 3 benzin ford NaN 2016-03-12 00:00:00 0 56218 2016-04-06 17:16:49
49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung privat Angebot $2,000 control kombi 1998 manuell 115 astra 150,000km 12 benzin opel nein 2016-03-15 00:00:00 0 86859 2016-04-05 17:21:46
49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm privat Angebot $1,950 control kleinwagen 2004 manuell 0 fabia 90,000km 7 benzin skoda NaN 2016-03-29 00:00:00 0 45884 2016-03-29 18:51:08
49983 2016-03-06 12:43:04 Ford_focus_99 privat Angebot $600 test kleinwagen 1999 manuell 101 focus 150,000km 4 benzin ford NaN 2016-03-06 00:00:00 0 52477 2016-03-09 06:16:08
49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... privat Angebot $0 test NaN 2000 NaN 0 NaN 150,000km 0 NaN sonstige_autos NaN 2016-03-31 00:00:00 0 12103 2016-04-02 19:44:53
49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! privat Angebot $1,000 control NaN 1995 automatik 0 NaN 150,000km 0 benzin volkswagen NaN 2016-04-02 00:00:00 0 30900 2016-04-06 15:17:52
49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... privat Angebot $15,900 control limousine 2010 automatik 218 300c 125,000km 11 diesel chrysler nein 2016-04-04 00:00:00 0 73527 2016-04-06 23:16:00
49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... privat Angebot $21,990 control limousine 2013 manuell 150 a3 50,000km 11 diesel audi nein 2016-03-22 00:00:00 0 94362 2016-03-26 22:46:06
49988 2016-03-28 19:49:51 BMW_330_Ci privat Angebot $9,550 control coupe 2001 manuell 231 3er 150,000km 10 benzin bmw nein 2016-03-28 00:00:00 0 83646 2016-04-07 02:17:40
49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau privat Angebot $150 test kleinwagen 1997 manuell 0 polo 150,000km 5 benzin volkswagen ja 2016-03-11 00:00:00 0 21244 2016-03-12 10:17:55
49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban privat Angebot $17,500 test limousine 2012 manuell 156 a_klasse 30,000km 12 benzin mercedes_benz nein 2016-03-21 00:00:00 0 58239 2016-04-06 22:46:57
49991 2016-03-06 15:25:19 Kleinwagen privat Angebot $500 control NaN 2016 manuell 0 twingo 150,000km 0 benzin renault NaN 2016-03-06 00:00:00 0 61350 2016-03-06 18:24:19
49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport privat Angebot $4,800 control kleinwagen 2009 manuell 120 andere 125,000km 9 lpg fiat nein 2016-03-10 00:00:00 0 68642 2016-03-13 01:44:51
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot $1,650 control kleinwagen 1997 manuell 0 NaN 150,000km 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot $5,000 control kombi 2001 automatik 299 a6 150,000km 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

Ach so! Das ist interessant!

If we look at the data, we can see that the columns are in english (Phew!) however you will also come across words like Angebot (Offer), kleinwagen (small car) wtc which is not very easily decipherable for eyes not accustomed to German.

Before we get into this, a clarification: Sonstige that is mentioned in the brand actually means other or miscellaneous in German. Would be good to keep that in mind as it will be used quite extensively through the course of this project

We will need to work on that! Well, atleast there is no mention of Kraftfahrzeughaftpflichtversicherung (automobile liability insurance ;) ) essentially making our lives easier.

Let's print the information out for our dataframe and have a look at their data types to understand a bit more of what we are dealing with.

In [3]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Data Exploration and Cleaning

The data set has 20 columns, 15 of which are object type and 5 are of int 64 type.

Looking at the data, at first glance we can identify that the columns of vehicleType, gearbox, model, fuelType and notRepairedDamage are the only ones which have null values existing within them.

Moreover, there are some object type data like price and odometer which we would need to clean and convert into int64 or float64 types in order to analyze our data more accurately.

The column names use camelcase (new word is capitalized but not seperated) instead of Python's preferred snakecase (different words joined by an underscore), which means we can't just replace spaces with underscores.

In order to decide whether to keep the row which have null values or overall exclude them would be subject to some more analysis on our part. We would only be able to establish that once we dig a wee bit deeper into the data.

Let us start off by converting camelcase to snakecase for better nomenclature.

In [4]:
autos.columns #to printout all existing columns
Out[4]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [5]:
autos.rename({"dateCrawled":"date_crawled","offerType":"offer_type","yearOfRegistration":"registration_year","vehicleType":"vehicle_type","powerPS":"power_ps","monthOfRegistration":"registration_month","fuelType":"fuel_type","notRepairedDamage":"unrepaired_damage","dateCreated":"ad_created","nrOfPictures":"nr_pictures","postalCode":"postal_code","abtest":"ab_test","lastSeen":"last_seen"},axis=1, inplace=True)

In this previous set we remaned all the columns and brought them to snakecase. Let us check how the columns look now

In [6]:
autos.head()
Out[6]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

  1. Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
  2. Examples of numeric data stored as text which can be cleaned and converted.

The following methods are helpful for exploring the data: - DataFrame.describe() (with include='all' to get both categorical and numeric columns) - Series.value_counts() and Series.head() if any columns need a closer look.

In [7]:
autos.describe(include='all')
Out[7]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-27 22:55:05 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

A few odd observations:

  1. registration_year : min : 1000 (not possible unless eBay dates back to the Holy Roman Empire and Otto the great was a tech and used-car fanatic) max : 9999 which also an absurd value

  2. power_ps : PS stands for pferdstärke which literally means horsepower and neither 0 nor 177000 are acceptable values. Needs further investigation

  3. nr_pictures - This essentially shows that there are no present values and that this column would be of no use for us as it doesnt paint a picture of any sort.

  4. price and odometer need to be cleaned and then checked for patterns and discrepencies.

  5. seller and offer have 49999 of the same values out of 50000 which essentially means that this column does not have any deviations or variations and can be deemed as contants and essentially useless for us.

In [8]:
autos["price"] = autos["price"].str.replace('$',"")
autos["price"] = autos["price"].str.replace(',',"").astype(int)

autos["odometer"] = autos["odometer"].str.replace('km',"")
autos["odometer"] = autos["odometer"].str.replace(',',"").astype(int)
In [9]:
autos.rename({"odometer":"odometer_km"},axis=1, inplace=True)
autos.describe()
Out[9]:
price registration_year power_ps odometer_km registration_month nr_pictures postal_code
count 5.000000e+04 50000.000000 50000.000000 50000.000000 50000.000000 50000.0 50000.000000
mean 9.840044e+03 2005.073280 116.355920 125732.700000 5.723360 0.0 50813.627300
std 4.811044e+05 105.712813 209.216627 40042.211706 3.711984 0.0 25779.747957
min 0.000000e+00 1000.000000 0.000000 5000.000000 0.000000 0.0 1067.000000
25% 1.100000e+03 1999.000000 70.000000 125000.000000 3.000000 0.0 30451.000000
50% 2.950000e+03 2003.000000 105.000000 150000.000000 6.000000 0.0 49577.000000
75% 7.200000e+03 2008.000000 150.000000 150000.000000 9.000000 0.0 71540.000000
max 1.000000e+08 9999.000000 17700.000000 150000.000000 12.000000 0.0 99998.000000

The max value for the cars does seem a little dicey unless because it is unlikely that there would be a car sold for 100 million dollars. So does the min price value of 0.

Let us dig deeper into these columns and see where it takes us

In [10]:
autos["price"].unique().shape #to see the unique values
Out[10]:
(2357,)
In [11]:
autos["price"].describe() #in-depth value checking
Out[11]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [12]:
autos["price"].value_counts().sort_index(ascending = False).head(10) #check for extraordinarily high values
Out[12]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: price, dtype: int64
In [13]:
auto2 = autos[autos["price"] > 1000000]

auto2[["name","price"]]
Out[13]:
name price
2897 Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 11111111
7814 Ferrari_F40 1300000
11137 suche_maserati_3200_gt_Zustand_unwichtig_laufe... 10000000
22947 Bmw_530d_zum_ausschlachten 1234566
24384 Schlachte_Golf_3_gt_tdi 11111111
27371 Fiat_Punto 12345678
39377 Tausche_volvo_v40_gegen_van 12345678
39705 Tausch_gegen_gleichwertiges 99999999
42221 Leasinguebernahme 27322222
47598 Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... 12345678
47634 Ferrari_FXX 3890000
In [14]:
auto2 = autos[autos["price"] < 250]

auto2[["name","price"]].sort_index(ascending=False).head()
Out[14]:
name price
49989 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau 150
49984 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... 0
49974 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... 0
49960 Ford_KA_zu_verschenken_***Reserviert*** 0
49943 Opel_astra 0

Here we would generally have to consider that any price below 250 ia too low and anything above 100000 seems improbable too (considering no one would want to really sell their sports car on eBay)

Let's only keep the values between 250 and 100000 into our dataset

In [15]:
autos = autos[autos["price"].between(250,100000)]
autos.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47249 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          47249 non-null object
name                  47249 non-null object
seller                47249 non-null object
offer_type            47249 non-null object
price                 47249 non-null int64
ab_test               47249 non-null object
vehicle_type          43092 non-null object
registration_year     47249 non-null int64
gearbox               45198 non-null object
power_ps              47249 non-null int64
model                 44960 non-null object
odometer_km           47249 non-null int64
registration_month    47249 non-null int64
fuel_type             43611 non-null object
brand                 47249 non-null object
unrepaired_damage     38799 non-null object
ad_created            47249 non-null object
nr_pictures           47249 non-null int64
postal_code           47249 non-null int64
last_seen             47249 non-null object
dtypes: int64(7), object(13)
memory usage: 7.6+ MB

The odometer_km on the other hand looks okay so we will let it be

Working with dates

Let's now move on to the date columns and understand the date range the data covers.

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary

  • date_crawled: added by the crawler
  • last_seen: added by the crawler
  • ad_created: from the website
  • registration_month: from the website
  • registration_year: from the website

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [ ]:
 
In [16]:
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[16]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [17]:
print(autos['date_crawled'].str[:10].head(15)) #To select the first 10 characters (the date) in each column
0     2016-03-26
1     2016-04-04
2     2016-03-26
3     2016-03-12
4     2016-04-01
5     2016-03-21
6     2016-03-20
7     2016-03-16
8     2016-03-22
9     2016-03-16
10    2016-03-15
11    2016-03-16
12    2016-03-31
13    2016-03-23
14    2016-03-23
Name: date_crawled, dtype: object
In [18]:
#date_drawled
print(autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index().head()) #Using head here to avoid printing walls of data

print(autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head())
2016-03-05 14:06:30    0.000021
2016-03-05 14:06:40    0.000021
2016-03-05 14:07:08    0.000021
2016-03-05 14:07:21    0.000021
2016-03-05 14:07:26    0.000021
Name: date_crawled, dtype: float64
2016-04-07 14:36:56    0.000021
2016-04-07 14:36:55    0.000021
2016-04-07 14:36:44    0.000021
2016-04-07 14:30:26    0.000021
2016-04-07 14:30:09    0.000021
Name: date_crawled, dtype: float64

Observation:

The date_crawled section is set between the dates March 05 2016 and April 07 2016.

In [19]:
#ad_created
print(autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index().head()) #Using head here to avoid printing walls of data

print(autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head())
2015-06-11 00:00:00    0.000021
2015-08-10 00:00:00    0.000021
2015-09-09 00:00:00    0.000021
2015-11-10 00:00:00    0.000021
2015-12-05 00:00:00    0.000021
Name: ad_created, dtype: float64
2016-04-07 00:00:00    0.001206
2016-04-06 00:00:00    0.003259
2016-04-05 00:00:00    0.011852
2016-04-04 00:00:00    0.036953
2016-04-03 00:00:00    0.039048
Name: ad_created, dtype: float64

Observation:

  1. The ad_created section is set between the dates June 11 2015 and April 03 2016. Which is essentially a bigger bracket than date_crawled
  2. On first glance, April 03,04,05 seem to have had a significant numbers of new ads created in comparison with the other days.
In [20]:
#last_seen. This parameter is when the ad was last seen, it could be presumed that it is the date when the car was sold.
print(autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index().head()) #Using head here to avoid printing walls of data

print(autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head())
2016-03-05 14:45:46    0.000021
2016-03-05 14:46:02    0.000021
2016-03-05 14:49:34    0.000021
2016-03-05 15:16:11    0.000021
2016-03-05 15:16:47    0.000021
Name: last_seen, dtype: float64
2016-04-07 14:58:50    0.000063
2016-04-07 14:58:48    0.000063
2016-04-07 14:58:46    0.000021
2016-04-07 14:58:45    0.000021
2016-04-07 14:58:44    0.000063
Name: last_seen, dtype: float64

Obersavation:

A similar date time frame as date_crawled. This could be the presumed date where the car was sold or the ad was taken down altogether.

Let us look at the registration_year series and analyze where we could make some changes, or if any changes are necessary at all.

In [21]:
autos['registration_year'].describe()
Out[21]:
count    47249.000000
mean      2004.844547
std         88.789003
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Here ofcourse, we need to exclude the year 1000 and 9999. Moreover, any year prior to 1950 seems unlikely and later than 2017 is impossible. So let us clean the data accordingly using the between function like we used in price

In [22]:
autos = autos[autos['registration_year'].between(1950,2017)]
autos['registration_year'].value_counts().sort_values().tail()
Out[22]:
2003    2691
2004    2694
1999    2814
2005    2898
2000    2974
Name: registration_year, dtype: int64

With this we can see that most of the cars being sold on eBay at the moment have been registered between 1999-2005 (around 29%)

Around the time, the most used and sold car in Germany (upto this date too in a way) is the VW Golf. If we have to start with a hypothesis it would be that VW Golf should be among the top listed cars considering the demographics and the years of registration. (could also be VW beetle because it picked up around 1999-2003 but highly unlikely because it did go out of style unlike VW Golf)

Brandwise Distribution of Cars

It woudl defintiely be of value for us to see which brands and which models have been listed the most on eBay.

It could help us identify the brand choices of people and how it would have evolved over years.

Here, we will use aggregation to understand the brand column.

In [34]:
a = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:20] #We only need the top 20 manufacturers for now
autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:20]
Out[34]:
volkswagen        0.212938
bmw               0.110619
opel              0.105827
mercedes_benz     0.097720
audi              0.087708
ford              0.067899
renault           0.046699
peugeot           0.029692
fiat              0.025200
seat              0.018590
skoda             0.016386
nissan            0.015274
mazda             0.015253
smart             0.014632
citroen           0.014226
toyota            0.013006
hyundai           0.010097
volvo             0.009220
sonstige_autos    0.009156
mini              0.008878
Name: brand, dtype: float64

As assumed, VW is the most sold brand not only for new cars in Germany but also for used cars on eBay. (Only logical as usual inflow and outflow with market circulation is statistically and directly proportional) It is followed by the other German bigwigs like bmw, opel, mercedies and the VW descendent audi.

Let us see which of the brands have the highest average price and average odometer count based on this existing data that we have gathered.

In [35]:
autos_mean = {}
autos_drive = {}

for c in a.index:
    selected_rows = autos[autos['brand'] == c]
    
    autos_mean[c] = selected_rows['price'].mean()
    autos_drive[c] = selected_rows['odometer_km'].mean()
    
print(autos_mean)
{'mini': 10616.96626506024, 'sonstige_autos': 11026.56308411215, 'seat': 4481.210586881473, 'hyundai': 5447.218220338983, 'toyota': 5160.1578947368425, 'audi': 9283.809024390244, 'mercedes_benz': 8496.2896234676, 'ford': 3882.059861373661, 'skoda': 6430.8616187989555, 'nissan': 4809.389355742297, 'bmw': 8241.758267259718, 'volvo': 4939.691415313225, 'fiat': 2934.799660441426, 'smart': 3550.3055555555557, 'renault': 2557.268437929455, 'volkswagen': 5486.6349206349205, 'peugeot': 3143.4531700288185, 'citroen': 3823.912781954887, 'opel': 3101.2429755407315, 'mazda': 4176.294530154278}


{'mini': 89469.8795180723, 'sonstige_autos': 92406.54205607477, 'seat': 122094.36133486766, 'hyundai': 106822.03389830509, 'toyota': 116496.71052631579, 'audi': 129557.31707317074, 'mercedes_benz': 131417.469352014, 'ford': 124163.51606805294, 'skoda': 111174.93472584856, 'nissan': 118375.35014005602, 'bmw': 132925.9330883775, 'volvo': 138735.4988399072, 'fiat': 117016.12903225806, 'smart': 100562.86549707603, 'renault': 127830.96655978012, 'volkswagen': 128906.46976090015, 'peugeot': 126884.00576368877, 'citroen': 119992.48120300751, 'opel': 129167.17202344855, 'mazda': 124340.81346423563}

Observation There is a gap between the top 20 prices

  1. volksvagen has mid-range prices which makes it the most popular brand

  2. sonstige_autos, mini and audi have the highest prices

  3. opel and renault have the lowest prices

In [37]:
#Extracting Observations based on odometer reading
print(autos_drive)
{'mini': 89469.8795180723, 'sonstige_autos': 92406.54205607477, 'seat': 122094.36133486766, 'hyundai': 106822.03389830509, 'toyota': 116496.71052631579, 'audi': 129557.31707317074, 'mercedes_benz': 131417.469352014, 'ford': 124163.51606805294, 'skoda': 111174.93472584856, 'nissan': 118375.35014005602, 'bmw': 132925.9330883775, 'volvo': 138735.4988399072, 'fiat': 117016.12903225806, 'smart': 100562.86549707603, 'renault': 127830.96655978012, 'volkswagen': 128906.46976090015, 'peugeot': 126884.00576368877, 'citroen': 119992.48120300751, 'opel': 129167.17202344855, 'mazda': 124340.81346423563}

Here is where things get interesting:

  1. The top 2 cars with the highest price average mini and sonstige (Miscellaneous) have the least mileage readings on the odometer.
  2. volvo is the brand with the most mileage on the list with bmw, audi and opel closely behind.
In [38]:
# create a dataframe of the mean price and mean mileage to make it easy to compare

mean_price = pd.Series(autos_mean)
mean_mileage = pd.Series(autos_drive)

# create a dataframe of both series
brand_df = pd.DataFrame({'mean_price': mean_price, 'mean_mileage': mean_mileage})

brand_df
Out[38]:
mean_mileage mean_price
audi 129557.317073 9283.809024
bmw 132925.933088 8241.758267
citroen 119992.481203 3823.912782
fiat 117016.129032 2934.799660
ford 124163.516068 3882.059861
hyundai 106822.033898 5447.218220
mazda 124340.813464 4176.294530
mercedes_benz 131417.469352 8496.289623
mini 89469.879518 10616.966265
nissan 118375.350140 4809.389356
opel 129167.172023 3101.242976
peugeot 126884.005764 3143.453170
renault 127830.966560 2557.268438
seat 122094.361335 4481.210587
skoda 111174.934726 6430.861619
smart 100562.865497 3550.305556
sonstige_autos 92406.542056 11026.563084
toyota 116496.710526 5160.157895
volkswagen 128906.469761 5486.634921
volvo 138735.498840 4939.691415

Mini and sonstige (other) have the least odometer readings and the highest prices. Which could help us levitate towards the possibility of higher prices for used cars which have a lesser odometer reading (Which would imply that they have travalled lesser and have relagively healthier engine)

However, from a very high level, the point which more clearly justifies the price is actually (and more importantly, obviously) the brand.

Opel, Renault and Fiat are the reasonable cars, while BMW, Audi and Mercedes are the costlier ones.

VW on the other hand is placed very well in the middle strata of the market and has the highest sales.

The most commonly sold model

We proposed an assumption towards the beginnning that as golf should be one of the best sellers on eBay too as it is one of the best sellers in Germany overall.

Also, let us take a look into the more interesting brands and what models are most popular for brands like VW, Audi, BMW, Opel and Mercedes Benz

To do this we will look into the column labelled model

Let's start off with the models which are popular overall

In [41]:
autos['model'].value_counts().sort_values(ascending=False).head(6)
Out[41]:
golf      3764
andere    3330
3er       2636
polo      1555
corsa     1522
passat    1367
Name: model, dtype: int64

Prima! Wir haben recht! We were right!

Golf is the most sold model overall!

If you have not heard of andere, there would be no reason to beat yourself up as it just means 'other' in German. So the next identifiable model would be 3er by BMW!

This is followed by Polo by VW, Corsa by Opel and Passat by VW again!

This essentially means that VW leads by a mile with BMW and Opel in close seconds.

Now what about the most preferred models brandwise?

In [45]:
brands = ['volkswagen','bmw','audi','mercedes_benz','opel']



for a in brands:
    models_car = autos[autos['brand']==a]
    b = models_car['model'].value_counts().sort_values(ascending=False).head(1)
    print('{0} brand : {1}'.format(a,b))
    print('\n')
    
volkswagen brand : golf    3764
Name: model, dtype: int64


bmw brand : 3er    2636
Name: model, dtype: int64


audi brand : a4    1241
Name: model, dtype: int64


mercedes_benz brand : c_klasse    1153
Name: model, dtype: int64


opel brand : corsa    1522
Name: model, dtype: int64


There we have it!

The Most popular models by brand!

  1. VW - Golf
  2. BMW - 3er
  3. Opel - Corsa
  4. Audi - A4
  5. Mercedes Benz - C Class

Effect of Damage/Repairs on Price

We have a section which can help us identify if any damage as happened to the car with 'nein' for a no and 'ja' for a yes.

Let us analyze the series a bit more.

In [49]:
autos['unrepaired_damage'].describe()
Out[49]:
count     38503
unique        2
top        nein
freq      34215
Name: unrepaired_damage, dtype: object

There are two main values ja and nein. Now let us tear these two apart, calculate the mean of their prices and compare them to see if damages affect the price of the cars in any way.

In [51]:
ohne_repair_mean = autos.loc[autos['unrepaired_damage']=="nein",'price'].mean()
mit_repair_mean = autos.loc[autos['unrepaired_damage']=="ja",'price'].mean()

print(ohne_repair_mean)
print(mit_repair_mean)
6978.685956451849
2384.205223880597

Quite an evident differnce between the price of cars without repair and the ones with repair.

The ones without repair have a clearly higher price (3 times!)

While we havent looked way to deep into the data, it does validate our thoughts and general knowledge of mint condition items being more expensive than refurbished items!

Conclusion

The inferences from this project are unique as well as sometimes directly aligning to everything with what we ideally consider to be common knowledge.

Unique observations which one might not ideally know:

  1. VW is the most sold brand for new as well as used cars with Gold being in the lead.
  2. This is followed by Opel, another Deutschland classic!
  3. Opel and Renault have the lowest prices and BMW, Audi and Mercedes are amongst the highest.
  4. a significant chunk (~30%) of second hand cars which were tried to be sold in the used car market from 2015-2016 were in reality registered between 1999-2004

Some observations which seem obvious but we did confirm them using this project:

  1. The higher the odometer readings, the lower the prices. (example of mini and sonstige)
  2. If a car has been damaged and then repaired, its price is usually lesser than cars which have had no damage whatsoever.
  3. Overall, German is as fun a language as it is difficult!