Contents index

Dataset of used cars from eBay Kleinanzeigen

The aim of this project is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

  • dateCrawled - When this ad was first crawled. All field-values are taken from this date.

  • name - Name of the car.

  • seller - Whether the seller is private or a dealer.

  • offerType - The type of listing

  • price - The price on the ad to sell the car.

  • abtest - Whether the listing is included in an A/B test.

  • vehicleType - The vehicle Type.

  • yearOfRegistration - The year in which the car was first registered.

  • gearbox - The transmission type.

  • powerPS - The power of the car in PS.

  • model - The car model name.

  • kilometer - How many kilometers the car has driven.

  • monthOfRegistration - The month in which the car was first registered.

  • fuelType - What type of fuel the car uses.

  • brand - The brand of the car.

  • notRepairedDamage - If the car has a damage which is not yet repaired.

  • dateCreated - The date on which the eBay listing was created.

  • nrOfPictures - The number of pictures in the ad.

  • postalCode - The postal code for the location of the vehicle.

  • lastSeenOnline - When the crawler saw this ad last online.

In [1]:
import numpy as np
import pandas as pd
In [2]:
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
In [3]:
autos
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau privat Angebot $999 test NaN 2017 manuell 90 NaN 150,000km 4 benzin volkswagen nein 2016-03-14 00:00:00 0 86157 2016-04-07 03:16:21
11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden privat Angebot $350 control NaN 2000 NaN 0 NaN 150,000km 0 benzin mercedes_benz NaN 2016-03-16 00:00:00 0 17498 2016-03-16 18:45:34
12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot $5,299 control kleinwagen 2010 automatik 71 fortwo 50,000km 9 benzin smart nein 2016-03-31 00:00:00 0 34590 2016-04-06 14:17:52
13 2016-03-23 10:48:32 Audi_A3_1.6_tuning privat Angebot $1,350 control limousine 1999 manuell 101 a3 150,000km 11 benzin audi nein 2016-03-23 00:00:00 0 12043 2016-04-01 14:17:13
14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot $3,999 test kleinwagen 2007 manuell 75 clio 150,000km 9 benzin renault NaN 2016-03-23 00:00:00 0 81737 2016-04-01 15:46:47
15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot $18,900 test coupe 1982 automatik 203 NaN 80,000km 6 benzin sonstige_autos nein 2016-04-01 00:00:00 0 61276 2016-04-02 21:10:48
16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi privat Angebot $350 test kombi 1999 manuell 101 vectra 150,000km 5 benzin opel nein 2016-03-16 00:00:00 0 57299 2016-03-18 05:29:37
17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 privat Angebot $5,500 test coupe 1990 manuell 205 scirocco 150,000km 6 benzin volkswagen nein 2016-03-29 00:00:00 0 74821 2016-04-05 20:46:26
18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot $300 control bus 1995 manuell 150 3er 150,000km 0 benzin bmw NaN 2016-03-26 00:00:00 0 54329 2016-04-02 12:16:41
19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot $4,150 control suv 2004 manuell 124 andere 150,000km 2 lpg mazda nein 2016-03-17 00:00:00 0 40878 2016-03-17 14:45:58
20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot $3,500 test kombi 2003 manuell 131 a4 150,000km 5 diesel audi NaN 2016-03-05 00:00:00 0 53913 2016-03-07 05:46:46
21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio privat Angebot $41,500 test cabrio 2004 manuell 320 911 150,000km 4 benzin porsche nein 2016-03-06 00:00:00 0 65428 2016-04-05 23:46:19
22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio privat Angebot $25,450 control cabrio 2015 manuell 184 cooper 10,000km 1 benzin mini nein 2016-03-28 00:00:00 0 44789 2016-04-01 06:45:30
23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima privat Angebot $7,999 control bus 2010 manuell 120 NaN 150,000km 2 diesel peugeot nein 2016-03-10 00:00:00 0 30900 2016-03-17 08:45:17
24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. privat Angebot $48,500 control limousine 2014 automatik 306 5er 30,000km 12 benzin bmw nein 2016-04-03 00:00:00 0 22547 2016-04-07 13:16:50
25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung privat Angebot $90 control kombi 1996 manuell 116 NaN 150,000km 4 benzin ford ja 2016-03-21 00:00:00 0 27574 2016-04-01 05:16:49
26 2016-04-03 22:46:28 Volkswagen_Polo_Fox privat Angebot $777 control kleinwagen 1992 manuell 54 polo 125,000km 2 benzin volkswagen nein 2016-04-03 00:00:00 0 38110 2016-04-05 23:46:48
27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE privat Angebot $0 control NaN 2005 NaN 0 NaN 150,000km 0 NaN ford NaN 2016-03-27 00:00:00 0 66701 2016-03-27 18:45:01
28 2016-03-19 21:56:19 MINI_Cooper_D privat Angebot $5,250 control kleinwagen 2007 manuell 110 cooper 150,000km 7 diesel mini ja 2016-03-19 00:00:00 0 15745 2016-04-07 14:58:48
29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... privat Angebot $4,999 test kombi 2004 automatik 204 e_klasse 150,000km 10 diesel mercedes_benz nein 2016-04-02 00:00:00 0 47638 2016-04-02 12:45:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... privat Angebot $15,800 control bus 2010 automatik 136 c4 60,000km 4 diesel citroen nein 2016-03-21 00:00:00 0 14947 2016-04-07 04:17:34
49971 2016-03-29 14:54:12 W.Lupo_1.0 privat Angebot $950 test kleinwagen 2001 manuell 50 lupo 150,000km 4 benzin volkswagen nein 2016-03-29 00:00:00 0 65197 2016-03-29 20:41:51
49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. privat Angebot $3,300 control bus 2004 automatik 150 vito 150,000km 10 diesel mercedes_benz ja 2016-03-26 00:00:00 0 65326 2016-03-28 11:28:18
49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor privat Angebot $6,000 control cabrio 2004 manuell 163 slk 150,000km 11 benzin mercedes_benz nein 2016-03-27 00:00:00 0 53567 2016-03-27 08:25:24
49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... privat Angebot $0 control cabrio 1983 manuell 70 golf 150,000km 2 benzin volkswagen nein 2016-03-20 00:00:00 0 8209 2016-03-27 19:48:16
49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort privat Angebot $9,700 control kleinwagen 2012 automatik 88 jazz 100,000km 11 hybrid honda nein 2016-03-27 00:00:00 0 84385 2016-04-05 19:45:34
49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... privat Angebot $5,900 test kombi 1992 automatik 150 80 150,000km 12 benzin audi nein 2016-03-19 00:00:00 0 36100 2016-04-07 06:16:44
49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 privat Angebot $5,500 control limousine 2003 manuell 116 c_klasse 150,000km 2 diesel mercedes_benz nein 2016-03-31 00:00:00 0 33739 2016-04-06 12:16:11
49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic privat Angebot $900 control limousine 1996 automatik 136 e_klasse 150,000km 9 benzin mercedes_benz ja 2016-04-04 00:00:00 0 24405 2016-04-06 12:44:20
49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style privat Angebot $11,000 test kleinwagen 2011 manuell 90 polo 70,000km 11 diesel volkswagen nein 2016-03-20 00:00:00 0 48455 2016-04-07 01:45:12
49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V privat Angebot $400 control kombi 1995 manuell 105 escort 125,000km 3 benzin ford NaN 2016-03-12 00:00:00 0 56218 2016-04-06 17:16:49
49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung privat Angebot $2,000 control kombi 1998 manuell 115 astra 150,000km 12 benzin opel nein 2016-03-15 00:00:00 0 86859 2016-04-05 17:21:46
49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm privat Angebot $1,950 control kleinwagen 2004 manuell 0 fabia 90,000km 7 benzin skoda NaN 2016-03-29 00:00:00 0 45884 2016-03-29 18:51:08
49983 2016-03-06 12:43:04 Ford_focus_99 privat Angebot $600 test kleinwagen 1999 manuell 101 focus 150,000km 4 benzin ford NaN 2016-03-06 00:00:00 0 52477 2016-03-09 06:16:08
49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... privat Angebot $0 test NaN 2000 NaN 0 NaN 150,000km 0 NaN sonstige_autos NaN 2016-03-31 00:00:00 0 12103 2016-04-02 19:44:53
49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! privat Angebot $1,000 control NaN 1995 automatik 0 NaN 150,000km 0 benzin volkswagen NaN 2016-04-02 00:00:00 0 30900 2016-04-06 15:17:52
49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... privat Angebot $15,900 control limousine 2010 automatik 218 300c 125,000km 11 diesel chrysler nein 2016-04-04 00:00:00 0 73527 2016-04-06 23:16:00
49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... privat Angebot $21,990 control limousine 2013 manuell 150 a3 50,000km 11 diesel audi nein 2016-03-22 00:00:00 0 94362 2016-03-26 22:46:06
49988 2016-03-28 19:49:51 BMW_330_Ci privat Angebot $9,550 control coupe 2001 manuell 231 3er 150,000km 10 benzin bmw nein 2016-03-28 00:00:00 0 83646 2016-04-07 02:17:40
49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau privat Angebot $150 test kleinwagen 1997 manuell 0 polo 150,000km 5 benzin volkswagen ja 2016-03-11 00:00:00 0 21244 2016-03-12 10:17:55
49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban privat Angebot $17,500 test limousine 2012 manuell 156 a_klasse 30,000km 12 benzin mercedes_benz nein 2016-03-21 00:00:00 0 58239 2016-04-06 22:46:57
49991 2016-03-06 15:25:19 Kleinwagen privat Angebot $500 control NaN 2016 manuell 0 twingo 150,000km 0 benzin renault NaN 2016-03-06 00:00:00 0 61350 2016-03-06 18:24:19
49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport privat Angebot $4,800 control kleinwagen 2009 manuell 120 andere 125,000km 9 lpg fiat nein 2016-03-10 00:00:00 0 68642 2016-03-13 01:44:51
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot $1,650 control kleinwagen 1997 manuell 0 NaN 150,000km 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot $5,000 control kombi 2001 automatik 299 a6 150,000km 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

a concise summary of a DataFrame

In [4]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
In [5]:
autos.head()
Out[5]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

We observed a couple of things:

  • Missing values (null) in this columns.
Null Data value Percentage of data set
vehicleType 44905 10,19%
gearbox 47320 5,36%
model 47242 5,5%
fuelType 45518 8,96%
notRepairedDamage 40171 19,658%
  • The need to change the type of data we currently have in the dataset to a numerical data type.
actual type data change of data
price object int64
powerPS object int64
odometer object int64
dateCreated object int64
In [6]:
autos.columns
Out[6]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [7]:
autos.rename(columns = {"yearOfRegistration":"registration_year",
                        "monthOfRegistration":"registration_month",
                        "notRepairedDamage":"unrepaired_damage",
                        "dateCreated":"ad_created",
                        "dateCrawled":"date_crawled",
                        "offerType":"offer_type",
                        "vehicleType":"vehicle_type",
                        "powerPS":"power_ps",
                        "nrOfPictures":"nr_ofpictures",
                        "postalCode":"postal_code",
                        "fuelType":"fuel_type",
                        "lastSeen":"last_seen"}, inplace = True)

This step has already been implicitly performed in the previous step with the parameter:

  • inplace = True.
In [8]:
autos.head(2)
Out[8]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_ofpictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08

Renaming the columns serves for two purposes:

  • It is a solution to particular problems that arise when two words or segments are combined and in doing so we meet a standard making us more intelligible.
  • To indicate which units are used in that field, since when we change the data type in the columns we have previously determined the units that represent those quantities should be indicated in the column.

...what other cleaning tasks need to be done?

  • Remove text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
  • Convert Numeric data stored as text which must be cleaned and .
In [9]:
autos.describe(include='all').round() # get both categorical and numeric columns
Out[9]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_ofpictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.0 47320 50000.0 47242 50000 50000.0 45518 50000 40171 50000 50000.0 50000.0 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-04-04 16:40:33 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.0 NaN 116.0 NaN NaN 6.0 NaN NaN NaN NaN 0.0 50814.0 NaN
std NaN NaN NaN NaN NaN NaN NaN 106.0 NaN 209.0 NaN NaN 4.0 NaN NaN NaN NaN 0.0 25780.0 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.0 NaN 0.0 NaN NaN 0.0 NaN NaN NaN NaN 0.0 1067.0 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.0 NaN 70.0 NaN NaN 3.0 NaN NaN NaN NaN 0.0 30451.0 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.0 NaN 105.0 NaN NaN 6.0 NaN NaN NaN NaN 0.0 49577.0 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.0 NaN 150.0 NaN NaN 9.0 NaN NaN NaN NaN 0.0 71540.0 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.0 NaN 17700.0 NaN NaN 12.0 NaN NaN NaN NaN 0.0 99998.0 NaN

Columns candidates to be dropped:

  • nr_ofpictures
  • abtest: 'control', 'test'

Columns that need more investigation, for further analysis:

  • seller : 'privat', 'gewerblich'
  • offer_type: 'angebot' , 'Gesuch'

With this data we could see differences and obtain a deeper knowledge by segmenting the market by this sectors, however, this is not the objective of this study, so we will proceed to remove them.

Any examples of numeric data stored as text that needs to be cleaned.

  • price as int64
  • odometer as int64

These Text columns where all or almost all values are the same, can often be dropped as they don't have useful information for analysis

  • nr_ofpictures
  • abtest: 'control', 'test'
  • seller : 'privat', 'gewerblich'
  • offer_type: 'angebot' , 'Gesuch'
In [10]:
autos.drop(['nr_ofpictures', 'abtest','seller','offer_type','abtest'], axis = 1, inplace = True ) 
In [11]:
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','')
In [12]:
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','')
In [13]:
autos["price"] = autos["price"].astype(int)
In [14]:
autos["price"].describe().round(2)
Out[14]:
count       50000.00
mean         9840.04
std        481104.38
min             0.00
25%          1100.00
50%          2950.00
75%          7200.00
max      99999999.00
Name: price, dtype: float64
In [15]:
autos["odometer"] = autos["odometer"].astype(int)
In [16]:
autos["odometer"].describe().round(2)
Out[16]:
count     50000.00
mean     125732.70
std       40042.21
min        5000.00
25%      125000.00
50%      150000.00
75%      150000.00
max      150000.00
Name: odometer, dtype: float64

Seems there's some kind of outlayer on price columm, both at the top and at the bottom

In [17]:
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)
In [18]:
autos.rename(columns = {"price":"price_Dollars"}, inplace = True)
In [19]:
autos.head(2) # Just to check the columns.
Out[19]:
date_crawled name price_Dollars vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08

unique()

In [20]:
autos["odometer_km"].unique().shape
Out[20]:
(13,)
In [21]:
autos["price_Dollars"].unique().shape
Out[21]:
(2357,)

value_counts()

In [22]:
autos["odometer_km"].value_counts()
Out[22]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64
In [23]:
autos["price_Dollars"].value_counts().sort_index(ascending = True).head()
Out[23]:
0    1421
1     156
2       3
3       1
5       2
Name: price_Dollars, dtype: int64

describe()

In [24]:
autos["odometer_km"].describe()
Out[24]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Seems there´s a lot of kilometers on these cars.

The average number of kilometres travelled is very high and this should be taken into account.

In [25]:
autos["price_Dollars"].describe().round()
Out[25]:
count       50000.0
mean         9840.0
std        481104.0
min             0.0
25%          1100.0
50%          2950.0
75%          7200.0
max      99999999.0
Name: price_Dollars, dtype: float64

As we already know that the maximum value is 99999999.0 and the minimum value is 0 It is clear that they are values that cannot be true, because one is a very large amount and the other would mean that the vehicle is given as a gift and we are in a buying and selling website.

For that reason let's look at what happens to the values at the extreme ends of our list.

The way I will do it will be to divide the list in the upper part and another one in the lower part.

In the upper part I will choose a high value and check if it makes sense, that is to say if there is a vehicle with that price and from there I will filter it.

In [26]:
autos["price_Dollars"].value_counts().sort_index(ascending = False).head(15)
Out[26]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price_Dollars, dtype: int64

Let's see to which vehicle the value of 265000 corresponds.

In the event that this price is consistent with the vehicle, we can take it as a reference.

In [27]:
autos[autos["price_Dollars"] == 265000]
Out[27]:
date_crawled name price_Dollars vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
12682 2016-03-28 22:48:01 Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... 265000 coupe 2016 automatik 500 911 5000 3 benzin porsche nein 2016-03-28 00:00:00 70193 2016-04-05 03:44:51

It makes sense, so we will use our comparison from this value upwards and see what we find.

https://www.carindigo.com/used-cars/porsche-911-gt3-rs

As there are not too many values to check, it seems sensible to start from this model and see what we can eliminate and what we cannot.

In [28]:
top_List = autos[(autos["price_Dollars"] >= 265000)].sort_index(ascending = False)
top_List
Out[28]:
date_crawled name price_Dollars vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
47634 2016-04-04 21:25:21 Ferrari_FXX 3890000 coupe 2006 NaN 799 NaN 5000 7 NaN sonstige_autos nein 2016-04-04 00:00:00 60313 2016-04-05 12:07:37
47598 2016-03-31 18:56:54 Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... 12345678 limousine 2001 manuell 101 vectra 150000 3 benzin opel nein 2016-03-31 00:00:00 4356 2016-03-31 18:56:54
43049 2016-03-21 19:53:52 2_VW_Busse_T3 999999 bus 1981 manuell 70 transporter 150000 1 benzin volkswagen NaN 2016-03-21 00:00:00 99880 2016-03-28 17:18:28
42221 2016-03-08 20:39:05 Leasinguebernahme 27322222 limousine 2014 manuell 163 c4 40000 2 diesel citroen NaN 2016-03-08 00:00:00 76532 2016-03-08 20:39:05
39705 2016-03-22 14:58:27 Tausch_gegen_gleichwertiges 99999999 limousine 1999 automatik 224 s_klasse 150000 9 benzin mercedes_benz NaN 2016-03-22 00:00:00 73525 2016-04-06 05:15:30
39377 2016-03-08 23:53:51 Tausche_volvo_v40_gegen_van 12345678 NaN 2018 manuell 95 v40 150000 6 NaN volvo nein 2016-03-08 00:00:00 14542 2016-04-06 23:17:31
37585 2016-03-29 11:38:54 Volkswagen_Jetta_GT 999990 limousine 1985 manuell 111 jetta 150000 12 benzin volkswagen ja 2016-03-29 00:00:00 50997 2016-03-29 11:38:54
36818 2016-03-27 18:37:37 Porsche_991 350000 coupe 2016 manuell 500 911 5000 3 benzin porsche nein 2016-03-27 00:00:00 70499 2016-03-27 18:37:37
35923 2016-04-03 07:56:23 Porsche_911_Targa_Exclusive_Edition__1_von_15_... 295000 cabrio 2015 automatik 400 911 5000 6 benzin porsche nein 2016-04-03 00:00:00 74078 2016-04-03 08:56:20
34723 2016-03-23 16:37:29 Porsche_Porsche_911/930_Turbo_3.0__deutsche_Au... 299000 coupe 1977 manuell 260 911 100000 7 benzin porsche nein 2016-03-23 00:00:00 61462 2016-04-06 16:44:50
27371 2016-03-09 15:45:47 Fiat_Punto 12345678 NaN 2017 NaN 95 punto 150000 0 NaN fiat NaN 2016-03-09 00:00:00 96110 2016-03-09 15:45:47
24384 2016-03-21 13:57:51 Schlachte_Golf_3_gt_tdi 11111111 NaN 1995 NaN 0 NaN 150000 0 NaN volkswagen NaN 2016-03-21 00:00:00 18519 2016-03-21 14:40:18
22947 2016-03-22 12:54:19 Bmw_530d_zum_ausschlachten 1234566 kombi 1999 automatik 190 NaN 150000 2 diesel bmw NaN 2016-03-22 00:00:00 17454 2016-04-02 03:17:32
14715 2016-03-30 08:37:24 Rolls_Royce_Phantom_Drophead_Coupe 345000 cabrio 2012 automatik 460 NaN 20000 8 benzin sonstige_autos nein 2016-03-30 00:00:00 73525 2016-04-07 00:16:26
12682 2016-03-28 22:48:01 Porsche_GT3_RS__PCCB__Lift___grosser_Exklusiv_... 265000 coupe 2016 automatik 500 911 5000 3 benzin porsche nein 2016-03-28 00:00:00 70193 2016-04-05 03:44:51
11137 2016-03-29 23:52:57 suche_maserati_3200_gt_Zustand_unwichtig_laufe... 10000000 coupe 1960 manuell 368 NaN 100000 1 benzin sonstige_autos nein 2016-03-29 00:00:00 73033 2016-04-06 21:18:11
7814 2016-04-04 11:53:31 Ferrari_F40 1300000 coupe 1992 NaN 0 NaN 50000 12 NaN sonstige_autos nein 2016-04-04 00:00:00 60598 2016-04-05 11:34:11
2897 2016-03-12 21:50:57 Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 11111111 limousine 1973 manuell 48 escort 50000 3 benzin ford nein 2016-03-12 00:00:00 94469 2016-03-12 22:45:27
514 2016-03-17 09:53:08 Ford_Focus_Turnier_1.6_16V_Style 999999 kombi 2009 manuell 101 focus 125000 4 benzin ford nein 2016-03-17 00:00:00 12205 2016-04-06 07:17:35

There is a lot of data and we cannot compare correctly, so let's pick the data that matters, price and all possible information we can extract in relation to the vehicle.

And see if it helps us discern between the price and the class of vehicle to which it has been assigned.

In [29]:
top_List[["price_Dollars","brand","model","vehicle_type"]].sort_index(ascending = False)
Out[29]:
price_Dollars brand model vehicle_type
47634 3890000 sonstige_autos NaN coupe
47598 12345678 opel vectra limousine
43049 999999 volkswagen transporter bus
42221 27322222 citroen c4 limousine
39705 99999999 mercedes_benz s_klasse limousine
39377 12345678 volvo v40 NaN
37585 999990 volkswagen jetta limousine
36818 350000 porsche 911 coupe
35923 295000 porsche 911 cabrio
34723 299000 porsche 911 coupe
27371 12345678 fiat punto NaN
24384 11111111 volkswagen NaN NaN
22947 1234566 bmw NaN kombi
14715 345000 sonstige_autos NaN cabrio
12682 265000 porsche 911 coupe
11137 10000000 sonstige_autos NaN coupe
7814 1300000 sonstige_autos NaN coupe
2897 11111111 ford escort limousine
514 999999 ford focus kombi

It seems to be clearer this way, we see that there are vehicles that make sense to be on this list and that others clearly do not, for example:

Strange values of 1234566, corresponding to a bmw or 11111111 to a volkswagen even a C4 that is worth 27322222.

Clearly, these cases could be eliminated as error.

However, there are some kind of vehicles called 'sonstige autos' (other cars) that we are not aware of and because they have an important weight in the list it is important to clarify whether or not we should eliminate them.

In [30]:
top_List[top_List["brand"] == "sonstige_autos"]
Out[30]:
date_crawled name price_Dollars vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
47634 2016-04-04 21:25:21 Ferrari_FXX 3890000 coupe 2006 NaN 799 NaN 5000 7 NaN sonstige_autos nein 2016-04-04 00:00:00 60313 2016-04-05 12:07:37
14715 2016-03-30 08:37:24 Rolls_Royce_Phantom_Drophead_Coupe 345000 cabrio 2012 automatik 460 NaN 20000 8 benzin sonstige_autos nein 2016-03-30 00:00:00 73525 2016-04-07 00:16:26
11137 2016-03-29 23:52:57 suche_maserati_3200_gt_Zustand_unwichtig_laufe... 10000000 coupe 1960 manuell 368 NaN 100000 1 benzin sonstige_autos nein 2016-03-29 00:00:00 73033 2016-04-06 21:18:11
7814 2016-04-04 11:53:31 Ferrari_F40 1300000 coupe 1992 NaN 0 NaN 50000 12 NaN sonstige_autos nein 2016-04-04 00:00:00 60598 2016-04-05 11:34:11

High-end vehicles with a very high market value are appearing, will have to be reviewed.

name price in Dollars registration_year source
Ferrari_FXX 4,000,000 2006 https://www.autoblog.com/2006/06/14/for-sale-2006-ferrari-fxx-slightly-used/?guccounter=1
Rolls_Royce_Phantom_Drophead_Coupe From 450,000 2012 https://www.cars.com/shopping/rolls_royce-phantom_drophead_coupe-2012/
Ferrari_F40 1,959,900 1992 https://www.dupontregistry.com/autos/listing/1992/ferrari/f40/2418434
maserati 3200 gt Note

Note: In relation to the Maserati, there are several things to consider.

  • The 3200 Gt model was in production during the period 1998 to 2001 so the registration date we have from 1960 does not match, however, during the period 1957 - 1964 only one model of this brand was produced, the 3500 Gt, only 2,222 units were produced between the Coupe and the Spider version.

  • If we also take into account the description referred to in the cell of the name:

    " Zustand_unwichtig_laufe... // Condition_unimportant_running..."

  • The person who placed the advert was looking for a masserati in any condition (just running or not...), this gives us a clue that the model we are talking about is not the newest one, so probably refers to the 60's model.

  • As if this were not enough the price we have is 10 000 000 dollars and surely this is wrong because doing a search of what is the value of the car I have found that this ranges between 863.170,50 USD / 151.290,57 USD

So seeing as it is an offer from a private individual who does not care about the condition a price of 100000 would be a bargain for the vehicle like this.

Change the price of the Masserati = $100000.

In [31]:
autos.iloc[11137,2] = 100000

We finally see that the prices we were looking make sense as they all correspond to high-end vehicles.

What do we have to remove?

Everything other than 'Sonstige_autos' and 'Porsche'

In [32]:
remove_bad_price_bool = ((top_List["brand"] != 'sonstige_autos') & (top_List["brand"] != 'porsche'))
remove_bad_price_bool.head(3) # the boolean list to be use.
Out[32]:
47634    False
47598     True
43049     True
Name: brand, dtype: bool
In [33]:
#check_price[remove_bad_price_bool] 
In [34]:
bad_cars = top_List[remove_bad_price_bool] # type dataframe! This is the dataframe to remove from check_price
In [35]:
bad_cars
Out[35]:
date_crawled name price_Dollars vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
47598 2016-03-31 18:56:54 Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... 12345678 limousine 2001 manuell 101 vectra 150000 3 benzin opel nein 2016-03-31 00:00:00 4356 2016-03-31 18:56:54
43049 2016-03-21 19:53:52 2_VW_Busse_T3 999999 bus 1981 manuell 70 transporter 150000 1 benzin volkswagen NaN 2016-03-21 00:00:00 99880 2016-03-28 17:18:28
42221 2016-03-08 20:39:05 Leasinguebernahme 27322222 limousine 2014 manuell 163 c4 40000 2 diesel citroen NaN 2016-03-08 00:00:00 76532 2016-03-08 20:39:05
39705 2016-03-22 14:58:27 Tausch_gegen_gleichwertiges 99999999 limousine 1999 automatik 224 s_klasse 150000 9 benzin mercedes_benz NaN 2016-03-22 00:00:00 73525 2016-04-06 05:15:30
39377 2016-03-08 23:53:51 Tausche_volvo_v40_gegen_van 12345678 NaN 2018 manuell 95 v40 150000 6 NaN volvo nein 2016-03-08 00:00:00 14542 2016-04-06 23:17:31
37585 2016-03-29 11:38:54 Volkswagen_Jetta_GT 999990 limousine 1985 manuell 111 jetta 150000 12 benzin volkswagen ja 2016-03-29 00:00:00 50997 2016-03-29 11:38:54
27371 2016-03-09 15:45:47 Fiat_Punto 12345678 NaN 2017 NaN 95 punto 150000 0 NaN fiat NaN 2016-03-09 00:00:00 96110 2016-03-09 15:45:47
24384 2016-03-21 13:57:51 Schlachte_Golf_3_gt_tdi 11111111 NaN 1995 NaN 0 NaN 150000 0 NaN volkswagen NaN 2016-03-21 00:00:00 18519 2016-03-21 14:40:18
22947 2016-03-22 12:54:19 Bmw_530d_zum_ausschlachten 1234566 kombi 1999 automatik 190 NaN 150000 2 diesel bmw NaN 2016-03-22 00:00:00 17454 2016-04-02 03:17:32
2897 2016-03-12 21:50:57 Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 11111111 limousine 1973 manuell 48 escort 50000 3 benzin ford nein 2016-03-12 00:00:00 94469 2016-03-12 22:45:27
514 2016-03-17 09:53:08 Ford_Focus_Turnier_1.6_16V_Style 999999 kombi 2009 manuell 101 focus 125000 4 benzin ford nein 2016-03-17 00:00:00 12205 2016-04-06 07:17:35

We extract the indices of the rows we want to remove from the main dataframe.

In [36]:
index_to_remove = bad_cars.index
In [37]:
index_to_remove  # from the mail dataframe autos.
Out[37]:
Int64Index([47598, 43049, 42221, 39705, 39377, 37585, 27371, 24384, 22947,
            2897, 514],
           dtype='int64')
In [38]:
for indice in index_to_remove:
    autos.drop([indice], inplace = True)
In [39]:
autos.iloc[27371] # comprobacion de que el fiat panda lo habia eliminado
Out[39]:
date_crawled                            2016-03-08 08:53:55
name                  Renault_clio_1.2_ripcul_Bilder_folgen
price_Dollars                                          3300
vehicle_type                                     kleinwagen
registration_year                                      2007
gearbox                                                 NaN
power_ps                                                100
model                                                  clio
odometer_km                                          125000
registration_month                                        0
fuel_type                                            benzin
brand                                               renault
unrepaired_damage                                       NaN
ad_created                              2016-03-08 00:00:00
postal_code                                           27239
last_seen                               2016-03-10 00:17:14
Name: 27376, dtype: object
In [40]:
autos["price_Dollars"].value_counts().sort_index(ascending= True)
Out[40]:
0          1421
1           156
2             3
3             1
5             2
8             1
9             1
10            7
11            2
12            3
13            2
14            1
15            2
17            3
18            1
20            4
25            5
29            1
30            7
35            1
40            6
45            4
47            1
49            4
50           49
55            2
59            1
60            9
65            5
66            1
           ... 
120000        2
128000        1
129000        1
130000        1
135000        1
137999        1
139997        1
145000        1
151990        1
155000        1
163500        1
163991        1
169000        1
169999        1
175000        1
180000        1
190000        1
194000        1
197000        1
198000        1
220000        1
250000        1
259000        1
265000        1
295000        1
299000        1
345000        1
350000        1
1300000       1
3890000       1
Name: price_Dollars, Length: 2350, dtype: int64
In [41]:
autos[autos["price_Dollars"] == 1]
Out[41]:
date_crawled name price_Dollars vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
55 2016-03-07 02:47:54 Mercedes_E320_AMG_zu_Tauschen! 1 NaN 2017 automatik 224 e_klasse 125000 7 benzin mercedes_benz nein 2016-03-06 00:00:00 22111 2016-03-08 05:45:44
344 2016-03-22 11:25:18 Verkaufe_hier_mein_mein_schoener_honda_crx_in_... 1 NaN 2000 manuell 125 andere 5000 0 NaN honda NaN 2016-03-22 00:00:00 65391 2016-04-06 18:15:57
791 2016-04-05 09:52:11 Opel_Vectra_in_teilen 1 kombi 1999 manuell 101 vectra 150000 1 benzin opel NaN 2016-04-05 00:00:00 59394 2016-04-07 13:16:37
981 2016-03-10 00:56:12 Verkaufen_oder_tauschen 1 kombi 2003 manuell 150 3er 150000 9 diesel bmw nein 2016-03-10 00:00:00 58840 2016-04-06 01:16:59
1115 2016-03-24 22:36:58 Suche_Passat_VR6_Variant 1 NaN 1995 manuell 0 passat 150000 0 benzin volkswagen NaN 2016-03-24 00:00:00 38556 2016-04-06 03:45:42
1422 2016-03-16 18:43:16 zerlege_Astra_f_teile_guenstig 1 NaN 1995 NaN 75 astra 5000 0 benzin opel NaN 2016-03-16 00:00:00 1728 2016-03-19 16:45:08
1561 2016-03-15 13:52:33 Suche_Skoda_Oktavia_Kombi_Diesel_ab_2005 1 NaN 2005 NaN 0 octavia 150000 0 diesel skoda NaN 2016-03-15 00:00:00 88099 2016-04-06 03:16:18
1877 2016-03-29 09:50:25 Tausche_Peugeot_306_gegen_Motorroller/Handy/No... 1 NaN 2017 manuell 75 3_reihe 150000 0 NaN peugeot nein 2016-03-29 00:00:00 82362 2016-03-29 10:45:19
1987 2016-03-16 21:55:11 Audi_A4_B5_Ersatzteile 1 limousine 1997 NaN 0 a4 5000 0 benzin audi NaN 2016-03-16 00:00:00 64839 2016-04-07 10:16:37
2442 2016-03-13 00:59:30 Opel_astra_g_cc_1.8_16v_Schlachtfest!!! 1 NaN 2000 NaN 0 astra 150000 0 NaN opel ja 2016-03-12 00:00:00 48301 2016-03-19 11:45:51
2865 2016-03-08 16:57:11 Opel_Astra_f_Cabrio 1 cabrio 1995 manuell 0 astra 150000 0 benzin opel NaN 2016-03-08 00:00:00 26388 2016-04-07 03:16:24
2892 2016-03-08 20:46:19 Suche_Opel_Omega_gerne_DTI_ab_BJ_2000 1 NaN 2000 NaN 0 omega 5000 0 NaN opel NaN 2016-03-08 00:00:00 29456 2016-03-12 22:47:34
3062 2016-03-07 11:56:53 VW_Golf_3_zum_herrichten_oder_ausschlachten 1 NaN 2016 manuell 75 golf 150000 5 NaN volkswagen NaN 2016-03-07 00:00:00 74081 2016-03-15 05:17:02
3397 2016-03-14 16:53:37 Suchen_dringend_Automechaniker_in_Rheda 1 NaN 2000 NaN 0 golf 125000 0 NaN volkswagen NaN 2016-03-14 00:00:00 33330 2016-03-16 07:19:07
3611 2016-04-07 03:36:33 Suche_corsa_c 1 kleinwagen 2000 NaN 0 corsa 10000 5 NaN opel nein 2016-04-07 00:00:00 39112 2016-04-07 03:36:33
3679 2016-04-04 00:36:17 Suche_Auto 1 NaN 1910 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 2016-04-04 00:00:00 40239 2016-04-04 07:49:15
4194 2016-03-23 21:51:14 Skoda_octavia_FSI_Tuning_Fox_Auspuff_Spoiler 1 kombi 2005 NaN 0 octavia 150000 0 NaN skoda NaN 2016-03-23 00:00:00 13088 2016-04-06 04:15:24
4253 2016-03-15 17:56:58 Ich_schlachte_meinen_Mitsubishi_carisma. 1 limousine 2000 manuell 122 carisma 150000 6 NaN mitsubishi NaN 2016-03-15 00:00:00 13189 2016-04-06 18:48:10
4323 2016-03-11 07:55:11 Vw_polo_6n1_zum_ausschlachten 1 kleinwagen 1997 manuell 60 polo 150000 0 NaN volkswagen nein 2016-03-11 00:00:00 97318 2016-03-11 09:44:49
4496 2016-03-25 10:53:36 RENAULT__MEGAN__GRANTOUR____bj_2001____TAUSCHE... 1 kombi 2001 manuell 0 megane 5000 12 benzin renault nein 2016-03-25 00:00:00 14199 2016-04-06 10:45:11
4632 2016-03-18 05:06:26 Opel_Astra_F_Schlachtfest 1 limousine 1999 automatik 0 astra 150000 10 benzin opel NaN 2016-03-18 00:00:00 65933 2016-04-07 03:45:38
4787 2016-03-16 23:37:56 Sehr_gepflegt_und_nicht_rauche_Auto 1 limousine 1998 manuell 150 passat 150000 10 benzin volkswagen nein 2016-03-16 00:00:00 70469 2016-03-29 16:17:31
4928 2016-03-31 07:36:19 Opel_b_corsa_5_Tuerer 1 kleinwagen 1998 manuell 65 corsa 150000 7 benzin opel ja 2016-03-31 00:00:00 18556 2016-04-02 04:15:46
5118 2016-03-15 10:39:56 ECHTER__!__VW_Polo_GTI___Unverbastelt!___Gegen... 1 kleinwagen 1998 manuell 125 polo 150000 11 benzin volkswagen nein 2016-03-15 00:00:00 41372 2016-03-15 10:39:56
5246 2016-03-12 12:56:03 Autoaufbereitung 1 NaN 2015 NaN 0 6er 5000 12 NaN bmw NaN 2016-03-12 00:00:00 29640 2016-03-12 12:56:03
5308 2016-03-31 15:54:03 Nissan_Almera_1.6 1 NaN 2000 NaN 0 almera 90000 0 NaN nissan NaN 2016-03-31 00:00:00 66706 2016-04-06 08:46:24
5354 2016-03-29 18:25:27 Schlachtfest 1 NaN 2000 NaN 0 golf 150000 0 NaN volkswagen ja 2016-03-29 00:00:00 35232 2016-04-06 06:45:01
5594 2016-03-26 00:59:02 Schlachte_polo_6n_alles_vorhanden_und_billig_a... 1 NaN 2000 manuell 60 NaN 150000 3 NaN volkswagen NaN 2016-03-25 00:00:00 28757 2016-03-29 09:46:50
5883 2016-03-28 12:56:23 Suche_ein_Auto 1 NaN 2005 NaN 0 NaN 5000 0 NaN volkswagen NaN 2016-03-28 00:00:00 61118 2016-04-02 16:48:00
6013 2016-03-25 13:58:03 Auto_zum_ausschlachten 1 kleinwagen 1997 manuell 116 golf 150000 1 benzin volkswagen nein 2016-03-25 00:00:00 55234 2016-04-06 17:17:18
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
38064 2016-03-29 13:51:18 Schlachte_e39_touring_aspensilber_339 1 kombi 1996 manuell 193 5er 150000 12 benzin bmw NaN 2016-03-29 00:00:00 26532 2016-04-05 23:17:27
39180 2016-03-19 06:36:17 Vw_Polo_steilheck/tausch 1 kleinwagen 1993 manuell 45 polo 125000 0 benzin volkswagen nein 2016-03-19 00:00:00 93047 2016-04-04 04:18:02
39331 2016-03-06 15:43:41 Schlachtfest_e36_320 1 cabrio 1998 manuell 150 NaN 150000 0 NaN bmw NaN 2016-03-06 00:00:00 79211 2016-04-06 20:45:54
39354 2016-03-17 11:52:40 Ford_Focus_gegen_Handy_! 1 kleinwagen 1999 manuell 115 focus 150000 0 benzin ford NaN 2016-03-17 00:00:00 4603 2016-03-17 12:43:09
39507 2016-04-02 12:49:46 Tausche_gegen_anderen_Pkw 1 NaN 2016 manuell 75 golf 150000 5 NaN volkswagen nein 2016-04-02 00:00:00 4229 2016-04-06 10:46:56
39912 2016-03-30 12:44:39 Verkaufe_mein_Opel_Astra_G_mit_147_PS_Leder_/K... 1 coupe 2000 manuell 147 astra 150000 12 benzin opel NaN 2016-03-30 00:00:00 26160 2016-03-30 12:44:39
40143 2016-03-09 20:53:57 SEAT_IBIZA_MUSS_WEG_! 1 kleinwagen 1996 manuell 60 NaN 150000 0 benzin seat ja 2016-03-09 00:00:00 12437 2016-03-17 03:46:58
40283 2016-04-02 18:55:20 Fiat_Fiorino_Kastenwagen_Kombi_AHK_Anhaengerku... 1 kombi 1996 manuell 0 andere 150000 5 benzin fiat NaN 2016-04-02 00:00:00 58640 2016-04-06 18:47:52
40681 2016-03-31 23:43:27 Schlachtfest 1 limousine 1996 manuell 0 NaN 150000 0 benzin bmw NaN 2016-03-31 00:00:00 89426 2016-04-06 19:45:49
41577 2016-04-03 21:49:37 Alfa_Romeo_156_Schlachtfest_2.5_V6_24V_428/B_B... 1 limousine 1998 manuell 190 156 5000 5 benzin alfa_romeo NaN 2016-04-03 00:00:00 87743 2016-04-05 22:47:44
42445 2016-03-18 20:52:00 Mercedes_Benz_a_Klasse_Schlachtung_schlachtfes... 1 kleinwagen 2000 NaN 0 a_klasse 150000 12 benzin mercedes_benz NaN 2016-03-18 00:00:00 40822 2016-04-02 00:46:06
43413 2016-03-14 10:58:08 Tausche_Verkaufe__Audi_A8_3.3_TDI_Quattro_Xeno... 1 NaN 2016 automatik 225 a8 150000 1 diesel audi NaN 2016-03-14 00:00:00 24791 2016-03-21 01:16:46
43623 2016-03-05 16:53:31 1.6__60_kW_90ps 1 NaN 2016 manuell 0 carisma 5000 1 benzin mitsubishi ja 2016-03-05 00:00:00 86438 2016-04-07 12:17:40
44816 2016-03-31 14:52:00 Renault_Twingo_Baujahr_1998!_Bastler!_Export!_... 1 kleinwagen 1990 manuell 58 twingo 150000 7 benzin renault nein 2016-03-31 00:00:00 37154 2016-04-02 10:16:11
45216 2016-03-05 16:37:42 Brauche_dringend_Hilfe__opel_vectra_a_1.6_i 1 NaN 1995 NaN 0 vectra 150000 0 benzin opel NaN 2016-03-05 00:00:00 18246 2016-03-20 00:15:49
45546 2016-03-22 07:00:38 ICH_SUCHE_EINEN_BMW_E36_AB_2_LITER 1 NaN 1995 manuell 0 3er 125000 0 NaN bmw nein 2016-03-22 00:00:00 41849 2016-03-27 21:18:00
45754 2016-03-20 13:46:58 Audi_80_Cabrio_zum_Schlachten_!!! 1 cabrio 1996 manuell 0 andere 150000 1 benzin audi NaN 2016-03-20 00:00:00 26789 2016-04-06 14:47:18
46595 2016-03-26 19:47:39 BMW_318tI_nur_in_Ersatzteile_143Ps 1 kleinwagen 1998 manuell 102 3er 150000 6 benzin bmw NaN 2016-03-26 00:00:00 21717 2016-04-06 09:46:38
46659 2016-03-16 13:56:04 Fuer_Foto_u_Video_Aufnahmen_gesucht 1 NaN 1980 NaN 50 kaefer 100000 3 NaN volkswagen NaN 2016-03-16 00:00:00 46047 2016-04-06 12:45:45
46990 2016-03-16 14:46:18 xr_3_cabrio_mk4_in_Teilen 1 cabrio 1989 manuell 90 escort 150000 4 benzin ford NaN 2016-03-16 00:00:00 66773 2016-03-17 12:44:56
47087 2016-03-17 22:42:37 W124_SCHLACHTFEST_|_Kombi_|_1995_|_Aktiv_Bass 1 kombi 1996 manuell 0 e_klasse 5000 0 benzin mercedes_benz NaN 2016-03-17 00:00:00 50667 2016-04-07 13:17:31
47265 2016-03-09 09:50:40 Exportauto_zu_verkaufen 1 NaN 2000 NaN 0 focus 150000 0 NaN ford NaN 2016-03-09 00:00:00 15562 2016-03-10 04:15:56
48018 2016-04-04 11:41:12 Mercedes_Benz_W116_280_SE_Oldtimer_frisch_lack... 1 limousine 1978 automatik 185 andere 150000 10 benzin mercedes_benz nein 2016-04-04 00:00:00 89077 2016-04-06 12:17:44
48233 2016-03-21 11:53:11 Bmw_5er_E39_Schlachtfest 1 limousine 1999 manuell 0 5er 150000 4 NaN bmw NaN 2016-03-21 00:00:00 45772 2016-04-06 06:46:31
48786 2016-03-17 09:38:31 Auto_gesucht 1 NaN 2000 NaN 0 NaN 150000 0 NaN volkswagen NaN 2016-03-17 00:00:00 97659 2016-03-17 10:42:07
49240 2016-03-10 09:54:48 Suche_7_Sitzer_Espace_oder?_Biete_s.Beschreibung 1 kombi 1999 manuell 0 NaN 150000 4 benzin bmw NaN 2016-03-10 00:00:00 41199 2016-03-28 10:15:50
49377 2016-03-20 00:58:36 Tausche_BMW_318i_Facelift_Model_143_ps 1 limousine 2002 manuell 143 3er 150000 8 benzin bmw NaN 2016-03-19 00:00:00 86356 2016-03-20 04:56:34
49528 2016-03-28 19:57:41 BMW_E36__316i_Compact_M_Paket_"Schlachtfest" 1 NaN 2016 manuell 102 3er 150000 12 benzin bmw NaN 2016-03-28 00:00:00 49696 2016-04-07 02:47:04
49728 2016-03-25 16:00:11 Ford_Focus_1.4_16V_Bastlerauto 1 NaN 2016 manuell 0 focus 150000 0 benzin ford ja 2016-03-25 00:00:00 34587 2016-03-25 16:40:21
49845 2016-03-21 00:50:15 Schlachte_VW_Sharan_vr6_Automatik___no_GTI_16V... 1 NaN 2000 automatik 174 sharan 5000 6 benzin volkswagen nein 2016-03-20 00:00:00 14621 2016-04-05 21:16:16

156 rows × 16 columns

We see that the values do not resemble the same case that we have applied to the top of the list so we will have to follow a different strategy.

if we look at the ads with value equal to 1 we realize that this value is simply an indicator of some particularity that has been noted in the name box:

  • Mercedes_E320_AMG_for_exchange!
  • Sell_or_exchange 1 station wagon 2003 manual 150
  • Search_Skoda_Oktavia_station_station wagon_diesel_ab_2005
  • Exchange_Peugeot_306_for_scooter/mobile/no...
  • Audi_A4_B5_spare_parts 1 sedan 1997
  • VW_Golf_3_to_refurbish_or_break_out

We have an idea of the purpose of this ad section, so we are going to remove it, taking into account the maximum price we have in our list.

In [42]:
autos["price_Dollars"].max()
Out[42]:
3890000
In [43]:
autos = autos[autos["price_Dollars"].between(2,3890000)]
In [44]:
autos["price_Dollars"].describe().round()
Out[44]:
count      48412.0
mean        6017.0
std        20703.0
min            2.0
25%         1250.0
50%         3000.0
75%         7498.0
max      3890000.0
Name: price_Dollars, dtype: float64
price_dollars Before cleaning After cleaning % diff
count 50000.00 48412.0 - 3,176%
mean 9840.04 6017.0 - 38,85%
std 481104.38 20703.0 - 95,7%
25% 1100.00 1250.0 + 13,63
50% 2950.00 3000.0 + 1,69
75% 7200.00 7498.0 + 4,13
max 99999999.00 3890000.0 no sense

As we can see in the cleaning process we have only eliminated - 3,176% of the total number of samples and the mean prices have decreased - 38,85%, the interquartile values have increased from 4% to 13% and standart deviation has also improved.

If we see the column of the name the descriptions with the price (for example) of $ 1 we realize that we can not eliminate these vehicles just like that because it is not a mistake, are ads that people who intend to make changes or simply gives away their vehicle altruistically, so I will not bias these values so low.

Date columns and understand the date range

There are several columns that represent date values:

  • date_crawled: When the ad was first crawled. Added by the crawler.
  • last_seen: When the crawler saw the ad last online. Added by the crawler
  • ad_created: The date on which the eBay listing was created. Created by the website.
  • registration_month: The month in which the car was first registered. Created by the website.
  • registration_year: The year in which the car was first registered. Created by the website.
In [45]:
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[45]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [46]:
autos["date_crawled"].describe()
Out[46]:
count                   48412
unique                  46742
top       2016-03-21 20:37:19
freq                        3
Name: date_crawled, dtype: object

The way we will extract the date from this object type data will be by making a slice of the text we will access the text by str in Series.value_counts().

copying our dataframe its the way to achieve

In [47]:
autos["date_crawled"] = autos["date_crawled"].str[:10].copy()
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [49]:
autos['date_crawled']
Out[49]:
0        2016-03-26
1        2016-04-04
2        2016-03-26
3        2016-03-12
4        2016-04-01
5        2016-03-21
6        2016-03-20
7        2016-03-16
8        2016-03-22
9        2016-03-16
10       2016-03-15
11       2016-03-16
12       2016-03-31
13       2016-03-23
14       2016-03-23
15       2016-04-01
16       2016-03-16
17       2016-03-29
18       2016-03-26
19       2016-03-17
20       2016-03-05
21       2016-03-06
22       2016-03-28
23       2016-03-10
24       2016-04-03
25       2016-03-21
26       2016-04-03
28       2016-03-19
29       2016-04-02
30       2016-03-14
            ...    
49968    2016-04-01
49969    2016-03-17
49970    2016-03-21
49971    2016-03-29
49972    2016-03-26
49973    2016-03-27
49975    2016-03-27
49976    2016-03-19
49977    2016-03-31
49978    2016-04-04
49979    2016-03-20
49980    2016-03-12
49981    2016-03-15
49982    2016-03-29
49983    2016-03-06
49985    2016-04-02
49986    2016-04-04
49987    2016-03-22
49988    2016-03-28
49989    2016-03-11
49990    2016-03-21
49991    2016-03-06
49992    2016-03-10
49993    2016-03-15
49994    2016-03-22
49995    2016-03-27
49996    2016-03-28
49997    2016-04-02
49998    2016-03-08
49999    2016-03-14
Name: date_crawled, Length: 48412, dtype: object
In [ ]:
index_date_crawled = autos['date_crawled'].str[:10].index
In [ ]:
index_date_crawled
In [ ]:
for indice in index_to_remove:
    autos.drop([indice], inplace = True)
In [ ]:
# calculate the distribution of values in percentages instead of counts

autos['date_crawled']...value_counts(normalize=True, dropna=False).describe()

ad_created

In [ ]:
autos['ad_created']...describe()
In [ ]:
# extract the date from the string

autos['ad_created']=autos['ad_created'].str[:10]
In [ ]:
# calculate the distribution of values in percentages 

autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
In [ ]:
autos['ad_created'].shape
In [ ]:
ad_created_bool = autos['ad_created'].value_counts(normalize=True, dropna=False).max()
In [ ]:
autos['ad_created'].value_counts(normalize=True, dropna=False).head(1) == ad_created_bool

6.3.3 last_seen

In [ ]:
autos['last_seen'].describe()

Date and time: summary

Column Range of dates Most frequent day
date_crawled 2016-03-05 to 2016-04-07 2016-04-03 (3.86%)
ad_created 2015-06-11 to 2016-04-07 2016-04-03 (3.88%)
last_seen 2016-03-05 to 2016-04-07 2016-04-06 (2.21%)

Main conclusions:

  • dateCrawled displays the date the ads were crawled. Ad crawling started in March of 2016 and ended in April of the same year.
  • Ads created as early as June 2015 are included in this dataset, as we can see in the ad_created column.
  • The last_seen column contains the date when the crawler last saw the ad online. The day when most ads were last seen online is 2016-04-06 (2.21%).

6. Dealing with Incorrect Registration Year Data

In [ ]:
autos['registration_year'].describe()

As we can see above, the 'registration_year' column contains some errors. The maximum value is 9999 and the minimum is 1000, both clearly incorrect.

Given that the 'registration_year' corresponds to the year in which the car was first registered, we known that cannot be after 2015, the year the first ads were created. As for the earliest acceptable year for car registration, we will accept any year after 1900.

In [ ]:
# remove the rows where the 'registation_year' values are outside the range we have defined

autos = autos[autos['registration_year'].between(1900,2015)]
In [ ]:
autos['registration_year'].value_counts(normalize=True)
In [ ]:
autos['registration_year'].value_counts(normalize=True, dropna=False)[0:15]
In [ ]:
sum(autos['registration_year'].value_counts(normalize=True, dropna=False)[0:15])

In addition, almost 80% of all car registrations occured between 1997 and 2011. We can conclude from this analysis that, at the time of crawling, the majority of cars on sale in this platform were at least more than 5 years old.

7. Exploring Price by Brand

In [ ]:
top_brands = autos['brand'].value_counts(normalize=True, dropna=False).head(20)
top_brands

Write a short paragraph describing the brand data, and explaining which brands you've chosen to aggregate on.

Create an empty dictionary to hold your aggregate data. Loop over your selected brands, and assign the mean price to the dictionary, with the brand name as the key. Print your dictionary of aggregate data, and write a paragraph analyzing the results.

In [ ]:
autos['brand'].unique()
In [ ]:
autos['price_Dollars']

8. Storing Aggregate Data in a DataFrame

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 

9. Next Steps