The aim of this project is to clean a dataset of used cars from eBay Kleinanzeigen (a classifieds section of the German eBay website) and analyze the included used car listings.
The dataset was originally scraped and uploaded to Kaggle. In this project we are using its reduced version (50,000 data points sampled from the full dataset) with more dirty data. The original dataset, as well as the data dictionary, can be found here.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Some observations can be made here.
'vehicleType'
, 'gearbox'
, 'model'
, 'fuelType'
, 'notRepairedDamage'
) have null values, for 'notRepairedDamage'
up to 20% of all values.'price'
and "odometer'
, which will probably have to be fixed to avoid issues).Let's investigate now the first 10 rows of the dataset.
autos.head(10)
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
'seller'
, 'offerType'
, 'gearbox'
, 'notRepairedDamage'
) values are written in German.'price'
and 'odometer'
show the same units for each value.'seller'
, 'offerType'
, 'notRepairedDamage'
, 'nrOfPictures'
seem to have a dominant value each (with 'notRepairedDamage'
also having a lot of null values) and probably will not represent much interest for further data analysis.At this stage we convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
For converting the column names from camelcase to snakecase, we will use the following function:
def camel_to_snake(str):
res = [str[0].lower()]
for c in str[1:]:
if c in ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
res.append('_')
res.append(c.lower())
else:
res.append(c)
return ''.join(res)
# Converting the column names
new_columns = []
for c in autos.columns:
column_snake = camel_to_snake(c)
new_columns.append(column_snake)
autos.columns = new_columns
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'year_of_registration', 'gearbox', 'power_p_s', 'model', 'odometer', 'month_of_registration', 'fuel_type', 'brand', 'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
In this way the column names look more Python-style. However, for making them more descriptive, it's worth doing some further renaming here.
# Renaming columns
autos.rename({'abtest': 'ab_test', 'year_of_registration': 'registration_year',
'power_p_s': 'power_ps', 'month_of_registration': 'registration_month',
'not_repaired_damage': 'unrepaired_damage', 'date_created': 'ad_created',
'nr_of_pictures': 'number_pictures'}, axis=1, inplace=True)
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'number_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | number_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Now all the column names are in line with the Python's naming convention and represent clear and informative description of their contents.
Let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
# Checking the dataset statistics including string columns
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | number_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-23 19:38:20 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From these dataset statistics we can draw many insights.
'seller'
, 'offer_type'
and 'number_pictures'
have practically one value each, which is exactly what we supposed also earlier, analyzing the first 10 rows of the dataset. These columns have to be dropped.'price'
and 'odometer'
columns are numeric data stored as text (again, as we noticed earlier). They have to be cleaned and converted to a numeric data type.'registration_year'
, 'registration_month'
and 'power_ps'
have obvious outliers. The incorrect data has to be removed or corrected.'price'
column: 1,421 out of 50,000 entries have a price of 0$, meaning that around 3% of the cars of our dataset are actually free of charge.Let's drop the non-informative columns 'seller'
, 'offer_type'
, and 'number_pictures'
.
# Dropping columns
autos.drop(['seller', 'offer_type', 'number_pictures'], axis=1, inplace=True)
There is one more interesting column to have a look at: 'fuel_type'
. Out of 7 different fuel types represented in the dataset benzine is the most common one (60% of all the entries, including null values). What are the other 6?
autos['fuel_type'].value_counts()
benzin 30107 diesel 14567 lpg 691 cng 75 hybrid 37 andere 22 elektro 19 Name: fuel_type, dtype: int64
We see that the second common fuel type is diesel (quite predictable), while the other types, all together less than 2%, are liquified/compressed natural gas, hybrid, electro and "other" ("andere" in German).
Next, we are going to identify categorical data that uses German words, translate them and map to their English counterparts. After deleting the non-informative columns, we have the following ones containg German words: 'gearbox'
, 'fuel_type'
and 'unrepaired_damage'
. Let's investigate their vocabulary.
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())
['manuell' 'automatik' nan] ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] ['nein' nan 'ja']
Let's create a unique "German dictionary" for these words and map them into English meanings.
# Creating a dictionary
german_dict = {
'manuell': 'manual',
'automatik': 'automatic',
'benzin': 'benzine',
'elektro': 'electro',
'andere': 'other',
'nein': 'no',
'ja': 'yes',
# Next words remain the same, but have to be mapped to avoid NaN
'lpg': 'lpg',
'diesel': 'diesel',
'cng': 'cng',
'hybrid': 'hybrid'
}
# Mapping German words in English
autos['gearbox'] = autos['gearbox'].map(german_dict)
autos['fuel_type'] = autos['fuel_type'].map(german_dict)
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(german_dict)
# Checking updated unique values
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())
['manual' 'automatic' nan] ['lpg' 'benzine' 'diesel' nan 'cng' 'hybrid' 'electro' 'other'] ['no' nan 'yes']
Odometer
and Price
Columns: Looking for Outliers¶Let's return to the 'price'
and 'odometer'
columns: we have to remove any non-numeric characters ('$' and 'km' correspondingly and the symbol of comma from both), convert the columns to a numeric data type and rename the columns for not to lose the information about units.
# Removing non-numeric characters and converting to a numeric dtype
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)
# Renaming columns
autos.rename({'price': 'price_dollars', 'odometer': 'odometer_km'}, axis=1, inplace=True)
autos.head()
date_crawled | name | price_dollars | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | benzine | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | benzine | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatic | 71 | fortwo | 70000 | 6 | benzine | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manual | 0 | focus | 150000 | 7 | benzine | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
Now the columns 'price_dollars'
and 'odometer_km'
are in the right format to be analyzed.
autos['price_dollars'].unique().shape
(2357,)
autos['odometer_km'].unique().shape
(13,)
The column 'price_dollars'
has 2357 unique values, the column 'odometer_km'
13 ones.
Let's look at the statistics of prices.
autos['price_dollars'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_dollars, dtype: float64
It seems that there is an outlier in prices: the maximum value is 5 orders bigger than the value of the third quartile (75%), and in general 100,000,000$ doesn't look a realistic price for a used car on sale. As for the minimum value equals 0, it looks ok and represents the cars offered free of charge.
We will investigate in more detail the maximum value now.
# Finding the 5 biggest prices in descending order
five_biggest_prices = autos['price_dollars'].value_counts().sort_index(ascending=False).head()
five_biggest_prices
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 Name: price_dollars, dtype: int64
Hence, the maximum value is clearly wrong, probably it was intended to represent an undefined value. We have to remove it from the dataset.
# Removing the outlier
autos = autos.loc[autos['price_dollars'].between(0, 99999998), :]
# Double-checking the 5 biggest prices
five_biggest_prices = autos['price_dollars'].value_counts().sort_index(ascending=False).head()
five_biggest_prices
27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 Name: price_dollars, dtype: int64
autos['price_dollars'].describe()
count 4.999900e+04 mean 7.840201e+03 std 1.774639e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 2.732222e+07 Name: price_dollars, dtype: float64
Now the values of the 'price_dollars'
column are cleaned and meaningful, with the maximum price of 27,322,220$, the average of 7,840$ and the minimum free of charge.
Let's conduct a similar analysis for the 'odometer_km'
column.
autos['odometer_km'].describe()
count 49999.000000 mean 125732.214644 std 40042.465064 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
odometer_values = autos['odometer_km'].value_counts().sort_index().head()
odometer_values.head()
5000 967 10000 264 20000 784 30000 789 40000 819 Name: odometer_km, dtype: int64
odometer_values.tail()
5000 967 10000 264 20000 784 30000 789 40000 819 Name: odometer_km, dtype: int64
In this column we don't have any outliers. The value range is from 5,000 to 150,000 km, with the average of 125,732 km. These statistics are heavily influenced by the frequency of the maximum value: we can deduce from the previous cell that about 65% of cars in our dataset have driven 150,000 km.
Let's now understand the date range the data covers.
There are 5 columns representing date values. Some of these columns were created by the crawler, some came from the website itself.
'date_crawled'
: added by the crawler'last_seen'
: added by the crawler'ad_created'
: from the website'registration_month'
: from the website'registration_year'
: from the websiteThe first 3 of these columns are identified as string values by pandas so we need to convert them into a numerical representation for further quantitative analysis. The other 2 columns are already numeric.
Let's first understand how the values in the 3 string columns are formatted.
autos[['date_crawled','ad_created','last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
These columns all represent full timestamp values, with the first 10 characters representing the day and the last 5 - the time. For our purposes we are interested only in days.
We are going to calculate the distribution of values of dates in the 'date_crawled'
, 'ad_created'
, and 'last_seen'
columns as percentages, including missing values and ranking by date from earliest to latest.
# Extracting days, counting values in % including missing values, and sorting from earliest to latest
data_crawled_dates = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
data_crawled_dates
2016-03-05 0.025381 2016-03-06 0.013940 2016-03-07 0.035961 2016-03-08 0.033301 2016-03-09 0.033221 2016-03-10 0.032121 2016-03-11 0.032481 2016-03-12 0.036781 2016-03-13 0.015560 2016-03-14 0.036621 2016-03-15 0.033981 2016-03-16 0.029501 2016-03-17 0.031521 2016-03-18 0.013060 2016-03-19 0.034901 2016-03-20 0.037821 2016-03-21 0.037521 2016-03-22 0.032921 2016-03-23 0.032381 2016-03-24 0.029101 2016-03-25 0.031741 2016-03-26 0.032481 2016-03-27 0.031041 2016-03-28 0.034841 2016-03-29 0.034181 2016-03-30 0.033621 2016-03-31 0.031921 2016-04-01 0.033801 2016-04-02 0.035401 2016-04-03 0.038681 2016-04-04 0.036521 2016-04-05 0.013100 2016-04-06 0.003180 2016-04-07 0.001420 Name: date_crawled, dtype: float64
We can observe that the range of dates when each ad in our dataset was first crawled is from the 5th of March 2016 till the 7th of April 2016. The distribution of frequencies looks quite consistent.
# Extracting days, counting values in % including missing values, and sorting from earliest to latest
ad_created_dates = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
ad_created_dates
2015-06-11 0.000020 2015-08-10 0.000020 2015-09-09 0.000020 2015-11-10 0.000020 2015-12-05 0.000020 2015-12-30 0.000020 2016-01-03 0.000020 2016-01-07 0.000020 2016-01-10 0.000040 2016-01-13 0.000020 2016-01-14 0.000020 2016-01-16 0.000020 2016-01-22 0.000020 2016-01-27 0.000060 2016-01-29 0.000020 2016-02-01 0.000020 2016-02-02 0.000040 2016-02-05 0.000040 2016-02-07 0.000020 2016-02-08 0.000020 2016-02-09 0.000040 2016-02-11 0.000020 2016-02-12 0.000060 2016-02-14 0.000040 2016-02-16 0.000020 2016-02-17 0.000020 2016-02-18 0.000040 2016-02-19 0.000060 2016-02-20 0.000040 2016-02-21 0.000060 ... 2016-03-09 0.033241 2016-03-10 0.031861 2016-03-11 0.032781 2016-03-12 0.036621 2016-03-13 0.016920 2016-03-14 0.035221 2016-03-15 0.033741 2016-03-16 0.030001 2016-03-17 0.031201 2016-03-18 0.013720 2016-03-19 0.033841 2016-03-20 0.037861 2016-03-21 0.037721 2016-03-22 0.032781 2016-03-23 0.032181 2016-03-24 0.029081 2016-03-25 0.031881 2016-03-26 0.032561 2016-03-27 0.030901 2016-03-28 0.034961 2016-03-29 0.034141 2016-03-30 0.033441 2016-03-31 0.031921 2016-04-01 0.033801 2016-04-02 0.035081 2016-04-03 0.038921 2016-04-04 0.036881 2016-04-05 0.011840 2016-04-06 0.003260 2016-04-07 0.001280 Name: ad_created, Length: 76, dtype: float64
In our dataset the first eBay listing was created the 11th of June 2015 and the last - the 7th of April 2016. As for the distribution, the frequencies look very low at the begining and then grow significantly. Let's try to trace where this transition takes place.
ad_created_dates[30:]
2016-02-22 0.000020 2016-02-23 0.000080 2016-02-24 0.000040 2016-02-25 0.000060 2016-02-26 0.000040 2016-02-27 0.000120 2016-02-28 0.000200 2016-02-29 0.000160 2016-03-01 0.000100 2016-03-02 0.000100 2016-03-03 0.000860 2016-03-04 0.001440 2016-03-05 0.023040 2016-03-06 0.015120 2016-03-07 0.034741 2016-03-08 0.033341 2016-03-09 0.033241 2016-03-10 0.031861 2016-03-11 0.032781 2016-03-12 0.036621 2016-03-13 0.016920 2016-03-14 0.035221 2016-03-15 0.033741 2016-03-16 0.030001 2016-03-17 0.031201 2016-03-18 0.013720 2016-03-19 0.033841 2016-03-20 0.037861 2016-03-21 0.037721 2016-03-22 0.032781 2016-03-23 0.032181 2016-03-24 0.029081 2016-03-25 0.031881 2016-03-26 0.032561 2016-03-27 0.030901 2016-03-28 0.034961 2016-03-29 0.034141 2016-03-30 0.033441 2016-03-31 0.031921 2016-04-01 0.033801 2016-04-02 0.035081 2016-04-03 0.038921 2016-04-04 0.036881 2016-04-05 0.011840 2016-04-06 0.003260 2016-04-07 0.001280 Name: ad_created, dtype: float64
An evident transition is observed on the 5th of March, exactly the day when the first adds were crawled from the website.
# Extracting days, counting values in % including missing values, and sorting from earliest to latest
last_seen_dates = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
last_seen_dates
2016-03-05 0.001080 2016-03-06 0.004420 2016-03-07 0.005360 2016-03-08 0.007600 2016-03-09 0.009860 2016-03-10 0.010760 2016-03-11 0.012520 2016-03-12 0.023820 2016-03-13 0.008980 2016-03-14 0.012800 2016-03-15 0.015880 2016-03-16 0.016440 2016-03-17 0.027921 2016-03-18 0.007420 2016-03-19 0.015740 2016-03-20 0.020700 2016-03-21 0.020740 2016-03-22 0.021580 2016-03-23 0.018580 2016-03-24 0.019560 2016-03-25 0.019200 2016-03-26 0.016960 2016-03-27 0.016020 2016-03-28 0.020860 2016-03-29 0.022340 2016-03-30 0.024840 2016-03-31 0.023840 2016-04-01 0.023100 2016-04-02 0.024900 2016-04-03 0.025361 2016-04-04 0.024620 2016-04-05 0.124282 2016-04-06 0.220984 2016-04-07 0.130923 Name: last_seen, dtype: float64
The range of dates the crawler saw each ad last online is also (like for 'date_crawled'
) from the 5th of March 2016 till the 7th of April 2016. The distribution of frequencies mostly looks consistent, increasing only the last 3 days.
Let's take a look now at the statistics of the 'registration_year'
column, which is already numeric.
autos['registration_year'].describe()
count 49999.000000 mean 2005.073401 std 105.713866 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
We clearly see 2 issues here: a wrong minimum value of 1000 year and a wrong maximum one of 9999 year. These values certainly bias also the other statistics (mean and standard deviation).
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. As for the earliest valid year, realistically, it could be somewhere in the first few decades of the 1900s.
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
autos['registration_year'].value_counts().sort_index()
1000 1 1001 1 1111 1 1500 1 1800 2 1910 9 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 1955 2 1956 5 1957 2 1958 4 1959 7 1960 34 1961 6 1962 4 1963 9 ... 2001 2703 2002 2533 2003 2727 2004 2737 2005 3015 2006 2708 2007 2304 2008 2231 2009 2098 2010 1597 2011 1634 2012 1323 2013 806 2014 666 2015 399 2016 1316 2017 1453 2018 492 2019 3 2800 1 4100 1 4500 1 4800 1 5000 4 5911 1 6200 1 8888 1 9000 2 9996 1 9999 4 Name: registration_year, Length: 97, dtype: int64
The earliest plausible value of a registration year is 1910. It's worth looking at the 9 entries with that year in more detail.
autos[autos['registration_year'] == 1910]
date_crawled | name | price_dollars | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3679 | 2016-04-04 00:36:17 | Suche_Auto | 1 | test | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-04-04 00:00:00 | 40239 | 2016-04-04 07:49:15 |
15898 | 2016-03-08 10:50:05 | Tausch_alles_aus_meinen_Anzeigen_gegen_Auto | 0 | test | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-08 00:00:00 | 6108 | 2016-03-08 17:47:19 |
22659 | 2016-03-14 08:51:18 | Opel_Corsa_B | 500 | test | NaN | 1910 | NaN | 0 | corsa | 150000 | 0 | NaN | opel | NaN | 2016-03-14 00:00:00 | 52393 | 2016-04-03 07:53:55 |
28693 | 2016-03-22 17:48:41 | Renault_Twingo | 599 | control | kleinwagen | 1910 | manual | 0 | NaN | 5000 | 0 | benzine | renault | NaN | 2016-03-22 00:00:00 | 70376 | 2016-04-06 09:16:59 |
30781 | 2016-03-25 13:47:46 | Opel_Calibra_V6_DTM_Bausatz_1:24 | 30 | test | NaN | 1910 | NaN | 0 | calibra | 100000 | 0 | NaN | opel | NaN | 2016-03-25 00:00:00 | 47638 | 2016-03-26 23:46:29 |
33295 | 2016-03-12 10:49:42 | Ich_verkaufe/suche_fuer_sie_ihr_Auto!! | 0 | control | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-12 00:00:00 | 37075 | 2016-03-19 07:15:34 |
42181 | 2016-03-27 19:50:53 | SAMSUNG_55_3D_Tv_und_Soundbar_gegen_Auto | 0 | test | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-27 00:00:00 | 57080 | 2016-04-06 01:15:30 |
45157 | 2016-03-11 22:37:01 | Motorhaube | 15 | control | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 2016-03-11 00:00:00 | 90491 | 2016-03-25 11:18:57 |
46213 | 2016-04-02 13:47:16 | Bellier_Vario | 0 | test | kleinwagen | 1910 | NaN | 0 | NaN | 5000 | 1 | other | sonstige_autos | NaN | 2016-04-02 00:00:00 | 93105 | 2016-04-04 11:16:30 |
The majority of columns for these entries have missing values. The 'odometer_km'
column looks realistic, with low values for almost all the cars, which is in line with the technology level of more than 100 years ago. The prices are low (up to 0) enough for just old cars but unexpectedly low for something considered a rarity. In any case, a significant proportion of missing values suggests that these rows don't represent particular interest for further analysis and can be removed.
Let's investigate 3 next early years from our value counts exploration: 1927, 1929, and 1931.
autos[autos['registration_year'] == 1927]
date_crawled | name | price_dollars | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
21416 | 2016-03-12 08:36:21 | Essex_super_six__Ford_A | 16500 | control | cabrio | 1927 | manual | 40 | andere | 5000 | 5 | benzine | ford | NaN | 2016-03-12 00:00:00 | 74821 | 2016-03-15 12:45:12 |
autos[autos['registration_year'] == 1929]
date_crawled | name | price_dollars | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22101 | 2016-03-09 16:51:17 | BMW_Andere | 11500 | test | cabrio | 1929 | manual | 15 | andere | 5000 | 1 | NaN | bmw | yes | 2016-03-09 00:00:00 | 70569 | 2016-04-07 06:17:11 |
autos[autos['registration_year'] == 1931]
date_crawled | name | price_dollars | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11246 | 2016-03-26 19:49:59 | Ford_Model_A_Roadster_Deluxe_1931 | 27500 | control | cabrio | 1931 | manual | 39 | andere | 10000 | 7 | benzine | ford | no | 2016-03-26 00:00:00 | 9322 | 2016-04-06 09:46:59 |
In these 3 rows there are almost no missing values, and all the other values look plausible: high prices for such rarities, cabrio type of vehicle, manual gearbox, low horsepower, low odometer values.
Hence, it seems reasonable to use the year of 1927 as the lowest acceptable value and 2016 (discussed earlier) as the highest one. All the values outside this interval have to be removed.
# Removing years outside the interval
autos = autos.loc[autos['registration_year'].between(1927, 2016)]
Let's calculate the distribution of frequencies of the remaining values.
autos['registration_year'].value_counts(normalize=True).sort_index()
1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000042 1937 0.000083 1938 0.000021 1939 0.000021 1941 0.000042 1943 0.000021 1948 0.000021 1950 0.000062 1951 0.000042 1952 0.000021 1953 0.000021 1954 0.000042 1955 0.000042 1956 0.000104 1957 0.000042 1958 0.000083 1959 0.000146 1960 0.000708 1961 0.000125 1962 0.000083 1963 0.000187 1964 0.000250 1965 0.000354 1966 0.000458 1967 0.000562 1968 0.000541 1969 0.000396 ... 1987 0.001562 1988 0.002957 1989 0.003769 1990 0.008226 1991 0.007414 1992 0.008143 1993 0.009267 1994 0.013745 1995 0.027344 1996 0.030072 1997 0.042234 1998 0.051085 1999 0.062456 2000 0.069849 2001 0.056291 2002 0.052751 2003 0.056791 2004 0.056999 2005 0.062789 2006 0.056396 2007 0.047982 2008 0.046462 2009 0.043692 2010 0.033258 2011 0.034029 2012 0.027552 2013 0.016785 2014 0.013870 2015 0.008309 2016 0.027406 Name: registration_year, Length: 77, dtype: float64
The frequencies for the older years are, naturally, very low. For more recent years the values are much higher, showing, however, some fluctuations from one year to another.
Our next step is to explore average price variations across different car brands. To start, let's take a look at the unique brands available in our dataset.
autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
Now let's have a look at number of cars (in % from the total number) by brand.
# Brands by percentage of the total number of cars
brands = autos['brand'].value_counts(normalize=True)
brands
volkswagen 0.212170 bmw 0.110042 opel 0.108147 mercedes_benz 0.095360 audi 0.086405 ford 0.069807 renault 0.047336 peugeot 0.029531 fiat 0.025865 seat 0.018181 skoda 0.016036 mazda 0.015140 nissan 0.015099 citroen 0.013932 smart 0.013911 toyota 0.012474 sonstige_autos 0.010850 hyundai 0.009850 volvo 0.009247 mini 0.008643 mitsubishi 0.008143 honda 0.007851 kia 0.007102 alfa_romeo 0.006623 porsche 0.006102 suzuki 0.005914 chevrolet 0.005706 chrysler 0.003665 dacia 0.002562 daihatsu 0.002562 jeep 0.002249 subaru 0.002187 land_rover 0.002041 saab 0.001604 jaguar 0.001583 trabant 0.001541 daewoo 0.001499 rover 0.001354 lancia 0.001083 lada 0.000604 Name: brand, dtype: float64
# Finding brands with 5% and more of the total number of cars
top_brands = brands[brands > 0.05].index
top_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
A few brands have over 5% of the total number of cars each. Let's calculate average price for them.
# Creating an empty dictionary
top_brands_average_prices = {}
# Looping over the top brands and assigning the average price to the dictionary
for brand in top_brands:
top_brands_average_prices[brand] = autos[autos['brand'] == brand]['price_dollars'].mean()
top_brands_average_prices
{'audi': 9093.65003615329, 'bmw': 8334.645155185466, 'ford': 7263.015811455847, 'mercedes_benz': 8485.239571958942, 'opel': 5254.537454265357, 'volkswagen': 6516.457597173145}
Here we can notice that Volkswagen, being the most common brand in our dataset with a big gap from the rest, has relatively low average price among the other top brands. Audi, being the 4th common brand, has the highest average price among the top ones, while Opel the lowest, with a distinct difference between these two of 3,839$.
For the top 6 brands, we will analyze average mileage and if there is any visible link with average price.
# Converting the average price dictionary into a series object
prices_series = pd.Series(top_brands_average_prices)
prices_series
audi 9093.650036 bmw 8334.645155 ford 7263.015811 mercedes_benz 8485.239572 opel 5254.537454 volkswagen 6516.457597 dtype: float64
# Creating an empty dictionary for average mileage values
top_brands_average_mileage = {}
# Looping over the top brands and assigning the average mileage to the dictionary
for brand in top_brands:
top_brands_average_mileage[brand] = autos[autos['brand'] == brand]['odometer_km'].mean()
top_brands_average_mileage
{'audi': 129287.78018799711, 'bmw': 132434.70855412565, 'ford': 124046.83770883054, 'mercedes_benz': 130856.0821139987, 'opel': 129228.76949740035, 'volkswagen': 128730.36906164115}
# Converting the average mileage dictionary into a series object
mileage_series = pd.Series(top_brands_average_mileage)
mileage_series
audi 129287.780188 bmw 132434.708554 ford 124046.837709 mercedes_benz 130856.082114 opel 129228.769497 volkswagen 128730.369062 dtype: float64
# Creating a single dataframe from both series objects
prices_mileage = pd.DataFrame(prices_series, columns=['average_price'])
prices_mileage['average_mileage'] = mileage_series
prices_mileage
average_price | average_mileage | |
---|---|---|
audi | 9093.650036 | 129287.780188 |
bmw | 8334.645155 | 132434.708554 |
ford | 7263.015811 | 124046.837709 |
mercedes_benz | 8485.239572 | 130856.082114 |
opel | 5254.537454 | 129228.769497 |
volkswagen | 6516.457597 | 128730.369062 |
Logically, it would be expectable to find an inverse relation between average prices and average mileage. However, we observe just the opposite: the 3 brands with the highest average prices (Audi, Mercedes-Benz and BMW) have also the biggest average mileage values. On the other hand, the variation of mileage here is not significant (7% between extreme values), so this factor is not of much use in our case.
Let's investigate the most common combinations of brands and models in our dataset. We can try to find, say, all the models represented by 1,500 and more cars.
# Finding combinations brand/model represented by 1,500 and more cars each
brand_model = autos.groupby('brand').model.value_counts().loc[lambda x : x>1500]
print(brand_model)
brand model bmw 3er 2688 opel corsa 1644 volkswagen golf 3815 polo 1677 Name: model, dtype: int64
The most common brand / model combinations are: Volkswagen Golf, BMW 3er, Volkswagen Polo, and Opel Corsa. Not surprising that Volkswagen is leading here: from the previous analysis we remember that the brand itself is the most common one, counting 21% of all the cars in the dataset.
Now we will investigate how mileage influences an average car price. For this purpose, it is convenient to split the 'odometer_km'
values into groups, but first we have to check these values.
autos['odometer_km'].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 40000, 100000])
We can define the following 7 groups of values (in thousands km): 5-10, 20-30, 40-50, 60-70, 80-90, 100-125, and 150. Let's calculate average prices for them.
# Grouping by mileage and calculating average price for each group
odometer_5_10 = autos[autos['odometer_km'] <= 100000 ]['price_dollars'].mean()
odometer_20_30 = autos[(autos['odometer_km'] >= 20000) & (autos['odometer_km'] <= 30000)]['price_dollars'].mean()
odometer_40_50 = autos[(autos['odometer_km'] >= 40000) & (autos['odometer_km'] <= 50000)]['price_dollars'].mean()
odometer_60_70 = autos[(autos['odometer_km'] >= 60000) & (autos['odometer_km'] <= 70000)]['price_dollars'].mean()
odometer_80_90 = autos[(autos['odometer_km'] >= 80000) & (autos['odometer_km'] <= 90000)]['price_dollars'].mean()
odometer_100_125 = autos[(autos['odometer_km'] >= 100000) & (autos['odometer_km'] <= 125000)]['price_dollars'].mean()
odometer_150 = autos[autos['odometer_km'] >= 150000]['price_dollars'].mean()
print(odometer_5_10)
print(odometer_20_30)
print(odometer_40_50)
print(odometer_60_70)
print(odometer_80_90)
print(odometer_100_125)
print(odometer_150)
15720.130828692545 17174.599216710183 36374.77998894417 11532.898544520547 8903.513906856404 8193.318008204838 4513.776098237019
We see that average prices follow some clear patterns. The highest average price (36,375$) is related to the used cars with medium values of mileage (40,000 - 50,000 km). Such cars can have been in use approximately about 2-3 years. The cars with smaller values of mileage, even though, obviously, being less used and, theoretically, better preserved, are twice cheaper. It's not surprising: a car with small mileage (used for a short period, or even those on sale soon after being bought) can look less attractive for potential buyers, because it can have some hidden flaws or problems that are not easy to detect from the beginning, so the current owner can prefer to get rid of it. For the values of mileage higher than 40,000 - 50,000 km (that is, for the cars having been more in use and, hence, more worn), average prices decrease again, and the higher the mileage the lower the average price, up to 4,514$ for the cars that ran 150,000 km.
Our last step will be to find out how much cheaper are cars with damage than their non-damaged counterparts. First, let's see how many cars have unrepaired damages.
autos['unrepaired_damage'].value_counts(dropna=False)
no 34255 NaN 8978 yes 4785 Name: unrepaired_damage, dtype: int64
We can colclude that at least 9.5% of all the cars (since we don't know the status of the numerous ones with null values) have unrepaired damages. Now we will calculate average prices of both categories.
# Calculating average prices for damaged and non-damaged cars
unrepaired = autos.groupby('unrepaired_damage')['price_dollars'].mean()
print(unrepaired)
unrepaired_damage no 8233.387360 yes 2335.379937 Name: price_dollars, dtype: float64
The difference of prices is significant: the damaged cars cost on average 3.5 times less than the non-damaged ones. For curiosity, let's have a look at the most expensive car among the damaged ones.
# Finding the most expensive car among the damaged ones
autos.loc[autos[autos['unrepaired_damage'] == 'yes']['price_dollars'].idxmax()]
date_crawled 2016-03-29 11:38:54 name Volkswagen_Jetta_GT price_dollars 999990 ab_test test vehicle_type limousine registration_year 1985 gearbox manual power_ps 111 model jetta odometer_km 150000 registration_month 12 fuel_type benzine brand volkswagen unrepaired_damage yes ad_created 2016-03-29 00:00:00 postal_code 50997 last_seen 2016-03-29 11:38:54 Name: 37585, dtype: object
We see that classy cars are still quite expensive also among the damaged ones, even though, evidently, much cheaper than their non-damaged counterparts.
To summarize, in this project we have cleaned and explored a dataset of used cars from the German eBay website. Cleaning tasks included renaming columns according to the Python’s convention and giving them descriptive names, deleting non-informative columns, mapping German words to English, converting strings to numeric data types, removing outliers, correcting year range. Exploring the date columns, we found out that the ads were added on the website from June 2015 till April 2016 and crawled from March 2016 till April 2016. We identified the most common brands in our dataset, with Volkswagen at the first place with a big gap from all the others. Only 6 brands count more than 5% of the total number of cars each. For these top brands average prices and mileage values were calculated. We discovered that about 3% of cars in the dataset are free of charge, the most common brand / model combination is Volkswagen Golf, and that 65% of cars have the highest mileage (150,000 km). The mileage-based price patterns were analyzed. We found out that the highest average prices are related to the cars with average mileage of 40,000 – 50,000 km. For lower values of mileage, and especially for higher ones, average prices decrease. Lastly, we explored the difference of average prices between non-damaged and damaged cars and observed that the latter are on average 3.5 times cheaper.