The aim of this project is to clean data taken from eBay Kleinanzeigen, the section of the German eBay website related used cars. The goal of which will be a detailed analysis of the used car listings in the data set. The data is provided in a CSV file, so the first order of business will be to read the file into the analysis software.
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From a quick scan of the data, there are 20 columns, mostly strings. It can be seen that the column headings are formatted to camelcase instead of the preferred snakecase. The other key note is that the headers are in English while the data is in German, it was expected the entire data set including headers would be in German. There is also some numerical row data that is stored as an object, but could be converted to integer or float data with the removal of special characters.
The first step will be to clean up the column headings.
autos.columns #displays current column headings
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
columns = autos.columns
mapping_dict = {
'dateCrawled': 'date_crawled',
'name': 'name',
'seller': 'seller',
'offerType': 'offer_type',
'price': 'price',
'abtest': 'ab_test',
'vehicleType': 'vehicle_type',
'yearOfRegistration': 'registration_year',
'gearbox': 'gearbox',
'powerPS': 'power_ps',
'model': 'model',
'odometer': 'odometer',
'monthOfRegistration': 'registration_month',
'fuelType': 'fuel_type',
'brand': 'brand',
'notRepairedDamage': 'unrepaired_damage',
'dateCreated': 'ad_created',
'nrOfPictures': 'nr_of_pictures',
'postalCode': 'postal_code',
'lastSeen': 'last_seen'
}
autos.columns = pd.Series(columns).map(mapping_dict)
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The code above was utilized to change the column headers to the snakecase format and make certain header edits. This will make further coding easier, and provides column headers that are in a more logical presentation.
The next step will be to explore the data further to see if any columns can be excluded due to all values being the same, or for numeric data stored as strings which need converting.
autos.describe(include='all') #provide descriptive statistics for the DataFrame
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 16:37:21 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
There are multiple columns that contain just two unique inputs. The majority of these inputs will affect the price of the vehicle and should, therefore, be used in the analysis. There are two, however, that will not really change the analysis at all. The two columns that should be considered to drop would be "seller" and "offer_type".
Furthermore, "price" and "odometer" are numerical inputs stored as strings. These will be cleaned below.
autos['price'] = autos['price'].str.replace('$', '').str.replace(',', '').astype(int) #removing special characters to convert to integers
autos['odometer'] = autos['odometer'].str.replace('km', '').str.replace(',', '').astype(int) #see note above
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True) # rename 'odometer' to convey unit of measure
# checking for outliers in 'price'
autos['price'].unique().shape # number of unique values
(2357,)
autos['price'].describe() # min/max/mean/median values
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
highest_prices = autos['price'].value_counts().sort_index(ascending=True).head()
lowest_prices = autos['price'].value_counts().sort_index(ascending=False).head()
print(highest_prices)
print(lowest_prices)
0 1421 1 156 2 3 3 1 5 2 Name: price, dtype: int64 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 Name: price, dtype: int64
From the above it can be seen that there are several cars with a price below '100'. These cars would be worth more if sold for scrap in these cases. For the purposes of this analysis these cars should not be counted in car sales, so these will be excluded. Conversely, there are a few cars listed well above '1,000,000'. While it is possible that there are cars worth this much, anything above this level will be excluded as well.
autos = autos.loc[autos['price'].between(100,1000000), :] # exclude highest and lowest prices
print(autos)
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS 8 2016-03-22 16:51:34 Seat_Arosa 9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau 11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden 12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... 13 2016-03-23 10:48:32 Audi_A3_1.6_tuning 14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... 15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection 16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi 17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring 19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... 21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio 22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio 23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima 24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. 26 2016-04-03 22:46:28 Volkswagen_Polo_Fox 28 2016-03-19 21:56:19 MINI_Cooper_D 29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... 31 2016-03-14 16:53:09 Noch_gut_erhaltenen_C_320 32 2016-03-20 05:03:03 Corsa_mit_TÜV_5.2016 ... ... ... 49968 2016-04-01 17:49:15 Mercedes_Benz_190_D_2.5_Automatik 49969 2016-03-17 18:49:02 Nissan_X_Trail_2.2_dCi_4x4_Sport_m.AHZ 49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... 49971 2016-03-29 14:54:12 W.Lupo_1.0 49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. 49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor 49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort 49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... 49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic 49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style 49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V 49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung 49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm 49983 2016-03-06 12:43:04 Ford_focus_99 49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! 49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... 49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... 49988 2016-03-28 19:49:51 BMW_330_Ci 49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau 49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban 49991 2016-03-06 15:25:19 Kleinwagen 49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport 49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug 49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... 49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V seller offer_type price ab_test vehicle_type registration_year \ 0 privat Angebot 5000 control bus 2004 1 privat Angebot 8500 control limousine 1997 2 privat Angebot 8990 test limousine 2009 3 privat Angebot 4350 control kleinwagen 2007 4 privat Angebot 1350 test kombi 2003 5 privat Angebot 7900 test bus 2006 6 privat Angebot 300 test limousine 1995 7 privat Angebot 1990 control limousine 1998 8 privat Angebot 250 test NaN 2000 9 privat Angebot 590 control bus 1997 10 privat Angebot 999 test NaN 2017 11 privat Angebot 350 control NaN 2000 12 privat Angebot 5299 control kleinwagen 2010 13 privat Angebot 1350 control limousine 1999 14 privat Angebot 3999 test kleinwagen 2007 15 privat Angebot 18900 test coupe 1982 16 privat Angebot 350 test kombi 1999 17 privat Angebot 5500 test coupe 1990 18 privat Angebot 300 control bus 1995 19 privat Angebot 4150 control suv 2004 20 privat Angebot 3500 test kombi 2003 21 privat Angebot 41500 test cabrio 2004 22 privat Angebot 25450 control cabrio 2015 23 privat Angebot 7999 control bus 2010 24 privat Angebot 48500 control limousine 2014 26 privat Angebot 777 control kleinwagen 1992 28 privat Angebot 5250 control kleinwagen 2007 29 privat Angebot 4999 test kombi 2004 31 privat Angebot 2850 test kombi 2002 32 privat Angebot 350 control kleinwagen 1999 ... ... ... ... ... ... ... 49968 privat Angebot 2100 test limousine 1986 49969 privat Angebot 4500 control suv 2005 49970 privat Angebot 15800 control bus 2010 49971 privat Angebot 950 test kleinwagen 2001 49972 privat Angebot 3300 control bus 2004 49973 privat Angebot 6000 control cabrio 2004 49975 privat Angebot 9700 control kleinwagen 2012 49976 privat Angebot 5900 test kombi 1992 49977 privat Angebot 5500 control limousine 2003 49978 privat Angebot 900 control limousine 1996 49979 privat Angebot 11000 test kleinwagen 2011 49980 privat Angebot 400 control kombi 1995 49981 privat Angebot 2000 control kombi 1998 49982 privat Angebot 1950 control kleinwagen 2004 49983 privat Angebot 600 test kleinwagen 1999 49985 privat Angebot 1000 control NaN 1995 49986 privat Angebot 15900 control limousine 2010 49987 privat Angebot 21990 control limousine 2013 49988 privat Angebot 9550 control coupe 2001 49989 privat Angebot 150 test kleinwagen 1997 49990 privat Angebot 17500 test limousine 2012 49991 privat Angebot 500 control NaN 2016 49992 privat Angebot 4800 control kleinwagen 2009 49993 privat Angebot 1650 control kleinwagen 1997 49994 privat Angebot 5000 control kombi 2001 49995 privat Angebot 24900 control limousine 2011 49996 privat Angebot 1980 control cabrio 1996 49997 privat Angebot 13200 test cabrio 2014 49998 privat Angebot 22900 control kombi 2013 49999 privat Angebot 1250 control limousine 1996 gearbox power_ps model odometer_km registration_month \ 0 manuell 158 andere 150000 3 1 automatik 286 7er 150000 6 2 manuell 102 golf 70000 7 3 automatik 71 fortwo 70000 6 4 manuell 0 focus 150000 7 5 automatik 150 voyager 150000 4 6 manuell 90 golf 150000 8 7 manuell 90 golf 150000 12 8 manuell 0 arosa 150000 10 9 manuell 90 megane 150000 7 10 manuell 90 NaN 150000 4 11 NaN 0 NaN 150000 0 12 automatik 71 fortwo 50000 9 13 manuell 101 a3 150000 11 14 manuell 75 clio 150000 9 15 automatik 203 NaN 80000 6 16 manuell 101 vectra 150000 5 17 manuell 205 scirocco 150000 6 18 manuell 150 3er 150000 0 19 manuell 124 andere 150000 2 20 manuell 131 a4 150000 5 21 manuell 320 911 150000 4 22 manuell 184 cooper 10000 1 23 manuell 120 NaN 150000 2 24 automatik 306 5er 30000 12 26 manuell 54 polo 125000 2 28 manuell 110 cooper 150000 7 29 automatik 204 e_klasse 150000 10 31 automatik 218 c_klasse 150000 7 32 manuell 0 corsa 150000 7 ... ... ... ... ... ... 49968 automatik 90 andere 150000 9 49969 manuell 136 x_trail 150000 5 49970 automatik 136 c4 60000 4 49971 manuell 50 lupo 150000 4 49972 automatik 150 vito 150000 10 49973 manuell 163 slk 150000 11 49975 automatik 88 jazz 100000 11 49976 automatik 150 80 150000 12 49977 manuell 116 c_klasse 150000 2 49978 automatik 136 e_klasse 150000 9 49979 manuell 90 polo 70000 11 49980 manuell 105 escort 125000 3 49981 manuell 115 astra 150000 12 49982 manuell 0 fabia 90000 7 49983 manuell 101 focus 150000 4 49985 automatik 0 NaN 150000 0 49986 automatik 218 300c 125000 11 49987 manuell 150 a3 50000 11 49988 manuell 231 3er 150000 10 49989 manuell 0 polo 150000 5 49990 manuell 156 a_klasse 30000 12 49991 manuell 0 twingo 150000 0 49992 manuell 120 andere 125000 9 49993 manuell 0 NaN 150000 7 49994 automatik 299 a6 150000 1 49995 automatik 239 q5 100000 1 49996 manuell 75 astra 150000 5 49997 automatik 69 500 5000 11 49998 manuell 150 a3 40000 11 49999 manuell 101 vectra 150000 1 fuel_type brand unrepaired_damage ad_created \ 0 lpg peugeot nein 2016-03-26 00:00:00 1 benzin bmw nein 2016-04-04 00:00:00 2 benzin volkswagen nein 2016-03-26 00:00:00 3 benzin smart nein 2016-03-12 00:00:00 4 benzin ford nein 2016-04-01 00:00:00 5 diesel chrysler NaN 2016-03-21 00:00:00 6 benzin volkswagen NaN 2016-03-20 00:00:00 7 diesel volkswagen nein 2016-03-16 00:00:00 8 NaN seat nein 2016-03-22 00:00:00 9 benzin renault nein 2016-03-16 00:00:00 10 benzin volkswagen nein 2016-03-14 00:00:00 11 benzin mercedes_benz NaN 2016-03-16 00:00:00 12 benzin smart nein 2016-03-31 00:00:00 13 benzin audi nein 2016-03-23 00:00:00 14 benzin renault NaN 2016-03-23 00:00:00 15 benzin sonstige_autos nein 2016-04-01 00:00:00 16 benzin opel nein 2016-03-16 00:00:00 17 benzin volkswagen nein 2016-03-29 00:00:00 18 benzin bmw NaN 2016-03-26 00:00:00 19 lpg mazda nein 2016-03-17 00:00:00 20 diesel audi NaN 2016-03-05 00:00:00 21 benzin porsche nein 2016-03-06 00:00:00 22 benzin mini nein 2016-03-28 00:00:00 23 diesel peugeot nein 2016-03-10 00:00:00 24 benzin bmw nein 2016-04-03 00:00:00 26 benzin volkswagen nein 2016-04-03 00:00:00 28 diesel mini ja 2016-03-19 00:00:00 29 diesel mercedes_benz nein 2016-04-02 00:00:00 31 benzin mercedes_benz nein 2016-03-14 00:00:00 32 benzin opel NaN 2016-03-20 00:00:00 ... ... ... ... ... 49968 diesel mercedes_benz nein 2016-04-01 00:00:00 49969 diesel nissan nein 2016-03-17 00:00:00 49970 diesel citroen nein 2016-03-21 00:00:00 49971 benzin volkswagen nein 2016-03-29 00:00:00 49972 diesel mercedes_benz ja 2016-03-26 00:00:00 49973 benzin mercedes_benz nein 2016-03-27 00:00:00 49975 hybrid honda nein 2016-03-27 00:00:00 49976 benzin audi nein 2016-03-19 00:00:00 49977 diesel mercedes_benz nein 2016-03-31 00:00:00 49978 benzin mercedes_benz ja 2016-04-04 00:00:00 49979 diesel volkswagen nein 2016-03-20 00:00:00 49980 benzin ford NaN 2016-03-12 00:00:00 49981 benzin opel nein 2016-03-15 00:00:00 49982 benzin skoda NaN 2016-03-29 00:00:00 49983 benzin ford NaN 2016-03-06 00:00:00 49985 benzin volkswagen NaN 2016-04-02 00:00:00 49986 diesel chrysler nein 2016-04-04 00:00:00 49987 diesel audi nein 2016-03-22 00:00:00 49988 benzin bmw nein 2016-03-28 00:00:00 49989 benzin volkswagen ja 2016-03-11 00:00:00 49990 benzin mercedes_benz nein 2016-03-21 00:00:00 49991 benzin renault NaN 2016-03-06 00:00:00 49992 lpg fiat nein 2016-03-10 00:00:00 49993 benzin audi NaN 2016-03-15 00:00:00 49994 benzin audi nein 2016-03-22 00:00:00 49995 diesel audi nein 2016-03-27 00:00:00 49996 benzin opel nein 2016-03-28 00:00:00 49997 benzin fiat nein 2016-04-02 00:00:00 49998 diesel audi nein 2016-03-08 00:00:00 49999 benzin opel nein 2016-03-13 00:00:00 nr_of_pictures postal_code last_seen 0 0 79588 2016-04-06 06:45:54 1 0 71034 2016-04-06 14:45:08 2 0 35394 2016-04-06 20:15:37 3 0 33729 2016-03-15 03:16:28 4 0 39218 2016-04-01 14:38:50 5 0 22962 2016-04-06 09:45:21 6 0 31535 2016-03-23 02:48:59 7 0 53474 2016-04-07 03:17:32 8 0 7426 2016-03-26 18:18:10 9 0 15749 2016-04-06 10:46:35 10 0 86157 2016-04-07 03:16:21 11 0 17498 2016-03-16 18:45:34 12 0 34590 2016-04-06 14:17:52 13 0 12043 2016-04-01 14:17:13 14 0 81737 2016-04-01 15:46:47 15 0 61276 2016-04-02 21:10:48 16 0 57299 2016-03-18 05:29:37 17 0 74821 2016-04-05 20:46:26 18 0 54329 2016-04-02 12:16:41 19 0 40878 2016-03-17 14:45:58 20 0 53913 2016-03-07 05:46:46 21 0 65428 2016-04-05 23:46:19 22 0 44789 2016-04-01 06:45:30 23 0 30900 2016-03-17 08:45:17 24 0 22547 2016-04-07 13:16:50 26 0 38110 2016-04-05 23:46:48 28 0 15745 2016-04-07 14:58:48 29 0 47638 2016-04-02 12:45:44 31 0 41065 2016-03-16 07:19:04 32 0 27619 2016-04-06 03:15:20 ... ... ... ... 49968 0 40227 2016-04-05 13:16:35 49969 0 17379 2016-03-25 23:18:15 49970 0 14947 2016-04-07 04:17:34 49971 0 65197 2016-03-29 20:41:51 49972 0 65326 2016-03-28 11:28:18 49973 0 53567 2016-03-27 08:25:24 49975 0 84385 2016-04-05 19:45:34 49976 0 36100 2016-04-07 06:16:44 49977 0 33739 2016-04-06 12:16:11 49978 0 24405 2016-04-06 12:44:20 49979 0 48455 2016-04-07 01:45:12 49980 0 56218 2016-04-06 17:16:49 49981 0 86859 2016-04-05 17:21:46 49982 0 45884 2016-03-29 18:51:08 49983 0 52477 2016-03-09 06:16:08 49985 0 30900 2016-04-06 15:17:52 49986 0 73527 2016-04-06 23:16:00 49987 0 94362 2016-03-26 22:46:06 49988 0 83646 2016-04-07 02:17:40 49989 0 21244 2016-03-12 10:17:55 49990 0 58239 2016-04-06 22:46:57 49991 0 61350 2016-03-06 18:24:19 49992 0 68642 2016-03-13 01:44:51 49993 0 65203 2016-04-06 19:46:53 49994 0 46537 2016-04-06 08:16:39 49995 0 82131 2016-04-01 13:47:40 49996 0 44807 2016-04-02 14:18:02 49997 0 73430 2016-04-04 11:47:27 49998 0 35683 2016-04-05 16:45:07 49999 0 45897 2016-04-06 21:18:48 [48227 rows x 20 columns]
Now the same analysis will be performed for the 'odometer_km'.
# checking for outliers in 'odometer_km'
autos['odometer_km'].unique().shape # number of unique values
(13,)
autos['odometer_km'].describe() # min/max/mean/median values
count 48227.000000 mean 125920.127729 std 39542.413981 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
highest_odometer = autos['odometer_km'].value_counts().sort_index(ascending=True).head()
lowest_odometer = autos['odometer_km'].value_counts().sort_index(ascending=False).head()
print(highest_odometer)
print(lowest_odometer)
5000 760 10000 245 20000 757 30000 777 40000 814 Name: odometer_km, dtype: int64 150000 31214 125000 5038 100000 2101 90000 1733 80000 1412 Name: odometer_km, dtype: int64
The odomoeter readings in the data set are not as diverse and all seem reasonable, therefore no values will be removed.
The next step will be to get a breakdown of the values in 'date_crawled', 'ad-created', and 'last_seen. Followed by an understanding of 'registration_year'.
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025359 2016-03-06 0.014038 2016-03-07 0.036059 2016-03-08 0.033176 2016-03-09 0.033011 2016-03-10 0.032285 2016-03-11 0.032596 2016-03-12 0.036909 2016-03-13 0.015676 2016-03-14 0.036660 2016-03-15 0.034317 2016-03-16 0.029465 2016-03-17 0.031518 2016-03-18 0.012897 2016-03-19 0.034732 2016-03-20 0.037800 2016-03-21 0.037220 2016-03-22 0.032886 2016-03-23 0.032285 2016-03-24 0.029444 2016-03-25 0.031497 2016-03-26 0.032306 2016-03-27 0.031124 2016-03-28 0.034960 2016-03-29 0.034130 2016-03-30 0.033736 2016-03-31 0.031849 2016-04-01 0.033695 2016-04-02 0.035602 2016-04-03 0.038609 2016-04-04 0.036536 2016-04-05 0.013063 2016-04-06 0.003172 2016-04-07 0.001389 Name: date_crawled, dtype: float64
The 'date_crawled' data shows that everything has been crawled in the last ~30 days, with an equal distribution amongst all dates.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033093 2016-03-10 0.031995 2016-03-11 0.032907 2016-03-12 0.036743 2016-03-13 0.017044 2016-03-14 0.035291 2016-03-15 0.034047 2016-03-16 0.029962 2016-03-17 0.031186 2016-03-18 0.013582 2016-03-19 0.033612 2016-03-20 0.037863 2016-03-21 0.037448 2016-03-22 0.032700 2016-03-23 0.032119 2016-03-24 0.029382 2016-03-25 0.031621 2016-03-26 0.032368 2016-03-27 0.031041 2016-03-28 0.035063 2016-03-29 0.034089 2016-03-30 0.033550 2016-03-31 0.031891 2016-04-01 0.033674 2016-04-02 0.035291 2016-04-03 0.038858 2016-04-04 0.036888 2016-04-05 0.011798 2016-04-06 0.003255 2016-04-07 0.001244 Name: ad_created, Length: 76, dtype: float64
The 'ad_created' column shows the same distribution amongst the 'web_crawled' dates, but there are several entries dated before this. This are probably erroneous or old data.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001078 2016-03-06 0.004313 2016-03-07 0.005433 2016-03-08 0.007320 2016-03-09 0.009580 2016-03-10 0.010637 2016-03-11 0.012400 2016-03-12 0.023783 2016-03-13 0.008875 2016-03-14 0.012628 2016-03-15 0.015862 2016-03-16 0.016443 2016-03-17 0.028096 2016-03-18 0.007320 2016-03-19 0.015759 2016-03-20 0.020652 2016-03-21 0.020549 2016-03-22 0.021357 2016-03-23 0.018579 2016-03-24 0.019761 2016-03-25 0.019097 2016-03-26 0.016671 2016-03-27 0.015551 2016-03-28 0.020860 2016-03-29 0.022311 2016-03-30 0.024696 2016-03-31 0.023825 2016-04-01 0.022850 2016-04-02 0.024882 2016-04-03 0.025131 2016-04-04 0.024530 2016-04-05 0.125054 2016-04-06 0.221971 2016-04-07 0.132146 Name: last_seen, dtype: float64
The 'last_seen' column trends as expected. The older the post the less likely it is to be seen more recently. This is most likely due to being 'bumped' down the site by newer ads and postings.
autos['registration_year'].describe()
count 48227.000000 mean 2004.730151 std 87.894768 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The mean of the 'registration_year' column shows a fairly current year (16 years old at time of report), but a scan of the min (1000) and max(9999) show that there is some erroneous data to be excluded here as well.
Removing the bad data from the 'registration_year' is the next logical step. Clearly, the max can not be above 2016, which is the year this data was aggregated. Given that the first mass produced vehicles did not begin production until roughly the 1920's, a minimum of 1920 will be used.
autos = autos[autos["registration_year"].between(1920, 2016)]
autos['registration_year'].value_counts(normalize=True).sort_index(ascending=True)
1927 0.000022 1929 0.000022 1931 0.000022 1934 0.000043 1937 0.000086 1938 0.000022 1939 0.000022 1941 0.000043 1943 0.000022 1948 0.000022 1950 0.000022 1951 0.000043 1952 0.000022 1953 0.000022 1954 0.000043 1955 0.000043 1956 0.000086 1957 0.000043 1958 0.000086 1959 0.000129 1960 0.000475 1961 0.000129 1962 0.000086 1963 0.000173 1964 0.000259 1965 0.000367 1966 0.000475 1967 0.000561 1968 0.000561 1969 0.000410 ... 1987 0.001553 1988 0.002869 1989 0.003689 1990 0.007162 1991 0.007292 1992 0.007918 1993 0.009061 1994 0.013505 1995 0.025737 1996 0.029232 1997 0.041529 1998 0.050482 1999 0.062110 2000 0.066964 2001 0.056717 2002 0.053438 2003 0.058098 2004 0.058227 2005 0.062801 2006 0.057558 2007 0.049037 2008 0.047678 2009 0.044895 2010 0.034237 2011 0.034906 2012 0.028218 2013 0.017280 2014 0.014282 2015 0.008198 2016 0.025931 Name: registration_year, Length: 77, dtype: float64
The results of this data cull trend as expected, because as registration year increases the percentage of registered cars increases. There is also a noticeable dip in the five most recent years given that most car owners keep a car for about that long.
Now it should be determined what the most popular cars are. The top ten vehicle brands and mean price will be aggregated.
top_brand_mean_price = {}
brands = autos["brand"].value_counts().head(10).index
for b in brands:
selected_rows = autos[autos["brand"] == b]
b_mean = selected_rows["price"].mean()
top_brand_mean_price[b] = b_mean
print(top_brand_mean_price)
{'audi': 9380.718547986076, 'mercedes_benz': 8672.654241071428, 'ford': 4086.93421865349, 'bmw': 8381.677305658899, 'seat': 4433.419621749409, 'opel': 3006.0002012072437, 'volkswagen': 5639.900316294255, 'fiat': 2836.8736310025274, 'peugeot': 3113.860549132948, 'renault': 2496.940394314535}
Looking at the top 10 brands, it isn't a surprise that the top two are both German car companies. It is surprising that number three is Ford from the USA. With used cars, the expectation would be that price would play a key part in popularity. This does not really corelate though, as three of the top four (also all German manufacturers) brands are nearly double the mean price of the other seven brands. It appears that Germans just prefer to purchase German vehicles.
top_brand_mean_odometer = {}
brands = autos["brand"].value_counts().head(10).index
for o in brands:
selected_rows = autos[autos["brand"] == o]
o_mean = selected_rows["odometer_km"].mean()
top_brand_mean_odometer[o] = o_mean
print(top_brand_mean_odometer)
{'audi': 129245.40029835903, 'mercedes_benz': 131025.66964285714, 'ford': 124277.33168622607, 'bmw': 132695.32014881534, 'seat': 121536.64302600473, 'opel': 129380.28169014085, 'volkswagen': 128804.2036526885, 'fiat': 116950.29486099411, 'peugeot': 127127.8901734104, 'renault': 128337.91838606144}
bmp_series = pd.Series(top_brand_mean_price)
bmm_series = pd.Series(top_brand_mean_odometer)
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df.insert(1, 'mean_odometer', bmm_series)
print(df.to_string())
mean_price mean_odometer audi 9380.718548 129245.400298 bmw 8381.677306 132695.320149 fiat 2836.873631 116950.294861 ford 4086.934219 124277.331686 mercedes_benz 8672.654241 131025.669643 opel 3006.000201 129380.281690 peugeot 3113.860549 127127.890173 renault 2496.940394 128337.918386 seat 4433.419622 121536.643026 volkswagen 5639.900316 128804.203653