For this project, we're using the dataset of used cars from eBay, uploaded to kaggle. The columns include: dateCrawled - When this ad was first crawled. All field-values are taken from this date.
The aim of this project is to clean the data, and analyze the included used car listings
#importing pandas and NumPy libraries
import pandas as pd
import numpy as np
#reading the csv file into pandas
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Now, converting the case of column names to snakecase, and renaming some of the columns:
#copying the dataframe
autos_copy = autos.copy()
autos_columns = autos_copy.columns
#creating a dictionary to map new names to old names
mapping_dict = {
'dateCrawled':'date_crawled',
'name':'name',
'seller':'seller',
'offerType':'offer_type',
'price':'price',
'abtest':'abtest',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'gearbox':'gearbox',
'powerPS':'power_ps',
'model':'model',
'odometer':'odometer',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'brand':'brand',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'}
# Updating the column names.
autos_copy.columns = pd.Series(autos_columns).map(mapping_dict)
autos_copy.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#alternative code for changing the column names
autos.rename(mapper = mapping_dict, axis = 1, inplace = True)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We've made the following changes to the dataframe:
In this step, we are looking for:
The following methods are helpful for exploring the data:
# Looking at descriptive statistics for all the columns
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the description, the columns 'price' and 'odometer' are numeric values but are stored as texts. The columns gearbox and fuel_type has some german words like, manuell and benzin, respectively.
We'll therefore explore the other german words in these values and translate them to english words:
#removing non-numeric characters from price and odometer columns
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)
autos["price"].head()
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",", "").astype(int)
#renaming odometer column
autos.rename({"odometer": "odometer_km"}, inplace = True, axis = 1)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())
['manuell' 'automatik' nan] ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] ['nein' nan 'ja']
#creating a dictionary to map the german words to english words:
german_to_english = {
'manuell': 'manual',
'automatik': 'automatic',
'benzin': 'benzine',
'elektro': 'electro',
'andere': 'other',
'lpg': 'lpg',
'diesel': 'diesel',
'cng': 'cng',
'hybrid': 'hybrid',
'nein': 'no',
'ja': 'yes'
}
#mapping the words to the columns:
autos['fuel_type'] = autos['fuel_type'].map(german_to_english)
autos['gearbox'] = autos['gearbox'].map(german_to_english)
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(german_to_english)
#updated values
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())
['manual' 'automatic' nan] ['lpg' 'benzine' 'diesel' nan 'cng' 'hybrid' 'electro' 'other'] ['no' nan 'yes']
Here, we'll:
#Total unique values for price and odometer_km columns
price_total_unique = autos["price"].unique().shape
odometer_total_unique = autos["odometer_km"].unique().shape
#print(price_total_unique, odometer_total_unique)
#Descriptive statistics for the columns
price_describe = autos["price"].describe()
odometer_describe = autos["odometer_km"].describe()
#print(price_describe)
#print(odometer_describe)
#Value counts
price_counts_asc = autos["price"].value_counts().sort_index(ascending = True).head()
print(price_counts_asc)
price_counts_desc = autos["price"].value_counts().sort_index(ascending = False).head()
print(price_counts_desc)
odometer_counts_asc = autos["odometer_km"].value_counts().sort_index(ascending = True).head()
print(odometer_counts_asc)
odometer_counts_desc = autos["odometer_km"].value_counts().sort_index(ascending = False).head()
print(odometer_counts_desc)
0 1421 1 156 2 3 3 1 5 2 Name: price, dtype: int64 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 Name: price, dtype: int64 5000 967 10000 264 20000 784 30000 789 40000 819 Name: odometer_km, dtype: int64 150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 Name: odometer_km, dtype: int64
autos = autos[autos["price"].between(1,30000)]
autos = autos[autos["odometer_km"].between(5000,10000)]
print(autos.info())
<class 'pandas.core.frame.DataFrame'> Int64Index: 997 entries, 22 to 49997 Data columns (total 20 columns): date_crawled 997 non-null object name 997 non-null object seller 997 non-null object offer_type 997 non-null object price 997 non-null int64 abtest 997 non-null object vehicle_type 726 non-null object registration_year 997 non-null int64 gearbox 767 non-null object power_ps 997 non-null int64 model 842 non-null object odometer_km 997 non-null int64 registration_month 997 non-null int64 fuel_type 725 non-null object brand 997 non-null object unrepaired_damage 583 non-null object ad_created 997 non-null object nr_of_pictures 997 non-null int64 postal_code 997 non-null int64 last_seen 997 non-null object dtypes: int64(7), object(13) memory usage: 163.6+ KB None
#percentage distribution of values in date_crawled column
dates_percentage = autos['date_crawled'].str[:10].value_counts(
normalize = True, dropna = False).sort_index(ascending = False)
print(dates_percentage)
2016-04-07 0.002006 2016-04-06 0.003009 2016-04-05 0.010030 2016-04-04 0.033099 2016-04-03 0.041123 2016-04-02 0.035105 2016-04-01 0.027081 2016-03-31 0.035105 2016-03-30 0.033099 2016-03-29 0.035105 2016-03-28 0.039117 2016-03-27 0.031093 2016-03-26 0.033099 2016-03-25 0.029087 2016-03-24 0.033099 2016-03-23 0.028084 2016-03-22 0.038114 2016-03-21 0.030090 2016-03-20 0.037111 2016-03-19 0.037111 2016-03-18 0.015045 2016-03-17 0.041123 2016-03-16 0.041123 2016-03-15 0.037111 2016-03-14 0.039117 2016-03-13 0.017051 2016-03-12 0.027081 2016-03-11 0.034102 2016-03-10 0.032096 2016-03-09 0.029087 2016-03-08 0.021063 2016-03-07 0.036108 2016-03-06 0.015045 2016-03-05 0.024072 Name: date_crawled, dtype: float64
#percentage distribution of values in ad_created column
ad_percentage = autos['ad_created'].str[:10].value_counts(
normalize = True, dropna = False).sort_index(ascending = False)
print(ad_percentage)
2016-04-07 0.002006 2016-04-06 0.003009 2016-04-05 0.009027 2016-04-04 0.034102 2016-04-03 0.040120 2016-04-02 0.035105 2016-04-01 0.028084 2016-03-31 0.034102 2016-03-30 0.034102 2016-03-29 0.034102 2016-03-28 0.040120 2016-03-27 0.031093 2016-03-26 0.032096 2016-03-25 0.029087 2016-03-24 0.034102 2016-03-23 0.028084 2016-03-22 0.036108 2016-03-21 0.031093 2016-03-20 0.037111 2016-03-19 0.036108 2016-03-18 0.016048 2016-03-17 0.040120 2016-03-16 0.041123 2016-03-15 0.036108 2016-03-14 0.040120 2016-03-13 0.016048 2016-03-12 0.027081 2016-03-11 0.036108 2016-03-10 0.029087 2016-03-09 0.031093 2016-03-08 0.022066 2016-03-07 0.035105 2016-03-06 0.016048 2016-03-05 0.021063 2016-03-04 0.002006 2016-03-03 0.001003 2016-02-23 0.001003 Name: ad_created, dtype: float64
#percentage distribution of values in last_seen column
lastseen_percentage = autos['last_seen'].str[:10].value_counts(
normalize = True, dropna = False).sort_index(ascending = False).head(3)
print(lastseen_percentage)
2016-04-07 0.139418 2016-04-06 0.223671 2016-04-05 0.136409 Name: last_seen, dtype: float64
It is also important to convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321
#converting dates to be uniform numerical data
autos['date_crawled'] = autos['date_crawled'].str[:10].str.replace("-","").astype(int)
autos['ad_created'] = autos['ad_created'].str[:10].str.replace("-","").astype(int)
autos['last_seen'] = autos['last_seen'].str[:10].str.replace("-","").astype(int)
autos.head(2)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | 20160328 | MINI_Cooper_S_Cabrio | privat | Angebot | 25450 | control | cabrio | 2015 | manual | 184 | cooper | 10000 | 1 | benzine | mini | no | 20160328 | 0 | 44789 | 20160401 |
52 | 20160325 | Senator_A_3.0E_Karosserie_restauriert_m._viele... | privat | Angebot | 3500 | test | limousine | 1985 | NaN | 0 | andere | 5000 | 0 | benzine | opel | no | 20160325 | 0 | 63500 | 20160407 |
#distribution of the registration year column
autos["registration_year"].describe()
count 997.000000 mean 2060.701103 std 599.739740 min 1000.000000 25% 1997.000000 50% 2003.000000 75% 2015.000000 max 9999.000000 Name: registration_year, dtype: float64
The registration year column has some odd values, minimum value of 1000 and maximum 0f 9999, which are unrealistic considering the years. We'll count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic
autos = autos[autos['registration_year'].between(1900,2016)]
autos["registration_year"].value_counts(normalize = True)
#autos[autos["price"].between(1,30000)]
2015 0.151872 2000 0.083422 2016 0.064171 2014 0.062032 1995 0.044920 2005 0.044920 1998 0.039572 1999 0.036364 2001 0.036364 2002 0.035294 1997 0.032086 1990 0.025668 2013 0.023529 2004 0.023529 1996 0.021390 1994 0.019251 2007 0.018182 2006 0.017112 2003 0.016043 2011 0.014973 2009 0.013904 2010 0.013904 1970 0.011765 1980 0.011765 1985 0.009626 2008 0.009626 2012 0.009626 1992 0.008556 1988 0.007487 1978 0.007487 ... 1972 0.005348 1971 0.004278 1960 0.004278 1968 0.004278 1983 0.004278 1975 0.004278 1991 0.003209 1964 0.003209 1966 0.003209 1910 0.003209 1984 0.003209 1993 0.003209 1987 0.003209 1989 0.003209 1981 0.003209 1937 0.002139 1959 0.002139 1950 0.002139 1979 0.001070 1927 0.001070 1929 0.001070 1931 0.001070 1934 0.001070 1956 0.001070 1938 0.001070 1939 0.001070 1961 0.001070 1962 0.001070 1965 0.001070 1969 0.001070 Name: registration_year, Length: 62, dtype: float64
The percentage distribution of car sales, per the registration year increased as the years progressed
Here, we'll explore variations across different car brands using aggregation.
#exploring unique values of brand column
unique_brands = autos['brand'].unique()
brand_pctg = autos['brand'].value_counts(normalize = True)
print(unique_brands)
print(brand_pctg)
['mini' 'opel' 'ford' 'audi' 'honda' 'volkswagen' 'hyundai' 'jaguar' 'toyota' 'seat' 'sonstige_autos' 'renault' 'bmw' 'trabant' 'skoda' 'mercedes_benz' 'porsche' 'chrysler' 'peugeot' 'smart' 'fiat' 'volvo' 'subaru' 'chevrolet' 'lancia' 'alfa_romeo' 'mitsubishi' 'daewoo' 'mazda' 'kia' 'dacia' 'citroen' 'lada' 'nissan' 'suzuki' 'daihatsu' 'land_rover'] volkswagen 0.205348 opel 0.131551 bmw 0.087701 ford 0.070588 sonstige_autos 0.064171 mercedes_benz 0.059893 renault 0.050267 audi 0.050267 fiat 0.036364 skoda 0.024599 trabant 0.017112 seat 0.016043 peugeot 0.016043 nissan 0.016043 mazda 0.013904 mini 0.012834 chevrolet 0.012834 toyota 0.011765 hyundai 0.011765 citroen 0.010695 mitsubishi 0.009626 alfa_romeo 0.009626 porsche 0.008556 honda 0.007487 smart 0.006417 chrysler 0.006417 suzuki 0.005348 daihatsu 0.005348 volvo 0.004278 dacia 0.004278 kia 0.004278 lada 0.002139 daewoo 0.002139 lancia 0.001070 jaguar 0.001070 land_rover 0.001070 subaru 0.001070 Name: brand, dtype: float64
Aggregating brands that have above 2% of the total values:
filtered_brands = brand_pctg[brand_pctg >= 0.02].index
print(filtered_brands)
#creating dictionary to hold the aggregate data:
brand_meanprice = {}
for brand in filtered_brands:
selected_brand = autos[autos['brand'] == brand]
mean_price = selected_brand['price'].mean()
brand_meanprice[brand] = int(mean_price)
print(brand_meanprice)
Index(['volkswagen', 'opel', 'bmw', 'ford', 'sonstige_autos', 'mercedes_benz', 'renault', 'audi', 'fiat', 'skoda'], dtype='object') {'ford': 7725, 'mercedes_benz': 6064, 'opel': 5250, 'renault': 5025, 'bmw': 6372, 'sonstige_autos': 5440, 'skoda': 9728, 'audi': 10880, 'volkswagen': 6438, 'fiat': 6038}
For the selected brands, we'll use aggregation to understand the average mileage and if there's any link with their mean prices. However, displaying both aggregated series objects and visually comparing them has the following limitations:
So, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly, using these methods:
#calculating mean mileage for the selected brands
brand_meanmileage = {}
for brand in filtered_brands:
chosen_brand = autos[autos['brand'] == brand]
mean_mileage = chosen_brand['odometer_km'].mean()
brand_meanmileage[brand] = int(mean_mileage)
print(brand_meanmileage)
print(brand_meanprice)
#converting the dictionaries to series o bjects
meanmileage_series = pd.Series(brand_meanmileage)
meanprice_series = pd. Series(brand_meanprice)
#creating a dataframe from the series objects
price_and_mileage = pd.DataFrame(meanprice_series, columns = ['mean_price'])
price_and_mileage.insert(1, 'mean_mileage',meanmileage_series)
print(price_and_mileage)
price_and_mileage.sort_values(by = ['mean_price'],ascending = False)
{'ford': 6136, 'mercedes_benz': 5803, 'opel': 5975, 'renault': 5744, 'bmw': 5548, 'sonstige_autos': 6416, 'skoda': 6956, 'audi': 7021, 'volkswagen': 6041, 'fiat': 6029} {'ford': 7725, 'mercedes_benz': 6064, 'opel': 5250, 'renault': 5025, 'bmw': 6372, 'sonstige_autos': 5440, 'skoda': 9728, 'audi': 10880, 'volkswagen': 6438, 'fiat': 6038} mean_price mean_mileage audi 10880 7021 bmw 6372 5548 fiat 6038 6029 ford 7725 6136 mercedes_benz 6064 5803 opel 5250 5975 renault 5025 5744 skoda 9728 6956 sonstige_autos 5440 6416 volkswagen 6438 6041
mean_price | mean_mileage | |
---|---|---|
audi | 10880 | 7021 |
skoda | 9728 | 6956 |
ford | 7725 | 6136 |
volkswagen | 6438 | 6041 |
bmw | 6372 | 5548 |
mercedes_benz | 6064 | 5803 |
fiat | 6038 | 6029 |
sonstige_autos | 5440 | 6416 |
opel | 5250 | 5975 |
renault | 5025 | 5744 |
From the outputs, it is clear that the higher the mean price, the higher the mean mileage.
autos.head(3)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
22 | 20160328 | MINI_Cooper_S_Cabrio | privat | Angebot | 25450 | control | cabrio | 2015 | manual | 184 | cooper | 10000 | 1 | benzine | mini | no | 20160328 | 0 | 44789 | 20160401 |
52 | 20160325 | Senator_A_3.0E_Karosserie_restauriert_m._viele... | privat | Angebot | 3500 | test | limousine | 1985 | NaN | 0 | andere | 5000 | 0 | benzine | opel | no | 20160325 | 0 | 63500 | 20160407 |
102 | 20160322 | Ford_Ka_dunkel_blau | privat | Angebot | 320 | control | kleinwagen | 2004 | manual | 0 | ka | 5000 | 6 | benzine | ford | yes | 20160322 | 0 | 24109 | 20160402 |
Finally, we need to do the following analyses:
#finding the most common brand/model combinations
common_brand_model = autos.groupby('brand').model.value_counts().sort_values(ascending = False).head()
common_brand_model
brand model volkswagen golf 62 bmw 3er 40 opel corsa 33 volkswagen polo 32 opel andere 27 Name: model, dtype: int64
The most common brand/model combination from the dataframe is golf by volkswagen.
#counting values for each of the categories:
autos['unrepaired_damage'].value_counts()
no 467 yes 93 Name: unrepaired_damage, dtype: int64
Those without damages are almost 5 times more than those with damages. Next, we'll calculate their average prices:
autos.groupby('unrepaired_damage')['price'].mean()
unrepaired_damage no 10359.284797 yes 1812.860215 Name: price, dtype: float64
It is evident that cars without damages have the highest prices on average
In this project, we(with the help of dataquest instructions) have cleaned and explored data on car sales. We have removed outliers from the data, converted some of the columns to their correct data types, removing unnecessary values. We have also conducted several analyses including analysing the brands by average price, comparing it to the mileage.