The aim of this project is to clean a large dataset and analyze the information using pandas
and numpy
libraries of python.
The dataset comes from *eBay Kleinanzeigen* classified section on ebay, you can download the full version in the this link.
The data dictionary provided with data is as follows:
dataCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- he car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.import pandas as pd
import numpy as np
# To read the csv file we use the .read_csv method
autos = pd.read_csv('autos(1).csv', encoding = 'Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
At first glance we have the following information of the dataset using the .info method:
fueltype
and nonreparied
damage have null valuesautos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We are going to convert the names case so we can have a more workable case for the dataset.
columns = autos.columns
columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
new_names= ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer','registration_month','fuel_type','brand',
'unreparied_damage','ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns = new_names
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unreparied_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
Now we can see the changes in the columns looking the results with the .head
method
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unreparied_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
In this step we are going to explore the data in detail, first using .describe
method to see some statistical results and look what kind of objects we have in the data.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unreparied_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
nr_of_pictures
have mostly 0 value, so we can remove the column without afecting the dataregistration_year
have a min value of 1000, it's really difficult to find a car of the year 1000 now a days, so we need to check that.registration_month
have a min value of 0, it's not a valid month so we need to check that toopower_ps
column have min 0 value, is really dificult to find a car with 0 power, so we need to check that.price
and odometer
are registered as strings, we need to convert to number so we can use the data.seller
, offer_type
and nr_of_pictures
have unique values, so we can drop this columns without affecting the data.So we need to drop the Columns that have the same values, we are not going to use the columns at this moment.
autos = autos.drop(["seller","offer_type","nr_of_pictures"], axis=1)
Now we need to clean the data in the 'price'
column, and also convert this data in integers for further calculus, the steps are:
astype
method..head()
auto_desc = autos["price"].describe()
auto_desc
count 50000 unique 2357 top $0 freq 1421 Name: price, dtype: object
# We need to replace the $ symbol and , separator with str.repalce
autos["price"] = autos["price"].str.replace("$","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["price"] = autos["price"].astype(int)
autos["price"].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
The steps are similar with prices, but in this case we are going to use .rename()
method at the to convert the name from odometer
to odometer_km
, so the person that will use the dataset could know which units use the odometer.
autos["odometer"].value_counts()
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
# We need to replace the $ symbol and , separator with str.repalce
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["odometer"] = autos["odometer"].astype(int)
autos["odometer"].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int64
#With the rename method we need to indicate the previous name ant the new name
autos.rename(columns={"odometer":"odometer_km"}, inplace=True)
autos.describe(include = 'all')
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unreparied_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 5.000000e+04 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000.000000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.000000 | 50000 |
unique | 48213 | 38754 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | NaN | 25756 | 12859 | NaN | 36993 | NaN | 4024 | NaN | NaN | 30107 | 10687 | 35232 | 1946 | NaN | 8 |
mean | NaN | NaN | 9.840044e+03 | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | 125732.700000 | 5.723360 | NaN | NaN | NaN | NaN | 50813.627300 | NaN |
std | NaN | NaN | 4.811044e+05 | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | 40042.211706 | 3.711984 | NaN | NaN | NaN | NaN | 25779.747957 | NaN |
min | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 1067.000000 | NaN |
25% | NaN | NaN | 1.100000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 30451.000000 | NaN |
50% | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 49577.000000 | NaN |
75% | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 71540.000000 | NaN |
max | NaN | NaN | 1.000000e+08 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 99998.000000 | NaN |
Now that we have the data converted to int
type, we need to explore the outliers, some data that have no sense or are not adjusted to our goal. The steps are the same for odometer_km
and price
.
.unique().shape
methodvalue_counts().sort_index()
method..between()
method to indicate the range of values that we are going to useautos["odometer_km"].unique().shape
(13,)
autos["odometer_km"].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos["odometer_km"].value_counts().sort_index(ascending=False)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
# With this we obtain the number of unique values
autos["price"].unique().shape
(2357,)
autos["price"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
# First we need to see the extreme superior values, prices above 1 million USD must be cleaned
autos["price"].value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
# Using `sort_index()` we can sort using the unique values, now we see the low prices, 0-100USD are a very low price
autos["price"].value_counts().sort_index(ascending=True).head(20)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price, dtype: int64
#Using only .value_counts() we can see the values by frequency
autos["price"].value_counts().head(50)
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 1100 376 1300 371 3000 365 550 356 1800 355 5500 340 1250 335 350 335 1600 327 1999 322 400 321 6500 300 1400 292 250 291 2800 291 2300 290 1350 276 1700 268 200 266 450 265 3800 264 3200 261 2900 256 1450 252 4000 246 2999 242 2600 241 5000 239 1900 239 1650 238 Name: price, dtype: int64
With this info is better to start from 100USD than from 1USD, is difficult to asses if below this values is realistic to purchase a car, we need more info about how this market works.
#With between method we remove the outliers and reduce the dataset.
autos = autos[autos["price"].between(100,351000)]
autos["price"].describe()
count 48224.000000 mean 5930.371433 std 9078.372762 min 100.000000 25% 1250.000000 50% 3000.000000 75% 7499.000000 max 350000.000000 Name: price, dtype: float64
We have 5 types of data from dates:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteWe need to convert the dates that are identified as strings to date format.
#First we see how is written the dates
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
# We need to remove the hour and leave only the dates using .str method
autos['date_crawled'].str[:10].head(10)
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 5 2016-03-21 6 2016-03-20 7 2016-03-16 8 2016-03-22 9 2016-03-16 Name: date_crawled, dtype: object
#Now we use value_counts() to see the most frequent dates
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(10)
2016-04-03 0.038611 2016-03-20 0.037803 2016-03-21 0.037201 2016-03-12 0.036911 2016-03-14 0.036662 2016-04-04 0.036538 2016-03-07 0.036061 2016-04-02 0.035605 2016-03-28 0.034962 2016-03-19 0.034734 Name: date_crawled, dtype: float64
# We need to identify the first dates of crawling using .sort_index()
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
2016-03-05 0.025361 2016-03-06 0.014039 2016-03-07 0.036061 2016-03-08 0.033179 2016-03-09 0.033013 2016-03-10 0.032287 2016-03-11 0.032598 2016-03-12 0.036911 2016-03-13 0.015677 2016-03-14 0.036662 Name: date_crawled, dtype: float64
#Then we identify the last dates using the same method
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
2016-03-29 0.034112 2016-03-30 0.033738 2016-03-31 0.031851 2016-04-01 0.033697 2016-04-02 0.035605 2016-04-03 0.038611 2016-04-04 0.036538 2016-04-05 0.013064 2016-04-06 0.003173 2016-04-07 0.001389 Name: date_crawled, dtype: float64
#The same previous process we will use with the other dates
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head(10)
2016-04-03 0.038860 2016-03-20 0.037865 2016-03-21 0.037429 2016-04-04 0.036890 2016-03-12 0.036745 2016-04-02 0.035294 2016-03-14 0.035294 2016-03-28 0.035066 2016-03-07 0.034796 2016-03-29 0.034070 Name: ad_created, dtype: float64
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 Name: ad_created, dtype: float64
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
2016-03-29 0.034070 2016-03-30 0.033552 2016-03-31 0.031893 2016-04-01 0.033676 2016-04-02 0.035294 2016-04-03 0.038860 2016-04-04 0.036890 2016-04-05 0.011799 2016-04-06 0.003256 2016-04-07 0.001244 Name: ad_created, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(10)
2016-04-06 0.221964 2016-04-07 0.132154 2016-04-05 0.125062 2016-03-17 0.028098 2016-04-03 0.025133 2016-04-02 0.024884 2016-03-30 0.024697 2016-04-04 0.024531 2016-03-31 0.023826 2016-03-12 0.023785 Name: last_seen, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
2016-03-05 0.001078 2016-03-06 0.004313 2016-03-07 0.005433 2016-03-08 0.007320 2016-03-09 0.009580 2016-03-10 0.010638 2016-03-11 0.012400 2016-03-12 0.023785 2016-03-13 0.008875 2016-03-14 0.012629 Name: last_seen, dtype: float64
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
2016-03-29 0.022292 2016-03-30 0.024697 2016-03-31 0.023826 2016-04-01 0.022852 2016-04-02 0.024884 2016-04-03 0.025133 2016-04-04 0.024531 2016-04-05 0.125062 2016-04-06 0.221964 2016-04-07 0.132154 Name: last_seen, dtype: float64
From the previous exercise we identified:
With this info is clear that we need to review the registration year of the cars. Similar to previous step with the price, we need to fix a range and exclude some data.
autos["registration_year"].describe()
count 48224.000000 mean 2004.730964 std 87.897388 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Looking the data we find that we have cars with registered years before 1900 (when we don't even have a clue of what is a Mercedes Benz) and years before 2016, the last year when we retrieved the data. So We need to explore the data set and removes this years.
autos["registration_year"].value_counts(normalize=True)
2000 0.064366 2005 0.060364 1999 0.059701 2004 0.055968 2003 0.055844 2006 0.055325 2001 0.054516 2002 0.051364 1998 0.048524 2007 0.047134 2008 0.045828 2009 0.043132 1997 0.039918 2011 0.033552 2010 0.032909 2017 0.028679 1996 0.028098 2012 0.027123 2016 0.024925 1995 0.024739 2013 0.016610 2014 0.013728 1994 0.012981 2018 0.009705 1993 0.008709 2015 0.007880 1992 0.007610 1991 0.007009 1990 0.006885 1989 0.003546 ... 1934 0.000041 1951 0.000041 1941 0.000041 1954 0.000041 1955 0.000041 2019 0.000041 1910 0.000041 1957 0.000041 1800 0.000041 8888 0.000021 1950 0.000021 4500 0.000021 1929 0.000021 1953 0.000021 1111 0.000021 1001 0.000021 4800 0.000021 1931 0.000021 5911 0.000021 1948 0.000021 1927 0.000021 1943 0.000021 1939 0.000021 1938 0.000021 1000 0.000021 2800 0.000021 4100 0.000021 6200 0.000021 9000 0.000021 1952 0.000021 Name: registration_year, Length: 95, dtype: float64
autos["registration_year"].value_counts().sort_index().head(20)
1000 1 1001 1 1111 1 1800 2 1910 2 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 1 1951 2 1952 1 1953 1 1954 2 Name: registration_year, dtype: int64
autos["registration_year"].value_counts().sort_index().tail(20)
2010 1587 2011 1618 2012 1308 2013 801 2014 662 2015 380 2016 1202 2017 1383 2018 468 2019 2 2800 1 4100 1 4500 1 4800 1 5000 3 5911 1 6200 1 8888 1 9000 1 9999 3 Name: registration_year, dtype: int64
autos = autos[autos["registration_year"].between(1900,2019)]
autos["registration_year"].describe()
count 48205.000000 mean 2003.490053 std 7.511679 min 1910.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 2019.000000 Name: registration_year, dtype: float64
autos["registration_year"].value_counts(normalize=True)
2000 0.064392 2005 0.060388 1999 0.059724 2004 0.055990 2003 0.055866 2006 0.055347 2001 0.054538 2002 0.051385 1998 0.048543 2007 0.047153 2008 0.045846 2009 0.043149 1997 0.039934 2011 0.033565 2010 0.032922 2017 0.028690 1996 0.028109 2012 0.027134 2016 0.024935 1995 0.024748 2013 0.016617 2014 0.013733 1994 0.012986 2018 0.009709 1993 0.008713 2015 0.007883 1992 0.007613 1991 0.007012 1990 0.006887 1989 0.003547 ... 1976 0.000436 1969 0.000394 1975 0.000373 1965 0.000353 1964 0.000249 1963 0.000166 1959 0.000124 1961 0.000124 1937 0.000083 1962 0.000083 1956 0.000083 1958 0.000083 1910 0.000041 1955 0.000041 2019 0.000041 1934 0.000041 1957 0.000041 1951 0.000041 1941 0.000041 1954 0.000041 1953 0.000021 1950 0.000021 1948 0.000021 1927 0.000021 1943 0.000021 1929 0.000021 1939 0.000021 1938 0.000021 1931 0.000021 1952 0.000021 Name: registration_year, Length: 81, dtype: float64
We prefer to mantain until 2019: could a mistake of typo not a deliberate error.
Now we need to explore which are the most popular brands in the used cars, and explore only this data.
top_brand_index
autos["brand"].value_counts(normalize=True).head(10)
volkswagen 0.212986 bmw 0.108765 opel 0.108288 mercedes_benz 0.095965 audi 0.086028 ford 0.069619 renault 0.047837 peugeot 0.029458 fiat 0.025972 seat 0.018919 Name: brand, dtype: float64
brand_counts = autos["brand"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_brand = brand_counts[brand_counts > 0.05]
print(top_brand)
print('\n')
top_brand_index = brand_counts[brand_counts > 0.05].index
print(top_brand_index)
volkswagen 0.212986 bmw 0.108765 opel 0.108288 mercedes_benz 0.095965 audi 0.086028 ford 0.069619 Name: brand, dtype: float64 Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
Once we identified the top brands, we are going to use the following information:
mean()
method the mean value for brand..Series()
method we are going to convert the dictionary into a single dataframe.dataframe()
method.mileage
and mean_price
we are going to do a single dataframe, combining the tables.brand_prices_dict = {}
for b in top_brand_index:
selected_rows = autos[autos["brand"] == b]
# Calculate the mean average price for just those rows
mean = selected_rows["price"].mean()
# Assign the mean value to the dictionary, using the
# brand name as the key
brand_prices_dict[b] = mean
brand_prices_dict
{'volkswagen': 5366.253433330086, 'bmw': 8310.138470341408, 'opel': 2971.6082375478927, 'mercedes_benz': 8570.76869865975, 'audi': 9259.510248372317, 'ford': 3756.9919547079858}
# pd. Series convert our dictionary into a single dataframe
brand_prices = pd.Series(brand_prices_dict)
brand_prices
volkswagen 5366.253433 bmw 8310.138470 opel 2971.608238 mercedes_benz 8570.768699 audi 9259.510248 ford 3756.991955 dtype: float64
# We can convert the series to a single Dataframe with column
brand_prices = pd.DataFrame(brand_prices, columns=['mean_price'])
brand_prices
mean_price | |
---|---|
volkswagen | 5366.253433 |
bmw | 8310.138470 |
opel | 2971.608238 |
mercedes_benz | 8570.768699 |
audi | 9259.510248 |
ford | 3756.991955 |
mileage_dict = {}
for b in top_brand_index:
selected_rows = autos[autos["brand"] == b]
# Calculate the mean average price for just those rows
mean = selected_rows["odometer_km"].mean()
# Assign the mean value to the dictionary, using the
# brand name as the key
mileage_dict[b] = mean
mileage_dict
{'volkswagen': 129056.19947404305, 'bmw': 132824.71867251574, 'opel': 129512.45210727969, 'mercedes_benz': 131079.76653696498, 'audi': 129604.53339763684, 'ford': 124360.8462455304}
# pd. Series convert our dictionary into a single dataframe
mileage = pd.Series(mileage_dict)
mileage
volkswagen 129056.199474 bmw 132824.718673 opel 129512.452107 mercedes_benz 131079.766537 audi 129604.533398 ford 124360.846246 dtype: float64
mileage = pd.DataFrame(mileage, columns = ['mileage'])
mileage
mileage | |
---|---|
volkswagen | 129056.199474 |
bmw | 132824.718673 |
opel | 129512.452107 |
mercedes_benz | 131079.766537 |
audi | 129604.533398 |
ford | 124360.846246 |
mileage["mean_prices"] = brand_prices
mileage
mileage | mean_prices | |
---|---|---|
volkswagen | 129056.199474 | 5366.253433 |
bmw | 132824.718673 | 8310.138470 |
opel | 129512.452107 | 2971.608238 |
mercedes_benz | 131079.766537 | 8570.768699 |
audi | 129604.533398 | 9259.510248 |
ford | 124360.846246 | 3756.991955 |
mileage["mileage"] = mileage["mileage"].astype(int)
mileage["mean_prices"] = mileage["mean_prices"].astype(int)
mileage
mileage | mean_prices | |
---|---|---|
volkswagen | 129056 | 5366 |
bmw | 132824 | 8310 |
opel | 129512 | 2971 |
mercedes_benz | 131079 | 8570 |
audi | 129604 | 9259 |
ford | 124360 | 3756 |
The principal findings here are:
Several columns still use the same German words, so we can find them and translate to english.
The steps are the same in previous methods:
str.replace()
method to replace the words.autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unreparied_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
autos["fuel_type"].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
autos["vehicle_type"] = autos["vehicle_type"].str.replace("kleinwagen","small_car").str.replace("kombi","station_wagon").str.replace("cabrio","convertible").str.replace("andere","other")
autos["vehicle_type"].unique()
array(['bus', 'limousine', 'small_car', 'station_wagon', nan, 'coupe', 'suv', 'convertible', 'other'], dtype=object)
autos["fuel_type"] = autos["fuel_type"].str.replace("benzin","petrol").str.replace("elektro","electro")
autos["fuel_type"].unique()
array(['lpg', 'petrol', 'diesel', nan, 'cng', 'hybrid', 'electro', 'andere'], dtype=object)
In this part we are going to convert a string to int.
str.split
to select only the first part of the string.astype
to convert the string to integer.autos["date_crawled"] = (autos["date_crawled"]
.str.split()
.str[0]
)
autos["date_crawled"].head()
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 Name: date_crawled, dtype: object
autos["date_crawled"] = autos["date_crawled"].str.replace("-","")
# Once we replace we can convert to int with .astype method
autos["date_crawled"] = autos["date_crawled"].astype(int)
autos["date_crawled"].head()
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 Name: date_crawled, dtype: int64
In this part we are going to identify the most offered models of each brand.
These are the steps:
split
method to select the second word of the name, that word is the modelautos["name_model"] = autos["name"].str.replace("_"," ")
autos["name_model"] = (autos["name_model"]
.str.split()
.str[1]
)
autos["name_model"].head()
0 807 1 740i 2 Golf 3 smart 4 Focus Name: name_model, dtype: object
autos["name_model"].head(10)
0 807 1 740i 2 Golf 3 smart 4 Focus 5 Grand 6 Golf 7 IV 8 Arosa 9 Megane Name: name_model, dtype: object
model_counts = autos["name_model"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_model = model_counts[model_counts > 0.01]
top_model.head(20)
Benz 0.067089 Golf 0.053163 Corsa 0.026269 Polo 0.024333 Astra 0.023397 Passat 0.021815 A4 0.021711 A3 0.014654 A6 0.014613 Focus 0.014155 Name: name_model, dtype: float64
We are now tray to find how much change the mean price of a used cars if the mileage increase.
For this we are going to do the following:
.mean()
methododometer_to_50k = autos[autos["odometer_km"] <= 50000]
odometer_to_100k = autos[(autos["odometer_km"] > 50000) & (autos["odometer_km"] <= 100000)]
odometer_to_150k = autos[(autos["odometer_km"] > 100000) & (autos["odometer_km"] <= 150000)]
Price1 = round(odometer_to_50k["price"].mean(),0)
Price2 = round (odometer_to_100k["price"].mean(),0)
Price3 = round (odometer_to_150k["price"].mean(),0)
print("Average price with odometer from 0 to 50k km:", Price1)
print("Average price with odometer from 50 to 100k km:", Price2)
print("Average price with odometer from 100 to 150k km:", Price3)
Average price with odometer from 0 to 50k km: 15025.0 Average price with odometer from 50 to 100k km: 9515.0 Average price with odometer from 100 to 150k km: 4087.0
Finally we are going to find the difference on price between a car with no damage and with damage.
"unrepaired_damage"
autos["unreparied_damage"].unique()
array(['nein', nan, 'ja'], dtype=object)
no_damage = autos[autos["unreparied_damage"] == 'nein']
damage = autos[autos["unreparied_damage"] == 'ja']
Price_1 = round(no_damage["price"].mean())
Price_2 = round(damage["price"].mean())
print("Average no damage:", Price_1)
print("Average with damage:", Price_2)
Average no damage: 7099 Average with damage: 2250
ratio_damage = 7099/2250
ratio_damage
3.155111111111111
We resume the main findings of our data here: