Used cars listing on ebay: how to clean up the data and analyze the results.

The aim of this project is to clean a large dataset and analyze the information using pandasand numpy libraries of python.

The dataset comes from eBay Kleinanzeigen classified section on ebay, you can download the full version in the this link.

The data dictionary provided with data is as follows:

  • dataCrawled - When this ad was first crawled. All field-values are taken from this date.
  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • model - he car model name.
  • kilometer - How many kilometers the car has driven.
  • monthOfRegistration - The month in which the car was first registered.
  • fuelType - What type of fuel the car uses.
  • brand - The brand of the car.
  • notRepairedDamage - If the car has a damage which is not yet repaired.
  • dateCreated - The date on which the eBay listing was created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.
In [1]:
import pandas as pd
import numpy as np

# To read the csv file we use the .read_csv method
autos = pd.read_csv('autos(1).csv', encoding = 'Latin-1')
In [2]:
autos
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau privat Angebot $999 test NaN 2017 manuell 90 NaN 150,000km 4 benzin volkswagen nein 2016-03-14 00:00:00 0 86157 2016-04-07 03:16:21
11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden privat Angebot $350 control NaN 2000 NaN 0 NaN 150,000km 0 benzin mercedes_benz NaN 2016-03-16 00:00:00 0 17498 2016-03-16 18:45:34
12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot $5,299 control kleinwagen 2010 automatik 71 fortwo 50,000km 9 benzin smart nein 2016-03-31 00:00:00 0 34590 2016-04-06 14:17:52
13 2016-03-23 10:48:32 Audi_A3_1.6_tuning privat Angebot $1,350 control limousine 1999 manuell 101 a3 150,000km 11 benzin audi nein 2016-03-23 00:00:00 0 12043 2016-04-01 14:17:13
14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot $3,999 test kleinwagen 2007 manuell 75 clio 150,000km 9 benzin renault NaN 2016-03-23 00:00:00 0 81737 2016-04-01 15:46:47
15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot $18,900 test coupe 1982 automatik 203 NaN 80,000km 6 benzin sonstige_autos nein 2016-04-01 00:00:00 0 61276 2016-04-02 21:10:48
16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi privat Angebot $350 test kombi 1999 manuell 101 vectra 150,000km 5 benzin opel nein 2016-03-16 00:00:00 0 57299 2016-03-18 05:29:37
17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 privat Angebot $5,500 test coupe 1990 manuell 205 scirocco 150,000km 6 benzin volkswagen nein 2016-03-29 00:00:00 0 74821 2016-04-05 20:46:26
18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot $300 control bus 1995 manuell 150 3er 150,000km 0 benzin bmw NaN 2016-03-26 00:00:00 0 54329 2016-04-02 12:16:41
19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot $4,150 control suv 2004 manuell 124 andere 150,000km 2 lpg mazda nein 2016-03-17 00:00:00 0 40878 2016-03-17 14:45:58
20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot $3,500 test kombi 2003 manuell 131 a4 150,000km 5 diesel audi NaN 2016-03-05 00:00:00 0 53913 2016-03-07 05:46:46
21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio privat Angebot $41,500 test cabrio 2004 manuell 320 911 150,000km 4 benzin porsche nein 2016-03-06 00:00:00 0 65428 2016-04-05 23:46:19
22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio privat Angebot $25,450 control cabrio 2015 manuell 184 cooper 10,000km 1 benzin mini nein 2016-03-28 00:00:00 0 44789 2016-04-01 06:45:30
23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima privat Angebot $7,999 control bus 2010 manuell 120 NaN 150,000km 2 diesel peugeot nein 2016-03-10 00:00:00 0 30900 2016-03-17 08:45:17
24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. privat Angebot $48,500 control limousine 2014 automatik 306 5er 30,000km 12 benzin bmw nein 2016-04-03 00:00:00 0 22547 2016-04-07 13:16:50
25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung privat Angebot $90 control kombi 1996 manuell 116 NaN 150,000km 4 benzin ford ja 2016-03-21 00:00:00 0 27574 2016-04-01 05:16:49
26 2016-04-03 22:46:28 Volkswagen_Polo_Fox privat Angebot $777 control kleinwagen 1992 manuell 54 polo 125,000km 2 benzin volkswagen nein 2016-04-03 00:00:00 0 38110 2016-04-05 23:46:48
27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE privat Angebot $0 control NaN 2005 NaN 0 NaN 150,000km 0 NaN ford NaN 2016-03-27 00:00:00 0 66701 2016-03-27 18:45:01
28 2016-03-19 21:56:19 MINI_Cooper_D privat Angebot $5,250 control kleinwagen 2007 manuell 110 cooper 150,000km 7 diesel mini ja 2016-03-19 00:00:00 0 15745 2016-04-07 14:58:48
29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... privat Angebot $4,999 test kombi 2004 automatik 204 e_klasse 150,000km 10 diesel mercedes_benz nein 2016-04-02 00:00:00 0 47638 2016-04-02 12:45:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... privat Angebot $15,800 control bus 2010 automatik 136 c4 60,000km 4 diesel citroen nein 2016-03-21 00:00:00 0 14947 2016-04-07 04:17:34
49971 2016-03-29 14:54:12 W.Lupo_1.0 privat Angebot $950 test kleinwagen 2001 manuell 50 lupo 150,000km 4 benzin volkswagen nein 2016-03-29 00:00:00 0 65197 2016-03-29 20:41:51
49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. privat Angebot $3,300 control bus 2004 automatik 150 vito 150,000km 10 diesel mercedes_benz ja 2016-03-26 00:00:00 0 65326 2016-03-28 11:28:18
49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor privat Angebot $6,000 control cabrio 2004 manuell 163 slk 150,000km 11 benzin mercedes_benz nein 2016-03-27 00:00:00 0 53567 2016-03-27 08:25:24
49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... privat Angebot $0 control cabrio 1983 manuell 70 golf 150,000km 2 benzin volkswagen nein 2016-03-20 00:00:00 0 8209 2016-03-27 19:48:16
49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort privat Angebot $9,700 control kleinwagen 2012 automatik 88 jazz 100,000km 11 hybrid honda nein 2016-03-27 00:00:00 0 84385 2016-04-05 19:45:34
49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... privat Angebot $5,900 test kombi 1992 automatik 150 80 150,000km 12 benzin audi nein 2016-03-19 00:00:00 0 36100 2016-04-07 06:16:44
49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 privat Angebot $5,500 control limousine 2003 manuell 116 c_klasse 150,000km 2 diesel mercedes_benz nein 2016-03-31 00:00:00 0 33739 2016-04-06 12:16:11
49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic privat Angebot $900 control limousine 1996 automatik 136 e_klasse 150,000km 9 benzin mercedes_benz ja 2016-04-04 00:00:00 0 24405 2016-04-06 12:44:20
49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style privat Angebot $11,000 test kleinwagen 2011 manuell 90 polo 70,000km 11 diesel volkswagen nein 2016-03-20 00:00:00 0 48455 2016-04-07 01:45:12
49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V privat Angebot $400 control kombi 1995 manuell 105 escort 125,000km 3 benzin ford NaN 2016-03-12 00:00:00 0 56218 2016-04-06 17:16:49
49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung privat Angebot $2,000 control kombi 1998 manuell 115 astra 150,000km 12 benzin opel nein 2016-03-15 00:00:00 0 86859 2016-04-05 17:21:46
49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm privat Angebot $1,950 control kleinwagen 2004 manuell 0 fabia 90,000km 7 benzin skoda NaN 2016-03-29 00:00:00 0 45884 2016-03-29 18:51:08
49983 2016-03-06 12:43:04 Ford_focus_99 privat Angebot $600 test kleinwagen 1999 manuell 101 focus 150,000km 4 benzin ford NaN 2016-03-06 00:00:00 0 52477 2016-03-09 06:16:08
49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... privat Angebot $0 test NaN 2000 NaN 0 NaN 150,000km 0 NaN sonstige_autos NaN 2016-03-31 00:00:00 0 12103 2016-04-02 19:44:53
49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! privat Angebot $1,000 control NaN 1995 automatik 0 NaN 150,000km 0 benzin volkswagen NaN 2016-04-02 00:00:00 0 30900 2016-04-06 15:17:52
49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... privat Angebot $15,900 control limousine 2010 automatik 218 300c 125,000km 11 diesel chrysler nein 2016-04-04 00:00:00 0 73527 2016-04-06 23:16:00
49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... privat Angebot $21,990 control limousine 2013 manuell 150 a3 50,000km 11 diesel audi nein 2016-03-22 00:00:00 0 94362 2016-03-26 22:46:06
49988 2016-03-28 19:49:51 BMW_330_Ci privat Angebot $9,550 control coupe 2001 manuell 231 3er 150,000km 10 benzin bmw nein 2016-03-28 00:00:00 0 83646 2016-04-07 02:17:40
49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau privat Angebot $150 test kleinwagen 1997 manuell 0 polo 150,000km 5 benzin volkswagen ja 2016-03-11 00:00:00 0 21244 2016-03-12 10:17:55
49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban privat Angebot $17,500 test limousine 2012 manuell 156 a_klasse 30,000km 12 benzin mercedes_benz nein 2016-03-21 00:00:00 0 58239 2016-04-06 22:46:57
49991 2016-03-06 15:25:19 Kleinwagen privat Angebot $500 control NaN 2016 manuell 0 twingo 150,000km 0 benzin renault NaN 2016-03-06 00:00:00 0 61350 2016-03-06 18:24:19
49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport privat Angebot $4,800 control kleinwagen 2009 manuell 120 andere 125,000km 9 lpg fiat nein 2016-03-10 00:00:00 0 68642 2016-03-13 01:44:51
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot $1,650 control kleinwagen 1997 manuell 0 NaN 150,000km 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot $5,000 control kombi 2001 automatik 299 a6 150,000km 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

Information of the dataset and first rows

At first glance we have the following information of the dataset using the .info method:

  • We have 20 columns, most of the strings
  • the columns vehicle type, gear box, model, fueltype and nonreparied damage have null values
In [3]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Another important observation is that the column names use capitalizacion style named camelcase, python prefers to work with snakecase. We can see this using .head() method to see the first results.

In [4]:
autos.head()
Out[4]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Converting the column names to snakecase

We are going to convert the names case so we can have a more workable case for the dataset.

In [5]:
columns = autos.columns
columns
Out[5]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Manually I made the change of the names, following this criteria:

  • From camelcase to snakecase
  • yearOfRegistration to registration_year
  • monthOfRegistration to registration_month
  • notRepairedDamage to unrepaired_damage
  • dateCreated to ad_created
In [6]:
new_names= ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer','registration_month','fuel_type','brand',
       'unreparied_damage','ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

autos.columns = new_names
autos.columns
Out[6]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unreparied_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Now we can see the changes in the columns looking the results with the .head method

In [7]:
autos.head()
Out[7]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unreparied_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Data exploration using special methods

In this step we are going to explore the data in detail, first using .describe method to see some statistical results and look what kind of objects we have in the data.

In [8]:
autos.describe(include='all')
Out[8]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unreparied_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-27 22:55:05 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
  • The column nr_of_pictures have mostly 0 value, so we can remove the column without afecting the data
  • The registration_year have a min value of 1000, it's really difficult to find a car of the year 1000 now a days, so we need to check that.
  • The registration_month have a min value of 0, it's not a valid month so we need to check that too
  • The power_ps column have min 0 value, is really dificult to find a car with 0 power, so we need to check that.
  • price and odometer are registered as strings, we need to convert to number so we can use the data.
  • The columns seller, offer_type and nr_of_pictures have unique values, so we can drop this columns without affecting the data.

So we need to drop the Columns that have the same values, we are not going to use the columns at this moment.

In [9]:
autos = autos.drop(["seller","offer_type","nr_of_pictures"], axis=1)

Converting Price column to integers

Now we need to clean the data in the 'price' column, and also convert this data in integers for further calculus, the steps are:

  1. Identify which kind of object we have in the column
  2. Identify which symbols we need to replace
  3. Convert to int using astype method.
  4. Verify that the results are according the expected using .head()
In [10]:
auto_desc = autos["price"].describe()
auto_desc
Out[10]:
count     50000
unique     2357
top          $0
freq       1421
Name: price, dtype: object
In [11]:
# We need to replace the $ symbol and , separator with str.repalce
autos["price"] = autos["price"].str.replace("$","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["price"] = autos["price"].astype(int)
In [12]:
autos["price"].head()
Out[12]:
0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Converting Odometer to integer type

The steps are similar with prices, but in this case we are going to use .rename()method at the to convert the name from odometer to odometer_km, so the person that will use the dataset could know which units use the odometer.

In [13]:
autos["odometer"].value_counts()
Out[13]:
150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64
In [14]:
# We need to replace the $ symbol and , separator with str.repalce
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["odometer"] = autos["odometer"].astype(int)
In [15]:
autos["odometer"].head()
Out[15]:
0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64
In [16]:
#With the rename method we need to indicate the previous name ant the new name
autos.rename(columns={"odometer":"odometer_km"}, inplace=True)
autos.describe(include = 'all')
Out[16]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unreparied_damage ad_created postal_code last_seen
count 50000 50000 5.000000e+04 50000 44905 50000.000000 47320 50000.000000 47242 50000.000000 50000.000000 45518 50000 40171 50000 50000.000000 50000
unique 48213 38754 NaN 2 8 NaN 2 NaN 245 NaN NaN 7 40 2 76 NaN 39481
top 2016-03-27 22:55:05 Ford_Fiesta NaN test limousine NaN manuell NaN golf NaN NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN 2016-04-07 06:17:27
freq 3 78 NaN 25756 12859 NaN 36993 NaN 4024 NaN NaN 30107 10687 35232 1946 NaN 8
mean NaN NaN 9.840044e+03 NaN NaN 2005.073280 NaN 116.355920 NaN 125732.700000 5.723360 NaN NaN NaN NaN 50813.627300 NaN
std NaN NaN 4.811044e+05 NaN NaN 105.712813 NaN 209.216627 NaN 40042.211706 3.711984 NaN NaN NaN NaN 25779.747957 NaN
min NaN NaN 0.000000e+00 NaN NaN 1000.000000 NaN 0.000000 NaN 5000.000000 0.000000 NaN NaN NaN NaN 1067.000000 NaN
25% NaN NaN 1.100000e+03 NaN NaN 1999.000000 NaN 70.000000 NaN 125000.000000 3.000000 NaN NaN NaN NaN 30451.000000 NaN
50% NaN NaN 2.950000e+03 NaN NaN 2003.000000 NaN 105.000000 NaN 150000.000000 6.000000 NaN NaN NaN NaN 49577.000000 NaN
75% NaN NaN 7.200000e+03 NaN NaN 2008.000000 NaN 150.000000 NaN 150000.000000 9.000000 NaN NaN NaN NaN 71540.000000 NaN
max NaN NaN 1.000000e+08 NaN NaN 9999.000000 NaN 17700.000000 NaN 150000.000000 12.000000 NaN NaN NaN NaN 99998.000000 NaN

Exploring the data on detail: outliers in the data

Now that we have the data converted to int type, we need to explore the outliers, some data that have no sense or are not adjusted to our goal. The steps are the same for odometer_km and price.

  1. Verify the number of unique values using .unique().shape method
  2. If we have outliers, identify the total number using value_counts().sort_index() method.
  3. Use the .between() method to indicate the range of values that we are going to use
In [17]:
autos["odometer_km"].unique().shape
Out[17]:
(13,)
In [18]:
autos["odometer_km"].describe()
Out[18]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [19]:
autos["odometer_km"].value_counts().sort_index(ascending=False)
Out[19]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64
In [20]:
# With this we obtain the number of unique values
autos["price"].unique().shape
Out[20]:
(2357,)
In [21]:
autos["price"].describe()
Out[21]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [22]:
# First we need to see the extreme superior values, prices above 1 million USD must be cleaned
autos["price"].value_counts().sort_index(ascending=False).head(20)
Out[22]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64
In [23]:
# Using `sort_index()` we can sort using the unique values, now we see the low prices, 0-100USD are a very low price
autos["price"].value_counts().sort_index(ascending=True).head(20)
Out[23]:
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64
In [24]:
#Using only .value_counts() we can see the values by frequency
autos["price"].value_counts().head(50)
Out[24]:
0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
1100     376
1300     371
3000     365
550      356
1800     355
5500     340
1250     335
350      335
1600     327
1999     322
400      321
6500     300
1400     292
250      291
2800     291
2300     290
1350     276
1700     268
200      266
450      265
3800     264
3200     261
2900     256
1450     252
4000     246
2999     242
2600     241
5000     239
1900     239
1650     238
Name: price, dtype: int64

With this info is better to start from 100USD than from 1USD, is difficult to asses if below this values is realistic to purchase a car, we need more info about how this market works.

In [25]:
#With between method we remove the outliers and reduce the dataset.
autos = autos[autos["price"].between(100,351000)]
autos["price"].describe()
Out[25]:
count     48224.000000
mean       5930.371433
std        9078.372762
min         100.000000
25%        1250.000000
50%        3000.000000
75%        7499.000000
max      350000.000000
Name: price, dtype: float64

Working with dates

We have 5 types of data from dates:

  • date_crawled: added by the crawler
  • last_seen: added by the crawler
  • ad_created: from the website
  • registration_month: from the website
  • registration_year: from the website

We need to convert the dates that are identified as strings to date format.

In [26]:
#First we see how is written the dates
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[26]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [27]:
# We need to remove the hour and leave only the dates using .str method
autos['date_crawled'].str[:10].head(10)
Out[27]:
0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
5    2016-03-21
6    2016-03-20
7    2016-03-16
8    2016-03-22
9    2016-03-16
Name: date_crawled, dtype: object
In [28]:
#Now we use value_counts() to see the most frequent dates
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(10)
Out[28]:
2016-04-03    0.038611
2016-03-20    0.037803
2016-03-21    0.037201
2016-03-12    0.036911
2016-03-14    0.036662
2016-04-04    0.036538
2016-03-07    0.036061
2016-04-02    0.035605
2016-03-28    0.034962
2016-03-19    0.034734
Name: date_crawled, dtype: float64
In [29]:
# We need to identify the first dates of crawling using .sort_index()
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
Out[29]:
2016-03-05    0.025361
2016-03-06    0.014039
2016-03-07    0.036061
2016-03-08    0.033179
2016-03-09    0.033013
2016-03-10    0.032287
2016-03-11    0.032598
2016-03-12    0.036911
2016-03-13    0.015677
2016-03-14    0.036662
Name: date_crawled, dtype: float64
In [30]:
#Then we identify the last dates using the same method
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
Out[30]:
2016-03-29    0.034112
2016-03-30    0.033738
2016-03-31    0.031851
2016-04-01    0.033697
2016-04-02    0.035605
2016-04-03    0.038611
2016-04-04    0.036538
2016-04-05    0.013064
2016-04-06    0.003173
2016-04-07    0.001389
Name: date_crawled, dtype: float64
In [31]:
#The same previous process we will use with the other dates
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head(10)
Out[31]:
2016-04-03    0.038860
2016-03-20    0.037865
2016-03-21    0.037429
2016-04-04    0.036890
2016-03-12    0.036745
2016-04-02    0.035294
2016-03-14    0.035294
2016-03-28    0.035066
2016-03-07    0.034796
2016-03-29    0.034070
Name: ad_created, dtype: float64
In [32]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
Out[32]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
Name: ad_created, dtype: float64
In [33]:
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
Out[33]:
2016-03-29    0.034070
2016-03-30    0.033552
2016-03-31    0.031893
2016-04-01    0.033676
2016-04-02    0.035294
2016-04-03    0.038860
2016-04-04    0.036890
2016-04-05    0.011799
2016-04-06    0.003256
2016-04-07    0.001244
Name: ad_created, dtype: float64
In [34]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(10)
Out[34]:
2016-04-06    0.221964
2016-04-07    0.132154
2016-04-05    0.125062
2016-03-17    0.028098
2016-04-03    0.025133
2016-04-02    0.024884
2016-03-30    0.024697
2016-04-04    0.024531
2016-03-31    0.023826
2016-03-12    0.023785
Name: last_seen, dtype: float64
In [35]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
Out[35]:
2016-03-05    0.001078
2016-03-06    0.004313
2016-03-07    0.005433
2016-03-08    0.007320
2016-03-09    0.009580
2016-03-10    0.010638
2016-03-11    0.012400
2016-03-12    0.023785
2016-03-13    0.008875
2016-03-14    0.012629
Name: last_seen, dtype: float64
In [36]:
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
Out[36]:
2016-03-29    0.022292
2016-03-30    0.024697
2016-03-31    0.023826
2016-04-01    0.022852
2016-04-02    0.024884
2016-04-03    0.025133
2016-04-04    0.024531
2016-04-05    0.125062
2016-04-06    0.221964
2016-04-07    0.132154
Name: last_seen, dtype: float64

Registration year cleaning

From the previous exercise we identified:

  • The dataset was retrieved from march until april of 2016
  • The ads were created between june 2015 and april 2016

With this info is clear that we need to review the registration year of the cars. Similar to previous step with the price, we need to fix a range and exclude some data.

In [37]:
autos["registration_year"].describe()
Out[37]:
count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The range of years

Looking the data we find that we have cars with registered years before 1900 (when we don't even have a clue of what is a Mercedes Benz) and years before 2016, the last year when we retrieved the data. So We need to explore the data set and removes this years.

In [38]:
autos["registration_year"].value_counts(normalize=True)
Out[38]:
2000    0.064366
2005    0.060364
1999    0.059701
2004    0.055968
2003    0.055844
2006    0.055325
2001    0.054516
2002    0.051364
1998    0.048524
2007    0.047134
2008    0.045828
2009    0.043132
1997    0.039918
2011    0.033552
2010    0.032909
2017    0.028679
1996    0.028098
2012    0.027123
2016    0.024925
1995    0.024739
2013    0.016610
2014    0.013728
1994    0.012981
2018    0.009705
1993    0.008709
2015    0.007880
1992    0.007610
1991    0.007009
1990    0.006885
1989    0.003546
          ...   
1934    0.000041
1951    0.000041
1941    0.000041
1954    0.000041
1955    0.000041
2019    0.000041
1910    0.000041
1957    0.000041
1800    0.000041
8888    0.000021
1950    0.000021
4500    0.000021
1929    0.000021
1953    0.000021
1111    0.000021
1001    0.000021
4800    0.000021
1931    0.000021
5911    0.000021
1948    0.000021
1927    0.000021
1943    0.000021
1939    0.000021
1938    0.000021
1000    0.000021
2800    0.000021
4100    0.000021
6200    0.000021
9000    0.000021
1952    0.000021
Name: registration_year, Length: 95, dtype: float64
In [39]:
autos["registration_year"].value_counts().sort_index().head(20)
Out[39]:
1000    1
1001    1
1111    1
1800    2
1910    2
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
1950    1
1951    2
1952    1
1953    1
1954    2
Name: registration_year, dtype: int64
In [40]:
autos["registration_year"].value_counts().sort_index().tail(20)
Out[40]:
2010    1587
2011    1618
2012    1308
2013     801
2014     662
2015     380
2016    1202
2017    1383
2018     468
2019       2
2800       1
4100       1
4500       1
4800       1
5000       3
5911       1
6200       1
8888       1
9000       1
9999       3
Name: registration_year, dtype: int64
In [41]:
autos = autos[autos["registration_year"].between(1900,2019)]
autos["registration_year"].describe()
Out[41]:
count    48205.000000
mean      2003.490053
std          7.511679
min       1910.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       2019.000000
Name: registration_year, dtype: float64
In [42]:
autos["registration_year"].value_counts(normalize=True)
Out[42]:
2000    0.064392
2005    0.060388
1999    0.059724
2004    0.055990
2003    0.055866
2006    0.055347
2001    0.054538
2002    0.051385
1998    0.048543
2007    0.047153
2008    0.045846
2009    0.043149
1997    0.039934
2011    0.033565
2010    0.032922
2017    0.028690
1996    0.028109
2012    0.027134
2016    0.024935
1995    0.024748
2013    0.016617
2014    0.013733
1994    0.012986
2018    0.009709
1993    0.008713
2015    0.007883
1992    0.007613
1991    0.007012
1990    0.006887
1989    0.003547
          ...   
1976    0.000436
1969    0.000394
1975    0.000373
1965    0.000353
1964    0.000249
1963    0.000166
1959    0.000124
1961    0.000124
1937    0.000083
1962    0.000083
1956    0.000083
1958    0.000083
1910    0.000041
1955    0.000041
2019    0.000041
1934    0.000041
1957    0.000041
1951    0.000041
1941    0.000041
1954    0.000041
1953    0.000021
1950    0.000021
1948    0.000021
1927    0.000021
1943    0.000021
1929    0.000021
1939    0.000021
1938    0.000021
1931    0.000021
1952    0.000021
Name: registration_year, Length: 81, dtype: float64

We prefer to mantain until 2019: could a mistake of typo not a deliberate error.

Aggregation with brands: understanding the most offered brands

Now we need to explore which are the most popular brands in the used cars, and explore only this data.

  1. We need to now which are the top brands by share of the total
  2. We are going to store this data in a new variable called top_brand_index
In [43]:
autos["brand"].value_counts(normalize=True).head(10)
Out[43]:
volkswagen       0.212986
bmw              0.108765
opel             0.108288
mercedes_benz    0.095965
audi             0.086028
ford             0.069619
renault          0.047837
peugeot          0.029458
fiat             0.025972
seat             0.018919
Name: brand, dtype: float64
In [44]:
brand_counts = autos["brand"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_brand = brand_counts[brand_counts > 0.05]
print(top_brand)
print('\n')
top_brand_index = brand_counts[brand_counts > 0.05].index
print(top_brand_index)
volkswagen       0.212986
bmw              0.108765
opel             0.108288
mercedes_benz    0.095965
audi             0.086028
ford             0.069619
Name: brand, dtype: float64


Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

Brand average prices an mileage

Once we identified the top brands, we are going to use the following information:

  1. We first create a dictionary that stores the top brand values and calculates using mean() method the mean value for brand.
  2. Using .Series() method we are going to convert the dictionary into a single dataframe
  3. Finally we are going to convert the dataframe into a single dataframe with column using .dataframe() method.
  4. Using the created dataframes for mileage and mean_pricewe are going to do a single dataframe, combining the tables.
In [45]:
brand_prices_dict = {}

for b in top_brand_index:
    selected_rows = autos[autos["brand"] == b]
    # Calculate the mean average price for just those rows
    mean = selected_rows["price"].mean()
    # Assign the mean value to the dictionary, using the
    # brand name as the key
    brand_prices_dict[b] = mean

brand_prices_dict
Out[45]:
{'volkswagen': 5366.253433330086,
 'bmw': 8310.138470341408,
 'opel': 2971.6082375478927,
 'mercedes_benz': 8570.76869865975,
 'audi': 9259.510248372317,
 'ford': 3756.9919547079858}
In [46]:
# pd. Series convert our dictionary into a single dataframe 
brand_prices = pd.Series(brand_prices_dict)
brand_prices
Out[46]:
volkswagen       5366.253433
bmw              8310.138470
opel             2971.608238
mercedes_benz    8570.768699
audi             9259.510248
ford             3756.991955
dtype: float64
In [47]:
# We can convert the series to a single Dataframe with column
brand_prices = pd.DataFrame(brand_prices, columns=['mean_price'])
brand_prices
Out[47]:
mean_price
volkswagen 5366.253433
bmw 8310.138470
opel 2971.608238
mercedes_benz 8570.768699
audi 9259.510248
ford 3756.991955
In [48]:
mileage_dict = {}

for b in top_brand_index:
    selected_rows = autos[autos["brand"] == b]
    # Calculate the mean average price for just those rows
    mean = selected_rows["odometer_km"].mean()
    # Assign the mean value to the dictionary, using the
    # brand name as the key
    mileage_dict[b] = mean

mileage_dict
Out[48]:
{'volkswagen': 129056.19947404305,
 'bmw': 132824.71867251574,
 'opel': 129512.45210727969,
 'mercedes_benz': 131079.76653696498,
 'audi': 129604.53339763684,
 'ford': 124360.8462455304}
In [49]:
# pd. Series convert our dictionary into a single dataframe 
mileage = pd.Series(mileage_dict)
mileage
Out[49]:
volkswagen       129056.199474
bmw              132824.718673
opel             129512.452107
mercedes_benz    131079.766537
audi             129604.533398
ford             124360.846246
dtype: float64
In [50]:
mileage = pd.DataFrame(mileage, columns = ['mileage'])
mileage
Out[50]:
mileage
volkswagen 129056.199474
bmw 132824.718673
opel 129512.452107
mercedes_benz 131079.766537
audi 129604.533398
ford 124360.846246
In [51]:
mileage["mean_prices"] = brand_prices
mileage
Out[51]:
mileage mean_prices
volkswagen 129056.199474 5366.253433
bmw 132824.718673 8310.138470
opel 129512.452107 2971.608238
mercedes_benz 131079.766537 8570.768699
audi 129604.533398 9259.510248
ford 124360.846246 3756.991955
In [52]:
mileage["mileage"] = mileage["mileage"].astype(int)
mileage["mean_prices"] = mileage["mean_prices"].astype(int)
mileage
Out[52]:
mileage mean_prices
volkswagen 129056 5366
bmw 132824 8310
opel 129512 2971
mercedes_benz 131079 8570
audi 129604 9259
ford 124360 3756

Findings on brands and price

The principal findings here are:

  • We can see that the Ford cars are selling with less mileage. This could be a red flag, maybe the car consumes more combustible or is difficult to find spare parts.
  • The most affordable model is opel.
  • Audi is the more expensive car, but the mileage is lower than mercedes, so we maybe need to check what other variables could affect the price in this cases.

Finding German words in the data and translating

Several columns still use the same German words, so we can find them and translate to english.

The steps are the same in previous methods:

  1. Review the columns with german words.
  2. Identify the unique words in this columns
  3. Use str.replace() method to replace the words.
In [53]:
autos.head()
Out[53]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unreparied_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
In [54]:
autos["vehicle_type"].unique()
Out[54]:
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)
In [55]:
autos["fuel_type"].unique()
Out[55]:
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)
In [56]:
autos["vehicle_type"] = autos["vehicle_type"].str.replace("kleinwagen","small_car").str.replace("kombi","station_wagon").str.replace("cabrio","convertible").str.replace("andere","other")
In [57]:
autos["vehicle_type"].unique()
Out[57]:
array(['bus', 'limousine', 'small_car', 'station_wagon', nan, 'coupe',
       'suv', 'convertible', 'other'], dtype=object)
In [58]:
autos["fuel_type"] = autos["fuel_type"].str.replace("benzin","petrol").str.replace("elektro","electro")
In [59]:
autos["fuel_type"].unique()
Out[59]:
array(['lpg', 'petrol', 'diesel', nan, 'cng', 'hybrid', 'electro',
       'andere'], dtype=object)

Converting the dates to integer

In this part we are going to convert a string to int.

  • In this exercise we are going to use str.split to select only the first part of the string.
  • We are going to use astype to convert the string to integer.
In [60]:
autos["date_crawled"] = (autos["date_crawled"]
                                       .str.split()
                                       .str[0]
                              )
autos["date_crawled"].head()
Out[60]:
0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object
In [61]:
autos["date_crawled"] = autos["date_crawled"].str.replace("-","")
# Once we replace we can convert to int with .astype method
autos["date_crawled"] = autos["date_crawled"].astype(int)
autos["date_crawled"].head()
Out[61]:
0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled, dtype: int64

Extracting model

In this part we are going to identify the most offered models of each brand.

These are the steps:

  1. Replace parts of the string for spaces to extract latter the words
  2. Use split method to select the second word of the name, that word is the model
  3. Identify the models that have more than 1% of the total of the offered models
In [62]:
autos["name_model"] = autos["name"].str.replace("_"," ")
In [63]:
autos["name_model"] = (autos["name_model"]
                              .str.split()
                              .str[1]
                              )
autos["name_model"].head()
Out[63]:
0      807
1     740i
2     Golf
3    smart
4    Focus
Name: name_model, dtype: object
In [64]:
autos["name_model"].head(10)
Out[64]:
0       807
1      740i
2      Golf
3     smart
4     Focus
5     Grand
6      Golf
7        IV
8     Arosa
9    Megane
Name: name_model, dtype: object
In [65]:
model_counts = autos["name_model"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_model = model_counts[model_counts > 0.01]
top_model.head(20)
Out[65]:
Benz      0.067089
Golf      0.053163
Corsa     0.026269
Polo      0.024333
Astra     0.023397
Passat    0.021815
A4        0.021711
A3        0.014654
A6        0.014613
Focus     0.014155
Name: name_model, dtype: float64

Findings on models

  • Mercedes Benz is the most popular used car offered in the list.
  • Golf, Corsa, Polo and Astra also are popular models in the data set.
  • Audi A3, A4 and A6 are also popular models, even when Audi is the most expensive brand. So a lot of people is trying to sell used expensive models to purchase a new one.

Odometer vs price

We are now tray to find how much change the mean price of a used cars if the mileage increase.

For this we are going to do the following:

  1. We are going to split on 3 main groups the odometer values, (minus of 50k km, between 50k km and 100k km, most than 100k km)
  2. We are going to find the mean price of each group using .mean() method
In [66]:
odometer_to_50k = autos[autos["odometer_km"] <= 50000]
odometer_to_100k = autos[(autos["odometer_km"] > 50000) & (autos["odometer_km"] <= 100000)]
odometer_to_150k = autos[(autos["odometer_km"] > 100000) & (autos["odometer_km"] <= 150000)]

Price1 = round(odometer_to_50k["price"].mean(),0)
Price2 = round (odometer_to_100k["price"].mean(),0)
Price3 = round (odometer_to_150k["price"].mean(),0)

print("Average price with odometer from 0 to 50k km:", Price1)
print("Average price with odometer from 50 to 100k km:", Price2)
print("Average price with odometer from 100 to 150k km:", Price3)
Average price with odometer from 0 to 50k km: 15025.0
Average price with odometer from 50 to 100k km: 9515.0
Average price with odometer from 100 to 150k km: 4087.0

Findings on odometers

  • The average price of an used car decrease with more mileage. We need to dive deep for brands to see if any of this change with certain models.

Damaged cars

Finally we are going to find the difference on price between a car with no damage and with damage.

  1. We are going to find the unique values in "unrepaired_damage"
  2. Once we have the unique values, following the previous steps we find the mean price.
In [67]:
autos["unreparied_damage"].unique()
Out[67]:
array(['nein', nan, 'ja'], dtype=object)
In [68]:
no_damage = autos[autos["unreparied_damage"] == 'nein']
damage = autos[autos["unreparied_damage"] == 'ja']

Price_1 = round(no_damage["price"].mean())
Price_2 = round(damage["price"].mean())

print("Average no damage:", Price_1)
print("Average with damage:", Price_2)
Average no damage: 7099
Average with damage: 2250
In [69]:
ratio_damage = 7099/2250
ratio_damage
Out[69]:
3.155111111111111

Findings damage cars

  • A car without damage could cost 3 times the price of a car with a damage.

Conclusions

We resume the main findings of our data here:

  • We can see that the Ford cars are selling with less mileage. This could be a red flag, maybe the car consumes more combustible or is difficult to find spare parts.
  • The most affordable model is opel.
  • Audi is the more expensive car, but the mileage is lower than mercedes, so we maybe need to check what other variables could affect the price in this cases.
  • A car without damage could cost 3 times the price of a car with a damage.
  • The average price of an used car decrease with more mileage. We need to dive deep for brands to see if any of this change with certain models.
  • Mercedes Benz is the most popular used car offered in the list.
  • Golf, Corsa, Polo and Astra also are popular models in the data set.
  • Audi A3, A4 and A6 are also popular models, even when Audi is the most expensive brand. So a lot of people is trying to sell used expensive models to purchase a new one.