GUIDED PROJECT: EXPLORING EBay CAR SALES

For this project, we're using the dataset of used cars from eBay, uploaded to kaggle. The columns include: dateCrawled - When this ad was first crawled. All field-values are taken from this date.

  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • dateCreated - The date on which the eBay listing was created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data, and analyze the included used car listings

In [1]:
#importing pandas and NumPy libraries
import pandas as pd
import numpy as np

#reading the csv file into pandas
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
In [2]:
autos
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau privat Angebot $999 test NaN 2017 manuell 90 NaN 150,000km 4 benzin volkswagen nein 2016-03-14 00:00:00 0 86157 2016-04-07 03:16:21
11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden privat Angebot $350 control NaN 2000 NaN 0 NaN 150,000km 0 benzin mercedes_benz NaN 2016-03-16 00:00:00 0 17498 2016-03-16 18:45:34
12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot $5,299 control kleinwagen 2010 automatik 71 fortwo 50,000km 9 benzin smart nein 2016-03-31 00:00:00 0 34590 2016-04-06 14:17:52
13 2016-03-23 10:48:32 Audi_A3_1.6_tuning privat Angebot $1,350 control limousine 1999 manuell 101 a3 150,000km 11 benzin audi nein 2016-03-23 00:00:00 0 12043 2016-04-01 14:17:13
14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot $3,999 test kleinwagen 2007 manuell 75 clio 150,000km 9 benzin renault NaN 2016-03-23 00:00:00 0 81737 2016-04-01 15:46:47
15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot $18,900 test coupe 1982 automatik 203 NaN 80,000km 6 benzin sonstige_autos nein 2016-04-01 00:00:00 0 61276 2016-04-02 21:10:48
16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi privat Angebot $350 test kombi 1999 manuell 101 vectra 150,000km 5 benzin opel nein 2016-03-16 00:00:00 0 57299 2016-03-18 05:29:37
17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 privat Angebot $5,500 test coupe 1990 manuell 205 scirocco 150,000km 6 benzin volkswagen nein 2016-03-29 00:00:00 0 74821 2016-04-05 20:46:26
18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot $300 control bus 1995 manuell 150 3er 150,000km 0 benzin bmw NaN 2016-03-26 00:00:00 0 54329 2016-04-02 12:16:41
19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot $4,150 control suv 2004 manuell 124 andere 150,000km 2 lpg mazda nein 2016-03-17 00:00:00 0 40878 2016-03-17 14:45:58
20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot $3,500 test kombi 2003 manuell 131 a4 150,000km 5 diesel audi NaN 2016-03-05 00:00:00 0 53913 2016-03-07 05:46:46
21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio privat Angebot $41,500 test cabrio 2004 manuell 320 911 150,000km 4 benzin porsche nein 2016-03-06 00:00:00 0 65428 2016-04-05 23:46:19
22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio privat Angebot $25,450 control cabrio 2015 manuell 184 cooper 10,000km 1 benzin mini nein 2016-03-28 00:00:00 0 44789 2016-04-01 06:45:30
23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima privat Angebot $7,999 control bus 2010 manuell 120 NaN 150,000km 2 diesel peugeot nein 2016-03-10 00:00:00 0 30900 2016-03-17 08:45:17
24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. privat Angebot $48,500 control limousine 2014 automatik 306 5er 30,000km 12 benzin bmw nein 2016-04-03 00:00:00 0 22547 2016-04-07 13:16:50
25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung privat Angebot $90 control kombi 1996 manuell 116 NaN 150,000km 4 benzin ford ja 2016-03-21 00:00:00 0 27574 2016-04-01 05:16:49
26 2016-04-03 22:46:28 Volkswagen_Polo_Fox privat Angebot $777 control kleinwagen 1992 manuell 54 polo 125,000km 2 benzin volkswagen nein 2016-04-03 00:00:00 0 38110 2016-04-05 23:46:48
27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE privat Angebot $0 control NaN 2005 NaN 0 NaN 150,000km 0 NaN ford NaN 2016-03-27 00:00:00 0 66701 2016-03-27 18:45:01
28 2016-03-19 21:56:19 MINI_Cooper_D privat Angebot $5,250 control kleinwagen 2007 manuell 110 cooper 150,000km 7 diesel mini ja 2016-03-19 00:00:00 0 15745 2016-04-07 14:58:48
29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... privat Angebot $4,999 test kombi 2004 automatik 204 e_klasse 150,000km 10 diesel mercedes_benz nein 2016-04-02 00:00:00 0 47638 2016-04-02 12:45:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... privat Angebot $15,800 control bus 2010 automatik 136 c4 60,000km 4 diesel citroen nein 2016-03-21 00:00:00 0 14947 2016-04-07 04:17:34
49971 2016-03-29 14:54:12 W.Lupo_1.0 privat Angebot $950 test kleinwagen 2001 manuell 50 lupo 150,000km 4 benzin volkswagen nein 2016-03-29 00:00:00 0 65197 2016-03-29 20:41:51
49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. privat Angebot $3,300 control bus 2004 automatik 150 vito 150,000km 10 diesel mercedes_benz ja 2016-03-26 00:00:00 0 65326 2016-03-28 11:28:18
49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor privat Angebot $6,000 control cabrio 2004 manuell 163 slk 150,000km 11 benzin mercedes_benz nein 2016-03-27 00:00:00 0 53567 2016-03-27 08:25:24
49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... privat Angebot $0 control cabrio 1983 manuell 70 golf 150,000km 2 benzin volkswagen nein 2016-03-20 00:00:00 0 8209 2016-03-27 19:48:16
49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort privat Angebot $9,700 control kleinwagen 2012 automatik 88 jazz 100,000km 11 hybrid honda nein 2016-03-27 00:00:00 0 84385 2016-04-05 19:45:34
49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... privat Angebot $5,900 test kombi 1992 automatik 150 80 150,000km 12 benzin audi nein 2016-03-19 00:00:00 0 36100 2016-04-07 06:16:44
49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 privat Angebot $5,500 control limousine 2003 manuell 116 c_klasse 150,000km 2 diesel mercedes_benz nein 2016-03-31 00:00:00 0 33739 2016-04-06 12:16:11
49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic privat Angebot $900 control limousine 1996 automatik 136 e_klasse 150,000km 9 benzin mercedes_benz ja 2016-04-04 00:00:00 0 24405 2016-04-06 12:44:20
49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style privat Angebot $11,000 test kleinwagen 2011 manuell 90 polo 70,000km 11 diesel volkswagen nein 2016-03-20 00:00:00 0 48455 2016-04-07 01:45:12
49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V privat Angebot $400 control kombi 1995 manuell 105 escort 125,000km 3 benzin ford NaN 2016-03-12 00:00:00 0 56218 2016-04-06 17:16:49
49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung privat Angebot $2,000 control kombi 1998 manuell 115 astra 150,000km 12 benzin opel nein 2016-03-15 00:00:00 0 86859 2016-04-05 17:21:46
49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm privat Angebot $1,950 control kleinwagen 2004 manuell 0 fabia 90,000km 7 benzin skoda NaN 2016-03-29 00:00:00 0 45884 2016-03-29 18:51:08
49983 2016-03-06 12:43:04 Ford_focus_99 privat Angebot $600 test kleinwagen 1999 manuell 101 focus 150,000km 4 benzin ford NaN 2016-03-06 00:00:00 0 52477 2016-03-09 06:16:08
49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... privat Angebot $0 test NaN 2000 NaN 0 NaN 150,000km 0 NaN sonstige_autos NaN 2016-03-31 00:00:00 0 12103 2016-04-02 19:44:53
49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! privat Angebot $1,000 control NaN 1995 automatik 0 NaN 150,000km 0 benzin volkswagen NaN 2016-04-02 00:00:00 0 30900 2016-04-06 15:17:52
49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... privat Angebot $15,900 control limousine 2010 automatik 218 300c 125,000km 11 diesel chrysler nein 2016-04-04 00:00:00 0 73527 2016-04-06 23:16:00
49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... privat Angebot $21,990 control limousine 2013 manuell 150 a3 50,000km 11 diesel audi nein 2016-03-22 00:00:00 0 94362 2016-03-26 22:46:06
49988 2016-03-28 19:49:51 BMW_330_Ci privat Angebot $9,550 control coupe 2001 manuell 231 3er 150,000km 10 benzin bmw nein 2016-03-28 00:00:00 0 83646 2016-04-07 02:17:40
49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau privat Angebot $150 test kleinwagen 1997 manuell 0 polo 150,000km 5 benzin volkswagen ja 2016-03-11 00:00:00 0 21244 2016-03-12 10:17:55
49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban privat Angebot $17,500 test limousine 2012 manuell 156 a_klasse 30,000km 12 benzin mercedes_benz nein 2016-03-21 00:00:00 0 58239 2016-04-06 22:46:57
49991 2016-03-06 15:25:19 Kleinwagen privat Angebot $500 control NaN 2016 manuell 0 twingo 150,000km 0 benzin renault NaN 2016-03-06 00:00:00 0 61350 2016-03-06 18:24:19
49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport privat Angebot $4,800 control kleinwagen 2009 manuell 120 andere 125,000km 9 lpg fiat nein 2016-03-10 00:00:00 0 68642 2016-03-13 01:44:51
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot $1,650 control kleinwagen 1997 manuell 0 NaN 150,000km 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot $5,000 control kombi 2001 automatik 299 a6 150,000km 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
  • The 'autos' dataframe has 20 columns of integer and object datatypes
  • The column names are in camel case

Cleaning Column names

Now, converting the case of column names to snakecase, and renaming some of the columns:

In [4]:
#copying the dataframe
autos_copy = autos.copy()
autos_columns = autos_copy.columns

#creating a dictionary to map new names to old names
mapping_dict = {
'dateCrawled':'date_crawled',
'name':'name',
'seller':'seller',
'offerType':'offer_type',
'price':'price',
'abtest':'abtest',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'gearbox':'gearbox',
'powerPS':'power_ps',
'model':'model',
'odometer':'odometer',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'brand':'brand',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'}

# Updating the column names.
autos_copy.columns = pd.Series(autos_columns).map(mapping_dict)
autos_copy.head()
Out[4]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [5]:
#alternative code for changing the column names
autos.rename(mapper = mapping_dict, axis = 1, inplace = True)
autos.head()
Out[5]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

We've made the following changes to the dataframe:

  • editing all the column names from camel case to snake case, which is the preferred case in python
  • renaming some of the column names to be more descriptive

Initial exploration and Cleaning

In this step, we are looking for:

  • text columns where all or almost all values are the same, which can be dropped
  • numerical data stored as text, which can be cleaned and converted.

The following methods are helpful for exploring the data:

  • DataFrame.describe() (with include='all' to get both categorical and numeric columns)
  • Series.value_counts()Series.head() if any columns need a closer look.
In [6]:
# Looking at descriptive statistics for all the columns
autos.describe(include = 'all')
Out[6]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-27 22:55:05 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

From the description, the columns 'price' and 'odometer' are numeric values but are stored as texts. The columns gearbox and fuel_type has some german words like, manuell and benzin, respectively.

We'll therefore explore the other german words in these values and translate them to english words:

In [7]:
#removing non-numeric characters from price and odometer columns
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)
autos["price"].head()
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",", "").astype(int)

#renaming odometer column
autos.rename({"odometer": "odometer_km"}, inplace = True, axis = 1)
autos.head()
Out[7]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [8]:
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())
['manuell' 'automatik' nan]
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
['nein' nan 'ja']
In [9]:
#creating a dictionary to map the german words to english words:
german_to_english = {
    'manuell': 'manual',
    'automatik': 'automatic',
    'benzin': 'benzine',
    'elektro': 'electro',
    'andere': 'other',
    'lpg': 'lpg',
    'diesel': 'diesel',
    'cng': 'cng',
    'hybrid': 'hybrid',
    'nein': 'no',
    'ja': 'yes'
}

#mapping the words to the columns:
autos['fuel_type'] = autos['fuel_type'].map(german_to_english)
autos['gearbox'] = autos['gearbox'].map(german_to_english)
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(german_to_english)
#updated values
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())
['manual' 'automatic' nan]
['lpg' 'benzine' 'diesel' nan 'cng' 'hybrid' 'electro' 'other']
['no' nan 'yes']

Exploring the odometer and price columns

Here, we'll:

  • Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.
  • We'll use:
    • Series.unique().shape to see how many unique values
    • Series.describe() to view min/max/median/mean etc
    • Series.value_counts(), with some variations:
      • chained to .head() if there are lots of values.
      • Because Series.value_counts() returns a series, we can use Series.sort_index() with ascending= True or False to view the highest and lowest values with their counts (can also chain to head() here).
    • When removing outliers, we can do df[(df["col"] > x ) & (df["col"] < y )], but it's more readable to use df[df["col"].between(x,y)]
In [10]:
#Total unique values for price and odometer_km columns
price_total_unique = autos["price"].unique().shape
odometer_total_unique = autos["odometer_km"].unique().shape
#print(price_total_unique, odometer_total_unique)

#Descriptive statistics for the columns
price_describe = autos["price"].describe()
odometer_describe = autos["odometer_km"].describe()
#print(price_describe)
#print(odometer_describe)

#Value counts
price_counts_asc = autos["price"].value_counts().sort_index(ascending = True).head()
print(price_counts_asc)
price_counts_desc = autos["price"].value_counts().sort_index(ascending = False).head()
print(price_counts_desc)

odometer_counts_asc = autos["odometer_km"].value_counts().sort_index(ascending = True).head()
print(odometer_counts_asc)
odometer_counts_desc = autos["odometer_km"].value_counts().sort_index(ascending = False).head()
print(odometer_counts_desc)
0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
Name: price, dtype: int64
5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64
In [11]:
autos = autos[autos["price"].between(1,30000)]
autos = autos[autos["odometer_km"].between(5000,10000)]
print(autos.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 997 entries, 22 to 49997
Data columns (total 20 columns):
date_crawled          997 non-null object
name                  997 non-null object
seller                997 non-null object
offer_type            997 non-null object
price                 997 non-null int64
abtest                997 non-null object
vehicle_type          726 non-null object
registration_year     997 non-null int64
gearbox               767 non-null object
power_ps              997 non-null int64
model                 842 non-null object
odometer_km           997 non-null int64
registration_month    997 non-null int64
fuel_type             725 non-null object
brand                 997 non-null object
unrepaired_damage     583 non-null object
ad_created            997 non-null object
nr_of_pictures        997 non-null int64
postal_code           997 non-null int64
last_seen             997 non-null object
dtypes: int64(7), object(13)
memory usage: 163.6+ KB
None

Exploring the date columns

In [12]:
#percentage distribution of values in date_crawled column
dates_percentage = autos['date_crawled'].str[:10].value_counts(
    normalize = True, dropna = False).sort_index(ascending = False)
print(dates_percentage)
2016-04-07    0.002006
2016-04-06    0.003009
2016-04-05    0.010030
2016-04-04    0.033099
2016-04-03    0.041123
2016-04-02    0.035105
2016-04-01    0.027081
2016-03-31    0.035105
2016-03-30    0.033099
2016-03-29    0.035105
2016-03-28    0.039117
2016-03-27    0.031093
2016-03-26    0.033099
2016-03-25    0.029087
2016-03-24    0.033099
2016-03-23    0.028084
2016-03-22    0.038114
2016-03-21    0.030090
2016-03-20    0.037111
2016-03-19    0.037111
2016-03-18    0.015045
2016-03-17    0.041123
2016-03-16    0.041123
2016-03-15    0.037111
2016-03-14    0.039117
2016-03-13    0.017051
2016-03-12    0.027081
2016-03-11    0.034102
2016-03-10    0.032096
2016-03-09    0.029087
2016-03-08    0.021063
2016-03-07    0.036108
2016-03-06    0.015045
2016-03-05    0.024072
Name: date_crawled, dtype: float64
In [13]:
#percentage distribution of values in ad_created column
ad_percentage = autos['ad_created'].str[:10].value_counts(
    normalize = True, dropna = False).sort_index(ascending = False)
print(ad_percentage)
2016-04-07    0.002006
2016-04-06    0.003009
2016-04-05    0.009027
2016-04-04    0.034102
2016-04-03    0.040120
2016-04-02    0.035105
2016-04-01    0.028084
2016-03-31    0.034102
2016-03-30    0.034102
2016-03-29    0.034102
2016-03-28    0.040120
2016-03-27    0.031093
2016-03-26    0.032096
2016-03-25    0.029087
2016-03-24    0.034102
2016-03-23    0.028084
2016-03-22    0.036108
2016-03-21    0.031093
2016-03-20    0.037111
2016-03-19    0.036108
2016-03-18    0.016048
2016-03-17    0.040120
2016-03-16    0.041123
2016-03-15    0.036108
2016-03-14    0.040120
2016-03-13    0.016048
2016-03-12    0.027081
2016-03-11    0.036108
2016-03-10    0.029087
2016-03-09    0.031093
2016-03-08    0.022066
2016-03-07    0.035105
2016-03-06    0.016048
2016-03-05    0.021063
2016-03-04    0.002006
2016-03-03    0.001003
2016-02-23    0.001003
Name: ad_created, dtype: float64
In [14]:
#percentage distribution of values in last_seen column
lastseen_percentage = autos['last_seen'].str[:10].value_counts(
    normalize = True, dropna = False).sort_index(ascending = False).head(3)
print(lastseen_percentage)
2016-04-07    0.139418
2016-04-06    0.223671
2016-04-05    0.136409
Name: last_seen, dtype: float64

It is also important to convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321

In [15]:
#converting dates to be uniform numerical data
autos['date_crawled'] = autos['date_crawled'].str[:10].str.replace("-","").astype(int)
autos['ad_created'] = autos['ad_created'].str[:10].str.replace("-","").astype(int)
autos['last_seen'] = autos['last_seen'].str[:10].str.replace("-","").astype(int)

autos.head(2)
Out[15]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
22 20160328 MINI_Cooper_S_Cabrio privat Angebot 25450 control cabrio 2015 manual 184 cooper 10000 1 benzine mini no 20160328 0 44789 20160401
52 20160325 Senator_A_3.0E_Karosserie_restauriert_m._viele... privat Angebot 3500 test limousine 1985 NaN 0 andere 5000 0 benzine opel no 20160325 0 63500 20160407
In [16]:
#distribution of the registration year column
autos["registration_year"].describe()
Out[16]:
count     997.000000
mean     2060.701103
std       599.739740
min      1000.000000
25%      1997.000000
50%      2003.000000
75%      2015.000000
max      9999.000000
Name: registration_year, dtype: float64

The registration year column has some odd values, minimum value of 1000 and maximum 0f 9999, which are unrealistic considering the years. We'll count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic

In [17]:
autos = autos[autos['registration_year'].between(1900,2016)]
autos["registration_year"].value_counts(normalize = True)
#autos[autos["price"].between(1,30000)]
Out[17]:
2015    0.151872
2000    0.083422
2016    0.064171
2014    0.062032
1995    0.044920
2005    0.044920
1998    0.039572
1999    0.036364
2001    0.036364
2002    0.035294
1997    0.032086
1990    0.025668
2013    0.023529
2004    0.023529
1996    0.021390
1994    0.019251
2007    0.018182
2006    0.017112
2003    0.016043
2011    0.014973
2009    0.013904
2010    0.013904
1970    0.011765
1980    0.011765
1985    0.009626
2008    0.009626
2012    0.009626
1992    0.008556
1988    0.007487
1978    0.007487
          ...   
1972    0.005348
1971    0.004278
1960    0.004278
1968    0.004278
1983    0.004278
1975    0.004278
1991    0.003209
1964    0.003209
1966    0.003209
1910    0.003209
1984    0.003209
1993    0.003209
1987    0.003209
1989    0.003209
1981    0.003209
1937    0.002139
1959    0.002139
1950    0.002139
1979    0.001070
1927    0.001070
1929    0.001070
1931    0.001070
1934    0.001070
1956    0.001070
1938    0.001070
1939    0.001070
1961    0.001070
1962    0.001070
1965    0.001070
1969    0.001070
Name: registration_year, Length: 62, dtype: float64

The percentage distribution of car sales, per the registration year increased as the years progressed

Exploring price by brand

Here, we'll explore variations across different car brands using aggregation.

In [18]:
#exploring unique values of brand column
unique_brands = autos['brand'].unique()
brand_pctg = autos['brand'].value_counts(normalize = True)
print(unique_brands)
print(brand_pctg)
['mini' 'opel' 'ford' 'audi' 'honda' 'volkswagen' 'hyundai' 'jaguar'
 'toyota' 'seat' 'sonstige_autos' 'renault' 'bmw' 'trabant' 'skoda'
 'mercedes_benz' 'porsche' 'chrysler' 'peugeot' 'smart' 'fiat' 'volvo'
 'subaru' 'chevrolet' 'lancia' 'alfa_romeo' 'mitsubishi' 'daewoo' 'mazda'
 'kia' 'dacia' 'citroen' 'lada' 'nissan' 'suzuki' 'daihatsu' 'land_rover']
volkswagen        0.205348
opel              0.131551
bmw               0.087701
ford              0.070588
sonstige_autos    0.064171
mercedes_benz     0.059893
renault           0.050267
audi              0.050267
fiat              0.036364
skoda             0.024599
trabant           0.017112
seat              0.016043
peugeot           0.016043
nissan            0.016043
mazda             0.013904
mini              0.012834
chevrolet         0.012834
toyota            0.011765
hyundai           0.011765
citroen           0.010695
mitsubishi        0.009626
alfa_romeo        0.009626
porsche           0.008556
honda             0.007487
smart             0.006417
chrysler          0.006417
suzuki            0.005348
daihatsu          0.005348
volvo             0.004278
dacia             0.004278
kia               0.004278
lada              0.002139
daewoo            0.002139
lancia            0.001070
jaguar            0.001070
land_rover        0.001070
subaru            0.001070
Name: brand, dtype: float64

Aggregating brands that have above 2% of the total values:

In [19]:
filtered_brands = brand_pctg[brand_pctg >= 0.02].index
print(filtered_brands)
#creating dictionary to hold the aggregate data:
brand_meanprice = {}

for brand in filtered_brands:
    selected_brand = autos[autos['brand'] == brand]
    mean_price = selected_brand['price'].mean()
    brand_meanprice[brand] = int(mean_price)

print(brand_meanprice)
Index(['volkswagen', 'opel', 'bmw', 'ford', 'sonstige_autos', 'mercedes_benz',
       'renault', 'audi', 'fiat', 'skoda'],
      dtype='object')
{'ford': 7725, 'mercedes_benz': 6064, 'opel': 5250, 'renault': 5025, 'bmw': 6372, 'sonstige_autos': 5440, 'skoda': 9728, 'audi': 10880, 'volkswagen': 6438, 'fiat': 6038}

Storing aggregate data in a dataframe

For the selected brands, we'll use aggregation to understand the average mileage and if there's any link with their mean prices. However, displaying both aggregated series objects and visually comparing them has the following limitations:

  • it's difficult to compare more than two aggregate series objects if we want to extend to more columns
  • we can't compare more than a few rows from each series object
  • we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

So, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly, using these methods:

In [20]:
#calculating mean mileage for the selected brands
brand_meanmileage = {}

for brand in filtered_brands:
    chosen_brand = autos[autos['brand'] == brand]
    mean_mileage = chosen_brand['odometer_km'].mean()    
    brand_meanmileage[brand] = int(mean_mileage)
    
print(brand_meanmileage)
print(brand_meanprice)

#converting the dictionaries to series o bjects
meanmileage_series = pd.Series(brand_meanmileage)
meanprice_series = pd. Series(brand_meanprice)

#creating a dataframe from the series objects
price_and_mileage = pd.DataFrame(meanprice_series, columns = ['mean_price'])
price_and_mileage.insert(1, 'mean_mileage',meanmileage_series) 
print(price_and_mileage)
price_and_mileage.sort_values(by = ['mean_price'],ascending = False)
{'ford': 6136, 'mercedes_benz': 5803, 'opel': 5975, 'renault': 5744, 'bmw': 5548, 'sonstige_autos': 6416, 'skoda': 6956, 'audi': 7021, 'volkswagen': 6041, 'fiat': 6029}
{'ford': 7725, 'mercedes_benz': 6064, 'opel': 5250, 'renault': 5025, 'bmw': 6372, 'sonstige_autos': 5440, 'skoda': 9728, 'audi': 10880, 'volkswagen': 6438, 'fiat': 6038}
                mean_price  mean_mileage
audi                 10880          7021
bmw                   6372          5548
fiat                  6038          6029
ford                  7725          6136
mercedes_benz         6064          5803
opel                  5250          5975
renault               5025          5744
skoda                 9728          6956
sonstige_autos        5440          6416
volkswagen            6438          6041
Out[20]:
mean_price mean_mileage
audi 10880 7021
skoda 9728 6956
ford 7725 6136
volkswagen 6438 6041
bmw 6372 5548
mercedes_benz 6064 5803
fiat 6038 6029
sonstige_autos 5440 6416
opel 5250 5975
renault 5025 5744

From the outputs, it is clear that the higher the mean price, the higher the mean mileage.

In [21]:
autos.head(3)
Out[21]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
22 20160328 MINI_Cooper_S_Cabrio privat Angebot 25450 control cabrio 2015 manual 184 cooper 10000 1 benzine mini no 20160328 0 44789 20160401
52 20160325 Senator_A_3.0E_Karosserie_restauriert_m._viele... privat Angebot 3500 test limousine 1985 NaN 0 andere 5000 0 benzine opel no 20160325 0 63500 20160407
102 20160322 Ford_Ka_dunkel_blau privat Angebot 320 control kleinwagen 2004 manual 0 ka 5000 6 benzine ford yes 20160322 0 24109 20160402

Finally, we need to do the following analyses:

  • Find the most common brand/model combinations
  • Find how much cheaper are cars with damage than their non-damaged counterparts
In [22]:
#finding the most common brand/model combinations
common_brand_model = autos.groupby('brand').model.value_counts().sort_values(ascending = False).head()
common_brand_model
Out[22]:
brand       model 
volkswagen  golf      62
bmw         3er       40
opel        corsa     33
volkswagen  polo      32
opel        andere    27
Name: model, dtype: int64

The most common brand/model combination from the dataframe is golf by volkswagen.

Price of cars with damage vs those without

In [23]:
#counting values for each of the categories:
autos['unrepaired_damage'].value_counts()
Out[23]:
no     467
yes     93
Name: unrepaired_damage, dtype: int64

Those without damages are almost 5 times more than those with damages. Next, we'll calculate their average prices:

In [24]:
autos.groupby('unrepaired_damage')['price'].mean()
Out[24]:
unrepaired_damage
no     10359.284797
yes     1812.860215
Name: price, dtype: float64

It is evident that cars without damages have the highest prices on average

Conclusion

In this project, we(with the help of dataquest instructions) have cleaned and explored data on car sales. We have removed outliers from the data, converted some of the columns to their correct data types, removing unnecessary values. We have also conducted several analyses including analysing the brands by average price, comparing it to the mileage.