Used cars listing on ebay: how to clean up the data and analyze the results.¶

The aim of this project is to clean a large dataset and analyze the information using pandasand numpy libraries of python.

The dataset comes from *eBay Kleinanzeigen* classified section on ebay, you can download the full version in the this link.

The data dictionary provided with data is as follows:

dataCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - he car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

In [1]:

import pandas as pd
import numpy as np

# To read the csv file we use the .read_csv method
autos = pd.read_csv('autos(1).csv', encoding = 'Latin-1')

In [2]:

autos

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
5	2016-03-21 13:47:45	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	privat	Angebot	$7,900	test	bus	2006	automatik	150	voyager	150,000km	4	diesel	chrysler	NaN	2016-03-21 00:00:00	0	22962	2016-04-06 09:45:21
6	2016-03-20 17:55:21	VW_Golf_III_GT_Special_Electronic_Green_Metall...	privat	Angebot	$300	test	limousine	1995	manuell	90	golf	150,000km	8	benzin	volkswagen	NaN	2016-03-20 00:00:00	0	31535	2016-03-23 02:48:59
7	2016-03-16 18:55:19	Golf_IV_1.9_TDI_90PS	privat	Angebot	$1,990	control	limousine	1998	manuell	90	golf	150,000km	12	diesel	volkswagen	nein	2016-03-16 00:00:00	0	53474	2016-04-07 03:17:32
8	2016-03-22 16:51:34	Seat_Arosa	privat	Angebot	$250	test	NaN	2000	manuell	0	arosa	150,000km	10	NaN	seat	nein	2016-03-22 00:00:00	0	7426	2016-03-26 18:18:10
9	2016-03-16 13:47:02	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	privat	Angebot	$590	control	bus	1997	manuell	90	megane	150,000km	7	benzin	renault	nein	2016-03-16 00:00:00	0	15749	2016-04-06 10:46:35
10	2016-03-15 01:41:36	VW_Golf_Tuning_in_siber/grau	privat	Angebot	$999	test	NaN	2017	manuell	90	NaN	150,000km	4	benzin	volkswagen	nein	2016-03-14 00:00:00	0	86157	2016-04-07 03:16:21
11	2016-03-16 18:45:34	Mercedes_A140_Motorschaden	privat	Angebot	$350	control	NaN	2000	NaN	0	NaN	150,000km	0	benzin	mercedes_benz	NaN	2016-03-16 00:00:00	0	17498	2016-03-16 18:45:34
12	2016-03-31 19:48:22	Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...	privat	Angebot	$5,299	control	kleinwagen	2010	automatik	71	fortwo	50,000km	9	benzin	smart	nein	2016-03-31 00:00:00	0	34590	2016-04-06 14:17:52
13	2016-03-23 10:48:32	Audi_A3_1.6_tuning	privat	Angebot	$1,350	control	limousine	1999	manuell	101	a3	150,000km	11	benzin	audi	nein	2016-03-23 00:00:00	0	12043	2016-04-01 14:17:13
14	2016-03-23 11:50:46	Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...	privat	Angebot	$3,999	test	kleinwagen	2007	manuell	75	clio	150,000km	9	benzin	renault	NaN	2016-03-23 00:00:00	0	81737	2016-04-01 15:46:47
15	2016-04-01 12:06:20	Corvette_C3_Coupe_T_Top_Crossfire_Injection	privat	Angebot	$18,900	test	coupe	1982	automatik	203	NaN	80,000km	6	benzin	sonstige_autos	nein	2016-04-01 00:00:00	0	61276	2016-04-02 21:10:48
16	2016-03-16 14:59:02	Opel_Vectra_B_Kombi	privat	Angebot	$350	test	kombi	1999	manuell	101	vectra	150,000km	5	benzin	opel	nein	2016-03-16 00:00:00	0	57299	2016-03-18 05:29:37
17	2016-03-29 11:46:22	Volkswagen_Scirocco_2_G60	privat	Angebot	$5,500	test	coupe	1990	manuell	205	scirocco	150,000km	6	benzin	volkswagen	nein	2016-03-29 00:00:00	0	74821	2016-04-05 20:46:26
18	2016-03-26 19:57:44	Verkaufen_mein_bmw_e36_320_i_touring	privat	Angebot	$300	control	bus	1995	manuell	150	3er	150,000km	0	benzin	bmw	NaN	2016-03-26 00:00:00	0	54329	2016-04-02 12:16:41
19	2016-03-17 13:36:21	mazda_tribute_2.0_mit_gas_und_tuev_neu_2018	privat	Angebot	$4,150	control	suv	2004	manuell	124	andere	150,000km	2	lpg	mazda	nein	2016-03-17 00:00:00	0	40878	2016-03-17 14:45:58
20	2016-03-05 19:57:31	Audi_A4_Avant_1.9_TDI_6_GangAHKKlimatronik...	privat	Angebot	$3,500	test	kombi	2003	manuell	131	a4	150,000km	5	diesel	audi	NaN	2016-03-05 00:00:00	0	53913	2016-03-07 05:46:46
21	2016-03-06 19:07:10	Porsche_911_Carrera_4S_Cabrio	privat	Angebot	$41,500	test	cabrio	2004	manuell	320	911	150,000km	4	benzin	porsche	nein	2016-03-06 00:00:00	0	65428	2016-04-05 23:46:19
22	2016-03-28 20:50:54	MINI_Cooper_S_Cabrio	privat	Angebot	$25,450	control	cabrio	2015	manuell	184	cooper	10,000km	1	benzin	mini	nein	2016-03-28 00:00:00	0	44789	2016-04-01 06:45:30
23	2016-03-10 19:55:34	Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima	privat	Angebot	$7,999	control	bus	2010	manuell	120	NaN	150,000km	2	diesel	peugeot	nein	2016-03-10 00:00:00	0	30900	2016-03-17 08:45:17
24	2016-04-03 11:57:02	BMW_535i_xDrive_Sport_Aut.	privat	Angebot	$48,500	control	limousine	2014	automatik	306	5er	30,000km	12	benzin	bmw	nein	2016-04-03 00:00:00	0	22547	2016-04-07 13:16:50
25	2016-03-21 21:56:18	Ford_escort_kombi_an_bastler_mit_ghia_ausstattung	privat	Angebot	$90	control	kombi	1996	manuell	116	NaN	150,000km	4	benzin	ford	ja	2016-03-21 00:00:00	0	27574	2016-04-01 05:16:49
26	2016-04-03 22:46:28	Volkswagen_Polo_Fox	privat	Angebot	$777	control	kleinwagen	1992	manuell	54	polo	125,000km	2	benzin	volkswagen	nein	2016-04-03 00:00:00	0	38110	2016-04-05 23:46:48
27	2016-03-27 18:45:01	Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE	privat	Angebot	$0	control	NaN	2005	NaN	0	NaN	150,000km	0	NaN	ford	NaN	2016-03-27 00:00:00	0	66701	2016-03-27 18:45:01
28	2016-03-19 21:56:19	MINI_Cooper_D	privat	Angebot	$5,250	control	kleinwagen	2007	manuell	110	cooper	150,000km	7	diesel	mini	ja	2016-03-19 00:00:00	0	15745	2016-04-07 14:58:48
29	2016-04-02 12:45:44	Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz...	privat	Angebot	$4,999	test	kombi	2004	automatik	204	e_klasse	150,000km	10	diesel	mercedes_benz	nein	2016-04-02 00:00:00	0	47638	2016-04-02 12:45:44
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49970	2016-03-21 22:47:37	c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp...	privat	Angebot	$15,800	control	bus	2010	automatik	136	c4	60,000km	4	diesel	citroen	nein	2016-03-21 00:00:00	0	14947	2016-04-07 04:17:34
49971	2016-03-29 14:54:12	W.Lupo_1.0	privat	Angebot	$950	test	kleinwagen	2001	manuell	50	lupo	150,000km	4	benzin	volkswagen	nein	2016-03-29 00:00:00	0	65197	2016-03-29 20:41:51
49972	2016-03-26 22:25:23	Mercedes_Benz_Vito_115_CDI_Extralang_Aut.	privat	Angebot	$3,300	control	bus	2004	automatik	150	vito	150,000km	10	diesel	mercedes_benz	ja	2016-03-26 00:00:00	0	65326	2016-03-28 11:28:18
49973	2016-03-27 05:32:39	Mercedes_Benz_SLK_200_Kompressor	privat	Angebot	$6,000	control	cabrio	2004	manuell	163	slk	150,000km	11	benzin	mercedes_benz	nein	2016-03-27 00:00:00	0	53567	2016-03-27 08:25:24
49974	2016-03-20 10:52:31	Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing...	privat	Angebot	$0	control	cabrio	1983	manuell	70	golf	150,000km	2	benzin	volkswagen	nein	2016-03-20 00:00:00	0	8209	2016-03-27 19:48:16
49975	2016-03-27 20:51:39	Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort	privat	Angebot	$9,700	control	kleinwagen	2012	automatik	88	jazz	100,000km	11	hybrid	honda	nein	2016-03-27 00:00:00	0	84385	2016-04-05 19:45:34
49976	2016-03-19 18:56:05	Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga...	privat	Angebot	$5,900	test	kombi	1992	automatik	150	80	150,000km	12	benzin	audi	nein	2016-03-19 00:00:00	0	36100	2016-04-07 06:16:44
49977	2016-03-31 18:37:18	Mercedes_Benz_C200_Cdi_W203	privat	Angebot	$5,500	control	limousine	2003	manuell	116	c_klasse	150,000km	2	diesel	mercedes_benz	nein	2016-03-31 00:00:00	0	33739	2016-04-06 12:16:11
49978	2016-04-04 10:37:14	Mercedes_Benz_E_200_Classic	privat	Angebot	$900	control	limousine	1996	automatik	136	e_klasse	150,000km	9	benzin	mercedes_benz	ja	2016-04-04 00:00:00	0	24405	2016-04-06 12:44:20
49979	2016-03-20 18:38:40	Volkswagen_Polo_1.6_TDI_Style	privat	Angebot	$11,000	test	kleinwagen	2011	manuell	90	polo	70,000km	11	diesel	volkswagen	nein	2016-03-20 00:00:00	0	48455	2016-04-07 01:45:12
49980	2016-03-12 10:55:54	Ford_Escort_Turnier_16V	privat	Angebot	$400	control	kombi	1995	manuell	105	escort	125,000km	3	benzin	ford	NaN	2016-03-12 00:00:00	0	56218	2016-04-06 17:16:49
49981	2016-03-15 09:38:21	Opel_Astra_Kombi_mit_Anhaengerkupplung	privat	Angebot	$2,000	control	kombi	1998	manuell	115	astra	150,000km	12	benzin	opel	nein	2016-03-15 00:00:00	0	86859	2016-04-05 17:21:46
49982	2016-03-29 18:51:08	Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm	privat	Angebot	$1,950	control	kleinwagen	2004	manuell	0	fabia	90,000km	7	benzin	skoda	NaN	2016-03-29 00:00:00	0	45884	2016-03-29 18:51:08
49983	2016-03-06 12:43:04	Ford_focus_99	privat	Angebot	$600	test	kleinwagen	1999	manuell	101	focus	150,000km	4	benzin	ford	NaN	2016-03-06 00:00:00	0	52477	2016-03-09 06:16:08
49984	2016-03-31 22:48:48	Student_sucht_ein__Anfaengerauto___ab_2000_BJ_...	privat	Angebot	$0	test	NaN	2000	NaN	0	NaN	150,000km	0	NaN	sonstige_autos	NaN	2016-03-31 00:00:00	0	12103	2016-04-02 19:44:53
49985	2016-04-02 16:38:23	Verkaufe_meinen_vw_vento!	privat	Angebot	$1,000	control	NaN	1995	automatik	0	NaN	150,000km	0	benzin	volkswagen	NaN	2016-04-02 00:00:00	0	30900	2016-04-06 15:17:52
49986	2016-04-04 20:46:02	Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst...	privat	Angebot	$15,900	control	limousine	2010	automatik	218	300c	125,000km	11	diesel	chrysler	nein	2016-04-04 00:00:00	0	73527	2016-04-06 23:16:00
49987	2016-03-22 20:47:27	Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__...	privat	Angebot	$21,990	control	limousine	2013	manuell	150	a3	50,000km	11	diesel	audi	nein	2016-03-22 00:00:00	0	94362	2016-03-26 22:46:06
49988	2016-03-28 19:49:51	BMW_330_Ci	privat	Angebot	$9,550	control	coupe	2001	manuell	231	3er	150,000km	10	benzin	bmw	nein	2016-03-28 00:00:00	0	83646	2016-04-07 02:17:40
49989	2016-03-11 19:50:37	VW_Polo_zum_Ausschlachten_oder_Wiederaufbau	privat	Angebot	$150	test	kleinwagen	1997	manuell	0	polo	150,000km	5	benzin	volkswagen	ja	2016-03-11 00:00:00	0	21244	2016-03-12 10:17:55
49990	2016-03-21 19:54:19	Mercedes_Benz_A_200__BlueEFFICIENCY__Urban	privat	Angebot	$17,500	test	limousine	2012	manuell	156	a_klasse	30,000km	12	benzin	mercedes_benz	nein	2016-03-21 00:00:00	0	58239	2016-04-06 22:46:57
49991	2016-03-06 15:25:19	Kleinwagen	privat	Angebot	$500	control	NaN	2016	manuell	0	twingo	150,000km	0	benzin	renault	NaN	2016-03-06 00:00:00	0	61350	2016-03-06 18:24:19
49992	2016-03-10 19:37:38	Fiat_Grande_Punto_1.4_T_Jet_16V_Sport	privat	Angebot	$4,800	control	kleinwagen	2009	manuell	120	andere	125,000km	9	lpg	fiat	nein	2016-03-10 00:00:00	0	68642	2016-03-13 01:44:51
49993	2016-03-15 18:47:35	Audi_A3__1_8l__Silber;_schoenes_Fahrzeug	privat	Angebot	$1,650	control	kleinwagen	1997	manuell	0	NaN	150,000km	7	benzin	audi	NaN	2016-03-15 00:00:00	0	65203	2016-04-06 19:46:53
49994	2016-03-22 17:36:42	Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...	privat	Angebot	$5,000	control	kombi	2001	automatik	299	a6	150,000km	1	benzin	audi	nein	2016-03-22 00:00:00	0	46537	2016-04-06 08:16:39
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

Information of the dataset and first rows¶

At first glance we have the following information of the dataset using the .info method:

We have 20 columns, most of the strings
the columns vehicle type, gear box, model, fueltype and nonreparied damage have null values

In [3]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Another important observation is that the column names use capitalizacion style named camelcase, python prefers to work with snakecase. We can see this using .head() method to see the first results.

In [4]:

autos.head()

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Converting the column names to snakecase¶

We are going to convert the names case so we can have a more workable case for the dataset.

In [5]:

columns = autos.columns
columns

Out[5]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Manually I made the change of the names, following this criteria:¶

From camelcase to snakecase
yearOfRegistration to registration_year
monthOfRegistration to registration_month
notRepairedDamage to unrepaired_damage
dateCreated to ad_created

In [6]:

new_names= ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer','registration_month','fuel_type','brand',
       'unreparied_damage','ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

autos.columns = new_names
autos.columns

Out[6]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unreparied_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Now we can see the changes in the columns looking the results with the .head method

In [7]:

autos.head()

Out[7]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unreparied_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Data exploration using special methods¶

In this step we are going to explore the data in detail, first using .describe method to see some statistical results and look what kind of objects we have in the data.

In [8]:

autos.describe(include='all')

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unreparied_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-27 22:55:05	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

The column nr_of_pictures have mostly 0 value, so we can remove the column without afecting the data
The registration_year have a min value of 1000, it's really difficult to find a car of the year 1000 now a days, so we need to check that.
The registration_month have a min value of 0, it's not a valid month so we need to check that too
The power_ps column have min 0 value, is really dificult to find a car with 0 power, so we need to check that.
price and odometer are registered as strings, we need to convert to number so we can use the data.
The columns seller, offer_type and nr_of_pictures have unique values, so we can drop this columns without affecting the data.

So we need to drop the Columns that have the same values, we are not going to use the columns at this moment.

In [9]:

autos = autos.drop(["seller","offer_type","nr_of_pictures"], axis=1)

Converting Price column to integers¶

Now we need to clean the data in the 'price' column, and also convert this data in integers for further calculus, the steps are:

Identify which kind of object we have in the column
Identify which symbols we need to replace
Convert to int using astype method.
Verify that the results are according the expected using .head()

In [10]:

auto_desc = autos["price"].describe()
auto_desc

Out[10]:

count     50000
unique     2357
top          $0
freq       1421
Name: price, dtype: object

In [11]:

# We need to replace the $ symbol and , separator with str.repalce
autos["price"] = autos["price"].str.replace("$","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["price"] = autos["price"].astype(int)

In [12]:

autos["price"].head()

Out[12]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Converting Odometer to integer type¶

The steps are similar with prices, but in this case we are going to use .rename()method at the to convert the name from odometer to odometer_km, so the person that will use the dataset could know which units use the odometer.

In [13]:

autos["odometer"].value_counts()

Out[13]:

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [14]:

# We need to replace the $ symbol and , separator with str.repalce
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["odometer"] = autos["odometer"].astype(int)

In [15]:

autos["odometer"].head()

Out[15]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64

In [16]:

#With the rename method we need to indicate the previous name ant the new name
autos.rename(columns={"odometer":"odometer_km"}, inplace=True)
autos.describe(include = 'all')

Out[16]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unreparied_damage	ad_created	postal_code	last_seen
count	50000	50000	5.000000e+04	50000	44905	50000.000000	47320	50000.000000	47242	50000.000000	50000.000000	45518	50000	40171	50000	50000.000000	50000
unique	48213	38754	NaN	2	8	NaN	2	NaN	245	NaN	NaN	7	40	2	76	NaN	39481
top	2016-03-27 22:55:05	Ford_Fiesta	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	2016-04-07 06:17:27
freq	3	78	NaN	25756	12859	NaN	36993	NaN	4024	NaN	NaN	30107	10687	35232	1946	NaN	8
mean	NaN	NaN	9.840044e+03	NaN	NaN	2005.073280	NaN	116.355920	NaN	125732.700000	5.723360	NaN	NaN	NaN	NaN	50813.627300	NaN
std	NaN	NaN	4.811044e+05	NaN	NaN	105.712813	NaN	209.216627	NaN	40042.211706	3.711984	NaN	NaN	NaN	NaN	25779.747957	NaN
min	NaN	NaN	0.000000e+00	NaN	NaN	1000.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	1067.000000	NaN
25%	NaN	NaN	1.100000e+03	NaN	NaN	1999.000000	NaN	70.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	30451.000000	NaN
50%	NaN	NaN	2.950000e+03	NaN	NaN	2003.000000	NaN	105.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	49577.000000	NaN
75%	NaN	NaN	7.200000e+03	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	71540.000000	NaN
max	NaN	NaN	1.000000e+08	NaN	NaN	9999.000000	NaN	17700.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	99998.000000	NaN

Exploring the data on detail: outliers in the data¶

Now that we have the data converted to int type, we need to explore the outliers, some data that have no sense or are not adjusted to our goal. The steps are the same for odometer_km and price.

Verify the number of unique values using .unique().shape method
If we have outliers, identify the total number using value_counts().sort_index() method.
Use the .between() method to indicate the range of values that we are going to use

In [17]:

autos["odometer_km"].unique().shape

Out[17]:

(13,)

In [18]:

autos["odometer_km"].describe()

Out[18]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [19]:

autos["odometer_km"].value_counts().sort_index(ascending=False)

Out[19]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

In [20]:

# With this we obtain the number of unique values
autos["price"].unique().shape

Out[20]:

(2357,)

In [21]:

autos["price"].describe()

Out[21]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [22]:

# First we need to see the extreme superior values, prices above 1 million USD must be cleaned
autos["price"].value_counts().sort_index(ascending=False).head(20)

Out[22]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [23]:

# Using `sort_index()` we can sort using the unique values, now we see the low prices, 0-100USD are a very low price
autos["price"].value_counts().sort_index(ascending=True).head(20)

Out[23]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

In [24]:

#Using only .value_counts() we can see the values by frequency
autos["price"].value_counts().head(50)

Out[24]:

0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
1100     376
1300     371
3000     365
550      356
1800     355
5500     340
1250     335
350      335
1600     327
1999     322
400      321
6500     300
1400     292
250      291
2800     291
2300     290
1350     276
1700     268
200      266
450      265
3800     264
3200     261
2900     256
1450     252
4000     246
2999     242
2600     241
5000     239
1900     239
1650     238
Name: price, dtype: int64

With this info is better to start from 100USD than from 1USD, is difficult to asses if below this values is realistic to purchase a car, we need more info about how this market works.

In [25]:

#With between method we remove the outliers and reduce the dataset.
autos = autos[autos["price"].between(100,351000)]
autos["price"].describe()

Out[25]:

count     48224.000000
mean       5930.371433
std        9078.372762
min         100.000000
25%        1250.000000
50%        3000.000000
75%        7499.000000
max      350000.000000
Name: price, dtype: float64

Working with dates¶

We have 5 types of data from dates:

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

We need to convert the dates that are identified as strings to date format.

In [26]:

#First we see how is written the dates
autos[['date_crawled','ad_created','last_seen']][0:5]

Out[26]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [27]:

# We need to remove the hour and leave only the dates using .str method
autos['date_crawled'].str[:10].head(10)

Out[27]:

0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
5    2016-03-21
6    2016-03-20
7    2016-03-16
8    2016-03-22
9    2016-03-16
Name: date_crawled, dtype: object

In [28]:

#Now we use value_counts() to see the most frequent dates
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(10)

Out[28]:

2016-04-03    0.038611
2016-03-20    0.037803
2016-03-21    0.037201
2016-03-12    0.036911
2016-03-14    0.036662
2016-04-04    0.036538
2016-03-07    0.036061
2016-04-02    0.035605
2016-03-28    0.034962
2016-03-19    0.034734
Name: date_crawled, dtype: float64

In [29]:

# We need to identify the first dates of crawling using .sort_index()
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)

Out[29]:

2016-03-05    0.025361
2016-03-06    0.014039
2016-03-07    0.036061
2016-03-08    0.033179
2016-03-09    0.033013
2016-03-10    0.032287
2016-03-11    0.032598
2016-03-12    0.036911
2016-03-13    0.015677
2016-03-14    0.036662
Name: date_crawled, dtype: float64

In [30]:

#Then we identify the last dates using the same method
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)

Out[30]:

2016-03-29    0.034112
2016-03-30    0.033738
2016-03-31    0.031851
2016-04-01    0.033697
2016-04-02    0.035605
2016-04-03    0.038611
2016-04-04    0.036538
2016-04-05    0.013064
2016-04-06    0.003173
2016-04-07    0.001389
Name: date_crawled, dtype: float64

In [31]:

#The same previous process we will use with the other dates
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head(10)

Out[31]:

2016-04-03    0.038860
2016-03-20    0.037865
2016-03-21    0.037429
2016-04-04    0.036890
2016-03-12    0.036745
2016-04-02    0.035294
2016-03-14    0.035294
2016-03-28    0.035066
2016-03-07    0.034796
2016-03-29    0.034070
Name: ad_created, dtype: float64

In [32]:

autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)

Out[32]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
Name: ad_created, dtype: float64

In [33]:

autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)

Out[33]:

2016-03-29    0.034070
2016-03-30    0.033552
2016-03-31    0.031893
2016-04-01    0.033676
2016-04-02    0.035294
2016-04-03    0.038860
2016-04-04    0.036890
2016-04-05    0.011799
2016-04-06    0.003256
2016-04-07    0.001244
Name: ad_created, dtype: float64

In [34]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(10)

Out[34]:

2016-04-06    0.221964
2016-04-07    0.132154
2016-04-05    0.125062
2016-03-17    0.028098
2016-04-03    0.025133
2016-04-02    0.024884
2016-03-30    0.024697
2016-04-04    0.024531
2016-03-31    0.023826
2016-03-12    0.023785
Name: last_seen, dtype: float64

In [35]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)

Out[35]:

2016-03-05    0.001078
2016-03-06    0.004313
2016-03-07    0.005433
2016-03-08    0.007320
2016-03-09    0.009580
2016-03-10    0.010638
2016-03-11    0.012400
2016-03-12    0.023785
2016-03-13    0.008875
2016-03-14    0.012629
Name: last_seen, dtype: float64

In [36]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)

Out[36]:

2016-03-29    0.022292
2016-03-30    0.024697
2016-03-31    0.023826
2016-04-01    0.022852
2016-04-02    0.024884
2016-04-03    0.025133
2016-04-04    0.024531
2016-04-05    0.125062
2016-04-06    0.221964
2016-04-07    0.132154
Name: last_seen, dtype: float64

Registration year cleaning¶

From the previous exercise we identified:

The dataset was retrieved from march until april of 2016
The ads were created between june 2015 and april 2016

With this info is clear that we need to review the registration year of the cars. Similar to previous step with the price, we need to fix a range and exclude some data.

In [37]:

autos["registration_year"].describe()

Out[37]:

count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The range of years¶

Looking the data we find that we have cars with registered years before 1900 (when we don't even have a clue of what is a Mercedes Benz) and years before 2016, the last year when we retrieved the data. So We need to explore the data set and removes this years.

In [38]:

autos["registration_year"].value_counts(normalize=True)

Out[38]:

2000    0.064366
2005    0.060364
1999    0.059701
2004    0.055968
2003    0.055844
2006    0.055325
2001    0.054516
2002    0.051364
1998    0.048524
2007    0.047134
2008    0.045828
2009    0.043132
1997    0.039918
2011    0.033552
2010    0.032909
2017    0.028679
1996    0.028098
2012    0.027123
2016    0.024925
1995    0.024739
2013    0.016610
2014    0.013728
1994    0.012981
2018    0.009705
1993    0.008709
2015    0.007880
1992    0.007610
1991    0.007009
1990    0.006885
1989    0.003546
          ...   
1934    0.000041
1951    0.000041
1941    0.000041
1954    0.000041
1955    0.000041
2019    0.000041
1910    0.000041
1957    0.000041
1800    0.000041
8888    0.000021
1950    0.000021
4500    0.000021
1929    0.000021
1953    0.000021
1111    0.000021
1001    0.000021
4800    0.000021
1931    0.000021
5911    0.000021
1948    0.000021
1927    0.000021
1943    0.000021
1939    0.000021
1938    0.000021
1000    0.000021
2800    0.000021
4100    0.000021
6200    0.000021
9000    0.000021
1952    0.000021
Name: registration_year, Length: 95, dtype: float64

In [39]:

autos["registration_year"].value_counts().sort_index().head(20)

Out[39]:

1000    1
1001    1
1111    1
1800    2
1910    2
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
1950    1
1951    2
1952    1
1953    1
1954    2
Name: registration_year, dtype: int64

In [40]:

autos["registration_year"].value_counts().sort_index().tail(20)

Out[40]:

2010    1587
2011    1618
2012    1308
2013     801
2014     662
2015     380
2016    1202
2017    1383
2018     468
2019       2
2800       1
4100       1
4500       1
4800       1
5000       3
5911       1
6200       1
8888       1
9000       1
9999       3
Name: registration_year, dtype: int64

In [41]:

autos = autos[autos["registration_year"].between(1900,2019)]
autos["registration_year"].describe()

Out[41]:

count    48205.000000
mean      2003.490053
std          7.511679
min       1910.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       2019.000000
Name: registration_year, dtype: float64

In [42]:

autos["registration_year"].value_counts(normalize=True)

Out[42]:

2000    0.064392
2005    0.060388
1999    0.059724
2004    0.055990
2003    0.055866
2006    0.055347
2001    0.054538
2002    0.051385
1998    0.048543
2007    0.047153
2008    0.045846
2009    0.043149
1997    0.039934
2011    0.033565
2010    0.032922
2017    0.028690
1996    0.028109
2012    0.027134
2016    0.024935
1995    0.024748
2013    0.016617
2014    0.013733
1994    0.012986
2018    0.009709
1993    0.008713
2015    0.007883
1992    0.007613
1991    0.007012
1990    0.006887
1989    0.003547
          ...   
1976    0.000436
1969    0.000394
1975    0.000373
1965    0.000353
1964    0.000249
1963    0.000166
1959    0.000124
1961    0.000124
1937    0.000083
1962    0.000083
1956    0.000083
1958    0.000083
1910    0.000041
1955    0.000041
2019    0.000041
1934    0.000041
1957    0.000041
1951    0.000041
1941    0.000041
1954    0.000041
1953    0.000021
1950    0.000021
1948    0.000021
1927    0.000021
1943    0.000021
1929    0.000021
1939    0.000021
1938    0.000021
1931    0.000021
1952    0.000021
Name: registration_year, Length: 81, dtype: float64

We prefer to mantain until 2019: could a mistake of typo not a deliberate error.

Aggregation with brands: understanding the most offered brands¶

Now we need to explore which are the most popular brands in the used cars, and explore only this data.

We need to now which are the top brands by share of the total
We are going to store this data in a new variable called top_brand_index

In [43]:

autos["brand"].value_counts(normalize=True).head(10)

Out[43]:

volkswagen       0.212986
bmw              0.108765
opel             0.108288
mercedes_benz    0.095965
audi             0.086028
ford             0.069619
renault          0.047837
peugeot          0.029458
fiat             0.025972
seat             0.018919
Name: brand, dtype: float64

In [44]:

brand_counts = autos["brand"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_brand = brand_counts[brand_counts > 0.05]
print(top_brand)
print('\n')
top_brand_index = brand_counts[brand_counts > 0.05].index
print(top_brand_index)

volkswagen       0.212986
bmw              0.108765
opel             0.108288
mercedes_benz    0.095965
audi             0.086028
ford             0.069619
Name: brand, dtype: float64


Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

Brand average prices an mileage¶

Once we identified the top brands, we are going to use the following information:

We first create a dictionary that stores the top brand values and calculates using mean() method the mean value for brand.
Using .Series() method we are going to convert the dictionary into a single dataframe
Finally we are going to convert the dataframe into a single dataframe with column using .dataframe() method.
Using the created dataframes for mileage and mean_pricewe are going to do a single dataframe, combining the tables.

In [45]:

brand_prices_dict = {}

for b in top_brand_index:
    selected_rows = autos[autos["brand"] == b]
    # Calculate the mean average price for just those rows
    mean = selected_rows["price"].mean()
    # Assign the mean value to the dictionary, using the
    # brand name as the key
    brand_prices_dict[b] = mean

brand_prices_dict

Out[45]:

{'volkswagen': 5366.253433330086,
 'bmw': 8310.138470341408,
 'opel': 2971.6082375478927,
 'mercedes_benz': 8570.76869865975,
 'audi': 9259.510248372317,
 'ford': 3756.9919547079858}

In [46]:

# pd. Series convert our dictionary into a single dataframe 
brand_prices = pd.Series(brand_prices_dict)
brand_prices

Out[46]:

volkswagen       5366.253433
bmw              8310.138470
opel             2971.608238
mercedes_benz    8570.768699
audi             9259.510248
ford             3756.991955
dtype: float64

In [47]:

# We can convert the series to a single Dataframe with column
brand_prices = pd.DataFrame(brand_prices, columns=['mean_price'])
brand_prices

Out[47]:

	mean_price
volkswagen	5366.253433
bmw	8310.138470
opel	2971.608238
mercedes_benz	8570.768699
audi	9259.510248
ford	3756.991955

In [48]:

mileage_dict = {}

for b in top_brand_index:
    selected_rows = autos[autos["brand"] == b]
    # Calculate the mean average price for just those rows
    mean = selected_rows["odometer_km"].mean()
    # Assign the mean value to the dictionary, using the
    # brand name as the key
    mileage_dict[b] = mean

mileage_dict

Out[48]:

{'volkswagen': 129056.19947404305,
 'bmw': 132824.71867251574,
 'opel': 129512.45210727969,
 'mercedes_benz': 131079.76653696498,
 'audi': 129604.53339763684,
 'ford': 124360.8462455304}

In [49]:

# pd. Series convert our dictionary into a single dataframe 
mileage = pd.Series(mileage_dict)
mileage

Out[49]:

volkswagen       129056.199474
bmw              132824.718673
opel             129512.452107
mercedes_benz    131079.766537
audi             129604.533398
ford             124360.846246
dtype: float64

In [50]:

mileage = pd.DataFrame(mileage, columns = ['mileage'])
mileage

Out[50]:

	mileage
volkswagen	129056.199474
bmw	132824.718673
opel	129512.452107
mercedes_benz	131079.766537
audi	129604.533398
ford	124360.846246

In [51]:

mileage["mean_prices"] = brand_prices
mileage

Out[51]:

	mileage	mean_prices
volkswagen	129056.199474	5366.253433
bmw	132824.718673	8310.138470
opel	129512.452107	2971.608238
mercedes_benz	131079.766537	8570.768699
audi	129604.533398	9259.510248
ford	124360.846246	3756.991955

In [52]:

mileage["mileage"] = mileage["mileage"].astype(int)
mileage["mean_prices"] = mileage["mean_prices"].astype(int)
mileage

Out[52]:

	mileage	mean_prices
volkswagen	129056	5366
bmw	132824	8310
opel	129512	2971
mercedes_benz	131079	8570
audi	129604	9259
ford	124360	3756

Findings on brands and price¶

The principal findings here are:

We can see that the Ford cars are selling with less mileage. This could be a red flag, maybe the car consumes more combustible or is difficult to find spare parts.
The most affordable model is opel.
Audi is the more expensive car, but the mileage is lower than mercedes, so we maybe need to check what other variables could affect the price in this cases.

Finding German words in the data and translating¶

Several columns still use the same German words, so we can find them and translate to english.

The steps are the same in previous methods:

Review the columns with german words.
Identify the unique words in this columns
Use str.replace() method to replace the words.

In [53]:

autos.head()

Out[53]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unreparied_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [54]:

autos["vehicle_type"].unique()

Out[54]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [55]:

autos["fuel_type"].unique()

Out[55]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [56]:

autos["vehicle_type"] = autos["vehicle_type"].str.replace("kleinwagen","small_car").str.replace("kombi","station_wagon").str.replace("cabrio","convertible").str.replace("andere","other")

In [57]:

autos["vehicle_type"].unique()

Out[57]:

array(['bus', 'limousine', 'small_car', 'station_wagon', nan, 'coupe',
       'suv', 'convertible', 'other'], dtype=object)

In [58]:

autos["fuel_type"] = autos["fuel_type"].str.replace("benzin","petrol").str.replace("elektro","electro")

In [59]:

autos["fuel_type"].unique()

Out[59]:

array(['lpg', 'petrol', 'diesel', nan, 'cng', 'hybrid', 'electro',
       'andere'], dtype=object)

Converting the dates to integer¶

In this part we are going to convert a string to int.

In this exercise we are going to use str.split to select only the first part of the string.
We are going to use astype to convert the string to integer.

In [60]:

autos["date_crawled"] = (autos["date_crawled"]
                                       .str.split()
                                       .str[0]
                              )
autos["date_crawled"].head()

Out[60]:

0    2016-03-26
1    2016-04-04
2    2016-03-26
3    2016-03-12
4    2016-04-01
Name: date_crawled, dtype: object

In [61]:

autos["date_crawled"] = autos["date_crawled"].str.replace("-","")
# Once we replace we can convert to int with .astype method
autos["date_crawled"] = autos["date_crawled"].astype(int)
autos["date_crawled"].head()

Out[61]:

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled, dtype: int64

Extracting model¶

In this part we are going to identify the most offered models of each brand.

These are the steps:

Replace parts of the string for spaces to extract latter the words
Use split method to select the second word of the name, that word is the model
Identify the models that have more than 1% of the total of the offered models

In [62]:

autos["name_model"] = autos["name"].str.replace("_"," ")

In [63]:

autos["name_model"] = (autos["name_model"]
                              .str.split()
                              .str[1]
                              )
autos["name_model"].head()

Out[63]:

0      807
1     740i
2     Golf
3    smart
4    Focus
Name: name_model, dtype: object

In [64]:

autos["name_model"].head(10)

Out[64]:

0       807
1      740i
2      Golf
3     smart
4     Focus
5     Grand
6      Golf
7        IV
8     Arosa
9    Megane
Name: name_model, dtype: object

In [65]:

model_counts = autos["name_model"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_model = model_counts[model_counts > 0.01]
top_model.head(20)

Out[65]:

Benz      0.067089
Golf      0.053163
Corsa     0.026269
Polo      0.024333
Astra     0.023397
Passat    0.021815
A4        0.021711
A3        0.014654
A6        0.014613
Focus     0.014155
Name: name_model, dtype: float64

Findings on models¶

Mercedes Benz is the most popular used car offered in the list.
Golf, Corsa, Polo and Astra also are popular models in the data set.
Audi A3, A4 and A6 are also popular models, even when Audi is the most expensive brand. So a lot of people is trying to sell used expensive models to purchase a new one.

Odometer vs price¶

We are now tray to find how much change the mean price of a used cars if the mileage increase.

For this we are going to do the following:

We are going to split on 3 main groups the odometer values, (minus of 50k km, between 50k km and 100k km, most than 100k km)
We are going to find the mean price of each group using .mean() method

In [66]:

odometer_to_50k = autos[autos["odometer_km"] <= 50000]
odometer_to_100k = autos[(autos["odometer_km"] > 50000) & (autos["odometer_km"] <= 100000)]
odometer_to_150k = autos[(autos["odometer_km"] > 100000) & (autos["odometer_km"] <= 150000)]

Price1 = round(odometer_to_50k["price"].mean(),0)
Price2 = round (odometer_to_100k["price"].mean(),0)
Price3 = round (odometer_to_150k["price"].mean(),0)

print("Average price with odometer from 0 to 50k km:", Price1)
print("Average price with odometer from 50 to 100k km:", Price2)
print("Average price with odometer from 100 to 150k km:", Price3)

Average price with odometer from 0 to 50k km: 15025.0
Average price with odometer from 50 to 100k km: 9515.0
Average price with odometer from 100 to 150k km: 4087.0

Findings on odometers¶

The average price of an used car decrease with more mileage. We need to dive deep for brands to see if any of this change with certain models.

Damaged cars¶

Finally we are going to find the difference on price between a car with no damage and with damage.

We are going to find the unique values in "unrepaired_damage"
Once we have the unique values, following the previous steps we find the mean price.

In [67]:

autos["unreparied_damage"].unique()

Out[67]:

array(['nein', nan, 'ja'], dtype=object)

In [68]:

no_damage = autos[autos["unreparied_damage"] == 'nein']
damage = autos[autos["unreparied_damage"] == 'ja']

Price_1 = round(no_damage["price"].mean())
Price_2 = round(damage["price"].mean())

print("Average no damage:", Price_1)
print("Average with damage:", Price_2)

Average no damage: 7099
Average with damage: 2250

In [69]:

ratio_damage = 7099/2250
ratio_damage

Out[69]:

3.155111111111111

Findings damage cars¶

A car without damage could cost 3 times the price of a car with a damage.

Conclusions¶

We resume the main findings of our data here:

We can see that the Ford cars are selling with less mileage. This could be a red flag, maybe the car consumes more combustible or is difficult to find spare parts.
The most affordable model is opel.
Audi is the more expensive car, but the mileage is lower than mercedes, so we maybe need to check what other variables could affect the price in this cases.
A car without damage could cost 3 times the price of a car with a damage.
The average price of an used car decrease with more mileage. We need to dive deep for brands to see if any of this change with certain models.
Mercedes Benz is the most popular used car offered in the list.
Golf, Corsa, Polo and Astra also are popular models in the data set.
Audi A3, A4 and A6 are also popular models, even when Audi is the most expensive brand. So a lot of people is trying to sell used expensive models to purchase a new one.