Exploring Ebay Car Sales Data¶

The aim of this project is to clean a dataset of used cars from eBay Kleinanzeigen (a classifieds section of the German eBay website) and analyze the included used car listings.

The dataset was originally scraped and uploaded to Kaggle. In this project we are using its reduced version (50,000 data points sampled from the full dataset) with more dirty data. The original dataset, as well as the data dictionary, can be found here.

Data Downloading and Understanding¶

In [1]:

import pandas as pd
import numpy as np

In [2]:

autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [3]:

autos

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
5	2016-03-21 13:47:45	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	privat	Angebot	$7,900	test	bus	2006	automatik	150	voyager	150,000km	4	diesel	chrysler	NaN	2016-03-21 00:00:00	0	22962	2016-04-06 09:45:21
6	2016-03-20 17:55:21	VW_Golf_III_GT_Special_Electronic_Green_Metall...	privat	Angebot	$300	test	limousine	1995	manuell	90	golf	150,000km	8	benzin	volkswagen	NaN	2016-03-20 00:00:00	0	31535	2016-03-23 02:48:59
7	2016-03-16 18:55:19	Golf_IV_1.9_TDI_90PS	privat	Angebot	$1,990	control	limousine	1998	manuell	90	golf	150,000km	12	diesel	volkswagen	nein	2016-03-16 00:00:00	0	53474	2016-04-07 03:17:32
8	2016-03-22 16:51:34	Seat_Arosa	privat	Angebot	$250	test	NaN	2000	manuell	0	arosa	150,000km	10	NaN	seat	nein	2016-03-22 00:00:00	0	7426	2016-03-26 18:18:10
9	2016-03-16 13:47:02	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	privat	Angebot	$590	control	bus	1997	manuell	90	megane	150,000km	7	benzin	renault	nein	2016-03-16 00:00:00	0	15749	2016-04-06 10:46:35
10	2016-03-15 01:41:36	VW_Golf_Tuning_in_siber/grau	privat	Angebot	$999	test	NaN	2017	manuell	90	NaN	150,000km	4	benzin	volkswagen	nein	2016-03-14 00:00:00	0	86157	2016-04-07 03:16:21
11	2016-03-16 18:45:34	Mercedes_A140_Motorschaden	privat	Angebot	$350	control	NaN	2000	NaN	0	NaN	150,000km	0	benzin	mercedes_benz	NaN	2016-03-16 00:00:00	0	17498	2016-03-16 18:45:34
12	2016-03-31 19:48:22	Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...	privat	Angebot	$5,299	control	kleinwagen	2010	automatik	71	fortwo	50,000km	9	benzin	smart	nein	2016-03-31 00:00:00	0	34590	2016-04-06 14:17:52
13	2016-03-23 10:48:32	Audi_A3_1.6_tuning	privat	Angebot	$1,350	control	limousine	1999	manuell	101	a3	150,000km	11	benzin	audi	nein	2016-03-23 00:00:00	0	12043	2016-04-01 14:17:13
14	2016-03-23 11:50:46	Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...	privat	Angebot	$3,999	test	kleinwagen	2007	manuell	75	clio	150,000km	9	benzin	renault	NaN	2016-03-23 00:00:00	0	81737	2016-04-01 15:46:47
15	2016-04-01 12:06:20	Corvette_C3_Coupe_T_Top_Crossfire_Injection	privat	Angebot	$18,900	test	coupe	1982	automatik	203	NaN	80,000km	6	benzin	sonstige_autos	nein	2016-04-01 00:00:00	0	61276	2016-04-02 21:10:48
16	2016-03-16 14:59:02	Opel_Vectra_B_Kombi	privat	Angebot	$350	test	kombi	1999	manuell	101	vectra	150,000km	5	benzin	opel	nein	2016-03-16 00:00:00	0	57299	2016-03-18 05:29:37
17	2016-03-29 11:46:22	Volkswagen_Scirocco_2_G60	privat	Angebot	$5,500	test	coupe	1990	manuell	205	scirocco	150,000km	6	benzin	volkswagen	nein	2016-03-29 00:00:00	0	74821	2016-04-05 20:46:26
18	2016-03-26 19:57:44	Verkaufen_mein_bmw_e36_320_i_touring	privat	Angebot	$300	control	bus	1995	manuell	150	3er	150,000km	0	benzin	bmw	NaN	2016-03-26 00:00:00	0	54329	2016-04-02 12:16:41
19	2016-03-17 13:36:21	mazda_tribute_2.0_mit_gas_und_tuev_neu_2018	privat	Angebot	$4,150	control	suv	2004	manuell	124	andere	150,000km	2	lpg	mazda	nein	2016-03-17 00:00:00	0	40878	2016-03-17 14:45:58
20	2016-03-05 19:57:31	Audi_A4_Avant_1.9_TDI_6_GangAHKKlimatronik...	privat	Angebot	$3,500	test	kombi	2003	manuell	131	a4	150,000km	5	diesel	audi	NaN	2016-03-05 00:00:00	0	53913	2016-03-07 05:46:46
21	2016-03-06 19:07:10	Porsche_911_Carrera_4S_Cabrio	privat	Angebot	$41,500	test	cabrio	2004	manuell	320	911	150,000km	4	benzin	porsche	nein	2016-03-06 00:00:00	0	65428	2016-04-05 23:46:19
22	2016-03-28 20:50:54	MINI_Cooper_S_Cabrio	privat	Angebot	$25,450	control	cabrio	2015	manuell	184	cooper	10,000km	1	benzin	mini	nein	2016-03-28 00:00:00	0	44789	2016-04-01 06:45:30
23	2016-03-10 19:55:34	Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima	privat	Angebot	$7,999	control	bus	2010	manuell	120	NaN	150,000km	2	diesel	peugeot	nein	2016-03-10 00:00:00	0	30900	2016-03-17 08:45:17
24	2016-04-03 11:57:02	BMW_535i_xDrive_Sport_Aut.	privat	Angebot	$48,500	control	limousine	2014	automatik	306	5er	30,000km	12	benzin	bmw	nein	2016-04-03 00:00:00	0	22547	2016-04-07 13:16:50
25	2016-03-21 21:56:18	Ford_escort_kombi_an_bastler_mit_ghia_ausstattung	privat	Angebot	$90	control	kombi	1996	manuell	116	NaN	150,000km	4	benzin	ford	ja	2016-03-21 00:00:00	0	27574	2016-04-01 05:16:49
26	2016-04-03 22:46:28	Volkswagen_Polo_Fox	privat	Angebot	$777	control	kleinwagen	1992	manuell	54	polo	125,000km	2	benzin	volkswagen	nein	2016-04-03 00:00:00	0	38110	2016-04-05 23:46:48
27	2016-03-27 18:45:01	Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE	privat	Angebot	$0	control	NaN	2005	NaN	0	NaN	150,000km	0	NaN	ford	NaN	2016-03-27 00:00:00	0	66701	2016-03-27 18:45:01
28	2016-03-19 21:56:19	MINI_Cooper_D	privat	Angebot	$5,250	control	kleinwagen	2007	manuell	110	cooper	150,000km	7	diesel	mini	ja	2016-03-19 00:00:00	0	15745	2016-04-07 14:58:48
29	2016-04-02 12:45:44	Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz...	privat	Angebot	$4,999	test	kombi	2004	automatik	204	e_klasse	150,000km	10	diesel	mercedes_benz	nein	2016-04-02 00:00:00	0	47638	2016-04-02 12:45:44
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49970	2016-03-21 22:47:37	c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp...	privat	Angebot	$15,800	control	bus	2010	automatik	136	c4	60,000km	4	diesel	citroen	nein	2016-03-21 00:00:00	0	14947	2016-04-07 04:17:34
49971	2016-03-29 14:54:12	W.Lupo_1.0	privat	Angebot	$950	test	kleinwagen	2001	manuell	50	lupo	150,000km	4	benzin	volkswagen	nein	2016-03-29 00:00:00	0	65197	2016-03-29 20:41:51
49972	2016-03-26 22:25:23	Mercedes_Benz_Vito_115_CDI_Extralang_Aut.	privat	Angebot	$3,300	control	bus	2004	automatik	150	vito	150,000km	10	diesel	mercedes_benz	ja	2016-03-26 00:00:00	0	65326	2016-03-28 11:28:18
49973	2016-03-27 05:32:39	Mercedes_Benz_SLK_200_Kompressor	privat	Angebot	$6,000	control	cabrio	2004	manuell	163	slk	150,000km	11	benzin	mercedes_benz	nein	2016-03-27 00:00:00	0	53567	2016-03-27 08:25:24
49974	2016-03-20 10:52:31	Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing...	privat	Angebot	$0	control	cabrio	1983	manuell	70	golf	150,000km	2	benzin	volkswagen	nein	2016-03-20 00:00:00	0	8209	2016-03-27 19:48:16
49975	2016-03-27 20:51:39	Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort	privat	Angebot	$9,700	control	kleinwagen	2012	automatik	88	jazz	100,000km	11	hybrid	honda	nein	2016-03-27 00:00:00	0	84385	2016-04-05 19:45:34
49976	2016-03-19 18:56:05	Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga...	privat	Angebot	$5,900	test	kombi	1992	automatik	150	80	150,000km	12	benzin	audi	nein	2016-03-19 00:00:00	0	36100	2016-04-07 06:16:44
49977	2016-03-31 18:37:18	Mercedes_Benz_C200_Cdi_W203	privat	Angebot	$5,500	control	limousine	2003	manuell	116	c_klasse	150,000km	2	diesel	mercedes_benz	nein	2016-03-31 00:00:00	0	33739	2016-04-06 12:16:11
49978	2016-04-04 10:37:14	Mercedes_Benz_E_200_Classic	privat	Angebot	$900	control	limousine	1996	automatik	136	e_klasse	150,000km	9	benzin	mercedes_benz	ja	2016-04-04 00:00:00	0	24405	2016-04-06 12:44:20
49979	2016-03-20 18:38:40	Volkswagen_Polo_1.6_TDI_Style	privat	Angebot	$11,000	test	kleinwagen	2011	manuell	90	polo	70,000km	11	diesel	volkswagen	nein	2016-03-20 00:00:00	0	48455	2016-04-07 01:45:12
49980	2016-03-12 10:55:54	Ford_Escort_Turnier_16V	privat	Angebot	$400	control	kombi	1995	manuell	105	escort	125,000km	3	benzin	ford	NaN	2016-03-12 00:00:00	0	56218	2016-04-06 17:16:49
49981	2016-03-15 09:38:21	Opel_Astra_Kombi_mit_Anhaengerkupplung	privat	Angebot	$2,000	control	kombi	1998	manuell	115	astra	150,000km	12	benzin	opel	nein	2016-03-15 00:00:00	0	86859	2016-04-05 17:21:46
49982	2016-03-29 18:51:08	Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm	privat	Angebot	$1,950	control	kleinwagen	2004	manuell	0	fabia	90,000km	7	benzin	skoda	NaN	2016-03-29 00:00:00	0	45884	2016-03-29 18:51:08
49983	2016-03-06 12:43:04	Ford_focus_99	privat	Angebot	$600	test	kleinwagen	1999	manuell	101	focus	150,000km	4	benzin	ford	NaN	2016-03-06 00:00:00	0	52477	2016-03-09 06:16:08
49984	2016-03-31 22:48:48	Student_sucht_ein__Anfaengerauto___ab_2000_BJ_...	privat	Angebot	$0	test	NaN	2000	NaN	0	NaN	150,000km	0	NaN	sonstige_autos	NaN	2016-03-31 00:00:00	0	12103	2016-04-02 19:44:53
49985	2016-04-02 16:38:23	Verkaufe_meinen_vw_vento!	privat	Angebot	$1,000	control	NaN	1995	automatik	0	NaN	150,000km	0	benzin	volkswagen	NaN	2016-04-02 00:00:00	0	30900	2016-04-06 15:17:52
49986	2016-04-04 20:46:02	Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst...	privat	Angebot	$15,900	control	limousine	2010	automatik	218	300c	125,000km	11	diesel	chrysler	nein	2016-04-04 00:00:00	0	73527	2016-04-06 23:16:00
49987	2016-03-22 20:47:27	Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__...	privat	Angebot	$21,990	control	limousine	2013	manuell	150	a3	50,000km	11	diesel	audi	nein	2016-03-22 00:00:00	0	94362	2016-03-26 22:46:06
49988	2016-03-28 19:49:51	BMW_330_Ci	privat	Angebot	$9,550	control	coupe	2001	manuell	231	3er	150,000km	10	benzin	bmw	nein	2016-03-28 00:00:00	0	83646	2016-04-07 02:17:40
49989	2016-03-11 19:50:37	VW_Polo_zum_Ausschlachten_oder_Wiederaufbau	privat	Angebot	$150	test	kleinwagen	1997	manuell	0	polo	150,000km	5	benzin	volkswagen	ja	2016-03-11 00:00:00	0	21244	2016-03-12 10:17:55
49990	2016-03-21 19:54:19	Mercedes_Benz_A_200__BlueEFFICIENCY__Urban	privat	Angebot	$17,500	test	limousine	2012	manuell	156	a_klasse	30,000km	12	benzin	mercedes_benz	nein	2016-03-21 00:00:00	0	58239	2016-04-06 22:46:57
49991	2016-03-06 15:25:19	Kleinwagen	privat	Angebot	$500	control	NaN	2016	manuell	0	twingo	150,000km	0	benzin	renault	NaN	2016-03-06 00:00:00	0	61350	2016-03-06 18:24:19
49992	2016-03-10 19:37:38	Fiat_Grande_Punto_1.4_T_Jet_16V_Sport	privat	Angebot	$4,800	control	kleinwagen	2009	manuell	120	andere	125,000km	9	lpg	fiat	nein	2016-03-10 00:00:00	0	68642	2016-03-13 01:44:51
49993	2016-03-15 18:47:35	Audi_A3__1_8l__Silber;_schoenes_Fahrzeug	privat	Angebot	$1,650	control	kleinwagen	1997	manuell	0	NaN	150,000km	7	benzin	audi	NaN	2016-03-15 00:00:00	0	65203	2016-04-06 19:46:53
49994	2016-03-22 17:36:42	Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...	privat	Angebot	$5,000	control	kombi	2001	automatik	299	a6	150,000km	1	benzin	audi	nein	2016-03-22 00:00:00	0	46537	2016-04-06 08:16:39
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [4]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Some observations can be made here.

Our dataset has 50,000 entries and 20 columns.
A few columns ('vehicleType', 'gearbox', 'model', 'fuelType', 'notRepairedDamage') have null values, for 'notRepairedDamage' up to 20% of all values.
The majority of columns have values in string data type (including 'price' and "odometer', which will probably have to be fixed to avoid issues).
The column names use JavaScript's preferred camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores. Given that some of the names are long and compound, contaning even prepositions, it makes it more difficult to catch their meaning.

Let's investigate now the first 10 rows of the dataset.

In [5]:

autos.head(10)

Out[5]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50
5	2016-03-21 13:47:45	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	privat	Angebot	$7,900	test	bus	2006	automatik	150	voyager	150,000km	4	diesel	chrysler	NaN	2016-03-21 00:00:00	22962	2016-04-06 09:45:21
6	2016-03-20 17:55:21	VW_Golf_III_GT_Special_Electronic_Green_Metall...	privat	Angebot	$300	test	limousine	1995	manuell	90	golf	150,000km	8	benzin	volkswagen	NaN	2016-03-20 00:00:00	31535	2016-03-23 02:48:59
7	2016-03-16 18:55:19	Golf_IV_1.9_TDI_90PS	privat	Angebot	$1,990	control	limousine	1998	manuell	90	golf	150,000km	12	diesel	volkswagen	nein	2016-03-16 00:00:00	53474	2016-04-07 03:17:32
8	2016-03-22 16:51:34	Seat_Arosa	privat	Angebot	$250	test	NaN	2000	manuell	0	arosa	150,000km	10	NaN	seat	nein	2016-03-22 00:00:00	7426	2016-03-26 18:18:10
9	2016-03-16 13:47:02	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	privat	Angebot	$590	control	bus	1997	manuell	90	megane	150,000km	7	benzin	renault	nein	2016-03-16 00:00:00	15749	2016-04-06 10:46:35

At a first glance, it seems that the columns containing dates and times are of quite a consistent format.
In some columns ('seller', 'offerType', 'gearbox', 'notRepairedDamage') values are written in German.
The columns 'price' and 'odometer' show the same units for each value.
The columns 'seller', 'offerType', 'notRepairedDamage', 'nrOfPictures' seem to have a dominant value each (with 'notRepairedDamage' also having a lot of null values) and probably will not represent much interest for further data analysis.

Cleaning Column Names¶

At this stage we convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [6]:

autos.columns

Out[6]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

For converting the column names from camelcase to snakecase, we will use the following function:

In [7]:

def camel_to_snake(str): 
    res = [str[0].lower()] 
    for c in str[1:]: 
        if c in ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'): 
            res.append('_') 
            res.append(c.lower()) 
        else: 
            res.append(c) 
      
    return ''.join(res) 

# Converting the column names
new_columns = []
for c in autos.columns:
    column_snake = camel_to_snake(c)
    new_columns.append(column_snake)
    

autos.columns = new_columns
autos.columns

Out[7]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'year_of_registration', 'gearbox', 'power_p_s', 'model',
       'odometer', 'month_of_registration', 'fuel_type', 'brand',
       'not_repaired_damage', 'date_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In this way the column names look more Python-style. However, for making them more descriptive, it's worth doing some further renaming here.

In [8]:

# Renaming columns
autos.rename({'abtest': 'ab_test', 'year_of_registration': 'registration_year', 
              'power_p_s': 'power_ps', 'month_of_registration': 'registration_month',
              'not_repaired_damage': 'unrepaired_damage', 'date_created': 'ad_created', 
              'nr_of_pictures': 'number_pictures'}, axis=1, inplace=True)

autos.columns

Out[8]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [9]:

autos.head()

Out[9]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Now all the column names are in line with the Python's naming convention and represent clear and informative description of their contents.

Initial Data Exploration and Cleaning¶

Let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

Text columns where all or almost all values are the same. These can be dropped as they don't have useful information for analysis.
Examples of numeric data stored as text which can be cleaned and converted.
Categorical data that uses German words that can be translated and mapped to their English counterparts.

In [10]:

# Checking the dataset statistics including string columns
autos.describe(include='all')

Out[10]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	number_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-23 19:38:20	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

From these dataset statistics we can draw many insights.

The columns 'seller', 'offer_type' and 'number_pictures' have practically one value each, which is exactly what we supposed also earlier, analyzing the first 10 rows of the dataset. These columns have to be dropped.
The 'price' and 'odometer' columns are numeric data stored as text (again, as we noticed earlier). They have to be cleaned and converted to a numeric data type.
Judging by minimum and maximum values, the columns 'registration_year', 'registration_month' and 'power_ps' have obvious outliers. The incorrect data has to be removed or corrected.
One curious observation about the 'price' column: 1,421 out of 50,000 entries have a price of 0$, meaning that around 3% of the cars of our dataset are actually free of charge.
More than 9.5% of all the cars have unrepaired damages.

Let's drop the non-informative columns 'seller', 'offer_type', and 'number_pictures'.

In [11]:

# Dropping columns
autos.drop(['seller', 'offer_type', 'number_pictures'], axis=1, inplace=True)

There is one more interesting column to have a look at: 'fuel_type'. Out of 7 different fuel types represented in the dataset benzine is the most common one (60% of all the entries, including null values). What are the other 6?

In [12]:

autos['fuel_type'].value_counts()

Out[12]:

benzin     30107
diesel     14567
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64

We see that the second common fuel type is diesel (quite predictable), while the other types, all together less than 2%, are liquified/compressed natural gas, hybrid, electro and "other" ("andere" in German).

Next, we are going to identify categorical data that uses German words, translate them and map to their English counterparts. After deleting the non-informative columns, we have the following ones containg German words: 'gearbox', 'fuel_type' and 'unrepaired_damage'. Let's investigate their vocabulary.

In [13]:

print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())

['manuell' 'automatik' nan]
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
['nein' nan 'ja']

Let's create a unique "German dictionary" for these words and map them into English meanings.

In [14]:

# Creating a dictionary
german_dict = {
    'manuell': 'manual',
    'automatik': 'automatic',
    'benzin': 'benzine',
    'elektro': 'electro',
    'andere': 'other',
    'nein': 'no',
    'ja': 'yes',
# Next words remain the same, but have to be mapped to avoid NaN 
    'lpg': 'lpg',
    'diesel': 'diesel',
    'cng': 'cng',
    'hybrid': 'hybrid'    
}

# Mapping German words in English
autos['gearbox'] = autos['gearbox'].map(german_dict)
autos['fuel_type'] = autos['fuel_type'].map(german_dict)
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(german_dict)

# Checking updated unique values
print(autos['gearbox'].unique())
print(autos['fuel_type'].unique())
print(autos['unrepaired_damage'].unique())

['manual' 'automatic' nan]
['lpg' 'benzine' 'diesel' nan 'cng' 'hybrid' 'electro' 'other']
['no' nan 'yes']

Exploring the `Odometer` and `Price` Columns: Looking for Outliers¶

Let's return to the 'price' and 'odometer' columns: we have to remove any non-numeric characters ('$' and 'km' correspondingly and the symbol of comma from both), convert the columns to a numeric data type and rename the columns for not to lose the information about units.

In [15]:

# Removing non-numeric characters and converting to a numeric dtype
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)

# Renaming columns
autos.rename({'price': 'price_dollars', 'odometer': 'odometer_km'}, axis=1, inplace=True)

In [16]:

autos.head()

Out[16]:

	date_crawled	name	price_dollars	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manual	158	andere	150000	3	lpg	peugeot	no	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatic	286	7er	150000	6	benzine	bmw	no	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manual	102	golf	70000	7	benzine	volkswagen	no	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	kleinwagen	2007	automatic	71	fortwo	70000	6	benzine	smart	no	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	kombi	2003	manual	0	focus	150000	7	benzine	ford	no	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Now the columns 'price_dollars' and 'odometer_km' are in the right format to be analyzed.

In [17]:

autos['price_dollars'].unique().shape

Out[17]:

(2357,)

In [18]:

autos['odometer_km'].unique().shape

Out[18]:

(13,)

The column 'price_dollars' has 2357 unique values, the column 'odometer_km' 13 ones.

Let's look at the statistics of prices.

In [19]:

autos['price_dollars'].describe()

Out[19]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollars, dtype: float64

It seems that there is an outlier in prices: the maximum value is 5 orders bigger than the value of the third quartile (75%), and in general 100,000,000$ doesn't look a realistic price for a used car on sale. As for the minimum value equals 0, it looks ok and represents the cars offered free of charge.

We will investigate in more detail the maximum value now.

In [20]:

# Finding the 5 biggest prices in descending order
five_biggest_prices = autos['price_dollars'].value_counts().sort_index(ascending=False).head()
five_biggest_prices

Out[20]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
Name: price_dollars, dtype: int64

Hence, the maximum value is clearly wrong, probably it was intended to represent an undefined value. We have to remove it from the dataset.

In [21]:

# Removing the outlier
autos = autos.loc[autos['price_dollars'].between(0, 99999998), :]

In [22]:

# Double-checking the 5 biggest prices
five_biggest_prices = autos['price_dollars'].value_counts().sort_index(ascending=False).head()
five_biggest_prices

Out[22]:

27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
Name: price_dollars, dtype: int64

In [23]:

autos['price_dollars'].describe()

Out[23]:

count    4.999900e+04
mean     7.840201e+03
std      1.774639e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.732222e+07
Name: price_dollars, dtype: float64

Now the values of the 'price_dollars' column are cleaned and meaningful, with the maximum price of 27,322,220$, the average of 7,840$ and the minimum free of charge.

Let's conduct a similar analysis for the 'odometer_km' column.

In [24]:

autos['odometer_km'].describe()

Out[24]:

count     49999.000000
mean     125732.214644
std       40042.465064
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [25]:

odometer_values = autos['odometer_km'].value_counts().sort_index().head()
odometer_values.head()

Out[25]:

5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64

In [26]:

odometer_values.tail()

Out[26]:

5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64

In this column we don't have any outliers. The value range is from 5,000 to 150,000 km, with the average of 125,732 km. These statistics are heavily influenced by the frequency of the maximum value: we can deduce from the previous cell that about 65% of cars in our dataset have driven 150,000 km.

Exploring the Date Columns¶

Let's now understand the date range the data covers.

There are 5 columns representing date values. Some of these columns were created by the crawler, some came from the website itself.

'date_crawled': added by the crawler
'last_seen': added by the crawler
'ad_created': from the website
'registration_month': from the website
'registration_year': from the website

The first 3 of these columns are identified as string values by pandas so we need to convert them into a numerical representation for further quantitative analysis. The other 2 columns are already numeric.

Let's first understand how the values in the 3 string columns are formatted.

In [27]:

autos[['date_crawled','ad_created','last_seen']].head()

Out[27]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

These columns all represent full timestamp values, with the first 10 characters representing the day and the last 5 - the time. For our purposes we are interested only in days.

We are going to calculate the distribution of values of dates in the 'date_crawled', 'ad_created', and 'last_seen' columns as percentages, including missing values and ranking by date from earliest to latest.

In [28]:

# Extracting days, counting values in % including missing values, and sorting from earliest to latest
data_crawled_dates = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
data_crawled_dates

Out[28]:

2016-03-05    0.025381
2016-03-06    0.013940
2016-03-07    0.035961
2016-03-08    0.033301
2016-03-09    0.033221
2016-03-10    0.032121
2016-03-11    0.032481
2016-03-12    0.036781
2016-03-13    0.015560
2016-03-14    0.036621
2016-03-15    0.033981
2016-03-16    0.029501
2016-03-17    0.031521
2016-03-18    0.013060
2016-03-19    0.034901
2016-03-20    0.037821
2016-03-21    0.037521
2016-03-22    0.032921
2016-03-23    0.032381
2016-03-24    0.029101
2016-03-25    0.031741
2016-03-26    0.032481
2016-03-27    0.031041
2016-03-28    0.034841
2016-03-29    0.034181
2016-03-30    0.033621
2016-03-31    0.031921
2016-04-01    0.033801
2016-04-02    0.035401
2016-04-03    0.038681
2016-04-04    0.036521
2016-04-05    0.013100
2016-04-06    0.003180
2016-04-07    0.001420
Name: date_crawled, dtype: float64

We can observe that the range of dates when each ad in our dataset was first crawled is from the 5th of March 2016 till the 7th of April 2016. The distribution of frequencies looks quite consistent.

In [29]:

# Extracting days, counting values in % including missing values, and sorting from earliest to latest
ad_created_dates = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
ad_created_dates

Out[29]:

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
2015-12-30    0.000020
2016-01-03    0.000020
2016-01-07    0.000020
2016-01-10    0.000040
2016-01-13    0.000020
2016-01-14    0.000020
2016-01-16    0.000020
2016-01-22    0.000020
2016-01-27    0.000060
2016-01-29    0.000020
2016-02-01    0.000020
2016-02-02    0.000040
2016-02-05    0.000040
2016-02-07    0.000020
2016-02-08    0.000020
2016-02-09    0.000040
2016-02-11    0.000020
2016-02-12    0.000060
2016-02-14    0.000040
2016-02-16    0.000020
2016-02-17    0.000020
2016-02-18    0.000040
2016-02-19    0.000060
2016-02-20    0.000040
2016-02-21    0.000060
                ...   
2016-03-09    0.033241
2016-03-10    0.031861
2016-03-11    0.032781
2016-03-12    0.036621
2016-03-13    0.016920
2016-03-14    0.035221
2016-03-15    0.033741
2016-03-16    0.030001
2016-03-17    0.031201
2016-03-18    0.013720
2016-03-19    0.033841
2016-03-20    0.037861
2016-03-21    0.037721
2016-03-22    0.032781
2016-03-23    0.032181
2016-03-24    0.029081
2016-03-25    0.031881
2016-03-26    0.032561
2016-03-27    0.030901
2016-03-28    0.034961
2016-03-29    0.034141
2016-03-30    0.033441
2016-03-31    0.031921
2016-04-01    0.033801
2016-04-02    0.035081
2016-04-03    0.038921
2016-04-04    0.036881
2016-04-05    0.011840
2016-04-06    0.003260
2016-04-07    0.001280
Name: ad_created, Length: 76, dtype: float64

In our dataset the first eBay listing was created the 11th of June 2015 and the last - the 7th of April 2016. As for the distribution, the frequencies look very low at the begining and then grow significantly. Let's try to trace where this transition takes place.

In [30]:

ad_created_dates[30:]

Out[30]:

2016-02-22    0.000020
2016-02-23    0.000080
2016-02-24    0.000040
2016-02-25    0.000060
2016-02-26    0.000040
2016-02-27    0.000120
2016-02-28    0.000200
2016-02-29    0.000160
2016-03-01    0.000100
2016-03-02    0.000100
2016-03-03    0.000860
2016-03-04    0.001440
2016-03-05    0.023040
2016-03-06    0.015120
2016-03-07    0.034741
2016-03-08    0.033341
2016-03-09    0.033241
2016-03-10    0.031861
2016-03-11    0.032781
2016-03-12    0.036621
2016-03-13    0.016920
2016-03-14    0.035221
2016-03-15    0.033741
2016-03-16    0.030001
2016-03-17    0.031201
2016-03-18    0.013720
2016-03-19    0.033841
2016-03-20    0.037861
2016-03-21    0.037721
2016-03-22    0.032781
2016-03-23    0.032181
2016-03-24    0.029081
2016-03-25    0.031881
2016-03-26    0.032561
2016-03-27    0.030901
2016-03-28    0.034961
2016-03-29    0.034141
2016-03-30    0.033441
2016-03-31    0.031921
2016-04-01    0.033801
2016-04-02    0.035081
2016-04-03    0.038921
2016-04-04    0.036881
2016-04-05    0.011840
2016-04-06    0.003260
2016-04-07    0.001280
Name: ad_created, dtype: float64

An evident transition is observed on the 5th of March, exactly the day when the first adds were crawled from the website.

In [31]:

# Extracting days, counting values in % including missing values, and sorting from earliest to latest
last_seen_dates = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
last_seen_dates

Out[31]:

2016-03-05    0.001080
2016-03-06    0.004420
2016-03-07    0.005360
2016-03-08    0.007600
2016-03-09    0.009860
2016-03-10    0.010760
2016-03-11    0.012520
2016-03-12    0.023820
2016-03-13    0.008980
2016-03-14    0.012800
2016-03-15    0.015880
2016-03-16    0.016440
2016-03-17    0.027921
2016-03-18    0.007420
2016-03-19    0.015740
2016-03-20    0.020700
2016-03-21    0.020740
2016-03-22    0.021580
2016-03-23    0.018580
2016-03-24    0.019560
2016-03-25    0.019200
2016-03-26    0.016960
2016-03-27    0.016020
2016-03-28    0.020860
2016-03-29    0.022340
2016-03-30    0.024840
2016-03-31    0.023840
2016-04-01    0.023100
2016-04-02    0.024900
2016-04-03    0.025361
2016-04-04    0.024620
2016-04-05    0.124282
2016-04-06    0.220984
2016-04-07    0.130923
Name: last_seen, dtype: float64

The range of dates the crawler saw each ad last online is also (like for 'date_crawled') from the 5th of March 2016 till the 7th of April 2016. The distribution of frequencies mostly looks consistent, increasing only the last 3 days.

Let's take a look now at the statistics of the 'registration_year' column, which is already numeric.

In [32]:

autos['registration_year'].describe()

Out[32]:

count    49999.000000
mean      2005.073401
std        105.713866
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

We clearly see 2 issues here: a wrong minimum value of 1000 year and a wrong maximum one of 9999 year. These values certainly bias also the other statistics (mean and standard deviation).

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. As for the earliest valid year, realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [33]:

autos['registration_year'].value_counts().sort_index()

Out[33]:

1000       1
1001       1
1111       1
1500       1
1800       2
1910       9
1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       2
1952       1
1953       1
1954       2
1955       2
1956       5
1957       2
1958       4
1959       7
1960      34
1961       6
1962       4
1963       9
        ... 
2001    2703
2002    2533
2003    2727
2004    2737
2005    3015
2006    2708
2007    2304
2008    2231
2009    2098
2010    1597
2011    1634
2012    1323
2013     806
2014     666
2015     399
2016    1316
2017    1453
2018     492
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, Length: 97, dtype: int64

The earliest plausible value of a registration year is 1910. It's worth looking at the 9 entries with that year in more detail.

In [34]:

autos[autos['registration_year'] == 1910]

Out[34]:

	date_crawled	name	price_dollars	ab_test	vehicle_type	registration_year	gearbox	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
3679	2016-04-04 00:36:17	Suche_Auto	1	test	NaN	1910	NaN	NaN	5000	0	NaN	sonstige_autos	NaN	2016-04-04 00:00:00	40239	2016-04-04 07:49:15
15898	2016-03-08 10:50:05	Tausch_alles_aus_meinen_Anzeigen_gegen_Auto	0	test	NaN	1910	NaN	NaN	5000	0	NaN	sonstige_autos	NaN	2016-03-08 00:00:00	6108	2016-03-08 17:47:19
22659	2016-03-14 08:51:18	Opel_Corsa_B	500	test	NaN	1910	NaN	corsa	150000	0	NaN	opel	NaN	2016-03-14 00:00:00	52393	2016-04-03 07:53:55
28693	2016-03-22 17:48:41	Renault_Twingo	599	control	kleinwagen	1910	manual	NaN	5000	0	benzine	renault	NaN	2016-03-22 00:00:00	70376	2016-04-06 09:16:59
30781	2016-03-25 13:47:46	Opel_Calibra_V6_DTM_Bausatz_1:24	30	test	NaN	1910	NaN	calibra	100000	0	NaN	opel	NaN	2016-03-25 00:00:00	47638	2016-03-26 23:46:29
33295	2016-03-12 10:49:42	Ich_verkaufe/suche_fuer_sie_ihr_Auto!!	0	control	NaN	1910	NaN	NaN	5000	0	NaN	sonstige_autos	NaN	2016-03-12 00:00:00	37075	2016-03-19 07:15:34
42181	2016-03-27 19:50:53	SAMSUNG_55_3D_Tv_und_Soundbar_gegen_Auto	0	test	NaN	1910	NaN	NaN	5000	0	NaN	sonstige_autos	NaN	2016-03-27 00:00:00	57080	2016-04-06 01:15:30
45157	2016-03-11 22:37:01	Motorhaube	15	control	NaN	1910	NaN	NaN	5000	0	NaN	trabant	NaN	2016-03-11 00:00:00	90491	2016-03-25 11:18:57
46213	2016-04-02 13:47:16	Bellier_Vario	0	test	kleinwagen	1910	NaN	NaN	5000	1	other	sonstige_autos	NaN	2016-04-02 00:00:00	93105	2016-04-04 11:16:30

The majority of columns for these entries have missing values. The 'odometer_km' column looks realistic, with low values for almost all the cars, which is in line with the technology level of more than 100 years ago. The prices are low (up to 0) enough for just old cars but unexpectedly low for something considered a rarity. In any case, a significant proportion of missing values suggests that these rows don't represent particular interest for further analysis and can be removed.

Let's investigate 3 next early years from our value counts exploration: 1927, 1929, and 1931.

In [35]:

autos[autos['registration_year'] == 1927]

Out[35]:

	date_crawled	name	price_dollars	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
21416	2016-03-12 08:36:21	Essex_super_six__Ford_A	16500	control	cabrio	1927	manual	40	andere	5000	5	benzine	ford	NaN	2016-03-12 00:00:00	74821	2016-03-15 12:45:12

In [36]:

autos[autos['registration_year'] == 1929]

Out[36]:

	date_crawled	name	price_dollars	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
22101	2016-03-09 16:51:17	BMW_Andere	11500	test	cabrio	1929	manual	15	andere	5000	1	NaN	bmw	yes	2016-03-09 00:00:00	70569	2016-04-07 06:17:11

In [37]:

autos[autos['registration_year'] == 1931]

Out[37]:

	date_crawled	name	price_dollars	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
11246	2016-03-26 19:49:59	Ford_Model_A_Roadster_Deluxe_1931	27500	control	cabrio	1931	manual	39	andere	10000	7	benzine	ford	no	2016-03-26 00:00:00	9322	2016-04-06 09:46:59

In these 3 rows there are almost no missing values, and all the other values look plausible: high prices for such rarities, cabrio type of vehicle, manual gearbox, low horsepower, low odometer values.

Hence, it seems reasonable to use the year of 1927 as the lowest acceptable value and 2016 (discussed earlier) as the highest one. All the values outside this interval have to be removed.

In [38]:

# Removing years outside the interval
autos = autos.loc[autos['registration_year'].between(1927, 2016)]

Let's calculate the distribution of frequencies of the remaining values.

In [39]:

autos['registration_year'].value_counts(normalize=True).sort_index()

Out[39]:

1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
1937    0.000083
1938    0.000021
1939    0.000021
1941    0.000042
1943    0.000021
1948    0.000021
1950    0.000062
1951    0.000042
1952    0.000021
1953    0.000021
1954    0.000042
1955    0.000042
1956    0.000104
1957    0.000042
1958    0.000083
1959    0.000146
1960    0.000708
1961    0.000125
1962    0.000083
1963    0.000187
1964    0.000250
1965    0.000354
1966    0.000458
1967    0.000562
1968    0.000541
1969    0.000396
          ...   
1987    0.001562
1988    0.002957
1989    0.003769
1990    0.008226
1991    0.007414
1992    0.008143
1993    0.009267
1994    0.013745
1995    0.027344
1996    0.030072
1997    0.042234
1998    0.051085
1999    0.062456
2000    0.069849
2001    0.056291
2002    0.052751
2003    0.056791
2004    0.056999
2005    0.062789
2006    0.056396
2007    0.047982
2008    0.046462
2009    0.043692
2010    0.033258
2011    0.034029
2012    0.027552
2013    0.016785
2014    0.013870
2015    0.008309
2016    0.027406
Name: registration_year, Length: 77, dtype: float64

The frequencies for the older years are, naturally, very low. For more recent years the values are much higher, showing, however, some fluctuations from one year to another.

Exploring Top Brand Prices¶

Our next step is to explore average price variations across different car brands. To start, let's take a look at the unique brands available in our dataset.

In [40]:

autos['brand'].unique()

Out[40]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

Now let's have a look at number of cars (in % from the total number) by brand.

In [41]:

# Brands by percentage of the total number of cars
brands = autos['brand'].value_counts(normalize=True)
brands

Out[41]:

volkswagen        0.212170
bmw               0.110042
opel              0.108147
mercedes_benz     0.095360
audi              0.086405
ford              0.069807
renault           0.047336
peugeot           0.029531
fiat              0.025865
seat              0.018181
skoda             0.016036
mazda             0.015140
nissan            0.015099
citroen           0.013932
smart             0.013911
toyota            0.012474
sonstige_autos    0.010850
hyundai           0.009850
volvo             0.009247
mini              0.008643
mitsubishi        0.008143
honda             0.007851
kia               0.007102
alfa_romeo        0.006623
porsche           0.006102
suzuki            0.005914
chevrolet         0.005706
chrysler          0.003665
dacia             0.002562
daihatsu          0.002562
jeep              0.002249
subaru            0.002187
land_rover        0.002041
saab              0.001604
jaguar            0.001583
trabant           0.001541
daewoo            0.001499
rover             0.001354
lancia            0.001083
lada              0.000604
Name: brand, dtype: float64

In [42]:

# Finding brands with 5% and more of the total number of cars
top_brands = brands[brands > 0.05].index
top_brands

Out[42]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

A few brands have over 5% of the total number of cars each. Let's calculate average price for them.

In [43]:

# Creating an empty dictionary
top_brands_average_prices = {}

# Looping over the top brands and assigning the average price to the dictionary
for brand in top_brands:
    top_brands_average_prices[brand] = autos[autos['brand'] == brand]['price_dollars'].mean()
    
top_brands_average_prices

Out[43]:

{'audi': 9093.65003615329,
 'bmw': 8334.645155185466,
 'ford': 7263.015811455847,
 'mercedes_benz': 8485.239571958942,
 'opel': 5254.537454265357,
 'volkswagen': 6516.457597173145}

Here we can notice that Volkswagen, being the most common brand in our dataset with a big gap from the rest, has relatively low average price among the other top brands. Audi, being the 4th common brand, has the highest average price among the top ones, while Opel the lowest, with a distinct difference between these two of 3,839$.

For the top 6 brands, we will analyze average mileage and if there is any visible link with average price.

In [44]:

# Converting the average price dictionary into a series object
prices_series = pd.Series(top_brands_average_prices)
prices_series

Out[44]:

audi             9093.650036
bmw              8334.645155
ford             7263.015811
mercedes_benz    8485.239572
opel             5254.537454
volkswagen       6516.457597
dtype: float64

In [45]:

# Creating an empty dictionary for average mileage values
top_brands_average_mileage = {}

# Looping over the top brands and assigning the average mileage to the dictionary
for brand in top_brands:
    top_brands_average_mileage[brand] = autos[autos['brand'] == brand]['odometer_km'].mean()
    
top_brands_average_mileage

Out[45]:

{'audi': 129287.78018799711,
 'bmw': 132434.70855412565,
 'ford': 124046.83770883054,
 'mercedes_benz': 130856.0821139987,
 'opel': 129228.76949740035,
 'volkswagen': 128730.36906164115}

In [46]:

# Converting the average mileage dictionary into a series object
mileage_series = pd.Series(top_brands_average_mileage)
mileage_series

Out[46]:

audi             129287.780188
bmw              132434.708554
ford             124046.837709
mercedes_benz    130856.082114
opel             129228.769497
volkswagen       128730.369062
dtype: float64

In [47]:

# Creating a single dataframe from both series objects
prices_mileage = pd.DataFrame(prices_series, columns=['average_price'])
prices_mileage['average_mileage'] = mileage_series
prices_mileage

Out[47]:

	average_price	average_mileage
audi	9093.650036	129287.780188
bmw	8334.645155	132434.708554
ford	7263.015811	124046.837709
mercedes_benz	8485.239572	130856.082114
opel	5254.537454	129228.769497
volkswagen	6516.457597	128730.369062

Logically, it would be expectable to find an inverse relation between average prices and average mileage. However, we observe just the opposite: the 3 brands with the highest average prices (Audi, Mercedes-Benz and BMW) have also the biggest average mileage values. On the other hand, the variation of mileage here is not significant (7% between extreme values), so this factor is not of much use in our case.

Finding the Most Common Brand/Model Combinations¶

Let's investigate the most common combinations of brands and models in our dataset. We can try to find, say, all the models represented by 1,500 and more cars.

In [48]:

# Finding combinations brand/model represented by 1,500 and more cars each
brand_model = autos.groupby('brand').model.value_counts().loc[lambda x : x>1500]
print(brand_model)                                                           

brand       model
bmw         3er      2688
opel        corsa    1644
volkswagen  golf     3815
            polo     1677
Name: model, dtype: int64

The most common brand / model combinations are: Volkswagen Golf, BMW 3er, Volkswagen Polo, and Opel Corsa. Not surprising that Volkswagen is leading here: from the previous analysis we remember that the brand itself is the most common one, counting 21% of all the cars in the dataset.

Analyzing Price Patterns Based on Mileage¶

Now we will investigate how mileage influences an average car price. For this purpose, it is convenient to split the 'odometer_km' values into groups, but first we have to check these values.

In [49]:

autos['odometer_km'].unique()

Out[49]:

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000,  40000, 100000])

We can define the following 7 groups of values (in thousands km): 5-10, 20-30, 40-50, 60-70, 80-90, 100-125, and 150. Let's calculate average prices for them.

In [50]:

# Grouping by mileage and calculating average price for each group
odometer_5_10 = autos[autos['odometer_km'] <= 100000 ]['price_dollars'].mean()
odometer_20_30 = autos[(autos['odometer_km'] >= 20000) & (autos['odometer_km'] <= 30000)]['price_dollars'].mean()
odometer_40_50 = autos[(autos['odometer_km'] >= 40000) & (autos['odometer_km'] <= 50000)]['price_dollars'].mean()
odometer_60_70 = autos[(autos['odometer_km'] >= 60000) & (autos['odometer_km'] <= 70000)]['price_dollars'].mean()
odometer_80_90 = autos[(autos['odometer_km'] >= 80000) & (autos['odometer_km'] <= 90000)]['price_dollars'].mean()
odometer_100_125 = autos[(autos['odometer_km'] >= 100000) & (autos['odometer_km'] <= 125000)]['price_dollars'].mean()
odometer_150 = autos[autos['odometer_km'] >= 150000]['price_dollars'].mean()

print(odometer_5_10)
print(odometer_20_30)
print(odometer_40_50)
print(odometer_60_70)
print(odometer_80_90)
print(odometer_100_125)
print(odometer_150)

15720.130828692545
17174.599216710183
36374.77998894417
11532.898544520547
8903.513906856404
8193.318008204838
4513.776098237019

We see that average prices follow some clear patterns. The highest average price (36,375$) is related to the used cars with medium values of mileage (40,000 - 50,000 km). Such cars can have been in use approximately about 2-3 years. The cars with smaller values of mileage, even though, obviously, being less used and, theoretically, better preserved, are twice cheaper. It's not surprising: a car with small mileage (used for a short period, or even those on sale soon after being bought) can look less attractive for potential buyers, because it can have some hidden flaws or problems that are not easy to detect from the beginning, so the current owner can prefer to get rid of it. For the values of mileage higher than 40,000 - 50,000 km (that is, for the cars having been more in use and, hence, more worn), average prices decrease again, and the higher the mileage the lower the average price, up to 4,514$ for the cars that ran 150,000 km.

Comparing Prices of Damaged vs. Non-Damaged Cars¶

Our last step will be to find out how much cheaper are cars with damage than their non-damaged counterparts. First, let's see how many cars have unrepaired damages.

In [51]:

autos['unrepaired_damage'].value_counts(dropna=False)

Out[51]:

no     34255
NaN     8978
yes     4785
Name: unrepaired_damage, dtype: int64

We can colclude that at least 9.5% of all the cars (since we don't know the status of the numerous ones with null values) have unrepaired damages. Now we will calculate average prices of both categories.

In [52]:

# Calculating average prices for damaged and non-damaged cars
unrepaired = autos.groupby('unrepaired_damage')['price_dollars'].mean()
print(unrepaired)    

unrepaired_damage
no     8233.387360
yes    2335.379937
Name: price_dollars, dtype: float64

The difference of prices is significant: the damaged cars cost on average 3.5 times less than the non-damaged ones. For curiosity, let's have a look at the most expensive car among the damaged ones.

In [53]:

# Finding the most expensive car among the damaged ones
autos.loc[autos[autos['unrepaired_damage'] == 'yes']['price_dollars'].idxmax()]

Out[53]:

date_crawled          2016-03-29 11:38:54
name                  Volkswagen_Jetta_GT
price_dollars                      999990
ab_test                              test
vehicle_type                    limousine
registration_year                    1985
gearbox                            manual
power_ps                              111
model                               jetta
odometer_km                        150000
registration_month                     12
fuel_type                         benzine
brand                          volkswagen
unrepaired_damage                     yes
ad_created            2016-03-29 00:00:00
postal_code                         50997
last_seen             2016-03-29 11:38:54
Name: 37585, dtype: object

We see that classy cars are still quite expensive also among the damaged ones, even though, evidently, much cheaper than their non-damaged counterparts.

Conclusion¶

To summarize, in this project we have cleaned and explored a dataset of used cars from the German eBay website. Cleaning tasks included renaming columns according to the Python’s convention and giving them descriptive names, deleting non-informative columns, mapping German words to English, converting strings to numeric data types, removing outliers, correcting year range. Exploring the date columns, we found out that the ads were added on the website from June 2015 till April 2016 and crawled from March 2016 till April 2016. We identified the most common brands in our dataset, with Volkswagen at the first place with a big gap from all the others. Only 6 brands count more than 5% of the total number of cars each. For these top brands average prices and mileage values were calculated. We discovered that about 3% of cars in the dataset are free of charge, the most common brand / model combination is Volkswagen Golf, and that 65% of cars have the highest mileage (150,000 km). The mileage-based price patterns were analyzed. We found out that the highest average prices are related to the cars with average mileage of 40,000 – 50,000 km. For lower values of mileage, and especially for higher ones, average prices decrease. Lastly, we explored the difference of average prices between non-damaged and damaged cars and observed that the latter are on average 3.5 times cheaper.