Guided Project No. 03: Exploring Ebay Car Sales Data¶

Our personal goal is to familiarize ourselves with various data cleaning tasks and work flows using numpy and pandas.

As for the project itself, we will be trying to clean the data set and do some preliminary analysis. The data we will be using for this project are ads for used cars from the classifieds sectionof the GermaneBay website.

The data dictionary for our data is as follows:

Variable Name	Description
`dateCrawled`	when this ad was first crawled (all field-values are taken from this date)
`name`	name of the car
`seller`	whether the seller is private or a dealer
`offerType`	the type of listing
`price`	the price on the ad to sell the car
`abtest`	whether the listing is included in an A/B test
`vehicleType`	the vehicle type
`yearOfRegistration`	the year in which the car was first registered
`gearbox`	the transmission type
`powerPS`	the power of the car in PS
`model`	the car model name
`kilometer`	how many kilometers the car has driven
`monthOfRegistration`	the month in which the car was first registered
`fuelType`	what type of fuel the car uses
`brand`	the brand of the car
`notRepairedDamage`	if the car has a damage which is not yet repaired
`dateCreated`	the date on which the eBay listing was created
`nrOfPictures`	the number of pictures in the ad
`postalCode`	the postal code for the location of the vehicle
`lastSeenOnline`	when the crawler saw this ad last online

We will start by importing the pandas and numpy libraries, as well as loading the data set.

In [1]:

# importing libraries
import pandas as pd
import numpy as np

# loading the file
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [2]:

# Checking what our data set looks like
# This cell also shows some features of Jupyter Notebook in displaying dataframes.
autos

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
5	2016-03-21 13:47:45	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	privat	Angebot	$7,900	test	bus	2006	automatik	150	voyager	150,000km	4	diesel	chrysler	NaN	2016-03-21 00:00:00	0	22962	2016-04-06 09:45:21
6	2016-03-20 17:55:21	VW_Golf_III_GT_Special_Electronic_Green_Metall...	privat	Angebot	$300	test	limousine	1995	manuell	90	golf	150,000km	8	benzin	volkswagen	NaN	2016-03-20 00:00:00	0	31535	2016-03-23 02:48:59
7	2016-03-16 18:55:19	Golf_IV_1.9_TDI_90PS	privat	Angebot	$1,990	control	limousine	1998	manuell	90	golf	150,000km	12	diesel	volkswagen	nein	2016-03-16 00:00:00	0	53474	2016-04-07 03:17:32
8	2016-03-22 16:51:34	Seat_Arosa	privat	Angebot	$250	test	NaN	2000	manuell	0	arosa	150,000km	10	NaN	seat	nein	2016-03-22 00:00:00	0	7426	2016-03-26 18:18:10
9	2016-03-16 13:47:02	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	privat	Angebot	$590	control	bus	1997	manuell	90	megane	150,000km	7	benzin	renault	nein	2016-03-16 00:00:00	0	15749	2016-04-06 10:46:35
10	2016-03-15 01:41:36	VW_Golf_Tuning_in_siber/grau	privat	Angebot	$999	test	NaN	2017	manuell	90	NaN	150,000km	4	benzin	volkswagen	nein	2016-03-14 00:00:00	0	86157	2016-04-07 03:16:21
11	2016-03-16 18:45:34	Mercedes_A140_Motorschaden	privat	Angebot	$350	control	NaN	2000	NaN	0	NaN	150,000km	0	benzin	mercedes_benz	NaN	2016-03-16 00:00:00	0	17498	2016-03-16 18:45:34
12	2016-03-31 19:48:22	Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...	privat	Angebot	$5,299	control	kleinwagen	2010	automatik	71	fortwo	50,000km	9	benzin	smart	nein	2016-03-31 00:00:00	0	34590	2016-04-06 14:17:52
13	2016-03-23 10:48:32	Audi_A3_1.6_tuning	privat	Angebot	$1,350	control	limousine	1999	manuell	101	a3	150,000km	11	benzin	audi	nein	2016-03-23 00:00:00	0	12043	2016-04-01 14:17:13
14	2016-03-23 11:50:46	Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...	privat	Angebot	$3,999	test	kleinwagen	2007	manuell	75	clio	150,000km	9	benzin	renault	NaN	2016-03-23 00:00:00	0	81737	2016-04-01 15:46:47
15	2016-04-01 12:06:20	Corvette_C3_Coupe_T_Top_Crossfire_Injection	privat	Angebot	$18,900	test	coupe	1982	automatik	203	NaN	80,000km	6	benzin	sonstige_autos	nein	2016-04-01 00:00:00	0	61276	2016-04-02 21:10:48
16	2016-03-16 14:59:02	Opel_Vectra_B_Kombi	privat	Angebot	$350	test	kombi	1999	manuell	101	vectra	150,000km	5	benzin	opel	nein	2016-03-16 00:00:00	0	57299	2016-03-18 05:29:37
17	2016-03-29 11:46:22	Volkswagen_Scirocco_2_G60	privat	Angebot	$5,500	test	coupe	1990	manuell	205	scirocco	150,000km	6	benzin	volkswagen	nein	2016-03-29 00:00:00	0	74821	2016-04-05 20:46:26
18	2016-03-26 19:57:44	Verkaufen_mein_bmw_e36_320_i_touring	privat	Angebot	$300	control	bus	1995	manuell	150	3er	150,000km	0	benzin	bmw	NaN	2016-03-26 00:00:00	0	54329	2016-04-02 12:16:41
19	2016-03-17 13:36:21	mazda_tribute_2.0_mit_gas_und_tuev_neu_2018	privat	Angebot	$4,150	control	suv	2004	manuell	124	andere	150,000km	2	lpg	mazda	nein	2016-03-17 00:00:00	0	40878	2016-03-17 14:45:58
20	2016-03-05 19:57:31	Audi_A4_Avant_1.9_TDI_6_GangAHKKlimatronik...	privat	Angebot	$3,500	test	kombi	2003	manuell	131	a4	150,000km	5	diesel	audi	NaN	2016-03-05 00:00:00	0	53913	2016-03-07 05:46:46
21	2016-03-06 19:07:10	Porsche_911_Carrera_4S_Cabrio	privat	Angebot	$41,500	test	cabrio	2004	manuell	320	911	150,000km	4	benzin	porsche	nein	2016-03-06 00:00:00	0	65428	2016-04-05 23:46:19
22	2016-03-28 20:50:54	MINI_Cooper_S_Cabrio	privat	Angebot	$25,450	control	cabrio	2015	manuell	184	cooper	10,000km	1	benzin	mini	nein	2016-03-28 00:00:00	0	44789	2016-04-01 06:45:30
23	2016-03-10 19:55:34	Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima	privat	Angebot	$7,999	control	bus	2010	manuell	120	NaN	150,000km	2	diesel	peugeot	nein	2016-03-10 00:00:00	0	30900	2016-03-17 08:45:17
24	2016-04-03 11:57:02	BMW_535i_xDrive_Sport_Aut.	privat	Angebot	$48,500	control	limousine	2014	automatik	306	5er	30,000km	12	benzin	bmw	nein	2016-04-03 00:00:00	0	22547	2016-04-07 13:16:50
25	2016-03-21 21:56:18	Ford_escort_kombi_an_bastler_mit_ghia_ausstattung	privat	Angebot	$90	control	kombi	1996	manuell	116	NaN	150,000km	4	benzin	ford	ja	2016-03-21 00:00:00	0	27574	2016-04-01 05:16:49
26	2016-04-03 22:46:28	Volkswagen_Polo_Fox	privat	Angebot	$777	control	kleinwagen	1992	manuell	54	polo	125,000km	2	benzin	volkswagen	nein	2016-04-03 00:00:00	0	38110	2016-04-05 23:46:48
27	2016-03-27 18:45:01	Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE	privat	Angebot	$0	control	NaN	2005	NaN	0	NaN	150,000km	0	NaN	ford	NaN	2016-03-27 00:00:00	0	66701	2016-03-27 18:45:01
28	2016-03-19 21:56:19	MINI_Cooper_D	privat	Angebot	$5,250	control	kleinwagen	2007	manuell	110	cooper	150,000km	7	diesel	mini	ja	2016-03-19 00:00:00	0	15745	2016-04-07 14:58:48
29	2016-04-02 12:45:44	Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz...	privat	Angebot	$4,999	test	kombi	2004	automatik	204	e_klasse	150,000km	10	diesel	mercedes_benz	nein	2016-04-02 00:00:00	0	47638	2016-04-02 12:45:44
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49970	2016-03-21 22:47:37	c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp...	privat	Angebot	$15,800	control	bus	2010	automatik	136	c4	60,000km	4	diesel	citroen	nein	2016-03-21 00:00:00	0	14947	2016-04-07 04:17:34
49971	2016-03-29 14:54:12	W.Lupo_1.0	privat	Angebot	$950	test	kleinwagen	2001	manuell	50	lupo	150,000km	4	benzin	volkswagen	nein	2016-03-29 00:00:00	0	65197	2016-03-29 20:41:51
49972	2016-03-26 22:25:23	Mercedes_Benz_Vito_115_CDI_Extralang_Aut.	privat	Angebot	$3,300	control	bus	2004	automatik	150	vito	150,000km	10	diesel	mercedes_benz	ja	2016-03-26 00:00:00	0	65326	2016-03-28 11:28:18
49973	2016-03-27 05:32:39	Mercedes_Benz_SLK_200_Kompressor	privat	Angebot	$6,000	control	cabrio	2004	manuell	163	slk	150,000km	11	benzin	mercedes_benz	nein	2016-03-27 00:00:00	0	53567	2016-03-27 08:25:24
49974	2016-03-20 10:52:31	Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing...	privat	Angebot	$0	control	cabrio	1983	manuell	70	golf	150,000km	2	benzin	volkswagen	nein	2016-03-20 00:00:00	0	8209	2016-03-27 19:48:16
49975	2016-03-27 20:51:39	Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort	privat	Angebot	$9,700	control	kleinwagen	2012	automatik	88	jazz	100,000km	11	hybrid	honda	nein	2016-03-27 00:00:00	0	84385	2016-04-05 19:45:34
49976	2016-03-19 18:56:05	Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga...	privat	Angebot	$5,900	test	kombi	1992	automatik	150	80	150,000km	12	benzin	audi	nein	2016-03-19 00:00:00	0	36100	2016-04-07 06:16:44
49977	2016-03-31 18:37:18	Mercedes_Benz_C200_Cdi_W203	privat	Angebot	$5,500	control	limousine	2003	manuell	116	c_klasse	150,000km	2	diesel	mercedes_benz	nein	2016-03-31 00:00:00	0	33739	2016-04-06 12:16:11
49978	2016-04-04 10:37:14	Mercedes_Benz_E_200_Classic	privat	Angebot	$900	control	limousine	1996	automatik	136	e_klasse	150,000km	9	benzin	mercedes_benz	ja	2016-04-04 00:00:00	0	24405	2016-04-06 12:44:20
49979	2016-03-20 18:38:40	Volkswagen_Polo_1.6_TDI_Style	privat	Angebot	$11,000	test	kleinwagen	2011	manuell	90	polo	70,000km	11	diesel	volkswagen	nein	2016-03-20 00:00:00	0	48455	2016-04-07 01:45:12
49980	2016-03-12 10:55:54	Ford_Escort_Turnier_16V	privat	Angebot	$400	control	kombi	1995	manuell	105	escort	125,000km	3	benzin	ford	NaN	2016-03-12 00:00:00	0	56218	2016-04-06 17:16:49
49981	2016-03-15 09:38:21	Opel_Astra_Kombi_mit_Anhaengerkupplung	privat	Angebot	$2,000	control	kombi	1998	manuell	115	astra	150,000km	12	benzin	opel	nein	2016-03-15 00:00:00	0	86859	2016-04-05 17:21:46
49982	2016-03-29 18:51:08	Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm	privat	Angebot	$1,950	control	kleinwagen	2004	manuell	0	fabia	90,000km	7	benzin	skoda	NaN	2016-03-29 00:00:00	0	45884	2016-03-29 18:51:08
49983	2016-03-06 12:43:04	Ford_focus_99	privat	Angebot	$600	test	kleinwagen	1999	manuell	101	focus	150,000km	4	benzin	ford	NaN	2016-03-06 00:00:00	0	52477	2016-03-09 06:16:08
49984	2016-03-31 22:48:48	Student_sucht_ein__Anfaengerauto___ab_2000_BJ_...	privat	Angebot	$0	test	NaN	2000	NaN	0	NaN	150,000km	0	NaN	sonstige_autos	NaN	2016-03-31 00:00:00	0	12103	2016-04-02 19:44:53
49985	2016-04-02 16:38:23	Verkaufe_meinen_vw_vento!	privat	Angebot	$1,000	control	NaN	1995	automatik	0	NaN	150,000km	0	benzin	volkswagen	NaN	2016-04-02 00:00:00	0	30900	2016-04-06 15:17:52
49986	2016-04-04 20:46:02	Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst...	privat	Angebot	$15,900	control	limousine	2010	automatik	218	300c	125,000km	11	diesel	chrysler	nein	2016-04-04 00:00:00	0	73527	2016-04-06 23:16:00
49987	2016-03-22 20:47:27	Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__...	privat	Angebot	$21,990	control	limousine	2013	manuell	150	a3	50,000km	11	diesel	audi	nein	2016-03-22 00:00:00	0	94362	2016-03-26 22:46:06
49988	2016-03-28 19:49:51	BMW_330_Ci	privat	Angebot	$9,550	control	coupe	2001	manuell	231	3er	150,000km	10	benzin	bmw	nein	2016-03-28 00:00:00	0	83646	2016-04-07 02:17:40
49989	2016-03-11 19:50:37	VW_Polo_zum_Ausschlachten_oder_Wiederaufbau	privat	Angebot	$150	test	kleinwagen	1997	manuell	0	polo	150,000km	5	benzin	volkswagen	ja	2016-03-11 00:00:00	0	21244	2016-03-12 10:17:55
49990	2016-03-21 19:54:19	Mercedes_Benz_A_200__BlueEFFICIENCY__Urban	privat	Angebot	$17,500	test	limousine	2012	manuell	156	a_klasse	30,000km	12	benzin	mercedes_benz	nein	2016-03-21 00:00:00	0	58239	2016-04-06 22:46:57
49991	2016-03-06 15:25:19	Kleinwagen	privat	Angebot	$500	control	NaN	2016	manuell	0	twingo	150,000km	0	benzin	renault	NaN	2016-03-06 00:00:00	0	61350	2016-03-06 18:24:19
49992	2016-03-10 19:37:38	Fiat_Grande_Punto_1.4_T_Jet_16V_Sport	privat	Angebot	$4,800	control	kleinwagen	2009	manuell	120	andere	125,000km	9	lpg	fiat	nein	2016-03-10 00:00:00	0	68642	2016-03-13 01:44:51
49993	2016-03-15 18:47:35	Audi_A3__1_8l__Silber;_schoenes_Fahrzeug	privat	Angebot	$1,650	control	kleinwagen	1997	manuell	0	NaN	150,000km	7	benzin	audi	NaN	2016-03-15 00:00:00	0	65203	2016-04-06 19:46:53
49994	2016-03-22 17:36:42	Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...	privat	Angebot	$5,000	control	kombi	2001	automatik	299	a6	150,000km	1	benzin	audi	nein	2016-03-22 00:00:00	0	46537	2016-04-06 08:16:39
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

Let's also have a quick look at the shape and data types in our data set using the info() and the head() methods.

In [3]:

autos.info()
autos.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37

Cleaning column names¶

Since the column names are currently in camelcase, let's rename them to follow the snakecase convention. Dataquest also suggests for us to rename some specific column names such as notRepairedDamage to unrepaired_damage, among others.

We'll do this manually since we're not yet experts in regular expressions and advanced manipulation of strings.

In [4]:

# Checking the column names
print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:

# Renaming the columns
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price',
               'ab_test', 'vehicle_type', 'registration_year',
              'gear_box', 'power_ps', 'model', 'odometer',
               'registration_month', 'fuel_type', 'brand',
               'unrepaired_damage', 'ad_created', 'nr_of_pictures',
              'postal_code', 'last_seen'
              ]
autos.columns = new_columns

# Checking the new column names by printing the first few rows of our dataframe
autos.head(2)

Out[5]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08

Initial exploration and cleaning¶

Let's look at the the characteristics of our data set so we can start figuring out what data cleaning tasks we need to do. We use the describe() method and specify the parameter include='all' so that non-numeric columns are included.

In [6]:

# Checking column descriptions/summaries
autos.describe(include='all')

Out[6]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-04-02 15:49:30	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Some quick observations here:

49,985 out of 49,986 observations in the seller column are tagged as 'privat'
49,985 out of 49,986 observations in the offer_type column are tagged as 'Angebot
all the values of the nr_of_pictures column appear to be zero (we can check this by using the value_counts() method later

We'll consider dropping these three columns but let's check the nr_of_pictures more closely first.

In [7]:

# Checking value counts of the three columns being considered for deletion.
# We're most interested in the the nr_of_pictures column

print(autos['nr_of_pictures'].value_counts())
print('\n')
print(autos['seller'].value_counts())
print('\n')
print(autos['offer_type'].value_counts())

0    50000
Name: nr_of_pictures, dtype: int64


privat        49999
gewerblich        1
Name: seller, dtype: int64


Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

The previous cell confirms that all values in the nr_of_pictures columns are zero. Also, there isn't really any significant variation in the values contained in the seller and offer_type columns, with only one observation having a different value. We'll drop those two as well.

In [8]:

# Dropping the three columns
autos.drop(['seller', 'offer_type', 'nr_of_pictures'], axis=1, inplace=True)

# Checking if they are really dropped
print(autos.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price                 50000 non-null object
ab_test               50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gear_box              47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer              50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(4), object(13)
memory usage: 6.5+ MB
None

Some other observations from checking the info of our dataframe:

date_crawled, odometer, price, ad_created and last_seen are stored as strings (we may want to convert these to dates and integers, as appropriate)
registration_year and registration_month are stored as integers (we may want to convert these to datetime format)

Hence, these are our tasks for further cleaning the data:

convert odometer and price to numeric format
convert date_crawled, ad_created, last_seen, registration_year, and registration_monthcolumns to datetime format

Since item no. 1 should be so much easier compared to item no. 2, let's start with that.

In [9]:

# Checking what the values in price and odometer columns look like
print(autos['price'].unique())
print(autos['odometer'].unique())

['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995']
['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km'
 '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km'
 '40,000km']

It appears that for the price column, we need to remove the leading $ sign and remove the commas. For the odometer, we need to remove the trailing km indicator and also the commas.

In [10]:

# Converting the two columns to numeric
autos['price'] = (autos['price'].str.replace(',','')
                  .str.replace('$','')
                  .astype(int)
                 )
autos['odometer'] = (autos['odometer'].str.replace(',','')
                     .str.replace('km','')
                     .astype(int)
                    )

# Renaming odometer column
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

Exploring the Odometer and Price Columns¶

After converting price and odometer to numeric, let's begin exploring these two columns.

Dataquest recommends the following steps:

Use Series.unique().shape to see how many unique values
Use Series.describe() to view min/max/median/mean etc
Use Series.value_counts() to see frequency distributions

Since those are a lot of steps, let's define a function that will do all those things for us.

In [11]:

# Defining a function to make our examination of the series faster
def explore(series):
    unique_values = series.unique().shape[0]
    print('The series "{s}" has {u} unique values'
         .format(s=series.name, u=unique_values))
    print('\n')
    print('Description')
    print(series.describe())
    print('\n')
    print('Value counts top')
    print(series.value_counts()
          .sort_index(ascending=False)
          .head(10))
    print('\n')
    print('Value counts bottom')
    print(series.value_counts()
          .sort_index(ascending=True)
          .head(10))

In [12]:

# Using the explore function to check price and odometer_km columns
explore(autos['price'])
print('\n')
explore(autos['odometer_km'])

The series "price" has 2357 unique values


Description
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64


Value counts top
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: price, dtype: int64


Value counts bottom
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64


The series "odometer_km" has 13 unique values


Description
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64


Value counts top
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
Name: odometer_km, dtype: int64


Value counts bottom
5000      967
10000     264
20000     784
30000     789
40000     819
50000    1027
60000    1164
70000    1230
80000    1436
90000    1757
Name: odometer_km, dtype: int64

For the price column, we remove those with zero prices (lower outliers) and also those that cost more than $900,000 (upper outliers).

Note that the cars that are priced above $900,000 may still be valid data points. However, it is likely that these very high prices are typographical errors or items that are just being advertised but not really for sale. Anyway, these extremely high values (and there are only a few of them) might disproportionately affect our observations, so we will drop them for now.

In [13]:

# Removing zero prices and overpriced observations
autos = autos[autos['price'].between(1,900000)]

#autos = autos.drop(autos[(autos['price'] == 0)
#                         | (autos['price'] > 900000)].index)

explore(autos['price'])

The series "price" has 2346 unique values


Description
count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64


Value counts top
350000    1
345000    1
299000    1
295000    1
265000    1
259000    1
250000    1
220000    1
198000    1
197000    1
Name: price, dtype: int64


Value counts bottom
1     156
2       3
3       1
5       2
8       1
9       1
10      7
11      2
12      3
13      2
Name: price, dtype: int64

After dropping the outliers, the number of unique prices has dropped from 2,357 to 2,346. Our total number of observations also dropped from 50,000 to 48,565. We only lost a small amount of observations, so this is acceptable.

For the odometer_km column, there are no obvious outliers since the minimum and maximum values appear reasonable enough. We won't be dropping anything for this column.

Exploring the date columns¶

Recall that we have some columns that give us datetime-related information. These are the date_crawled, registration_year, registration_month, ad_created, and last_seen columns. Some of these columns are stored as strings. Let's check these columns again, cleaning them as necessary, and trying out some preliminary analytical tasks.

In [14]:

date_columns = ['date_crawled', 'registration_year', 'registration_month', 'ad_created', 'last_seen']
print(autos[date_columns].info())
autos[date_columns][:4]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48565 entries, 0 to 49999
Data columns (total 5 columns):
date_crawled          48565 non-null object
registration_year     48565 non-null int64
registration_month    48565 non-null int64
ad_created            48565 non-null object
last_seen             48565 non-null object
dtypes: int64(2), object(3)
memory usage: 2.2+ MB
None

Out[14]:

	date_crawled	registration_year	registration_month	ad_created	last_seen
0	2016-03-26 17:47:46	2004	3	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	1997	6	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2009	7	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2007	6	2016-03-12 00:00:00	2016-03-15 03:16:28

The columns registration_year and registration_month are in int64 format. For the purpose of our analysis, there is no need to convert these to datetime format. We may consider, however, combining the two into a single variable that contains month and year information if we want to study trends over several years.

For the other time-related columns, let's convert them to datetime format. While we're not interested in the exact time values contained in the columns, there may be a use for those in later analysis. What we'll do, instead, is to create a new dataframe with columns that are of datetime format but only containing date (and not time) information. We create a new dataframe because adding new columns to our autos dataframe appears to give us a SettingwithCopyWarning.

In [15]:

# importing datetime library
import datetime as dt

# creating function for conversion
date_format = '%Y-%m-%d'
def convert_dt(series):
    date = series.str[:10]
    date_dt = pd.to_datetime(date, format=date_format)
    return date_dt

# Creating an empty dataframe
autos_copy = pd.DataFrame()

# Adding columns to our new dataframe
autos_copy['date_crawled_dt'] = convert_dt(autos['date_crawled'])
autos_copy['ad_created_dt'] = convert_dt(autos['ad_created'])
autos_copy['last_seen_dt'] = convert_dt(autos['last_seen'])

Now we check the distributions of the three date columns, sorted by earliest date to latest date.

In [16]:

# Date crawled
print('date_crawled')
print(autos_copy['date_crawled_dt'].
      value_counts(normalize=True, dropna=False).
      sort_index()
     )
print('\n')

# Ad created
print('ad_created')
print(autos_copy['ad_created_dt'].
      value_counts(normalize=True, dropna=False).
      sort_index()
     )
print('\n')

# Last seen
print('last_seen')
print(autos_copy['last_seen_dt'].
      value_counts(normalize=True, dropna=False).
      sort_index()
     )

date_crawled
2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled_dt, dtype: float64


ad_created
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21    0.037579
2016-03-22    0.032801
2016-03-23    0.032060
2016-03-24    0.029280
2016-03-25    0.031751
2016-03-26    0.032266
2016-03-27    0.030989
2016-03-28    0.034984
2016-03-29    0.034037
2016-03-30    0.033501
2016-03-31    0.031875
2016-04-01    0.033687
2016-04-02    0.035149
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created_dt, Length: 76, dtype: float64


last_seen
2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen_dt, dtype: float64

In terms of frequency distribution, there is no pattern that really pops out. Looking at the date_crawled_dt column, we learn that the crawler collected this data over a span of around a month in March/April of 2016.

The oldest ad seen by the crawler, as shown by the ad_created_dt column was from June 2015, while the latest one (April 4, 2016) coincides with the last day the crawler collected data.

Not much can be said about the last_seen column, other than over 40% of the ads were still seen by the crawler during its last three days (April 5-7) of crawling.

It's now time to look at registration_year and registration_month. Again, we have not created a new series that combines these two. Let's explore them separately for now.

In [17]:

# Checking the registration year and moth information
print(autos[['registration_year', 'registration_month']].describe())
print('\n')

# Checking frequency distributions
print(autos['registration_year'].value_counts(dropna=False).head(5))
print(autos['registration_year'].value_counts(dropna=False).tail(5))
print('\n')

# Checking frequency distribution of registration month
print(autos['registration_month'].value_counts(dropna=False))

       registration_year  registration_month
count       48565.000000        48565.000000
mean         2004.755421            5.782251
std            88.643887            3.685595
min          1000.000000            0.000000
25%          1999.000000            3.000000
50%          2004.000000            6.000000
75%          2008.000000            9.000000
max          9999.000000           12.000000


2000    3156
2005    2936
1999    2897
2004    2703
2003    2699
Name: registration_year, dtype: int64
9000    1
1948    1
1000    1
1931    1
1952    1
Name: registration_year, dtype: int64


3     5003
0     4480
6     4271
4     4036
5     4031
7     3857
10    3588
12    3374
9     3330
11    3313
1     3219
8     3126
2     2937
Name: registration_month, dtype: int64

Discussion on the observations from the above results are in the next section.

Dealing with Incorrect Registration Year Data¶

Upon checking the results from using the describe method for the registrationyear, we see some outliers that don't make sense. The minimum is 1,000 while the maximum is 9,999. The lower outlier doesn't make sense because this years are too early while the higher outliers are years that haven't occured yet.

We may want to remove this using the criteria:

Remove rows where registration year is earlier than 1900
Remove rows where registration year is later than 2016

The reason for these the lower-bound criteria is that mass production of vehicles only really started around the 1900s. The reason for the upper-bound criteria is that the crawler couldn't possibly have detected vehicles that were registered in the future (the crawler's last day in our data set is April 2016).

As for the registration_month, we see that March is the most frequent month of registration for the vehicles in our data set. However, we do see that there are vehicles that are identified as being registered on month zero. We have several options to deal with this: (1) drop the rows where registration month is zero, (2) convert the months to 1 (January), or (3) convert the months to missing (NaN).

In doing the steps above, we need to clean the year data first, because it's likely that the zero month rows coincide with the outlier years where data is questionable. For the rows that remain after dropping the rows with outlier years, we will then convert the zero months to missing.

In [18]:

# Removing outlier registration years
autos = autos[autos['registration_year'].between(1900,2016)]

# Checking data and months
print('registration year')
print(autos['registration_year'].describe())
print('\n')
print('registration month')
print(autos['registration_month'].value_counts())

registration year
count    46681.000000
mean      2002.910756
std          7.185103
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64


registration month
3     4838
6     4119
0     4026
4     3894
5     3877
7     3722
10    3495
12    3262
9     3243
11    3229
1     3105
8     3021
2     2850
Name: registration_month, dtype: int64

We dropped around 2,000 observations after removing the outliers. The registration years look okay now, with the earliest observation tagged as 1910.

Upon checking the regisration month, there are still months listed as zero. We'll assume that these months are 'unknown' and just tag them as missing. This is so our frequency distributions for regisration months can still be analyzed, while removing the zeros so it doesn't affect any numerical computations or aggregations.

Note that if we want to create a new datetime column, we may want to go with the January conversion instead, although this will inflate the numbers for January registrations, hence making any trend analysis tenuous.

In [19]:

# replacing zero months to 1
zero_month = (autos['registration_month'] == 0)
autos.loc[zero_month, 'registration_month'] = np.nan

# re-checking the registration month column
print(autos['registration_month'].value_counts())

3.0     4838
6.0     4119
4.0     3894
5.0     3877
7.0     3722
10.0    3495
12.0    3262
9.0     3243
11.0    3229
1.0     3105
8.0     3021
2.0     2850
Name: registration_month, dtype: int64

Exploring Price by Brand¶

Let's now check how price patterns vary across brands. Let's start with looking at what the unique brands in our data set are.

In [20]:

print('Unique brands: ', len(autos['brand'].unique()))
autos['brand'].value_counts(normalize=True)

Unique brands:  40

Out[20]:

volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
rover             0.001328
lancia            0.001071
lada              0.000578
Name: brand, dtype: float64

We have over 40 brands of automobiles in our data set, with Volkswagen being the most common brand. These are a lot of brands, we'll perform some aggregation for all the brands, since it is fairly simple to do using loops, but will only delve deeper into the top 10 most common brands.

In [21]:

# Creating dictionary on average price per brand
mean_price_brand = {}
brands = autos['brand'].unique()

for b in brands:
    brand = autos[autos['brand'] == b]
    mean_price = brand['price'].mean()
    mean_price_brand[b] = int(mean_price)
    
mean_price_brand

Out[21]:

{'alfa_romeo': 4087,
 'audi': 9336,
 'bmw': 8332,
 'chevrolet': 6684,
 'chrysler': 3465,
 'citroen': 3779,
 'dacia': 5915,
 'daewoo': 1049,
 'daihatsu': 1636,
 'fiat': 2813,
 'ford': 3749,
 'honda': 4107,
 'hyundai': 5365,
 'jaguar': 11635,
 'jeep': 11650,
 'kia': 5982,
 'lada': 2688,
 'lancia': 3376,
 'land_rover': 19108,
 'mazda': 4112,
 'mercedes_benz': 8628,
 'mini': 10613,
 'mitsubishi': 3394,
 'nissan': 4743,
 'opel': 2975,
 'peugeot': 3094,
 'porsche': 45643,
 'renault': 2474,
 'rover': 1602,
 'saab': 3211,
 'seat': 4397,
 'skoda': 6368,
 'smart': 3580,
 'sonstige_autos': 12338,
 'subaru': 3953,
 'suzuki': 4096,
 'toyota': 5167,
 'trabant': 1790,
 'volkswagen': 5402,
 'volvo': 4946}

It turns out that analyzing the dictionary with all the brands is quite difficult for us at our current skill level. Let's do the aggregation again, this time with only for the brands that account for at least 5% of the total number of observations.

In [22]:

# Getting the top brands
brand_share = autos['brand'].value_counts(normalize=True)
top_brands = brand_share[brand_share > 0.05].index

mean_price_top_brands = {}

for b in top_brands:
    brand = autos[autos['brand'] == b]
    mean_price = brand['price'].mean()
    mean_price_top_brands[b] = int(mean_price)

mean_price_top_brands

Out[22]:

{'audi': 9336,
 'bmw': 8332,
 'ford': 3749,
 'mercedes_benz': 8628,
 'opel': 2975,
 'volkswagen': 5402}

We can convert dictionaries to series, then dataframes, using some pandas functionalities to help us analyze the aggregated data much easier.

In [23]:

# Converting a couple of dictionaries we've already created
mean_price_all = pd.Series(mean_price_brand)
mean_price_top = pd.Series(mean_price_top_brands)

# Converting the series into dataframes
mean_price_all_df = pd.DataFrame(mean_price_all, columns=['mean_price'])
mean_price_top_df = pd.DataFrame(mean_price_top, columns=['mean_price'])

In [24]:

mean_price_all_df.sort_values('mean_price', ascending=False)

Out[24]:

	mean_price
porsche	45643
land_rover	19108
sonstige_autos	12338
jeep	11650
jaguar	11635
mini	10613
audi	9336
mercedes_benz	8628
bmw	8332
chevrolet	6684
skoda	6368
kia	5982
dacia	5915
volkswagen	5402
hyundai	5365
toyota	5167
volvo	4946
nissan	4743
seat	4397
mazda	4112
honda	4107
suzuki	4096
alfa_romeo	4087
subaru	3953
citroen	3779
ford	3749
smart	3580
chrysler	3465
mitsubishi	3394
lancia	3376
saab	3211
peugeot	3094
opel	2975
fiat	2813
lada	2688
renault	2474
trabant	1790
daihatsu	1636
rover	1602
daewoo	1049

In [25]:

mean_price_top_df.sort_values('mean_price', ascending=False)

Out[25]:

	mean_price
audi	9336
mercedes_benz	8628
bmw	8332
volkswagen	5402
ford	3749
opel	2975

When accounting for all the brands, we see that the most expensive brand, on the average, is Porsche with Land Rover a distant second. The cheapest brand, on the average, is Daewoo.

When we consider only the top brands (those which account for >5% of total observations), Audi cars are the most expensive with a mean price of 9,366 while Opel is the cheapest with a mean price of 2,975.

Storing Aggregate Data in a DataFrame¶

We see from the top brands that the Audi, Mercedes Benz, and BMW are relatively more expensive (average prices greather than 8,000) while Ford and Opel are very cheap by comparison (average prices less than 4,000).

It may be the brands themselves creating this price differential. However, it's also possible that the price differentials are the result of different average mileage or age of the vehicles. It may be the case that certain brands have, on the average, more mileage or are older, hence the lower average price.

Let's check whether our guess is correct by aggregating mileage (odometer_km) and registration years (registration_year) by brand then appending to a single dataframe along with the mean prices which we've already generated. This time, we limit our analysis to the top brands.

In [38]:

# Creating a copy dataframe for our aggregations
aggregate_df = mean_price_top_df.copy()

# For mileage
mean_mileage_brands = {}

for b in top_brands:
    brand = autos[autos['brand'] == b]
    mean_mileage = brand['odometer_km'].mean()
    mean_mileage_brands[b] = int(mean_mileage)

mean_mileage_series = pd.Series(mean_mileage_brands)
aggregate_df['mean_mileage'] = mean_mileage_series

# For registration year
mean_reg_year_brands = {}

for b in top_brands:
    brand = autos[autos['brand'] == b]
    mean_reg_year = brand['registration_year'].mean()
    mean_reg_year_brands[b] = int(mean_reg_year)

mean_reg_year_series = pd.Series(mean_reg_year_brands)
aggregate_df['mean_reg_year'] = mean_reg_year_series

# Displaying our new dataframe
aggregate_df.sort_values('mean_price', ascending=False)

Out[38]:

	mean_price	mean_mileage	mean_reg_year
audi	9336	129157	2004
mercedes_benz	8628	130788	2002
bmw	8332	132572	2003
volkswagen	5402	128707	2002
ford	3749	124266	2002
opel	2975	129310	2002

(Preliminary) Conclusions and Learnings¶

From the table above, we see that our initial guess as to the relationship between price and mileage for the top brands may be incorrect. In fact, Audi cars have more mileage (129,157km), on the average, compared to Ford cars (124,266km) and only slightly lower than the much cheaper Opel cars (129,310km).

However, we also see that the Audis may be newer cars, with an average registration year of 2004 compared to Ford and Opel's 2002. We cannot make any definitive conclusions for now, especially since Mercedes Benz cares are, on the average, just as old as the cheaper brands but are still more expensive. We could expand this analysis to include all the brands in our data set but we'll settle with this for now.

For this project, we were able to practice a lot of skills:

renaming column names (albeit manually)
cleaning outliers and other problematic observations
converting columns from string format to numeric and datetime formats
aggregating based on certain columns (brand) and displaying frequency tables
converting dictionaries to series and appending them to dataframes

The skills we practiced for this project will prove to be useful in future data science tasks.