eBay Kleinanzeigen: Analysis of used car listing¶

About:¶

In this project our aim is to analyse the dataset of used cars from eBay Kleinanzeigen (A Classified section of the German eBay website).

Initially, we will step through the data cleansing process as dataset itself needs extensive cleansing and is a significant part of the project.

Finally, after our data cleansing, we will answer the question of top brands and their average prices and compare that alongside average mileage to draw inferences.

Dataset:¶

For the purposes of this project, the dataset has been limited to 50,000 rows.

The full dataset can be downloaded from here.

Below we can find the metadata about this dataset:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.

name - Name of the car.

seller - Whether the seller is private or a dealer.

offerType - The type of listing

price - The price on the ad to sell the car.

abtest - Whether the listing is included in an A/B test.

vehicleType - The vehicle Type.

yearOfRegistration - The year in which the car was first registered.

gearbox - The transmission type.

powerPS - The power of the car in PS.

model - The car model name.

kilometer - How many kilometers the car has driven.

monthOfRegistration - The month in which the car was first registered.

fuelType - What type of fuel the car uses.

brand - The brand of the car.

notRepairedDamage - If the car has a damage which is not yet repaired.

dateCreated - The date on which the eBay listing was created.

nrOfPictures - The number of pictures in the ad.

postalCode - The postal code for the location of the vehicle.

lastSeenOnline - When the crawler saw this ad last online.

Initial Data Exploration¶

We will start by importing the pandas and NumPy libraries and read our csv data.

In [1]:

import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding="Latin-1")

In [2]:

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

seller, no_of_pics and offer_type column seems to have only one value for 99.99% of the data and hence can be ignored safely
registration_year seem to have some odd values suggested by the min being 1000 and max being 9999. This needs further investigation
registration_month as well have invalid values suggested by the min value being 0. This needs further investigation
powerPS column has 0 for about 11% of the data and that is invalid for that column
fuel_type column has NaN for about 10% of the data which we would want to exclude for any analysis based on that column
unrepaired_damage column has NaN for about 20% of the data which we want to exclude for any analysis based on that column
Columns price, odometer ideally should be numeric for our analysis
Columns dateCrawled,dateCreated, and lastSeen should be Date type

Additionally, the column names use camelCase instead of the preferred snake_case.

Data Cleansing¶

1. Rename labels¶

We will clean/rename the column names so that we have consistent snake_case and also reduce some lengthy column names to more apt shorter ones so that it will aid when we use those as labels in our data analysis.

In [3]:

new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'pictures_count', 'postal_code',
       'last_seen']
autos.columns = new_columns
autos.head()

Out[3]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

2. Datatype conversions¶

We will describe the statistics of the data across the columns to: - Drop any text columns which have almost all values as the same - Convert the data on any columns which have numeric data which are stored as text

In [4]:

autos.describe(include="all")

Out[4]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	pictures_count	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-16 21:50:53	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

From the above table, we can infer the following:

Columns seller, offer_type and pictures_count have the same text data for almost the entire dataset and hence can safely be ignored
Columns price, odometer are numeric values stored as text with its metric

Let's now look at converting the text to numeric on the columns price and odometer

In [5]:

print("="*5,"Odometer","="*5)
print(autos["odometer"].value_counts())
print('\n')
print("="*5,"Price","="*5)
print(autos["price"].value_counts())

===== Odometer =====
150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64


===== Price =====
$0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$23,790       1
$9,655        1
$29,445       1
$29,699       1
$35,800       1
Name: price, Length: 2357, dtype: int64

First, we need to remove the non-numeric characters as seen above on both of these columns and then we can convert to a numeric dtype.

In [6]:

# Remove all non-numeric characters and then convert to numeric
autos["odometer"] = autos["odometer"].str.replace(",","").str.replace("km","").astype(int)
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").str.replace(".","").astype(int)
# Rename the column so that the metric is evident
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)
autos.describe()

Out[6]:

	price	registration_year	power_ps	odometer_km	registration_month	pictures_count	postal_code
count	5.000000e+04	50000.000000	50000.000000	50000.000000	50000.000000	50000.0	50000.000000
mean	9.840044e+03	2005.073280	116.355920	125732.700000	5.723360	0.0	50813.627300
std	4.811044e+05	105.712813	209.216627	40042.211706	3.711984	0.0	25779.747957
min	0.000000e+00	1000.000000	0.000000	5000.000000	0.000000	0.0	1067.000000
25%	1.100000e+03	1999.000000	70.000000	125000.000000	3.000000	0.0	30451.000000
50%	2.950000e+03	2003.000000	105.000000	150000.000000	6.000000	0.0	49577.000000
75%	7.200000e+03	2008.000000	150.000000	150000.000000	9.000000	0.0	71540.000000
max	1.000000e+08	9999.000000	17700.000000	150000.000000	12.000000	0.0	99998.000000

We have ensured that odometer and price columns are numeric reflecting the values.

3. Remove outliers¶

Now we will look at the values in these columns a bit deeper to see if there are any outliers that skew the statistics and remove them

In [7]:

print("="*5,"Top 20 - High price listing","="*5)
print(autos["price"].value_counts().sort_index(ascending = False).head(20))
print('\n')
print("="*5,"Bottom 20 - Low price listing","="*5)
print(autos["price"].value_counts().sort_index(ascending = True).head(20))
print('\n')
print("="*5,"Statistics of low prices","="*5)
print(autos.loc[autos["price"] < 100, "price"].describe())
print('\n')
print("="*5,"Distribution % of high prices","="*5)
print(autos.loc[autos["price"].between(350000,99999999),"price"].value_counts(normalize=True, dropna=False))
print('\n')
print("="*5,"Listing 350,000 and above","="*5)
autos.loc[autos["price"] >= 350000,["name","brand","vehicle_type","price"]]

===== Top 20 - High price listing =====
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64


===== Bottom 20 - Low price listing =====
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64


===== Statistics of low prices =====
count    1762.000000
mean        5.632804
std        18.374263
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max        99.000000
Name: price, dtype: float64


===== Distribution % of high prices =====
12345678    0.200000
999999      0.133333
11111111    0.133333
999990      0.066667
3890000     0.066667
27322222    0.066667
1234566     0.066667
350000      0.066667
99999999    0.066667
10000000    0.066667
1300000     0.066667
Name: price, dtype: float64


===== Listing 350,000 and above =====

Out[7]:

	name	brand	vehicle_type	price
514	Ford_Focus_Turnier_1.6_16V_Style	ford	kombi	999999
2897	Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000	ford	limousine	11111111
7814	Ferrari_F40	sonstige_autos	coupe	1300000
11137	suche_maserati_3200_gt_Zustand_unwichtig_laufe...	sonstige_autos	coupe	10000000
22947	Bmw_530d_zum_ausschlachten	bmw	kombi	1234566
24384	Schlachte_Golf_3_gt_tdi	volkswagen	NaN	11111111
27371	Fiat_Punto	fiat	NaN	12345678
36818	Porsche_991	porsche	coupe	350000
37585	Volkswagen_Jetta_GT	volkswagen	limousine	999990
39377	Tausche_volvo_v40_gegen_van	volvo	NaN	12345678
39705	Tausch_gegen_gleichwertiges	mercedes_benz	limousine	99999999
42221	Leasinguebernahme	citroen	limousine	27322222
43049	2_VW_Busse_T3	volkswagen	bus	999999
47598	Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...	opel	limousine	12345678
47634	Ferrari_FXX	sonstige_autos	coupe	3890000

Based on above display of price column, we see few astronomically high prices and lot of listing with prices that are miniscule which would overall skew our results.

On closer look at these high price listing, few are true vintage cars or high performance cars (such as Ferrari_F40) that demand these high prices and some are mistakes in the dataset (such as Ford Punto, Escort MK1 or Maserati 3200 GT).

Same way if we look at miniscule prices (between 0 and 100) and potentially these might be scrap rather than any potential use of the car.

But since large part of these are mistakes, we will remove any listing that has a price of above 350,000, and similarly remove any listing that has a price less than 100.

Let's now look at the odometer values and see if there are any outliers.

In [8]:

print("Number of unique odometer values: ",autos["odometer_km"].unique().shape[0])
autos["odometer_km"].value_counts().sort_index(ascending = False)

Number of unique odometer values:  13

Out[8]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

From above display of odometer_km data, we could infer two things:

It seems based on number of unique values that it possibly is already bucketed and grouped on certain km ranges
It looks normal for used car listing to have done these kilometer ranges

Hence, we will keep these and only remove the outliers based on the price and then display the statistics of our cleaned dataset.

In [9]:

autos = autos.loc[autos["price"].between(100,350000)]
autos.describe()

Out[9]:

	price	registration_year	power_ps	odometer_km	registration_month	pictures_count	postal_code
count	48224.000000	48224.000000	48224.000000	48224.000000	48224.000000	48224.0	48224.000000
mean	5930.371433	2004.730964	117.677609	125919.148142	5.801634	0.0	50987.919729
std	9078.372762	87.897388	201.206304	39543.339640	3.676976	0.0	25737.119986
min	100.000000	1000.000000	0.000000	5000.000000	0.000000	0.0	1067.000000
25%	1250.000000	1999.000000	73.000000	125000.000000	3.000000	0.0	30823.000000
50%	3000.000000	2004.000000	107.000000	150000.000000	6.000000	0.0	49716.000000
75%	7499.000000	2008.000000	150.000000	150000.000000	9.000000	0.0	71666.750000
max	350000.000000	9999.000000	17700.000000	150000.000000	12.000000	0.0	99998.000000

4. Date column cleansing: Part 1¶

From our dataset, there are 5 columns that represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

We see that date_crawled, last_seen and ad_created columns are identified as string values by pandas.

We will look at the string format on these columns.

In [10]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[10]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

We see that these 3 columns have the initial 10 characters representing the date (In the format Year-Month-Day).

Let's now understand the distribution of data using this date format across the columns.

In [11]:

print("="*5,"Distribution of date_crawled column","="*5)
print(autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))

===== Distribution of date_crawled column =====
2016-03-05    0.025361
2016-03-06    0.014039
2016-03-07    0.036061
2016-03-08    0.033179
2016-03-09    0.033013
2016-03-10    0.032287
2016-03-11    0.032598
2016-03-12    0.036911
2016-03-13    0.015677
2016-03-14    0.036662
2016-03-15    0.034319
2016-03-16    0.029467
2016-03-17    0.031499
2016-03-18    0.012898
2016-03-19    0.034734
2016-03-20    0.037803
2016-03-21    0.037201
2016-03-22    0.032888
2016-03-23    0.032287
2016-03-24    0.029446
2016-03-25    0.031499
2016-03-26    0.032308
2016-03-27    0.031126
2016-03-28    0.034962
2016-03-29    0.034112
2016-03-30    0.033738
2016-03-31    0.031851
2016-04-01    0.033697
2016-04-02    0.035605
2016-04-03    0.038611
2016-04-04    0.036538
2016-04-05    0.013064
2016-04-06    0.003173
2016-04-07    0.001389
Name: date_crawled, dtype: float64

The data was pulled from the eBay website over a month from 5th March to 7th April, 2016
Amount of data crawled is evenly spread except the last two days when the data crawled was very less in proportion to others

In [12]:

print("="*5,"Distribution of Top 5 most recent ad_created","="*5)
print(autos["ad_created"].str[:10].value_counts(normalize=True).sort_index(ascending=False).head(5))
print("\n")
print("="*5,"Distribution of Bottom 5 most old ad_created","="*5)
print(autos["ad_created"].str[:10].value_counts(normalize=True).sort_index(ascending=True).head(5))

===== Distribution of Top 5 most recent ad_created =====
2016-04-07    0.001244
2016-04-06    0.003256
2016-04-05    0.011799
2016-04-04    0.036890
2016-04-03    0.038860
Name: ad_created, dtype: float64


===== Distribution of Bottom 5 most old ad_created =====
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
Name: ad_created, dtype: float64

From the exploration of the ad_created column, we observe the following:

It shows a pattern that we saw with date_crawled column that most of the data is distributed between 9th March to 5th April, 2016
There are listings with ad_created up to an year before the date_crawled date distribution

In [13]:

print("="*5,"Distribution of last_seen column","="*5)
autos["last_seen"].str[:10].value_counts(normalize=True, ascending=False)

===== Distribution of last_seen column =====

Out[13]:

2016-04-06    0.221964
2016-04-07    0.132154
2016-04-05    0.125062
2016-03-17    0.028098
2016-04-03    0.025133
2016-04-02    0.024884
2016-03-30    0.024697
2016-04-04    0.024531
2016-03-31    0.023826
2016-03-12    0.023785
2016-04-01    0.022852
2016-03-29    0.022292
2016-03-22    0.021359
2016-03-28    0.020840
2016-03-20    0.020654
2016-03-21    0.020550
2016-03-24    0.019762
2016-03-25    0.019098
2016-03-23    0.018580
2016-03-26    0.016672
2016-03-16    0.016444
2016-03-15    0.015863
2016-03-19    0.015760
2016-03-27    0.015552
2016-03-14    0.012629
2016-03-11    0.012400
2016-03-10    0.010638
2016-03-09    0.009580
2016-03-13    0.008875
2016-03-18    0.007320
2016-03-08    0.007320
2016-03-07    0.005433
2016-03-06    0.004313
2016-03-05    0.001078
Name: last_seen, dtype: float64

With the distribution on column last_seen:

Concurs with what we saw with date_crawled column that distribution is between 5th March to 7th April, 2016
Here purely based on distribution percentages and most data is distributed towards 6th and 7th of April and could suggest a good few days lag between when the ad was created when it was last seen - So might not be a high traffic consumption of listing being viewed immediately.

5. Date column cleansing: Part 2¶

Let's now look at the column registration_year. Here we are only dealing with the year part of the date.

It's important to understand the data in this column and any outliers here as this column probably suggests the registration year of the car being listed and in turn useful in our analysis based on the car's age.

In [14]:

# To understand the statistics of the data behind the series
print(autos["registration_year"].describe(),'\n')

# To understand the data distribution percentage on the column
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).head(5))
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).tail(5))

count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64 

1000    0.000021
1001    0.000021
1111    0.000021
1800    0.000041
1910    0.000041
Name: registration_year, dtype: float64
5911    0.000021
6200    0.000021
8888    0.000021
9000    0.000021
9999    0.000062
Name: registration_year, dtype: float64

From the statistics on the registration_year, it clear that some of the data is not a year evident from the min (1000) and max (9999) values.

Also, through the distribution of data we see:

Large part of the data is distributed between year 2000 and 2014 (~85%)
Even if we factor some vintage cars in the listing with old registration years, still there are some incorrect values that might have to be corrected. e.g. 1000, 1001, 1111
There are car registration year beyond when the ad the month/year the actual list was created. e.g. Mar-Apr 2016 being our ad listing date ranges and we can assume these are incorrect values

We will make an assumption that infact some vintage cars are in our listing so we will use the range being 1900 to 2016 and anything outside of this range as incorrect.

Note: We have ignored the car listings in 1800 as invalid for the following reasons:

The period of 1800 is too early as first automobiles came only about 1886
Brand of the car in these listings says Mitsubishi (they were not founded until 1960's)

Let's look at the number of listings between our assumed valid registration year ranges and outside of this range. We will also remove the data outside our year ranges.

In [15]:

# To see how many listings we have within our assumed valid registration year range.
print("Listings within valid year range = ",autos["registration_year"].between(1900,2016).sum(),"\n")

print("Listings outside valid year range = ",(~autos["registration_year"].between(1900,2016)).sum(),"\n")

# To remove the values outside of our assumed valid registration year range
autos = autos[ autos["registration_year"].between(1900,2016) ]

# To see the data distribtion now in our valid data set
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).head(5))
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).tail(5))

Listings within valid year range =  46352 

Listings outside valid year range =  1872 

1910    0.000043
1927    0.000022
1929    0.000022
1931    0.000022
1934    0.000043
Name: registration_year, dtype: float64
2012    0.028219
2013    0.017281
2014    0.014282
2015    0.008198
2016    0.025932
Name: registration_year, dtype: float64

Data Analysis and Inference¶

1. Car brand variation¶

Now we will take a look at the brand column to perform some analysis and find the mean prices based on the brand.

In [16]:

print('Number of car brands in the listing: ',autos["brand"].unique().shape[0])

Number of car brands in the listing:  40

We see that there are 40 unique car brands in our current cleansed dataset.

2. Average Price and Mileage analysis¶

For the purposes of this project, we are going to consider only the top 10 car brands by distribution.

In [17]:

# Select only top 10 brands based on the distribution in the car listing
car_brands = autos["brand"].value_counts(normalize = True, dropna = False).head(10).index
brand_mean_price = {}
brand_mean_mileage = {}
# Loop through top 10 brands to calculate mean price of cars
for brand in car_brands:
    mean_price = autos.loc[autos["brand"] == brand, "price"].mean()
    mean_mileage = autos.loc[autos["brand"] == brand, "odometer_km"].mean()
    brand_mean_price[brand] = round(mean_price,2)
    brand_mean_mileage[brand] = round(mean_mileage,2)
# Convert the brand aggregate data into Panda series to sort by values
brand_agg_mean_price = pd.Series(brand_mean_price)
brand_agg_mean_mileage = pd.Series(brand_mean_mileage)
print(brand_agg_mean_price.sort_values(ascending = False))

audi             9380.72
mercedes_benz    8672.65
bmw              8381.68
volkswagen       5436.95
seat             4433.42
ford             3779.27
peugeot          3113.86
opel             3005.50
fiat             2836.87
renault          2496.07
dtype: float64

From the above analysis, it seems among the top 10 car brands by distribution:

Audi, Mercedes Benz, and BMW seems to have a higher mean price
Opel, Fiat, and Renault are the lowest by significant margin
Volkswagen and Seat are the sweet spot in the middle

We will also compare this average price for the top 10 brands against it's average mileage.

For this, we add our mean mileage as well and make a dataframe so that we can compare side by side.

In [18]:

# Create a dataframe with a single column from a series
brand_agg = pd.DataFrame(brand_agg_mean_price, columns=["mean_price"])
# Add a column to the dataframe with another series sharing same label
brand_agg["mean_mileage"] = brand_agg_mean_mileage
brand_agg

Out[18]:

	mean_price	mean_mileage
volkswagen	5436.95	128799.88
bmw	8381.68	132695.32
opel	3005.50	129384.43
mercedes_benz	8672.65	131025.67
audi	9380.72	129245.40
ford	3779.27	124277.11
renault	2496.07	128281.39
peugeot	3113.86	127127.89
fiat	2836.87	116950.29
seat	4433.42	121536.64

Now by comparing the average price alongside the average mileage, we see that the mileage range for Audi, Mercedes Benz, and BMW are all close in the same range.

Conclusion¶

We started with data cleansing as a big part of this project and we did the following:

Renamed column names to a consistent format
Converted certain numeric data which were stored as text
Removed outliers:
- We removed listings that was outside of our price range we assumed was realistic (100 and 350000)
- We removed listings that was outside of our registered_year range that was realistic (1900 to 2016)

After cleansing the data, we started with our data analysis and listed the Top 10 car brands by price.

Then upon analysing the average price and mileage of these top 10 car brands, we concluded that:

Audi, Mercedes Benz, and BMW are the Top 3 brands per average car price and all of these are in the same mileage range
Volkswagen and Seat are the best brands for value of money in terms of average price and mileage