Exploring Ebay Car Sales Data¶

In this project, we'll work with a we'll work with a modified dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The original dataset can be found here.

The data dictionary provided with the data is as follows.

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

Summary of Results¶

In this project we found:

The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.
The top brands by listing were as follows:
- Volkswagen (21.21%)
- BMW (11.00%)
- Opel (10.82%)
- Mercedes Benz (9.54%)
- Audi (8.64%)
- Ford (6.98%)
Mercedes Benz vehicles are by far the most expensive out our top brands, on average costing three times more than the second most expensive brand, Audi.
We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity
There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.

Project¶

Let's start by importing our libraries and reading in the data.

In [1]:

# Import Libraries
import numpy as np
import pandas as pd
from pprint import pprint

# Read in data
autos = pd.read_csv('autos.csv', encoding='Latin-1')

In [2]:

# Quick exploration of the data
autos

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:

# Further exploration
autos.info()
print('\n')
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Initial Observations¶

From the above code cells we can see the following:

Our dataset contains 20 columns.
Our dataset contains a total of 50,000 entries.
Our data is made up of two data types: int64 and object.
The column names use camelcase and will need to be changed to snakecase.

The columns vehicleType, gearbox, model, fuelType and notRepairedDamage all contain null values.

Editing Column Names¶

In [4]:

# Print out existing column names
pprint(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:

# Rename columns
autos.rename({
    'dateCrawled': 'date_crawled',  
    'offerType': 'offer_type',  
    'abtest': 'ab_test',
    'vehicleType': 'vehicle_type', 
    'yearOfRegistration': 'registration_year',  
    'powerPS': 'power_ps', 
    'monthOfRegistration': 'registration_month', 
    'fuelType': 'fuel_type', 
    'notRepairedDamage': 'unrepaired_damage', 
    'dateCreated': 'ad_created', 
    'nrOfPictures': 'no_of_pics', 
    'postalCode': 'postal_code',
    'lastSeen': 'last_seen'
    },
    axis=1,
    inplace=True
)

# Display new column headings
autos.head()

Out[5]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In the code cell above we converted the column names from camelcase to snakecase and reworded some of the column names based on the data dictionary to be more descriptive.

Determining what cleaning tasks need to be performed¶

In [6]:

# Get descriptive statistics for all columns.
autos.describe(include='all')

Out[6]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	no_of_pics	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-11 22:38:16	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Here we can see the following:

The columns seller and offer_type both contain mostly 1 value and will not be useful to our analysis.
The column no_of_pics only has 1 value and will not be useful to our analysis.
The price and odometer columns are numeric values stored as text.

Based on this we will remove the seller, offer_type and no_of_pics columns, and we will convert the price and odometer columns to a numeric data type.

In [7]:

# Remove unecessary columns
autos.drop(columns=['seller', 'offer_type', 'no_of_pics'], inplace=True)

# Clean and convert price column.
autos['price'] = (autos['price']
                  .str.replace('$','')
                  .str.replace(',','')
                  .astype(float)
                 )

# Clean and convert price column.
autos['odometer'] = (autos['odometer'].str.replace('km','')
                     .str.replace(',','')
                     .astype(int)
                    )

# Rename price and odometer columns to make the more descriptive
autos.rename({'price':'price_usd', 'odometer':'odometer_km'}
             ,axis=1
             ,inplace=True
            )

# Output updated data
autos.head()

Out[7]:

	date_crawled	name	price_usd	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000.0	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500.0	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990.0	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350.0	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350.0	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Exploring the `odometer_km` column¶

In [8]:

odometer = autos['odometer_km']

# Display discriptive statistics of odometer_km column
odometer.describe()

Out[8]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

From this initial exploration it does not looks as though there are any unrealistic odemeter readings.

In [9]:

# Further exploration
print(odometer.unique())
print('\n')
print(odometer.value_counts().sort_index(ascending=False))

[150000  70000  50000  80000  10000  30000 125000  90000  20000  60000
   5000 100000  40000]


150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

Further exploration confirms that the data in this column does not require further cleaning.

Exploring the `price` column¶

In [10]:

prices = autos['price_usd']

# Display discriptive statistics of price_usd column
prices.describe()

Out[10]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

From this initial exploration we can see that there are clearly errors within the price_usd column data. For example, the max price of 100,000,000 USD is too high and the min price of 0 USD is too low.

Let's do some further exploraion.

In [11]:

# Determine top 20 prices
top_20_prices = prices.value_counts().sort_index(ascending=False).head(20)
print(f"Top 20 prices:\n{top_20_prices}") 

Top 20 prices:
99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
259000.0      1
250000.0      1
220000.0      1
198000.0      1
197000.0      1
Name: price_usd, dtype: int64

Pricing above 350,000 appears to be random key entries or erroneous data and will thus be considered as outliers.

In [12]:

# Determine bottom 20 prices
bottom_20_prices = prices.value_counts().sort_index(ascending=True).head(20)
print(f"Bottom 20 prices:\n{bottom_20_prices}") 

Bottom 20 prices:
0.0     1421
1.0      156
2.0        3
3.0        1
5.0        2
8.0        1
9.0        1
10.0       7
11.0       2
12.0       3
13.0       2
14.0       1
15.0       2
17.0       3
18.0       1
20.0       4
25.0       5
29.0       1
30.0       7
35.0       1
Name: price_usd, dtype: int64

It seems unlikely that there would be 1421 random '0' key entries, however we will treat vehicles with zero cost as erroneous data and will thus remove them from our data.

In [13]:

# Updating autos by removing erroneus data.
autos[prices.between(1, 350000)]

Out[13]:

	date_crawled	name	price_usd	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000.0	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500.0	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990.0	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350.0	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350.0	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	24900.0	control	limousine	2011	automatik	239	q5	100000	1	diesel	audi	nein	2016-03-27 00:00:00	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	1980.0	control	cabrio	1996	manuell	75	astra	150000	5	benzin	opel	nein	2016-03-28 00:00:00	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	13200.0	test	cabrio	2014	automatik	69	500	5000	11	benzin	fiat	nein	2016-04-02 00:00:00	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	22900.0	control	kombi	2013	manuell	150	a3	40000	11	diesel	audi	nein	2016-03-08 00:00:00	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	1250.0	control	limousine	1996	manuell	101	vectra	150000	1	benzin	opel	nein	2016-03-13 00:00:00	45897	2016-04-06 21:18:48

48565 rows × 17 columns

Distribution of Date Values¶

Below we calculate the distrbution of values in the date_crawled, ad_created, and last_seen columns.

In [14]:

# Calculating date_crawled distribution.
date_crawled = autos['date_crawled'].str[:10]
print(date_crawled.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")

# Sort according to distribution.
print(date_crawled.value_counts(sort=True, dropna=False))

2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64



2016-04-03    1934
2016-03-20    1891
2016-03-21    1876
2016-03-12    1839
2016-03-14    1831
2016-04-04    1826
2016-03-07    1798
2016-04-02    1770
2016-03-19    1745
2016-03-28    1742
2016-03-29    1709
2016-03-15    1699
2016-04-01    1690
2016-03-30    1681
2016-03-08    1665
2016-03-09    1661
2016-03-22    1647
2016-03-11    1624
2016-03-26    1624
2016-03-23    1619
2016-03-10    1606
2016-03-31    1596
2016-03-25    1587
2016-03-17    1576
2016-03-27    1552
2016-03-16    1475
2016-03-24    1455
2016-03-05    1269
2016-03-13     778
2016-03-06     697
2016-04-05     655
2016-03-18     653
2016-04-06     159
2016-04-07      71
Name: date_crawled, dtype: int64

We can see that all ads were crawled between 2016-03-05 and 2016-04-07. We can also see that 2016-04-03 was the day with the highest activity with a totl of 1934 ads crawled.

In [15]:

# Calculating ad_created distribution.
ad_created = autos['ad_created'].str[:10]
print(ad_created.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")

# Sort according to distribution.
print(ad_created.value_counts(sort=True, dropna=False))

2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: ad_created, Length: 76, dtype: float64



2016-04-03    1946
2016-03-20    1893
2016-03-21    1886
2016-04-04    1844
2016-03-12    1831
              ... 
2016-01-22       1
2016-01-07       1
2016-01-14       1
2016-01-29       1
2016-02-08       1
Name: ad_created, Length: 76, dtype: int64

All ads were created between 2015-06-11 and 2016-04-07. We see that 2016-04-03 was the day with the highest number of ads created which is the same date on which the most ads were crawled.

In [16]:

# Calculating last_seen distribution.
last_seen = autos['last_seen'].str[:10]
print(last_seen.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")

# Sort according to distribution.
print(last_seen.value_counts(sort=True, dropna=False))

2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64



2016-04-06    11050
2016-04-07     6546
2016-04-05     6214
2016-03-17     1396
2016-04-03     1268
2016-04-02     1245
2016-03-30     1242
2016-04-04     1231
2016-03-31     1192
2016-03-12     1191
2016-04-01     1155
2016-03-29     1117
2016-03-22     1079
2016-03-28     1043
2016-03-21     1037
2016-03-20     1035
2016-03-24      978
2016-03-25      960
2016-03-23      929
2016-03-26      848
2016-03-16      822
2016-03-27      801
2016-03-15      794
2016-03-19      787
2016-03-14      640
2016-03-11      626
2016-03-10      538
2016-03-09      493
2016-03-13      449
2016-03-08      380
2016-03-18      371
2016-03-07      268
2016-03-06      221
2016-03-05       54
Name: last_seen, dtype: int64

We see that the date with the highest frequency in th last_seen column was 2016-04-06 which could possibly indicated this was the date when the most sales occurred. However, this is speculative as there is not enough data to make a conclusive statement.

In [17]:

# Display descriptive statistics on registration_year column
autos['registration_year'].describe()

Out[17]:

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The data above makes it clear that there is erroneous data within this column as it is not possible for a car to have been registered in 1000 or 9999.

We will explore this column further in order to decide how best to clean it.

In [18]:

# Display 30 earliest registration years
print(autos['registration_year'].value_counts().sort_index().head(30))

# Display 30 latest registration years
print('\n\n')
print(autos['registration_year'].value_counts().sort_index().tail(30))

1000     1
1001     1
1111     1
1500     1
1800     2
1910     9
1927     1
1929     1
1931     1
1934     2
1937     4
1938     1
1939     1
1941     2
1943     1
1948     1
1950     3
1951     2
1952     1
1953     1
1954     2
1955     2
1956     5
1957     2
1958     4
1959     7
1960    34
1961     6
1962     4
1963     9
Name: registration_year, dtype: int64



2001    2703
2002    2533
2003    2727
2004    2737
2005    3015
2006    2708
2007    2304
2008    2231
2009    2098
2010    1597
2011    1634
2012    1323
2013     806
2014     666
2015     399
2016    1316
2017    1453
2018     492
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, dtype: int64

We will remove entries with vehicle registration years earlier than 1910 as the amount of vehicles within this range are negligible and the registration years themselves are unrealistic or impossible.

We will also remove vehicles with a registration year later than 2016 as it is impossible for a car to have been registered after the listing had alreeady been seen.

In [19]:

# Removing erroneous entries
autos = autos[(autos['registration_year'] >= 1910) & (autos['registration_year'] <= 2016)]
autos['registration_year'].describe()

Out[19]:

count    48028.00000
mean      2002.80351
std          7.31085
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2016.00000
Name: registration_year, dtype: float64

In [20]:

# Display updated distribution
autos['registration_year'].value_counts(normalize=True).head(30)

Out[20]:

2000    0.069834
2005    0.062776
1999    0.062464
2004    0.056988
2003    0.056779
2006    0.056384
2001    0.056280
2002    0.052740
1998    0.051074
2007    0.047972
2008    0.046452
2009    0.043683
1997    0.042225
2011    0.034022
2010    0.033251
1996    0.030066
2012    0.027546
2016    0.027401
1995    0.027338
2013    0.016782
2014    0.013867
1994    0.013742
1993    0.009265
2015    0.008308
1990    0.008224
1992    0.008141
1991    0.007412
1989    0.003769
1988    0.002957
1985    0.002186
Name: registration_year, dtype: float64

In [21]:

# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
 .value_counts(normalize=True)
 .sort_index(ascending=False)
 .head(30)
 .sum()
))

# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
 .value_counts(normalize=True)
 .sort_index(ascending=False)
 .head(10)
 .sum()
))

# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
 .value_counts(normalize=True)
 .sort_index(ascending=False)
 .head(5)
 .sum()
))

0.9793037394853003
0.29928375114516537
0.093903556258849

The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.

Exploration of `brand` column¶

In [22]:

# Display descriptive statistics for brand column
print(autos['brand'].describe())

# Display brand value counts
autos['brand'].value_counts(normalize=True)

count          48028
unique            40
top       volkswagen
freq           10188
Name: brand, dtype: object

Out[22]:

volkswagen        0.212126
bmw               0.110019
opel              0.108166
mercedes_benz     0.095361
audi              0.086387
ford              0.069793
renault           0.047347
peugeot           0.029524
fiat              0.025860
seat              0.018177
skoda             0.016032
mazda             0.015137
nissan            0.015095
citroen           0.013929
smart             0.013909
toyota            0.012472
sonstige_autos    0.010952
hyundai           0.009848
volvo             0.009245
mini              0.008641
mitsubishi        0.008141
honda             0.007850
kia               0.007100
alfa_romeo        0.006621
porsche           0.006101
suzuki            0.005913
chevrolet         0.005705
chrysler          0.003665
dacia             0.002561
daihatsu          0.002561
jeep              0.002249
subaru            0.002186
land_rover        0.002040
saab              0.001603
jaguar            0.001582
trabant           0.001562
daewoo            0.001499
rover             0.001353
lancia            0.001083
lada              0.000604
Name: brand, dtype: float64

As we can see there are 40 unique brands in our dataset, with Volkswagen being the brand with the highest frequency.

Top Brands¶

Next we will select the brands that constitute more than 5% of the total values, and then we will aggregate their price data.

In [23]:

# Create list to store top brands
brand_counts = autos['brand'].value_counts(normalize=True)
top_brands = brand_counts[brand_counts > .05].index
        
# Display top_brands
print(f"Top Brands: \n{list(top_brands)}")
    

Top Brands: 
['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford']

The brands which meet our ciriteris are:

Volkswagen (21.21%)
BMW (11.00%)
Opel (10.82%)
Mercedes Benz (9.54%)
Audi (8.64%)
Ford (6.98%)

We can see that Volkswagen is the top brand, making up over a fifth of the listed vehicles, and having nearly double the second most listed brand BMW.

Next we will aggregate their price data.

Aggregating Price Data¶

In [24]:

# Create dictionary to store aggregate data.
top_brands_mean_price = {}

# Selecting data.
for brand in top_brands:
    selected_rows = autos[autos['brand'] == brand]
    avg_price = selected_rows['price_usd'].mean()
    top_brands_mean_price[brand] = avg_price

# Sort brands by average price in descending order.
top_brands_sorted = sorted(top_brands_mean_price.items()
                       ,key=lambda x: x[1]
                       ,reverse=True
                      )

# Display average prices.
for brand, price in top_brands_sorted:
    print(f"{brand}: ${round(price, 2)}")

mercedes_benz: $30317.45
audi: $9093.65
bmw: $8334.65
ford: $7263.02
volkswagen: $6516.46
opel: $5252.62

We can see that Mercedes Benz vehicles are by far the most expensive out our top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in our dataset which could indicate one or both of the following:

Many people are tying to sell their Mercedes Benz vehicles,
There is a high demand for Mercedes Benz vehicles

More research would need to be done inorder to determine which of these two options most closely represents reality.

We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.

Next we will aggregate our mileage data for each of the top brands to see if there is any visible link between it and our mean price data.

Aggregating Mean Mileage Data for Comparison to Mean Price Data¶

In [25]:

# Create dictionary to store aggregate data.
top_brands_mean_mileage = {}

# Selecting data.
for brand in top_brands:
    selected_rows = autos[autos['brand'] == brand]
    mean_mileage = selected_rows['odometer_km'].mean()
    top_brands_mean_mileage[brand] = mean_mileage

In order to easily compare mean mileage and mean price for each of the top brands we will create a new DataFrame object containing the relevant data.

Comparing Mean Mileage and Mean Price for Top Brands¶

In [46]:

# Convert top_brands_mean_price dictionary into a Series
tbmp_series = pd.Series(top_brands_mean_price)
tbmp_series = tbmp_series.round(2)

# Convert top_brands_mean_mileage dictionary into a Series
tbmm_series = pd.Series(top_brands_mean_mileage)
tbmm_series = tbmm_series.round(2)

# Create new DataFrame to combine two series objects
tb_mean_price_mileage = pd.DataFrame(tbmp_series, columns=['mean_price'])
tb_mean_price_mileage['mean_mileage'] = tbmm_series   

# Display tb_mean_price_mileage sorted by price.
tb_mean_price_mileage.sort_values('mean_price', ascending=False)

Out[46]:

	mean_price	mean_mileage
mercedes_benz	30317.45	130860.26
audi	9093.65	129287.78
bmw	8334.65	132434.71
ford	7263.02	124046.84
volkswagen	6516.46	128730.37
opel	5252.62	129227.14

The table above makes it clear that while there is a wide range of mean prices amongst the top brands (\$5252.62 - \$30,317.45) there is a very low range for the mean mileages (124,046.84km - 132,434.71km).

This indicates that there is no meaningful relationship between the mean prices and mean mileages amongst our top brands.

Conclusion¶

Findings¶

The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.
The top brands by listings:
- Volkswagen (21.21%)
- BMW (11.00%)
- Opel (10.82%)
- Mercedes Benz (9.54%)
- Audi (8.64%)
- Ford (6.98%)
The Germans clearly love local automakers, with German brands taking the top 5 spots and making up over 61.21% of all listings.
We can see that Mercedes Benz vehicles are by far the most expensive amongst the top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in the dataset which indicates many people are tying to sell their Mercedes Benz vehicles. More research would need to be done in order to state whether or not the demand for Mercedes Benz vehicles is greater or less than the supply.
We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.
There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.

Possible Improvements¶

Find out how the demand for vehicles compares to their supply.
See which brand's vehicles have the longest and shortest listing times on average.
Find the most common brand/model combination.
Translate German words into English
Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
Find out how much cheaper cars with damage are than their non-damaged counterparts.