Exploring Car Sale Data¶

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to Kaggle:

In this project I will start by cleaning the data and then analysing the different brands to check for price and popularity and then see if there are any connections between price, mileage and damage. Throughout this project I will be using Pandas the python module and then visualizing some of the data using matplotlib.

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

Now lets begin by importing the necessary modules and reading the file

In [1]:

# Import modules
import pandas as pd
import numpy as np

#Read dataset into Pandas
autos = pd.read_csv('autos.csv',encoding='Latin-1', engine='python')

Great, now that we have the dataset as a dataframe. let's take a look at the first 5 rows and then a description of the dataset to see what we can learn.

In [2]:

#First 5 rows
autos.head()

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [3]:

# Info about each column
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 4.8+ MB

A few observations from the dataset:

The dataset contains 20 columns, most of which are strings.

Some columns have null values, but none have more than ~20% null values.

The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

The price and odometer columns are both object type, which means that we need to change them to integer types in order to access the numeric values.

We need to do some cleaning:

Firstly I will change the column names from camelcase to sankecase and also change a couple of the column names so that they better match the descriptions from the data dictionary.

In [4]:

# Change column names
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year',
              'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
              'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen']

autos.columns = new_columns

Now I will change the price and odometer columns to integer types. This will involve removing the non numeric values ($ , km) from the values and just putting them into the column name instead.

In [5]:

# Rename price and odometer columns to show units 
autos = autos.rename(columns={'price': 'price($)', 'odometer':'odometer(km)'})

# Replace non numeric values in price and odometer columns and convert to interger datatype
autos['price($)'] = autos['price($)'].str.replace('$', '').str.replace(',','').astype(int)
autos['odometer(km)'] = autos['odometer(km)'].str.replace('km', '').str.replace(',', '').astype(int)

Let's take a look at the first 5 rows of the dataset to check our new column names and to check the values in the price and odometer columns.

In [6]:

# First 5 rows
autos.head() 

Out[6]:

	date_crawled	name	seller	offer_type	price($)	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer(km)	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

The columns have been cleaned successfully so let's move on.

Let's take a look at the description of the data to see if there is anything else that needs cleaning up or could be removed

In [7]:

# Show data description
autos.describe(include='all')

Out[7]:

	date_crawled	name	seller	offer_type	price($)	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer(km)	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	5.000000e+04	50000	44905	50000.000000	47320	50000.000000	47242	50000.000000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	NaN	2	8	NaN	2	NaN	245	NaN	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-10 15:36:24	Ford_Fiesta	privat	Angebot	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	NaN	25756	12859	NaN	36993	NaN	4024	NaN	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	9.840044e+03	NaN	NaN	2005.073280	NaN	116.355920	NaN	125732.700000	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	4.811044e+05	NaN	NaN	105.712813	NaN	209.216627	NaN	40042.211706	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	0.000000e+00	NaN	NaN	1000.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	1.100000e+03	NaN	NaN	1999.000000	NaN	70.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	2.950000e+03	NaN	NaN	2003.000000	NaN	105.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	7.200000e+03	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	1.000000e+08	NaN	NaN	9999.000000	NaN	17700.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

From the description we can see that their are 2 columns with only 2 unique values and that in each case one of the values accounts for 49,999 out of 50,000 of the occurances. We can therefore remove the seller and offer_type columns as they are redundant.

In [8]:

# Remove unneeded columns
del autos['seller']
del autos['offer_type']

Now to look for any values within the data that could be incorrect as they are unrealisticlly too high or too low. I will focus on the columns with numeric values where unrealistic outliers can potentially affect the reliability of ouor findings.

I will therefore take a look at the following columns:

price($)
odometer(km)

Firstly price($):

In [9]:

# Show description of price column
autos['price($)'].describe()

Out[9]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price($), dtype: float64

The max value is 100,000,000 dollars, I don't think anyone is buying a car on ebay for that much. Let's have a look at the 20 highes values in the price column.

In [10]:

# Show most expensive cars in dataset
value_counts = autos['price($)'].value_counts()
value_counts.sort_index().tail(20)

Out[10]:

197000      1
198000      1
220000      1
250000      1
259000      1
265000      1
295000      1
299000      1
345000      1
350000      1
999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price($), dtype: int64

*There are 14 cars listed as being 999,990 dollars and above. Below that there is a big gap all the way down to 350,000 dollars which seems about right for the most expensive vehicle ebay. I will therefore remove all rows with a price above 350,000.

In [11]:

# Remove unreialistic prices
autos = autos[autos['price($)'].between(0, 350000)]
# Show descrition of cleaned price column
autos['price($)'].describe()

Out[11]:

count     49986.000000
mean       5721.525167
std        8983.617820
min           0.000000
25%        1100.000000
50%        2950.000000
75%        7200.000000
max      350000.000000
Name: price($), dtype: float64

Now for the 20 lowest values

In [12]:

# Show least expensive cars
autos['price($)'].value_counts().sort_index().head(20)

Out[12]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price($), dtype: int64

There arent any outliers in this data. Yes, selling for nothing seems strange but there are 1421 listings at this price. Clearly with nearly 2.5 percent of listings at this price, these arent outliers but instead having a starting price of 0 is a legitimate practice on ebay. I therefore won't be removing any rows for having a price that is too low.

Now let's take a look at the odometer column:

In [13]:

# Show description of odometer column
autos['odometer(km)'].describe()

Out[13]:

count     49986.000000
mean     125736.506222
std       40038.133399
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer(km), dtype: float64

In [14]:

# Show value counts in order of index
autos['odometer(km)'].value_counts().sort_index()

Out[14]:

5000        966
10000       264
20000       784
30000       789
40000       818
50000      1025
60000      1164
70000      1230
80000      1436
90000      1757
100000     2168
125000     5169
150000    32416
Name: odometer(km), dtype: int64

Only 13 unique values. The odometer values have clearly been grouped together and there are no outliers here. We can therefore move on.

Now let's take a look at the columns with dates. There are 5 in total. I will begin by looking at the 3 string columns:

In [15]:

# Show 3 columns with string dates
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]

Out[15]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [16]:

# Show dates of date crawled column
autos['date_crawled'].str[:10].value_counts(dropna=False)

Out[16]:

2016-04-03    1934
2016-03-20    1891
2016-03-21    1874
2016-03-12    1838
2016-03-14    1831
2016-04-04    1824
2016-03-07    1798
2016-04-02    1770
2016-03-19    1745
2016-03-28    1742
2016-03-29    1707
2016-03-15    1699
2016-04-01    1690
2016-03-30    1681
2016-03-08    1663
2016-03-09    1660
2016-03-22    1645
2016-03-26    1624
2016-03-11    1624
2016-03-23    1619
2016-03-10    1606
2016-03-31    1595
2016-03-25    1587
2016-03-17    1575
2016-03-27    1552
2016-03-16    1475
2016-03-24    1455
2016-03-05    1269
2016-03-13     778
2016-03-06     697
2016-04-05     655
2016-03-18     653
2016-04-06     159
2016-04-07      71
Name: date_crawled, dtype: int64

All of the listings were crawled between March 6th and April 7th 2016

In [17]:

# Show dates of ad created column
autos['ad_created'].str[:10].value_counts(dropna=False).sort_index()

Out[17]:

2015-06-11       1
2015-08-10       1
2015-09-09       1
2015-11-10       1
2015-12-05       1
              ... 
2016-04-03    1946
2016-04-04    1842
2016-04-05     592
2016-04-06     163
2016-04-07      64
Name: ad_created, Length: 76, dtype: int64

Ads were created between June 11th 2015 and April 7th 2016.

In [18]:

# Show dates of last seen column
autos['last_seen'].str[:10].value_counts(dropna=False).sort_index().tail(20)

Out[18]:

2016-03-19      787
2016-03-20     1035
2016-03-21     1036
2016-03-22     1079
2016-03-23      929
2016-03-24      978
2016-03-25      960
2016-03-26      848
2016-03-27      801
2016-03-28     1042
2016-03-29     1116
2016-03-30     1242
2016-03-31     1191
2016-04-01     1155
2016-04-02     1244
2016-04-03     1268
2016-04-04     1231
2016-04-05     6212
2016-04-06    11046
2016-04-07     6546
Name: last_seen, dtype: int64

All listings were last seen between March 19th and April 7th 2016

For these 3 columns nothing stands out as being out of the ordinary. Now let's check the registration year column to see if it matches what we would expect from the above data. There should be no registration years later than 2016 as that is when the last ads were created and the data was crawled.

In [19]:

# Show description of registration year column
autos['registration_year'].describe()

Out[19]:

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

hmmmm well that is strange. We have cars registered as early as the year 1000 and as late as 9999, incredible! We need to take a closer look at the earliest and latest dates in the registration year column.

In [20]:

# Show earliest dates
autos['registration_year'].value_counts().sort_index().head(20)

Out[20]:

1000    1
1001    1
1111    1
1500    1
1800    2
1910    9
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
1950    3
1951    2
1952    1
1953    1
Name: registration_year, dtype: int64

In [21]:

# Show latest dates
autos['registration_year'].value_counts().sort_index().tail(20)

Out[21]:

2011    1634
2012    1323
2013     806
2014     665
2015     399
2016    1316
2017    1452
2018     491
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, dtype: int64

There are dates here which are not possible.

We need to remove the rows with impossible registration dates. The Ford Model T became the first mass produced car in 1908. I will use that date as the cut off and remove everything from before that date.

As we saw from the ad_created column these cars were all listed between 2015 and 2016. Therefore we cannot have any cars with a reistration date later than 2016 and must remove all of those rows as well.

In [22]:

# Remove impossible dates
autos = autos[autos['registration_year'].between(1908,2016)]

Let's take a look at the remaining values in a histogram.

In [23]:

# Import graph plotting module
from matplotlib import pyplot as plt
%matplotlib inline

# Plot histgogram
ax = autos['registration_year'].plot(kind='hist', bins=30, range=(1908, 2016))
ax.set_xlabel('Date')

Out[23]:

Text(0.5, 0, 'Date')

With the unrealistic dates gone we now have the values you would expect. Lot's of cars from the past 20 years and progressivly less as we go further in the past.

Our dataset now only contains accurate dates.

The next step is translate from German to English.

In [24]:

autos.head()

Out[24]:

	date_crawled	name	price($)	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer(km)	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Just looking through the columns we can see that the 'vehicle_type', 'gearbox', 'fuel_type' and 'unrepaired_damage' all contain strings that may be in Gemran (other columns also are strings but they are names so we don't need to worry about translating them. I am going to first check the unique values in each column and then replace all of the German words with English ones. This will require some help from google translate.

In [25]:

# Check for any German in vehicle type column
autos['vehicle_type'].unique()

Out[25]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [26]:

# Translate German to English
autos['vehicle_type'] = autos['vehicle_type'].str.replace('kleinwagen', 'mini').str.replace('kombi', 'combination')
autos['vehicle_type'] = autos['vehicle_type'].str.replace('cabrio', 'convertible').str.replace('andere', 'other')
# Check to see new values in vehicle type column
autos['vehicle_type'].unique()

Out[26]:

array(['bus', 'limousine', 'mini', 'combination', nan, 'coupe', 'suv',
       'convertible', 'other'], dtype=object)

In [27]:

# Check for any german in gerabox column
autos['gearbox'].unique()

Out[27]:

array(['manuell', 'automatik', nan], dtype=object)

In [28]:

# Translate German to English
autos['gearbox'] = autos['gearbox'].str.replace('manuell', 'manual').str.replace('automatik', 'automatic')
# Check to see new values in Gearbox column
autos['gearbox'].unique()

Out[28]:

array(['manual', 'automatic', nan], dtype=object)

In [29]:

# Check for any German in Fuel type column
autos['fuel_type'].unique()

Out[29]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [30]:

# Translate German to English
autos['fuel_type'] = autos['fuel_type'].str.replace('benzin', 'petrol').str.replace('elektro', 'electric').str.replace('andere', 'other')
# Check to see new values in fuel type column
autos['fuel_type'].unique()

Out[30]:

array(['lpg', 'petrol', 'diesel', nan, 'cng', 'hybrid', 'electric',
       'other'], dtype=object)

In [31]:

# Check for any German in unrepaired damage column
autos['unrepaired_damage'].unique()

Out[31]:

array(['nein', nan, 'ja'], dtype=object)

In [32]:

# Translate German to English
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein', 'no').str.replace('ja', 'yes')
# Check to see new values in unrepaired damage column
autos['unrepaired_damage'].unique()

Out[32]:

array(['no', nan, 'yes'], dtype=object)

Great, our dataset now contains English instead of German. Everything seems to be ready to do some data analysis!

Now that the data is cleaned we can start to analyse it.

First, I am going to look at the different car brands in the dataset. I want to have a look at which car brands are the most popular in the dataset and then narrow down our data to include the popular car brands. I will count popular as any brand that is above 2% of the dataset. 2% of 50,000 is 1000 which I think is a reasonable cut off point.

In [33]:

#Loop through brand column and create frequency table in brands dictionary
brand_frequency = {}

for values in autos['brand'].sort_index():
    if values in brand_frequency:
        brand_frequency[values] += 1
    else:
        brand_frequency[values] = 1

# Find % of each car brand
total = sum(brand_frequency.values())
percentages = {}
for keys, values in brand_frequency.items():
    percent = values * 100 / total
    percentages[keys] = percent

# Narrow down to brands above 2
brands_above_2_percent = {}
for keys, values in percentages.items():
    if values > 2:
        brands_above_2_percent[keys] = values
# Show dictionary
brands_above_2_percent

Out[33]:

{'peugeot': 2.953182272575808,
 'bmw': 11.002582472509163,
 'volkswagen': 21.211679440186604,
 'ford': 6.976841052982339,
 'renault': 4.7359213595468175,
 'mercedes_benz': 9.536404531822726,
 'audi': 8.640869710096634,
 'opel': 10.817227590803066,
 'fiat': 2.586637787404199}

Im going to make this information a little easier to view by putting it into a bar chart:

In [34]:

# Plot Bar chart
plt.bar(brands_above_2_percent.keys(), brands_above_2_percent.values())
plt.ylabel('% of cars sold')
plt.xlabel('Brands')
plt.xticks(rotation=90)
plt.title('Most popular car brands')
plt.show()

Here we can see that Volkswagen is by far the most popular car brand on German ebay. This is followed by 4 more German car brands; BMW, Mercedes, Audi and Opel. The most popular non German brand is Ford.

Now let's have a look at the average price of these popular car brands:

In [35]:

#Find average price of each brand.
mean_brands = {}
brand_names = autos['brand'].unique()
for brand in brand_names:
    selected_rows = autos[autos['brand'] == brand ]
    mean = selected_rows['price($)'].mean()
    mean_brands[brand] = mean

#Fitler out brands with less than 2 percent of sales.
popular_brands = brands_above_2_percent.keys()
mean_price_popular_brands = {}
for key, value in mean_brands.items():
    if key in popular_brands:
        mean_price_popular_brands[key] = value

#Show dictionary     
mean_price_popular_brands

Out[35]:

{'peugeot': 3039.4682651622,
 'bmw': 8102.536248343744,
 'volkswagen': 5231.081983308787,
 'ford': 3652.095223880597,
 'renault': 2395.4164467897976,
 'mercedes_benz': 8485.239571958942,
 'audi': 9093.65003615329,
 'opel': 2876.716403542549,
 'fiat': 2711.8011272141707}

Again I will put this information into a bar char to make it more readable.

In [36]:

#Plot Bar Chart
plt.bar(mean_price_popular_brands.keys(), mean_price_popular_brands.values())
plt.title('Mean price of Popular car brands')
plt.xticks(rotation=90)
plt.ylabel('Price ($)')
plt.xlabel('brands')
plt.show()

We can see that the three luxury German brands are the three most expensive, with a fourth German brand, Volkswagen, coming next. After that come Ford and Peugeot, followed by Renault. Opel and Fiat. This means that not only are BMW, Mercedes and Audi the 3 most popular brands but they are also the 3 most expensive as well.

Next, lets take a look at the average mileage of each of this brands. Like with the price I will plot the results into a bar chart.

In [37]:

#Find average mileage of car brands
mean_mileage_brands = {}
brand_names = autos['brand'].unique()
for brand in brand_names:
    selected_rows = autos[autos['brand'] == brand]
    mean = selected_rows['odometer(km)'].mean()
    mean_mileage_brands[brand] = mean
    
#Fitler out brands with less than 2 percent of sales.
popular_brands = brands_above_2_percent.keys()
mean_mileage_popular_brands = {}
for key, value in mean_mileage_brands.items():
    if key in popular_brands:
        mean_mileage_popular_brands[key] = value
    
mean_mileage_popular_brands

Out[37]:

{'peugeot': 127136.81241184767,
 'bmw': 132431.38368351315,
 'volkswagen': 128724.10407461954,
 'ford': 124068.65671641791,
 'renault': 128183.81706244503,
 'mercedes_benz': 130856.0821139987,
 'audi': 129287.78018799711,
 'opel': 129223.14208702349,
 'fiat': 116553.94524959743}

This data is easier to visualize in a var chart:

In [38]:

# Plot bar chart
plt.bar(mean_mileage_popular_brands.keys(), mean_mileage_popular_brands.values())
plt.ylim(110000, 140000)
plt.xticks(rotation=90)
plt.ylabel('Average Mileage(km)')
plt.xlabel('brands')
plt.title('Average mileage of most popular brands')
plt.show()

We can see here that on average each brand has similar mileage on the clock, between 110,000 and 130,000 KM. BMW has the highest average and Fiat is the lowest. On the whole there is little difference between the average mileage of different brands, with the majority of brands having an average mileage of 125,000 and 135,000 km.

I now want to have a look to see if there is a comparison between the average price and average mileage of these popular brands.

In [39]:

# Create panda series for mean price and mean mileage
mean_price_series = pd.Series(mean_price_popular_brands)
mean_mileage_series = pd.Series(mean_mileage_popular_brands)

# Create dataframe using these 2 series
price_mileage_dataframe = pd.DataFrame(mean_price_series, columns=['mean_price'])
price_mileage_dataframe['mean_mileage'] = mean_mileage_series

# Show mean price and mileage side by side
price_mileage_dataframe

Out[39]:

	mean_price	mean_mileage
peugeot	3039.468265	127136.812412
bmw	8102.536248	132431.383684
volkswagen	5231.081983	128724.104075
ford	3652.095224	124068.656716
renault	2395.416447	128183.817062
mercedes_benz	8485.239572	130856.082114
audi	9093.650036	129287.780188
opel	2876.716404	129223.142087
fiat	2711.801127	116553.945250

This time for a viusalization of the data I will plot a scatter chart with the mileage on the x axis and the mean price on the y axis.

In [40]:

# Plot scatter chart
ax = price_mileage_dataframe.plot(x='mean_mileage', y='mean_price', kind='scatter', title='Mileage x price')

We would expect the cars which have a higer mileage to cost less. However, in the top right hand corner we can see that the three highest priced brands (BMW, Mercedes, Audi) also have the highest average mileage. This is largely a result of the average mileage between brands being very similar. All that we can conclude from this is that the price difference of brands is not affected by mileage but by other factors.

In order to see the difference that mileage can make to price we need to look at all of the cars in the dataset individually and see if their is a pattern between the average mileage and the average price of each car.

In [41]:

# Find values in odometer column
mileage = autos['odometer(km)'].value_counts()

# Create dictionary with mileage as the key and average price as the values
price_mileage = {}
mileage = autos['odometer(km)'].unique()
for miles in mileage:
    selected_rows = autos[autos['odometer(km)'] == miles]
    mean = selected_rows['price($)'].mean()
    miles = str(miles)
    price_mileage[miles] = mean

# Sort keys into ascending numerical order    
sorted_keys = sorted(price_mileage, key=lambda line: int(line.split()[0]))

# Create new dictionary with sorted keys
sorted_dict = {}
for v in sorted_keys:
    sorted_dict[v] = 0
for k, v in price_mileage.items():
    sorted_dict[k] += v
    
# Show dictionary
sorted_dict

Out[41]:

{'5000': 7654.626373626374,
 '10000': 19890.5983935743,
 '20000': 17940.720838794234,
 '30000': 16414.45513654096,
 '40000': 15441.445,
 '50000': 13633.68588469185,
 '60000': 12286.970096745823,
 '70000': 10817.819849874895,
 '80000': 9575.700573065902,
 '90000': 8350.228183962265,
 '100000': 7936.031768610716,
 '125000': 6086.20790481952,
 '150000': 3654.0014505366985}

Now to put the dictionary into a bar chart for a better visualization.

In [42]:

# Plot bar chart showing average price and average mileage
plt.bar(sorted_dict.keys(), sorted_dict.values())
plt.xticks(rotation=90)
plt.ylabel('Average Price($)')
plt.xlabel('Mileage (km)')
plt.title('Average price of different mileage groupings')
plt.show()

The results here are clear. The more km's on the clock, the less a car will cost. The one outlier is cars with very few km's on the clock (around 5000km).

There could be numerous reasons why the cars wiht the lowest mileage fall outside of this trend. Perhaps, people only sell cars with low mileage if there is something wrong with them. let's finish by having a look at the relationship between average price and whether a car has unrepaired damage:

In [43]:

# Split cars between damaged and undamaged
damaged = autos[autos['unrepaired_damage'] == 'yes']
undamaged = autos[autos['unrepaired_damage'] == 'no']

# Find mean price for damaged and undamaged cars
damaged_mean_price = damaged['price($)'].mean()
undamaged_mean_price = undamaged['price($)'].mean()

# Plot bar chart to visualize data
plt.bar(['damaged', 'undamaged'], [damaged_mean_price, undamaged_mean_price])
plt.title('Average price of damaged and undamaged cars')
plt.ylabel('Price($)')
plt.show()

The results here are very clear. A damaged car is on average 5000 dollars cheaper than an undamaged car.

3 Main Takeaways

BMW, Mercedes and Audi are the kings of the German auto ebay market. They have the 3 most popular and three most expensive car brands on average. With Volkswagen as the fourth most popular it is clear that Germans like to buy German.
Mileage has a significant affect on the price of vehicles. The more miles on the clock, the less expensive the car will tend to be.
The same is true of cars with unrepaired damage. On average a car with unrepaired damage will cost 5000 dollars less than one without damage.

In [ ]: