Guided Project: Exploring Ebay Car Sales Data¶

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle. A few modifications from the original dataset that was uploaded to Kaggle have been made:

50,000 data points from the full dataset have been sampled, to ensure your code runs quickly in our hosted environment
The dataset has been dirtied a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with).

The aim of this project is to clean the data and analyze the included used car listings.

In [300]:

import numpy as np
import pandas as pd

In [301]:

#Let's read in the dataset

autos = pd.read_csv('autos.csv', encoding='Latin-1')

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [302]:

autos.head()

Out[302]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Data cleaning¶

In the above few cells we read in the autos dataset into a dataframe:

We see that it consists of 50000 rows and 20 columns.
Some columns have null values, but none have more than ~20% null values.
The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [303]:

#Let's get the names of the columns as they are:

autos.columns

Out[303]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [304]:

#We edit the columns'names to snakecase:

autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'n_pictures', 'postal_code',
       'last_seen']

In [305]:

autos.head()

Out[305]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In the above few cells we cleaned the names of the columns in the dataframe so that they are in snakecase which would be easier to work with.

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
Examples of numeric data stored as text which can be cleaned and converted.

In [306]:

autos.describe(include='all')

Out[306]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	n_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-27 22:55:05	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [307]:

autos['seller'].value_counts()
#We see that there is only one entry which isn't from a private person.

Out[307]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [308]:

autos['offer_type'].value_counts()
#Only one entry differs.

Out[308]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [309]:

autos['price'].value_counts()
#This column needs additional clean up.

Out[309]:

$0             1421
$500            781
$1,500          734
$2,500          643
$1,000          639
$1,200          639
$600            531
$3,500          498
$800            498
$2,000          460
$999            434
$750            433
$900            420
$650            419
$850            410
$700            395
$4,500          394
$300            384
$2,200          382
$950            379
$1,100          376
$1,300          371
$3,000          365
$550            356
$1,800          355
$5,500          340
$1,250          335
$350            335
$1,600          327
$1,999          322
               ... 
$2,004            1
$459              1
$11,899           1
$197,000          1
$42,999           1
$69,999           1
$14,337           1
$4,855            1
$27,322,222       1
$33,650           1
$4,005            1
$20,589           1
$1,698            1
$5,248            1
$26,150           1
$7,795            1
$28,399           1
$17,695           1
$135,000          1
$1,414            1
$356              1
$5,913            1
$25,450           1
$14,888           1
$33,777           1
$13,383           1
$169,000          1
$25,490           1
$12,690           1
$29,600           1
Name: price, Length: 2357, dtype: int64

In [310]:

autos['odometer'].value_counts()
#This column needs additional clean up.

Out[310]:

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [311]:

autos['n_pictures'].value_counts()
#We see that all ads are without pictures. Maybe we can drop that column.

Out[311]:

0    50000
Name: n_pictures, dtype: int64

From the above it looks like the 'seller', 'offer_type' and 'n_pictures' columns can be dropped. It also seems like the 'odometer' and 'price' columns need additional clean up - they hold numeric values stored as text. Let's clean them up:

In [312]:

#First we will clean the price column:

autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos['price'].head()

Out[312]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

In [313]:

#Next we will clean the odometer column:

autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(float)
autos['odometer'].head()

Out[313]:

0    150000.0
1    150000.0
2     70000.0
3     70000.0
4    150000.0
Name: odometer, dtype: float64

In [314]:

#Let's rename the odometer column, so that it's clear that it is in km:

autos = autos.rename({'odometer':'odometer_km'}, axis=1)

Now that we've cleaned up the price and odometer columns, let's explore them further.

Cleaning the 'price' columm¶

In [315]:

#Let's see how many unique values there are:

autos['price'].unique().shape

Out[315]:

(2357,)

In [316]:

#Let's view min/max/median/mean etc

autos['price'].describe()

Out[316]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [317]:

autos['price'].value_counts().sort_index(ascending=True).head(40)

#We see the 40 lowest prices and how many cars are priced at them.

Out[317]:

0      1421
1       156
2         3
3         1
5         2
8         1
9         1
10        7
11        2
12        3
13        2
14        1
15        2
17        3
18        1
20        4
25        5
29        1
30        7
35        1
40        6
45        4
47        1
49        4
50       49
55        2
59        1
60        9
65        5
66        1
70       10
75        5
79        1
80       15
89        1
90        5
99       19
100     134
110       3
111       2
Name: price, dtype: int64

In [318]:

autos['price'].value_counts().sort_index(ascending=False).head(20)

#We see the 20 highest prices and how many cars are priced at them.

Out[318]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

Looking at the highest prices - it seems there's a huge jump from 350,000 to 999,990. In fact, all prices above 350,000 look as if they are incorrect. Let's drop those rows. Additionally, we see that there are 1421 cars priced at 0 USD. Let's drop these as well.

In [319]:

autos = autos[(autos['price'] >= 1) & (autos['price'] < 999990)]

autos['price'].describe()

Out[319]:

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Cleaning the 'odometer_km' column¶

In [320]:

#Let's see how many unique values

autos['odometer_km'].shape

Out[320]:

(48565,)

In [321]:

#Let's view min/max/median/mean etc

autos['odometer_km'].describe()

Out[321]:

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [322]:

autos['odometer_km'].value_counts(ascending=True).head(10)

Out[322]:

10000.0     253
20000.0     762
30000.0     780
40000.0     815
5000.0      836
50000.0    1012
60000.0    1155
70000.0    1217
80000.0    1415
90000.0    1734
Name: odometer_km, dtype: int64

In [323]:

autos['odometer_km'].value_counts(ascending=False).head()

Out[323]:

150000.0    31414
125000.0     5057
100000.0     2115
90000.0      1734
80000.0      1415
Name: odometer_km, dtype: int64

The odometer values seem rounded which might mean that the sellers had to choose from a list and not enter the actual numbers of the cars.

Cleaning the date columns¶

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [324]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[324]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [325]:

autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[325]:

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

We see that the ads were crawled between March 5th, 2016 and April 7th, 2016. It seems that the ads were crawled almost daily.

In [326]:

autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[326]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21    0.037579
2016-03-22    0.032801
2016-03-23    0.032060
2016-03-24    0.029280
2016-03-25    0.031751
2016-03-26    0.032266
2016-03-27    0.030989
2016-03-28    0.034984
2016-03-29    0.034037
2016-03-30    0.033501
2016-03-31    0.031875
2016-04-01    0.033687
2016-04-02    0.035149
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

We see that the oldest ads are from June 11th, 2015 and the newest ones - from April 7th, 2016.

In [327]:

autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[327]:

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

The crawler records when it last saw a listing. This allows us to determine when the car was sold.

Cleaning the registration_year column¶

In [328]:

autos['registration_year'].describe()

Out[328]:

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [329]:

autos['registration_year'].value_counts(normalize=True).sort_index()

Out[329]:

1000    0.000021
1001    0.000021
1111    0.000021
1800    0.000041
1910    0.000103
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000041
1937    0.000082
1938    0.000021
1939    0.000021
1941    0.000041
1943    0.000021
1948    0.000021
1950    0.000062
1951    0.000041
1952    0.000021
1953    0.000021
1954    0.000041
1955    0.000041
1956    0.000082
1957    0.000041
1958    0.000082
1959    0.000124
1960    0.000474
1961    0.000124
1962    0.000082
1963    0.000165
1964    0.000247
          ...   
2000    0.064985
2001    0.054278
2002    0.051189
2003    0.055575
2004    0.055657
2005    0.060455
2006    0.054978
2007    0.046886
2008    0.045609
2009    0.042932
2010    0.032719
2011    0.033419
2012    0.026974
2013    0.016535
2014    0.013652
2015    0.008072
2016    0.025121
2017    0.028663
2018    0.009678
2019    0.000041
2800    0.000021
4100    0.000021
4500    0.000021
4800    0.000021
5000    0.000082
5911    0.000021
6200    0.000021
8888    0.000021
9000    0.000021
9999    0.000062
Name: registration_year, Length: 95, dtype: float64

There are values in the registration_year column that seem incorrect:

Given that cars were invented in 1885, any listing with registration_year prior to that seems to be incorrect.
Given that the ads were created between June 11th, 2015 and April 7th, 2016, any car with registration year after 2016 should also be removed.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [330]:

autos['registration_year'].shape[0]-autos['registration_year'].between(1900,2016).sum()

Out[330]:

We see that 1884 out of the 48565 listings fall outside of the 1900-2016 registration year interval. Given that this account for less than 4%, it is safe to remove them.

In [331]:

autos=autos[autos['registration_year'].between(1900,2016)]

In [332]:

autos['registration_year'].value_counts(normalize=True).head()

Out[332]:

2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
Name: registration_year, dtype: float64

It looks like the majority of the cars were registered between 1999 and 2005.

Exploring the 'brand' column¶

In [333]:

(autos['brand'].value_counts(normalize=True)*100)

Out[333]:

volkswagen        21.126368
bmw               11.004477
opel              10.758124
mercedes_benz      9.646323
audi               8.656627
ford               6.989996
renault            4.714980
peugeot            2.984083
fiat               2.564212
seat               1.827296
skoda              1.640925
nissan             1.527388
mazda              1.518819
smart              1.415994
citroen            1.400998
toyota             1.270324
hyundai            1.002549
sonstige_autos     0.981127
volvo              0.914719
mini               0.876159
mitsubishi         0.822604
honda              0.784045
kia                0.706926
alfa_romeo         0.664082
porsche            0.612669
suzuki             0.593389
chevrolet          0.569825
chrysler           0.351321
dacia              0.263490
daihatsu           0.250637
jeep               0.227073
subaru             0.214220
land_rover         0.209936
saab               0.164949
jaguar             0.156381
daewoo             0.149954
trabant            0.139243
rover              0.132816
lancia             0.107110
lada               0.057839
Name: brand, dtype: float64

Unsurprisingly, German brands make up for about 60% of all listings. Volkswagen is the top brand with the number of cars listed from that brand being the same as the total of the next two brands - BMW and Opel. There are a lot of brands that aren't very represented. For the sake of the analysis, we will focus on those that make up more than 5% of the listings.

In [334]:

#Let's isolate the top brands

brand_counts = (autos['brand'].value_counts(normalize=True)*100)
top_brands = brand_counts[brand_counts > 5].index

top_brands

Out[334]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

In [335]:

brand_mean_prices = {}

for brand in top_brands:
    by_brand = autos[autos['brand']==brand]
    mean_price = by_brand['price'].mean()
    brand_mean_prices[brand] = int(mean_price)
    
print(brand_mean_prices)

{'opel': 2975, 'audi': 9336, 'mercedes_benz': 8628, 'bmw': 8332, 'ford': 3749, 'volkswagen': 5402}

Based on the average prices we see that:

Opel and Ford have the lowest average prices.
Audi, Mercedes and BMW have the highest average prices.
Volkswagen are moderately priced.

Mean mileage per brand¶

In [336]:

km_per_brand = {}

for brand in top_brands:
    by_brand = autos[autos['brand'] == brand]
    mean_km = by_brand['odometer_km'].mean()
    km_per_brand[brand] = int(mean_km)
    
km_per_brand

Out[336]:

{'audi': 129157,
 'bmw': 132572,
 'ford': 124266,
 'mercedes_benz': 130788,
 'opel': 129310,
 'volkswagen': 128707}

Above we created a dictionary in which we see the average mileage per brand. Now, let's build a dataframe so that we can easily compare the average price and average mileage per brand and see if we can find any corelation.

In [337]:

mean_prices = pd.Series(brand_mean_prices)
mean_mileage = pd.Series(km_per_brand)

In [338]:

brand_info = pd.DataFrame(mean_prices, columns=['mean_prices'])
brand_info['mean_mileage'] = mean_mileage

brand_info

Out[338]:

	mean_prices	mean_mileage
audi	9336	129157
bmw	8332	132572
ford	3749	124266
mercedes_benz	8628	130788
opel	2975	129310
volkswagen	5402	128707

It is interesting to see that even though the average mileage of BMW and Mercedes cars is higher, their prices are still higher than the prices of other brands.

Conclusion¶

In this guided project we explored data collected from the German eBay. The data contained information on car sales listings. We cleaned up the data and looked for any corelation between the average price and mileage of the top brands.

Next steps¶

Data cleaning next steps:

Identify categorical data that uses german words, translate them and map the values to their english counterparts
Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
See if there are particular keywords in the name column that you can extract as new columns

Analysis next steps:

Find the most common brand/model combinations
Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
How much cheaper are cars with damage than their non-damaged counterparts?

Data cleaning continued¶

Let's identify columns which have data entered in German and translate it in English. We already know that the 'seller' and 'offer_type' columns have words in German so let's start with them.

In [339]:

autos['seller'] = autos['seller'].str.replace('privat', 'private').str.replace('gewerblich', 'commercial')

autos['seller'].value_counts()

Out[339]:

private       46680
commercial        1
Name: seller, dtype: int64

In [340]:

autos['offer_type'] = autos['offer_type'].str.replace('Angebot', 'offer')

autos['offer_type'].value_counts()

Out[340]:

offer    46681
Name: offer_type, dtype: int64

In [341]:

autos['vehicle_type'].value_counts()

Out[341]:

limousine     12598
kleinwagen    10585
kombi          8930
bus            4031
cabrio         3016
coupe          2462
suv            1965
andere          390
Name: vehicle_type, dtype: int64

In [342]:

autos['vehicle_type'] = autos['vehicle_type'].str.replace(
'kleinwagen', 'small car').str.replace('kombi', 'combi').str.replace('cabrio', 'convertible').str.replace('andere', 'other')

autos['vehicle_type'].value_counts()

Out[342]:

limousine      12598
small car      10585
combi           8930
bus             4031
convertible     3016
coupe           2462
suv             1965
other            390
Name: vehicle_type, dtype: int64

In [343]:

autos['gearbox'].value_counts()

Out[343]:

manuell      34715
automatik     9856
Name: gearbox, dtype: int64

In [344]:

autos['gearbox'] = autos['gearbox'].str.replace('manuell', 'manual').str.replace('automatik', 'automatic')

autos['gearbox'].value_counts()

Out[344]:

manual       34715
automatic     9856
Name: gearbox, dtype: int64

In [345]:

autos['fuel_type'].value_counts()

Out[345]:

benzin     28540
diesel     14032
lpg          649
cng           71
hybrid        37
elektro       19
andere        15
Name: fuel_type, dtype: int64

In [346]:

autos['fuel_type'] = autos['fuel_type'].str.replace('benzin', 'petrol').str.replace('elektro', 'electric').str.replace('andere', 'other')

autos['fuel_type'].value_counts()

Out[346]:

petrol      28540
diesel      14032
lpg           649
cng            71
hybrid         37
electric       19
other          15
Name: fuel_type, dtype: int64

In [347]:

autos['unrepaired_damage'].value_counts()

Out[347]:

nein    33834
ja       4540
Name: unrepaired_damage, dtype: int64

In [348]:

autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('ja', 'yes').str.replace('nein', 'no')

autos['unrepaired_damage'].value_counts()

Out[348]:

no     33834
yes     4540
Name: unrepaired_damage, dtype: int64

Now, let's convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321. The columns containing dates are: 'date_crawled', 'ad_created' and 'last_seen'

In [349]:

autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['date_crawled'] = autos['date_crawled'].str.replace('-', '')
autos['date_crawled'] = autos['date_crawled'].astype(float)
autos['date_crawled'].value_counts().head()

Out[349]:

20160403.0    1810
20160320.0    1775
20160321.0    1742
20160312.0    1719
20160404.0    1709
Name: date_crawled, dtype: int64

In [350]:

autos['ad_created'] = autos['ad_created'].str[:10]
autos['ad_created'] = autos['ad_created'].str.replace('-', '')
autos['ad_created'] = autos['ad_created'].astype(int)
autos['ad_created'].value_counts().head()

Out[350]:

20160403    1821
20160320    1777
20160321    1752
20160404    1725
20160312    1711
Name: ad_created, dtype: int64

In [351]:

autos['last_seen'] = autos['last_seen'].str[:10]
autos['last_seen'] = autos['last_seen'].str.replace('-', '')
autos['last_seen'] = autos['last_seen'].astype(int)
autos['last_seen'].value_counts().head()

Out[351]:

20160406    10425
20160407     6197
20160405     5854
20160317     1311
20160403     1174
Name: last_seen, dtype: int64

Analysis continued¶

Let's try to find the most common brand/model combination for the top brands:

In [357]:

brand_model = {}

for brand in top_brands:
    by_brand = autos[autos['brand'] == brand]
    top_model = by_brand['model'].describe()['top']
    brand_model[brand] = top_model
    
brand_model

#Below we see the most common model per brand

Out[357]:

{'audi': 'a4',
 'bmw': '3er',
 'ford': 'focus',
 'mercedes_benz': 'c_klasse',
 'opel': 'corsa',
 'volkswagen': 'golf'}

Now let's see if the average prices follow any pattern based on mileage.

In [452]:

ranges = autos['odometer_km'].value_counts(bins=5).index
ranges

Out[452]:

IntervalIndex([(121000.0, 150000.0], (63000.0, 92000.0], (34000.0, 63000.0], (4854.999, 34000.0], (92000.0, 121000.0]]
              closed='right',
              dtype='interval[float64]')

In [460]:

ranges_tuple = ranges.to_tuples()

km_price = {}
for r in ranges_tuple:
    by_km = autos[autos['odometer_km'].between(r[0], r[1])]
    price = by_km['price'].mean()
    km_price[r] = int(price)
    
km_price

Out[460]:

{(4854.999, 34000.0): 15122,
 (34000.0, 63000.0): 13721,
 (63000.0, 92000.0): 9563,
 (92000.0, 121000.0): 8132,
 (121000.0, 150000.0): 4107}

In the above two cells we did the following:

we split the 'odometer_km' values in five bins (ranges)
we calculated the mean price for each odometer bin

We can conclude that the lower the milage of the car, the higher the price.

How much cheaper are cars with damage than their non-damaged counterparts?

In [429]:

damage = autos['unrepaired_damage'].unique()

damaged_price = {}
for d in damage:
    by_d = autos[autos['unrepaired_damage'] == d]
    price = by_d['price'].mean()
    damaged_price[d] = price    

price_per_damaged

Out[429]:

NaN            NaN
no     7164.033103
yes    2241.146035
dtype: float64

On average, cars with unrepaired damage are 30% cheaper than undamaged cars.

In [ ]: