eBay Auto Sales¶

This analysis will make use of Pandas.

Aims¶

Clean a larger, more complex dataset using Pandas.
Analyse and visualise the dataset.

Dataset¶

Used cars scraed from eBay Kleinanzeigen, a classifieds section of German eBay.

Data Dictionary¶

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

Read Dataset & Import Libraries¶

In [1]:

# Mount Google.
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)
#drive.flush_and_unmount()

# Find out where the data is.
!ls "gdrive/MyDrive/Colab Notebooks/gp3_autos/"

Mounted at /content/gdrive
 autos.csv  'eBay Auto Sales.ipynb'

In [2]:

import pandas as pd
import numpy as np

autos = pd.read_csv('gdrive/MyDrive/Colab Notebooks/gp3_autos/autos.csv', encoding="Latin-1")

DataFrame Details¶

Observations¶

Column headings are inconsistent and messy.
Dtype is not correct for many series.
The dataset is in German. Not a problem, per se.
Some columns are better than others...
- "brand" is consistent and there are no null values.
- "vehicleType" is categorical and has "andere" (others).
- There are a small number of "wanted" ads which could be removed as they are not consistent with the rest of the dataset.

In [ ]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-null  object
 16  dateCreated          371528 non-null  object
 17  nrOfPictures         371528 non-null  int64 
 18  postalCode           371528 non-null  int64 
 19  lastSeen             371528 non-null  object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB

In [ ]:

autos.head()

Out[ ]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	kilometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-24 11:52:17	Golf_3_1.6	privat	Angebot	480	test	NaN	1993	manuell	0	golf	150000	0	benzin	volkswagen	NaN	2016-03-24 00:00:00	70435	2016-04-07 03:16:57
1	2016-03-24 10:58:45	A5_Sportback_2.7_Tdi	privat	Angebot	18300	test	coupe	2011	manuell	190	NaN	125000	5	diesel	audi	ja	2016-03-24 00:00:00	66954	2016-04-07 01:46:50
2	2016-03-14 12:52:21	Jeep_Grand_Cherokee_"Overland"	privat	Angebot	9800	test	suv	2004	automatik	163	grand	125000	8	diesel	jeep	NaN	2016-03-14 00:00:00	90480	2016-04-05 12:47:46
3	2016-03-17 16:54:04	GOLF_4_1_4__3TÜRER	privat	Angebot	1500	test	kleinwagen	2001	manuell	75	golf	150000	6	benzin	volkswagen	nein	2016-03-17 00:00:00	91074	2016-03-17 17:40:17
4	2016-03-31 17:25:20	Skoda_Fabia_1.4_TDI_PD_Classic	privat	Angebot	3600	test	kleinwagen	2008	manuell	69	fabia	90000	7	diesel	skoda	nein	2016-03-31 00:00:00	60437	2016-04-06 10:17:21

In [ ]:

autos["name"].value_counts()

Out[ ]:

Ford_Fiesta                                         657
BMW_318i                                            627
Opel_Corsa                                          622
Volkswagen_Golf_1.4                                 603
BMW_316i                                            523
                                                   ... 
Golf_3_GTI_Klima                                      1
Audi_A6_Avant_1.8_T_TÜV_KLIMA_AUTOMATIK               1
Verkaufe_Golf_3_4/5_Tuerer                            1
Nissan_350Z___Neue_Kupplung_und_Bremsen__             1
Mercedes_C320_Amg_packt_von_Werk_Tausch_moeglich      1
Name: name, Length: 233531, dtype: int64

In [ ]:

autos["offerType"].value_counts()

Out[ ]:

Angebot    371516
Gesuch         12
Name: offerType, dtype: int64

Stage 1 Investigation¶

General observations¶

Columns need renaming
NaN in string columns is problematic:
- "gearbox"
- "vehicleType"
- "model"
- "fuelType"
- "notRepairedDamage"
Datetime columns are incorrectly formatted:
- "dateCrawled"
- "yearOfRegistration"
- "monthOfRegistration"
- "dateCreated"
- "lastSeen"

Valid Columns¶

"gearbox"
"powerPS"
"model"
"brand"
"notRepairedDamage"
"kilometers"

Drop Columns¶

"nrOfPictures" = empty
"seller" = 99% "privat"
"offerType" = 99% "Angebot"

Investigate Columns¶

"price" has a very high max value.
"yearOfRegistration" has outliers.

In [ ]:

autos.dtypes

Out[ ]:

dateCrawled            object
name                   object
seller                 object
offerType              object
price                   int64
abtest                 object
vehicleType            object
yearOfRegistration      int64
gearbox                object
powerPS                 int64
model                  object
kilometer               int64
monthOfRegistration     int64
fuelType               object
brand                  object
notRepairedDamage      object
dateCreated            object
nrOfPictures            int64
postalCode              int64
lastSeen               object
dtype: object

In [ ]:

autos.describe(include='all')

Out[ ]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	kilometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
count	371528	371528	371528	371528	3.715280e+05	371528	333659	371528.000000	351319	371528.000000	351044	371528.000000	371528.000000	338142	371528	299468	371528	371528.0	371528.00000	371528
unique	280500	233531	2	2	NaN	2	8	NaN	2	NaN	251	NaN	NaN	7	40	2	114	NaN	NaN	182806
top	2016-03-24 14:49:47	Ford_Fiesta	privat	Angebot	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:45:59
freq	7	657	371525	371516	NaN	192585	95894	NaN	274214	NaN	30070	NaN	NaN	223857	79640	263182	14450	NaN	NaN	17
mean	NaN	NaN	NaN	NaN	1.729514e+04	NaN	NaN	2004.577997	NaN	115.549477	NaN	125618.688228	5.734445	NaN	NaN	NaN	NaN	0.0	50820.66764	NaN
std	NaN	NaN	NaN	NaN	3.587954e+06	NaN	NaN	92.866598	NaN	192.139578	NaN	40112.337051	3.712412	NaN	NaN	NaN	NaN	0.0	25799.08247	NaN
min	NaN	NaN	NaN	NaN	0.000000e+00	NaN	NaN	1000.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	0.0	1067.00000	NaN
25%	NaN	NaN	NaN	NaN	1.150000e+03	NaN	NaN	1999.000000	NaN	70.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	0.0	30459.00000	NaN
50%	NaN	NaN	NaN	NaN	2.950000e+03	NaN	NaN	2003.000000	NaN	105.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	0.0	49610.00000	NaN
75%	NaN	NaN	NaN	NaN	7.200000e+03	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	0.0	71546.00000	NaN
max	NaN	NaN	NaN	NaN	2.147484e+09	NaN	NaN	9999.000000	NaN	20000.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	0.0	99998.00000	NaN

In [ ]:

autos["seller"].value_counts()

Out[ ]:

privat        371525
gewerblich         3
Name: seller, dtype: int64

In [ ]:

autos["offerType"].value_counts()

Out[ ]:

Angebot    371516
Gesuch         12
Name: offerType, dtype: int64

In [ ]:

autos["nrOfPictures"].value_counts()

Out[ ]:

0    371528
Name: nrOfPictures, dtype: int64

In [ ]:

autos["vehicleType"].iloc[0:22]

Out[ ]:

0            NaN
1          coupe
2            suv
3     kleinwagen
4     kleinwagen
5      limousine
6         cabrio
7      limousine
8            bus
9     kleinwagen
10     limousine
11         kombi
12         kombi
13         kombi
14           suv
15    kleinwagen
16           NaN
17    kleinwagen
18           bus
19           bus
20         coupe
21     limousine
Name: vehicleType, dtype: object

In [ ]:

autos['gearbox'].value_counts()

Out[ ]:

manuell      274214
automatik     77105
Name: gearbox, dtype: int64

In [ ]:

autos['brand'].value_counts()

Out[ ]:

volkswagen        79640
bmw               40274
opel              40136
mercedes_benz     35309
audi              32873
ford              25573
renault           17969
peugeot           11027
fiat               9676
seat               7022
mazda              5695
skoda              5641
smart              5249
citroen            5182
nissan             5037
toyota             4694
sonstige_autos     3982
hyundai            3646
mini               3394
volvo              3327
mitsubishi         3061
honda              2836
kia                2555
alfa_romeo         2345
suzuki             2328
porsche            2215
chevrolet          1845
chrysler           1452
dacia               900
jeep                807
daihatsu            806
subaru              779
land_rover          770
jaguar              621
trabant             591
daewoo              542
saab                530
rover               490
lancia              484
lada                225
Name: brand, dtype: int64

Stage 1 Cleaning¶

The following steps will be done in the cleaning processes:

Rename columns - DataFrame.columns
Dropping useless columns

1. Rename Columns¶

The columns use CamelCase sometimes and are generally hard to read.

In [3]:

# Copy used to ensure we have a copy of the old columns for reference if needed.
columns_old = autos.columns.copy()
print(columns_old)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'kilometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [4]:

columns_new = ['date_crawled', # to snake_case
'name', # no change
'seller', # no change
'offer_type', # to snake_case
'price', # no change
'ab_test', # to snake_case
'vehicle_type', # to snake_case
'registration_year', # clarity
'gearbox', # no change
'power_ps', # to snake_case
'model', # no change
'odometer_km', # more accurate
'registration_month', #clarity
'fuel_type', # to snake_case
'brand', # no change
'unrepaired_damage', # clarity
'ad_created', # to snake_case
'num_pictures', # clarity and to snake_case
'postcode', # common usage
'last_seen'] # to snake_case

# Copy back (rather than assignment) to preserve versions
autos.columns = columns_new.copy()

In [5]:

autos.columns

Out[5]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pictures', 'postcode',
       'last_seen'],
      dtype='object')

2. Dropping Useless Columns¶

"num_pictures", "seller" and "offer_type" should be dropped.

In [6]:

autos = autos.drop(columns=["num_pictures", "seller", "offer_type"], axis=1)

In [7]:

autos.columns

Out[7]:

Index(['date_crawled', 'name', 'price', 'ab_test', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postcode', 'last_seen'],
      dtype='object')

Stage 2 Cleaning¶

Outliers¶

There are outliers in these columns which need removing:

"registration_year"
"price"

Techniques¶

Series.unique().shape
Series.describe()
Series.value_counts().head()
Series.value_counts().sort_index()
Boolean: df[df["col"].between(x,y)]

"registration_year"¶

In [ ]:

autos["registration_year"].describe()

Out[ ]:

count    371528.000000
mean       2004.577997
std          92.866598
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: registration_year, dtype: float64

Confirm there are years well out of range.

In [ ]:

autos["registration_year"].value_counts().head()

Out[ ]:

2000    24551
1999    22767
2005    22316
2006    20230
2001    20218
Name: registration_year, dtype: int64

Confirm our desired values are the dominant values.

In [ ]:

# initial_year_distribution = autos["registration_year"].value_counts().sort_index()
# initial_year_distribution.plot()

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f62051ab1d0>

Plot the initial distribution which we will fit to the normal curve.

In [8]:

range_bool = autos["registration_year"].between(1980,2020)
final_distribution = autos[range_bool]
final_distribution["registration_year"].value_counts().sort_index().plot()

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f88d0e2c390>

Cars aged between 1980 and 2021 create an appropriate bell curve distribution with a few interesting features.

In [9]:

autos["registration_year"] = final_distribution["registration_year"]

Finally, write the new distribution back to the main dataset.

"price"¶

In [ ]:

autos["price"].describe()

Out[ ]:

count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

Confirm max and min values are outliers.

In [ ]:

autos["price"].value_counts().sort_index().plot()

Out[ ]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f62041b4410>

Plot the initial distribution we will crop to fit a normal distribution.

In [10]:

price_bool = autos["price"].between(1,45000)
final_price_distribution = autos[price_bool]
final_price_distribution["price"].value_counts().sort_index().plot()

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f88d0d94e50>

Not sure this isn't charting properly.

In [11]:

autos["price"] = final_price_distribution["price"]

Write the new slice anyway

In [13]:

autos["price"].describe()

Out[13]:

count    358711.000000
mean       5522.694004
std        6506.381772
min           1.000000
25%        1250.000000
50%        3000.000000
75%        7300.000000
max       45000.000000
Name: price, dtype: float64

Confirm we have a good distribution even if we can't chart it properly.

Stage 3 Cleaning¶

Date Columns¶

There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:

date_crawled: added by the crawler : string
last_seen: added by the crawler : string
ad_created: from the website : string
registration_month: from the website : int64
registration_year: from the website : float

In [ ]:

print(autos["date_crawled"].dtypes)
print(autos["last_seen"].dtypes)
print(autos["ad_created"].dtypes)
print(autos["registration_month"].dtypes)
print(autos["registration_year"].dtypes)

object
object
object
int64
float64

In [ ]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[ ]:

	date_crawled	ad_created	last_seen
0	2016-03-24 11:52:17	2016-03-24 00:00:00	2016-04-07 03:16:57
1	2016-03-24 10:58:45	2016-03-24 00:00:00	2016-04-07 01:46:50
2	2016-03-14 12:52:21	2016-03-14 00:00:00	2016-04-05 12:47:46
3	2016-03-17 16:54:04	2016-03-17 00:00:00	2016-03-17 17:40:17
4	2016-03-31 17:25:20	2016-03-31 00:00:00	2016-04-06 10:17:21

"date_crawled"¶

In [ ]:

print(autos['date_crawled'].str[:10].value_counts(dropna=False).sort_index())

2016-03-05     9493
2016-03-06     5373
2016-03-07    13260
2016-03-08    12429
2016-03-09    12710
2016-03-10    12098
2016-03-11    12159
2016-03-12    13447
2016-03-13     5846
2016-03-14    13477
2016-03-15    12428
2016-03-16    11201
2016-03-17    11763
2016-03-18     4877
2016-03-19    13112
2016-03-20    13506
2016-03-21    13278
2016-03-22    12063
2016-03-23    11878
2016-03-24    11113
2016-03-25    12236
2016-03-26    11876
2016-03-27    11248
2016-03-28    13045
2016-03-29    12695
2016-03-30    12457
2016-03-31    11846
2016-04-01    12675
2016-04-02    13033
2016-04-03    14391
2016-04-04    13974
2016-04-05     4764
2016-04-06     1175
2016-04-07      602
Name: date_crawled, dtype: int64

It is clear the crawl had quite a tight date range of 30 days.

Should cast as datetime.

In [14]:

autos['date_crawled'] = pd.to_datetime(autos['date_crawled'])
print(autos["date_crawled"].dtype)

datetime64[ns]

"ad_created"¶

In [ ]:

print(autos['ad_created'].str[:10].value_counts(dropna=False).sort_index())

2014-03-10        1
2015-03-20        1
2015-06-11        1
2015-06-18        1
2015-08-07        1
              ...  
2016-04-03    14450
2016-04-04    14022
2016-04-05     4330
2016-04-06     1172
2016-04-07      579
Name: ad_created, Length: 114, dtype: int64

"ad_created" has values outside the range of "date_crawled".

Investigate if these rows are valid.

In [ ]:

bool_2015 = autos["ad_created"].str[:4] == "2015"
slice_2015 = autos[bool_2015]
slice_2015.head()

Out[ ]:

	date_crawled	name	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postcode	last_seen
4192	2016-03-23 14:42:46	Der_VAN_ist_zu_OSTERN_reisefertig	6000.0	test	bus	1995.0	automatik	200	andere	150000	0	benzin	chevrolet	nein	2015-03-20 00:00:00	24558	2016-04-07 03:45:01
5310	2016-03-23 01:48:59	Mercedes_Benz_C220_BT__7G_Tronic_AMG__Modellja...	NaN	test	limousine	2014.0	automatik	0	c_klasse	20000	2	diesel	mercedes_benz	NaN	2015-06-11 00:00:00	46145	2016-04-06 09:47:02
10471	2016-03-05 17:53:53	RENAULT_CLIO_1.6_EDITION_DYNAMIQUE	3490.0	test	limousine	2006.0	manuell	88	clio	150000	3	benzin	renault	nein	2015-11-02 00:00:00	40880	2016-03-17 15:46:57
31127	2016-03-28 13:31:25	Audi_A_3_Ambition	6950.0	test	limousine	2004.0	automatik	150	a3	100000	1	benzin	audi	nein	2015-12-06 00:00:00	61389	2016-04-06 14:16:09
63113	2016-03-05 20:42:38	W202_C280_V6__schmidt_felgen__amg__ATP_sport__...	5999.0	control	limousine	1997.0	automatik	197	c_klasse	150000	9	benzin	mercedes_benz	nein	2015-11-17 00:00:00	28816	2016-04-06 05:46:03

Appears to be adverts which have taken a long time to see and have valid rows in the set.

Cast as datetime.

In [15]:

autos['ad_created'] = pd.to_datetime(autos['ad_created'])
print(autos["ad_created"].dtype)

datetime64[ns]

"last_seen"¶

In [ ]:

print(autos['last_seen'].str[:10].value_counts(dropna=False).sort_index())

2016-03-05      480
2016-03-06     1536
2016-03-07     1955
2016-03-08     2993
2016-03-09     3713
2016-03-10     4296
2016-03-11     4847
2016-03-12     8694
2016-03-13     3154
2016-03-14     4570
2016-03-15     6097
2016-03-16     6100
2016-03-17    10685
2016-03-18     2575
2016-03-19     6061
2016-03-20     7399
2016-03-21     7481
2016-03-22     7656
2016-03-23     6743
2016-03-24     7147
2016-03-25     7095
2016-03-26     6004
2016-03-27     6282
2016-03-28     8275
2016-03-29     8661
2016-03-30     8863
2016-03-31     9005
2016-04-01     8925
2016-04-02     9294
2016-04-03     9424
2016-04-04     9531
2016-04-05    46889
2016-04-06    80930
2016-04-07    48168
Name: last_seen, dtype: int64

"last_seen" has exactly the same date range as "date_crawled" suggesting this was done as part of the crawl criteria.

Cast as datetime.

In [16]:

autos['last_seen'] = pd.to_datetime(autos['last_seen'])
print(autos["last_seen"].dtype)

datetime64[ns]

"registration_month"¶

In [ ]:

autos["registration_month"].describe()

Out[ ]:

count    371528.000000
mean          5.734445
std           3.712412
min           0.000000
25%           3.000000
50%           6.000000
75%           9.000000
max          12.000000
Name: registration_month, dtype: float64

A month value of "0" is not very helpful.

If this is the month of registration of the vehicle, it's hard to imagine any analysis would use this.

Drop "registration_month"

In [17]:

autos = autos.drop(columns=["registration_month"], axis=1)
autos.columns

Out[17]:

Index(['date_crawled', 'name', 'price', 'ab_test', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postcode',
       'last_seen'],
      dtype='object')

"brand" Analysis¶

One fruitful area of analysis will be to look at brands and see if brand correlates to sales price, frequency and period on eBay.

To start, it's necessary to explore this series in more detail.

In [ ]:

autos["brand"].value_counts(dropna=False, normalize=True).head(16)

Out[ ]:

volkswagen       0.214358
bmw              0.108401
opel             0.108030
mercedes_benz    0.095037
audi             0.088481
ford             0.068832
renault          0.048365
peugeot          0.029680
fiat             0.026044
seat             0.018900
mazda            0.015329
skoda            0.015183
smart            0.014128
citroen          0.013948
nissan           0.013558
toyota           0.012634
Name: brand, dtype: float64

The column series is very clean with no null values and no duplicates from spelling mistakes. Probably selected from a drop-down rather than manual entry by a user.

Many of the brands are rarely sold, making up less than 1% of listings.

"sonstige_autos" means "other cars" so we can eliminate that because it's vague.

Therefore, we will only aggregate the top 16 most frequently listed brands.

Mean price aggregation¶

In [32]:

mean_price_agg = {}

top_16_index = autos["brand"].value_counts(dropna=False, normalize=True).head(16)

for brand in top_16_index.index:
  brand_slice = autos[autos["brand"] == brand]
  agg_mean = brand_slice["price"].mean()
  mean_price_agg[brand] = round(agg_mean)

print(mean_price_agg)

{'volkswagen': 5258, 'bmw': 8047, 'opel': 2939, 'mercedes_benz': 7894, 'audi': 8597, 'ford': 3605, 'renault': 2384, 'peugeot': 3231, 'fiat': 2853, 'seat': 4455, 'mazda': 4031, 'skoda': 6461, 'smart': 3545, 'citroen': 3694, 'nissan': 4601, 'toyota': 5212}

In [24]:

mean_price_series = pd.Series(mean_price_agg)
mean_price_df = pd.DataFrame(mean_price_series, columns=["mean_price"])
mean_price_df = mean_price_df.sort_values(by=["mean_price"], ascending=False)
mean_price_df

Out[24]:

	mean_price
audi	8597
bmw	8047
mercedes_benz	7894
skoda	6461
volkswagen	5258
toyota	5212
nissan	4601
seat	4455
mazda	4031
citroen	3694
ford	3605
smart	3545
peugeot	3231
opel	2939
fiat	2853
renault	2384

In [26]:

mean_price_df.sort_values(by=["mean_price"], ascending=True).plot.barh()

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f88cb8a1850>

This aggregation shows what you would expect for vehicle depreciation.

German cars cost more, and hold their value much better than French cars.

Further aggregation analysis could look at:

variation from a know vehicle average price to get a sense of depreciation.
max and min prices agaist sales period.
correlation between average mileage and price

"price" to "odometer_km" Analysis¶

Unexpected Pricing¶

Audi, BMW, and Mercedes are luxury car brands so we expect a high price, but Skoda, Volkswagen and Toyota are more affordable but their prices remain high.

This is especially true for Skoda which appears to be in the wrong grouping.

Link to odometer value¶

It's possible that the prestige cars are higher mileage, or that Skodas hold their value only when their mileage is low.

To prove this hypothesis, we can construct a dataframe that allows us to visualise price and odometer values together.

We can create a new aggregation dictionary and construct then join dataframes.

First we should investigate this series because we haven't yet.

In [30]:

autos["odometer_km"].describe()

Out[30]:

count    371528.000000
mean     125618.688228
std       40112.337051
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [31]:

autos["odometer_km"].value_counts()

Out[31]:

150000    240797
125000     38067
100000     15920
90000      12523
80000      11053
70000       9773
60000       8669
50000       7615
5000        7069
40000       6376
30000       6041
20000       5676
10000       1949
Name: odometer_km, dtype: int64

"odometer_km" seems nearly categorical it's so perfect.

In [83]:

mean_odo_agg = {}

top_16_index = autos["brand"].value_counts(dropna=False, normalize=True).head(16)

for brand in top_16_index.index:
  brand_slice = autos[autos["brand"] == brand]
  mean_odo = brand_slice["odometer_km"].mean()
  mean_odo_agg[brand] = round(mean_odo/10)

print(mean_odo_agg)

{'volkswagen': 12858, 'bmw': 13276, 'opel': 12891, 'mercedes_benz': 13066, 'audi': 12972, 'ford': 12384, 'renault': 12805, 'peugeot': 12497, 'fiat': 11685, 'seat': 12147, 'mazda': 12599, 'skoda': 11384, 'smart': 10062, 'citroen': 12051, 'nissan': 11979, 'toyota': 11734}

In [84]:

mean_odo_series = pd.Series(mean_odo_agg)
mean_odo_df = pd.DataFrame(mean_odo_series, columns=["mean_odometer"])
mean_odo_df.sort_index()

Out[84]:

	mean_odometer
audi	12972
bmw	13276
citroen	12051
fiat	11685
ford	12384
mazda	12599
mercedes_benz	13066
nissan	11979
opel	12891
peugeot	12497
renault	12805
seat	12147
skoda	11384
smart	10062
toyota	11734
volkswagen	12858

In [85]:

mean_price_df.sort_index()

Out[85]:

	mean_price
audi	8597
bmw	8047
citroen	3694
fiat	2853
ford	3605
mazda	4031
mercedes_benz	7894
nissan	4601
opel	2939
peugeot	3231
renault	2384
seat	4455
skoda	6461
smart	3545
toyota	5212
volkswagen	5258

In [91]:

join_them = [mean_odo_df, mean_price_df]
mean_odo_price_df = pd.concat(join_them, axis=1)
mean_odo_price_df.sort_values(by="mean_price", ascending=False)

Out[91]:

	mean_odometer	mean_price
audi	12972	8597
bmw	13276	8047
mercedes_benz	13066	7894
skoda	11384	6461
volkswagen	12858	5258
toyota	11734	5212
nissan	11979	4601
seat	12147	4455
mazda	12599	4031
citroen	12051	3694
ford	12384	3605
smart	10062	3545
peugeot	12497	3231
opel	12891	2939
fiat	11685	2853
renault	12805	2384

In [94]:

import matplotlib.pyplot as plt

#df2 = df.groupby(['Name', 'Abuse/NFF'])['Name'].count().unstack('Abuse/NFF').fillna(0)
mean_odo_price_df.sort_values(by="mean_price", ascending=False).head(6).plot(kind='bar', stacked=False)

Out[94]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f88cb1d7c10>

Inverse correlation between "price" and "odometer_km"¶

As expected, we see a decline in the value for "odometer_km" as we move down to cheaper cars. In other words, high values for "odometer_km" suppress "price" such that there is an inverse correlation.