Exploring Ebay Car Sales Data

In this project, we'll work with a we'll work with a modified dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The original dataset can be found here.

The data dictionary provided with the data is as follows.

  • dateCrawled - When this ad was first crawled. All field-values are taken from this date.
  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • model - The car model name.
  • kilometer - How many kilometers the car has driven.
  • monthOfRegistration - The month in which the car was first registered.
  • fuelType - What type of fuel the car uses.
  • brand - The brand of the car.
  • notRepairedDamage - If the car has a damage which is not yet repaired.
  • dateCreated - The date on which the eBay listing was created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.

Summary of Results

In this project we found:

  • The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.
  • The top brands by listing were as follows:
    • Volkswagen (21.21%)
    • BMW (11.00%)
    • Opel (10.82%)
    • Mercedes Benz (9.54%)
    • Audi (8.64%)
    • Ford (6.98%)
  • Mercedes Benz vehicles are by far the most expensive out our top brands, on average costing three times more than the second most expensive brand, Audi.
  • We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity
  • There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.

Project

Let's start by importing our libraries and reading in the data.

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
from pprint import pprint

# Read in data
autos = pd.read_csv('autos.csv', encoding='Latin-1')
In [2]:
# Quick exploration of the data
autos
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:
# Further exploration
autos.info()
print('\n')
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB


Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Initial Observations

From the above code cells we can see the following:

  • Our dataset contains 20 columns.
  • Our dataset contains a total of 50,000 entries.
  • Our data is made up of two data types: int64 and object.
  • The column names use camelcase and will need to be changed to snakecase.

The columns vehicleType, gearbox, model, fuelType and notRepairedDamage all contain null values.

Editing Column Names

In [4]:
# Print out existing column names
pprint(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [5]:
# Rename columns
autos.rename({
    'dateCrawled': 'date_crawled',  
    'offerType': 'offer_type',  
    'abtest': 'ab_test',
    'vehicleType': 'vehicle_type', 
    'yearOfRegistration': 'registration_year',  
    'powerPS': 'power_ps', 
    'monthOfRegistration': 'registration_month', 
    'fuelType': 'fuel_type', 
    'notRepairedDamage': 'unrepaired_damage', 
    'dateCreated': 'ad_created', 
    'nrOfPictures': 'no_of_pics', 
    'postalCode': 'postal_code',
    'lastSeen': 'last_seen'
    },
    axis=1,
    inplace=True
)

# Display new column headings
autos.head()
Out[5]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created no_of_pics postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

In the code cell above we converted the column names from camelcase to snakecase and reworded some of the column names based on the data dictionary to be more descriptive.

Determining what cleaning tasks need to be performed

In [6]:
# Get descriptive statistics for all columns.
autos.describe(include='all')
Out[6]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created no_of_pics postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-11 22:38:16 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Here we can see the following:

  • The columns seller and offer_type both contain mostly 1 value and will not be useful to our analysis.
  • The column no_of_pics only has 1 value and will not be useful to our analysis.
  • The price and odometer columns are numeric values stored as text.

Based on this we will remove the seller, offer_type and no_of_pics columns, and we will convert the price and odometer columns to a numeric data type.

In [7]:
# Remove unecessary columns
autos.drop(columns=['seller', 'offer_type', 'no_of_pics'], inplace=True)

# Clean and convert price column.
autos['price'] = (autos['price']
                  .str.replace('$','')
                  .str.replace(',','')
                  .astype(float)
                 )

# Clean and convert price column.
autos['odometer'] = (autos['odometer'].str.replace('km','')
                     .str.replace(',','')
                     .astype(int)
                    )

# Rename price and odometer columns to make the more descriptive
autos.rename({'price':'price_usd', 'odometer':'odometer_km'}
             ,axis=1
             ,inplace=True
            )

# Output updated data
autos.head()
Out[7]:
date_crawled name price_usd ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000.0 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500.0 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990.0 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350.0 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350.0 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50

Exploring the odometer_km column

In [8]:
odometer = autos['odometer_km']

# Display discriptive statistics of odometer_km column
odometer.describe()
Out[8]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

From this initial exploration it does not looks as though there are any unrealistic odemeter readings.

In [9]:
# Further exploration
print(odometer.unique())
print('\n')
print(odometer.value_counts().sort_index(ascending=False))
[150000  70000  50000  80000  10000  30000 125000  90000  20000  60000
   5000 100000  40000]


150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64

Further exploration confirms that the data in this column does not require further cleaning.

Exploring the price column

In [10]:
prices = autos['price_usd']

# Display discriptive statistics of price_usd column
prices.describe()
Out[10]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

From this initial exploration we can see that there are clearly errors within the price_usd column data. For example, the max price of 100,000,000 USD is too high and the min price of 0 USD is too low.

Let's do some further exploraion.

In [11]:
# Determine top 20 prices
top_20_prices = prices.value_counts().sort_index(ascending=False).head(20)
print(f"Top 20 prices:\n{top_20_prices}") 
Top 20 prices:
99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
350000.0      1
345000.0      1
299000.0      1
295000.0      1
265000.0      1
259000.0      1
250000.0      1
220000.0      1
198000.0      1
197000.0      1
Name: price_usd, dtype: int64

Pricing above 350,000 appears to be random key entries or erroneous data and will thus be considered as outliers.

In [12]:
# Determine bottom 20 prices
bottom_20_prices = prices.value_counts().sort_index(ascending=True).head(20)
print(f"Bottom 20 prices:\n{bottom_20_prices}") 
Bottom 20 prices:
0.0     1421
1.0      156
2.0        3
3.0        1
5.0        2
8.0        1
9.0        1
10.0       7
11.0       2
12.0       3
13.0       2
14.0       1
15.0       2
17.0       3
18.0       1
20.0       4
25.0       5
29.0       1
30.0       7
35.0       1
Name: price_usd, dtype: int64

It seems unlikely that there would be 1421 random '0' key entries, however we will treat vehicles with zero cost as erroneous data and will thus remove them from our data.

In [13]:
# Updating autos by removing erroneus data.
autos[prices.between(1, 350000)]
Out[13]:
date_crawled name price_usd ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000.0 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500.0 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990.0 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350.0 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350.0 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon 24900.0 control limousine 2011 automatik 239 q5 100000 1 diesel audi nein 2016-03-27 00:00:00 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... 1980.0 control cabrio 1996 manuell 75 astra 150000 5 benzin opel nein 2016-03-28 00:00:00 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge 13200.0 test cabrio 2014 automatik 69 500 5000 11 benzin fiat nein 2016-04-02 00:00:00 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition 22900.0 control kombi 2013 manuell 150 a3 40000 11 diesel audi nein 2016-03-08 00:00:00 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V 1250.0 control limousine 1996 manuell 101 vectra 150000 1 benzin opel nein 2016-03-13 00:00:00 45897 2016-04-06 21:18:48

48565 rows × 17 columns

Distribution of Date Values

Below we calculate the distrbution of values in the date_crawled, ad_created, and last_seen columns.

In [14]:
# Calculating date_crawled distribution.
date_crawled = autos['date_crawled'].str[:10]
print(date_crawled.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")

# Sort according to distribution.
print(date_crawled.value_counts(sort=True, dropna=False))
2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03330
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03294
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
Name: date_crawled, dtype: float64



2016-04-03    1934
2016-03-20    1891
2016-03-21    1876
2016-03-12    1839
2016-03-14    1831
2016-04-04    1826
2016-03-07    1798
2016-04-02    1770
2016-03-19    1745
2016-03-28    1742
2016-03-29    1709
2016-03-15    1699
2016-04-01    1690
2016-03-30    1681
2016-03-08    1665
2016-03-09    1661
2016-03-22    1647
2016-03-11    1624
2016-03-26    1624
2016-03-23    1619
2016-03-10    1606
2016-03-31    1596
2016-03-25    1587
2016-03-17    1576
2016-03-27    1552
2016-03-16    1475
2016-03-24    1455
2016-03-05    1269
2016-03-13     778
2016-03-06     697
2016-04-05     655
2016-03-18     653
2016-04-06     159
2016-04-07      71
Name: date_crawled, dtype: int64

We can see that all ads were crawled between 2016-03-05 and 2016-04-07. We can also see that 2016-04-03 was the day with the highest activity with a totl of 1934 ads crawled.

In [15]:
# Calculating ad_created distribution.
ad_created = autos['ad_created'].str[:10]
print(ad_created.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")

# Sort according to distribution.
print(ad_created.value_counts(sort=True, dropna=False))
2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-03    0.03892
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
Name: ad_created, Length: 76, dtype: float64



2016-04-03    1946
2016-03-20    1893
2016-03-21    1886
2016-04-04    1844
2016-03-12    1831
              ... 
2016-01-22       1
2016-01-07       1
2016-01-14       1
2016-01-29       1
2016-02-08       1
Name: ad_created, Length: 76, dtype: int64

All ads were created between 2015-06-11 and 2016-04-07. We see that 2016-04-03 was the day with the highest number of ads created which is the same date on which the most ads were crawled.

In [16]:
# Calculating last_seen distribution.
last_seen = autos['last_seen'].str[:10]
print(last_seen.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")

# Sort according to distribution.
print(last_seen.value_counts(sort=True, dropna=False))
2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00760
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22100
2016-04-07    0.13092
Name: last_seen, dtype: float64



2016-04-06    11050
2016-04-07     6546
2016-04-05     6214
2016-03-17     1396
2016-04-03     1268
2016-04-02     1245
2016-03-30     1242
2016-04-04     1231
2016-03-31     1192
2016-03-12     1191
2016-04-01     1155
2016-03-29     1117
2016-03-22     1079
2016-03-28     1043
2016-03-21     1037
2016-03-20     1035
2016-03-24      978
2016-03-25      960
2016-03-23      929
2016-03-26      848
2016-03-16      822
2016-03-27      801
2016-03-15      794
2016-03-19      787
2016-03-14      640
2016-03-11      626
2016-03-10      538
2016-03-09      493
2016-03-13      449
2016-03-08      380
2016-03-18      371
2016-03-07      268
2016-03-06      221
2016-03-05       54
Name: last_seen, dtype: int64

We see that the date with the highest frequency in th last_seen column was 2016-04-06 which could possibly indicated this was the date when the most sales occurred. However, this is speculative as there is not enough data to make a conclusive statement.

In [17]:
# Display descriptive statistics on registration_year column
autos['registration_year'].describe()
Out[17]:
count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The data above makes it clear that there is erroneous data within this column as it is not possible for a car to have been registered in 1000 or 9999.

We will explore this column further in order to decide how best to clean it.

In [18]:
# Display 30 earliest registration years
print(autos['registration_year'].value_counts().sort_index().head(30))

# Display 30 latest registration years
print('\n\n')
print(autos['registration_year'].value_counts().sort_index().tail(30))
1000     1
1001     1
1111     1
1500     1
1800     2
1910     9
1927     1
1929     1
1931     1
1934     2
1937     4
1938     1
1939     1
1941     2
1943     1
1948     1
1950     3
1951     2
1952     1
1953     1
1954     2
1955     2
1956     5
1957     2
1958     4
1959     7
1960    34
1961     6
1962     4
1963     9
Name: registration_year, dtype: int64



2001    2703
2002    2533
2003    2727
2004    2737
2005    3015
2006    2708
2007    2304
2008    2231
2009    2098
2010    1597
2011    1634
2012    1323
2013     806
2014     666
2015     399
2016    1316
2017    1453
2018     492
2019       3
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       2
9996       1
9999       4
Name: registration_year, dtype: int64

We will remove entries with vehicle registration years earlier than 1910 as the amount of vehicles within this range are negligible and the registration years themselves are unrealistic or impossible.

We will also remove vehicles with a registration year later than 2016 as it is impossible for a car to have been registered after the listing had alreeady been seen.

In [19]:
# Removing erroneous entries
autos = autos[(autos['registration_year'] >= 1910) & (autos['registration_year'] <= 2016)]
autos['registration_year'].describe()
Out[19]:
count    48028.00000
mean      2002.80351
std          7.31085
min       1910.00000
25%       1999.00000
50%       2003.00000
75%       2008.00000
max       2016.00000
Name: registration_year, dtype: float64
In [20]:
# Display updated distribution
autos['registration_year'].value_counts(normalize=True).head(30)
Out[20]:
2000    0.069834
2005    0.062776
1999    0.062464
2004    0.056988
2003    0.056779
2006    0.056384
2001    0.056280
2002    0.052740
1998    0.051074
2007    0.047972
2008    0.046452
2009    0.043683
1997    0.042225
2011    0.034022
2010    0.033251
1996    0.030066
2012    0.027546
2016    0.027401
1995    0.027338
2013    0.016782
2014    0.013867
1994    0.013742
1993    0.009265
2015    0.008308
1990    0.008224
1992    0.008141
1991    0.007412
1989    0.003769
1988    0.002957
1985    0.002186
Name: registration_year, dtype: float64
In [21]:
# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
 .value_counts(normalize=True)
 .sort_index(ascending=False)
 .head(30)
 .sum()
))

# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
 .value_counts(normalize=True)
 .sort_index(ascending=False)
 .head(10)
 .sum()
))

# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
 .value_counts(normalize=True)
 .sort_index(ascending=False)
 .head(5)
 .sum()
))
0.9793037394853003
0.29928375114516537
0.093903556258849

The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.

Exploration of brand column

In [22]:
# Display descriptive statistics for brand column
print(autos['brand'].describe())

# Display brand value counts
autos['brand'].value_counts(normalize=True)
count          48028
unique            40
top       volkswagen
freq           10188
Name: brand, dtype: object
Out[22]:
volkswagen        0.212126
bmw               0.110019
opel              0.108166
mercedes_benz     0.095361
audi              0.086387
ford              0.069793
renault           0.047347
peugeot           0.029524
fiat              0.025860
seat              0.018177
skoda             0.016032
mazda             0.015137
nissan            0.015095
citroen           0.013929
smart             0.013909
toyota            0.012472
sonstige_autos    0.010952
hyundai           0.009848
volvo             0.009245
mini              0.008641
mitsubishi        0.008141
honda             0.007850
kia               0.007100
alfa_romeo        0.006621
porsche           0.006101
suzuki            0.005913
chevrolet         0.005705
chrysler          0.003665
dacia             0.002561
daihatsu          0.002561
jeep              0.002249
subaru            0.002186
land_rover        0.002040
saab              0.001603
jaguar            0.001582
trabant           0.001562
daewoo            0.001499
rover             0.001353
lancia            0.001083
lada              0.000604
Name: brand, dtype: float64

As we can see there are 40 unique brands in our dataset, with Volkswagen being the brand with the highest frequency.

Top Brands

Next we will select the brands that constitute more than 5% of the total values, and then we will aggregate their price data.

In [23]:
# Create list to store top brands
brand_counts = autos['brand'].value_counts(normalize=True)
top_brands = brand_counts[brand_counts > .05].index
        
# Display top_brands
print(f"Top Brands: \n{list(top_brands)}")
    
Top Brands: 
['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford']

The brands which meet our ciriteris are:

  • Volkswagen (21.21%)
  • BMW (11.00%)
  • Opel (10.82%)
  • Mercedes Benz (9.54%)
  • Audi (8.64%)
  • Ford (6.98%)

We can see that Volkswagen is the top brand, making up over a fifth of the listed vehicles, and having nearly double the second most listed brand BMW.

Next we will aggregate their price data.

Aggregating Price Data

In [24]:
# Create dictionary to store aggregate data.
top_brands_mean_price = {}

# Selecting data.
for brand in top_brands:
    selected_rows = autos[autos['brand'] == brand]
    avg_price = selected_rows['price_usd'].mean()
    top_brands_mean_price[brand] = avg_price

# Sort brands by average price in descending order.
top_brands_sorted = sorted(top_brands_mean_price.items()
                       ,key=lambda x: x[1]
                       ,reverse=True
                      )

# Display average prices.
for brand, price in top_brands_sorted:
    print(f"{brand}: ${round(price, 2)}")
mercedes_benz: $30317.45
audi: $9093.65
bmw: $8334.65
ford: $7263.02
volkswagen: $6516.46
opel: $5252.62

We can see that Mercedes Benz vehicles are by far the most expensive out our top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in our dataset which could indicate one or both of the following:

  • Many people are tying to sell their Mercedes Benz vehicles,
  • There is a high demand for Mercedes Benz vehicles

More research would need to be done inorder to determine which of these two options most closely represents reality.

We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.

Next we will aggregate our mileage data for each of the top brands to see if there is any visible link between it and our mean price data.

Aggregating Mean Mileage Data for Comparison to Mean Price Data

In [25]:
# Create dictionary to store aggregate data.
top_brands_mean_mileage = {}

# Selecting data.
for brand in top_brands:
    selected_rows = autos[autos['brand'] == brand]
    mean_mileage = selected_rows['odometer_km'].mean()
    top_brands_mean_mileage[brand] = mean_mileage

In order to easily compare mean mileage and mean price for each of the top brands we will create a new DataFrame object containing the relevant data.

Comparing Mean Mileage and Mean Price for Top Brands

In [46]:
# Convert top_brands_mean_price dictionary into a Series
tbmp_series = pd.Series(top_brands_mean_price)
tbmp_series = tbmp_series.round(2)

# Convert top_brands_mean_mileage dictionary into a Series
tbmm_series = pd.Series(top_brands_mean_mileage)
tbmm_series = tbmm_series.round(2)

# Create new DataFrame to combine two series objects
tb_mean_price_mileage = pd.DataFrame(tbmp_series, columns=['mean_price'])
tb_mean_price_mileage['mean_mileage'] = tbmm_series   

# Display tb_mean_price_mileage sorted by price.
tb_mean_price_mileage.sort_values('mean_price', ascending=False)
Out[46]:
mean_price mean_mileage
mercedes_benz 30317.45 130860.26
audi 9093.65 129287.78
bmw 8334.65 132434.71
ford 7263.02 124046.84
volkswagen 6516.46 128730.37
opel 5252.62 129227.14

The table above makes it clear that while there is a wide range of mean prices amongst the top brands (\$5252.62 - \\$30,317.45) there is a very low range for the mean mileages (124,046.84km - 132,434.71km).

This indicates that there is no meaningful relationship between the mean prices and mean mileages amongst our top brands.

Conclusion

Findings

  • The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.
  • The top brands by listings:
    • Volkswagen (21.21%)
    • BMW (11.00%)
    • Opel (10.82%)
    • Mercedes Benz (9.54%)
    • Audi (8.64%)
    • Ford (6.98%)
  • The Germans clearly love local automakers, with German brands taking the top 5 spots and making up over 61.21% of all listings.
  • We can see that Mercedes Benz vehicles are by far the most expensive amongst the top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in the dataset which indicates many people are tying to sell their Mercedes Benz vehicles. More research would need to be done in order to state whether or not the demand for Mercedes Benz vehicles is greater or less than the supply.
  • We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.
  • There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.

Possible Improvements

  • Find out how the demand for vehicles compares to their supply.
  • See which brand's vehicles have the longest and shortest listing times on average.
  • Find the most common brand/model combination.
  • Translate German words into English
  • Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
  • Find out how much cheaper cars with damage are than their non-damaged counterparts.