In this project, we'll work with a we'll work with a modified dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The original dataset can be found here.
The data dictionary provided with the data is as follows.
In this project we found:
The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.
The top brands by listing were as follows:
Mercedes Benz vehicles are by far the most expensive out our top brands, on average costing three times more than the second most expensive brand, Audi.
We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity
There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.
Let's start by importing our libraries and reading in the data.
# Import Libraries
import numpy as np
import pandas as pd
from pprint import pprint
# Read in data
autos = pd.read_csv('autos.csv', encoding='Latin-1')
# Quick exploration of the data
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
# Further exploration
autos.info()
print('\n')
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From the above code cells we can see the following:
The columns vehicleType
, gearbox
, model
, fuelType
and notRepairedDamage
all contain null values.
# Print out existing column names
pprint(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Rename columns
autos.rename({
'dateCrawled': 'date_crawled',
'offerType': 'offer_type',
'abtest': 'ab_test',
'vehicleType': 'vehicle_type',
'yearOfRegistration': 'registration_year',
'powerPS': 'power_ps',
'monthOfRegistration': 'registration_month',
'fuelType': 'fuel_type',
'notRepairedDamage': 'unrepaired_damage',
'dateCreated': 'ad_created',
'nrOfPictures': 'no_of_pics',
'postalCode': 'postal_code',
'lastSeen': 'last_seen'
},
axis=1,
inplace=True
)
# Display new column headings
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
In the code cell above we converted the column names from camelcase to snakecase and reworded some of the column names based on the data dictionary to be more descriptive.
# Get descriptive statistics for all columns.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-11 22:38:16 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Here we can see the following:
seller
and offer_type
both contain mostly 1 value and will not be useful to our analysis.no_of_pics
only has 1 value and will not be useful to our analysis.price
and odometer
columns are numeric values stored as text.Based on this we will remove the seller
, offer_type
and no_of_pics
columns, and we will convert the price
and odometer
columns to a numeric data type.
# Remove unecessary columns
autos.drop(columns=['seller', 'offer_type', 'no_of_pics'], inplace=True)
# Clean and convert price column.
autos['price'] = (autos['price']
.str.replace('$','')
.str.replace(',','')
.astype(float)
)
# Clean and convert price column.
autos['odometer'] = (autos['odometer'].str.replace('km','')
.str.replace(',','')
.astype(int)
)
# Rename price and odometer columns to make the more descriptive
autos.rename({'price':'price_usd', 'odometer':'odometer_km'}
,axis=1
,inplace=True
)
# Output updated data
autos.head()
date_crawled | name | price_usd | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
odometer_km
column¶odometer = autos['odometer_km']
# Display discriptive statistics of odometer_km column
odometer.describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
From this initial exploration it does not looks as though there are any unrealistic odemeter readings.
# Further exploration
print(odometer.unique())
print('\n')
print(odometer.value_counts().sort_index(ascending=False))
[150000 70000 50000 80000 10000 30000 125000 90000 20000 60000 5000 100000 40000] 150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
Further exploration confirms that the data in this column does not require further cleaning.
price
column¶prices = autos['price_usd']
# Display discriptive statistics of price_usd column
prices.describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64
From this initial exploration we can see that there are clearly errors within the price_usd
column data. For example, the max price of 100,000,000 USD is too high and the min price of 0 USD is too low.
Let's do some further exploraion.
# Determine top 20 prices
top_20_prices = prices.value_counts().sort_index(ascending=False).head(20)
print(f"Top 20 prices:\n{top_20_prices}")
Top 20 prices: 99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 2 10000000.0 1 3890000.0 1 1300000.0 1 1234566.0 1 999999.0 2 999990.0 1 350000.0 1 345000.0 1 299000.0 1 295000.0 1 265000.0 1 259000.0 1 250000.0 1 220000.0 1 198000.0 1 197000.0 1 Name: price_usd, dtype: int64
Pricing above 350,000 appears to be random key entries or erroneous data and will thus be considered as outliers.
# Determine bottom 20 prices
bottom_20_prices = prices.value_counts().sort_index(ascending=True).head(20)
print(f"Bottom 20 prices:\n{bottom_20_prices}")
Bottom 20 prices: 0.0 1421 1.0 156 2.0 3 3.0 1 5.0 2 8.0 1 9.0 1 10.0 7 11.0 2 12.0 3 13.0 2 14.0 1 15.0 2 17.0 3 18.0 1 20.0 4 25.0 5 29.0 1 30.0 7 35.0 1 Name: price_usd, dtype: int64
It seems unlikely that there would be 1421 random '0' key entries, however we will treat vehicles with zero cost as erroneous data and will thus remove them from our data.
# Updating autos by removing erroneus data.
autos[prices.between(1, 350000)]
date_crawled | name | price_usd | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | 24900.0 | control | limousine | 2011 | automatik | 239 | q5 | 100000 | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | 1980.0 | control | cabrio | 1996 | manuell | 75 | astra | 150000 | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | 13200.0 | test | cabrio | 2014 | automatik | 69 | 500 | 5000 | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | 22900.0 | control | kombi | 2013 | manuell | 150 | a3 | 40000 | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | 1250.0 | control | limousine | 1996 | manuell | 101 | vectra | 150000 | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 45897 | 2016-04-06 21:18:48 |
48565 rows × 17 columns
Below we calculate the distrbution of values in the date_crawled
, ad_created
, and last_seen
columns.
# Calculating date_crawled distribution.
date_crawled = autos['date_crawled'].str[:10]
print(date_crawled.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")
# Sort according to distribution.
print(date_crawled.value_counts(sort=True, dropna=False))
2016-03-05 0.02538 2016-03-06 0.01394 2016-03-07 0.03596 2016-03-08 0.03330 2016-03-09 0.03322 2016-03-10 0.03212 2016-03-11 0.03248 2016-03-12 0.03678 2016-03-13 0.01556 2016-03-14 0.03662 2016-03-15 0.03398 2016-03-16 0.02950 2016-03-17 0.03152 2016-03-18 0.01306 2016-03-19 0.03490 2016-03-20 0.03782 2016-03-21 0.03752 2016-03-22 0.03294 2016-03-23 0.03238 2016-03-24 0.02910 2016-03-25 0.03174 2016-03-26 0.03248 2016-03-27 0.03104 2016-03-28 0.03484 2016-03-29 0.03418 2016-03-30 0.03362 2016-03-31 0.03192 2016-04-01 0.03380 2016-04-02 0.03540 2016-04-03 0.03868 2016-04-04 0.03652 2016-04-05 0.01310 2016-04-06 0.00318 2016-04-07 0.00142 Name: date_crawled, dtype: float64 2016-04-03 1934 2016-03-20 1891 2016-03-21 1876 2016-03-12 1839 2016-03-14 1831 2016-04-04 1826 2016-03-07 1798 2016-04-02 1770 2016-03-19 1745 2016-03-28 1742 2016-03-29 1709 2016-03-15 1699 2016-04-01 1690 2016-03-30 1681 2016-03-08 1665 2016-03-09 1661 2016-03-22 1647 2016-03-11 1624 2016-03-26 1624 2016-03-23 1619 2016-03-10 1606 2016-03-31 1596 2016-03-25 1587 2016-03-17 1576 2016-03-27 1552 2016-03-16 1475 2016-03-24 1455 2016-03-05 1269 2016-03-13 778 2016-03-06 697 2016-04-05 655 2016-03-18 653 2016-04-06 159 2016-04-07 71 Name: date_crawled, dtype: int64
We can see that all ads were crawled between 2016-03-05 and 2016-04-07. We can also see that 2016-04-03 was the day with the highest activity with a totl of 1934 ads crawled.
# Calculating ad_created distribution.
ad_created = autos['ad_created'].str[:10]
print(ad_created.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")
# Sort according to distribution.
print(ad_created.value_counts(sort=True, dropna=False))
2015-06-11 0.00002 2015-08-10 0.00002 2015-09-09 0.00002 2015-11-10 0.00002 2015-12-05 0.00002 ... 2016-04-03 0.03892 2016-04-04 0.03688 2016-04-05 0.01184 2016-04-06 0.00326 2016-04-07 0.00128 Name: ad_created, Length: 76, dtype: float64 2016-04-03 1946 2016-03-20 1893 2016-03-21 1886 2016-04-04 1844 2016-03-12 1831 ... 2016-01-22 1 2016-01-07 1 2016-01-14 1 2016-01-29 1 2016-02-08 1 Name: ad_created, Length: 76, dtype: int64
All ads were created between 2015-06-11 and 2016-04-07. We see that 2016-04-03 was the day with the highest number of ads created which is the same date on which the most ads were crawled.
# Calculating last_seen distribution.
last_seen = autos['last_seen'].str[:10]
print(last_seen.value_counts(normalize=True, dropna=False).sort_index())
print("\n\n")
# Sort according to distribution.
print(last_seen.value_counts(sort=True, dropna=False))
2016-03-05 0.00108 2016-03-06 0.00442 2016-03-07 0.00536 2016-03-08 0.00760 2016-03-09 0.00986 2016-03-10 0.01076 2016-03-11 0.01252 2016-03-12 0.02382 2016-03-13 0.00898 2016-03-14 0.01280 2016-03-15 0.01588 2016-03-16 0.01644 2016-03-17 0.02792 2016-03-18 0.00742 2016-03-19 0.01574 2016-03-20 0.02070 2016-03-21 0.02074 2016-03-22 0.02158 2016-03-23 0.01858 2016-03-24 0.01956 2016-03-25 0.01920 2016-03-26 0.01696 2016-03-27 0.01602 2016-03-28 0.02086 2016-03-29 0.02234 2016-03-30 0.02484 2016-03-31 0.02384 2016-04-01 0.02310 2016-04-02 0.02490 2016-04-03 0.02536 2016-04-04 0.02462 2016-04-05 0.12428 2016-04-06 0.22100 2016-04-07 0.13092 Name: last_seen, dtype: float64 2016-04-06 11050 2016-04-07 6546 2016-04-05 6214 2016-03-17 1396 2016-04-03 1268 2016-04-02 1245 2016-03-30 1242 2016-04-04 1231 2016-03-31 1192 2016-03-12 1191 2016-04-01 1155 2016-03-29 1117 2016-03-22 1079 2016-03-28 1043 2016-03-21 1037 2016-03-20 1035 2016-03-24 978 2016-03-25 960 2016-03-23 929 2016-03-26 848 2016-03-16 822 2016-03-27 801 2016-03-15 794 2016-03-19 787 2016-03-14 640 2016-03-11 626 2016-03-10 538 2016-03-09 493 2016-03-13 449 2016-03-08 380 2016-03-18 371 2016-03-07 268 2016-03-06 221 2016-03-05 54 Name: last_seen, dtype: int64
We see that the date with the highest frequency in th last_seen
column was 2016-04-06 which could possibly indicated this was the date when the most sales occurred. However, this is speculative as there is not enough data to make a conclusive statement.
# Display descriptive statistics on registration_year column
autos['registration_year'].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The data above makes it clear that there is erroneous data within this column as it is not possible for a car to have been registered in 1000 or 9999.
We will explore this column further in order to decide how best to clean it.
# Display 30 earliest registration years
print(autos['registration_year'].value_counts().sort_index().head(30))
# Display 30 latest registration years
print('\n\n')
print(autos['registration_year'].value_counts().sort_index().tail(30))
1000 1 1001 1 1111 1 1500 1 1800 2 1910 9 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 1950 3 1951 2 1952 1 1953 1 1954 2 1955 2 1956 5 1957 2 1958 4 1959 7 1960 34 1961 6 1962 4 1963 9 Name: registration_year, dtype: int64 2001 2703 2002 2533 2003 2727 2004 2737 2005 3015 2006 2708 2007 2304 2008 2231 2009 2098 2010 1597 2011 1634 2012 1323 2013 806 2014 666 2015 399 2016 1316 2017 1453 2018 492 2019 3 2800 1 4100 1 4500 1 4800 1 5000 4 5911 1 6200 1 8888 1 9000 2 9996 1 9999 4 Name: registration_year, dtype: int64
We will remove entries with vehicle registration years earlier than 1910 as the amount of vehicles within this range are negligible and the registration years themselves are unrealistic or impossible.
We will also remove vehicles with a registration year later than 2016 as it is impossible for a car to have been registered after the listing had alreeady been seen.
# Removing erroneous entries
autos = autos[(autos['registration_year'] >= 1910) & (autos['registration_year'] <= 2016)]
autos['registration_year'].describe()
count 48028.00000 mean 2002.80351 std 7.31085 min 1910.00000 25% 1999.00000 50% 2003.00000 75% 2008.00000 max 2016.00000 Name: registration_year, dtype: float64
# Display updated distribution
autos['registration_year'].value_counts(normalize=True).head(30)
2000 0.069834 2005 0.062776 1999 0.062464 2004 0.056988 2003 0.056779 2006 0.056384 2001 0.056280 2002 0.052740 1998 0.051074 2007 0.047972 2008 0.046452 2009 0.043683 1997 0.042225 2011 0.034022 2010 0.033251 1996 0.030066 2012 0.027546 2016 0.027401 1995 0.027338 2013 0.016782 2014 0.013867 1994 0.013742 1993 0.009265 2015 0.008308 1990 0.008224 1992 0.008141 1991 0.007412 1989 0.003769 1988 0.002957 1985 0.002186 Name: registration_year, dtype: float64
# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
.value_counts(normalize=True)
.sort_index(ascending=False)
.head(30)
.sum()
))
# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
.value_counts(normalize=True)
.sort_index(ascending=False)
.head(10)
.sum()
))
# Display distribution of vehicles reistered within the past 30 years.
print((autos['registration_year']
.value_counts(normalize=True)
.sort_index(ascending=False)
.head(5)
.sum()
))
0.9793037394853003 0.29928375114516537 0.093903556258849
The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.
brand
column¶# Display descriptive statistics for brand column
print(autos['brand'].describe())
# Display brand value counts
autos['brand'].value_counts(normalize=True)
count 48028 unique 40 top volkswagen freq 10188 Name: brand, dtype: object
volkswagen 0.212126 bmw 0.110019 opel 0.108166 mercedes_benz 0.095361 audi 0.086387 ford 0.069793 renault 0.047347 peugeot 0.029524 fiat 0.025860 seat 0.018177 skoda 0.016032 mazda 0.015137 nissan 0.015095 citroen 0.013929 smart 0.013909 toyota 0.012472 sonstige_autos 0.010952 hyundai 0.009848 volvo 0.009245 mini 0.008641 mitsubishi 0.008141 honda 0.007850 kia 0.007100 alfa_romeo 0.006621 porsche 0.006101 suzuki 0.005913 chevrolet 0.005705 chrysler 0.003665 dacia 0.002561 daihatsu 0.002561 jeep 0.002249 subaru 0.002186 land_rover 0.002040 saab 0.001603 jaguar 0.001582 trabant 0.001562 daewoo 0.001499 rover 0.001353 lancia 0.001083 lada 0.000604 Name: brand, dtype: float64
As we can see there are 40 unique brands in our dataset, with Volkswagen being the brand with the highest frequency.
Next we will select the brands that constitute more than 5% of the total values, and then we will aggregate their price data.
# Create list to store top brands
brand_counts = autos['brand'].value_counts(normalize=True)
top_brands = brand_counts[brand_counts > .05].index
# Display top_brands
print(f"Top Brands: \n{list(top_brands)}")
Top Brands: ['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford']
The brands which meet our ciriteris are:
We can see that Volkswagen is the top brand, making up over a fifth of the listed vehicles, and having nearly double the second most listed brand BMW.
Next we will aggregate their price data.
# Create dictionary to store aggregate data.
top_brands_mean_price = {}
# Selecting data.
for brand in top_brands:
selected_rows = autos[autos['brand'] == brand]
avg_price = selected_rows['price_usd'].mean()
top_brands_mean_price[brand] = avg_price
# Sort brands by average price in descending order.
top_brands_sorted = sorted(top_brands_mean_price.items()
,key=lambda x: x[1]
,reverse=True
)
# Display average prices.
for brand, price in top_brands_sorted:
print(f"{brand}: ${round(price, 2)}")
mercedes_benz: $30317.45 audi: $9093.65 bmw: $8334.65 ford: $7263.02 volkswagen: $6516.46 opel: $5252.62
We can see that Mercedes Benz vehicles are by far the most expensive out our top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in our dataset which could indicate one or both of the following:
More research would need to be done inorder to determine which of these two options most closely represents reality.
We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.
Next we will aggregate our mileage data for each of the top brands to see if there is any visible link between it and our mean price data.
# Create dictionary to store aggregate data.
top_brands_mean_mileage = {}
# Selecting data.
for brand in top_brands:
selected_rows = autos[autos['brand'] == brand]
mean_mileage = selected_rows['odometer_km'].mean()
top_brands_mean_mileage[brand] = mean_mileage
In order to easily compare mean mileage and mean price for each of the top brands we will create a new DataFrame object containing the relevant data.
# Convert top_brands_mean_price dictionary into a Series
tbmp_series = pd.Series(top_brands_mean_price)
tbmp_series = tbmp_series.round(2)
# Convert top_brands_mean_mileage dictionary into a Series
tbmm_series = pd.Series(top_brands_mean_mileage)
tbmm_series = tbmm_series.round(2)
# Create new DataFrame to combine two series objects
tb_mean_price_mileage = pd.DataFrame(tbmp_series, columns=['mean_price'])
tb_mean_price_mileage['mean_mileage'] = tbmm_series
# Display tb_mean_price_mileage sorted by price.
tb_mean_price_mileage.sort_values('mean_price', ascending=False)
mean_price | mean_mileage | |
---|---|---|
mercedes_benz | 30317.45 | 130860.26 |
audi | 9093.65 | 129287.78 |
bmw | 8334.65 | 132434.71 |
ford | 7263.02 | 124046.84 |
volkswagen | 6516.46 | 128730.37 |
opel | 5252.62 | 129227.14 |
The table above makes it clear that while there is a wide range of mean prices amongst the top brands (\$5252.62 - \$30,317.45) there is a very low range for the mean mileages (124,046.84km - 132,434.71km).
This indicates that there is no meaningful relationship between the mean prices and mean mileages amongst our top brands.
The vast majority of (97.93%) of vehicles were registerd within the past 30 years with nearly 30% registered within the past decade and nearly 10% within the past 5 years.
The top brands by listings:
The Germans clearly love local automakers, with German brands taking the top 5 spots and making up over 61.21% of all listings.
We can see that Mercedes Benz vehicles are by far the most expensive amongst the top brands, on average costing three times more than the second most expensive brand Audi. However, it still makes up nearly 10% of all vehicles in the dataset which indicates many people are tying to sell their Mercedes Benz vehicles. More research would need to be done in order to state whether or not the demand for Mercedes Benz vehicles is greater or less than the supply.
We also see that the three most expensive top brands (Mercedes Benz, Audi, BMW) make up nearly 30% of all listings, which could indicate there is not a strong correlation between price and popularity. However, more research would be required to determine whetether or not this is true.
There is no apparent relationship between the mean price and mean mileage amongst vehicles of the top brands.