We're working with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle by user orgesleka.,/a> The original dataset isn't available on Kaggle anymore, but you can find it here.
Few modifications were made from the original dataset:
Let's start by importing the libraries we need and reading the dataset into pandas.
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding='Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
From the above information we can the see that our datasets contains 50000 rows and 20 columns
# Enquiring further by checking the info() and head
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
most of our columns are objects while year of regristration, powerps, monthofregistration, nrofpictures and postalcode are int64
We can make the following observations:
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.columns = ['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
'odometer', 'registration_month', 'fueltype', 'brand',
'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
'lastseen']
autos.head()
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
autos.describe(include='all')
datecrawled | name | seller | offertype | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-11 22:38:16 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
What can be seen is that columns "seller" and "offer_type" consist of almost all the same values.\ Therefore, these columns are deemed unuseful for further analysis and will be dropped from the dataframe.
# Taking a quick look at individual columns
autos['odometer'].value_counts(dropna = False)
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
#remove columns that provide unuseful data
autos.drop(['seller','offertype'],axis=1,inplace=True)
For each column we will:
Remove all non-numeric characters Convert the column to a numeric type Rename the 'odometer' column to 'odometer_km'
#price
autos['price'] = (autos['price']
.str.replace("$","")
.str.replace(",","")
.astype(float)
)
#odometer
autos['odometer'] = (autos['odometer'].
str.replace("km","").
str.replace(",","").
astype(int)
)
autos.rename({'odometer': 'odometer_km'},axis=1,inplace=True)
# autos['price'] = autos['price'].str.replace('$','').str.replace(',', '').astype(int)
autos['price'].head()
0 5000.0 1 8500.0 2 8990.0 3 4350.0 4 1350.0 Name: price, dtype: float64
# autos['odometer'] = autos['odometer'].str.replace(',', '').str.replace('km', '').astype(int)
# autos.rename({'odometer':"odometer_km"}, axis = 1, inplace = True)
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
The columns 'price' and 'odometer_km' have now succesfully been transformed to integer values.\ They will now be further analysed in order to check for outliers that might need to be removed.
Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns
autos['price'].nunique()
2357
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
Here we can see that the price column appears to have a 'Positive or Right Skew' with its maximum value being of the magnitude 10^5 times greater than values within 75% of the dataset.
autos['price'].unique().shape
(2357,)
autos['price'].value_counts()
0.0 1421 500.0 781 1500.0 734 2500.0 643 1200.0 639 ... 6202.0 1 18310.0 1 898.0 1 11240.0 1 789.0 1 Name: price, Length: 2357, dtype: int64
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Here we can see that the 'odometer_km' column appears to have a 'Negative or Left Skew' with its minimum value being of the magnitude 10^2 times less than values within 75% of the dataset.
autos['odometer_km'].shape
(50000,)
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].unique().shape
(13,)
# Removing outliers in the Price column and checking the new statistical distribution
autos['price'] = autos.loc[autos['price'].between(0,30000), 'price']
autos['price'].describe()
count 49206.000000 mean 5025.773483 std 5679.154441 min 0.000000 25% 1100.000000 50% 2850.000000 75% 6900.000000 max 30000.000000 Name: price, dtype: float64
The lowest price is zero, which has 1421 occurences in the dataset.\ Price represents the starting value of an action. Since it is only the starting price it does not mean that a car will be sold for that price. Therefore anything below 500 euro's (which seems like a realistic lower bound for used car prices) is not removed from the dataset.
There are also outliers at the top. When investigating the unique values with their respective counts the starting price increases slowly until 350,000.\ Onwards the price increases rapidly. Occurences with a starting price higher than 350,000 will be investigated further.
autos['price'].nunique()
1985
# Removing outliers in the odometer_km column and checking its statistical distribution
autos['odometer_km'] = autos.loc[autos['odometer_km'].between(10000,150000), 'odometer_km']
autos['odometer_km'].nunique()
12
# The new Dataframe
autos.describe()
price | registration_year | powerps | odometer_km | registration_month | nrofpictures | postalcode | |
---|---|---|---|---|---|---|---|
count | 49206.000000 | 50000.000000 | 50000.000000 | 49033.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 5025.773483 | 2005.073280 | 116.355920 | 128113.719332 | 5.723360 | 0.0 | 50813.627300 |
std | 5679.154441 | 105.712813 | 209.216627 | 36631.381785 | 3.711984 | 0.0 | 25779.747957 |
min | 0.000000 | 1000.000000 | 0.000000 | 10000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1100.000000 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2850.000000 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 6900.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 30000.000000 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
Because these 3 columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.
As the other two time columns - 'registration_month' & 'registration_year' are already in a numerical format, their distribution can be understood using the Series.describe() method
autos.columns
Index(['datecrawled', 'name', 'price', 'abtest', 'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer_km', 'registration_month', 'fueltype', 'brand', 'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode', 'lastseen'], dtype='object')
autos.loc[0:5, ['datecrawled', 'ad_created', 'lastseen']]
datecrawled | ad_created | lastseen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | 2016-03-21 00:00:00 | 2016-04-06 09:45:21 |
Extracting the date only and calculating the distribution of values in these 3 columns
# INVESTIGATING THE DATE_CRAWLED COLUMN
date_crawled = autos['datecrawled'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'
date_crawled.sort_index()
2016-03-05 2.5% 2016-03-06 1.4% 2016-03-07 3.6% 2016-03-08 3.3% 2016-03-09 3.3% 2016-03-10 3.2% 2016-03-11 3.2% 2016-03-12 3.7% 2016-03-13 1.6% 2016-03-14 3.7% 2016-03-15 3.4% 2016-03-16 2.9% 2016-03-17 3.2% 2016-03-18 1.3% 2016-03-19 3.5% 2016-03-20 3.8% 2016-03-21 3.8% 2016-03-22 3.3% 2016-03-23 3.2% 2016-03-24 2.9% 2016-03-25 3.2% 2016-03-26 3.2% 2016-03-27 3.1% 2016-03-28 3.5% 2016-03-29 3.4% 2016-03-30 3.4% 2016-03-31 3.2% 2016-04-01 3.4% 2016-04-02 3.5% 2016-04-03 3.9% 2016-04-04 3.7% 2016-04-05 1.3% 2016-04-06 0.3% 2016-04-07 0.1% Name: datecrawled, dtype: object
Looking at the data above it seems like the period over which the data has been crawled covers roughly one month (March-April 2016).\ The distribution is more or less uniform.
# INVESTIGATING THE ad_created COLUMN
ad_created = autos['ad_created'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'
ad_created.sort_index()
2015-06-11 0.0% 2015-08-10 0.0% 2015-09-09 0.0% 2015-11-10 0.0% 2015-12-05 0.0% ... 2016-04-03 3.9% 2016-04-04 3.7% 2016-04-05 1.2% 2016-04-06 0.3% 2016-04-07 0.1% Name: ad_created, Length: 76, dtype: object
The dates ads were created range from June 2015 until April of 2016. The majority (+- 97%) of ads in the dataset were created after the date on which data was crawled for the first time.
This make sense as most auctions are only 'live' for a short period of time.
# INVESTIGATING THE lastseen COLUMN
last_seen = autos['lastseen'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'
last_seen.sort_index()
2016-03-05 0.1% 2016-03-06 0.4% 2016-03-07 0.5% 2016-03-08 0.8% 2016-03-09 1.0% 2016-03-10 1.1% 2016-03-11 1.3% 2016-03-12 2.4% 2016-03-13 0.9% 2016-03-14 1.3% 2016-03-15 1.6% 2016-03-16 1.6% 2016-03-17 2.8% 2016-03-18 0.7% 2016-03-19 1.6% 2016-03-20 2.1% 2016-03-21 2.1% 2016-03-22 2.2% 2016-03-23 1.9% 2016-03-24 2.0% 2016-03-25 1.9% 2016-03-26 1.7% 2016-03-27 1.6% 2016-03-28 2.1% 2016-03-29 2.2% 2016-03-30 2.5% 2016-03-31 2.4% 2016-04-01 2.3% 2016-04-02 2.5% 2016-04-03 2.5% 2016-04-04 2.5% 2016-04-05 12.4% 2016-04-06 22.1% 2016-04-07 13.1% Name: lastseen, dtype: object
The last seen dates looks a bit uniform, but the last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales. It's more likely that these values are to do with the crawling period ending and don't indicate car sales.
registration_year = autos["registration_year"].describe()
registration_year
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos["registration_year"].head(10)
0 2004 1 1997 2 2009 3 2007 4 2003 5 2006 6 1995 7 1998 8 2000 9 1997 Name: registration_year, dtype: int64
The minimum and maximum value of registration year seems strange.
The lowest registration year is 1,000 which must be incorrect as cars only started appearing in the late 1800's, Also the max year is 9999 which is talking about the future, Due to this all occurences with a registration year before 1885 (first patented practical automobile) will be removed.
All registration years after 2016 must be incorrect as ads were created in 2015 & 2016.\ These will be removed from the dataset as well
A manufactured car can't be registered after its listing was seen but before, any vehicle with a registration year above 2016 we can deduce as inaccurate as our ad_created column which represents date_of_listing only has data on years up to 2016.
Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1800s. In an attempt to validate this we can count the number of listings that fall outside the 1900-2016 interval and see if it is safe to remove those rows entirely or if they require custom logic.
autos[autos['registration_year'].between(1800,2016)].describe()
price | registration_year | powerps | odometer_km | registration_month | nrofpictures | postalcode | |
---|---|---|---|---|---|---|---|
count | 47245.000000 | 48030.000000 | 48030.000000 | 47117.000000 | 48030.000000 | 48030.0 | 48030.000000 |
mean | 5094.483014 | 2002.795066 | 117.140496 | 127874.864699 | 5.767604 | 0.0 | 50936.383094 |
std | 5729.183673 | 7.426905 | 195.449149 | 36786.714694 | 3.696805 | 0.0 | 25791.666655 |
min | 0.000000 | 1800.000000 | 0.000000 | 10000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1100.000000 | 1999.000000 | 71.000000 | 125000.000000 | 3.000000 | 0.0 | 30459.000000 |
50% | 2900.000000 | 2003.000000 | 107.000000 | 150000.000000 | 6.000000 | 0.0 | 49696.000000 |
75% | 6999.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71665.000000 |
max | 30000.000000 | 2016.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
There is still a large variety in registration years of the cars.The mean of apprpximately 2002 with a small standard deviation indicate that most cars are approximately between 7 and 21 years old.
autos['brand'].value_counts()
volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 smart 701 citroen 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
volkswagen is the most used brand of the vehicle which is responsible for over 20% of the population of the brands of the vehicle followed by opel, bmw, mercedes_benz, audi, ford, renault, peugeot and fiat
autos['brand'].nunique()
40
autos['brand'].value_counts(normalize = True).mul(100).round(1).astype(str) + '%'
volkswagen 21.4% opel 10.9% bmw 10.9% mercedes_benz 9.5% audi 8.6% ford 7.0% renault 4.8% peugeot 2.9% fiat 2.6% seat 1.9% skoda 1.6% mazda 1.5% nissan 1.5% smart 1.4% citroen 1.4% toyota 1.2% sonstige_autos 1.1% hyundai 1.0% volvo 0.9% mini 0.8% mitsubishi 0.8% honda 0.8% kia 0.7% alfa_romeo 0.7% porsche 0.6% suzuki 0.6% chevrolet 0.6% chrysler 0.4% dacia 0.3% daihatsu 0.3% jeep 0.2% subaru 0.2% land_rover 0.2% saab 0.2% daewoo 0.2% trabant 0.2% jaguar 0.2% rover 0.1% lancia 0.1% lada 0.1% Name: brand, dtype: object
When working with data on cars, it's good to explore variations across different car brands. Aggregation can be used to understand the brand column.
#selecting the top 20 brands
auto_brands_20 = autos['brand'].value_counts().head(20).index
auto_brands_20
Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart', 'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'], dtype='object')
brand_mean_prices = {}
for brand in auto_brands_20:
mean_price = round(autos.loc[autos['brand'] == brand, 'price'].mean())
# Creating a new key_value pair with this mean price
brand_mean_prices[brand] = mean_price
brand_mean_prices
{'volkswagen': 4945.0, 'opel': 2834.0, 'bmw': 7142.0, 'mercedes_benz': 7209.0, 'audi': 7640.0, 'ford': 3421.0, 'renault': 2314.0, 'peugeot': 3011.0, 'fiat': 2698.0, 'seat': 4188.0, 'skoda': 6271.0, 'mazda': 3773.0, 'nissan': 4511.0, 'smart': 3483.0, 'citroen': 3645.0, 'toyota': 4984.0, 'sonstige_autos': 6692.0, 'hyundai': 5317.0, 'volvo': 4686.0, 'mini': 10281.0}
# Step 5 - Sorting the values in the brand dictionary according to the highest mean price
brand_mean_prices = dict(sorted(brand_mean_prices.items(), key = lambda data: data[1], reverse = True ))
brand_mean_prices
{'mini': 10281.0, 'audi': 7640.0, 'mercedes_benz': 7209.0, 'bmw': 7142.0, 'sonstige_autos': 6692.0, 'skoda': 6271.0, 'hyundai': 5317.0, 'toyota': 4984.0, 'volkswagen': 4945.0, 'volvo': 4686.0, 'nissan': 4511.0, 'seat': 4188.0, 'mazda': 3773.0, 'citroen': 3645.0, 'smart': 3483.0, 'ford': 3421.0, 'peugeot': 3011.0, 'opel': 2834.0, 'fiat': 2698.0, 'renault': 2314.0}
we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.
Audi, BMW and Mercedes Benz are more expensive
Ford and Opel are less expensive
Volkswagen is in between 4945.0
From the exploration and brief statistical aggregation conducted, it can be seen that the brand mini has the highest mean selling price among the top 20 German car brands in our dataset, followed by audi, sostige , mercedes and bmw
Using these two aggregated series objects we can then combine then into a single subset dataframe (with a shared index).
brand_mean_mileage = {}
for brand in auto_brands_20:
mean_mileage = round(autos.loc[autos['brand'] == brand, 'odometer_km'].mean())
# Creating a new key_value pair with this mean price
brand_mean_mileage[brand] = mean_mileage
brand_mean_mileage
{'volkswagen': 131187.0, 'opel': 132091.0, 'bmw': 134817.0, 'mercedes_benz': 132721.0, 'audi': 130967.0, 'ford': 126436.0, 'renault': 130734.0, 'peugeot': 128540.0, 'fiat': 120208.0, 'seat': 123958.0, 'skoda': 113009.0, 'mazda': 127067.0, 'nissan': 121135.0, 'smart': 101444.0, 'citroen': 121257.0, 'toyota': 117262.0, 'sonstige_autos': 101298.0, 'hyundai': 107836.0, 'volvo': 139812.0, 'mini': 91205.0}
# Step 5 - Sorting the values in the brand dictionary according to the highest mean mileage
brand_mean_mileage = dict(sorted(brand_mean_mileage.items(), key = lambda data: data[1], reverse = True ))
brand_mean_mileage
{'volvo': 139812.0, 'bmw': 134817.0, 'mercedes_benz': 132721.0, 'opel': 132091.0, 'volkswagen': 131187.0, 'audi': 130967.0, 'renault': 130734.0, 'peugeot': 128540.0, 'mazda': 127067.0, 'ford': 126436.0, 'seat': 123958.0, 'citroen': 121257.0, 'nissan': 121135.0, 'fiat': 120208.0, 'toyota': 117262.0, 'skoda': 113009.0, 'hyundai': 107836.0, 'smart': 101444.0, 'sonstige_autos': 101298.0, 'mini': 91205.0}
#Converting both dictionaries to series objects, using the series constructor.
mean_prices_top20 = pd.Series(brand_mean_prices)
mean_milieage_top20 = pd.Series(brand_mean_mileage)
# Step 4 - Creating a dataframe using the mean_prices series
top_20_brands = pd.DataFrame(mean_prices_top20, columns=['mean_price'])
top_20_brands
mean_price | |
---|---|
mini | 10281.0 |
audi | 7640.0 |
mercedes_benz | 7209.0 |
bmw | 7142.0 |
sonstige_autos | 6692.0 |
skoda | 6271.0 |
hyundai | 5317.0 |
toyota | 4984.0 |
volkswagen | 4945.0 |
volvo | 4686.0 |
nissan | 4511.0 |
seat | 4188.0 |
mazda | 3773.0 |
citroen | 3645.0 |
smart | 3483.0 |
ford | 3421.0 |
peugeot | 3011.0 |
opel | 2834.0 |
fiat | 2698.0 |
renault | 2314.0 |
# Appending the mean_mileage data in the other series as a column of this new dataframe to see if there is any possible link
top_20_brands['mean_milieage'] = mean_milieage_top20
top_20_brands.sort_values('mean_milieage', ascending = False )
mean_price | mean_milieage | |
---|---|---|
volvo | 4686.0 | 139812.0 |
bmw | 7142.0 | 134817.0 |
mercedes_benz | 7209.0 | 132721.0 |
opel | 2834.0 | 132091.0 |
volkswagen | 4945.0 | 131187.0 |
audi | 7640.0 | 130967.0 |
renault | 2314.0 | 130734.0 |
peugeot | 3011.0 | 128540.0 |
mazda | 3773.0 | 127067.0 |
ford | 3421.0 | 126436.0 |
seat | 4188.0 | 123958.0 |
citroen | 3645.0 | 121257.0 |
nissan | 4511.0 | 121135.0 |
fiat | 2698.0 | 120208.0 |
toyota | 4984.0 | 117262.0 |
skoda | 6271.0 | 113009.0 |
hyundai | 5317.0 | 107836.0 |
smart | 3483.0 | 101444.0 |
sonstige_autos | 6692.0 | 101298.0 |
mini | 10281.0 | 91205.0 |
Both mean price and mean mileage can be seen and from the table we can see that brands with higer mean mileage have low mean price, therefore It is impossible to conclude whether higher mileage is affecting the price. This is due to the fact that within a brand there are a lot of other variables affecting price (such as car type, engine type, registration year etc). In order to confirm whether mileage affects the price a slice of the dataset is necessary where all those variables are kept the same as much as possible.
autos.head(10)
datecrawled | name | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer_km | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000.0 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000.0 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000.0 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | 7900.0 | test | bus | 2006 | automatik | 150 | voyager | 150000.0 | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | 300.0 | test | limousine | 1995 | manuell | 90 | golf | 150000.0 | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | 1990.0 | control | limousine | 1998 | manuell | 90 | golf | 150000.0 | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | 250.0 | test | NaN | 2000 | manuell | 0 | arosa | 150000.0 | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | 590.0 | control | bus | 1997 | manuell | 90 | megane | 150000.0 | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
vehicletype, gearbox and fueltype are three columns which have values in german.
autos["vehicletype"].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
autos["gearbox"].unique()
array(['manuell', 'automatik', nan], dtype=object)
autos["fueltype"].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
#Identify categorical data that uses german words, translate them and map the values to their english counterparts
translate_map = { 'kleinwagen' : "small car",
'kombi' : "estate car" ,
'cabrio' : "convertible",
'andere' : "other",
'elektro' : "electric",
'benzin' : 'petrol',
'manuell' : "manual",
'automatik' : "automatic",
"bus" : "bus",
"limousine" : "limousine",
"coupe":"coupe",
"suv" : "suv",
"lpg" : "lpg",
"diesel" : "diesel",
"cng" : "cng",
"hybrid": "hybrid",
"Unknown" : "Unknown",
"nein" : "no",
"ja" : "yes", }
categorical = ["fueltype", "gearbox", "vehicletype",
"unrepaired_damage"]
for word in categorical:
autos[word] = autos[word].map(translate_map)
#replace nan values with "Unknown"
for c in categorical:
autos.loc[ autos[c].isnull() , c ] = "Unknown"
autos["gearbox"].unique()
array(['manual', 'automatic', 'Unknown'], dtype=object)
autos["vehicletype"].unique()
array(['bus', 'limousine', 'small car', 'estate car', 'Unknown', 'coupe', 'suv', 'convertible', 'other'], dtype=object)
autos["fueltype"].unique()
array(['lpg', 'petrol', 'diesel', 'Unknown', 'cng', 'hybrid', 'electric', 'other'], dtype=object)
autos["unrepaired_damage"].unique()
array(['no', 'Unknown', 'yes'], dtype=object)
#Data cleaning of the dates, Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321
date_cols = ["datecrawled", "ad_created", "lastseen"]
for c in date_cols:
temp = autos[c].str[:10].str.replace("-","").astype(int)
autos[c] = temp
autos.head(20)
datecrawled | name | price | abtest | vehicletype | registration_year | gearbox | powerps | model | odometer_km | registration_month | fueltype | brand | unrepaired_damage | ad_created | nrofpictures | postalcode | lastseen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000.0 | control | bus | 2004 | manual | 158 | andere | 150000.0 | 3 | lpg | peugeot | no | 20160326 | 0 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500.0 | control | limousine | 1997 | automatic | 286 | 7er | 150000.0 | 6 | petrol | bmw | no | 20160404 | 0 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | 8990.0 | test | limousine | 2009 | manual | 102 | golf | 70000.0 | 7 | petrol | volkswagen | no | 20160326 | 0 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350.0 | control | small car | 2007 | automatic | 71 | fortwo | 70000.0 | 6 | petrol | smart | no | 20160312 | 0 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350.0 | test | estate car | 2003 | manual | 0 | focus | 150000.0 | 7 | petrol | ford | no | 20160401 | 0 | 39218 | 20160401 |
5 | 20160321 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | 7900.0 | test | bus | 2006 | automatic | 150 | voyager | 150000.0 | 4 | diesel | chrysler | Unknown | 20160321 | 0 | 22962 | 20160406 |
6 | 20160320 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | 300.0 | test | limousine | 1995 | manual | 90 | golf | 150000.0 | 8 | petrol | volkswagen | Unknown | 20160320 | 0 | 31535 | 20160323 |
7 | 20160316 | Golf_IV_1.9_TDI_90PS | 1990.0 | control | limousine | 1998 | manual | 90 | golf | 150000.0 | 12 | diesel | volkswagen | no | 20160316 | 0 | 53474 | 20160407 |
8 | 20160322 | Seat_Arosa | 250.0 | test | Unknown | 2000 | manual | 0 | arosa | 150000.0 | 10 | Unknown | seat | no | 20160322 | 0 | 7426 | 20160326 |
9 | 20160316 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | 590.0 | control | bus | 1997 | manual | 90 | megane | 150000.0 | 7 | petrol | renault | no | 20160316 | 0 | 15749 | 20160406 |
10 | 20160315 | VW_Golf_Tuning_in_siber/grau | 999.0 | test | Unknown | 2017 | manual | 90 | NaN | 150000.0 | 4 | petrol | volkswagen | no | 20160314 | 0 | 86157 | 20160407 |
11 | 20160316 | Mercedes_A140_Motorschaden | 350.0 | control | Unknown | 2000 | Unknown | 0 | NaN | 150000.0 | 0 | petrol | mercedes_benz | Unknown | 20160316 | 0 | 17498 | 20160316 |
12 | 20160331 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | 5299.0 | control | small car | 2010 | automatic | 71 | fortwo | 50000.0 | 9 | petrol | smart | no | 20160331 | 0 | 34590 | 20160406 |
13 | 20160323 | Audi_A3_1.6_tuning | 1350.0 | control | limousine | 1999 | manual | 101 | a3 | 150000.0 | 11 | petrol | audi | no | 20160323 | 0 | 12043 | 20160401 |
14 | 20160323 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | 3999.0 | test | small car | 2007 | manual | 75 | clio | 150000.0 | 9 | petrol | renault | Unknown | 20160323 | 0 | 81737 | 20160401 |
15 | 20160401 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | 18900.0 | test | coupe | 1982 | automatic | 203 | NaN | 80000.0 | 6 | petrol | sonstige_autos | no | 20160401 | 0 | 61276 | 20160402 |
16 | 20160316 | Opel_Vectra_B_Kombi | 350.0 | test | estate car | 1999 | manual | 101 | vectra | 150000.0 | 5 | petrol | opel | no | 20160316 | 0 | 57299 | 20160318 |
17 | 20160329 | Volkswagen_Scirocco_2_G60 | 5500.0 | test | coupe | 1990 | manual | 205 | scirocco | 150000.0 | 6 | petrol | volkswagen | no | 20160329 | 0 | 74821 | 20160405 |
18 | 20160326 | Verkaufen_mein_bmw_e36_320_i_touring | 300.0 | control | bus | 1995 | manual | 150 | 3er | 150000.0 | 0 | petrol | bmw | Unknown | 20160326 | 0 | 54329 | 20160402 |
19 | 20160317 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | 4150.0 | control | suv | 2004 | manual | 124 | andere | 150000.0 | 2 | lpg | mazda | no | 20160317 | 0 | 40878 | 20160317 |
finding the most common brand/model combinations; this can be done using the aggregation method
unique_brands = autos['brand'].unique()
unique_brands
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
dict_model_brand = {}
for ub in unique_brands:
brands_tem = autos.loc[ autos["brand"] == ub , "model" ].value_counts().index.max()
dict_model_brand[ub] = brands_tem
dict_model_brand
{'peugeot': 'andere', 'bmw': 'z_reihe', 'volkswagen': 'up', 'smart': 'roadster', 'ford': 'transit', 'chrysler': 'voyager', 'seat': 'toledo', 'renault': 'twingo', 'mercedes_benz': 'vito', 'audi': 'tt', 'sonstige_autos': nan, 'opel': 'zafira', 'mazda': 'rx_reihe', 'porsche': 'cayenne', 'mini': 'one', 'toyota': 'yaris', 'dacia': 'sandero', 'nissan': 'x_trail', 'jeep': 'wrangler', 'saab': 'andere', 'volvo': 'xc_reihe', 'mitsubishi': 'pajero', 'jaguar': 'x_type', 'fiat': 'stilo', 'skoda': 'yeti', 'subaru': 'legacy', 'kia': 'sportage', 'citroen': 'c5', 'chevrolet': 'spark', 'hyundai': 'tucson', 'honda': 'jazz', 'daewoo': 'nubira', 'suzuki': 'swift', 'trabant': 'andere', 'land_rover': 'range_rover_sport', 'alfa_romeo': 'spider', 'lada': 'samara', 'rover': 'rangerover', 'daihatsu': 'terios', 'lancia': 'ypsilon'}
using aggregation to see if average prices follows any patterns based on the mileage.
groups_odometer = autos["odometer_km"].unique().tolist()
groups_odometer = sorted(groups_odometer)
groups_odometer
[10000.0, 20000.0, 30000.0, 40000.0, 50000.0, 60000.0, 70000.0, 80000.0, 90000.0, 100000.0, 125000.0, 150000.0, nan]
dict_avg_price_mil = {}
for kilom in groups_odometer:
temp = autos.loc[autos["odometer_km"] == kilom,
"price" ].mean()
temp = round(temp,2)
dict_avg_price_mil[kilom] = temp
dict_avg_price_mil
{10000.0: 12652.79, 20000.0: 12008.03, 30000.0: 12416.99, 40000.0: 12365.93, 50000.0: 11109.58, 60000.0: 10098.24, 70000.0: 9450.56, 80000.0: 8523.17, 90000.0: 7448.37, 100000.0: 6919.56, 125000.0: 5637.48, 150000.0: 3519.07, nan: nan}
The average car price is decreasing with increase in mileage. The reason for this could be cars with damage but further analysis is needed.
Checking how cheaper are cars with damage than their non-damaged counterparts
autos['unrepaired_damage'].unique()
array(['no', 'Unknown', 'yes'], dtype=object)
autos['unrepaired_damage'].value_counts()
no 35232 Unknown 9829 yes 4939 Name: unrepaired_damage, dtype: int64
damage_not_damaged = ['no', 'Unknown', 'yes']
damage_not_damaged_counterparts = {}
for up in damage_not_damaged:
val = autos.loc[autos["unrepaired_damage"] == up, "price" ].mean()
damage_not_damaged_counterparts[up] = val
damage_not_damaged_counterparts
{'no': 6091.86273770872, 'Unknown': 2768.6617932726717, 'yes': 2043.6258876039765}
diff_in_price = damage_not_damaged_counterparts["no"] - damage_not_damaged_counterparts["yes"]
diff_in_price
4048.236850104743
There is a difference of about 4000 dollars in average price of damaged and non damaged cars
The damaged price is more expensive than the non-damaged car