Using a dataset of eBay Kleinanzeigen (the classified section of the German eBay website) consisting of 50K data points available on kaggle, we want to (1) CLEAN THE DATA & (2) ANALYZE THE DATA.
This dataset contains the following information:
Variable | Description |
---|---|
dateCrawled | When the ad was 1st crawled |
name | Name of car |
seller | Private or Dealer as seller |
offerType | Type of listing |
price | Price on the car ad |
abtest | Whether the listing included A/B test |
vehicleType | Type of vehicle |
yearOfRegistration | Year when car was 1st registered |
gearbox | The transmission type |
powerPS | Horsepower of the car |
model | Model of the car |
kilometer | # of kilometers driven |
monthOfRegistration | Month in which car was 1st registered |
fuelType | Type of fuel the car uses |
brand | Brand of the car |
notRepairedDamage | If the repairs were done on damaged car |
dateCreated | Date when eBay listing was created |
nrOfPictures | # of pics of car in ad |
postalCode | Postal code where the car is at |
lastSeenOnline | When the crawler saw this ad last online |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
There are several things that need to change:
(1) need to lower case everything
(2) deal with missing values in "vehicleType", "gearbox", "model", "fuelType", "notRepairedDamage"
(3) turn things into snake_case
(4) turn "price" and "odometer" into float
autos.rename(str.lower, axis = 1, inplace = True)
autos.rename({"yearofregistration":"registration_year",
"monthofregistration":"registration_month",
"notrepaireddamage":"unrepaired_damage",
"datecreated":"ad_created",
"datecrawled":"date_crawled",
"offertype":"offer_type",
"vehicletype":"vehicle_type",
"gearbox":"gear_box",
"fueltype":"fuel_type",
"nrofpictures":"num_pictures",
"postalcode":"postal_code",
"lastseen":"last_seen"}, axis = 1, inplace = True)
autos.head(5)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | powerps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include = "all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | powerps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-16 21:50:53 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
print(autos["offer_type"].value_counts())
print("\n")
print(autos['seller'].value_counts())
print("\n")
print(autos["num_pictures"].value_counts())
print("\n")
print(autos['gear_box'].value_counts())
print("\n")
print(autos['abtest'].value_counts())
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64 privat 49999 gewerblich 1 Name: seller, dtype: int64 0 50000 Name: num_pictures, dtype: int64 manuell 36993 automatik 10327 Name: gear_box, dtype: int64 test 25756 control 24244 Name: abtest, dtype: int64
Remove "seller", "num_pictures" and "seller" from dataframe
"powerps" has an unrealistic value (> 1000 HP). Need to further analyze
Similarly, 'price' also need to be examined as there is no way to have a $2 million+ vehicle
"registration_year" has an unrealistic range, particularly at the extremes. Need to remove them.
Odometer + price needs to return to floats
# Fix price
autos['price'].value_counts()
# There's "$" and ",". Plus it's an object and not an integer/float
autos['price'] = autos['price'].str.replace("$","").str.replace(",","")
autos['price'] = autos['price'].astype(float)
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
# fix odometer
autos['odometer'].value_counts()
autos['odometer'].unique()
# Need to deal with "," and "km" + turn it to float/int
autos['odometer'] = autos['odometer'].str.replace(",", "")
autos['odometer'] = autos['odometer'].str.replace("km", "")
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
autos.columns
# removing unnecessary columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gear_box', 'powerps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'num_pictures', 'postal_code', 'last_seen'], dtype='object')
cols = ['date_crawled', 'name', 'price',
'abtest', 'vehicle_type', 'registration_year',
'gear_box', 'powerps', 'model',
'odometer_km', 'registration_month', 'fuel_type',
'brand', 'unrepaired_damage', 'ad_created',
'postal_code', 'last_seen']
autos = autos[cols]
autos.head()
autos.describe(include = 'all')
date_crawled | name | price | abtest | vehicle_type | registration_year | gear_box | powerps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 5.000000e+04 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000.000000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.000000 | 50000 |
unique | 48213 | 38754 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | 39481 |
top | 2016-03-16 21:50:53 | Ford_Fiesta | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | NaN | 25756 | 12859 | NaN | 36993 | NaN | 4024 | NaN | NaN | 30107 | 10687 | 35232 | 1946 | NaN | 8 |
mean | NaN | NaN | 9.840044e+03 | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | 125732.700000 | 5.723360 | NaN | NaN | NaN | NaN | 50813.627300 | NaN |
std | NaN | NaN | 4.811044e+05 | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | 40042.211706 | 3.711984 | NaN | NaN | NaN | NaN | 25779.747957 | NaN |
min | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 1067.000000 | NaN |
25% | NaN | NaN | 1.100000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 30451.000000 | NaN |
50% | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 49577.000000 | NaN |
75% | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 71540.000000 | NaN |
max | NaN | NaN | 1.000000e+08 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 99998.000000 | NaN |
There is no way that horse power can be 17700. So we need to find out what's up.
autos['powerps'].value_counts().sort_index(ascending = False).head(50)
17700 1 16312 1 16011 1 15016 1 15001 1 14009 1 9011 1 8404 1 7511 1 6512 1 6226 1 6045 1 5867 1 4400 1 3750 1 3500 1 2729 1 2018 1 1998 2 1988 1 1986 1 1800 3 1796 1 1793 1 1781 1 1780 1 1779 1 1771 1 1753 1 1704 1 1405 1 1401 1 1400 3 1398 1 1367 1 1300 1 1202 1 1103 1 1090 1 1082 1 1056 1 1055 1 1016 1 1011 1 1003 1 1001 3 1000 1 999 1 952 1 950 1 Name: powerps, dtype: int64
autos['powerps'].value_counts().sort_index(ascending = True).head(50)
0 5500 1 5 2 2 3 3 4 4 5 13 6 3 8 2 9 1 10 3 11 4 12 1 14 1 15 5 16 1 18 6 19 3 20 4 21 1 23 4 24 1 25 2 26 39 27 5 29 5 30 3 32 2 33 9 34 27 35 2 37 7 38 2 39 18 40 42 41 57 42 7 43 20 44 52 45 397 46 9 47 8 48 16 49 2 50 604 51 14 52 40 53 28 54 759 55 275 56 40 Name: powerps, dtype: int64
As the Bugatti Cheron was the highest HP production car at approx. 1600, really anything above that is not valid. So we will use this as the top-end cap. Conversely the lowest HP production car is the Renault Twitzy at about 17 hp. Thus we will use this as the bottom-end cap.
autos = autos[autos['powerps'].between(16, 1600)]
autos.describe()
price | registration_year | powerps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 4.442000e+04 | 44420.000000 | 44420.000000 | 44420.000000 | 44420.000000 | 44420.000000 |
mean | 1.045555e+04 | 2004.047726 | 126.743134 | 125794.236830 | 5.930752 | 51368.900698 |
std | 5.076550e+05 | 53.063904 | 67.302685 | 39267.227509 | 3.602153 | 25754.846000 |
min | 0.000000e+00 | 1927.000000 | 16.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1.299000e+03 | 1999.000000 | 80.000000 | 100000.000000 | 3.000000 | 31028.000000 |
50% | 3.299000e+03 | 2004.000000 | 116.000000 | 150000.000000 | 6.000000 | 50354.000000 |
75% | 7.800000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 72250.000000 |
max | 1.000000e+08 | 9000.000000 | 1405.000000 | 150000.000000 | 12.000000 | 99998.000000 |
autos['powerps'].value_counts().sort_index(ascending = False)
1405 1 1401 1 1400 3 1398 1 1367 1 1300 1 1202 1 1103 1 1090 1 1082 1 1056 1 1055 1 1016 1 1011 1 1003 1 1001 3 1000 1 999 1 952 1 950 1 923 1 900 3 850 1 800 1 799 1 754 1 740 1 696 1 682 1 678 1 ... 50 604 49 2 48 16 47 8 46 9 45 397 44 52 43 20 42 7 41 57 40 42 39 18 38 2 37 7 35 2 34 27 33 9 32 2 30 3 29 5 27 5 26 39 25 2 24 1 23 4 21 1 20 4 19 3 18 6 16 1 Name: powerps, Length: 404, dtype: int64
autos['price'].shape[0]
44420
autos['price'].describe()
count 4.442000e+04 mean 1.045555e+04 std 5.076550e+05 min 0.000000e+00 25% 1.299000e+03 50% 3.299000e+03 75% 7.800000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().sort_index(ascending = False).head(10)
99999999.0 1 27322222.0 1 12345678.0 3 11111111.0 1 10000000.0 1 3890000.0 1 1234566.0 1 999999.0 2 999990.0 1 350000.0 1 Name: price, dtype: int64
autos['price'].value_counts().sort_index(ascending = False).tail(10)
30.0 1 25.0 1 20.0 1 13.0 1 11.0 1 10.0 2 5.0 1 3.0 1 1.0 89 0.0 907 Name: price, dtype: int64
print(len(autos[autos['price'] < 100]))
print(len(autos[autos['price'] > 500000]))
print(len(autos))
1072 12 44420
Of the 50K datapoints in this dataset, 1762 datapoints list price of car as less than 100 dollars whilst 11 list the price of the car as greater than 500,000 dollars. Considering both the unlikelihood of purchasing a car within these extremes and its make up of the total data point to be less than 5% (i.e. can remove without affecting the integrity of the data), we will use this as our liimts.
autos = autos[autos['price'].between(100, 500000)]
autos['price'].value_counts().sort_index()
100.0 77 110.0 1 111.0 1 115.0 1 117.0 1 120.0 18 125.0 4 130.0 8 135.0 1 139.0 1 140.0 7 145.0 2 149.0 6 150.0 148 156.0 1 160.0 7 170.0 2 173.0 1 175.0 9 179.0 1 180.0 24 185.0 1 188.0 1 190.0 13 193.0 1 195.0 2 199.0 27 200.0 192 210.0 1 215.0 1 ... 115000.0 1 115991.0 1 116000.0 1 119500.0 1 119900.0 1 120000.0 2 128000.0 1 129000.0 1 135000.0 1 137999.0 1 139997.0 1 145000.0 1 151990.0 1 155000.0 1 163500.0 1 163991.0 1 169000.0 1 169999.0 1 175000.0 1 180000.0 1 190000.0 1 197000.0 1 198000.0 1 220000.0 1 259000.0 1 265000.0 1 295000.0 1 299000.0 1 345000.0 1 350000.0 1 Name: price, Length: 2260, dtype: int64
autos['odometer_km'].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 100000, 5000, 40000])
autos['odometer_km'].describe()
count 43336.000000 mean 125625.230755 std 39265.241462 min 5000.000000 25% 100000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts().sort_index(ascending = False)
150000 27696 125000 4657 100000 1922 90000 1594 80000 1325 70000 1145 60000 1101 50000 964 40000 772 30000 731 20000 698 10000 215 5000 516 Name: odometer_km, dtype: int64
Considering the nature of the items as used cars, which are likely to have a very high mileage, it would be expected that there would be a large number of vehicles with extraordinarily high mileage and few vehicles with low mileage. As such, we will not be modifying the vehicle in terms of odometer.
# date_crawled + ad_created + last_seen
autos[['date_crawled','ad_created','last_seen']].head()
# the format is %YYYY-%mm-%dd (essentially 10 strings)
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
5 | 2016-03-21 13:47:45 | 2016-03-21 00:00:00 | 2016-04-06 09:45:21 |
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['ad_created'] = autos['ad_created'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]
# Counting date_crawled
autos['date_crawled'].value_counts(normalize = True, dropna = False).sort_index(ascending = False).round(2)
2016-04-07 0.00 2016-04-06 0.00 2016-04-05 0.01 2016-04-04 0.04 2016-04-03 0.04 2016-04-02 0.04 2016-04-01 0.03 2016-03-31 0.03 2016-03-30 0.03 2016-03-29 0.03 2016-03-28 0.04 2016-03-27 0.03 2016-03-26 0.03 2016-03-25 0.03 2016-03-24 0.03 2016-03-23 0.03 2016-03-22 0.03 2016-03-21 0.04 2016-03-20 0.04 2016-03-19 0.03 2016-03-18 0.01 2016-03-17 0.03 2016-03-16 0.03 2016-03-15 0.03 2016-03-14 0.04 2016-03-13 0.02 2016-03-12 0.04 2016-03-11 0.03 2016-03-10 0.03 2016-03-09 0.03 2016-03-08 0.03 2016-03-07 0.04 2016-03-06 0.01 2016-03-05 0.03 Name: date_crawled, dtype: float64
Examining the distribution of date_crawled, which ranged from March 5th, 2016 - April 7th, 2016, there is a rounghly equivalent distribution at approx 3% across the dates in this range when the crawler accessed the ads.
# Counting date_crawled
autos['last_seen'].value_counts(normalize = True, dropna = False).sort_index(ascending = False).round(2)
2016-04-07 0.13 2016-04-06 0.23 2016-04-05 0.13 2016-04-04 0.02 2016-04-03 0.02 2016-04-02 0.03 2016-04-01 0.02 2016-03-31 0.02 2016-03-30 0.02 2016-03-29 0.02 2016-03-28 0.02 2016-03-27 0.02 2016-03-26 0.02 2016-03-25 0.02 2016-03-24 0.02 2016-03-23 0.02 2016-03-22 0.02 2016-03-21 0.02 2016-03-20 0.02 2016-03-19 0.02 2016-03-18 0.01 2016-03-17 0.03 2016-03-16 0.02 2016-03-15 0.02 2016-03-14 0.01 2016-03-13 0.01 2016-03-12 0.02 2016-03-11 0.01 2016-03-10 0.01 2016-03-09 0.01 2016-03-08 0.01 2016-03-07 0.01 2016-03-06 0.00 2016-03-05 0.00 Name: last_seen, dtype: float64
In terms of "last_seen" that looked within the range of March 5th, 2016 to April 7th, 2016, there was a noted uptake in when the majority of the ads were last seen (i.e. near April 7th), which could be explained by: (1) the corresponding to the collection period when the dataset was being compiled or (2) this being the prime period to which car sales occur since it's the end of the 1st quarter.
autos['ad_created'].value_counts(normalize = True, dropna = False).sort_index(ascending = False).round(2)
2016-04-07 0.00 2016-04-06 0.00 2016-04-05 0.01 2016-04-04 0.04 2016-04-03 0.04 2016-04-02 0.04 2016-04-01 0.03 2016-03-31 0.03 2016-03-30 0.03 2016-03-29 0.03 2016-03-28 0.04 2016-03-27 0.03 2016-03-26 0.03 2016-03-25 0.03 2016-03-24 0.03 2016-03-23 0.03 2016-03-22 0.03 2016-03-21 0.04 2016-03-20 0.04 2016-03-19 0.03 2016-03-18 0.01 2016-03-17 0.03 2016-03-16 0.03 2016-03-15 0.03 2016-03-14 0.04 2016-03-13 0.02 2016-03-12 0.04 2016-03-11 0.03 2016-03-10 0.03 2016-03-09 0.03 ... 2016-02-23 0.00 2016-02-22 0.00 2016-02-21 0.00 2016-02-20 0.00 2016-02-19 0.00 2016-02-18 0.00 2016-02-17 0.00 2016-02-16 0.00 2016-02-14 0.00 2016-02-12 0.00 2016-02-09 0.00 2016-02-08 0.00 2016-02-07 0.00 2016-02-05 0.00 2016-02-02 0.00 2016-02-01 0.00 2016-01-29 0.00 2016-01-27 0.00 2016-01-22 0.00 2016-01-16 0.00 2016-01-14 0.00 2016-01-13 0.00 2016-01-10 0.00 2016-01-07 0.00 2016-01-03 0.00 2015-12-30 0.00 2015-12-05 0.00 2015-11-10 0.00 2015-09-09 0.00 2015-08-10 0.00 Name: ad_created, Length: 74, dtype: float64
As for "ad_created", looking at the range from June 11th, 2015 to April 7th, 2016, the majority of the ads were created in March 2016.
autos['registration_year'].value_counts().sort_index()
1927 1 1931 1 1937 3 1941 2 1950 1 1951 2 1954 2 1955 1 1956 3 1957 2 1958 2 1959 6 1960 8 1961 4 1962 3 1963 7 1964 10 1965 13 1966 13 1967 18 1968 23 1969 14 1970 19 1971 20 1972 29 1973 16 1974 20 1975 14 1976 16 1977 19 ... 1994 550 1995 999 1996 1187 1997 1721 1998 2084 1999 2555 2000 2530 2001 2405 2002 2266 2003 2484 2004 2487 2005 2579 2006 2533 2007 2145 2008 2102 2009 1992 2010 1540 2011 1585 2012 1270 2013 784 2014 642 2015 364 2016 864 2017 1034 2018 349 2019 2 2800 1 5000 1 5911 1 9000 1 Name: registration_year, Length: 76, dtype: int64
Seeing as the earliest commercial car was made in 1917, we can just use this as the set minimum. Furthermore, as this dataset only took in data up until April 2017, it doesn't make sense to have registration years after 2017. So the range in registration year we are lookin for is b/t 1917 - 2017.
min_bool = autos['registration_year'] > 1917
max_bool = autos['registration_year'] < 2018
combined = min_bool & max_bool
autos = autos[combined]
autos.describe()
price | registration_year | powerps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 42981.000000 | 42981.000000 | 42981.000000 | 42981.000000 | 42981.000000 | 42981.000000 |
mean | 6264.865964 | 2003.512180 | 127.107001 | 125600.381564 | 5.967776 | 51538.376492 |
std | 9231.082718 | 7.060427 | 67.377291 | 39274.090885 | 3.583941 | 25732.257434 |
min | 100.000000 | 1927.000000 | 16.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1399.000000 | 1999.000000 | 80.000000 | 100000.000000 | 3.000000 | 31177.000000 |
50% | 3495.000000 | 2004.000000 | 116.000000 | 150000.000000 | 6.000000 | 50737.000000 |
75% | 7950.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 72406.000000 |
max | 350000.000000 | 2017.000000 | 1405.000000 | 150000.000000 | 12.000000 | 99998.000000 |
autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'chrysler', 'renault', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'mercedes_benz', 'ford', 'seat', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'land_rover', 'alfa_romeo', 'rover', 'daihatsu', 'trabant', 'lancia', 'lada'], dtype=object)
top_10_brands = autos['brand'].value_counts()[:10]
bottom_10_brands = autos['brand'].value_counts()[-10:]
top_10_brands
volkswagen 9194 bmw 4862 opel 4567 mercedes_benz 4126 audi 3842 ford 2936 renault 1929 peugeot 1264 fiat 1062 seat 809 Name: brand, dtype: int64
bottom_10_brands
daihatsu 96 subaru 92 land_rover 91 saab 73 jaguar 64 daewoo 55 rover 52 lancia 46 trabant 41 lada 21 Name: brand, dtype: int64
print(10157+5201+5155+4586+4118+3331+2272+1413+1232+888)
print(len(autos))
print((10157+5201+5155+4586+4118+3331+2272+1413+1232+888) / len(autos))
38353 42981 0.8923245154835858
Looking at the data, it seems that the majority of the used cars are German brands (i.e. Volkswagen, BMW, Audi, Mercedes Benz) whilst the major foreign brand being Ford. Nevertheless, the most common used car brands are domestic. However, as it pertains to the 10 least popular brands, they are mainly foreign brands such as Subaru, Land Rover, Daewoo, etc.
The decision to take the Top 10 brands is that these make up 89.3% of the entire dataset, which is a pretty good representation of the marketplace.
volkswagen_bool = autos['brand'] == "volkswagen"
bmw_bool = autos['brand'] == 'bmw'
opel_bool = autos['brand'] == "opel"
mercedes_bool = autos['brand'] == 'mercedes_benz'
audi_bool = autos['brand'] == "audi"
ford_bool = autos['brand'] =='ford'
renault_bool = autos['brand'] == "renault"
peugeot_bool = autos['brand'] == 'peugeot'
fiat_bool = autos['brand'] == 'fiat'
seat_bool = autos['brand'] == 'seat'
volkswagen_only = autos[volkswagen_bool]
bmw_only = autos[bmw_bool]
opel_only = autos[opel_bool]
mercedes_only = autos[mercedes_bool]
audi_only = autos[audi_bool]
ford_only = autos[ford_bool]
renault_only = autos[renault_bool]
peugeot_only = autos[peugeot_bool]
fiat_only = autos[fiat_bool]
seat_only = autos[seat_bool]
top_brands_mean_price = {}
top_10 = autos['brand'].value_counts(ascending = False)[:10].index
for maker in top_10:
selected_rows = autos[autos['brand'] == maker]
mean_price = selected_rows['price'].mean()
mean_price = mean_price.round(2)
top_brands_mean_price[maker] = mean_price
top_brands_mean_price
{'audi': 9659.37, 'bmw': 8577.61, 'fiat': 2968.12, 'ford': 3932.04, 'mercedes_benz': 8936.97, 'opel': 3151.16, 'peugeot': 3249.6, 'renault': 2635.15, 'seat': 4675.29, 'volkswagen': 5660.48}
The avg. price for each of the top 10 car brands are:
Brand | Avg. Price (in dollars) |
---|---|
Audi | 9659.37 |
BMW | 8577.61 |
Fiat | 2968.12 |
Ford | 3933.09 |
Mercedes Benz | 8936.97 |
Opel | 3151.16 |
Peugeot | 3249.60 |
Renault | 2635.15 |
Seat | 4675.29 |
Volkswagen | 5660.48 |
Looking at the data, it seems that the more luxurious brands (i.e. Audi, BMW, Mercedes Benz) have the highest price point whilst the foreign brands (i.e. Ford, Peugeot, Fiat) are on the lower end of the price point spectrum. As for the most populous brand (Volkswagen), its pricepoint is moderate.
Working with the top 10 brands, we want to see if mean mileage (i.e. odometer_km) and mean price are related to one another.
top_brands_mean_mileage = {}
top_10 = autos['brand'].value_counts(ascending = False)[:10].index
for maker in top_10:
selected_rows = autos[autos['brand'] == maker]
mean_km= selected_rows['odometer_km'].mean().round(2)
top_brands_mean_mileage[maker] = mean_km
top_brands_mean_mileage
{'audi': 128854.76, 'bmw': 132670.71, 'fiat': 116864.41, 'ford': 124105.93, 'mercedes_benz': 130822.83, 'opel': 129162.47, 'peugeot': 126475.47, 'renault': 127392.43, 'seat': 121161.93, 'volkswagen': 128454.97}
Looking at the average mileage of the Top 10 brands of cars, it seems that the range lies b/t 116000km to 133000km. There luxurious brands (BMW and Mercedes) appear to have a higher mileage as compared to the foreign brands.
top_10_price = pd.Series(top_brands_mean_price)
top_10_mileage = pd.Series(top_brands_mean_mileage)
top_10_makers = pd.DataFrame(top_10_price, columns = ['mean_price'])
top_10_makers['mean_mileage'] = top_10_mileage
top_10_makers
mean_price | mean_mileage | |
---|---|---|
audi | 9659.37 | 128854.76 |
bmw | 8577.61 | 132670.71 |
fiat | 2968.12 | 116864.41 |
ford | 3932.04 | 124105.93 |
mercedes_benz | 8936.97 | 130822.83 |
opel | 3151.16 | 129162.47 |
peugeot | 3249.60 | 126475.47 |
renault | 2635.15 | 127392.43 |
seat | 4675.29 | 121161.93 |
volkswagen | 5660.48 | 128454.97 |
It appears as though mileage doesn't seem to be as strongly related to price despite the more luxurious and priciest brands having a higher mean mileage as compared to the more economic and cheapest brands with a lower mean mileage.
col = ['brand', 'model']
Brand_Model = autos[col]
Brand_Model = Brand_Model.dropna(axis = 0)
Brand_Model['brand'].value_counts().head()
volkswagen 8878 bmw 4685 opel 4411 mercedes_benz 4031 audi 3747 Name: brand, dtype: int64
Brand_Model['model'].value_counts().head()
golf 3536 andere 2997 3er 2467 polo 1455 corsa 1426 Name: model, dtype: int64
top_model_by_brand = {}
brands = Brand_Model['brand'].unique()
for b in brands:
selected_rows = Brand_Model[Brand_Model['brand'] == b]
sorted_rows = selected_rows.mode()
top_row = selected_rows.iloc[0]
top_model = top_row['model']
top_model_by_brand[b] = top_model
top_model_by_brand
{'alfa_romeo': '156', 'audi': 'a3', 'bmw': '7er', 'chevrolet': 'andere', 'chrysler': 'voyager', 'citroen': 'andere', 'dacia': 'sandero', 'daewoo': 'kalos', 'daihatsu': 'terios', 'fiat': 'punto', 'ford': 'mondeo', 'honda': 'civic', 'hyundai': 'i_reihe', 'jaguar': 'andere', 'jeep': 'wrangler', 'kia': 'andere', 'lada': 'niva', 'lancia': 'lybra', 'land_rover': 'freelander', 'mazda': 'andere', 'mercedes_benz': 'e_klasse', 'mini': 'cooper', 'mitsubishi': 'colt', 'nissan': 'primera', 'opel': 'vectra', 'peugeot': 'andere', 'porsche': '911', 'renault': 'megane', 'rover': 'andere', 'saab': 'andere', 'seat': 'altea', 'skoda': 'octavia', 'smart': 'fortwo', 'subaru': 'andere', 'suzuki': 'grand', 'toyota': 'andere', 'trabant': '601', 'volkswagen': 'golf', 'volvo': 'andere'}
autos['odometer_km'].value_counts().sort_index()
5000 505 10000 213 20000 697 30000 726 40000 769 50000 962 60000 1096 70000 1138 80000 1313 90000 1579 100000 1906 125000 4621 150000 27456 Name: odometer_km, dtype: int64
len(autos)
42981
Normally, the best approach should be a split in terms of equally distributed groups. However, considering the aggregation of vehicles with mileage greater than or equal to 150K km, we will break it down to the following distribution:
odometer_km_splits = []
for km in autos['odometer_km']:
if km < 30001:
odometer_km_splits.append("0 - 30,000")
elif km < 60001:
odometer_km_splits.append("30,001 - 60,000")
elif km < 90001:
odometer_km_splits.append("60,001 - 90,000")
elif km < 125001:
odometer_km_splits.append("90,001 - 125,000")
else:
odometer_km_splits.append("125,000+")
autos["odometer_km_splits"] = odometer_km_splits
autos['odometer_km_splits'].value_counts()
125,000+ 27456 90,001 - 125,000 6527 60,001 - 90,000 4030 30,001 - 60,000 2827 0 - 30,000 2141 Name: odometer_km_splits, dtype: int64
price_by_mileage = {}
odometer_splits = autos["odometer_km_splits"].unique()
for km in odometer_splits:
selected_rows = autos[autos["odometer_km_splits"] == km]
avg_price = selected_rows['price'].mean().round(2)
price_by_mileage[km] = avg_price
price_by_mileage
{'0 - 30,000': 17004.72, '125,000+': 3953.13, '30,001 - 60,000': 13819.04, '60,001 - 90,000': 9826.49, '90,001 - 125,000': 6995.37}
Looking at the breakdown of the odometer, there seems to be a negative relationship between mean price points of a vehicle and mileage where the greater the mileage = the lower the price of the vehicle.
col = ['price', 'unrepaired_damage']
price_by_damage = autos[col].dropna(axis = 0)
price_by_damage
price | unrepaired_damage | |
---|---|---|
0 | 5000.0 | nein |
1 | 8500.0 | nein |
2 | 8990.0 | nein |
3 | 4350.0 | nein |
7 | 1990.0 | nein |
9 | 590.0 | nein |
10 | 999.0 | nein |
12 | 5299.0 | nein |
13 | 1350.0 | nein |
15 | 18900.0 | nein |
16 | 350.0 | nein |
17 | 5500.0 | nein |
19 | 4150.0 | nein |
21 | 41500.0 | nein |
22 | 25450.0 | nein |
23 | 7999.0 | nein |
24 | 48500.0 | nein |
26 | 777.0 | nein |
28 | 5250.0 | ja |
29 | 4999.0 | nein |
31 | 2850.0 | nein |
33 | 4800.0 | nein |
36 | 7500.0 | nein |
37 | 13800.0 | nein |
38 | 2850.0 | nein |
39 | 1450.0 | nein |
40 | 3500.0 | nein |
41 | 3200.0 | nein |
43 | 11500.0 | nein |
44 | 900.0 | nein |
... | ... | ... |
49953 | 14750.0 | nein |
49956 | 4450.0 | nein |
49959 | 4200.0 | nein |
49962 | 2200.0 | ja |
49963 | 7600.0 | nein |
49965 | 6700.0 | nein |
49966 | 1490.0 | ja |
49968 | 2100.0 | nein |
49969 | 4500.0 | nein |
49970 | 15800.0 | nein |
49971 | 950.0 | nein |
49972 | 3300.0 | ja |
49973 | 6000.0 | nein |
49975 | 9700.0 | nein |
49976 | 5900.0 | nein |
49977 | 5500.0 | nein |
49978 | 900.0 | ja |
49979 | 11000.0 | nein |
49981 | 2000.0 | nein |
49986 | 15900.0 | nein |
49987 | 21990.0 | nein |
49988 | 9550.0 | nein |
49990 | 17500.0 | nein |
49992 | 4800.0 | nein |
49994 | 5000.0 | nein |
49995 | 24900.0 | nein |
49996 | 1980.0 | nein |
49997 | 13200.0 | nein |
49998 | 22900.0 | nein |
49999 | 1250.0 | nein |
36657 rows × 2 columns
price_by_status = {}
status = price_by_damage['unrepaired_damage'].unique()
for s in status:
rows = price_by_damage[price_by_damage['unrepaired_damage'] == s]
avg_price = rows['price'].mean().round(2)
price_by_status[s] = avg_price
price_by_status
{'ja': 2354.32, 'nein': 7282.34}
Looking at whether or not there was any unrepaired damage to a vehicle, a greater average price point was found for vehicles without any unrepaired damage as compared to those with it.