In this project our aim is to analyse the dataset of used cars from eBay Kleinanzeigen (A Classified section of the German eBay website).
Initially, we will step through the data cleansing process as dataset itself needs extensive cleansing and is a significant part of the project.
Finally, after our data cleansing, we will answer the question of top brands and their average prices and compare that alongside average mileage to draw inferences.
For the purposes of this project, the dataset has been limited to 50,000 rows.
The full dataset can be downloaded from here.
Below we can find the metadata about this dataset:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.
name
- Name of the car.
seller
- Whether the seller is private or a dealer.
offerType
- The type of listing
price
- The price on the ad to sell the car.
abtest
- Whether the listing is included in an A/B test.
vehicleType
- The vehicle Type.
yearOfRegistration
- The year in which the car was first registered.
gearbox
- The transmission type.
powerPS
- The power of the car in PS.
model
- The car model name.
kilometer
- How many kilometers the car has driven.
monthOfRegistration
- The month in which the car was first registered.
fuelType
- What type of fuel the car uses.
brand
- The brand of the car.
notRepairedDamage
- If the car has a damage which is not yet repaired.
dateCreated
- The date on which the eBay listing was created.
nrOfPictures
- The number of pictures in the ad.
postalCode
- The postal code for the location of the vehicle.
lastSeenOnline
- When the crawler saw this ad last online.
We will start by importing the pandas
and NumPy
libraries and read our csv data.
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
seller
, no_of_pics
and offer_type
column seems to have only one value for 99.99% of the data and hence can be ignored safelyregistration_year
seem to have some odd values suggested by the min being 1000 and max being 9999. This needs further investigationregistration_month
as well have invalid values suggested by the min value being 0. This needs further investigationpowerPS
column has 0 for about 11% of the data and that is invalid for that columnfuel_type
column has NaN for about 10% of the data which we would want to exclude for any analysis based on that columnunrepaired_damage
column has NaN for about 20% of the data which we want to exclude for any analysis based on that columnprice
, odometer
ideally should be numeric for our analysisdateCrawled
,dateCreated
, and lastSeen
should be Date typeAdditionally, the column names use camelCase instead of the preferred snake_case.
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'pictures_count', 'postal_code',
'last_seen']
autos.columns = new_columns
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_count | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
We will describe the statistics of the data across the columns to: - Drop any text columns which have almost all values as the same - Convert the data on any columns which have numeric data which are stored as text
autos.describe(include="all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | pictures_count | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-16 21:50:53 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the above table, we can infer the following:
seller
, offer_type
and pictures_count
have the same text data for almost the entire dataset and hence can safely be ignoredprice
, odometer
are numeric values stored as text with its metricLet's now look at converting the text to numeric on the columns price
and odometer
print("="*5,"Odometer","="*5)
print(autos["odometer"].value_counts())
print('\n')
print("="*5,"Price","="*5)
print(autos["price"].value_counts())
===== Odometer ===== 150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64 ===== Price ===== $0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 ... $23,790 1 $9,655 1 $29,445 1 $29,699 1 $35,800 1 Name: price, Length: 2357, dtype: int64
First, we need to remove the non-numeric characters as seen above on both of these columns and then we can convert to a numeric dtype.
# Remove all non-numeric characters and then convert to numeric
autos["odometer"] = autos["odometer"].str.replace(",","").str.replace("km","").astype(int)
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").str.replace(".","").astype(int)
# Rename the column so that the metric is evident
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | pictures_count | postal_code | |
---|---|---|---|---|---|---|---|
count | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 9.840044e+03 | 2005.073280 | 116.355920 | 125732.700000 | 5.723360 | 0.0 | 50813.627300 |
std | 4.811044e+05 | 105.712813 | 209.216627 | 40042.211706 | 3.711984 | 0.0 | 25779.747957 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1.100000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 1.000000e+08 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
We have ensured that odometer and price columns are numeric reflecting the values.
Now we will look at the values in these columns a bit deeper to see if there are any outliers that skew the statistics and remove them
print("="*5,"Top 20 - High price listing","="*5)
print(autos["price"].value_counts().sort_index(ascending = False).head(20))
print('\n')
print("="*5,"Bottom 20 - Low price listing","="*5)
print(autos["price"].value_counts().sort_index(ascending = True).head(20))
print('\n')
print("="*5,"Statistics of low prices","="*5)
print(autos.loc[autos["price"] < 100, "price"].describe())
print('\n')
print("="*5,"Distribution % of high prices","="*5)
print(autos.loc[autos["price"].between(350000,99999999),"price"].value_counts(normalize=True, dropna=False))
print('\n')
print("="*5,"Listing 350,000 and above","="*5)
autos.loc[autos["price"] >= 350000,["name","brand","vehicle_type","price"]]
===== Top 20 - High price listing ===== 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64 ===== Bottom 20 - Low price listing ===== 0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price, dtype: int64 ===== Statistics of low prices ===== count 1762.000000 mean 5.632804 std 18.374263 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 99.000000 Name: price, dtype: float64 ===== Distribution % of high prices ===== 12345678 0.200000 999999 0.133333 11111111 0.133333 999990 0.066667 3890000 0.066667 27322222 0.066667 1234566 0.066667 350000 0.066667 99999999 0.066667 10000000 0.066667 1300000 0.066667 Name: price, dtype: float64 ===== Listing 350,000 and above =====
name | brand | vehicle_type | price | |
---|---|---|---|---|
514 | Ford_Focus_Turnier_1.6_16V_Style | ford | kombi | 999999 |
2897 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | ford | limousine | 11111111 |
7814 | Ferrari_F40 | sonstige_autos | coupe | 1300000 |
11137 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | sonstige_autos | coupe | 10000000 |
22947 | Bmw_530d_zum_ausschlachten | bmw | kombi | 1234566 |
24384 | Schlachte_Golf_3_gt_tdi | volkswagen | NaN | 11111111 |
27371 | Fiat_Punto | fiat | NaN | 12345678 |
36818 | Porsche_991 | porsche | coupe | 350000 |
37585 | Volkswagen_Jetta_GT | volkswagen | limousine | 999990 |
39377 | Tausche_volvo_v40_gegen_van | volvo | NaN | 12345678 |
39705 | Tausch_gegen_gleichwertiges | mercedes_benz | limousine | 99999999 |
42221 | Leasinguebernahme | citroen | limousine | 27322222 |
43049 | 2_VW_Busse_T3 | volkswagen | bus | 999999 |
47598 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | opel | limousine | 12345678 |
47634 | Ferrari_FXX | sonstige_autos | coupe | 3890000 |
Based on above display of price
column, we see few astronomically high prices and lot of listing with prices that are miniscule which would overall skew our results.
On closer look at these high price listing, few are true vintage cars or high performance cars (such as Ferrari_F40) that demand these high prices and some are mistakes in the dataset (such as Ford Punto, Escort MK1 or Maserati 3200 GT).
Same way if we look at miniscule prices (between 0 and 100) and potentially these might be scrap rather than any potential use of the car.
But since large part of these are mistakes, we will remove any listing that has a price of above 350,000, and similarly remove any listing that has a price less than 100.
Let's now look at the odometer values and see if there are any outliers.
print("Number of unique odometer values: ",autos["odometer_km"].unique().shape[0])
autos["odometer_km"].value_counts().sort_index(ascending = False)
Number of unique odometer values: 13
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 40000 819 30000 789 20000 784 10000 264 5000 967 Name: odometer_km, dtype: int64
From above display of odometer_km
data, we could infer two things:
Hence, we will keep these and only remove the outliers based on the price and then display the statistics of our cleaned dataset.
autos = autos.loc[autos["price"].between(100,350000)]
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | pictures_count | postal_code | |
---|---|---|---|---|---|---|---|
count | 48224.000000 | 48224.000000 | 48224.000000 | 48224.000000 | 48224.000000 | 48224.0 | 48224.000000 |
mean | 5930.371433 | 2004.730964 | 117.677609 | 125919.148142 | 5.801634 | 0.0 | 50987.919729 |
std | 9078.372762 | 87.897388 | 201.206304 | 39543.339640 | 3.676976 | 0.0 | 25737.119986 |
min | 100.000000 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1250.000000 | 1999.000000 | 73.000000 | 125000.000000 | 3.000000 | 0.0 | 30823.000000 |
50% | 3000.000000 | 2004.000000 | 107.000000 | 150000.000000 | 6.000000 | 0.0 | 49716.000000 |
75% | 7499.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71666.750000 |
max | 350000.000000 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
From our dataset, there are 5 columns that represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteWe see that date_crawled
, last_seen
and ad_created
columns are identified as string values by pandas.
We will look at the string format on these columns.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
We see that these 3 columns have the initial 10 characters representing the date (In the format Year-Month-Day).
Let's now understand the distribution of data using this date format across the columns.
print("="*5,"Distribution of date_crawled column","="*5)
print(autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending = True))
===== Distribution of date_crawled column ===== 2016-03-05 0.025361 2016-03-06 0.014039 2016-03-07 0.036061 2016-03-08 0.033179 2016-03-09 0.033013 2016-03-10 0.032287 2016-03-11 0.032598 2016-03-12 0.036911 2016-03-13 0.015677 2016-03-14 0.036662 2016-03-15 0.034319 2016-03-16 0.029467 2016-03-17 0.031499 2016-03-18 0.012898 2016-03-19 0.034734 2016-03-20 0.037803 2016-03-21 0.037201 2016-03-22 0.032888 2016-03-23 0.032287 2016-03-24 0.029446 2016-03-25 0.031499 2016-03-26 0.032308 2016-03-27 0.031126 2016-03-28 0.034962 2016-03-29 0.034112 2016-03-30 0.033738 2016-03-31 0.031851 2016-04-01 0.033697 2016-04-02 0.035605 2016-04-03 0.038611 2016-04-04 0.036538 2016-04-05 0.013064 2016-04-06 0.003173 2016-04-07 0.001389 Name: date_crawled, dtype: float64
print("="*5,"Distribution of Top 5 most recent ad_created","="*5)
print(autos["ad_created"].str[:10].value_counts(normalize=True).sort_index(ascending=False).head(5))
print("\n")
print("="*5,"Distribution of Bottom 5 most old ad_created","="*5)
print(autos["ad_created"].str[:10].value_counts(normalize=True).sort_index(ascending=True).head(5))
===== Distribution of Top 5 most recent ad_created ===== 2016-04-07 0.001244 2016-04-06 0.003256 2016-04-05 0.011799 2016-04-04 0.036890 2016-04-03 0.038860 Name: ad_created, dtype: float64 ===== Distribution of Bottom 5 most old ad_created ===== 2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 Name: ad_created, dtype: float64
From the exploration of the ad_created
column, we observe the following:
date_crawled
column that most of the data is distributed between 9th March to 5th April, 2016ad_created
up to an year before the date_crawled
date distributionprint("="*5,"Distribution of last_seen column","="*5)
autos["last_seen"].str[:10].value_counts(normalize=True, ascending=False)
===== Distribution of last_seen column =====
2016-04-06 0.221964 2016-04-07 0.132154 2016-04-05 0.125062 2016-03-17 0.028098 2016-04-03 0.025133 2016-04-02 0.024884 2016-03-30 0.024697 2016-04-04 0.024531 2016-03-31 0.023826 2016-03-12 0.023785 2016-04-01 0.022852 2016-03-29 0.022292 2016-03-22 0.021359 2016-03-28 0.020840 2016-03-20 0.020654 2016-03-21 0.020550 2016-03-24 0.019762 2016-03-25 0.019098 2016-03-23 0.018580 2016-03-26 0.016672 2016-03-16 0.016444 2016-03-15 0.015863 2016-03-19 0.015760 2016-03-27 0.015552 2016-03-14 0.012629 2016-03-11 0.012400 2016-03-10 0.010638 2016-03-09 0.009580 2016-03-13 0.008875 2016-03-18 0.007320 2016-03-08 0.007320 2016-03-07 0.005433 2016-03-06 0.004313 2016-03-05 0.001078 Name: last_seen, dtype: float64
With the distribution on column last_seen
:
date_crawled
column that distribution is between 5th March to 7th April, 2016Let's now look at the column registration_year
. Here we are only dealing with the year part of the date.
It's important to understand the data in this column and any outliers here as this column probably suggests the registration year of the car being listed and in turn useful in our analysis based on the car's age.
# To understand the statistics of the data behind the series
print(autos["registration_year"].describe(),'\n')
# To understand the data distribution percentage on the column
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).head(5))
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).tail(5))
count 48224.000000 mean 2004.730964 std 87.897388 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64 1000 0.000021 1001 0.000021 1111 0.000021 1800 0.000041 1910 0.000041 Name: registration_year, dtype: float64 5911 0.000021 6200 0.000021 8888 0.000021 9000 0.000021 9999 0.000062 Name: registration_year, dtype: float64
From the statistics on the registration_year
, it clear that some of the data is not a year evident from the min (1000) and max (9999) values.
Also, through the distribution of data we see:
We will make an assumption that infact some vintage cars are in our listing so we will use the range being 1900 to 2016 and anything outside of this range as incorrect.
Note: We have ignored the car listings in 1800 as invalid for the following reasons:
Let's look at the number of listings between our assumed valid registration year ranges and outside of this range. We will also remove the data outside our year ranges.
# To see how many listings we have within our assumed valid registration year range.
print("Listings within valid year range = ",autos["registration_year"].between(1900,2016).sum(),"\n")
print("Listings outside valid year range = ",(~autos["registration_year"].between(1900,2016)).sum(),"\n")
# To remove the values outside of our assumed valid registration year range
autos = autos[ autos["registration_year"].between(1900,2016) ]
# To see the data distribtion now in our valid data set
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).head(5))
print(autos["registration_year"].value_counts(normalize=True).sort_index(ascending=True).tail(5))
Listings within valid year range = 46352 Listings outside valid year range = 1872 1910 0.000043 1927 0.000022 1929 0.000022 1931 0.000022 1934 0.000043 Name: registration_year, dtype: float64 2012 0.028219 2013 0.017281 2014 0.014282 2015 0.008198 2016 0.025932 Name: registration_year, dtype: float64
print('Number of car brands in the listing: ',autos["brand"].unique().shape[0])
Number of car brands in the listing: 40
We see that there are 40 unique car brands in our current cleansed dataset.
For the purposes of this project, we are going to consider only the top 10 car brands by distribution.
# Select only top 10 brands based on the distribution in the car listing
car_brands = autos["brand"].value_counts(normalize = True, dropna = False).head(10).index
brand_mean_price = {}
brand_mean_mileage = {}
# Loop through top 10 brands to calculate mean price of cars
for brand in car_brands:
mean_price = autos.loc[autos["brand"] == brand, "price"].mean()
mean_mileage = autos.loc[autos["brand"] == brand, "odometer_km"].mean()
brand_mean_price[brand] = round(mean_price,2)
brand_mean_mileage[brand] = round(mean_mileage,2)
# Convert the brand aggregate data into Panda series to sort by values
brand_agg_mean_price = pd.Series(brand_mean_price)
brand_agg_mean_mileage = pd.Series(brand_mean_mileage)
print(brand_agg_mean_price.sort_values(ascending = False))
audi 9380.72 mercedes_benz 8672.65 bmw 8381.68 volkswagen 5436.95 seat 4433.42 ford 3779.27 peugeot 3113.86 opel 3005.50 fiat 2836.87 renault 2496.07 dtype: float64
From the above analysis, it seems among the top 10 car brands by distribution:
We will also compare this average price for the top 10 brands against it's average mileage.
For this, we add our mean mileage as well and make a dataframe so that we can compare side by side.
# Create a dataframe with a single column from a series
brand_agg = pd.DataFrame(brand_agg_mean_price, columns=["mean_price"])
# Add a column to the dataframe with another series sharing same label
brand_agg["mean_mileage"] = brand_agg_mean_mileage
brand_agg
mean_price | mean_mileage | |
---|---|---|
volkswagen | 5436.95 | 128799.88 |
bmw | 8381.68 | 132695.32 |
opel | 3005.50 | 129384.43 |
mercedes_benz | 8672.65 | 131025.67 |
audi | 9380.72 | 129245.40 |
ford | 3779.27 | 124277.11 |
renault | 2496.07 | 128281.39 |
peugeot | 3113.86 | 127127.89 |
fiat | 2836.87 | 116950.29 |
seat | 4433.42 | 121536.64 |
Now by comparing the average price alongside the average mileage, we see that the mileage range for Audi, Mercedes Benz, and BMW are all close in the same range.
We started with data cleansing as a big part of this project and we did the following:
After cleansing the data, we started with our data analysis and listed the Top 10 car brands by price.
Then upon analysing the average price and mileage of these top 10 car brands, we concluded that: