In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.
The data dictionary provided with data is as follows:
The aim of this project is to clean the data and analyze the included used car listings.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
This information shows how the dataset contains 20 columns, most of which are strings. Some columns have null values, but none have more than ~20% null values.
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
new_col = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns = new_col
autos.head(3)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
We have converted the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
Now, we present some basic data exploration to determine what other cleaning tasks need to be done.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-23 18:39:34 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
These results show how seller, offer_type and nr_of_pictures columns have mostly one value. Then, they will be dropped.
The columns gearbox and unrepaired_damage need more investigation since they present only two different values, with one of them is presented in most of the dataset. Moreover, these columns present missing values.
On the other hand, the columns price and odometer are numeric values stored as text. We will convert these values to the numeric type.
Let's drop seller, offer_type and nr_of_pictures columns because they have mostly one value.
autos.drop(["offer_type","seller","nr_of_pictures"],axis=1).head(5)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
Let's clean and convert price and odometer
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','')
autos["price"] = autos["price"].astype(int)
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','')
autos["odometer"] = autos["odometer"].astype(int)
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
We continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.
print("Shape odometer_km", autos["odometer_km"].unique().shape)
print("Describe the odometer_km:")
print(autos["odometer_km"].describe())
autos["odometer_km"].value_counts()
Shape odometer_km (13,) Describe the odometer_km: count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
print("Shape price", autos["price"].unique().shape)
print("Describe the price:")
print(autos["price"].describe())
autos_price_sorted = autos["price"].value_counts().sort_index(ascending = True)
print(autos_price_sorted.head(20))
print(autos_price_sorted.tail(20))
Shape price (2357,) Describe the price: count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price, dtype: int64 197000 1 198000 1 220000 1 250000 1 259000 1 265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, dtype: int64
The values from the odometer_km seem to be logical. However, we will remove the rows with a price equal to 0 and the ones with values higher than 350000, since the values in the price column show a jump from 350000 to 999999.
autos = autos[autos['price'].between(1,350000)]
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 | 48565.000000 | 48565.0 | 48565.000000 |
mean | 5888.935591 | 2004.755421 | 117.197158 | 125770.101925 | 5.782251 | 0.0 | 50975.745207 |
std | 9059.854754 | 88.643887 | 200.649618 | 39788.636804 | 3.685595 | 0.0 | 25746.968398 |
min | 1.000000 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1200.000000 | 1999.000000 | 71.000000 | 125000.000000 | 3.000000 | 0.0 | 30657.000000 |
50% | 3000.000000 | 2004.000000 | 107.000000 | 150000.000000 | 6.000000 | 0.0 | 49716.000000 |
75% | 7490.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71665.000000 |
max | 350000.000000 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
The new statistical description of this column show how the average price is almost 5900, where most of the values are between 1200 and 7490.
First, we will understand how the values in the three string columns are formatted. These columns all represent full timestamp values:
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
To understand the date range, we can extract just the first 10 characters that represent date values. Then, we calculate the values's distribution in these columns.
date_crawled = autos['date_crawled'].str[:10]
date_crawled.value_counts(normalize=True, dropna=False).sort_index(ascending = True)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
The results show how the distribution of the values in the date_crawled column are balanced in most of the dates. However, the last three dates the number of ad crawled decrease respect to the rest of the dates, since the percentage in these days is less than 1.5%. Notice that, in march there are three exceptional dates, where the percentege is also less than 1.5.
ad_created = autos['ad_created'].str[:10]
value_counts_ad_created = ad_created.value_counts(normalize=True, dropna=False).sort_index(ascending = True)
print(value_counts_ad_created.head(10))
print(value_counts_ad_created.tail(35))
value_counts_ad_created.describe()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 Name: ad_created, dtype: float64 2016-03-04 0.001483 2016-03-05 0.022897 2016-03-06 0.015320 2016-03-07 0.034737 2016-03-08 0.033316 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, dtype: float64
count 76.000000 mean 0.013158 std 0.015956 min 0.000021 25% 0.000021 50% 0.000144 75% 0.032112 max 0.038855 Name: ad_created, dtype: float64
The first dates stored in the dataset are at the end of 2015, where the number of ad created was less than 1%, almost one ad created most of the dates. However, at the begining of March, 2016, this number was increased, reaching 3% until almost the begining of april.
last_seen = autos['last_seen'].str[:10]
value_counts_last_seen = last_seen.value_counts(normalize=True, dropna=False).sort_index(ascending = True)
print(value_counts_last_seen)
value_counts_last_seen.describe()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
count 34.000000 mean 0.029412 std 0.043718 min 0.001071 25% 0.012432 50% 0.019489 75% 0.023783 max 0.221806 Name: last_seen, dtype: float64
The results show how the distribution of the values in the last_seen column are balanced in most of the dates, with values around one and two percentage. However, these values increased in the last three dates, since the percentage in these days was higher than 12%.
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
This description shows how the registration_year column present odd values, since the minimum value is 1000 and the maximum value is 9999.
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
autos_year_1900 = autos[autos['registration_year'] < 1900]
print("Number of autos with a year registration less than 1900:",autos_year_1900.shape[0])
autos_year_2016 = autos[autos['registration_year'] > 2016]
print("Number of autos with a year registration higuer than 2016:",autos_year_2016.shape[0])
print("\nValues count of the registration year less than 1940:")
print(autos.loc[(autos["registration_year"] < 1940),"registration_year"].value_counts().sort_index(ascending = True).head(10))
Number of autos with a year registration less than 1900: 5 Number of autos with a year registration higuer than 2016: 1879 Values count of the registration year less than 1940: 1000 1 1001 1 1111 1 1800 2 1910 5 1927 1 1929 1 1931 1 1934 2 1937 4 Name: registration_year, dtype: int64
The number of autos with a registration year less than 1900 is 5. These rows will be removed since they present an invalid registration year.
Now, let's analyse the autos with a registration year 1910, which is the next year stored after 1900.
autos[autos["registration_year"] == 1910]
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3679 | 2016-04-04 00:36:17 | Suche_Auto | privat | Angebot | 1 | test | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-04-04 00:00:00 | 0 | 40239 | 2016-04-04 07:49:15 |
22659 | 2016-03-14 08:51:18 | Opel_Corsa_B | privat | Angebot | 500 | test | NaN | 1910 | NaN | 0 | corsa | 150000 | 0 | NaN | opel | NaN | 2016-03-14 00:00:00 | 0 | 52393 | 2016-04-03 07:53:55 |
28693 | 2016-03-22 17:48:41 | Renault_Twingo | privat | Angebot | 599 | control | kleinwagen | 1910 | manuell | 0 | NaN | 5000 | 0 | benzin | renault | NaN | 2016-03-22 00:00:00 | 0 | 70376 | 2016-04-06 09:16:59 |
30781 | 2016-03-25 13:47:46 | Opel_Calibra_V6_DTM_Bausatz_1:24 | privat | Angebot | 30 | test | NaN | 1910 | NaN | 0 | calibra | 100000 | 0 | NaN | opel | NaN | 2016-03-25 00:00:00 | 0 | 47638 | 2016-03-26 23:46:29 |
45157 | 2016-03-11 22:37:01 | Motorhaube | privat | Angebot | 15 | control | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 2016-03-11 00:00:00 | 0 | 90491 | 2016-03-25 11:18:57 |
There are five rows that contain 1910 in the registration_year column. Most of them present missing values for the vehicle_type, gearbox, fuel_type, unrepaired_damage and model column. Moreover, the registration_month column contain 0 value for all the rows, which represent a wrong value. The price column also present odd values, range between 1 and 599. Finally, with a very low probability, ebay could sell a car registered in 1910. For this reason, these rows will be removed.
Let's analyse the autos with a registration year between 1927 and 1940, which with some probabiltiy may be ebay sell a car with these registration years.
autos_1927_1940 = autos[autos["registration_year"].between(1927,1940)]
percent_missing = autos_1927_1940.isnull().sum() * 100 / len(autos_1927_1940)
print("Percent missing for the autos with registration year between 1927-1940: \n", percent_missing)
autos_1927_1940.describe()
Percent missing for the autos with registration year between 1927-1940: date_crawled 0.000000 name 0.000000 seller 0.000000 offer_type 0.000000 price 0.000000 abtest 0.000000 vehicle_type 0.000000 registration_year 0.000000 gearbox 27.272727 power_ps 0.000000 model 45.454545 odometer_km 0.000000 registration_month 0.000000 fuel_type 18.181818 brand 0.000000 unrepaired_damage 27.272727 ad_created 0.000000 nr_of_pictures 0.000000 postal_code 0.000000 last_seen 0.000000 dtype: float64
price | registration_year | power_ps | odometer_km | registration_month | nr_of_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.000000 | 11.0 | 11.000000 |
mean | 13021.818182 | 1934.545455 | 24.272727 | 30454.545455 | 4.272727 | 0.0 | 45506.000000 |
std | 9479.701280 | 3.958880 | 28.344632 | 48138.059031 | 3.133398 | 0.0 | 29825.147376 |
min | 3000.000000 | 1927.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 3042.000000 |
25% | 5900.000000 | 1932.500000 | 0.000000 | 5000.000000 | 1.000000 | 0.0 | 16629.000000 |
50% | 8990.000000 | 1937.000000 | 15.000000 | 5000.000000 | 5.000000 | 0.0 | 53804.000000 |
75% | 21000.000000 | 1937.000000 | 39.500000 | 30000.000000 | 7.000000 | 0.0 | 65843.500000 |
max | 27500.000000 | 1939.000000 | 85.000000 | 150000.000000 | 8.000000 | 0.0 | 90489.000000 |
The results show how the prices in these rows are quite logic and how these rows present few missing values for almost all the columns.
Then, the lowest acceptable value will be 1927 for the registration year column and the highest will be 2016 because a car can't be first registered after the listing was seen, then any vehicle with a registration year above 2016 is definitely inaccurate.
autos = autos[autos['registration_year'].between(1927,2016)]
value_counts_autos_reg_year = autos['registration_year'].value_counts(normalize=True).sort_index(ascending = True)
print(value_counts_autos_reg_year.head(30))
print(value_counts_autos_reg_year.tail(20))
value_counts_autos_reg_year.describe()
1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000043 1937 0.000086 1938 0.000021 1939 0.000021 1941 0.000043 1943 0.000021 1948 0.000021 1950 0.000064 1951 0.000043 1952 0.000021 1953 0.000021 1954 0.000043 1955 0.000043 1956 0.000086 1957 0.000043 1958 0.000086 1959 0.000129 1960 0.000493 1961 0.000129 1962 0.000086 1963 0.000171 1964 0.000257 1965 0.000364 1966 0.000471 1967 0.000557 1968 0.000557 1969 0.000407 Name: registration_year, dtype: float64 1997 0.041799 1998 0.050626 1999 0.062066 2000 0.067615 2001 0.056474 2002 0.053261 2003 0.057824 2004 0.057910 2005 0.062902 2006 0.057203 2007 0.048783 2008 0.047455 2009 0.044670 2010 0.034043 2011 0.034772 2012 0.028066 2013 0.017204 2014 0.014204 2015 0.008398 2016 0.026138 Name: registration_year, dtype: float64
count 77.000000 mean 0.012987 std 0.020765 min 0.000021 25% 0.000086 50% 0.000728 75% 0.017204 max 0.067615 Name: registration_year, dtype: float64
We can see how in the early decades of the 1900s, a few number of autos with these registration years where stored, and after the 60s the percentage increase. Moreover, the highest percentage (3-6%) are represented at the end of the 90s until 2011.
When working with data on cars, it's natural to explore variations across different car brands. We will explore the unique values in the brand column, and decide on which brands we want to aggregate by.
value_counts_brand = autos['brand'].value_counts(normalize=True)
print(value_counts_brand.head(10))
autos['brand'].describe()
volkswagen 0.211286 bmw 0.110057 opel 0.107550 mercedes_benz 0.096474 audi 0.086576 ford 0.069907 renault 0.047133 peugeot 0.029844 fiat 0.025645 seat 0.018275 Name: brand, dtype: float64
count 46676 unique 40 top volkswagen freq 9862 Name: brand, dtype: object
We select for a futher analysis the brands that have a percentage of the total values higher than 5%. The description of the brand column show that it present 40 unique values, where the top one is volkswagen.
brands_selected = value_counts_brand[value_counts_brand > 0.05]
brand_mean_prices = {}
brands_mean_mileage = {}
for b in brands_selected.index.values:
prices_brand = autos.loc[autos["brand"] == b, "price"]
mileage_brand = autos.loc[autos["brand"] == b, "odometer_km"]
brand_mean_prices[b] = round(prices_brand.sum() / len(prices_brand), 2)
brands_mean_mileage[b] = round(mileage_brand.sum() / len(mileage_brand), 2)
bm_price_series = pd.Series(brand_mean_prices)
bm_mileage_series = pd.Series(brands_mean_mileage)
df_brands = pd.DataFrame(bm_price_series, columns=['mean_price'])
df_brands['mean_mileage_km'] = pd.Series(brands_mean_mileage)
df_brands
mean_price | mean_mileage_km | |
---|---|---|
audi | 9336.69 | 129157.39 |
bmw | 8332.82 | 132572.51 |
ford | 3749.47 | 124266.01 |
mercedes_benz | 8628.45 | 130788.36 |
opel | 2976.32 | 129311.75 |
volkswagen | 5402.41 | 128707.16 |
The above results show that the most expensive brands are Audi, BMW and Mercedes Benz and the less one Ford and Opel. Finally, Volkswagen is between them. The average mileage is quite similar for all the brands. However, note that the BMW and Mercedes Benz present the highest values for this column.