In this project, i will be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.
The data dictionary provided with data is as follows:
Column label | What is represents |
---|---|
dateCrawled | When this ad was first crawled |
name | Name of the car |
seller | Whether the seller is private or a dealer |
offerType | The type of listing |
price | The price on the ad to sell the car |
abtest | Whether the listing is included in an A/B test |
vehicleType | The vehicle type |
yearOfRegistration | The year in which the car was first registered |
gearbox | The transmission type |
powerPS | The power of the car in PS |
model | The car model name |
kilometer | How many kilometers the car has driven |
monthOfRegistration | The month in which the car was first registered |
fueltype | What type of fuel the car uses |
brand | The brand of the car |
notRepairedDamage | If the car has a damage which is not yet repaired |
dateCreated | The date on which the ebay listing was created |
nrOfPictures | The number of pictures in the ad |
postalCode | The postal code for the location of the vehicle |
lastSeenOnline | When the crawler saw this ad last online |
The aim of this project is to clean the data and analyze the included used car listings.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv',encoding = 'Latin-1')
autos.info()
print('\n')
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Of the 20 autos dataset columns 5 have null values.
All the columns are of strings type except in vehicleType, gearbox, model, fueltype, noRepairedDamage columns.
print(autos.columns)
print('\n')
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'powerps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'no_of_pics', 'postal_code',
'last_seen']
autos.columns = new_columns
print(autos.head())
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object') date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offer_type price abtest vehicle_type registration_year \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerps model odometer registration_month fuel_type \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand unrepaired_damage ad_created no_of_pics postal_code \ 0 peugeot nein 2016-03-26 00:00:00 0 79588 1 bmw nein 2016-04-04 00:00:00 0 71034 2 volkswagen nein 2016-03-26 00:00:00 0 35394 3 smart nein 2016-03-12 00:00:00 0 33729 4 ford nein 2016-04-01 00:00:00 0 39218 last_seen 0 2016-04-06 06:45:54 1 2016-04-06 14:45:08 2 2016-04-06 20:15:37 3 2016-03-15 03:16:28 4 2016-04-01 14:38:50
I have changed the column labels from camelcase to snakecase to make it easier to deal with.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | powerps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-22 09:51:06 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
The seller, no_of_pics, offer_type columns have or almost all values are the same.
I drop them below.
autos = autos.drop(['seller','no_of_pics','offer_type'], axis = 1)
The price, odometer columns are numeric values stored as text.
I removed the non-numeric characters and converted it to numeric dtype.
I rename the column to capture the details ommitted after cleaning.
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace(',','').str.replace('km','').astype(int)
print(autos['price'].head(3))
print('\n')
print(autos['odometer'].head(3))
0 5000 1 8500 2 8990 Name: price, dtype: int64 0 150000 1 150000 2 70000 Name: odometer, dtype: int64
autos.rename({'odometer':'odometer_km'},axis = 1, inplace = True)
autos.rename({'price':'price_usd'},axis = 1, inplace = True)
print(autos.columns)
Index(['date_crawled', 'name', 'price_usd', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'powerps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
I continue exploring the data, looking for data that doesn't look right.
Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low that we might want to remove.
print(autos['odometer_km'].unique().shape)
print(autos['odometer_km'].describe())
print(autos['odometer_km'].value_counts().head(10))
(13,) count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64 150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 Name: odometer_km, dtype: int64
We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field.
Additionally, there are more high mileage than low mileage vehicles.
print(autos['price_usd'].unique().shape)
print(autos['price_usd'].describe())
print(autos['price_usd'].value_counts().head(10))
(2357,) count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64 0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 Name: price_usd, dtype: int64
the prices in this column seem rounded, however given there are 2357 unique values in the column, that may just be people's tendency to round prices on the site.
here are 1,421 cars listed with $0 price - given that this is only 2% of the of the cars, we might consider removing these rows. The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further.
print(autos['price_usd'].value_counts().sort_index(ascending = False).head(10))
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 Name: price_usd, dtype: int64
print(autos['price_usd'].value_counts().sort_index(ascending = True).head(20))
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price_usd, dtype: int64
There are a number of listings with prices below $30, including about 1,500 at $0. There are also a small number of listings with very high values, including 14 at around or over $1 million.
Given that eBay is an auction site, there could legitimately be items where the opening bid is $1. We will keep the $1 items, but remove anything above $350,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.
autos = autos[autos['price_usd'].between(1,351000)]
There are five columns with date information:
I willl explore all of this columns to learn more about the listing.
print(autos[['date_crawled','ad_created','last_seen']][0:5])
date_crawled ad_created last_seen 0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54 1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08 2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37 3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28 4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
The crawler recorded the date it last saw any listing.The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Looking at this column, there exists some odd values. The minimum value represents a year motorvehicles had no yet been invented, and the maximum is many years into the future.
Since it is not possible to register a car before the listing is seen, any vehicle with a registration year above 2016 is definitely inaccurate.
Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
(~autos['registration_year'].between(1900,2016)).sum() / autos.shape[0]
0.038793369710697
The number of listings with cars that fall outside the 1900 - 2016is below 4% hece we drop them below.
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize =True).head(15)
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 2006 0.057197 2001 0.056468 2002 0.053255 1998 0.050620 2007 0.048778 2008 0.047450 2009 0.044665 1997 0.041794 2011 0.034768 2010 0.034040 Name: registration_year, dtype: float64
It seems like most of the vehicles were first registered in the past 20 years.
autos['brand'].value_counts(normalize = True)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.
brand_names = autos['brand'].value_counts(normalize = True)
pop_brand_names = brand_names[brand_names > 0.05].index
print(pop_brand_names)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings.
brand_mean_prices = {}
for brand in pop_brand_names:
brand_sel = autos[autos["brand"] == brand]
mean_price = brand_sel["price_usd"].mean()
brand_mean_prices[brand] = int(mean_price)
brand_mean_prices
{'audi': 9336, 'bmw': 8332, 'ford': 3749, 'mercedes_benz': 8628, 'opel': 2975, 'volkswagen': 5402}
Of the top 5 brands, there is a distinct price gap:
brand_mean_milage = {}
for brand in pop_brand_names:
brand_sel = autos[autos['brand'] == brand]
mean = brand_sel['odometer_km'].mean()
brand_mean_milage[brand] = int(mean)
brand_mean_milage
{'audi': 129157, 'bmw': 132572, 'ford': 124266, 'mercedes_benz': 130788, 'opel': 129310, 'volkswagen': 128707}
bmp_series = pd.Series(brand_mean_prices).sort_values(ascending = False)
print(bmp_series)
audi 9336 mercedes_benz 8628 bmw 8332 volkswagen 5402 ford 3749 opel 2975 dtype: int64
bmm_series = pd.Series(brand_mean_milage).sort_values(ascending = False)
print(bmm_series)
bmw 132572 mercedes_benz 130788 opel 129310 audi 129157 volkswagen 128707 ford 124266 dtype: int64
df = pd.DataFrame(bmp_series, columns = ['mean_price'])
df
mean_price | |
---|---|
audi | 9336 |
mercedes_benz | 8628 |
bmw | 8332 |
volkswagen | 5402 |
ford | 3749 |
opel | 2975 |
df_2 = pd.DataFrame(bmm_series, columns = ['mean_milage'])
df_2
mean_milage | |
---|---|
bmw | 132572 |
mercedes_benz | 130788 |
opel | 129310 |
audi | 129157 |
volkswagen | 128707 |
ford | 124266 |
df_2['mean_price'] = bmp_series
df_2
mean_milage | mean_price | |
---|---|---|
bmw | 132572 | 8332 |
mercedes_benz | 130788 | 8628 |
opel | 129310 | 2975 |
audi | 129157 | 9336 |
volkswagen | 128707 | 5402 |
ford | 124266 | 3749 |
There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.