In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle. A few modifications from the original dataset that was uploaded to Kaggle have been made:
The aim of this project is to clean the data and analyze the included used car listings.
import numpy as np
import pandas as pd
#Let's read in the dataset
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
In the above few cells we read in the autos dataset into a dataframe:
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
#Let's get the names of the columns as they are:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#We edit the columns'names to snakecase:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'n_pictures', 'postal_code',
'last_seen']
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | n_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
In the above few cells we cleaned the names of the columns in the dataframe so that they are in snakecase which would be easier to work with.
Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | n_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos['seller'].value_counts()
#We see that there is only one entry which isn't from a private person.
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
#Only one entry differs.
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos['price'].value_counts()
#This column needs additional clean up.
$0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 $1,200 639 $600 531 $3,500 498 $800 498 $2,000 460 $999 434 $750 433 $900 420 $650 419 $850 410 $700 395 $4,500 394 $300 384 $2,200 382 $950 379 $1,100 376 $1,300 371 $3,000 365 $550 356 $1,800 355 $5,500 340 $1,250 335 $350 335 $1,600 327 $1,999 322 ... $2,004 1 $459 1 $11,899 1 $197,000 1 $42,999 1 $69,999 1 $14,337 1 $4,855 1 $27,322,222 1 $33,650 1 $4,005 1 $20,589 1 $1,698 1 $5,248 1 $26,150 1 $7,795 1 $28,399 1 $17,695 1 $135,000 1 $1,414 1 $356 1 $5,913 1 $25,450 1 $14,888 1 $33,777 1 $13,383 1 $169,000 1 $25,490 1 $12,690 1 $29,600 1 Name: price, Length: 2357, dtype: int64
autos['odometer'].value_counts()
#This column needs additional clean up.
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
autos['n_pictures'].value_counts()
#We see that all ads are without pictures. Maybe we can drop that column.
0 50000 Name: n_pictures, dtype: int64
From the above it looks like the 'seller', 'offer_type' and 'n_pictures' columns can be dropped. It also seems like the 'odometer' and 'price' columns need additional clean up - they hold numeric values stored as text. Let's clean them up:
#First we will clean the price column:
autos['price'] = autos['price'].str.replace('$', '')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
#Next we will clean the odometer column:
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(float)
autos['odometer'].head()
0 150000.0 1 150000.0 2 70000.0 3 70000.0 4 150000.0 Name: odometer, dtype: float64
#Let's rename the odometer column, so that it's clear that it is in km:
autos = autos.rename({'odometer':'odometer_km'}, axis=1)
Now that we've cleaned up the price and odometer columns, let's explore them further.
#Let's see how many unique values there are:
autos['price'].unique().shape
(2357,)
#Let's view min/max/median/mean etc
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().sort_index(ascending=True).head(40)
#We see the 40 lowest prices and how many cars are priced at them.
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 70 10 75 5 79 1 80 15 89 1 90 5 99 19 100 134 110 3 111 2 Name: price, dtype: int64
autos['price'].value_counts().sort_index(ascending=False).head(20)
#We see the 20 highest prices and how many cars are priced at them.
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
Looking at the highest prices - it seems there's a huge jump from 350,000 to 999,990. In fact, all prices above 350,000 look as if they are incorrect. Let's drop those rows. Additionally, we see that there are 1421 cars priced at 0 USD. Let's drop these as well.
autos = autos[(autos['price'] >= 1) & (autos['price'] < 999990)]
autos['price'].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
#Let's see how many unique values
autos['odometer_km'].shape
(48565,)
#Let's view min/max/median/mean etc
autos['odometer_km'].describe()
count 48565.000000 mean 125770.101925 std 39788.636804 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts(ascending=True).head(10)
10000.0 253 20000.0 762 30000.0 780 40000.0 815 5000.0 836 50000.0 1012 60000.0 1155 70000.0 1217 80000.0 1415 90000.0 1734 Name: odometer_km, dtype: int64
autos['odometer_km'].value_counts(ascending=False).head()
150000.0 31414 125000.0 5057 100000.0 2115 90000.0 1734 80000.0 1415 Name: odometer_km, dtype: int64
The odometer values seem rounded which might mean that the sellers had to choose from a list and not enter the actual numbers of the cars.
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
We see that the ads were crawled between March 5th, 2016 and April 7th, 2016. It seems that the ads were crawled almost daily.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
We see that the oldest ads are from June 11th, 2015 and the newest ones - from April 7th, 2016.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
The crawler records when it last saw a listing. This allows us to determine when the car was sold.
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
autos['registration_year'].value_counts(normalize=True).sort_index()
1000 0.000021 1001 0.000021 1111 0.000021 1800 0.000041 1910 0.000103 1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000041 1937 0.000082 1938 0.000021 1939 0.000021 1941 0.000041 1943 0.000021 1948 0.000021 1950 0.000062 1951 0.000041 1952 0.000021 1953 0.000021 1954 0.000041 1955 0.000041 1956 0.000082 1957 0.000041 1958 0.000082 1959 0.000124 1960 0.000474 1961 0.000124 1962 0.000082 1963 0.000165 1964 0.000247 ... 2000 0.064985 2001 0.054278 2002 0.051189 2003 0.055575 2004 0.055657 2005 0.060455 2006 0.054978 2007 0.046886 2008 0.045609 2009 0.042932 2010 0.032719 2011 0.033419 2012 0.026974 2013 0.016535 2014 0.013652 2015 0.008072 2016 0.025121 2017 0.028663 2018 0.009678 2019 0.000041 2800 0.000021 4100 0.000021 4500 0.000021 4800 0.000021 5000 0.000082 5911 0.000021 6200 0.000021 8888 0.000021 9000 0.000021 9999 0.000062 Name: registration_year, Length: 95, dtype: float64
There are values in the registration_year column that seem incorrect:
Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.
autos['registration_year'].shape[0]-autos['registration_year'].between(1900,2016).sum()
1884
We see that 1884 out of the 48565 listings fall outside of the 1900-2016 registration year interval. Given that this account for less than 4%, it is safe to remove them.
autos=autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize=True).head()
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 Name: registration_year, dtype: float64
It looks like the majority of the cars were registered between 1999 and 2005.
(autos['brand'].value_counts(normalize=True)*100)
volkswagen 21.126368 bmw 11.004477 opel 10.758124 mercedes_benz 9.646323 audi 8.656627 ford 6.989996 renault 4.714980 peugeot 2.984083 fiat 2.564212 seat 1.827296 skoda 1.640925 nissan 1.527388 mazda 1.518819 smart 1.415994 citroen 1.400998 toyota 1.270324 hyundai 1.002549 sonstige_autos 0.981127 volvo 0.914719 mini 0.876159 mitsubishi 0.822604 honda 0.784045 kia 0.706926 alfa_romeo 0.664082 porsche 0.612669 suzuki 0.593389 chevrolet 0.569825 chrysler 0.351321 dacia 0.263490 daihatsu 0.250637 jeep 0.227073 subaru 0.214220 land_rover 0.209936 saab 0.164949 jaguar 0.156381 daewoo 0.149954 trabant 0.139243 rover 0.132816 lancia 0.107110 lada 0.057839 Name: brand, dtype: float64
Unsurprisingly, German brands make up for about 60% of all listings. Volkswagen is the top brand with the number of cars listed from that brand being the same as the total of the next two brands - BMW and Opel. There are a lot of brands that aren't very represented. For the sake of the analysis, we will focus on those that make up more than 5% of the listings.
#Let's isolate the top brands
brand_counts = (autos['brand'].value_counts(normalize=True)*100)
top_brands = brand_counts[brand_counts > 5].index
top_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
brand_mean_prices = {}
for brand in top_brands:
by_brand = autos[autos['brand']==brand]
mean_price = by_brand['price'].mean()
brand_mean_prices[brand] = int(mean_price)
print(brand_mean_prices)
{'opel': 2975, 'audi': 9336, 'mercedes_benz': 8628, 'bmw': 8332, 'ford': 3749, 'volkswagen': 5402}
Based on the average prices we see that:
km_per_brand = {}
for brand in top_brands:
by_brand = autos[autos['brand'] == brand]
mean_km = by_brand['odometer_km'].mean()
km_per_brand[brand] = int(mean_km)
km_per_brand
{'audi': 129157, 'bmw': 132572, 'ford': 124266, 'mercedes_benz': 130788, 'opel': 129310, 'volkswagen': 128707}
Above we created a dictionary in which we see the average mileage per brand. Now, let's build a dataframe so that we can easily compare the average price and average mileage per brand and see if we can find any corelation.
mean_prices = pd.Series(brand_mean_prices)
mean_mileage = pd.Series(km_per_brand)
brand_info = pd.DataFrame(mean_prices, columns=['mean_prices'])
brand_info['mean_mileage'] = mean_mileage
brand_info
mean_prices | mean_mileage | |
---|---|---|
audi | 9336 | 129157 |
bmw | 8332 | 132572 |
ford | 3749 | 124266 |
mercedes_benz | 8628 | 130788 |
opel | 2975 | 129310 |
volkswagen | 5402 | 128707 |
It is interesting to see that even though the average mileage of BMW and Mercedes cars is higher, their prices are still higher than the prices of other brands.
In this guided project we explored data collected from the German eBay. The data contained information on car sales listings. We cleaned up the data and looked for any corelation between the average price and mileage of the top brands.
Data cleaning next steps:
Analysis next steps:
Let's identify columns which have data entered in German and translate it in English. We already know that the 'seller' and 'offer_type' columns have words in German so let's start with them.
autos['seller'] = autos['seller'].str.replace('privat', 'private').str.replace('gewerblich', 'commercial')
autos['seller'].value_counts()
private 46680 commercial 1 Name: seller, dtype: int64
autos['offer_type'] = autos['offer_type'].str.replace('Angebot', 'offer')
autos['offer_type'].value_counts()
offer 46681 Name: offer_type, dtype: int64
autos['vehicle_type'].value_counts()
limousine 12598 kleinwagen 10585 kombi 8930 bus 4031 cabrio 3016 coupe 2462 suv 1965 andere 390 Name: vehicle_type, dtype: int64
autos['vehicle_type'] = autos['vehicle_type'].str.replace(
'kleinwagen', 'small car').str.replace('kombi', 'combi').str.replace('cabrio', 'convertible').str.replace('andere', 'other')
autos['vehicle_type'].value_counts()
limousine 12598 small car 10585 combi 8930 bus 4031 convertible 3016 coupe 2462 suv 1965 other 390 Name: vehicle_type, dtype: int64
autos['gearbox'].value_counts()
manuell 34715 automatik 9856 Name: gearbox, dtype: int64
autos['gearbox'] = autos['gearbox'].str.replace('manuell', 'manual').str.replace('automatik', 'automatic')
autos['gearbox'].value_counts()
manual 34715 automatic 9856 Name: gearbox, dtype: int64
autos['fuel_type'].value_counts()
benzin 28540 diesel 14032 lpg 649 cng 71 hybrid 37 elektro 19 andere 15 Name: fuel_type, dtype: int64
autos['fuel_type'] = autos['fuel_type'].str.replace('benzin', 'petrol').str.replace('elektro', 'electric').str.replace('andere', 'other')
autos['fuel_type'].value_counts()
petrol 28540 diesel 14032 lpg 649 cng 71 hybrid 37 electric 19 other 15 Name: fuel_type, dtype: int64
autos['unrepaired_damage'].value_counts()
nein 33834 ja 4540 Name: unrepaired_damage, dtype: int64
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('ja', 'yes').str.replace('nein', 'no')
autos['unrepaired_damage'].value_counts()
no 33834 yes 4540 Name: unrepaired_damage, dtype: int64
Now, let's convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321. The columns containing dates are: 'date_crawled', 'ad_created' and 'last_seen'
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['date_crawled'] = autos['date_crawled'].str.replace('-', '')
autos['date_crawled'] = autos['date_crawled'].astype(float)
autos['date_crawled'].value_counts().head()
20160403.0 1810 20160320.0 1775 20160321.0 1742 20160312.0 1719 20160404.0 1709 Name: date_crawled, dtype: int64
autos['ad_created'] = autos['ad_created'].str[:10]
autos['ad_created'] = autos['ad_created'].str.replace('-', '')
autos['ad_created'] = autos['ad_created'].astype(int)
autos['ad_created'].value_counts().head()
20160403 1821 20160320 1777 20160321 1752 20160404 1725 20160312 1711 Name: ad_created, dtype: int64
autos['last_seen'] = autos['last_seen'].str[:10]
autos['last_seen'] = autos['last_seen'].str.replace('-', '')
autos['last_seen'] = autos['last_seen'].astype(int)
autos['last_seen'].value_counts().head()
20160406 10425 20160407 6197 20160405 5854 20160317 1311 20160403 1174 Name: last_seen, dtype: int64
Let's try to find the most common brand/model combination for the top brands:
brand_model = {}
for brand in top_brands:
by_brand = autos[autos['brand'] == brand]
top_model = by_brand['model'].describe()['top']
brand_model[brand] = top_model
brand_model
#Below we see the most common model per brand
{'audi': 'a4', 'bmw': '3er', 'ford': 'focus', 'mercedes_benz': 'c_klasse', 'opel': 'corsa', 'volkswagen': 'golf'}
Now let's see if the average prices follow any pattern based on mileage.
ranges = autos['odometer_km'].value_counts(bins=5).index
ranges
IntervalIndex([(121000.0, 150000.0], (63000.0, 92000.0], (34000.0, 63000.0], (4854.999, 34000.0], (92000.0, 121000.0]] closed='right', dtype='interval[float64]')
ranges_tuple = ranges.to_tuples()
km_price = {}
for r in ranges_tuple:
by_km = autos[autos['odometer_km'].between(r[0], r[1])]
price = by_km['price'].mean()
km_price[r] = int(price)
km_price
{(4854.999, 34000.0): 15122, (34000.0, 63000.0): 13721, (63000.0, 92000.0): 9563, (92000.0, 121000.0): 8132, (121000.0, 150000.0): 4107}
In the above two cells we did the following:
We can conclude that the lower the milage of the car, the higher the price.
How much cheaper are cars with damage than their non-damaged counterparts?
damage = autos['unrepaired_damage'].unique()
damaged_price = {}
for d in damage:
by_d = autos[autos['unrepaired_damage'] == d]
price = by_d['price'].mean()
damaged_price[d] = price
price_per_damaged
NaN NaN no 7164.033103 yes 2241.146035 dtype: float64
On average, cars with unrepaired damage are 30% cheaper than undamaged cars.