In this project, we will work with [dataset]https://www.kaggle.com/orgesleka/used-cars-database/data) of used cars from eBay Kleinanzeigen, a section of the German eBay website.
The dataset can be found here
To clean the data and analyze the included used car listings.
The data dictionary provided with data is as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.pandas
and NumPy
, read the csv
file¶Import the libraries and read autos.csv
file into autos
.
Display the infomation and first 5 rows in autos
df.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv',encoding='Latin-1')
autos.info()
five_rows = autos.head(5)
five_rows
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From the above cell, we can infer that there are 20 columns in the DataFrame and all are non-null (objects or int64). Dates are stored along with time. The column names use camelcase
instead of Python's preferred snakecase
. The dataframe needs to be cleaned before we proceed any further.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.rename({'yearOfRegistration':'registration_year'},axis=1, inplace= True)
autos.rename({'monthOfRegistration':'registration_month'},axis=1, inplace= True)
autos.rename({'notRepairedDamage':'unrepaired_damage'},axis=1, inplace= True)
autos.rename({'dateCreated':'ad_created'},axis=1, inplace= True)
autos.rename({'dateCrawled':'date_crawled'},axis=1, inplace= True)
autos.rename({'offerType':'offer_type'},axis=1, inplace= True)
autos.rename({'vehicleType':'vehicle_type'},axis=1, inplace= True)
autos.rename({'fuelType':'fuel_type'},axis=1, inplace= True)
autos.rename({'nrOfPictures':'no_of_pictures'},axis=1, inplace= True)
autos.rename({'postalCode':'postal_code'},axis=1, inplace= True)
autos.rename({'lastSeen':'last_seen'},axis=1, inplace= True)
autos.rename({'powerPS':'power_in_ps'},axis=1, inplace= True)
autos.columns
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_in_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
nrOfPictures
which is the number of pictures in the ad is incorrectly named and is not intuitive. It has been changed to no_of_pictures
which seems more accurate.
Rest of the columns have been changed from camelcase
to Python's preferred snakecase
# explore first 5 rows
autos.head(5)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_in_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_in_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-09 11:54:38 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the above result cell,the following can be made:
no_of_pictures
can be dropped as it has only one value and others being mostly zero or NaN
price
and odometer
have both string and integer values.This should be cleaned to hold numeric datagearbox
,offer_type
,name
, abtest
,fuel_type
,brand
,unrepaired_damage
columns have numeric values which does not make sense in this contextad_created
, last_seen
, date_crawled
have invalid values.seller
and offer_type
columns have same values, these can be dropped.As per the instructions, let's remove the non-numeric values from price
and odometer
and rename odometer
to odometer_km
# rename column odometer
autos.rename({'odometer':'odometer_km'},axis=1, inplace= True)
autos["price"] = (autos["price"].astype(str)
.str.replace(',','')
.str.replace('$','')
.astype(int))
autos["odometer_km"] = (autos["odometer_km"].astype(str)
.str.replace(',','')
.str.replace('km','')
.astype(int)
)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_in_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
# remove no_of_pictures, seller and offer_type
autos = autos.drop(["no_of_pictures", "seller", "offer_type"], axis=1)
autos.head(5)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_in_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
odometer_km
and price
columns¶# show the number of unique values
print("uniques in price column",autos["price"].unique().shape)
print("uniques in odometer column",autos["odometer_km"].unique().shape)
uniques in price column (2357,) uniques in odometer column (13,)
# view the basic statistical details
print("Statistical details - Price column \n",autos["price"].describe())
print("\nStatistical details - Odometer column \n",autos["odometer_km"].describe())
Statistical details - Price column count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 Statistical details - Odometer column count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
print("First 30 unique price counts :\n",autos["price"].value_counts().head(30))
# count the unique values and sort highest to lowest
print("Unique price counts - sorted highest to lowest :\n",autos["price"].value_counts()
.sort_index(ascending=False)
.head(30))
First 30 unique price counts : 0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 1100 376 1300 371 3000 365 550 356 1800 355 5500 340 1250 335 350 335 1600 327 1999 322 Name: price, dtype: int64 Unique price counts - sorted highest to lowest : 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 Name: price, dtype: int64
The above result cells show that there are 1421 listed cars with price value zero and a few number of cars with unrealistic prices which fall between 9,99,990 and 9,99,99,999. We remove the car listings that fall above the price value 350,000 and those with price value zero.
autos = autos[autos["price"].between(1,351000)]
print("after removing outliers :\n",autos["price"])
autos["price"].describe()
after removing outliers : 0 5000 1 8500 2 8990 3 4350 4 1350 5 7900 6 300 7 1990 8 250 9 590 10 999 11 350 12 5299 13 1350 14 3999 15 18900 16 350 17 5500 18 300 19 4150 20 3500 21 41500 22 25450 23 7999 24 48500 25 90 26 777 28 5250 29 4999 30 80 ... 49968 2100 49969 4500 49970 15800 49971 950 49972 3300 49973 6000 49975 9700 49976 5900 49977 5500 49978 900 49979 11000 49980 400 49981 2000 49982 1950 49983 600 49985 1000 49986 15900 49987 21990 49988 9550 49989 150 49990 17500 49991 500 49992 4800 49993 1650 49994 5000 49995 24900 49996 1980 49997 13200 49998 22900 49999 1250 Name: price, Length: 48565, dtype: int64
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
As we already discussed in exploring and cleaning section abve, the columns ad_created
, last_seen
, date_crawled
have invalid values. These need to be converted into meaningful data.
Firstly, lets understand the format of the three columns in question.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
These columns are all represented in full timestamp values. The first 10 characters represent the day (e.g: 2016-03-26
). To calculate the distribution values, we have to first extract the date from the timestamp value.
# extract dates, sort in ascending order
# and calculate distribution for date_crawled
(autos["date_crawled"].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
The above values show that most of the listings were crawled between March and April, 2016.
registration_year
¶# extract dates, sort in ascending order
# and calculate distribution for ad_created
(autos["ad_created"].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
# extract dates, sort in ascending order
# and calculate distribution for last_seen
(autos["last_seen"].str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
registration_year
¶print("First 20 values in registration year\n",autos["registration_year"].head(20))
autos["registration_year"].describe()
First 20 values in registration year 0 2004 1 1997 2 2009 3 2007 4 2003 5 2006 6 1995 7 1998 8 2000 9 1997 10 2017 11 2000 12 2010 13 1999 14 2007 15 1982 16 1999 17 1990 18 1995 19 2004 Name: registration_year, dtype: int64
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The distribution results shows that the column contains some odd values like the minimum being 1000
ie, before cars were invented and maximum 9999
which is many years into the future.
Any registration_year
values above 2016 is inaccurate because a car cannot be first registered after being listed on eBay.
autos["registration_year"].head(10)
0 2004 1 1997 2 2009 3 2007 4 2003 5 2006 6 1995 7 1998 8 2000 9 1997 Name: registration_year, dtype: int64
We have registration years in the 1900s. So let's remove all years that do not between fall 1900 and 2016 (2016 being the year of listing).
autos = autos[autos["registration_year"].between(1900, 2016)]
autos["registration_year"].describe()
count 46681.000000 mean 2002.910756 std 7.185103 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Now, the minimum is 1910 and maximum is 2016 meaning, out of the cars listed on eBay, the minimum registration year is 1910 and the maximum year or the latest is 2016.
autos["registration_year"].value_counts(normalize=True).head(20)
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 2006 0.057197 2001 0.056468 2002 0.053255 1998 0.050620 2007 0.048778 2008 0.047450 2009 0.044665 1997 0.041794 2011 0.034768 2010 0.034040 1996 0.029412 2012 0.028063 1995 0.026285 2016 0.026135 2013 0.017202 Name: registration_year, dtype: float64
The data shows that most of the cars were first registered before the year 2013
# unique values in brand
autos["brand"].unique().tolist()
['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia']
autos["brand"].value_counts(normalize=True)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
Volkswagen, BMW and Opel seem to be the popular cars listed. According to the instructions provided, we are selecting the top 10 cars for further analysis.
popular_cars_price={}
brands = autos["brand"].value_counts(normalize=True).head(10)
popular_brands = brands.index
for brand in popular_brands:
top_brands = autos[autos["brand"] == brand]
mean_price = top_brands ["price"].mean()
popular_cars_price[brand] = int(mean_price)
print(popular_cars_price)
## sort highest to lowest price
sorted(popular_cars_price.items(), key=lambda x: x[1], reverse=True)
{'peugeot': 3094, 'opel': 2975, 'fiat': 2813, 'ford': 3749, 'audi': 9336, 'volkswagen': 5402, 'seat': 4397, 'bmw': 8332, 'mercedes_benz': 8628, 'renault': 2474}
[('audi', 9336), ('mercedes_benz', 8628), ('bmw', 8332), ('volkswagen', 5402), ('seat', 4397), ('ford', 3749), ('peugeot', 3094), ('opel', 2975), ('fiat', 2813), ('renault', 2474)]
From the above result cell, brand sonstige_autos
has the highest mean price of 12338. The second popular car with highest mean price is mini
with 10613.
The most expensive cars from the top 10 cars are
And the less expensive ones are
# store mean_prices into dataframe
brand_mean_prices = pd.Series(popular_cars_price).sort_values(ascending=False)
price_dataframe = pd.DataFrame(brand_mean_prices, columns=["mean_price"])
print("Mean price of popular car brands \n")
price_dataframe
Mean price of popular car brands
mean_price | |
---|---|
audi | 9336 |
mercedes_benz | 8628 |
bmw | 8332 |
volkswagen | 5402 |
seat | 4397 |
ford | 3749 |
peugeot | 3094 |
opel | 2975 |
fiat | 2813 |
renault | 2474 |
# calculating mean mileage
car_mileage ={}
for brand in popular_brands:
brands = autos[autos["brand"] == brand]
mean_mileage = brands["odometer_km"].mean()
car_mileage[brand] = int(mean_mileage)
# store mean_mileage into dataframe
brand_mean_mileage = pd.Series(car_mileage).sort_values(ascending=False)
mileage_dataframe = pd.DataFrame(brand_mean_mileage, columns=["mean_mileage"])
print("Mean mileage of popular car brands \n")
mileage_dataframe
Mean mileage of popular car brands
mean_mileage | |
---|---|
bmw | 132572 |
mercedes_benz | 130788 |
opel | 129310 |
audi | 129157 |
volkswagen | 128707 |
renault | 128071 |
peugeot | 127153 |
ford | 124266 |
seat | 121131 |
fiat | 117121 |
## combine the dataframs and sort results
combined_cars_df = pd.concat([price_dataframe, mileage_dataframe], axis=1)
sorted_cars_df =combined_cars_df.sort_values(["mean_mileage"], ascending=False)
sorted_cars_df
mean_price | mean_mileage | |
---|---|---|
bmw | 8332 | 132572 |
mercedes_benz | 8628 | 130788 |
opel | 2975 | 129310 |
audi | 9336 | 129157 |
volkswagen | 5402 | 128707 |
renault | 2474 | 128071 |
peugeot | 3094 | 127153 |
ford | 3749 | 124266 |
seat | 4397 | 121131 |
fiat | 2813 | 117121 |
From the dataframe above, BMW
has the most mileage and is most expensive. We can say that despite having a higher mileage, some cars are still expensive( E.g. BMW
and Mercedes Benz
).
We have already established that there are 3 columns with datetime
type. These columns being the
date_crawled
ad_created
last_seen
These columns are all represented in full timestamp values. The first 10 characters represent the day (e.g: 2016-03-26). Let's change the type of these columns to numeric data, so that they are uniform across the dataset.
date_columns=["date_crawled","ad_created","last_seen"]
for i in date_columns:
col = autos[i].str[:10].str.replace('-','').astype(int)
autos[i] = col
autos.head(5)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_in_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 20160326 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 20160404 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 20160326 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 20160312 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 20160401 | 39218 | 20160401 |
autos["odometer_km"].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 40000, 100000])
To split the data into groups, we use the groupby
method. And then find the mean of the car prices on the grouped data..
# split the odometer_km column into groups
df_by_km = autos.groupby("odometer_km")
df_by_km
<pandas.core.groupby.DataFrameGroupBy object at 0x7f9445672a90>
# calculate mean and sort results
df_by_km["price"].mean().sort_values(ascending=False)
odometer_km 10000 20550.867220 20000 18448.477089 30000 16608.836842 40000 15499.568381 50000 13812.173212 60000 12385.004433 70000 10927.182814 80000 9721.947636 5000 8873.515924 90000 8465.025105 100000 8132.697279 125000 6214.022030 150000 3767.927107 Name: price, dtype: float64
Cars with mileage less than 10000 and greater than 70000 were being listed with lower mean price. Cars with mileage greater than 100000 have lesser value and cars with mileage that fall between 10000 and 30000 have more value.
df_by_damage = autos.groupby("unrepaired_damage")
df_by_damage["price"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
unrepaired_damage | ||||||||
ja | 4540.0 | 2241.146035 | 3563.276478 | 1.0 | 500.0 | 1000.0 | 2500.0 | 44200.0 |
nein | 33834.0 | 7164.033103 | 10078.475478 | 1.0 | 1800.0 | 4150.0 | 9000.0 | 350000.0 |
df_by_damage["price"].mean()
unrepaired_damage ja 2241.146035 nein 7164.033103 Name: price, dtype: float64
The cars with unrepaired damage is comparitively highly cheaper( average price approx. 5000 dollars) than the cars with damages repaired.
To find the common brand/model combinations, first group the dataset by model
and brand
and find the count
of the cars listed on Ebay by the listing date.
# group the dataset by model and brand
df_brand = autos.groupby(["brand","model"])
# find the count and sort desscending
brand_model = df_brand["date_crawled"].count().sort_values(ascending=False)
brand_model
brand model volkswagen golf 3707 bmw 3er 2615 volkswagen polo 1609 opel corsa 1592 volkswagen passat 1349 opel astra 1348 audi a4 1231 mercedes_benz c_klasse 1136 bmw 5er 1132 mercedes_benz e_klasse 958 audi a3 825 a6 797 ford focus 762 fiesta 722 volkswagen transporter 674 renault twingo 615 peugeot 2_reihe 600 smart fortwo 550 opel vectra 544 mercedes_benz a_klasse 539 bmw 1er 521 ford mondeo 479 renault clio 473 mercedes_benz andere 439 volkswagen touran 433 fiat punto 415 opel zafira 394 ford ka 349 renault megane 335 seat ibiza 328 ... land_rover range_rover 9 daewoo andere 9 renault r19 9 audi 90 8 daewoo nubira 8 trabant andere 8 smart andere 7 lada andere 7 chrysler crossfire 6 seat exeo 6 volkswagen amarok 6 dacia lodgy 5 saab 9000 5 lancia delta 5 land_rover range_rover_evoque 5 fiat croma 4 daihatsu materia 4 land_rover andere 4 alfa_romeo 145 4 daihatsu charade 3 lada samara 3 volvo v60 3 dacia andere 2 rover freelander 2 lancia kappa 2 bmw i3 1 rover rangerover 1 ford b_max 1 audi 200 1 rover discovery 1 Name: date_crawled, Length: 290, dtype: int64
*Volkswagen Golf* is the most common brand/model combination with total 3707 listings, followed by *BMW 3ER* and *Volkswagen Polo*
In this project we applied a variety of pandas methods to explore and understand a dataset on car listings on Ebay.