Guided Project: Exploring eBay Sales Data - Data Cleaning¶

The aim of this guided project is to clean and analyse used car listings. This notebook only contains the data cleaning process and some initial explorations, I'm going to analyze the data in a different notebook.

The original dataset was scraped from eBay Kleinanzeigen, a classified section of German eBay and is available here. But Dataquest prepared a modified version of it for study purposes (making it less clean) and that is what I'm going to use.

The dataset has 50,000 data points and the data dictionary is shown below:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

Importing Libraries and Initial Exploration¶

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt

autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos

Out[1]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [2]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [3]:

autos.describe(include="all")

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-04-02 15:49:30	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Initial observations¶

There are 50000 rows and 20 columns in the dataset. Most columns are strings.

Column names are in CamelCase, so a bit difficult to read.
Some columns have only a few unique values and might be unnecessary for the analysis

seller
offerType
abtest
gearbox
nrOfPictures
notRepairedDamage

These object columns that should be converted to numerical columns:

price
odometer

The following columns have missing values, but none of them has more than 20% so we can replace them with a different value:

vehicleType
gearbox
model
fuelType
notRepairedDamage

Some columns have values in German which should be translated to English.

vehicleType
fuelType
notRepairedDamage

Date colums that should be converted to datetime format:

dateCrawled
dateCreated
lastSeen

Data Cleaning and Initial Exploration¶

Cleaning Column Names¶

Changing column names from CamelCase to snake_case for readability and simplifying the column name where possible. Also, it's not obvious what dateCreated refers to, so changing it to ad_created.

In [4]:

autos.columns

Out[4]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:

new_labels = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
autos.columns = new_labels
autos.head()

Out[5]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Dropping columns¶

In [6]:

print(autos["seller"].value_counts(), "\n")
print(autos["offer_type"].value_counts(), "\n")
print(autos["abtest"].value_counts(), "\n")
print(autos["gearbox"].value_counts(), "\n")
print(autos["unrepaired_damage"].value_counts(), "\n")
print(autos["nr_of_pictures"].value_counts())

privat        49999
gewerblich        1
Name: seller, dtype: int64 

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64 

test       25756
control    24244
Name: abtest, dtype: int64 

manuell      36993
automatik    10327
Name: gearbox, dtype: int64 

nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64 

0    50000
Name: nr_of_pictures, dtype: int64

The gearbox and unrepaired_damage column seems to be important, but I don't need the abtest and postal_code columns for this analysis and seller, offer_type and nr_of_picturescolumns don't contain any meaningful information, so they can be safely dropped:

In [7]:

autos.drop(["abtest", "seller", "offer_type", "nr_of_pictures", "postal_code"], axis=1, inplace=True)
autos

Out[7]:

	date_crawled	name	price	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	$5,000	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	$8,500	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	$8,990	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	$4,350	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	$1,350	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	$24,900	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	$1,980	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	$13,200	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	$22,900	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	$1,250	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	2016-04-06 21:18:48

50000 rows × 15 columns

Converting `price` and `odometer` columns to numeric values¶

In [8]:

autos["price"].unique()

Out[8]:

array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'],
      dtype=object)

In [9]:

autos["odometer"].unique()

Out[9]:

array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km',
       '30,000km', '125,000km', '90,000km', '20,000km', '60,000km',
       '5,000km', '100,000km', '40,000km'], dtype=object)

In [10]:

autos["price"] = (autos["price"]
                  .str.replace("$", "")
                  .str.replace(",", "")
                  .astype(int)
                  )
autos["price"].head()

Out[10]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int32

In [11]:

autos["odometer"] = (autos["odometer"]
                    .str.replace("km", "")
                     .str.replace(",", "")
                     .astype(int)
                    )
autos.rename({"odometer" : "odometer_km"}, axis=1, inplace=True)
autos["odometer_km"].head()

Out[11]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer_km, dtype: int32

Exploring the Odometer and Price Columns¶

Odometer column:¶

In [12]:

autos["odometer_km"].unique()

Out[12]:

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])

In [13]:

autos["odometer_km"].describe()

Out[13]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [14]:

autos["odometer_km"].value_counts().sort_index()

Out[14]:

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

In [15]:

plt.figure(figsize=(10,5))
autos["odometer_km"].value_counts().sort_index().plot.bar(title = "Odometer distribution")
plt.xlabel("kilometers")
plt.ylabel("frequency")
plt.show()

The values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field. Additionally, the distribution of kilometers is left skewed, there are more high mileage than low mileage vehicles.

Price column:¶

In [16]:

autos["price"].unique()

Out[16]:

array([ 5000,  8500,  8990, ...,   385, 22200, 16995])

In [17]:

autos["price"].unique().shape

Out[17]:

(2357,)

In [18]:

autos["price"].describe()

Out[18]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [19]:

autos["price"].value_counts().sort_index(ascending=False).head(20)

Out[19]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

In [20]:

autos["price"].value_counts().sort_index(ascending=True).head(20)

Out[20]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

In [21]:

plt.figure(figsize=(10,5))
autos["price"].value_counts().sort_index().plot.hist(title = "Price distribution")
plt.xlabel("price")
plt.show()

There are 1,421 cars listed with 0 dollars and 156 cars listed with 1 dollar. Considering the fact that eBay is an auction-like site (it's like an auction, but transacctions are not binding, ie both buyer or seller can walk away from the sale without penalty), it's possible that some bids start with 0 dollars.

The maximum price is one hundred million dollars, which seems a lot. Prices increase steadily till 350,000 dollars, and then jump to less realistic numbers (after a bit of browsing, the most expensive car sold on eBay was around 350,000 dollars as well) so I'm removing all listings higher than this price.

In [22]:

autos = autos[autos["price"].between(0, 351000)]
autos["price"].describe()

Out[22]:

count     49986.000000
mean       5721.525167
std        8983.617820
min           0.000000
25%        1100.000000
50%        2950.000000
75%        7200.000000
max      350000.000000
Name: price, dtype: float64

Exploring the date columns¶

There are 5 columns representing date values. Some of these columns were created by the crawler, some came from the website itself:

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

The date_crawled, last_seen, and ad_created columns are identified as string values by pandas, the registration_month and registration_year columns are integers.

In [23]:

autos[['date_crawled','ad_created','last_seen', 'registration_month', 'registration_year']][0:5]

Out[23]:

	date_crawled	ad_created	last_seen	registration_month	registration_year
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54	3	2004
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08	6	1997
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37	7	2009
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28	6	2007
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50	7	2003

In [24]:

autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[24]:

2016-03-05    0.025387
2016-03-06    0.013944
2016-03-07    0.035970
2016-03-08    0.033269
2016-03-09    0.033209
2016-03-10    0.032129
2016-03-11    0.032489
2016-03-12    0.036770
2016-03-13    0.015564
2016-03-14    0.036630
2016-03-15    0.033990
2016-03-16    0.029508
2016-03-17    0.031509
2016-03-18    0.013064
2016-03-19    0.034910
2016-03-20    0.037831
2016-03-21    0.037490
2016-03-22    0.032909
2016-03-23    0.032389
2016-03-24    0.029108
2016-03-25    0.031749
2016-03-26    0.032489
2016-03-27    0.031049
2016-03-28    0.034850
2016-03-29    0.034150
2016-03-30    0.033629
2016-03-31    0.031909
2016-04-01    0.033809
2016-04-02    0.035410
2016-04-03    0.038691
2016-04-04    0.036490
2016-04-05    0.013104
2016-04-06    0.003181
2016-04-07    0.001420
Name: date_crawled, dtype: float64

In [25]:

autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_values()

Out[25]:

2016-04-07    0.001420
2016-04-06    0.003181
2016-03-18    0.013064
2016-04-05    0.013104
2016-03-06    0.013944
2016-03-13    0.015564
2016-03-05    0.025387
2016-03-24    0.029108
2016-03-16    0.029508
2016-03-27    0.031049
2016-03-17    0.031509
2016-03-25    0.031749
2016-03-31    0.031909
2016-03-10    0.032129
2016-03-23    0.032389
2016-03-11    0.032489
2016-03-26    0.032489
2016-03-22    0.032909
2016-03-09    0.033209
2016-03-08    0.033269
2016-03-30    0.033629
2016-04-01    0.033809
2016-03-15    0.033990
2016-03-29    0.034150
2016-03-28    0.034850
2016-03-19    0.034910
2016-04-02    0.035410
2016-03-07    0.035970
2016-04-04    0.036490
2016-03-14    0.036630
2016-03-12    0.036770
2016-03-21    0.037490
2016-03-20    0.037831
2016-04-03    0.038691
Name: date_crawled, dtype: float64

In [26]:

plt.figure(figsize=(10,6))
autos["date_crawled"].str[:10].value_counts().sort_index().plot.barh(title="Distribution of crawling date")
plt.xlabel("frequency")
plt.ylabel("date")
plt.show()

The website was crawled over a period of month from March 5 to April 7, 2016. The distribution of listings crawled on each day is roughly uniform.

In [27]:

autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[27]:

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen, dtype: float64

In [28]:

plt.figure(figsize=(10,5))
autos["last_seen"].str[:10].value_counts().sort_index().plot.bar(title="Distribution of date last seen by the crawler")
plt.xlabel("date")
plt.ylabel("frequency")
plt.show()

The last_seen column shows the last time the crawler saw the data, which indicates the date each vehicle was probably sold. The distribution of the last 3 days is very high, which is because of the end of the crawling period, rather than a sudden spike in sales.

In [29]:

print(autos["ad_created"].str[:10].unique().shape)
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()

(76,)

Out[29]:

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
                ...   
2016-04-03    0.038931
2016-04-04    0.036850
2016-04-05    0.011843
2016-04-06    0.003261
2016-04-07    0.001280
Name: ad_created, Length: 76, dtype: float64

In [30]:

plt.figure(figsize=(10,10))
autos["ad_created"].str[:10].value_counts().sort_index().plot.barh(title="Distribution of ad created date")
plt.xlabel("frequency")
plt.ylabel("date")
plt.show()

There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.

In [31]:

autos["registration_year"].describe()

Out[31]:

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The year that the car was first registered probably indicates the age of the car. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.

Dealing with Incorrect Registration Year Data¶

The first car was invented in 1885, so any registration year before 1885 is inaccurate. Also, a car can't be registered after beeing seen in the listing, so a registration year above 2016 (and some in 2016) is inaccurate as well.

In [32]:

autos["registration_year"].value_counts(normalize=True).sort_index().head(20)

Out[32]:

1000    0.00002
1001    0.00002
1111    0.00002
1500    0.00002
1800    0.00004
1910    0.00018
1927    0.00002
1929    0.00002
1931    0.00002
1934    0.00004
1937    0.00008
1938    0.00002
1939    0.00002
1941    0.00004
1943    0.00002
1948    0.00002
1950    0.00006
1951    0.00004
1952    0.00002
1953    0.00002
Name: registration_year, dtype: float64

In [33]:

autos["registration_year"].value_counts(normalize=True).sort_index().tail(20)

Out[33]:

2011    0.032689
2012    0.026467
2013    0.016125
2014    0.013304
2015    0.007982
2016    0.026327
2017    0.029048
2018    0.009823
2019    0.000060
2800    0.000020
4100    0.000020
4500    0.000020
4800    0.000020
5000    0.000080
5911    0.000020
6200    0.000020
8888    0.000020
9000    0.000040
9996    0.000020
9999    0.000080
Name: registration_year, dtype: float64

In [34]:

wrong_date = len(autos[autos["registration_year"] >= 2016]) + len(autos[autos["registration_year"] < 1885])
wrong_date_percentage = wrong_date / len(autos) * 100
print(wrong_date_percentage)

6.573840675389109

About 6.6% of the data has wrong dates. I'll drop all rows with registration date before 1885 and after 2019. But most of the inaccurate registrations are in 2016-2018, so I'll keep these for now for further examination.

In [35]:

index_to_drop = autos[autos["registration_year"] > 2018].index
index_to_drop2 = autos[autos["registration_year"] < 1885].index

In [36]:

autos.drop(index_to_drop, inplace=True, axis=0)
autos.drop(index_to_drop2, inplace=True, axis=0)

C:\Users\eveli\anaconda3\lib\site-packages\pandas\core\frame.py:4163: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(

In [37]:

autos["registration_year"].value_counts().sort_index()

Out[37]:

1910       9
1927       1
1929       1
1931       1
1934       2
        ... 
2014     665
2015     399
2016    1316
2017    1452
2018     491
Name: registration_year, Length: 80, dtype: int64

Translating values to English¶

Using Google translate to some of the columns from German to English. Also, these columns, along with the model column have missing values which I'm going to fill with the string "unknown".

Vehicle type¶

In [38]:

autos["vehicle_type"].value_counts()

Out[38]:

limousine     12854
kleinwagen    10822
kombi          9125
bus            4092
cabrio         3061
coupe          2534
suv            1986
andere          420
Name: vehicle_type, dtype: int64

In [39]:

mapping_dict = {
    "limousine" : "sedan",
    "kleinwagen" : "small car",
    "kombi" : "station wagon",
    "bus" : "bus",
    "cabrio" : "convertible",
    "coupe" : "coupe",
    "suv" : "suv",
    "andere" : "other"
}
autos["vehicle_type"] = autos["vehicle_type"].replace(mapping_dict)
autos["vehicle_type"].value_counts()

<ipython-input-39-2fea61329317>:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["vehicle_type"] = autos["vehicle_type"].replace(mapping_dict)

Out[39]:

sedan            12854
small car        10822
station wagon     9125
bus               4092
convertible       3061
coupe             2534
suv               1986
other              420
Name: vehicle_type, dtype: int64

In [40]:

autos["gearbox"].value_counts()

Out[40]:

manuell      36978
automatik    10321
Name: gearbox, dtype: int64

In [41]:

mapping_dict = {
    "manuell" : "manual",
    "automatik" : "automatic"
}
autos["gearbox"] = autos["gearbox"].replace(mapping_dict)
autos["gearbox"].value_counts()

<ipython-input-41-a9c465c124d7>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["gearbox"] = autos["gearbox"].replace(mapping_dict)

Out[41]:

manual       36978
automatic    10321
Name: gearbox, dtype: int64

In [42]:

autos["fuel_type"].value_counts()

Out[42]:

benzin     30092
diesel     14564
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64

In [43]:

mapping_dict = {
    "benzin" : "gasoline",
    "diesel" : "diesel",
    "lpg" : "lpg",
    "cng" : "cng",
    "hybrid" : "hybrid",
    "andere" : "other",
    "elektro" : "electric",
}
autos["fuel_type"] = autos["fuel_type"].replace(mapping_dict)
autos["fuel_type"].value_counts()

<ipython-input-43-b6b4d0fdc623>:10: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["fuel_type"] = autos["fuel_type"].replace(mapping_dict)

Out[43]:

gasoline    30092
diesel      14564
lpg           691
cng            75
hybrid         37
other          22
electric       19
Name: fuel_type, dtype: int64

In [44]:

autos["unrepaired_damage"].value_counts()

Out[44]:

nein    35218
ja       4938
Name: unrepaired_damage, dtype: int64

In [45]:

mapping_dict = {
    "nein" : "no",
    "ja" : "yes"
}
autos["unrepaired_damage"] = autos["unrepaired_damage"].replace(mapping_dict)
autos["unrepaired_damage"].value_counts()

<ipython-input-45-e1feb9caa2fa>:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["unrepaired_damage"] = autos["unrepaired_damage"].replace(mapping_dict)

Out[45]:

no     35218
yes     4938
Name: unrepaired_damage, dtype: int64

In [46]:

autos["name"].value_counts()

Out[46]:

Ford_Fiesta                                    78
BMW_316i                                       75
Volkswagen_Golf_1.4                            75
Volkswagen_Polo                                72
BMW_318i                                       72
                                               ..
Audi_80_Avant_2.0l_AHK                          1
Audi_A4_Avant_B5                                1
Renault_Megane_Grandtour_1.6_Fairway            1
_Zu_Verkaufen__Opel_Corsa_1.0_12V_World_Cup     1
Volkswagen_Sharan_2.0_Family                    1
Name: name, Length: 38718, dtype: int64

In [47]:

autos[["model","vehicle_type", "gearbox", "fuel_type", "unrepaired_damage"]] = autos[["model","vehicle_type", "gearbox", "fuel_type", "unrepaired_damage"]].fillna("unknown")

C:\Users\eveli\anaconda3\lib\site-packages\pandas\core\frame.py:3065: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]

In [48]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49959 entries, 0 to 49999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        49959 non-null  object
 1   name                49959 non-null  object
 2   price               49959 non-null  int32 
 3   vehicle_type        49959 non-null  object
 4   registration_year   49959 non-null  int64 
 5   gearbox             49959 non-null  object
 6   power_ps            49959 non-null  int64 
 7   model               49959 non-null  object
 8   odometer_km         49959 non-null  int32 
 9   registration_month  49959 non-null  int64 
 10  fuel_type           49959 non-null  object
 11  brand               49959 non-null  object
 12  unrepaired_damage   49959 non-null  object
 13  ad_created          49959 non-null  object
 14  last_seen           49959 non-null  object
dtypes: int32(2), int64(3), object(10)
memory usage: 5.7+ MB

In [49]:

autos

Out[49]:

	date_crawled	name	price	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	bus	2004	manual	158	andere	150000	3	lpg	peugeot	no	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	sedan	1997	automatic	286	7er	150000	6	gasoline	bmw	no	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	sedan	2009	manual	102	golf	70000	7	gasoline	volkswagen	no	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	small car	2007	automatic	71	fortwo	70000	6	gasoline	smart	no	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	station wagon	2003	manual	0	focus	150000	7	gasoline	ford	no	2016-04-01 00:00:00	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	24900	sedan	2011	automatic	239	q5	100000	1	diesel	audi	no	2016-03-27 00:00:00	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	1980	convertible	1996	manual	75	astra	150000	5	gasoline	opel	no	2016-03-28 00:00:00	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	13200	convertible	2014	automatic	69	500	5000	11	gasoline	fiat	no	2016-04-02 00:00:00	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	22900	station wagon	2013	manual	150	a3	40000	11	diesel	audi	no	2016-03-08 00:00:00	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	1250	sedan	1996	manual	101	vectra	150000	1	gasoline	opel	no	2016-03-13 00:00:00	2016-04-06 21:18:48

49959 rows × 15 columns

In [50]:

autos.to_csv("ebaycars_cleaned.csv")