Exploring eBay Car Sales Data¶

In this guided project, we will be cleaning and exploring car sales data from eBay Kleinanzeigen, a classifieds section of the German eBay website. The goal of this project is to demonstrate our ability to use and understand the functionality of Pandas by working with Series and DataFrame objects, as well as demonstrate our ability to recognize issues and anomolies in our dataset.

The original data can be found here. The original data has been trimmed to a smaller size (50,000 data points), and dirtied, to better suit the purpose of this project.

Overview of Columns¶

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

In [1]:

#importing the pandas and numpy libraries
import pandas as pd
import numpy as np

#reading the .csv file as a dataframe
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')

In [2]:

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

The number of non-null objects listed under each column can tell us which part of the data set that we'll likely need to focus on cleaning. As we know from before, this data has 50,000 entries, so any column with less than 50,000 non-null objects has some number of values that are null, which can prove difficult when we get to analysis.

The columns where we see this pattern are:

vehicleType
gearbox
model
fuelType
notRepairedDamage

Additionally, there are a few columns that may benefit from being a different data type. For example, the price column currently is an object type, but it would likely be more useful as an int64 data type. That way we can do things like sort the rows by price.

In [3]:

autos.columns

Out[3]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Another problem with our data is that our column names are in camelcase (likeThis) instead of snakecase (like_this). Below, we'll convert the column names to snakecase.

In [4]:

autos.columns= ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
       'last_seen']
autos.columns

Out[4]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Although this is a merely a stylistic change, a little bit of readability can go a long way!

Overview of data¶

Let's use the dataframe.describe() method to learn more about the columns and values stored in our data set. This will give us insight into which parts of our data we'll need to focus on cleaning. Note that we use the include='all' parameter, so we can get summary statistics for both numeric and object data types.

In [5]:

autos.describe(include='all')

Out[5]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-19 17:36:18	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Problem 1 : Redundancy¶

Looking at the summary statistics for the data, we can see there are a few columns that don't seem to be useful for our analysis. For instance, the seller, offer_type, and nr_pictures columns are comprised mostly of the same value.

In [6]:

autos['seller'].value_counts()

Out[6]:

privat        49999
gewerblich        1
Name: seller, dtype: int64

In [7]:

autos['offer_type'].value_counts()

Out[7]:

Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64

In [8]:

autos['nr_pictures'].value_counts()

Out[8]:

0    50000
Name: nr_pictures, dtype: int64

As we can see by using the series.value_counts() method on the seller column, the nearly all of our data is from private sellers, there is only one listing under 'gewerblich' (in english, commercial). Similarly, nearly all of the listings have an offer type of 'angebot' (in english, offer, or someone looking sell their car), and only one listing as 'gesuch' (in english, request, or someone posting a listing looking to buy a car). All of the listings contain a '0' value in nr_pictures.

Since these columns contain mostly one value, it would make sense to drop them from the data set.

Problem 2: incorrect data types¶

As mentioned before, there are a few columns where numeric data is stored as text rather than an integer. Both the price and odometer columns would be better suited as integers. Let's begin by looking at the price column

In [9]:

autos['price'].head()

Out[9]:

0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

It looks like the values in the price column contain the non-numeric characters $ and ,

We need to remove them before converting the strings into integers

In [10]:

autos['price']=autos['price'].str.replace('$','')
autos['price']=autos['price'].str.replace(',','')
autos['price'].head()

Out[10]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: object

Note that the dtype is still object. So, we'll use the series.astype() method to change the dtype from object to integer.

In [11]:

autos['price']=autos['price'].astype(int)
autos['price'].head()

Out[11]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Now that the price column holds numeric values, let's use series.describe() to look at the summary statistics

In [12]:

autos['price'].describe()

Out[12]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

Let's use this same process for the odometer column

In [13]:

autos['odometer'].head()

Out[13]:

0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object

In [14]:

autos['odometer']=autos['odometer'].str.replace(',','')
autos['odometer']=autos['odometer'].str.replace('km','')
autos['odometer']=autos['odometer'].astype(int)

In [15]:

autos['odometer'].head()

Out[15]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64

In [16]:

autos['odometer'].describe()

Out[16]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer, dtype: float64

Before moving on, let's rename the odometer column to odometer_km so that we can remember that the values are in kilometers.

In [17]:

autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)

In [18]:

autos.describe(include='all')

Out[18]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_pictures	postal_code	last_seen
count	50000	50000	50000	50000	5.000000e+04	50000	44905	50000.000000	47320	50000.000000	47242	50000.000000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	NaN	2	8	NaN	2	NaN	245	NaN	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-19 17:36:18	Ford_Fiesta	privat	Angebot	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	NaN	25756	12859	NaN	36993	NaN	4024	NaN	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	9.840044e+03	NaN	NaN	2005.073280	NaN	116.355920	NaN	125732.700000	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	4.811044e+05	NaN	NaN	105.712813	NaN	209.216627	NaN	40042.211706	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	0.000000e+00	NaN	NaN	1000.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	1.100000e+03	NaN	NaN	1999.000000	NaN	70.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	2.950000e+03	NaN	NaN	2003.000000	NaN	105.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	7.200000e+03	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	1.000000e+08	NaN	NaN	9999.000000	NaN	17700.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Problem 3: Outliers¶

Let's take another look at the price column

In [19]:

autos['price'].describe()

Out[19]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

It appears that we may have some outliers in our data that may be skewing the mean. Note how the minimum price is 0 dollar and the maximum price is 100 million dollars. Below, we use the value_counts() and sort_index() methods to look at some of the specific, most extreme prices.

In [20]:

autos['price'].value_counts().sort_index(ascending=False)

Out[20]:

99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
3890000        1
1300000        1
1234566        1
999999         2
999990         1
350000         1
345000         1
299000         1
295000         1
265000         1
259000         1
250000         1
220000         1
198000         1
197000         1
194000         1
190000         1
180000         1
175000         1
169999         1
169000         1
163991         1
163500         1
155000         1
151990         1
            ... 
66             1
65             5
60             9
59             1
55             2
50            49
49             4
47             1
45             4
40             6
35             1
30             7
29             1
25             5
20             4
18             1
17             3
15             2
14             1
13             2
12             3
11             2
10             7
9              1
8              1
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64

It looks like there are a handful of listings listed at either above 350,000 dollars or below 200 dollars. Let's drop some of these outliers in a different dataframe clean_autos.

In [21]:

clean_autos=autos[autos['price'].between(200,350000)]
clean_autos['price'].describe()

Out[21]:

count     47645.000000
mean       6000.707273
std        9110.783444
min         200.000000
25%        1300.000000
50%        3190.000000
75%        7500.000000
max      350000.000000
Name: price, dtype: float64

Next, let's examine the odometer_km column in our new clean_autos data.

In [22]:

clean_autos['odometer_km'].describe()

Out[22]:

count     47645.000000
mean     125887.501312
std       39482.911790
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [23]:

clean_autos['odometer_km'].value_counts()

Out[23]:

150000    30781
125000     4998
100000     2083
90000      1720
80000      1407
70000      1209
60000      1145
50000      1007
40000       813
30000       773
20000       747
5000        720
10000       242
Name: odometer_km, dtype: int64

It looks like the car listings are distributed in roughly 10,000 km intervals, which indicates that the odometer readings aren't exact. The website likely offers a km range for sellers to list their cars. Additionally, the majority of the listings are at the highest interval, 150,000 km, indicating that most of the cars have 150,000 or more km on them.

Another column we should examine is registration_year.

In [24]:

clean_autos['registration_year'].describe()

Out[24]:

count    47645.000000
mean      2004.800084
std         88.423872
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Some of the listings have registration years from 1000, or 9999, which is not accurate. Let's remove these outliers from our dataset.

In [25]:

clean_autos['registration_year'].value_counts().sort_index(ascending=False)

Out[25]:

9999       3
9000       1
8888       1
6200       1
5911       1
5000       3
4800       1
4500       1
4100       1
2800       1
2019       1
2018     468
2017    1371
2016    1161
2015     376
2014     656
2013     797
2012    1307
2011    1617
2010    1587
2009    2079
2008    2206
2007    2272
2006    2667
2005    2906
2004    2694
2003    2692
2002    2470
2001    2611
2000    3019
        ... 
1964      12
1963       8
1962       4
1961       6
1960      21
1959       6
1958       4
1957       2
1956       4
1955       2
1954       2
1953       1
1952       1
1951       2
1950       1
1948       1
1943       1
1941       2
1939       1
1938       1
1937       4
1934       2
1931       1
1929       1
1927       1
1910       2
1800       2
1111       1
1001       1
1000       1
Name: registration_year, Length: 95, dtype: int64

We'll keep the listings with registration year in the range 1910-2018, and drop the others.

In [26]:

clean_autos=clean_autos[clean_autos['registration_year'].between(1900,2016)]
clean_autos['registration_year'].describe()

Out[26]:

count    45786.000000
mean      2002.993098
std          7.113188
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

In [27]:

clean_autos['registration_year'].value_counts(normalize=True)

Out[27]:

2000    0.065937
2005    0.063469
1999    0.062137
2004    0.058839
2003    0.058795
2006    0.058249
2001    0.057026
2002    0.053947
1998    0.049906
2007    0.049622
2008    0.048181
2009    0.045407
1997    0.040449
2011    0.035316
2010    0.034661
2012    0.028546
1996    0.028437
2016    0.025357
1995    0.024440
2013    0.017407
2014    0.014328
1994    0.012952
1993    0.008867
2015    0.008212
1992    0.007753
1991    0.007207
1990    0.006989
1989    0.003691
1988    0.002905
1985    0.001922
          ...   
1976    0.000459
1960    0.000459
1969    0.000415
1975    0.000393
1965    0.000371
1964    0.000262
1963    0.000175
1961    0.000131
1959    0.000131
1962    0.000087
1956    0.000087
1937    0.000087
1958    0.000087
1955    0.000044
1954    0.000044
1957    0.000044
1951    0.000044
1910    0.000044
1934    0.000044
1941    0.000044
1927    0.000022
1929    0.000022
1950    0.000022
1931    0.000022
1948    0.000022
1938    0.000022
1939    0.000022
1953    0.000022
1943    0.000022
1952    0.000022
Name: registration_year, Length: 78, dtype: float64

Most of the listings are for cars registered in the previous 20 years at the time this data was collected. This is to be expected as any car older than that generally loses utility and exits the market. However I'd suspect the other listings are for collectors or buyers specifically looking for vintage vehicles.

In [28]:

clean_autos.describe(include='all')

Out[28]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_pictures	postal_code	last_seen
count	45786	45786	45786	45786	45786.000000	45786	43394	45786.000000	43911	45786.000000	43717	45786.000000	45786.000000	42756	45786	37966	45786	45786.0	45786.000000	45786
unique	44289	35043	1	1	NaN	2	8	NaN	2	NaN	244	NaN	NaN	7	40	2	74	NaN	NaN	36495
top	2016-03-23 18:39:34	Volkswagen_Golf_1.4	privat	Angebot	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	75	45786	45786	NaN	23582	12489	NaN	34105	NaN	3639	NaN	NaN	27978	9672	33670	1793	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	6092.594112	NaN	NaN	2002.993098	NaN	119.168654	NaN	125699.340410	5.868759	NaN	NaN	NaN	NaN	0.0	51160.666776	NaN
std	NaN	NaN	NaN	NaN	9229.962879	NaN	NaN	7.113188	NaN	185.462987	NaN	39546.768047	3.650627	NaN	NaN	NaN	NaN	0.0	25733.241750	NaN
min	NaN	NaN	NaN	NaN	200.000000	NaN	NaN	1910.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	1300.000000	NaN	NaN	1999.000000	NaN	75.000000	NaN	100000.000000	3.000000	NaN	NaN	NaN	NaN	0.0	30880.000000	NaN
50%	NaN	NaN	NaN	NaN	3200.000000	NaN	NaN	2003.000000	NaN	109.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	0.0	50127.000000	NaN
75%	NaN	NaN	NaN	NaN	7700.000000	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	0.0	72100.000000	NaN
max	NaN	NaN	NaN	NaN	350000.000000	NaN	NaN	2016.000000	NaN	17700.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Exploring Date Data¶

There are 5 columns that represent date values, some from the website itself and some gathered by the crawler:

date_crawled: added by the crawler
last_seen: added by the crawler
ad_created: from the website
registration_month: from the website
registration_year: from the website

The values in date_crawled,ad_created, and last_seen are formatted as strings. Since we're only interested in the month, day, and year, we'll look at the first 10 characters of each value and use Series.value_counts() method to see the distribution of the dates. Additionally, we'll use the parameter normalize=True to see the percentage rather than the count.

In [29]:

clean_autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[29]:

2016-03-05    0.025226
2016-03-06    0.014175
2016-03-07    0.036256
2016-03-08    0.033351
2016-03-09    0.033242
2016-03-10    0.032390
2016-03-11    0.032565
2016-03-12    0.036758
2016-03-13    0.015900
2016-03-14    0.036321
2016-03-15    0.034290
2016-03-16    0.029310
2016-03-17    0.031691
2016-03-18    0.012711
2016-03-19    0.034487
2016-03-20    0.037937
2016-03-21    0.037326
2016-03-22    0.032543
2016-03-23    0.032346
2016-03-24    0.029441
2016-03-25    0.031341
2016-03-26    0.032106
2016-03-27    0.030905
2016-03-28    0.034749
2016-03-29    0.034028
2016-03-30    0.034006
2016-03-31    0.031800
2016-04-01    0.033941
2016-04-02    0.035775
2016-04-03    0.038920
2016-04-04    0.036671
2016-04-05    0.013039
2016-04-06    0.003080
2016-04-07    0.001376
Name: date_crawled, dtype: float64

It looks like the data was collected by the crawler over a month period from March 5, 2016 to April 7, 2016

In [30]:

clean_autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)

Out[30]:

2016-04-03    0.039160
2016-03-20    0.038003
2016-03-21    0.037544
2016-04-04    0.036998
2016-03-12    0.036561
2016-04-02    0.035426
2016-03-07    0.034989
2016-03-14    0.034967
2016-03-28    0.034836
2016-03-15    0.034093
2016-03-29    0.034050
2016-04-01    0.033941
2016-03-30    0.033809
2016-03-19    0.033416
2016-03-08    0.033351
2016-03-09    0.033329
2016-03-11    0.032848
2016-03-22    0.032302
2016-03-23    0.032237
2016-03-26    0.032106
2016-03-10    0.032106
2016-03-31    0.031887
2016-03-25    0.031494
2016-03-17    0.031385
2016-03-27    0.030774
2016-03-16    0.029725
2016-03-24    0.029398
2016-03-05    0.022845
2016-03-13    0.017276
2016-03-06    0.015398
                ...   
2016-02-25    0.000066
2016-02-02    0.000044
2016-01-27    0.000044
2016-02-24    0.000044
2016-02-12    0.000044
2016-02-18    0.000044
2016-02-20    0.000044
2016-02-14    0.000044
2016-02-26    0.000044
2016-02-05    0.000044
2016-01-10    0.000044
2016-02-07    0.000022
2015-11-10    0.000022
2016-01-13    0.000022
2015-06-11    0.000022
2016-01-16    0.000022
2016-02-17    0.000022
2015-09-09    0.000022
2016-02-09    0.000022
2016-01-14    0.000022
2016-02-11    0.000022
2016-01-07    0.000022
2015-12-30    0.000022
2016-02-22    0.000022
2015-08-10    0.000022
2016-02-01    0.000022
2016-02-16    0.000022
2015-12-05    0.000022
2016-02-08    0.000022
2016-01-03    0.000022
Name: ad_created, Length: 74, dtype: float64

Most of the listings were created around the same month period the data was crawled.

In [31]:

clean_autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[31]:

2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
2015-12-30    0.000022
2016-01-03    0.000022
2016-01-07    0.000022
2016-01-10    0.000044
2016-01-13    0.000022
2016-01-14    0.000022
2016-01-16    0.000022
2016-01-27    0.000044
2016-02-01    0.000022
2016-02-02    0.000044
2016-02-05    0.000044
2016-02-07    0.000022
2016-02-08    0.000022
2016-02-09    0.000022
2016-02-11    0.000022
2016-02-12    0.000044
2016-02-14    0.000044
2016-02-16    0.000022
2016-02-17    0.000022
2016-02-18    0.000044
2016-02-19    0.000066
2016-02-20    0.000044
2016-02-21    0.000066
2016-02-22    0.000022
2016-02-23    0.000087
                ...   
2016-03-09    0.033329
2016-03-10    0.032106
2016-03-11    0.032848
2016-03-12    0.036561
2016-03-13    0.017276
2016-03-14    0.034967
2016-03-15    0.034093
2016-03-16    0.029725
2016-03-17    0.031385
2016-03-18    0.013367
2016-03-19    0.033416
2016-03-20    0.038003
2016-03-21    0.037544
2016-03-22    0.032302
2016-03-23    0.032237
2016-03-24    0.029398
2016-03-25    0.031494
2016-03-26    0.032106
2016-03-27    0.030774
2016-03-28    0.034836
2016-03-29    0.034050
2016-03-30    0.033809
2016-03-31    0.031887
2016-04-01    0.033941
2016-04-02    0.035426
2016-04-03    0.039160
2016-04-04    0.036998
2016-04-05    0.011816
2016-04-06    0.003167
2016-04-07    0.001223
Name: ad_created, Length: 74, dtype: float64

The listings in our data set were created between June 11th 2015 to April 7th 2016

In [32]:

clean_autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[32]:

2016-03-05    0.001092
2016-03-06    0.004106
2016-03-07    0.005351
2016-03-08    0.007273
2016-03-09    0.009785
2016-03-10    0.010484
2016-03-11    0.012318
2016-03-12    0.023872
2016-03-13    0.008649
2016-03-14    0.012580
2016-03-15    0.015835
2016-03-16    0.016097
2016-03-17    0.028087
2016-03-18    0.007164
2016-03-19    0.015420
2016-03-20    0.020618
2016-03-21    0.020443
2016-03-22    0.020858
2016-03-23    0.018324
2016-03-24    0.019526
2016-03-25    0.018805
2016-03-26    0.016686
2016-03-27    0.015485
2016-03-28    0.020596
2016-03-29    0.021950
2016-03-30    0.024396
2016-03-31    0.023719
2016-04-01    0.023042
2016-04-02    0.024527
2016-04-03    0.025073
2016-04-04    0.024243
2016-04-05    0.126130
2016-04-06    0.224020
2016-04-07    0.133447
Name: last_seen, dtype: float64

It's notable that nearly 50% of the listings were last seen on April 5, 6, and 7. This may be linked to the fact that the crawler hasn't gathered any data after April 7th. Many of these listings were likely still open on these dates.

Exploring Price by Brand¶

For this next section, let's look at the brand column and ask: which car brand is more expensive on average?

In [33]:

clean_autos['brand'].value_counts(normalize=True)

Out[33]:

volkswagen        0.211244
bmw               0.111257
opel              0.105906
mercedes_benz     0.097628
audi              0.087603
ford              0.068733
renault           0.046543
peugeot           0.029944
fiat              0.025095
seat              0.018171
skoda             0.016512
nissan            0.015289
mazda             0.015136
smart             0.014371
citroen           0.014131
toyota            0.012930
hyundai           0.010112
sonstige_autos    0.009544
volvo             0.009195
mini              0.008889
mitsubishi        0.008168
honda             0.007928
kia               0.007142
alfa_romeo        0.006683
porsche           0.006094
suzuki            0.005810
chevrolet         0.005722
chrysler          0.003560
dacia             0.002686
daihatsu          0.002490
jeep              0.002315
land_rover        0.002140
subaru            0.002097
saab              0.001660
jaguar            0.001529
daewoo            0.001485
rover             0.001332
trabant           0.001289
lancia            0.001048
lada              0.000590
Name: brand, dtype: float64

Let's look at the top 20 most commonly listed car brands and find the mean price for each brand.

In [35]:

#selecting the 20 most commonly listed brands
brands_list=clean_autos['brand'].value_counts().index[:20]
print(brands_list)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart',
       'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'],
      dtype='object')

In the cell below, we create an empty dictionary, loop through the brands_list referenced above. For each brand, we calculate the mean price from the clean_autos dataframe, then we add the brand and it's mean price the brand_mean dictionary.

In [ ]:

brand_price={}

for brand in brands_list:
    mean=int(clean_autos.loc[clean_autos['brand'] == brand, 'price'].mean())
    brand_price[brand]=mean
    
brand_price

For reference, the mean price across all of the listings in clean_autos is 6092. The most commonly listed brand, Volkswagen, is cheaper on average than other cars in our data. However, the second most commonly listed brand, BMW, is more expensive on average.

Next, let's use the same process to calculate the average milage (from the odometer_km column) accross the selected brands.

In [ ]:

brand_km={}

for brand in brands_list:
    mean=int(clean_autos.loc[clean_autos['brand'] == brand, 'odometer_km'].mean())
    brand_km[brand]=mean
    
brand_km

In the cell below, we'll convert the two dictionaries brand_price and brand_km to series objects, and then combine them into a single dataframe object that share a common index (the brand).

In [ ]:

bmp_series=pd.Series(brand_price)
bkm_series=pd.Series(brand_km)

avg_df=pd.DataFrame(bmp_series, columns=['mean_price'])
avg_df['mean_km']=bkm_series
avg_df

The assumption is that the higher the milage on a car, the cheaper the car becomes, however when looking at the dataframe, there is no clear relationship between the milage and price accross brands. This may indicate that the car brand is more of a factor affecting price than the milage of the car. We should note, however, as we saw before that the data in the odometer_km is not exact, and that the listed milage on the eBay ads is likely limited to a preselected range.

Conclusion¶

The goal of this guided project was to demonstrate our ability to clean data by recognizing anomolies, organize DataFrame and Series objects, and alter data types using the functionality of Pandas and NumPy. It is important to not declare the presence a particular trend without proving it true through statistical analysis.