Exploring eBay Car Sales Data

In this guided project, we will be cleaning and exploring car sales data from eBay Kleinanzeigen, a classifieds section of the German eBay website. The goal of this project is to demonstrate our ability to use and understand the functionality of Pandas by working with Series and DataFrame objects, as well as demonstrate our ability to recognize issues and anomolies in our dataset.

The original data can be found here. The original data has been trimmed to a smaller size (50,000 data points), and dirtied, to better suit the purpose of this project.

Overview of Columns

The data dictionary provided with data is as follows:

  • dateCrawled - When this ad was first crawled. All field-values are taken from this date.
  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • model - The car model name.
  • kilometer - How many kilometers the car has driven.
  • monthOfRegistration - The month in which the car was first registered.
  • fuelType - What type of fuel the car uses.
  • brand - The brand of the car.
  • notRepairedDamage - If the car has a damage which is not yet repaired.
  • dateCreated - The date on which the eBay listing was created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.
In [1]:
#importing the pandas and numpy libraries
import pandas as pd
import numpy as np

#reading the .csv file as a dataframe
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
In [2]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

The number of non-null objects listed under each column can tell us which part of the data set that we'll likely need to focus on cleaning. As we know from before, this data has 50,000 entries, so any column with less than 50,000 non-null objects has some number of values that are null, which can prove difficult when we get to analysis.

The columns where we see this pattern are:

  • vehicleType
  • gearbox
  • model
  • fuelType
  • notRepairedDamage

Additionally, there are a few columns that may benefit from being a different data type. For example, the price column currently is an object type, but it would likely be more useful as an int64 data type. That way we can do things like sort the rows by price.

In [3]:
autos.columns
Out[3]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Another problem with our data is that our column names are in camelcase (likeThis) instead of snakecase (like_this). Below, we'll convert the column names to snakecase.

In [4]:
autos.columns= ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
       'last_seen']
autos.columns
Out[4]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Although this is a merely a stylistic change, a little bit of readability can go a long way!

Overview of data

Let's use the dataframe.describe() method to learn more about the columns and values stored in our data set. This will give us insight into which parts of our data we'll need to focus on cleaning. Note that we use the include='all' parameter, so we can get summary statistics for both numeric and object data types.

In [5]:
autos.describe(include='all')
Out[5]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-19 17:36:18 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Problem 1 : Redundancy

Looking at the summary statistics for the data, we can see there are a few columns that don't seem to be useful for our analysis. For instance, the seller, offer_type, and nr_pictures columns are comprised mostly of the same value.

In [6]:
autos['seller'].value_counts()
Out[6]:
privat        49999
gewerblich        1
Name: seller, dtype: int64
In [7]:
autos['offer_type'].value_counts()
Out[7]:
Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64
In [8]:
autos['nr_pictures'].value_counts()
Out[8]:
0    50000
Name: nr_pictures, dtype: int64

As we can see by using the series.value_counts() method on the seller column, the nearly all of our data is from private sellers, there is only one listing under 'gewerblich' (in english, commercial). Similarly, nearly all of the listings have an offer type of 'angebot' (in english, offer, or someone looking sell their car), and only one listing as 'gesuch' (in english, request, or someone posting a listing looking to buy a car). All of the listings contain a '0' value in nr_pictures.

Since these columns contain mostly one value, it would make sense to drop them from the data set.

Problem 2: incorrect data types

As mentioned before, there are a few columns where numeric data is stored as text rather than an integer. Both the price and odometer columns would be better suited as integers. Let's begin by looking at the price column

In [9]:
autos['price'].head()
Out[9]:
0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object

It looks like the values in the price column contain the non-numeric characters $ and ,

We need to remove them before converting the strings into integers

In [10]:
autos['price']=autos['price'].str.replace('$','')
autos['price']=autos['price'].str.replace(',','')
autos['price'].head()
Out[10]:
0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: object

Note that the dtype is still object. So, we'll use the series.astype() method to change the dtype from object to integer.

In [11]:
autos['price']=autos['price'].astype(int)
autos['price'].head()
Out[11]:
0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Now that the price column holds numeric values, let's use series.describe() to look at the summary statistics

In [12]:
autos['price'].describe()
Out[12]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

Let's use this same process for the odometer column

In [13]:
autos['odometer'].head()
Out[13]:
0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object
In [14]:
autos['odometer']=autos['odometer'].str.replace(',','')
autos['odometer']=autos['odometer'].str.replace('km','')
autos['odometer']=autos['odometer'].astype(int)
In [15]:
autos['odometer'].head()
Out[15]:
0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64
In [16]:
autos['odometer'].describe()
Out[16]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer, dtype: float64

Before moving on, let's rename the odometer column to odometer_km so that we can remember that the values are in kilometers.

In [17]:
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
In [18]:
autos.describe(include='all')
Out[18]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_pictures postal_code last_seen
count 50000 50000 50000 50000 5.000000e+04 50000 44905 50000.000000 47320 50000.000000 47242 50000.000000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 NaN 2 8 NaN 2 NaN 245 NaN NaN 7 40 2 76 NaN NaN 39481
top 2016-03-19 17:36:18 Ford_Fiesta privat Angebot NaN test limousine NaN manuell NaN golf NaN NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 NaN 25756 12859 NaN 36993 NaN 4024 NaN NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN 9.840044e+03 NaN NaN 2005.073280 NaN 116.355920 NaN 125732.700000 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN 4.811044e+05 NaN NaN 105.712813 NaN 209.216627 NaN 40042.211706 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN 0.000000e+00 NaN NaN 1000.000000 NaN 0.000000 NaN 5000.000000 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN 1.100000e+03 NaN NaN 1999.000000 NaN 70.000000 NaN 125000.000000 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN 2.950000e+03 NaN NaN 2003.000000 NaN 105.000000 NaN 150000.000000 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN 7.200000e+03 NaN NaN 2008.000000 NaN 150.000000 NaN 150000.000000 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN 1.000000e+08 NaN NaN 9999.000000 NaN 17700.000000 NaN 150000.000000 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Problem 3: Outliers

Let's take another look at the price column

In [19]:
autos['price'].describe()
Out[19]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

It appears that we may have some outliers in our data that may be skewing the mean. Note how the minimum price is 0 dollar and the maximum price is 100 million dollars. Below, we use the value_counts() and sort_index() methods to look at some of the specific, most extreme prices.

In [20]:
autos['price'].value_counts().sort_index(ascending=False)
Out[20]:
99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
3890000        1
1300000        1
1234566        1
999999         2
999990         1
350000         1
345000         1
299000         1
295000         1
265000         1
259000         1
250000         1
220000         1
198000         1
197000         1
194000         1
190000         1
180000         1
175000         1
169999         1
169000         1
163991         1
163500         1
155000         1
151990         1
            ... 
66             1
65             5
60             9
59             1
55             2
50            49
49             4
47             1
45             4
40             6
35             1
30             7
29             1
25             5
20             4
18             1
17             3
15             2
14             1
13             2
12             3
11             2
10             7
9              1
8              1
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64

It looks like there are a handful of listings listed at either above 350,000 dollars or below 200 dollars. Let's drop some of these outliers in a different dataframe clean_autos.

In [21]:
clean_autos=autos[autos['price'].between(200,350000)]
clean_autos['price'].describe()
Out[21]:
count     47645.000000
mean       6000.707273
std        9110.783444
min         200.000000
25%        1300.000000
50%        3190.000000
75%        7500.000000
max      350000.000000
Name: price, dtype: float64

Next, let's examine the odometer_km column in our new clean_autos data.

In [22]:
clean_autos['odometer_km'].describe()
Out[22]:
count     47645.000000
mean     125887.501312
std       39482.911790
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [23]:
clean_autos['odometer_km'].value_counts()
Out[23]:
150000    30781
125000     4998
100000     2083
90000      1720
80000      1407
70000      1209
60000      1145
50000      1007
40000       813
30000       773
20000       747
5000        720
10000       242
Name: odometer_km, dtype: int64

It looks like the car listings are distributed in roughly 10,000 km intervals, which indicates that the odometer readings aren't exact. The website likely offers a km range for sellers to list their cars. Additionally, the majority of the listings are at the highest interval, 150,000 km, indicating that most of the cars have 150,000 or more km on them.

Another column we should examine is registration_year.

In [24]:
clean_autos['registration_year'].describe()
Out[24]:
count    47645.000000
mean      2004.800084
std         88.423872
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Some of the listings have registration years from 1000, or 9999, which is not accurate. Let's remove these outliers from our dataset.

In [25]:
clean_autos['registration_year'].value_counts().sort_index(ascending=False)
Out[25]:
9999       3
9000       1
8888       1
6200       1
5911       1
5000       3
4800       1
4500       1
4100       1
2800       1
2019       1
2018     468
2017    1371
2016    1161
2015     376
2014     656
2013     797
2012    1307
2011    1617
2010    1587
2009    2079
2008    2206
2007    2272
2006    2667
2005    2906
2004    2694
2003    2692
2002    2470
2001    2611
2000    3019
        ... 
1964      12
1963       8
1962       4
1961       6
1960      21
1959       6
1958       4
1957       2
1956       4
1955       2
1954       2
1953       1
1952       1
1951       2
1950       1
1948       1
1943       1
1941       2
1939       1
1938       1
1937       4
1934       2
1931       1
1929       1
1927       1
1910       2
1800       2
1111       1
1001       1
1000       1
Name: registration_year, Length: 95, dtype: int64

We'll keep the listings with registration year in the range 1910-2018, and drop the others.

In [26]:
clean_autos=clean_autos[clean_autos['registration_year'].between(1900,2016)]
clean_autos['registration_year'].describe()
Out[26]:
count    45786.000000
mean      2002.993098
std          7.113188
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64
In [27]:
clean_autos['registration_year'].value_counts(normalize=True)
Out[27]:
2000    0.065937
2005    0.063469
1999    0.062137
2004    0.058839
2003    0.058795
2006    0.058249
2001    0.057026
2002    0.053947
1998    0.049906
2007    0.049622
2008    0.048181
2009    0.045407
1997    0.040449
2011    0.035316
2010    0.034661
2012    0.028546
1996    0.028437
2016    0.025357
1995    0.024440
2013    0.017407
2014    0.014328
1994    0.012952
1993    0.008867
2015    0.008212
1992    0.007753
1991    0.007207
1990    0.006989
1989    0.003691
1988    0.002905
1985    0.001922
          ...   
1976    0.000459
1960    0.000459
1969    0.000415
1975    0.000393
1965    0.000371
1964    0.000262
1963    0.000175
1961    0.000131
1959    0.000131
1962    0.000087
1956    0.000087
1937    0.000087
1958    0.000087
1955    0.000044
1954    0.000044
1957    0.000044
1951    0.000044
1910    0.000044
1934    0.000044
1941    0.000044
1927    0.000022
1929    0.000022
1950    0.000022
1931    0.000022
1948    0.000022
1938    0.000022
1939    0.000022
1953    0.000022
1943    0.000022
1952    0.000022
Name: registration_year, Length: 78, dtype: float64

Most of the listings are for cars registered in the previous 20 years at the time this data was collected. This is to be expected as any car older than that generally loses utility and exits the market. However I'd suspect the other listings are for collectors or buyers specifically looking for vintage vehicles.

In [28]:
clean_autos.describe(include='all')
Out[28]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_pictures postal_code last_seen
count 45786 45786 45786 45786 45786.000000 45786 43394 45786.000000 43911 45786.000000 43717 45786.000000 45786.000000 42756 45786 37966 45786 45786.0 45786.000000 45786
unique 44289 35043 1 1 NaN 2 8 NaN 2 NaN 244 NaN NaN 7 40 2 74 NaN NaN 36495
top 2016-03-23 18:39:34 Volkswagen_Golf_1.4 privat Angebot NaN test limousine NaN manuell NaN golf NaN NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 75 45786 45786 NaN 23582 12489 NaN 34105 NaN 3639 NaN NaN 27978 9672 33670 1793 NaN NaN 8
mean NaN NaN NaN NaN 6092.594112 NaN NaN 2002.993098 NaN 119.168654 NaN 125699.340410 5.868759 NaN NaN NaN NaN 0.0 51160.666776 NaN
std NaN NaN NaN NaN 9229.962879 NaN NaN 7.113188 NaN 185.462987 NaN 39546.768047 3.650627 NaN NaN NaN NaN 0.0 25733.241750 NaN
min NaN NaN NaN NaN 200.000000 NaN NaN 1910.000000 NaN 0.000000 NaN 5000.000000 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN 1300.000000 NaN NaN 1999.000000 NaN 75.000000 NaN 100000.000000 3.000000 NaN NaN NaN NaN 0.0 30880.000000 NaN
50% NaN NaN NaN NaN 3200.000000 NaN NaN 2003.000000 NaN 109.000000 NaN 150000.000000 6.000000 NaN NaN NaN NaN 0.0 50127.000000 NaN
75% NaN NaN NaN NaN 7700.000000 NaN NaN 2008.000000 NaN 150.000000 NaN 150000.000000 9.000000 NaN NaN NaN NaN 0.0 72100.000000 NaN
max NaN NaN NaN NaN 350000.000000 NaN NaN 2016.000000 NaN 17700.000000 NaN 150000.000000 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Exploring Date Data

There are 5 columns that represent date values, some from the website itself and some gathered by the crawler:

  • date_crawled: added by the crawler
  • last_seen: added by the crawler
  • ad_created: from the website
  • registration_month: from the website
  • registration_year: from the website

The values in date_crawled,ad_created, and last_seen are formatted as strings. Since we're only interested in the month, day, and year, we'll look at the first 10 characters of each value and use Series.value_counts() method to see the distribution of the dates. Additionally, we'll use the parameter normalize=True to see the percentage rather than the count.

In [29]:
clean_autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
Out[29]:
2016-03-05    0.025226
2016-03-06    0.014175
2016-03-07    0.036256
2016-03-08    0.033351
2016-03-09    0.033242
2016-03-10    0.032390
2016-03-11    0.032565
2016-03-12    0.036758
2016-03-13    0.015900
2016-03-14    0.036321
2016-03-15    0.034290
2016-03-16    0.029310
2016-03-17    0.031691
2016-03-18    0.012711
2016-03-19    0.034487
2016-03-20    0.037937
2016-03-21    0.037326
2016-03-22    0.032543
2016-03-23    0.032346
2016-03-24    0.029441
2016-03-25    0.031341
2016-03-26    0.032106
2016-03-27    0.030905
2016-03-28    0.034749
2016-03-29    0.034028
2016-03-30    0.034006
2016-03-31    0.031800
2016-04-01    0.033941
2016-04-02    0.035775
2016-04-03    0.038920
2016-04-04    0.036671
2016-04-05    0.013039
2016-04-06    0.003080
2016-04-07    0.001376
Name: date_crawled, dtype: float64

It looks like the data was collected by the crawler over a month period from March 5, 2016 to April 7, 2016

In [30]:
clean_autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
Out[30]:
2016-04-03    0.039160
2016-03-20    0.038003
2016-03-21    0.037544
2016-04-04    0.036998
2016-03-12    0.036561
2016-04-02    0.035426
2016-03-07    0.034989
2016-03-14    0.034967
2016-03-28    0.034836
2016-03-15    0.034093
2016-03-29    0.034050
2016-04-01    0.033941
2016-03-30    0.033809
2016-03-19    0.033416
2016-03-08    0.033351
2016-03-09    0.033329
2016-03-11    0.032848
2016-03-22    0.032302
2016-03-23    0.032237
2016-03-26    0.032106
2016-03-10    0.032106
2016-03-31    0.031887
2016-03-25    0.031494
2016-03-17    0.031385
2016-03-27    0.030774
2016-03-16    0.029725
2016-03-24    0.029398
2016-03-05    0.022845
2016-03-13    0.017276
2016-03-06    0.015398
                ...   
2016-02-25    0.000066
2016-02-02    0.000044
2016-01-27    0.000044
2016-02-24    0.000044
2016-02-12    0.000044
2016-02-18    0.000044
2016-02-20    0.000044
2016-02-14    0.000044
2016-02-26    0.000044
2016-02-05    0.000044
2016-01-10    0.000044
2016-02-07    0.000022
2015-11-10    0.000022
2016-01-13    0.000022
2015-06-11    0.000022
2016-01-16    0.000022
2016-02-17    0.000022
2015-09-09    0.000022
2016-02-09    0.000022
2016-01-14    0.000022
2016-02-11    0.000022
2016-01-07    0.000022
2015-12-30    0.000022
2016-02-22    0.000022
2015-08-10    0.000022
2016-02-01    0.000022
2016-02-16    0.000022
2015-12-05    0.000022
2016-02-08    0.000022
2016-01-03    0.000022
Name: ad_created, Length: 74, dtype: float64

Most of the listings were created around the same month period the data was crawled.

In [31]:
clean_autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
Out[31]:
2015-06-11    0.000022
2015-08-10    0.000022
2015-09-09    0.000022
2015-11-10    0.000022
2015-12-05    0.000022
2015-12-30    0.000022
2016-01-03    0.000022
2016-01-07    0.000022
2016-01-10    0.000044
2016-01-13    0.000022
2016-01-14    0.000022
2016-01-16    0.000022
2016-01-27    0.000044
2016-02-01    0.000022
2016-02-02    0.000044
2016-02-05    0.000044
2016-02-07    0.000022
2016-02-08    0.000022
2016-02-09    0.000022
2016-02-11    0.000022
2016-02-12    0.000044
2016-02-14    0.000044
2016-02-16    0.000022
2016-02-17    0.000022
2016-02-18    0.000044
2016-02-19    0.000066
2016-02-20    0.000044
2016-02-21    0.000066
2016-02-22    0.000022
2016-02-23    0.000087
                ...   
2016-03-09    0.033329
2016-03-10    0.032106
2016-03-11    0.032848
2016-03-12    0.036561
2016-03-13    0.017276
2016-03-14    0.034967
2016-03-15    0.034093
2016-03-16    0.029725
2016-03-17    0.031385
2016-03-18    0.013367
2016-03-19    0.033416
2016-03-20    0.038003
2016-03-21    0.037544
2016-03-22    0.032302
2016-03-23    0.032237
2016-03-24    0.029398
2016-03-25    0.031494
2016-03-26    0.032106
2016-03-27    0.030774
2016-03-28    0.034836
2016-03-29    0.034050
2016-03-30    0.033809
2016-03-31    0.031887
2016-04-01    0.033941
2016-04-02    0.035426
2016-04-03    0.039160
2016-04-04    0.036998
2016-04-05    0.011816
2016-04-06    0.003167
2016-04-07    0.001223
Name: ad_created, Length: 74, dtype: float64

The listings in our data set were created between June 11th 2015 to April 7th 2016

In [32]:
clean_autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
Out[32]:
2016-03-05    0.001092
2016-03-06    0.004106
2016-03-07    0.005351
2016-03-08    0.007273
2016-03-09    0.009785
2016-03-10    0.010484
2016-03-11    0.012318
2016-03-12    0.023872
2016-03-13    0.008649
2016-03-14    0.012580
2016-03-15    0.015835
2016-03-16    0.016097
2016-03-17    0.028087
2016-03-18    0.007164
2016-03-19    0.015420
2016-03-20    0.020618
2016-03-21    0.020443
2016-03-22    0.020858
2016-03-23    0.018324
2016-03-24    0.019526
2016-03-25    0.018805
2016-03-26    0.016686
2016-03-27    0.015485
2016-03-28    0.020596
2016-03-29    0.021950
2016-03-30    0.024396
2016-03-31    0.023719
2016-04-01    0.023042
2016-04-02    0.024527
2016-04-03    0.025073
2016-04-04    0.024243
2016-04-05    0.126130
2016-04-06    0.224020
2016-04-07    0.133447
Name: last_seen, dtype: float64

It's notable that nearly 50% of the listings were last seen on April 5, 6, and 7. This may be linked to the fact that the crawler hasn't gathered any data after April 7th. Many of these listings were likely still open on these dates.

Exploring Price by Brand

For this next section, let's look at the brand column and ask: which car brand is more expensive on average?

In [33]:
clean_autos['brand'].value_counts(normalize=True)
Out[33]:
volkswagen        0.211244
bmw               0.111257
opel              0.105906
mercedes_benz     0.097628
audi              0.087603
ford              0.068733
renault           0.046543
peugeot           0.029944
fiat              0.025095
seat              0.018171
skoda             0.016512
nissan            0.015289
mazda             0.015136
smart             0.014371
citroen           0.014131
toyota            0.012930
hyundai           0.010112
sonstige_autos    0.009544
volvo             0.009195
mini              0.008889
mitsubishi        0.008168
honda             0.007928
kia               0.007142
alfa_romeo        0.006683
porsche           0.006094
suzuki            0.005810
chevrolet         0.005722
chrysler          0.003560
dacia             0.002686
daihatsu          0.002490
jeep              0.002315
land_rover        0.002140
subaru            0.002097
saab              0.001660
jaguar            0.001529
daewoo            0.001485
rover             0.001332
trabant           0.001289
lancia            0.001048
lada              0.000590
Name: brand, dtype: float64

Let's look at the top 20 most commonly listed car brands and find the mean price for each brand.

In [35]:
#selecting the 20 most commonly listed brands
brands_list=clean_autos['brand'].value_counts().index[:20]
print(brands_list)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'nissan', 'mazda', 'smart',
       'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'],
      dtype='object')

In the cell below, we create an empty dictionary, loop through the brands_list referenced above. For each brand, we calculate the mean price from the clean_autos dataframe, then we add the brand and it's mean price the brand_mean dictionary.

In [ ]:
brand_price={}

for brand in brands_list:
    mean=int(clean_autos.loc[clean_autos['brand'] == brand, 'price'].mean())
    brand_price[brand]=mean
    
brand_price

For reference, the mean price across all of the listings in clean_autos is 6092. The most commonly listed brand, Volkswagen, is cheaper on average than other cars in our data. However, the second most commonly listed brand, BMW, is more expensive on average.

Next, let's use the same process to calculate the average milage (from the odometer_km column) accross the selected brands.

In [ ]:
brand_km={}

for brand in brands_list:
    mean=int(clean_autos.loc[clean_autos['brand'] == brand, 'odometer_km'].mean())
    brand_km[brand]=mean
    
brand_km

In the cell below, we'll convert the two dictionaries brand_price and brand_km to series objects, and then combine them into a single dataframe object that share a common index (the brand).

In [ ]:
bmp_series=pd.Series(brand_price)
bkm_series=pd.Series(brand_km)

avg_df=pd.DataFrame(bmp_series, columns=['mean_price'])
avg_df['mean_km']=bkm_series
avg_df

The assumption is that the higher the milage on a car, the cheaper the car becomes, however when looking at the dataframe, there is no clear relationship between the milage and price accross brands. This may indicate that the car brand is more of a factor affecting price than the milage of the car. We should note, however, as we saw before that the data in the odometer_km is not exact, and that the listed milage on the eBay ads is likely limited to a preselected range.

Conclusion

The goal of this guided project was to demonstrate our ability to clean data by recognizing anomolies, organize DataFrame and Series objects, and alter data types using the functionality of Pandas and NumPy. It is important to not declare the presence a particular trend without proving it true through statistical analysis.