Exploring Ebay Car Sales Data Project

In this project, i will be working on a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.The dataset was originally scraped and uploaded to Kaggle. The version of the dataset we are working with is a sample of 50,000 data points that was prepared by Dataquest including simulating a less-cleaned version of the data.
The data dictionary provided with data is as follows:

Column label What is represents
dateCrawled When this ad was first crawled
name Name of the car
seller Whether the seller is private or a dealer
offerType The type of listing
price The price on the ad to sell the car
abtest Whether the listing is included in an A/B test
vehicleType The vehicle type
yearOfRegistration The year in which the car was first registered
gearbox The transmission type
powerPS The power of the car in PS
model The car model name
kilometer How many kilometers the car has driven
monthOfRegistration The month in which the car was first registered
fueltype What type of fuel the car uses
brand The brand of the car
notRepairedDamage If the car has a damage which is not yet repaired
dateCreated The date on which the ebay listing was created
nrOfPictures The number of pictures in the ad
postalCode The postal code for the location of the vehicle
lastSeenOnline When the crawler saw this ad last online

The aim of this project is to clean the data and analyze the included used car listings.

In [1]:
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv',encoding = 'Latin-1')
In [2]:
autos.info()
print('\n')
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB


Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Of the 20 autos dataset columns 5 have null values.
All the columns are of strings type except in vehicleType, gearbox, model, fueltype, noRepairedDamage columns.

In [3]:
print(autos.columns)
print('\n')
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'powerps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'no_of_pics', 'postal_code',
       'last_seen']
autos.columns = new_columns
print(autos.head())
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offer_type   price   abtest vehicle_type  registration_year  \
0  privat    Angebot  $5,000  control          bus               2004   
1  privat    Angebot  $8,500  control    limousine               1997   
2  privat    Angebot  $8,990     test    limousine               2009   
3  privat    Angebot  $4,350  control   kleinwagen               2007   
4  privat    Angebot  $1,350     test        kombi               2003   

     gearbox  powerps   model   odometer  registration_month fuel_type  \
0    manuell      158  andere  150,000km                   3       lpg   
1  automatik      286     7er  150,000km                   6    benzin   
2    manuell      102    golf   70,000km                   7    benzin   
3  automatik       71  fortwo   70,000km                   6    benzin   
4    manuell        0   focus  150,000km                   7    benzin   

        brand unrepaired_damage           ad_created  no_of_pics  postal_code  \
0     peugeot              nein  2016-03-26 00:00:00           0        79588   
1         bmw              nein  2016-04-04 00:00:00           0        71034   
2  volkswagen              nein  2016-03-26 00:00:00           0        35394   
3       smart              nein  2016-03-12 00:00:00           0        33729   
4        ford              nein  2016-04-01 00:00:00           0        39218   

             last_seen  
0  2016-04-06 06:45:54  
1  2016-04-06 14:45:08  
2  2016-04-06 20:15:37  
3  2016-03-15 03:16:28  
4  2016-04-01 14:38:50  

I have changed the column labels from camelcase to snakecase to make it easier to deal with.

In [4]:
autos.describe(include = 'all')
Out[4]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox powerps model odometer registration_month fuel_type brand unrepaired_damage ad_created no_of_pics postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-22 09:51:06 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

The seller, no_of_pics, offer_type columns have or almost all values are the same.
I drop them below.

In [5]:
autos = autos.drop(['seller','no_of_pics','offer_type'], axis = 1)

The price, odometer columns are numeric values stored as text.
I removed the non-numeric characters and converted it to numeric dtype.
I rename the column to capture the details ommitted after cleaning.

In [6]:
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace(',','').str.replace('km','').astype(int)
print(autos['price'].head(3))
print('\n')
print(autos['odometer'].head(3))
0    5000
1    8500
2    8990
Name: price, dtype: int64


0    150000
1    150000
2     70000
Name: odometer, dtype: int64
In [7]:
autos.rename({'odometer':'odometer_km'},axis = 1, inplace = True)
autos.rename({'price':'price_usd'},axis = 1, inplace = True)
print(autos.columns)
Index(['date_crawled', 'name', 'price_usd', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'powerps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

I continue exploring the data, looking for data that doesn't look right.
Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low that we might want to remove.

In [8]:
print(autos['odometer_km'].unique().shape)
print(autos['odometer_km'].describe())
print(autos['odometer_km'].value_counts().head(10))
(13,)
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
Name: odometer_km, dtype: int64

We can see that the values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field.
Additionally, there are more high mileage than low mileage vehicles.

In [9]:
print(autos['price_usd'].unique().shape)
print(autos['price_usd'].describe())
print(autos['price_usd'].value_counts().head(10))
(2357,)
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64
0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
Name: price_usd, dtype: int64

the prices in this column seem rounded, however given there are 2357 unique values in the column, that may just be people's tendency to round prices on the site.
here are 1,421 cars listed with $0 price - given that this is only 2% of the of the cars, we might consider removing these rows. The maximum price is one hundred million dollars, which seems a lot, let's look at the highest prices further.

In [10]:
print(autos['price_usd'].value_counts().sort_index(ascending = False).head(10))
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: price_usd, dtype: int64
In [11]:
print(autos['price_usd'].value_counts().sort_index(ascending = True).head(20))
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price_usd, dtype: int64

There are a number of listings with prices below \$30, including about 1,500 at \$0. There are also a small number of listings with very high values, including 14 at around or over $1 million.<br> Given that eBay is an auction site, there could legitimately be items where the opening bid is \$1. We will keep the \$1 items, but remove anything above \$350,000, since it seems that prices increase steadily to that number and then jump up to less realistic numbers.

In [12]:
autos = autos[autos['price_usd'].between(1,351000)]

There are five columns with date information:

  • date_crawled
  • registration_month
  • registration_year
  • ad_created
  • last seen
    I willl explore all of this columns to learn more about the listing.
In [13]:
print(autos[['date_crawled','ad_created','last_seen']][0:5])
          date_crawled           ad_created            last_seen
0  2016-03-26 17:47:46  2016-03-26 00:00:00  2016-04-06 06:45:54
1  2016-04-04 13:38:56  2016-04-04 00:00:00  2016-04-06 14:45:08
2  2016-03-26 18:57:24  2016-03-26 00:00:00  2016-04-06 20:15:37
3  2016-03-12 16:58:10  2016-03-12 00:00:00  2016-03-15 03:16:28
4  2016-04-01 14:38:50  2016-04-01 00:00:00  2016-04-01 14:38:50
In [14]:
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[14]:
2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

Looks like the site was crawled daily over roughly a one month period in March and April 2016. The distribution of listings crawled on each day is roughly uniform.

In [15]:
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[15]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21    0.037579
2016-03-22    0.032801
2016-03-23    0.032060
2016-03-24    0.029280
2016-03-25    0.031751
2016-03-26    0.032266
2016-03-27    0.030989
2016-03-28    0.034984
2016-03-29    0.034037
2016-03-30    0.033501
2016-03-31    0.031875
2016-04-01    0.033687
2016-04-02    0.035149
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.

In [16]:
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[16]:
2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

The crawler recorded the date it last saw any listing.The last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales, and more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [17]:
autos['registration_year'].describe()
Out[17]:
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Looking at this column, there exists some odd values. The minimum value represents a year motorvehicles had no yet been invented, and the maximum is many years into the future.
Since it is not possible to register a car before the listing is seen, any vehicle with a registration year above 2016 is definitely inaccurate.
Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

In [18]:
(~autos['registration_year'].between(1900,2016)).sum() / autos.shape[0]
Out[18]:
0.038793369710697

The number of listings with cars that fall outside the 1900 - 2016is below 4% hece we drop them below.

In [19]:
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize =True).head(15)
Out[19]:
2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
2006    0.057197
2001    0.056468
2002    0.053255
1998    0.050620
2007    0.048778
2008    0.047450
2009    0.044665
1997    0.041794
2011    0.034768
2010    0.034040
Name: registration_year, dtype: float64

It seems like most of the vehicles were first registered in the past 20 years.

In [20]:
autos['brand'].value_counts(normalize = True)
Out[20]:
volkswagen        0.211264
bmw               0.110045
opel              0.107581
mercedes_benz     0.096463
audi              0.086566
ford              0.069900
renault           0.047150
peugeot           0.029841
fiat              0.025642
seat              0.018273
skoda             0.016409
nissan            0.015274
mazda             0.015188
smart             0.014160
citroen           0.014010
toyota            0.012703
hyundai           0.010025
sonstige_autos    0.009811
volvo             0.009147
mini              0.008762
mitsubishi        0.008226
honda             0.007840
kia               0.007069
alfa_romeo        0.006641
porsche           0.006127
suzuki            0.005934
chevrolet         0.005698
chrysler          0.003513
dacia             0.002635
daihatsu          0.002506
jeep              0.002271
subaru            0.002142
land_rover        0.002099
saab              0.001649
jaguar            0.001564
daewoo            0.001500
trabant           0.001392
rover             0.001328
lancia            0.001071
lada              0.000578
Name: brand, dtype: float64

German manufacturers represent four out of the top five brands, almost 50% of the overall listings. Volkswagen is by far the most popular brand, with approximately double the cars for sale of the next two brands combined.

In [21]:
brand_names = autos['brand'].value_counts(normalize = True)
pop_brand_names = brand_names[brand_names > 0.05].index
print(pop_brand_names)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

There are lots of brands that don't have a significant percentage of listings, so we will limit our analysis to brands representing more than 5% of total listings.

In [22]:
brand_mean_prices = {}

for brand in pop_brand_names:
    brand_sel = autos[autos["brand"] == brand]
    mean_price = brand_sel["price_usd"].mean()
    brand_mean_prices[brand] = int(mean_price)
    
brand_mean_prices



        
Out[22]:
{'audi': 9336,
 'bmw': 8332,
 'ford': 3749,
 'mercedes_benz': 8628,
 'opel': 2975,
 'volkswagen': 5402}

Of the top 5 brands, there is a distinct price gap:

  • Audi, BMW and Mercedes Benz are more expensive
  • Ford and Opel are less expensive
  • Volkswagen is in between - this may explain its - popularity, it may be a 'best of 'both worlds' option.
In [23]:
brand_mean_milage = {}
for brand in pop_brand_names:
    brand_sel = autos[autos['brand'] == brand]
    mean = brand_sel['odometer_km'].mean()
    brand_mean_milage[brand] = int(mean)
    
brand_mean_milage
Out[23]:
{'audi': 129157,
 'bmw': 132572,
 'ford': 124266,
 'mercedes_benz': 130788,
 'opel': 129310,
 'volkswagen': 128707}
In [24]:
bmp_series = pd.Series(brand_mean_prices).sort_values(ascending = False)
print(bmp_series)
audi             9336
mercedes_benz    8628
bmw              8332
volkswagen       5402
ford             3749
opel             2975
dtype: int64
In [25]:
bmm_series = pd.Series(brand_mean_milage).sort_values(ascending = False)
print(bmm_series)
bmw              132572
mercedes_benz    130788
opel             129310
audi             129157
volkswagen       128707
ford             124266
dtype: int64
In [26]:
df = pd.DataFrame(bmp_series, columns = ['mean_price'])

df
Out[26]:
mean_price
audi 9336
mercedes_benz 8628
bmw 8332
volkswagen 5402
ford 3749
opel 2975
In [27]:
df_2 = pd.DataFrame(bmm_series, columns = ['mean_milage'])
df_2
Out[27]:
mean_milage
bmw 132572
mercedes_benz 130788
opel 129310
audi 129157
volkswagen 128707
ford 124266
In [28]:
df_2['mean_price'] = bmp_series
df_2
Out[28]:
mean_milage mean_price
bmw 132572 8332
mercedes_benz 130788 8628
opel 129310 2975
audi 129157 9336
volkswagen 128707 5402
ford 124266 3749

There is a slight trend to the more expensive vehicles having higher mileage, with the less expensive vehicles having lower mileage.