PayDirt

Analysing Used Car Sales on E-bay Germany

Index

1

Introduction

image.png

eBay Kleinanzeigen is a classifieds website of ebay Germany. Just like any other classifieds website, the site allows users to post products and services which other users maybe interested in. The site has specific sections for different product including a services section for cars, boats and bikes. The section is quite broad and its listings include:

  • Vehicles being offered at a particular price
  • Vehicles being requested at a particular price
  • Vehicles being bartered for other items
  • Vehicle parts
  • Vehicle scraps
  • Servicing offered for vehicle

The goal of this project will be to:

  • Clean up the dataset
  • Analyze the dataset and glean some useful insigts from the same.

It must be noted as mentioned above that the listings are quite broad. Our focus in this project will only be on listings associated to four-wheeled vehicles and not on parts or services associated to vehicles.

2

Reading the Data

Dataquest has provided a dataset containing 50,000 listings for analysis. The listings were randomly sampled and contains erroneous data. The data dictionary associated to the data set has been provided below.

  • dateCrawled - When this ad was first crawled. All field-values are taken from this date.
  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • model - The car model name.
  • kilometer - How many kilometers the car has driven.
  • monthOfRegistration - The month in which the car was first registered.
  • fuelType - What type of fuel the car uses.
  • brand - The brand of the car.
  • notRepairedDamage - If the car has a damage which is not yet repaired.
  • dateCreated - The date on which the eBay listing was created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.

Lets read the dataset below.

In [1]:
%matplotlib inline

import matplotlib
import matplotlib.pyplot as plt
In [2]:
import pandas as pd
In [3]:
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head(10)
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
In [4]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

The data is clearly from a German audience. A large number of columns have string type data. There are a number of columns with missing data like vehicleType, gearbox etc.

In order to make analysis easier we will need to start by fixing the name of the columns. To begin with, the name of certain columns will need to be changed.

In [5]:
#Rename certain columns with relevant names
autos.rename(columns = {"yearOfRegistration":"registration_year",
                        "monthOfRegistration":"registration_month",
                        "notRepairedDamage":"unrepaired_damage",
                        "dateCreated":"ad_created"},
             inplace=True)

Now that we have renamed the columns, we can proceed to make the column name to snake case to make them consistent to Python standards.

In [6]:
def camel_to_snake(a_column_name):
    """
    Convert column name from camel to snake case
    
    Arg:
    a_column_name(str): Column name to be converted from camel to snake case
    
    Ret:
    a_column_name(str): Column name in snake case
    """
    pos = 0
    for letter in a_column_name:
        if letter.isupper() == True:
            pos = a_column_name.index(letter)
            string1 = a_column_name[:pos]
            string2 = a_column_name[pos:]
            a_column_name = string1+"_"+string2
    return (a_column_name.lower())
In [7]:
def bold_print(a_string,a_value=None):
    """
    Boldens the output
    
    Args:
        a_string (string): String to be bolded
    """
    print("\033[1m"+a_string+"\033[0m"+'\n')
In [8]:
autos_columns_fixed = []
for a_column in autos.columns:
    autos_columns_fixed.append(camel_to_snake(a_column))
autos.columns = autos_columns_fixed
bold_print("Cleaned column names")
autos.columns
Cleaned column names

Out[8]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Now that we have clean column names it will be easier to refer to the columns in the dataset. Lets continue with cleaning the dataset further below.

3

That doesn't look right....

Issues in the data

Lets assess the nature of the data that has been read below.

In [9]:
autos.describe(include = 'all')
Out[9]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_p_s model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-04-02 15:49:30 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

At the outset the following issues come to light:

  • The price column is meant to be float, however it is currently string. Also each price value is preceded by a \$ symbol which must be removed.
  • The odometer column also registers as a string instead of integer. In addition, each odometer reading is suffixed by km which must be removed.
  • The columns seller, offer_type and nr_of_pictures have exactly the same data in every line which makes it irrelevant for analysis and thus can be considered for removal.

Lets begin by cleaning up the price and odometer columns.

In [10]:
autos['price'] = autos['price'].str.strip("$").str.replace(',','').astype('float')
autos['odometer'] = autos['odometer'].str.strip("km").str.replace(',','').astype('int')
autos.rename(columns = {'odometer':'odometer_km'},inplace=True)

The price and odometer columns have been cleaned by removing the special characters like $, km and , from the data.

Columns with the same data repeated multiple times will be dropped from the dataframe.

In [11]:
autos = autos.drop(["seller", "offer_type", "nr_of_pictures"], axis=1)
autos[:5]
Out[11]:
date_crawled name price abtest vehicle_type registration_year gearbox power_p_s model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000.0 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500.0 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990.0 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350.0 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350.0 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50

4

Seek and Ye shall find....

Analysing the data further

Considering that the columns seem clean a thorough analysis of each column will be required to find out whether the data itself has issues. A function has been created below to assess a given column by the column name.

In [12]:
def column_details(the_dataframe, a_column):
    """
    Provides more indepth detail of each column
    
    Args:
    the_dataframe(dataframe): Dataset
    a_column(string): Name of the column for which details are sought after 
    """
    
    #Prints the number of unique values in a given column
    print("Number of unique values in the {0} column:".format(a_column), the_dataframe[a_column].unique().shape[0])
    print("---------------------")
    
    #Prints data associated to the describe() function for a given column
    if the_dataframe[a_column].dtype == "object":
        print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe())
    else:
        print("Statistical data related to the {0} column:\n".format(a_column),the_dataframe[a_column].describe().map('{:,.2f}'.format))
    print("---------------------")
    
    #Prints the number of occurences of each unique_value
    print("Unique counts for each entry in the {0} column:\n".format(a_column))
    print(the_dataframe[a_column].value_counts().sort_index(ascending=True))

Assigning a column name in the cell below gives a statistical view of the data in the given column.

In [13]:
#Details related to a column:
column_name = "price"
column_details(autos,column_name)
Number of unique values in the price column: 2357
---------------------
Statistical data related to the price column:
 count        50,000.00
mean          9,840.04
std         481,104.38
min               0.00
25%           1,100.00
50%           2,950.00
75%           7,200.00
max      99,999,999.00
Name: price, dtype: object
---------------------
Unique counts for each entry in the price column:

0.0           1421
1.0            156
2.0              3
3.0              1
5.0              2
              ... 
10000000.0       1
11111111.0       2
12345678.0       3
27322222.0       1
99999999.0       1
Name: price, Length: 2357, dtype: int64

After an in-depth analysis of every column, the following issues have been realised:

  • name: is quite revealing. It shows that many of the items listed are not exactly vehicles. They can be advertisements for vehicle related services, parts of specific vehicles, exchange offers for vehicles, vehicles sold for scrap etc.
  • price: is a reflection of the above column. There are prices in the range of $0.00-99.00 which is not usually the price of a vehicle. But these prices were against listings that were set up for exchange offers or toy models of a vehicle. They are not the actual asking price of the vehicle.
  • vehicle_type: had missing values in many cases. This was especially true for the cases mentioned above.
  • registration_year: is a mandatory field. However some of the values in it ranged from 1800-1950 and 2017-5000. This would have to be considered erroneous as the data was scraped in 2016 and as we all know cars were not available in the 1800s
  • registration_month: values ranged from 0-12. The listing with 0 against them would have to be considered as erroneous data
  • brand and model: had missing values. The issue that arose because of this will be revealed later

5

What is a Vehicle?

Defining a Vehicle

Before going forward a definition is required for a vehicle listing as our goal is to analyse the same. After a detailed analysis, the following is a definition for the listing of a vehicle for sale on the classifieds:

  • It must have a price above $0.00 otherwise it was not meant for a sale

  • It must have a valid registration year and month. Most car buyers look out for that information and most sellers make sure to put that up to ensure the car gets sold

  • The year of registration must be between 1949-2018. One could argue that there were cars before 1949 but the assumptions is that those would be mostly collector's items. Also the number of cars below this year range are less.

It must be noted that cars with registration years 2017 and 2018 have been considered even though the dataset is for 2016. The assumption is that this data has been manually inserted. It is doubtful the system itself would have given the user the ability to set these years. Moreover there are about 800 vehicles with these years and removing them all could affect analysis considering other eliminations that are being carried out. Lastly the vehicles with these years have all their other data in order and a quick search using their names on Google reveals that they are vehicles manufactured before 2016.

Finally, while we could also stipulate that the listing must have a vehicle_type it has been noticed that there are vehicle listings without vehicle types that have all their other information in order. For now this can be left out.

In [14]:
cleaned_autos = autos[autos["price"]!=0]
cleaned_autos = cleaned_autos[cleaned_autos["registration_year"].between(1949,2018)]
cleaned_autos = cleaned_autos[cleaned_autos["registration_month"]>0]
cleaned_autos[:5]
Out[14]:
date_crawled name price abtest vehicle_type registration_year gearbox power_p_s model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000.0 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500.0 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990.0 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350.0 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350.0 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50

Based on the above filtering we should now have a set of vehicle listing.

6

Hallo to Hello

Translating values from German to English

The data values in some of the columns are in German. In order to speed-up analysis, it would be best to translate these values from German to English. The columns identified for the same are

  • vehicle_type
  • gearbox
  • fuel_type
  • unrepaired_damage
In [15]:
#Translate column vehicle_type
cleaned_autos["vehicle_type"] = cleaned_autos["vehicle_type"].map({"andere":"others", "bus":"bus", 
                                                                   "cabrios":"convertible", "coupe":"coupe", 
                                                                   "kleinewagen":"small_car", "kombi":"combi", 
                                                                   "limousine":"limousine", "suv":"suv"}, 
                                                                  na_action='ignore')
#Translate column gearbox
cleaned_autos["gearbox"] = cleaned_autos["gearbox"].map({"manuell":"manual", "automatik":"automatic"}, na_action = 'ignore')

#Translate column fuel_type
cleaned_autos["fuel_type"] = cleaned_autos["fuel_type"].map({"andere":"others", "benzin":"gasoline", 
                                                             "cng":"cng", "diesel":"diesel", 
                                                             "elektro":"electric", "hybrid":"hybrid", "lpg":"lpg"}, 
                                                            na_action='ignore')

#Translate column unrepaired_damage
cleaned_autos["unrepaired_damage"] = cleaned_autos["unrepaired_damage"].map({"ja":"yes", "nein":"no"}, na_action = 'ignore')
cleaned_autos[:10]
Out[15]:
date_crawled name price abtest vehicle_type registration_year gearbox power_p_s model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000.0 control bus 2004 manual 158 andere 150000 3 lpg peugeot no 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500.0 control limousine 1997 automatic 286 7er 150000 6 gasoline bmw no 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990.0 test limousine 2009 manual 102 golf 70000 7 gasoline volkswagen no 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350.0 control NaN 2007 automatic 71 fortwo 70000 6 gasoline smart no 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350.0 test combi 2003 manual 0 focus 150000 7 gasoline ford no 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 7900.0 test bus 2006 automatic 150 voyager 150000 4 diesel chrysler NaN 2016-03-21 00:00:00 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... 300.0 test limousine 1995 manual 90 golf 150000 8 gasoline volkswagen NaN 2016-03-20 00:00:00 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS 1990.0 control limousine 1998 manual 90 golf 150000 12 diesel volkswagen no 2016-03-16 00:00:00 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa 250.0 test NaN 2000 manual 0 arosa 150000 10 NaN seat no 2016-03-22 00:00:00 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 590.0 control bus 1997 manual 90 megane 150000 7 gasoline renault no 2016-03-16 00:00:00 15749 2016-04-06 10:46:35

7

The Price is Right....Wrong!

Identifying a Price range

Earlier while analyzing the price column it was noted that there were unusual prices for some listings. Some of the prices were as low as \$1.00 and the most was \\$99,999,999 which seems to be unusally high. The prices require further analysis.

In [16]:
#Details related to a column:
column_name = "price"
column_details(cleaned_autos,column_name)
Number of unique values in the price column: 2316
---------------------
Statistical data related to the price column:
 count        44,074.00
mean         10,374.52
std         506,282.97
min               1.00
25%           1,399.00
50%           3,400.00
75%           7,900.00
max      99,999,999.00
Name: price, dtype: object
---------------------
Unique counts for each entry in the price column:

1.0           88
2.0            1
3.0            1
5.0            1
9.0            1
              ..
10000000.0     1
11111111.0     1
12345678.0     2
27322222.0     1
99999999.0     1
Name: price, Length: 2316, dtype: int64

Since there are a large number of values it would be better to divide the values in to bins.

In [17]:
bold_print("1000 bins created from the Price column")
cleaned_autos['price'].value_counts(bins = 1000).sort_index()
1000 bins created from the Price column

Out[17]:
(-99998.99900000001, 100000.998]    44024
(100000.998, 200000.996]               31
(200000.996, 300000.994]                5
(300000.994, 400000.992]                2
(400000.992, 500000.99]                 0
                                    ...  
(99499999.01, 99599999.008]             0
(99599999.008, 99699999.006]            0
(99699999.006, 99799999.004]            0
(99799999.004, 99899999.002]            0
(99899999.002, 99999999.0]              1
Name: price, Length: 1000, dtype: int64

Based on the output above, it is clear that the maximum value for the price of a car that must be considered is 200000 because there are atleast 31 cars that fall in the price range 100,000-200,000. Given that the total number of listings is 44074, if we consider only the first 2 bins we would only be losing out on 19 (44074-44055) listings which is .04% of the total listings.

While analysing the dataset for the lowest value of the price range it was noticed that there were many listings for services that offered car shredding. These listings include the words ausschlachten or schlachten (literally means butchering). Such listings need to be removed before considering a lower range as they have very low price values.

In [18]:
bold_print("Price listings with *schlachten* in the listing names and the number of occurences")
cleaned_autos[((cleaned_autos['name'].str.contains('schlachten')) == True)]["price"].value_counts()
Price listings with *schlachten* in the listing names and the number of occurences

Out[18]:
200.0        11
100.0        11
350.0         8
150.0         8
250.0         7
300.0         6
800.0         4
450.0         3
600.0         3
500.0         3
120.0         3
700.0         2
1200.0        2
1.0           2
650.0         2
1000.0        2
130.0         2
400.0         2
370.0         1
890.0         1
248.0         1
249.0         1
850.0         1
1250.0        1
60.0          1
220.0         1
2500.0        1
470.0         1
75.0          1
385.0         1
50.0          1
175.0         1
290.0         1
1234566.0     1
1100.0        1
66.0          1
180.0         1
299.0         1
1950.0        1
Name: price, dtype: int64
In [19]:
bold_print("No. of listings with *schlachten*")
cleaned_autos[((cleaned_autos['name'].str.contains('schlachten')) == True)].shape[0]
No. of listings with *schlachten*

Out[19]:
102
In [20]:
#Remove listing with the word *schlachten*
cleaned_autos = cleaned_autos[((cleaned_autos['name'].str.contains('schlachten')) == False)]
In [21]:
#Price column after removing 'schlacten' listings
column_name = "price"
column_details(cleaned_autos,column_name)
Number of unique values in the price column: 2312
---------------------
Statistical data related to the price column:
 count        43,972.00
mean         10,369.62
std         506,836.00
min               1.00
25%           1,400.00
50%           3,400.00
75%           7,900.00
max      99,999,999.00
Name: price, dtype: object
---------------------
Unique counts for each entry in the price column:

1.0           86
2.0            1
3.0            1
5.0            1
9.0            1
              ..
10000000.0     1
11111111.0     1
12345678.0     2
27322222.0     1
99999999.0     1
Name: price, Length: 2312, dtype: int64

After evaluating the dataset a minimum value of \$200 seems to be the best option as there are many vehicles above that price. It must also be noted that **anything between \\$1-\$200 does not seem to be a valid price for a car and are therefore not being considered.**

Based on the above understanding a valid price range for a vehicle has been identified as \$200-\\$200,000. Any listing not within this range will be removed.

In [22]:
cleaned_autos = cleaned_autos[cleaned_autos["price"].between(200,200000)]
column_details( cleaned_autos, "price")
Number of unique values in the price column: 2241
---------------------
Statistical data related to the price column:
 count     43,433.00
mean       6,275.62
std        8,570.83
min          200.00
25%        1,450.00
50%        3,499.00
75%        7,990.00
max      198,000.00
Name: price, dtype: object
---------------------
Unique counts for each entry in the price column:

200.0       156
215.0         1
217.0         1
219.0         1
220.0        23
           ... 
180000.0      1
190000.0      1
194000.0      1
197000.0      1
198000.0      1
Name: price, Length: 2241, dtype: int64

8

Lets Talk Money

Gleaning insights from the cleaned dataset

Now that we have a cleaner dataset let us look in to analysing the same for some insights in to the data.

Question:How do German manufactured, most likely second hand vehicles, compare to each other when it comes to price?

In [23]:
#Identify all the brands in the dataset
cleaned_autos["brand"].value_counts()
Out[23]:
volkswagen        9109
bmw               4860
opel              4493
mercedes_benz     4335
audi              3832
ford              2940
renault           2026
peugeot           1290
fiat              1086
seat               821
skoda              735
nissan             679
mazda              643
smart              640
citroen            611
toyota             577
hyundai            444
mini               404
volvo              403
sonstige_autos     388
honda              349
mitsubishi         348
kia                331
alfa_romeo         287
porsche            267
suzuki             254
chevrolet          250
chrysler           151
dacia              121
jeep               102
daihatsu           101
land_rover          93
subaru              84
saab                74
jaguar              68
daewoo              67
rover               58
lancia              46
trabant             42
lada                24
Name: brand, dtype: int64

Of the identified brands above, a quick search reveals that the following are German brands:

  • Volkswagen
  • BMW
  • Opel
  • Mercedes_benz
  • Audi
  • Porsche

Using the price information we should be able to identify how the german car brands rank against each other.

In [24]:
import seaborn as sns
german_brands = ["volkswagen", "bmw", "opel", 
                 "mercedes_benz", "audi", "porsche"]
brand_mean_price = {}
for a_brand in german_brands:
    brand_mean_price[a_brand] = round(cleaned_autos[cleaned_autos["brand"] == a_brand]["price"].mean(),2)

#Creating a dataframe to plot a graph
mean_price_df = pd.Series(brand_mean_price).to_frame().reset_index()
mean_price_df.rename(columns = {"index":"brands", 0:"avg_price"},inplace=True)
sorted_price=mean_price_df.sort_values("avg_price",ascending=False)
In [25]:
def gen_barplot(the_data,x_axis,y_axis,title):
    """
    Generate a barplot with a fixed style 
    
    Args:
    x_axis (string): Column name of joined_data for which values are to be plotted 
    y_axis (string): Column name of joined_data for which values are to be plotted
    title (string): Name of the plot
    """
    fig = matplotlib.pyplot.figure(figsize=(7,7))
    plt = sns.barplot(data = the_data, x = x_axis, y = y_axis)
    sns.despine(left = True, top=False, bottom=True)
    plt.set_title(title, size = 17)
    plt.set_xlabel(None)
    plt.set_ylabel(None)
    plt.tick_params(left=False, labelsize=13)
    plt.xaxis.tick_top()
In [26]:
gen_barplot(sorted_price,'avg_price','brands',"Price($) Comparison of German Car Brands")

Insights

  • As seen Porche is the most expensive brand and clearly an outlier considering that its listing count is the least in comparison to other German brands
  • The second most expensive would be the Audi averaging at \$9715.78
  • More affordable German brands include Opel and Volkswagen

Question: An earlier insight revealed the brands and their average prices. However, these prices do not take in to account the distance covered by these vehicles. The more distance a vehicle has covered, the less its asking price (vice-versa may not be true as the age of the car also causes it to depreciate in value). Does this stand?

In [27]:
brand_mean_mileage = {}
for a_brand in german_brands:
    brand_mean_mileage[a_brand] = round(cleaned_autos[cleaned_autos["brand"] == a_brand]["odometer_km"].mean(),2)

mean_mileage_df = pd.Series(brand_mean_mileage).to_frame().reset_index()
mean_mileage_df.columns = ["brands","avg_mileage"]
sorted_mileage=mean_mileage_df.sort_values('avg_mileage',ascending=False)
In [28]:
gen_barplot(sorted_mileage,'avg_mileage','brands',"Mileage(km) Comparison of German Car Brands")

Insight

Brand value has preference over the distance covered by a vehicle.

A good example would be a comparison between the Audi and Opel. Both brands have almost the same average distance. But the mean price of the Audi is three times that of the Opel brand.

Question: What are the most common brand-model pairs?

In [29]:
common_brand_model = {}

for a_brand in cleaned_autos["brand"].unique():
    #List all models of a brand
    most_common_model=cleaned_autos[cleaned_autos["brand"] == a_brand]["model"].value_counts().sort_index(ascending = False)
    #Select the model with the most count
    if most_common_model.empty != True:
        common_brand_model[a_brand] = most_common_model.index[0]

brand_model = pd.Series(common_brand_model).to_frame().reset_index()
brand_model.columns = ["brand","top_model"]
bold_print("Brands and their Best models")
brand_model
Brands and their Best models

Out[29]:
brand top_model
0 peugeot andere
1 bmw z_reihe
2 volkswagen up
3 smart roadster
4 ford transit
5 chrysler voyager
6 seat toledo
7 renault twingo
8 audi tt
9 opel zafira
10 mazda rx_reihe
11 porsche cayenne
12 mini one
13 mercedes_benz vito
14 toyota yaris
15 dacia sandero
16 nissan x_trail
17 jeep wrangler
18 saab andere
19 volvo xc_reihe
20 mitsubishi pajero
21 jaguar x_type
22 fiat stilo
23 skoda yeti
24 subaru legacy
25 kia sportage
26 citroen c5
27 chevrolet spark
28 hyundai tucson
29 honda jazz
30 daewoo nubira
31 suzuki swift
32 trabant andere
33 land_rover range_rover_sport
34 alfa_romeo spider
35 daihatsu terios
36 rover rangerover
37 lancia ypsilon
38 lada samara

Insights

  • There are multiple brands for which models have not been listed. For these, the models have been listed as 'andere'(meaning others)
  • One of the brands not in this list is sonstige_autos which does not have a single model listed against it. I believe they might be resellers but I cannot be sure of it.

Question: How does damage affect the price of a vehicle?

In [30]:
repaired_mean_price = {}
non_repaired_mean_price = {}

for a_brand in cleaned_autos["brand"].unique():
    the_brand = cleaned_autos[cleaned_autos["brand"] == a_brand] # Select a brand
    
    #Select repaired and damaged vehicles of the brand
    its_repaired_vehicles = the_brand[the_brand["unrepaired_damage"] == "no"] 
    its_non_repaired_vehicles = the_brand[the_brand["unrepaired_damage"] == "yes"]
    
    #Calculate mean price of repaired and damaged vehicles.
    repaired_mean_price[a_brand] = round(its_repaired_vehicles["price"].mean(),2)
    non_repaired_mean_price[a_brand] = round(its_non_repaired_vehicles["price"].mean(),2)
    
repaired_mean_price_series = pd.Series(repaired_mean_price)
non_repaired_mean_price_series = pd.Series(non_repaired_mean_price)
In [31]:
#Tabulate the mean prices
repair_vs_non_repair = pd.DataFrame(repaired_mean_price_series, columns = ["repaired_mean_price"])
repair_vs_non_repair["non_repaired_mean_price"] = non_repaired_mean_price_series
repair_vs_non_repair["ratio"] = round(repair_vs_non_repair["repaired_mean_price"]/repair_vs_non_repair["non_repaired_mean_price"],2)
bold_print("Repaired vs. Damaged Vehicle Prices by Brand")
repair_vs_non_repair
Repaired vs. Damaged Vehicle Prices by Brand

Out[31]:
repaired_mean_price non_repaired_mean_price ratio
peugeot 3693.94 1568.28 2.36
bmw 9469.24 3848.39 2.46
volkswagen 6597.75 2426.36 2.72
smart 3924.23 1488.15 2.64
ford 4739.69 1587.60 2.99
chrysler 4013.97 2239.00 1.79
seat 5230.90 1854.59 2.82
renault 3151.62 1286.69 2.45
audi 10969.08 3541.31 3.10
sonstige_autos 15068.17 5916.76 2.55
opel 3727.36 1573.46 2.37
mazda 5195.75 1605.02 3.24
porsche 46256.49 14670.00 3.15
mini 11138.89 4595.00 2.42
mercedes_benz 9825.39 4126.84 2.38
toyota 5656.65 3630.95 1.56
dacia 6217.42 4459.62 1.39
nissan 5804.16 2112.18 2.75
jeep 12537.60 3074.88 4.08
saab 3942.26 657.00 6.00
volvo 5861.22 1669.21 3.51
mitsubishi 4274.43 1568.20 2.73
jaguar 14118.36 4993.62 2.83
fiat 3542.51 1405.57 2.52
skoda 7143.40 3735.67 1.91
subaru 5544.50 2856.29 1.94
kia 6971.32 2177.12 3.20
citroen 4293.60 2067.51 2.08
chevrolet 7425.04 3766.81 1.97
hyundai 6216.83 2964.89 2.10
honda 5035.65 1560.19 3.23
daewoo 1206.51 524.80 2.30
suzuki 4861.12 1726.00 2.82
trabant 3091.76 817.25 3.78
land_rover 21688.51 5223.75 4.15
alfa_romeo 4880.84 2118.50 2.30
daihatsu 1828.56 950.00 1.92
rover 1962.26 888.50 2.21
lancia 3106.69 4774.50 0.65
lada 3064.41 1400.00 2.19
In [32]:
#Reset the index and rename the original index column
repair_vs_non_repair = repair_vs_non_repair.reset_index()
repair_vs_non_repair.rename(columns = {'index':'brand'},inplace=True)
In [33]:
#Plot the tabulated mean prices
melted_df = repair_vs_non_repair[['brand','repaired_mean_price','non_repaired_mean_price']].melt(id_vars='brand',
                                                                                                 var_name = 'group',
                                                                                                 value_name = 'mean_price').sort_values('mean_price',ascending=False)
fig = plt.figure(figsize = (14,14))
sns.barplot(data = melted_df, x = 'mean_price', y = 'brand', hue = 'group')
sns.despine(left = True, top=False, bottom=True)
plt.title("Price($) of Repaired vs. Damaged Vehicles",size=18)
plt.tick_params(left=False, labelsize=13)
plt.xlabel(None)
plt.ylabel(None)
plt.legend(loc='center right',fontsize='x-large')
plt.show()

It is quite clear that repaired vehicles cost significantly more than non-repaired vehilcles. What is more revealing is that repaired vehicles of big brand names tend to cost significantly more than their damaged counterparts. e.g. Porche, Land Rover, Jaguar etc.

In [34]:
print("On average a repaired vehicle can be priced at {0} times its damaged counterpart".format(round(repair_vs_non_repair["ratio"].mean(),2)))
On average a repaired vehicle can be priced at 2.64 times its damaged counterpart

Insights

  • On average a repaired vehicle can be priced at 2.6 times its damaged counterpart
  • Repaired vehicles with great brand value tend to be priced signficantly more than their damaged counterparts.

9

Conclusion

A quick summary

eBay Kleinanzeigen has a lot of listings related to different vehicles including their servicing and parts. Through this project we were able to extract, clean and review a pre-sampled version of their unclean listing data related to cars from 2016.

Cleaning the data involved the removal of multiple rows and columns. Once the dataset was cleaned multiple insights were gained from the dataset.

One was that repaired cars tend to cost almost thrice their damaged counterparts. Another was that German vehicles that belong to big brand names tend to cost more despite the distance the vehicles have travelled. A third revelation was that big brand German vehicles tend to cost significantly more than their common used counterparts.