eBay Kleinanzeigen is a classifieds section of the German eBay website. In this project, we will use the used cars dataset from eBay Kleinanzeigen to analyze the used cars ads created in Germany between 11th June 2015 and 7th April 2016.
The original dataset uploaded to Kaggle by user orgesleka is no longer available on Kaggle, but it is still accessible on data.world.
In this project, we use the modified version of the original dataset, which was prepared by Dataquest. The modifications are as follows:
The explanations for the columns of the autos.csv
dataset are as follows:
dateCrawled
: The date when this ad was first crawledname
: Name of the carseller
: A private seller or a dealerofferType
: The listing typeprice
: The asking price for selling the carabtest
: Whether the listing is included in an A/B testvehicleType
: The vehicle typeyearOfRegistration
: The year in which the car was first registeredgearbox
: The transmission typepowerPS
: The power of the car in PSmodel
: The car model nameodometer
: How many kilometres the car has drivenmonthOfRegistration
: The month in which the car was first registeredfuelType
: What type of fuel the car usesbrand
: The brand of the carnotRepairedDamage
: Whether the car has damage which still needs to be repaireddateCreated
: The eBay listing creation datenrOfPictures
: The number of pictures in the adpostalCode
: The postal code for the location of the vehiclelastSeenOnline
: When the crawler saw this ad last onlineThe content of the dataset is in German, as the data was originally scraped from the German eBay website.
The goal of this project is to clean the data and analyze the trend of the used car deals.
We cleaned the dataset and observed that March and beginning of April 2016 are the busiest time for ad creation. We also found a lack of correlation between mean price and mean mileage by brand. The distribution of date_crawled
and last_seen
are fairly consistent throughout the whole data sampling period.
We read the autos.csv
file into pandas and assign it to the variable name autos
. The file could read by using encoding=Latin-1
instead of the default encoding UTF-8
.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
Let's briefly examine the data:
autos
autos.info()
Here are what we observe:
autos
dataset contains 50,000 rows and 20 columns.We will transform the column names from camelcase to the preferred snakecase and modify some of the column names to make them more descriptive.
The original column names are as follows:
# Show the original column names
autos.columns
We copy the array of the column names and edit it, and eventually assign the modified column names back to the autos.columns
:
# Edit the column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
To verify whether the column names have been modified, we call autos.head()
to examine the column names of the autos
dataframe. Alternatively, we can also verify the column names by running autos.columns
.
# Display the data of the first 5 rows
autos.head()
The output shows that the column names have been converted to snakecase and several names have been reworded.
Now, we will explore the data a little bit more to determine the necessary basic data cleaning tasks. We use autos.describe()
with include='all
' to examine the descriptive statistics for both categorical and numeric columns.
# Display the descriptive statistics
autos.describe(include='all')
Based on the output, we notice that the nr_of_pictures
column has only one value, which is 0
.
# Show the unique value
autos['nr_of_pictures'].unique()
Since a single-value column would not add additional value to our data analysis, we will remove nr_of_pictures
from our dataset.
# Remove `nr_of_pictures` from the dataset
autos.drop(columns=['nr_of_pictures'], inplace=True)
To verify whether the nr_of_pictures
column has been removed, we did the following:
autos.shape
: The number of columns has been reduced from 20 to 19.autos.columns
: The nr_of_pictures
column has disappeared.These results confirm that the nr_of_pictures
column has been removed from the dataset.
# Show the number of rows and columns
autos.shape
# Show the column names
autos.columns
We also observe that only one out of the 50,000 entries for the seller
column is gewerblich
(or commercial
in English). The remaining are privat
(or private
in English). We can delete the entry for the gewerblich
to focus on the entries of the privat
dealers.
Same as the observation for the seller
, the offer_type
column shows 49,999 entries for Angebot
(or Offer
in English), while only one entry for Gesuch
(or Request
in English). Since we focus on the Angebot
(Offer
), we will delete the entry for the Gesuch
to prevent it from messing up our data analysis.
First, we isolate the rows for the gewerblich
(commercial
) seller and Gesuch
(Request
) offer_type:
# Isolate `commercial` seller
seller_commercial = autos[autos['seller'] == 'gewerblich']
seller_commercial
# Isolate `Request` offer_type
offer_type_request = autos[autos['offer_type'] == 'Gesuch']
offer_type_request
Next, we remove these two rows by using autos.drop
and the indexes of these rows:
# Remove the rows of `commercial` seller and `Request` offer_type from our dataset
autos.drop(index=[7738, 17541], inplace=True)
The result from autos.shape
shows that the number of rows has been reduced from 50,000 to 49,998. This indicates that two rows have been deleted from the dataset.
# Show the number of rows and columns
autos.shape
To examine whether these two deleted rows are the rows that we wanted to delete, we call these rows by using their index labels. The results show that the indexes have been updated — it confirms that the two rows that we do not intend to keep have been deleted.
# Index 7738 was the original index for `commercial` seller row
autos.iloc[7738]
# Index 17541 was the original index for Request` offer_type row
autos.iloc[17541]
We notice that this numeric data stored as string objects, and we need to convert them to integers for analysis purpose:
price
odometer
autos.info()
We build string_to_integer
function to convert columns with the numeric data stored as string to integer datatype. It also removes unit and ,
(thousands separator ) from the numeric data.
# `column` is the name of the column; `unit` is the non-digit character to be removed from the value
def string_to_integer(column, unit):
autos[column] = autos[column].str.replace(unit, '').str.replace(',', '').astype(int)
return autos[column]
Next, we use string_to_integer
to convert price
and odometer
to integer and verify the conversion by printing their datatypes.
# Convert the `price` from string object to integer
string_to_integer('price', '$')
# Convert the `odometer` from string object to integer
string_to_integer('odometer', 'km')
# Check their dtypes after conversion
# Alternatively, we can also use `autos.info()`
print('The dtype for:')
print(' - price: ', autos['price'].dtype)
print(' - odometer: ', autos['odometer'].dtype)
We also update the column names by including their corresponding units — an important piece of information to keep.
# Rename the columns for `price` and `odometer`
autos.rename({'price': 'price_$', 'odometer' : 'odometer_km'}, axis=1, inplace=True)
Next, we confirm that all the changes have been made on the dataset by examining autos.csv
.
# Display the data
autos
After the initial data cleaning, we explore the data in detailed — specifically examine for data that is illogical. We start by analyzing the price_$
column.
# Display the descriptive statistics
autos['price_$'].describe()
The statistic information above is displayed in scientific notation, which is not so friendly to read in the price context. Therefore, we will change the display format to floats with one decimal place and include a thousand separator (,
) by using pd.options.display.float_format
method.
# Change the format to show floats with 1 decimal place and include a thousand separator
pd.options.display.float_format = '{:,.1f}'.format
# Use the following to reset the format
# pd.reset_option('^display.', silent=True)
After adjusting the format, we reprint the descriptive statistics:
# Display the descriptive statistics with updated format
autos['price_$'].describe()
print('The number of unique values: ', autos['price_$'].unique().shape[0])
We notice that 0 for the min price of listed cars is unreasonable, as it is basically FREE!
Surprisingly, the frequency for 0 is very high — 1420 entries — therefore, they do not seem to be entry errors.
# Examine the value counts of `price_$` and sort their index in a descending manner
autos['price_$'].value_counts().sort_index().head(10)
We extract the details of these entries and give it a variable name min_price
for further investigation.
min_price = autos[autos['price_$'] == 0]
min_price
When we analyze the number of counts of the registration_year
of the cars under min_price
, we found that some of the cars are very new, for example, 96 of them were registered in 2016. Thus, it is unreasonable to us that the asking price is only 0.
Interestingly, there are also 60 cars registered in 2017, which is illogical, as the data was crawled in 2016. Because of that, we need to delete the entries with registration after 2016.
min_price['registration_year'].value_counts().head(10)
To investigate whether those 'free' cars belong to cheaper or unknown brands, we analyze their brands.
min_price['brand'].value_counts().head(10)
Given that the top 10 brands of the 'free' cars include many luxurious brands (e.g. BMW, Audi and Mercedes-Benz), as well as decent/well-known brands (e.g. Volkswagen, Ford, Renault, etc.), it is bizarre that their asking price is only 0.
eBay is an auction site, which means that in theory, a seller could propose the lowest asking price at 0. However, based on the several points that we just mentioned, 0 is still an unrealistically low asking price, especially for new (or not too old) cars and decent brands. Therefore, we will remove their entries from our dataset.
autos['price_$'].describe()
Based on the descriptive statistics for price_$
, the maximum value is 99,999,999, which far exceeds the 7,200 of the third quartile (Q3 or 75% of the data). Considering that this is a very old car registered in 1999, the asking price is too high, and it sounds illogical. Therefore, this entry is definitely an outlier that needs to be removed.
autos[autos['price_$'] == 99999999]
To find out more about other potential outliers at the upper end of the price, we investigate the data of the top 20 most expensive cars.
autos['price_$'].value_counts().sort_index(ascending=False).head(20)
# Based on the sorting result, the 20th most expensive car costs $197,000
# Call for the details of the entries of the top 20 most expensive cars
max_price = autos[autos['price_$'] >= 197000].sort_values('price_$', ascending=False)
max_price
Based on the output, we noticed that there is a huge price gap between USD 350,000 (Porsche 991, index: 36818
) and 999,990 (Volkswagen Jetta GT, index: 37585
). Most of the cars equal to or above USD 999,990 are fairly old cars (registered between the year 1960 to 2009), with several exceptions:
index: 42221
— USD 27,322,222index: 27371
— USD 12,345,678index: 39377
— USD 12,345,678The asking price for Citroen C4 on eBay does not make sense, as its market price in 2016 was only between USD 9,314 to 12,454 (GBP 6,899 to 9,225) (Ref: Citroen C4 2016). The car entries for years 2017 and 2018 are questionable, as the crawling date was 2016 and the prices (12,345,678) seem to be invented number.
Meanwhile, a quick search on the internet browser shows that the price of the most expensive cars in 2016 are:
If a brand new luxurious car in 2016 costs above one million, any used cars cost close to or above 1 million in the dataset are clearly unrealistically high. Therefore, we decided to keep the entries with the price between USD 1 and USD 350,000, i.e. exclude entries at USD 0 and above USD 350,000.
We are aware of the fact that a better filter may involve more manual checking regarding whether the asking price is realistic — similar to what we just did to the data above USD 999,990. However, it is not so realistic to do so, as there are thousands of entries in the dataset. Therefore, even though our filter is still not perfect, yet it is fairly effective.
# Keep the rows with prices between 1 and 350,000, while discarding the rows with other values
autos = autos[autos['price_$'].between(1,350000)]
autos
# Display the descriptive statistics
autos.describe()
# Display the lowest price in an ascending manner
autos['price_$'].value_counts().sort_index().head()
# Display the highest price in an descending manner
autos['price_$'].value_counts().sort_index(ascending=False).head()
Based on the cells above, we confirm that the entries at price 0 and above 350,000 have been removed.
Next, we analyze the odometer_km
column.
autos['odometer_km'].describe()
print('The number of unique values: ', autos['odometer_km'].unique().shape[0])
autos['odometer_km'].value_counts().sort_index(ascending=True)
There are 13 unique values in odometer_km
column.
The value_counts
result shows that there is no outlier. The number of counts skews towards the higher end with the highest counts being the maximum value, 150,000 km. This is followed by the second and third highest counts being 125,000 and 100,000 km, which are the second and third maximum values, respectively.
Overall, the data looks good.
Now, we will explore the date columns, which are:
date_crawled
ad_created
last_seen
registration_year
registration_month
Let's look at the date formats of the three date columns:
# Display the first five rows of the three date columns
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]
The output shows that the values in date columns are displayed in %Y-%m-%d %H:%M%S
format, where the initial 10 characters represent the date (e.g. 2016-03-26
).
To investigate the distribution of the date values in these three columns as percentages, we chain the methods as such:
str[:10]
: To extract the datevalue_counts(normalize=True, dropna=False)
: To get the relative frequency of the unique values and include the missing valuessort_index()
: To sort by date in ascending order (default)*100
: To convert the relative frequency to the percentageIn the earlier cells, we changed the output format of the descriptive statistics from scientific notation to floats with 1 decimal place and included a thousand separator. Now, we revert the format setting to make the analysis of the value in percentages easier.
# Reset the float format setting
pd.reset_option('^display.', silent=True)
# 'date_crawled' — To get the distribution of date values as percentages
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()*100
The distribution of date_crawled
is quite consistent throughout the whole crawling period — it fluctuates within 3% most of the time — with several exceptions. The lower date_crawled
percentage from 2016-04-05 to 2016-04-07 correlates with the lower ad_created
percentage (see the next cell). This indicates that the percentage of ad_created
affect the percentage of date_crawled
. Overall, the crawling efficiency seems to be fairly good.
# 'ad_created' — To get the distribution of date values as percentages
autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index()*100
March 2016 and beginning of April 2016 were the busiest month for used cars ads creation, with around 3.+ % of daily new ads. The percentages of daily ads creation decreased drastically from 2016-04-05 to 2016-04-07 from 1.181945% to 0.125607%.
The daily new ads created between June 2015 to February 2016 were very low in comparison to March 2016 — only about 0.002059% to 0.006177% — and they were not created daily. In 2015, we also noticed that they were only 0.002059% of new ads on a single day in June, August, September and November, respectively; 0.002059 on two days in December; while none in July.
Based on these observations, we propose that March or early spring is generally a good time for a buyer to search for a used car on eBay.
# 'last_seen' — To get the distribution of date values as percentages
autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index()*100
The distribution of last_seen
percentage is rather consistent throughout the timeframe — it fluctuates within 0.107075% to 2.808665%. The percentage is a lot higher on the last three days, between 12.476320% to 22.178980%, which seems to be normal, as it is the date that the crawler saw the ad last online.
Now, let's analyze the registration year data!
autos['registration_year'].describe()
The descriptive statistics of registration_year
comes to our surprise in two illogical aspects:
We investigate the registration_year
further to examine whether there are more unreasonable entries.
# To get the distribution of `registration_year` as percentages and sort the years in an ascending manner
autos['registration_year'].value_counts(normalize=True,dropna=False).sort_index()*100
In addition to the year 1000, we notice that the unreasonable earliest registration years also include 1001, 1111, 1800 and 1910. Since the year 1886 is regarded as the birth year of the modern car (Ref: modern car), any entries before the year 1910 in this dataset should be removed.
Additionally, any entries after 2016 (the crawling year), i.e. between 2017 to 9999, should be removed as well.
# Remove the entries before the year 1910 and after the year 2016
autos = autos[autos['registration_year'].between(1910,2016)]
From the cell below, we can see that we have successfully removed the illogical years. Now, the minimum year is 1910, whereas the maximum year is 2016.
autos['registration_year'].describe()
# To get the distribution of `registration_year` as percentages and sort the years in a descending manner
autos['registration_year'].value_counts(normalize=True).sort_values(ascending=False).head(10)*100
# To get the distribution of `registration_year` as percentages and sort the years in an ascending manner
autos['registration_year'].value_counts(normalize=True).sort_values().head(30)*100
Based on the distribution, we can see that the top 10 registration_year
are between 1998 to 2007 (4.88% to 6.76%). There was also a small percentage of sellers selling their antique cars on eBay site.
Next, we will analyze the registration month data.
autos['registration_month'].describe()
# To get the distribution of `registration_month` as percentages and sort the months in a descending manner
autos['registration_month'].value_counts(normalize=True).sort_values(ascending=False)*100
We are surprised by the fact that 8.62% of the registration was done in the month-0, which does not exist. Given that registration month is usually not so crucial compared to registration year in a market place, sellers might fill in month-0 when they do not remember the registration month. We will just keep the entries for the month-0 as they are, as registration month will not have a big impact on our analysis and we do not want to lose the associated data of the month-0.
Based on the data, most of the cars were registered between spring and early summer (March to June) — 8.31% to 10.36% — presumably preparing for summer holidays or spring/summer outdoor activities. Therefore, spring and early summer is the best time to change cars, especially March, as March is the peak time for ads creation as well.
To have an overview of the number of entries, we print it out:
n_before_removal = autos.shape[0]
print('The number of entries before removal of illogical registration months: ', n_before_removal)
Now, let's examine whether there are any entries with illogical registration months. For example, it is illogical if an ad was created and crawled in March 2016, but the car was registered in August 2016.
The last ad creation date is 2016-04-07. In theory, the chance of someone buying and registering a new car and sell it within the same month is extremely low. Therefore, we create a combined Boolean filter for car registration after March 2016 by using autos[(autos['registration_year'] == 2016) & (autos['registration_month'] > 3)]
, and remove the entries from autos by using autos.drop()
method.
# Remove the entries of car registration after March 2016
autos = autos.drop(autos[(autos['registration_year'] == 2016) & (autos['registration_month'] > 3)].index)
autos
As we can see from the cell below, the number of entries have been reduced from 46680 to 45979 after removing 701 illogical entries.
n_after_removal = autos.shape[0]
print('The number of entries after removal of illogical registration months: ', n_after_removal)
print('The number of deleted entries: ', n_before_removal - n_after_removal)
While we check the entries for the registration year 2016, we only obtain the entries for registration month 0 to 3. This result further confirms the removal of the entries of the illogical registration months.
# Verify the deletion
autos[autos['registration_year'] == 2016]
Now, we are going to analyze the price and mileage of cars by brand.
First, we explore the unique values in the brand
column and sort them by using value_counts()
:
print('Unique brand:')
print(autos['brand'].unique())
print('\n')
print('The number of unique brand: ', len(autos['brand'].unique()))
# Display the percentage of values counts of all car brands
autos['brand'].value_counts(normalize=True).sort_values(ascending=False)*100
Out of the 40 unique brands, we decided to aggregate the top 10 brands for our analysis, as it represents a good proportion of the common car brands (1.81% to 21.07%). The brands that we have chosen are as below:
# Display the percentage of values counts of the top 10 car brands
autos['brand'].value_counts(normalize=True).sort_values(ascending=False).head(10)*100
We use aggregation to understand the brand
column. We start with grouping the the top 10 unique brands under the variable top_10_brands
:
# Group the top 10 unique brands
top_10_brands = autos['brand'].value_counts(normalize=True).sort_values(ascending=False).head(10).index
print(top_10_brands)
Next, we loop over the brand names in top_10_brands
to group the dataframe by unique brand names. By using the unique brand names, we compute the brand_mean_price
and brand_mean_mileage
for each of the unique brands, add them to the mean_prices_dict
and mean_mileage_dict
dictionaries, respectively.
# A dictionary to hold the aggegate data
mean_prices_dict = {}
mean_mileage_dict = {}
mean_prices_list = []
# Loop over the top 10 unique brands
for brand_name in top_10_brands:
brand_groups = autos[autos['brand'] == brand_name] # Group the rows of the unique brands
# For mean price
brand_mean_prices = int(brand_groups['price_$'].mean()) # Calculate the mean price of each of the unique brands and convert it to integer
mean_prices_dict[brand_name] = brand_mean_prices # Add to dictionary: key = brand_name; value = brand_mean_prices
# For mean mileage
brand_mean_mileage = int(brand_groups['odometer_km'].mean()) # Calculate the mean mileage of each of the unique brands and convert it to integer
mean_mileage_dict[brand_name] = brand_mean_mileage # Add to dictionary: key = brand_name; value = brand_mean_mileage
print('Mean price dictionary:', '\n', mean_prices_dict)
print('\n')
print('Mean mileage dictionary:', '\n', mean_mileage_dict)
Now, we will investigate the correlation between the and the mean price and mean mileage by brand, if any. To do so, we will:
# Construct mean price series
mean_price_series = pd.Series(mean_prices_dict)
# Construct mean mileage series
mean_mileage_series = pd.Series(mean_mileage_dict)
print('** Mean price series **')
print(mean_price_series)
print('\n')
print('** Mean mileage series **')
print(mean_mileage_series)
# Construct a dataframe from `mean_price_series`
mean_price_mileage_df = pd.DataFrame(mean_price_series, columns=['mean_price_$'])
# Assign `mean_mileage_series` as a new column in this dataframe
mean_price_mileage_df['mean_mileage_km'] = pd.DataFrame(mean_mileage_series)
# Sort the mean price in a descending manner
mean_price_mileage_df.sort_values(['mean_price_$'], ascending=False)
# Sort the mean mileage in a descending manner
mean_price_mileage_df.sort_values(['mean_mileage_km'], ascending=False)
max_mileage = mean_price_mileage_df['mean_mileage_km'].max()
min_mileage = mean_price_mileage_df['mean_mileage_km'].min()
difference_mileage_percent = round((max_mileage - min_mileage) * 100 / max_mileage, 1)
print('The difference between the maximum and minimum mileage: ', difference_mileage_percent, '%')
We notice that there are three distinct price gaps between the 10 top brands:
The difference in mileage between the maximum and minimum prices among the top 10 brands is not big, only 11.7%.
Based on the comparison result, we do not observe any correlation between the mean price and the mean mileage by brand. In fact, the mean price of the brands is possibly determined by more complicated combined factors, such as credibility of the brand, marketing strategy, popularity in certain countries/regions, car specifications, etc.
In this project, we cleaned the used cars dataset from eBay Kleinanzeigen, by removing the non-value-added columns, outliers in the price column and the illogical registration year and month data. We further analyzed the car listings and discovered :
date_crawled
is fairly consistent throughout the whole crawling periodads_created
is March 2016 and beginning of April 2016 last_seen
percentage is rather consistent throughout the timeframe