Exploring ebay Car Sales in Germany.¶

By working with the ebay car sales data, scraped from eBay Kleinanzeigen, a classifieds section of the German eBay website, we aim to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

1. dateCrawled - When this ad was first crawled. All field-values are taken from this date.

2. name - Name of the car. seller - Whether the seller is private or a dealer.

3. offerType - The type of listing

4. price - The price on the ad to sell the car.

5. abtest - Whether the listing is included in an A/B test.

6. vehicleType- The vehicle Type.

7. yearOfRegistration - The year in which the car was first registered.

8. gearbox - The transmission type.

9. powerPS- The power of the car in PS.

10. model - The car model name.

11. kilometer - How many kilometers the car has driven.

12. monthOfRegistration - The month in which the car was first registered.

13. fuelType - What type of fuel the car uses.

14. brand - The brand of the car.

15. notRepairedDamage - If the car has a damage which is not yet repaired.

16. dateCreated - The date on which the eBay listing was created.

17. nrOfPictures - The number of pictures in the ad.

18. postalCode - The postal code for the location of the vehicle.

18. lastSeenOnline - When the crawler saw this ad last online.

Let us start by importing the dataset.

In [1]:

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
data=pd.read_csv("autos.csv",encoding="Latin-1")

In [2]:

#First few rows of the dataset.
data[:5]

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [3]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

We can see that there are multiple columns with null values and multiple columns with "object" as datatype. Let us start our data cleaning here.

Data Cleaning¶

1. Null Values¶

Looking at the results above, lets us know that columns vehicleType,gearbox,model,fuelType and notRepairedDamage have some null values. The way to handle this missing data depends on the percentage of observations missing from each column.

In [4]:

def per_missing(col): #function to calculate percentage of missing values
    per=(data[col].isnull().sum()*100)/data.shape[0]
    print("Percentage of ",col," missing is: ",per,"%")
per_missing("vehicleType")
per_missing("gearbox")
per_missing("model")
per_missing("fuelType")
per_missing("notRepairedDamage")

Percentage of  vehicleType  missing is:  10.19 %
Percentage of  gearbox  missing is:  5.36 %
Percentage of  model  missing is:  5.516 %
Percentage of  fuelType  missing is:  8.964 %
Percentage of  notRepairedDamage  missing is:  19.658 %

As at most, the column missing values are 20% of the data,we won't be dealing with missing data now and proceed further

2. Ambiguous Column Names and Values¶

Let us see the column names.

In [5]:

print(data.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Some of the column names are too long like 'yearOfRegistration', 'monthOfRegistration' and 'notRepairedDamage' Some of them are inappropriate like 'dateCreated' which is the date on which the eBay listing was created. Also let us change the camelcase of the column names to snakecase.

We will make the following changes:

yearOfRegistration to registration_year
monthOfRegistration to registration_month
notRepairedDamage to unrepaired_damage
dateCreated to ad_created
nrOfPictures to num_pics
The rest of the column names from camelcase to snakecase.

In [6]:

data.rename({"yearOfRegistration":"registration_year","monthOfRegistration":"registration_month",
            "notRepairedDamage":"unrepaired_damage","dateCreated":"ad_created",
            "dateCrawled":"date_crawled","offerType":"offer_type","powerPS":"power_ps",
            "nrOfPictures":"num_pics","fuelType":"fuel_type","postalCode":"postal_code",
            "postalCode":"postal_code",'vehicleType':'vehicle_type',"lastSeen":"last_seen"},inplace=True,axis=1)

In [7]:

print(data.columns)

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pics', 'postal_code',
       'last_seen'],
      dtype='object')

2. Exploring column values for ambiguous values¶

a) Removing rows with values that are inappropriate.¶

In [8]:

data["registration_month"].value_counts()

Out[8]:

0     5075
3     5071
6     4368
5     4107
4     4102
7     3949
10    3651
12    3447
9     3389
11    3360
1     3282
8     3191
2     3008
Name: registration_month, dtype: int64

Let us remove the values with registration_month=0 because this value seems ambiguous

In [9]:

data=data[~(data["registration_month"]==0)]

The acceptable values for registration year are 1900-2016. So we will remove all other rows with values that do not fall in this interval.

In [10]:

data=data[~((data["registration_year"]>2016)|(data["registration_year"]<1900))]

data.shape

Out[10]:

(43441, 20)

b) Changing german words to english words.¶

In [11]:

data["seller"].unique()

Out[11]:

array(['privat', 'gewerblich'], dtype=object)

In german, privat means private and gewerblich means commercial.

In [12]:

ger_eng={'privat':"private","gewerblich":"commercial"}
data["seller"]=data["seller"].map(ger_eng)
data["seller"].unique()

Out[12]:

array(['private', 'commercial'], dtype=object)

In [13]:

data["offer_type"].unique()

Out[13]:

array(['Angebot'], dtype=object)

In german, Angebot means offer.

In [14]:

ger_eng={'Angebot':"offer"}
data["offer_type"]=data["offer_type"].map(ger_eng)
data["offer_type"].unique()

Out[14]:

array(['offer'], dtype=object)

In [15]:

data["gearbox"].unique()

Out[15]:

array(['manuell', 'automatik', nan], dtype=object)

In german, manuell means manual and automatik means automatic .

In [16]:

ger_eng={'manuell':"manual","automatik":"automatic"}
data["gearbox"]=data["gearbox"].map(ger_eng)
data["gearbox"].unique()

Out[16]:

array(['manual', 'automatic', nan], dtype=object)

In [17]:

data["fuel_type"].unique()

Out[17]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In german, benzin means petrol and elektro means electro.

In [18]:

ger_eng={"andere":"others", "benzin":"gasoline","cng":"cng", "diesel":"diesel", 
                "elektro":"electric", "hybrid":"hybrid", "lpg":"lpg"}
data["fuel_type"]=data["fuel_type"].map(ger_eng)
data["fuel_type"].unique()

Out[18]:

array(['lpg', 'gasoline', 'diesel', nan, 'cng', 'hybrid', 'electric',
       'others'], dtype=object)

In [19]:

data["vehicle_type"].unique()

Out[19]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [20]:

ger_eng={"andere":"others","bus":"bus","kleinewagen":"small_car","kombi":"combi","cabrios":"convertible",
         "coupe":"coupe","suv":"suv","limousine":"limousine"}

data["vehicle_type"]=data["vehicle_type"].map(ger_eng)
data["vehicle_type"].unique()

Out[20]:

array(['bus', 'limousine', nan, 'combi', 'coupe', 'suv', 'others'],
      dtype=object)

In [21]:

data["unrepaired_damage"].unique()

Out[21]:

array(['nein', nan, 'ja'], dtype=object)

In [22]:

data["unrepaired_damage"]=data["unrepaired_damage"].map({"ja":"yes", "nein":"no"})
data["unrepaired_damage"].unique()

Out[22]:

array(['no', nan, 'yes'], dtype=object)

3. Exploring data types¶

In [23]:

data.dtypes

Out[23]:

date_crawled          object
name                  object
seller                object
offer_type            object
price                 object
abtest                object
vehicle_type          object
registration_year      int64
gearbox               object
power_ps               int64
model                 object
odometer              object
registration_month     int64
fuel_type             object
brand                 object
unrepaired_damage     object
ad_created            object
num_pics               int64
postal_code            int64
last_seen             object
dtype: object

The data types of some columns seems wrong.

Column	Wrong Data Type	Correct Data Type
date_crawled	Object	datetime
price	Object	float
ad_created	Object	datetime
last_seen	Object	datetime
odometer	Object	float

Also, km suffix should be removed from the odometer column and then it will be converted to int. Also, from the price column,$ prefix should be removed and then it will be converted to float.

a) Removing inappropriate suffix/prefix from price and odometer data.¶

In [24]:

data["price"]=data["price"].str.replace("$","")
data["price"]=data["price"].str.replace(",","")
data["price"]

Out[24]:

0         5000
1         8500
2         8990
3         4350
4         1350
         ...  
49995    24900
49996     1980
49997    13200
49998    22900
49999     1250
Name: price, Length: 43441, dtype: object

In [25]:

data["odometer"]=data["odometer"].str.replace("km","")
data["odometer"]=data["odometer"].str.replace(",","")
data["odometer"]

Out[25]:

0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer, Length: 43441, dtype: object

Let us also alter the column names to make them more descriptive.

In [26]:

data.rename({"price":"price($)","odometer":"odometer(km)"},inplace=True,axis=1)
data.columns

Out[26]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price($)', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer(km)', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_pics', 'postal_code',
       'last_seen'],
      dtype='object')

b) Changing the data types of the columns mentioned above.¶

In [27]:

data["date_crawled"]=pd.to_datetime(data["date_crawled"])
data["ad_created"]=pd.to_datetime(data["ad_created"])
data["last_seen"]=pd.to_datetime(data["last_seen"])

In [28]:

data["price($)"]=data["price($)"].astype("float")
data["odometer(km)"]=data["odometer(km)"].astype("float")

In [29]:

data.dtypes

Out[29]:

date_crawled          datetime64[ns]
name                          object
seller                        object
offer_type                    object
price($)                     float64
abtest                        object
vehicle_type                  object
registration_year              int64
gearbox                       object
power_ps                       int64
model                         object
odometer(km)                 float64
registration_month             int64
fuel_type                     object
brand                         object
unrepaired_damage             object
ad_created            datetime64[ns]
num_pics                       int64
postal_code                    int64
last_seen             datetime64[ns]
dtype: object

In [30]:

data["offer_type"].unique()

Out[30]:

array(['offer'], dtype=object)

In [31]:

data["num_pics"].unique()

Out[31]:

array([0])

offer_type and num_pics have same value across all rows. There is no benefit in keeping these columns in our analysis as they have redundant values

In [32]:

data.drop(["offer_type","num_pics"],axis=1,inplace=True)
data

Out[32]:

	date_crawled	name	seller	price($)	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer(km)	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	private	5000.0	control	bus	2004	manual	158	andere	150000.0	3	lpg	peugeot	no	2016-03-26	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	private	8500.0	control	limousine	1997	automatic	286	7er	150000.0	6	gasoline	bmw	no	2016-04-04	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	private	8990.0	test	limousine	2009	manual	102	golf	70000.0	7	gasoline	volkswagen	no	2016-03-26	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	private	4350.0	control	NaN	2007	automatic	71	fortwo	70000.0	6	gasoline	smart	no	2016-03-12	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	private	1350.0	test	combi	2003	manual	0	focus	150000.0	7	gasoline	ford	no	2016-04-01	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	private	24900.0	control	limousine	2011	automatic	239	q5	100000.0	1	diesel	audi	no	2016-03-27	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	private	1980.0	control	NaN	1996	manual	75	astra	150000.0	5	gasoline	opel	no	2016-03-28	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	private	13200.0	test	NaN	2014	automatic	69	500	5000.0	11	gasoline	fiat	no	2016-04-02	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	private	22900.0	control	combi	2013	manual	150	a3	40000.0	11	diesel	audi	no	2016-03-08	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	private	1250.0	control	limousine	1996	manual	101	vectra	150000.0	1	gasoline	opel	no	2016-03-13	45897	2016-04-06 21:18:48

43441 rows × 18 columns

Data cleaning seems over (as of now). Let us proceed to our analysis.

Analysis¶

In [33]:

import matplotlib.pyplot as plt
%matplotlib inline
reg_year_count=data["registration_year"].value_counts().sort_index()
plt.plot(reg_year_count)
plt.xlabel("Years")
plt.ylabel("Number of Registered Cars")
plt.title("Registrations of cars for each year")
plt.show()

During years 1920-1980, the registered cars are too low. Let us just focus on years after 1980 as they are more relevant to our analysis.

In [34]:

plt.plot(reg_year_count)
plt.xlim([1980, 2016])
plt.xlabel("Years")
plt.ylabel("Number of Registered Cars")
plt.title("Registrations of cars for each year")
plt.show()

Most number of vehicles on the listings were registered in the years 2000-2010.

Let's break this data by vehicle type.

In [35]:

for v in data["vehicle_type"].dropna().unique():
    vt=(data["vehicle_type"]==v)
    a=data.loc[vt,"registration_year"].value_counts(dropna=True).sort_index()
    plt.plot(a.index,a.values,label=v)
    plt.xlim(1980,2016)
    
plt.xlabel("Registration Year")    
plt.ylabel("Number of Registrations")    
plt.legend()
plt.show()

Breaking down the data into vehicle types, showed us that most limousines, combi and buses were registered during 1995-2010 but for other categories, the registration peaks were seen during different time frames.

Now we will look at each vehicle type individually and see what brands are common in the listings.

¶

In [36]:

for v in data["vehicle_type"].dropna().unique():
    
    vt=data["vehicle_type"]==v
    c=data.loc[vt,"brand"].value_counts().sort_values(ascending=False).head(5)
    plt.bar(c.index,c.values)
    plt.xticks(rotation=30)
    plt.title(v.upper()+" Top 5 brands and their Frequencies")
    plt.show()

For most vehicle types, the listings seem to show similar top 5 brands like Volkswagen, Audi, BMW, Opel and Mercedes Benz. Now, we will focus some of our analysis on these common brands.

In [37]:

popular_brands=["volkswagen","audi","bmw","mercedes_benz","opel"]

Let us compare the average prices of these common brands for used vehicles

In [38]:

v=(data["brand"].isin(popular_brands))
a=data[v].groupby(["brand","gearbox"])["price($)"].mean().unstack("gearbox")
a.plot(kind="bar",stacked=True)
plt.xlabel("Common Brands")
plt.ylabel("Average Price")
plt.xticks(rotation=30)
plt.title("Average Prices of Common Brands")
plt.show()

In common brands, the most expensive brand seems to be Mercedes Benz and other brands have comparably low prices.

In most brands, we see that automatic cars are priced at a higher value than manual cars. However, only for Opel brand, the reverse is true.

In [39]:

v=(data["brand"].isin(popular_brands))
a=data[v].groupby(["brand","gearbox"])["date_crawled"].count().unstack("gearbox")
a.plot(kind="bar",stacked=True)
plt.xlabel("Popular Brands")
plt.ylabel("Total Frequency")
plt.xticks(rotation=30)
plt.title("Frequency of Popular Brands")
plt.show()

The most common listings seems to be of Volkswagen and the least number of listings are for Audi. In most of these brands,more manual cars are up for sale than automatic cars.

If we compare the prices and frequency of most common brands, we see that on ebay, more number of listings are for cheaper vehicles.

Let us now compare odometer(km) i.e. total distance traveled of the common brands.

In [40]:

v=(data["brand"].isin(popular_brands))
a=data[v].groupby("brand")["odometer(km)"].mean()
a.plot(kind="bar")
plt.xlabel("Common Brands")
plt.ylabel("Average Distance(km)")
plt.xticks(rotation=30)
plt.ylim(125000,)
plt.title("Average Distance of Popular Brands")
plt.show()

Although the difference in average distance between different brands is not huge but we can use this bar chart to infer that on average, BMW has most distance traveled. Maybe, due to this reason, out of the luxury brands (BMW, Audi ,Mercedes), we compared, BMW was cheaper on average.

Now let us shift some gears(pun intended) and switch our analysis to look at different models for all brands. We will look for most common brand/model combinations.

In [41]:

a=data.groupby(["brand","model"]).size().reset_index()
print("Most common Brand/Model combinations:\n")
a.groupby(["brand"])["model",0].max()

Most common Brand/Model combinations:

Out[41]:

	model	0
brand
alfa_romeo	spider	79
audi	tt	1165
bmw	z_reihe	2479
chevrolet	spark	140
chrysler	voyager	44
citroen	c5	194
dacia	sandero	44
daewoo	nubira	22
daihatsu	terios	55
fiat	stilo	385
ford	transit	713
honda	jazz	153
hyundai	tucson	170
jaguar	x_type	37
jeep	wrangler	43
kia	sportage	64
lada	samara	15
lancia	ypsilon	19
land_rover	range_rover_sport	30
mazda	rx_reihe	158
mercedes_benz	vito	1090
mini	one	266
mitsubishi	pajero	95
nissan	x_trail	222
opel	zafira	1436
peugeot	andere	579
porsche	cayenne	140
renault	twingo	557
rover	rangerover	52
saab	andere	51
seat	toledo	312
skoda	yeti	290
smart	roadster	517
subaru	legacy	28
suzuki	swift	115
toyota	yaris	125
trabant	andere	33
volkswagen	up	3423
volvo	xc_reihe	85

Now, we will look at the difference between prices of damaged vehicles v/s repaired ones.

In [42]:

damaged=data[data["unrepaired_damage"]=="yes"]
repaired=data[data["unrepaired_damage"]=="no"]

avg_dam_price=damaged.groupby("brand")["price($)"].mean().sort_values(ascending=False).reset_index()
avg_rep_price=repaired.groupby("brand")["price($)"].mean().sort_values(ascending=False).reset_index()

In [43]:

avg_dam_price.rename({"price($)":"dam_price($)"},inplace=True,axis=1)

avg_rep_price.rename({"price($)":"rep_price($)"},inplace=True,axis=1)

a=pd.merge(avg_rep_price,avg_dam_price,on="brand")
a.set_index(["brand"],inplace=True)
a.head()

Out[43]:

	rep_price($)	dam_price($)
brand
sonstige_autos	66309.358804	6458.937500
porsche	50551.665323	14670.000000
land_rover	21688.512821	5223.750000
jaguar	14055.857143	4438.777778
jeep	12626.000000	3074.875000

In [44]:

plt.figure(figsize=(6, 6))
a.plot.bar()
plt.ylabel("Average Prices($)")
plt.show()

<Figure size 600x600 with 0 Axes>

There is a huge difference between average prices of damaged cars and repaired cars, across all brands.

This was our apriori expectation as well. Vehicles that are damaged are priced quite low as compared to repaired vehicles.

Conclusions.¶

Overall used cars on ebay were registered during 2000-2010 time frame.
The 5 most common brands for different vehicle types were found to be, Volkswagen, Audi, BMW, Opel and Mercedes Benz.
In common brands, the most expensive brand seems to be Mercedes Benz and other brands have comparably low prices.Also, mostly,we see that automatic cars are priced at a higher value than manual cars.
The most common listings seems to be of Volkswagen and the least number of listings are for Audi. In each of these brands,more manual cars are up for sale than automatic cars.
Using these two bar charts, we saw that on ebay, more number of listings are for cheaper vehicles**.
On average, BMW has vehicles with most distance traveled. Maybe, due to this reason, out of the luxury brands (BMW, Audi ,Mercedes), we compared, BMW was cheaper on average.
Vehicles that are damaged are priced quite low as compared to repaired vehicles.

In [ ]: