eBay Classifieds cleaning and analysis

In this project, I will work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but it can be found here.

The project's aim is to clean the data and analyze the included used car listings.

The data dictionary provided with data is as follows:

Column name Description
dateCrawled When this ad was first crawled. All field-values are taken from this date
name Name of the car
seller Whether the seller is private or a dealer
offerType The type of listing
price The price on the ad to sell the car
abtest Whether the listing is included in an A/B test
vehicleType The vehicle Type
yearOfRegistration The year in which the car was first registered
gearbox The transmission type
powerPS The power of the car in PS
model The car model name
kilometer How many kilometers the car has driven
monthOfRegistration The month in which the car was first registered
fuelType What type of fuel the car uses
brand The brand of the car
notRepairedDamage If the car has a damage which is not yet repaired
dateCreated The date on which the eBay listing was created
nrOfPictures The number of pictures in the ad
postalCode The postal code for the location of the vehicle
lastSeenOnline When the crawler saw this ad last online

I will start by importing the libraries I need and reading in the data file.

In [1]:
#Importing the essential libraries
import pandas as pd
import numpy as np 

#Reading in the file and assigning the first column as index column
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
In [2]:
#Exploring the  basic info on dataset
autos
print(autos.info())
print(autos.head()) 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
None
           dateCrawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offerType   price   abtest vehicleType  yearOfRegistration  \
0  privat   Angebot  $5,000  control         bus                2004   
1  privat   Angebot  $8,500  control   limousine                1997   
2  privat   Angebot  $8,990     test   limousine                2009   
3  privat   Angebot  $4,350  control  kleinwagen                2007   
4  privat   Angebot  $1,350     test       kombi                2003   

     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \
0    manuell      158  andere  150,000km                    3      lpg   
1  automatik      286     7er  150,000km                    6   benzin   
2    manuell      102    golf   70,000km                    7   benzin   
3  automatik       71  fortwo   70,000km                    6   benzin   
4    manuell        0   focus  150,000km                    7   benzin   

        brand notRepairedDamage          dateCreated  nrOfPictures  \
0     peugeot              nein  2016-03-26 00:00:00             0   
1         bmw              nein  2016-04-04 00:00:00             0   
2  volkswagen              nein  2016-03-26 00:00:00             0   
3       smart              nein  2016-03-12 00:00:00             0   
4        ford              nein  2016-04-01 00:00:00             0   

   postalCode             lastSeen  
0       79588  2016-04-06 06:45:54  
1       71034  2016-04-06 14:45:08  
2       35394  2016-04-06 20:15:37  
3       33729  2016-03-15 03:16:28  
4       39218  2016-04-01 14:38:50  

Issues with the data

The first issue is easily detected.

1. Column names do not follow Python convention. Some of them are in camelcase.

We'll start by correcting the column names. We will convert the camelcase column names into snakecase and replace some of the cryptic names with clearer ones.

In [3]:
# Creating a specific function to replace columns
def clean_col(col):
    col = col.replace("yearOfRegistration","registration_year")
    col = col.replace("monthOfRegistration","registration_month")
    col = col.replace("notRepairedDamage","unrepaired_damage")
    col = col.replace("dateCreated","ad_created")
    col = col.replace("offerType","offer_type")
    col = col.replace("vehicleType","vehicle_type")
    col = col.replace("powerPS","power_ps")
    col = col.replace("fuelType","fuel_type")
    col = col.replace("nrOfPictures","nr_of_pictures")
    col = col.replace("postalCode","postal_code")
    col = col.replace("lastSeen","last_seen")
    col = col.replace("dateCrawled","date_crawled")
    return col

#Using the function to replace the incorrect names to a list with new names
new_columns = []
for column in autos.columns:
    column = clean_col(column)
    new_columns.append(column)
    
#Assigning the modified column names 
autos.columns = new_columns

#Checking the result
autos.head()
Out[3]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

If we further check our data we will find another issue:

2. Columns where all values are the same

This can be found by checking basic stats on each column as I show below.

In [4]:
#Describing both the numeric and categorical columns
autos.describe(include = "all")
Out[4]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-05 16:57:05 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
In [ ]:
 

We can see that nr_of_pictures column seems to have only one unique value. Besides, five columns have only two unique values. We will check if one of these values is not very frequent and might be omitted. Below I print value counts for each of these columns.

In [5]:
print(autos["nr_of_pictures"].value_counts())
print(autos["abtest"].value_counts())
print(autos["seller"].value_counts())
print(autos["offer_type"].value_counts())
print(autos["gearbox"].value_counts())
print(autos["unrepaired_damage"].value_counts())
0    50000
Name: nr_of_pictures, dtype: int64
test       25756
control    24244
Name: abtest, dtype: int64
privat        49999
gewerblich        1
Name: seller, dtype: int64
Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64
manuell      36993
automatik    10327
Name: gearbox, dtype: int64
nein    35232
ja       4939
Name: unrepaired_damage, dtype: int64

Two of the columns we checked have almost all the same values: offer_type and seller. This means that they do not offer any substandial criterion for analysing our data. We will drop both columns together with nr_of_pictures.

In [6]:
autos = autos.drop(["nr_of_pictures","seller", "offer_type"], axis=1)

3. Numeric data stored as text.

This third issue is visible from the column descriptions we printed above. The columns price and odometer contain commas, "$" and "km" signs that do not allow conversion into numeric values. We will have to remove these characters and convert the values into floats. In order to still be able to understand the values in these columns, we will add measure units to the column name.

In [7]:
#Removing odd characters, converting and renaming price column
autos["price"] = (autos["price"].str.replace("$","") 
                                .str.replace(",","").astype(int)
                 )
autos.rename({"price":"price_usd"}, axis = 1, inplace=True)

#Removing odd characters, converting and renaming odometer column
autos["odometer"] = (autos["odometer"].str.replace(",","")
                                      .str.replace("km","").astype(int)                 
                    )

autos.rename({"odometer":"odometer_km"}, axis = 1, inplace=True)

#Checking the result
print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   price_usd   abtest vehicle_type  registration_year    gearbox  power_ps  \
0       5000  control          bus               2004    manuell       158   
1       8500  control    limousine               1997  automatik       286   
2       8990     test    limousine               2009    manuell       102   
3       4350  control   kleinwagen               2007  automatik        71   
4       1350     test        kombi               2003    manuell         0   

    model  odometer_km  registration_month fuel_type       brand  \
0  andere       150000                   3       lpg     peugeot   
1     7er       150000                   6    benzin         bmw   
2    golf        70000                   7    benzin  volkswagen   
3  fortwo        70000                   6    benzin       smart   
4   focus       150000                   7    benzin        ford   

  unrepaired_damage           ad_created  postal_code            last_seen  
0              nein  2016-03-26 00:00:00        79588  2016-04-06 06:45:54  
1              nein  2016-04-04 00:00:00        71034  2016-04-06 14:45:08  
2              nein  2016-03-26 00:00:00        35394  2016-04-06 20:15:37  
3              nein  2016-03-12 00:00:00        33729  2016-03-15 03:16:28  
4              nein  2016-04-01 00:00:00        39218  2016-04-01 14:38:50  
<ipython-input-7-15f8e80f5b59>:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  autos["price"] = (autos["price"].str.replace("$","")

4. Unrealistic data in price and odometer columns

We will now further look into price_usd and odometer_km columns to check for any unrealistic values.

In [8]:
#How many unique values
autos["price_usd"].unique().shape
Out[8]:
(2357,)
In [9]:
#Descriptive stats
autos["price_usd"].describe()
Out[9]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64
In [10]:
#15 highest values with their frequencies 
autos["price_usd"].value_counts().sort_index(ascending=False).head(15)
Out[10]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
Name: price_usd, dtype: int64

Several values in price column are above USD 10 mln and, apparently, are introduced at random (for ex., 12345678). The highest reasonable price in the list is USD350000. We will cut off all the prices above this number. But first, let's check the lowest prices.

In [11]:
autos["price_usd"].value_counts().sort_index(ascending=True).head(15)
Out[11]:
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
Name: price_usd, dtype: int64

There are 1421 entries where price equals 0. We know that ebay is an auction site with starting price of 1USD. Zero price doesn't make sense, so we'll remove these rows.

In [12]:
#Only keep the prices between 1 and 350 000
autos = autos[autos["price_usd"].between(1,350000)]

#Check
autos["price_usd"].value_counts().sort_index(ascending=False)
Out[12]:
350000      1
345000      1
299000      1
295000      1
265000      1
         ... 
8           1
5           2
3           1
2           3
1         156
Name: price_usd, Length: 2346, dtype: int64
In [13]:
#Check the whole column again
autos["price_usd"].describe()
Out[13]:
count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price_usd, dtype: float64

Now we'll look at the odometer column in the same way.

In [14]:
autos["odometer_km"].describe()
Out[14]:
count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Min and max values seem reasonable. The mean is 125,000km so the high mileage cars predominate. Let's check if there is anything wrong with value frequencies.

In [15]:
autos["odometer_km"].value_counts(ascending=False)
Out[15]:
150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
5000        836
40000       815
30000       780
20000       762
10000       253
Name: odometer_km, dtype: int64
In [16]:
autos["odometer_km"].value_counts(ascending=True).head(15)
Out[16]:
10000       253
20000       762
30000       780
40000       815
5000        836
50000      1012
60000      1155
70000      1217
80000      1415
90000      1734
100000     2115
125000     5057
150000    31414
Name: odometer_km, dtype: int64

Seems like our price_usd and odometer_km columns are in order now. We can move on to other issues.

5. Format of the date columns

Two main issues here:

  • The date_crawled, last_seen, and ad_created columns are all identified as string values by pandas.
  • registration_month and registraton_year are stored as float type values with impossible min and max values.

We will first look at the three string columns.

In [17]:
autos[["date_crawled","ad_created","last_seen"]][0:5]
Out[17]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50

We can see that the first 10 characters of each cell represent the day. In order to check the distribution of values by date, we will extract these characters and create a frequency table in percentages of data crawled, created and seen each day.

In [18]:
(autos["date_crawled"]
      .str[:10]
      .value_counts(normalize=True, dropna=False)
      .sort_index()
     )
Out[18]:
2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

The distribution seems reasonable, with less entries crawled on the last days, probably due to the ending cycle of the crawler's work.

In [19]:
(autos["ad_created"]
      .str[:10]
      .value_counts(normalize=True, dropna=False)
      .sort_index()
     )
Out[19]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64
In [20]:
(autos["last_seen"]
      .str[:10]
      .value_counts(normalize=True, dropna=False)
      .sort_index()
     )
Out[20]:
2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

Can't see anything wrong with these distributions. The only thing we should do now is convert all these dates into numeric data. We'll do that by deleting the dashes between the numbers and changing the data type to integer.

In [21]:
autos["date_crawled"] = (autos["date_crawled"]
                         .str.split()
                         .str[0]
                         .str.replace("-","")
                         .astype(int)
                        )

print("date_crawled", "\n", autos["date_crawled"].head(10))

autos["ad_created"] = (autos["ad_created"]
                         .str.split()
                         .str[0]
                         .str.replace("-","")
                         .astype(int)
                        )
print("ad_created", "\n", autos["ad_created"].head(10))

autos["last_seen"] = (autos["last_seen"]
                         .str.split()
                         .str[0]
                         .str.replace("-","")
                         .astype(int)
                        )
print("last_seen", "\n", autos["last_seen"].head(10))
date_crawled 
 0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
5    20160321
6    20160320
7    20160316
8    20160322
9    20160316
Name: date_crawled, dtype: int64
ad_created 
 0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
5    20160321
6    20160320
7    20160316
8    20160322
9    20160316
Name: ad_created, dtype: int64
last_seen 
 0    20160406
1    20160406
2    20160406
3    20160315
4    20160401
5    20160406
6    20160323
7    20160407
8    20160326
9    20160406
Name: last_seen, dtype: int64

Now to the issues with registration_year column, where we noticed very bizarre min (1000) and max(9999) values. Let's check again.

In [22]:
autos["registration_year"].describe()
Out[22]:
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
In [23]:
#Looking closer at the 10 greatest values
(autos["registration_year"]
 .value_counts()
 .sort_index(ascending=False)
 .head(15)
)
Out[23]:
9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
Name: registration_year, dtype: int64

It is clear that everything over 2016, the year when the announcements were posted, must be a random number and needs to be deleted. Now for the lowest values.

In [24]:
(autos["registration_year"]
 .value_counts()
 .sort_index(ascending=True)
 .head(15)
)
Out[24]:
1000    1
1001    1
1111    1
1800    2
1910    5
1927    1
1929    1
1931    1
1934    2
1937    4
1938    1
1939    1
1941    2
1943    1
1948    1
Name: registration_year, dtype: int64

I guess 1910 might still be a registration date for a very, very old car. 1800 and lower are already impossible, because there were no cars. These rows too must be deleted.

In [25]:
#Only keep the years between 1910 and 2016
autos = autos[autos["registration_year"].between(1910,2016)]

#Checking the resulting distribution
print(autos["registration_year"].value_counts(normalize=True))
print("\n")

#Chcking the resulting column
print(autos["registration_year"].describe())
2000    0.067608
2005    0.062895
1999    0.062060
2004    0.057904
2003    0.057818
          ...   
1929    0.000021
1931    0.000021
1938    0.000021
1939    0.000021
1952    0.000021
Name: registration_year, Length: 78, dtype: float64


count    46681.000000
mean      2002.910756
std          7.185103
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

We passed from 48565 values to 46681, losing arond 4% of the entries. It is not a significant amount, so we will proceed with the reduced dataset without extreme values sqewing our calculations.

6. Translating German words

Some of the categorical data in the dataset is in German. To make my analysis accessible for English language readers, I'll translate it.

First, let's display the unique values of the columns we need to translate.

In [26]:
print(autos["vehicle_type"].unique())
print(autos["gearbox"].unique())
print(autos["fuel_type"].unique())
print(autos["unrepaired_damage"].unique())
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere']
['manuell' 'automatik' nan]
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
['nein' nan 'ja']
In [27]:
#Creating translation dictionaries
veh_type = {
    "bus":"bus",
    "limousine":"limousine",
    "kleinwagen":"small car",
    "kombi":"station wagon",
    "coupe":"coupe",
    "suv":"suv",
    "cabrio":"cabrio",
    "andere":"other"
    }

gearbox_tr = {
    "manuell":"manual",
    "automatik":"automatic"
    }
fuel_tr = {
    "lpg":"lpg",
    "diesel":"diesel",
    "cng":"cng",
    "benzin":"gasoline",
    "hybrid":"hybrid",
    "elektro":"electric",
    "andere":"other"
    }
damage_tr = {
    "nein":"no",
    "ja":"yes"
    }


#Mapping the translations into the columns
autos["vehicle_type"] = autos["vehicle_type"].map(veh_type)
autos["gearbox"] = autos["gearbox"].map(gearbox_tr)
autos["fuel_type"] = autos["fuel_type"].map(fuel_tr)
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(damage_tr)

#Checking
print(autos["vehicle_type"].unique())
print(autos["gearbox"].unique())
print(autos["fuel_type"].unique())
print(autos["unrepaired_damage"].unique())
['bus' 'limousine' 'small car' 'station wagon' nan 'coupe' 'suv' 'cabrio'
 'other']
['manual' 'automatic' nan]
['lpg' 'gasoline' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']
['no' nan 'yes']

Analysis

Below we will find out:

  • which car brands are more popular whithin this dataset,
  • which brands are cheaper and which are more expensive.
In [28]:
#Showing top 10 brands in the dataset
popular_brands = (autos["brand"]
                  .value_counts(normalize=True)
                  .head(10)
                  .index
                 )
print(popular_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object')

The five most popular brands are German, which is not surprising for German ebay. Now we will calculate the mean price for each of those brands.

In [29]:
#Computing mean price by brand
mean_price_by_brand = {}
for b in popular_brands:
    selected_rows = autos[autos["brand"] == b]
    mean_price = selected_rows["price_usd"].mean()
    mean_price_by_brand[b] = mean_price
    
for i in mean_price_by_brand:
    print(i, ":", mean_price_by_brand[i])
volkswagen : 5402.410261610221
bmw : 8332.820517811953
opel : 2975.2419354838707
mercedes_benz : 8628.450366422385
audi : 9336.687453600594
ford : 3749.4695065890287
renault : 2474.8646069968195
peugeot : 3094.0172290021537
fiat : 2813.748538011696
seat : 4397.230949589683

Now that we have mean prices by brand, we can group the car brands into expensive, medium and cheap, anchoring our groups on descriptive statistics we pulled on the price_usd column above.

In [30]:
#Grouping brands by mean price

expensive_brands = []
medium_low_brands = []
medium_upper_brands = []
low_cost_brands = []

for brand in mean_price_by_brand:
    if mean_price_by_brand[brand] <= 3000:
        low_cost_brands.append(brand)
    elif 3000 < mean_price_by_brand[brand]<= 5000:
        medium_low_brands.append(brand)
    elif 5000 < mean_price_by_brand[brand] <= 8000:
        medium_upper_brands.append(brand)
    elif 8000 < mean_price_by_brand[brand]:
        expensive_brands.append(brand)
   

print("Expensive car brands: ", expensive_brands)
print("\n")
print("Medium-high cost brands: ", medium_upper_brands)
print("\n")
print("Medium-low cost brands: ", medium_low_brands)
print("\n")
print("Low cost brands: ", low_cost_brands)

    
Expensive car brands:  ['bmw', 'mercedes_benz', 'audi']


Medium-high cost brands:  ['volkswagen']


Medium-low cost brands:  ['ford', 'peugeot', 'seat']


Low cost brands:  ['opel', 'renault', 'fiat']

The most popular brand Volkswagen is in the middle-price category, which seems reazonable. Three out of five most popular brands - BMW, Mercedes Benz and Audi - also turn out to be among the most expensive ones. This is probably due to the fact that cars of these brands are of better quality and last longer, so they are more likely to be resold after use. Let's check.

2. Price and mileage connection

We will now calculate mean mileage for the most popular brands and see if there is any relation to the price distribution.

In [31]:
#Creating a dictionary with mean mileage by brand
mean_km_brand = {}
for b in popular_brands:
    selected_rows = autos[autos["brand"] == b]
    mean_km = selected_rows["odometer_km"].mean()
    mean_km_brand[b] = mean_km
    
for i in mean_km_brand:
    print(i, ":", mean_km_brand[i])
volkswagen : 128707.15879132022
bmw : 132572.51313996495
opel : 129310.0358422939
mercedes_benz : 130788.36331334666
audi : 129157.38678544914
ford : 124266.01287159056
renault : 128071.33121308497
peugeot : 127153.62526920316
fiat : 117121.9715956558
seat : 121131.30128956624

Below we will create a dataframe with mean price and mean mileage columns in order to make comparison easier.

In [32]:
#Converting price and mileage dictionaries into series
price_series = pd.Series(mean_price_by_brand)
km_series = pd.Series(mean_km_brand)

#Creating dataframe from price series
price_km_df = pd.DataFrame(price_series, columns = ["mean_price"])

#Adding odometer column to the dataframe
price_km_df["odometer_km"] = km_series 

#Checking the result
print(price_km_df)
                mean_price    odometer_km
volkswagen     5402.410262  128707.158791
bmw            8332.820518  132572.513140
opel           2975.241935  129310.035842
mercedes_benz  8628.450366  130788.363313
audi           9336.687454  129157.386785
ford           3749.469507  124266.012872
renault        2474.864607  128071.331213
peugeot        3094.017229  127153.625269
fiat           2813.748538  117121.971596
seat           4397.230950  121131.301290

Still the view is confusing. I suggest we try the same grouping method we used on prices to see if the groups coincide.

In [33]:
#Grouping brands by mean mileage

high_mileage = []
medium_mileage = []
low_mileage = []

for brand in mean_km_brand:
    if mean_km_brand[brand] <= 125000:
        low_mileage.append(brand)
    elif 125000 < mean_km_brand[brand]<= 130000:
        medium_mileage.append(brand)
    elif 130000 < mean_km_brand[brand]:
        high_mileage.append(brand)
    

print("High mileage brands: ", high_mileage)
print("\n")
print("Medium mileage brands: ", medium_mileage)
print("\n")
print("Low mileage brands: ", low_mileage)
High mileage brands:  ['bmw', 'mercedes_benz']


Medium mileage brands:  ['volkswagen', 'opel', 'audi', 'renault', 'peugeot']


Low mileage brands:  ['ford', 'fiat', 'seat']

Our hypothesis is confirmed: the more expensive cars do tend to have higher mileage. As I speculated above, it might be due to the fact that expensive cars are more reliable and serve longer, and so can be sold several times without significant loss in value.

3. Common brand/model combinations

Below I will calculate the most popular model for each of the popular car brands.

In [36]:
brand_model = {}
for b in popular_brands:
    selected_rows = autos[autos["brand"] == b]
    top_model = selected_rows["model"].value_counts().index[0]
    brand_model[b] = top_model
    
for i in brand_model:
    print(i, ":", brand_model[i])
volkswagen : golf
bmw : 3er
opel : corsa
mercedes_benz : c_klasse
audi : a4
ford : focus
renault : twingo
peugeot : 2_reihe
fiat : punto
seat : ibiza

4. Damaged cars prices

How much cheaper are cars with damage than their non-damaged counterparts? Let's find out. I will calculate the average price for cars with and without unrepaired damage.

In [40]:
damaged = autos["unrepaired_damage"].dropna().unique()

mean_price_damage = {}
for i in damaged:
    selected_rows = autos[autos["unrepaired_damage"] == i]
    mean_price = selected_rows["price_usd"].mean()
    mean_price_damage[i] = mean_price
    
for i in mean_price_damage:
    print(i, ":", mean_price_damage[i])
no : 7164.033102796004
yes : 2241.146035242291
In [49]:
price_difference = 100 - ((mean_price_damage["yes"]/mean_price_damage["no"])*100)
print("The damaged cars are", "{:.2f}".format(price_difference), "% cheaper than the undamaged ones.")
The damaged cars are 68.72 % cheaper than the undamaged ones.

Results

I have cleaned the autos dataset from resolving 6 different issues. The analysis of the dataset allows me to make the following conclusions:

  1. The five brands most popular among german users on ebay are: Volkswagen, BMW, Opel, Mercedes Bez and Audi.
  2. More expensive cars tend to have more km on the odometer.
  3. The most popular models for each of the popular brands are as follows: Volkswagen Golf, BMW 3er, Opel Corsa, Mercedes C-class, Audi A4.
  4. Cars with unrepared damage are 69% cheaper.