Guided Project: Exploring Ebay Car Sales Data

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but you can find it here.

We've made a few modifications from the original dataset:

-We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment -We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The data dictionary provided with data is as follows:

  • dateCrawled - When this ad was first crawled. All field-values are -taken from this date.

  • name - Name of the car.

  • seller - Whether the seller is private or a dealer.

  • offerType - The type of listing

  • price - The price on the ad to sell the car.

  • abtest - Whether the listing is included in an A/B test.

  • vehicleType - The vehicle Type.

  • yearOfRegistration - The year in which the car was first registered.

  • gearbox - The transmission type.

  • powerPS - The power of the car in PS.

  • model - The car model name.

  • kilometer - How many kilometers the car has driven.

  • monthOfRegistration - The month in which the car was first registered.

  • fuelType - What type of fuel the car uses.

  • brand - The brand of the car.

  • notRepairedDamage - If the car has a damage which is not yet repaired.

  • dateCreated - The date on which the eBay listing was created.

  • nrOfPictures - The number of pictures in the ad.

  • postalCode - The postal code for the location of the vehicle.

  • lastSeenOnline - When the crawler saw this ad last online. ###### The aim of this project is to clean the data and analyze the included used car listings.

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding="Latin-1")
autos
Out[1]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [2]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

A briefly describing our observations:
  • After using autos.info() method we discovered that no all columns has the same amount of data. Most of them has 50 000, but some less. Some columns have null values, but none have more than ~20% null values.
  • The dataset contains 20 columns, most of which are strings.
  • dtypes: int64(5), object(15)
  • Some columns has both: str and int/float types of data.
  • The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

Cleaning Column Names:

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [3]:
autos = autos.rename({'dateCrawled': 'date_crawled', 
                        'offerType': 'offer_type',
                        'vehicleType': 'vehicle_type',
                        'yearOfRegistration': 'registration_year',
                        'monthOfRegistration': 'registration_month', 
                        'powerPS': 'power_ps',
                        'fuelType': 'fuel_type', 
                        'notRepairedDamage': 'unrepaired_damage',
                        'dateCreated': 'ad_created',
                        'nrOfPictures': 'num_of_pictures', 
                        'postalCode': 'postal_code', 
                        'lastSeen': 'last_seen'}, axis=1)
autos.columns


autos.head()
Out[3]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Now all columns have more redable names with snakecase.


Initial Exploration and Cleaning:

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

  • Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
  • Examples of numeric data stored as text which can be cleaned and converted.

The following methods are helpful for exploring the data:

  • DataFrame.describe() (with include='all' to get both categorical and numeric columns)
  • Series.value_counts() and Series.head() if any columns need a closer look.
In [4]:
autos.describe(include='all')
Out[4]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-12 16:06:22 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Our observations here:

  • Most of the columns have almost the same values - that's good.
  • The nr_of_pictures column looks odd - seems to be empty.
  • there are a number of text columns where almost all of the values are the same (seller and offer_type). But we will leve it as they are, because the single appearance different in those columns may be some big company.
In [5]:
autos["num_of_pictures"].value_counts()
Out[5]:
0    50000
Name: num_of_pictures, dtype: int64
In [6]:
#Because all values in nr_of_pictures are "0", We'll drop this column
autos = autos.drop('num_of_pictures', axis=1)
print("autos.columns:", autos.columns) #test
autos.columns: Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'],
      dtype='object')
In [7]:
autos['registration_year'].value_counts()
Out[7]:
2000    3354
2005    3015
1999    3000
2004    2737
2003    2727
        ... 
1931       1
1929       1
1001       1
9996       1
1952       1
Name: registration_year, Length: 97, dtype: int64
In [8]:
# For each column:
# Remove any non-numeric characters.
# Convert the column to a numeric dtype.


autos['price'] = (autos['price'].str.replace('$', '')
                  .str.replace(',', '').astype(int)
                  )


autos['odometer'] = (autos['odometer'].str[:-2]
                     .str.replace(',', '')
                     .astype(int)
                       )

#rename odometer to odometer_km:

autos.rename({'odometer': 'odometer_km'}, axis=1, inplace=True)

print('price type:', autos['price'].dtype) ##Test
print('odometer_km', autos['odometer_km'].dtype) ##Test
price type: int32
odometer_km int32
In [9]:
autos.head()
Out[9]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50

Exploring the Odometer and Price Columns:

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns. Here's the steps we'll take:

Analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.

In [10]:
# We'll use:
# - Series.unique().shape to see how many unique values
# - Series.describe() to view min/max/median/mean etc
# - Series.value_counts(), with some variations:
# chained to .head() if there are lots of values.
# - Because Series.value_counts() returns a series, 
# we can use Series.sort_index() with ascending= True or False 
# to view the highest and lowest values with their counts (can also chain to head() here).
# - When removing outliers, we can do df[(df["col"] > x ) & (df["col"] < y )],
# but it's more readable to use df[df["col"].between(x,y)]


# autos['odometer_km'] analyzing:

print(autos['odometer_km'].unique().shape)
autos['odometer_km'].describe()

autos['odometer_km'].value_counts()
(13,)
Out[10]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64
In [11]:
autos['odometer_km'].value_counts().sort_index(ascending=True).head()
Out[11]:
5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64
In [12]:
autos['odometer_km'].value_counts().sort_index(ascending=False).head()
Out[12]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64
In [13]:
# autos['price'] analyzing: 

print(autos['price'].unique().shape)
autos['price'].describe()

autos['price'].value_counts().head()
(2357,)
Out[13]:
0       1421
500      781
1500     734
2500     643
1000     639
Name: price, dtype: int64
In [14]:
autos['price'].value_counts().sort_index(ascending=True).head(10)
Out[14]:
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price, dtype: int64
In [15]:
autos['price'].value_counts().sort_index(ascending=False).head(10)
Out[15]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
Name: price, dtype: int64

We can see that there are some cars that probably are given for free ( that's possible). But there are also some suspicious expensive car at 99999999$ and another one at 10000000$. Those two are most likely outliers.

In [41]:
autos[~(autos['price'].between(0,20000000))] = np.NaN
autos['price'].value_counts().sort_index(ascending=True).head(10) # test
# autos['price'].value_counts().sort_index(ascending=True).tail(10) # test
Out[41]:
0.0     1335
1.0      150
2.0        2
3.0        1
5.0        2
8.0        1
9.0        1
10.0       6
11.0       2
12.0       3
Name: price, dtype: int64
In [17]:
autos['price'].describe()
Out[17]:
count    4.999800e+04
mean     7.293891e+03
std      1.287297e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.234568e+07
Name: price, dtype: float64

There is at least few more column that includes outliers for ex: registration year. Maybe power ps too. We will do it below.


Exploring the date columns:

Columns that will be checked firstly are: date_crawled, last_seen, and ad_created.

In [18]:
autos['date_crawled'].describe()
Out[18]:
count                   49998
unique                  48211
top       2016-03-21 16:37:21
freq                        3
Name: date_crawled, dtype: object
In [19]:
autos['ad_created'].describe()
Out[19]:
count                   49998
unique                     76
top       2016-04-03 00:00:00
freq                     1946
Name: ad_created, dtype: object
In [20]:
autos['last_seen'].describe()
Out[20]:
count                   49998
unique                  39480
top       2016-04-07 06:17:27
freq                        8
Name: last_seen, dtype: object
In [21]:
# To include missing values in the distribution and to use 
# percentages instead of counts, 
# chain the Series.value_counts(normalize=True, dropna=False) method.

(print('val_desc_date_crawled_sorted:', '\n', 
       autos['date_crawled'].str[:10]
 .value_counts(normalize=True, dropna=False)
 .sort_index(ascending=True))
)
print('\n')
# Find out few smaller values (because the smallest is NaN)
print('min_date_crawled:', '\n', 
       autos['date_crawled'].str[:10]
 .value_counts(normalize=True, dropna=False).tail())
 
print('\n')
# Find out max
print('max_date_crawled:', '\n', 
       autos['date_crawled'].str[:10]
 .value_counts(normalize=True, dropna=False).head())
val_desc_date_crawled_sorted: 
 2016-03-05    0.02538
2016-03-06    0.01394
2016-03-07    0.03596
2016-03-08    0.03328
2016-03-09    0.03322
2016-03-10    0.03212
2016-03-11    0.03248
2016-03-12    0.03678
2016-03-13    0.01556
2016-03-14    0.03662
2016-03-15    0.03398
2016-03-16    0.02950
2016-03-17    0.03152
2016-03-18    0.01306
2016-03-19    0.03490
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-22    0.03292
2016-03-23    0.03238
2016-03-24    0.02910
2016-03-25    0.03174
2016-03-26    0.03248
2016-03-27    0.03104
2016-03-28    0.03484
2016-03-29    0.03418
2016-03-30    0.03362
2016-03-31    0.03192
2016-04-01    0.03380
2016-04-02    0.03540
2016-04-03    0.03868
2016-04-04    0.03652
2016-04-05    0.01310
2016-04-06    0.00318
2016-04-07    0.00142
NaN           0.00004
Name: date_crawled, dtype: float64


min_date_crawled: 
 2016-04-05    0.01310
2016-03-18    0.01306
2016-04-06    0.00318
2016-04-07    0.00142
NaN           0.00004
Name: date_crawled, dtype: float64


max_date_crawled: 
 2016-04-03    0.03868
2016-03-20    0.03782
2016-03-21    0.03752
2016-03-12    0.03678
2016-03-14    0.03662
Name: date_crawled, dtype: float64

From our column date_crawled exploration, we can conclude that:

  • the highest distribution of values was on 2016-04-03 and it was 0.03868%
  • the lowest distribution of values was on 2016-04-07 and it was 0.00142%
  • Nan has 0.00014%
In [22]:
(print('val_desc_ad_created =', '\n',
       autos['ad_created'].str[:10]
       .value_counts(normalize=True, dropna=False)
       .sort_index(ascending=True))
)
print('\n')

# Find out max 
(print('max_ad_created:', '\n',
       autos['ad_created'].str[:10]
       .value_counts(normalize=True, dropna=False)
       .head())
)
print('\n')

# Find out few smaller values      
(print('min_ad_created:', '\n',
       autos['ad_created'].str[:10]
       .value_counts(normalize=True, dropna=False)
       .tail(22).sort_index(ascending=True)) 
)  # when ".sort_index(ascending=True)" removed, we can see that there is 20 the smallest valuest
val_desc_ad_created = 
 2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
               ...   
2016-04-04    0.03688
2016-04-05    0.01184
2016-04-06    0.00326
2016-04-07    0.00128
NaN           0.00004
Name: ad_created, Length: 77, dtype: float64


max_ad_created: 
 2016-04-03    0.03892
2016-03-20    0.03786
2016-03-21    0.03772
2016-04-04    0.03688
2016-03-12    0.03662
Name: ad_created, dtype: float64


min_ad_created: 
 2015-06-11    0.00002
2015-08-10    0.00002
2015-09-09    0.00002
2015-11-10    0.00002
2015-12-05    0.00002
2015-12-30    0.00002
2016-01-03    0.00002
2016-01-07    0.00002
2016-01-13    0.00002
2016-01-14    0.00002
2016-01-16    0.00002
2016-01-22    0.00002
2016-01-29    0.00002
2016-02-01    0.00002
2016-02-07    0.00002
2016-02-08    0.00002
2016-02-09    0.00004
2016-02-11    0.00002
2016-02-16    0.00002
2016-02-17    0.00002
2016-02-18    0.00004
2016-02-22    0.00002
Name: ad_created, dtype: float64

From our column ad_created exploration, we can conclude that:

  • the highest distribution of values was on 2016-04-03 and it was 0.03892%
  • the lowest distribution of values occurred mostly between: 2016-01 and 2016-02 and it was 0.00002%
  • NaN has 0.00014%
In [23]:
(print('val_desc_last_seen =', '\n', autos['last_seen']
       .str[:10].value_counts(normalize=True, dropna=False)
       .sort_index(ascending=True))
)
print('\n')

(print('max_last_seen =', '\n', autos['last_seen']
       .str[:10].value_counts(normalize=True, dropna=False)
       .head())
)
print('\n')

(print('min_last_seen =', '\n', autos['last_seen']
       .str[:10].value_counts(normalize=True, dropna=False)
       .tail(10))
)
val_desc_last_seen = 
 2016-03-05    0.00108
2016-03-06    0.00442
2016-03-07    0.00536
2016-03-08    0.00758
2016-03-09    0.00986
2016-03-10    0.01076
2016-03-11    0.01252
2016-03-12    0.02382
2016-03-13    0.00898
2016-03-14    0.01280
2016-03-15    0.01588
2016-03-16    0.01644
2016-03-17    0.02792
2016-03-18    0.00742
2016-03-19    0.01574
2016-03-20    0.02070
2016-03-21    0.02074
2016-03-22    0.02158
2016-03-23    0.01858
2016-03-24    0.01956
2016-03-25    0.01920
2016-03-26    0.01696
2016-03-27    0.01602
2016-03-28    0.02086
2016-03-29    0.02234
2016-03-30    0.02484
2016-03-31    0.02384
2016-04-01    0.02310
2016-04-02    0.02490
2016-04-03    0.02536
2016-04-04    0.02462
2016-04-05    0.12428
2016-04-06    0.22098
2016-04-07    0.13092
NaN           0.00004
Name: last_seen, dtype: float64


max_last_seen = 
 2016-04-06    0.22098
2016-04-07    0.13092
2016-04-05    0.12428
2016-03-17    0.02792
2016-04-03    0.02536
Name: last_seen, dtype: float64


min_last_seen = 
 2016-03-11    0.01252
2016-03-10    0.01076
2016-03-09    0.00986
2016-03-13    0.00898
2016-03-08    0.00758
2016-03-18    0.00742
2016-03-07    0.00536
2016-03-06    0.00442
2016-03-05    0.00108
NaN           0.00004
Name: last_seen, dtype: float64

From our column last_seen exploration, we can conclude that:

  • the highest distribution of values was on 2016-04-07 and it was 0.13092%
  • the lowest distribution of values was on: 2016-03-05 and it was 0.00108%
  • NaN has 0.00014%

Now we use Series.describe() to understand the distribution of registration_year

In [24]:
autos['registration_year'].describe()
Out[24]:
count    49998.000000
mean      2005.073223
std        105.714916
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
Dealing with Incorrect Registration Year Data:

As we can see, there are cars registered in far far past and in deep future. As was expected, this column need to be clear too.

  • The minimum value is 1000, before cars were invented
  • The maximum value is 9999, many years into the future

Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.

Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic.

In [25]:
autos = autos[autos['registration_year'].between(1900,2016)]
autos['registration_year'].value_counts(normalize=True)
Out[25]:
2000.0    0.069837
2005.0    0.062778
1999.0    0.062445
2004.0    0.056990
2003.0    0.056782
            ...   
1939.0    0.000021
1938.0    0.000021
1931.0    0.000021
1929.0    0.000021
1927.0    0.000021
Name: registration_year, Length: 78, dtype: float64

We can see that most of the cars are from the period from 1994 to 2016. Another thing worth mentioned is the fact that there are 3 periods that represent main sets of offers:

  • year: 2000 with 0.069842%
  • year: 2005 with 0.062782%
  • year: 1999 with 0.062449%

Exploring Price by Brand:

Now we explore the unique values in the brand column, and decide on which brands you want to aggregate by.

In [26]:
autos['brand'].value_counts()
Out[26]:
volkswagen        10188
bmw                5284
opel               5195
mercedes_benz      4579
audi               4149
ford               3352
renault            2274
peugeot            1418
fiat               1242
seat                873
skoda               770
mazda               727
nissan              725
smart               668
citroen             668
toyota              599
sonstige_autos      526
hyundai             473
volvo               444
mini                415
mitsubishi          391
honda               377
kia                 341
alfa_romeo          318
porsche             293
suzuki              284
chevrolet           274
chrysler            176
dacia               123
daihatsu            123
jeep                108
subaru              105
land_rover           98
saab                 77
jaguar               76
trabant              75
daewoo               72
rover                65
lancia               52
lada                 29
Name: brand, dtype: int64

We will create top 10 of the most available car's brands.

In [27]:
autos['brand'].value_counts().index[:10] # Only first 10 needed
Out[27]:
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object')

We'll loop over our selected brands, and assign the mean price to the dictionary, with the brand name as the key. By this, we will know the mean price for each brand from top 10

In [28]:
mean_price_for_top_10_cars_raw = {}

for b in autos['brand'].value_counts().index[:10]: # index[:10] -> we need just top 10 
    top_brands = autos[autos["brand"] == b]
    mean_price = top_brands['price'].mean()
    mean_price_for_top_10_cars_raw[b] = mean_price
    
for key in mean_price_for_top_10_cars_raw:
    print(key, ':', mean_price_for_top_10_cars_raw[key]) # print of unsorted dict. mean_price_for_top_10_cars_raw 
volkswagen : 6516.457597173145
bmw : 8334.645155185466
opel : 5252.61655437921
mercedes_benz : 8485.239571958942
audi : 9093.65003615329
ford : 7263.015811455847
renault : 2395.4164467897976
peugeot : 3039.4682651622
fiat : 2711.8011272141707
seat : 4296.492554410081

The dictionary gave us unsorted, not well looking output, because we won't do any further analysis with this object, we will convert it into a list and reshape it.

In [29]:
# sorted dictionary saved as tuple's list
tuple_list = (sorted(((value, key) 
                      for (key,value) in mean_price_for_top_10_cars_raw.items())
                     , reverse=True)
             )

list_from_tuple = [list(x) for x in tuple_list] # convert tuple to list of lists

print('Mean price for top 10 most often offered cars is:', '\n')

# printing nice sorted output from tuple
for index in list_from_tuple:
    index_0 = index[1]
    index_1 = index[0]
    print(index_0, ":", int(index_1), "$") 
Mean price for top 10 most often offered cars is: 

audi : 9093 $
mercedes_benz : 8485 $
bmw : 8334 $
ford : 7263 $
volkswagen : 6516 $
opel : 5252 $
seat : 4296 $
peugeot : 3039 $
fiat : 2711 $
renault : 2395 $

There's a distinct price gap:

  • Audi, BMW and Mercedes Benz are more expensive
  • Ford and Peugeot are less expensive
  • Volkswagen and Seat are in between

Storing Aggregate Data in a DataFrame:

After we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. While our natural instinct may be to display both aggregated series objects and visually compare them, this has a few limitations:

  • it's difficult to compare more than two aggregate series objects if we want to extend to more columns
  • we can't compare more than a few rows from each series object
  • we can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

Instead, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly. To do this, we'll need to learn two pandas methods:

  • pandas series constructor
  • pandas dataframe constructor
In [30]:
top_six_brands = {}
for value in mean_price_for_top_10_cars_raw:
    if mean_price_for_top_10_cars_raw[value] >= 3039: # bool filtering price for the cheapest(Peugeot)
        top_six_brands[value] = int(mean_price_for_top_10_cars_raw[value])
# test:
print("top 6 brands that represents the most crucial companies:")
for key in top_six_brands:
    print(key, ':', top_six_brands[key], "$") # print of unsorted dict. mean_price_for_top_10_cars_raw 

top_six_brands_pd_series = pd.Series(top_six_brands)
top_six_brands_pd = pd.DataFrame(top_six_brands_pd_series, 
                                  columns=['top 6 brands'])

print('\n')
mean_odometr_km = {}

for b in autos['brand'].value_counts().index[:10]: # index[:10] -> we need just top 10 
    top_brands = autos[autos["brand"] == b]
    mean_odo_km = top_brands['odometer_km'].mean()
    mean_odometr_km[b] = int(mean_odo_km)


# additional loop for printig "top 10 brands" sorted
print("top 10 of the most often offered brand's of cars")
for index in list_from_tuple:
    index_0 = index[1]
    index_1 = index[0]
    print(index_0, ":", int(index_1), "$") # printing nice sorted output from tuple


mean_odometr_km_pd_series = pd.Series(mean_odometr_km)
top_six_brands_pd["mean odometer km"] = mean_odometr_km_pd_series
top_six_brands_pd.index.name = "brand" # index name for index column

top_six_brands_pd.sort_values('mean odometer km')
top 6 brands that represents the most crucial companies:
volkswagen : 6516 $
bmw : 8334 $
opel : 5252 $
mercedes_benz : 8485 $
audi : 9093 $
ford : 7263 $
peugeot : 3039 $
seat : 4296 $


top 10 of the most often offered brand's of cars
audi : 9093 $
mercedes_benz : 8485 $
bmw : 8334 $
ford : 7263 $
volkswagen : 6516 $
opel : 5252 $
seat : 4296 $
peugeot : 3039 $
fiat : 2711 $
renault : 2395 $
Out[30]:
top 6 brands mean odometer km
brand
seat 4296 121563
ford 7263 124046
peugeot 3039 127136
volkswagen 6516 128730
opel 5252 129227
audi 9093 129287
mercedes_benz 8485 130856
bmw 8334 132434
Conclusions:

The conclusions from our research are:

  • the cheapest cars have the smallest mean odometer km.
  • the most expensive cars have the highest mean odometer km, but Mercedes Benz is on top of the odometer column ( and it's cheaper from Audi). There are few possible reasons for this:
  1. BMW and Mercedes are more popular among people that drive more often
  2. BMW and Mercedes are seen as heavy duty cars(in meaning: they don't break so easy) that can handle more ride without the need of repair
  3. BMW and Mercedes are indeed more solid cars
  4. Audi is less popular - that's the only reason
  • because the cheapest cars have the smallest mean odometer km, we can assume that point 3 is more likely truthful.
  • Mercedes seems to be seen by people as the best premium solid brand of cars.
  • From the other hand Audi seems to be seen by people as the best premium solid brand of cars, because it's the most expensive.

Additional data cleaning and analyse:

Data cleaning next steps:
  • Identify categorical data that uses german words, translate them and map the values to their english counterparts
  • Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
  • See if there are particular keywords in the name column that you can extract as new columns
Analysis next steps:
  • Find the most common brand/model combinations
  • Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
  • How much cheaper are cars with damage than their non-damaged counterparts?

Identify categorical data that uses german words, translate them and map the values to their english counterparts and extract as new columns particular keywords :
In [31]:
autos
Out[31]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000.0 control bus 2004.0 manuell 158.0 andere 150000.0 3.0 lpg peugeot nein 2016-03-26 00:00:00 79588.0 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500.0 control limousine 1997.0 automatik 286.0 7er 150000.0 6.0 benzin bmw nein 2016-04-04 00:00:00 71034.0 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990.0 test limousine 2009.0 manuell 102.0 golf 70000.0 7.0 benzin volkswagen nein 2016-03-26 00:00:00 35394.0 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350.0 control kleinwagen 2007.0 automatik 71.0 fortwo 70000.0 6.0 benzin smart nein 2016-03-12 00:00:00 33729.0 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350.0 test kombi 2003.0 manuell 0.0 focus 150000.0 7.0 benzin ford nein 2016-04-01 00:00:00 39218.0 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot 24900.0 control limousine 2011.0 automatik 239.0 q5 100000.0 1.0 diesel audi nein 2016-03-27 00:00:00 82131.0 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot 1980.0 control cabrio 1996.0 manuell 75.0 astra 150000.0 5.0 benzin opel nein 2016-03-28 00:00:00 44807.0 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot 13200.0 test cabrio 2014.0 automatik 69.0 500 5000.0 11.0 benzin fiat nein 2016-04-02 00:00:00 73430.0 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot 22900.0 control kombi 2013.0 manuell 150.0 a3 40000.0 11.0 diesel audi nein 2016-03-08 00:00:00 35683.0 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot 1250.0 control limousine 1996.0 manuell 101.0 vectra 150000.0 1.0 benzin opel nein 2016-03-13 00:00:00 45897.0 2016-04-06 21:18:48

48026 rows × 19 columns

In [32]:
german_list = ['seller', 'vehicle_type', 'gearbox', 'fuel_type', 'unrepaired_damage']
for p in german_list:
    print(str(p), '\n', pd.unique(autos[p]), '\n', '\n')
seller 
 ['privat' 'gewerblich'] 
 

vehicle_type 
 ['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere'] 
 

gearbox 
 ['manuell' 'automatik' nan] 
 

fuel_type 
 ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] 
 

unrepaired_damage 
 ['nein' nan 'ja'] 
 

In [33]:
autos['seller'] = (autos['seller']
                    .map({'privat' : 'private', 'gewerblich': 'commercial'},
                         na_action='ignore')
                   )
# replace NaN to string 'no information':
autos['seller'] = autos['seller'].fillna('no information')

autos['vehicle_type'] = (autos['vehicle_type']
                    .map({'kleinwagen' : 'small car', 'andere': 'other'},
                         na_action='ignore')
                   )
# replace NaN to string 'no information':
autos['vehicle_type'] = autos['vehicle_type'].fillna('no information')

autos['gearbox'] = (autos['gearbox']
                    .map({'manuell': 'manual', 'automatik': 'automatic'},
                         na_action='ignore')
                   )
# replace NaN to string 'no information':
autos['gearbox'] = autos['gearbox'].fillna('no information')

# It can be write like this too:
# autos['gearbox'] = (autos['gearbox']
#                     .str.replace('manuell', 'manual').str.replace('automatik', 'automatic')
#                    )


autos['fuel_type'] = (autos['fuel_type']
                    .map({'benzin': 'petrol', 'elektro' : 'electro', 'andere': 'other'},
                         na_action='ignore')
                   )
# replace NaN to string 'no information':
autos['fuel_type'] = autos['fuel_type'].fillna('no information')

autos['unrepaired_damage'] = (autos['unrepaired_damage']
                    .map({'nein': 'no', 'ja': 'yes'})
                   )

# replace NaN to string 'no information':
autos['unrepaired_damage'] = autos['unrepaired_damage'].fillna('no information')

# autos #test

# test 2:
german_list = ['seller', 'vehicle_type', 'gearbox', 'fuel_type', 'unrepaired_damage']
for p in german_list:
    print(str(p), '\n', pd.unique(autos[p]), '\n', '\n')
seller 
 ['private' 'commercial'] 
 

vehicle_type 
 ['no information' 'small car' 'other'] 
 

gearbox 
 ['manual' 'automatic' 'no information'] 
 

fuel_type 
 ['no information' 'petrol' 'electro' 'other'] 
 

unrepaired_damage 
 ['no' 'no information' 'yes'] 
 


Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
In [34]:
autos['date_crawled: time'] = autos['date_crawled'].str.split().str[1]
autos['date_crawled'] = (autos['date_crawled']
                         .str.replace('-', '')
                         .str.split().str[0].astype(int)
                        )

autos['ad_created: time'] = autos['ad_created'].str.split().str[1]
# because there is no data in " ad_created: time", we will drop this column.
autos.drop('ad_created: time', inplace=True, axis=1)

autos['ad_created'] = (autos['ad_created'].str.replace('-', '')
                       .str.split().str[0].astype(int)
                      )

autos['last_seen: time'] = autos['last_seen'].str.split().str[1]
autos['last_seen'] = (autos['last_seen']
                      .str.replace('-', '')
                      .str.split().str[0].astype(int)
                     )
autos #test
Out[34]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps ... odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen date_crawled: time last_seen: time
0 20160326 Peugeot_807_160_NAVTECH_ON_BOARD private Angebot 5000.0 control no information 2004.0 manual 158.0 ... 150000.0 3.0 no information peugeot no 20160326 79588.0 20160406 17:47:46 06:45:54
1 20160404 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik private Angebot 8500.0 control no information 1997.0 automatic 286.0 ... 150000.0 6.0 petrol bmw no 20160404 71034.0 20160406 13:38:56 14:45:08
2 20160326 Volkswagen_Golf_1.6_United private Angebot 8990.0 test no information 2009.0 manual 102.0 ... 70000.0 7.0 petrol volkswagen no 20160326 35394.0 20160406 18:57:24 20:15:37
3 20160312 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... private Angebot 4350.0 control small car 2007.0 automatic 71.0 ... 70000.0 6.0 petrol smart no 20160312 33729.0 20160315 16:58:10 03:16:28
4 20160401 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... private Angebot 1350.0 test no information 2003.0 manual 0.0 ... 150000.0 7.0 petrol ford no 20160401 39218.0 20160401 14:38:50 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 20160327 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon private Angebot 24900.0 control no information 2011.0 automatic 239.0 ... 100000.0 1.0 no information audi no 20160327 82131.0 20160401 14:38:19 13:47:40
49996 20160328 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... private Angebot 1980.0 control no information 1996.0 manual 75.0 ... 150000.0 5.0 petrol opel no 20160328 44807.0 20160402 10:50:25 14:18:02
49997 20160402 Fiat_500_C_1.2_Dualogic_Lounge private Angebot 13200.0 test no information 2014.0 automatic 69.0 ... 5000.0 11.0 petrol fiat no 20160402 73430.0 20160404 14:44:48 11:47:27
49998 20160308 Audi_A3_2.0_TDI_Sportback_Ambition private Angebot 22900.0 control no information 2013.0 manual 150.0 ... 40000.0 11.0 no information audi no 20160308 35683.0 20160405 19:25:42 16:45:07
49999 20160314 Opel_Vectra_1.6_16V private Angebot 1250.0 control no information 1996.0 manual 101.0 ... 150000.0 1.0 petrol opel no 20160313 45897.0 20160406 00:42:12 21:18:48

48026 rows × 21 columns

In [35]:
# We need to rearange colums so, for ex. "date_crawled: time" column, will be after "date_crawled" column:


# Let's get a list so it will be faster to fill new sequence later:
# autos.columns.tolist() 

# New sequence:
autos = autos[[
               'name',
               'offer_type',
               'price',
               'abtest',
               'vehicle_type',
               'registration_year',
               'gearbox',
               'power_ps',
               'model',
               'odometer_km',
               'registration_month',
               'fuel_type',
               'brand',
               'unrepaired_damage',
               'postal_code',
               'date_crawled', 
               'date_crawled: time',
               'ad_created',
               'last_seen',
               'last_seen: time',
               'seller',
              ]]
autos
Out[35]:
name offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km ... fuel_type brand unrepaired_damage postal_code date_crawled date_crawled: time ad_created last_seen last_seen: time seller
0 Peugeot_807_160_NAVTECH_ON_BOARD Angebot 5000.0 control no information 2004.0 manual 158.0 andere 150000.0 ... no information peugeot no 79588.0 20160326 17:47:46 20160326 20160406 06:45:54 private
1 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik Angebot 8500.0 control no information 1997.0 automatic 286.0 7er 150000.0 ... petrol bmw no 71034.0 20160404 13:38:56 20160404 20160406 14:45:08 private
2 Volkswagen_Golf_1.6_United Angebot 8990.0 test no information 2009.0 manual 102.0 golf 70000.0 ... petrol volkswagen no 35394.0 20160326 18:57:24 20160326 20160406 20:15:37 private
3 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... Angebot 4350.0 control small car 2007.0 automatic 71.0 fortwo 70000.0 ... petrol smart no 33729.0 20160312 16:58:10 20160312 20160315 03:16:28 private
4 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... Angebot 1350.0 test no information 2003.0 manual 0.0 focus 150000.0 ... petrol ford no 39218.0 20160401 14:38:50 20160401 20160401 14:38:50 private
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon Angebot 24900.0 control no information 2011.0 automatic 239.0 q5 100000.0 ... no information audi no 82131.0 20160327 14:38:19 20160327 20160401 13:47:40 private
49996 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... Angebot 1980.0 control no information 1996.0 manual 75.0 astra 150000.0 ... petrol opel no 44807.0 20160328 10:50:25 20160328 20160402 14:18:02 private
49997 Fiat_500_C_1.2_Dualogic_Lounge Angebot 13200.0 test no information 2014.0 automatic 69.0 500 5000.0 ... petrol fiat no 73430.0 20160402 14:44:48 20160402 20160404 11:47:27 private
49998 Audi_A3_2.0_TDI_Sportback_Ambition Angebot 22900.0 control no information 2013.0 manual 150.0 a3 40000.0 ... no information audi no 35683.0 20160308 19:25:42 20160308 20160405 16:45:07 private
49999 Opel_Vectra_1.6_16V Angebot 1250.0 control no information 1996.0 manual 101.0 vectra 150000.0 ... petrol opel no 45897.0 20160314 00:42:12 20160313 20160406 21:18:48 private

48026 rows × 21 columns


Find the most common brand/model combinations:
In [36]:
autos['name'].describe()
Out[36]:
count           48026
unique          37018
top       Ford_Fiesta
freq               76
Name: name, dtype: object

Describe() method isn't satisfying. We want to know more. Let's check what's the most common brand/model for each brand:

In [37]:
autos['brand'].value_counts().index
Out[37]:
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini',
       'mitsubishi', 'honda', 'kia', 'alfa_romeo', 'porsche', 'suzuki',
       'chevrolet', 'chrysler', 'dacia', 'daihatsu', 'jeep', 'subaru',
       'land_rover', 'saab', 'jaguar', 'trabant', 'daewoo', 'rover', 'lancia',
       'lada'],
      dtype='object')
In [38]:
common_brand_model = {}

for c in autos['brand'].value_counts().index:
    brands = autos[autos["brand"] == c]
    desc_model = brands['name'].describe()
    common_brand_model[c] = desc_model[2]
  
print('the most common brand/model for each brand:', '\n')
for key in common_brand_model:
    print(key, ':', common_brand_model[key], '\n') # print of unsorted dict. mean_price_for_top_10_cars_raw 
the most common brand/model for each brand: 

volkswagen : Volkswagen_Golf_1.4 

bmw : BMW_316i 

opel : Opel_Corsa 

mercedes_benz : Mercedes_Benz_SLK_200_Kompressor 

audi : Audi_A4_Avant_2.0_TDI_DPF 

ford : Ford_Fiesta 

renault : Renault_Twingo 

peugeot : Peugeot_206 

fiat : Fiat_Punto 

seat : Seat_Ibiza 

skoda : Skoda_Fabia 

mazda : Mazda_MX_5_1.6i_16V 

nissan : Nissan_Micra 

smart : Smart_smart_&_passion 

citroen : Citroën_C1_1.0_Style 

toyota : Toyota_Aygo 

sonstige_autos : Dodge_RAM 

hyundai : Hyundai_i30_1.4_Classic 

volvo : Volvo_V70_2.5 

mini : MINI_Mini_One 

mitsubishi : Mitsubishi_Colt 

honda : Honda_Civic 

kia : Kia_Picanto 

alfa_romeo : Alfa_Romeo_147 

porsche : Porsche_Boxster 

suzuki : Suzuki_Swift_1.6_Sport 

chevrolet : Chevrolet_Spark_1.0_LS 

chrysler : Chrysler_Stratus_2.5_LX 

dacia : Dacia_Logan_MCV_1.6_16V_Laureate 

daihatsu : Daihatsu_Cuore 

jeep : Jeep_Grand_Cherokee_3.0_CRD_Automatik_Limited 

subaru : Subaru_Legacy_2.5_4WD_Automatik_GX 

land_rover : Land_Rover_Freelander_1.8i 

saab : Saab_9_3_2.2_TiD 

jaguar : Jaguar_XF_2.2_Diesel 

trabant : Trabant_601 

daewoo : Daewoo_Matiz 

rover : Rover_200_Limousine_abzugeben 

lancia : Lancia_Phedra_2.2_Multijet_16v_DPF 

lada : Lada_Niva 


Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage:
In [39]:
# We create the definition that will sort dictionary by the value:
def sort_dict(dictionary): # sorting dictionary #thx to Elena_Kosourova:)

    import operator
    dictionary = (sorted(dictionary.items(),
                         key=operator.itemgetter(1),
                         reverse=True)
                 )# reverse: the highest value on top
    return dict(dictionary) #  The result sorted_by_value will be a list of tuples without dict() !!!


# compare brand to mean odometer:
mean_odometr_km = {}

for b in autos['brand'].value_counts().index:
    top_brands = autos[autos["brand"] == b]
    mean_odo_km = int(top_brands['odometer_km'].mean())
    mean_odometr_km[b] = int(mean_odo_km)

# print(mean_odometr_km) ##test
# print('\n')

# compare brand to mean price:

mean_price = {}

for b in autos['brand'].value_counts().index:
    top_brands = autos[autos["brand"] == b]
    mean_price_value = int(top_brands['price'].mean())
    mean_price[b] = int(mean_price_value)

# Using definition "sort_dict" on "mean_price":

mean_price_sorted = sort_dict(mean_price) 

# print('mean_price_sorted', mean_price_sorted) ## test


odometr_series = pd.Series(mean_odometr_km) # creating series from dictionary
autos_1 = pd.DataFrame(odometr_series,columns=['odometer km']) # creating new DataFrame
autos_1.index.name = "brand" # index name for index column

mean_price_series = pd.Series(mean_price) # creating series from dictionary
autos_1['mean price'] = mean_price_series # appending new column to the DataFrame
autos_mean_price = autos_1.sort_values('mean price') # sorting by the price
autos_mean_km = autos_1.sort_values('odometer km') # sorting by the odometer

# function that print first and last 4 rows:
def head_tale(df, x=4):
    return df.head(x).append(df.tail(x))

print('first and last 4 rows for the mean price:')
print(head_tale(autos_mean_price))
print('\n')
print('first and last 4 rows for the mean odometer:')
print(head_tale(autos_mean_km))
first and last 4 rows for the mean price:
                odometer km  mean price
brand                                  
daewoo               122430        1019
rover                135615        1528
trabant               59666        1552
daihatsu             115284        1556
jeep                 127546       11434
land_rover           118010       19108
sonstige_autos        87262       39621
porsche               97457       44553


first and last 4 rows for the mean odometer:
                odometer km  mean price
brand                                  
trabant               59666        1552
dacia                 84268        5915
lada                  85517        2502
sonstige_autos        87262       39621
chrysler             133181        3229
rover                135615        1528
volvo                138355        4757
saab                 144415        3211
From the output, we can see that:
  • Trabant is the cheapest on average when compared to the average millage(the smallest).
  • Daewoo  is cheap too, but it has twice the average millage in comparison to Trabant
  • Lada is in the second place of the best smallest millage/price after Trabant
  • Porshe is the most expensive on average(who would guess that?)
  • Jeep has the worst(the biggest) mileage/price relation. It's expensive and it has a lot of millages.

How much cheaper are cars with damage than their non-damaged counterparts?:
In [40]:
mean_price_all_cars = autos['price'].mean()
damaged_cars = autos[autos['unrepaired_damage'] == "yes"]
undamaged_cars = autos[autos['unrepaired_damage'] == 'no']
mean_price_damaged_cars = damaged_cars['price'].mean()
mean_price_undamaged_cars = undamaged_cars['price'].mean()

# damaged_cars
# undamaged_cars
# mean_price_all_cars
print('Mean_price_damaged_cars:', int(mean_price_damaged_cars), '$')
print('\n')
print('Mean_price_undamaged_cars:', int(mean_price_undamaged_cars), '$')
print('\n')
print('The damaged cars are on average', int(mean_price_undamaged_cars - mean_price_damaged_cars), '$ cheaper than undamaged cars. ')
Mean_price_damaged_cars: 2335 $


Mean_price_undamaged_cars: 8233 $


The damaged cars are on average 5898 $ cheaper than undamaged cars. 

General conclusions:

  • The whole dataset had to be cleaned. Some string object had to be changed into int/float ( numbers) objects. Some columns had to be split, so the whole dataset would look more clear. 
  • There are 3 general segments of cars: cheap, middle and expensive.
  • The price of a particular brand/model is related to many factors ( not only milage and good condition of the car). The society has its own focal point of values: some brands that can be equal in quality with others will be seen as less precious etc.
In [ ]: