Exploring Ebay Car Sales Data

Introduction

The given dataset contains information about used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded by other user, afterwards it was modified and dirtied a bit as well - so, it is necessary to clean the data before working with it. The original dataset isn't available on Kaggle anymore, but it can be found here.

The aim of the following project is to analyze the included used car listings and make some conclusions about the given dataframe.

Importing the modules and opening the file

In [1]:
# Importing necessary modules - pandas and numpy
import pandas as pd
import numpy as np
In [2]:
# Reading the 'autos.csv' file as 'autos' 
autos = pd.read_csv('autos.csv', encoding='Latin-1')
In [3]:
# Only first 5 rows are returned as currently there's no need to examine all the rows 
autos.head() 
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Obtaining basic information about the dataframe

In [4]:
# Obtaining information about the dataframe
autos.info()
# Checking how many rows are present in the dataframe 
print('\nnumber of rows in df:', autos.shape[0])
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

number of rows in df: 50000

All the columns are non-null, however, there are NaN values in some columns (such as: 'vehicleType', 'gearbox', 'model', 'fueltype', 'notRepairedDamage' - because they aren't equal to 50000 - the total number of rows in the dataframe). There are 5 integer and 15 object columns - that makes 20 columns in total.

In [5]:
# Getting an array of existing column names
autos.columns
Out[5]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Editing column names

In [6]:
# Editing column names and converting them from camelcase
autos.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code', 'lastSeen':'last_seen'}, axis=1, inplace=True)
In [7]:
# Making sure that the columns have been converted to snakecase
autos.columns
Out[7]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [8]:
# Making sure that the converted columns are now in the dataframe and haven't altered the rest of the dataframe structure
autos.head()
Out[8]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Column labels were replaces from camelcase to snakecase to make the data more descriptive, also some column names ('yearOfRegistration', 'monthOfRegistration', 'notRepairedDamage', 'dateCreated') were altered to make them easier to understand and avoid misinformation they could cause

Copying with columns values

In [9]:
# Looking at descriptive statistics for all columns
autos.describe(include='all')
Out[9]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-12 16:06:22 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Results analysis:

1) 'nr_of_pictures' has only one value 0 (as its std, min, and max are equal to 0). 'seller' and 'offer_type' have two values (the count of the first value is 49999 and the count of the second one is only 1) - they can be candidates to be dropped as well. We are going to remove the 'nr_of_pictures' column; we are going to remove 'seller' and 'offer_type' too - as we can see, they have almost no unique values

2) 'price' and 'odometer' store numeric data as a text and does not allow us to analyze this data as numeric data (does not allow us to find their min, max, std)

3) 'registration_year', 'power_ps', 'registration_month', and 'postal code' need more investigation as their number of unique values cannot be defined. They are equal to NaN because their dtype is 'int64' - the '.describe()' method is not capable of finding unique values of numeric values

Now, after we examined the results presented above, it is necessary to optimize the dataframe:

1) Removing statistically insignificant columns

In [10]:
# Removing the 'nr_of_pictures', 'seller', 'offer_type' columns
autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)
In [11]:
# Making sure that the mentioned columns have been removed
autos.shape[1]
Out[11]:
17

2) Converting text into numeric data

Converting 'price'

In [12]:
# Understanding what non-numeric characters are present in the 'price' column
autos['price'].head()
Out[12]:
0    $5,000
1    $8,500
2    $8,990
3    $4,350
4    $1,350
Name: price, dtype: object
In [13]:
# Removing non-numeric values (',' and '$') from the 'price' column -
# including converting the column to a numeric dtype 
autos['price'] = autos['price'].str.replace(',','').str.replace('$','').astype('int64')
In [14]:
# Making sure that non-numeric values have been removed
autos['price'].head()
Out[14]:
0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Converting 'odometer' and renaming it

In [15]:
# Understanding what non-numeric characters are present in the 'odometer' column
autos['odometer'].head() 
Out[15]:
0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object
In [16]:
# Removing non-numeric values ('km' and ',') from the 'odometer' column and converting it to a numeric dtype
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype('int64')
In [17]:
# Making sure that non-numeric values have been removed
autos['odometer'].head()
Out[17]:
0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64
In [18]:
# Renaming the column from 'odometer' to 'odometer_km'
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
In [19]:
# Making sure that the column has been renamed
autos.columns
Out[19]:
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

3) Describing numeric columns

Now we can find out how many unique values numeric columns have. It is necessary to make sure that in many rows they have more than one value - otherwise these columns will be viewed as statistically insignificant and, consequently, will be removed from the dataframe

In [20]:
# Calculating the number of unique values for each numeric column
print('Unique values for numeric columns:')
print('registration_year:', *autos['registration_year'].unique().shape)
print('power_ps:', *autos['power_ps'].unique().shape)
print('registration_month:', *autos['registration_month'].unique().shape)
print('postal_code:', *autos['postal_code'].unique().shape)
Unique values for numeric columns:
registration_year: 97
power_ps: 448
registration_month: 13
postal_code: 7014

As we can see, all the four numeric columns have a plenty of unique values - that proves that they statistically significant and that they cannot be dropped

Copying with outliers

Before removing the outliers from 'odometer_km' and 'price' columns, we need to describe these columns (with outliers included) in order to be able to compare them with themselves, but without outliers

In [21]:
# Observing the values present in 'odometer_km' and 'price' columns before the outliers are removed
outliers_table = autos[['odometer_km', 'price']].describe()
outliers_table
Out[21]:
odometer_km price
count 50000.000000 5.000000e+04
mean 125732.700000 9.840044e+03
std 40042.211706 4.811044e+05
min 5000.000000 0.000000e+00
25% 125000.000000 1.100000e+03
50% 150000.000000 2.950000e+03
75% 150000.000000 7.200000e+03
max 150000.000000 1.000000e+08

Now, to remove the outliers (if there are any), we need to identify them with the '1.5xIQR rule'. This rule provides us with an interval needed to exclude all the outliers: [Q1-1.5xIQR; Q3+1.5xIQR], where Q1 is the lower quartile (0.25 quantile), Q3 - upper quartile (0.75 quantile), and IQR = Q3 - Q1. The quantiles can be found with the following function: series.quantile(n), where n ∈ [0,1].

Removing outliers from 'odometer_km'

In [22]:
# Removing the outliers from the 'odometer_km' column
odometer_no_outliers = autos[autos['odometer_km'].between(autos['odometer_km'].quantile(0.25)-1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)),
                                   autos['odometer_km'].quantile(0.75)+1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)))]   
In [23]:
# Printing dataframe with and without outliers
print('df with potential outliers:', *autos['odometer_km'].shape)
print('df without outliers:', *odometer_no_outliers['odometer_km'].shape) 
df with potential outliers: 50000
df without outliers: 41520

There were outliers in the 'odometer_km' column because the number of rows of 'df without outliers' (equal to 41520) is less than this number of 'df with potential outliers' (50000). Then we can replace the values of the 'autos' dataframe with the values of 'odometer_no_outliers' - it is the same dataframe but without outliers in the 'odometer_km' column

In [24]:
# Removing the rows with the outliers in the 'odometer_km' column
autos = odometer_no_outliers
In [25]:
# Making sure that the number of rows in the new 'autos' version is less than 50000
autos.shape[0]
Out[25]:
41520

Removing outliers from 'price'

In [26]:
# Removing the outliers from the 'price' column
price_no_outliers = autos[autos['price'].between(autos['price'].quantile(0.25)-1.5*(autos['price'].quantile(0.75)-autos['price'].quantile(0.25)), 
                                                 autos['price'].quantile(0.75)+1.5*(autos['price'].quantile(0.75)-autos['price'].quantile(0.25)))]
In [27]:
# Printing dataframe with and without outliers
print('df with potential outliers:', *autos['price'].shape)
print('df without outliers:', *price_no_outliers['price'].shape) 
df with potential outliers: 41520
df without outliers: 38386

There were outliers in the 'price' column because the number of rows of 'df without outliers' (equal to 38386) is less than this number of 'df with potential outliers' (41520). Then we can replace the values of the 'autos' dataframe with the values of 'price_no_outliers' - it is the same dataframe but without outliers in the 'price' column

In [28]:
# Removing the rows with the outliers in the 'price' column
autos = price_no_outliers
In [29]:
# Making sure that the number of rows in the new 'autos' version is less than 41520
autos.shape[0]
Out[29]:
38386

Comparing the columns with and without outliers

Now we can compare the values of the columns with outliers included with the values of the same columns without outliers:

In [30]:
# Printing the table with outliers
outliers_table
Out[30]:
odometer_km price
count 50000.000000 5.000000e+04
mean 125732.700000 9.840044e+03
std 40042.211706 4.811044e+05
min 5000.000000 0.000000e+00
25% 125000.000000 1.100000e+03
50% 150000.000000 2.950000e+03
75% 150000.000000 7.200000e+03
max 150000.000000 1.000000e+08
In [31]:
# Observing the values left in 'odometer_km' and 'price' after removing outliers
autos[['odometer_km', 'price']].describe()
Out[31]:
odometer_km price
count 38386.000000 38386.000000
mean 142674.412546 3127.445136
std 16147.477812 2872.206674
min 90000.000000 0.000000
25% 150000.000000 950.000000
50% 150000.000000 2131.000000
75% 150000.000000 4500.000000
max 150000.000000 12250.000000

After the outliers were removed, 'odometer_km' has experienced some changes. Its smallest values were outliers and their removal, consequenty, lead to smaller standard deviation (16147 instead of 40042) and bigger minimum value (90000 instead of 5000).

The new 'price' column has changed significantly as well: its standard deviation is now equal to 2872 instead of 481104; its maximum value is now 12250 instead of 1.000000+e08 (1 billion)

Calculating the distribution of values in columns concerning ads

It can be noted that all there are three columns ('date_crawled', 'ad_created', and 'last_seen') characterize ads of cars sold. Then analyzing them can provide us with a brief description of: when the ads were discovered by the crawler, when these ads were active, and when they went inactive.

In [32]:
# Calculating the distribution of values in 'date_crawled'
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()  
Out[32]:
2016-03-05    0.024697
2016-03-06    0.013911
2016-03-07    0.036628
2016-03-08    0.034414
2016-03-09    0.033736
2016-03-10    0.032642
2016-03-11    0.032017
2016-03-12    0.037540
2016-03-13    0.015058
2016-03-14    0.037383
2016-03-15    0.034127
2016-03-16    0.030037
2016-03-17    0.032121
2016-03-18    0.013234
2016-03-19    0.033398
2016-03-20    0.037514
2016-03-21    0.037488
2016-03-22    0.032564
2016-03-23    0.032538
2016-03-24    0.028630
2016-03-25    0.032199
2016-03-26    0.032069
2016-03-27    0.030819
2016-03-28    0.034674
2016-03-29    0.033867
2016-03-30    0.034127
2016-03-31    0.031991
2016-04-01    0.033267
2016-04-02    0.034492
2016-04-03    0.037800
2016-04-04    0.036862
2016-04-05    0.013442
2016-04-06    0.003569
2016-04-07    0.001146
Name: date_crawled, dtype: float64

All the data given was obtained by the crawler within a year (2016). Most of the data was obtained throughout March (3-rd month), while its remaining part - throughout April (4-th month); that means there were only 2 months when the data was 'crawled'.

In [33]:
# Calculating the distribution of values in 'ad_created'
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index() 
Out[33]:
2015-12-05    0.000026
2015-12-30    0.000026
2016-01-07    0.000026
2016-01-10    0.000026
2016-01-13    0.000026
2016-01-16    0.000026
2016-01-22    0.000026
2016-01-27    0.000052
2016-01-29    0.000026
2016-02-01    0.000026
2016-02-02    0.000052
2016-02-05    0.000026
2016-02-08    0.000026
2016-02-09    0.000026
2016-02-11    0.000026
2016-02-12    0.000078
2016-02-16    0.000026
2016-02-18    0.000052
2016-02-19    0.000026
2016-02-20    0.000052
2016-02-21    0.000078
2016-02-22    0.000026
2016-02-23    0.000026
2016-02-25    0.000052
2016-02-26    0.000026
2016-02-27    0.000130
2016-02-28    0.000208
2016-02-29    0.000130
2016-03-01    0.000130
2016-03-02    0.000078
                ...   
2016-03-09    0.033632
2016-03-10    0.032434
2016-03-11    0.032329
2016-03-12    0.037436
2016-03-13    0.016490
2016-03-14    0.035820
2016-03-15    0.033919
2016-03-16    0.030480
2016-03-17    0.031835
2016-03-18    0.013989
2016-03-19    0.032382
2016-03-20    0.037462
2016-03-21    0.037644
2016-03-22    0.032642
2016-03-23    0.032356
2016-03-24    0.028474
2016-03-25    0.032382
2016-03-26    0.032095
2016-03-27    0.030714
2016-03-28    0.034648
2016-03-29    0.034023
2016-03-30    0.033893
2016-03-31    0.031861
2016-04-01    0.033424
2016-04-02    0.034075
2016-04-03    0.038087
2016-04-04    0.037488
2016-04-05    0.012010
2016-04-06    0.003595
2016-04-07    0.001042
Name: ad_created, Length: 66, dtype: float64

The first ad was created on June of 2015 (the 6-th month), while the last one - on April of 2016 (the 4-th month). Most of ads were created in 2016, it took around 8 months to create all the ads. February and March of 2016 (the 2-nd and the 3-rd months accordingly) represented the period of time when the majority of ads was created.

In [34]:
# Calculating the distribution of values in 'last_seen'
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index() 
Out[34]:
2016-03-05    0.001094
2016-03-06    0.005002
2016-03-07    0.006174
2016-03-08    0.008701
2016-03-09    0.010707
2016-03-10    0.011384
2016-03-11    0.014120
2016-03-12    0.026051
2016-03-13    0.009821
2016-03-14    0.013755
2016-03-15    0.016985
2016-03-16    0.017871
2016-03-17    0.030141
2016-03-18    0.007685
2016-03-19    0.017090
2016-03-20    0.022300
2016-03-21    0.022091
2016-03-22    0.022638
2016-03-23    0.019434
2016-03-24    0.021049
2016-03-25    0.020320
2016-03-26    0.017845
2016-03-27    0.017064
2016-03-28    0.022326
2016-03-29    0.023368
2016-03-30    0.025712
2016-03-31    0.024879
2016-04-01    0.024488
2016-04-02    0.025817
2016-04-03    0.025634
2016-04-04    0.025426
2016-04-05    0.116944
2016-04-06    0.205309
2016-04-07    0.120773
Name: last_seen, dtype: float64

All the data was last seen within two months: March and April of 2016. However, it took only about a month for all the data to be seen last (as this process started on the 5-th of March and ended on the 7-th of April). It means that after 2016-04-07 no ad was seen by the crawler online.

Analyzing 'registration_year'

In [35]:
# Describing the distribution of 'registration_year'
autos['registration_year'].describe() 
Out[35]:
count    38386.000000
mean      2002.499557
std         36.331672
min       1910.000000
25%       1999.000000
50%       2002.000000
75%       2006.000000
max       9000.000000
Name: registration_year, dtype: float64

The biggest values of the series (from 2008 up to 9999) are located after the third quartile (75%) while the max is equal to 9999 - that means most of the values in the series are located within the [1000;2008] interval.

It is known, however, that the values before 1903 cannot be taken into consideration because the first car registration took place in 1903. The latest year was 2020, so the upper limit is this year. All these facts taken into the consideration create an interval [1903,2020]. Then it is necessary to remove the rows that don't meet the criterion ('registration_year' out of the interval).

In [36]:
autos['registration_year'][autos['registration_year'].between(1903,2020)].value_counts(normalize=True)  
Out[36]:
2000    0.080839
1999    0.074144
2005    0.068699
2001    0.065963
2003    0.065911
2004    0.064635
2002    0.061456
1998    0.060049
2006    0.057965
1997    0.050410
2007    0.044835
2008    0.036889
1996    0.035300
2017    0.032461
1995    0.031002
2016    0.029725
2009    0.027354
1994    0.015553
2010    0.014771
2018    0.010812
1993    0.010655
2011    0.009222
1992    0.008858
1990    0.008363
1991    0.008180
2012    0.004012
1989    0.003960
1988    0.002918
1985    0.001902
1987    0.001615
          ...   
1970    0.000391
1960    0.000391
1973    0.000391
1972    0.000365
1976    0.000339
1977    0.000287
1975    0.000208
2014    0.000208
1965    0.000182
1968    0.000156
1974    0.000156
2015    0.000130
1966    0.000130
1967    0.000104
1971    0.000104
1969    0.000104
1910    0.000052
1956    0.000052
1961    0.000052
1958    0.000052
1964    0.000052
1963    0.000052
2019    0.000052
1962    0.000026
1937    0.000026
1934    0.000026
1959    0.000026
1954    0.000026
1953    0.000026
1950    0.000026
Name: registration_year, Length: 69, dtype: float64

Biggest number of annual registrations belongs to 2000s and the end of 90s. Lowest number belongs to the first half of the previous century (from 30s to 50s).

In [37]:
# Removing the rows where the 'registation_year' values are outside the interval
autos = autos.loc[autos['registration_year'].between(1903,2020)]

Analyzing the 15 most sold brands

A decision was made to choose 15 most sold brands, the most commonly sold of which is 'volkswagen' and the least one is 'toyota'.

In [38]:
# Picking 15 most common values in the 'brand' column
autos['brand'].value_counts().head(15) 
Out[38]:
volkswagen       8588
opel             4700
bmw              4052
mercedes_benz    3434
audi             3045
ford             2790
renault          2071
peugeot          1229
fiat             1007
seat              743
mazda             624
nissan            568
citroen           546
skoda             525
toyota            457
Name: brand, dtype: int64
In [39]:
# Creating a dictionary for mean price 
aggregate_price = {}
top_brands = autos['brand'].value_counts().head(15).index
for brand in top_brands:
    sel_row = autos[autos['brand']==brand]
    mean_row = sel_row['price'].mean()
    aggregate_price[brand] = mean_row
In [40]:
aggregate_price
Out[40]:
{'audi': 4322.562889983579,
 'bmw': 4476.98198420533,
 'citroen': 2698.168498168498,
 'fiat': 1843.119165839126,
 'ford': 2243.584946236559,
 'mazda': 2677.7596153846152,
 'mercedes_benz': 4145.680838672102,
 'nissan': 2648.906690140845,
 'opel': 2096.4821276595744,
 'peugeot': 2303.6102522375913,
 'renault': 1713.7469821342347,
 'seat': 2730.265141318977,
 'skoda': 4192.601904761905,
 'toyota': 3775.301969365427,
 'volkswagen': 3180.179902189101}
In [41]:
# Sorting the dictionary by its values in descending order
sorted(aggregate_price.items(), key=lambda x: x[1], reverse=True) 
Out[41]:
[('bmw', 4476.98198420533),
 ('audi', 4322.562889983579),
 ('skoda', 4192.601904761905),
 ('mercedes_benz', 4145.680838672102),
 ('toyota', 3775.301969365427),
 ('volkswagen', 3180.179902189101),
 ('seat', 2730.265141318977),
 ('citroen', 2698.168498168498),
 ('mazda', 2677.7596153846152),
 ('nissan', 2648.906690140845),
 ('peugeot', 2303.6102522375913),
 ('ford', 2243.584946236559),
 ('opel', 2096.4821276595744),
 ('fiat', 1843.119165839126),
 ('renault', 1713.7469821342347)]

'bmw' mean price is the highest among the top 15 brands, it is being followed by 'audi' and 'skoda'. The lowest mean price among the given brands belongs to 'renault'.

In [42]:
# Creating a dictionary for mean mileage
aggregate_mileage = {}
for brand in autos['brand'].value_counts().head(15).index:
    sel_row = autos[autos['brand'] == brand]
    mean_mileage = sel_row['odometer_km'].mean() 
    aggregate_mileage[brand] = mean_mileage
aggregate_mileage    
Out[42]:
{'audi': 146834.15435139573,
 'bmw': 145826.75222112538,
 'citroen': 138095.2380952381,
 'fiat': 137120.1588877855,
 'ford': 140992.83154121865,
 'mazda': 141153.84615384616,
 'mercedes_benz': 145789.16715200932,
 'nissan': 138345.0704225352,
 'opel': 142320.21276595743,
 'peugeot': 140537.02196908055,
 'renault': 141144.37469821342,
 'seat': 140343.20323014804,
 'skoda': 137285.7142857143,
 'toyota': 137133.47921225382,
 'volkswagen': 144168.60735910572}
In [43]:
# Creating series for mean price
bmp_series = pd.Series(aggregate_price) 
bmp_series
Out[43]:
audi             4322.562890
bmw              4476.981984
citroen          2698.168498
fiat             1843.119166
ford             2243.584946
mazda            2677.759615
mercedes_benz    4145.680839
nissan           2648.906690
opel             2096.482128
peugeot          2303.610252
renault          1713.746982
seat             2730.265141
skoda            4192.601905
toyota           3775.301969
volkswagen       3180.179902
dtype: float64
In [44]:
# Creating series for mean mileage
bmm_series = pd.Series(aggregate_mileage)
bmm_series
Out[44]:
audi             146834.154351
bmw              145826.752221
citroen          138095.238095
fiat             137120.158888
ford             140992.831541
mazda            141153.846154
mercedes_benz    145789.167152
nissan           138345.070423
opel             142320.212766
peugeot          140537.021969
renault          141144.374698
seat             140343.203230
skoda            137285.714286
toyota           137133.479212
volkswagen       144168.607359
dtype: float64
In [45]:
# Assigning the values of mean price to df
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df
Out[45]:
mean_price
audi 4322.562890
bmw 4476.981984
citroen 2698.168498
fiat 1843.119166
ford 2243.584946
mazda 2677.759615
mercedes_benz 4145.680839
nissan 2648.906690
opel 2096.482128
peugeot 2303.610252
renault 1713.746982
seat 2730.265141
skoda 4192.601905
toyota 3775.301969
volkswagen 3180.179902
In [46]:
# Inserting the values of mean mileage as a new column
df.insert(1, 'mean_mileage', bmm_series)
df
Out[46]:
mean_price mean_mileage
audi 4322.562890 146834.154351
bmw 4476.981984 145826.752221
citroen 2698.168498 138095.238095
fiat 1843.119166 137120.158888
ford 2243.584946 140992.831541
mazda 2677.759615 141153.846154
mercedes_benz 4145.680839 145789.167152
nissan 2648.906690 138345.070423
opel 2096.482128 142320.212766
peugeot 2303.610252 140537.021969
renault 1713.746982 141144.374698
seat 2730.265141 140343.203230
skoda 4192.601905 137285.714286
toyota 3775.301969 137133.479212
volkswagen 3180.179902 144168.607359
In [47]:
# Sorting the dataframe by mean price in descending order
df.sort_values('mean_price', ascending=False)
Out[47]:
mean_price mean_mileage
bmw 4476.981984 145826.752221
audi 4322.562890 146834.154351
skoda 4192.601905 137285.714286
mercedes_benz 4145.680839 145789.167152
toyota 3775.301969 137133.479212
volkswagen 3180.179902 144168.607359
seat 2730.265141 140343.203230
citroen 2698.168498 138095.238095
mazda 2677.759615 141153.846154
nissan 2648.906690 138345.070423
peugeot 2303.610252 140537.021969
ford 2243.584946 140992.831541
opel 2096.482128 142320.212766
fiat 1843.119166 137120.158888
renault 1713.746982 141144.374698
In [48]:
# Calculating correlation between the two columns
df['mean_price'].corr(df['mean_mileage'])
Out[48]:
0.43168914135180714

The correlation between these two columns ('mean_price' and 'mean_mileage') is equal to 43% - that is considered to be a weak correlation. Then it can be concluded that these two characteristics ('mean_price' and 'mean_mileage') are weakly correlated; consequently, the changes in one column are not likely to influence the other column (and the other way around).

General conclusions:

  • There were 3 statistically insignificant columns ('seller', 'offer_type', and 'nr_of_pictures');
  • There were 50000-41520=8480 outliers in the 'odometer_km' column and after their removal there remained 41520-38386=3134 outliers in the 'price' column;
  • All the ads were first crawled and last seen in March and April of 2016, most of ads were created in the same year;
  • Mean price of a car brand and its mean mileage are weakly correlated.
In [ ]: