Exploring eBay Car Sales Data. Understanding average price of cars by each popular brand

In this project I will work with data set containing information on eBay used car sales in Germany. I will explore dataset, try to find most popular car brands and some similar patterns in relation to average price of car by each popular brand. Patterns that I am interested in are covered distance, presence of unrepaired damage and average age of car for each popular brand. In other words, I am going to analyse if there are similar patterns between average price of car for brand and other characteristics of brand in average. This will also help to understand if some natural expectations of buyers and/or sellers for price of used cars can be trusted.

You can download dataset here.

Reading & Exploring data

Reading data

Pandas are main module I used for analysis. Except that I imported NumPy and created print_2n function. Latter prints 2 empty lines before each printed object. Numbers in tables will be printed with 3 digits after decimal point.

In [31]:
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f' % x)

def print_2n(*args):
    for arg in args:
        print("\n"*2, arg)
        
autos = pd.read_csv('autos.csv', encoding="Latin-1")
autos
Out[31]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 26.3.16 17:47 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 26.3.16 00:00 0 79588 6.4.16 06:45
1 4.4.16 13:38 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 4.4.16 00:00 0 71034 6.4.16 14:45
2 26.3.16 18:57 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 26.3.16 00:00 0 35394 6.4.16 20:15
3 12.3.16 16:58 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 12.3.16 00:00 0 33729 15.3.16 03:16
4 1.4.16 14:38 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 1.4.16 00:00 0 39218 1.4.16 14:38
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 27.3.16 14:38 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 27.3.16 00:00 0 82131 1.4.16 13:47
49996 28.3.16 10:50 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 28.3.16 00:00 0 44807 2.4.16 14:18
49997 2.4.16 14:44 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2.4.16 00:00 0 73430 4.4.16 11:47
49998 8.3.16 19:25 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 8.3.16 00:00 0 35683 5.4.16 16:45
49999 14.3.16 00:42 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 13.3.16 00:00 0 45897 6.4.16 21:18

50000 rows × 20 columns

As we can see from result of above code there are 50,000 rows and 20 columns in dataset.

Exploring data

In order to see all columns, types and # of non-null values in each column I use info() method. Except that printing first 5 rows helps to comprehend with what type of data we deal.

In [32]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[32]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 26.3.16 17:47 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 26.3.16 00:00 0 79588 6.4.16 06:45
1 4.4.16 13:38 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 4.4.16 00:00 0 71034 6.4.16 14:45
2 26.3.16 18:57 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 26.3.16 00:00 0 35394 6.4.16 20:15
3 12.3.16 16:58 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 12.3.16 00:00 0 33729 15.3.16 03:16
4 1.4.16 14:38 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 1.4.16 00:00 0 39218 1.4.16 14:38

Only 5 of 20 columns are saved as integer types of data. Most of the columns are saved as string types.
Majority of columns do not have null values. Share of null values in minority of columns where they exist is less than 20% by each column.

Cleaning dataset

Changing column labels

Below I printed column names. From result you can see that column labels are in camelcase and some labels are unnecessarily long. I renamed them to snakecase and shortened some labels. But besides that I added 'km' to 'odemeter' column and 'usd' to 'price' column. Added information to labels ('km' & 'usd') are presented in all values of both corresponding columns if you look closely to data printed as result of above cells. But I will delete them later as I convert those columns to integer type. Therefore information that prices are in US dollars and distance covered by car is in kilometers will be saved in column labels.

In [33]:
print(autos.columns)
col_dict = {'dateCrawled': 'date_crawled', 'name': 'name', 'seller': 'seller',
            'offerType': 'offer_type', 'price': 'price_usd', 'abtest': 'ab_test',
            'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year',
            'gearbox': 'gearbox', 'powerPS': 'power_ps', 'model': 'model',
            'odometer': 'odometer_km', 'monthOfRegistration': 'registration_month',
            'fuelType': 'fuel_type', 'brand': 'brand','notRepairedDamage': 'unrepaired_damage',
            'dateCreated': 'ad_created', 'nrOfPictures': 'nr_of_pictures',
            'postalCode': 'postal_code','lastSeen': 'last_seen'}
autos.rename(columns=col_dict, inplace=True)
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
Out[33]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Cleaning 'odemeter_km', 'price_usd' & 'unrepaired_damage' columns

As mentioned above I will delete special characters from 'odometer_km' and 'price_usd' columns. Then I convert them to numeric types of data for future purposes. The main future purpose is to use converted to float kilometers and prices for calculation of average values for most frequently presented brands in dataset.

In [34]:
autos['odometer_km'] = autos['odometer_km'].str.replace('km', '').str.replace(',', '').astype(float)
autos['price_usd'] = autos['price_usd'].str.replace('$', '').str.replace(',', '').astype(float)
autos.head(10)
Out[34]:
date_crawled name seller offer_type price_usd ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 26.3.16 17:47 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000.000 control bus 2004 manuell 158 andere 150000.000 3 lpg peugeot nein 26.3.16 00:00 0 79588 6.4.16 06:45
1 4.4.16 13:38 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500.000 control limousine 1997 automatik 286 7er 150000.000 6 benzin bmw nein 4.4.16 00:00 0 71034 6.4.16 14:45
2 26.3.16 18:57 Volkswagen_Golf_1.6_United privat Angebot 8990.000 test limousine 2009 manuell 102 golf 70000.000 7 benzin volkswagen nein 26.3.16 00:00 0 35394 6.4.16 20:15
3 12.3.16 16:58 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350.000 control kleinwagen 2007 automatik 71 fortwo 70000.000 6 benzin smart nein 12.3.16 00:00 0 33729 15.3.16 03:16
4 1.4.16 14:38 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350.000 test kombi 2003 manuell 0 focus 150000.000 7 benzin ford nein 1.4.16 00:00 0 39218 1.4.16 14:38
5 21.3.16 13:47 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot 7900.000 test bus 2006 automatik 150 voyager 150000.000 4 diesel chrysler NaN 21.3.16 00:00 0 22962 6.4.16 09:45
6 20.3.16 17:55 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot 300.000 test limousine 1995 manuell 90 golf 150000.000 8 benzin volkswagen NaN 20.3.16 00:00 0 31535 23.3.16 02:48
7 16.3.16 18:55 Golf_IV_1.9_TDI_90PS privat Angebot 1990.000 control limousine 1998 manuell 90 golf 150000.000 12 diesel volkswagen nein 16.3.16 00:00 0 53474 7.4.16 03:17
8 22.3.16 16:51 Seat_Arosa privat Angebot 250.000 test NaN 2000 manuell 0 arosa 150000.000 10 NaN seat nein 22.3.16 00:00 0 7426 26.3.16 18:18
9 16.3.16 13:47 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot 590.000 control bus 1997 manuell 90 megane 150000.000 7 benzin renault nein 16.3.16 00:00 0 15749 6.4.16 10:46

There are 2 columns with German words as values which I need to change:

  • unrepaired_damage - If the car has a damage which is not yet repaired.
  • gearbox - The transmission type.

I will change German words for 'yes' and 'no' in 'unrepaired_damage' column to 1's and 0's. And German words for transmission type to their English counterparts.

In [35]:
# leaving nan in its place 
damage_dict = {'nein': 0, 'ja': 1}
autos['unrepaired_damage']= autos['unrepaired_damage'].map(damage_dict)

gear_dict = {'automatik' : 'automatic', 'manuell': 'manual'}
autos['gearbox'] = autos['gearbox'].map(gear_dict)
print_2n(autos['unrepaired_damage'].value_counts(dropna=False),\
           autos['gearbox'].value_counts(dropna=False))
 0.000    35232
nan       9829
1.000     4939
Name: unrepaired_damage, dtype: int64


 manual       36993
automatic    10327
NaN           2680
Name: gearbox, dtype: int64

I already cleaned column showing prices from special characters, added information stored in special character to column label, converted column to numeric type. And there is no null-values as you can see in exploring data section. But before we can use this column in any project we should identify if there are outliers or unrealistic data. Considering that we analyse used car market very expensive cars seems to be unrelated to the topic. Too high prices can be related to rare old models, mistakes in dataset, almost new cars and so on. In any case this prices cannot be corrected on objective basis at this step. Also there are too cheap cars in data set. I will check descriptive statistics, top 10 and bottom 10 values to have better understanding of extreme prices.

In [36]:
desc_stat_price = autos['price_usd'].describe()
print(desc_stat_price)

top_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).head(10)
bottom_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).tail(10)
print_2n(top_10_price, bottom_10_price)
count      50000.000
mean        9840.044
std       481104.381
min            0.000
25%         1100.000
50%         2950.000
75%         7200.000
max     99999999.000
Name: price_usd, dtype: float64


 99999999.000    1
27322222.000    1
12345678.000    3
11111111.000    2
10000000.000    1
3890000.000     1
1300000.000     1
1234566.000     1
999999.000      2
999990.000      1
Name: price_usd, dtype: int64


 12.000       3
11.000       2
10.000       7
9.000        1
8.000        1
5.000        2
3.000        1
2.000        3
1.000      156
0.000     1421
Name: price_usd, dtype: int64

If you pay attention in output of above cell descriptive statistics on price column shows that minimum and maximum values are unexpectedly high (almost 10 million USD) and low (0 USD), respectively. Bottom 10 prices are below than 15 dollars and top ten are between 100k and 10 million dollars. Therefore I should set up boundaries and I found range between 100 USD and 100,000 USD as sensible. Down I will check number of cars that fell out of mentioned range.

In [37]:
unreal_prices = autos[(autos['price_usd'] >= 100000) |\
      (autos['price_usd'] <= 100)].shape[0]
print('\n'*2, '# of rows with unrealistic prices (below 100 and more tha 100k):'\
            , unreal_prices)
 # of rows with unrealistic prices (below 100 and more tha 100k): 1949

In cell below I will delete those 1949 rows in order to eliminate effect of outliers in terms of prices and check on descriptive statistics.

In [38]:
autos = autos[autos['price_usd'].between(100, 100000)]
desc_stat_price = autos['price_usd'].describe()
print_2n(desc_stat_price)
 count   48185.000
mean     5796.100
std      7525.532
min       100.000
25%      1250.000
50%      3000.000
75%      7499.000
max     99900.000
Name: price_usd, dtype: float64

I use same methods which was implemented above to understand if there are outliers in 'odometer_km' column which presents distance covered by each car. There also no null-values in this column (see exploring data).

In [39]:
# checking descriptive statistics and unrealisticly high and low 
# covered distance by car 
desc_stat_odo = autos['odometer_km'].describe()
print_2n(desc_stat_odo)

top_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).head(10)
bottom_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).tail(10)
print_2n(top_10_odo, bottom_10_odo)
 count    48185.000
mean    125986.718
std      39467.457
min       5000.000
25%     125000.000
50%     150000.000
75%     150000.000
max     150000.000
Name: odometer_km, dtype: float64


 150000.000    31209
125000.000     5035
100000.000     2099
90000.000      1733
80000.000      1411
70000.000      1214
60000.000      1151
50000.000      1007
40000.000       813
30000.000       771
Name: odometer_km, dtype: int64


 90000.000    1733
80000.000    1411
70000.000    1214
60000.000    1151
50000.000    1007
40000.000     813
30000.000     771
20000.000     749
10000.000     242
5000.000      751
Name: odometer_km, dtype: int64

Minimum (5,000 km) and maximum (150,000 km) values make sense in 'odometer_km' column. Top 10 and bottom 10 values also seems to be realistic and useful for later analysis where I will use it for aggregation of average values.

Understanding and cleaning date columns

There are 5 columns containing date and/or time data:

  • registration_month - The month in which the car was first registered.
  • registration_year - The year in which the car was first registered.
  • date_crawled - When this ad was first crawled.
  • ad_created - The date on which the eBay listing was created.
  • last_seen - When the crawler saw this ad last online.

To have better understanding we should pay attention on below cell output to see data types and descriptive statistics on all 5 columns.

In [40]:
date_cols = ['registration_month', 'registration_year', 'date_crawled',\
             'ad_created' , 'last_seen']
print(autos[date_cols].dtypes)
autos[date_cols].describe(include='all')
registration_month     int64
registration_year      int64
date_crawled          object
ad_created            object
last_seen             object
dtype: object
Out[40]:
registration_month registration_year date_crawled ad_created last_seen
count 48185.000 48185.000 48185 48185 48185
unique nan nan 11994 76 10226
top nan nan 28.3.16 14:49 3.4.16 00:00 6.4.16 02:45
freq nan nan 15 1872 111
mean 5.802 2004.730 NaN NaN NaN
std 3.678 87.932 NaN NaN NaN
min 0.000 1000.000 NaN NaN NaN
25% 3.000 1999.000 NaN NaN NaN
50% 6.000 2004.000 NaN NaN NaN
75% 9.000 2008.000 NaN NaN NaN
max 12.000 9999.000 NaN NaN NaN

Form result of above cell it is visible that 'registration_month' and 'registration_year' columns are saved as numeric types while rest of 3 columns as string data type. To see in which format values in all 5 date columns represented I will separate column labels based on column data types in 2 different lists. Then I will print top 5 rows from both to see example of how date related information is stored.

In [41]:
str_date_cols = ['date_crawled','ad_created' , 'last_seen']
int_date_cols = ['registration_year', 'registration_month']
print_2n(autos[str_date_cols].head(), autos[int_date_cols].head())
     date_crawled     ad_created      last_seen
0  26.3.16 17:47  26.3.16 00:00   6.4.16 06:45
1   4.4.16 13:38   4.4.16 00:00   6.4.16 14:45
2  26.3.16 18:57  26.3.16 00:00   6.4.16 20:15
3  12.3.16 16:58  12.3.16 00:00  15.3.16 03:16
4   1.4.16 14:38   1.4.16 00:00   1.4.16 14:38


    registration_year  registration_month
0               2004                   3
1               1997                   6
2               2009                   7
3               2007                   6
4               2003                   7

As we can see 3 columns with values saved as strings contain dates and time. Later those 3 columns can be used to extract date or time from them and use for analysis. Other 2 date columns represent year and month of registration in numbers and therefore can be used for further analysis without any change in data type.

Now we saw what data is saved in date columns we can pay closer attention to 'last_seen', 'registration_year' & 'registration_month' columns. There are 2 reasons to look closer at mentioned columns:

  • Presence of too old cars that can be considered as outliers drawing wrong patterns for analysis
  • Presence of "impossibly new cars". There are rows where registration year is higher than corresponding year in 'last_seen' column which is impossible.

To clean mentioned above 2 categories first I will check distribution in 'last_seen' column. This will help to understand approximate upper limit for registration year to be realistic. I will also check distribution for other 2 columns saved in string data type ('date_crawled','ad_created').

In [42]:
for col in str_date_cols:
    date = autos[col].str.split().str[0]
    month_year = date.str.split('.').str[1] + "-" +\
    date.str.split('.').str[2]
    autos[f'{col}_month_year'] = month_year
    month_year_dist = month_year.value_counts(normalize=True, dropna=False)\
                    .sort_values(ascending=False)*100
    print_2n(f"Distribution for '{col}_month_year' column:", month_year_dist)
C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
 Distribution for 'date_crawled_month_year' column:


 3-16   83.792
4-16   16.208
Name: date_crawled, dtype: float64


 Distribution for 'ad_created_month_year' column:


 3-16    83.734
4-16    16.103
2-16     0.127
1-16     0.025
12-15    0.004
6-15     0.002
11-15    0.002
9-15     0.002
8-15     0.002
Name: ad_created, dtype: float64


 Distribution for 'last_seen_month_year' column:


 4-16   57.651
3-16   42.349
Name: last_seen, dtype: float64

The output presents format of date as 'month number-2 digit year number' (e.g. 4-16 is April of 2016). Distribution of ‘last_seen’ and ‘date_crawled’ columns shows that there are only 2 values: 3-16 (March 2016) and 4-16 (April 2016). Distribution for 'ad_created' column also shows that there is no date later than 4-16 (April 2016). These factors tells that dataset presents situation in used car market on eBay in Germany as of March - April 2016.

Above observations make it clear that registration year above 2016 is error and cannot be realistic. Moreover, in rows where registration year is 2016 and registration month is later than month in 'last_seen' column false date is presented and should be deleted.

In [43]:
autos['last_seen_month'] = autos['last_seen'].str.split().str[0]\
.str.split(".").str[1].astype(int)
autos = autos[~((autos['registration_year'] == 2016) &\
              (autos['registration_month'] >= autos['last_seen_month']))]

autos.to_csv('autos_cleaned.csv', index=False)
autos
C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[43]:
date_crawled name seller offer_type price_usd ab_test vehicle_type registration_year gearbox power_ps ... brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen date_crawled_month_year ad_created_month_year last_seen_month_year last_seen_month
0 26.3.16 17:47 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000.000 control bus 2004 manual 158 ... peugeot 0.000 26.3.16 00:00 0 79588 6.4.16 06:45 3-16 3-16 4-16 4
1 4.4.16 13:38 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500.000 control limousine 1997 automatic 286 ... bmw 0.000 4.4.16 00:00 0 71034 6.4.16 14:45 4-16 4-16 4-16 4
2 26.3.16 18:57 Volkswagen_Golf_1.6_United privat Angebot 8990.000 test limousine 2009 manual 102 ... volkswagen 0.000 26.3.16 00:00 0 35394 6.4.16 20:15 3-16 3-16 4-16 4
3 12.3.16 16:58 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350.000 control kleinwagen 2007 automatic 71 ... smart 0.000 12.3.16 00:00 0 33729 15.3.16 03:16 3-16 3-16 3-16 3
4 1.4.16 14:38 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350.000 test kombi 2003 manual 0 ... ford 0.000 1.4.16 00:00 0 39218 1.4.16 14:38 4-16 4-16 4-16 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 27.3.16 14:38 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot 24900.000 control limousine 2011 automatic 239 ... audi 0.000 27.3.16 00:00 0 82131 1.4.16 13:47 3-16 3-16 4-16 4
49996 28.3.16 10:50 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot 1980.000 control cabrio 1996 manual 75 ... opel 0.000 28.3.16 00:00 0 44807 2.4.16 14:18 3-16 3-16 4-16 4
49997 2.4.16 14:44 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot 13200.000 test cabrio 2014 automatic 69 ... fiat 0.000 2.4.16 00:00 0 73430 4.4.16 11:47 4-16 4-16 4-16 4
49998 8.3.16 19:25 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot 22900.000 control kombi 2013 manual 150 ... audi 0.000 8.3.16 00:00 0 35683 5.4.16 16:45 3-16 3-16 4-16 4
49999 14.3.16 00:42 Opel_Vectra_1.6_16V privat Angebot 1250.000 control limousine 1996 manual 101 ... opel 0.000 13.3.16 00:00 0 45897 6.4.16 21:18 3-16 3-16 4-16 4

47440 rows × 24 columns

For lower limit let's first consider 1950. Now I need to see how many cars were registered before that year and decide if number is insignificant enough to delete corresponding rows.

In [44]:
too_old_cars = autos[autos['registration_year'] < 1950].shape[0]
print(f'# of too old cars = {too_old_cars}')
# of too old cars = 22

As number of cars registered before 1950 is just 22 and will not have significant data loss effect they can be removed from dataset.

In [45]:
# removing cars registered before 1950 and after 2016
autos = autos[autos['registration_year'].between(1949, 2017, inclusive=False)]

As a result of all above operations I got cleaned data and saved it to new csv file (''). I will use cleaned data for analysis of used car market in German eBay based on most common brands, their average mileage, percentage of unrepaired damages, gearbox type and average age. As mentioned before I will try to find similar patterns between average price and other factors of popular brands.

Analysis

First of all in order to analyse most frequent brands in our dataset we need to see detailed information on brands. And then I will create new dataset where we will have only popular brands and aggregated values for these brands. In below cell I check the unique brands, descriptive statistics of 'brands' column, and the distribution of brands in percentage in descending order.

In [46]:
unique_brands = autos['brand'].unique()
stat_brands = autos['brand'].describe()
brands_dist = autos['brand'].value_counts(normalize=True, dropna=False)*100
print_2n(unique_brands, stat_brands, brands_dist)
 ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia']


 count          45551
unique            40
top       volkswagen
freq            9618
Name: brand, dtype: object


 volkswagen       21.115
bmw              11.093
opel             10.667
mercedes_benz     9.708
audi              8.724
ford              6.968
renault           4.702
peugeot           2.975
fiat              2.553
seat              1.811
skoda             1.655
nissan            1.528
mazda             1.521
smart             1.416
citroen           1.401
toyota            1.284
hyundai           1.001
sonstige_autos    0.933
volvo             0.920
mini              0.883
mitsubishi        0.819
honda             0.788
kia               0.716
alfa_romeo        0.667
suzuki            0.593
chevrolet         0.571
porsche           0.551
chrysler          0.351
dacia             0.266
daihatsu          0.252
jeep              0.231
land_rover        0.213
subaru            0.211
saab              0.167
jaguar            0.156
daewoo            0.151
trabant           0.138
rover             0.134
lancia            0.108
lada              0.059
Name: brand, dtype: float64

There are 40 unique brands in dataset with Volkswagen being top frequent brand. We can see these brands in list of unique brands. But I plan to analyse some patterns in popular brands. Therefore, brands with very low number of cars in dataset will not be included in new dataset I will create. The threshold to be included to analysis for popular cars set by me is 1 percent. In other words if in above cell percentage in distribution of brands is less than 1 % then corresponding brand will be left out. We can see from distribution than last brand qualified for analysis is 'hyundai' and first one not qualifies is 'sonstige_autos'.
Next step is to create new dataset with popular brands only. Each brand will be represented by one line where brands will be index values. Besides I will create also first column containing average price for each brand.

In [47]:
# selecting brands which cover 1 or more than 1 percent of total share
cmmn_brands = brands_dist[brands_dist >= 1]

# aggregating avg price for each common brand
# adding brand as key & avg price as value to dict
# creating series from dict: key turns to index,
# values to column and descending order
# creating df from series
brands_price_dict = {}
for brand in cmmn_brands.index:
    avg_price = autos.loc[autos['brand'] == brand, 'price_usd'].mean()
    brands_price_dict[brand] = avg_price
    
cmmn_brands_avg_price = pd.Series(brands_price_dict).sort_values(ascending=False)
cmmn_brands_df = pd.DataFrame(cmmn_brands_avg_price, columns=['avg_price'])
cmmn_brands_df.index.name = 'brand'
cmmn_brands_df
Out[47]:
avg_price
brand
audi 9379.054
mercedes_benz 8627.313
bmw 8293.899
skoda 6424.845
hyundai 5484.410
volkswagen 5482.053
toyota 5200.318
nissan 4803.099
seat 4479.248
mazda 4178.258
citroen 3841.723
ford 3774.345
smart 3614.164
peugeot 3129.524
opel 3026.059
fiat 2857.802
renault 2512.908

Now we have new dataset called 'cmmn_brands_df'. In next 6 cells I will add to new dataset other columns.

Adding new columns to 'cmmn_brands_df' to dataset

Adding 'avg_km_covered' column:
Values in this column represent average distance covered by cars representing particular brand in kilometers.

In [48]:
# aggreating avg covered distance by each brand in 
# common brands df and adding it as series to same df 
brands_avg_km = {}
for brand in cmmn_brands_df.index:
    avg_mlg = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
    brands_avg_km[brand] = avg_mlg

cmmn_brands_df['avg_km_covered'] = pd.Series(brands_avg_km)
cmmn_brands_df
Out[48]:
avg_price avg_km_covered
brand
audi 9379.054 129266.482
mercedes_benz 8627.313 131007.463
bmw 8293.899 132673.659
skoda 6424.845 110835.544
hyundai 5484.410 106293.860
volkswagen 5482.053 128614.057
toyota 5200.318 115777.778
nissan 4803.099 118520.115
seat 4479.248 121321.212
mazda 4178.258 124206.349
citroen 3841.723 119184.953
ford 3774.345 124188.721
smart 3614.164 99263.566
peugeot 3129.524 126974.170
opel 3026.059 129320.848
fiat 2857.802 116822.872
renault 2512.908 128204.949

Adding 'damageinfo%' column:
Values in this column represent percentage of cars for each brand where there is information weather car has unrepaired damage or not.
In next cells I will also add percentage of cars with unrepaired damage and percentage of healthy cars. But as we learned here there are null values in corresponding column. Therefore it is crucial to learn if majority of cars have information about presence of unrepaired damage.

In [49]:
damage_info_dict = {}
for brand in cmmn_brands_df.index:
    count_brands = autos[autos['brand'] == brand].shape[0] 
    count_na = autos.loc[(autos['brand'] == brand) &\
             (autos['unrepaired_damage'].isnull())].shape[0]
    share_dmg = 100 - (count_na / count_brands ) * 100 
    damage_info_dict[brand] = share_dmg

cmmn_brands_df['damage_info_%'] = pd.Series(damage_info_dict)
cmmn_brands_df
Out[49]:
avg_price avg_km_covered damage_info_%
brand
audi 9379.054 129266.482 84.625
mercedes_benz 8627.313 131007.463 85.052
bmw 8293.899 132673.659 85.197
skoda 6424.845 110835.544 88.859
hyundai 5484.410 106293.860 88.158
volkswagen 5482.053 128614.057 81.004
toyota 5200.318 115777.778 88.547
nissan 4803.099 118520.115 85.057
seat 4479.248 121321.212 84.606
mazda 4178.258 124206.349 81.818
citroen 3841.723 119184.953 81.818
ford 3774.345 124188.721 80.466
smart 3614.164 99263.566 83.876
peugeot 3129.524 126974.170 84.207
opel 3026.059 129320.848 80.078
fiat 2857.802 116822.872 77.988
renault 2512.908 128204.949 79.458

Output of above cell shows that more than 75 % of cars by each brand has information about presence of unrepaired damage. As significant majority of cars has this information we can rely on analysis based on percentage of healthy cars or cars with unrepaired damage.

Adding 'unrepdamage%' column:
Values in this column represent percentage of cars for each brand where there is unrepaired damage

Adding 'healthycar%' column:
Values in this column represent percentage of cars with no unrepaired damage by each brand.

In [50]:
unrep_damage_dict = {}
for brand in cmmn_brands_df.index:
    damaged_share = autos.loc[(autos['brand'] == brand) &\
                  (autos['unrepaired_damage'].notnull()), 'unrepaired_damage'].mean() * 100
    unrep_damage_dict[brand] = damaged_share

cmmn_brands_df['unrep_damage_%'] = pd.Series(unrep_damage_dict)
cmmn_brands_df['healthy_car_%'] = cmmn_brands_df['damage_info_%']\
                                - cmmn_brands_df['unrep_damage_%']
cmmn_brands_df
Out[50]:
avg_price avg_km_covered damage_info_% unrep_damage_% healthy_car_%
brand
audi 9379.054 129266.482 84.625 9.664 74.961
mercedes_benz 8627.313 131007.463 85.052 8.881 76.171
bmw 8293.899 132673.659 85.197 9.175 76.022
skoda 6424.845 110835.544 88.859 9.701 79.158
hyundai 5484.410 106293.860 88.158 13.930 74.228
volkswagen 5482.053 128614.057 81.004 10.602 70.402
toyota 5200.318 115777.778 88.547 7.915 80.632
nissan 4803.099 118520.115 85.057 15.372 69.686
seat 4479.248 121321.212 84.606 11.461 73.145
mazda 4178.258 124206.349 81.818 14.286 67.532
citroen 3841.723 119184.953 81.818 12.452 69.366
ford 3774.345 124188.721 80.466 15.427 65.040
smart 3614.164 99263.566 83.876 8.133 75.743
peugeot 3129.524 126974.170 84.207 15.688 68.519
opel 3026.059 129320.848 80.078 14.418 65.660
fiat 2857.802 116822.872 77.988 14.002 63.986
renault 2512.908 128204.949 79.458 15.805 63.654

Adding 'automatic_%' column:
Values in this column represent percentage of cars with automatic gearbox by each brand

Adding 'manual_%' column:
Values in this column represent percentage of cars with manual gearbox by each brand

In [51]:
automatic_dict = {}
manual_dict = {}
for brand in cmmn_brands_df.index:
    automatic_share = autos.loc[autos['brand'] == brand, 'gearbox']\
                .value_counts(normalize=True, dropna=False)['automatic'] * 100
    manual_share = autos.loc[autos['brand'] == brand, 'gearbox']\
                .value_counts(normalize=True, dropna=False)['manual'] * 100
    automatic_dict[brand] = automatic_share
    manual_dict[brand] = manual_share
    
cmmn_brands_df['automatic_%'] = pd.Series(automatic_dict)
cmmn_brands_df['manual_%'] = pd.Series(manual_dict)
cmmn_brands_df
Out[51]:
avg_price avg_km_covered damage_info_% unrep_damage_% healthy_car_% automatic_% manual_%
brand
audi 9379.054 129266.482 84.625 9.664 74.961 33.065 63.538
mercedes_benz 8627.313 131007.463 85.052 8.881 76.171 55.450 41.633
bmw 8293.899 132673.659 85.197 9.175 76.022 33.742 63.606
skoda 6424.845 110835.544 88.859 9.701 79.158 11.538 85.942
hyundai 5484.410 106293.860 88.158 13.930 74.228 12.719 84.211
volkswagen 5482.053 128614.057 81.004 10.602 70.402 13.329 82.086
toyota 5200.318 115777.778 88.547 7.915 80.632 12.137 84.274
nissan 4803.099 118520.115 85.057 15.372 69.686 10.920 85.201
seat 4479.248 121321.212 84.606 11.461 73.145 3.879 92.606
mazda 4178.258 124206.349 81.818 14.286 67.532 6.061 90.909
citroen 3841.723 119184.953 81.818 12.452 69.366 13.323 81.191
ford 3774.345 124188.721 80.466 15.427 65.040 8.034 87.492
smart 3614.164 99263.566 83.876 8.133 75.743 73.178 11.628
peugeot 3129.524 126974.170 84.207 15.688 68.519 5.461 90.849
opel 3026.059 129320.848 80.078 14.418 65.660 10.249 85.120
fiat 2857.802 116822.872 77.988 14.002 63.986 2.408 92.347
renault 2512.908 128204.949 79.458 15.805 63.654 6.676 88.562

Adding 'vehicle_count' column:
Values in this column represent number of cars by each brand

In [52]:
count_dict = {}
for brand in cmmn_brands_df.index:
    vehicle_count = autos[autos['brand'] == brand].shape[0]
    count_dict[brand] = vehicle_count

cmmn_brands_df.insert(0, 'vehicle_count', pd.Series(count_dict))
cmmn_brands_df
Out[52]:
vehicle_count avg_price avg_km_covered damage_info_% unrep_damage_% healthy_car_% automatic_% manual_%
brand
audi 3974 9379.054 129266.482 84.625 9.664 74.961 33.065 63.538
mercedes_benz 4422 8627.313 131007.463 85.052 8.881 76.171 55.450 41.633
bmw 5053 8293.899 132673.659 85.197 9.175 76.022 33.742 63.606
skoda 754 6424.845 110835.544 88.859 9.701 79.158 11.538 85.942
hyundai 456 5484.410 106293.860 88.158 13.930 74.228 12.719 84.211
volkswagen 9618 5482.053 128614.057 81.004 10.602 70.402 13.329 82.086
toyota 585 5200.318 115777.778 88.547 7.915 80.632 12.137 84.274
nissan 696 4803.099 118520.115 85.057 15.372 69.686 10.920 85.201
seat 825 4479.248 121321.212 84.606 11.461 73.145 3.879 92.606
mazda 693 4178.258 124206.349 81.818 14.286 67.532 6.061 90.909
citroen 638 3841.723 119184.953 81.818 12.452 69.366 13.323 81.191
ford 3174 3774.345 124188.721 80.466 15.427 65.040 8.034 87.492
smart 645 3614.164 99263.566 83.876 8.133 75.743 73.178 11.628
peugeot 1355 3129.524 126974.170 84.207 15.688 68.519 5.461 90.849
opel 4859 3026.059 129320.848 80.078 14.418 65.660 10.249 85.120
fiat 1163 2857.802 116822.872 77.988 14.002 63.986 2.408 92.347
renault 2142 2512.908 128204.949 79.458 15.805 63.654 6.676 88.562

Adding 'avg_age' column:
Values in this column represent average age of cars by each brand

In [53]:
# as there are '0' values in registration month column I replaced them with 1
reg_month_0to1 = autos['registration_month'].replace(0, 1).astype(str)
# combining values in registration month and registration year columns
m_y_reg = reg_month_0to1 + '-' + autos['registration_year'].astype(str)
# converting above combination to date values
date_m_y_reg = pd.to_datetime(m_y_reg, format='%m-%Y')
# finding latest date from last seen column
latest_date = pd.to_datetime(autos['last_seen'].str.split().str[0], format='%d.%m.%y').max()
# finding age of each car in days
diff = latest_date - date_m_y_reg
# converting age from days to years and adding to autos dataset
autos['vehicle_age'] = diff.dt.components['days'].astype(int) / 365

# finding average age of cars by brand and adding it to cmmn_brands_df dataset
age_dict = {}
for brand in cmmn_brands_df.index:
    avg = autos.loc[autos['brand'] == brand, 'vehicle_age'].mean()
    age_dict[brand] = avg
    
cmmn_brands_df.insert(1, 'avg_age', pd.Series(age_dict))
#cmmn_brands_df.to_csv('autos_brands.csv')
cmmn_brands_df
C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
Out[53]:
vehicle_count avg_age avg_price avg_km_covered damage_info_% unrep_damage_% healthy_car_% automatic_% manual_%
brand
audi 3974 11.815 9379.054 129266.482 84.625 9.664 74.961 33.065 63.538
mercedes_benz 4422 13.924 8627.313 131007.463 85.052 8.881 76.171 55.450 41.633
bmw 5053 12.921 8293.899 132673.659 85.197 9.175 76.022 33.742 63.606
skoda 754 9.385 6424.845 110835.544 88.859 9.701 79.158 11.538 85.942
hyundai 456 8.780 5484.410 106293.860 88.158 13.930 74.228 12.719 84.211
volkswagen 9618 13.636 5482.053 128614.057 81.004 10.602 70.402 13.329 82.086
toyota 585 11.670 5200.318 115777.778 88.547 7.915 80.632 12.137 84.274
nissan 696 12.687 4803.099 118520.115 85.057 15.372 69.686 10.920 85.201
seat 825 11.593 4479.248 121321.212 84.606 11.461 73.145 3.879 92.606
mazda 693 13.174 4178.258 124206.349 81.818 14.286 67.532 6.061 90.909
citroen 638 11.612 3841.723 119184.953 81.818 12.452 69.366 13.323 81.191
ford 3174 13.258 3774.345 124188.721 80.466 15.427 65.040 8.034 87.492
smart 645 10.611 3614.164 99263.566 83.876 8.133 75.743 73.178 11.628
peugeot 1355 12.357 3129.524 126974.170 84.207 15.688 68.519 5.461 90.849
opel 4859 13.900 3026.059 129320.848 80.078 14.418 65.660 10.249 85.120
fiat 1163 12.877 2857.802 116822.872 77.988 14.002 63.986 2.408 92.347
renault 2142 13.163 2512.908 128204.949 79.458 15.805 63.654 6.676 88.562

Now we got dataset ‘cmmn_brands_df’ ready for analysis. I will use it to understand if there are similar patterns and trends between average price and other data in other columns.

At this step of my training I will not use statistical methods as they haven't been covered in this course yet, although I have some knowledge from university. Techniques used below are simple but some explanation of logic behind analysis is required. First, I will create function named 'compare_around_mean' to see if there is pattern between 2 columns. Then I will use it several times to understand if there is pattern between 'avg_price' column and 4 columns.

In [54]:
def compare_around_mean(a_df, x_col, y_col):
    print(f"'{x_col}' column values in descending order:\n \
    {a_df[x_col].sort_values(ascending=False)}")
    
    x_mean = a_df[x_col].mean()
    buttomx_y_avg = a_df.loc[a_df[x_col] < x_mean, y_col].mean()
    topx_y_avg = a_df.loc[a_df[x_col] >= x_mean, y_col].mean()
    
    print_2n(f"Mean of '{y_col}' for below average values in '{x_col}': {buttomx_y_avg}")
    print_2n(f"Mean of '{y_col}' for above average values in '{x_col}': {topx_y_avg}")

I will explain what function does and logic behind it on example of 'avgprice' and 'manual%' columns below. 'manual_%' column represent share of cars with manual gear by each brand :

    1. So function takes 3 arguments: ‘a_df’ (dataset), ‘x_col’ (column which we think has impact on 'y_col' column), ‘y_col’ (column which we think has been affected by 'x_col' column)
    1. Once called function prints values in 'xcol' ('manual%' column in this example) in descending order so that we can see if there is any anomality in data
    1. Then it finds average value of 'manual_%' column
    1. Now we can imagine values in 'manual_%' to be divided in 2 groups: 'values below average' and 'values above average'.
    1. Next, function calculates average of 'y_col' ('avgprice' column in this example) for 'values below average' and 'values above average' in 'manual%' column
    1. Finally function prints the result
    1. And if there is significant difference between average price for 'values below average' and 'values above average' in 'manual_%' we can say that there is pattern between 2 columns although it does not imply correlation.

In below cell we can see that average price for brands with share of manual cars below average (7478.61 USD) is much higher than for above average (4245.74 USD). This pattern confirms generally known fact that usually car with manual gear is cheaper. There is only one concern in our case with this conclusin. Share (and number) of cars with automatic gearbox is really low relatively to cars with manual gearbox. Therefore difference between average prices can change significantly once more cars with automatic gears added to the data.

In [55]:
compare_around_mean(cmmn_brands_df, 'manual_%', 'avg_price')
'manual_%' column values in descending order:
     brand
seat            92.606
fiat            92.347
mazda           90.909
peugeot         90.849
renault         88.562
ford            87.492
skoda           85.942
nissan          85.201
opel            85.120
toyota          84.274
hyundai         84.211
volkswagen      82.086
citroen         81.191
bmw             63.606
audi            63.538
mercedes_benz   41.633
smart           11.628
Name: manual_%, dtype: float64


 Mean of 'avg_price' for below average values in 'manual_%': 7478.6076593315465


 Mean of 'avg_price' for above average values in 'manual_%': 4245.7378633367825

Before using 'cmmn_brands_df' function below I added new column to dataset: 'damage/age'. This column represent ratio of share of cars with unrepaired damage to average age of cars by each brand. Therefore we can see which share of cars by each brand gets unrepaired damage every year. Ideally, the higher the ratio the worse condition of car should be. So price of car should be in that case also lower compared to brands with lower ratio. And as we can see from output of below cell average price for brands with lower ratio is higher approximately by 60 percent.

In [56]:
cmmn_brands_df['damage/age'] = cmmn_brands_df['unrep_damage_%'] / cmmn_brands_df['avg_age']
compare_around_mean(cmmn_brands_df, 'damage/age', 'avg_price')
'damage/age' column values in descending order:
     brand
hyundai         1.587
peugeot         1.270
nissan          1.212
renault         1.201
ford            1.164
fiat            1.087
mazda           1.084
citroen         1.072
opel            1.037
skoda           1.034
seat            0.989
audi            0.818
volkswagen      0.778
smart           0.767
bmw             0.710
toyota          0.678
mercedes_benz   0.638
Name: damage/age, dtype: float64


 Mean of 'avg_price' for below average values in 'damage/age': 6439.435683507786


 Mean of 'avg_price' for above average values in 'damage/age': 4003.2973076149838

Another column we have to check on relation between price and damage of car is 'healthycar%' column. Below cell confirms finding of above cell. The higher the share of healthy (no unrepaired damage) cars the higher the average price.

In [57]:
compare_around_mean(cmmn_brands_df, 'healthy_car_%', 'avg_price')
'healthy_car_%' column values in descending order:
     brand
toyota          80.632
skoda           79.158
mercedes_benz   76.171
bmw             76.022
smart           75.743
audi            74.961
hyundai         74.228
seat            73.145
volkswagen      70.402
nissan          69.686
citroen         69.366
peugeot         68.519
mazda           67.532
opel            65.660
ford            65.040
fiat            63.986
renault         63.654
Name: healthy_car_%, dtype: float64


 Mean of 'avg_price' for below average values in 'healthy_car_%': 3733.974541611802


 Mean of 'avg_price' for above average values in 'healthy_car_%': 6437.906498274766

And last column I will check in pair with 'avg_price' column is 'avg_km_covered'. Natural expectation is higher the distance covered lower the price. From output of cell it is visible that average price for brands which covered above average distance is lower by ~790 USD. This number confirms natural expectation but we saw that this difference was higher in previous cells. There fore difference between average prices here indicates weaker similarity between covered distance and price.

In [58]:
compare_around_mean(cmmn_brands_df, 'avg_km_covered', 'avg_price')
'avg_km_covered' column values in descending order:
     brand
bmw             132673.659
mercedes_benz   131007.463
opel            129320.848
audi            129266.482
volkswagen      128614.057
renault         128204.949
peugeot         126974.170
mazda           124206.349
ford            124188.721
seat            121321.212
citroen         119184.953
nissan          118520.115
fiat            116822.872
toyota          115777.778
skoda           110835.544
hyundai         106293.860
smart            99263.566
Name: avg_km_covered, dtype: float64


 Mean of 'avg_price' for below average values in 'avg_km_covered': 4588.201204252344


 Mean of 'avg_price' for above average values in 'avg_km_covered': 5378.157025187288

Conclusion

Even before looking at any data anybody with plans to set a price of used car has some expectation of what and how affects price of car. Or if you want to buy one you probably have some price in your mind depending in what condition is car. In this project I analysed average prices of popular brands. I found out supporting indications for 4 of our natural expectations about prices of cars that belong to some popular brand in trends & patterns section.

Basically average price for car that belongs to certain brand is higher if cars of that brand :

  • have less manual gear cars than automatic gear cars
  • get less unrepaired damage every year
  • has high share of cars with no unrepaired damage at point of sale
  • covered less distance on average relative to other cars

Notes

  • The method I used here is simple and does not involve sufficient statistical methods. The reason for it is quite simple as at this stage of my training I tried to use only concepts already covered in the course. Therefore method I used cannot be enough on its own to draw deterministic conclusions but can serve as additional evidence.

  • There are some parts of code that can be deleted in this project. For example, adding some new columns to original dataset which later I did not use. I decide to leave and not delete those parts. The reason is that I am on my training phase. As a result, I want to type as more code as possible to memorize the concepts and syntax better. And leave it there cause in case I need come back to update my knowledge it will be easy to find what I learned.