Exploring eBay Car Sales Data. Understanding average price of cars by each popular brand¶

In this project I will work with data set containing information on eBay used car sales in Germany. I will explore dataset, try to find most popular car brands and some similar patterns in relation to average price of car by each popular brand. Patterns that I am interested in are covered distance, presence of unrepaired damage and average age of car for each popular brand. In other words, I am going to analyse if there are similar patterns between average price of car for brand and other characteristics of brand in average. This will also help to understand if some natural expectations of buyers and/or sellers for price of used cars can be trusted.

You can download dataset here.

Reading & Exploring data¶

Reading data¶

Pandas are main module I used for analysis. Except that I imported NumPy and created print_2n function. Latter prints 2 empty lines before each printed object. Numbers in tables will be printed with 3 digits after decimal point.

In [31]:

import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f' % x)

def print_2n(*args):
    for arg in args:
        print("\n"*2, arg)
        
autos = pd.read_csv('autos.csv', encoding="Latin-1")
autos

Out[31]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	26.3.16 17:47	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	26.3.16 00:00	0	79588	6.4.16 06:45
1	4.4.16 13:38	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	4.4.16 00:00	0	71034	6.4.16 14:45
2	26.3.16 18:57	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	26.3.16 00:00	0	35394	6.4.16 20:15
3	12.3.16 16:58	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	12.3.16 00:00	0	33729	15.3.16 03:16
4	1.4.16 14:38	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	1.4.16 00:00	0	39218	1.4.16 14:38
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	27.3.16 14:38	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	27.3.16 00:00	0	82131	1.4.16 13:47
49996	28.3.16 10:50	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	28.3.16 00:00	0	44807	2.4.16 14:18
49997	2.4.16 14:44	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2.4.16 00:00	0	73430	4.4.16 11:47
49998	8.3.16 19:25	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	8.3.16 00:00	0	35683	5.4.16 16:45
49999	14.3.16 00:42	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	13.3.16 00:00	0	45897	6.4.16 21:18

50000 rows × 20 columns

As we can see from result of above code there are 50,000 rows and 20 columns in dataset.

Exploring data¶

In order to see all columns, types and # of non-null values in each column I use info() method. Except that printing first 5 rows helps to comprehend with what type of data we deal.

In [32]:

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[32]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	26.3.16 17:47	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	26.3.16 00:00	79588	6.4.16 06:45
1	4.4.16 13:38	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	4.4.16 00:00	71034	6.4.16 14:45
2	26.3.16 18:57	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	26.3.16 00:00	35394	6.4.16 20:15
3	12.3.16 16:58	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	12.3.16 00:00	33729	15.3.16 03:16
4	1.4.16 14:38	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	1.4.16 00:00	39218	1.4.16 14:38

Only 5 of 20 columns are saved as integer types of data. Most of the columns are saved as string types.
Majority of columns do not have null values. Share of null values in minority of columns where they exist is less than 20% by each column.

Cleaning dataset¶

Changing column labels¶

Below I printed column names. From result you can see that column labels are in camelcase and some labels are unnecessarily long. I renamed them to snakecase and shortened some labels. But besides that I added 'km' to 'odemeter' column and 'usd' to 'price' column. Added information to labels ('km' & 'usd') are presented in all values of both corresponding columns if you look closely to data printed as result of above cells. But I will delete them later as I convert those columns to integer type. Therefore information that prices are in US dollars and distance covered by car is in kilometers will be saved in column labels.

In [33]:

print(autos.columns)
col_dict = {'dateCrawled': 'date_crawled', 'name': 'name', 'seller': 'seller',
            'offerType': 'offer_type', 'price': 'price_usd', 'abtest': 'ab_test',
            'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year',
            'gearbox': 'gearbox', 'powerPS': 'power_ps', 'model': 'model',
            'odometer': 'odometer_km', 'monthOfRegistration': 'registration_month',
            'fuelType': 'fuel_type', 'brand': 'brand','notRepairedDamage': 'unrepaired_damage',
            'dateCreated': 'ad_created', 'nrOfPictures': 'nr_of_pictures',
            'postalCode': 'postal_code','lastSeen': 'last_seen'}
autos.rename(columns=col_dict, inplace=True)
autos.columns

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Out[33]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

Cleaning 'odemeter_km', 'price_usd' & 'unrepaired_damage' columns¶

As mentioned above I will delete special characters from 'odometer_km' and 'price_usd' columns. Then I convert them to numeric types of data for future purposes. The main future purpose is to use converted to float kilometers and prices for calculation of average values for most frequently presented brands in dataset.

In [34]:

autos['odometer_km'] = autos['odometer_km'].str.replace('km', '').str.replace(',', '').astype(float)
autos['price_usd'] = autos['price_usd'].str.replace('$', '').str.replace(',', '').astype(float)
autos.head(10)

Out[34]:

	date_crawled	name	seller	offer_type	price_usd	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	26.3.16 17:47	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000.000	control	bus	2004	manuell	158	andere	150000.000	3	lpg	peugeot	nein	26.3.16 00:00	79588	6.4.16 06:45
1	4.4.16 13:38	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500.000	control	limousine	1997	automatik	286	7er	150000.000	6	benzin	bmw	nein	4.4.16 00:00	71034	6.4.16 14:45
2	26.3.16 18:57	Volkswagen_Golf_1.6_United	privat	Angebot	8990.000	test	limousine	2009	manuell	102	golf	70000.000	7	benzin	volkswagen	nein	26.3.16 00:00	35394	6.4.16 20:15
3	12.3.16 16:58	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350.000	control	kleinwagen	2007	automatik	71	fortwo	70000.000	6	benzin	smart	nein	12.3.16 00:00	33729	15.3.16 03:16
4	1.4.16 14:38	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350.000	test	kombi	2003	manuell	0	focus	150000.000	7	benzin	ford	nein	1.4.16 00:00	39218	1.4.16 14:38
5	21.3.16 13:47	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	privat	Angebot	7900.000	test	bus	2006	automatik	150	voyager	150000.000	4	diesel	chrysler	NaN	21.3.16 00:00	22962	6.4.16 09:45
6	20.3.16 17:55	VW_Golf_III_GT_Special_Electronic_Green_Metall...	privat	Angebot	300.000	test	limousine	1995	manuell	90	golf	150000.000	8	benzin	volkswagen	NaN	20.3.16 00:00	31535	23.3.16 02:48
7	16.3.16 18:55	Golf_IV_1.9_TDI_90PS	privat	Angebot	1990.000	control	limousine	1998	manuell	90	golf	150000.000	12	diesel	volkswagen	nein	16.3.16 00:00	53474	7.4.16 03:17
8	22.3.16 16:51	Seat_Arosa	privat	Angebot	250.000	test	NaN	2000	manuell	0	arosa	150000.000	10	NaN	seat	nein	22.3.16 00:00	7426	26.3.16 18:18
9	16.3.16 13:47	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	privat	Angebot	590.000	control	bus	1997	manuell	90	megane	150000.000	7	benzin	renault	nein	16.3.16 00:00	15749	6.4.16 10:46

There are 2 columns with German words as values which I need to change:

unrepaired_damage - If the car has a damage which is not yet repaired.
gearbox - The transmission type.

I will change German words for 'yes' and 'no' in 'unrepaired_damage' column to 1's and 0's. And German words for transmission type to their English counterparts.

In [35]:

# leaving nan in its place 
damage_dict = {'nein': 0, 'ja': 1}
autos['unrepaired_damage']= autos['unrepaired_damage'].map(damage_dict)

gear_dict = {'automatik' : 'automatic', 'manuell': 'manual'}
autos['gearbox'] = autos['gearbox'].map(gear_dict)
print_2n(autos['unrepaired_damage'].value_counts(dropna=False),\
           autos['gearbox'].value_counts(dropna=False))


 0.000    35232
nan       9829
1.000     4939
Name: unrepaired_damage, dtype: int64


 manual       36993
automatic    10327
NaN           2680
Name: gearbox, dtype: int64

I already cleaned column showing prices from special characters, added information stored in special character to column label, converted column to numeric type. And there is no null-values as you can see in exploring data section. But before we can use this column in any project we should identify if there are outliers or unrealistic data. Considering that we analyse used car market very expensive cars seems to be unrelated to the topic. Too high prices can be related to rare old models, mistakes in dataset, almost new cars and so on. In any case this prices cannot be corrected on objective basis at this step. Also there are too cheap cars in data set. I will check descriptive statistics, top 10 and bottom 10 values to have better understanding of extreme prices.

In [36]:

desc_stat_price = autos['price_usd'].describe()
print(desc_stat_price)

top_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).head(10)
bottom_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).tail(10)
print_2n(top_10_price, bottom_10_price)

count      50000.000
mean        9840.044
std       481104.381
min            0.000
25%         1100.000
50%         2950.000
75%         7200.000
max     99999999.000
Name: price_usd, dtype: float64


 99999999.000    1
27322222.000    1
12345678.000    3
11111111.000    2
10000000.000    1
3890000.000     1
1300000.000     1
1234566.000     1
999999.000      2
999990.000      1
Name: price_usd, dtype: int64


 12.000       3
11.000       2
10.000       7
9.000        1
8.000        1
5.000        2
3.000        1
2.000        3
1.000      156
0.000     1421
Name: price_usd, dtype: int64

If you pay attention in output of above cell descriptive statistics on price column shows that minimum and maximum values are unexpectedly high (almost 10 million USD) and low (0 USD), respectively. Bottom 10 prices are below than 15 dollars and top ten are between 100k and 10 million dollars. Therefore I should set up boundaries and I found range between 100 USD and 100,000 USD as sensible. Down I will check number of cars that fell out of mentioned range.

In [37]:

unreal_prices = autos[(autos['price_usd'] >= 100000) |\
      (autos['price_usd'] <= 100)].shape[0]
print('\n'*2, '# of rows with unrealistic prices (below 100 and more tha 100k):'\
            , unreal_prices)


 # of rows with unrealistic prices (below 100 and more tha 100k): 1949

In cell below I will delete those 1949 rows in order to eliminate effect of outliers in terms of prices and check on descriptive statistics.

In [38]:

autos = autos[autos['price_usd'].between(100, 100000)]
desc_stat_price = autos['price_usd'].describe()
print_2n(desc_stat_price)


 count   48185.000
mean     5796.100
std      7525.532
min       100.000
25%      1250.000
50%      3000.000
75%      7499.000
max     99900.000
Name: price_usd, dtype: float64

I use same methods which was implemented above to understand if there are outliers in 'odometer_km' column which presents distance covered by each car. There also no null-values in this column (see exploring data).

In [39]:

# checking descriptive statistics and unrealisticly high and low 
# covered distance by car 
desc_stat_odo = autos['odometer_km'].describe()
print_2n(desc_stat_odo)

top_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).head(10)
bottom_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).tail(10)
print_2n(top_10_odo, bottom_10_odo)


 count    48185.000
mean    125986.718
std      39467.457
min       5000.000
25%     125000.000
50%     150000.000
75%     150000.000
max     150000.000
Name: odometer_km, dtype: float64


 150000.000    31209
125000.000     5035
100000.000     2099
90000.000      1733
80000.000      1411
70000.000      1214
60000.000      1151
50000.000      1007
40000.000       813
30000.000       771
Name: odometer_km, dtype: int64


 90000.000    1733
80000.000    1411
70000.000    1214
60000.000    1151
50000.000    1007
40000.000     813
30000.000     771
20000.000     749
10000.000     242
5000.000      751
Name: odometer_km, dtype: int64

Minimum (5,000 km) and maximum (150,000 km) values make sense in 'odometer_km' column. Top 10 and bottom 10 values also seems to be realistic and useful for later analysis where I will use it for aggregation of average values.

Understanding and cleaning date columns¶

There are 5 columns containing date and/or time data:

registration_month - The month in which the car was first registered.
registration_year - The year in which the car was first registered.
date_crawled - When this ad was first crawled.
ad_created - The date on which the eBay listing was created.
last_seen - When the crawler saw this ad last online.

To have better understanding we should pay attention on below cell output to see data types and descriptive statistics on all 5 columns.

In [40]:

date_cols = ['registration_month', 'registration_year', 'date_crawled',\
             'ad_created' , 'last_seen']
print(autos[date_cols].dtypes)
autos[date_cols].describe(include='all')

registration_month     int64
registration_year      int64
date_crawled          object
ad_created            object
last_seen             object
dtype: object

Out[40]:

	registration_month	registration_year	date_crawled	ad_created	last_seen
count	48185.000	48185.000	48185	48185	48185
unique	nan	nan	11994	76	10226
top	nan	nan	28.3.16 14:49	3.4.16 00:00	6.4.16 02:45
freq	nan	nan	15	1872	111
mean	5.802	2004.730	NaN	NaN	NaN
std	3.678	87.932	NaN	NaN	NaN
min	0.000	1000.000	NaN	NaN	NaN
25%	3.000	1999.000	NaN	NaN	NaN
50%	6.000	2004.000	NaN	NaN	NaN
75%	9.000	2008.000	NaN	NaN	NaN
max	12.000	9999.000	NaN	NaN	NaN

Form result of above cell it is visible that 'registration_month' and 'registration_year' columns are saved as numeric types while rest of 3 columns as string data type. To see in which format values in all 5 date columns represented I will separate column labels based on column data types in 2 different lists. Then I will print top 5 rows from both to see example of how date related information is stored.

In [41]:

str_date_cols = ['date_crawled','ad_created' , 'last_seen']
int_date_cols = ['registration_year', 'registration_month']
print_2n(autos[str_date_cols].head(), autos[int_date_cols].head())


     date_crawled     ad_created      last_seen
0  26.3.16 17:47  26.3.16 00:00   6.4.16 06:45
1   4.4.16 13:38   4.4.16 00:00   6.4.16 14:45
2  26.3.16 18:57  26.3.16 00:00   6.4.16 20:15
3  12.3.16 16:58  12.3.16 00:00  15.3.16 03:16
4   1.4.16 14:38   1.4.16 00:00   1.4.16 14:38


    registration_year  registration_month
0               2004                   3
1               1997                   6
2               2009                   7
3               2007                   6
4               2003                   7

As we can see 3 columns with values saved as strings contain dates and time. Later those 3 columns can be used to extract date or time from them and use for analysis. Other 2 date columns represent year and month of registration in numbers and therefore can be used for further analysis without any change in data type.

Now we saw what data is saved in date columns we can pay closer attention to 'last_seen', 'registration_year' & 'registration_month' columns. There are 2 reasons to look closer at mentioned columns:

Presence of too old cars that can be considered as outliers drawing wrong patterns for analysis
Presence of "impossibly new cars". There are rows where registration year is higher than corresponding year in 'last_seen' column which is impossible.

To clean mentioned above 2 categories first I will check distribution in 'last_seen' column. This will help to understand approximate upper limit for registration year to be realistic. I will also check distribution for other 2 columns saved in string data type ('date_crawled','ad_created').

In [42]:

for col in str_date_cols:
    date = autos[col].str.split().str[0]
    month_year = date.str.split('.').str[1] + "-" +\
    date.str.split('.').str[2]
    autos[f'{col}_month_year'] = month_year
    month_year_dist = month_year.value_counts(normalize=True, dropna=False)\
                    .sort_values(ascending=False)*100
    print_2n(f"Distribution for '{col}_month_year' column:", month_year_dist)

C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


 Distribution for 'date_crawled_month_year' column:


 3-16   83.792
4-16   16.208
Name: date_crawled, dtype: float64


 Distribution for 'ad_created_month_year' column:


 3-16    83.734
4-16    16.103
2-16     0.127
1-16     0.025
12-15    0.004
6-15     0.002
11-15    0.002
9-15     0.002
8-15     0.002
Name: ad_created, dtype: float64


 Distribution for 'last_seen_month_year' column:


 4-16   57.651
3-16   42.349
Name: last_seen, dtype: float64

The output presents format of date as 'month number-2 digit year number' (e.g. 4-16 is April of 2016). Distribution of ‘last_seen’ and ‘date_crawled’ columns shows that there are only 2 values: 3-16 (March 2016) and 4-16 (April 2016). Distribution for 'ad_created' column also shows that there is no date later than 4-16 (April 2016). These factors tells that dataset presents situation in used car market on eBay in Germany as of March - April 2016.

Above observations make it clear that registration year above 2016 is error and cannot be realistic. Moreover, in rows where registration year is 2016 and registration month is later than month in 'last_seen' column false date is presented and should be deleted.

In [43]:

autos['last_seen_month'] = autos['last_seen'].str.split().str[0]\
.str.split(".").str[1].astype(int)
autos = autos[~((autos['registration_year'] == 2016) &\
              (autos['registration_month'] >= autos['last_seen_month']))]

autos.to_csv('autos_cleaned.csv', index=False)
autos

C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Out[43]:

	date_crawled	name	seller	offer_type	price_usd	ab_test	vehicle_type	registration_year	gearbox	power_ps	...	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen	date_crawled_month_year	ad_created_month_year	last_seen_month_year	last_seen_month
0	26.3.16 17:47	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000.000	control	bus	2004	manual	158	...	peugeot	0.000	26.3.16 00:00	0	79588	6.4.16 06:45	3-16	3-16	4-16	4
1	4.4.16 13:38	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500.000	control	limousine	1997	automatic	286	...	bmw	0.000	4.4.16 00:00	0	71034	6.4.16 14:45	4-16	4-16	4-16	4
2	26.3.16 18:57	Volkswagen_Golf_1.6_United	privat	Angebot	8990.000	test	limousine	2009	manual	102	...	volkswagen	0.000	26.3.16 00:00	0	35394	6.4.16 20:15	3-16	3-16	4-16	4
3	12.3.16 16:58	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350.000	control	kleinwagen	2007	automatic	71	...	smart	0.000	12.3.16 00:00	0	33729	15.3.16 03:16	3-16	3-16	3-16	3
4	1.4.16 14:38	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350.000	test	kombi	2003	manual	0	...	ford	0.000	1.4.16 00:00	0	39218	1.4.16 14:38	4-16	4-16	4-16	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	27.3.16 14:38	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	24900.000	control	limousine	2011	automatic	239	...	audi	0.000	27.3.16 00:00	0	82131	1.4.16 13:47	3-16	3-16	4-16	4
49996	28.3.16 10:50	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	1980.000	control	cabrio	1996	manual	75	...	opel	0.000	28.3.16 00:00	0	44807	2.4.16 14:18	3-16	3-16	4-16	4
49997	2.4.16 14:44	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	13200.000	test	cabrio	2014	automatic	69	...	fiat	0.000	2.4.16 00:00	0	73430	4.4.16 11:47	4-16	4-16	4-16	4
49998	8.3.16 19:25	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	22900.000	control	kombi	2013	manual	150	...	audi	0.000	8.3.16 00:00	0	35683	5.4.16 16:45	3-16	3-16	4-16	4
49999	14.3.16 00:42	Opel_Vectra_1.6_16V	privat	Angebot	1250.000	control	limousine	1996	manual	101	...	opel	0.000	13.3.16 00:00	0	45897	6.4.16 21:18	3-16	3-16	4-16	4

47440 rows × 24 columns

For lower limit let's first consider 1950. Now I need to see how many cars were registered before that year and decide if number is insignificant enough to delete corresponding rows.

In [44]:

too_old_cars = autos[autos['registration_year'] < 1950].shape[0]
print(f'# of too old cars = {too_old_cars}')

# of too old cars = 22

As number of cars registered before 1950 is just 22 and will not have significant data loss effect they can be removed from dataset.

In [45]:

# removing cars registered before 1950 and after 2016
autos = autos[autos['registration_year'].between(1949, 2017, inclusive=False)]

As a result of all above operations I got cleaned data and saved it to new csv file (''). I will use cleaned data for analysis of used car market in German eBay based on most common brands, their average mileage, percentage of unrepaired damages, gearbox type and average age. As mentioned before I will try to find similar patterns between average price and other factors of popular brands.

Analysis¶

Creating new dataset with popular brands only¶

First of all in order to analyse most frequent brands in our dataset we need to see detailed information on brands. And then I will create new dataset where we will have only popular brands and aggregated values for these brands. In below cell I check the unique brands, descriptive statistics of 'brands' column, and the distribution of brands in percentage in descending order.

In [46]:

unique_brands = autos['brand'].unique()
stat_brands = autos['brand'].describe()
brands_dist = autos['brand'].value_counts(normalize=True, dropna=False)*100
print_2n(unique_brands, stat_brands, brands_dist)


 ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia']


 count          45551
unique            40
top       volkswagen
freq            9618
Name: brand, dtype: object


 volkswagen       21.115
bmw              11.093
opel             10.667
mercedes_benz     9.708
audi              8.724
ford              6.968
renault           4.702
peugeot           2.975
fiat              2.553
seat              1.811
skoda             1.655
nissan            1.528
mazda             1.521
smart             1.416
citroen           1.401
toyota            1.284
hyundai           1.001
sonstige_autos    0.933
volvo             0.920
mini              0.883
mitsubishi        0.819
honda             0.788
kia               0.716
alfa_romeo        0.667
suzuki            0.593
chevrolet         0.571
porsche           0.551
chrysler          0.351
dacia             0.266
daihatsu          0.252
jeep              0.231
land_rover        0.213
subaru            0.211
saab              0.167
jaguar            0.156
daewoo            0.151
trabant           0.138
rover             0.134
lancia            0.108
lada              0.059
Name: brand, dtype: float64

There are 40 unique brands in dataset with Volkswagen being top frequent brand. We can see these brands in list of unique brands. But I plan to analyse some patterns in popular brands. Therefore, brands with very low number of cars in dataset will not be included in new dataset I will create. The threshold to be included to analysis for popular cars set by me is 1 percent. In other words if in above cell percentage in distribution of brands is less than 1 % then corresponding brand will be left out. We can see from distribution than last brand qualified for analysis is 'hyundai' and first one not qualifies is 'sonstige_autos'.
Next step is to create new dataset with popular brands only. Each brand will be represented by one line where brands will be index values. Besides I will create also first column containing average price for each brand.

In [47]:

# selecting brands which cover 1 or more than 1 percent of total share
cmmn_brands = brands_dist[brands_dist >= 1]

# aggregating avg price for each common brand
# adding brand as key & avg price as value to dict
# creating series from dict: key turns to index,
# values to column and descending order
# creating df from series
brands_price_dict = {}
for brand in cmmn_brands.index:
    avg_price = autos.loc[autos['brand'] == brand, 'price_usd'].mean()
    brands_price_dict[brand] = avg_price
    
cmmn_brands_avg_price = pd.Series(brands_price_dict).sort_values(ascending=False)
cmmn_brands_df = pd.DataFrame(cmmn_brands_avg_price, columns=['avg_price'])
cmmn_brands_df.index.name = 'brand'
cmmn_brands_df

Out[47]:

	avg_price
brand
audi	9379.054
mercedes_benz	8627.313
bmw	8293.899
skoda	6424.845
hyundai	5484.410
volkswagen	5482.053
toyota	5200.318
nissan	4803.099
seat	4479.248
mazda	4178.258
citroen	3841.723
ford	3774.345
smart	3614.164
peugeot	3129.524
opel	3026.059
fiat	2857.802
renault	2512.908

Now we have new dataset called 'cmmn_brands_df'. In next 6 cells I will add to new dataset other columns.

Adding new columns to 'cmmn_brands_df' to dataset¶

Adding 'avg_km_covered' column:
Values in this column represent average distance covered by cars representing particular brand in kilometers.

In [48]:

# aggreating avg covered distance by each brand in 
# common brands df and adding it as series to same df 
brands_avg_km = {}
for brand in cmmn_brands_df.index:
    avg_mlg = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
    brands_avg_km[brand] = avg_mlg

cmmn_brands_df['avg_km_covered'] = pd.Series(brands_avg_km)
cmmn_brands_df

Out[48]:

	avg_price	avg_km_covered
brand
audi	9379.054	129266.482
mercedes_benz	8627.313	131007.463
bmw	8293.899	132673.659
skoda	6424.845	110835.544
hyundai	5484.410	106293.860
volkswagen	5482.053	128614.057
toyota	5200.318	115777.778
nissan	4803.099	118520.115
seat	4479.248	121321.212
mazda	4178.258	124206.349
citroen	3841.723	119184.953
ford	3774.345	124188.721
smart	3614.164	99263.566
peugeot	3129.524	126974.170
opel	3026.059	129320.848
fiat	2857.802	116822.872
renault	2512.908	128204.949

Adding 'damage_info_%' column:
Values in this column represent percentage of cars for each brand where there is information weather car has unrepaired damage or not.
In next cells I will also add percentage of cars with unrepaired damage and percentage of healthy cars. But as we learned here there are null values in corresponding column. Therefore it is crucial to learn if majority of cars have information about presence of unrepaired damage.

In [49]:

damage_info_dict = {}
for brand in cmmn_brands_df.index:
    count_brands = autos[autos['brand'] == brand].shape[0] 
    count_na = autos.loc[(autos['brand'] == brand) &\
             (autos['unrepaired_damage'].isnull())].shape[0]
    share_dmg = 100 - (count_na / count_brands ) * 100 
    damage_info_dict[brand] = share_dmg

cmmn_brands_df['damage_info_%'] = pd.Series(damage_info_dict)
cmmn_brands_df

Out[49]:

	avg_price	avg_km_covered	damage_info_%
brand
audi	9379.054	129266.482	84.625
mercedes_benz	8627.313	131007.463	85.052
bmw	8293.899	132673.659	85.197
skoda	6424.845	110835.544	88.859
hyundai	5484.410	106293.860	88.158
volkswagen	5482.053	128614.057	81.004
toyota	5200.318	115777.778	88.547
nissan	4803.099	118520.115	85.057
seat	4479.248	121321.212	84.606
mazda	4178.258	124206.349	81.818
citroen	3841.723	119184.953	81.818
ford	3774.345	124188.721	80.466
smart	3614.164	99263.566	83.876
peugeot	3129.524	126974.170	84.207
opel	3026.059	129320.848	80.078
fiat	2857.802	116822.872	77.988
renault	2512.908	128204.949	79.458

Output of above cell shows that more than 75 % of cars by each brand has information about presence of unrepaired damage. As significant majority of cars has this information we can rely on analysis based on percentage of healthy cars or cars with unrepaired damage.

Adding 'unrep_damage_%' column:
Values in this column represent percentage of cars for each brand where there is unrepaired damage

Adding 'healthy_car_%' column:
Values in this column represent percentage of cars with no unrepaired damage by each brand.

In [50]:

unrep_damage_dict = {}
for brand in cmmn_brands_df.index:
    damaged_share = autos.loc[(autos['brand'] == brand) &\
                  (autos['unrepaired_damage'].notnull()), 'unrepaired_damage'].mean() * 100
    unrep_damage_dict[brand] = damaged_share

cmmn_brands_df['unrep_damage_%'] = pd.Series(unrep_damage_dict)
cmmn_brands_df['healthy_car_%'] = cmmn_brands_df['damage_info_%']\
                                - cmmn_brands_df['unrep_damage_%']
cmmn_brands_df

Out[50]:

	avg_price	avg_km_covered	damage_info_%	unrep_damage_%	healthy_car_%
brand
audi	9379.054	129266.482	84.625	9.664	74.961
mercedes_benz	8627.313	131007.463	85.052	8.881	76.171
bmw	8293.899	132673.659	85.197	9.175	76.022
skoda	6424.845	110835.544	88.859	9.701	79.158
hyundai	5484.410	106293.860	88.158	13.930	74.228
volkswagen	5482.053	128614.057	81.004	10.602	70.402
toyota	5200.318	115777.778	88.547	7.915	80.632
nissan	4803.099	118520.115	85.057	15.372	69.686
seat	4479.248	121321.212	84.606	11.461	73.145
mazda	4178.258	124206.349	81.818	14.286	67.532
citroen	3841.723	119184.953	81.818	12.452	69.366
ford	3774.345	124188.721	80.466	15.427	65.040
smart	3614.164	99263.566	83.876	8.133	75.743
peugeot	3129.524	126974.170	84.207	15.688	68.519
opel	3026.059	129320.848	80.078	14.418	65.660
fiat	2857.802	116822.872	77.988	14.002	63.986
renault	2512.908	128204.949	79.458	15.805	63.654

Adding 'automatic_%' column:
Values in this column represent percentage of cars with automatic gearbox by each brand

Adding 'manual_%' column:
Values in this column represent percentage of cars with manual gearbox by each brand

In [51]:

automatic_dict = {}
manual_dict = {}
for brand in cmmn_brands_df.index:
    automatic_share = autos.loc[autos['brand'] == brand, 'gearbox']\
                .value_counts(normalize=True, dropna=False)['automatic'] * 100
    manual_share = autos.loc[autos['brand'] == brand, 'gearbox']\
                .value_counts(normalize=True, dropna=False)['manual'] * 100
    automatic_dict[brand] = automatic_share
    manual_dict[brand] = manual_share
    
cmmn_brands_df['automatic_%'] = pd.Series(automatic_dict)
cmmn_brands_df['manual_%'] = pd.Series(manual_dict)
cmmn_brands_df

Out[51]:

	avg_price	avg_km_covered	damage_info_%	unrep_damage_%	healthy_car_%	automatic_%	manual_%
brand
audi	9379.054	129266.482	84.625	9.664	74.961	33.065	63.538
mercedes_benz	8627.313	131007.463	85.052	8.881	76.171	55.450	41.633
bmw	8293.899	132673.659	85.197	9.175	76.022	33.742	63.606
skoda	6424.845	110835.544	88.859	9.701	79.158	11.538	85.942
hyundai	5484.410	106293.860	88.158	13.930	74.228	12.719	84.211
volkswagen	5482.053	128614.057	81.004	10.602	70.402	13.329	82.086
toyota	5200.318	115777.778	88.547	7.915	80.632	12.137	84.274
nissan	4803.099	118520.115	85.057	15.372	69.686	10.920	85.201
seat	4479.248	121321.212	84.606	11.461	73.145	3.879	92.606
mazda	4178.258	124206.349	81.818	14.286	67.532	6.061	90.909
citroen	3841.723	119184.953	81.818	12.452	69.366	13.323	81.191
ford	3774.345	124188.721	80.466	15.427	65.040	8.034	87.492
smart	3614.164	99263.566	83.876	8.133	75.743	73.178	11.628
peugeot	3129.524	126974.170	84.207	15.688	68.519	5.461	90.849
opel	3026.059	129320.848	80.078	14.418	65.660	10.249	85.120
fiat	2857.802	116822.872	77.988	14.002	63.986	2.408	92.347
renault	2512.908	128204.949	79.458	15.805	63.654	6.676	88.562

Adding 'vehicle_count' column:
Values in this column represent number of cars by each brand

In [52]:

count_dict = {}
for brand in cmmn_brands_df.index:
    vehicle_count = autos[autos['brand'] == brand].shape[0]
    count_dict[brand] = vehicle_count

cmmn_brands_df.insert(0, 'vehicle_count', pd.Series(count_dict))
cmmn_brands_df

Out[52]:

	vehicle_count	avg_price	avg_km_covered	damage_info_%	unrep_damage_%	healthy_car_%	automatic_%	manual_%
brand
audi	3974	9379.054	129266.482	84.625	9.664	74.961	33.065	63.538
mercedes_benz	4422	8627.313	131007.463	85.052	8.881	76.171	55.450	41.633
bmw	5053	8293.899	132673.659	85.197	9.175	76.022	33.742	63.606
skoda	754	6424.845	110835.544	88.859	9.701	79.158	11.538	85.942
hyundai	456	5484.410	106293.860	88.158	13.930	74.228	12.719	84.211
volkswagen	9618	5482.053	128614.057	81.004	10.602	70.402	13.329	82.086
toyota	585	5200.318	115777.778	88.547	7.915	80.632	12.137	84.274
nissan	696	4803.099	118520.115	85.057	15.372	69.686	10.920	85.201
seat	825	4479.248	121321.212	84.606	11.461	73.145	3.879	92.606
mazda	693	4178.258	124206.349	81.818	14.286	67.532	6.061	90.909
citroen	638	3841.723	119184.953	81.818	12.452	69.366	13.323	81.191
ford	3174	3774.345	124188.721	80.466	15.427	65.040	8.034	87.492
smart	645	3614.164	99263.566	83.876	8.133	75.743	73.178	11.628
peugeot	1355	3129.524	126974.170	84.207	15.688	68.519	5.461	90.849
opel	4859	3026.059	129320.848	80.078	14.418	65.660	10.249	85.120
fiat	1163	2857.802	116822.872	77.988	14.002	63.986	2.408	92.347
renault	2142	2512.908	128204.949	79.458	15.805	63.654	6.676	88.562

Adding 'avg_age' column:
Values in this column represent average age of cars by each brand

In [53]:

# as there are '0' values in registration month column I replaced them with 1
reg_month_0to1 = autos['registration_month'].replace(0, 1).astype(str)
# combining values in registration month and registration year columns
m_y_reg = reg_month_0to1 + '-' + autos['registration_year'].astype(str)
# converting above combination to date values
date_m_y_reg = pd.to_datetime(m_y_reg, format='%m-%Y')
# finding latest date from last seen column
latest_date = pd.to_datetime(autos['last_seen'].str.split().str[0], format='%d.%m.%y').max()
# finding age of each car in days
diff = latest_date - date_m_y_reg
# converting age from days to years and adding to autos dataset
autos['vehicle_age'] = diff.dt.components['days'].astype(int) / 365

# finding average age of cars by brand and adding it to cmmn_brands_df dataset
age_dict = {}
for brand in cmmn_brands_df.index:
    avg = autos.loc[autos['brand'] == brand, 'vehicle_age'].mean()
    age_dict[brand] = avg
    
cmmn_brands_df.insert(1, 'avg_age', pd.Series(age_dict))
#cmmn_brands_df.to_csv('autos_brands.csv')
cmmn_brands_df

C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':

Out[53]:

	vehicle_count	avg_age	avg_price	avg_km_covered	damage_info_%	unrep_damage_%	healthy_car_%	automatic_%	manual_%
brand
audi	3974	11.815	9379.054	129266.482	84.625	9.664	74.961	33.065	63.538
mercedes_benz	4422	13.924	8627.313	131007.463	85.052	8.881	76.171	55.450	41.633
bmw	5053	12.921	8293.899	132673.659	85.197	9.175	76.022	33.742	63.606
skoda	754	9.385	6424.845	110835.544	88.859	9.701	79.158	11.538	85.942
hyundai	456	8.780	5484.410	106293.860	88.158	13.930	74.228	12.719	84.211
volkswagen	9618	13.636	5482.053	128614.057	81.004	10.602	70.402	13.329	82.086
toyota	585	11.670	5200.318	115777.778	88.547	7.915	80.632	12.137	84.274
nissan	696	12.687	4803.099	118520.115	85.057	15.372	69.686	10.920	85.201
seat	825	11.593	4479.248	121321.212	84.606	11.461	73.145	3.879	92.606
mazda	693	13.174	4178.258	124206.349	81.818	14.286	67.532	6.061	90.909
citroen	638	11.612	3841.723	119184.953	81.818	12.452	69.366	13.323	81.191
ford	3174	13.258	3774.345	124188.721	80.466	15.427	65.040	8.034	87.492
smart	645	10.611	3614.164	99263.566	83.876	8.133	75.743	73.178	11.628
peugeot	1355	12.357	3129.524	126974.170	84.207	15.688	68.519	5.461	90.849
opel	4859	13.900	3026.059	129320.848	80.078	14.418	65.660	10.249	85.120
fiat	1163	12.877	2857.802	116822.872	77.988	14.002	63.986	2.408	92.347
renault	2142	13.163	2512.908	128204.949	79.458	15.805	63.654	6.676	88.562

Now we got dataset ‘cmmn_brands_df’ ready for analysis. I will use it to understand if there are similar patterns and trends between average price and other data in other columns.

Trends & Patterns¶

At this step of my training I will not use statistical methods as they haven't been covered in this course yet, although I have some knowledge from university. Techniques used below are simple but some explanation of logic behind analysis is required. First, I will create function named 'compare_around_mean' to see if there is pattern between 2 columns. Then I will use it several times to understand if there is pattern between 'avg_price' column and 4 columns.

In [54]:

def compare_around_mean(a_df, x_col, y_col):
    print(f"'{x_col}' column values in descending order:\n \
    {a_df[x_col].sort_values(ascending=False)}")
    
    x_mean = a_df[x_col].mean()
    buttomx_y_avg = a_df.loc[a_df[x_col] < x_mean, y_col].mean()
    topx_y_avg = a_df.loc[a_df[x_col] >= x_mean, y_col].mean()
    
    print_2n(f"Mean of '{y_col}' for below average values in '{x_col}': {buttomx_y_avg}")
    print_2n(f"Mean of '{y_col}' for above average values in '{x_col}': {topx_y_avg}")

I will explain what function does and logic behind it on example of 'avg_price' and 'manual_%' columns below. 'manual_%' column represent share of cars with manual gear by each brand :

1. So function takes 3 arguments: ‘a_df’ (dataset), ‘x_col’ (column which we think has impact on 'y_col' column), ‘y_col’ (column which we think has been affected by 'x_col' column)
1. Once called function prints values in 'x_col' ('manual_%' column in this example) in descending order so that we can see if there is any anomality in data
1. Then it finds average value of 'manual_%' column
1. Now we can imagine values in 'manual_%' to be divided in 2 groups: 'values below average' and 'values above average'.
1. Next, function calculates average of 'y_col' ('avg_price' column in this example) for 'values below average' and 'values above average' in 'manual_%' column
1. Finally function prints the result
1. And if there is significant difference between average price for 'values below average' and 'values above average' in 'manual_%' we can say that there is pattern between 2 columns although it does not imply correlation.

In below cell we can see that average price for brands with share of manual cars below average (7478.61 USD) is much higher than for above average (4245.74 USD). This pattern confirms generally known fact that usually car with manual gear is cheaper. There is only one concern in our case with this conclusin. Share (and number) of cars with automatic gearbox is really low relatively to cars with manual gearbox. Therefore difference between average prices can change significantly once more cars with automatic gears added to the data.

In [55]:

compare_around_mean(cmmn_brands_df, 'manual_%', 'avg_price')

'manual_%' column values in descending order:
     brand
seat            92.606
fiat            92.347
mazda           90.909
peugeot         90.849
renault         88.562
ford            87.492
skoda           85.942
nissan          85.201
opel            85.120
toyota          84.274
hyundai         84.211
volkswagen      82.086
citroen         81.191
bmw             63.606
audi            63.538
mercedes_benz   41.633
smart           11.628
Name: manual_%, dtype: float64


 Mean of 'avg_price' for below average values in 'manual_%': 7478.6076593315465


 Mean of 'avg_price' for above average values in 'manual_%': 4245.7378633367825

Before using 'cmmn_brands_df' function below I added new column to dataset: 'damage/age'. This column represent ratio of share of cars with unrepaired damage to average age of cars by each brand. Therefore we can see which share of cars by each brand gets unrepaired damage every year. Ideally, the higher the ratio the worse condition of car should be. So price of car should be in that case also lower compared to brands with lower ratio. And as we can see from output of below cell average price for brands with lower ratio is higher approximately by 60 percent.

In [56]:

cmmn_brands_df['damage/age'] = cmmn_brands_df['unrep_damage_%'] / cmmn_brands_df['avg_age']
compare_around_mean(cmmn_brands_df, 'damage/age', 'avg_price')

'damage/age' column values in descending order:
     brand
hyundai         1.587
peugeot         1.270
nissan          1.212
renault         1.201
ford            1.164
fiat            1.087
mazda           1.084
citroen         1.072
opel            1.037
skoda           1.034
seat            0.989
audi            0.818
volkswagen      0.778
smart           0.767
bmw             0.710
toyota          0.678
mercedes_benz   0.638
Name: damage/age, dtype: float64


 Mean of 'avg_price' for below average values in 'damage/age': 6439.435683507786


 Mean of 'avg_price' for above average values in 'damage/age': 4003.2973076149838

Another column we have to check on relation between price and damage of car is 'healthy_car_%' column. Below cell confirms finding of above cell. The higher the share of healthy (no unrepaired damage) cars the higher the average price.

In [57]:

compare_around_mean(cmmn_brands_df, 'healthy_car_%', 'avg_price')

'healthy_car_%' column values in descending order:
     brand
toyota          80.632
skoda           79.158
mercedes_benz   76.171
bmw             76.022
smart           75.743
audi            74.961
hyundai         74.228
seat            73.145
volkswagen      70.402
nissan          69.686
citroen         69.366
peugeot         68.519
mazda           67.532
opel            65.660
ford            65.040
fiat            63.986
renault         63.654
Name: healthy_car_%, dtype: float64


 Mean of 'avg_price' for below average values in 'healthy_car_%': 3733.974541611802


 Mean of 'avg_price' for above average values in 'healthy_car_%': 6437.906498274766

And last column I will check in pair with 'avg_price' column is 'avg_km_covered'. Natural expectation is higher the distance covered lower the price. From output of cell it is visible that average price for brands which covered above average distance is lower by ~790 USD. This number confirms natural expectation but we saw that this difference was higher in previous cells. There fore difference between average prices here indicates weaker similarity between covered distance and price.

In [58]:

compare_around_mean(cmmn_brands_df, 'avg_km_covered', 'avg_price')

'avg_km_covered' column values in descending order:
     brand
bmw             132673.659
mercedes_benz   131007.463
opel            129320.848
audi            129266.482
volkswagen      128614.057
renault         128204.949
peugeot         126974.170
mazda           124206.349
ford            124188.721
seat            121321.212
citroen         119184.953
nissan          118520.115
fiat            116822.872
toyota          115777.778
skoda           110835.544
hyundai         106293.860
smart            99263.566
Name: avg_km_covered, dtype: float64


 Mean of 'avg_price' for below average values in 'avg_km_covered': 4588.201204252344


 Mean of 'avg_price' for above average values in 'avg_km_covered': 5378.157025187288

Conclusion¶

Even before looking at any data anybody with plans to set a price of used car has some expectation of what and how affects price of car. Or if you want to buy one you probably have some price in your mind depending in what condition is car. In this project I analysed average prices of popular brands. I found out supporting indications for 4 of our natural expectations about prices of cars that belong to some popular brand in trends & patterns section.

Basically average price for car that belongs to certain brand is higher if cars of that brand :

have less manual gear cars than automatic gear cars
get less unrepaired damage every year
has high share of cars with no unrepaired damage at point of sale
covered less distance on average relative to other cars

Notes¶

The method I used here is simple and does not involve sufficient statistical methods. The reason for it is quite simple as at this stage of my training I tried to use only concepts already covered in the course. Therefore method I used cannot be enough on its own to draw deterministic conclusions but can serve as additional evidence.
There are some parts of code that can be deleted in this project. For example, adding some new columns to original dataset which later I did not use. I decide to leave and not delete those parts. The reason is that I am on my training phase. As a result, I want to type as more code as possible to memorize the concepts and syntax better. And leave it there cause in case I need come back to update my knowledge it will be easy to find what I learned.