Project: Exploring Ebay Car Sales Data

This poject is aimed at gaining insight on used cars from eBay site The dataset was originally scraped and uploaded to Kaggle. Dataquest has made a few modifications from the original dataset that was uploaded to Kaggle.

A few notes describing the data columns can be found here

  • dateCrawled : when this ad was first crawled, all field-values are taken from this date
  • name : "name" of the car
  • seller : private or dealer
  • price : the price on the ad to sell the car
  • yearOfRegistration : at which year the car was first registered
  • powerPS : power of the car in PS
  • lastSeenOnline : when the crawler saw this ad last online
In [1]:
#importing the numpy and pandas library
import numpy as np
import pandas as pd

autos = pd.read_csv("autos.csv", encoding="Latin-1") #reading the datafile
In [2]:
#chcking for missing data and wrong data types using info()
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
In [3]:
#printing first few rows of data using .head()
autos.head() 
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [4]:
#using DataFrame.columns to print out existing columns
autos.columns
Out[4]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [5]:
#editing the column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
In [6]:
#confirming changes to column names
autos.columns
Out[6]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [7]:
#using dataFrame.head() to look at the current state of autos dataframe
autos.head()
Out[7]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gear_box power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

The camelCase columns have been changed to snake_case, some column names have been edited to make more meanings from the data dictionary.

In [8]:
#looking at the descriptive statistics for all columns
autos.describe(include='all')
Out[8]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gear_box power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-04-02 15:49:30 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

There are about 50000 records in the dataset with 20 columns. 'vehicleType','model','fuelType','notRepairedDamage' all contain missing values. Columns casing are not regular. Price and Odometer contain unit symbols. dateCrawled,dateCreated dtype should be datetime and price,odometer should be int dtypes. seller, offerType, abtest, gearbox are about 2 unique values each.

The mean, std, median and percentiles for price and odometer are NaN which indicates a wrong data type. These requires more investigation

In [9]:
#renaming the price and odometer columns
autos.rename(columns={'price':'price_usd','odometer':'odometer_km'},inplace=True)
In [10]:
#confirming change in names
autos.columns
Out[10]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [11]:
#removing the '$' and ',' chars in the price values
autos['price_usd'] = autos['price_usd'].str.replace('$','').str.replace(',','').astype('int64')
In [12]:
#viewing changes to price column
autos.head()
Out[12]:
date_crawled name seller offer_type price_usd abtest vehicle_type registration_year gear_box power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [13]:
#confirming change to price dtype
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price_usd             50000 non-null int64
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gear_box              47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
nr_of_pictures        50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(6), object(14)
memory usage: 7.6+ MB
In [14]:
#removing the km characters in odometer
autos['odometer_km'] = autos['odometer_km'].str.replace('km','').str.replace(',','')
In [15]:
#chaniging odometer_km to int type
autos['odometer_km'] = autos['odometer_km'].astype('int64')
In [16]:
autos.head()
Out[16]:
date_crawled name seller offer_type price_usd abtest vehicle_type registration_year gear_box power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [17]:
#understanding the shape of the price column
autos['price_usd'].unique().shape
Out[17]:
(2357,)
In [18]:
#summary statistics for price column
autos['price_usd'].describe() #.sort_index(ascending=False)
Out[18]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64
In [19]:
#Series.value_counts() returns a series, better to use Series.sort_index()
#with ascending= True or False, Second look at max outlier
autos['price_usd'].value_counts().sort_index(ascending=False).head()
Out[19]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
Name: price_usd, dtype: int64
In [20]:
#Second look at Min outlier
autos['price_usd'].value_counts().sort_index(ascending=False).tail()
Out[20]:
5       2
3       1
2       3
1     156
0    1421
Name: price_usd, dtype: int64
In [21]:
#removing outliers
autos = autos[autos["price_usd"].between(1,27322222)]
In [22]:
autos['price_usd'].describe()
Out[22]:
count    4.857800e+04
mean     8.069542e+03
std      1.800356e+05
min      1.000000e+00
25%      1.200000e+03
50%      3.000000e+03
75%      7.490000e+03
max      2.732222e+07
Name: price_usd, dtype: float64

After removing the outliers, it is observed that average sales is around $8,000 dollars

In [23]:
#understanding the shape of the odometer column
autos['odometer_km'].unique().shape
Out[23]:
(13,)
In [24]:
#summary statistics of the odometer column
autos['odometer_km'].describe()
Out[24]:
count     48578.000000
mean     125765.675820
std       39793.193823
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [25]:
#checking for high outliers for odometer
autos['odometer_km'].value_counts().sort_index(ascending=False).head()
Out[25]:
150000    31421
125000     5058
100000     2116
90000      1734
80000      1415
Name: odometer_km, dtype: int64
In [26]:
#checking for low outliers for odometer
autos['odometer_km'].value_counts().sort_index(ascending=False).tail()
Out[26]:
40000    816
30000    780
20000    762
10000    253
5000     837
Name: odometer_km, dtype: int64

The odometer column appears to not have outliers. Next stop is to look at the dates columns of 'date_crawled', 'ad_created', 'last_seen' which are in string formats

In [27]:
#understanding the date columns
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[27]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [28]:
#calculating the distribution for date_crawled
#To rank by date in ascending order (earliest to latest), chain the 
#Series.sort_index() method 
autos_date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
In [29]:
autos_date_crawled
Out[29]:
2016-03-05    0.025320
2016-03-06    0.014039
2016-03-07    0.036004
2016-03-08    0.033328
2016-03-09    0.033101
2016-03-10    0.032175
2016-03-11    0.032566
2016-03-12    0.036930
2016-03-13    0.015666
2016-03-14    0.036539
2016-03-15    0.034275
2016-03-16    0.029602
2016-03-17    0.031640
2016-03-18    0.012907
2016-03-19    0.034769
2016-03-20    0.037877
2016-03-21    0.037404
2016-03-22    0.032998
2016-03-23    0.032216
2016-03-24    0.029334
2016-03-25    0.031599
2016-03-26    0.032196
2016-03-27    0.031084
2016-03-28    0.034851
2016-03-29    0.034131
2016-03-30    0.033678
2016-03-31    0.031846
2016-04-01    0.033678
2016-04-02    0.035469
2016-04-03    0.038598
2016-04-04    0.036519
2016-04-05    0.013092
2016-04-06    0.003170
2016-04-07    0.001400
Name: date_crawled, dtype: float64
In [30]:
#calculating the distribution of 'ad_created' values including missing values
#using Series.value_counts(normalize=True, dropna=False)
autos_ad_created = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
In [51]:
#viewing the values
autos_ad_created.sort_index().head(10)
Out[51]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
Name: ad_created, dtype: float64
In [52]:
    #viewing the values
autos_ad_created.sort_index().tail(10)
Out[52]:
2016-03-29    0.034069
2016-03-30    0.033493
2016-03-31    0.031887
2016-04-01    0.033678
2016-04-02    0.035139
2016-04-03    0.038845
2016-04-04    0.036889
2016-04-05    0.011816
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, dtype: float64
In [32]:
#calculating the distribution of 'last_seen' values including missing values
#using Series.value_counts(normalize=True, dropna=False)
autos_last_seen = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)
In [49]:
#viewing 'last_seen'
autos_last_seen.sort_index().head(10)
Out[49]:
2016-03-05    0.001070
2016-03-06    0.004323
2016-03-07    0.005393
2016-03-08    0.007431
2016-03-09    0.009613
2016-03-10    0.010663
2016-03-11    0.012372
2016-03-12    0.023797
2016-03-13    0.008893
2016-03-14    0.012598
Name: last_seen, dtype: float64
In [34]:
#understanding the distribution of registration year
autos['registration_year'].describe()
Out[34]:
count    48578.000000
mean      2004.753119
std         88.632571
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The earliest year is 1000 and the latest year is 9999. Weird dates with regards to when cars were produced and the future projection.In 1903, Henry Ford invented a modernized car and the year appears to most suitable while 2016 will be taken as the highest year since the last seen ads were in 2016.

In [35]:
#removing upper and lower bounds in the registration_year
autos = autos[autos["registration_year"].between(1902,2017)]
In [48]:
autos['registration_year'].value_counts(normalize=True).sort_index().head(10)
Out[48]:
1910    0.000104
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
1937    0.000083
1938    0.000021
1939    0.000021
1941    0.000042
1943    0.000021
Name: registration_year, dtype: float64

There appears to be more vehicle listings between 1990s and 2000s. A closer look at the brand of cars would be neccessary

In [37]:
#exploring the unique brand values
autos['brand'].unique()
Out[37]:
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)
In [38]:
#confirming brand index
autos.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48085 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          48085 non-null object
name                  48085 non-null object
seller                48085 non-null object
offer_type            48085 non-null object
price_usd             48085 non-null int64
abtest                48085 non-null object
vehicle_type          43988 non-null object
registration_year     48085 non-null int64
gear_box              45807 non-null object
power_ps              48085 non-null int64
model                 45696 non-null object
odometer_km           48085 non-null int64
registration_month    48085 non-null int64
fuel_type             44263 non-null object
brand                 48085 non-null object
unrepaired_damage     39184 non-null object
ad_created            48085 non-null object
nr_of_pictures        48085 non-null int64
postal_code           48085 non-null int64
last_seen             48085 non-null object
dtypes: int64(7), object(13)
memory usage: 7.7+ MB
In [39]:
#selecting the top 20 brands
top_5_brands = autos['brand'].value_counts().sort_values(ascending=False).head(5).index
top_5_brands
Out[39]:
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi'], dtype='object')
In [40]:
#aggregating prices for top 5 brands
top_brands_m_prices = {} #empty dictionary for top 5 brand mean prices

for brand in top_5_brands: #looping over top brands
    selected_brand = autos[autos['brand'] == brand] #selecting top brands autos dataset
    mean_price = selected_brand['price_usd'].mean()#calculating mean price for top brand
    top_brands_m_prices[brand] = int(mean_price) #assigning to dictionary
print(top_brands_m_prices) #printing out results 
{'bmw': 8518, 'opel': 5321, 'volkswagen': 6632, 'audi': 9239, 'mercedes_benz': 8528}
In [41]:
sorted(top_brands_m_prices.items(), key=lambda x: x[1])
Out[41]:
[('opel', 5321),
 ('volkswagen', 6632),
 ('bmw', 8518),
 ('mercedes_benz', 8528),
 ('audi', 9239)]

From the results of our aggregation, audi, mercedesbenz and bmw are most expensive brands, followed by volkswagen and opel is least amongst top 5.

In [42]:
#aggregating mileage for top 5 brands
top_brands_m_mileage = {} #empty dictionary for top 5 brand mean mileage

for brand in top_5_brands: #looping over top brands
    selected_brand = autos[autos['brand'] == brand] #selecting top brands autos dataset
    mean_mileage = selected_brand['odometer_km'].mean()#calculating mean mileage for top brand
    top_brands_m_mileage[brand] = int(mean_mileage) #assigning to dictionary
print(top_brands_m_mileage) #printing out results 
{'bmw': 132670, 'opel': 129419, 'volkswagen': 128934, 'audi': 129406, 'mercedes_benz': 130962}
In [43]:
#using pandas series constructor pd.Series to construct a series 
#for top brand mean prices
top_bmp_series = pd.Series(top_brands_m_prices)
print(top_bmp_series)
audi             9239
bmw              8518
mercedes_benz    8528
opel             5321
volkswagen       6632
dtype: int64
In [44]:
# using the pandas DataFrame constructor to construct a dataframe
#for mean prices
df_brands = pd.DataFrame(top_bmp_series, columns=['mean_price'])
df_brands
Out[44]:
mean_price
audi 9239
bmw 8518
mercedes_benz 8528
opel 5321
volkswagen 6632
In [45]:
#brands mean mileage
top_bmm_series = pd.Series(top_brands_m_mileage)
print(top_bmm_series)
audi             129406
bmw              132670
mercedes_benz    130962
opel             129419
volkswagen       128934
dtype: int64
In [46]:
#adding brands mean mileage to the dataframe
df_brands.insert(1, 'top_brands_m_mileage', top_bmm_series)
In [47]:
#viewing new dataframe
df_brands
Out[47]:
mean_price top_brands_m_mileage
audi 9239 129406
bmw 8518 132670
mercedes_benz 8528 130962
opel 5321 129419
volkswagen 6632 128934

Conclusion: The cost of top brand vehicles do not neccessarily determing the cars mileage.