Project: Exploring Ebay Car Sales Data¶

This poject is aimed at gaining insight on used cars from eBay site The dataset was originally scraped and uploaded to Kaggle. Dataquest has made a few modifications from the original dataset that was uploaded to Kaggle.

A few notes describing the data columns can be found here

dateCrawled : when this ad was first crawled, all field-values are taken from this date
name : "name" of the car
seller : private or dealer
price : the price on the ad to sell the car
yearOfRegistration : at which year the car was first registered
powerPS : power of the car in PS
lastSeenOnline : when the crawler saw this ad last online

In [1]:

#importing the numpy and pandas library
import numpy as np
import pandas as pd

autos = pd.read_csv("autos.csv", encoding="Latin-1") #reading the datafile

In [2]:

#chcking for missing data and wrong data types using info()
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [3]:

#printing first few rows of data using .head()
autos.head() 

Out[3]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [4]:

#using DataFrame.columns to print out existing columns
autos.columns

Out[4]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:

#editing the column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [6]:

#confirming changes to column names
autos.columns

Out[6]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [7]:

#using dataFrame.head() to look at the current state of autos dataframe
autos.head()

Out[7]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

The camelCase columns have been changed to snake_case, some column names have been edited to make more meanings from the data dictionary.

In [8]:

#looking at the descriptive statistics for all columns
autos.describe(include='all')

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-04-02 15:49:30	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

There are about 50000 records in the dataset with 20 columns. 'vehicleType','model','fuelType','notRepairedDamage' all contain missing values. Columns casing are not regular. Price and Odometer contain unit symbols. dateCrawled,dateCreated dtype should be datetime and price,odometer should be int dtypes. seller, offerType, abtest, gearbox are about 2 unique values each.

The mean, std, median and percentiles for price and odometer are NaN which indicates a wrong data type. These requires more investigation

In [9]:

#renaming the price and odometer columns
autos.rename(columns={'price':'price_usd','odometer':'odometer_km'},inplace=True)

In [10]:

#confirming change in names
autos.columns

Out[10]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [11]:

#removing the '$' and ',' chars in the price values
autos['price_usd'] = autos['price_usd'].str.replace('$','').str.replace(',','').astype('int64')

In [12]:

#viewing changes to price column
autos.head()

Out[12]:

	date_crawled	name	seller	offer_type	price_usd	abtest	vehicle_type	registration_year	gear_box	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [13]:

#confirming change to price dtype
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
seller                50000 non-null object
offer_type            50000 non-null object
price_usd             50000 non-null int64
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gear_box              47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
nr_of_pictures        50000 non-null int64
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(6), object(14)
memory usage: 7.6+ MB

In [14]:

#removing the km characters in odometer
autos['odometer_km'] = autos['odometer_km'].str.replace('km','').str.replace(',','')

In [15]:

#chaniging odometer_km to int type
autos['odometer_km'] = autos['odometer_km'].astype('int64')

In [16]:

autos.head()

Out[16]:

	date_crawled	name	seller	offer_type	price_usd	abtest	vehicle_type	registration_year	gear_box	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [17]:

#understanding the shape of the price column
autos['price_usd'].unique().shape

Out[17]:

(2357,)

In [18]:

#summary statistics for price column
autos['price_usd'].describe() #.sort_index(ascending=False)

Out[18]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_usd, dtype: float64

In [19]:

#Series.value_counts() returns a series, better to use Series.sort_index()
#with ascending= True or False, Second look at max outlier
autos['price_usd'].value_counts().sort_index(ascending=False).head()

Out[19]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
Name: price_usd, dtype: int64

In [20]:

#Second look at Min outlier
autos['price_usd'].value_counts().sort_index(ascending=False).tail()

Out[20]:

5       2
3       1
2       3
1     156
0    1421
Name: price_usd, dtype: int64

In [21]:

#removing outliers
autos = autos[autos["price_usd"].between(1,27322222)]

In [22]:

autos['price_usd'].describe()

Out[22]:

count    4.857800e+04
mean     8.069542e+03
std      1.800356e+05
min      1.000000e+00
25%      1.200000e+03
50%      3.000000e+03
75%      7.490000e+03
max      2.732222e+07
Name: price_usd, dtype: float64

After removing the outliers, it is observed that average sales is around $8,000 dollars

In [23]:

#understanding the shape of the odometer column
autos['odometer_km'].unique().shape

Out[23]:

(13,)

In [24]:

#summary statistics of the odometer column
autos['odometer_km'].describe()

Out[24]:

count     48578.000000
mean     125765.675820
std       39793.193823
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [25]:

#checking for high outliers for odometer
autos['odometer_km'].value_counts().sort_index(ascending=False).head()

Out[25]:

150000    31421
125000     5058
100000     2116
90000      1734
80000      1415
Name: odometer_km, dtype: int64

In [26]:

#checking for low outliers for odometer
autos['odometer_km'].value_counts().sort_index(ascending=False).tail()

Out[26]:

40000    816
30000    780
20000    762
10000    253
5000     837
Name: odometer_km, dtype: int64

The odometer column appears to not have outliers. Next stop is to look at the dates columns of 'date_crawled', 'ad_created', 'last_seen' which are in string formats

In [27]:

#understanding the date columns
autos[['date_crawled','ad_created','last_seen']][0:5]

Out[27]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [28]:

#calculating the distribution for date_crawled
#To rank by date in ascending order (earliest to latest), chain the 
#Series.sort_index() method 
autos_date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()

In [29]:

autos_date_crawled

Out[29]:

2016-03-05    0.025320
2016-03-06    0.014039
2016-03-07    0.036004
2016-03-08    0.033328
2016-03-09    0.033101
2016-03-10    0.032175
2016-03-11    0.032566
2016-03-12    0.036930
2016-03-13    0.015666
2016-03-14    0.036539
2016-03-15    0.034275
2016-03-16    0.029602
2016-03-17    0.031640
2016-03-18    0.012907
2016-03-19    0.034769
2016-03-20    0.037877
2016-03-21    0.037404
2016-03-22    0.032998
2016-03-23    0.032216
2016-03-24    0.029334
2016-03-25    0.031599
2016-03-26    0.032196
2016-03-27    0.031084
2016-03-28    0.034851
2016-03-29    0.034131
2016-03-30    0.033678
2016-03-31    0.031846
2016-04-01    0.033678
2016-04-02    0.035469
2016-04-03    0.038598
2016-04-04    0.036519
2016-04-05    0.013092
2016-04-06    0.003170
2016-04-07    0.001400
Name: date_crawled, dtype: float64

In [30]:

#calculating the distribution of 'ad_created' values including missing values
#using Series.value_counts(normalize=True, dropna=False)
autos_ad_created = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)

In [51]:

#viewing the values
autos_ad_created.sort_index().head(10)

Out[51]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
Name: ad_created, dtype: float64

In [52]:

    #viewing the values
autos_ad_created.sort_index().tail(10)

Out[52]:

2016-03-29    0.034069
2016-03-30    0.033493
2016-03-31    0.031887
2016-04-01    0.033678
2016-04-02    0.035139
2016-04-03    0.038845
2016-04-04    0.036889
2016-04-05    0.011816
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, dtype: float64

In [32]:

#calculating the distribution of 'last_seen' values including missing values
#using Series.value_counts(normalize=True, dropna=False)
autos_last_seen = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)

In [49]:

#viewing 'last_seen'
autos_last_seen.sort_index().head(10)

Out[49]:

2016-03-05    0.001070
2016-03-06    0.004323
2016-03-07    0.005393
2016-03-08    0.007431
2016-03-09    0.009613
2016-03-10    0.010663
2016-03-11    0.012372
2016-03-12    0.023797
2016-03-13    0.008893
2016-03-14    0.012598
Name: last_seen, dtype: float64

In [34]:

#understanding the distribution of registration year
autos['registration_year'].describe()

Out[34]:

count    48578.000000
mean      2004.753119
std         88.632571
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The earliest year is 1000 and the latest year is 9999. Weird dates with regards to when cars were produced and the future projection.In 1903, Henry Ford invented a modernized car and the year appears to most suitable while 2016 will be taken as the highest year since the last seen ads were in 2016.

In [35]:

#removing upper and lower bounds in the registration_year
autos = autos[autos["registration_year"].between(1902,2017)]

In [48]:

autos['registration_year'].value_counts(normalize=True).sort_index().head(10)

Out[48]:

1910    0.000104
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
1937    0.000083
1938    0.000021
1939    0.000021
1941    0.000042
1943    0.000021
Name: registration_year, dtype: float64

There appears to be more vehicle listings between 1990s and 2000s. A closer look at the brand of cars would be neccessary

In [37]:

#exploring the unique brand values
autos['brand'].unique()

Out[37]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [38]:

#confirming brand index
autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48085 entries, 0 to 49999
Data columns (total 20 columns):
date_crawled          48085 non-null object
name                  48085 non-null object
seller                48085 non-null object
offer_type            48085 non-null object
price_usd             48085 non-null int64
abtest                48085 non-null object
vehicle_type          43988 non-null object
registration_year     48085 non-null int64
gear_box              45807 non-null object
power_ps              48085 non-null int64
model                 45696 non-null object
odometer_km           48085 non-null int64
registration_month    48085 non-null int64
fuel_type             44263 non-null object
brand                 48085 non-null object
unrepaired_damage     39184 non-null object
ad_created            48085 non-null object
nr_of_pictures        48085 non-null int64
postal_code           48085 non-null int64
last_seen             48085 non-null object
dtypes: int64(7), object(13)
memory usage: 7.7+ MB

In [39]:

#selecting the top 20 brands
top_5_brands = autos['brand'].value_counts().sort_values(ascending=False).head(5).index
top_5_brands

Out[39]:

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi'], dtype='object')

In [40]:

#aggregating prices for top 5 brands
top_brands_m_prices = {} #empty dictionary for top 5 brand mean prices

for brand in top_5_brands: #looping over top brands
    selected_brand = autos[autos['brand'] == brand] #selecting top brands autos dataset
    mean_price = selected_brand['price_usd'].mean()#calculating mean price for top brand
    top_brands_m_prices[brand] = int(mean_price) #assigning to dictionary
print(top_brands_m_prices) #printing out results 

{'bmw': 8518, 'opel': 5321, 'volkswagen': 6632, 'audi': 9239, 'mercedes_benz': 8528}

In [41]:

sorted(top_brands_m_prices.items(), key=lambda x: x[1])

Out[41]:

[('opel', 5321),
 ('volkswagen', 6632),
 ('bmw', 8518),
 ('mercedes_benz', 8528),
 ('audi', 9239)]

From the results of our aggregation, audi, mercedesbenz and bmw are most expensive brands, followed by volkswagen and opel is least amongst top 5.

In [42]:

#aggregating mileage for top 5 brands
top_brands_m_mileage = {} #empty dictionary for top 5 brand mean mileage

for brand in top_5_brands: #looping over top brands
    selected_brand = autos[autos['brand'] == brand] #selecting top brands autos dataset
    mean_mileage = selected_brand['odometer_km'].mean()#calculating mean mileage for top brand
    top_brands_m_mileage[brand] = int(mean_mileage) #assigning to dictionary
print(top_brands_m_mileage) #printing out results 

{'bmw': 132670, 'opel': 129419, 'volkswagen': 128934, 'audi': 129406, 'mercedes_benz': 130962}

In [43]:

#using pandas series constructor pd.Series to construct a series 
#for top brand mean prices
top_bmp_series = pd.Series(top_brands_m_prices)
print(top_bmp_series)

audi             9239
bmw              8518
mercedes_benz    8528
opel             5321
volkswagen       6632
dtype: int64

In [44]:

# using the pandas DataFrame constructor to construct a dataframe
#for mean prices
df_brands = pd.DataFrame(top_bmp_series, columns=['mean_price'])
df_brands

Out[44]:

	mean_price
audi	9239
bmw	8518
mercedes_benz	8528
opel	5321
volkswagen	6632

In [45]:

#brands mean mileage
top_bmm_series = pd.Series(top_brands_m_mileage)
print(top_bmm_series)

audi             129406
bmw              132670
mercedes_benz    130962
opel             129419
volkswagen       128934
dtype: int64

In [46]:

#adding brands mean mileage to the dataframe
df_brands.insert(1, 'top_brands_m_mileage', top_bmm_series)

In [47]:

#viewing new dataframe
df_brands

Out[47]:

	mean_price	top_brands_m_mileage
audi	9239	129406
bmw	8518	132670
mercedes_benz	8528	130962
opel	5321	129419
volkswagen	6632	128934

Conclusion: The cost of top brand vehicles do not neccessarily determing the cars mileage.

Resources¶

Looping over series

Sorting the dictionary