This poject is aimed at gaining insight on used cars from eBay site The dataset was originally scraped and uploaded to Kaggle. Dataquest has made a few modifications from the original dataset that was uploaded to Kaggle.
A few notes describing the data columns can be found here
#importing the numpy and pandas library
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv", encoding="Latin-1") #reading the datafile
#chcking for missing data and wrong data types using info()
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
#printing first few rows of data using .head()
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#using DataFrame.columns to print out existing columns
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#editing the column names
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
#confirming changes to column names
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
#using dataFrame.head() to look at the current state of autos dataframe
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The camelCase columns have been changed to snake_case, some column names have been edited to make more meanings from the data dictionary.
#looking at the descriptive statistics for all columns
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 15:49:30 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
There are about 50000 records in the dataset with 20 columns. 'vehicleType','model','fuelType','notRepairedDamage' all contain missing values. Columns casing are not regular. Price and Odometer contain unit symbols. dateCrawled,dateCreated dtype should be datetime and price,odometer should be int dtypes. seller, offerType, abtest, gearbox are about 2 unique values each.
The mean, std, median and percentiles for price and odometer are NaN which indicates a wrong data type. These requires more investigation
#renaming the price and odometer columns
autos.rename(columns={'price':'price_usd','odometer':'odometer_km'},inplace=True)
#confirming change in names
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'abtest', 'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
#removing the '$' and ',' chars in the price values
autos['price_usd'] = autos['price_usd'].str.replace('$','').str.replace(',','').astype('int64')
#viewing changes to price column
autos.head()
date_crawled | name | seller | offer_type | price_usd | abtest | vehicle_type | registration_year | gear_box | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#confirming change to price dtype
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): date_crawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offer_type 50000 non-null object price_usd 50000 non-null int64 abtest 50000 non-null object vehicle_type 44905 non-null object registration_year 50000 non-null int64 gear_box 47320 non-null object power_ps 50000 non-null int64 model 47242 non-null object odometer_km 50000 non-null object registration_month 50000 non-null int64 fuel_type 45518 non-null object brand 50000 non-null object unrepaired_damage 40171 non-null object ad_created 50000 non-null object nr_of_pictures 50000 non-null int64 postal_code 50000 non-null int64 last_seen 50000 non-null object dtypes: int64(6), object(14) memory usage: 7.6+ MB
#removing the km characters in odometer
autos['odometer_km'] = autos['odometer_km'].str.replace('km','').str.replace(',','')
#chaniging odometer_km to int type
autos['odometer_km'] = autos['odometer_km'].astype('int64')
autos.head()
date_crawled | name | seller | offer_type | price_usd | abtest | vehicle_type | registration_year | gear_box | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
#understanding the shape of the price column
autos['price_usd'].unique().shape
(2357,)
#summary statistics for price column
autos['price_usd'].describe() #.sort_index(ascending=False)
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64
#Series.value_counts() returns a series, better to use Series.sort_index()
#with ascending= True or False, Second look at max outlier
autos['price_usd'].value_counts().sort_index(ascending=False).head()
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 Name: price_usd, dtype: int64
#Second look at Min outlier
autos['price_usd'].value_counts().sort_index(ascending=False).tail()
5 2 3 1 2 3 1 156 0 1421 Name: price_usd, dtype: int64
#removing outliers
autos = autos[autos["price_usd"].between(1,27322222)]
autos['price_usd'].describe()
count 4.857800e+04 mean 8.069542e+03 std 1.800356e+05 min 1.000000e+00 25% 1.200000e+03 50% 3.000000e+03 75% 7.490000e+03 max 2.732222e+07 Name: price_usd, dtype: float64
After removing the outliers, it is observed that average sales is around $8,000 dollars
#understanding the shape of the odometer column
autos['odometer_km'].unique().shape
(13,)
#summary statistics of the odometer column
autos['odometer_km'].describe()
count 48578.000000 mean 125765.675820 std 39793.193823 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
#checking for high outliers for odometer
autos['odometer_km'].value_counts().sort_index(ascending=False).head()
150000 31421 125000 5058 100000 2116 90000 1734 80000 1415 Name: odometer_km, dtype: int64
#checking for low outliers for odometer
autos['odometer_km'].value_counts().sort_index(ascending=False).tail()
40000 816 30000 780 20000 762 10000 253 5000 837 Name: odometer_km, dtype: int64
The odometer column appears to not have outliers. Next stop is to look at the dates columns of 'date_crawled', 'ad_created', 'last_seen' which are in string formats
#understanding the date columns
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
#calculating the distribution for date_crawled
#To rank by date in ascending order (earliest to latest), chain the
#Series.sort_index() method
autos_date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
autos_date_crawled
2016-03-05 0.025320 2016-03-06 0.014039 2016-03-07 0.036004 2016-03-08 0.033328 2016-03-09 0.033101 2016-03-10 0.032175 2016-03-11 0.032566 2016-03-12 0.036930 2016-03-13 0.015666 2016-03-14 0.036539 2016-03-15 0.034275 2016-03-16 0.029602 2016-03-17 0.031640 2016-03-18 0.012907 2016-03-19 0.034769 2016-03-20 0.037877 2016-03-21 0.037404 2016-03-22 0.032998 2016-03-23 0.032216 2016-03-24 0.029334 2016-03-25 0.031599 2016-03-26 0.032196 2016-03-27 0.031084 2016-03-28 0.034851 2016-03-29 0.034131 2016-03-30 0.033678 2016-03-31 0.031846 2016-04-01 0.033678 2016-04-02 0.035469 2016-04-03 0.038598 2016-04-04 0.036519 2016-04-05 0.013092 2016-04-06 0.003170 2016-04-07 0.001400 Name: date_crawled, dtype: float64
#calculating the distribution of 'ad_created' values including missing values
#using Series.value_counts(normalize=True, dropna=False)
autos_ad_created = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False)
#viewing the values
autos_ad_created.sort_index().head(10)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 Name: ad_created, dtype: float64
#viewing the values
autos_ad_created.sort_index().tail(10)
2016-03-29 0.034069 2016-03-30 0.033493 2016-03-31 0.031887 2016-04-01 0.033678 2016-04-02 0.035139 2016-04-03 0.038845 2016-04-04 0.036889 2016-04-05 0.011816 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, dtype: float64
#calculating the distribution of 'last_seen' values including missing values
#using Series.value_counts(normalize=True, dropna=False)
autos_last_seen = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False)
#viewing 'last_seen'
autos_last_seen.sort_index().head(10)
2016-03-05 0.001070 2016-03-06 0.004323 2016-03-07 0.005393 2016-03-08 0.007431 2016-03-09 0.009613 2016-03-10 0.010663 2016-03-11 0.012372 2016-03-12 0.023797 2016-03-13 0.008893 2016-03-14 0.012598 Name: last_seen, dtype: float64
#understanding the distribution of registration year
autos['registration_year'].describe()
count 48578.000000 mean 2004.753119 std 88.632571 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The earliest year is 1000 and the latest year is 9999. Weird dates with regards to when cars were produced and the future projection.In 1903, Henry Ford invented a modernized car and the year appears to most suitable while 2016 will be taken as the highest year since the last seen ads were in 2016.
#removing upper and lower bounds in the registration_year
autos = autos[autos["registration_year"].between(1902,2017)]
autos['registration_year'].value_counts(normalize=True).sort_index().head(10)
1910 0.000104 1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000042 1937 0.000083 1938 0.000021 1939 0.000021 1941 0.000042 1943 0.000021 Name: registration_year, dtype: float64
There appears to be more vehicle listings between 1990s and 2000s. A closer look at the brand of cars would be neccessary
#exploring the unique brand values
autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
#confirming brand index
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48085 entries, 0 to 49999 Data columns (total 20 columns): date_crawled 48085 non-null object name 48085 non-null object seller 48085 non-null object offer_type 48085 non-null object price_usd 48085 non-null int64 abtest 48085 non-null object vehicle_type 43988 non-null object registration_year 48085 non-null int64 gear_box 45807 non-null object power_ps 48085 non-null int64 model 45696 non-null object odometer_km 48085 non-null int64 registration_month 48085 non-null int64 fuel_type 44263 non-null object brand 48085 non-null object unrepaired_damage 39184 non-null object ad_created 48085 non-null object nr_of_pictures 48085 non-null int64 postal_code 48085 non-null int64 last_seen 48085 non-null object dtypes: int64(7), object(13) memory usage: 7.7+ MB
#selecting the top 20 brands
top_5_brands = autos['brand'].value_counts().sort_values(ascending=False).head(5).index
top_5_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi'], dtype='object')
#aggregating prices for top 5 brands
top_brands_m_prices = {} #empty dictionary for top 5 brand mean prices
for brand in top_5_brands: #looping over top brands
selected_brand = autos[autos['brand'] == brand] #selecting top brands autos dataset
mean_price = selected_brand['price_usd'].mean()#calculating mean price for top brand
top_brands_m_prices[brand] = int(mean_price) #assigning to dictionary
print(top_brands_m_prices) #printing out results
{'bmw': 8518, 'opel': 5321, 'volkswagen': 6632, 'audi': 9239, 'mercedes_benz': 8528}
sorted(top_brands_m_prices.items(), key=lambda x: x[1])
[('opel', 5321), ('volkswagen', 6632), ('bmw', 8518), ('mercedes_benz', 8528), ('audi', 9239)]
From the results of our aggregation, audi, mercedesbenz and bmw are most expensive brands, followed by volkswagen and opel is least amongst top 5.
#aggregating mileage for top 5 brands
top_brands_m_mileage = {} #empty dictionary for top 5 brand mean mileage
for brand in top_5_brands: #looping over top brands
selected_brand = autos[autos['brand'] == brand] #selecting top brands autos dataset
mean_mileage = selected_brand['odometer_km'].mean()#calculating mean mileage for top brand
top_brands_m_mileage[brand] = int(mean_mileage) #assigning to dictionary
print(top_brands_m_mileage) #printing out results
{'bmw': 132670, 'opel': 129419, 'volkswagen': 128934, 'audi': 129406, 'mercedes_benz': 130962}
#using pandas series constructor pd.Series to construct a series
#for top brand mean prices
top_bmp_series = pd.Series(top_brands_m_prices)
print(top_bmp_series)
audi 9239 bmw 8518 mercedes_benz 8528 opel 5321 volkswagen 6632 dtype: int64
# using the pandas DataFrame constructor to construct a dataframe
#for mean prices
df_brands = pd.DataFrame(top_bmp_series, columns=['mean_price'])
df_brands
mean_price | |
---|---|
audi | 9239 |
bmw | 8518 |
mercedes_benz | 8528 |
opel | 5321 |
volkswagen | 6632 |
#brands mean mileage
top_bmm_series = pd.Series(top_brands_m_mileage)
print(top_bmm_series)
audi 129406 bmw 132670 mercedes_benz 130962 opel 129419 volkswagen 128934 dtype: int64
#adding brands mean mileage to the dataframe
df_brands.insert(1, 'top_brands_m_mileage', top_bmm_series)
#viewing new dataframe
df_brands
mean_price | top_brands_m_mileage | |
---|---|---|
audi | 9239 | 129406 |
bmw | 8518 | 132670 |
mercedes_benz | 8528 | 130962 |
opel | 5321 | 129419 |
volkswagen | 6632 | 128934 |
Conclusion: The cost of top brand vehicles do not neccessarily determing the cars mileage.