Finding the best cars for germans - an analysis of used cars on ebay

The aim of this project is to practice data cleaning and analysis using the included used car listings dataset. For our practise we will be attempting to find out the best cars to purchase from the german ebay website.

The dataset I'll be working with is a dataset of used cars from eBay Kleinanzeigen, a classified section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle but I'll be making use of a modified version that's been provided to me by dataquest.

The original dataset can be found along with information on this dataset by following the link.

Summary of Conclusion

After our analysis, we discovered that the Volkswagen was the most popular brand, the sonstige_autos was the most expensive on average and the volvo gave the most mileage. We also noticed a few brands, like the BMW and Mercedez Benz, which were expensive yet popular among buyers. We deduced that the reason for this was that they provided value which was worth their high price.

We concluded that the best brand to buy would be between a Volkwagen or volvo. If we chose to go for a more classy brand, a BMW or a Mercedez Benz would be better options.

Preprocessing

Before exploring the dataset, there are some things we need to do first such as handle missing values, convert categorical data to numerical data and so on. All this is to ensure we get as accurate an insight as possible from our data. Let's take a look at our data

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head()
Out[1]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [2]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

From the above code cell, We see that we have 20 columns in our dataset. Most of the columns have 50000 entries with a quater of them (five of them to be exact) having missing values of about 2000 to almost 9000 of them. These missing values will need to be cleaned out. We'll also need to change the column names to something that's a bit easier for us to work with. Let's begin shall we.

In [3]:
autos.columns
Out[3]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [4]:
changes={'yearOfRegistration': 'registration_year',
        'monthOfRegistration': 'registration_month',
        'notRepairedDamage': 'unrepaired_damage',
         'dateCreated': 'ad_created', 'offerType': 'offer_type',
         'vehicleType': 'vehicle_type', 'powerPS': 'power_ps',
        'fuelType': 'fuel_type', 'nrOfPictures': 'no_of_pics',
        'postalCode': 'postal_code', 'lastSeen': 'last_seen', 
         'dateCrawled': 'date_crawled'}

autos.rename(changes, axis = 1, inplace=True)
autos.columns
Out[4]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'no_of_pics', 'postal_code',
       'last_seen'],
      dtype='object')
In [5]:
autos.head()
Out[5]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created no_of_pics postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Great! Now our column names look better. All letters are in lower case and al words have been separated using an underscore _. This type of casing is known as snake case. We write in snake case to ease our typing. This way we don't have to alternate between the uppercase and lowercase letters.

Now we to find out what kind of data we're looking at and what we need to clean.

In [6]:
autos.describe(include='all')
Out[6]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created no_of_pics postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-21 20:37:19 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

We find that we have two alpha-numeric columns (price, odometer) which need to be cleaned to a numeric type, three date columns (date_crawled, ad_created, last_seen) which will be properly cleaned later five numeric columns (registration_year, regestration_month, no_of_pics, postal_code) and the rest of the columns are text columns.

We'll need to investigate the date columns some more to see how much insight we can gain about the timings of some events related to the cars. The no_of_pics column has only one value for all its rows so we'll drop that column.

We'll start our cleaning process by dealing with the alpha-numeric data first. We'll remove all the letters and symbols in the data and convert the columns to either an integer or a float type whichever is best suited for the column.

In [7]:
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)
In [8]:
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
In [9]:
autos.rename({'odometer': 'odometer_km', 'price': 'price_dollars'}, axis = 1, inplace=True)

Now that that's done let's take a look at the price column first.

In [10]:
autos['price_dollars'].unique().shape
Out[10]:
(2357,)

There are 2357 unique prices in price_dollars. Lets investigate some more to find out more about these values

In [11]:
autos['price_dollars'].describe()
Out[11]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollars, dtype: float64
In [12]:
autos['price_dollars'].value_counts(ascending=False).head(6)
Out[12]:
0       1421
500      781
1500     734
2500     643
1000     639
1200     639
Name: price_dollars, dtype: int64
In [13]:
autos['price_dollars'].value_counts(ascending=False).tail(6)
Out[13]:
6495     1
20790    1
8970     1
846      1
2895     1
33980    1
Name: price_dollars, dtype: int64
In [14]:
autos.loc[autos['price_dollars'] == 99999999, "name"].shape
Out[14]:
(1,)

Upon further investigation, we notice the following:

  • The highest price is 99999999 dollars with seems like an outlier.
  • that the minimum price is 0 dollars indicating that the vehicle is free.
  • Quite a number of cars seem to have prices within the thousands range and also at 500 dollars.

We also notice that the 0 dollar price has the highest frequency. This seems strange as it implies that quite a number of cars are free. I did some google searching and I discovered that the least price for a used care on ebay 4000 dollars indicating that no car should have a price of less than 4000 dollars.

Lets do some investigating. I decided to start my investigation by finding out the number of all the cars with a price of 0 dollars and an odometer reading of more than 100000km.

In [15]:
autos.loc[(autos['price_dollars'] == 0) & (autos['odometer_km'] > 10000), "price_dollars"].shape
Out[15]:
(1280,)
In [16]:
autos.loc[(autos['price_dollars'] == 0) & (autos['odometer_km'] <= 10000), :]
Out[16]:
date_crawled name seller offer_type price_dollars abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created no_of_pics postal_code last_seen
71 2016-03-28 19:39:35 Suche_Opel_Astra_F__Corsa_oder_Kadett_E_mit_Re... privat Angebot 0 control NaN 1990 manuell 0 NaN 5000 0 benzin opel NaN 2016-03-28 00:00:00 0 4552 2016-04-07 01:45:48
167 2016-04-02 19:43:45 Suche_VW_Multivan_Innenausstattung_Set_oder_TE... privat Angebot 0 control NaN 2011 NaN 0 transporter 5000 0 NaN volkswagen NaN 2016-04-02 00:00:00 0 64739 2016-04-06 19:45:08
226 2016-03-25 23:52:12 Porsche_911_S_Targa__67er_SWB privat Angebot 0 control cabrio 1967 manuell 160 911 5000 12 benzin porsche nein 2016-03-25 00:00:00 0 44575 2016-04-05 14:46:39
259 2016-04-03 23:49:58 guenstiges_Auto_/_auch_defekt privat Angebot 0 control NaN 2000 NaN 0 NaN 5000 6 NaN sonstige_autos NaN 2016-04-03 00:00:00 0 89269 2016-04-06 07:16:22
301 2016-03-08 20:37:59 Kaufe_alle_Autos_bietet_an privat Angebot 0 control NaN 1990 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 2016-03-08 00:00:00 0 13589 2016-04-05 18:44:43
418 2016-03-29 14:43:24 Fiat_SCUDO_8_Sitzer_Bus__Diesel_JTD__80_KW privat Angebot 0 test bus 2003 manuell 80 andere 5000 5 diesel fiat nein 2016-03-29 00:00:00 0 35315 2016-04-05 23:47:14
430 2016-03-18 23:52:40 Winterraeder_FORD privat Angebot 0 test NaN 2007 NaN 0 focus 5000 0 NaN ford NaN 2016-03-18 00:00:00 0 40549 2016-03-19 06:47:06
1937 2016-03-19 08:51:48 Vw_polo_1_9tdi privat Angebot 0 test kombi 2001 manuell 120 polo 5000 0 NaN volkswagen NaN 2016-03-19 00:00:00 0 4720 2016-04-06 07:46:03
2360 2016-04-04 16:44:31 Polo_86c_3f__g40_g60__vr6 privat Angebot 0 test NaN 1995 NaN 0 NaN 5000 0 NaN volkswagen NaN 2016-04-04 00:00:00 0 26529 2016-04-06 18:17:16
2466 2016-03-20 17:57:49 Auto_Haus_Kanaan_An&Verkauf_Gebrauchtwagen privat Angebot 0 test NaN 2015 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 2016-03-20 00:00:00 0 55234 2016-03-21 21:17:30
2481 2016-04-04 11:47:53 Triton_Supermoto_400_ccm_Tuev_9/17_!_Verkauf_/... privat Angebot 0 control NaN 2017 automatik 0 NaN 10000 2 benzin sonstige_autos NaN 2016-04-04 00:00:00 0 58708 2016-04-06 12:46:49
2813 2016-03-20 09:51:34 Verkaufe_Fiat_Punto_Felgen privat Angebot 0 control NaN 1990 NaN 0 punto 5000 0 NaN fiat NaN 2016-03-20 00:00:00 0 80999 2016-04-06 05:45:13
2875 2016-04-04 08:53:00 VW_Polo_3F_Motorsport privat Angebot 0 control coupe 1991 manuell 75 polo 5000 1 benzin volkswagen nein 2016-04-04 00:00:00 0 54516 2016-04-06 10:45:56
3308 2016-03-28 20:58:50 Suche_dringend_ein_Kleinwagen privat Angebot 0 control NaN 2000 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 2016-03-28 00:00:00 0 44532 2016-04-03 02:20:29
3394 2016-04-01 18:52:13 Opel_omega_b_schlachtauto_zu_verschenken privat Angebot 0 control NaN 2000 automatik 211 NaN 5000 0 benzin opel NaN 2016-04-01 00:00:00 0 84189 2016-04-03 15:46:14
3422 2016-03-28 18:56:23 Audi_A6_3.0Tdi_Gewinde___20zoll privat Angebot 0 control NaN 2006 NaN 0 a6 5000 0 NaN audi NaN 2016-03-28 00:00:00 0 26632 2016-04-04 22:18:49
3643 2016-03-27 00:49:35 Single_frame_golf_4 privat Angebot 0 test NaN 2000 NaN 0 golf 5000 0 NaN volkswagen NaN 2016-03-26 00:00:00 0 6528 2016-04-07 03:15:23
3928 2016-03-07 21:45:42 Oldtimer_GAZ_M_21_Wolga_viele_Ersatzteile_Youn... privat Angebot 0 control limousine 1960 manuell 0 NaN 5000 1 benzin sonstige_autos NaN 2016-03-07 00:00:00 0 98630 2016-04-06 01:15:19
4079 2016-03-07 18:38:53 Opel_Zafira_CNG_zu_verkaufen! privat Angebot 0 control NaN 2016 NaN 0 zafira 5000 0 cng opel ja 2016-03-07 00:00:00 0 44135 2016-03-08 16:46:52
4111 2016-03-12 08:53:03 Brauch_ein_neues_Autos_habt_ihr_was_zum_anbiet... privat Angebot 0 control NaN 2000 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 2016-03-12 00:00:00 0 80809 2016-03-21 06:17:14
4509 2016-03-22 01:00:15 Fahrzeuglackierer privat Angebot 0 control NaN 2015 NaN 0 NaN 5000 0 benzin sonstige_autos NaN 2016-03-22 00:00:00 0 38440 2016-03-26 08:18:27
6308 2016-03-12 17:38:17 Kaufe_Autos_jeglicher privat Angebot 0 test NaN 9996 NaN 0 NaN 10000 0 NaN sonstige_autos NaN 2016-03-12 00:00:00 0 21244 2016-03-12 17:38:17
7172 2016-03-14 20:46:25 2_Ford_Escort_Cabrio_Paket_Preis privat Angebot 0 control NaN 1990 NaN 0 escort 5000 3 NaN ford NaN 2016-03-14 00:00:00 0 47804 2016-03-14 20:46:25
7234 2016-03-31 07:54:02 Renault_Clio_Grandtour_Dynamique_TCe_90_eco2 privat Angebot 0 test kombi 2014 manuell 90 NaN 10000 10 benzin renault nein 2016-03-31 00:00:00 0 1640 2016-04-06 02:17:46
7266 2016-03-17 19:53:31 Tauschen_golf3 privat Angebot 0 control NaN 2016 NaN 0 golf 5000 0 benzin volkswagen nein 2016-03-17 00:00:00 0 16515 2016-03-20 22:44:57
7499 2016-03-13 16:37:44 Karosserie__opel_Kadett_c_Limo privat Angebot 0 test limousine 1978 NaN 0 kadett 5000 0 NaN opel NaN 2016-03-13 00:00:00 0 93413 2016-03-31 21:47:16
7512 2016-03-28 16:50:43 Bmw_750i_Tausch_oder_Angebot privat Angebot 0 test limousine 1995 automatik 326 7er 5000 0 NaN bmw nein 2016-03-28 00:00:00 0 23826 2016-04-06 22:17:41
7672 2016-03-13 01:57:22 FORD_FOCUS_WENIG_KILOMETER privat Angebot 0 control NaN 2018 NaN 0 focus 5000 12 diesel ford nein 2016-03-13 00:00:00 0 52385 2016-03-14 21:46:28
7808 2016-03-09 12:54:08 Koppel_Autos_Motoren_und_Getriebe_laufen privat Angebot 0 control NaN 1980 NaN 0 escort 10000 0 NaN ford NaN 2016-03-09 00:00:00 0 25554 2016-03-13 00:17:30
8438 2016-03-14 21:37:21 Jaguar_xj_40_Daimler_Original_Baujahr_1994 privat Angebot 0 test NaN 1995 automatik 0 NaN 5000 3 NaN jaguar nein 2016-03-14 00:00:00 0 12099 2016-03-25 11:45:38
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
39594 2016-03-11 23:40:30 fiat_scudo__fuer_Bastler__laeuft_auf_3_zylinde... privat Angebot 0 test bus 2000 manuell 69 andere 5000 5 diesel fiat NaN 2016-03-11 00:00:00 0 21107 2016-03-17 15:17:51
39604 2016-03-23 21:51:40 Kaefer_1303__Projekt_mit_Subaru_EJ22_Motor_eve... privat Angebot 0 test limousine 1973 manuell 136 kaefer 5000 0 benzin volkswagen NaN 2016-03-23 00:00:00 0 9131 2016-04-06 04:15:27
39635 2016-03-24 13:51:55 SUCHE_fuer_TT8N privat Angebot 0 control coupe 2000 manuell 0 tt 5000 12 benzin audi nein 2016-03-24 00:00:00 0 61169 2016-04-05 23:46:41
40264 2016-03-27 11:49:32 Bmw_316_i_Schlachtfesst_austauschmotor privat Angebot 0 test NaN 2000 NaN 0 3er 5000 0 NaN bmw NaN 2016-03-27 00:00:00 0 87766 2016-04-07 07:16:08
40324 2016-03-29 00:57:04 Golf_5_Bj.2007 privat Angebot 0 test limousine 2007 manuell 140 golf 5000 6 benzin volkswagen NaN 2016-03-28 00:00:00 0 12435 2016-04-05 13:45:15
40568 2016-03-25 22:53:28 Opel_Vectra_B privat Angebot 0 control NaN 2017 automatik 136 vectra 5000 12 NaN opel nein 2016-03-25 00:00:00 0 25554 2016-04-05 15:45:46
40955 2016-03-06 00:54:53 Zer_Guten_Auto_800_€ privat Angebot 0 control NaN 2016 NaN 0 NaN 5000 0 NaN chevrolet NaN 2016-03-06 00:00:00 0 28325 2016-03-06 00:54:53
41251 2016-03-24 20:00:29 Tauschen_e36_cabrio privat Angebot 0 control cabrio 1995 manuell 116 NaN 5000 3 lpg bmw nein 2016-03-24 00:00:00 0 26897 2016-03-29 06:47:12
42181 2016-03-27 19:50:53 SAMSUNG_55“_3D_Tv_und_Soundbar_gegen_Auto privat Angebot 0 test NaN 1910 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 2016-03-27 00:00:00 0 57080 2016-04-06 01:15:30
42851 2016-03-26 22:54:00 Mercedes_Benz._____________________Kabine_actr... privat Angebot 0 test NaN 2005 NaN 0 andere 5000 0 NaN mercedes_benz NaN 2016-03-26 00:00:00 0 46240 2016-04-07 01:45:36
43224 2016-03-18 23:48:59 Golf_3_in_Teile__..._einfach_Anfragen__4_Tuerer privat Angebot 0 test NaN 2000 NaN 60 NaN 5000 0 NaN volkswagen NaN 2016-03-18 00:00:00 0 32312 2016-04-06 00:46:08
43481 2016-03-20 10:59:19 Holzkoepfe/Arschmaden privat Angebot 0 control NaN 2000 NaN 0 NaN 5000 8 NaN sonstige_autos NaN 2016-03-20 00:00:00 0 23845 2016-03-20 10:59:19
43492 2016-04-03 13:52:33 Opel_Astra_G_mit_Klima_elektrische_fenster privat Angebot 0 control NaN 2000 NaN 101 astra 5000 0 NaN opel NaN 2016-04-03 00:00:00 0 47169 2016-04-05 12:46:36
44109 2016-03-13 19:37:21 BMW_e30_Sammlungsaufloesung privat Angebot 0 control NaN 1990 NaN 0 3er 5000 0 NaN bmw NaN 2016-03-13 00:00:00 0 97456 2016-03-13 19:37:21
44195 2016-03-11 12:50:39 whatsapp_gruppe_Ford_focus privat Angebot 0 test NaN 2015 NaN 0 focus 5000 0 NaN ford NaN 2016-03-11 00:00:00 0 72770 2016-04-07 06:45:56
44415 2016-03-10 22:55:34 Auffahrrampe_/_Ausstellungsrampe privat Angebot 0 control NaN 2013 NaN 0 NaN 5000 0 NaN sonstige_autos NaN 2016-03-10 00:00:00 0 31789 2016-03-18 04:15:19
44624 2016-03-20 13:44:18 Opel_zafira__selection!_2.2_benzin__lpg._Defect privat Angebot 0 test bus 2001 automatik 150 NaN 5000 10 lpg opel ja 2016-03-20 00:00:00 0 63452 2016-03-26 22:47:07
46213 2016-04-02 13:47:16 Bellier_Vario privat Angebot 0 test kleinwagen 1910 NaN 0 NaN 5000 1 andere sonstige_autos NaN 2016-04-02 00:00:00 0 93105 2016-04-04 11:16:30
46220 2016-03-29 04:03:36 Tausch_oder_Verkauf_Renault_twingo privat Angebot 0 test NaN 2000 manuell 0 twingo 5000 0 benzin renault NaN 2016-03-29 00:00:00 0 71159 2016-04-05 17:25:59
46366 2016-03-05 22:54:03 Lackierer_sucht_Auftraege privat Angebot 0 control NaN 1960 NaN 0 NaN 10000 0 NaN sonstige_autos NaN 2016-03-05 00:00:00 0 72622 2016-03-06 05:43:10
46665 2016-03-24 13:50:22 FIAT_COUPE_175_ERSATZTEILE privat Angebot 0 control coupe 1994 manuell 139 andere 5000 0 benzin fiat NaN 2016-03-24 00:00:00 0 58553 2016-04-07 05:17:26
47142 2016-03-19 08:37:44 GESUCHT_WIRD_UNBEDINGT_EIN_VOLKSWAGEN_TOURAN_S... privat Angebot 0 test bus 2003 manuell 101 touran 5000 0 diesel volkswagen ja 2016-03-19 00:00:00 0 27753 2016-04-06 07:17:47
47280 2016-04-05 11:43:12 PKW_gesucht___01793917553 privat Angebot 0 test NaN 2005 manuell 0 NaN 5000 0 benzin volkswagen NaN 2016-04-05 00:00:00 0 8066 2016-04-05 11:43:12
47310 2016-03-14 04:55:34 Vw_golf_3_gt privat Angebot 0 control NaN 2016 manuell 90 golf 5000 12 NaN volkswagen NaN 2016-03-14 00:00:00 0 99955 2016-04-07 06:45:52
47368 2016-04-01 23:44:14 ZU_VERSCHENKEN_AUTO privat Angebot 0 control kleinwagen 1998 manuell 60 polo 5000 8 benzin volkswagen NaN 2016-04-01 00:00:00 0 69469 2016-04-01 23:44:14
48193 2016-04-04 01:47:49 Mein_Auto_fuer_Ihre_Werbung privat Angebot 0 control NaN 2005 NaN 0 NaN 5000 0 NaN volkswagen NaN 2016-04-03 00:00:00 0 7549 2016-04-06 08:16:17
48290 2016-04-03 17:41:30 Suche_mk2_Fahrer privat Angebot 0 test NaN 2009 NaN 0 NaN 5000 0 NaN ford NaN 2016-04-03 00:00:00 0 45711 2016-04-05 17:26:42
48636 2016-03-27 00:52:29 Audi_A4_gemacht_!!! privat Angebot 0 test limousine 2000 manuell 368 a4 10000 7 benzin audi nein 2016-03-26 00:00:00 0 99734 2016-03-27 08:43:42
48985 2016-03-27 08:55:25 Stockcar_Autos_Peugeot privat Angebot 0 test NaN 2007 NaN 60 NaN 5000 0 NaN peugeot NaN 2016-03-27 00:00:00 0 17153 2016-04-07 05:17:18
49496 2016-03-26 13:55:28 Bmw_e39_520 privat Angebot 0 control limousine 1998 manuell 0 NaN 5000 0 NaN bmw NaN 2016-03-26 00:00:00 0 26188 2016-03-26 13:55:28

141 rows × 20 columns

After my investigation I noticed that there were a mix of fully-repaired vehicles, vehicles with unrepaired damages and those with an unknown status. I also noticed Vehicles with registration years ranging from the 1900s to 2016 - vehicles from anytime in the 2000s shouldn't be free - and they were all private vehicles.

Since we have no way of confirming the correct data for these rows, We will be removing all the rows with a price of zero dollars and 99999999 dollars (the outlier) since it is safe to do so.

In [17]:
autos = autos[(autos['price_dollars'].between(1, 6000000))]
autos.shape
Out[17]:
(48578, 20)

Cleaning the Odometer column

Now we move to the odometer column.

In [18]:
autos['odometer_km'].unique().shape
Out[18]:
(13,)
In [19]:
autos['odometer_km'].unique()
Out[19]:
array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000, 100000,  40000])
In [20]:
autos['odometer_km'].describe()
Out[20]:
count     48578.000000
mean     125765.675820
std       39793.193823
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [21]:
autos['odometer_km'].value_counts(ascending=False).head(6)
Out[21]:
150000    31421
125000     5058
100000     2116
90000      1734
80000      1415
70000      1217
Name: odometer_km, dtype: int64
In [22]:
autos['odometer_km'].value_counts(ascending=False).tail(6)
Out[22]:
50000    1014
5000      837
40000     816
30000     780
20000     762
10000     253
Name: odometer_km, dtype: int64

From the analysis we notice the following:

  • There are thirteen unique values
  • All the values are rounded up to the nearest thousands
  • The highest value is 150000km. It is also the most frequently occuring indicating that some of the used cars have been used quite a bit.
  • The least value is 5000km. 967 cars have an odometer reading of 5000km indicating that they werent used much. I expect some of the big named cars in this categories to be quite expensive. We'll look more into this next
In [23]:
autos.loc[autos['odometer_km'] == 5000, "brand"].value_counts().head()
Out[23]:
volkswagen       161
opel             106
bmw               81
mercedes_benz     64
ford              56
Name: brand, dtype: int64
In [24]:
autos.loc[autos['odometer_km'] == 5000, "price_dollars"].describe()
Out[24]:
count    8.370000e+02
mean     1.316806e+04
std      1.366148e+05
min      1.000000e+00
25%      4.500000e+02
50%      1.600000e+03
75%      7.499000e+03
max      3.890000e+06
Name: price_dollars, dtype: float64
In [25]:
autos.loc[(autos['odometer_km'] == 5000) & (autos['price_dollars'] == 3890000), "name"]
Out[25]:
47634    Ferrari_FXX
Name: name, dtype: object
In [26]:
autos.loc[(autos['odometer_km'] == 5000) & (autos['price_dollars'] == 1), ["name","registeration_year","unrepaired_damage"]]
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  if __name__ == '__main__':
/dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexing.py:1367: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)
Out[26]:
name registeration_year unrepaired_damage
344 Verkaufe_hier_mein_mein_schoener_honda_crx_in_... NaN NaN
1422 zerlege_Astra_f_teile_guenstig NaN NaN
1987 Audi_A4_B5_Ersatzteile NaN NaN
2892 Suche_Opel_Omega_gerne_DTI_ab_BJ_2000 NaN NaN
3679 Suche_Auto NaN NaN
4496 RENAULT__MEGAN__GRANTOUR____bj_2001____TAUSCHE... NaN nein
5246 Autoaufbereitung NaN NaN
5883 Suche_ein_Auto NaN NaN
7442 Suche_Mercedes_280_SE_w108_Karosserie_oder_zum... NaN NaN
9643 Golf_2_gti_Umbau_1_8_L_107_PS NaN nein
13679 Suche_Audi_80_Cabrio NaN NaN
17984 Oldtimer_Chevrolet_Malibu_5.7l_V8_Bj.:1978 NaN NaN
18569 Suche_Leasing_Audi_A4_Avant__A6_oder_Q5__SQ5_P... NaN NaN
24362 Suche_Mercedes_CLS_320_CDI NaN NaN
27171 Wartburg_Scheunenfund_Teilespender NaN NaN
32819 Suche_VW_/_Audi NaN NaN
36652 WER_WÜRDE_MICH_SEIN_AUTO_AUF_RATEN_VERKAUFEN NaN NaN
41577 Alfa_Romeo_156_Schlachtfest_2.5_V6_24V_428/B_B... NaN NaN
43623 1.6__60_kW_90ps NaN ja
47087 W124_SCHLACHTFEST_|_Kombi_|_1995_|_Aktiv_Bass NaN NaN
49845 Schlachte_VW_Sharan_vr6_Automatik___no_GTI_16V... NaN nein

As expected, the most expensive car with a 5000km odometer reading is a Ferrari_FXX which is worth 3890000 dollars.

Some other things I noticed include that:

  • the least expensive cars are twenty-one in number, from various brands and all with a price of one dollar.
  • the registration years and unrepaired damage status, for most of these cars, are unknown. Because of this, we can't draw any reasonable conclusions as to why the they are so cheap.
  • the Volkswagen, opel and BMW are the most frequent brands with a mileage of 5000km.

Now let's look at the cars with the most mileage

In [27]:
autos.loc[autos['odometer_km'] == 150000, "brand"].value_counts().head()
Out[27]:
volkswagen       7224
bmw              3881
opel             3589
mercedes_benz    3340
audi             2994
Name: brand, dtype: int64
In [28]:
autos.loc[autos['odometer_km'] == 150000, "name"].value_counts().head()
Out[28]:
BMW_316i               65
BMW_320i               57
Volkswagen_Golf_1.4    53
Opel_Corsa             49
BMW_318i               49
Name: name, dtype: int64
In [29]:
autos.loc[autos['odometer_km'] == 150000, "price_dollars"].describe()
Out[29]:
count    3.142100e+04
mean     5.362514e+03
std      1.363871e+05
min      1.000000e+00
25%      9.990000e+02
50%      2.190000e+03
75%      4.700000e+03
max      1.234568e+07
Name: price_dollars, dtype: float64
In [30]:
autos.loc[(autos['odometer_km'] == 150000) & (autos['price_dollars'] == 12345678), "brand"]
Out[30]:
27371     fiat
39377    volvo
47598     opel
Name: brand, dtype: object

After analysis, we noticed that of all the cars with a mileage of 150000km:

  • Volkswagen cars had the highest population.
  • The most expensive brands were the fiat, volvo and opel. Each had a price of 12345678 (that's weird).
  • The minimum price was also one dollar.
  • In terms of Car names, the BMW seems to have a higher population than the Volkswagen.

The highest seems strange. Typically prices within the millions range are rounded up to the nearest hundred thousands or higher. Since they are only three rows, its safe enough for us to remove them without losing any meaningful insights in from our data.

In [31]:
autos = autos.loc[~(autos['price_dollars'] == 12345678), :]
autos.shape
Out[31]:
(48575, 20)

Apart from the weird highest price I mentioned earlier, nothing else seems to be out of the ordinary with the odometer column. Lets move on to the next thing.

Exploring the date column

There are three date columns in our data:

  • date_crawled: date and time the webpage was crawled
  • ad_created: date and time the car sale ad was created
  • last_seen: date and time the car ad was last seen

We begin by extracting the date and time of each column to separate columns before performing our analysis. We will discard the original columns after this since we have no need for them. To do this, well create a function to avoid repeating our code multiple times.

In [32]:
def extract_date_time(column):
    extract_date = autos[column].str[:10]
    extract_time = autos[column].str[10:]
    autos.drop(column, axis = 1)
    
    return extract_date, extract_time
In [33]:
autos['date_crawled_date'], autos['date_crawled_time'] = extract_date_time('date_crawled')
In [34]:
autos['ad_created_date'], autos['ad_created_time'] = extract_date_time('ad_created')
In [35]:
autos['last_seen_date'], autos['last_seen_time'] = extract_date_time('last_seen')

Now that we've extracted the date and time of each column, we can begin our analysis.

In [36]:
autos['date_crawled_date'].value_counts(normalize=True, dropna=False)
Out[36]:
2016-04-03    0.038600
2016-03-20    0.037880
2016-03-21    0.037406
2016-03-12    0.036933
2016-03-14    0.036541
2016-04-04    0.036521
2016-03-07    0.036006
2016-04-02    0.035471
2016-03-28    0.034853
2016-03-19    0.034771
2016-03-15    0.034277
2016-03-29    0.034133
2016-04-01    0.033680
2016-03-30    0.033680
2016-03-08    0.033309
2016-03-09    0.033083
2016-03-22    0.033001
2016-03-11    0.032568
2016-03-23    0.032218
2016-03-26    0.032198
2016-03-10    0.032177
2016-03-31    0.031827
2016-03-17    0.031642
2016-03-25    0.031601
2016-03-27    0.031086
2016-03-16    0.029604
2016-03-24    0.029336
2016-03-05    0.025322
2016-03-13    0.015666
2016-03-06    0.014040
2016-04-05    0.013093
2016-03-18    0.012908
2016-04-06    0.003170
2016-04-07    0.001400
Name: date_crawled_date, dtype: float64

We can observe the following:

  • All ads were crawled in the year 2016
  • All ads were crawled sometime between March and April
  • The least crawled dates were on the 6th and 7th of April 2016
  • The most crawled dates were the 3rd of April and 20th to 21st of March, 2016
In [37]:
autos['ad_created_date'].value_counts(normalize=True, dropna=False)
Out[37]:
2016-04-03    0.038847
2016-03-20    0.037941
2016-03-21    0.037612
2016-04-04    0.036891
2016-03-12    0.036768
2016-03-14    0.035183
2016-04-02    0.035142
2016-03-28    0.034977
2016-03-07    0.034730
2016-03-29    0.034071
2016-03-15    0.034009
2016-03-19    0.033680
2016-04-01    0.033680
2016-03-30    0.033495
2016-03-08    0.033330
2016-03-09    0.033145
2016-03-11    0.032898
2016-03-22    0.032815
2016-03-26    0.032259
2016-03-23    0.032054
2016-03-10    0.031889
2016-03-31    0.031868
2016-03-25    0.031745
2016-03-17    0.031292
2016-03-27    0.030983
2016-03-16    0.030118
2016-03-24    0.029274
2016-03-05    0.022892
2016-03-13    0.017005
2016-03-06    0.015317
                ...   
2016-02-26    0.000041
2016-02-20    0.000041
2016-02-18    0.000041
2016-02-02    0.000041
2016-02-05    0.000041
2016-01-10    0.000041
2016-02-24    0.000041
2016-02-12    0.000041
2016-02-14    0.000041
2015-12-05    0.000021
2015-12-30    0.000021
2016-02-07    0.000021
2016-01-07    0.000021
2015-08-10    0.000021
2016-02-16    0.000021
2016-02-17    0.000021
2015-09-09    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-01-22    0.000021
2015-06-11    0.000021
2015-11-10    0.000021
2016-01-03    0.000021
2016-02-08    0.000021
2016-02-22    0.000021
2016-01-13    0.000021
2016-01-16    0.000021
2016-01-29    0.000021
2016-02-01    0.000021
2016-01-14    0.000021
Name: ad_created_date, Length: 76, dtype: float64

In the ad_created_date column, we notice that:

  • The ads were created between the June 2015 and April 2016.
  • The least amount of ads were created at the end of 2015 and the first two months of 2016.
  • The most amount of ads were created shortly after the year 2016 began. Sometime in March and April 2016.
In [38]:
autos['last_seen_date'].value_counts(normalize=True, dropna=False)
Out[38]:
2016-04-06    0.221801
2016-04-07    0.131920
2016-04-05    0.124776
2016-03-17    0.028080
2016-04-03    0.025198
2016-04-02    0.024931
2016-03-30    0.024766
2016-04-04    0.024478
2016-03-12    0.023798
2016-03-31    0.023778
2016-04-01    0.022790
2016-03-29    0.022357
2016-03-22    0.021369
2016-03-28    0.020875
2016-03-21    0.020648
2016-03-20    0.020648
2016-03-24    0.019763
2016-03-25    0.019207
2016-03-23    0.018528
2016-03-26    0.016799
2016-03-16    0.016449
2016-03-15    0.015872
2016-03-19    0.015831
2016-03-27    0.015646
2016-03-14    0.012599
2016-03-11    0.012373
2016-03-10    0.010664
2016-03-09    0.009593
2016-03-13    0.008893
2016-03-08    0.007432
2016-03-18    0.007349
2016-03-07    0.005394
2016-03-06    0.004323
2016-03-05    0.001071
Name: last_seen_date, dtype: float64

The crawler was last seen in 2016, sometime between most of March and the start of April 2016

In [39]:
autos['registration_year'].describe()
Out[39]:
count    48575.000000
mean      2004.752671
std         88.635269
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Here we observe that:

  • The highest registration year is 9999
  • The lowest registration year is 1000

These are unrealistic because the ads were crawled in 2016 so its impossible for any car to have been registered after 2016. The registration year of 1000 is also impossible because at that time, cars had not been invented yet. We may have to remove rows like these which contain unrealistic date stamps but lets find out how many they are to see if its safe to do so. To do this we set our upper and lower bounds to be 2016 and 1940 respectively.

In [40]:
outlier_count = autos.loc[~(autos['registration_year'].between(1940, 2016)), 'registration_year'].count()
outlier_count
Out[40]:
1900

There are 1900 cars that were registered outside of 1900 to 2016. Considering how many rows we have dropped till now, It's I think it's okay to remove these ones outside our bounds.

In [41]:
autos = autos.loc[autos['registration_year'].between(1940, 2016), :]
In [42]:
autos['registration_year'].value_counts(normalize=True)
Out[42]:
2000    0.067616
2005    0.062903
1999    0.062089
2004    0.057911
2003    0.057825
2006    0.057225
2001    0.056476
2002    0.053262
1998    0.050627
2007    0.048784
2008    0.047456
2009    0.044692
1997    0.041800
2011    0.034772
2010    0.034044
1996    0.029416
2012    0.028066
1995    0.026310
2016    0.026138
2013    0.017204
2014    0.014226
1994    0.013476
1993    0.009106
2015    0.008399
1992    0.007949
1990    0.007434
1991    0.007263
1989    0.003728
1988    0.002892
1985    0.002057
          ...   
1981    0.000621
1968    0.000557
1967    0.000557
1971    0.000557
1973    0.000514
1960    0.000514
1974    0.000514
1966    0.000471
1977    0.000471
1976    0.000450
1969    0.000407
1975    0.000386
1965    0.000364
1964    0.000257
1963    0.000171
1961    0.000129
1959    0.000129
1962    0.000086
1956    0.000086
1958    0.000086
1950    0.000064
1957    0.000043
1955    0.000043
1954    0.000043
1951    0.000043
1941    0.000043
1948    0.000021
1943    0.000021
1953    0.000021
1952    0.000021
Name: registration_year, Length: 70, dtype: float64

Now that we've cleaned our data, we can assert the following conclusions:

  • The highest number of registrations occured in the year 2000
  • The least number of registrations occured in the year 1952
  • There were more registration of cars in the 2000s than in the 1900s

Great! We are finally done exploring all the date columns in out dataset for insights. We will move on to the next column, the brand column.

Exploring the brand column

To explore this column we will start by getting an aggregate of the unique values of this column. Because of how many unique brands are in the dataset, We will aggregate on the top 20 most common brands.

We will get this by using the ASeries.value_counts() method available in pandas to count the

In [43]:
brand_count = autos['brand'].value_counts().head(20)
brand_count
Out[43]:
volkswagen        9865
bmw               5137
opel              5019
mercedes_benz     4502
audi              4041
ford              3262
renault           2200
peugeot           1393
fiat              1197
seat               853
skoda              766
nissan             713
mazda              709
smart              661
citroen            655
toyota             593
hyundai            468
sonstige_autos     455
volvo              427
mini               409
Name: brand, dtype: int64

The most commonly sold brand is the volkswagen. The BMW and opel are a second and third respectively but there is a significantly lesser number of them than the volkswagen. The mini and volvo are the rarest brands of the top 20 with both less than 430 in number.

Not surprisingly, posh cars like the ferarri dont appear on the top twenty list. The ford appears to be one quite common while nissan is a closer to the bottom of the list. This could be as a result of the german's preference in cars or a nissan might be more expensive than a ford. Lets find out by taking a look at the mean price of the cars on this list.

To do this, We'll create a function to sort and display, in a descending order, the mean price dictionary we will create for our analysis.

In [64]:
def display_table(dataset, index = 0, is_dataset = True):
    """
    Displays frequency table of any column in the dataset in descending order.
    Dataset can be a frequency dictionary or a dataset.
    
    INPUT:
    dataset = The dataset or frequency table to be examined
    index = index of column whose frequency table we are after. Default = 0
    is_dataset = A boolean. True if dataset is a dataset else: false. Default = True
    
    OUTPUT:
    table_sorted = Frequency table of column corresponding to index sorted in descending order.
    """
    if is_dataset:
        table = freq_table(dataset, index)
    else:
        table = dataset
    
    table_val_sorted = sorted(table.values(), reverse = True)
    
    for value in table_val_sorted:
        for key in table:
            if table[key] == value:
                print(key, ':', value)
In [65]:
brand_unique = autos['brand'].unique()
brand_price_mean = {}

for brand in brand_count.index:
    brand_price_mean[brand] = autos.loc[autos['brand'] == brand, 'price_dollars'].mean()

display_table(brand_price_mean, is_dataset=False)
sonstige_autos : 45644.97802197802
citroen : 45486.68549618321
mini : 10613.459657701711
audi : 9336.687453600594
mercedes_benz : 8629.300755219901
bmw : 8570.910064239828
ford : 7447.770999386879
volkswagen : 6729.81956411556
skoda : 6368.0
hyundai : 5365.254273504273
toyota : 5167.091062394604
volvo : 4946.501170960188
nissan : 4743.40252454418
seat : 4397.230949589683
mazda : 4112.596614950635
smart : 3580.2239031770046
peugeot : 3094.0172290021537
opel : 2976.2472604104405
fiat : 2813.748538011696
renault : 2475.7172727272728

First we notice that the volkswagen (our most common brand) and ford (a common brand) are somewhere in the middle of the list. Indicating that the average brand price is a bit on the average side of the top 20 brands.

The opel is at the bottom of the price list indicating that its popularity may be largly due to its affordability. We also notice that the sonstige_autos and citroen are at the top of the list. Not surprising considering that they are luxury cars.

We also notice that the BMW and Mercedes Benz are quite expensive yet they are among the top 4 most common brands. This could be because they offer better mileage, last longer than most other brands or offer some other advantages over their competition that make them worth their price.

Now let's look at the mean mileage of these brands.

In [68]:
brand_mean_mileage = {}

for brand in brand_count.index:
    brand_mean_mileage[brand] = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()

display_table(brand_mean_mileage, is_dataset=False)
volvo : 138067.9156908665
bmw : 132600.73973136072
mercedes_benz : 130816.30386494892
opel : 129336.5212193664
audi : 129157.38678544914
volkswagen : 128713.6340598074
renault : 128127.27272727272
peugeot : 127153.62526920316
mazda : 124464.03385049365
ford : 124351.62477007971
seat : 121131.30128956624
citroen : 119572.51908396947
nissan : 118330.99579242637
fiat : 117121.9715956558
toyota : 115944.35075885328
skoda : 110848.5639686684
hyundai : 106442.30769230769
smart : 99326.77760968229
sonstige_autos : 90219.78021978022
mini : 88105.13447432763

On this list we notice that the Volvo has the highest average mileage. The sonstige_autos (our most expensive brand offers one of the lowest mileages. This could explain why its ranked so low among the top 20 most common brands. The bmw, opel and Mercedes benz all have one of the highest mileages. They are topped only by Volvo. It's worth noting that despite the high price, the BMW and Mercedez were one of the most common brands. Ford is at the middle of the list, it is also fairly expensive and is also a fairly common brand. And Volkwagen (our most common brand) has a high average mileage and its quite affordable.

From what I gather, This signifies a trend, the most important factor in picking a brand is the mileage. The better the mileage a brand provides, the more likely the brand is to be bought. Other factors that affect the likelihood could be buyer tastes, the price also has some importance but, as in the case of the BMW and Mercedes Benz, they can be overlooked if the benefits the brand provides are worth it.

Let's take a closer look at the mean mileages of brands in the top 8 of the mean price list. We could find some connections between the two properties.

In [72]:
mean_price_series = pd.Series(brand_price_mean)
mean_mileage_series = pd.Series(brand_mean_mileage)
mean_stats_df = pd.DataFrame(mean_price_series, columns = ['mean_price'])
mean_stats_df['mean_mileage'] = mean_mileage_series
mean_stats_df['brand_count'] = brand_count
mean_stats_df
Out[72]:
mean_price mean_mileage brand_count
audi 9336.687454 129157.386785 4041
bmw 8570.910064 132600.739731 5137
citroen 45486.685496 119572.519084 655
fiat 2813.748538 117121.971596 1197
ford 7447.770999 124351.624770 3262
hyundai 5365.254274 106442.307692 468
mazda 4112.596615 124464.033850 709
mercedes_benz 8629.300755 130816.303865 4502
mini 10613.459658 88105.134474 409
nissan 4743.402525 118330.995792 713
opel 2976.247260 129336.521219 5019
peugeot 3094.017229 127153.625269 1393
renault 2475.717273 128127.272727 2200
seat 4397.230950 121131.301290 853
skoda 6368.000000 110848.563969 766
smart 3580.223903 99326.777610 661
sonstige_autos 45644.978022 90219.780220 455
toyota 5167.091062 115944.350759 593
volkswagen 6729.819564 128713.634060 9865
volvo 4946.501171 138067.915691 427

Sorting our new dataframe by mean price in descending order we will be able to easily find our top 8 brands with the highest prices and their corresponding mileages

In [73]:
mean_stats_df.sort_values('mean_price', axis=0, ascending=False)
Out[73]:
mean_price mean_mileage brand_count
sonstige_autos 45644.978022 90219.780220 455
citroen 45486.685496 119572.519084 655
mini 10613.459658 88105.134474 409
audi 9336.687454 129157.386785 4041
mercedes_benz 8629.300755 130816.303865 4502
bmw 8570.910064 132600.739731 5137
ford 7447.770999 124351.624770 3262
volkswagen 6729.819564 128713.634060 9865
skoda 6368.000000 110848.563969 766
hyundai 5365.254274 106442.307692 468
toyota 5167.091062 115944.350759 593
volvo 4946.501171 138067.915691 427
nissan 4743.402525 118330.995792 713
seat 4397.230950 121131.301290 853
mazda 4112.596615 124464.033850 709
smart 3580.223903 99326.777610 661
peugeot 3094.017229 127153.625269 1393
opel 2976.247260 129336.521219 5019
fiat 2813.748538 117121.971596 1197
renault 2475.717273 128127.272727 2200

Apart from the sonstige_autos and mini, all the brands in the top 8 of the mean price list have a mean mileage above 100000km. The brands with average mileages greater than 120000km have a high brand count values. This indicated that a car is most likely to be sold if it has a high mileage even if the car may be a bit expensive.

Conclusion

In this project, we anayzed a modified version of the Ebay car sales dataset provided on kaggle to find out the best cars to purchase and reached the following conclusions:

  • the most popular brand was the volkswagen
  • the most expensive brand among the top 20 brands was the sonstige_autos
  • the brand with the most mileage was the volvo

We also noticed a few brands that like the BMW and mercedez which were expensive yet popular among buyers. We deduced that the reason for this was that they provided value which was worth their price.

We concluded that the best brand to buy would be between a volkwagen or volvo. If we chose to go for a mor classy brand, a BMW or a Mercedez Benz would be better options.