We will be working with the dataset of used cars from eBay Kleinanzeigen - a classifieds section of the German eBay website.
The aim of this project is to clean the data and analyze the included used car listings.
The data dictionary provided with data is as follows:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.
name
- Name of the car.
seller
- Whether the seller is private or a dealer.
offerType
- The type of listing
price
- The price on the ad to sell the car.
abtest
- Whether the listing is included in an A/B test.
vehicleType
- The vehicle Type.
yearOfRegistration
- The year in which the car was first registered.
gearbox
- The transmission type.
powerPS
- The power of the car in PS.
model
- The car model name.
kilometer
- How many kilometers the car has driven.
monthOfRegistration
- The month in which the car was first registered.
fuelType
- What type of fuel the car uses.
brand
- The brand of the car.
notRepairedDamage
- If the car has a damage which is not yet repaired.
dateCreated
- The date on which the eBay listing was created.
nrOfPictures
- The number of pictures in the ad.
postalCode
- The postal code for the location of the vehicle.
lastSeenOnline
- When the crawler saw this ad last online.
If we are looking at cars, what place better to look at than the home of the modern car? Hence, let us start off our project by importing the libraries and read the dataset which we will work on using pandas.
On that note, let's go!
Or as a typical german would say, 'Los geht's!'
import pandas as pd
import numpy as nm
autos = pd.read_csv('autos.csv',encoding = "Latin-1")
autos #Jupyter notebook has a neat feature to render the first and last few fvalues of any pandas object!
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | $7,900 | test | bus | 2006 | automatik | 150 | voyager | 150,000km | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | $300 | test | limousine | 1995 | manuell | 90 | golf | 150,000km | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | $1,990 | control | limousine | 1998 | manuell | 90 | golf | 150,000km | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | $250 | test | NaN | 2000 | manuell | 0 | arosa | 150,000km | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | $590 | control | bus | 1997 | manuell | 90 | megane | 150,000km | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
10 | 2016-03-15 01:41:36 | VW_Golf_Tuning_in_siber/grau | privat | Angebot | $999 | test | NaN | 2017 | manuell | 90 | NaN | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-14 00:00:00 | 0 | 86157 | 2016-04-07 03:16:21 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | $350 | control | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | $5,299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50,000km | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | $1,350 | control | limousine | 1999 | manuell | 101 | a3 | 150,000km | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | $3,999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150,000km | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | $18,900 | test | coupe | 1982 | automatik | 203 | NaN | 80,000km | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | $350 | test | kombi | 1999 | manuell | 101 | vectra | 150,000km | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | $5,500 | test | coupe | 1990 | manuell | 205 | scirocco | 150,000km | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | $300 | control | bus | 1995 | manuell | 150 | 3er | 150,000km | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | $4,150 | control | suv | 2004 | manuell | 124 | andere | 150,000km | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | $3,500 | test | kombi | 2003 | manuell | 131 | a4 | 150,000km | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | $41,500 | test | cabrio | 2004 | manuell | 320 | 911 | 150,000km | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | $25,450 | control | cabrio | 2015 | manuell | 184 | cooper | 10,000km | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | $7,999 | control | bus | 2010 | manuell | 120 | NaN | 150,000km | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | $48,500 | control | limousine | 2014 | automatik | 306 | 5er | 30,000km | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
25 | 2016-03-21 21:56:18 | Ford_escort_kombi_an_bastler_mit_ghia_ausstattung | privat | Angebot | $90 | control | kombi | 1996 | manuell | 116 | NaN | 150,000km | 4 | benzin | ford | ja | 2016-03-21 00:00:00 | 0 | 27574 | 2016-04-01 05:16:49 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | $777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125,000km | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
27 | 2016-03-27 18:45:01 | Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE | privat | Angebot | $0 | control | NaN | 2005 | NaN | 0 | NaN | 150,000km | 0 | NaN | ford | NaN | 2016-03-27 00:00:00 | 0 | 66701 | 2016-03-27 18:45:01 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | $5,250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150,000km | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | $4,999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150,000km | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49970 | 2016-03-21 22:47:37 | c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... | privat | Angebot | $15,800 | control | bus | 2010 | automatik | 136 | c4 | 60,000km | 4 | diesel | citroen | nein | 2016-03-21 00:00:00 | 0 | 14947 | 2016-04-07 04:17:34 |
49971 | 2016-03-29 14:54:12 | W.Lupo_1.0 | privat | Angebot | $950 | test | kleinwagen | 2001 | manuell | 50 | lupo | 150,000km | 4 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 65197 | 2016-03-29 20:41:51 |
49972 | 2016-03-26 22:25:23 | Mercedes_Benz_Vito_115_CDI_Extralang_Aut. | privat | Angebot | $3,300 | control | bus | 2004 | automatik | 150 | vito | 150,000km | 10 | diesel | mercedes_benz | ja | 2016-03-26 00:00:00 | 0 | 65326 | 2016-03-28 11:28:18 |
49973 | 2016-03-27 05:32:39 | Mercedes_Benz_SLK_200_Kompressor | privat | Angebot | $6,000 | control | cabrio | 2004 | manuell | 163 | slk | 150,000km | 11 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 53567 | 2016-03-27 08:25:24 |
49974 | 2016-03-20 10:52:31 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | privat | Angebot | $0 | control | cabrio | 1983 | manuell | 70 | golf | 150,000km | 2 | benzin | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 8209 | 2016-03-27 19:48:16 |
49975 | 2016-03-27 20:51:39 | Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort | privat | Angebot | $9,700 | control | kleinwagen | 2012 | automatik | 88 | jazz | 100,000km | 11 | hybrid | honda | nein | 2016-03-27 00:00:00 | 0 | 84385 | 2016-04-05 19:45:34 |
49976 | 2016-03-19 18:56:05 | Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... | privat | Angebot | $5,900 | test | kombi | 1992 | automatik | 150 | 80 | 150,000km | 12 | benzin | audi | nein | 2016-03-19 00:00:00 | 0 | 36100 | 2016-04-07 06:16:44 |
49977 | 2016-03-31 18:37:18 | Mercedes_Benz_C200_Cdi_W203 | privat | Angebot | $5,500 | control | limousine | 2003 | manuell | 116 | c_klasse | 150,000km | 2 | diesel | mercedes_benz | nein | 2016-03-31 00:00:00 | 0 | 33739 | 2016-04-06 12:16:11 |
49978 | 2016-04-04 10:37:14 | Mercedes_Benz_E_200_Classic | privat | Angebot | $900 | control | limousine | 1996 | automatik | 136 | e_klasse | 150,000km | 9 | benzin | mercedes_benz | ja | 2016-04-04 00:00:00 | 0 | 24405 | 2016-04-06 12:44:20 |
49979 | 2016-03-20 18:38:40 | Volkswagen_Polo_1.6_TDI_Style | privat | Angebot | $11,000 | test | kleinwagen | 2011 | manuell | 90 | polo | 70,000km | 11 | diesel | volkswagen | nein | 2016-03-20 00:00:00 | 0 | 48455 | 2016-04-07 01:45:12 |
49980 | 2016-03-12 10:55:54 | Ford_Escort_Turnier_16V | privat | Angebot | $400 | control | kombi | 1995 | manuell | 105 | escort | 125,000km | 3 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 56218 | 2016-04-06 17:16:49 |
49981 | 2016-03-15 09:38:21 | Opel_Astra_Kombi_mit_Anhaengerkupplung | privat | Angebot | $2,000 | control | kombi | 1998 | manuell | 115 | astra | 150,000km | 12 | benzin | opel | nein | 2016-03-15 00:00:00 | 0 | 86859 | 2016-04-05 17:21:46 |
49982 | 2016-03-29 18:51:08 | Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm | privat | Angebot | $1,950 | control | kleinwagen | 2004 | manuell | 0 | fabia | 90,000km | 7 | benzin | skoda | NaN | 2016-03-29 00:00:00 | 0 | 45884 | 2016-03-29 18:51:08 |
49983 | 2016-03-06 12:43:04 | Ford_focus_99 | privat | Angebot | $600 | test | kleinwagen | 1999 | manuell | 101 | focus | 150,000km | 4 | benzin | ford | NaN | 2016-03-06 00:00:00 | 0 | 52477 | 2016-03-09 06:16:08 |
49984 | 2016-03-31 22:48:48 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | privat | Angebot | $0 | test | NaN | 2000 | NaN | 0 | NaN | 150,000km | 0 | NaN | sonstige_autos | NaN | 2016-03-31 00:00:00 | 0 | 12103 | 2016-04-02 19:44:53 |
49985 | 2016-04-02 16:38:23 | Verkaufe_meinen_vw_vento! | privat | Angebot | $1,000 | control | NaN | 1995 | automatik | 0 | NaN | 150,000km | 0 | benzin | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 30900 | 2016-04-06 15:17:52 |
49986 | 2016-04-04 20:46:02 | Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... | privat | Angebot | $15,900 | control | limousine | 2010 | automatik | 218 | 300c | 125,000km | 11 | diesel | chrysler | nein | 2016-04-04 00:00:00 | 0 | 73527 | 2016-04-06 23:16:00 |
49987 | 2016-03-22 20:47:27 | Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... | privat | Angebot | $21,990 | control | limousine | 2013 | manuell | 150 | a3 | 50,000km | 11 | diesel | audi | nein | 2016-03-22 00:00:00 | 0 | 94362 | 2016-03-26 22:46:06 |
49988 | 2016-03-28 19:49:51 | BMW_330_Ci | privat | Angebot | $9,550 | control | coupe | 2001 | manuell | 231 | 3er | 150,000km | 10 | benzin | bmw | nein | 2016-03-28 00:00:00 | 0 | 83646 | 2016-04-07 02:17:40 |
49989 | 2016-03-11 19:50:37 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | privat | Angebot | $150 | test | kleinwagen | 1997 | manuell | 0 | polo | 150,000km | 5 | benzin | volkswagen | ja | 2016-03-11 00:00:00 | 0 | 21244 | 2016-03-12 10:17:55 |
49990 | 2016-03-21 19:54:19 | Mercedes_Benz_A_200__BlueEFFICIENCY__Urban | privat | Angebot | $17,500 | test | limousine | 2012 | manuell | 156 | a_klasse | 30,000km | 12 | benzin | mercedes_benz | nein | 2016-03-21 00:00:00 | 0 | 58239 | 2016-04-06 22:46:57 |
49991 | 2016-03-06 15:25:19 | Kleinwagen | privat | Angebot | $500 | control | NaN | 2016 | manuell | 0 | twingo | 150,000km | 0 | benzin | renault | NaN | 2016-03-06 00:00:00 | 0 | 61350 | 2016-03-06 18:24:19 |
49992 | 2016-03-10 19:37:38 | Fiat_Grande_Punto_1.4_T_Jet_16V_Sport | privat | Angebot | $4,800 | control | kleinwagen | 2009 | manuell | 120 | andere | 125,000km | 9 | lpg | fiat | nein | 2016-03-10 00:00:00 | 0 | 68642 | 2016-03-13 01:44:51 |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | $1,650 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150,000km | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | $5,000 | control | kombi | 2001 | automatik | 299 | a6 | 150,000km | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
Ach so! Das ist interessant!
If we look at the data, we can see that the columns are in english (Phew!) however you will also come across words like Angebot
(Offer), kleinwagen
(small car) wtc which is not very easily decipherable for eyes not accustomed to German.
Before we get into this, a clarification: Sonstige
that is mentioned in the brand
actually means other or miscellaneous in German. Would be good to keep that in mind as it will be used quite extensively through the course of this project
We will need to work on that! Well, atleast there is no mention of Kraftfahrzeughaftpflichtversicherung
(automobile liability insurance ;) ) essentially making our lives easier.
Let's print the information out for our dataframe and have a look at their data types to understand a bit more of what we are dealing with.
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Data Exploration and Cleaning
The data set has 20 columns, 15 of which are object type and 5 are of int 64 type.
Looking at the data, at first glance we can identify that the columns of vehicleType
, gearbox
, model
, fuelType
and notRepairedDamage
are the only ones which have null values existing within them.
Moreover, there are some object type data like price
and odometer
which we would need to clean and convert into int64 or float64 types in order to analyze our data more accurately.
The column names use camelcase
(new word is capitalized but not seperated) instead of Python's preferred snakecase
(different words joined by an underscore), which means we can't just replace spaces with underscores.
In order to decide whether to keep the row which have null values or overall exclude them would be subject to some more analysis on our part. We would only be able to establish that once we dig a wee bit deeper into the data.
Let us start off by converting camelcase to snakecase for better nomenclature.
autos.columns #to printout all existing columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.rename({"dateCrawled":"date_crawled","offerType":"offer_type","yearOfRegistration":"registration_year","vehicleType":"vehicle_type","powerPS":"power_ps","monthOfRegistration":"registration_month","fuelType":"fuel_type","notRepairedDamage":"unrepaired_damage","dateCreated":"ad_created","nrOfPictures":"nr_pictures","postalCode":"postal_code","abtest":"ab_test","lastSeen":"last_seen"},axis=1, inplace=True)
In this previous set we remaned all the columns and brought them to snakecase. Let us check how the columns look now
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
The following methods are helpful for exploring the data: - DataFrame.describe()
(with include='all' to get both categorical and numeric columns) - Series.value_counts()
and Series.head()
if any columns need a closer look.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
A few odd observations:
registration_year
: min : 1000 (not possible unless eBay dates back to the Holy Roman Empire and Otto the great was a tech and used-car fanatic)max : 9999 which also an absurd value
power_ps
: PS stands for pferdstärke which literally means horsepower and neither 0 nor 177000 are acceptable values. Needs further investigation
nr_pictures
- This essentially shows that there are no present values and that this column would be of no use for us as it doesnt paint a picture of any sort.
price
and odometer
need to be cleaned and then checked for patterns and discrepencies.
seller
and offer
have 49999 of the same values out of 50000 which essentially means that this column does not have any deviations or variations and can be deemed as contants and essentially useless for us.
autos["price"] = autos["price"].str.replace('$',"")
autos["price"] = autos["price"].str.replace(',',"").astype(int)
autos["odometer"] = autos["odometer"].str.replace('km',"")
autos["odometer"] = autos["odometer"].str.replace(',',"").astype(int)
autos.rename({"odometer":"odometer_km"},axis=1, inplace=True)
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | nr_pictures | postal_code | |
---|---|---|---|---|---|---|---|
count | 5.000000e+04 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.000000 | 50000.0 | 50000.000000 |
mean | 9.840044e+03 | 2005.073280 | 116.355920 | 125732.700000 | 5.723360 | 0.0 | 50813.627300 |
std | 4.811044e+05 | 105.712813 | 209.216627 | 40042.211706 | 3.711984 | 0.0 | 25779.747957 |
min | 0.000000e+00 | 1000.000000 | 0.000000 | 5000.000000 | 0.000000 | 0.0 | 1067.000000 |
25% | 1.100000e+03 | 1999.000000 | 70.000000 | 125000.000000 | 3.000000 | 0.0 | 30451.000000 |
50% | 2.950000e+03 | 2003.000000 | 105.000000 | 150000.000000 | 6.000000 | 0.0 | 49577.000000 |
75% | 7.200000e+03 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 0.0 | 71540.000000 |
max | 1.000000e+08 | 9999.000000 | 17700.000000 | 150000.000000 | 12.000000 | 0.0 | 99998.000000 |
The max value for the cars does seem a little dicey unless because it is unlikely that there would be a car sold for 100 million dollars. So does the min price value of 0.
Let us dig deeper into these columns and see where it takes us
autos["price"].unique().shape #to see the unique values
(2357,)
autos["price"].describe() #in-depth value checking
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos["price"].value_counts().sort_index(ascending = False).head(10) #check for extraordinarily high values
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 Name: price, dtype: int64
auto2 = autos[autos["price"] > 1000000]
auto2[["name","price"]]
name | price | |
---|---|---|
2897 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | 11111111 |
7814 | Ferrari_F40 | 1300000 |
11137 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | 10000000 |
22947 | Bmw_530d_zum_ausschlachten | 1234566 |
24384 | Schlachte_Golf_3_gt_tdi | 11111111 |
27371 | Fiat_Punto | 12345678 |
39377 | Tausche_volvo_v40_gegen_van | 12345678 |
39705 | Tausch_gegen_gleichwertiges | 99999999 |
42221 | Leasinguebernahme | 27322222 |
47598 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | 12345678 |
47634 | Ferrari_FXX | 3890000 |
auto2 = autos[autos["price"] < 250]
auto2[["name","price"]].sort_index(ascending=False).head()
name | price | |
---|---|---|
49989 | VW_Polo_zum_Ausschlachten_oder_Wiederaufbau | 150 |
49984 | Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... | 0 |
49974 | Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... | 0 |
49960 | Ford_KA_zu_verschenken_***Reserviert*** | 0 |
49943 | Opel_astra | 0 |
Here we would generally have to consider that any price below 250 ia too low and anything above 100000 seems improbable too (considering no one would want to really sell their sports car on eBay)
Let's only keep the values between 250 and 100000 into our dataset
autos = autos[autos["price"].between(250,100000)]
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 47249 entries, 0 to 49999 Data columns (total 20 columns): date_crawled 47249 non-null object name 47249 non-null object seller 47249 non-null object offer_type 47249 non-null object price 47249 non-null int64 ab_test 47249 non-null object vehicle_type 43092 non-null object registration_year 47249 non-null int64 gearbox 45198 non-null object power_ps 47249 non-null int64 model 44960 non-null object odometer_km 47249 non-null int64 registration_month 47249 non-null int64 fuel_type 43611 non-null object brand 47249 non-null object unrepaired_damage 38799 non-null object ad_created 47249 non-null object nr_pictures 47249 non-null int64 postal_code 47249 non-null int64 last_seen 47249 non-null object dtypes: int64(7), object(13) memory usage: 7.6+ MB
The odometer_km
on the other hand looks okay so we will let it be
Working with dates
Let's now move on to the date columns and understand the date range the data covers.
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteRight now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
print(autos['date_crawled'].str[:10].head(15)) #To select the first 10 characters (the date) in each column
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 5 2016-03-21 6 2016-03-20 7 2016-03-16 8 2016-03-22 9 2016-03-16 10 2016-03-15 11 2016-03-16 12 2016-03-31 13 2016-03-23 14 2016-03-23 Name: date_crawled, dtype: object
#date_drawled
print(autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index().head()) #Using head here to avoid printing walls of data
print(autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head())
2016-03-05 14:06:30 0.000021 2016-03-05 14:06:40 0.000021 2016-03-05 14:07:08 0.000021 2016-03-05 14:07:21 0.000021 2016-03-05 14:07:26 0.000021 Name: date_crawled, dtype: float64 2016-04-07 14:36:56 0.000021 2016-04-07 14:36:55 0.000021 2016-04-07 14:36:44 0.000021 2016-04-07 14:30:26 0.000021 2016-04-07 14:30:09 0.000021 Name: date_crawled, dtype: float64
Observation:
The date_crawled
section is set between the dates March 05 2016 and April 07 2016.
#ad_created
print(autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index().head()) #Using head here to avoid printing walls of data
print(autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head())
2015-06-11 00:00:00 0.000021 2015-08-10 00:00:00 0.000021 2015-09-09 00:00:00 0.000021 2015-11-10 00:00:00 0.000021 2015-12-05 00:00:00 0.000021 Name: ad_created, dtype: float64 2016-04-07 00:00:00 0.001206 2016-04-06 00:00:00 0.003259 2016-04-05 00:00:00 0.011852 2016-04-04 00:00:00 0.036953 2016-04-03 00:00:00 0.039048 Name: ad_created, dtype: float64
Observation:
ad_created
section is set between the dates June 11 2015 and April 03 2016. Which is essentially a bigger bracket than date_crawled
#last_seen. This parameter is when the ad was last seen, it could be presumed that it is the date when the car was sold.
print(autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index().head()) #Using head here to avoid printing walls of data
print(autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index(ascending=False).head())
2016-03-05 14:45:46 0.000021 2016-03-05 14:46:02 0.000021 2016-03-05 14:49:34 0.000021 2016-03-05 15:16:11 0.000021 2016-03-05 15:16:47 0.000021 Name: last_seen, dtype: float64 2016-04-07 14:58:50 0.000063 2016-04-07 14:58:48 0.000063 2016-04-07 14:58:46 0.000021 2016-04-07 14:58:45 0.000021 2016-04-07 14:58:44 0.000063 Name: last_seen, dtype: float64
Obersavation:
A similar date time frame as date_crawled
. This could be the presumed date where the car was sold or the ad was taken down altogether.
Let us look at the registration_year
series and analyze where we could make some changes, or if any changes are necessary at all.
autos['registration_year'].describe()
count 47249.000000 mean 2004.844547 std 88.789003 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Here ofcourse, we need to exclude the year 1000 and 9999.
Moreover, any year prior to 1950 seems unlikely and later than 2017 is impossible.
So let us clean the data accordingly using the between
function like we used in price
autos = autos[autos['registration_year'].between(1950,2017)]
autos['registration_year'].value_counts().sort_values().tail()
2003 2691 2004 2694 1999 2814 2005 2898 2000 2974 Name: registration_year, dtype: int64
With this we can see that most of the cars being sold on eBay at the moment have been registered between 1999-2005 (around 29%)
Around the time, the most used and sold car in Germany (upto this date too in a way) is the VW Golf. If we have to start with a hypothesis it would be that VW Golf should be among the top listed cars considering the demographics and the years of registration. (could also be VW beetle because it picked up around 1999-2003 but highly unlikely because it did go out of style unlike VW Golf)
Brandwise Distribution of Cars
It woudl defintiely be of value for us to see which brands and which models have been listed the most on eBay.
It could help us identify the brand choices of people and how it would have evolved over years.
Here, we will use aggregation to understand the brand
column.
a = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:20] #We only need the top 20 manufacturers for now
autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:20]
volkswagen 0.212938 bmw 0.110619 opel 0.105827 mercedes_benz 0.097720 audi 0.087708 ford 0.067899 renault 0.046699 peugeot 0.029692 fiat 0.025200 seat 0.018590 skoda 0.016386 nissan 0.015274 mazda 0.015253 smart 0.014632 citroen 0.014226 toyota 0.013006 hyundai 0.010097 volvo 0.009220 sonstige_autos 0.009156 mini 0.008878 Name: brand, dtype: float64
As assumed, VW is the most sold brand not only for new cars in Germany but also for used cars on eBay. (Only logical as usual inflow and outflow with market circulation is statistically and directly proportional) It is followed by the other German bigwigs like bmw, opel, mercedies and the VW descendent audi.
Let us see which of the brands have the highest average price and average odometer count based on this existing data that we have gathered.
autos_mean = {}
autos_drive = {}
for c in a.index:
selected_rows = autos[autos['brand'] == c]
autos_mean[c] = selected_rows['price'].mean()
autos_drive[c] = selected_rows['odometer_km'].mean()
print(autos_mean)
{'mini': 10616.96626506024, 'sonstige_autos': 11026.56308411215, 'seat': 4481.210586881473, 'hyundai': 5447.218220338983, 'toyota': 5160.1578947368425, 'audi': 9283.809024390244, 'mercedes_benz': 8496.2896234676, 'ford': 3882.059861373661, 'skoda': 6430.8616187989555, 'nissan': 4809.389355742297, 'bmw': 8241.758267259718, 'volvo': 4939.691415313225, 'fiat': 2934.799660441426, 'smart': 3550.3055555555557, 'renault': 2557.268437929455, 'volkswagen': 5486.6349206349205, 'peugeot': 3143.4531700288185, 'citroen': 3823.912781954887, 'opel': 3101.2429755407315, 'mazda': 4176.294530154278} {'mini': 89469.8795180723, 'sonstige_autos': 92406.54205607477, 'seat': 122094.36133486766, 'hyundai': 106822.03389830509, 'toyota': 116496.71052631579, 'audi': 129557.31707317074, 'mercedes_benz': 131417.469352014, 'ford': 124163.51606805294, 'skoda': 111174.93472584856, 'nissan': 118375.35014005602, 'bmw': 132925.9330883775, 'volvo': 138735.4988399072, 'fiat': 117016.12903225806, 'smart': 100562.86549707603, 'renault': 127830.96655978012, 'volkswagen': 128906.46976090015, 'peugeot': 126884.00576368877, 'citroen': 119992.48120300751, 'opel': 129167.17202344855, 'mazda': 124340.81346423563}
Observation There is a gap between the top 20 prices
volksvagen has mid-range prices which makes it the most popular brand
sonstige_autos, mini and audi have the highest prices
opel and renault have the lowest prices
#Extracting Observations based on odometer reading
print(autos_drive)
{'mini': 89469.8795180723, 'sonstige_autos': 92406.54205607477, 'seat': 122094.36133486766, 'hyundai': 106822.03389830509, 'toyota': 116496.71052631579, 'audi': 129557.31707317074, 'mercedes_benz': 131417.469352014, 'ford': 124163.51606805294, 'skoda': 111174.93472584856, 'nissan': 118375.35014005602, 'bmw': 132925.9330883775, 'volvo': 138735.4988399072, 'fiat': 117016.12903225806, 'smart': 100562.86549707603, 'renault': 127830.96655978012, 'volkswagen': 128906.46976090015, 'peugeot': 126884.00576368877, 'citroen': 119992.48120300751, 'opel': 129167.17202344855, 'mazda': 124340.81346423563}
Here is where things get interesting:
mini
and sonstige
(Miscellaneous) have the least mileage readings on the odometer.volvo
is the brand with the most mileage on the list with bmw
, audi
and opel
closely behind.# create a dataframe of the mean price and mean mileage to make it easy to compare
mean_price = pd.Series(autos_mean)
mean_mileage = pd.Series(autos_drive)
# create a dataframe of both series
brand_df = pd.DataFrame({'mean_price': mean_price, 'mean_mileage': mean_mileage})
brand_df
mean_mileage | mean_price | |
---|---|---|
audi | 129557.317073 | 9283.809024 |
bmw | 132925.933088 | 8241.758267 |
citroen | 119992.481203 | 3823.912782 |
fiat | 117016.129032 | 2934.799660 |
ford | 124163.516068 | 3882.059861 |
hyundai | 106822.033898 | 5447.218220 |
mazda | 124340.813464 | 4176.294530 |
mercedes_benz | 131417.469352 | 8496.289623 |
mini | 89469.879518 | 10616.966265 |
nissan | 118375.350140 | 4809.389356 |
opel | 129167.172023 | 3101.242976 |
peugeot | 126884.005764 | 3143.453170 |
renault | 127830.966560 | 2557.268438 |
seat | 122094.361335 | 4481.210587 |
skoda | 111174.934726 | 6430.861619 |
smart | 100562.865497 | 3550.305556 |
sonstige_autos | 92406.542056 | 11026.563084 |
toyota | 116496.710526 | 5160.157895 |
volkswagen | 128906.469761 | 5486.634921 |
volvo | 138735.498840 | 4939.691415 |
Mini and sonstige
(other) have the least odometer readings and the highest prices. Which could help us levitate towards the possibility of higher prices for used cars which have a lesser odometer reading (Which would imply that they have travalled lesser and have relagively healthier engine)
However, from a very high level, the point which more clearly justifies the price is actually (and more importantly, obviously) the brand.
Opel, Renault and Fiat are the reasonable cars, while BMW, Audi and Mercedes are the costlier ones.
VW on the other hand is placed very well in the middle strata of the market and has the highest sales.
The most commonly sold model
We proposed an assumption towards the beginnning that as golf should be one of the best sellers on eBay too as it is one of the best sellers in Germany overall.
Also, let us take a look into the more interesting brands and what models are most popular for brands like VW
, Audi
, BMW
, Opel
and Mercedes Benz
To do this we will look into the column labelled model
Let's start off with the models which are popular overall
autos['model'].value_counts().sort_values(ascending=False).head(6)
golf 3764 andere 3330 3er 2636 polo 1555 corsa 1522 passat 1367 Name: model, dtype: int64
Prima! Wir haben recht! We were right!
Golf is the most sold model overall!
If you have not heard of andere
, there would be no reason to beat yourself up as it just means 'other' in German.
So the next identifiable model would be 3er by BMW!
This is followed by Polo
by VW
, Corsa
by Opel
and Passat
by VW
again!
This essentially means that VW leads by a mile with BMW and Opel in close seconds.
Now what about the most preferred models brandwise?
brands = ['volkswagen','bmw','audi','mercedes_benz','opel']
for a in brands:
models_car = autos[autos['brand']==a]
b = models_car['model'].value_counts().sort_values(ascending=False).head(1)
print('{0} brand : {1}'.format(a,b))
print('\n')
volkswagen brand : golf 3764 Name: model, dtype: int64 bmw brand : 3er 2636 Name: model, dtype: int64 audi brand : a4 1241 Name: model, dtype: int64 mercedes_benz brand : c_klasse 1153 Name: model, dtype: int64 opel brand : corsa 1522 Name: model, dtype: int64
There we have it!
The Most popular models by brand!
Effect of Damage/Repairs on Price
We have a section which can help us identify if any damage as happened to the car with 'nein' for a no and 'ja' for a yes.
Let us analyze the series a bit more.
autos['unrepaired_damage'].describe()
count 38503 unique 2 top nein freq 34215 Name: unrepaired_damage, dtype: object
There are two main values ja and nein. Now let us tear these two apart, calculate the mean of their prices and compare them to see if damages affect the price of the cars in any way.
ohne_repair_mean = autos.loc[autos['unrepaired_damage']=="nein",'price'].mean()
mit_repair_mean = autos.loc[autos['unrepaired_damage']=="ja",'price'].mean()
print(ohne_repair_mean)
print(mit_repair_mean)
6978.685956451849 2384.205223880597
Quite an evident differnce between the price of cars without repair and the ones with repair.
The ones without repair have a clearly higher price (3 times!)
While we havent looked way to deep into the data, it does validate our thoughts and general knowledge of mint condition
items being more expensive than refurbished
items!
The inferences from this project are unique as well as sometimes directly aligning to everything with what we ideally consider to be common knowledge.
Unique observations which one might not ideally know:
Some observations which seem obvious but we did confirm them using this project:
sonstige
)