eBay Auto Dataset Project

In this project I will work with 50,000 row sample from an dataset of German eBay car sales. The base dataset had been cleaned for analysis. This sample has been "dirtied" again so that it resembles what one might find in a scraped dataset.

Initial Data Dictionary

  • dateCrawled - When this ad was first crawled. All field-values are taken from this date.
  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • model - The car model name.
  • odometer - How many kilometers the car has driven.
  • monthOfRegistration - The month in which the car was first registered.
  • fuelType - What type of fuel the car uses.
  • brand - The brand of the car.
  • notRepairedDamage - If the car has a damage which is not yet repaired.
  • dateCreated - The date on which the eBay listing was created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.
In [1]:
import pandas as pd
import numpy as np
import pprint as pp
In [2]:
autos = pd.read_csv("autos.csv", encoding="Latin-1" )
In [3]:
autos
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau privat Angebot $999 test NaN 2017 manuell 90 NaN 150,000km 4 benzin volkswagen nein 2016-03-14 00:00:00 0 86157 2016-04-07 03:16:21
11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden privat Angebot $350 control NaN 2000 NaN 0 NaN 150,000km 0 benzin mercedes_benz NaN 2016-03-16 00:00:00 0 17498 2016-03-16 18:45:34
12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot $5,299 control kleinwagen 2010 automatik 71 fortwo 50,000km 9 benzin smart nein 2016-03-31 00:00:00 0 34590 2016-04-06 14:17:52
13 2016-03-23 10:48:32 Audi_A3_1.6_tuning privat Angebot $1,350 control limousine 1999 manuell 101 a3 150,000km 11 benzin audi nein 2016-03-23 00:00:00 0 12043 2016-04-01 14:17:13
14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot $3,999 test kleinwagen 2007 manuell 75 clio 150,000km 9 benzin renault NaN 2016-03-23 00:00:00 0 81737 2016-04-01 15:46:47
15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot $18,900 test coupe 1982 automatik 203 NaN 80,000km 6 benzin sonstige_autos nein 2016-04-01 00:00:00 0 61276 2016-04-02 21:10:48
16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi privat Angebot $350 test kombi 1999 manuell 101 vectra 150,000km 5 benzin opel nein 2016-03-16 00:00:00 0 57299 2016-03-18 05:29:37
17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 privat Angebot $5,500 test coupe 1990 manuell 205 scirocco 150,000km 6 benzin volkswagen nein 2016-03-29 00:00:00 0 74821 2016-04-05 20:46:26
18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot $300 control bus 1995 manuell 150 3er 150,000km 0 benzin bmw NaN 2016-03-26 00:00:00 0 54329 2016-04-02 12:16:41
19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot $4,150 control suv 2004 manuell 124 andere 150,000km 2 lpg mazda nein 2016-03-17 00:00:00 0 40878 2016-03-17 14:45:58
20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot $3,500 test kombi 2003 manuell 131 a4 150,000km 5 diesel audi NaN 2016-03-05 00:00:00 0 53913 2016-03-07 05:46:46
21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio privat Angebot $41,500 test cabrio 2004 manuell 320 911 150,000km 4 benzin porsche nein 2016-03-06 00:00:00 0 65428 2016-04-05 23:46:19
22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio privat Angebot $25,450 control cabrio 2015 manuell 184 cooper 10,000km 1 benzin mini nein 2016-03-28 00:00:00 0 44789 2016-04-01 06:45:30
23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima privat Angebot $7,999 control bus 2010 manuell 120 NaN 150,000km 2 diesel peugeot nein 2016-03-10 00:00:00 0 30900 2016-03-17 08:45:17
24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. privat Angebot $48,500 control limousine 2014 automatik 306 5er 30,000km 12 benzin bmw nein 2016-04-03 00:00:00 0 22547 2016-04-07 13:16:50
25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung privat Angebot $90 control kombi 1996 manuell 116 NaN 150,000km 4 benzin ford ja 2016-03-21 00:00:00 0 27574 2016-04-01 05:16:49
26 2016-04-03 22:46:28 Volkswagen_Polo_Fox privat Angebot $777 control kleinwagen 1992 manuell 54 polo 125,000km 2 benzin volkswagen nein 2016-04-03 00:00:00 0 38110 2016-04-05 23:46:48
27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE privat Angebot $0 control NaN 2005 NaN 0 NaN 150,000km 0 NaN ford NaN 2016-03-27 00:00:00 0 66701 2016-03-27 18:45:01
28 2016-03-19 21:56:19 MINI_Cooper_D privat Angebot $5,250 control kleinwagen 2007 manuell 110 cooper 150,000km 7 diesel mini ja 2016-03-19 00:00:00 0 15745 2016-04-07 14:58:48
29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... privat Angebot $4,999 test kombi 2004 automatik 204 e_klasse 150,000km 10 diesel mercedes_benz nein 2016-04-02 00:00:00 0 47638 2016-04-02 12:45:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... privat Angebot $15,800 control bus 2010 automatik 136 c4 60,000km 4 diesel citroen nein 2016-03-21 00:00:00 0 14947 2016-04-07 04:17:34
49971 2016-03-29 14:54:12 W.Lupo_1.0 privat Angebot $950 test kleinwagen 2001 manuell 50 lupo 150,000km 4 benzin volkswagen nein 2016-03-29 00:00:00 0 65197 2016-03-29 20:41:51
49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. privat Angebot $3,300 control bus 2004 automatik 150 vito 150,000km 10 diesel mercedes_benz ja 2016-03-26 00:00:00 0 65326 2016-03-28 11:28:18
49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor privat Angebot $6,000 control cabrio 2004 manuell 163 slk 150,000km 11 benzin mercedes_benz nein 2016-03-27 00:00:00 0 53567 2016-03-27 08:25:24
49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... privat Angebot $0 control cabrio 1983 manuell 70 golf 150,000km 2 benzin volkswagen nein 2016-03-20 00:00:00 0 8209 2016-03-27 19:48:16
49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort privat Angebot $9,700 control kleinwagen 2012 automatik 88 jazz 100,000km 11 hybrid honda nein 2016-03-27 00:00:00 0 84385 2016-04-05 19:45:34
49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... privat Angebot $5,900 test kombi 1992 automatik 150 80 150,000km 12 benzin audi nein 2016-03-19 00:00:00 0 36100 2016-04-07 06:16:44
49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 privat Angebot $5,500 control limousine 2003 manuell 116 c_klasse 150,000km 2 diesel mercedes_benz nein 2016-03-31 00:00:00 0 33739 2016-04-06 12:16:11
49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic privat Angebot $900 control limousine 1996 automatik 136 e_klasse 150,000km 9 benzin mercedes_benz ja 2016-04-04 00:00:00 0 24405 2016-04-06 12:44:20
49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style privat Angebot $11,000 test kleinwagen 2011 manuell 90 polo 70,000km 11 diesel volkswagen nein 2016-03-20 00:00:00 0 48455 2016-04-07 01:45:12
49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V privat Angebot $400 control kombi 1995 manuell 105 escort 125,000km 3 benzin ford NaN 2016-03-12 00:00:00 0 56218 2016-04-06 17:16:49
49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung privat Angebot $2,000 control kombi 1998 manuell 115 astra 150,000km 12 benzin opel nein 2016-03-15 00:00:00 0 86859 2016-04-05 17:21:46
49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm privat Angebot $1,950 control kleinwagen 2004 manuell 0 fabia 90,000km 7 benzin skoda NaN 2016-03-29 00:00:00 0 45884 2016-03-29 18:51:08
49983 2016-03-06 12:43:04 Ford_focus_99 privat Angebot $600 test kleinwagen 1999 manuell 101 focus 150,000km 4 benzin ford NaN 2016-03-06 00:00:00 0 52477 2016-03-09 06:16:08
49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... privat Angebot $0 test NaN 2000 NaN 0 NaN 150,000km 0 NaN sonstige_autos NaN 2016-03-31 00:00:00 0 12103 2016-04-02 19:44:53
49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! privat Angebot $1,000 control NaN 1995 automatik 0 NaN 150,000km 0 benzin volkswagen NaN 2016-04-02 00:00:00 0 30900 2016-04-06 15:17:52
49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... privat Angebot $15,900 control limousine 2010 automatik 218 300c 125,000km 11 diesel chrysler nein 2016-04-04 00:00:00 0 73527 2016-04-06 23:16:00
49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... privat Angebot $21,990 control limousine 2013 manuell 150 a3 50,000km 11 diesel audi nein 2016-03-22 00:00:00 0 94362 2016-03-26 22:46:06
49988 2016-03-28 19:49:51 BMW_330_Ci privat Angebot $9,550 control coupe 2001 manuell 231 3er 150,000km 10 benzin bmw nein 2016-03-28 00:00:00 0 83646 2016-04-07 02:17:40
49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau privat Angebot $150 test kleinwagen 1997 manuell 0 polo 150,000km 5 benzin volkswagen ja 2016-03-11 00:00:00 0 21244 2016-03-12 10:17:55
49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban privat Angebot $17,500 test limousine 2012 manuell 156 a_klasse 30,000km 12 benzin mercedes_benz nein 2016-03-21 00:00:00 0 58239 2016-04-06 22:46:57
49991 2016-03-06 15:25:19 Kleinwagen privat Angebot $500 control NaN 2016 manuell 0 twingo 150,000km 0 benzin renault NaN 2016-03-06 00:00:00 0 61350 2016-03-06 18:24:19
49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport privat Angebot $4,800 control kleinwagen 2009 manuell 120 andere 125,000km 9 lpg fiat nein 2016-03-10 00:00:00 0 68642 2016-03-13 01:44:51
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot $1,650 control kleinwagen 1997 manuell 0 NaN 150,000km 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot $5,000 control kombi 2001 automatik 299 a6 150,000km 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [4]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Observations

See above for details.

  • Column names are in English.
  • Data is in German
  • Number formats are US-style '.' for decimal, ',' for thousands.
    • price format is $X,XXX (not Euros)
    • odometer format is XXX,XXXXkm
    • these were not recognized as numbers and columns have "object type.
  • Timestamps in dateCrawled and adCreated and lastSeen weren't recognized as such.
  • Some columns contain null values
    • vehicleType, gearbox, model, fuelType, notRepairedDamage
  • notRepairedDamage column has "NaN", "nein" & 'ja'

2. Cleaning Column Names

In [5]:
autos.columns.tolist()
Out[5]:
['dateCrawled',
 'name',
 'seller',
 'offerType',
 'price',
 'abtest',
 'vehicleType',
 'yearOfRegistration',
 'gearbox',
 'powerPS',
 'model',
 'odometer',
 'monthOfRegistration',
 'fuelType',
 'brand',
 'notRepairedDamage',
 'dateCreated',
 'nrOfPictures',
 'postalCode',
 'lastSeen']

Updating coulumn names from CamelCase to snakecase, and to be more descriptive

In [6]:
new_column_names = ['date_crawled',
                     'name',
                     'seller',
                     'offer_type',
                     'price',
                     'abtest',
                     'vehicle_type',
                     'registration_year',
                     'gearbox',
                     'power_ps',
                     'model',
                     'odometer',
                     'registration_month',
                     'fuel_type',
                     'brand',
                     'unrepaired_damage',
                     'ad_created',
                     'number_of_pictures',
                     'postal_code',
                     'last_seen']
autos.columns = new_column_names
In [7]:
autos.head()
Out[7]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created number_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

3. Initial Exploration and Cleaning.

In [8]:
autos.describe()
Out[8]:
registration_year power_ps registration_month number_of_pictures postal_code
count 50000.000000 50000.000000 50000.000000 50000.0 50000.000000
mean 2005.073280 116.355920 5.723360 0.0 50813.627300
std 105.712813 209.216627 3.711984 0.0 25779.747957
min 1000.000000 0.000000 0.000000 0.0 1067.000000
25% 1999.000000 70.000000 3.000000 0.0 30451.000000
50% 2003.000000 105.000000 6.000000 0.0 49577.000000
75% 2008.000000 150.000000 9.000000 0.0 71540.000000
max 9999.000000 17700.000000 12.000000 0.0 99998.000000
In [9]:
autos.describe(include=np.object)
Out[9]:
date_crawled name seller offer_type price abtest vehicle_type gearbox model odometer fuel_type brand unrepaired_damage ad_created last_seen
count 50000 50000 50000 50000 50000 50000 44905 47320 47242 50000 45518 50000 40171 50000 50000
unique 48213 38754 2 2 2357 2 8 2 245 13 7 40 2 76 39481
top 2016-03-22 09:51:06 Ford_Fiesta privat Angebot $0 test limousine manuell golf 150,000km benzin volkswagen nein 2016-04-03 00:00:00 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 36993 4024 32424 30107 10687 35232 1946 8
In [10]:
for c in ["seller", "offer_type", "abtest", "vehicle_type", "gearbox", "fuel_type", "odometer", "brand"]:
    print(c, ":\n", autos[c].value_counts(), "\n")
seller :
 privat        49999
gewerblich        1
Name: seller, dtype: int64 

offer_type :
 Angebot    49999
Gesuch         1
Name: offer_type, dtype: int64 

abtest :
 test       25756
control    24244
Name: abtest, dtype: int64 

vehicle_type :
 limousine     12859
kleinwagen    10822
kombi          9127
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64 

gearbox :
 manuell      36993
automatik    10327
Name: gearbox, dtype: int64 

fuel_type :
 benzin     30107
diesel     14567
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64 

odometer :
 150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64 

brand :
 volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
citroen             701
smart               701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64 

Notes

Drop:

  • seller and offer_type, because all but one row have same value.
  • number_of_photos, because the value is 0 for all rows.
  • abtest because we don't have enough information for it to be useful.

Fix:

  • Convert price and odometer to numeric values
  • Rename odometer to odometer_km
  • Maybe convert: date_crawled and last_seen to datetime
In [11]:
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int)
In [12]:
autos["odometer"] = autos["odometer"].str.replace(",","").str.replace("km","").astype(int)
In [13]:
autos.rename({"odometer": "odometer_km"}, axis=1, inplace=True)
In [14]:
autos.columns
Out[14]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer_km', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'number_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [15]:
autos.drop(columns=["seller", "offer_type", "number_of_pictures"], inplace=True)
In [16]:
autos.columns
Out[16]:
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

4. Exploring the Odometer and Price Columns

In [17]:
autos["odometer_km"].value_counts(dropna=False)
Out[17]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

The above values seem reasonable. Nothing seems out of place

In [18]:
autos["price"].describe()
Out[18]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [19]:
autos["price"].dtype
#Weird, considering what's reported by describe.
Out[19]:
dtype('int64')
In [20]:
autos["price"].isnull().sum()
Out[20]:
0
In [21]:
autos["price"].value_counts(dropna=False).sort_index()
Out[21]:
0           1421
1            156
2              3
3              1
5              2
8              1
9              1
10             7
11             2
12             3
13             2
14             1
15             2
17             3
18             1
20             4
25             5
29             1
30             7
35             1
40             6
45             4
47             1
49             4
50            49
55             2
59             1
60             9
65             5
66             1
            ... 
151990         1
155000         1
163500         1
163991         1
169000         1
169999         1
175000         1
180000         1
190000         1
194000         1
197000         1
198000         1
220000         1
250000         1
259000         1
265000         1
295000         1
299000         1
345000         1
350000         1
999990         1
999999         2
1234566        1
1300000        1
3890000        1
10000000       1
11111111       2
12345678       3
27322222       1
99999999       1
Name: price, Length: 2357, dtype: int64

Some of these seem a little sketchy.

How likely is it that a car sold for \$99.9M? $27M seems like a stretch, but its possible.

Similarly, how likely is it that 1421 sold for \$0, or 156 for \$1? Does eBay even accept \$0?

I am going to discard the 1421 rows with a price of $0.

In [22]:
autos = autos[autos["price"] > 0 ]
len(autos)
Out[22]:
48579

Now lets look at the rows with a price >= $900K since there are only ~10.

In [23]:
autos[autos["price"] >= 900000]
Out[23]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
514 2016-03-17 09:53:08 Ford_Focus_Turnier_1.6_16V_Style 999999 test kombi 2009 manuell 101 focus 125000 4 benzin ford nein 2016-03-17 00:00:00 12205 2016-04-06 07:17:35
2897 2016-03-12 21:50:57 Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 11111111 test limousine 1973 manuell 48 escort 50000 3 benzin ford nein 2016-03-12 00:00:00 94469 2016-03-12 22:45:27
7814 2016-04-04 11:53:31 Ferrari_F40 1300000 control coupe 1992 NaN 0 NaN 50000 12 NaN sonstige_autos nein 2016-04-04 00:00:00 60598 2016-04-05 11:34:11
11137 2016-03-29 23:52:57 suche_maserati_3200_gt_Zustand_unwichtig_laufe... 10000000 control coupe 1960 manuell 368 NaN 100000 1 benzin sonstige_autos nein 2016-03-29 00:00:00 73033 2016-04-06 21:18:11
22947 2016-03-22 12:54:19 Bmw_530d_zum_ausschlachten 1234566 control kombi 1999 automatik 190 NaN 150000 2 diesel bmw NaN 2016-03-22 00:00:00 17454 2016-04-02 03:17:32
24384 2016-03-21 13:57:51 Schlachte_Golf_3_gt_tdi 11111111 test NaN 1995 NaN 0 NaN 150000 0 NaN volkswagen NaN 2016-03-21 00:00:00 18519 2016-03-21 14:40:18
27371 2016-03-09 15:45:47 Fiat_Punto 12345678 control NaN 2017 NaN 95 punto 150000 0 NaN fiat NaN 2016-03-09 00:00:00 96110 2016-03-09 15:45:47
37585 2016-03-29 11:38:54 Volkswagen_Jetta_GT 999990 test limousine 1985 manuell 111 jetta 150000 12 benzin volkswagen ja 2016-03-29 00:00:00 50997 2016-03-29 11:38:54
39377 2016-03-08 23:53:51 Tausche_volvo_v40_gegen_van 12345678 control NaN 2018 manuell 95 v40 150000 6 NaN volvo nein 2016-03-08 00:00:00 14542 2016-04-06 23:17:31
39705 2016-03-22 14:58:27 Tausch_gegen_gleichwertiges 99999999 control limousine 1999 automatik 224 s_klasse 150000 9 benzin mercedes_benz NaN 2016-03-22 00:00:00 73525 2016-04-06 05:15:30
42221 2016-03-08 20:39:05 Leasinguebernahme 27322222 control limousine 2014 manuell 163 c4 40000 2 diesel citroen NaN 2016-03-08 00:00:00 76532 2016-03-08 20:39:05
43049 2016-03-21 19:53:52 2_VW_Busse_T3 999999 test bus 1981 manuell 70 transporter 150000 1 benzin volkswagen NaN 2016-03-21 00:00:00 99880 2016-03-28 17:18:28
47598 2016-03-31 18:56:54 Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... 12345678 control limousine 2001 manuell 101 vectra 150000 3 benzin opel nein 2016-03-31 00:00:00 4356 2016-03-31 18:56:54
47634 2016-04-04 21:25:21 Ferrari_FXX 3890000 test coupe 2006 NaN 799 NaN 5000 7 NaN sonstige_autos nein 2016-04-04 00:00:00 60313 2016-04-05 12:07:37

The Ferrari prices are probably legitimate, as is the Maserati. The Fords, VW Golf & Bus and BMW stationwagon probably aren't legitimate. The rest I don't know enough about.

Since there are only a few Ferraris in this price range, and since later analysis focuses on popular brands (which Ferarri isn't), I'm just going to drop everything above $900K.

In [24]:
autos = autos[autos["price"] < 900000]
len(autos)
Out[24]:
48565

5. Exploring the date columns

date_crawled, ad_created and last_seen all have timestamps, but are stored as objects.

In [25]:
autos[["date_crawled", "ad_created", "last_seen"]][0:5]
Out[25]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [26]:
for c in ["date_crawled", "ad_created", "last_seen"]:
    print(c,":")
    print(autos[c].str[:10].value_counts(normalize=True, dropna=False))
    print()
date_crawled :
2016-04-03    0.038608
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-12    0.036920
2016-03-14    0.036549
2016-04-04    0.036487
2016-03-07    0.036014
2016-04-02    0.035478
2016-03-28    0.034860
2016-03-19    0.034778
2016-03-15    0.034284
2016-03-29    0.034099
2016-04-01    0.033687
2016-03-30    0.033687
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-22    0.032987
2016-03-11    0.032575
2016-03-23    0.032225
2016-03-26    0.032204
2016-03-10    0.032184
2016-03-31    0.031834
2016-03-17    0.031628
2016-03-25    0.031607
2016-03-27    0.031092
2016-03-16    0.029610
2016-03-24    0.029342
2016-03-05    0.025327
2016-03-13    0.015670
2016-03-06    0.014043
2016-04-05    0.013096
2016-03-18    0.012911
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

ad_created :
2016-04-03    0.038855
2016-03-20    0.037949
2016-03-21    0.037579
2016-04-04    0.036858
2016-03-12    0.036755
2016-03-14    0.035190
2016-04-02    0.035149
2016-03-28    0.034984
2016-03-07    0.034737
2016-03-29    0.034037
2016-03-15    0.034016
2016-03-19    0.033687
2016-04-01    0.033687
2016-03-30    0.033501
2016-03-08    0.033316
2016-03-09    0.033151
2016-03-11    0.032904
2016-03-22    0.032801
2016-03-26    0.032266
2016-03-23    0.032060
2016-03-10    0.031895
2016-03-31    0.031875
2016-03-25    0.031751
2016-03-17    0.031278
2016-03-27    0.030989
2016-03-16    0.030125
2016-03-24    0.029280
2016-03-05    0.022897
2016-03-13    0.017008
2016-03-06    0.015320
                ...   
2016-01-10    0.000041
2016-02-24    0.000041
2016-02-20    0.000041
2016-02-05    0.000041
2016-02-18    0.000041
2016-02-14    0.000041
2016-02-26    0.000041
2016-02-02    0.000041
2016-02-12    0.000041
2016-01-16    0.000021
2015-11-10    0.000021
2016-02-09    0.000021
2016-02-08    0.000021
2016-02-07    0.000021
2016-01-29    0.000021
2016-01-13    0.000021
2016-02-16    0.000021
2016-01-07    0.000021
2015-06-11    0.000021
2016-02-17    0.000021
2016-02-01    0.000021
2015-12-30    0.000021
2016-02-11    0.000021
2016-01-22    0.000021
2015-09-09    0.000021
2015-12-05    0.000021
2016-02-22    0.000021
2016-01-14    0.000021
2015-08-10    0.000021
2016-01-03    0.000021
Name: ad_created, Length: 76, dtype: float64

last_seen :
2016-04-06    0.221806
2016-04-07    0.131947
2016-04-05    0.124761
2016-03-17    0.028086
2016-04-03    0.025203
2016-04-02    0.024915
2016-03-30    0.024771
2016-04-04    0.024483
2016-03-31    0.023783
2016-03-12    0.023783
2016-04-01    0.022794
2016-03-29    0.022341
2016-03-22    0.021373
2016-03-28    0.020859
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-23    0.018532
2016-03-26    0.016802
2016-03-16    0.016452
2016-03-15    0.015876
2016-03-19    0.015834
2016-03-27    0.015649
2016-03-14    0.012602
2016-03-11    0.012375
2016-03-10    0.010666
2016-03-09    0.009595
2016-03-13    0.008895
2016-03-08    0.007413
2016-03-18    0.007351
2016-03-07    0.005395
2016-03-06    0.004324
2016-03-05    0.001071
Name: last_seen, dtype: float64

In [27]:
for c in ["date_crawled", "ad_created", "last_seen"]:
    print(c,":")
    print(autos[c].str[:10].value_counts(normalize=True, dropna=False).sort_index())
    print()
date_crawled :
2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

ad_created :
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21    0.037579
2016-03-22    0.032801
2016-03-23    0.032060
2016-03-24    0.029280
2016-03-25    0.031751
2016-03-26    0.032266
2016-03-27    0.030989
2016-03-28    0.034984
2016-03-29    0.034037
2016-03-30    0.033501
2016-03-31    0.031875
2016-04-01    0.033687
2016-04-02    0.035149
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

last_seen :
2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

  • date_crawled spans a ~one month range of dates from 2016/3/5 to 2016/4/7. From 0.3-3.8%
  • ad_created ranges from 2015/06/11 to 2016/04/07.
  • last_seen ranges from 2016-03-05 to 2016-04-07
In [28]:
for c in ["date_crawled", "ad_created", "last_seen"]:
    print(c,":")
    print(pd.to_datetime(autos[c].str[:10]).describe())
    print()
date_crawled :
count                   48565
unique                     34
top       2016-04-03 00:00:00
freq                     1875
first     2016-03-05 00:00:00
last      2016-04-07 00:00:00
Name: date_crawled, dtype: object

ad_created :
count                   48565
unique                     76
top       2016-04-03 00:00:00
freq                     1887
first     2015-06-11 00:00:00
last      2016-04-07 00:00:00
Name: ad_created, dtype: object

last_seen :
count                   48565
unique                     34
top       2016-04-06 00:00:00
freq                    10772
first     2016-03-05 00:00:00
last      2016-04-07 00:00:00
Name: last_seen, dtype: object

It is strange that out of 50000 listings, created over a span of almost a year, their are only 76 unique creation dates.

The range in the number of pages crawled on a given day is interesting. It varies over a pretty wide range. Thinking it through, it's probably a product of both the rate of the spider and the rate of ad creation over time. If the spider can't crawl all the ads in a given run, it may not detect a new ad for some period of time. Ads may be missed if they end before detection.

6. Dealing with Incorrect Registration Year Data

Cleaning Registration Year

In [29]:
autos["registration_year"].describe()
Out[29]:
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
In [30]:
autos["registration_year"].value_counts().sort_index()
Out[30]:
1000       1
1001       1
1111       1
1800       2
1910       5
1927       1
1929       1
1931       1
1934       2
1937       4
1938       1
1939       1
1941       2
1943       1
1948       1
1950       3
1951       2
1952       1
1953       1
1954       2
1955       2
1956       4
1957       2
1958       4
1959       6
1960      23
1961       6
1962       4
1963       8
1964      12
        ... 
2000    3156
2001    2636
2002    2486
2003    2699
2004    2703
2005    2936
2006    2670
2007    2277
2008    2215
2009    2085
2010    1589
2011    1623
2012    1310
2013     803
2014     663
2015     392
2016    1220
2017    1392
2018     470
2019       2
2800       1
4100       1
4500       1
4800       1
5000       4
5911       1
6200       1
8888       1
9000       1
9999       3
Name: registration_year, Length: 95, dtype: int64
In [31]:
autos.loc[autos["registration_year"].between(1900,2016),"registration_year"].describe()
Out[31]:
count    46681.000000
mean      2002.910756
std          7.185103
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

There are a number of entries with bogus registration years. I've eliminated anything more recent than 2016, when the crawls were performed. More precise would be to compare ad creation date to year and month of registration, but this is good enough to get a sense of the data.

I've also elimated anything with a registration date earlier than 1900. This is, on the one hand a bit anal, since it eliminates less than a dozen listings, while also possibly a bit lax, since I don't know when Germany first started registering cars.

The average car sold was ~15y old, and 75% of cars were at least ~8y old.

In [32]:
autos.loc[autos["registration_year"] < 1935,
          ["name", "model", "brand", "registration_year", "registration_month"]
         ]
Out[32]:
name model brand registration_year registration_month
2221 Sehr_seltener_Oldtimer_Opel_1210_zum_Restaurieren andere opel 1934 0
2573 Hanomag_rekord_15k_Suche_ersatz_teile NaN sonstige_autos 1934 1
3679 Suche_Auto NaN sonstige_autos 1910 0
10556 UNFAL_Auto NaN mitsubishi 1800 2
11246 Ford_Model_A_Roadster_Deluxe_1931 andere ford 1931 7
21416 Essex_super_six__Ford_A andere ford 1927 5
22101 BMW_Andere andere bmw 1929 1
22316 VW_Kaefer.__Zwei_zum_Preis_von_einem. kaefer volkswagen 1000 0
22659 Opel_Corsa_B corsa opel 1910 0
24511 Trabant__wartburg__Ostalgie NaN trabant 1111 0
28693 Renault_Twingo NaN renault 1910 0
30781 Opel_Calibra_V6_DTM_Bausatz_1:24 calibra opel 1910 0
32585 UNFAL_Auto NaN mitsubishi 1800 2
45157 Motorhaube NaN trabant 1910 0
49283 Citroen_HY andere citroen 1001 0

To my eye, the earliest genuine registration is a 1927 Ford Model A. The next is a 1931 Ford Model A. In between the two there is a BMW Andere (Other) registered 1929/1. There are some other early 30s listings that appear to be old cars. I'm going to pick 1927 as the earliest legitimate registration year.

In [33]:
autos.loc[autos["registration_year"] > 2016,
          ["name", "model", "brand", "registration_year", "registration_month"]
         ]
Out[33]:
name model brand registration_year registration_month
10 VW_Golf_Tuning_in_siber/grau NaN volkswagen 2017 4
55 Mercedes_E320_AMG_zu_Tauschen! e_klasse mercedes_benz 2017 7
65 Ford_Fiesta_zum_ausschlachten fiesta ford 2017 9
68 Mini_cooper_s_clubman_/vollausstattung_/_Navi/... clubman mini 2017 0
84 Renault_twingo twingo renault 2018 0
113 Golf_4_Anfaenger_auto golf volkswagen 2017 7
164 Opel_Meriva__nur_76000_Km__unfallfrei__scheckh... meriva opel 2018 4
197 VW_Polo_9N_an_Bastler polo volkswagen 2017 7
253 Ford_mondeo_Gas_anlage_mit_TÜV_04.2017 mondeo ford 2017 8
348 VW_Beetle_1.8Turbo_mit_Vollausstattung_und_seh... beetle volkswagen 2017 7
390 Fiat_Bertone_X_1_9__X_1/9__X19__X_19__X1_9__X_19 andere fiat 2018 6
438 VW_Golf_5_V_1.6_2004_Klima_Tempomat golf volkswagen 2017 0
453 Armee_Jeep andere jeep 4500 0
454 Ford_Fiesta_Servo_D3_Kat_86€_Steuern_65.000_tkm fiesta ford 2017 0
457 Mercedes_Benz_W203_C200_Kombi_Kompressor_c_kla... c_klasse mercedes_benz 2017 8
477 Citroën_C4 NaN citroen 2017 2
545 Bmw_x5_/__E53 x_reihe bmw 2017 7
548 Peugeot_206_1.9_Hdi_4_Tueren_Tuev_02.17 2_reihe peugeot 2017 6
557 Smart_Fortwo_Passion_MHD fortwo smart 2017 4
568 PRIMA_ANFÄNGERAUTO_smart_fortwo_cabrio fortwo smart 2017 12
577 Volkswagen_Golf_3 golf volkswagen 2017 1
582 V10_TDI_TOUAREG touareg volkswagen 2017 7
654 Ich_verkaufe_meinen_gepflegten_daewoo kalos daewoo 2017 5
740 Seat_Leon_zu_verkaufen! leon seat 2017 2
754 Ford_Ka_Kleinwagen_BJ_2008 ka ford 2017 3
758 Honda_Civic_mit_TÜV_!!! civic honda 2017 12
765 BMW_3er_Limousine_mit_1_Jahr_TÜV!!!! 3er bmw 2017 3
802 Lada_mit_wenig_km_neuem_Tuev_bj_08 kalina lada 2018 0
871 Citroen_C2_60_PS c2 citroen 2017 3
889 Verkaufe_Opel_Corsa_C_1_2 corsa opel 2017 10
... ... ... ... ... ...
49108 TDI_20_L_DIESEL140_PS..AUTOMATICK NaN volkswagen 2017 1
49130 Renault_Modus modus renault 2017 6
49153 Corsa_c20xe corsa opel 5000 0
49154 Auto_mit_TÜV_bis_September_2017 ibiza seat 2017 7
49178 Mazda_323_F_TÜV__neu_2te_Hand_viele_Neuteile 3_reihe mazda 2018 6
49185 Alfa_Romeo_156_Sportwagon_1_8_T_Spark 156 alfa_romeo 2019 5
49218 Ford_Fiesta_1.3_4/5_Tuerer_16_Monate_Tuev fiesta ford 2018 4
49245 Audi_A3_1.8T_tausch_moegl. a3 audi 2017 4
49259 Volkswagen_Lupo_TÜV_06.2017_Klima lupo volkswagen 2017 0
49261 Toyota_aygo aygo toyota 2017 6
49262 Opel_vectra NaN opel 2018 0
49266 Toyota_Corolla_Verso_7_Sitzer_104.000km_2_2_Di... NaN toyota 2017 12
49343 bmw_320i_touring 3er bmw 2017 0
49347 Honda_civic_mit_TÜV_3/2017 civic honda 2017 0
49354 Bmw_e39_523i_mit_neuem_Tuev 5er bmw 2018 8
49389 Astra_G_CC astra opel 2017 0
49411 Renault_twingo_Tuev_neu twingo renault 2018 0
49522 Mazda_Demio_zu_verkaufen!_RESERVIERT NaN mazda 2017 12
49557 E320_T_cdi_avangart e_klasse mercedes_benz 2017 3
49561 Ford_Galaxy_1_8_TDCI_7_Sitze_Paoramadach_DPF_AHK galaxy ford 2017 6
49653 Top_zustand_privat scenic renault 2017 8
49662 Daihatsu_Coure_122tkm_Tuev_1/17_2.Hand cuore daihatsu 2017 0
49689 VW_GOLF_V_1.9 NaN volkswagen 2017 0
49696 Audi_A3_1_6_Attraction_mit_TÜV_10/17 a3 audi 2017 8
49731 Renault_MEGANE_1.6_16V__neu_ZANHRIEMEN_TOP_ZUS... megane renault 2017 7
49770 VW_Polo_6n_Tuev_Neu!__1.6_75PS polo volkswagen 2018 12
49796 Opel_corsa_1.4_zu_verkaufen corsa opel 2017 7
49841 Passat_abzugeben. passat volkswagen 2017 7
49910 Schoener_fast_neuer_Opel_Mokka_in_Zell_Mosel_m... andere opel 9000 3
49935 Mercedes_A_klasse_angemeldet_mit_Tuef_und_Auto... a_klasse mercedes_benz 2017 9

1879 rows × 5 columns

In [34]:
autos.loc[autos["registration_year"] > 2016, "registration_year"].value_counts()
Out[34]:
2017    1392
2018     470
5000       4
9999       3
2019       2
5911       1
9000       1
2800       1
4800       1
8888       1
4500       1
6200       1
4100       1
Name: registration_year, dtype: int64

There are 1453 entries with a registration date of 2017, and 492 with a registration date of 2018. This makes me think it may be worthwhile to look within 2016 for bogus registration dates. The easiest is to look for anythin with a reg month more recent than 4/2016, since that's when the last crawl was done.

In [35]:
autos.loc[(autos["registration_year"] == 2016) & (autos["registration_month"] > 4),
         ["name", "model", "brand", "registration_year", "registration_month"]]
Out[35]:
name model brand registration_year registration_month
135 Opel_Meriva_B_Panoramadach__Sitz__und_Lenkradh... meriva opel 2016 8
256 Passat_1.9TDI_4Motion_Highline passat volkswagen 2016 11
295 Privat_anbiter astra opel 2016 8
307 Giessen_ford focus ford 2016 8
437 Mazda__klima_leder__Alufelgen NaN mazda 2016 7
495 BMW_1er_116i 1er bmw 2016 12
507 Citroën_Xsara NaN citroen 2016 7
574 VW_vento_2_8_vr6 andere volkswagen 2016 5
675 Verkaufe_honda_civic civic honda 2016 11
756 Golf_3_Variante_tuev_bis_Juni golf volkswagen 2016 8
914 Opel_Zafira__AHK__Standheizung__Mp3 zafira opel 2016 11
942 Colt_zu_verkaufen colt mitsubishi 2016 10
1021 BMW_316i_Leder_Voll_FAHRBETEIT 1er bmw 2016 6
1039 Vw_polo_classic NaN volkswagen 2016 7
1153 Audi_A3_Sportback_2.0TDI_S_Line a3 audi 2016 5
1268 PKW._Volkswagen_Golf_2_Madison_Edition_Typ_19E golf volkswagen 2016 6
1326 Opel_Meriva_A_in_einem_gepflegten_Zustand meriva opel 2016 11
1389 Fiat_Punto_1.2_60ps punto fiat 2016 5
1474 Golf_Vr6_Turbo_HGP_Turbo_Alles_Eingetragen golf volkswagen 2016 5
1593 BMW_520I_AUTOMATIKGETRIEBE 5er bmw 2016 6
1663 Opel_vectra_b_2_5_V6_mit_TÜV vectra opel 2016 5
1741 Vw_Golf_4_2.0__121000_km NaN volkswagen 2016 8
2014 Honda_civic_ej2_Coupe_USA_Model NaN honda 2016 5
2170 VW_Lupo_1.0_wenig_Km_Sitzheizung_._El._Fenster lupo volkswagen 2016 7
2333 Mitsubishi_Colt_+_Klimaanlage colt mitsubishi 2016 11
2350 Opel_corsa_b_Edition_100_cool corsa opel 2016 7
2371 Dringet_zuverkaufen_audi_a4_mit_maengel NaN audi 2016 12
2446 Golf_5__TDI_1.9 golf volkswagen 2016 8
2776 Bastlerfahrzeug_Seat_Ibiza ibiza seat 2016 12
2837 Seat_Leon_1.4_Euro3_und_D4_Klima leon seat 2016 6
... ... ... ... ... ...
47982 Renault_Espace_2.0_136_PS_1998ccm espace renault 2016 6
48021 Golf_4_1_4_tuev_bis_6/16 golf volkswagen 2016 5
48058 BMW_120D_Coupe_NAVI/_KEYLESS/LEDER/_TEMPOMAT/_PDC 1er bmw 2016 5
48062 Fiat_Scudo_1_6benzin NaN fiat 2016 8
48198 VW_Passat_3bg_tdi_131ps passat volkswagen 2016 12
48317 e36_325i_coupe_neulack_bilder_folgen_bbs_rf_st... 3er bmw 2016 6
48347 OPEL_CORSA_1.2 corsa opel 2016 9
48352 Ford_Mondeo_1.6_16v mondeo ford 2016 12
48464 A_Klasse_170_CDI_Klima_sitzheizung a_klasse mercedes_benz 2016 8
48514 Renault_Senic_Megane_16V_nur_34500km_TÜV_neu NaN renault 2016 8
48634 NUR_HEUTE_1100€_MAZDA_PREMACY andere mazda 2016 11
48736 Nissan_Micra_1l_mit_klima NaN nissan 2016 12
48752 Opel_Astra_1_4T_MAHAGONIBRAUN_1.HAND NaN opel 2016 10
48828 Grand_Cherokee_ZJ__5_2l__neue_Stossdaempfer__a... NaN jeep 2016 6
48866 Opel_vectra_B_1_8_16v vectra opel 2016 5
48893 Polo_Kombi polo volkswagen 2016 6
48952 Audi_80 NaN audi 2016 6
49145 Grand_Voyager_Chrysler_mit_TÜV_bis_08.2016 voyager chrysler 2016 10
49195 Golf_IV_Generation__1_4l__55_kw__Klima golf volkswagen 2016 9
49239 Audi_Coupe_typ_89_quattro_2_6_s2 NaN audi 2016 12
49258 BMW_e36_328i_Cabrio_!!!Neu_lackiert!!! NaN bmw 2016 5
49385 Golf_3_500€_Festpreis golf volkswagen 2016 5
49417 Opel_vectra_gruene_Plakette_guter_Zustand vectra opel 2016 7
49528 BMW_E36__316i_Compact_M_Paket_"Schlachtfest" 3er bmw 2016 12
49531 Fiat_Merena_Diesel_Kombi NaN fiat 2016 9
49547 Smart_Passion_mit_Panorama_Dach fortwo smart 2016 10
49852 TOP__Golf_3_1.8l golf volkswagen 2016 7
49876 Audi_a5_3.0_tdi_s_line NaN audi 2016 10
49919 Fiat_Punto punto fiat 2016 8
49938 Mercedes_Benz_A_160_Avantgarde a_klasse mercedes_benz 2016 6

618 rows × 5 columns

In [36]:
autos = autos[autos["registration_year"].between(1917,2016)]
autos = autos[~((autos["registration_year"] == 2016) & (autos["registration_month"] > 4))]
In [37]:
autos["registration_year"].value_counts(normalize=True)
Out[37]:
2000    0.068522
2005    0.063746
1999    0.062899
2004    0.058687
2003    0.058600
2006    0.057970
2001    0.057232
2002    0.053975
1998    0.051305
2007    0.049438
2008    0.048092
2009    0.045269
1997    0.042360
2011    0.035238
2010    0.034500
1996    0.029810
2012    0.028442
1995    0.026640
2013    0.017435
2014    0.014395
1994    0.013657
2016    0.013070
1993    0.009227
2015    0.008511
1992    0.008033
1990    0.007534
1991    0.007360
1989    0.003778
1988    0.002931
1985    0.002063
          ...   
1966    0.000478
1977    0.000478
1976    0.000456
1969    0.000413
1975    0.000391
1965    0.000369
1964    0.000261
1963    0.000174
1961    0.000130
1959    0.000130
1956    0.000087
1958    0.000087
1937    0.000087
1962    0.000087
1950    0.000065
1954    0.000043
1941    0.000043
1934    0.000043
1957    0.000043
1951    0.000043
1955    0.000043
1931    0.000022
1953    0.000022
1943    0.000022
1938    0.000022
1939    0.000022
1927    0.000022
1929    0.000022
1948    0.000022
1952    0.000022
Name: registration_year, Length: 77, dtype: float64
In [38]:
autos["registration_year"].describe()
Out[38]:
count    46058.000000
mean      2002.745213
std          7.004326
min       1927.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2016.000000
Name: registration_year, dtype: float64

I've eliminated 2638 entries with problematic registration years. My observations after my first refinement of the dataset still holds:

The average car sold was ~15y old, and 75% of cars were at least ~8y old.

In [39]:
len(autos)
Out[39]:
46058

Top Brands

In [40]:
autos["brand"].describe()
Out[40]:
count          46058
unique            40
top       volkswagen
freq            9708
Name: brand, dtype: object
In [41]:
autos["brand"].value_counts(normalize=True)
Out[41]:
volkswagen        0.210778
bmw               0.110621
opel              0.107061
mercedes_benz     0.096813
audi              0.086912
ford              0.069716
renault           0.047071
peugeot           0.029767
fiat              0.025598
seat              0.018129
skoda             0.016501
nissan            0.015198
mazda             0.015155
smart             0.014069
citroen           0.013939
toyota            0.012723
hyundai           0.010053
sonstige_autos    0.009879
volvo             0.009227
mini              0.008793
mitsubishi        0.008207
honda             0.007838
kia               0.007165
alfa_romeo        0.006622
porsche           0.006166
suzuki            0.005949
chevrolet         0.005710
chrysler          0.003496
dacia             0.002627
daihatsu          0.002519
jeep              0.002280
subaru            0.002149
land_rover        0.002106
saab              0.001672
jaguar            0.001585
daewoo            0.001520
trabant           0.001390
rover             0.001324
lancia            0.001086
lada              0.000586
Name: brand, dtype: float64
In [42]:
top10_counts = autos["brand"].value_counts(normalize=True)[:10]
top10_counts.sum()
Out[42]:
0.8024664553389205

7. Exploring Price By Brand

There are forty unique auto brands represented in the dataset. The top four brands account for ~50% of the entries. Three are German domestic manufacturers, the fourth is Opel, which is GM's German subsidiary. The fifth most common brand is Audi, also German. The top five account for ~60% of the cars in the dataset.

The top ten brands account for 80% of the dataset.

I am going to look at some key aggregate details for the top ten brands.

Brands and Average Prices

In [43]:
top_brands = [brand for brand in top10_counts.index]
top_brands
Out[43]:
['volkswagen',
 'bmw',
 'opel',
 'mercedes_benz',
 'audi',
 'ford',
 'renault',
 'peugeot',
 'fiat',
 'seat']
In [44]:
brand_mean_prices = {}
for brand in top_brands:
    selection = autos.loc[autos["brand"] == brand]
    brand_mean_prices[brand] = selection["price"].mean()

brand_mean_prices
Out[44]:
{'audi': 9387.175118661004,
 'bmw': 8367.863788027478,
 'fiat': 2830.815945716709,
 'ford': 3790.819059483027,
 'mercedes_benz': 8681.313298945952,
 'opel': 2996.046643682823,
 'peugeot': 3108.5754923413565,
 'renault': 2493.9561808118083,
 'seat': 4443.226347305389,
 'volkswagen': 5444.281314379893}
In [45]:
sum(brand_mean_prices.values())/len(brand_mean_prices.values())
Out[45]:
5154.407318935543
In [46]:
autos["price"].describe()
Out[46]:
count     46058.000000
mean       6022.932954
std        9219.593098
min           1.000000
25%        1250.000000
50%        3150.000000
75%        7600.000000
max      350000.000000
Name: price, dtype: float64
  • The mean price for all cars in the dataset is $6,022. (It was $9,685 before removing (~10) cars above $900,000 from the dataset)

  • The mean price for cars from the top ten brands is $5154

  • The mean price of cars from the other top-ten brands is below the dataset mean.

  • Among USA-based manufactuers, the average Ford sells for 26% more than the average Opel.

  • Among the top tend brands, Audi, Mercedes and BMW cars command a premium. Volkswagon is right in the middle and Fiat, Ford, Opel, Peugot and Renault are budget brands.

8. Storing Aggregate Data in a DataFrame

Relationship between Price and Odometer

In [47]:
brand_mean_mileage = {}
for brand in top_brands:
    selection = autos.loc[autos["brand"] == brand]
    brand_mean_mileage[brand] = selection["odometer_km"].mean()
In [48]:
brand_mean_mileage
Out[48]:
{'audi': 129056.95728203848,
 'bmw': 132490.67713444555,
 'fiat': 116849.0245971162,
 'ford': 124051.6972905637,
 'mercedes_benz': 130638.0354339538,
 'opel': 129196.91746096127,
 'peugeot': 126867.25018234865,
 'renault': 127984.31734317343,
 'seat': 120838.32335329341,
 'volkswagen': 128550.16481252575}
In [49]:
bmp_series = pd.Series(brand_mean_prices)
bmm_series = pd.Series(brand_mean_mileage)
In [50]:
brand_data = pd.DataFrame(bmp_series, columns=['mean_price'])
In [51]:
brand_data["mean_mileage"] = bmm_series
In [52]:
brand_data
Out[52]:
mean_price mean_mileage
audi 9387.175119 129056.957282
bmw 8367.863788 132490.677134
fiat 2830.815946 116849.024597
ford 3790.819059 124051.697291
mercedes_benz 8681.313299 130638.035434
opel 2996.046644 129196.917461
peugeot 3108.575492 126867.250182
renault 2493.956181 127984.317343
seat 4443.226347 120838.323353
volkswagen 5444.281314 128550.164813
In [53]:
brand_data.describe()
Out[53]:
mean_price mean_mileage
count 10.000000 10.000000
mean 5154.407319 126652.336489
std 2676.064597 4764.710803
min 2493.956181 116849.024597
25% 3024.178856 124755.585514
50% 4117.022703 128267.241078
75% 7636.968170 129161.927416
max 9387.175119 132490.677134
  • There does not appear to be a strong relationship between mean price and mean odometer reading for any of the brands. Standard deviation for mean_price among brands is ~50% the mean, while the standard deviation for mileage among brands is <4% the mean.

It would be interesting to know the original selling price of the vehicles, so one could look at % depreciation vs mileage.

9. Next Steps

Translation of German Terms

Most of the categorical columns contain german words. In order to prepare for further analysis, I will identify those values and replace them with english equivalents.

In [54]:
autos.head()
Out[54]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
In [55]:
#saving myself some typing?
to_translate =["vehicle_type", "gearbox", "fuel_type", "unrepaired_damage"]
for c in to_translate:
    values = autos[c].unique()
    temp_dict = {v:v for v in values}
    print(c," = ", end="")
    pp.pprint(temp_dict)
vehicle_type  = {nan: nan,
 'andere': 'andere',
 'bus': 'bus',
 'cabrio': 'cabrio',
 'coupe': 'coupe',
 'kleinwagen': 'kleinwagen',
 'kombi': 'kombi',
 'limousine': 'limousine',
 'suv': 'suv'}
gearbox  = {nan: nan, 'automatik': 'automatik', 'manuell': 'manuell'}
fuel_type  = {nan: nan,
 'andere': 'andere',
 'benzin': 'benzin',
 'cng': 'cng',
 'diesel': 'diesel',
 'elektro': 'elektro',
 'hybrid': 'hybrid',
 'lpg': 'lpg'}
unrepaired_damage  = {nan: nan, 'ja': 'ja', 'nein': 'nein'}
In [56]:
vehicle_type_translation = {'andere': 'other',
                 'bus': 'bus',
                 'cabrio': 'convertable',
                 'coupe': 'coupe',
                 'kleinwagen': 'subcompact',
                 'kombi': 'stationwagon',
                 'limousine': 'limousine',
                 'suv': 'suv'}

autos["vehicle_type"] = autos["vehicle_type"].map(vehicle_type_translation)
In [57]:
gearbox_translation = {'automatik': 'automatic', 'manuell': 'manual'}
autos["gearbox"] = autos["gearbox"].map(gearbox_translation)
In [58]:
fuel_type_translation = {'andere': 'other',
              'benzin': 'gasoline',
              'cng': 'cng',
              'diesel': 'diesel',
              'elektro': 'electric',
              'hybrid': 'hybrid',
              'lpg': 'lpg'}

autos["fuel_type"] = autos["fuel_type"].map(fuel_type_translation)
In [59]:
unrepaired_damage_translation = {'ja': 'yes', 'nein': 'no'}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(unrepaired_damage_translation)
In [60]:
autos["model"] = autos["model"].str.replace("andere","other")
In [61]:
autos.head()
Out[61]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 control bus 2004 manual 158 other 150000 3 lpg peugeot no 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 control limousine 1997 automatic 286 7er 150000 6 gasoline bmw no 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 test limousine 2009 manual 102 golf 70000 7 gasoline volkswagen no 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 control subcompact 2007 automatic 71 fortwo 70000 6 gasoline smart no 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 test stationwagon 2003 manual 0 focus 150000 7 gasoline ford no 2016-04-01 00:00:00 39218 2016-04-01 14:38:50

Convert Date Columns

In [62]:
date_cols = ["date_crawled", "ad_created", "last_seen"]

for c in date_cols:
    autos[c] = autos[c].str[:10].str.replace("-","").astype(int)
In [63]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 46058 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          46058 non-null int64
name                  46058 non-null object
price                 46058 non-null int64
abtest                46058 non-null object
vehicle_type          43975 non-null object
registration_year     46058 non-null int64
gearbox               44009 non-null object
power_ps              46058 non-null int64
model                 43938 non-null object
odometer_km           46058 non-null int64
registration_month    46058 non-null int64
fuel_type             42999 non-null object
brand                 46058 non-null object
unrepaired_damage     38004 non-null object
ad_created            46058 non-null int64
postal_code           46058 non-null int64
last_seen             46058 non-null int64
dtypes: int64(9), object(8)
memory usage: 6.3+ MB
In [64]:
autos[date_cols].head()
Out[64]:
date_crawled ad_created last_seen
0 20160326 20160326 20160406
1 20160404 20160404 20160406
2 20160326 20160326 20160406
3 20160312 20160312 20160315
4 20160401 20160401 20160401

Extracting Structured Data from Name Column

In [65]:
autos["name"]
Out[65]:
0                         Peugeot_807_160_NAVTECH_ON_BOARD
1               BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                               Volkswagen_Golf_1.6_United
3        Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...
4        Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...
5        Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...
6        VW_Golf_III_GT_Special_Electronic_Green_Metall...
7                                     Golf_IV_1.9_TDI_90PS
8                                               Seat_Arosa
9                Renault_Megane_Scenic_1.6e_RT_Klimaanlage
11                              Mercedes_A140_Motorschaden
12       Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...
13                                      Audi_A3_1.6_tuning
14       Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...
15             Corvette_C3_Coupe_T_Top_Crossfire_Injection
16                                     Opel_Vectra_B_Kombi
17                               Volkswagen_Scirocco_2_G60
18                    Verkaufen_mein_bmw_e36_320_i_touring
19             mazda_tribute_2.0_mit_gas_und_tuev_neu_2018
20       Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*...
21                           Porsche_911_Carrera_4S_Cabrio
22                                    MINI_Cooper_S_Cabrio
23        Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima
24                              BMW_535i_xDrive_Sport_Aut.
25       Ford_escort_kombi_an_bastler_mit_ghia_ausstattung
26                                     Volkswagen_Polo_Fox
28                                           MINI_Cooper_D
29       Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz...
30                              Peugeot_206_Unfallfahrzeug
31                               Noch_gut_erhaltenen_C_320
                               ...                        
49968                    Mercedes_Benz_190_D_2.5_Automatik
49969               Nissan_X_Trail_2.2_dCi_4x4_Sport_m.AHZ
49970    c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp...
49971                                           W.Lupo_1.0
49972            Mercedes_Benz_Vito_115_CDI_Extralang_Aut.
49973                     Mercedes_Benz_SLK_200_Kompressor
49975            Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort
49976    Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga...
49977                          Mercedes_Benz_C200_Cdi_W203
49978                          Mercedes_Benz_E_200_Classic
49979                        Volkswagen_Polo_1.6_TDI_Style
49980                              Ford_Escort_Turnier_16V
49981               Opel_Astra_Kombi_mit_Anhaengerkupplung
49982              Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm
49983                                        Ford_focus_99
49985                            Verkaufe_meinen_vw_vento!
49986    Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst...
49987    Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__...
49988                                           BMW_330_Ci
49989          VW_Polo_zum_Ausschlachten_oder_Wiederaufbau
49990           Mercedes_Benz_A_200__BlueEFFICIENCY__Urban
49991                                           Kleinwagen
49992                Fiat_Grande_Punto_1.4_T_Jet_16V_Sport
49993             Audi_A3__1_8l__Silber;_schoenes_Fahrzeug
49994    Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...
49995     Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon
49996    Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...
49997                       Fiat_500_C_1.2_Dualogic_Lounge
49998                   Audi_A3_2.0_TDI_Sportback_Ambition
49999                                  Opel_Vectra_1.6_16V
Name: name, Length: 46058, dtype: object

One might be able to extract engine engine details, like displacement and 16V from the name column in some of the rows.

In [66]:
autos["name"].str.extract(r'_(\d\.\d+)').value_counts()
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  if __name__ == '__main__':
Out[66]:
2.0       3619
1.6       2643
1.4       2169
1.2       1623
1.9       1593
1.8       1444
3.0        671
1.0        648
2.5        497
2.2        487
1.3        374
2.4        207
1.1        190
1.7        186
1.5        179
2.7        153
2.8        144
2.3         96
3.2         80
4.2         65
1.25        65
2.6         46
4.0         29
2.9         21
3.3         20
3.5         16
5.0         16
0.8         16
1.33        13
5.2         10
          ... 
9.16         1
4.18         1
1.200        1
7.7          1
5.28         1
3.9          1
9.5          1
4.8          1
3.800        1
3.1          1
5.9          1
6.5          1
1.500        1
3.2016       1
8.17         1
3.16         1
5.3          1
6.7          1
5.2009       1
7.2017       1
8.2016       1
4.17         1
7.4          1
6.2016       1
5.17         1
5.2017       1
2.2005       1
4.9          1
5.16         1
8.4          1
Name: name, Length: 99, dtype: int64

A simple regular expression matches a lot of plausible values for displacement, as well as a number of implausible values. These implausible values could probably be discarded based some string and numeric filtering. This would likely still yield some bogus values, though. I will leave the rest of this exercise to the reader ;).

Most Common Models by Brand

In [67]:
brands = np.sort(autos["brand"].unique())
for b in brands:
    selection = (autos["brand"] == b)
    print(b,":")
    print(autos.loc[selection, "model"].value_counts(normalize=True)[:3])
    print()
alfa_romeo :
156      0.297945
147      0.267123
other    0.202055
Name: model, dtype: float64

audi :
a4    0.314763
a3    0.207445
a6    0.203594
Name: model, dtype: float64

bmw :
3er    0.530892
5er    0.230974
1er    0.105565
Name: model, dtype: float64

chevrolet :
other      0.588710
matiz      0.133065
captiva    0.096774
Name: model, dtype: float64

chrysler :
other        0.339869
voyager      0.274510
ptcruiser    0.196078
Name: model, dtype: float64

citroen :
other       0.350081
berlingo    0.155592
c5          0.116694
Name: model, dtype: float64

dacia :
sandero    0.361345
logan      0.352941
duster     0.226891
Name: model, dtype: float64

daewoo :
matiz    0.333333
kalos    0.257576
lanos    0.151515
Name: model, dtype: float64

daihatsu :
cuore     0.548673
other     0.123894
sirion    0.097345
Name: model, dtype: float64

fiat :
punto    0.360889
other    0.202667
500      0.105778
Name: model, dtype: float64

ford :
focus     0.243673
fiesta    0.229721
mondeo    0.153796
Name: model, dtype: float64

honda :
civic       0.479412
other       0.150000
cr_reihe    0.147059
Name: model, dtype: float64

hyundai :
i_reihe    0.395402
other      0.333333
getz       0.110345
Name: model, dtype: float64

jaguar :
other     0.541667
x_type    0.277778
s_type    0.180556
Name: model, dtype: float64

jeep :
grand       0.43
wrangler    0.24
cherokee    0.18
Name: model, dtype: float64

kia :
other       0.207668
sorento     0.166134
sportage    0.166134
Name: model, dtype: float64

lada :
niva      0.60
other     0.28
samara    0.12
Name: model, dtype: float64

lancia :
ypsilon    0.404255
other      0.234043
lybra      0.212766
Name: model, dtype: float64

land_rover :
freelander    0.322917
defender      0.229167
discovery     0.135417
Name: model, dtype: float64

mazda :
3_reihe    0.244344
6_reihe    0.238311
other      0.206637
Name: model, dtype: float64

mercedes_benz :
c_klasse    0.260198
e_klasse    0.218944
a_klasse    0.120765
Name: model, dtype: float64

mini :
cooper     0.668317
one        0.237624
clubman    0.064356
Name: model, dtype: float64

mitsubishi :
colt       0.304469
other      0.265363
carisma    0.159218
Name: model, dtype: float64

nissan :
micra     0.351032
other     0.176991
almera    0.131268
Name: model, dtype: float64

opel :
corsa     0.328635
astra     0.280034
vectra    0.112350
Name: model, dtype: float64

peugeot :
2_reihe    0.459984
3_reihe    0.245532
1_reihe    0.118104
Name: model, dtype: float64

porsche :
911        0.500000
boxster    0.167857
cayenne    0.167857
Name: model, dtype: float64

renault :
twingo    0.286667
clio      0.221905
megane    0.158095
Name: model, dtype: float64

rover :
other         0.931034
freelander    0.034483
rangerover    0.017241
Name: model, dtype: float64

saab :
other    0.716216
900      0.216216
9000     0.067568
Name: model, dtype: float64

seat :
ibiza    0.391626
leon     0.233990
arosa    0.128079
Name: model, dtype: float64

skoda :
octavia    0.407507
fabia      0.369973
superb     0.068365
Name: model, dtype: float64

smart :
fortwo      0.864952
forfour     0.072347
roadster    0.051447
Name: model, dtype: float64

sonstige_autos :
Series([], Name: model, dtype: float64)

subaru :
legacy     0.322581
impreza    0.247312
justy      0.193548
Name: model, dtype: float64

suzuki :
other    0.443609
swift    0.308271
jimny    0.124060
Name: model, dtype: float64

toyota :
yaris      0.221053
other      0.185965
corolla    0.149123
Name: model, dtype: float64

trabant :
601      0.854545
other    0.145455
Name: model, dtype: float64

volkswagen :
golf      0.389596
polo      0.168254
passat    0.143209
Name: model, dtype: float64

volvo :
v70      0.218978
v40      0.211679
other    0.199513
Name: model, dtype: float64

Above we can see the three most common models for each brand. In many cases "other" is one of the most common values, which suggests deficiencies in the listing-creation process.

I'm going to add a new column to the dataset that combines brand and model in case it is useful for later analysis.

In [68]:
autos["brand_model"] = autos["brand"] + "/" + autos["model"]
autos["brand_model"].value_counts().head(20)
Out[68]:
volkswagen/golf           3640
bmw/3er                   2595
volkswagen/polo           1572
opel/corsa                1562
volkswagen/passat         1338
opel/astra                1331
audi/a4                   1226
bmw/5er                   1129
mercedes_benz/c_klasse    1129
mercedes_benz/e_klasse     950
audi/a3                    808
audi/a6                    793
ford/focus                 751
ford/fiesta                708
volkswagen/transporter     670
renault/twingo             602
peugeot/2_reihe            592
smart/fortwo               538
opel/vectra                534
mercedes_benz/a_klasse     524
Name: brand_model, dtype: int64

Relationship between Price and Odometer Readings

An investigation of the relationship between selling price and odometer readings will have to take into account the wide range in average selling prices between the top ten brands. So, I've aggregated them by brand.

NOTE: Odometer readings in the dataset have already been binned, so no further binning has been performed.

In [69]:
for b in sorted(top_brands):
    brand_subset = autos[autos["brand"] == b]
    odometer_readings = np.sort(brand_subset["odometer_km"].unique())
    print(b,":")
    print(brand_subset[["price", "odometer_km"]].mean())
    for r in odometer_readings:
        odometer_subset = brand_subset[brand_subset["odometer_km"] == r]
        print(r,":", round(odometer_subset["price"].mean(),-3))
    price_40kkm = brand_subset.loc[brand_subset["odometer_km"] == 40000, "price"].mean()
    price_90kkm = brand_subset.loc[brand_subset["odometer_km"] == 90000, "price"].mean()
    print("90K/40K price ratio:",
         round(price_90kkm/price_40kkm,1))
    print()
audi :
price            9387.175119
odometer_km    129056.957282
dtype: float64
5000 : 16000.0
10000 : 28000.0
20000 : 27000.0
30000 : 27000.0
40000 : 26000.0
50000 : 23000.0
60000 : 20000.0
70000 : 22000.0
80000 : 16000.0
90000 : 16000.0
100000 : 15000.0
125000 : 12000.0
150000 : 6000.0
90K/40K price ratio: 0.6

bmw :
price            8367.863788
odometer_km    132490.677134
dtype: float64
5000 : 10000.0
10000 : 31000.0
20000 : 27000.0
30000 : 24000.0
40000 : 25000.0
50000 : 24000.0
60000 : 21000.0
70000 : 17000.0
80000 : 16000.0
90000 : 15000.0
100000 : 13000.0
125000 : 11000.0
150000 : 6000.0
90K/40K price ratio: 0.6

fiat :
price            2830.815946
odometer_km    116849.024597
dtype: float64
5000 : 5000.0
10000 : 10000.0
20000 : 7000.0
30000 : 8000.0
40000 : 7000.0
50000 : 5000.0
60000 : 5000.0
70000 : 5000.0
80000 : 4000.0
90000 : 4000.0
100000 : 3000.0
125000 : 2000.0
150000 : 2000.0
90K/40K price ratio: 0.5

ford :
price            3790.819059
odometer_km    124051.697291
dtype: float64
5000 : 8000.0
10000 : 16000.0
20000 : 12000.0
30000 : 13000.0
40000 : 10000.0
50000 : 10000.0
60000 : 7000.0
70000 : 8000.0
80000 : 6000.0
90000 : 4000.0
100000 : 5000.0
125000 : 4000.0
150000 : 2000.0
90K/40K price ratio: 0.4

mercedes_benz :
price            8681.313299
odometer_km    130638.035434
dtype: float64
5000 : 11000.0
10000 : 29000.0
20000 : 28000.0
30000 : 19000.0
40000 : 28000.0
50000 : 24000.0
60000 : 22000.0
70000 : 17000.0
80000 : 16000.0
90000 : 16000.0
100000 : 14000.0
125000 : 10000.0
150000 : 6000.0
90K/40K price ratio: 0.6

opel :
price            2996.046644
odometer_km    129196.917461
dtype: float64
5000 : 4000.0
10000 : 12000.0
20000 : 9000.0
30000 : 11000.0
40000 : 9000.0
50000 : 9000.0
60000 : 7000.0
70000 : 8000.0
80000 : 6000.0
90000 : 4000.0
100000 : 5000.0
125000 : 3000.0
150000 : 2000.0
90K/40K price ratio: 0.5

peugeot :
price            3108.575492
odometer_km    126867.250182
dtype: float64
5000 : 3000.0
10000 : 13000.0
20000 : 5000.0
30000 : 12000.0
40000 : 10000.0
50000 : 8000.0
60000 : 6000.0
70000 : 7000.0
80000 : 5000.0
90000 : 4000.0
100000 : 5000.0
125000 : 4000.0
150000 : 2000.0
90K/40K price ratio: 0.4

renault :
price            2493.956181
odometer_km    127984.317343
dtype: float64
5000 : 5000.0
10000 : 8000.0
20000 : 9000.0
30000 : 12000.0
40000 : 7000.0
50000 : 7000.0
60000 : 7000.0
70000 : 6000.0
80000 : 4000.0
90000 : 4000.0
100000 : 3000.0
125000 : 2000.0
150000 : 2000.0
90K/40K price ratio: 0.5

seat :
price            4443.226347
odometer_km    120838.323353
dtype: float64
5000 : 6000.0
10000 : 20000.0
20000 : 11000.0
30000 : 15000.0
40000 : 12000.0
50000 : 10000.0
60000 : 10000.0
70000 : 8000.0
80000 : 7000.0
90000 : 6000.0
100000 : 5000.0
125000 : 4000.0
150000 : 2000.0
90K/40K price ratio: 0.5

volkswagen :
price            5444.281314
odometer_km    128550.164813
dtype: float64
5000 : 4000.0
10000 : 19000.0
20000 : 17000.0
30000 : 15000.0
40000 : 16000.0
50000 : 13000.0
60000 : 13000.0
70000 : 12000.0
80000 : 11000.0
90000 : 9000.0
100000 : 8000.0
125000 : 6000.0
150000 : 4000.0
90K/40K price ratio: 0.6

As an alternative, lets display the aggreagated data as a data frame.

In [70]:
brand_data['90_40_price_ratio'] = np.NAN  #done to place early in column order
odometer_readings = np.sort(autos["odometer_km"].unique())    
for r in odometer_readings:
    odometer_subset = autos[autos["odometer_km"] == r]
    for b in brand_data.index:
        brand_subset = odometer_subset[odometer_subset["brand"] == b]
        brand_data.loc[b,r] = round(brand_subset["price"].mean(),-3)

price_ratios = {}
for b in brand_data.index:
    price_ratios[b] = round(brand_data.loc[b,90000]/brand_data.loc[b,40000],2)
    
price_ratios = pd.Series(price_ratios)
brand_data["90_40_price_ratio"] = price_ratios
In [71]:
brand_data
Out[71]:
mean_price mean_mileage 90_40_price_ratio 5000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 125000 150000
audi 9387.175119 129056.957282 0.62 16000.0 28000.0 27000.0 27000.0 26000.0 23000.0 20000.0 22000.0 16000.0 16000.0 15000.0 12000.0 6000.0
bmw 8367.863788 132490.677134 0.60 10000.0 31000.0 27000.0 24000.0 25000.0 24000.0 21000.0 17000.0 16000.0 15000.0 13000.0 11000.0 6000.0
fiat 2830.815946 116849.024597 0.57 5000.0 10000.0 7000.0 8000.0 7000.0 5000.0 5000.0 5000.0 4000.0 4000.0 3000.0 2000.0 2000.0
ford 3790.819059 124051.697291 0.40 8000.0 16000.0 12000.0 13000.0 10000.0 10000.0 7000.0 8000.0 6000.0 4000.0 5000.0 4000.0 2000.0
mercedes_benz 8681.313299 130638.035434 0.57 11000.0 29000.0 28000.0 19000.0 28000.0 24000.0 22000.0 17000.0 16000.0 16000.0 14000.0 10000.0 6000.0
opel 2996.046644 129196.917461 0.44 4000.0 12000.0 9000.0 11000.0 9000.0 9000.0 7000.0 8000.0 6000.0 4000.0 5000.0 3000.0 2000.0
peugeot 3108.575492 126867.250182 0.40 3000.0 13000.0 5000.0 12000.0 10000.0 8000.0 6000.0 7000.0 5000.0 4000.0 5000.0 4000.0 2000.0
renault 2493.956181 127984.317343 0.57 5000.0 8000.0 9000.0 12000.0 7000.0 7000.0 7000.0 6000.0 4000.0 4000.0 3000.0 2000.0 2000.0
seat 4443.226347 120838.323353 0.50 6000.0 20000.0 11000.0 15000.0 12000.0 10000.0 10000.0 8000.0 7000.0 6000.0 5000.0 4000.0 2000.0
volkswagen 5444.281314 128550.164813 0.56 4000.0 19000.0 17000.0 15000.0 16000.0 13000.0 13000.0 12000.0 11000.0 9000.0 8000.0 6000.0 4000.0

Notice that the prices for the lowest mileage cars are unusually low, lower than for higher mileage cars. One hypothesis is that these tend to be wrecked cars with unrepaired damage. This could be investigated with the available data.

I've experimented with finding a quick metric for comparing depreciation among brands. For each brand, I've computed the ratio of mean price for 40,000km vehicles with those for 90,000km.

Between 40,000km and 90,000km:

  • Audi and BMW cars loose ~40% of their value
  • Mercedes, Volkswagon and Renault loose ~43%
  • Fiat, Ford, Open, Peugeot loose >55% of their value

Damaged Cars

In [72]:
damaged = autos[autos["unrepaired_damage"] == "yes"]
undamaged = autos[autos["unrepaired_damage"] == "no"]
In [73]:
print("Mean Damaged Price:", round(damaged["price"].mean(),-1))
print("Mean Undamaged Price:", round(undamaged["price"].mean(),-1))
Mean Damaged Price: 2260.0
Mean Undamaged Price: 7200.0

Cars with unrepaired damage sell, on average, for less than 1/3 the price of cars without unrepaired damage.

We can also look at price differences for damaged cars on a model by model basis.

In [74]:
autos.groupby(["brand_model","unrepaired_damage"])["price"].mean()
Out[74]:
brand_model             unrepaired_damage
alfa_romeo/145          no                    2750.000000
alfa_romeo/147          no                    2722.854545
                        yes                   1286.846154
alfa_romeo/156          no                    1689.516667
                        yes                   1190.388889
alfa_romeo/159          no                    6659.653846
                        yes                   7800.000000
alfa_romeo/other        no                    8204.046512
                        yes                   3548.625000
alfa_romeo/spider       no                    9182.925926
audi/100                no                    2031.725000
                        yes                   1166.500000
audi/200                no                    2250.000000
audi/80                 no                    1771.739583
                        yes                    641.421053
audi/90                 no                    1699.600000
audi/a1                 no                   14793.827160
audi/a2                 no                    3808.848485
                        yes                   2333.333333
audi/a3                 no                    9280.805466
                        yes                   2845.285714
audi/a4                 no                    8183.572204
                        yes                   2886.800000
audi/a5                 no                   22454.473214
                        yes                  21649.800000
audi/a6                 no                    9773.868552
                        yes                   3201.715789
audi/a8                 no                   11933.750000
                        yes                   3780.000000
audi/other              no                   22860.139175
                                                 ...     
volkswagen/polo         yes                    988.312139
volkswagen/scirocco     no                   12298.285714
volkswagen/sharan       no                    6684.231707
                        yes                   5235.214286
volkswagen/tiguan       no                   18086.550459
volkswagen/touareg      no                   16235.134146
                        yes                  20933.333333
volkswagen/touran       no                    8642.105882
                        yes                   5496.187500
volkswagen/transporter  no                   11924.129225
                        yes                   4932.614035
volkswagen/up           no                    8371.279070
                        yes                   6449.333333
volvo/850               no                    2280.388889
                        yes                    580.000000
volvo/c_reihe           no                    8339.076923
                        yes                   2999.000000
volvo/other             no                    4245.936508
                        yes                   2022.500000
volvo/s60               no                    7379.142857
                        yes                   3500.000000
volvo/v40               no                    2227.868852
                        yes                    756.461538
volvo/v50               no                    5545.375000
                        yes                   3495.000000
volvo/v60               no                   21000.000000
volvo/v70               no                    5009.258065
                        yes                   1570.615385
volvo/xc_reihe          no                   13378.673913
                        yes                  11450.000000
Name: price, Length: 524, dtype: float64

Another way of doing it...

In [75]:
damaged_comparison = pd.DataFrame(undamaged.groupby("brand_model")["price"].mean())
damaged_comparison.rename({"price": "mean_undamaged_price"}, axis=1, inplace=True)
damaged_comparison["mean_damaged_price"] = damaged.groupby("brand_model")["price"].mean()
damaged_comparison
Out[75]:
mean_undamaged_price mean_damaged_price
brand_model
alfa_romeo/145 2750.000000 NaN
alfa_romeo/147 2722.854545 1286.846154
alfa_romeo/156 1689.516667 1190.388889
alfa_romeo/159 6659.653846 7800.000000
alfa_romeo/other 8204.046512 3548.625000
alfa_romeo/spider 9182.925926 NaN
audi/100 2031.725000 1166.500000
audi/200 2250.000000 NaN
audi/80 1771.739583 641.421053
audi/90 1699.600000 NaN
audi/a1 14793.827160 NaN
audi/a2 3808.848485 2333.333333
audi/a3 9280.805466 2845.285714
audi/a4 8183.572204 2886.800000
audi/a5 22454.473214 21649.800000
audi/a6 9773.868552 3201.715789
audi/a8 11933.750000 3780.000000
audi/other 22860.139175 4823.000000
audi/q3 28346.962963 NaN
audi/q5 26827.660714 NaN
audi/q7 22040.184211 11950.000000
audi/tt 13743.041667 7521.111111
bmw/1er 12035.286353 6220.645161
bmw/3er 6960.184942 2335.140845
bmw/5er 8891.072855 3926.722772
bmw/6er 19530.782609 NaN
bmw/7er 10414.252747 5389.250000
bmw/m_reihe 28971.975610 8900.000000
bmw/other 19302.000000 10579.800000
bmw/x_reihe 19259.575290 11625.333333
... ... ...
volkswagen/beetle 7549.046729 3094.000000
volkswagen/bora 2441.763889 1762.416667
volkswagen/caddy 7958.980519 4643.125000
volkswagen/cc 15144.352941 NaN
volkswagen/eos 11690.161290 7500.000000
volkswagen/fox 2891.621212 1818.750000
volkswagen/golf 6114.460083 1877.132450
volkswagen/jetta 5339.961538 NaN
volkswagen/kaefer 8983.866667 2960.000000
volkswagen/lupo 1546.163462 781.853659
volkswagen/other 5389.897059 7116.666667
volkswagen/passat 5838.317526 2178.945946
volkswagen/phaeton 13704.222222 NaN
volkswagen/polo 3234.969981 988.312139
volkswagen/scirocco 12298.285714 NaN
volkswagen/sharan 6684.231707 5235.214286
volkswagen/tiguan 18086.550459 NaN
volkswagen/touareg 16235.134146 20933.333333
volkswagen/touran 8642.105882 5496.187500
volkswagen/transporter 11924.129225 4932.614035
volkswagen/up 8371.279070 6449.333333
volvo/850 2280.388889 580.000000
volvo/c_reihe 8339.076923 2999.000000
volvo/other 4245.936508 2022.500000
volvo/s60 7379.142857 3500.000000
volvo/v40 2227.868852 756.461538
volvo/v50 5545.375000 3495.000000
volvo/v60 21000.000000 NaN
volvo/v70 5009.258065 1570.615385
volvo/xc_reihe 13378.673913 11450.000000

287 rows × 2 columns

In [76]:
damaged_comparison
Out[76]:
mean_undamaged_price mean_damaged_price
brand_model
alfa_romeo/145 2750.000000 NaN
alfa_romeo/147 2722.854545 1286.846154
alfa_romeo/156 1689.516667 1190.388889
alfa_romeo/159 6659.653846 7800.000000
alfa_romeo/other 8204.046512 3548.625000
alfa_romeo/spider 9182.925926 NaN
audi/100 2031.725000 1166.500000
audi/200 2250.000000 NaN
audi/80 1771.739583 641.421053
audi/90 1699.600000 NaN
audi/a1 14793.827160 NaN
audi/a2 3808.848485 2333.333333
audi/a3 9280.805466 2845.285714
audi/a4 8183.572204 2886.800000
audi/a5 22454.473214 21649.800000
audi/a6 9773.868552 3201.715789
audi/a8 11933.750000 3780.000000
audi/other 22860.139175 4823.000000
audi/q3 28346.962963 NaN
audi/q5 26827.660714 NaN
audi/q7 22040.184211 11950.000000
audi/tt 13743.041667 7521.111111
bmw/1er 12035.286353 6220.645161
bmw/3er 6960.184942 2335.140845
bmw/5er 8891.072855 3926.722772
bmw/6er 19530.782609 NaN
bmw/7er 10414.252747 5389.250000
bmw/m_reihe 28971.975610 8900.000000
bmw/other 19302.000000 10579.800000
bmw/x_reihe 19259.575290 11625.333333
... ... ...
volkswagen/beetle 7549.046729 3094.000000
volkswagen/bora 2441.763889 1762.416667
volkswagen/caddy 7958.980519 4643.125000
volkswagen/cc 15144.352941 NaN
volkswagen/eos 11690.161290 7500.000000
volkswagen/fox 2891.621212 1818.750000
volkswagen/golf 6114.460083 1877.132450
volkswagen/jetta 5339.961538 NaN
volkswagen/kaefer 8983.866667 2960.000000
volkswagen/lupo 1546.163462 781.853659
volkswagen/other 5389.897059 7116.666667
volkswagen/passat 5838.317526 2178.945946
volkswagen/phaeton 13704.222222 NaN
volkswagen/polo 3234.969981 988.312139
volkswagen/scirocco 12298.285714 NaN
volkswagen/sharan 6684.231707 5235.214286
volkswagen/tiguan 18086.550459 NaN
volkswagen/touareg 16235.134146 20933.333333
volkswagen/touran 8642.105882 5496.187500
volkswagen/transporter 11924.129225 4932.614035
volkswagen/up 8371.279070 6449.333333
volvo/850 2280.388889 580.000000
volvo/c_reihe 8339.076923 2999.000000
volvo/other 4245.936508 2022.500000
volvo/s60 7379.142857 3500.000000
volvo/v40 2227.868852 756.461538
volvo/v50 5545.375000 3495.000000
volvo/v60 21000.000000 NaN
volvo/v70 5009.258065 1570.615385
volvo/xc_reihe 13378.673913 11450.000000

287 rows × 2 columns

In Summary

See above ;)