MARKET OF USED CARS

The given datasaet contains about 50,000 entries of deatils of used cars put up for sales. We will do some analysis on this dataset.

In [1]:
import pandas as pd
import numpy as np
autos=pd.read_csv('autos.csv',encoding = 'Latin-1')
In [2]:
autos
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
5 2016-03-21 13:47:45 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot $7,900 test bus 2006 automatik 150 voyager 150,000km 4 diesel chrysler NaN 2016-03-21 00:00:00 0 22962 2016-04-06 09:45:21
6 2016-03-20 17:55:21 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot $300 test limousine 1995 manuell 90 golf 150,000km 8 benzin volkswagen NaN 2016-03-20 00:00:00 0 31535 2016-03-23 02:48:59
7 2016-03-16 18:55:19 Golf_IV_1.9_TDI_90PS privat Angebot $1,990 control limousine 1998 manuell 90 golf 150,000km 12 diesel volkswagen nein 2016-03-16 00:00:00 0 53474 2016-04-07 03:17:32
8 2016-03-22 16:51:34 Seat_Arosa privat Angebot $250 test NaN 2000 manuell 0 arosa 150,000km 10 NaN seat nein 2016-03-22 00:00:00 0 7426 2016-03-26 18:18:10
9 2016-03-16 13:47:02 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot $590 control bus 1997 manuell 90 megane 150,000km 7 benzin renault nein 2016-03-16 00:00:00 0 15749 2016-04-06 10:46:35
10 2016-03-15 01:41:36 VW_Golf_Tuning_in_siber/grau privat Angebot $999 test NaN 2017 manuell 90 NaN 150,000km 4 benzin volkswagen nein 2016-03-14 00:00:00 0 86157 2016-04-07 03:16:21
11 2016-03-16 18:45:34 Mercedes_A140_Motorschaden privat Angebot $350 control NaN 2000 NaN 0 NaN 150,000km 0 benzin mercedes_benz NaN 2016-03-16 00:00:00 0 17498 2016-03-16 18:45:34
12 2016-03-31 19:48:22 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot $5,299 control kleinwagen 2010 automatik 71 fortwo 50,000km 9 benzin smart nein 2016-03-31 00:00:00 0 34590 2016-04-06 14:17:52
13 2016-03-23 10:48:32 Audi_A3_1.6_tuning privat Angebot $1,350 control limousine 1999 manuell 101 a3 150,000km 11 benzin audi nein 2016-03-23 00:00:00 0 12043 2016-04-01 14:17:13
14 2016-03-23 11:50:46 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot $3,999 test kleinwagen 2007 manuell 75 clio 150,000km 9 benzin renault NaN 2016-03-23 00:00:00 0 81737 2016-04-01 15:46:47
15 2016-04-01 12:06:20 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot $18,900 test coupe 1982 automatik 203 NaN 80,000km 6 benzin sonstige_autos nein 2016-04-01 00:00:00 0 61276 2016-04-02 21:10:48
16 2016-03-16 14:59:02 Opel_Vectra_B_Kombi privat Angebot $350 test kombi 1999 manuell 101 vectra 150,000km 5 benzin opel nein 2016-03-16 00:00:00 0 57299 2016-03-18 05:29:37
17 2016-03-29 11:46:22 Volkswagen_Scirocco_2_G60 privat Angebot $5,500 test coupe 1990 manuell 205 scirocco 150,000km 6 benzin volkswagen nein 2016-03-29 00:00:00 0 74821 2016-04-05 20:46:26
18 2016-03-26 19:57:44 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot $300 control bus 1995 manuell 150 3er 150,000km 0 benzin bmw NaN 2016-03-26 00:00:00 0 54329 2016-04-02 12:16:41
19 2016-03-17 13:36:21 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot $4,150 control suv 2004 manuell 124 andere 150,000km 2 lpg mazda nein 2016-03-17 00:00:00 0 40878 2016-03-17 14:45:58
20 2016-03-05 19:57:31 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot $3,500 test kombi 2003 manuell 131 a4 150,000km 5 diesel audi NaN 2016-03-05 00:00:00 0 53913 2016-03-07 05:46:46
21 2016-03-06 19:07:10 Porsche_911_Carrera_4S_Cabrio privat Angebot $41,500 test cabrio 2004 manuell 320 911 150,000km 4 benzin porsche nein 2016-03-06 00:00:00 0 65428 2016-04-05 23:46:19
22 2016-03-28 20:50:54 MINI_Cooper_S_Cabrio privat Angebot $25,450 control cabrio 2015 manuell 184 cooper 10,000km 1 benzin mini nein 2016-03-28 00:00:00 0 44789 2016-04-01 06:45:30
23 2016-03-10 19:55:34 Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima privat Angebot $7,999 control bus 2010 manuell 120 NaN 150,000km 2 diesel peugeot nein 2016-03-10 00:00:00 0 30900 2016-03-17 08:45:17
24 2016-04-03 11:57:02 BMW_535i_xDrive_Sport_Aut. privat Angebot $48,500 control limousine 2014 automatik 306 5er 30,000km 12 benzin bmw nein 2016-04-03 00:00:00 0 22547 2016-04-07 13:16:50
25 2016-03-21 21:56:18 Ford_escort_kombi_an_bastler_mit_ghia_ausstattung privat Angebot $90 control kombi 1996 manuell 116 NaN 150,000km 4 benzin ford ja 2016-03-21 00:00:00 0 27574 2016-04-01 05:16:49
26 2016-04-03 22:46:28 Volkswagen_Polo_Fox privat Angebot $777 control kleinwagen 1992 manuell 54 polo 125,000km 2 benzin volkswagen nein 2016-04-03 00:00:00 0 38110 2016-04-05 23:46:48
27 2016-03-27 18:45:01 Hat_einer_Ahnung_mit_Ford_Galaxy_HILFE privat Angebot $0 control NaN 2005 NaN 0 NaN 150,000km 0 NaN ford NaN 2016-03-27 00:00:00 0 66701 2016-03-27 18:45:01
28 2016-03-19 21:56:19 MINI_Cooper_D privat Angebot $5,250 control kleinwagen 2007 manuell 110 cooper 150,000km 7 diesel mini ja 2016-03-19 00:00:00 0 15745 2016-04-07 14:58:48
29 2016-04-02 12:45:44 Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... privat Angebot $4,999 test kombi 2004 automatik 204 e_klasse 150,000km 10 diesel mercedes_benz nein 2016-04-02 00:00:00 0 47638 2016-04-02 12:45:44
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49970 2016-03-21 22:47:37 c4_Grand_Picasso_mit_Automatik_Leder_Navi_Temp... privat Angebot $15,800 control bus 2010 automatik 136 c4 60,000km 4 diesel citroen nein 2016-03-21 00:00:00 0 14947 2016-04-07 04:17:34
49971 2016-03-29 14:54:12 W.Lupo_1.0 privat Angebot $950 test kleinwagen 2001 manuell 50 lupo 150,000km 4 benzin volkswagen nein 2016-03-29 00:00:00 0 65197 2016-03-29 20:41:51
49972 2016-03-26 22:25:23 Mercedes_Benz_Vito_115_CDI_Extralang_Aut. privat Angebot $3,300 control bus 2004 automatik 150 vito 150,000km 10 diesel mercedes_benz ja 2016-03-26 00:00:00 0 65326 2016-03-28 11:28:18
49973 2016-03-27 05:32:39 Mercedes_Benz_SLK_200_Kompressor privat Angebot $6,000 control cabrio 2004 manuell 163 slk 150,000km 11 benzin mercedes_benz nein 2016-03-27 00:00:00 0 53567 2016-03-27 08:25:24
49974 2016-03-20 10:52:31 Golf_1_Cabrio_Tuev_Neu_viele_Extras_alles_eing... privat Angebot $0 control cabrio 1983 manuell 70 golf 150,000km 2 benzin volkswagen nein 2016-03-20 00:00:00 0 8209 2016-03-27 19:48:16
49975 2016-03-27 20:51:39 Honda_Jazz_1.3_DSi_i_VTEC_IMA_CVT_Comfort privat Angebot $9,700 control kleinwagen 2012 automatik 88 jazz 100,000km 11 hybrid honda nein 2016-03-27 00:00:00 0 84385 2016-04-05 19:45:34
49976 2016-03-19 18:56:05 Audi_80_Avant_2.6_E__Vollausstattung!!_Einziga... privat Angebot $5,900 test kombi 1992 automatik 150 80 150,000km 12 benzin audi nein 2016-03-19 00:00:00 0 36100 2016-04-07 06:16:44
49977 2016-03-31 18:37:18 Mercedes_Benz_C200_Cdi_W203 privat Angebot $5,500 control limousine 2003 manuell 116 c_klasse 150,000km 2 diesel mercedes_benz nein 2016-03-31 00:00:00 0 33739 2016-04-06 12:16:11
49978 2016-04-04 10:37:14 Mercedes_Benz_E_200_Classic privat Angebot $900 control limousine 1996 automatik 136 e_klasse 150,000km 9 benzin mercedes_benz ja 2016-04-04 00:00:00 0 24405 2016-04-06 12:44:20
49979 2016-03-20 18:38:40 Volkswagen_Polo_1.6_TDI_Style privat Angebot $11,000 test kleinwagen 2011 manuell 90 polo 70,000km 11 diesel volkswagen nein 2016-03-20 00:00:00 0 48455 2016-04-07 01:45:12
49980 2016-03-12 10:55:54 Ford_Escort_Turnier_16V privat Angebot $400 control kombi 1995 manuell 105 escort 125,000km 3 benzin ford NaN 2016-03-12 00:00:00 0 56218 2016-04-06 17:16:49
49981 2016-03-15 09:38:21 Opel_Astra_Kombi_mit_Anhaengerkupplung privat Angebot $2,000 control kombi 1998 manuell 115 astra 150,000km 12 benzin opel nein 2016-03-15 00:00:00 0 86859 2016-04-05 17:21:46
49982 2016-03-29 18:51:08 Skoda_Fabia_4_Tuerer_Bj:2004__85.000Tkm privat Angebot $1,950 control kleinwagen 2004 manuell 0 fabia 90,000km 7 benzin skoda NaN 2016-03-29 00:00:00 0 45884 2016-03-29 18:51:08
49983 2016-03-06 12:43:04 Ford_focus_99 privat Angebot $600 test kleinwagen 1999 manuell 101 focus 150,000km 4 benzin ford NaN 2016-03-06 00:00:00 0 52477 2016-03-09 06:16:08
49984 2016-03-31 22:48:48 Student_sucht_ein__Anfaengerauto___ab_2000_BJ_... privat Angebot $0 test NaN 2000 NaN 0 NaN 150,000km 0 NaN sonstige_autos NaN 2016-03-31 00:00:00 0 12103 2016-04-02 19:44:53
49985 2016-04-02 16:38:23 Verkaufe_meinen_vw_vento! privat Angebot $1,000 control NaN 1995 automatik 0 NaN 150,000km 0 benzin volkswagen NaN 2016-04-02 00:00:00 0 30900 2016-04-06 15:17:52
49986 2016-04-04 20:46:02 Chrysler_300C_3.0_CRD_DPF_Automatik_Voll_Ausst... privat Angebot $15,900 control limousine 2010 automatik 218 300c 125,000km 11 diesel chrysler nein 2016-04-04 00:00:00 0 73527 2016-04-06 23:16:00
49987 2016-03-22 20:47:27 Audi_A3_Limousine_2.0_TDI_DPF_Ambition__NAVI__... privat Angebot $21,990 control limousine 2013 manuell 150 a3 50,000km 11 diesel audi nein 2016-03-22 00:00:00 0 94362 2016-03-26 22:46:06
49988 2016-03-28 19:49:51 BMW_330_Ci privat Angebot $9,550 control coupe 2001 manuell 231 3er 150,000km 10 benzin bmw nein 2016-03-28 00:00:00 0 83646 2016-04-07 02:17:40
49989 2016-03-11 19:50:37 VW_Polo_zum_Ausschlachten_oder_Wiederaufbau privat Angebot $150 test kleinwagen 1997 manuell 0 polo 150,000km 5 benzin volkswagen ja 2016-03-11 00:00:00 0 21244 2016-03-12 10:17:55
49990 2016-03-21 19:54:19 Mercedes_Benz_A_200__BlueEFFICIENCY__Urban privat Angebot $17,500 test limousine 2012 manuell 156 a_klasse 30,000km 12 benzin mercedes_benz nein 2016-03-21 00:00:00 0 58239 2016-04-06 22:46:57
49991 2016-03-06 15:25:19 Kleinwagen privat Angebot $500 control NaN 2016 manuell 0 twingo 150,000km 0 benzin renault NaN 2016-03-06 00:00:00 0 61350 2016-03-06 18:24:19
49992 2016-03-10 19:37:38 Fiat_Grande_Punto_1.4_T_Jet_16V_Sport privat Angebot $4,800 control kleinwagen 2009 manuell 120 andere 125,000km 9 lpg fiat nein 2016-03-10 00:00:00 0 68642 2016-03-13 01:44:51
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot $1,650 control kleinwagen 1997 manuell 0 NaN 150,000km 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot $5,000 control kombi 2001 automatik 299 a6 150,000km 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:
autos.info()
autos.head(4)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[3]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28

Columns "gearbox", "model", "vehicleType", "fuelType" and "notRepairedDamage" have null values for some entries. 5 columns have int type entries. While rest of them are object.

In [4]:
def clean_string(char):
    word=""
    for a in char:
        if a.isupper():
           new_a="_"+ a.lower()
           a=new_a
        word+=a
        word=word.strip("_")
    return word
        
    
auto_copy=autos
print(auto_copy.columns)
label_list=[]
for name in auto_copy.columns:
    if name=="yearOfRegistration":
        name="registration_year"
    elif name=="monthOfRegistration":
        name="registration_month"
    elif name=="notRepairedDamage":
        name="unrepaired_damage"
    elif name=="dateCreated":
        name="ad_created"
    else:
        name=clean_string(name)
    label_list.append(name)
auto_copy.columns=label_list
print(auto_copy.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

The column names have been made more readable.

In [5]:
print(auto_copy.describe())
       registration_year     power_p_s  registration_month  nr_of_pictures  \
count       50000.000000  50000.000000        50000.000000         50000.0   
mean         2005.073280    116.355920            5.723360             0.0   
std           105.712813    209.216627            3.711984             0.0   
min          1000.000000      0.000000            0.000000             0.0   
25%          1999.000000     70.000000            3.000000             0.0   
50%          2003.000000    105.000000            6.000000             0.0   
75%          2008.000000    150.000000            9.000000             0.0   
max          9999.000000  17700.000000           12.000000             0.0   

        postal_code  
count  50000.000000  
mean   50813.627300  
std    25779.747957  
min     1067.000000  
25%    30451.000000  
50%    49577.000000  
75%    71540.000000  
max    99998.000000  

1. "nr_of_pictures" seems to have no data. This column can be deleted.
2. "price" and "odometer" have been stored as a string. This needs to be converted into a numeric type data.
3. To calculate how old the car is, "registration year" could be converted into an datetime object. Also the "ad_created" column needs to be converted into a datetime object. With reference to date of ad creation, we can get an idea of how old the car is.

Converting odometer column into a numeric value

In [6]:
print(auto_copy.loc[:,"odometer"].head(2))
0    150,000km
1    150,000km
Name: odometer, dtype: object
In [7]:
auto_copy["odometer"]=auto_copy["odometer"].str.replace(",","").str.replace("km","")
auto_copy["odometer"]=auto_copy["odometer"].astype(float)
auto_copy.rename({"odometer":"odometer_km"},axis=1,inplace=True)

Converting price column into a numeric value

In [8]:
print(auto_copy["price"].head(2))
0    $5,000
1    $8,500
Name: price, dtype: object
In [9]:
auto_copy["price"]=auto_copy["price"].str.replace(",","").str.replace("$","")
auto_copy["price"]=auto_copy["price"].astype(float)
In [10]:
print(auto_copy["price"].describe())
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [11]:
print(auto_copy["price"].value_counts().sort_index().head(50))
0.0      1421
1.0       156
2.0         3
3.0         1
5.0         2
8.0         1
9.0         1
10.0        7
11.0        2
12.0        3
13.0        2
14.0        1
15.0        2
17.0        3
18.0        1
20.0        4
25.0        5
29.0        1
30.0        7
35.0        1
40.0        6
45.0        4
47.0        1
49.0        4
50.0       49
55.0        2
59.0        1
60.0        9
65.0        5
66.0        1
70.0       10
75.0        5
79.0        1
80.0       15
89.0        1
90.0        5
99.0       19
100.0     134
110.0       3
111.0       2
115.0       2
117.0       1
120.0      39
122.0       1
125.0       8
129.0       1
130.0      15
135.0       1
139.0       1
140.0       9
Name: price, dtype: int64

Prices have abnormal values. Now we delete data entries with car pricing less than 100 Dollars and greater than 28,000,000 Dollars.

In [12]:
auto_copy=auto_copy[auto_copy["price"].between(100,4000000)]
auto_copy["price"].value_counts().sort_index(ascending=False)
Out[12]:
3890000.0      1
1300000.0      1
1234566.0      1
999999.0       2
999990.0       1
350000.0       1
345000.0       1
299000.0       1
295000.0       1
265000.0       1
259000.0       1
250000.0       1
220000.0       1
198000.0       1
197000.0       1
194000.0       1
190000.0       1
180000.0       1
175000.0       1
169999.0       1
169000.0       1
163991.0       1
163500.0       1
155000.0       1
151990.0       1
145000.0       1
139997.0       1
137999.0       1
135000.0       1
130000.0       1
            ... 
199.0         41
198.0          1
195.0          2
193.0          1
190.0         16
188.0          1
185.0          1
180.0         35
179.0          1
175.0         12
173.0          1
170.0          7
160.0          8
156.0          2
150.0        224
149.0          7
145.0          2
140.0          9
139.0          1
135.0          1
130.0         15
129.0          1
125.0          8
122.0          1
120.0         39
117.0          1
115.0          2
111.0          2
110.0          3
100.0        134
Name: price, Length: 2315, dtype: int64

Now lets observe dates

In [13]:
#date_crawled
a=auto_copy["date_crawled"].str[:10]
print(a.value_counts(normalize=True,dropna=False).sort_index(ascending=False))
#ad_created
b=auto_copy["ad_created"].str[:10]
print(b.value_counts(normalize=True,dropna=False).sort_index(ascending=False).head(50))
#date_crawled
c=auto_copy["last_seen"].str[:10]
print(c.value_counts(normalize=True,dropna=False).sort_index(ascending=False))
2016-04-07    0.001389
2016-04-06    0.003172
2016-04-05    0.013062
2016-04-04    0.036575
2016-04-03    0.038607
2016-04-02    0.035600
2016-04-01    0.033693
2016-03-31    0.031847
2016-03-30    0.033734
2016-03-29    0.034128
2016-03-28    0.034957
2016-03-27    0.031122
2016-03-26    0.032304
2016-03-25    0.031495
2016-03-24    0.029442
2016-03-23    0.032283
2016-03-22    0.032905
2016-03-21    0.037217
2016-03-20    0.037798
2016-03-19    0.034729
2016-03-18    0.012897
2016-03-17    0.031516
2016-03-16    0.029463
2016-03-15    0.034315
2016-03-14    0.036658
2016-03-13    0.015675
2016-03-12    0.036906
2016-03-11    0.032594
2016-03-10    0.032283
2016-03-09    0.033009
2016-03-08    0.033174
2016-03-07    0.036056
2016-03-06    0.014037
2016-03-05    0.025358
Name: date_crawled, dtype: float64
2016-04-07    0.001244
2016-04-06    0.003255
2016-04-05    0.011798
2016-04-04    0.036927
2016-04-03    0.038855
2016-04-02    0.035289
2016-04-01    0.033672
2016-03-31    0.031889
2016-03-30    0.033548
2016-03-29    0.034087
2016-03-28    0.035061
2016-03-27    0.031039
2016-03-26    0.032366
2016-03-25    0.031619
2016-03-24    0.029380
2016-03-23    0.032117
2016-03-22    0.032718
2016-03-21    0.037446
2016-03-20    0.037860
2016-03-19    0.033610
2016-03-18    0.013581
2016-03-17    0.031184
2016-03-16    0.029961
2016-03-15    0.034045
2016-03-14    0.035289
2016-03-13    0.017043
2016-03-12    0.036741
2016-03-11    0.032905
2016-03-10    0.031993
2016-03-09    0.033091
2016-03-08    0.033174
2016-03-07    0.034792
2016-03-06    0.015302
2016-03-05    0.022911
2016-03-04    0.001493
2016-03-03    0.000871
2016-03-02    0.000104
2016-03-01    0.000104
2016-02-29    0.000166
2016-02-28    0.000207
2016-02-27    0.000124
2016-02-26    0.000041
2016-02-25    0.000062
2016-02-24    0.000041
2016-02-23    0.000083
2016-02-22    0.000021
2016-02-21    0.000062
2016-02-20    0.000041
2016-02-19    0.000062
2016-02-18    0.000041
Name: ad_created, dtype: float64
2016-04-07    0.132138
2016-04-06    0.221957
2016-04-05    0.125088
2016-04-04    0.024528
2016-04-03    0.025130
2016-04-02    0.024902
2016-04-01    0.022849
2016-03-31    0.023823
2016-03-30    0.024694
2016-03-29    0.022310
2016-03-28    0.020858
2016-03-27    0.015550
2016-03-26    0.016670
2016-03-25    0.019096
2016-03-24    0.019759
2016-03-23    0.018578
2016-03-22    0.021356
2016-03-21    0.020547
2016-03-20    0.020651
2016-03-19    0.015758
2016-03-18    0.007319
2016-03-17    0.028095
2016-03-16    0.016442
2016-03-15    0.015861
2016-03-14    0.012627
2016-03-13    0.008874
2016-03-12    0.023782
2016-03-11    0.012399
2016-03-10    0.010637
2016-03-09    0.009579
2016-03-08    0.007319
2016-03-07    0.005432
2016-03-06    0.004313
2016-03-05    0.001078
Name: last_seen, dtype: float64

Date_crawled and last_seen values span from March 2016 to April 2016.

Whereas, Ad_Created values span from Feb 2015 to April 2016.

In [14]:
auto_copy["registration_year"].describe()
Out[14]:
count    48230.000000
mean      2004.729795
std         87.892058
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Registration year should be between 1900 to 2016. But there are values beyond this range. Hence there is some error in this data.

In [15]:
bool_reg=auto_copy["registration_year"].between (1950,2016)
corrected_auto_copy=auto_copy[bool_reg]
corrected_auto_copy["registration_year"].value_counts().sort_index()
Out[15]:
1950       1
1951       2
1952       1
1953       1
1954       2
1955       2
1956       4
1957       2
1958       4
1959       6
1960      22
1961       6
1962       4
1963       8
1964      12
1965      17
1966      22
1967      26
1968      26
1969      19
1970      37
1971      26
1972      33
1973      23
1974      24
1975      18
1976      21
1977      22
1978      42
1979      34
        ... 
1987      72
1988     133
1989     171
1990     332
1991     338
1992     368
1993     420
1994     626
1995    1193
1996    1355
1997    1925
1998    2340
1999    2880
2000    3104
2001    2629
2002    2477
2003    2693
2004    2699
2005    2911
2006    2669
2007    2273
2008    2210
2009    2081
2010    1587
2011    1618
2012    1308
2013     801
2014     662
2015     380
2016    1202
Name: registration_year, Length: 67, dtype: int64
In [16]:
auto_copy=corrected_auto_copy
In [17]:
print(auto_copy["brand"].value_counts(normalize=True))
volkswagen        0.211476
bmw               0.110205
opel              0.107227
mercedes_benz     0.096631
audi              0.086791
ford              0.069809
renault           0.047064
peugeot           0.029866
fiat              0.025614
seat              0.018256
skoda             0.016422
nissan            0.015343
mazda             0.015235
smart             0.014199
citroen           0.014048
toyota            0.012796
hyundai           0.010013
sonstige_autos    0.009430
volvo             0.009128
mini              0.008804
mitsubishi        0.008179
honda             0.007876
kia               0.007078
alfa_romeo        0.006668
porsche           0.006021
suzuki            0.005934
chevrolet         0.005675
chrysler          0.003517
dacia             0.002654
daihatsu          0.002503
jeep              0.002287
land_rover        0.002115
subaru            0.002115
saab              0.001662
jaguar            0.001532
daewoo            0.001489
trabant           0.001359
rover             0.001338
lancia            0.001057
lada              0.000583
Name: brand, dtype: float64

The first 17 manufacturers above represent about 90% of the data. Thus, below aggregation for these 17 brands will be done.

In [18]:
brand_dictionary_price={}
brand_dictionary_miles={}
brand_dictionary_counts={}
brands=auto_copy["brand"].value_counts(normalize=True).head(17)

for brand in brands.index:
    bool_brand=auto_copy["brand"]==brand
    selected_rows= auto_copy[bool_brand]
    
    mean_price=selected_rows["price"].mean()
    brand_dictionary_price[brand]=mean_price
    
    mean_miles=selected_rows["odometer_km"].mean()
    brand_dictionary_miles[brand]=mean_miles
    
    number_of_data=selected_rows["brand"].value_counts()
    brand_dictionary_counts[brand]=number_of_data.iloc[0]
    
#sorted_brand=sorted(brand_dictionary.items(),key=lambda x:x[1],reverse=True)
#$sorted_brand_cnt=sorted(brand_dictionary_counts.items(),key=lambda x:x[1],reverse=True)
final_table=pd.Series(brand_dictionary_price)
final_table=pd.DataFrame(final_table,columns=["mean_price"])
series_miles=pd.Series(brand_dictionary_miles)
series_counts=pd.Series(brand_dictionary_counts)
final_table['mean_miles']=series_miles
final_table['data_counts']=series_counts
print(final_table)

        
        
    
                mean_price     mean_miles  data_counts
audi           9380.718548  129245.400298         4022
bmw            8621.165459  132723.712551         5107
citroen        3796.262673  119554.531490          651
fiat           2836.873631  116950.294861         1187
ford           4074.959196  124386.398764         3235
hyundai        5411.075431  106885.775862          464
mazda          4129.774788  124553.824363          706
mercedes_benz  8669.448638  131069.673962         4478
nissan         4756.659634  118326.300985          711
opel           3005.930972  129405.312940         4969
peugeot        3113.860549  127127.890173         1384
renault        2496.940394  128337.918386         2181
seat           4433.419622  121536.643026          846
skoda          6409.609724  110906.701708          761
smart          3596.402736   99734.042553          658
toyota         5167.091062  115944.350759          593
volkswagen     5636.200306  128807.142857         9800

1. Mean price of audi is the highest which is closely followed by Mercedes Benz and BMW.

2. Opel, Fiat and Renault seems to be the cheapest car.

3. Also the number of data entries is more for volkswagen which is followed by BMW.

4. Total miles covered by the car does not seem to have a significant effect on the price.

5. It is the brand name that affects the price.

Data cleaning further is done below.

First translate german words in "unrepaired damage" column.

In [19]:
auto_copy.head(2)
Out[19]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_p_s model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000.0 control bus 2004 manuell 158 andere 150000.0 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500.0 control limousine 1997 automatik 286 7er 150000.0 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
In [20]:
auto_copy["unrepaired_damage"].value_counts()
Out[20]:
nein    33769
ja       4477
Name: unrepaired_damage, dtype: int64
In [21]:
auto_copy["unrepaired_damage"]=auto_copy["unrepaired_damage"].str.replace("nein","no")
auto_copy["unrepaired_damage"]=auto_copy["unrepaired_damage"].str.replace("ja","yes")
print(auto_copy["unrepaired_damage"].unique())
['no' nan 'yes']

Second, find most common brand/model combinations

In [22]:
print(auto_copy["model"].value_counts().describe())
print(auto_copy["brand"].value_counts().describe())
count     244.000000
mean      181.209016
std       430.101310
min         1.000000
25%        24.750000
50%        51.500000
75%       119.500000
max      3684.000000
Name: model, dtype: float64
count      40.000000
mean     1158.525000
std      1980.509427
min        27.000000
25%       113.500000
50%       393.500000
75%       782.250000
max      9800.000000
Name: brand, dtype: float64
In [23]:
print(auto_copy["model"].value_counts().head(5))
print(auto_copy["brand"].value_counts().head(5))
golf      3684
andere    3340
3er       2602
polo      1592
corsa     1567
Name: model, dtype: int64
volkswagen       9800
bmw              5107
opel             4969
mercedes_benz    4478
audi             4022
Name: brand, dtype: int64

There are 244 models and 40 brands. We will see most common models for first five brands.

In [24]:
brand_names=["volkswagen","bmw","opel","mercedes_benz","audi"]
for brand in brand_names:
    bool_brand=auto_copy["brand"]==brand
    selected_rows=auto_copy[bool_brand]
    common_models=selected_rows["model"].value_counts()
    common_model=common_models.index[0]
    print("Common model for {} is {} .".format(brand,common_model))
    
    
Common model for volkswagen is golf .
Common model for bmw is 3er .
Common model for opel is corsa .
Common model for mercedes_benz is c_klasse .
Common model for audi is a4 .
In [ ]: