Introduction

We are working with a german ebay Used Car dataset https://data.world/data-society/used-cars-data.

  • Examine the data
  • Reformat and Clean to prepare for Feature Engineering/Modeling
  • Look for any interesting insights
In [400]:
import pandas as pd
import numpy as np
import re
In [401]:
df = pd.read_csv("autos.csv", encoding='Windows-1252')
In [402]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
In [403]:
#percent null for column 15
299468/371528 
Out[403]:
0.8060442281604616

Columns needs to be cleaned up:

  1. lowercased/snakecased
  2. Some columns have null values - but greatest is 20% @ notrepaireddamage - which i assume its b/c the car was repaired
In [404]:
#let's start the cleanup
to_clean = df.columns

corrections = {
    "dateCrawled":"date_crawled",
    "name":"name",
    "seller":"seller",
    "offerType":"offer_type",
    "price":"price",
    "abtest":"ab_test",
    "vehicleType":"vehicle_type",
    "yearOfRegistration":"registration_year",
    "gearbox":"gearbox",
    "powerPS":"power_ps",
    "model":"model",
    "odometer":"odometer",
    "monthOfRegistration":"registration_month",
    "fuelType":"fuel_type",
    "brand":"brand",
    "notRepairedDamage":"unrepaired_damage",
    "dateCreated":"ad_created",
    "nrOfPictures":"num_of_pictures",
    "postalCode":"postal_code", 
    "lastSeen":"last_seen"
}

cleaned_col=to_clean.map(corrections)
In [405]:
cleaned_col
Out[405]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [406]:
#looks good, lets replace the original df columns
df.columns = cleaned_col
In [407]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   seller              50000 non-null  object
 3   offer_type          50000 non-null  object
 4   price               50000 non-null  object
 5   ab_test             50000 non-null  object
 6   vehicle_type        44905 non-null  object
 7   registration_year   50000 non-null  int64 
 8   gearbox             47320 non-null  object
 9   power_ps            50000 non-null  int64 
 10  model               47242 non-null  object
 11  odometer            50000 non-null  object
 12  registration_month  50000 non-null  int64 
 13  fuel_type           45518 non-null  object
 14  brand               50000 non-null  object
 15  unrepaired_damage   40171 non-null  object
 16  ad_created          50000 non-null  object
 17  num_of_pictures     50000 non-null  int64 
 18  postal_code         50000 non-null  int64 
 19  last_seen           50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

let's looks a quick look at the data

  • some object columns need to be converted to dates or numeric
  • it is not in english - so this might need to be addressed
In [408]:
df.head()
Out[408]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [409]:
df.describe(include='all')
Out[409]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-08 10:40:35 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
In [410]:
#Drop columns that are overwhelmingly one value

to_drop = ['seller',"offer_type",'num_of_pictures']
df = df.drop(to_drop, axis=1)
In [411]:
#Now let's convert price and odometer to numeric
#remove non-numeric characters and then convert type

to_numeric = ['price','odometer']

for c in to_numeric:
    df[c] = df[c].str.replace(r'\D+', '').astype(int)
In [412]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   price               50000 non-null  int32 
 3   ab_test             50000 non-null  object
 4   vehicle_type        44905 non-null  object
 5   registration_year   50000 non-null  int64 
 6   gearbox             47320 non-null  object
 7   power_ps            50000 non-null  int64 
 8   model               47242 non-null  object
 9   odometer            50000 non-null  int32 
 10  registration_month  50000 non-null  int64 
 11  fuel_type           45518 non-null  object
 12  brand               50000 non-null  object
 13  unrepaired_damage   40171 non-null  object
 14  ad_created          50000 non-null  object
 15  postal_code         50000 non-null  int64 
 16  last_seen           50000 non-null  object
dtypes: int32(2), int64(4), object(11)
memory usage: 6.1+ MB
In [413]:
#let's update odometer name with removed units. 
#will consider doing that first in the future.
df.rename(columns={'odometer':'odometer_km'}, inplace=True)
In [414]:
df.price.describe()
Out[414]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [415]:
#that scientific notation isn't helpful. let's take another look
df.price.describe().apply(lambda x: format(x, '.2f'))
Out[415]:
count       50000.00
mean         9840.04
std        481104.38
min             0.00
25%          1100.00
50%          2950.00
75%          7200.00
max      99999999.00
Name: price, dtype: object
In [416]:
df.price.value_counts(normalize=True)
Out[416]:
0        0.02842
500      0.01562
1500     0.01468
2500     0.01286
1000     0.01278
          ...   
20790    0.00002
8970     0.00002
846      0.00002
2895     0.00002
33980    0.00002
Name: price, Length: 2357, dtype: float64
In [417]:
df.price.value_counts().sort_index(ascending=False).head(20)
Out[417]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

10% of power ps is zero

In [418]:
df.price.value_counts().sort_index(ascending=True).head(20)
Out[418]:
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

Remove Outliers from Price

Let's remove outliers from price. 0's should probably be removed as they were likely unsold listings w/ out any bids. There are a few cars listed above a million with numbers that don't seem real (12345678)

In [419]:
df = df[df.price.between(1,350000)]
In [420]:
df.price.describe()
Out[420]:
count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

More Examination

let's see if we can find issues with any other numeric columns

In [421]:
df.odometer_km.value_counts()
Out[421]:
150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
5000        836
40000       815
30000       780
20000       762
10000       253
Name: odometer_km, dtype: int64

odometer data series looks fine with no outlier problems It seems that datetime data might need some help - date_crawled, last_seen, and ad_created are listed as objects.

In [422]:
df[["date_crawled",'ad_created','last_seen']].head()
Out[422]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [423]:
#let's examine each of these a little further starting with date_crawl

df.date_crawled.str[:10].value_counts(normalize=True, dropna=False).sort_index()
Out[423]:
2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

It looks like the crawling for data happened pretty consistently within a month's time.

In [424]:
df.last_seen.str[:10].value_counts(normalize=True, dropna=False).sort_index()
Out[424]:
2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

More towards the end, probably because there are more cars that haven't sold / removed their listing yet.

In [425]:
df.ad_created.str[:10].value_counts(dropna=False).loc[lambda x: x>2].sort_index()
Out[425]:
2016-01-27       3
2016-02-19       3
2016-02-21       3
2016-02-23       4
2016-02-25       3
2016-02-27       6
2016-02-28      10
2016-02-29       8
2016-03-01       5
2016-03-02       5
2016-03-03      42
2016-03-04      72
2016-03-05    1112
2016-03-06     744
2016-03-07    1687
2016-03-08    1618
2016-03-09    1610
2016-03-10    1549
2016-03-11    1598
2016-03-12    1785
2016-03-13     826
2016-03-14    1709
2016-03-15    1652
2016-03-16    1463
2016-03-17    1519
2016-03-18     660
2016-03-19    1636
2016-03-20    1843
2016-03-21    1825
2016-03-22    1593
2016-03-23    1557
2016-03-24    1422
2016-03-25    1542
2016-03-26    1567
2016-03-27    1505
2016-03-28    1699
2016-03-29    1653
2016-03-30    1627
2016-03-31    1548
2016-04-01    1636
2016-04-02    1707
2016-04-03    1887
2016-04-04    1790
2016-04-05     574
2016-04-06     158
2016-04-07      61
Name: ad_created, dtype: int64

Looks like most of these ads were created "recently" during with a few ads enduring for longer than a month long.

Finally, we should consider dropping the time portion from this column as it doesn't provide any information (all 00:00:00)

In [426]:
df['ad_created'] = df['ad_created'].str[:10]
In [427]:
#let's look at registration year
df.registration_year.describe()
Out[427]:
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The min and max certainly don't make any sense. certainly anything over 2016 or 2017 is impossible. The years 25% to 75% makes sense though

In [428]:
1-(df.registration_year.between(1950,2017).sum()/df.shape[0])
Out[428]:
0.010542571810974954

We should this 1% of cars that are not between 1950-2017. I included 2017 since some cars do sell by manufacturers half a year ahead of time (ie. buying 2018 model at the end of 2017)

pre-clean -48565

In [429]:
df = df[df.registration_year.between(1950,2017)]
In [430]:
df.registration_year.value_counts(normalize=True).sort_index().tail(30)
Out[430]:
1988    0.002809
1989    0.003621
1990    0.007221
1991    0.007055
1992    0.007700
1993    0.008844
1994    0.013090
1995    0.025534
1996    0.028573
1997    0.040601
1998    0.049175
1999    0.060288
2000    0.065677
2001    0.054856
2002    0.051735
2003    0.056167
2004    0.056250
2005    0.061099
2006    0.055564
2007    0.047385
2008    0.046095
2009    0.043390
2010    0.033068
2011    0.033775
2012    0.027262
2013    0.016711
2014    0.013797
2015    0.008158
2016    0.025389
2017    0.028968
Name: registration_year, dtype: float64
In [431]:
df.registration_year.describe()
Out[431]:
count    48053.000000
mean      2003.349177
std          7.310371
min       1950.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2017.000000
Name: registration_year, dtype: float64

That looks better. It looks like the majority of cars are less than 20 years old and we were also correct to include 2017

In [432]:
df.brand.value_counts()
Out[432]:
volkswagen        10219
bmw                5230
opel               5207
mercedes_benz      4608
audi               4139
ford               3354
renault            2291
peugeot            1423
fiat               1242
seat                895
skoda               777
nissan              736
mazda               733
smart               688
citroen             678
toyota              609
hyundai             480
sonstige_autos      455
volvo               437
mini                418
mitsubishi          392
honda               380
kia                 342
alfa_romeo          319
porsche             286
suzuki              282
chevrolet           271
chrysler            169
dacia               128
daihatsu            120
jeep                107
subaru              101
land_rover           99
saab                 78
daewoo               75
jaguar               73
trabant              66
rover                65
lancia               54
lada                 27
Name: brand, dtype: int64
In [433]:
brand_counts = df.brand.value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .01]
common_brands.shape
Out[433]:
(16,)

Here I was looing at whether we should remove some less popular cars, but the number of brands (40) is not too great and retaining all the brands should be more interesting.

Brand Mean Price & Mileage

Here we can take a look at Price and Mileage by Brand.

In [434]:
brand_price = df.groupby('brand')['price']
brand_price_describe = brand_price.describe().rename(columns={'mean':'mean_price'})
brand_price_describe.head()
Out[434]:
count mean_price std min 25% 50% 75% max
brand
alfa_romeo 319.0 4054.645768 5099.961862 1.0 994.5 2390.0 5316.5 55000.0
audi 4139.0 9239.323025 9972.875540 1.0 2216.0 6200.0 12900.0 175000.0
bmw 5230.0 8283.439197 9807.309036 1.0 2300.0 5900.0 11000.0 259000.0
chevrolet 271.0 6688.084871 7073.388239 1.0 2300.0 4350.0 8999.5 56500.0
chrysler 169.0 3519.384615 3925.459891 70.0 999.0 2250.0 3900.0 20000.0
In [435]:
brand_info = pd.DataFrame(brand_price_describe, columns=["mean_price"])
brand_info.head()
Out[435]:
mean_price
brand
alfa_romeo 4054.645768
audi 9239.323025
bmw 8283.439197
chevrolet 6688.084871
chrysler 3519.384615
In [436]:
brand_mileage = df.groupby('brand')['odometer_km']
brand_mileage_describe = brand_mileage.describe()
brand_info['mean_mileage'] = brand_mileage_describe['mean']
In [437]:
brand_info = brand_info.sort_values(by=['mean_price','mean_mileage'], ascending=False)
In [438]:
brand_info
Out[438]:
mean_price mean_mileage
brand
porsche 45643.937063 96853.146853
land_rover 18934.272727 118333.333333
sonstige_autos 12283.775824 90989.010989
jaguar 11635.493151 124178.082192
jeep 11590.214953 127102.803738
mini 10541.566986 88899.521531
audi 9239.323025 129406.861561
mercedes_benz 8525.090278 131004.774306
bmw 8283.439197 132691.204589
chevrolet 6688.084871 98948.339483
skoda 6341.280566 110984.555985
dacia 5920.382812 84218.750000
kia 5908.502924 112821.637427
hyundai 5358.306250 106625.000000
volkswagen 5347.828261 128930.913005
toyota 5151.940887 116338.259442
volvo 4872.565217 138283.752860
nissan 4670.418478 118682.065217
seat 4354.731844 121625.698324
suzuki 4128.592199 108723.404255
mazda 4066.230559 124665.757162
alfa_romeo 4054.645768 131379.310345
honda 4041.989474 123000.000000
subaru 3922.069307 125544.554455
citroen 3753.119469 120066.371681
ford 3720.407275 124360.465116
smart 3529.816860 100239.825581
chrysler 3519.384615 132366.863905
mitsubishi 3379.477041 126556.122449
saab 3192.269231 144487.179487
lancia 3170.574074 123055.555556
peugeot 3069.413914 127315.530569
opel 2953.367582 129441.136931
fiat 2791.086151 117367.149758
lada 2688.296296 83518.518519
renault 2443.572239 128149.279790
trabant 1816.515152 55909.090909
daihatsu 1612.233333 116833.333333
rover 1586.492308 138230.769231
daewoo 1091.066667 121266.666667
In [439]:
brand_info['mileage_per_price']=brand_info['mean_mileage']/brand_info['mean_price']
In [440]:
brand_info.sort_values(by=['mileage_per_price'], ascending=False)
Out[440]:
mean_price mean_mileage mileage_per_price
brand
daewoo 1091.066667 121266.666667 111.145057
rover 1586.492308 138230.769231 87.129807
daihatsu 1612.233333 116833.333333 72.466765
renault 2443.572239 128149.279790 52.443418
saab 3192.269231 144487.179487 45.261589
opel 2953.367582 129441.136931 43.828319
fiat 2791.086151 117367.149758 42.050708
peugeot 3069.413914 127315.530569 41.478775
lancia 3170.574074 123055.555556 38.811759
chrysler 3519.384615 132366.863905 37.610798
mitsubishi 3379.477041 126556.122449 37.448434
ford 3720.407275 124360.465116 33.426573
alfa_romeo 4054.645768 131379.310345 32.402167
subaru 3922.069307 125544.554455 32.009775
citroen 3753.119469 120066.371681 31.991087
lada 2688.296296 83518.518519 31.067453
trabant 1816.515152 55909.090909 30.778213
mazda 4066.230559 124665.757162 30.658802
honda 4041.989474 123000.000000 30.430559
smart 3529.816860 100239.825581 28.398024
volvo 4872.565217 138283.752860 28.380072
seat 4354.731844 121625.698324 27.929549
suzuki 4128.592199 108723.404255 26.334256
nissan 4670.418478 118682.065217 25.411441
volkswagen 5347.828261 128930.913005 24.109023
toyota 5151.940887 116338.259442 22.581443
hyundai 5358.306250 106625.000000 19.899012
kia 5908.502924 112821.637427 19.094793
skoda 6341.280566 110984.555985 17.501915
bmw 8283.439197 132691.204589 16.018854
mercedes_benz 8525.090278 131004.774306 15.366966
chevrolet 6688.084871 98948.339483 14.794719
dacia 5920.382812 84218.750000 14.225220
audi 9239.323025 129406.861561 14.006098
jeep 11590.214953 127102.803738 10.966389
jaguar 11635.493151 124178.082192 10.672352
mini 10541.566986 88899.521531 8.433236
sonstige_autos 12283.775824 90989.010989 7.407251
land_rover 18934.272727 118333.333333 6.249690
porsche 45643.937063 96853.146853 2.121928

Brand Mileage/Price

Here we can see the brands sorted by mileage per dollar.

we can also find the most popular models by brand in the following:

In [442]:
brand_model = df.groupby('brand',  dropna=False)['model'].value_counts()
In [443]:
print(brand_model.nlargest(20))
brand          model      
volkswagen     golf           3859
bmw            3er            2663
volkswagen     polo           1664
opel           corsa          1652
               astra          1393
volkswagen     passat         1374
audi           a4             1251
mercedes_benz  c_klasse       1155
bmw            5er            1145
mercedes_benz  e_klasse        971
audi           a3              860
               a6              809
ford           focus           784
               fiesta          745
volkswagen     transporter     685
renault        twingo          639
peugeot        2_reihe         609
mercedes_benz  a_klasse        575
smart          fortwo          568
opel           vectra          556
Name: model, dtype: int64
In [444]:
damaged=df.unrepaired_damage== 'ja'
undamaged=df.unrepaired_damage=='nein'
In [445]:
df[undamaged].price.describe()
Out[445]:
count     34501.000000
mean       7105.037535
std       10007.036750
min           1.000000
25%        1799.000000
50%        4100.000000
75%        8999.000000
max      350000.000000
Name: price, dtype: float64
In [446]:
df[damaged].price.describe()
Out[446]:
count     4664.000000
mean      2203.812822
std       3468.180983
min          1.000000
25%        500.000000
50%       1000.000000
75%       2500.000000
max      44200.000000
Name: price, dtype: float64

Damaged cars seem to be worth 70% less.

Additional/Outstanding Observations/Questions

  • Datetime related columns converted to datetime
  • Create a length of listing (can be done if last seen != last day maybe)
In [ ]: