Introduction¶

We are working with a german ebay Used Car dataset https://data.world/data-society/used-cars-data.

Examine the data
Reformat and Clean to prepare for Feature Engineering/Modeling
Look for any interesting insights

In [400]:

import pandas as pd
import numpy as np
import re

In [401]:

df = pd.read_csv("autos.csv", encoding='Windows-1252')

In [402]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

In [403]:

#percent null for column 15
299468/371528 

Out[403]:

0.8060442281604616

Columns needs to be cleaned up:

lowercased/snakecased
Some columns have null values - but greatest is 20% @ notrepaireddamage - which i assume its b/c the car was repaired

In [404]:

#let's start the cleanup
to_clean = df.columns

corrections = {
    "dateCrawled":"date_crawled",
    "name":"name",
    "seller":"seller",
    "offerType":"offer_type",
    "price":"price",
    "abtest":"ab_test",
    "vehicleType":"vehicle_type",
    "yearOfRegistration":"registration_year",
    "gearbox":"gearbox",
    "powerPS":"power_ps",
    "model":"model",
    "odometer":"odometer",
    "monthOfRegistration":"registration_month",
    "fuelType":"fuel_type",
    "brand":"brand",
    "notRepairedDamage":"unrepaired_damage",
    "dateCreated":"ad_created",
    "nrOfPictures":"num_of_pictures",
    "postalCode":"postal_code", 
    "lastSeen":"last_seen"
}

cleaned_col=to_clean.map(corrections)

In [405]:

cleaned_col

Out[405]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'num_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [406]:

#looks good, lets replace the original df columns
df.columns = cleaned_col

In [407]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   seller              50000 non-null  object
 3   offer_type          50000 non-null  object
 4   price               50000 non-null  object
 5   ab_test             50000 non-null  object
 6   vehicle_type        44905 non-null  object
 7   registration_year   50000 non-null  int64 
 8   gearbox             47320 non-null  object
 9   power_ps            50000 non-null  int64 
 10  model               47242 non-null  object
 11  odometer            50000 non-null  object
 12  registration_month  50000 non-null  int64 
 13  fuel_type           45518 non-null  object
 14  brand               50000 non-null  object
 15  unrepaired_damage   40171 non-null  object
 16  ad_created          50000 non-null  object
 17  num_of_pictures     50000 non-null  int64 
 18  postal_code         50000 non-null  int64 
 19  last_seen           50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

let's looks a quick look at the data

some object columns need to be converted to dates or numeric
it is not in english - so this might need to be addressed

In [408]:

df.head()

Out[408]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [409]:

df.describe(include='all')

Out[409]:

	date_crawled	name	seller	offer_type	price	ab_test	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-08 10:40:35	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [410]:

#Drop columns that are overwhelmingly one value

to_drop = ['seller',"offer_type",'num_of_pictures']
df = df.drop(to_drop, axis=1)

In [411]:

#Now let's convert price and odometer to numeric
#remove non-numeric characters and then convert type

to_numeric = ['price','odometer']

for c in to_numeric:
    df[c] = df[c].str.replace(r'\D+', '').astype(int)

In [412]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   price               50000 non-null  int32 
 3   ab_test             50000 non-null  object
 4   vehicle_type        44905 non-null  object
 5   registration_year   50000 non-null  int64 
 6   gearbox             47320 non-null  object
 7   power_ps            50000 non-null  int64 
 8   model               47242 non-null  object
 9   odometer            50000 non-null  int32 
 10  registration_month  50000 non-null  int64 
 11  fuel_type           45518 non-null  object
 12  brand               50000 non-null  object
 13  unrepaired_damage   40171 non-null  object
 14  ad_created          50000 non-null  object
 15  postal_code         50000 non-null  int64 
 16  last_seen           50000 non-null  object
dtypes: int32(2), int64(4), object(11)
memory usage: 6.1+ MB

In [413]:

#let's update odometer name with removed units. 
#will consider doing that first in the future.
df.rename(columns={'odometer':'odometer_km'}, inplace=True)

In [414]:

df.price.describe()

Out[414]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [415]:

#that scientific notation isn't helpful. let's take another look
df.price.describe().apply(lambda x: format(x, '.2f'))

Out[415]:

count       50000.00
mean         9840.04
std        481104.38
min             0.00
25%          1100.00
50%          2950.00
75%          7200.00
max      99999999.00
Name: price, dtype: object

In [416]:

df.price.value_counts(normalize=True)

Out[416]:

0        0.02842
500      0.01562
1500     0.01468
2500     0.01286
1000     0.01278
          ...   
20790    0.00002
8970     0.00002
846      0.00002
2895     0.00002
33980    0.00002
Name: price, Length: 2357, dtype: float64

In [417]:

df.price.value_counts().sort_index(ascending=False).head(20)

Out[417]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price, dtype: int64

10% of power ps is zero

In [418]:

df.price.value_counts().sort_index(ascending=True).head(20)

Out[418]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price, dtype: int64

Remove Outliers from Price¶

Let's remove outliers from price. 0's should probably be removed as they were likely unsold listings w/ out any bids. There are a few cars listed above a million with numbers that don't seem real (12345678)

In [419]:

df = df[df.price.between(1,350000)]

In [420]:

df.price.describe()

Out[420]:

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

More Examination¶

let's see if we can find issues with any other numeric columns

In [421]:

df.odometer_km.value_counts()

Out[421]:

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
5000        836
40000       815
30000       780
20000       762
10000       253
Name: odometer_km, dtype: int64

odometer data series looks fine with no outlier problems It seems that datetime data might need some help - date_crawled, last_seen, and ad_created are listed as objects.

In [422]:

df[["date_crawled",'ad_created','last_seen']].head()

Out[422]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [423]:

#let's examine each of these a little further starting with date_crawl

df.date_crawled.str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[423]:

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

It looks like the crawling for data happened pretty consistently within a month's time.

In [424]:

df.last_seen.str[:10].value_counts(normalize=True, dropna=False).sort_index()

Out[424]:

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

More towards the end, probably because there are more cars that haven't sold / removed their listing yet.

In [425]:

df.ad_created.str[:10].value_counts(dropna=False).loc[lambda x: x>2].sort_index()

Out[425]:

2016-01-27       3
2016-02-19       3
2016-02-21       3
2016-02-23       4
2016-02-25       3
2016-02-27       6
2016-02-28      10
2016-02-29       8
2016-03-01       5
2016-03-02       5
2016-03-03      42
2016-03-04      72
2016-03-05    1112
2016-03-06     744
2016-03-07    1687
2016-03-08    1618
2016-03-09    1610
2016-03-10    1549
2016-03-11    1598
2016-03-12    1785
2016-03-13     826
2016-03-14    1709
2016-03-15    1652
2016-03-16    1463
2016-03-17    1519
2016-03-18     660
2016-03-19    1636
2016-03-20    1843
2016-03-21    1825
2016-03-22    1593
2016-03-23    1557
2016-03-24    1422
2016-03-25    1542
2016-03-26    1567
2016-03-27    1505
2016-03-28    1699
2016-03-29    1653
2016-03-30    1627
2016-03-31    1548
2016-04-01    1636
2016-04-02    1707
2016-04-03    1887
2016-04-04    1790
2016-04-05     574
2016-04-06     158
2016-04-07      61
Name: ad_created, dtype: int64

Looks like most of these ads were created "recently" during with a few ads enduring for longer than a month long.

Finally, we should consider dropping the time portion from this column as it doesn't provide any information (all 00:00:00)

In [426]:

df['ad_created'] = df['ad_created'].str[:10]

In [427]:

#let's look at registration year
df.registration_year.describe()

Out[427]:

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The min and max certainly don't make any sense. certainly anything over 2016 or 2017 is impossible. The years 25% to 75% makes sense though

In [428]:

1-(df.registration_year.between(1950,2017).sum()/df.shape[0])

Out[428]:

0.010542571810974954

We should this 1% of cars that are not between 1950-2017. I included 2017 since some cars do sell by manufacturers half a year ahead of time (ie. buying 2018 model at the end of 2017)

pre-clean -48565

In [429]:

df = df[df.registration_year.between(1950,2017)]

In [430]:

df.registration_year.value_counts(normalize=True).sort_index().tail(30)

Out[430]:

1988    0.002809
1989    0.003621
1990    0.007221
1991    0.007055
1992    0.007700
1993    0.008844
1994    0.013090
1995    0.025534
1996    0.028573
1997    0.040601
1998    0.049175
1999    0.060288
2000    0.065677
2001    0.054856
2002    0.051735
2003    0.056167
2004    0.056250
2005    0.061099
2006    0.055564
2007    0.047385
2008    0.046095
2009    0.043390
2010    0.033068
2011    0.033775
2012    0.027262
2013    0.016711
2014    0.013797
2015    0.008158
2016    0.025389
2017    0.028968
Name: registration_year, dtype: float64

In [431]:

df.registration_year.describe()

Out[431]:

count    48053.000000
mean      2003.349177
std          7.310371
min       1950.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2017.000000
Name: registration_year, dtype: float64

That looks better. It looks like the majority of cars are less than 20 years old and we were also correct to include 2017

In [432]:

df.brand.value_counts()

Out[432]:

volkswagen        10219
bmw                5230
opel               5207
mercedes_benz      4608
audi               4139
ford               3354
renault            2291
peugeot            1423
fiat               1242
seat                895
skoda               777
nissan              736
mazda               733
smart               688
citroen             678
toyota              609
hyundai             480
sonstige_autos      455
volvo               437
mini                418
mitsubishi          392
honda               380
kia                 342
alfa_romeo          319
porsche             286
suzuki              282
chevrolet           271
chrysler            169
dacia               128
daihatsu            120
jeep                107
subaru              101
land_rover           99
saab                 78
daewoo               75
jaguar               73
trabant              66
rover                65
lancia               54
lada                 27
Name: brand, dtype: int64

In [433]:

brand_counts = df.brand.value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .01]
common_brands.shape

Out[433]:

(16,)

Here I was looing at whether we should remove some less popular cars, but the number of brands (40) is not too great and retaining all the brands should be more interesting.

Brand Mean Price & Mileage¶

Here we can take a look at Price and Mileage by Brand.

In [434]:

brand_price = df.groupby('brand')['price']
brand_price_describe = brand_price.describe().rename(columns={'mean':'mean_price'})
brand_price_describe.head()

Out[434]:

	count	mean_price	std	min	25%	50%	75%	max
brand
alfa_romeo	319.0	4054.645768	5099.961862	1.0	994.5	2390.0	5316.5	55000.0
audi	4139.0	9239.323025	9972.875540	1.0	2216.0	6200.0	12900.0	175000.0
bmw	5230.0	8283.439197	9807.309036	1.0	2300.0	5900.0	11000.0	259000.0
chevrolet	271.0	6688.084871	7073.388239	1.0	2300.0	4350.0	8999.5	56500.0
chrysler	169.0	3519.384615	3925.459891	70.0	999.0	2250.0	3900.0	20000.0

In [435]:

brand_info = pd.DataFrame(brand_price_describe, columns=["mean_price"])
brand_info.head()

Out[435]:

	mean_price
brand
alfa_romeo	4054.645768
audi	9239.323025
bmw	8283.439197
chevrolet	6688.084871
chrysler	3519.384615

In [436]:

brand_mileage = df.groupby('brand')['odometer_km']
brand_mileage_describe = brand_mileage.describe()
brand_info['mean_mileage'] = brand_mileage_describe['mean']

In [437]:

brand_info = brand_info.sort_values(by=['mean_price','mean_mileage'], ascending=False)

In [438]:

brand_info

Out[438]:

	mean_price	mean_mileage
brand
porsche	45643.937063	96853.146853
land_rover	18934.272727	118333.333333
sonstige_autos	12283.775824	90989.010989
jaguar	11635.493151	124178.082192
jeep	11590.214953	127102.803738
mini	10541.566986	88899.521531
audi	9239.323025	129406.861561
mercedes_benz	8525.090278	131004.774306
bmw	8283.439197	132691.204589
chevrolet	6688.084871	98948.339483
skoda	6341.280566	110984.555985
dacia	5920.382812	84218.750000
kia	5908.502924	112821.637427
hyundai	5358.306250	106625.000000
volkswagen	5347.828261	128930.913005
toyota	5151.940887	116338.259442
volvo	4872.565217	138283.752860
nissan	4670.418478	118682.065217
seat	4354.731844	121625.698324
suzuki	4128.592199	108723.404255
mazda	4066.230559	124665.757162
alfa_romeo	4054.645768	131379.310345
honda	4041.989474	123000.000000
subaru	3922.069307	125544.554455
citroen	3753.119469	120066.371681
ford	3720.407275	124360.465116
smart	3529.816860	100239.825581
chrysler	3519.384615	132366.863905
mitsubishi	3379.477041	126556.122449
saab	3192.269231	144487.179487
lancia	3170.574074	123055.555556
peugeot	3069.413914	127315.530569
opel	2953.367582	129441.136931
fiat	2791.086151	117367.149758
lada	2688.296296	83518.518519
renault	2443.572239	128149.279790
trabant	1816.515152	55909.090909
daihatsu	1612.233333	116833.333333
rover	1586.492308	138230.769231
daewoo	1091.066667	121266.666667

In [439]:

brand_info['mileage_per_price']=brand_info['mean_mileage']/brand_info['mean_price']

In [440]:

brand_info.sort_values(by=['mileage_per_price'], ascending=False)

Out[440]:

	mean_price	mean_mileage	mileage_per_price
brand
daewoo	1091.066667	121266.666667	111.145057
rover	1586.492308	138230.769231	87.129807
daihatsu	1612.233333	116833.333333	72.466765
renault	2443.572239	128149.279790	52.443418
saab	3192.269231	144487.179487	45.261589
opel	2953.367582	129441.136931	43.828319
fiat	2791.086151	117367.149758	42.050708
peugeot	3069.413914	127315.530569	41.478775
lancia	3170.574074	123055.555556	38.811759
chrysler	3519.384615	132366.863905	37.610798
mitsubishi	3379.477041	126556.122449	37.448434
ford	3720.407275	124360.465116	33.426573
alfa_romeo	4054.645768	131379.310345	32.402167
subaru	3922.069307	125544.554455	32.009775
citroen	3753.119469	120066.371681	31.991087
lada	2688.296296	83518.518519	31.067453
trabant	1816.515152	55909.090909	30.778213
mazda	4066.230559	124665.757162	30.658802
honda	4041.989474	123000.000000	30.430559
smart	3529.816860	100239.825581	28.398024
volvo	4872.565217	138283.752860	28.380072
seat	4354.731844	121625.698324	27.929549
suzuki	4128.592199	108723.404255	26.334256
nissan	4670.418478	118682.065217	25.411441
volkswagen	5347.828261	128930.913005	24.109023
toyota	5151.940887	116338.259442	22.581443
hyundai	5358.306250	106625.000000	19.899012
kia	5908.502924	112821.637427	19.094793
skoda	6341.280566	110984.555985	17.501915
bmw	8283.439197	132691.204589	16.018854
mercedes_benz	8525.090278	131004.774306	15.366966
chevrolet	6688.084871	98948.339483	14.794719
dacia	5920.382812	84218.750000	14.225220
audi	9239.323025	129406.861561	14.006098
jeep	11590.214953	127102.803738	10.966389
jaguar	11635.493151	124178.082192	10.672352
mini	10541.566986	88899.521531	8.433236
sonstige_autos	12283.775824	90989.010989	7.407251
land_rover	18934.272727	118333.333333	6.249690
porsche	45643.937063	96853.146853	2.121928

Brand Mileage/Price¶

Here we can see the brands sorted by mileage per dollar.

Most Popular Brand/Models¶

we can also find the most popular models by brand in the following:

In [442]:

brand_model = df.groupby('brand',  dropna=False)['model'].value_counts()

In [443]:

print(brand_model.nlargest(20))

brand          model      
volkswagen     golf           3859
bmw            3er            2663
volkswagen     polo           1664
opel           corsa          1652
               astra          1393
volkswagen     passat         1374
audi           a4             1251
mercedes_benz  c_klasse       1155
bmw            5er            1145
mercedes_benz  e_klasse        971
audi           a3              860
               a6              809
ford           focus           784
               fiesta          745
volkswagen     transporter     685
renault        twingo          639
peugeot        2_reihe         609
mercedes_benz  a_klasse        575
smart          fortwo          568
opel           vectra          556
Name: model, dtype: int64

In [444]:

damaged=df.unrepaired_damage== 'ja'
undamaged=df.unrepaired_damage=='nein'

In [445]:

df[undamaged].price.describe()

Out[445]:

count     34501.000000
mean       7105.037535
std       10007.036750
min           1.000000
25%        1799.000000
50%        4100.000000
75%        8999.000000
max      350000.000000
Name: price, dtype: float64

In [446]:

df[damaged].price.describe()

Out[446]:

count     4664.000000
mean      2203.812822
std       3468.180983
min          1.000000
25%        500.000000
50%       1000.000000
75%       2500.000000
max      44200.000000
Name: price, dtype: float64

Damaged cars seem to be worth 70% less.

Additional/Outstanding Observations/Questions¶

Datetime related columns converted to datetime
Create a length of listing (can be done if last seen != last day maybe)

In [ ]: