We are working with a german ebay Used Car
dataset
https://data.world/data-society/used-cars-data.
import pandas as pd
import numpy as np
import re
df = pd.read_csv("autos.csv", encoding='Windows-1252')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
#percent null for column 15
299468/371528
0.8060442281604616
Columns needs to be cleaned up:
#let's start the cleanup
to_clean = df.columns
corrections = {
"dateCrawled":"date_crawled",
"name":"name",
"seller":"seller",
"offerType":"offer_type",
"price":"price",
"abtest":"ab_test",
"vehicleType":"vehicle_type",
"yearOfRegistration":"registration_year",
"gearbox":"gearbox",
"powerPS":"power_ps",
"model":"model",
"odometer":"odometer",
"monthOfRegistration":"registration_month",
"fuelType":"fuel_type",
"brand":"brand",
"notRepairedDamage":"unrepaired_damage",
"dateCreated":"ad_created",
"nrOfPictures":"num_of_pictures",
"postalCode":"postal_code",
"lastSeen":"last_seen"
}
cleaned_col=to_clean.map(corrections)
cleaned_col
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'num_of_pictures', 'postal_code', 'last_seen'], dtype='object')
#looks good, lets replace the original df columns
df.columns = cleaned_col
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price 50000 non-null object 5 ab_test 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 power_ps 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 registration_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null object 17 num_of_pictures 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
let's looks a quick look at the data
df.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
df.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-08 10:40:35 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
#Drop columns that are overwhelmingly one value
to_drop = ['seller',"offer_type",'num_of_pictures']
df = df.drop(to_drop, axis=1)
#Now let's convert price and odometer to numeric
#remove non-numeric characters and then convert type
to_numeric = ['price','odometer']
for c in to_numeric:
df[c] = df[c].str.replace(r'\D+', '').astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 price 50000 non-null int32 3 ab_test 50000 non-null object 4 vehicle_type 44905 non-null object 5 registration_year 50000 non-null int64 6 gearbox 47320 non-null object 7 power_ps 50000 non-null int64 8 model 47242 non-null object 9 odometer 50000 non-null int32 10 registration_month 50000 non-null int64 11 fuel_type 45518 non-null object 12 brand 50000 non-null object 13 unrepaired_damage 40171 non-null object 14 ad_created 50000 non-null object 15 postal_code 50000 non-null int64 16 last_seen 50000 non-null object dtypes: int32(2), int64(4), object(11) memory usage: 6.1+ MB
#let's update odometer name with removed units.
#will consider doing that first in the future.
df.rename(columns={'odometer':'odometer_km'}, inplace=True)
df.price.describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
#that scientific notation isn't helpful. let's take another look
df.price.describe().apply(lambda x: format(x, '.2f'))
count 50000.00 mean 9840.04 std 481104.38 min 0.00 25% 1100.00 50% 2950.00 75% 7200.00 max 99999999.00 Name: price, dtype: object
df.price.value_counts(normalize=True)
0 0.02842 500 0.01562 1500 0.01468 2500 0.01286 1000 0.01278 ... 20790 0.00002 8970 0.00002 846 0.00002 2895 0.00002 33980 0.00002 Name: price, Length: 2357, dtype: float64
df.price.value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price, dtype: int64
10% of power ps is zero
df.price.value_counts().sort_index(ascending=True).head(20)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price, dtype: int64
Let's remove outliers from price. 0's should probably be removed as they were likely unsold listings w/ out any bids. There are a few cars listed above a million with numbers that don't seem real (12345678)
df = df[df.price.between(1,350000)]
df.price.describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
let's see if we can find issues with any other numeric columns
df.odometer_km.value_counts()
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 5000 836 40000 815 30000 780 20000 762 10000 253 Name: odometer_km, dtype: int64
odometer data series looks fine with no outlier problems It seems that datetime data might need some help - date_crawled, last_seen, and ad_created are listed as objects.
df[["date_crawled",'ad_created','last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
#let's examine each of these a little further starting with date_crawl
df.date_crawled.str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
It looks like the crawling for data happened pretty consistently within a month's time.
df.last_seen.str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
More towards the end, probably because there are more cars that haven't sold / removed their listing yet.
df.ad_created.str[:10].value_counts(dropna=False).loc[lambda x: x>2].sort_index()
2016-01-27 3 2016-02-19 3 2016-02-21 3 2016-02-23 4 2016-02-25 3 2016-02-27 6 2016-02-28 10 2016-02-29 8 2016-03-01 5 2016-03-02 5 2016-03-03 42 2016-03-04 72 2016-03-05 1112 2016-03-06 744 2016-03-07 1687 2016-03-08 1618 2016-03-09 1610 2016-03-10 1549 2016-03-11 1598 2016-03-12 1785 2016-03-13 826 2016-03-14 1709 2016-03-15 1652 2016-03-16 1463 2016-03-17 1519 2016-03-18 660 2016-03-19 1636 2016-03-20 1843 2016-03-21 1825 2016-03-22 1593 2016-03-23 1557 2016-03-24 1422 2016-03-25 1542 2016-03-26 1567 2016-03-27 1505 2016-03-28 1699 2016-03-29 1653 2016-03-30 1627 2016-03-31 1548 2016-04-01 1636 2016-04-02 1707 2016-04-03 1887 2016-04-04 1790 2016-04-05 574 2016-04-06 158 2016-04-07 61 Name: ad_created, dtype: int64
Looks like most of these ads were created "recently" during with a few ads enduring for longer than a month long.
Finally, we should consider dropping the time portion from this column as it doesn't provide any information (all 00:00:00)
df['ad_created'] = df['ad_created'].str[:10]
#let's look at registration year
df.registration_year.describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The min and max certainly don't make any sense. certainly anything over 2016 or 2017 is impossible. The years 25% to 75% makes sense though
1-(df.registration_year.between(1950,2017).sum()/df.shape[0])
0.010542571810974954
We should this 1% of cars that are not between 1950-2017. I included 2017 since some cars do sell by manufacturers half a year ahead of time (ie. buying 2018 model at the end of 2017)
pre-clean -48565
df = df[df.registration_year.between(1950,2017)]
df.registration_year.value_counts(normalize=True).sort_index().tail(30)
1988 0.002809 1989 0.003621 1990 0.007221 1991 0.007055 1992 0.007700 1993 0.008844 1994 0.013090 1995 0.025534 1996 0.028573 1997 0.040601 1998 0.049175 1999 0.060288 2000 0.065677 2001 0.054856 2002 0.051735 2003 0.056167 2004 0.056250 2005 0.061099 2006 0.055564 2007 0.047385 2008 0.046095 2009 0.043390 2010 0.033068 2011 0.033775 2012 0.027262 2013 0.016711 2014 0.013797 2015 0.008158 2016 0.025389 2017 0.028968 Name: registration_year, dtype: float64
df.registration_year.describe()
count 48053.000000 mean 2003.349177 std 7.310371 min 1950.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2017.000000 Name: registration_year, dtype: float64
That looks better. It looks like the majority of cars are less than 20 years old and we were also correct to include 2017
df.brand.value_counts()
volkswagen 10219 bmw 5230 opel 5207 mercedes_benz 4608 audi 4139 ford 3354 renault 2291 peugeot 1423 fiat 1242 seat 895 skoda 777 nissan 736 mazda 733 smart 688 citroen 678 toyota 609 hyundai 480 sonstige_autos 455 volvo 437 mini 418 mitsubishi 392 honda 380 kia 342 alfa_romeo 319 porsche 286 suzuki 282 chevrolet 271 chrysler 169 dacia 128 daihatsu 120 jeep 107 subaru 101 land_rover 99 saab 78 daewoo 75 jaguar 73 trabant 66 rover 65 lancia 54 lada 27 Name: brand, dtype: int64
brand_counts = df.brand.value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .01]
common_brands.shape
(16,)
Here I was looing at whether we should remove some less popular cars, but the number of brands (40) is not too great and retaining all the brands should be more interesting.
Here we can take a look at Price
and Mileage
by Brand
.
brand_price = df.groupby('brand')['price']
brand_price_describe = brand_price.describe().rename(columns={'mean':'mean_price'})
brand_price_describe.head()
count | mean_price | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
brand | ||||||||
alfa_romeo | 319.0 | 4054.645768 | 5099.961862 | 1.0 | 994.5 | 2390.0 | 5316.5 | 55000.0 |
audi | 4139.0 | 9239.323025 | 9972.875540 | 1.0 | 2216.0 | 6200.0 | 12900.0 | 175000.0 |
bmw | 5230.0 | 8283.439197 | 9807.309036 | 1.0 | 2300.0 | 5900.0 | 11000.0 | 259000.0 |
chevrolet | 271.0 | 6688.084871 | 7073.388239 | 1.0 | 2300.0 | 4350.0 | 8999.5 | 56500.0 |
chrysler | 169.0 | 3519.384615 | 3925.459891 | 70.0 | 999.0 | 2250.0 | 3900.0 | 20000.0 |
brand_info = pd.DataFrame(brand_price_describe, columns=["mean_price"])
brand_info.head()
mean_price | |
---|---|
brand | |
alfa_romeo | 4054.645768 |
audi | 9239.323025 |
bmw | 8283.439197 |
chevrolet | 6688.084871 |
chrysler | 3519.384615 |
brand_mileage = df.groupby('brand')['odometer_km']
brand_mileage_describe = brand_mileage.describe()
brand_info['mean_mileage'] = brand_mileage_describe['mean']
brand_info = brand_info.sort_values(by=['mean_price','mean_mileage'], ascending=False)
brand_info
mean_price | mean_mileage | |
---|---|---|
brand | ||
porsche | 45643.937063 | 96853.146853 |
land_rover | 18934.272727 | 118333.333333 |
sonstige_autos | 12283.775824 | 90989.010989 |
jaguar | 11635.493151 | 124178.082192 |
jeep | 11590.214953 | 127102.803738 |
mini | 10541.566986 | 88899.521531 |
audi | 9239.323025 | 129406.861561 |
mercedes_benz | 8525.090278 | 131004.774306 |
bmw | 8283.439197 | 132691.204589 |
chevrolet | 6688.084871 | 98948.339483 |
skoda | 6341.280566 | 110984.555985 |
dacia | 5920.382812 | 84218.750000 |
kia | 5908.502924 | 112821.637427 |
hyundai | 5358.306250 | 106625.000000 |
volkswagen | 5347.828261 | 128930.913005 |
toyota | 5151.940887 | 116338.259442 |
volvo | 4872.565217 | 138283.752860 |
nissan | 4670.418478 | 118682.065217 |
seat | 4354.731844 | 121625.698324 |
suzuki | 4128.592199 | 108723.404255 |
mazda | 4066.230559 | 124665.757162 |
alfa_romeo | 4054.645768 | 131379.310345 |
honda | 4041.989474 | 123000.000000 |
subaru | 3922.069307 | 125544.554455 |
citroen | 3753.119469 | 120066.371681 |
ford | 3720.407275 | 124360.465116 |
smart | 3529.816860 | 100239.825581 |
chrysler | 3519.384615 | 132366.863905 |
mitsubishi | 3379.477041 | 126556.122449 |
saab | 3192.269231 | 144487.179487 |
lancia | 3170.574074 | 123055.555556 |
peugeot | 3069.413914 | 127315.530569 |
opel | 2953.367582 | 129441.136931 |
fiat | 2791.086151 | 117367.149758 |
lada | 2688.296296 | 83518.518519 |
renault | 2443.572239 | 128149.279790 |
trabant | 1816.515152 | 55909.090909 |
daihatsu | 1612.233333 | 116833.333333 |
rover | 1586.492308 | 138230.769231 |
daewoo | 1091.066667 | 121266.666667 |
brand_info['mileage_per_price']=brand_info['mean_mileage']/brand_info['mean_price']
brand_info.sort_values(by=['mileage_per_price'], ascending=False)
mean_price | mean_mileage | mileage_per_price | |
---|---|---|---|
brand | |||
daewoo | 1091.066667 | 121266.666667 | 111.145057 |
rover | 1586.492308 | 138230.769231 | 87.129807 |
daihatsu | 1612.233333 | 116833.333333 | 72.466765 |
renault | 2443.572239 | 128149.279790 | 52.443418 |
saab | 3192.269231 | 144487.179487 | 45.261589 |
opel | 2953.367582 | 129441.136931 | 43.828319 |
fiat | 2791.086151 | 117367.149758 | 42.050708 |
peugeot | 3069.413914 | 127315.530569 | 41.478775 |
lancia | 3170.574074 | 123055.555556 | 38.811759 |
chrysler | 3519.384615 | 132366.863905 | 37.610798 |
mitsubishi | 3379.477041 | 126556.122449 | 37.448434 |
ford | 3720.407275 | 124360.465116 | 33.426573 |
alfa_romeo | 4054.645768 | 131379.310345 | 32.402167 |
subaru | 3922.069307 | 125544.554455 | 32.009775 |
citroen | 3753.119469 | 120066.371681 | 31.991087 |
lada | 2688.296296 | 83518.518519 | 31.067453 |
trabant | 1816.515152 | 55909.090909 | 30.778213 |
mazda | 4066.230559 | 124665.757162 | 30.658802 |
honda | 4041.989474 | 123000.000000 | 30.430559 |
smart | 3529.816860 | 100239.825581 | 28.398024 |
volvo | 4872.565217 | 138283.752860 | 28.380072 |
seat | 4354.731844 | 121625.698324 | 27.929549 |
suzuki | 4128.592199 | 108723.404255 | 26.334256 |
nissan | 4670.418478 | 118682.065217 | 25.411441 |
volkswagen | 5347.828261 | 128930.913005 | 24.109023 |
toyota | 5151.940887 | 116338.259442 | 22.581443 |
hyundai | 5358.306250 | 106625.000000 | 19.899012 |
kia | 5908.502924 | 112821.637427 | 19.094793 |
skoda | 6341.280566 | 110984.555985 | 17.501915 |
bmw | 8283.439197 | 132691.204589 | 16.018854 |
mercedes_benz | 8525.090278 | 131004.774306 | 15.366966 |
chevrolet | 6688.084871 | 98948.339483 | 14.794719 |
dacia | 5920.382812 | 84218.750000 | 14.225220 |
audi | 9239.323025 | 129406.861561 | 14.006098 |
jeep | 11590.214953 | 127102.803738 | 10.966389 |
jaguar | 11635.493151 | 124178.082192 | 10.672352 |
mini | 10541.566986 | 88899.521531 | 8.433236 |
sonstige_autos | 12283.775824 | 90989.010989 | 7.407251 |
land_rover | 18934.272727 | 118333.333333 | 6.249690 |
porsche | 45643.937063 | 96853.146853 | 2.121928 |
brand_model = df.groupby('brand', dropna=False)['model'].value_counts()
print(brand_model.nlargest(20))
brand model volkswagen golf 3859 bmw 3er 2663 volkswagen polo 1664 opel corsa 1652 astra 1393 volkswagen passat 1374 audi a4 1251 mercedes_benz c_klasse 1155 bmw 5er 1145 mercedes_benz e_klasse 971 audi a3 860 a6 809 ford focus 784 fiesta 745 volkswagen transporter 685 renault twingo 639 peugeot 2_reihe 609 mercedes_benz a_klasse 575 smart fortwo 568 opel vectra 556 Name: model, dtype: int64
damaged=df.unrepaired_damage== 'ja'
undamaged=df.unrepaired_damage=='nein'
df[undamaged].price.describe()
count 34501.000000 mean 7105.037535 std 10007.036750 min 1.000000 25% 1799.000000 50% 4100.000000 75% 8999.000000 max 350000.000000 Name: price, dtype: float64
df[damaged].price.describe()
count 4664.000000 mean 2203.812822 std 3468.180983 min 1.000000 25% 500.000000 50% 1000.000000 75% 2500.000000 max 44200.000000 Name: price, dtype: float64
Damaged cars seem to be worth 70% less.