In this project we will work with a modified data set from a classifieds section of the German Ebay website. The goal of this project is to clean and analyze the data.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Initial impressions of the data:
'dateCrawled' and 'lastSeenOnline' - might not be relevent to our ananlysis so might delete 'price' - convert to float 'model' - put this next to the 'brand' column for ease of reading 'odometer' - convert to int 'monthOfRegistration' - maybe convert to the full name of the month 'dateCreated' - convert to date
Some of the columns have null values that we will have to investigate.
Column titles we can clean, add spaces, and use lowercase.
First let's convert the column names from camelcase to snakecase and re-word some of them to make it more clear:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
mapping_dictionary = {'dateCrawled':'date_crawled', 'offerType':'offer_type', 'abtest':'ab_test','vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year', 'powerPS':'power_ps', 'monthOfRegistration':'registration_month',
'fuelType':'fuel_type', 'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created', 'nrOfPictures':'nr_of_pictures', 'postaCode':'postal_code', 'lastSeen':'last_seen'}
autos.rename(mapper=mapping_dictionary, axis=1, inplace=True)
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postalCode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postalCode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
'seller' and 'offer_type' can be removed becasue amost all of the values are the same 'price', 'odometer' needs to be converted to int 'date_craweled', 'ad_created', and 'last_seen' can be converted to date format.
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int)
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns.
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].nlargest(20)
39705 99999999 42221 27322222 27371 12345678 39377 12345678 47598 12345678 2897 11111111 24384 11111111 11137 10000000 47634 3890000 7814 1300000 22947 1234566 514 999999 43049 999999 37585 999990 36818 350000 14715 345000 34723 299000 35923 295000 12682 265000 47337 259000 Name: price, dtype: int64
price_bool = autos['price'] < 100
price_bool.value_counts()
False 48238 True 1762 Name: price, dtype: int64
We will remove the entries with prices 999,990 and above since those are either errors or extrememly expense that they may skew the results. There are over 1700 entries with a price less than 100 so we will exclude these as well.
autos = autos.loc[autos['price'].between(100, 350000), :]
autos['odometer_km'].min()
5000
We don't need decimals for the 'odometer_km' column so we will convert it to int. 150,000 km is reasonable for some used cars so there is no need to remove any on this basis. The lowest is 5000 km which is also reasonable.
autos['odometer_km'].describe()
count 48224.000000 mean 125919.148142 std 39543.339640 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['price'].describe()
count 48224.000000 mean 5930.371433 std 9078.372762 min 100.000000 25% 1250.000000 50% 3000.000000 75% 7499.000000 max 350000.000000 Name: price, dtype: float64
Observations about the remaining values for 'price' and 'odometer_km':
The mean milage is over 125000 km so most of our entries are vehicles with high km. 75% of the vehicles have 125000 kilometers or more.
The average price is 5930.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025361 2016-03-06 0.014039 2016-03-07 0.036061 2016-03-08 0.033179 2016-03-09 0.033013 2016-03-10 0.032287 2016-03-11 0.032598 2016-03-12 0.036911 2016-03-13 0.015677 2016-03-14 0.036662 2016-03-15 0.034319 2016-03-16 0.029467 2016-03-17 0.031499 2016-03-18 0.012898 2016-03-19 0.034734 2016-03-20 0.037803 2016-03-21 0.037201 2016-03-22 0.032888 2016-03-23 0.032287 2016-03-24 0.029446 2016-03-25 0.031499 2016-03-26 0.032308 2016-03-27 0.031126 2016-03-28 0.034962 2016-03-29 0.034112 2016-03-30 0.033738 2016-03-31 0.031851 2016-04-01 0.033697 2016-04-02 0.035605 2016-04-03 0.038611 2016-04-04 0.036538 2016-04-05 0.013064 2016-04-06 0.003173 2016-04-07 0.001389 Name: date_crawled, dtype: float64
All of the entries were crawled in either March or April of 2016 with most of them being craweled in March
autos['ad_created'].str[:7].value_counts(normalize=True, dropna=False)
2016-03 0.837342 2016-04 0.161019 2016-02 0.001265 2016-01 0.000249 2015-12 0.000041 2015-08 0.000021 2015-06 0.000021 2015-11 0.000021 2015-09 0.000021 Name: ad_created, dtype: float64
Over 83% of the ad were created in March of 2016. The total range for all ad creation is from June 2015 through April 2016.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001078 2016-03-06 0.004313 2016-03-07 0.005433 2016-03-08 0.007320 2016-03-09 0.009580 2016-03-10 0.010638 2016-03-11 0.012400 2016-03-12 0.023785 2016-03-13 0.008875 2016-03-14 0.012629 2016-03-15 0.015863 2016-03-16 0.016444 2016-03-17 0.028098 2016-03-18 0.007320 2016-03-19 0.015760 2016-03-20 0.020654 2016-03-21 0.020550 2016-03-22 0.021359 2016-03-23 0.018580 2016-03-24 0.019762 2016-03-25 0.019098 2016-03-26 0.016672 2016-03-27 0.015552 2016-03-28 0.020840 2016-03-29 0.022292 2016-03-30 0.024697 2016-03-31 0.023826 2016-04-01 0.022852 2016-04-02 0.024884 2016-04-03 0.025133 2016-04-04 0.024531 2016-04-05 0.125062 2016-04-06 0.221964 2016-04-07 0.132154 Name: last_seen, dtype: float64
The date range for when the ad was last seen is from March 5th, 2016 to April 7th, 2016
autos['registration_year'].describe()
count 48224.000000 mean 2004.730964 std 87.897388 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Some things stand out from the 'registration_year' column:
the minimum vallue is 1000 and the maximum value is 9999. The registration year can't be after the listing was seen so any entry after 2016 is inaccurate. The earliest possible registration date would be in the early 1900's when cars were invented. Let's take a look at some of the data to determine if we need to fix anything that may be unrealistic.
autos[autos['registration_year'] < 1940]
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postalCode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2221 | 2016-03-15 14:57:07 | Sehr_seltener_Oldtimer_Opel_1210_zum_Restaurieren | privat | Angebot | 3350 | control | andere | 1934 | manuell | 0 | andere | 5000 | 0 | benzin | opel | ja | 2016-03-15 00:00:00 | 0 | 49828 | 2016-04-06 06:17:51 |
2573 | 2016-03-19 22:51:25 | Hanomag_rekord_15k_Suche_ersatz_teile | privat | Angebot | 3000 | test | andere | 1934 | NaN | 0 | NaN | 90000 | 1 | benzin | sonstige_autos | nein | 2016-03-19 00:00:00 | 0 | 90489 | 2016-03-19 22:51:25 |
10556 | 2016-04-01 06:02:10 | UNFAL_Auto | privat | Angebot | 450 | control | NaN | 1800 | NaN | 1800 | NaN | 5000 | 2 | NaN | mitsubishi | nein | 2016-04-01 00:00:00 | 0 | 63322 | 2016-04-01 09:42:30 |
11246 | 2016-03-26 19:49:59 | Ford_Model_A_Roadster_Deluxe_1931 | privat | Angebot | 27500 | control | cabrio | 1931 | manuell | 39 | andere | 10000 | 7 | benzin | ford | nein | 2016-03-26 00:00:00 | 0 | 9322 | 2016-04-06 09:46:59 |
21416 | 2016-03-12 08:36:21 | Essex_super_six__Ford_A | privat | Angebot | 16500 | control | cabrio | 1927 | manuell | 40 | andere | 5000 | 5 | benzin | ford | NaN | 2016-03-12 00:00:00 | 0 | 74821 | 2016-03-15 12:45:12 |
21421 | 2016-03-05 17:45:32 | Ford_Business_Coupe_Hotrod_Projekt.1937 | privat | Angebot | 7000 | test | coupe | 1937 | manuell | 85 | andere | 5000 | 8 | benzin | ford | ja | 2016-03-05 00:00:00 | 0 | 8359 | 2016-04-07 10:44:39 |
22101 | 2016-03-09 16:51:17 | BMW_Andere | privat | Angebot | 11500 | test | cabrio | 1929 | manuell | 15 | andere | 5000 | 1 | NaN | bmw | ja | 2016-03-09 00:00:00 | 0 | 70569 | 2016-04-07 06:17:11 |
22316 | 2016-03-29 16:56:41 | VW_Kaefer.__Zwei_zum_Preis_von_einem. | privat | Angebot | 1500 | control | NaN | 1000 | manuell | 0 | kaefer | 5000 | 0 | benzin | volkswagen | NaN | 2016-03-29 00:00:00 | 0 | 48324 | 2016-03-31 10:15:28 |
22659 | 2016-03-14 08:51:18 | Opel_Corsa_B | privat | Angebot | 500 | test | NaN | 1910 | NaN | 0 | corsa | 150000 | 0 | NaN | opel | NaN | 2016-03-14 00:00:00 | 0 | 52393 | 2016-04-03 07:53:55 |
23804 | 2016-03-30 22:42:50 | Mercedes_Benz_170V | privat | Angebot | 4800 | control | limousine | 1937 | manuell | 38 | andere | 5000 | 0 | benzin | mercedes_benz | NaN | 2016-03-30 00:00:00 | 0 | 23936 | 2016-03-30 22:42:50 |
24511 | 2016-03-17 19:45:11 | Trabant__wartburg__Ostalgie | privat | Angebot | 490 | control | NaN | 1111 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 2016-03-17 00:00:00 | 0 | 16818 | 2016-04-07 07:17:29 |
24855 | 2016-03-31 13:57:18 | Andere_DKW_F8_Cabrio_Limosine | privat | Angebot | 26600 | control | cabrio | 1939 | NaN | 0 | NaN | 5000 | 7 | NaN | sonstige_autos | ja | 2016-03-31 00:00:00 | 0 | 55278 | 2016-04-06 07:15:57 |
26103 | 2016-03-16 12:37:38 | Oldtimer_Sport___Cabrio_Georges_Irat | privat | Angebot | 8500 | test | cabrio | 1938 | manuell | 0 | NaN | 5000 | 6 | benzin | sonstige_autos | ja | 2016-03-16 00:00:00 | 0 | 61118 | 2016-03-16 19:43:33 |
26607 | 2016-03-21 14:45:26 | MG_Andere | privat | Angebot | 25500 | control | cabrio | 1937 | manuell | 50 | NaN | 50000 | 5 | benzin | sonstige_autos | nein | 2016-03-21 00:00:00 | 0 | 53804 | 2016-04-06 11:17:17 |
28693 | 2016-03-22 17:48:41 | Renault_Twingo | privat | Angebot | 599 | control | kleinwagen | 1910 | manuell | 0 | NaN | 5000 | 0 | benzin | renault | NaN | 2016-03-22 00:00:00 | 0 | 70376 | 2016-04-06 09:16:59 |
32585 | 2016-04-02 16:56:39 | UNFAL_Auto | privat | Angebot | 450 | control | NaN | 1800 | NaN | 1800 | NaN | 5000 | 2 | NaN | mitsubishi | nein | 2016-04-02 00:00:00 | 0 | 63322 | 2016-04-04 14:46:21 |
39725 | 2016-03-25 22:54:31 | Dodge_1937_Hotrod_Flathead_Scheunenfund_Oldtim... | privat | Angebot | 8990 | control | limousine | 1937 | NaN | 0 | NaN | 150000 | 7 | benzin | sonstige_autos | NaN | 2016-03-25 00:00:00 | 0 | 3042 | 2016-04-05 15:47:14 |
49283 | 2016-03-15 18:38:53 | Citroen_HY | privat | Angebot | 7750 | control | NaN | 1001 | NaN | 0 | andere | 5000 | 0 | NaN | citroen | NaN | 2016-03-15 00:00:00 | 0 | 66706 | 2016-04-06 18:47:20 |
The entires with registration years 1800, 1000, 1111, 1800, 1001 are inaccurate. The entry of 1910 for an Opel Corsa B is inaccurate because Opel Corsas were not manufactured until 1982. The entry of 1910 for a Renault Twingo is inaccurate because the Twingo was not manufactured until 1992. The rest of the entries before 1940 are vintage cars which appear to be accurate so we will leave our analysis there.
We will remove entries with registration date before 1911 and after 2016
autos = autos[autos["registration_year"].between(1911, 2016)]
autos['registration_year'].value_counts(normalize=True)
2000 0.066969 2005 0.062805 1999 0.062114 2004 0.058231 2003 0.058101 ... 1927 0.000022 1929 0.000022 1950 0.000022 1948 0.000022 1952 0.000022 Name: registration_year, Length: 77, dtype: float64
We will aggregate by price using brands that make up 5% or greater of total entries. Over 20% of all entries are Volkswagens and the top 4 brands make up over 50% of the data.
autos['brand'].value_counts(normalize=False).head(10)
volkswagen 9799 bmw 5107 opel 4970 mercedes_benz 4480 audi 4022 ford 3237 renault 2181 peugeot 1384 fiat 1187 seat 846 Name: brand, dtype: int64
brands = autos['brand'].value_counts(normalize=True)
significant_brands = brands[brands > 0.05].index
print(significant_brands)
avg_price_by_brand = {}
for b in significant_brands:
selected_brand = autos[autos['brand'] == b]
price_mean = selected_brand['price'].mean()
avg_price_by_brand[b] = int(price_mean)
print(avg_price_by_brand)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object') {'volkswagen': 5436, 'bmw': 8381, 'opel': 3006, 'mercedes_benz': 8672, 'audi': 9380, 'ford': 3779}
The most common car brand in our dataset, Volkswagen, has an average list price of 5,436. Audi is the most expensive at 9,380. The least expenive are Ford and Opel.
For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.
avg_milage_by_brand = {}
for b in significant_brands:
selected_brand = autos[autos['brand'] == b]
milage_mean = selected_brand['odometer_km'].mean()
avg_milage_by_brand[b] = int(milage_mean)
print(avg_milage_by_brand)
{'volkswagen': 128799, 'bmw': 132695, 'opel': 129380, 'mercedes_benz': 131025, 'audi': 129245, 'ford': 124277}
Convert both dictionaries to series objects:
apb_series = pd.Series(avg_price_by_brand)
print(apb_series)
amb_series = pd.Series(avg_milage_by_brand)
print(amb_series)
volkswagen 5436 bmw 8381 opel 3006 mercedes_benz 8672 audi 9380 ford 3779 dtype: int64 volkswagen 128799 bmw 132695 opel 129380 mercedes_benz 131025 audi 129245 ford 124277 dtype: int64
price_milage_df = pd.DataFrame(apb_series, columns=['price_mean'])
price_milage_df
price_mean | |
---|---|
volkswagen | 5436 |
bmw | 8381 |
opel | 3006 |
mercedes_benz | 8672 |
audi | 9380 |
ford | 3779 |
price_milage_df['milage_mean'] = amb_series
print(price_milage_df)
price_mean milage_mean volkswagen 5436 128799 bmw 8381 132695 opel 3006 129380 mercedes_benz 8672 131025 audi 9380 129245 ford 3779 124277
The two least expensive brands have the lowest milage readings while the 3 most expensive have the most milage.
autos.head(50)
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postalCode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
5 | 2016-03-21 13:47:45 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 2016-03-21 00:00:00 | 0 | 22962 | 2016-04-06 09:45:21 |
6 | 2016-03-20 17:55:21 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | 300 | test | limousine | 1995 | manuell | 90 | golf | 150000 | 8 | benzin | volkswagen | NaN | 2016-03-20 00:00:00 | 0 | 31535 | 2016-03-23 02:48:59 |
7 | 2016-03-16 18:55:19 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 2016-03-16 00:00:00 | 0 | 53474 | 2016-04-07 03:17:32 |
8 | 2016-03-22 16:51:34 | Seat_Arosa | privat | Angebot | 250 | test | NaN | 2000 | manuell | 0 | arosa | 150000 | 10 | NaN | seat | nein | 2016-03-22 00:00:00 | 0 | 7426 | 2016-03-26 18:18:10 |
9 | 2016-03-16 13:47:02 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590 | control | bus | 1997 | manuell | 90 | megane | 150000 | 7 | benzin | renault | nein | 2016-03-16 00:00:00 | 0 | 15749 | 2016-04-06 10:46:35 |
11 | 2016-03-16 18:45:34 | Mercedes_A140_Motorschaden | privat | Angebot | 350 | control | NaN | 2000 | NaN | 0 | NaN | 150000 | 0 | benzin | mercedes_benz | NaN | 2016-03-16 00:00:00 | 0 | 17498 | 2016-03-16 18:45:34 |
12 | 2016-03-31 19:48:22 | Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... | privat | Angebot | 5299 | control | kleinwagen | 2010 | automatik | 71 | fortwo | 50000 | 9 | benzin | smart | nein | 2016-03-31 00:00:00 | 0 | 34590 | 2016-04-06 14:17:52 |
13 | 2016-03-23 10:48:32 | Audi_A3_1.6_tuning | privat | Angebot | 1350 | control | limousine | 1999 | manuell | 101 | a3 | 150000 | 11 | benzin | audi | nein | 2016-03-23 00:00:00 | 0 | 12043 | 2016-04-01 14:17:13 |
14 | 2016-03-23 11:50:46 | Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... | privat | Angebot | 3999 | test | kleinwagen | 2007 | manuell | 75 | clio | 150000 | 9 | benzin | renault | NaN | 2016-03-23 00:00:00 | 0 | 81737 | 2016-04-01 15:46:47 |
15 | 2016-04-01 12:06:20 | Corvette_C3_Coupe_T_Top_Crossfire_Injection | privat | Angebot | 18900 | test | coupe | 1982 | automatik | 203 | NaN | 80000 | 6 | benzin | sonstige_autos | nein | 2016-04-01 00:00:00 | 0 | 61276 | 2016-04-02 21:10:48 |
16 | 2016-03-16 14:59:02 | Opel_Vectra_B_Kombi | privat | Angebot | 350 | test | kombi | 1999 | manuell | 101 | vectra | 150000 | 5 | benzin | opel | nein | 2016-03-16 00:00:00 | 0 | 57299 | 2016-03-18 05:29:37 |
17 | 2016-03-29 11:46:22 | Volkswagen_Scirocco_2_G60 | privat | Angebot | 5500 | test | coupe | 1990 | manuell | 205 | scirocco | 150000 | 6 | benzin | volkswagen | nein | 2016-03-29 00:00:00 | 0 | 74821 | 2016-04-05 20:46:26 |
18 | 2016-03-26 19:57:44 | Verkaufen_mein_bmw_e36_320_i_touring | privat | Angebot | 300 | control | bus | 1995 | manuell | 150 | 3er | 150000 | 0 | benzin | bmw | NaN | 2016-03-26 00:00:00 | 0 | 54329 | 2016-04-02 12:16:41 |
19 | 2016-03-17 13:36:21 | mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 | privat | Angebot | 4150 | control | suv | 2004 | manuell | 124 | andere | 150000 | 2 | lpg | mazda | nein | 2016-03-17 00:00:00 | 0 | 40878 | 2016-03-17 14:45:58 |
20 | 2016-03-05 19:57:31 | Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... | privat | Angebot | 3500 | test | kombi | 2003 | manuell | 131 | a4 | 150000 | 5 | diesel | audi | NaN | 2016-03-05 00:00:00 | 0 | 53913 | 2016-03-07 05:46:46 |
21 | 2016-03-06 19:07:10 | Porsche_911_Carrera_4S_Cabrio | privat | Angebot | 41500 | test | cabrio | 2004 | manuell | 320 | 911 | 150000 | 4 | benzin | porsche | nein | 2016-03-06 00:00:00 | 0 | 65428 | 2016-04-05 23:46:19 |
22 | 2016-03-28 20:50:54 | MINI_Cooper_S_Cabrio | privat | Angebot | 25450 | control | cabrio | 2015 | manuell | 184 | cooper | 10000 | 1 | benzin | mini | nein | 2016-03-28 00:00:00 | 0 | 44789 | 2016-04-01 06:45:30 |
23 | 2016-03-10 19:55:34 | Peugeot_Boxer_2_2_HDi_120_Ps_9_Sitzer_inkl_Klima | privat | Angebot | 7999 | control | bus | 2010 | manuell | 120 | NaN | 150000 | 2 | diesel | peugeot | nein | 2016-03-10 00:00:00 | 0 | 30900 | 2016-03-17 08:45:17 |
24 | 2016-04-03 11:57:02 | BMW_535i_xDrive_Sport_Aut. | privat | Angebot | 48500 | control | limousine | 2014 | automatik | 306 | 5er | 30000 | 12 | benzin | bmw | nein | 2016-04-03 00:00:00 | 0 | 22547 | 2016-04-07 13:16:50 |
26 | 2016-04-03 22:46:28 | Volkswagen_Polo_Fox | privat | Angebot | 777 | control | kleinwagen | 1992 | manuell | 54 | polo | 125000 | 2 | benzin | volkswagen | nein | 2016-04-03 00:00:00 | 0 | 38110 | 2016-04-05 23:46:48 |
28 | 2016-03-19 21:56:19 | MINI_Cooper_D | privat | Angebot | 5250 | control | kleinwagen | 2007 | manuell | 110 | cooper | 150000 | 7 | diesel | mini | ja | 2016-03-19 00:00:00 | 0 | 15745 | 2016-04-07 14:58:48 |
29 | 2016-04-02 12:45:44 | Mercedes_Benz_E_320_T_CDI_Avantgarde_DPF7_Sitz... | privat | Angebot | 4999 | test | kombi | 2004 | automatik | 204 | e_klasse | 150000 | 10 | diesel | mercedes_benz | nein | 2016-04-02 00:00:00 | 0 | 47638 | 2016-04-02 12:45:44 |
31 | 2016-03-14 16:53:09 | Noch_gut_erhaltenen_C_320 | privat | Angebot | 2850 | test | kombi | 2002 | automatik | 218 | c_klasse | 150000 | 7 | benzin | mercedes_benz | nein | 2016-03-14 00:00:00 | 0 | 41065 | 2016-03-16 07:19:04 |
32 | 2016-03-20 05:03:03 | Corsa_mit_TÜV_5.2016 | privat | Angebot | 350 | control | kleinwagen | 1999 | manuell | 0 | corsa | 150000 | 7 | benzin | opel | NaN | 2016-03-20 00:00:00 | 0 | 27619 | 2016-04-06 03:15:20 |
33 | 2016-03-15 12:25:26 | Audi_A3_2.0_TDI_Sportback_DPF_Kupplung_defekt_... | privat | Angebot | 4800 | test | kombi | 2006 | manuell | 140 | a3 | 150000 | 8 | diesel | audi | nein | 2016-03-15 00:00:00 | 0 | 61169 | 2016-04-07 13:15:31 |
34 | 2016-04-05 07:55:07 | Bmw_530d__navi_tuv03/18 | privat | Angebot | 2600 | test | kombi | 2000 | automatik | 193 | 5er | 150000 | 6 | diesel | bmw | NaN | 2016-04-05 00:00:00 | 0 | 95111 | 2016-04-07 12:15:47 |
35 | 2016-03-29 21:52:56 | Ford_Mondeo | privat | Angebot | 1200 | test | kombi | 1998 | manuell | 0 | mondeo | 150000 | 12 | benzin | ford | nein | 2016-03-29 00:00:00 | 0 | 52428 | 2016-04-06 08:45:35 |
36 | 2016-03-16 23:46:31 | Ford_Mondeo_2.0_Turnier_TDCi_DPF_Titanium | privat | Angebot | 7500 | test | kombi | 2007 | manuell | 131 | mondeo | 150000 | 11 | diesel | ford | nein | 2016-03-16 00:00:00 | 0 | 34246 | 2016-04-05 16:21:16 |
37 | 2016-03-23 16:47:46 | Seat_Altea_XL_2.0_TDI_CR_DPF_Style | privat | Angebot | 13800 | test | bus | 2012 | manuell | 140 | altea | 90000 | 2 | diesel | seat | nein | 2016-03-23 00:00:00 | 0 | 85113 | 2016-04-05 20:17:53 |
38 | 2016-03-21 15:51:10 | Volkswagen_Golf_1.4_Special | privat | Angebot | 2850 | control | limousine | 2002 | manuell | 75 | golf | 125000 | 2 | benzin | volkswagen | nein | 2016-03-21 00:00:00 | 0 | 63674 | 2016-03-28 12:16:06 |
39 | 2016-03-11 13:58:38 | Mercedes_C_180_Kombi__zu_verkaufen | privat | Angebot | 1450 | test | kombi | 1999 | automatik | 122 | c_klasse | 150000 | 6 | benzin | mercedes_benz | nein | 2016-03-11 00:00:00 | 0 | 12683 | 2016-04-06 15:17:52 |
40 | 2016-03-07 14:50:03 | VW_Golf__4_Cabrio_2.0_GTI_16V___Leder_MFA_Alus... | privat | Angebot | 3500 | control | cabrio | 1999 | manuell | 150 | golf | 150000 | 1 | benzin | volkswagen | nein | 2016-03-07 00:00:00 | 0 | 6780 | 2016-03-12 02:15:52 |
41 | 2016-03-10 10:46:08 | Passat_3b_1.9_TDI_Highline__angemeldet_mit_tuv... | privat | Angebot | 3200 | test | kombi | 2003 | manuell | 131 | NaN | 150000 | 7 | NaN | volkswagen | nein | 2016-03-10 00:00:00 | 0 | 28259 | 2016-04-06 20:19:08 |
42 | 2016-03-22 21:37:03 | Vw_Polo_l.0_/60_PS_Blue_Motion_Technologie_Son... | privat | Angebot | 11900 | control | kleinwagen | 2014 | manuell | 60 | polo | 20000 | 7 | benzin | volkswagen | NaN | 2016-03-22 00:00:00 | 0 | 26629 | 2016-04-06 18:44:39 |
43 | 2016-03-08 16:51:22 | Audi_A1_1.2_TFSI_Attraction_mit_Garantie | privat | Angebot | 11500 | test | kleinwagen | 2011 | manuell | 86 | a1 | 60000 | 9 | benzin | audi | nein | 2016-03-08 00:00:00 | 0 | 85049 | 2016-04-07 03:16:05 |
44 | 2016-03-12 10:38:06 | Zu_verkaufen | privat | Angebot | 900 | control | kleinwagen | 1998 | manuell | 43 | twingo | 150000 | 7 | benzin | renault | nein | 2016-03-12 00:00:00 | 0 | 79725 | 2016-04-06 01:17:27 |
45 | 2016-03-12 17:39:55 | Toyota_Starlet_1.3 | privat | Angebot | 1000 | test | kleinwagen | 1995 | manuell | 75 | andere | 150000 | 3 | benzin | toyota | nein | 2016-03-12 00:00:00 | 0 | 74597 | 2016-04-07 08:45:19 |
46 | 2016-03-31 10:53:28 | BMW_mit__Lpg | privat | Angebot | 9000 | control | NaN | 2005 | automatik | 0 | NaN | 150000 | 0 | NaN | bmw | NaN | 2016-03-31 00:00:00 | 0 | 12351 | 2016-04-06 03:44:41 |
47 | 2016-03-28 19:51:51 | Mercedes_Benz_A_150 | privat | Angebot | 3499 | control | limousine | 2008 | manuell | 95 | a_klasse | 150000 | 8 | benzin | mercedes_benz | nein | 2016-03-28 00:00:00 | 0 | 59557 | 2016-04-01 05:17:51 |
48 | 2016-03-20 12:41:50 | Mercedes_C180_Sport_1_8l | privat | Angebot | 1099 | test | limousine | 1998 | manuell | 122 | c_klasse | 150000 | 2 | benzin | mercedes_benz | nein | 2016-03-20 00:00:00 | 0 | 83024 | 2016-03-24 03:17:53 |
49 | 2016-03-27 15:36:42 | Mercedes_Benz_CL_500__BlueEFFICIENCY__7G_TRONIC | privat | Angebot | 56000 | control | coupe | 2012 | automatik | 435 | cl | 60000 | 2 | benzin | mercedes_benz | nein | 2016-03-27 00:00:00 | 0 | 34474 | 2016-04-05 11:47:52 |
50 | 2016-03-09 16:48:05 | ??_????????_??Top!!!_Peugeot_308_sw_Sport_?... | privat | Angebot | 5999 | test | kombi | 2008 | manuell | 150 | 3_reihe | 125000 | 8 | benzin | peugeot | nein | 2016-03-09 00:00:00 | 0 | 55569 | 2016-04-07 06:17:09 |
51 | 2016-03-22 12:57:38 | Mercedes_Benz_S_320_CDI | privat | Angebot | 6000 | control | limousine | 2005 | automatik | 204 | s_klasse | 150000 | 7 | diesel | mercedes_benz | ja | 2016-03-22 00:00:00 | 0 | 49492 | 2016-04-04 00:17:51 |
52 | 2016-03-25 18:50:03 | Senator_A_3.0E_Karosserie_restauriert_m._viele... | privat | Angebot | 3500 | test | limousine | 1985 | NaN | 0 | andere | 5000 | 0 | benzin | opel | nein | 2016-03-25 00:00:00 | 0 | 63500 | 2016-04-07 00:46:00 |
53 | 2016-04-02 17:45:01 | Dacia_Sandero_Stepway_II_MPI_85__top_gepflegt | privat | Angebot | 9000 | control | limousine | 2012 | manuell | 84 | sandero | 30000 | 7 | benzin | dacia | nein | 2016-04-02 00:00:00 | 0 | 53639 | 2016-04-06 16:46:50 |
The following columns have German words: 'seller','offer_type', 'vehicle_type', 'gearbox', 'fuel_type', 'unrepaired_damage'.
Let's use the value_counts() method to find out all the words that are contained within each column:
autos['seller'].value_counts()
privat 46349 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
Angebot 46350 Name: offer_type, dtype: int64
autos['vehicle_type'].value_counts()
limousine 12558 kleinwagen 10497 kombi 8906 bus 4023 cabrio 3008 coupe 2456 suv 1962 andere 388 Name: vehicle_type, dtype: int64
autos['gearbox'].value_counts()
manuell 34546 automatik 9831 Name: gearbox, dtype: int64
autos['fuel_type'].value_counts()
benzin 28380 diesel 14013 lpg 648 cng 70 hybrid 37 elektro 18 andere 14 Name: fuel_type, dtype: int64
autos['unrepaired_damage'].value_counts()
nein 33772 ja 4482 Name: unrepaired_damage, dtype: int64
Here is the common english translation for these words in the context of vehicles:
'seller' privat = private, gewerblich = trade
'offer_type' Agnebot = asking price
'vehicle_type' limousine = sedan, kleinwagen = subcompact, kombi = sport_wagen, bus = van, cabrio = convertible, coupe = couple, suv = suv, andere = other
'gearbox' manuell = manual, automatik = automatic
'fuel_type' benzin = gas, diesel = diesel, lpg = lpg (liquified petroleum gas), cng = cng (compressed natural gas), hybrid = hybrid, elektro = electric, andere = other
'unrepaired_damage' nein = no, ja = yes
Now that we have our translations, lets replace the German words with the English words. To do this we will use the series.map method:
autos['seller'] = autos['seller'].map({'privat':'private', 'gewerblich':'trade'})
autos['offer_type'] = autos['offer_type'].map({'Angebot':'asking price'})
autos['vehicle_type'] = autos['vehicle_type'].map({'limousine':'sedan', 'bus':'bus', 'coupe':'coupe', 'suv':'suv','kleinwagen':'subcompact', 'kombi':'sport wagen', 'bus':'van', 'cabrio':'convertible', 'andere':'other'})
autos['gearbox'] = autos['gearbox'].map({'manuell':'manual', 'automatik':'automatic'})
autos['fuel_type'] = autos['fuel_type'].map({'benzin':'gas', 'diesel':'diesel', 'lpg':'lpg', 'cng':'cng', 'hybrid':'hybrid', 'elektro':'electric', 'andere':'other'})
autos['unrepaired_damage'] = autos['unrepaired_damage'].map({'nein':'no', 'ja':'yes'})
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postalCode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | private | asking price | 5000 | control | van | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | private | asking price | 8500 | control | sedan | 1997 | automatic | 286 | 7er | 150000 | 6 | gas | bmw | no | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | private | asking price | 8990 | test | sedan | 2009 | manual | 102 | golf | 70000 | 7 | gas | volkswagen | no | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | private | asking price | 4350 | control | subcompact | 2007 | automatic | 71 | fortwo | 70000 | 6 | gas | smart | no | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | private | asking price | 1350 | test | sport wagen | 2003 | manual | 0 | focus | 150000 | 7 | gas | ford | no | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Convert the dates to be unifor numeric data so '2016-01-01' becomes integer 20160101
autos['date_crawled'] = autos['date_crawled'].str.replace('-', '').str[:9].astype(int)
autos['ad_created'] = autos['ad_created'].str.replace('-', '').str[:9].astype(int)
autos['last_seen'] = autos['last_seen'].str.replace('-', '').str[:9].astype(int)
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postalCode | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 20160326 | Peugeot_807_160_NAVTECH_ON_BOARD | private | asking price | 5000 | control | van | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 20160326 | 0 | 79588 | 20160406 |
1 | 20160404 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | private | asking price | 8500 | control | sedan | 1997 | automatic | 286 | 7er | 150000 | 6 | gas | bmw | no | 20160404 | 0 | 71034 | 20160406 |
2 | 20160326 | Volkswagen_Golf_1.6_United | private | asking price | 8990 | test | sedan | 2009 | manual | 102 | golf | 70000 | 7 | gas | volkswagen | no | 20160326 | 0 | 35394 | 20160406 |
3 | 20160312 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | private | asking price | 4350 | control | subcompact | 2007 | automatic | 71 | fortwo | 70000 | 6 | gas | smart | no | 20160312 | 0 | 33729 | 20160315 |
4 | 20160401 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | private | asking price | 1350 | test | sport wagen | 2003 | manual | 0 | focus | 150000 | 7 | gas | ford | no | 20160401 | 0 | 39218 | 20160401 |
import operator
vw = autos['brand'] == 'volkswagen'
vw_models = autos[vw]['model']
model_freq_vw = {}
for m in vw_models:
if m not in model_freq_vw:
model_freq_vw[m] = 1
else:
model_freq_vw[m] += 1
sorted_vw = sorted(model_freq_vw.items(), key=operator.itemgetter(1), reverse=True)
sorted_vw[0]
('golf', 3684)
From here we could use this same code over and over for each brand but to speed up the process, we will make a custom function:
import operator #This will let us sort our dictionary by value so that we can get the dictionary sorted by most common models
def brand_model_freq(b):
brand_filter = autos['brand'] == b
brand_models = autos[brand_filter]['model']
model_freq = {}
for m in brand_models:
if m not in model_freq:
model_freq[m] = 1
else:
model_freq[m] += 1
sorted_brand = sorted(model_freq.items(), key=operator.itemgetter(1), reverse=True)
return sorted_brand[0]
brand_model_freq('volkswagen')
('golf', 3684)
brand_model_freq('bmw')
('3er', 2602)
brand_model_freq('opel')
('corsa', 1567)
brand_model_freq('mercedes_benz')
('c_klasse', 1135)
brand_model_freq('audi')
('a4', 1226)
brand_model_freq('ford')
('focus', 757)
To answer this we will create milage ranges that make sense and then assign our entries into those respective groups.
autos['odometer_km'].value_counts()
group_less_than_20 = autos.loc[autos["odometer_km"]<20000,'price']
group_between_20_and_50 = autos.loc[autos["odometer_km"].between(20000,50000),'price']
group_between_50_and_80 = autos.loc[autos["odometer_km"].between(50000,80000),'price']
group_between_80_and_100 = autos.loc[autos["odometer_km"].between(80000,100000),'price']
group_between_100_and_125 = autos.loc[autos["odometer_km"].between(100000,125000),'price']
group_between_125_and_150 = autos.loc[autos["odometer_km"].between(125000,150000),'price']
group_150_and_over = autos.loc[autos["odometer_km"] == 150000,'price']
group_less_than_20.mean()
12635.730646871687
group_between_20_and_50.mean()
15959.703839122487
group_between_50_and_80.mean()
11563.292594178081
group_between_80_and_100.mean()
8700.130110062893
group_between_100_and_125.mean()
6817.424295263005
group_between_125_and_150.mean()
4133.174100843724
group_150_and_over.mean()
3792.4303924520727
In general, we see the price decreasing as milage increases which is expected.
To answer this we will use boolean indexing to filter our dataframe by damage, then find the mean:
with_damage = autos[autos['unrepaired_damage'] == 'yes']
with_damage['price'].mean()
2269.5798750557788
no_damage = autos[autos['unrepaired_damage'] == 'no']
no_damage['price'].mean()
7177.137984128864
On average, a vehicle with damage is one third the price of a vehicle with no damage.