import pandas as pd #importing pandas library
import numpy as np #importing numpy library
autos = pd.read_csv('autos.csv',encoding = 'Latin-1')
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
As is clear from the output of the code cell above, the dataset contains 50,000 rows and 20 columns. Most of the columns don't have missing values except for the 'vehicletype', 'gearbox', 'model', 'fueltype' and 'notRepairedDamage' columns. 12 columns are strings, 5 are integers while 3 columns have datetime values.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos = autos.rename({'dateCrawled':'date_crawled',
'yearOfRegistration':'registration_year',
'monthOfRegistration':'registration_month',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created','offerType':'offer_type',
'abtest':'ab_test','vehicleType':'vehicle_type',
'gearBox':'gear_box','powerPS':'power_ps',
'fuelType':'fuel_type','nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'},axis = 1)
autos.head()
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-19 17:36:18 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos = autos.drop(columns = ['seller','offer_type','nr_of_pictures'])
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',','').astype(int)
autos['odometer'] = autos['odometer'].str.replace('km','')
autos['odometer'] = autos['odometer'].str.replace(',','').astype(int)
autos = autos.rename({'odometer':'odometer_km'},axis = 1)
autos
date_crawled | name | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | 24900 | control | limousine | 2011 | automatik | 239 | q5 | 100000 | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | 1980 | control | cabrio | 1996 | manuell | 75 | astra | 150000 | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | 13200 | test | cabrio | 2014 | automatik | 69 | 500 | 5000 | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | 22900 | control | kombi | 2013 | manuell | 150 | a3 | 40000 | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | 1250 | control | limousine | 1996 | manuell | 101 | vectra | 150000 | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 17 columns
print(autos['price'].unique().shape)
autos['price'].describe()
(2357,)
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().sort_index()
0 1421 1 156 2 3 3 1 5 2 ... 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, Length: 2357, dtype: int64
autos.loc[autos['price'].between(0,1000),'price'] = np.nan
autos.loc[autos['price'].between(100000,100000000),'price'] = np.nan
autos['price'].value_counts().sort_index()
1039.0 1 1040.0 1 1049.0 6 1050.0 95 1059.0 1 .. 93911.0 1 94999.0 1 98500.0 1 99000.0 2 99900.0 2 Name: price, Length: 2051, dtype: int64
print(autos['price'].value_counts().shape)
autos['price'].describe()
(2051,)
count 37948.000000 mean 7191.577896 std 7920.021774 min 1039.000000 25% 2300.000000 50% 4499.000000 75% 8999.000000 max 99900.000000 Name: price, dtype: float64
Now that we have removed the outliers, we are left with around 38000 price entries. Standard deviation has also reduced from about 480000 to almost 8000 which is a direct result of removing the exceedingly large and small price values.
autos['odometer_km'].value_counts(dropna=False)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
The data in this column seems fine and it does not have any outliers that could affect our analysis.
We are only interested in number of sales or registrations on a single day, so we will not consider the time part for all the three date columns.
date_crawled = autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()*100
date_crawled
2016-03-05 2.538 2016-03-06 1.394 2016-03-07 3.596 2016-03-08 3.330 2016-03-09 3.322 2016-03-10 3.212 2016-03-11 3.248 2016-03-12 3.678 2016-03-13 1.556 2016-03-14 3.662 2016-03-15 3.398 2016-03-16 2.950 2016-03-17 3.152 2016-03-18 1.306 2016-03-19 3.490 2016-03-20 3.782 2016-03-21 3.752 2016-03-22 3.294 2016-03-23 3.238 2016-03-24 2.910 2016-03-25 3.174 2016-03-26 3.248 2016-03-27 3.104 2016-03-28 3.484 2016-03-29 3.418 2016-03-30 3.362 2016-03-31 3.192 2016-04-01 3.380 2016-04-02 3.540 2016-04-03 3.868 2016-04-04 3.652 2016-04-05 1.310 2016-04-06 0.318 2016-04-07 0.142 Name: date_crawled, dtype: float64
date_crawled.describe()
count 34.000000 mean 2.941176 std 0.976476 min 0.142000 25% 2.988500 50% 3.271000 75% 3.488500 max 3.868000 Name: date_crawled, dtype: float64
autos['date_crawled'].unique().shape
(48213,)
Clearly, their are 48213 unique values in the date_crawled column all of which are of April and March months of the year 2016 with each value occupying almost 3% of the total values.
autos['ad_created'].str[:10].describe()
count 50000 unique 76 top 2016-04-03 freq 1946 Name: ad_created, dtype: object
ad_created = (autos['ad_created'].str[:10].value_counts(normalize=True).sort_index())*100
ad_created
2015-06-11 0.002 2015-08-10 0.002 2015-09-09 0.002 2015-11-10 0.002 2015-12-05 0.002 ... 2016-04-03 3.892 2016-04-04 3.688 2016-04-05 1.184 2016-04-06 0.326 2016-04-07 0.128 Name: ad_created, Length: 76, dtype: float64
ad_created.describe()
count 76.000000 mean 1.315789 std 1.595835 min 0.002000 25% 0.002000 50% 0.014000 75% 3.227500 max 3.892000 Name: ad_created, dtype: float64
autos['ad_created'].unique().shape
(76,)
The ad_created column contains 76 unique values with each unique value occupying about 1.3% by average of the total values which means about 650 ads were created on a single day by average. Almost half of the values are of the months of March and April of the year 2016, about 5-10 values are from months of 2015.
autos['last_seen'].str[:10].value_counts()
2016-04-06 11050 2016-04-07 6546 2016-04-05 6214 2016-03-17 1396 2016-04-03 1268 2016-04-02 1245 2016-03-30 1242 2016-04-04 1231 2016-03-31 1192 2016-03-12 1191 2016-04-01 1155 2016-03-29 1117 2016-03-22 1079 2016-03-28 1043 2016-03-21 1037 2016-03-20 1035 2016-03-24 978 2016-03-25 960 2016-03-23 929 2016-03-26 848 2016-03-16 822 2016-03-27 801 2016-03-15 794 2016-03-19 787 2016-03-14 640 2016-03-11 626 2016-03-10 538 2016-03-09 493 2016-03-13 449 2016-03-08 380 2016-03-18 371 2016-03-07 268 2016-03-06 221 2016-03-05 54 Name: last_seen, dtype: int64
autos['last_seen'].str[:10].describe()
count 50000 unique 34 top 2016-04-06 freq 11050 Name: last_seen, dtype: object
last_seen = (autos['last_seen'].str[:10].value_counts(normalize=True).sort_index())*100
last_seen.mean()
2.941176470588235
autos['last_seen'].str[:10].unique().shape
(34,)
Clearly, there are 34 unique date values in the last_seen column all of them are of the year 2016 with each date occupying almost 3%, i.e., 1500 sales by average of all the values. Most sales are of the April month and most of these are 3 consecutive days- 5th, 6th and 7th of April - almost 4-7 times the average sales number.
autos['registration_year'].describe()
count 50000.000000 mean 2005.073280 std 105.712813 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The registration_year column appears to have a number of invalid entries. The maximum and minimum values of the column are 9999 and 1000 which is impossible. We will need to fix these errors.
autos['registration_year'].unique()
array([2004, 1997, 2009, 2007, 2003, 2006, 1995, 1998, 2000, 2017, 2010, 1999, 1982, 1990, 2015, 2014, 1996, 1992, 2005, 2002, 2012, 2011, 2008, 1985, 2016, 1994, 1986, 2001, 2018, 2013, 1972, 1993, 1988, 1989, 1967, 1973, 1956, 1976, 4500, 1987, 1991, 1983, 1960, 1969, 1950, 1978, 1980, 1984, 1963, 1977, 1961, 1968, 1934, 1965, 1971, 1966, 1979, 1981, 1970, 1974, 1910, 1975, 5000, 4100, 2019, 1959, 9996, 9999, 6200, 1964, 1958, 1800, 1948, 1931, 1943, 9000, 1941, 1962, 1927, 1937, 1929, 1000, 1957, 1952, 1111, 1955, 1939, 8888, 1954, 1938, 2800, 5911, 1500, 1953, 1951, 4800, 1001])
We will remove the years earlier than 1910 (this is the earliest plausible year after the invention of the first car). We shall also remove the years later than 2016 since a car cannot be first registered after it was last seen on the website (we saw earlier that all the values in the last_seen column are of the year 2016). We will not remove the entire row since we may lose valuable data, so we will assign the corresponding value as NaN.
autos.loc[autos['registration_year'] < 1910,'registration_year'] = np.nan
autos.loc[autos['registration_year'] > 2016, 'registration_year'] = np.nan
autos['registration_year'].describe()
count 48028.00000 mean 2002.80351 std 7.31085 min 1910.00000 25% 1999.00000 50% 2003.00000 75% 2008.00000 max 2016.00000 Name: registration_year, dtype: float64
(autos['registration_year'].value_counts(normalize=True).sort_index().tail(50))*100
1967.0 0.056217 1968.0 0.054135 1969.0 0.039560 1970.0 0.093695 1971.0 0.056217 1972.0 0.072874 1973.0 0.054135 1974.0 0.049971 1975.0 0.039560 1976.0 0.056217 1977.0 0.045807 1978.0 0.097860 1979.0 0.072874 1980.0 0.201966 1981.0 0.064546 1982.0 0.089531 1983.0 0.110352 1984.0 0.110352 1985.0 0.218622 1986.0 0.158241 1987.0 0.156159 1988.0 0.295661 1989.0 0.376863 1990.0 0.822437 1991.0 0.741234 1992.0 0.814108 1993.0 0.926543 1994.0 1.374198 1995.0 2.733822 1996.0 3.006579 1997.0 4.222537 1998.0 5.107437 1999.0 6.246356 2000.0 6.983426 2001.0 5.627967 2002.0 5.274007 2003.0 5.677938 2004.0 5.698759 2005.0 6.277588 2006.0 5.638378 2007.0 4.797202 2008.0 4.645207 2009.0 4.368285 2010.0 3.325144 2011.0 3.402182 2012.0 2.754643 2013.0 1.678188 2014.0 1.386691 2015.0 0.830765 2016.0 2.740068 Name: registration_year, dtype: float64
((autos['registration_year'].value_counts(normalize=True))*100).mean()
1.2820512820512817
Each year makes approximately 1.3% of the entire registration_year column, i.e., about 620 cars/year by average. About 20-30 cars/year were registered before the 80s and the registrations start to increase after that. The maximum registrations were recorded in the year 2005 - around 3000, then we again see a steady decrease untill the year 2016 where the numbers jump from about 400 (in the year 2015) to almost 1300.
autos['brand'].describe()
count 50000 unique 40 top volkswagen freq 10687 Name: brand, dtype: object
(autos['brand'].value_counts())
volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 citroen 701 smart 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
There are 40 unique brands in the brand column with volkswagen being the most frequent with more than 10000 sales. We will leave aggregation of those brands whose frequencies are less than 1% of the total values in the brand column.
brand_sel = (autos['brand'].value_counts(normalize=True))*100 > 1
brands = ((autos['brand'].value_counts(normalize=True))*100)[brand_sel]
brands
volkswagen 21.374 opel 10.922 bmw 10.858 mercedes_benz 9.468 audi 8.566 ford 6.958 renault 4.808 peugeot 2.912 fiat 2.616 seat 1.882 skoda 1.572 mazda 1.514 nissan 1.508 citroen 1.402 smart 1.402 toyota 1.234 sonstige_autos 1.092 Name: brand, dtype: float64
sel_brands = brands.index
price_by_brand = {}
for b in sel_brands:
mean_price = autos.loc[autos['brand'] == b, 'price'].mean()
price_by_brand[b] = mean_price
price_by_brand
{'volkswagen': 6652.138861386138, 'opel': 4259.6706021251475, 'bmw': 8990.066779374472, 'mercedes_benz': 9202.106543822349, 'audi': 10242.50607615447, 'ford': 5351.12505663797, 'renault': 3625.2169274537696, 'peugeot': 4006.038684719536, 'fiat': 4033.7395061728394, 'seat': 5577.062682215744, 'skoda': 6894.382889200561, 'mazda': 5300.20955882353, 'nissan': 6403.148362235067, 'citroen': 4676.230188679246, 'smart': 3760.979623824451, 'toyota': 5597.41981981982, 'sonstige_autos': 12567.545691906005}
Next important parameter is the mileage of the cars. We have only the odometer_km column with us so we'd have to make do with it.
mileage = {}
for b in sel_brands:
mean = autos.loc[autos['brand'] == b, 'odometer_km'].mean()
mileage[b] = mean
mean_mileage_km = pd.Series(mileage)
mean_mileage_km = mean_mileage_km.sort_values(ascending = False)
mean_mileage_km
bmw 132521.643028 mercedes_benz 130886.142797 audi 129643.941163 opel 129298.663248 volkswagen 128955.272761 renault 128223.793677 peugeot 127352.335165 mazda 125132.100396 ford 124131.934464 seat 122061.636557 citroen 119764.621969 nissan 118978.779841 fiat 117037.461774 toyota 115988.654781 skoda 110947.837150 smart 100756.062767 sonstige_autos 87188.644689 dtype: float64
mean_price_dollars = pd.Series(price_by_brand)
mean_price_dollars = mean_price_dollars.sort_values(ascending = False)
mean_price_dollars
sonstige_autos 12567.545692 audi 10242.506076 mercedes_benz 9202.106544 bmw 8990.066779 skoda 6894.382889 volkswagen 6652.138861 nissan 6403.148362 toyota 5597.419820 seat 5577.062682 ford 5351.125057 mazda 5300.209559 citroen 4676.230189 opel 4259.670602 fiat 4033.739506 peugeot 4006.038685 smart 3760.979624 renault 3625.216927 dtype: float64
df = pd.DataFrame(mean_price_dollars,columns = ['mean_price_dollars'])
df
mean_price_dollars | |
---|---|
sonstige_autos | 12567.545692 |
audi | 10242.506076 |
mercedes_benz | 9202.106544 |
bmw | 8990.066779 |
skoda | 6894.382889 |
volkswagen | 6652.138861 |
nissan | 6403.148362 |
toyota | 5597.419820 |
seat | 5577.062682 |
ford | 5351.125057 |
mazda | 5300.209559 |
citroen | 4676.230189 |
opel | 4259.670602 |
fiat | 4033.739506 |
peugeot | 4006.038685 |
smart | 3760.979624 |
renault | 3625.216927 |
df['mean_mileage_km'] = mean_mileage_km
df = df.iloc[1:,:]
df['mean_price_dollars'] = df['mean_price_dollars'].astype(int)
df['mean_mileage_km'] = df['mean_mileage_km'].astype(int)
<ipython-input-343-82216bbf9c11>:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['mean_price_dollars'] = df['mean_price_dollars'].astype(int) <ipython-input-343-82216bbf9c11>:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['mean_mileage_km'] = df['mean_mileage_km'].astype(int)
df
mean_price_dollars | mean_mileage_km | |
---|---|---|
audi | 10242 | 129643 |
mercedes_benz | 9202 | 130886 |
bmw | 8990 | 132521 |
skoda | 6894 | 110947 |
volkswagen | 6652 | 128955 |
nissan | 6403 | 118978 |
toyota | 5597 | 115988 |
seat | 5577 | 122061 |
ford | 5351 | 124131 |
mazda | 5300 | 125132 |
citroen | 4676 | 119764 |
opel | 4259 | 129298 |
fiat | 4033 | 117037 |
peugeot | 4006 | 127352 |
smart | 3760 | 100756 |
renault | 3625 | 128223 |
autos.loc[autos['brand']=='bmw','model'].describe().top
'3er'
d = {}
for b in autos['brand'].unique():
d[b] = autos.loc[autos['brand']==b,'model'].describe().top
d
{'peugeot': '2_reihe', 'bmw': '3er', 'volkswagen': 'golf', 'smart': 'fortwo', 'ford': 'focus', 'chrysler': 'andere', 'seat': 'ibiza', 'renault': 'twingo', 'mercedes_benz': 'c_klasse', 'audi': 'a4', 'sonstige_autos': nan, 'opel': 'corsa', 'mazda': '3_reihe', 'porsche': '911', 'mini': 'cooper', 'toyota': 'yaris', 'dacia': 'sandero', 'nissan': 'micra', 'jeep': 'grand', 'saab': 'andere', 'volvo': 'v40', 'mitsubishi': 'colt', 'jaguar': 'andere', 'fiat': 'punto', 'skoda': 'octavia', 'subaru': 'legacy', 'kia': 'andere', 'citroen': 'andere', 'chevrolet': 'andere', 'hyundai': 'i_reihe', 'honda': 'civic', 'daewoo': 'matiz', 'suzuki': 'andere', 'trabant': '601', 'land_rover': 'freelander', 'alfa_romeo': '156', 'lada': 'niva', 'rover': 'andere', 'daihatsu': 'cuore', 'lancia': 'ypsilon'}
df['most_common_model'] = pd.Series(d)
<ipython-input-347-68a91684a4fa>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['most_common_model'] = pd.Series(d)
df
mean_price_dollars | mean_mileage_km | most_common_model | |
---|---|---|---|
audi | 10242 | 129643 | a4 |
mercedes_benz | 9202 | 130886 | c_klasse |
bmw | 8990 | 132521 | 3er |
skoda | 6894 | 110947 | octavia |
volkswagen | 6652 | 128955 | golf |
nissan | 6403 | 118978 | micra |
toyota | 5597 | 115988 | yaris |
seat | 5577 | 122061 | ibiza |
ford | 5351 | 124131 | focus |
mazda | 5300 | 125132 | 3_reihe |
citroen | 4676 | 119764 | andere |
opel | 4259 | 129298 | corsa |
fiat | 4033 | 117037 | punto |
peugeot | 4006 | 127352 | 2_reihe |
smart | 3760 | 100756 | fortwo |
renault | 3625 | 128223 | twingo |
Let us first rename the entries in unrepaired_damage column which are in German
autos.loc[autos['unrepaired_damage']=='nein','unrepaired_damage'] = 'no'
autos.loc[autos['unrepaired_damage']=='ja','unrepaired_damage'] ='yes'
autos.loc[(autos['brand']=='bmw') & (autos['unrepaired_damage']=='no'), 'price'].describe()
count 3823.000000 mean 9733.702851 std 8495.453418 min 1050.000000 25% 3500.000000 50% 7490.000000 75% 12999.000000 max 72900.000000 Name: price, dtype: float64
d1 = {}
for b in autos['brand'].unique():
d1[b] = int(autos.loc[(autos['brand']==b) & (autos['unrepaired_damage']=='no'), 'price'].mean())
df['mean_price_without_damage'] = pd.Series(d1)
<ipython-input-353-adb9091db602>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['mean_price_without_damage'] = pd.Series(d1)
df
mean_price_dollars | mean_mileage_km | most_common_model | mean_price_without_damage | |
---|---|---|---|---|
audi | 10242 | 129643 | a4 | 11374 |
mercedes_benz | 9202 | 130886 | c_klasse | 10066 |
bmw | 8990 | 132521 | 3er | 9733 |
skoda | 6894 | 110947 | octavia | 7291 |
volkswagen | 6652 | 128955 | golf | 7324 |
nissan | 6403 | 118978 | micra | 6922 |
toyota | 5597 | 115988 | yaris | 5931 |
seat | 5577 | 122061 | ibiza | 6019 |
ford | 5351 | 124131 | focus | 5768 |
mazda | 5300 | 125132 | 3_reihe | 5857 |
citroen | 4676 | 119764 | andere | 4792 |
opel | 4259 | 129298 | corsa | 4592 |
fiat | 4033 | 117037 | punto | 4356 |
peugeot | 4006 | 127352 | 2_reihe | 4286 |
smart | 3760 | 100756 | fortwo | 3956 |
renault | 3625 | 128223 | twingo | 3937 |
d2 = {}
for b in autos['brand'].unique():
d2[b] = (autos.loc[(autos['brand']==b) & (autos['unrepaired_damage']=='yes'), 'price'].mean())
df['mean_price_with_damage'] = pd.Series(d2)
<ipython-input-356-0cc0768a2817>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['mean_price_with_damage'] = pd.Series(d2)
autos.loc[(autos['unrepaired_damage'] != 'yes') & (autos['unrepaired_damage']!='no'),'unrepaired_damage'] = 'unknown'
autos['unrepaired_damage'].value_counts(dropna=False)
no 35232 unknown 9829 yes 4939 Name: unrepaired_damage, dtype: int64
df['mean_price_with_damage'] = df['mean_price_with_damage'].astype(int)
<ipython-input-359-26b4b8f8b3d4>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['mean_price_with_damage'] = df['mean_price_with_damage'].astype(int)
df['non_damaged_vs_damaged_price_ratio'] = df['mean_price_without_damage']/df['mean_price_with_damage']
<ipython-input-360-280aa0ccb949>:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df['non_damaged_vs_damaged_price_ratio'] = df['mean_price_without_damage']/df['mean_price_with_damage']
df
mean_price_dollars | mean_mileage_km | most_common_model | mean_price_without_damage | mean_price_with_damage | non_damaged_vs_damaged_price_ratio | |
---|---|---|---|---|---|---|
audi | 10242 | 129643 | a4 | 11374 | 4769 | 2.384986 |
mercedes_benz | 9202 | 130886 | c_klasse | 10066 | 5182 | 1.942493 |
bmw | 8990 | 132521 | 3er | 9733 | 5113 | 1.903579 |
skoda | 6894 | 110947 | octavia | 7291 | 5269 | 1.383754 |
volkswagen | 6652 | 128955 | golf | 7324 | 4006 | 1.828258 |
nissan | 6403 | 118978 | micra | 6922 | 3878 | 1.784941 |
toyota | 5597 | 115988 | yaris | 5931 | 4256 | 1.393562 |
seat | 5577 | 122061 | ibiza | 6019 | 3079 | 1.954855 |
ford | 5351 | 124131 | focus | 5768 | 3067 | 1.880665 |
mazda | 5300 | 125132 | 3_reihe | 5857 | 2475 | 2.366465 |
citroen | 4676 | 119764 | andere | 4792 | 3323 | 1.442070 |
opel | 4259 | 129298 | corsa | 4592 | 2762 | 1.662563 |
fiat | 4033 | 117037 | punto | 4356 | 2826 | 1.541401 |
peugeot | 4006 | 127352 | 2_reihe | 4286 | 2716 | 1.578056 |
smart | 3760 | 100756 | fortwo | 3956 | 2027 | 1.951653 |
renault | 3625 | 128223 | twingo | 3937 | 2563 | 1.536091 |
df['non_damaged_vs_damaged_price_ratio'].mean()
1.7834620365276217
Top three most expensive brands are 'Audi', 'Mercedes Benz' and 'BMW'. 'Ford' ,'Nissan', 'Volkswagen' and 'Toyota' are relatively less expensive and among the cheapest ones are 'Renault', 'Smart' and 'Fiat'.
There seems to be no direct correlation between mileage and average prices.
Average price of cars without damage is about 1.78 times the average price of non-damaged cars which is not surprising. For most cases this ratio is even more than 2.