Guided Project: Exploring Ebay Car Sales Data

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle. A few modifications have been made from the original dataset that was uploaded to Kaggle:

  • 50,000 data points from the full dataset have been sampled
  • The dataset has been dirtied a bit to more closely resemble what would be expected from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The file and relevant libraries have been uploaded below

In [1]:
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding = "Latin-1")

Initial inspection

  • Writing the name of a panda variable, will show the first few and last few rows
  • The file has been initally inspected, with info and head panda functions. The columns at index values 6, 8, 10, 13 and 15 all have null values.
  • There are also some numeric values, currently represented as a string e.g. price
In [2]:
autos
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
  • The headings have been provided in CamelCase, rather than Snake_case
  • I will manually access and edit these, using the dataframe.column function
In [3]:
autos.columns
Out[3]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [4]:
New_headings = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_PS', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
In [5]:
autos.columns = New_headings

autos.head()
Out[5]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_PS model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
  • Next we will do some basic data exploration to determine what other cleaning tasks need to be done.
  • Initially we will look for: - Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis. - Examples of numeric data stored as text which can be cleaned and converted.
In [6]:
autos.describe(include = 'all')
Out[6]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_PS model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-12 16:06:22 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
  • The seller and offer_type columns each contain only 1 unique value, so we will consider dropping them
  • The columns price and odometer, are objects, so we will consider converting them to numeric values by stripping out the non-numeric characters, then converting to int64 or float64 and finally renaming the column
In [7]:
autos.drop('nr_of_pictures', axis=1)
Out[7]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_PS model odometer registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 45897 2016-04-06 21:18:48

50000 rows × 19 columns

In [8]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   date_crawled        50000 non-null  object
 1   name                50000 non-null  object
 2   seller              50000 non-null  object
 3   offer_type          50000 non-null  object
 4   price               50000 non-null  object
 5   abtest              50000 non-null  object
 6   vehicle_type        44905 non-null  object
 7   registration_year   50000 non-null  int64 
 8   gearbox             47320 non-null  object
 9   power_PS            50000 non-null  int64 
 10  model               47242 non-null  object
 11  odometer            50000 non-null  object
 12  registration_month  50000 non-null  int64 
 13  fuel_type           45518 non-null  object
 14  brand               50000 non-null  object
 15  unrepaired_damage   40171 non-null  object
 16  ad_created          50000 non-null  object
 17  nr_of_pictures      50000 non-null  int64 
 18  postal_code         50000 non-null  int64 
 19  last_seen           50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
In [9]:
autos.head()
Out[9]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_PS model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [10]:
autos.loc[:,"odometer"] = autos.loc[:,"odometer"].str.replace("km","").str.replace(",","")
autos.loc[:,"odometer"] = autos.loc[:,"odometer"].astype(int)
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)
In [11]:
autos.loc[:,"price"] = autos.loc[:,"price"].str.replace("$","").str.replace(",","")
autos.loc[:,"price"] = autos.loc[:,"price"].astype(int)
autos.head()
Out[11]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_PS model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
  • Next, I will try to identify any outliers in the odometer_km and price columns.
  • I will use the .shape (unique values), .describe() (min, max, etc..) and .value_counts() (counts of unique values) functions.
In [12]:
autos.loc[:, 'odometer_km'].shape
Out[12]:
(50000,)
In [13]:
autos.loc[:, 'odometer_km'].describe()
Out[13]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [14]:
autos.loc[:, 'odometer_km'].value_counts().sort_index(ascending = False)
Out[14]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
40000       819
30000       789
20000       784
10000       264
5000        967
Name: odometer_km, dtype: int64
In [15]:
autos.loc[:, 'price'].shape
Out[15]:
(50000,)
In [16]:
autos.loc[:, 'price'].describe()
Out[16]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [17]:
autos.loc[:, 'price'].value_counts().sort_index(ascending = False)
Out[17]:
99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
            ... 
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64
  • The top and bottom five values, look out of place, so I will convert values >= 10000000 and <= 100, to Null and then drop the rows with Null values
In [18]:
autos.loc[autos.loc[:, 'price'] == 99999999, 'price'] = np.nan
autos.loc[autos.loc[:, 'price'] < 50, 'price'] = np.nan
autos = autos.dropna()
In [19]:
# checking the new largest and smallest values in the price series
autos.loc[:, 'price'].value_counts().sort_index(ascending = False)
Out[19]:
12345678.0     1
11111111.0     1
999999.0       1
999990.0       1
350000.0       1
              ..
70.0           1
66.0           1
65.0           1
60.0           2
50.0          15
Name: price, Length: 2146, dtype: int64
  • The new largest and smallest values seem like sensible prices of cars, although you would have to be very rich to afford, the most expensive in the list!
In [20]:
autos.shape
Out[20]:
(34550, 20)

Dates and date ranges

  • Next we will move on to the date columns and understand the date range the data covers.
  • The date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.
In [21]:
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[21]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [22]:
autos.loc[:, 'date_crawled'] = autos.loc[:, 'date_crawled'].str[:10]
autos.loc[:, 'ad_created'] = autos.loc[:, 'ad_created'].str[:10]
autos.loc[:, 'last_seen'] = autos.loc[:, 'last_seen'].str[:10]
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[22]:
date_crawled ad_created last_seen
0 2016-03-26 2016-03-26 2016-04-06
1 2016-04-04 2016-04-04 2016-04-06
2 2016-03-26 2016-03-26 2016-04-06
3 2016-03-12 2016-03-12 2016-03-15
4 2016-04-01 2016-04-01 2016-04-01
In [23]:
autos.loc[:, 'registration_year'].value_counts().sort_index(ascending = False)
Out[23]:
2018      1
2016     40
2015    344
2014    603
2013    731
       ... 
1950      1
1941      1
1937      1
1934      1
1931      1
Name: registration_year, Length: 68, dtype: int64
  • All the data from the date cells, looks sensible and there was no obvious pattern in the date_crawled,ad_created or last_seen cells. This may be because I dropped any "problem rows" when removing Null values.
  • However, to improve the quality of the data slightly, I've removed rows with registration years outside of 1950 - 2016
In [24]:
autos = autos.loc[autos.loc[:, 'registration_year'].between(1950,2016),:]
  • We will now convert dates to uniform integers in the format YYYYMMDD
In [25]:
autos.loc[:, 'date_crawled'] = autos.loc[:, 'date_crawled'].str.replace("-","").astype(int)
autos.loc[:, 'ad_created'] = autos.loc[:, 'ad_created'].str.replace("-","").astype(int)
autos.loc[:, 'last_seen'] = autos.loc[:, 'last_seen'].str.replace("-","").astype(int)

Brand Aggregation

  • In this section we will explore the differences between brands
  • We will only be looking at brands that occupy >= 5% of the total
  • These brands have been highlighted below
In [26]:
brands = autos.loc[:,'brand'].value_counts(normalize = True, ascending = False)
significant_brands = brands.loc[brands > 0.05].index
significant_brands
Out[26]:
Index(['volkswagen', 'bmw', 'mercedes_benz', 'opel', 'audi', 'ford'], dtype='object')
In [27]:
mean_price = {}

for b in significant_brands:
    mean = autos.loc[autos.loc[:,'brand']==b,'price'].mean()
    mean_price[b] = mean
    
mean_price
Out[27]:
{'volkswagen': 6445.090143218197,
 'bmw': 9230.636386200049,
 'mercedes_benz': 9543.794084507042,
 'opel': 6953.577796706417,
 'audi': 10581.772583701832,
 'ford': 9577.086712683347}
  • Next we will create a new dataframe comparing the mean price and mileage, for the top 6 brands, identified above.
  • This will allow us to identify any lin between price and mileage
In [28]:
mean_mileage = {}

for b in significant_brands:
    mean = autos.loc[autos.loc[:,'brand']==b,'odometer_km'].mean()
    mean_mileage[b] = mean
    
mean_mileage
Out[28]:
{'volkswagen': 126191.37882617243,
 'bmw': 131522.71034996276,
 'mercedes_benz': 129816.9014084507,
 'opel': 127535.49119818285,
 'audi': 126366.0770688566,
 'ford': 122918.46419327006}
In [29]:
price_series = pd.Series(mean_price)
mileage_series = pd.Series(mean_mileage)

df = pd.DataFrame(price_series, columns =['mean_price'])
df.loc[:,'mean_mileage'] = mileage_series

df
Out[29]:
mean_price mean_mileage
volkswagen 6445.090143 126191.378826
bmw 9230.636386 131522.710350
mercedes_benz 9543.794085 129816.901408
opel 6953.577797 127535.491198
audi 10581.772584 126366.077069
ford 9577.086713 122918.464193
In [30]:
df.loc[:,'price_per_mile'] = df.iloc[:,0] / df.iloc[:,1]

df.sort_values(by = 'mean_price', ascending = False)
Out[30]:
mean_price mean_mileage price_per_mile
audi 10581.772584 126366.077069 0.083739
ford 9577.086713 122918.464193 0.077914
mercedes_benz 9543.794085 129816.901408 0.073517
bmw 9230.636386 131522.710350 0.070183
opel 6953.577797 127535.491198 0.054523
volkswagen 6445.090143 126191.378826 0.051074
  • I have added in an extra column ("price_per_mile") to help with the analysis.
  • There does not appear to be any clear link between mean_price and mean_mileage
  • In the table above, as the mean_price decreases, the mean_mileage tends to stay fairly conisistent.
  • However, the results may indicate that the price is determined more so by the brand than the mileage.

Additional analysis

1) Find the most common brand/model combinations

2) Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.

3) How much cheaper are cars with damage than their non-damaged counterparts?

In [37]:
autos.loc[:,'brand_model_comb'] = autos.loc[:,'brand'].str.cat(autos.loc[:,'model'])
autos.loc[:,'brand_model_comb'].value_counts().head(3)
Out[37]:
volkswagengolf    2748
bmw3er            2077
volkswagenpolo    1129
Name: brand_model_comb, dtype: int64
  • It appears that the 3 most popular brad/model combinations are the volkswagen gold, the BMW 3ER and the volskwagen polo
In [42]:
_0km_50km_mean = autos.loc[autos.loc[:,'odometer_km'].between(0,50000), 'price'].mean()
_50km_100km_mean = autos.loc[autos.loc[:,'odometer_km'].between(50000,100000), 'price'].mean()
_100km_150km_mean = autos.loc[autos.loc[:,'odometer_km'].between(100000,150000), 'price'].mean()
print(_0km_50km_mean)
print(_50km_100km_mean)
print(_100km_150km_mean)
20224.724452554743
12408.266312921267
5439.197750877979
  • It is apparent that the average price of a car, decreases as the mileage increases
In [43]:
autos.head()
Out[43]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_PS ... odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen brand_model_comb
0 20160326 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000.0 control bus 2004 manuell 158 ... 150000 3 lpg peugeot nein 20160326 0 79588 20160406 peugeotandere
1 20160404 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500.0 control limousine 1997 automatik 286 ... 150000 6 benzin bmw nein 20160404 0 71034 20160406 bmw7er
2 20160326 Volkswagen_Golf_1.6_United privat Angebot 8990.0 test limousine 2009 manuell 102 ... 70000 7 benzin volkswagen nein 20160326 0 35394 20160406 volkswagengolf
3 20160312 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350.0 control kleinwagen 2007 automatik 71 ... 70000 6 benzin smart nein 20160312 0 33729 20160315 smartfortwo
4 20160401 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350.0 test kombi 2003 manuell 0 ... 150000 7 benzin ford nein 20160401 0 39218 20160401 fordfocus

5 rows × 21 columns

In [44]:
autos.loc[:,'unrepaired_damage'].unique()
Out[44]:
array(['nein', 'ja'], dtype=object)
In [47]:
No_unrepaired_damage = autos.loc[autos.loc[:,'unrepaired_damage'] == 'nein', 'price'].mean()
unrepaired_damage = autos.loc[autos.loc[:,'unrepaired_damage'] == 'ja', 'price'].mean()
No_unrepaired_damage - unrepaired_damage
Out[47]:
5421.324586578532
  • On average, cars with no unrepaired damage, cost 5421.3 more than those with unrepaired damage