Exploring eBay Car Sales Data:


This project aims to clean and analyze a dataset from a German car classifieds section on eBay. Pandas will be used to analyze the data. 50,000 data points were sampled from the original data set. While the data is no longer available on Kaggle (where it was initially scraped and uploaded), it can be found here. The dataset was purposefully tampered with as a means of learning how to scrape and clean imperfect data.

In [1]:
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
In [2]:
autos
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:
print(autos.info())
print(autos.head())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
None
           dateCrawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offerType   price   abtest vehicleType  yearOfRegistration  \
0  privat   Angebot  $5,000  control         bus                2004   
1  privat   Angebot  $8,500  control   limousine                1997   
2  privat   Angebot  $8,990     test   limousine                2009   
3  privat   Angebot  $4,350  control  kleinwagen                2007   
4  privat   Angebot  $1,350     test       kombi                2003   

     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \
0    manuell      158  andere  150,000km                    3      lpg   
1  automatik      286     7er  150,000km                    6   benzin   
2    manuell      102    golf   70,000km                    7   benzin   
3  automatik       71  fortwo   70,000km                    6   benzin   
4    manuell        0   focus  150,000km                    7   benzin   

        brand notRepairedDamage          dateCreated  nrOfPictures  \
0     peugeot              nein  2016-03-26 00:00:00             0   
1         bmw              nein  2016-04-04 00:00:00             0   
2  volkswagen              nein  2016-03-26 00:00:00             0   
3       smart              nein  2016-03-12 00:00:00             0   
4        ford              nein  2016-04-01 00:00:00             0   

   postalCode             lastSeen  
0       79588  2016-04-06 06:45:54  
1       71034  2016-04-06 14:45:08  
2       35394  2016-04-06 20:15:37  
3       33729  2016-03-15 03:16:28  
4       39218  2016-04-01 14:38:50  

Some important inital observations are that


vehicleType,gearbox,model,fuelType,notRepairedDamage contain roughly 2,500 - 10,000 null values. Also, certain column values are not integers. Some data cleaning will be required for columns like dateCrawled,price,odometer,dateCreated, and lastSeen. In order to analyze the data in these columns, these values need to be evaluated as integers.

In [4]:
autos.columns
Out[4]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

The above columns need to be modified in the following way:

  1. yearOfRegistration,monthOfRegistration,notRepairedDamage, as well as dateCreated will be reworded and all columns will be rewritten in snakecase.

  2. The corrected snakecase columns will be reassigned back to autos.columns

In [5]:
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
In [6]:
autos.columns
Out[6]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [7]:
autos.head()
Out[7]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

The column names in the Autos DataFrame above


is adjusteed so that it matches Python's preferred snakecase instead of the current camelcase.

In [8]:
autos.describe(include = 'all')
Out[8]:
date_crawled name seller offer_type price ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-08 10:40:35 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Only one column (nr_of_pictures) appears like a good candidate to be dropped because the column description indicates 50,00 entries that are all NaN.

Suitable columns for investigation look like name, price, vehicle_type, registration_year, gearbox,model,odometer,fuel_type,brand,unrepaired_damage, ad_created, and last_seen.

Numeric data stored as text which needs to be cleaned for further analysis would be the following columns: date_crawled,price,odometer, ad_created, and last_seen.

In [9]:
vehicle_type_counts = autos["vehicle_type"].value_counts()
In [10]:
print(vehicle_type_counts)
limousine     12859
kleinwagen    10822
kombi          9127
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

The data at a glance:

The data shows two columns with values with frequencies that are essentially one value: seller and offer_type. Other columns which need more investigation are vehicle_type, model, and brand. Examples of columns which are stored as text that need further cleaning are: price,registration_year,odometer, and nr_of_pictures.

In [11]:
autos["price"] = autos["price"].str.replace('$','').str.replace(',','').astype(int)
autos.rename({"price":"price_$"}, axis = 1, inplace = True)
In [12]:
autos.head()
Out[12]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [13]:
autos["odometer"] = autos["odometer"].str.replace(',','').str.replace('km','').astype(int)
In [14]:
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True)            
In [15]:
autos.head()
Out[15]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Now to explore:

the values by category within the odometer_km column and the price column.

In [16]:
odometer_km = autos["odometer_km"]
odometer_km.unique().shape
Out[16]:
(13,)
In [17]:
odometer_km.describe
Out[17]:
<bound method NDFrame.describe of 0        150000
1        150000
2         70000
3         70000
4        150000
          ...  
49995    100000
49996    150000
49997      5000
49998     40000
49999    150000
Name: odometer_km, Length: 50000, dtype: int64>
In [18]:
odometer_km.value_counts().sort_index(ascending = False).head()
Out[18]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64
In [19]:
odometer_km.value_counts()
Out[19]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64
In [20]:
autos["price_$"].describe()
Out[20]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_$, dtype: float64
In [21]:
autos["price_$"].value_counts().sort_index(ascending = True).head(10)
Out[21]:
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
Name: price_$, dtype: int64
In [22]:
autos["price_$"].value_counts().sort_index(ascending = False).head(20)
Out[22]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price_$, dtype: int64
In [23]:
autos = autos[autos["price_$"].between(1,100000)]
autos["price_$"].shape
Out[23]:
(48526,)
In [24]:
## autos.loc[autos["price_$"] < 1, "price_$" ] = np.nan
In [25]:
## autos.loc[autos["price_$"] > 100000, "price_$" ] = np.nan
In [26]:
## price_ = autos["price_$"]
## price_no_null = price_.dropna(axis = 0 )
In [27]:
## price_no_null.describe()
In [28]:
## price_no_null.unique().shape
In [29]:
autos["price_$"].describe()
Out[29]:
count    48526.000000
mean      5755.574146
std       7514.501772
min          1.000000
25%       1200.000000
50%       3000.000000
75%       7450.000000
max      99900.000000
Name: price_$, dtype: float64
In [30]:
autos["price_$"].value_counts().sort_index(ascending=True).head(10)
Out[30]:
1     156
2       3
3       1
5       2
8       1
9       1
10      7
11      2
12      3
13      2
Name: price_$, dtype: int64
In [31]:
autos["price_$"].value_counts().sort_index(ascending = False).head(20)
Out[31]:
99900    2
99000    2
98500    1
94999    1
93911    1
93000    2
89900    1
89000    1
88900    1
86500    1
85000    1
84997    1
84000    1
83000    1
82987    1
80000    3
79999    1
79980    1
79933    1
79500    1
Name: price_$, dtype: int64

While all odometer values don't exceed 150,000:


the price values are upwards of 100 million dollars. The price column needed to be brought within a more reasonable range. More than 90% of the data (counts after filtering divided by counts before filtering) was found with values between 1 and 100,000 dollars. Since it's not unreasonable to assume that bids start at 1 dollar, those values were included. Value counts also indicates that this price (1 dollar) contains two orders of magnitude more value counts than the next nine subsequent prices. Conversely, the highest 20 values still contained within the price range, (1- 1,000,000 dollars) have a great deal fewer value counts. While the higher range is important to include, it's clear that the listings of interest should probably not exceed 1 million dollars.

In [32]:
## autos["date_crawled"] = autos['date_crawled'].str[:10]

autos["date_crawled"] = autos["date_crawled"].str[:10]  
<ipython-input-32-d68c5cc0133a>:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["date_crawled"] = autos["date_crawled"].str[:10]
In [33]:
autos['date_crawled'].value_counts(normalize=True, dropna=False).sort_index(ascending = True)
Out[33]:
2016-03-05    0.025306
2016-03-06    0.014054
2016-03-07    0.036043
2016-03-08    0.033281
2016-03-09    0.033096
2016-03-10    0.032210
2016-03-11    0.032601
2016-03-12    0.036929
2016-03-13    0.015682
2016-03-14    0.036578
2016-03-15    0.034270
2016-03-16    0.029613
2016-03-17    0.031633
2016-03-18    0.012880
2016-03-19    0.034806
2016-03-20    0.037897
2016-03-21    0.037361
2016-03-22    0.032910
2016-03-23    0.032230
2016-03-24    0.029345
2016-03-25    0.031612
2016-03-26    0.032189
2016-03-27    0.031076
2016-03-28    0.034847
2016-03-29    0.034126
2016-03-30    0.033652
2016-03-31    0.031839
2016-04-01    0.033652
2016-04-02    0.035507
2016-04-03    0.038598
2016-04-04    0.036517
2016-04-05    0.013086
2016-04-06    0.003174
2016-04-07    0.001401
Name: date_crawled, dtype: float64

Over the approximate one month span


the sight was crawled rather uniformly each day.

In [34]:
autos["ad_created"] = autos["ad_created"].str[:10]
<ipython-input-34-e8d1dccb314d>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos["ad_created"] = autos["ad_created"].str[:10]
In [35]:
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending = True)
Out[35]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038845
2016-04-04    0.036887
2016-04-05    0.011808
2016-04-06    0.003256
2016-04-07    0.001257
Name: ad_created, Length: 76, dtype: float64

Most ads were created within the 1-2 month span in 2016 of March and April.


However, there are a few that were created upwards of 9-10 months prior to this 1-2 month March/April span.

In [36]:
autos['last_seen'] = autos['last_seen'].str[:10]
<ipython-input-36-ba4857993d1d>:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  autos['last_seen'] = autos['last_seen'].str[:10]
In [37]:
autos['last_seen'].value_counts(normalize=True, dropna=False).sort_index(ascending = True)
Out[37]:
2016-03-05    0.001072
2016-03-06    0.004328
2016-03-07    0.005399
2016-03-08    0.007419
2016-03-09    0.009582
2016-03-10    0.010675
2016-03-11    0.012385
2016-03-12    0.023802
2016-03-13    0.008902
2016-03-14    0.012612
2016-03-15    0.015888
2016-03-16    0.016445
2016-03-17    0.028109
2016-03-18    0.007336
2016-03-19    0.015847
2016-03-20    0.020649
2016-03-21    0.020628
2016-03-22    0.021391
2016-03-23    0.018526
2016-03-24    0.019742
2016-03-25    0.019227
2016-03-26    0.016754
2016-03-27    0.015641
2016-03-28    0.020855
2016-03-29    0.022359
2016-03-30    0.024791
2016-03-31    0.023802
2016-04-01    0.022813
2016-04-02    0.024914
2016-04-03    0.025203
2016-04-04    0.024502
2016-04-05    0.124634
2016-04-06    0.221840
2016-04-07    0.131929
Name: last_seen, dtype: float64

While last_seen


is generally indicative of a car being sold, (and consequently the add being taken down) it may also indicate that the crawler had been spending a couple days taking inventory of what ads were stll listed on the site. Thus, the last time the ad was seen by human is not necessarily indicative of the higher frequencies shown April 5th through April 7th. It also more than likely indicates that there wasn't a spike in sales, but instead, a bot conflating differences in human and artificial last seen counts by perhaps taking a kind of monthly inventory.

In [38]:
autos['registration_year'].describe()
Out[38]:
count    48526.000000
mean      2004.754935
std         88.678603
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

There are values in registration_year above


which clearly need to be removed from the data set. The automobile was invented c. 1885 and at the time of this dataset being pulled, the year wasn't later than 2016. Furthermore, to be more conservative with our estimates, let's assume a vast majority of consumers will probably NOT be shopping for a car before 1900.

In [39]:
autos = autos[autos['registration_year'].between(1900,2016)]
In [40]:
autos['registration_year'].describe()
Out[40]:
count    46642.000000
mean      2002.908709
std          7.176230
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

The registration year statistics above

describe what listings remain after being filtered by range of years. Since there is still a market for early 20th century cars, (given the first car was invented in 1886) and since the last listing was seen in April of 2016, it was determined that this range was best.

In [41]:
autos['registration_year'].value_counts(normalize = True).sort_index().tail(50)
Out[41]:
1967    0.000557
1968    0.000536
1969    0.000407
1970    0.000815
1971    0.000557
1972    0.000708
1973    0.000493
1974    0.000515
1975    0.000386
1976    0.000450
1977    0.000450
1978    0.000943
1979    0.000729
1980    0.001822
1981    0.000600
1982    0.000879
1983    0.001093
1984    0.001093
1985    0.002037
1986    0.001544
1987    0.001544
1988    0.002873
1989    0.003731
1990    0.007440
1991    0.007268
1992    0.007911
1993    0.009112
1994    0.013486
1995    0.026285
1996    0.029437
1997    0.041808
1998    0.050598
1999    0.062111
2000    0.067643
2001    0.056494
2002    0.053300
2003    0.057866
2004    0.057952
2005    0.062948
2006    0.057245
2007    0.048797
2008    0.047447
2009    0.044702
2010    0.033982
2011    0.034776
2012    0.028022
2013    0.017130
2014    0.014172
2015    0.008319
2016    0.026028
Name: registration_year, dtype: float64

By inspection


it's quite clear that with the exception of 2015, between 1994 and 2016 inclusive, that these registration years comprise greater than 1 percent of total listings with many of these years being well above 3 percent.

In [42]:
values = autos['brand'].value_counts(normalize=True)
print(values)
top_brands = values["volkswagen":"ford"].index ##help understand syntax?
## or top_brands = values[values > 0.05].index
print('\n')
print(values[values > 0.05])

print(top_brands)
volkswagen        0.211440
bmw               0.110072
opel              0.107671
mercedes_benz     0.096480
audi              0.086617
ford              0.069937
renault           0.047189
peugeot           0.029866
fiat              0.025664
seat              0.018288
skoda             0.016423
nissan            0.015287
mazda             0.015201
smart             0.014172
citroen           0.014022
toyota            0.012714
hyundai           0.010034
sonstige_autos    0.009712
volvo             0.009155
mini              0.008769
mitsubishi        0.008233
honda             0.007847
kia               0.007075
alfa_romeo        0.006646
suzuki            0.005939
chevrolet         0.005703
porsche           0.005574
chrysler          0.003516
dacia             0.002637
daihatsu          0.002508
jeep              0.002273
subaru            0.002144
land_rover        0.002101
saab              0.001651
jaguar            0.001565
daewoo            0.001501
trabant           0.001394
rover             0.001329
lancia            0.001072
lada              0.000579
Name: brand, dtype: float64


volkswagen       0.211440
bmw              0.110072
opel             0.107671
mercedes_benz    0.096480
audi             0.086617
ford             0.069937
Name: brand, dtype: float64
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')

Brand data being used is above 5%


in order to get a sense for what the mean cost is for any brand that comprises over 5 percent of the total listings. Thus, more popular well-known brands can be understood in this analysis as well as their mean costs.

In [43]:
MOT_automobiles = {} ##mean of top automobiles

for brands in top_brands:
    Selected_rows = autos[autos["brand"] == brands]
    
    mean = Selected_rows["price_$"].mean()
    
    MOT_automobiles[brands] = int(mean)
    
    MOT_auto_list = sorted(MOT_automobiles.items(), key = lambda x:x[1])
    sort_MOT_auto_list = dict(MOT_auto_list)




MOT_autolist = dict(MOT_auto_list)

                      
In [44]:
MOT_automobiles
Out[44]:
{'volkswagen': 5402,
 'bmw': 8201,
 'opel': 2975,
 'mercedes_benz': 8529,
 'audi': 9295,
 'ford': 3710}

Comparing the frequency of brand posted


and the mean price for the brand posted, there doesn't seem to be a correlation between how many posts a brand has with it's cost (i.e. more posts means higher cost). There is one example where a higher mean cost has the second lowest frequency as compared to the other six brands (audi) and also an example of where a relatively lower mean cost has the highest frequency of postings (volkswagen). Frequency is based on popularity and popularity seems to be multi-faceted; one component perhaps being due to affordability, another being due to perceived quality of the vehicle, and yet another perhaps being due to some combination of the aforementioned.

In [45]:
new_values = values[values > 0.05]
In [46]:
print(new_values)
volkswagen       0.211440
bmw              0.110072
opel             0.107671
mercedes_benz    0.096480
audi             0.086617
ford             0.069937
Name: brand, dtype: float64
In [47]:
new_series = pd.Series(MOT_autolist)
In [48]:
print(new_series)
opel             2975
ford             3710
volkswagen       5402
bmw              8201
mercedes_benz    8529
audi             9295
dtype: int64
In [49]:
new_df = pd.DataFrame(new_series, columns = ['mean_price'])
In [50]:
new_df
Out[50]:
mean_price
opel 2975
ford 3710
volkswagen 5402
bmw 8201
mercedes_benz 8529
audi 9295
In [51]:
Mean_MofTA = {} #mean mileage of top Automobiles dictionary

for brands in top_brands:
    selected_rows_miles = autos[autos["brand"] == brands]
    mean_miles = selected_rows_miles["odometer_km"].mean()
    Mean_MofTA[brands] = int(mean_miles)
    
    
    
    Mean_MofTA_list = sorted(Mean_MofTA.items(), key = lambda x:x[1])
    sort_Mean_MofTA_list = dict(Mean_MofTA_list)
    
Mean_MofTA_list = dict(Mean_MofTA_list)
Mean_MofTA_list
Out[51]:
{'ford': 124288,
 'volkswagen': 128707,
 'audi': 129188,
 'opel': 129310,
 'mercedes_benz': 130851,
 'bmw': 132633}
In [ ]:
 
In [52]:
MMofTA_series = pd.Series(Mean_MofTA_list)
print(MMofTA_series)
ford             124288
volkswagen       128707
audi             129188
opel             129310
mercedes_benz    130851
bmw              132633
dtype: int64
In [53]:
MMofTA_df = pd.DataFrame(MMofTA_series, columns = ['mean_mileage'])
MMofTA_df                         
                         
Out[53]:
mean_mileage
ford 124288
volkswagen 128707
audi 129188
opel 129310
mercedes_benz 130851
bmw 132633
In [54]:
d = {'mean_price': pd.Series(MOT_autolist), 'mean_mileage': pd.Series(MMofTA_series)}
pd.DataFrame(data = d, index = top_brands)
Out[54]:
mean_price mean_mileage
volkswagen 5402 128707
bmw 8201 132633
opel 2975 129310
mercedes_benz 8529 130851
audi 9295 129188
ford 3710 124288

Description for above

In [55]:
autos.head()
Out[55]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 0 79588 2016-04-06
1 2016-04-04 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 0 71034 2016-04-06
2 2016-03-26 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 0 35394 2016-04-06
3 2016-03-12 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 0 33729 2016-03-15
4 2016-04-01 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 0 39218 2016-04-01
In [56]:
autos["vehicle_type"].unique()
Out[56]:
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)
In [57]:
autos["vehicle_type"].isnull().value_counts()
Out[57]:
False    43939
True      2703
Name: vehicle_type, dtype: int64
In [58]:
autos.head(20)
Out[58]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 0 79588 2016-04-06
1 2016-04-04 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 0 71034 2016-04-06
2 2016-03-26 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 0 35394 2016-04-06
3 2016-03-12 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 0 33729 2016-03-15
4 2016-04-01 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 0 39218 2016-04-01
5 2016-03-21 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot 7900 test bus 2006 automatik 150 voyager 150000 4 diesel chrysler NaN 2016-03-21 0 22962 2016-04-06
6 2016-03-20 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot 300 test limousine 1995 manuell 90 golf 150000 8 benzin volkswagen NaN 2016-03-20 0 31535 2016-03-23
7 2016-03-16 Golf_IV_1.9_TDI_90PS privat Angebot 1990 control limousine 1998 manuell 90 golf 150000 12 diesel volkswagen nein 2016-03-16 0 53474 2016-04-07
8 2016-03-22 Seat_Arosa privat Angebot 250 test NaN 2000 manuell 0 arosa 150000 10 NaN seat nein 2016-03-22 0 7426 2016-03-26
9 2016-03-16 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot 590 control bus 1997 manuell 90 megane 150000 7 benzin renault nein 2016-03-16 0 15749 2016-04-06
11 2016-03-16 Mercedes_A140_Motorschaden privat Angebot 350 control NaN 2000 NaN 0 NaN 150000 0 benzin mercedes_benz NaN 2016-03-16 0 17498 2016-03-16
12 2016-03-31 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot 5299 control kleinwagen 2010 automatik 71 fortwo 50000 9 benzin smart nein 2016-03-31 0 34590 2016-04-06
13 2016-03-23 Audi_A3_1.6_tuning privat Angebot 1350 control limousine 1999 manuell 101 a3 150000 11 benzin audi nein 2016-03-23 0 12043 2016-04-01
14 2016-03-23 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot 3999 test kleinwagen 2007 manuell 75 clio 150000 9 benzin renault NaN 2016-03-23 0 81737 2016-04-01
15 2016-04-01 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot 18900 test coupe 1982 automatik 203 NaN 80000 6 benzin sonstige_autos nein 2016-04-01 0 61276 2016-04-02
16 2016-03-16 Opel_Vectra_B_Kombi privat Angebot 350 test kombi 1999 manuell 101 vectra 150000 5 benzin opel nein 2016-03-16 0 57299 2016-03-18
17 2016-03-29 Volkswagen_Scirocco_2_G60 privat Angebot 5500 test coupe 1990 manuell 205 scirocco 150000 6 benzin volkswagen nein 2016-03-29 0 74821 2016-04-05
18 2016-03-26 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot 300 control bus 1995 manuell 150 3er 150000 0 benzin bmw NaN 2016-03-26 0 54329 2016-04-02
19 2016-03-17 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot 4150 control suv 2004 manuell 124 andere 150000 2 lpg mazda nein 2016-03-17 0 40878 2016-03-17
20 2016-03-05 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot 3500 test kombi 2003 manuell 131 a4 150000 5 diesel audi NaN 2016-03-05 0 53913 2016-03-07
In [59]:
vehicle_type_dict = {"kleinwagen" : "small car",
                     "kombi" : "station wagon",
                     "cabrio" : "convertible",
                     "andere" : "other",
                     "bus" : "bus",
                     "limousine" : "limousine",
                     "coupe" : "coupe",
                     "suv" : "suv"
                     
                    }

autos['vehicle_type'] = autos['vehicle_type'].map(vehicle_type_dict)
In [60]:
autos["vehicle_type"].fillna("unknown_type", inplace = True)
In [61]:
autos.head(20)
Out[61]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 0 79588 2016-04-06
1 2016-04-04 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 0 71034 2016-04-06
2 2016-03-26 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 0 35394 2016-04-06
3 2016-03-12 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control small car 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 0 33729 2016-03-15
4 2016-04-01 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test station wagon 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 0 39218 2016-04-01
5 2016-03-21 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot 7900 test bus 2006 automatik 150 voyager 150000 4 diesel chrysler NaN 2016-03-21 0 22962 2016-04-06
6 2016-03-20 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot 300 test limousine 1995 manuell 90 golf 150000 8 benzin volkswagen NaN 2016-03-20 0 31535 2016-03-23
7 2016-03-16 Golf_IV_1.9_TDI_90PS privat Angebot 1990 control limousine 1998 manuell 90 golf 150000 12 diesel volkswagen nein 2016-03-16 0 53474 2016-04-07
8 2016-03-22 Seat_Arosa privat Angebot 250 test unknown_type 2000 manuell 0 arosa 150000 10 NaN seat nein 2016-03-22 0 7426 2016-03-26
9 2016-03-16 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot 590 control bus 1997 manuell 90 megane 150000 7 benzin renault nein 2016-03-16 0 15749 2016-04-06
11 2016-03-16 Mercedes_A140_Motorschaden privat Angebot 350 control unknown_type 2000 NaN 0 NaN 150000 0 benzin mercedes_benz NaN 2016-03-16 0 17498 2016-03-16
12 2016-03-31 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot 5299 control small car 2010 automatik 71 fortwo 50000 9 benzin smart nein 2016-03-31 0 34590 2016-04-06
13 2016-03-23 Audi_A3_1.6_tuning privat Angebot 1350 control limousine 1999 manuell 101 a3 150000 11 benzin audi nein 2016-03-23 0 12043 2016-04-01
14 2016-03-23 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot 3999 test small car 2007 manuell 75 clio 150000 9 benzin renault NaN 2016-03-23 0 81737 2016-04-01
15 2016-04-01 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot 18900 test coupe 1982 automatik 203 NaN 80000 6 benzin sonstige_autos nein 2016-04-01 0 61276 2016-04-02
16 2016-03-16 Opel_Vectra_B_Kombi privat Angebot 350 test station wagon 1999 manuell 101 vectra 150000 5 benzin opel nein 2016-03-16 0 57299 2016-03-18
17 2016-03-29 Volkswagen_Scirocco_2_G60 privat Angebot 5500 test coupe 1990 manuell 205 scirocco 150000 6 benzin volkswagen nein 2016-03-29 0 74821 2016-04-05
18 2016-03-26 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot 300 control bus 1995 manuell 150 3er 150000 0 benzin bmw NaN 2016-03-26 0 54329 2016-04-02
19 2016-03-17 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot 4150 control suv 2004 manuell 124 andere 150000 2 lpg mazda nein 2016-03-17 0 40878 2016-03-17
20 2016-03-05 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot 3500 test station wagon 2003 manuell 131 a4 150000 5 diesel audi NaN 2016-03-05 0 53913 2016-03-07
In [62]:
autos["date_crawled"] = autos["date_crawled"].str.replace("-","")
In [63]:
autos["ad_created"] = autos["ad_created"].str.replace("-","")
autos["last_seen"] = autos["last_seen"].str.replace("-","")
In [64]:
autos.head(20)
Out[64]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 20160326 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 20160326 0 79588 20160406
1 20160404 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 20160404 0 71034 20160406
2 20160326 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 20160326 0 35394 20160406
3 20160312 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control small car 2007 automatik 71 fortwo 70000 6 benzin smart nein 20160312 0 33729 20160315
4 20160401 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test station wagon 2003 manuell 0 focus 150000 7 benzin ford nein 20160401 0 39218 20160401
5 20160321 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... privat Angebot 7900 test bus 2006 automatik 150 voyager 150000 4 diesel chrysler NaN 20160321 0 22962 20160406
6 20160320 VW_Golf_III_GT_Special_Electronic_Green_Metall... privat Angebot 300 test limousine 1995 manuell 90 golf 150000 8 benzin volkswagen NaN 20160320 0 31535 20160323
7 20160316 Golf_IV_1.9_TDI_90PS privat Angebot 1990 control limousine 1998 manuell 90 golf 150000 12 diesel volkswagen nein 20160316 0 53474 20160407
8 20160322 Seat_Arosa privat Angebot 250 test unknown_type 2000 manuell 0 arosa 150000 10 NaN seat nein 20160322 0 7426 20160326
9 20160316 Renault_Megane_Scenic_1.6e_RT_Klimaanlage privat Angebot 590 control bus 1997 manuell 90 megane 150000 7 benzin renault nein 20160316 0 15749 20160406
11 20160316 Mercedes_A140_Motorschaden privat Angebot 350 control unknown_type 2000 NaN 0 NaN 150000 0 benzin mercedes_benz NaN 20160316 0 17498 20160316
12 20160331 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... privat Angebot 5299 control small car 2010 automatik 71 fortwo 50000 9 benzin smart nein 20160331 0 34590 20160406
13 20160323 Audi_A3_1.6_tuning privat Angebot 1350 control limousine 1999 manuell 101 a3 150000 11 benzin audi nein 20160323 0 12043 20160401
14 20160323 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... privat Angebot 3999 test small car 2007 manuell 75 clio 150000 9 benzin renault NaN 20160323 0 81737 20160401
15 20160401 Corvette_C3_Coupe_T_Top_Crossfire_Injection privat Angebot 18900 test coupe 1982 automatik 203 NaN 80000 6 benzin sonstige_autos nein 20160401 0 61276 20160402
16 20160316 Opel_Vectra_B_Kombi privat Angebot 350 test station wagon 1999 manuell 101 vectra 150000 5 benzin opel nein 20160316 0 57299 20160318
17 20160329 Volkswagen_Scirocco_2_G60 privat Angebot 5500 test coupe 1990 manuell 205 scirocco 150000 6 benzin volkswagen nein 20160329 0 74821 20160405
18 20160326 Verkaufen_mein_bmw_e36_320_i_touring privat Angebot 300 control bus 1995 manuell 150 3er 150000 0 benzin bmw NaN 20160326 0 54329 20160402
19 20160317 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 privat Angebot 4150 control suv 2004 manuell 124 andere 150000 2 lpg mazda nein 20160317 0 40878 20160317
20 20160305 Audi_A4_Avant_1.9_TDI_*6_Gang*AHK*Klimatronik*... privat Angebot 3500 test station wagon 2003 manuell 131 a4 150000 5 diesel audi NaN 20160305 0 53913 20160307
In [65]:
autos["name"] = autos["name"].str.replace("_"," ")
In [66]:
autos.head(20)
Out[66]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 20160326 Peugeot 807 160 NAVTECH ON BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 20160326 0 79588 20160406
1 20160404 BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 20160404 0 71034 20160406
2 20160326 Volkswagen Golf 1.6 United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 20160326 0 35394 20160406
3 20160312 Smart smart fortwo coupe softouch/F1/Klima/Pan... privat Angebot 4350 control small car 2007 automatik 71 fortwo 70000 6 benzin smart nein 20160312 0 33729 20160315
4 20160401 Ford Focus 1 6 Benzin TÜV neu ist sehr gepfleg... privat Angebot 1350 test station wagon 2003 manuell 0 focus 150000 7 benzin ford nein 20160401 0 39218 20160401
5 20160321 Chrysler Grand Voyager 2.8 CRD Aut.Limited Sto... privat Angebot 7900 test bus 2006 automatik 150 voyager 150000 4 diesel chrysler NaN 20160321 0 22962 20160406
6 20160320 VW Golf III GT Special Electronic Green Metall... privat Angebot 300 test limousine 1995 manuell 90 golf 150000 8 benzin volkswagen NaN 20160320 0 31535 20160323
7 20160316 Golf IV 1.9 TDI 90PS privat Angebot 1990 control limousine 1998 manuell 90 golf 150000 12 diesel volkswagen nein 20160316 0 53474 20160407
8 20160322 Seat Arosa privat Angebot 250 test unknown_type 2000 manuell 0 arosa 150000 10 NaN seat nein 20160322 0 7426 20160326
9 20160316 Renault Megane Scenic 1.6e RT Klimaanlage privat Angebot 590 control bus 1997 manuell 90 megane 150000 7 benzin renault nein 20160316 0 15749 20160406
11 20160316 Mercedes A140 Motorschaden privat Angebot 350 control unknown_type 2000 NaN 0 NaN 150000 0 benzin mercedes_benz NaN 20160316 0 17498 20160316
12 20160331 Smart smart fortwo coupe softouch pure MHD Pan... privat Angebot 5299 control small car 2010 automatik 71 fortwo 50000 9 benzin smart nein 20160331 0 34590 20160406
13 20160323 Audi A3 1.6 tuning privat Angebot 1350 control limousine 1999 manuell 101 a3 150000 11 benzin audi nein 20160323 0 12043 20160401
14 20160323 Renault Clio 3 Dynamique 1.2 16 V; viele Ver... privat Angebot 3999 test small car 2007 manuell 75 clio 150000 9 benzin renault NaN 20160323 0 81737 20160401
15 20160401 Corvette C3 Coupe T Top Crossfire Injection privat Angebot 18900 test coupe 1982 automatik 203 NaN 80000 6 benzin sonstige_autos nein 20160401 0 61276 20160402
16 20160316 Opel Vectra B Kombi privat Angebot 350 test station wagon 1999 manuell 101 vectra 150000 5 benzin opel nein 20160316 0 57299 20160318
17 20160329 Volkswagen Scirocco 2 G60 privat Angebot 5500 test coupe 1990 manuell 205 scirocco 150000 6 benzin volkswagen nein 20160329 0 74821 20160405
18 20160326 Verkaufen mein bmw e36 320 i touring privat Angebot 300 control bus 1995 manuell 150 3er 150000 0 benzin bmw NaN 20160326 0 54329 20160402
19 20160317 mazda tribute 2.0 mit gas und tuev neu 2018 privat Angebot 4150 control suv 2004 manuell 124 andere 150000 2 lpg mazda nein 20160317 0 40878 20160317
20 20160305 Audi A4 Avant 1.9 TDI *6 Gang*AHK*Klimatronik*... privat Angebot 3500 test station wagon 2003 manuell 131 a4 150000 5 diesel audi NaN 20160305 0 53913 20160307
In [67]:
autos["name"].str.split().str[0:2]
Out[67]:
0            [Peugeot, 807]
1               [BMW, 740i]
2        [Volkswagen, Golf]
3            [Smart, smart]
4             [Ford, Focus]
                ...        
49995            [Audi, Q5]
49996         [Opel, Astra]
49997           [Fiat, 500]
49998            [Audi, A3]
49999        [Opel, Vectra]
Name: name, Length: 46642, dtype: object
In [68]:
models = autos["model"].unique()
In [69]:
print(models)
['andere' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan
 'a3' 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo'
 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo'
 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera'
 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris'
 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito'
 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb'
 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg'
 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster'
 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton'
 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento'
 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift'
 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace'
 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai'
 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia'
 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156'
 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo'
 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601'
 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo'
 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2'
 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note'
 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara'
 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7'
 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon'
 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er'
 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva'
 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60'
 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move'
 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade'
 'i3' '200' 'b_max']
In [70]:
autos["model"].value_counts().head(20)
Out[70]:
golf           3707
andere         3370
3er            2615
polo           1609
corsa          1592
passat         1349
astra          1348
a4             1231
c_klasse       1136
5er            1132
e_klasse        958
a3              825
a6              797
focus           762
fiesta          722
transporter     674
twingo          615
2_reihe         600
fortwo          550
vectra          544
Name: model, dtype: int64
In [71]:
autos["model"].value_counts().head(12).sum() / autos["model"].shape[0]
Out[71]:
0.447493675228335
In [72]:
top_12_models = autos["model"].value_counts().head(12)
In [73]:
print(top_12_models)
golf        3707
andere      3370
3er         2615
polo        1609
corsa       1592
passat      1349
astra       1348
a4          1231
c_klasse    1136
5er         1132
e_klasse     958
a3           825
Name: model, dtype: int64
In [ ]:
 
In [74]:
Top_12_models = top_12_models[top_12_models > 837].index
In [75]:
print(Top_12_models)
Index(['golf', 'andere', '3er', 'polo', 'corsa', 'passat', 'astra', 'a4',
       'c_klasse', '5er', 'e_klasse'],
      dtype='object')
In [76]:
autos["model_sum"] = autos.groupby('model').cumcount() + 1
In [77]:
autos
Out[77]:
date_crawled name seller offer_type price_$ ab_test vehicle_type registration_year gearbox power_ps ... odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen model_sum
0 20160326 Peugeot 807 160 NAVTECH ON BOARD privat Angebot 5000 control bus 2004 manuell 158 ... 150000 3 lpg peugeot nein 20160326 0 79588 20160406 1
1 20160404 BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik privat Angebot 8500 control limousine 1997 automatik 286 ... 150000 6 benzin bmw nein 20160404 0 71034 20160406 1
2 20160326 Volkswagen Golf 1.6 United privat Angebot 8990 test limousine 2009 manuell 102 ... 70000 7 benzin volkswagen nein 20160326 0 35394 20160406 1
3 20160312 Smart smart fortwo coupe softouch/F1/Klima/Pan... privat Angebot 4350 control small car 2007 automatik 71 ... 70000 6 benzin smart nein 20160312 0 33729 20160315 1
4 20160401 Ford Focus 1 6 Benzin TÜV neu ist sehr gepfleg... privat Angebot 1350 test station wagon 2003 manuell 0 ... 150000 7 benzin ford nein 20160401 0 39218 20160401 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 20160327 Audi Q5 3.0 TDI qu. S tr. Navi Panorama Xenon privat Angebot 24900 control limousine 2011 automatik 239 ... 100000 1 diesel audi nein 20160327 0 82131 20160401 62
49996 20160328 Opel Astra F Cabrio Bertone Edition TÜV neu+... privat Angebot 1980 control convertible 1996 manuell 75 ... 150000 5 benzin opel nein 20160328 0 44807 20160402 1348
49997 20160402 Fiat 500 C 1.2 Dualogic Lounge privat Angebot 13200 test convertible 2014 automatik 69 ... 5000 11 benzin fiat nein 20160402 0 73430 20160404 120
49998 20160308 Audi A3 2.0 TDI Sportback Ambition privat Angebot 22900 control station wagon 2013 manuell 150 ... 40000 11 diesel audi nein 20160308 0 35683 20160405 825
49999 20160314 Opel Vectra 1.6 16V privat Angebot 1250 control limousine 1996 manuell 101 ... 150000 1 benzin opel nein 20160313 0 45897 20160406 544

46642 rows × 21 columns

In [78]:
brand_model = {}

autos["model"].fillna("unknown",inplace = True)


Top_12_models = top_12_models[top_12_models > 837].index

for m in Top_12_models: 
    selected_rows = autos[autos["model"] == m ]
    sorted_rows = selected_rows.sort_values("model_sum",ascending = False)
    top_sum = sorted_rows.iloc[0]
    top_brands = top_sum["brand"]
    
    brand_model[m] = top_brands
    
    
    

brand_model    
Out[78]:
{'golf': 'volkswagen',
 'andere': 'fiat',
 '3er': 'bmw',
 'polo': 'volkswagen',
 'corsa': 'opel',
 'passat': 'volkswagen',
 'astra': 'opel',
 'a4': 'audi',
 'c_klasse': 'mercedes_benz',
 '5er': 'bmw',
 'e_klasse': 'mercedes_benz'}
In [79]:
brand_model_series = pd.Series(brand_model)
In [80]:
autos["date_crawled"] = autos["date_crawled"].astype(int)
In [81]:
autos["ad_created"] = autos["ad_created"].astype(int)
In [82]:
autos["last_seen"] = autos["last_seen"].astype(int)
In [83]:
brand_model_df = pd.DataFrame(brand_model_series, columns = ['model'])
In [84]:
print(brand_model_df)
                  model
golf         volkswagen
andere             fiat
3er                 bmw
polo         volkswagen
corsa              opel
passat       volkswagen
astra              opel
a4                 audi
c_klasse  mercedes_benz
5er                 bmw
e_klasse  mercedes_benz
In [85]:
model_count_series = pd.Series(top_12_models)
In [86]:
print(model_count_series)
golf        3707
andere      3370
3er         2615
polo        1609
corsa       1592
passat      1349
astra       1348
a4          1231
c_klasse    1136
5er         1132
e_klasse     958
a3           825
Name: model, dtype: int64
In [87]:
d = {'brand': pd.Series(brand_model), 'count': pd.Series(top_12_models)}
pd.DataFrame(data = d, index = Top_12_models)
Out[87]:
brand count
golf volkswagen 3707
andere fiat 3370
3er bmw 2615
polo volkswagen 1609
corsa opel 1592
passat volkswagen 1349
astra opel 1348
a4 audi 1231
c_klasse mercedes_benz 1136
5er bmw 1132
e_klasse mercedes_benz 958
In [88]:
autos['odometer_km'].unique()
Out[88]:
array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000,  40000, 100000])
In [89]:
AvgPrice = {}
miles_strata = autos["odometer_km"].unique()


for m in miles_strata:
    selected_rows = autos[autos["odometer_km"] == m]    
    mean_price = selected_rows["price_$"].mean()
    AvgPrice[m] = int(mean_price)

    AvgPrice_list = sorted(AvgPrice.items(), key = lambda x:x[1])
    sort_AvgPrice_list = dict(AvgPrice_list)

AvgPrice_list
  
    
    
Out[89]:
[(150000, 3754),
 (125000, 6165),
 (5000, 6455),
 (100000, 7932),
 (90000, 8465),
 (80000, 9652),
 (70000, 10927),
 (60000, 12109),
 (50000, 13527),
 (40000, 15305),
 (30000, 15672),
 (20000, 16635),
 (10000, 18946)]
In [90]:
odometer_km.value_counts()
Out[90]:
150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

Price differences across odometer stratas


As a trend, (with an exception for 5,000 miles) prices tend to go up as odometer values decrease. What could be driving the cost for 5,000 km values is that these listings may be more recent and so bids are initially lower but one might expect those values to ultimately increase the longer the listing stays on the site.

In [105]:
odometer_500 = autos[autos["odometer_km"] == 500]
In [106]:
odomoter_500["ad_created"].unique()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-106-d87f6c8df839> in <module>
----> 1 odomoter_500["ad_created"].unique()

NameError: name 'odomoter_500' is not defined
In [96]:
odometer_500_dict = {}
odometer_500["ad_created_500_sum"] = 0
odometer_500["ad_created_500_sum"] = odometer_500.groupby('ad_created').cumcount() + 1

for rows in odometer_500:
    selected_rows = odometer_500[odometer_500["odometer_km"] == rows]
    sorted_rows = selected_rows.sort_values("ad_created_500_sum", ascending = True)
    
    top_row = sorted_rows.iloc[0]
    ad_list_dates = top_row["ad_created"]
    autos[rows] = ad_list_dates
    
odometer_500_dict
    
    
    
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-96-23b29866c556> in <module>
      7     sorted_rows = selected_rows.sort_values("ad_created_500_sum", ascending = True)
      8 
----> 9     top_row = sorted_rows.iloc[0]
     10     ad_list_dates = top_row["ad_created"]
     11     autos[rows] = ad_list_dates

/dataquest/system/env/python3/lib/python3.8/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1766 
   1767             maybe_callable = com.apply_if_callable(key, self.obj)
-> 1768             return self._getitem_axis(maybe_callable, axis=axis)
   1769 
   1770     def _is_scalar_access(self, key: Tuple):

/dataquest/system/env/python3/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   2136 
   2137             # validate the location
-> 2138             self._validate_integer(key, axis)
   2139 
   2140             return self._get_loc(key, axis=axis)

/dataquest/system/env/python3/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_integer(self, key, axis)
   2061         len_axis = len(self.obj._get_axis(axis))
   2062         if key >= len_axis or key < -len_axis:
-> 2063             raise IndexError("single positional indexer is out-of-bounds")
   2064 
   2065     def _getitem_tuple(self, tup: Tuple):

IndexError: single positional indexer is out-of-bounds
In [ ]:
autos
In [ ]:
 
In [ ]:
autos['unrepaired_damage'].unique()
In [ ]:
 
In [ ]:
damage_cost = {}
Unrepaired = autos["unrepaired_damage"].value_counts(dropna = True).index

for u in Unrepaired: 
    selected_rows = autos[autos["unrepaired_damage"] == u]
    mean = selected_rows["price_$"].mean()
    
    damage_cost[u] = int(mean)
    
damage_cost 

In the above case, 'nein':

means there is no unrepaired damage. This makes sense (with my limited knowledge of German I do know 'nein' means no). The average price for cars with unrepaired damage is higher than the average cost for cars WITH unrepaired damage.

In [ ]: