Introduction

This is the first guided project in DataQuest. We'll be working with a dataset of used cars from a classified section of eBay. Data originally from Kaggle, but there have been a few modifications to the data to make it more useful to practice cleaning the data.

  • We sampled 50,000 data points from the full dataset.
  • Dirtied the dataset to resemble actual scraped data.

Purpose:

To clean the data and analyze the included car listings.

In [2]:
import pandas as pd
import numpy as np
In [3]:
autos = pd.read_csv('autos.csv',encoding='Latin-1')
In [4]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

As we can see from the .info method, we have a few columns with some null values. Columns being vehicleType, gearbox, model, fuelType, and nonRepairedDamage. We must inspect these columns to see how we can deal with these null values. Also, we might want to change the price column to a float dtype so it is easier to work with.

In [5]:
autos.head()
Out[5]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

From the .head method, we can see that there is a lot of data cleaning that we need to do.

  • Clean name column
  • Convert price column to int/float dtype
  • Inspect how many vehicleTypes. Clean accordingly
  • Correct misspellings in gearbox column
  • Inspect model
  • Convert odomoter dtype
  • Inspect notRepairedDamage column (seems like majority null)
  • Maybe we need to convert to datetime objects?
In [6]:
autos.columns
Out[6]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [7]:
new_columns = ({'yearOfRegistration':'registration_year', 'monthOfRegistration':'registration_month',
                'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created', 'dateCrawled':'date_crawled',
               'offerType':'offer_type','vehicleType':"vehicle_type", 'powerPS':'power_ps','fuelType':'fuel_type',
               'nrOfPictures':'num_pictures','postalCode':'postal_code','lastSeen':'last_seen'}
              )

autos.rename(new_columns,axis=1,inplace=True)
autos.head()
Out[7]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Camelcase to snakecase basically means changing the word to lower case and have underscores instead of spaces. So, I changed the all the columns to to snakecase just for it to be more readable. I'm not sure what date_crawled means so I just left it as is. One specific change I made was to nrOfPictures to num_pictures.

In [8]:
autos.describe(include='all')
Out[8]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-30 19:48:02 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Price and odometer are stored as text, so we need the change these to numeric. We can say this because we see their descriptive statistics results in all null values

We can also see that columns seller, offer_type, abtest, gearbox, unrepaired_damage all have only 2 values. We may consider dropping these columns because they won't be useful in our analyses.

In [9]:
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(float)
In [10]:
autos['odometer'] = autos['odometer'].str.lower().str.replace('km','').str.replace(',','').astype(float)
autos.rename({'odometer':'odometer_km'}, axis=1,inplace=True)

Let's explore the price and odometer_km column

In [11]:
autos['price'].unique().shape 
# theres 2357 unique values
Out[11]:
(2357,)
In [12]:
print('Max: {}\nMin: {}'.format(autos['price'].max(), autos['price'].min()))
# What car costs 99,999,999 and 0? This isn't right
autos['price'].describe()
Max: 99999999.0
Min: 0.0
Out[12]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [13]:
(autos['price'] > 500000).value_counts()
# 14 cars that are more than $500,000
Out[13]:
False    49986
True        14
Name: price, dtype: int64
In [14]:
autos['price'].value_counts().sort_index(ascending=False).tail(200)
# Looking at how many occurrences of each price value in descending order. 
Out[14]:
760.0       8
755.0       1
750.0     433
749.0      23
745.0       2
         ... 
5.0         2
3.0         1
2.0         3
1.0       156
0.0      1421
Name: price, Length: 200, dtype: int64
In [15]:
price_q25, price_q75 = np.percentile(autos['price'].values, [25,75])
price_iqr = price_q75 - price_q25
price_cutoff = price_iqr * 2
price_upper, price_lower = price_q75+price_cutoff, price_q25-price_cutoff

price_outliers = [x for x in autos['price'] if x<price_lower or x>price_upper]

len(price_outliers)
Out[15]:
2644
In [16]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        50000 non-null  object 
 1   name                50000 non-null  object 
 2   seller              50000 non-null  object 
 3   offer_type          50000 non-null  object 
 4   price               50000 non-null  float64
 5   abtest              50000 non-null  object 
 6   vehicle_type        44905 non-null  object 
 7   registration_year   50000 non-null  int64  
 8   gearbox             47320 non-null  object 
 9   power_ps            50000 non-null  int64  
 10  model               47242 non-null  object 
 11  odometer_km         50000 non-null  float64
 12  registration_month  50000 non-null  int64  
 13  fuel_type           45518 non-null  object 
 14  brand               50000 non-null  object 
 15  unrepaired_damage   40171 non-null  object 
 16  ad_created          50000 non-null  object 
 17  num_pictures        50000 non-null  int64  
 18  postal_code         50000 non-null  int64  
 19  last_seen           50000 non-null  object 
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB
In [17]:
new_autos = autos[autos['price'].between(price_lower,price_upper)]
new_autos.info()
# 2644 outliers removed
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47356 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        47356 non-null  object 
 1   name                47356 non-null  object 
 2   seller              47356 non-null  object 
 3   offer_type          47356 non-null  object 
 4   price               47356 non-null  float64
 5   abtest              47356 non-null  object 
 6   vehicle_type        42311 non-null  object 
 7   registration_year   47356 non-null  int64  
 8   gearbox             44724 non-null  object 
 9   power_ps            47356 non-null  int64  
 10  model               44695 non-null  object 
 11  odometer_km         47356 non-null  float64
 12  registration_month  47356 non-null  int64  
 13  fuel_type           42950 non-null  object 
 14  brand               47356 non-null  object 
 15  unrepaired_damage   37668 non-null  object 
 16  ad_created          47356 non-null  object 
 17  num_pictures        47356 non-null  int64  
 18  postal_code         47356 non-null  int64  
 19  last_seen           47356 non-null  object 
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB

To remove the outliers, I used the interquartile range method because it focuses solely on the data given. I used a multiplier of 2 for the cutoff because I noticed that some of the lower values were reasonable prices and I wanted to keep majority of them.

In [18]:
price_outliers.sort()
In [19]:
price_outliers
Out[19]:
[19444.0,
 19450.0,
 19450.0,
 19480.0,
 19490.0,
 19490.0,
 19490.0,
 19490.0,
 19499.0,
 19499.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19550.0,
 19599.0,
 19600.0,
 19600.0,
 19600.0,
 19600.0,
 19600.0,
 19600.0,
 19650.0,
 19650.0,
 19650.0,
 19666.0,
 19690.0,
 19699.0,
 19700.0,
 19700.0,
 19700.0,
 19700.0,
 19750.0,
 19750.0,
 19750.0,
 19777.0,
 19780.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19850.0,
 19850.0,
 19850.0,
 19850.0,
 19890.0,
 19890.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19911.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19968.0,
 19970.0,
 19980.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19995.0,
 19995.0,
 19995.0,
 19995.0,
 19998.0,
 19998.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20123.0,
 20150.0,
 20199.0,
 20200.0,
 20200.0,
 20200.0,
 20200.0,
 20200.0,
 20222.0,
 20250.0,
 20290.0,
 20299.0,
 20300.0,
 20390.0,
 20390.0,
 20399.0,
 20400.0,
 20400.0,
 20400.0,
 20400.0,
 20450.0,
 20480.0,
 20490.0,
 20490.0,
 20490.0,
 20490.0,
 20490.0,
 20490.0,
 20495.0,
 20495.0,
 20499.0,
 20499.0,
 20499.0,
 20499.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20550.0,
 20550.0,
 20550.0,
 20589.0,
 20590.0,
 20595.0,
 20650.0,
 20650.0,
 20650.0,
 20700.0,
 20700.0,
 20750.0,
 20750.0,
 20750.0,
 20750.0,
 20790.0,
 20799.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20850.0,
 20850.0,
 20850.0,
 20850.0,
 20888.0,
 20890.0,
 20890.0,
 20890.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20911.0,
 20950.0,
 20950.0,
 20950.0,
 20950.0,
 20950.0,
 20950.0,
 20980.0,
 20985.0,
 20987.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21049.0,
 21111.0,
 21200.0,
 21200.0,
 21200.0,
 21200.0,
 21250.0,
 21250.0,
 21275.0,
 21299.0,
 21300.0,
 21300.0,
 21300.0,
 21300.0,
 21300.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21444.0,
 21450.0,
 21450.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21590.0,
 21590.0,
 21599.0,
 21600.0,
 21600.0,
 21600.0,
 21690.0,
 21700.0,
 21750.0,
 21750.0,
 21750.0,
 21780.0,
 21780.0,
 21790.0,
 21800.0,
 21800.0,
 21800.0,
 21800.0,
 21850.0,
 21850.0,
 21850.0,
 21888.0,
 21890.0,
 21890.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21950.0,
 21950.0,
 21950.0,
 21950.0,
 21950.0,
 21950.0,
 21959.0,
 21980.0,
 21980.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21995.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22199.0,
 22200.0,
 22222.0,
 22222.0,
 22250.0,
 22250.0,
 22250.0,
 22280.0,
 22290.0,
 22299.0,
 22300.0,
 22300.0,
 22300.0,
 22300.0,
 22300.0,
 22300.0,
 22399.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22450.0,
 22450.0,
 22450.0,
 22490.0,
 22490.0,
 22490.0,
 22490.0,
 22490.0,
 22490.0,
 22499.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22555.0,
 22580.0,
 22600.0,
 22600.0,
 22600.0,
 22650.0,
 22650.0,
 22690.0,
 22700.0,
 22700.0,
 22750.0,
 22750.0,
 22790.0,
 22799.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22850.0,
 22850.0,
 22850.0,
 22880.0,
 22890.0,
 22899.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22980.0,
 22989.0,
 22990.0,
 22990.0,
 22990.0,
 22990.0,
 22990.0,
 22996.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23123.0,
 23200.0,
 23200.0,
 23200.0,
 23200.0,
 23233.0,
 23250.0,
 23250.0,
 23250.0,
 23300.0,
 23333.0,
 23333.0,
 23350.0,
 23400.0,
 23400.0,
 23400.0,
 23400.0,
 23450.0,
 23490.0,
 23490.0,
 23490.0,
 23490.0,
 23490.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23540.0,
 23590.0,
 23600.0,
 23600.0,
 23600.0,
 23600.0,
 23650.0,
 23700.0,
 23700.0,
 23700.0,
 23700.0,
 23750.0,
 23750.0,
 23790.0,
 23799.0,
 23800.0,
 23800.0,
 ...]

Let's change columns to quantifiable dates

  • date_crawled, ad_created, last_seen are string values
  • registration_year, registration_month are numeric values
In [20]:
new_autos.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 47356 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        47356 non-null  object 
 1   name                47356 non-null  object 
 2   seller              47356 non-null  object 
 3   offer_type          47356 non-null  object 
 4   price               47356 non-null  float64
 5   abtest              47356 non-null  object 
 6   vehicle_type        42311 non-null  object 
 7   registration_year   47356 non-null  int64  
 8   gearbox             44724 non-null  object 
 9   power_ps            47356 non-null  int64  
 10  model               44695 non-null  object 
 11  odometer_km         47356 non-null  float64
 12  registration_month  47356 non-null  int64  
 13  fuel_type           42950 non-null  object 
 14  brand               47356 non-null  object 
 15  unrepaired_damage   37668 non-null  object 
 16  ad_created          47356 non-null  object 
 17  num_pictures        47356 non-null  int64  
 18  postal_code         47356 non-null  int64  
 19  last_seen           47356 non-null  object 
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB
In [21]:
autos[['date_crawled','ad_created','last_seen']].head()
Out[21]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [22]:
new_autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
# normalize make it a percentage instead of the actual count
# we want to keep the null values
# earliest to latest
Out[22]:
2016-03-05    0.025551
2016-03-06    0.013874
2016-03-07    0.036088
2016-03-08    0.033533
2016-03-09    0.033216
2016-03-10    0.032308
2016-03-11    0.032140
2016-03-12    0.037313
2016-03-13    0.015563
2016-03-14    0.037081
2016-03-15    0.033702
2016-03-16    0.029774
2016-03-17    0.031633
2016-03-18    0.012966
2016-03-19    0.034674
2016-03-20    0.037883
2016-03-21    0.037398
2016-03-22    0.032773
2016-03-23    0.032435
2016-03-24    0.029162
2016-03-25    0.031844
2016-03-26    0.032900
2016-03-27    0.030640
2016-03-28    0.034737
2016-03-29    0.034125
2016-03-30    0.033660
2016-03-31    0.031781
2016-04-01    0.033259
2016-04-02    0.035180
2016-04-03    0.038622
2016-04-04    0.036616
2016-04-05    0.012987
2016-04-06    0.003167
2016-04-07    0.001415
Name: date_crawled, dtype: float64

With the date_crawled column and its distribution, we can see that most occurrences were between the middle of March and very early April.

In [23]:
new_autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
Out[23]:
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
                ...   
2016-04-03    0.038876
2016-04-04    0.036975
2016-04-05    0.011699
2016-04-06    0.003231
2016-04-07    0.001288
Name: ad_created, Length: 74, dtype: float64

Similarly to the date_crawled distribution, we can see that most of the ads were created between mid March and very early April. One thing to note is that some ads were created in 2015 and the date_crawled column did not have any. This may be something we would want to explore later.

In [24]:
new_autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
Out[24]:
2016-03-05    0.001140
2016-03-06    0.004603
2016-03-07    0.005511
2016-03-08    0.007919
2016-03-09    0.010178
2016-03-10    0.011086
2016-03-11    0.013029
2016-03-12    0.024601
2016-03-13    0.009291
2016-03-14    0.012881
2016-03-15    0.016154
2016-03-16    0.016872
2016-03-17    0.028529
2016-03-18    0.007560
2016-03-19    0.016175
2016-03-20    0.021180
2016-03-21    0.021053
2016-03-22    0.022067
2016-03-23    0.019026
2016-03-24    0.019913
2016-03-25    0.019744
2016-03-26    0.017231
2016-03-27    0.016408
2016-03-28    0.021349
2016-03-29    0.022954
2016-03-30    0.025298
2016-03-31    0.024200
2016-04-01    0.023418
2016-04-02    0.025298
2016-04-03    0.025572
2016-04-04    0.025129
2016-04-05    0.121590
2016-04-06    0.216171
2016-04-07    0.126869
Name: last_seen, dtype: float64

The distribution of the last_seen column is as expected because you tend to see what is new rather than what is old in the public. So most of the occurrences lie within early April.

In [25]:
new_autos['registration_year'].describe()
Out[25]:
count    47356.000000
mean      2004.620724
std        101.921674
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Using the describe method, we can see that there are definitely some unusual values, such as the max being 9999 and the min being 1000. Given the column tells us the year of the car was first released, these values (and others) may be removed.

Let's explore the registration_year column

In [26]:
(new_autos[new_autos['registration_year']<1900])

# 6 cars that have a registration_year under 1900
Out[26]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen
10556 2016-04-01 06:02:10 UNFAL_Auto privat Angebot 450.0 control NaN 1800 NaN 1800 NaN 5000.0 2 NaN mitsubishi nein 2016-04-01 00:00:00 0 63322 2016-04-01 09:42:30
22316 2016-03-29 16:56:41 VW_Kaefer.__Zwei_zum_Preis_von_einem. privat Angebot 1500.0 control NaN 1000 manuell 0 kaefer 5000.0 0 benzin volkswagen NaN 2016-03-29 00:00:00 0 48324 2016-03-31 10:15:28
24511 2016-03-17 19:45:11 Trabant__wartburg__Ostalgie privat Angebot 490.0 control NaN 1111 NaN 0 NaN 5000.0 0 NaN trabant NaN 2016-03-17 00:00:00 0 16818 2016-04-07 07:17:29
32585 2016-04-02 16:56:39 UNFAL_Auto privat Angebot 450.0 control NaN 1800 NaN 1800 NaN 5000.0 2 NaN mitsubishi nein 2016-04-02 00:00:00 0 63322 2016-04-04 14:46:21
35238 2016-03-26 13:45:20 Suche_Skoda_Fabia____Skoda_Fabia_Combi_mit_Klima privat Angebot 0.0 control NaN 1500 NaN 0 NaN 5000.0 0 benzin skoda NaN 2016-03-26 00:00:00 0 15517 2016-04-04 00:16:54
49283 2016-03-15 18:38:53 Citroen_HY privat Angebot 7750.0 control NaN 1001 NaN 0 andere 5000.0 0 NaN citroen NaN 2016-03-15 00:00:00 0 66706 2016-04-06 18:47:20
In [27]:
(new_autos['registration_year']>2018).value_counts()
Out[27]:
False    47337
True        19
Name: registration_year, dtype: int64

Looking at these specific years, I would say keeping the values between 1900 and 2018 is the most ideal. I would say its common to see car listings for cars in the future (1 or 2 years) so I decided to keep a few rows.

In [28]:
new_autos = new_autos[new_autos['registration_year'].between(1900,2018)]
In [29]:
new_autos['registration_year'].value_counts(normalize=True).head(20)
Out[29]:
2000    0.070651
1999    0.063193
2005    0.063003
2004    0.057277
2003    0.057256
2001    0.056770
2006    0.056179
2002    0.053221
1998    0.051594
2007    0.046883
2008    0.044474
1997    0.042720
2009    0.041051
2017    0.030466
1996    0.030361
2010    0.028713
1995    0.027572
2011    0.027276
2016    0.027086
2012    0.020367
Name: registration_year, dtype: float64

The distribution of the remaining values of the registration year seems to be slightly skewed in the early 2000's. Then we see some 2010's and 1990s.

Let's explore the brand column through aggregation

In [30]:
new_autos['brand'].unique()
Out[30]:
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'mini', 'toyota', 'dacia', 'nissan', 'jeep',
       'saab', 'volvo', 'mitsubishi', 'fiat', 'skoda', 'subaru', 'kia',
       'citroen', 'porsche', 'hyundai', 'chevrolet', 'honda', 'daewoo',
       'suzuki', 'trabant', 'land_rover', 'jaguar', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)
In [55]:
selected_brands = new_autos['brand'].value_counts(normalize=True).head(10).index
new_autos['brand'].value_counts(normalize=True).head(10)
# Lets explore the top 10 based on percentages
Out[55]:
volkswagen       0.216771
opel             0.114407
bmw              0.104287
mercedes_benz    0.088948
audi             0.079441
ford             0.071982
renault          0.050601
peugeot          0.030720
fiat             0.027530
seat             0.019459
Name: brand, dtype: float64

I have chosen to aggregate the top 10 brands based on their respective percentage in the dataset. We see mostly German car manufacturers so I think it would be interesting to find any insights within their listings.

In [32]:
brand_dict = {}

for x in selected_brands:
    rows = new_autos[new_autos['brand'] == x]
    mean = rows['price'].mean()
    brand_dict[x]=mean
    
sorted_dict = dict(sorted(brand_dict.items(),key=lambda x:x[1],reverse=True)) 
# the key parameter allows us to sort by values instead of keys
# reverse makes it in descending order
sorted_dict
Out[32]:
{'audi': 6090.976595744681,
 'bmw': 5944.947325769854,
 'mercedes_benz': 5710.772446555819,
 'volkswagen': 4309.721637426901,
 'seat': 3847.6916395222584,
 'ford': 3098.1209275022015,
 'peugeot': 2981.661623108666,
 'opel': 2685.0227146814404,
 'fiat': 2641.108979278588,
 'renault': 2245.678496868476}

From the selected brands, we see that the luxury brands have the highest mean price, as expected. The less known brands are much cheaper (to my knowledge). The means range from (2245,6090).

  • It's difficult to compare more than two aggregate series objects if we want to extend to more columns
  • We can't compare more than a few rows from each series object
  • We can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

So let's make a single dataframe with mean mileage and mean price.

In [33]:
price_ser = pd.Series(data=sorted_dict,index=sorted_dict.keys())

price_df = pd.DataFrame(price_ser,columns=['mean_price'])
In [34]:
list_top6_brand = list(sorted_dict.keys())

mile_dict = {}

for x in list_top6_brand:
    selected = new_autos[new_autos['brand']==x]
    mean = selected['odometer_km'].mean()
    mile_dict[x] = mean
    
mile_dict
Out[34]:
{'audi': 137809.8404255319,
 'bmw': 137336.91247974068,
 'mercedes_benz': 136861.04513064132,
 'volkswagen': 131788.9863547758,
 'seat': 123979.37024972856,
 'ford': 125702.96448488407,
 'peugeot': 127479.36726272352,
 'opel': 130046.16805170821,
 'fiat': 117237.14504988489,
 'renault': 128540.70981210856}
In [35]:
price_df['mean_mileage'] = mile_dict.values()
In [36]:
price_df.round(2)
Out[36]:
mean_price mean_mileage
audi 6090.98 137809.84
bmw 5944.95 137336.91
mercedes_benz 5710.77 136861.05
volkswagen 4309.72 131788.99
seat 3847.69 123979.37
ford 3098.12 125702.96
peugeot 2981.66 127479.37
opel 2685.02 130046.17
fiat 2641.11 117237.15
renault 2245.68 128540.71

With both the mean price and mean mileage, we don't see any correlation between the 2.

In [37]:
new_autos
Out[37]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000.0 control bus 2004 manuell 158 andere 150000.0 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500.0 control limousine 1997 automatik 286 7er 150000.0 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990.0 test limousine 2009 manuell 102 golf 70000.0 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350.0 control kleinwagen 2007 automatik 71 fortwo 70000.0 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350.0 test kombi 2003 manuell 0 focus 150000.0 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug privat Angebot 1650.0 control kleinwagen 1997 manuell 0 NaN 150000.0 7 benzin audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... privat Angebot 5000.0 control kombi 2001 automatik 299 a6 150000.0 1 benzin audi nein 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot 1980.0 control cabrio 1996 manuell 75 astra 150000.0 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot 13200.0 test cabrio 2014 automatik 69 500 5000.0 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot 1250.0 control limousine 1996 manuell 101 vectra 150000.0 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

47331 rows × 20 columns

Now, let's clean the data to make it more presentable.

  • name: Replace underscores with spaces
  • seller: correct spellings
  • offer_type: translate
  • vehicle_type: translate
  • gearbox: correct spellings
  • fuel_type: translate
  • unrepaired_damage: translate
  • date_crawled, last_seen: change to numeric values instead of date
In [38]:
new_autos['name'].str.replace('_',' ')
Out[38]:
0                         Peugeot 807 160 NAVTECH ON BOARD
1               BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik
2                               Volkswagen Golf 1.6 United
3        Smart smart fortwo coupe softouch/F1/Klima/Pan...
4        Ford Focus 1 6 Benzin TÜV neu ist sehr gepfleg...
                               ...                        
49993             Audi A3  1 8l  Silber; schoenes Fahrzeug
49994    Audi A6  S6  Avant 4.2 quattro eventuell Tausc...
49996    Opel Astra F Cabrio Bertone Edition   TÜV neu+...
49997                       Fiat 500 C 1.2 Dualogic Lounge
49999                                  Opel Vectra 1.6 16V
Name: name, Length: 47331, dtype: object
In [39]:
new_autos['seller'].unique() #privat -> private, gewerblich -> commercial
Out[39]:
array(['privat', 'gewerblich'], dtype=object)
In [40]:
new_autos['seller'] = new_autos['seller'].str.replace('privat','private')
new_autos['seller'] = new_autos['seller'].str.replace('gewerblich','commercial')
In [41]:
new_autos['offer_type'].unique()
# angebot = offer, gesuch = request
Out[41]:
array(['Angebot', 'Gesuch'], dtype=object)
In [42]:
new_autos['offer_type'] = new_autos['offer_type'].str.replace('Angebot','offer').str.replace('Gesuch','request')
In [43]:
new_autos['vehicle_type'].unique()
Out[43]:
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)
In [44]:
#kleinwagen - supermini
#kombi - volkswagen type 2
#cabrio - convertible
#andere - other
new_autos['vehicle_type'].value_counts(dropna=False)
Out[44]:
limousine     12175
kleinwagen    10790
kombi          8744
NaN            5020
bus            3866
cabrio         2640
coupe          2127
suv            1556
andere          413
Name: vehicle_type, dtype: int64
In [45]:
new_autos.loc[new_autos['vehicle_type']=='kleinwagen','vehicle_type'] = 'supermini'
new_autos.loc[new_autos['vehicle_type']=='kombi','vehicle_type'] = 'volkswagen type 2'
new_autos.loc[new_autos['vehicle_type']=='cabrio','vehicle_type'] = 'convertible'
new_autos.loc[new_autos['vehicle_type']=='andere','vehicle_type'] = 'other'
In [46]:
new_autos['gearbox'].unique()
Out[46]:
array(['manuell', 'automatik', nan], dtype=object)
In [47]:
new_autos.loc[new_autos['gearbox']=='manuell','gearbox'] = 'manual'
new_autos.loc[new_autos['gearbox']=='automatik','gearbox'] = 'auto'
In [48]:
new_autos['fuel_type'].unique()
Out[48]:
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)
In [49]:
#benzin = petrol
#lpg = liquefied petroleum gas
#cng = compressed natural gas
#elektro = electric
#andere = other

new_autos.loc[new_autos['fuel_type']=='benzin','fuel_type'] = 'petrol'
new_autos.loc[new_autos['fuel_type']=='elektro','fuel_type'] = 'electric'
new_autos.loc[new_autos['fuel_type']=='andere','fuel_type'] = 'other'
In [50]:
new_autos['unrepaired_damage'].unique()
#nein = no
#ja = yes
Out[50]:
array(['nein', nan, 'ja'], dtype=object)
In [51]:
new_autos.loc[new_autos['unrepaired_damage']=='nein','unrepaired_damage'] = 'no'
new_autos.loc[new_autos['unrepaired_damage']=='ja','unrepaired_damage'] = 'yes'
In [52]:
new_autos['date'] = new_autos['date_crawled'].str.split(' ',expand=True)[0].str.replace('-','').astype(int)
In [53]:
new_autos
Out[53]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps ... odometer_km registration_month fuel_type brand unrepaired_damage ad_created num_pictures postal_code last_seen date
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD private offer 5000.0 control bus 2004 manual 158 ... 150000.0 3 lpg peugeot no 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54 20160326
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik private offer 8500.0 control limousine 1997 auto 286 ... 150000.0 6 petrol bmw no 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08 20160404
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United private offer 8990.0 test limousine 2009 manual 102 ... 70000.0 7 petrol volkswagen no 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37 20160326
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... private offer 4350.0 control supermini 2007 auto 71 ... 70000.0 6 petrol smart no 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28 20160312
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... private offer 1350.0 test volkswagen type 2 2003 manual 0 ... 150000.0 7 petrol ford no 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50 20160401
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49993 2016-03-15 18:47:35 Audi_A3__1_8l__Silber;_schoenes_Fahrzeug private offer 1650.0 control supermini 1997 manual 0 ... 150000.0 7 petrol audi NaN 2016-03-15 00:00:00 0 65203 2016-04-06 19:46:53 20160315
49994 2016-03-22 17:36:42 Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... private offer 5000.0 control volkswagen type 2 2001 auto 299 ... 150000.0 1 petrol audi no 2016-03-22 00:00:00 0 46537 2016-04-06 08:16:39 20160322
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... private offer 1980.0 control convertible 1996 manual 75 ... 150000.0 5 petrol opel no 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02 20160328
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge private offer 13200.0 test convertible 2014 auto 69 ... 5000.0 11 petrol fiat no 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27 20160402
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V private offer 1250.0 control limousine 1996 manual 101 ... 150000.0 1 petrol opel no 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48 20160314

47331 rows × 21 columns

Conclusion

Findings:

  • Top brands by listings:
    • volkswagen (21.67%)
    • opel (11.44%)
    • bmw (10.43%)
    • mercedes (8.89%)
    • audi (7.94%)
    • ford (7.2%)
    • renault (5.06%)
    • peugeot (3.07%)
    • fiat (2.75%)
    • seat(1.95%)
  • German car brands make up for more than 50% of the listings.
  • There is no apparent correlation between price and mileage. Further analysis would have to be done to determine whether or not it is true.

Suggestions:

  • Find out the interactions with the listings. (Impressions, views, etc.)
  • Supply for each car model in respect to their demands.
  • Somehow simplify the name of model
In [ ]: