Introduction¶

This is the first guided project in DataQuest. We'll be working with a dataset of used cars from a classified section of eBay. Data originally from Kaggle, but there have been a few modifications to the data to make it more useful to practice cleaning the data.

We sampled 50,000 data points from the full dataset.
Dirtied the dataset to resemble actual scraped data.

Purpose:¶

To clean the data and analyze the included car listings.

In [2]:

import pandas as pd
import numpy as np

In [3]:

autos = pd.read_csv('autos.csv',encoding='Latin-1')

In [4]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

As we can see from the .info method, we have a few columns with some null values. Columns being vehicleType, gearbox, model, fuelType, and nonRepairedDamage. We must inspect these columns to see how we can deal with these null values. Also, we might want to change the price column to a float dtype so it is easier to work with.

In [5]:

autos.head()

Out[5]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

From the .head method, we can see that there is a lot of data cleaning that we need to do.

Clean name column
Convert price column to int/float dtype
Inspect how many vehicleTypes. Clean accordingly
Correct misspellings in gearbox column
Inspect model
Convert odomoter dtype
Inspect notRepairedDamage column (seems like majority null)
Maybe we need to convert to datetime objects?

In [6]:

autos.columns

Out[6]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [7]:

new_columns = ({'yearOfRegistration':'registration_year', 'monthOfRegistration':'registration_month',
                'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created', 'dateCrawled':'date_crawled',
               'offerType':'offer_type','vehicleType':"vehicle_type", 'powerPS':'power_ps','fuelType':'fuel_type',
               'nrOfPictures':'num_pictures','postalCode':'postal_code','lastSeen':'last_seen'}
              )

autos.rename(new_columns,axis=1,inplace=True)
autos.head()

Out[7]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Camelcase to snakecase basically means changing the word to lower case and have underscores instead of spaces. So, I changed the all the columns to to snakecase just for it to be more readable. I'm not sure what date_crawled means so I just left it as is. One specific change I made was to nrOfPictures to num_pictures.

In [8]:

autos.describe(include='all')

Out[8]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-30 19:48:02	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Price and odometer are stored as text, so we need the change these to numeric. We can say this because we see their descriptive statistics results in all null values

We can also see that columns seller, offer_type, abtest, gearbox, unrepaired_damage all have only 2 values. We may consider dropping these columns because they won't be useful in our analyses.

In [9]:

autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(float)

In [10]:

autos['odometer'] = autos['odometer'].str.lower().str.replace('km','').str.replace(',','').astype(float)
autos.rename({'odometer':'odometer_km'}, axis=1,inplace=True)

Let's explore the price and odometer_km column¶

In [11]:

autos['price'].unique().shape 
# theres 2357 unique values

Out[11]:

(2357,)

In [12]:

print('Max: {}\nMin: {}'.format(autos['price'].max(), autos['price'].min()))
# What car costs 99,999,999 and 0? This isn't right
autos['price'].describe()

Max: 99999999.0
Min: 0.0

Out[12]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [13]:

(autos['price'] > 500000).value_counts()
# 14 cars that are more than $500,000

Out[13]:

False    49986
True        14
Name: price, dtype: int64

In [14]:

autos['price'].value_counts().sort_index(ascending=False).tail(200)
# Looking at how many occurrences of each price value in descending order. 

Out[14]:

760.0       8
755.0       1
750.0     433
749.0      23
745.0       2
         ... 
5.0         2
3.0         1
2.0         3
1.0       156
0.0      1421
Name: price, Length: 200, dtype: int64

In [15]:

price_q25, price_q75 = np.percentile(autos['price'].values, [25,75])
price_iqr = price_q75 - price_q25
price_cutoff = price_iqr * 2
price_upper, price_lower = price_q75+price_cutoff, price_q25-price_cutoff

price_outliers = [x for x in autos['price'] if x<price_lower or x>price_upper]

len(price_outliers)

Out[15]:

In [16]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        50000 non-null  object 
 1   name                50000 non-null  object 
 2   seller              50000 non-null  object 
 3   offer_type          50000 non-null  object 
 4   price               50000 non-null  float64
 5   abtest              50000 non-null  object 
 6   vehicle_type        44905 non-null  object 
 7   registration_year   50000 non-null  int64  
 8   gearbox             47320 non-null  object 
 9   power_ps            50000 non-null  int64  
 10  model               47242 non-null  object 
 11  odometer_km         50000 non-null  float64
 12  registration_month  50000 non-null  int64  
 13  fuel_type           45518 non-null  object 
 14  brand               50000 non-null  object 
 15  unrepaired_damage   40171 non-null  object 
 16  ad_created          50000 non-null  object 
 17  num_pictures        50000 non-null  int64  
 18  postal_code         50000 non-null  int64  
 19  last_seen           50000 non-null  object 
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB

In [17]:

new_autos = autos[autos['price'].between(price_lower,price_upper)]
new_autos.info()
# 2644 outliers removed

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47356 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        47356 non-null  object 
 1   name                47356 non-null  object 
 2   seller              47356 non-null  object 
 3   offer_type          47356 non-null  object 
 4   price               47356 non-null  float64
 5   abtest              47356 non-null  object 
 6   vehicle_type        42311 non-null  object 
 7   registration_year   47356 non-null  int64  
 8   gearbox             44724 non-null  object 
 9   power_ps            47356 non-null  int64  
 10  model               44695 non-null  object 
 11  odometer_km         47356 non-null  float64
 12  registration_month  47356 non-null  int64  
 13  fuel_type           42950 non-null  object 
 14  brand               47356 non-null  object 
 15  unrepaired_damage   37668 non-null  object 
 16  ad_created          47356 non-null  object 
 17  num_pictures        47356 non-null  int64  
 18  postal_code         47356 non-null  int64  
 19  last_seen           47356 non-null  object 
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB

To remove the outliers, I used the interquartile range method because it focuses solely on the data given. I used a multiplier of 2 for the cutoff because I noticed that some of the lower values were reasonable prices and I wanted to keep majority of them.

In [18]:

price_outliers.sort()

In [19]:

price_outliers

Out[19]:

[19444.0,
 19450.0,
 19450.0,
 19480.0,
 19490.0,
 19490.0,
 19490.0,
 19490.0,
 19499.0,
 19499.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19500.0,
 19550.0,
 19599.0,
 19600.0,
 19600.0,
 19600.0,
 19600.0,
 19600.0,
 19600.0,
 19650.0,
 19650.0,
 19650.0,
 19666.0,
 19690.0,
 19699.0,
 19700.0,
 19700.0,
 19700.0,
 19700.0,
 19750.0,
 19750.0,
 19750.0,
 19777.0,
 19780.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19800.0,
 19850.0,
 19850.0,
 19850.0,
 19850.0,
 19890.0,
 19890.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19900.0,
 19911.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19950.0,
 19968.0,
 19970.0,
 19980.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19990.0,
 19995.0,
 19995.0,
 19995.0,
 19995.0,
 19998.0,
 19998.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 19999.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20000.0,
 20123.0,
 20150.0,
 20199.0,
 20200.0,
 20200.0,
 20200.0,
 20200.0,
 20200.0,
 20222.0,
 20250.0,
 20290.0,
 20299.0,
 20300.0,
 20390.0,
 20390.0,
 20399.0,
 20400.0,
 20400.0,
 20400.0,
 20400.0,
 20450.0,
 20480.0,
 20490.0,
 20490.0,
 20490.0,
 20490.0,
 20490.0,
 20490.0,
 20495.0,
 20495.0,
 20499.0,
 20499.0,
 20499.0,
 20499.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20500.0,
 20550.0,
 20550.0,
 20550.0,
 20589.0,
 20590.0,
 20595.0,
 20650.0,
 20650.0,
 20650.0,
 20700.0,
 20700.0,
 20750.0,
 20750.0,
 20750.0,
 20750.0,
 20790.0,
 20799.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20800.0,
 20850.0,
 20850.0,
 20850.0,
 20850.0,
 20888.0,
 20890.0,
 20890.0,
 20890.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20900.0,
 20911.0,
 20950.0,
 20950.0,
 20950.0,
 20950.0,
 20950.0,
 20950.0,
 20980.0,
 20985.0,
 20987.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20990.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 20999.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21000.0,
 21049.0,
 21111.0,
 21200.0,
 21200.0,
 21200.0,
 21200.0,
 21250.0,
 21250.0,
 21275.0,
 21299.0,
 21300.0,
 21300.0,
 21300.0,
 21300.0,
 21300.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21400.0,
 21444.0,
 21450.0,
 21450.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21490.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21499.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21500.0,
 21590.0,
 21590.0,
 21599.0,
 21600.0,
 21600.0,
 21600.0,
 21690.0,
 21700.0,
 21750.0,
 21750.0,
 21750.0,
 21780.0,
 21780.0,
 21790.0,
 21800.0,
 21800.0,
 21800.0,
 21800.0,
 21850.0,
 21850.0,
 21850.0,
 21888.0,
 21890.0,
 21890.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21900.0,
 21950.0,
 21950.0,
 21950.0,
 21950.0,
 21950.0,
 21950.0,
 21959.0,
 21980.0,
 21980.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21990.0,
 21995.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 21999.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22000.0,
 22199.0,
 22200.0,
 22222.0,
 22222.0,
 22250.0,
 22250.0,
 22250.0,
 22280.0,
 22290.0,
 22299.0,
 22300.0,
 22300.0,
 22300.0,
 22300.0,
 22300.0,
 22300.0,
 22399.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22400.0,
 22450.0,
 22450.0,
 22450.0,
 22490.0,
 22490.0,
 22490.0,
 22490.0,
 22490.0,
 22490.0,
 22499.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22500.0,
 22555.0,
 22580.0,
 22600.0,
 22600.0,
 22600.0,
 22650.0,
 22650.0,
 22690.0,
 22700.0,
 22700.0,
 22750.0,
 22750.0,
 22790.0,
 22799.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22800.0,
 22850.0,
 22850.0,
 22850.0,
 22880.0,
 22890.0,
 22899.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22900.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22950.0,
 22980.0,
 22989.0,
 22990.0,
 22990.0,
 22990.0,
 22990.0,
 22990.0,
 22996.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 22999.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23000.0,
 23123.0,
 23200.0,
 23200.0,
 23200.0,
 23200.0,
 23233.0,
 23250.0,
 23250.0,
 23250.0,
 23300.0,
 23333.0,
 23333.0,
 23350.0,
 23400.0,
 23400.0,
 23400.0,
 23400.0,
 23450.0,
 23490.0,
 23490.0,
 23490.0,
 23490.0,
 23490.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23499.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23500.0,
 23540.0,
 23590.0,
 23600.0,
 23600.0,
 23600.0,
 23600.0,
 23650.0,
 23700.0,
 23700.0,
 23700.0,
 23700.0,
 23750.0,
 23750.0,
 23790.0,
 23799.0,
 23800.0,
 23800.0,
 ...]

Let's change columns to quantifiable dates¶

date_crawled, ad_created, last_seen are string values
registration_year, registration_month are numeric values

In [20]:

new_autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47356 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   date_crawled        47356 non-null  object 
 1   name                47356 non-null  object 
 2   seller              47356 non-null  object 
 3   offer_type          47356 non-null  object 
 4   price               47356 non-null  float64
 5   abtest              47356 non-null  object 
 6   vehicle_type        42311 non-null  object 
 7   registration_year   47356 non-null  int64  
 8   gearbox             44724 non-null  object 
 9   power_ps            47356 non-null  int64  
 10  model               44695 non-null  object 
 11  odometer_km         47356 non-null  float64
 12  registration_month  47356 non-null  int64  
 13  fuel_type           42950 non-null  object 
 14  brand               47356 non-null  object 
 15  unrepaired_damage   37668 non-null  object 
 16  ad_created          47356 non-null  object 
 17  num_pictures        47356 non-null  int64  
 18  postal_code         47356 non-null  int64  
 19  last_seen           47356 non-null  object 
dtypes: float64(2), int64(5), object(13)
memory usage: 7.6+ MB

In [21]:

autos[['date_crawled','ad_created','last_seen']].head()

Out[21]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [22]:

new_autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
# normalize make it a percentage instead of the actual count
# we want to keep the null values
# earliest to latest

Out[22]:

2016-03-05    0.025551
2016-03-06    0.013874
2016-03-07    0.036088
2016-03-08    0.033533
2016-03-09    0.033216
2016-03-10    0.032308
2016-03-11    0.032140
2016-03-12    0.037313
2016-03-13    0.015563
2016-03-14    0.037081
2016-03-15    0.033702
2016-03-16    0.029774
2016-03-17    0.031633
2016-03-18    0.012966
2016-03-19    0.034674
2016-03-20    0.037883
2016-03-21    0.037398
2016-03-22    0.032773
2016-03-23    0.032435
2016-03-24    0.029162
2016-03-25    0.031844
2016-03-26    0.032900
2016-03-27    0.030640
2016-03-28    0.034737
2016-03-29    0.034125
2016-03-30    0.033660
2016-03-31    0.031781
2016-04-01    0.033259
2016-04-02    0.035180
2016-04-03    0.038622
2016-04-04    0.036616
2016-04-05    0.012987
2016-04-06    0.003167
2016-04-07    0.001415
Name: date_crawled, dtype: float64

With the date_crawled column and its distribution, we can see that most occurrences were between the middle of March and very early April.

In [23]:

new_autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index()

Out[23]:

2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
                ...   
2016-04-03    0.038876
2016-04-04    0.036975
2016-04-05    0.011699
2016-04-06    0.003231
2016-04-07    0.001288
Name: ad_created, Length: 74, dtype: float64

Similarly to the date_crawled distribution, we can see that most of the ads were created between mid March and very early April. One thing to note is that some ads were created in 2015 and the date_crawled column did not have any. This may be something we would want to explore later.

In [24]:

new_autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index()

Out[24]:

2016-03-05    0.001140
2016-03-06    0.004603
2016-03-07    0.005511
2016-03-08    0.007919
2016-03-09    0.010178
2016-03-10    0.011086
2016-03-11    0.013029
2016-03-12    0.024601
2016-03-13    0.009291
2016-03-14    0.012881
2016-03-15    0.016154
2016-03-16    0.016872
2016-03-17    0.028529
2016-03-18    0.007560
2016-03-19    0.016175
2016-03-20    0.021180
2016-03-21    0.021053
2016-03-22    0.022067
2016-03-23    0.019026
2016-03-24    0.019913
2016-03-25    0.019744
2016-03-26    0.017231
2016-03-27    0.016408
2016-03-28    0.021349
2016-03-29    0.022954
2016-03-30    0.025298
2016-03-31    0.024200
2016-04-01    0.023418
2016-04-02    0.025298
2016-04-03    0.025572
2016-04-04    0.025129
2016-04-05    0.121590
2016-04-06    0.216171
2016-04-07    0.126869
Name: last_seen, dtype: float64

The distribution of the last_seen column is as expected because you tend to see what is new rather than what is old in the public. So most of the occurrences lie within early April.

In [25]:

new_autos['registration_year'].describe()

Out[25]:

count    47356.000000
mean      2004.620724
std        101.921674
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Using the describe method, we can see that there are definitely some unusual values, such as the max being 9999 and the min being 1000. Given the column tells us the year of the car was first released, these values (and others) may be removed.

Let's explore the registration_year column¶

In [26]:

(new_autos[new_autos['registration_year']<1900])

# 6 cars that have a registration_year under 1900

Out[26]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
10556	2016-04-01 06:02:10	UNFAL_Auto	privat	Angebot	450.0	control	NaN	1800	NaN	1800	NaN	5000.0	2	NaN	mitsubishi	nein	2016-04-01 00:00:00	63322	2016-04-01 09:42:30
22316	2016-03-29 16:56:41	VW_Kaefer.__Zwei_zum_Preis_von_einem.	privat	Angebot	1500.0	control	NaN	1000	manuell	0	kaefer	5000.0	0	benzin	volkswagen	NaN	2016-03-29 00:00:00	48324	2016-03-31 10:15:28
24511	2016-03-17 19:45:11	Trabant__wartburg__Ostalgie	privat	Angebot	490.0	control	NaN	1111	NaN	0	NaN	5000.0	0	NaN	trabant	NaN	2016-03-17 00:00:00	16818	2016-04-07 07:17:29
32585	2016-04-02 16:56:39	UNFAL_Auto	privat	Angebot	450.0	control	NaN	1800	NaN	1800	NaN	5000.0	2	NaN	mitsubishi	nein	2016-04-02 00:00:00	63322	2016-04-04 14:46:21
35238	2016-03-26 13:45:20	Suche_Skoda_Fabia____Skoda_Fabia_Combi_mit_Klima	privat	Angebot	0.0	control	NaN	1500	NaN	0	NaN	5000.0	0	benzin	skoda	NaN	2016-03-26 00:00:00	15517	2016-04-04 00:16:54
49283	2016-03-15 18:38:53	Citroen_HY	privat	Angebot	7750.0	control	NaN	1001	NaN	0	andere	5000.0	0	NaN	citroen	NaN	2016-03-15 00:00:00	66706	2016-04-06 18:47:20

In [27]:

(new_autos['registration_year']>2018).value_counts()

Out[27]:

False    47337
True        19
Name: registration_year, dtype: int64

Looking at these specific years, I would say keeping the values between 1900 and 2018 is the most ideal. I would say its common to see car listings for cars in the future (1 or 2 years) so I decided to keep a few rows.

In [28]:

new_autos = new_autos[new_autos['registration_year'].between(1900,2018)]

In [29]:

new_autos['registration_year'].value_counts(normalize=True).head(20)

Out[29]:

2000    0.070651
1999    0.063193
2005    0.063003
2004    0.057277
2003    0.057256
2001    0.056770
2006    0.056179
2002    0.053221
1998    0.051594
2007    0.046883
2008    0.044474
1997    0.042720
2009    0.041051
2017    0.030466
1996    0.030361
2010    0.028713
1995    0.027572
2011    0.027276
2016    0.027086
2012    0.020367
Name: registration_year, dtype: float64

The distribution of the remaining values of the registration year seems to be slightly skewed in the early 2000's. Then we see some 2010's and 1990s.

Let's explore the brand column through aggregation¶

In [30]:

new_autos['brand'].unique()

Out[30]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'mini', 'toyota', 'dacia', 'nissan', 'jeep',
       'saab', 'volvo', 'mitsubishi', 'fiat', 'skoda', 'subaru', 'kia',
       'citroen', 'porsche', 'hyundai', 'chevrolet', 'honda', 'daewoo',
       'suzuki', 'trabant', 'land_rover', 'jaguar', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [55]:

selected_brands = new_autos['brand'].value_counts(normalize=True).head(10).index
new_autos['brand'].value_counts(normalize=True).head(10)
# Lets explore the top 10 based on percentages

Out[55]:

volkswagen       0.216771
opel             0.114407
bmw              0.104287
mercedes_benz    0.088948
audi             0.079441
ford             0.071982
renault          0.050601
peugeot          0.030720
fiat             0.027530
seat             0.019459
Name: brand, dtype: float64

I have chosen to aggregate the top 10 brands based on their respective percentage in the dataset. We see mostly German car manufacturers so I think it would be interesting to find any insights within their listings.

In [32]:

brand_dict = {}

for x in selected_brands:
    rows = new_autos[new_autos['brand'] == x]
    mean = rows['price'].mean()
    brand_dict[x]=mean
    
sorted_dict = dict(sorted(brand_dict.items(),key=lambda x:x[1],reverse=True)) 
# the key parameter allows us to sort by values instead of keys
# reverse makes it in descending order
sorted_dict

Out[32]:

{'audi': 6090.976595744681,
 'bmw': 5944.947325769854,
 'mercedes_benz': 5710.772446555819,
 'volkswagen': 4309.721637426901,
 'seat': 3847.6916395222584,
 'ford': 3098.1209275022015,
 'peugeot': 2981.661623108666,
 'opel': 2685.0227146814404,
 'fiat': 2641.108979278588,
 'renault': 2245.678496868476}

From the selected brands, we see that the luxury brands have the highest mean price, as expected. The less known brands are much cheaper (to my knowledge). The means range from (2245,6090).

Let's link the top 6 brands to the mileage¶

It's difficult to compare more than two aggregate series objects if we want to extend to more columns
We can't compare more than a few rows from each series object
We can only sort by the index (brand name) of both series objects so we can easily make visual comparisons

So let's make a single dataframe with mean mileage and mean price.

In [33]:

price_ser = pd.Series(data=sorted_dict,index=sorted_dict.keys())

price_df = pd.DataFrame(price_ser,columns=['mean_price'])

In [34]:

list_top6_brand = list(sorted_dict.keys())

mile_dict = {}

for x in list_top6_brand:
    selected = new_autos[new_autos['brand']==x]
    mean = selected['odometer_km'].mean()
    mile_dict[x] = mean
    
mile_dict

Out[34]:

{'audi': 137809.8404255319,
 'bmw': 137336.91247974068,
 'mercedes_benz': 136861.04513064132,
 'volkswagen': 131788.9863547758,
 'seat': 123979.37024972856,
 'ford': 125702.96448488407,
 'peugeot': 127479.36726272352,
 'opel': 130046.16805170821,
 'fiat': 117237.14504988489,
 'renault': 128540.70981210856}

In [35]:

price_df['mean_mileage'] = mile_dict.values()

In [36]:

price_df.round(2)

Out[36]:

	mean_price	mean_mileage
audi	6090.98	137809.84
bmw	5944.95	137336.91
mercedes_benz	5710.77	136861.05
volkswagen	4309.72	131788.99
seat	3847.69	123979.37
ford	3098.12	125702.96
peugeot	2981.66	127479.37
opel	2685.02	130046.17
fiat	2641.11	117237.15
renault	2245.68	128540.71

With both the mean price and mean mileage, we don't see any correlation between the 2.

In [37]:

new_autos

Out[37]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_pictures	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000.0	control	bus	2004	manuell	158	andere	150000.0	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500.0	control	limousine	1997	automatik	286	7er	150000.0	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990.0	test	limousine	2009	manuell	102	golf	70000.0	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350.0	control	kleinwagen	2007	automatik	71	fortwo	70000.0	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350.0	test	kombi	2003	manuell	0	focus	150000.0	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49993	2016-03-15 18:47:35	Audi_A3__1_8l__Silber;_schoenes_Fahrzeug	privat	Angebot	1650.0	control	kleinwagen	1997	manuell	0	NaN	150000.0	7	benzin	audi	NaN	2016-03-15 00:00:00	0	65203	2016-04-06 19:46:53
49994	2016-03-22 17:36:42	Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...	privat	Angebot	5000.0	control	kombi	2001	automatik	299	a6	150000.0	1	benzin	audi	nein	2016-03-22 00:00:00	0	46537	2016-04-06 08:16:39
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	1980.0	control	cabrio	1996	manuell	75	astra	150000.0	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	13200.0	test	cabrio	2014	automatik	69	500	5000.0	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	1250.0	control	limousine	1996	manuell	101	vectra	150000.0	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

47331 rows × 20 columns

Now, let's clean the data to make it more presentable.¶

name: Replace underscores with spaces
seller: correct spellings
offer_type: translate
vehicle_type: translate
gearbox: correct spellings
fuel_type: translate
unrepaired_damage: translate
date_crawled, last_seen: change to numeric values instead of date

In [38]:

new_autos['name'].str.replace('_',' ')

Out[38]:

0                         Peugeot 807 160 NAVTECH ON BOARD
1               BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik
2                               Volkswagen Golf 1.6 United
3        Smart smart fortwo coupe softouch/F1/Klima/Pan...
4        Ford Focus 1 6 Benzin TÜV neu ist sehr gepfleg...
                               ...                        
49993             Audi A3  1 8l  Silber; schoenes Fahrzeug
49994    Audi A6  S6  Avant 4.2 quattro eventuell Tausc...
49996    Opel Astra F Cabrio Bertone Edition   TÜV neu+...
49997                       Fiat 500 C 1.2 Dualogic Lounge
49999                                  Opel Vectra 1.6 16V
Name: name, Length: 47331, dtype: object

In [39]:

new_autos['seller'].unique() #privat -> private, gewerblich -> commercial

Out[39]:

array(['privat', 'gewerblich'], dtype=object)

In [40]:

new_autos['seller'] = new_autos['seller'].str.replace('privat','private')
new_autos['seller'] = new_autos['seller'].str.replace('gewerblich','commercial')

In [41]:

new_autos['offer_type'].unique()
# angebot = offer, gesuch = request

Out[41]:

array(['Angebot', 'Gesuch'], dtype=object)

In [42]:

new_autos['offer_type'] = new_autos['offer_type'].str.replace('Angebot','offer').str.replace('Gesuch','request')

In [43]:

new_autos['vehicle_type'].unique()

Out[43]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [44]:

#kleinwagen - supermini
#kombi - volkswagen type 2
#cabrio - convertible
#andere - other
new_autos['vehicle_type'].value_counts(dropna=False)

Out[44]:

limousine     12175
kleinwagen    10790
kombi          8744
NaN            5020
bus            3866
cabrio         2640
coupe          2127
suv            1556
andere          413
Name: vehicle_type, dtype: int64

In [45]:

new_autos.loc[new_autos['vehicle_type']=='kleinwagen','vehicle_type'] = 'supermini'
new_autos.loc[new_autos['vehicle_type']=='kombi','vehicle_type'] = 'volkswagen type 2'
new_autos.loc[new_autos['vehicle_type']=='cabrio','vehicle_type'] = 'convertible'
new_autos.loc[new_autos['vehicle_type']=='andere','vehicle_type'] = 'other'

In [46]:

new_autos['gearbox'].unique()

Out[46]:

array(['manuell', 'automatik', nan], dtype=object)

In [47]:

new_autos.loc[new_autos['gearbox']=='manuell','gearbox'] = 'manual'
new_autos.loc[new_autos['gearbox']=='automatik','gearbox'] = 'auto'

In [48]:

new_autos['fuel_type'].unique()

Out[48]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [49]:

#benzin = petrol
#lpg = liquefied petroleum gas
#cng = compressed natural gas
#elektro = electric
#andere = other

new_autos.loc[new_autos['fuel_type']=='benzin','fuel_type'] = 'petrol'
new_autos.loc[new_autos['fuel_type']=='elektro','fuel_type'] = 'electric'
new_autos.loc[new_autos['fuel_type']=='andere','fuel_type'] = 'other'

In [50]:

new_autos['unrepaired_damage'].unique()
#nein = no
#ja = yes

Out[50]:

array(['nein', nan, 'ja'], dtype=object)

In [51]:

new_autos.loc[new_autos['unrepaired_damage']=='nein','unrepaired_damage'] = 'no'
new_autos.loc[new_autos['unrepaired_damage']=='ja','unrepaired_damage'] = 'yes'

In [52]:

new_autos['date'] = new_autos['date_crawled'].str.split(' ',expand=True)[0].str.replace('-','').astype(int)

In [53]:

new_autos

Out[53]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	...	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	num_pictures	postal_code	last_seen	date
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	private	offer	5000.0	control	bus	2004	manual	158	...	150000.0	3	lpg	peugeot	no	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54	20160326
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	private	offer	8500.0	control	limousine	1997	auto	286	...	150000.0	6	petrol	bmw	no	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08	20160404
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	private	offer	8990.0	test	limousine	2009	manual	102	...	70000.0	7	petrol	volkswagen	no	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37	20160326
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	private	offer	4350.0	control	supermini	2007	auto	71	...	70000.0	6	petrol	smart	no	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28	20160312
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	private	offer	1350.0	test	volkswagen type 2	2003	manual	0	...	150000.0	7	petrol	ford	no	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50	20160401
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49993	2016-03-15 18:47:35	Audi_A3__1_8l__Silber;_schoenes_Fahrzeug	private	offer	1650.0	control	supermini	1997	manual	0	...	150000.0	7	petrol	audi	NaN	2016-03-15 00:00:00	0	65203	2016-04-06 19:46:53	20160315
49994	2016-03-22 17:36:42	Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc...	private	offer	5000.0	control	volkswagen type 2	2001	auto	299	...	150000.0	1	petrol	audi	no	2016-03-22 00:00:00	0	46537	2016-04-06 08:16:39	20160322
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	private	offer	1980.0	control	convertible	1996	manual	75	...	150000.0	5	petrol	opel	no	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02	20160328
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	private	offer	13200.0	test	convertible	2014	auto	69	...	5000.0	11	petrol	fiat	no	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27	20160402
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	private	offer	1250.0	control	limousine	1996	manual	101	...	150000.0	1	petrol	opel	no	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48	20160314

47331 rows × 21 columns

Conclusion¶

Findings:¶

Top brands by listings:
- volkswagen (21.67%)
- opel (11.44%)
- bmw (10.43%)
- mercedes (8.89%)
- audi (7.94%)
- ford (7.2%)
- renault (5.06%)
- peugeot (3.07%)
- fiat (2.75%)
- seat(1.95%)
German car brands make up for more than 50% of the listings.
There is no apparent correlation between price and mileage. Further analysis would have to be done to determine whether or not it is true.

Suggestions:¶

Find out the interactions with the listings. (Impressions, views, etc.)
Supply for each car model in respect to their demands.
Somehow simplify the name of model

In [ ]: