Factors affecting value's of used cars.¶

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, which you can find it here.

A few modifications have been made from the original dataset:

50,000 data points were sampled from the full dataset
The dataset has been modified by Dataquest to be less clean.

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which which year the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which which year the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings.

Summary of results:¶

- price of the car depends on many factors(model, year, condition, car mileage, etc.)
- some brands are more expensive than other
- cars with lower mileage are more expensive
- damaged cars are on average 52% cheaper
- the most popular model is Volkswagen Golf

Let's start by importing the libraries we need and reading the dataset into pandas.

In [1]:

# import usefull libraries
import pandas as pd
import numpy as np

# read CSV file into pandas
autos = pd.read_csv('autos.csv', encoding = 'Latin-1') # use Latin-1 encoding

Lets quickly explore our data set.

In [2]:

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Our data sets has 20 columns and 50 000 rows, most of them are stored as strings and some as float. 6 categories contain null entries, but none have more than 20% null values. The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

Column names¶

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [3]:

autos.columns # use .columns to print columns names

Out[3]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [4]:

# modify column names
columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']

In [5]:

# assigned new column names

autos.columns = columns

autos.head(3)

Out[5]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37

We've changed the column names from camelcase to snakecase and editted some of the columns names to be more descriptive.

Now let's do some basic data exploration to determine what other cleaning tasks need to be done.

In [6]:

autos.describe(include='all') # include='all' to get both categorical and numeric columns

Out[6]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-21 20:37:19	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

Data cleaning¶

We are going to clean our data in few steps:

1. Drop any colunms that have mostly one value. 

2. Check columns for their unique values.

3. Change some columns for numeric values.

4. Change some columns for datetime dtype.

5. Translate some data from german to english.

1. Drop columns¶

We will drop seller and offer_type colunms since they have only one value and have a closer look at nr_of_pictures column.

In [7]:

autos['nr_of_pictures'].value_counts()

Out[7]:

0    50000
Name: nr_of_pictures, dtype: int64

Look like column nr_of_pictures has 0 as every value so we can also drop it.

In [8]:

autos.drop(['seller','offer_type', 'nr_of_pictures'], axis = 1)

Out[8]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	45897	2016-04-06 21:18:48

50000 rows × 17 columns

2. Unique values¶

We will check vehicle_type, fuel_type and brand for unique values.

In [9]:

# Vechicle Type
autos['vehicle_type'].value_counts(dropna=False)

Out[9]:

limousine     12859
kleinwagen    10822
kombi          9127
NaN            5095
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64

In [10]:

# Fuel Type
autos['fuel_type'].value_counts(dropna=False)

Out[10]:

benzin     30107
diesel     14567
NaN         4482
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64

In [11]:

# Brand
autos['brand'].value_counts(dropna=False)

Out[11]:

volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
smart               701
citroen             701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

Everything looks good here. Let's change some columns to numeric values.

3. Numeric values¶

We will change price and odometer for numeric values.

First we will delete '$' and ',' from price column and than change dtype for numeric.

In [12]:

autos['price'] = (autos['price']
                  .str.replace('$','') # delete $
                  .str.replace(',','') # delete ,
                  .astype(int) # convert column to numeric dtype
                 )

In [13]:

autos['price'].head()

Out[13]:

0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Let's do the same for odometer column.

In [14]:

autos['odometer'] = (autos['odometer'].str.replace('km','') # delete 'km'
                     .str.replace(',','') # delete ,
                     .astype(int) # change dtype to int
                    )

In [15]:

autos['odometer'].head()

Out[15]:

0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64

In [16]:

#change column name to 'odometer_km'
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km, price and registration_year columns. We will analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.

odometer_km¶

In [17]:

# see how many unique values
autos['odometer_km'].unique().shape

Out[17]:

(13,)

In [18]:

# see min/max/median/mean etc
autos['odometer_km'].describe()

Out[18]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [19]:

# see the highest and lowest values with their counts
autos['odometer_km'].value_counts().sort_index()

Out[19]:

5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

Everything looks good for this column.

price¶

In [20]:

# see how many unique values
autos['price'].unique().shape

Out[20]:

(2357,)

In [21]:

# see min/max/median/mean etc
autos['price'].describe()

Out[21]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [22]:

# see the lowest values with their counts
autos['price'].value_counts().sort_index().head(5)

Out[22]:

0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64

In [23]:

# see the highest values with their counts
autos['price'].value_counts().sort_index().tail(15)

Out[23]:

265000      1
295000      1
299000      1
345000      1
350000      1
999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64

We can see many prices that doesn't look realistic. There are over 1400 cars listed for free. We have also some with a price like '1234566' or '11111111'.

Let's have a closer look at listings with price over 350 000.

In [24]:

autos[autos['price'] > 350000].sort_values('price')

Out[24]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
37585	2016-03-29 11:38:54	Volkswagen_Jetta_GT	privat	Angebot	999990	test	limousine	1985	manuell	111	jetta	150000	12	benzin	volkswagen	ja	2016-03-29 00:00:00	50997	2016-03-29 11:38:54
514	2016-03-17 09:53:08	Ford_Focus_Turnier_1.6_16V_Style	privat	Angebot	999999	test	kombi	2009	manuell	101	focus	125000	4	benzin	ford	nein	2016-03-17 00:00:00	12205	2016-04-06 07:17:35
43049	2016-03-21 19:53:52	2_VW_Busse_T3	privat	Angebot	999999	test	bus	1981	manuell	70	transporter	150000	1	benzin	volkswagen	NaN	2016-03-21 00:00:00	99880	2016-03-28 17:18:28
22947	2016-03-22 12:54:19	Bmw_530d_zum_ausschlachten	privat	Angebot	1234566	control	kombi	1999	automatik	190	NaN	150000	2	diesel	bmw	NaN	2016-03-22 00:00:00	17454	2016-04-02 03:17:32
7814	2016-04-04 11:53:31	Ferrari_F40	privat	Angebot	1300000	control	coupe	1992	NaN	0	NaN	50000	12	NaN	sonstige_autos	nein	2016-04-04 00:00:00	60598	2016-04-05 11:34:11
47634	2016-04-04 21:25:21	Ferrari_FXX	privat	Angebot	3890000	test	coupe	2006	NaN	799	NaN	5000	7	NaN	sonstige_autos	nein	2016-04-04 00:00:00	60313	2016-04-05 12:07:37
11137	2016-03-29 23:52:57	suche_maserati_3200_gt_Zustand_unwichtig_laufe...	privat	Angebot	10000000	control	coupe	1960	manuell	368	NaN	100000	1	benzin	sonstige_autos	nein	2016-03-29 00:00:00	73033	2016-04-06 21:18:11
2897	2016-03-12 21:50:57	Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000	privat	Angebot	11111111	test	limousine	1973	manuell	48	escort	50000	3	benzin	ford	nein	2016-03-12 00:00:00	94469	2016-03-12 22:45:27
24384	2016-03-21 13:57:51	Schlachte_Golf_3_gt_tdi	privat	Angebot	11111111	test	NaN	1995	NaN	0	NaN	150000	0	NaN	volkswagen	NaN	2016-03-21 00:00:00	18519	2016-03-21 14:40:18
27371	2016-03-09 15:45:47	Fiat_Punto	privat	Angebot	12345678	control	NaN	2017	NaN	95	punto	150000	0	NaN	fiat	NaN	2016-03-09 00:00:00	96110	2016-03-09 15:45:47
39377	2016-03-08 23:53:51	Tausche_volvo_v40_gegen_van	privat	Angebot	12345678	control	NaN	2018	manuell	95	v40	150000	6	NaN	volvo	nein	2016-03-08 00:00:00	14542	2016-04-06 23:17:31
47598	2016-03-31 18:56:54	Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar...	privat	Angebot	12345678	control	limousine	2001	manuell	101	vectra	150000	3	benzin	opel	nein	2016-03-31 00:00:00	4356	2016-03-31 18:56:54
42221	2016-03-08 20:39:05	Leasinguebernahme	privat	Angebot	27322222	control	limousine	2014	manuell	163	c4	40000	2	diesel	citroen	NaN	2016-03-08 00:00:00	76532	2016-03-08 20:39:05
39705	2016-03-22 14:58:27	Tausch_gegen_gleichwertiges	privat	Angebot	99999999	control	limousine	1999	automatik	224	s_klasse	150000	9	benzin	mercedes_benz	NaN	2016-03-22 00:00:00	73525	2016-04-06 05:15:30

Most of this cars have unrealistic price excpet of 3 models(2x Ferrari and Masseratti) from 1 300 000 to 10 000 000. So we are going to delete all cars listed for free and over 350 000, but we are going to keep this 3 more expensive models.

In [25]:

autos = autos[(((autos["price"] >= 1 ) & (autos["price"] <= 350000 )) | ((autos["price"] >= 1300000 ) & (autos["price"] <= 10000000 )))]

autos['price'].value_counts().sort_index()

Out[25]:

1           156
2             3
3             1
5             2
8             1
           ... 
345000        1
350000        1
1300000       1
3890000       1
10000000      1
Name: price, Length: 2349, dtype: int64

registration_year¶

In [26]:

# see how many unique values
autos['registration_year'].unique().shape

Out[26]:

(95,)

In [27]:

# see min/max/median/mean etc
autos['registration_year'].describe()

Out[27]:

count    48568.000000
mean      2004.754262
std         88.641401
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [28]:

# see the highest and lowest values with their counts
autos['registration_year'].value_counts().sort_index()

Out[28]:

1000    1
1001    1
1111    1
1800    2
1910    5
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 95, dtype: int64

Since the car couldn't be registered before 1900 and after 2016 which is year where this listings were added we will remove this outliers.

In [29]:

autos = autos[autos["registration_year"].between(1900,2016)]

4. Date values¶

Let's now move on to the date columns and understand the date range the data covers. There are 5 columns that should represent date values.

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a astype datetime representation so we can understand it quantitatively. The other two registration_month and registration_year columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [30]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[30]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

We can notice that a date is in a format YYYY-MM-DD HH-MM-SS. We are going to use pd.to_datetime to change astype to datetime and we will change format to YYYY-MM-DD.

date_crawled¶

In [31]:

autos['date_crawled'] = pd.to_datetime(autos['date_crawled'], format="%Y-%m-%d %H:%M:%S") # change dtype to datetime
date_crawled = autos['date_crawled'].dt.strftime('%Y-%m-%d') # change format to YYYY-MM-DD

In [32]:

date_crawled.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages

Out[32]:

2016-03-05    0.025191
2016-03-06    0.014159
2016-03-07    0.036244
2016-03-08    0.033545
2016-03-09    0.033245
2016-03-10    0.032238
2016-03-11    0.032452
2016-03-12    0.036822
2016-03-13    0.015873
2016-03-14    0.036329
2016-03-15    0.034359
2016-03-16    0.029496
2016-03-17    0.031788
2016-03-18    0.012810
2016-03-19    0.034659
2016-03-20    0.038022
2016-03-21    0.037315
2016-03-22    0.032838
2016-03-23    0.032195
2016-03-24    0.029475
2016-03-25    0.031510
2016-03-26    0.032067
2016-03-27    0.030781
2016-03-28    0.034594
2016-03-29    0.034123
2016-03-30    0.033802
2016-03-31    0.031788
2016-04-01    0.033802
2016-04-02    0.035558
2016-04-03    0.038771
2016-04-04    0.036651
2016-04-05    0.013002
2016-04-06    0.003085
2016-04-07    0.001414
Name: date_crawled, dtype: float64

The adverts were first crawled between 5th of March and 7th of April 2016.

ad_created¶

In [33]:

autos['ad_created'] = pd.to_datetime(autos['ad_created'], format="%Y-%m-%d %H:%M:%S")
ad_created = autos['ad_created'].dt.strftime("%Y-%m") # keep only the year and month 

In [34]:

ad_created.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages

Out[34]:

2015-06    0.000021
2015-08    0.000021
2015-09    0.000021
2015-11    0.000021
2015-12    0.000043
2016-01    0.000193
2016-02    0.001307
2016-03    0.837139
2016-04    0.161233
Name: ad_created, dtype: float64

The eBay listings were created between June 2015 and April 2016. But 99% of the ads were created in March and April 2016. Most adds were creater in March 2016 - 83,7%

last_seen¶

In [35]:

autos['last_seen'] = pd.to_datetime(autos['last_seen'], format="%Y-%m-%d %H:%M:%S")
last_seen = autos['last_seen'].dt.strftime("%Y-%m-%d")

In [36]:

last_seen.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages

Out[36]:

2016-03-05    0.001071
2016-03-06    0.004113
2016-03-07    0.005377
2016-03-08    0.007476
2016-03-09    0.009768
2016-03-10    0.010689
2016-03-11    0.012381
2016-03-12    0.023755
2016-03-13    0.008654
2016-03-14    0.012660
2016-03-15    0.016001
2016-03-16    0.016280
2016-03-17    0.028082
2016-03-18    0.007219
2016-03-19    0.015616
2016-03-20    0.020628
2016-03-21    0.020585
2016-03-22    0.020842
2016-03-23    0.018357
2016-03-24    0.019686
2016-03-25    0.018936
2016-03-26    0.016794
2016-03-27    0.015637
2016-03-28    0.020692
2016-03-29    0.022085
2016-03-30    0.024612
2016-03-31    0.023627
2016-04-01    0.022941
2016-04-02    0.024655
2016-04-03    0.025148
2016-04-04    0.024120
2016-04-05    0.125439
2016-04-06    0.223331
2016-04-07    0.132744
Name: last_seen, dtype: float64

The crawler saw this ads between 3rd of May and 7th of April 2016.

registration_year¶

In [37]:

autos['registration_year'].describe()

Out[37]:

count    46684.000000
mean      2002.909669
std          7.187808
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Cars were first registered between 1910 and 2016.

We have cleared our data set, now lets analyze them.

5. Translate to english.¶

We are going to translate data from this columns: vehicle_type, gearbox, model, fuel_type, brand and unrepaired_damage from german to english. First we will check unique values for every column, translate them and than map the values to their english counterparts.

vehicle_type¶

In [38]:

print(pd.unique(autos['vehicle_type']))

['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere']

In [39]:

# create dictionary with translations
vt_dict = {
    'limousine': 'limousine',
    'kleinwagen': 'small_car',
    'kombi': 'combi',
    'bus': 'bus',
    'cabrio': 'convertible',
    'coupe': 'coupe',
    'suv': 'suv',
    'andere': 'other'
}

autos['vehicle_type'] = autos['vehicle_type'].map(vt_dict) # map the new values
print(pd.unique(autos['vehicle_type'])) # check new values

['bus' 'limousine' 'small_car' 'combi' nan 'coupe' 'suv' 'convertible'
 'other']

gearbox¶

In [40]:

print(pd.unique(autos['gearbox']))

['manuell' 'automatik' nan]

In [41]:

# create dictionary with translations
gb_dict = {
    'manuell': 'manual',
    'automatik': 'automatic'
}
autos['gearbox'] = autos['gearbox'].map(gb_dict) # map the new values
print(pd.unique(autos['gearbox'])) # check new values

['manual' 'automatic' nan]

fuel_type¶

In [42]:

print(pd.unique(autos['fuel_type']))

['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']

In [43]:

# create dictionary with translations
ft_dict = {
    'benzin': 'petrol',
    'diesel': 'diesel',
    'lpg': 'lpg',
    'cng': 'cng',
    'hybrid': 'hybrid',
    'elektro': 'electric',
    'andere': 'other'
}
autos['fuel_type'] = autos['fuel_type'].map(ft_dict) # map the new values
print(pd.unique(autos['fuel_type'])) # check new values

['lpg' 'petrol' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']

unrepaired_damage¶

In [44]:

print(pd.unique(autos['unrepaired_damage']))

['nein' nan 'ja']

In [45]:

# create dictionary with translations
ud_dict = {
    'ja': 'yes',
    'nein': 'no'
}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(ud_dict) # map the new values
print(pd.unique(autos['unrepaired_damage'])) # check new values

['no' nan 'yes']

model¶

In [46]:

print(pd.unique(autos['model']))

['andere' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan
 'a3' 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo'
 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo'
 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera'
 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris'
 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito'
 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb'
 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg'
 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster'
 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton'
 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento'
 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift'
 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace'
 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai'
 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia'
 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156'
 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo'
 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601'
 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo'
 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2'
 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note'
 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara'
 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7'
 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon'
 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er'
 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva'
 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60'
 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move'
 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade'
 'i3' '200' 'b_max']

Since we have alot unique values and we want translate only one type andere, we will use str.replace().

In [47]:

autos['model'] = autos['model'].str.replace('andere','other')
print(pd.unique(autos['model']))

['other' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan 'a3'
 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo'
 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo'
 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera'
 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris'
 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito'
 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb'
 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg'
 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster'
 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton'
 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento'
 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift'
 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace'
 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai'
 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia'
 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156'
 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo'
 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601'
 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo'
 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2'
 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note'
 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara'
 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7'
 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon'
 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er'
 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva'
 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60'
 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move'
 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade'
 'i3' '200' 'b_max']

brand¶

In [48]:

print(pd.unique(autos['brand']))

['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia']

In [49]:

autos['brand'] = autos['brand'].str.replace('sonstige_autos','other')
print(pd.unique(autos['brand']))

['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'other' 'opel' 'mazda' 'porsche' 'mini' 'toyota'
 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat'
 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo'
 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu'
 'lancia']

Analyze Data¶

Mean price by brand.¶

First we will calculate the mean price for every brand. We can use aggregation to understand the brand column.

In [50]:

autos['brand'].value_counts().sort_index # count values in 'brand' column

Out[50]:

<bound method Series.sort_index of volkswagen       9862
bmw              5137
opel             5022
mercedes_benz    4503
audi             4041
ford             3263
renault          2201
peugeot          1393
fiat             1197
seat              853
skoda             766
nissan            713
mazda             709
smart             661
citroen           654
toyota            593
hyundai           468
other             461
volvo             427
mini              409
mitsubishi        384
honda             366
kia               330
alfa_romeo        310
porsche           286
suzuki            277
chevrolet         266
chrysler          164
dacia             123
daihatsu          117
jeep              106
subaru            100
land_rover         98
saab               77
jaguar             73
daewoo             70
trabant            65
rover              62
lancia             50
lada               27
Name: brand, dtype: int64>

We have 40 diffrent car brands in our data. The most popular brand is Volkswagen.
Will will continue analyze only for the 10 most popular brands.

In [51]:

top10 = autos['brand'].value_counts().index[:10] #names of Top10 brands

In [52]:

mean_price = {} #create dictionary for mean price
for brand in top10:
    selected_rows = autos[autos['brand'] == brand] #select rows for every brand
    price = selected_rows['price'].mean() # calculate mean price
    mean_price[brand] = int(price) #add mean price to dictionary as int.

mean_price = pd.Series(mean_price).sort_values(ascending=False) # change dictionary to pandas.series and sort it. 

In [53]:

brand_price = pd.DataFrame(mean_price, columns=["mean_price"]) # change pd.series to pd.DataFrame
brand_price

Out[53]:

	mean_price
audi	9336
mercedes_benz	8628
bmw	8332
volkswagen	5402
seat	4397
ford	3749
peugeot	3094
opel	2975
fiat	2813
renault	2474

We observed that in the top 10 brands, there's a distinct price gap.

Audi, Mercedes Benz and BMW are more expensive
Fiat, Peugeot, Reanult and Opel are less expensive
Volkswagen, Ford and Seat are in between

For the top 10 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.

Mean car mileage by brand.¶

In [54]:

mean_mileage = {} #create dictionary for mean mileage
for brand in top10:
    selected_rows = autos[autos['brand'] == brand] #select rows for every brand
    mileage = selected_rows['odometer_km'].mean() # calculate mean mileagae
    mean_mileage[brand] = int(mileage) #add mean mileage to dictionary as int.
    
mean_mileage = pd.Series(mean_mileage).sort_values(ascending=False)  # change dictionary to pandas.series and sort it. 

In [55]:

brand_mileage = pd.DataFrame(mean_mileage,columns=['mean_mileage']) # change pd.series to pd.DataFrame
brand_mileage

Out[55]:

	mean_mileage
bmw	132572
mercedes_benz	130788
opel	129310
audi	129157
volkswagen	128707
renault	128071
peugeot	127153
ford	124266
seat	121131
fiat	117121

BMW cars have the highest car mileage.
Now we will check if the price of each brand depends on car mileage.

In [56]:

brand_price['mean_mileage'] = brand_mileage # combine to DataFrames 
brand_price

Out[56]:

	mean_price	mean_mileage
audi	9336	129157
mercedes_benz	8628	130788
bmw	8332	132572
volkswagen	5402	128707
seat	4397	121131
ford	3749	124266
peugeot	3094	127153
opel	2975	129310
fiat	2813	117121
renault	2474	128071

We cannot see a direct correlation between price and mileage. The price of the car is depending on many factors like: model, year, condition and mileage. We can conclude that the price is influenced by mileage but not depending on.

Average price per mileage¶

We will see how the car mileage influence the price.

In [57]:

### create a list of mileage
odometer = []
for x in autos['odometer_km']:
    if x not in odometer:
        odometer.append(x)
print(odometer)

[150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 40000, 100000]

In [58]:

avg_price = {} # create empty dictionary
for x in odometer:
    selected_rows = autos[autos['odometer_km'] == x] #select rows
    price = selected_rows['price'].mean() # calculate mean
    avg_price[x] = int(price) # add price to dictionary
    
avg_price = pd.Series(avg_price).sort_values(ascending=False) # change dictionary to pandas.series and sort it.

In [59]:

odo_price = pd.DataFrame(avg_price,columns=['price']) # change pd.series to pd.DataFrame
odo_price

Out[59]:

	price
10000	20550
20000	18448
30000	16608
40000	15499
50000	15106
5000	13811
100000	12985
60000	12385
70000	10927
80000	9721
90000	8465
125000	6214
150000	3767

Car price depends of their car mileage. We can see that cars with smaller mileage are mostly more expensive.
This data shows that the conclusion we made before is correct.

How much cheaper are cars with damage than their non-damaged counterparts?¶

We will investigate how much cheaper are same models of cars when they damaged.
First we will calculate average price for every model not damaged and damaged. Than we will compare both and see the diffrence.

In [60]:

### create a list of models
models = []
for x in autos['model'].dropna(): # select rows for each model and drop NaN values
    if x not in models:
        models.append(x)

models.remove('other') # remove 'other' form the list 
print(models)

['7er', 'golf', 'fortwo', 'focus', 'voyager', 'arosa', 'megane', 'a3', 'clio', 'vectra', 'scirocco', '3er', 'a4', '911', 'cooper', '5er', 'polo', 'e_klasse', '2_reihe', 'c_klasse', 'corsa', 'mondeo', 'altea', 'a1', 'twingo', 'a_klasse', 'cl', '3_reihe', 's_klasse', 'sandero', 'passat', 'primera', 'wrangler', 'a6', 'transporter', 'astra', 'v40', 'ibiza', 'micra', '1er', 'yaris', 'colt', '6_reihe', '5_reihe', 'corolla', 'ka', 'tigra', 'punto', 'vito', 'cordoba', 'galaxy', '100', 'octavia', 'm_klasse', 'lupo', 'fiesta', 'superb', 'meriva', 'c_max', 'laguna', 'touran', '1_reihe', 'm_reihe', 'touareg', 'seicento', 'avensis', 'vivaro', 'x_reihe', 'ducato', 'carnival', 'boxster', 'signum', 'sharan', 'zafira', 'rav', 'a5', 'beetle', 'c_reihe', 'phaeton', 'i_reihe', 'sl', 'insignia', 'up', 'civic', '80', 'mx_reihe', 'omega', 'sorento', 'z_reihe', 'berlingo', 'clk', 's_max', 'kalos', 'cx_reihe', 'grand', 'swift', 'tiguan', 'sprinter', 'mii', 'viano', 'kaefer', 'almera', 'picanto', 'espace', 'scenic', 'one', 'bora', 'fox', 'leon', 'transit', 'tucson', 'tt', 'qashqai', 'mustang', 'jazz', 'nubira', 'v70', 'duster', 'stilo', 'justy', 'kangoo', 'fabia', 'matiz', 'freelander', 'lancer', 'forester', 's_type', 'bravo', '500', '156', 'cr_reihe', 'toledo', 'panda', 'a8', 'ceed', 'caddy', 'aygo', 'citigo', 'exeo', 'c1', 'doblo', 'impreza', '147', 'agila', '145', 'c3', 'roomster', 'auris', '601', 'eos', 'aveo', 'slk', 'terios', 'fusion', 'c4', 'glk', 'materia', 'v50', 'combo', 'yeti', '900', '850', 'outlander', 'rio', 'escort', 'c2', 'jimny', 'forfour', 'a2', 'spider', 'kadett', 'b_klasse', 'carisma', 'cayenne', 'r19', 'cc', 'note', 'logan', 'xc_reihe', 'verso', 'v_klasse', 'jetta', 'c5', 'q5', 'getz', 'navara', 'x_type', 'clubman', 's60', '4_reihe', 'lybra', 'cherokee', 'q3', 'q7', 'roadster', 'captiva', 'cuore', '90', 'ptcruiser', 'kuga', 'x_trail', 'ypsilon', 'pajero', 'delta', 'rx_reihe', 'modus', 'sportage', 'calibra', 'sirion', '6er', 'gl', 'amarok', 'santa', '300c', 'antara', 'accord', 'crossfire', '159', 'niva', 'galant', 'legacy', 'defender', 'range_rover', 'juke', 'spark', 'lanos', 'v60', 'alhambra', 'discovery', 'range_rover_sport', 'range_rover_evoque', 'move', 'croma', 'g_klasse', 'lodgy', 'rangerover', 'samara', 'kappa', '9000', 'charade', 'i3', '200', 'b_max']

In [61]:

avg_price_non_damage = {} # create an empty dictionary
for x in models:
    selected_rows = autos.loc[(autos['model'] == x) & (autos['unrepaired_damage'] == 'no')] #select rows for each model without unrepaired damage
    price = selected_rows['price'].mean() # calculate mean price
    avg_price_non_damage[x] = price # add the price to dictionary
    
avg_price_non_damage = pd.Series(avg_price_non_damage).sort_values(ascending=False) #change dictionary to pd.series

In [62]:

non_damage = pd.DataFrame(avg_price_non_damage,columns=['price_non_damage']) #change pd.series to pd.DataFrame
non_damage

Out[62]:

	price_non_damage
911	74240.922481
range_rover_sport	35049.900000
gl	34478.000000
defender	33009.125000
m_reihe	28971.975610
...	...
lybra	659.800000
lanos	593.625000
charade	466.666667
materia	NaN
i3	NaN

243 rows × 1 columns

In [63]:

avg_price_damage = {} # create an empty dictionary
for x in models:
    selected_rows = autos.loc[(autos['model'] == x) & (autos['unrepaired_damage'] == 'yes')] #select rows for each model with unrepaired damage
    price = selected_rows['price'].mean() # calculate mean pruce
    avg_price_damage[x] = price # add mean price to dictionary
    
avg_price_damage = pd.Series(avg_price_damage).sort_values(ascending=False) # change dictionary to pd.Series

In [64]:

damage = pd.DataFrame(avg_price_damage,columns=['price_damage']) #change pd.series to pd.DataFrame
damage

Out[64]:

	price_damage
viano	24333.333333
a5	21649.800000
touareg	20933.333333
911	20833.333333
cayenne	17666.666667
...	...
rangerover	NaN
kappa	NaN
charade	NaN
200	NaN
b_max	NaN

243 rows × 1 columns

In [65]:

non_damage['price_damage'] = damage # add column with mean price from damged cars
non_damage.dropna(inplace=True) # drop rows with empty values, because we can't compare them
non_damage

Out[65]:

	price_non_damage	price_damage
911	74240.922481	20833.333333
defender	33009.125000	12245.000000
m_reihe	28971.975610	8900.000000
g_klasse	25956.736842	6375.000000
cayenne	25918.923077	17666.666667
...	...	...
cordoba	921.172414	500.000000
r19	759.800000	662.250000
move	683.166667	400.000000
lybra	659.800000	799.000000
lanos	593.625000	200.000000

201 rows × 2 columns

In [66]:

non_damage['price_non_damage'] = non_damage['price_non_damage'].astype(int) #change astype to int
non_damage['price_damage'] = non_damage['price_damage'].astype(int) #change astype to int

In [67]:

# create an extra column (difference) with diffrence in price
non_damage['difference'] = non_damage['price_non_damage'] - non_damage['price_damage']

#create an extra column (difference%) with diffrence in price show in percentages
non_damage['difference(%)'] = (1 - (non_damage['price_damage'] / non_damage['price_non_damage'])) * 100

In [68]:

non_damage.describe()

Out[68]:

	price_non_damage	price_damage	difference	difference(%)
count	201.000000	201.000000	201.000000	201.000000
mean	7210.945274	3412.980100	3797.965174	52.045420
std	7459.927049	3855.750868	5301.511094	23.212794
min	593.000000	195.000000	-4764.000000	-29.463789
25%	2910.000000	1089.000000	1255.000000	40.549020
50%	5141.000000	2285.000000	2426.000000	56.291074
75%	8916.000000	4016.000000	4615.000000	67.547642
max	74240.000000	24333.000000	53407.000000	93.131868

We can see that some cars are getting more expensive when damaged, lets have a closer look at them.

In [69]:

non_damage =  non_damage[non_damage['difference'] < 0] 
non_damage

Out[69]:

	price_non_damage	price_damage	difference	difference(%)
viano	20365	24333	-3968	-19.484410
touareg	16169	20933	-4764	-29.463789
cl	13229	14425	-1196	-9.040744
superb	9682	10639	-957	-9.884321
159	6659	7800	-1141	-17.134705
carnival	2178	2410	-232	-10.651974
lybra	659	799	-140	-21.244310

Damage of the cars has big influence on the price. On average the damaged cars are 52% cheaper. We can see that few of damaged counterparts are more expensvive, this beacause the price depednds on many diffrent factor.

Find the most common brand/model combinations¶

We will look for most common model for every brand.

In [70]:

# create a list of brands
brands = []
for x in autos['brand']:
    if x not in brands:
        brands.append(x)
print(brands)

['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'other', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia']

In [71]:

# remove 'other' since it's not a brand
brands.remove('other') 
print(brands)

['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia']

In [72]:

dtop_model_name = {}
dtop_model_amount = {}
dtop_model_pct = {}

for x in brands:
    b = autos[autos['brand'] == x] # select rows for every brand
    top_model = b['model'].value_counts().head(1) # find the most common model
    top_model_string = top_model.to_string() # change astype to string
    top_model_name = top_model_string.split(' ')[0] #use split to get only name of the model
    top_model_pct = b.value_counts('model', normalize=True).head(1) # percentages
    dtop_model_name[x] = top_model_name # add the name to the dictionary
    dtop_model_amount[x] = int(top_model) # add total amount to the dictionary
    dtop_model_pct[x] = round(float(top_model_pct), 2) # add percentages round to 2 digits
    
top_model_by_brand = pd.Series(dtop_model_name) #change dictionary to series
top_model_by_brand_pct = pd.Series(dtop_model_pct) #change dictionary to series
top_model_by_brand_amount = pd.Series(dtop_model_amount) #change dictionary to series
final = pd.DataFrame(top_model_by_brand, columns=['Model']) # change seires to DataFrame
final['Total'] = top_model_by_brand_amount # add Total column
final['Percentage'] = top_model_by_brand_pct.astype(str) + '%' #add percentages columb 

final.sort_values('Total', ascending=False, inplace=True) # sort by total
final
    

Out[72]:

	Model	Total	Percentage
volkswagen	golf	3707	0.39%
bmw	3er	2615	0.53%
opel	corsa	1592	0.33%
audi	a4	1231	0.31%
mercedes_benz	c_klasse	1136	0.26%
ford	focus	762	0.24%
renault	twingo	615	0.29%
peugeot	2_reihe	600	0.46%
smart	fortwo	550	0.87%
fiat	punto	415	0.36%
seat	ibiza	328	0.4%
skoda	octavia	305	0.41%
mini	cooper	271	0.66%
nissan	micra	243	0.35%
citroen	other	219	0.35%
hyundai	i_reihe	174	0.4%
honda	civic	166	0.48%
mazda	3_reihe	164	0.24%
chevrolet	other	147	0.59%
porsche	911	141	0.5%
toyota	yaris	126	0.22%
suzuki	other	119	0.44%
mitsubishi	colt	112	0.31%
volvo	v70	91	0.22%
alfa_romeo	156	88	0.3%
kia	other	65	0.21%
daihatsu	cuore	63	0.55%
rover	other	55	0.93%
saab	other	53	0.72%
chrysler	other	52	0.34%
trabant	601	47	0.85%
dacia	sandero	44	0.36%
jeep	grand	43	0.43%
jaguar	other	39	0.54%
land_rover	freelander	31	0.32%
subaru	legacy	30	0.32%
daewoo	matiz	22	0.33%
lancia	ypsilon	19	0.4%
lada	niva	15	0.6%

The most popular model is Volkswagen Golf.
Almost 87% of all smart car are model fortwo

Conclusion¶

In this project we wanted to clean our data and than see how price of the cars changes depending on different factors.

For cleaning data it's important to change column format and rename some of them based on the data dictionary to be more descriptive, to drop columns that have only one value which makes them not important for our analyze, to change astype of columns for one more suitable and to translate data from a foreign language.

The analyze shows that the price is influenced by many factors. Most expensive brands cost around 8000$ and cheapest 3000$. Cars with low mileage are more expensive. Damaged cars are on average 52% cheaper. The price of the car depends on all above and other factors combined.

The most popular model within all brands is Volkswagen Golf.