Factors affecting value's of used cars.

In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, which you can find it here.

A few modifications have been made from the original dataset:

  • 50,000 data points were sampled from the full dataset
  • The dataset has been modified by Dataquest to be less clean.

The data dictionary provided with data is as follows:

  • dateCrawled - When this ad was first crawled. All field-values are taken from this date.
  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which which year the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • model - The car model name.
  • kilometer - How many kilometers the car has driven.
  • monthOfRegistration - The month in which which year the car was first registered.
  • fuelType - What type of fuel the car uses.
  • brand - The brand of the car.
  • notRepairedDamage - If the car has a damage which is not yet repaired.
  • dateCreated - The date on which the eBay listing was created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings.

Summary of results:

- price of the car depends on many factors(model, year, condition, car mileage, etc.)
- some brands are more expensive than other
- cars with lower mileage are more expensive
- damaged cars are on average 52% cheaper
- the most popular model is Volkswagen Golf

Let's start by importing the libraries we need and reading the dataset into pandas.

In [1]:
# import usefull libraries
import pandas as pd
import numpy as np

# read CSV file into pandas
autos = pd.read_csv('autos.csv', encoding = 'Latin-1') # use Latin-1 encoding

Lets quickly explore our data set.

In [2]:
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Our data sets has 20 columns and 50 000 rows, most of them are stored as strings and some as float. 6 categories contain null entries, but none have more than 20% null values. The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

Column names

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [3]:
autos.columns # use .columns to print columns names
Out[3]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [4]:
# modify column names
columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
In [5]:
# assigned new column names

autos.columns = columns

autos.head(3)
Out[5]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37

We've changed the column names from camelcase to snakecase and editted some of the columns names to be more descriptive.

Now let's do some basic data exploration to determine what other cleaning tasks need to be done.

In [6]:
autos.describe(include='all') # include='all' to get both categorical and numeric columns
Out[6]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-21 20:37:19 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

Data cleaning

We are going to clean our data in few steps:

1. Drop any colunms that have mostly one value. 

2. Check columns for their unique values.

3. Change some columns for numeric values.

4. Change some columns for datetime dtype.

5. Translate some data from german to english. 

1. Drop columns

We will drop seller and offer_type colunms since they have only one value and have a closer look at nr_of_pictures column.

In [7]:
autos['nr_of_pictures'].value_counts()
Out[7]:
0    50000
Name: nr_of_pictures, dtype: int64

Look like column nr_of_pictures has 0 as every value so we can also drop it.

In [8]:
autos.drop(['seller','offer_type', 'nr_of_pictures'], axis = 1)
Out[8]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 45897 2016-04-06 21:18:48

50000 rows × 17 columns

2. Unique values

We will check vehicle_type, fuel_type and brand for unique values.

In [9]:
# Vechicle Type
autos['vehicle_type'].value_counts(dropna=False)
Out[9]:
limousine     12859
kleinwagen    10822
kombi          9127
NaN            5095
bus            4093
cabrio         3061
coupe          2537
suv            1986
andere          420
Name: vehicle_type, dtype: int64
In [10]:
# Fuel Type
autos['fuel_type'].value_counts(dropna=False)
Out[10]:
benzin     30107
diesel     14567
NaN         4482
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64
In [11]:
# Brand
autos['brand'].value_counts(dropna=False)
Out[11]:
volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
smart               701
citroen             701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

Everything looks good here. Let's change some columns to numeric values.

3. Numeric values

We will change price and odometer for numeric values.

First we will delete '$' and ',' from price column and than change dtype for numeric.

In [12]:
autos['price'] = (autos['price']
                  .str.replace('$','') # delete $
                  .str.replace(',','') # delete ,
                  .astype(int) # convert column to numeric dtype
                 )
In [13]:
autos['price'].head()
Out[13]:
0    5000
1    8500
2    8990
3    4350
4    1350
Name: price, dtype: int64

Let's do the same for odometer column.

In [14]:
autos['odometer'] = (autos['odometer'].str.replace('km','') # delete 'km'
                     .str.replace(',','') # delete ,
                     .astype(int) # change dtype to int
                    )
In [15]:
autos['odometer'].head()
Out[15]:
0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64
In [16]:
#change column name to 'odometer_km'
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km, price and registration_year columns. We will analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.

odometer_km

In [17]:
# see how many unique values
autos['odometer_km'].unique().shape
Out[17]:
(13,)
In [18]:
# see min/max/median/mean etc
autos['odometer_km'].describe()
Out[18]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [19]:
# see the highest and lowest values with their counts
autos['odometer_km'].value_counts().sort_index()
Out[19]:
5000        967
10000       264
20000       784
30000       789
40000       819
50000      1027
60000      1164
70000      1230
80000      1436
90000      1757
100000     2169
125000     5170
150000    32424
Name: odometer_km, dtype: int64

Everything looks good for this column.

price

In [20]:
# see how many unique values
autos['price'].unique().shape
Out[20]:
(2357,)
In [21]:
# see min/max/median/mean etc
autos['price'].describe()
Out[21]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [22]:
# see the lowest values with their counts
autos['price'].value_counts().sort_index().head(5)
Out[22]:
0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64
In [23]:
# see the highest values with their counts
autos['price'].value_counts().sort_index().tail(15)
Out[23]:
265000      1
295000      1
299000      1
345000      1
350000      1
999990      1
999999      2
1234566     1
1300000     1
3890000     1
10000000    1
11111111    2
12345678    3
27322222    1
99999999    1
Name: price, dtype: int64

We can see many prices that doesn't look realistic. There are over 1400 cars listed for free. We have also some with a price like '1234566' or '11111111'.

Let's have a closer look at listings with price over 350 000.

In [24]:
autos[autos['price'] > 350000].sort_values('price')
Out[24]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
37585 2016-03-29 11:38:54 Volkswagen_Jetta_GT privat Angebot 999990 test limousine 1985 manuell 111 jetta 150000 12 benzin volkswagen ja 2016-03-29 00:00:00 0 50997 2016-03-29 11:38:54
514 2016-03-17 09:53:08 Ford_Focus_Turnier_1.6_16V_Style privat Angebot 999999 test kombi 2009 manuell 101 focus 125000 4 benzin ford nein 2016-03-17 00:00:00 0 12205 2016-04-06 07:17:35
43049 2016-03-21 19:53:52 2_VW_Busse_T3 privat Angebot 999999 test bus 1981 manuell 70 transporter 150000 1 benzin volkswagen NaN 2016-03-21 00:00:00 0 99880 2016-03-28 17:18:28
22947 2016-03-22 12:54:19 Bmw_530d_zum_ausschlachten privat Angebot 1234566 control kombi 1999 automatik 190 NaN 150000 2 diesel bmw NaN 2016-03-22 00:00:00 0 17454 2016-04-02 03:17:32
7814 2016-04-04 11:53:31 Ferrari_F40 privat Angebot 1300000 control coupe 1992 NaN 0 NaN 50000 12 NaN sonstige_autos nein 2016-04-04 00:00:00 0 60598 2016-04-05 11:34:11
47634 2016-04-04 21:25:21 Ferrari_FXX privat Angebot 3890000 test coupe 2006 NaN 799 NaN 5000 7 NaN sonstige_autos nein 2016-04-04 00:00:00 0 60313 2016-04-05 12:07:37
11137 2016-03-29 23:52:57 suche_maserati_3200_gt_Zustand_unwichtig_laufe... privat Angebot 10000000 control coupe 1960 manuell 368 NaN 100000 1 benzin sonstige_autos nein 2016-03-29 00:00:00 0 73033 2016-04-06 21:18:11
2897 2016-03-12 21:50:57 Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 privat Angebot 11111111 test limousine 1973 manuell 48 escort 50000 3 benzin ford nein 2016-03-12 00:00:00 0 94469 2016-03-12 22:45:27
24384 2016-03-21 13:57:51 Schlachte_Golf_3_gt_tdi privat Angebot 11111111 test NaN 1995 NaN 0 NaN 150000 0 NaN volkswagen NaN 2016-03-21 00:00:00 0 18519 2016-03-21 14:40:18
27371 2016-03-09 15:45:47 Fiat_Punto privat Angebot 12345678 control NaN 2017 NaN 95 punto 150000 0 NaN fiat NaN 2016-03-09 00:00:00 0 96110 2016-03-09 15:45:47
39377 2016-03-08 23:53:51 Tausche_volvo_v40_gegen_van privat Angebot 12345678 control NaN 2018 manuell 95 v40 150000 6 NaN volvo nein 2016-03-08 00:00:00 0 14542 2016-04-06 23:17:31
47598 2016-03-31 18:56:54 Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... privat Angebot 12345678 control limousine 2001 manuell 101 vectra 150000 3 benzin opel nein 2016-03-31 00:00:00 0 4356 2016-03-31 18:56:54
42221 2016-03-08 20:39:05 Leasinguebernahme privat Angebot 27322222 control limousine 2014 manuell 163 c4 40000 2 diesel citroen NaN 2016-03-08 00:00:00 0 76532 2016-03-08 20:39:05
39705 2016-03-22 14:58:27 Tausch_gegen_gleichwertiges privat Angebot 99999999 control limousine 1999 automatik 224 s_klasse 150000 9 benzin mercedes_benz NaN 2016-03-22 00:00:00 0 73525 2016-04-06 05:15:30

Most of this cars have unrealistic price excpet of 3 models(2x Ferrari and Masseratti) from 1 300 000 to 10 000 000. So we are going to delete all cars listed for free and over 350 000, but we are going to keep this 3 more expensive models.

In [25]:
autos = autos[(((autos["price"] >= 1 ) & (autos["price"] <= 350000 )) | ((autos["price"] >= 1300000 ) & (autos["price"] <= 10000000 )))]


autos['price'].value_counts().sort_index()
Out[25]:
1           156
2             3
3             1
5             2
8             1
           ... 
345000        1
350000        1
1300000       1
3890000       1
10000000      1
Name: price, Length: 2349, dtype: int64

registration_year

In [26]:
# see how many unique values
autos['registration_year'].unique().shape
Out[26]:
(95,)
In [27]:
# see min/max/median/mean etc
autos['registration_year'].describe()
Out[27]:
count    48568.000000
mean      2004.754262
std         88.641401
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
In [28]:
# see the highest and lowest values with their counts
autos['registration_year'].value_counts().sort_index()
Out[28]:
1000    1
1001    1
1111    1
1800    2
1910    5
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 95, dtype: int64

Since the car couldn't be registered before 1900 and after 2016 which is year where this listings were added we will remove this outliers.

In [29]:
autos = autos[autos["registration_year"].between(1900,2016)]

4. Date values

Let's now move on to the date columns and understand the date range the data covers. There are 5 columns that should represent date values.

Right now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a astype datetime representation so we can understand it quantitatively. The other two registration_month and registration_year columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [30]:
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[30]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50

We can notice that a date is in a format YYYY-MM-DD HH-MM-SS. We are going to use pd.to_datetime to change astype to datetime and we will change format to YYYY-MM-DD.

date_crawled

In [31]:
autos['date_crawled'] = pd.to_datetime(autos['date_crawled'], format="%Y-%m-%d %H:%M:%S") # change dtype to datetime
date_crawled = autos['date_crawled'].dt.strftime('%Y-%m-%d') # change format to YYYY-MM-DD
In [32]:
date_crawled.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages
Out[32]:
2016-03-05    0.025191
2016-03-06    0.014159
2016-03-07    0.036244
2016-03-08    0.033545
2016-03-09    0.033245
2016-03-10    0.032238
2016-03-11    0.032452
2016-03-12    0.036822
2016-03-13    0.015873
2016-03-14    0.036329
2016-03-15    0.034359
2016-03-16    0.029496
2016-03-17    0.031788
2016-03-18    0.012810
2016-03-19    0.034659
2016-03-20    0.038022
2016-03-21    0.037315
2016-03-22    0.032838
2016-03-23    0.032195
2016-03-24    0.029475
2016-03-25    0.031510
2016-03-26    0.032067
2016-03-27    0.030781
2016-03-28    0.034594
2016-03-29    0.034123
2016-03-30    0.033802
2016-03-31    0.031788
2016-04-01    0.033802
2016-04-02    0.035558
2016-04-03    0.038771
2016-04-04    0.036651
2016-04-05    0.013002
2016-04-06    0.003085
2016-04-07    0.001414
Name: date_crawled, dtype: float64

The adverts were first crawled between 5th of March and 7th of April 2016.

ad_created

In [33]:
autos['ad_created'] = pd.to_datetime(autos['ad_created'], format="%Y-%m-%d %H:%M:%S")
ad_created = autos['ad_created'].dt.strftime("%Y-%m") # keep only the year and month 
In [34]:
ad_created.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages
Out[34]:
2015-06    0.000021
2015-08    0.000021
2015-09    0.000021
2015-11    0.000021
2015-12    0.000043
2016-01    0.000193
2016-02    0.001307
2016-03    0.837139
2016-04    0.161233
Name: ad_created, dtype: float64

The eBay listings were created between June 2015 and April 2016. But 99% of the ads were created in March and April 2016. Most adds were creater in March 2016 - 83,7%

last_seen

In [35]:
autos['last_seen'] = pd.to_datetime(autos['last_seen'], format="%Y-%m-%d %H:%M:%S")
last_seen = autos['last_seen'].dt.strftime("%Y-%m-%d")
In [36]:
last_seen.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages
Out[36]:
2016-03-05    0.001071
2016-03-06    0.004113
2016-03-07    0.005377
2016-03-08    0.007476
2016-03-09    0.009768
2016-03-10    0.010689
2016-03-11    0.012381
2016-03-12    0.023755
2016-03-13    0.008654
2016-03-14    0.012660
2016-03-15    0.016001
2016-03-16    0.016280
2016-03-17    0.028082
2016-03-18    0.007219
2016-03-19    0.015616
2016-03-20    0.020628
2016-03-21    0.020585
2016-03-22    0.020842
2016-03-23    0.018357
2016-03-24    0.019686
2016-03-25    0.018936
2016-03-26    0.016794
2016-03-27    0.015637
2016-03-28    0.020692
2016-03-29    0.022085
2016-03-30    0.024612
2016-03-31    0.023627
2016-04-01    0.022941
2016-04-02    0.024655
2016-04-03    0.025148
2016-04-04    0.024120
2016-04-05    0.125439
2016-04-06    0.223331
2016-04-07    0.132744
Name: last_seen, dtype: float64

The crawler saw this ads between 3rd of May and 7th of April 2016.

registration_year

In [37]:
autos['registration_year'].describe()
Out[37]:
count    46684.000000
mean      2002.909669
std          7.187808
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Cars were first registered between 1910 and 2016.

We have cleared our data set, now lets analyze them.

5. Translate to english.

We are going to translate data from this columns: vehicle_type, gearbox, model, fuel_type, brand and unrepaired_damage from german to english. First we will check unique values for every column, translate them and than map the values to their english counterparts.

vehicle_type

In [38]:
print(pd.unique(autos['vehicle_type']))
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere']
In [39]:
# create dictionary with translations
vt_dict = {
    'limousine': 'limousine',
    'kleinwagen': 'small_car',
    'kombi': 'combi',
    'bus': 'bus',
    'cabrio': 'convertible',
    'coupe': 'coupe',
    'suv': 'suv',
    'andere': 'other'
}

autos['vehicle_type'] = autos['vehicle_type'].map(vt_dict) # map the new values
print(pd.unique(autos['vehicle_type'])) # check new values
['bus' 'limousine' 'small_car' 'combi' nan 'coupe' 'suv' 'convertible'
 'other']

gearbox

In [40]:
print(pd.unique(autos['gearbox']))
['manuell' 'automatik' nan]
In [41]:
# create dictionary with translations
gb_dict = {
    'manuell': 'manual',
    'automatik': 'automatic'
}
autos['gearbox'] = autos['gearbox'].map(gb_dict) # map the new values
print(pd.unique(autos['gearbox'])) # check new values
['manual' 'automatic' nan]

fuel_type

In [42]:
print(pd.unique(autos['fuel_type']))
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
In [43]:
# create dictionary with translations
ft_dict = {
    'benzin': 'petrol',
    'diesel': 'diesel',
    'lpg': 'lpg',
    'cng': 'cng',
    'hybrid': 'hybrid',
    'elektro': 'electric',
    'andere': 'other'
}
autos['fuel_type'] = autos['fuel_type'].map(ft_dict) # map the new values
print(pd.unique(autos['fuel_type'])) # check new values
['lpg' 'petrol' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']

unrepaired_damage

In [44]:
print(pd.unique(autos['unrepaired_damage']))
['nein' nan 'ja']
In [45]:
# create dictionary with translations
ud_dict = {
    'ja': 'yes',
    'nein': 'no'
}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(ud_dict) # map the new values
print(pd.unique(autos['unrepaired_damage'])) # check new values
['no' nan 'yes']

model

In [46]:
print(pd.unique(autos['model']))
['andere' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan
 'a3' 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo'
 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo'
 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera'
 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris'
 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito'
 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb'
 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg'
 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster'
 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton'
 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento'
 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift'
 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace'
 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai'
 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia'
 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156'
 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo'
 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601'
 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo'
 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2'
 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note'
 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara'
 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7'
 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon'
 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er'
 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva'
 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60'
 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move'
 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade'
 'i3' '200' 'b_max']

Since we have alot unique values and we want translate only one type andere, we will use str.replace().

In [47]:
autos['model'] = autos['model'].str.replace('andere','other')
print(pd.unique(autos['model']))
['other' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan 'a3'
 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo'
 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo'
 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera'
 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris'
 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito'
 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb'
 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg'
 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster'
 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton'
 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento'
 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift'
 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace'
 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai'
 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia'
 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156'
 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo'
 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601'
 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo'
 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2'
 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note'
 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara'
 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7'
 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon'
 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er'
 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva'
 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60'
 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move'
 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade'
 'i3' '200' 'b_max']

brand

In [48]:
print(pd.unique(autos['brand']))
['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini'
 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar'
 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda'
 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover'
 'daihatsu' 'lancia']
In [49]:
autos['brand'] = autos['brand'].str.replace('sonstige_autos','other')
print(pd.unique(autos['brand']))
['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault'
 'mercedes_benz' 'audi' 'other' 'opel' 'mazda' 'porsche' 'mini' 'toyota'
 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat'
 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo'
 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu'
 'lancia']

Analyze Data

Mean price by brand.

First we will calculate the mean price for every brand. We can use aggregation to understand the brand column.

In [50]:
autos['brand'].value_counts().sort_index # count values in 'brand' column
Out[50]:
<bound method Series.sort_index of volkswagen       9862
bmw              5137
opel             5022
mercedes_benz    4503
audi             4041
ford             3263
renault          2201
peugeot          1393
fiat             1197
seat              853
skoda             766
nissan            713
mazda             709
smart             661
citroen           654
toyota            593
hyundai           468
other             461
volvo             427
mini              409
mitsubishi        384
honda             366
kia               330
alfa_romeo        310
porsche           286
suzuki            277
chevrolet         266
chrysler          164
dacia             123
daihatsu          117
jeep              106
subaru            100
land_rover         98
saab               77
jaguar             73
daewoo             70
trabant            65
rover              62
lancia             50
lada               27
Name: brand, dtype: int64>

We have 40 diffrent car brands in our data. The most popular brand is Volkswagen.\ Will will continue analyze only for the 10 most popular brands.

In [51]:
top10 = autos['brand'].value_counts().index[:10] #names of Top10 brands
In [52]:
mean_price = {} #create dictionary for mean price
for brand in top10:
    selected_rows = autos[autos['brand'] == brand] #select rows for every brand
    price = selected_rows['price'].mean() # calculate mean price
    mean_price[brand] = int(price) #add mean price to dictionary as int.

mean_price = pd.Series(mean_price).sort_values(ascending=False) # change dictionary to pandas.series and sort it. 
In [53]:
brand_price = pd.DataFrame(mean_price, columns=["mean_price"]) # change pd.series to pd.DataFrame
brand_price
Out[53]:
mean_price
audi 9336
mercedes_benz 8628
bmw 8332
volkswagen 5402
seat 4397
ford 3749
peugeot 3094
opel 2975
fiat 2813
renault 2474

We observed that in the top 10 brands, there's a distinct price gap.

  • Audi, Mercedes Benz and BMW are more expensive
  • Fiat, Peugeot, Reanult and Opel are less expensive
  • Volkswagen, Ford and Seat are in between

For the top 10 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.

Mean car mileage by brand.

In [54]:
mean_mileage = {} #create dictionary for mean mileage
for brand in top10:
    selected_rows = autos[autos['brand'] == brand] #select rows for every brand
    mileage = selected_rows['odometer_km'].mean() # calculate mean mileagae
    mean_mileage[brand] = int(mileage) #add mean mileage to dictionary as int.
    
mean_mileage = pd.Series(mean_mileage).sort_values(ascending=False)  # change dictionary to pandas.series and sort it. 
In [55]:
brand_mileage = pd.DataFrame(mean_mileage,columns=['mean_mileage']) # change pd.series to pd.DataFrame
brand_mileage
Out[55]:
mean_mileage
bmw 132572
mercedes_benz 130788
opel 129310
audi 129157
volkswagen 128707
renault 128071
peugeot 127153
ford 124266
seat 121131
fiat 117121

BMW cars have the highest car mileage.\ Now we will check if the price of each brand depends on car mileage.

In [56]:
brand_price['mean_mileage'] = brand_mileage # combine to DataFrames 
brand_price
Out[56]:
mean_price mean_mileage
audi 9336 129157
mercedes_benz 8628 130788
bmw 8332 132572
volkswagen 5402 128707
seat 4397 121131
ford 3749 124266
peugeot 3094 127153
opel 2975 129310
fiat 2813 117121
renault 2474 128071

We cannot see a direct correlation between price and mileage. The price of the car is depending on many factors like: model, year, condition and mileage. We can conclude that the price is influenced by mileage but not depending on.

Average price per mileage

We will see how the car mileage influence the price.

In [57]:
### create a list of mileage
odometer = []
for x in autos['odometer_km']:
    if x not in odometer:
        odometer.append(x)
print(odometer)
[150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 40000, 100000]
In [58]:
avg_price = {} # create empty dictionary
for x in odometer:
    selected_rows = autos[autos['odometer_km'] == x] #select rows
    price = selected_rows['price'].mean() # calculate mean
    avg_price[x] = int(price) # add price to dictionary
    
avg_price = pd.Series(avg_price).sort_values(ascending=False) # change dictionary to pandas.series and sort it.
In [59]:
odo_price = pd.DataFrame(avg_price,columns=['price']) # change pd.series to pd.DataFrame
odo_price
Out[59]:
price
10000 20550
20000 18448
30000 16608
40000 15499
50000 15106
5000 13811
100000 12985
60000 12385
70000 10927
80000 9721
90000 8465
125000 6214
150000 3767

Car price depends of their car mileage. We can see that cars with smaller mileage are mostly more expensive.\ This data shows that the conclusion we made before is correct.

How much cheaper are cars with damage than their non-damaged counterparts?

We will investigate how much cheaper are same models of cars when they damaged.\ First we will calculate average price for every model not damaged and damaged. Than we will compare both and see the diffrence.

In [60]:
### create a list of models
models = []
for x in autos['model'].dropna(): # select rows for each model and drop NaN values
    if x not in models:
        models.append(x)

models.remove('other') # remove 'other' form the list 
print(models)
['7er', 'golf', 'fortwo', 'focus', 'voyager', 'arosa', 'megane', 'a3', 'clio', 'vectra', 'scirocco', '3er', 'a4', '911', 'cooper', '5er', 'polo', 'e_klasse', '2_reihe', 'c_klasse', 'corsa', 'mondeo', 'altea', 'a1', 'twingo', 'a_klasse', 'cl', '3_reihe', 's_klasse', 'sandero', 'passat', 'primera', 'wrangler', 'a6', 'transporter', 'astra', 'v40', 'ibiza', 'micra', '1er', 'yaris', 'colt', '6_reihe', '5_reihe', 'corolla', 'ka', 'tigra', 'punto', 'vito', 'cordoba', 'galaxy', '100', 'octavia', 'm_klasse', 'lupo', 'fiesta', 'superb', 'meriva', 'c_max', 'laguna', 'touran', '1_reihe', 'm_reihe', 'touareg', 'seicento', 'avensis', 'vivaro', 'x_reihe', 'ducato', 'carnival', 'boxster', 'signum', 'sharan', 'zafira', 'rav', 'a5', 'beetle', 'c_reihe', 'phaeton', 'i_reihe', 'sl', 'insignia', 'up', 'civic', '80', 'mx_reihe', 'omega', 'sorento', 'z_reihe', 'berlingo', 'clk', 's_max', 'kalos', 'cx_reihe', 'grand', 'swift', 'tiguan', 'sprinter', 'mii', 'viano', 'kaefer', 'almera', 'picanto', 'espace', 'scenic', 'one', 'bora', 'fox', 'leon', 'transit', 'tucson', 'tt', 'qashqai', 'mustang', 'jazz', 'nubira', 'v70', 'duster', 'stilo', 'justy', 'kangoo', 'fabia', 'matiz', 'freelander', 'lancer', 'forester', 's_type', 'bravo', '500', '156', 'cr_reihe', 'toledo', 'panda', 'a8', 'ceed', 'caddy', 'aygo', 'citigo', 'exeo', 'c1', 'doblo', 'impreza', '147', 'agila', '145', 'c3', 'roomster', 'auris', '601', 'eos', 'aveo', 'slk', 'terios', 'fusion', 'c4', 'glk', 'materia', 'v50', 'combo', 'yeti', '900', '850', 'outlander', 'rio', 'escort', 'c2', 'jimny', 'forfour', 'a2', 'spider', 'kadett', 'b_klasse', 'carisma', 'cayenne', 'r19', 'cc', 'note', 'logan', 'xc_reihe', 'verso', 'v_klasse', 'jetta', 'c5', 'q5', 'getz', 'navara', 'x_type', 'clubman', 's60', '4_reihe', 'lybra', 'cherokee', 'q3', 'q7', 'roadster', 'captiva', 'cuore', '90', 'ptcruiser', 'kuga', 'x_trail', 'ypsilon', 'pajero', 'delta', 'rx_reihe', 'modus', 'sportage', 'calibra', 'sirion', '6er', 'gl', 'amarok', 'santa', '300c', 'antara', 'accord', 'crossfire', '159', 'niva', 'galant', 'legacy', 'defender', 'range_rover', 'juke', 'spark', 'lanos', 'v60', 'alhambra', 'discovery', 'range_rover_sport', 'range_rover_evoque', 'move', 'croma', 'g_klasse', 'lodgy', 'rangerover', 'samara', 'kappa', '9000', 'charade', 'i3', '200', 'b_max']
In [61]:
avg_price_non_damage = {} # create an empty dictionary
for x in models:
    selected_rows = autos.loc[(autos['model'] == x) & (autos['unrepaired_damage'] == 'no')] #select rows for each model without unrepaired damage
    price = selected_rows['price'].mean() # calculate mean price
    avg_price_non_damage[x] = price # add the price to dictionary
    
avg_price_non_damage = pd.Series(avg_price_non_damage).sort_values(ascending=False) #change dictionary to pd.series
In [62]:
non_damage = pd.DataFrame(avg_price_non_damage,columns=['price_non_damage']) #change pd.series to pd.DataFrame
non_damage
Out[62]:
price_non_damage
911 74240.922481
range_rover_sport 35049.900000
gl 34478.000000
defender 33009.125000
m_reihe 28971.975610
... ...
lybra 659.800000
lanos 593.625000
charade 466.666667
materia NaN
i3 NaN

243 rows × 1 columns

In [63]:
avg_price_damage = {} # create an empty dictionary
for x in models:
    selected_rows = autos.loc[(autos['model'] == x) & (autos['unrepaired_damage'] == 'yes')] #select rows for each model with unrepaired damage
    price = selected_rows['price'].mean() # calculate mean pruce
    avg_price_damage[x] = price # add mean price to dictionary
    
avg_price_damage = pd.Series(avg_price_damage).sort_values(ascending=False) # change dictionary to pd.Series
In [64]:
damage = pd.DataFrame(avg_price_damage,columns=['price_damage']) #change pd.series to pd.DataFrame
damage
Out[64]:
price_damage
viano 24333.333333
a5 21649.800000
touareg 20933.333333
911 20833.333333
cayenne 17666.666667
... ...
rangerover NaN
kappa NaN
charade NaN
200 NaN
b_max NaN

243 rows × 1 columns

In [65]:
non_damage['price_damage'] = damage # add column with mean price from damged cars
non_damage.dropna(inplace=True) # drop rows with empty values, because we can't compare them
non_damage
Out[65]:
price_non_damage price_damage
911 74240.922481 20833.333333
defender 33009.125000 12245.000000
m_reihe 28971.975610 8900.000000
g_klasse 25956.736842 6375.000000
cayenne 25918.923077 17666.666667
... ... ...
cordoba 921.172414 500.000000
r19 759.800000 662.250000
move 683.166667 400.000000
lybra 659.800000 799.000000
lanos 593.625000 200.000000

201 rows × 2 columns

In [66]:
non_damage['price_non_damage'] = non_damage['price_non_damage'].astype(int) #change astype to int
non_damage['price_damage'] = non_damage['price_damage'].astype(int) #change astype to int
In [67]:
# create an extra column (difference) with diffrence in price
non_damage['difference'] = non_damage['price_non_damage'] - non_damage['price_damage']

#create an extra column (difference%) with diffrence in price show in percentages
non_damage['difference(%)'] = (1 - (non_damage['price_damage'] / non_damage['price_non_damage'])) * 100
In [68]:
non_damage.describe()
Out[68]:
price_non_damage price_damage difference difference(%)
count 201.000000 201.000000 201.000000 201.000000
mean 7210.945274 3412.980100 3797.965174 52.045420
std 7459.927049 3855.750868 5301.511094 23.212794
min 593.000000 195.000000 -4764.000000 -29.463789
25% 2910.000000 1089.000000 1255.000000 40.549020
50% 5141.000000 2285.000000 2426.000000 56.291074
75% 8916.000000 4016.000000 4615.000000 67.547642
max 74240.000000 24333.000000 53407.000000 93.131868

We can see that some cars are getting more expensive when damaged, lets have a closer look at them.

In [69]:
non_damage =  non_damage[non_damage['difference'] < 0] 
non_damage
Out[69]:
price_non_damage price_damage difference difference(%)
viano 20365 24333 -3968 -19.484410
touareg 16169 20933 -4764 -29.463789
cl 13229 14425 -1196 -9.040744
superb 9682 10639 -957 -9.884321
159 6659 7800 -1141 -17.134705
carnival 2178 2410 -232 -10.651974
lybra 659 799 -140 -21.244310

Damage of the cars has big influence on the price. On average the damaged cars are 52% cheaper. We can see that few of damaged counterparts are more expensvive, this beacause the price depednds on many diffrent factor.

Find the most common brand/model combinations

We will look for most common model for every brand.

In [70]:
# create a list of brands
brands = []
for x in autos['brand']:
    if x not in brands:
        brands.append(x)
print(brands)
['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'other', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia']
In [71]:
# remove 'other' since it's not a brand
brands.remove('other') 
print(brands)
['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia']
In [72]:
dtop_model_name = {}
dtop_model_amount = {}
dtop_model_pct = {}

for x in brands:
    b = autos[autos['brand'] == x] # select rows for every brand
    top_model = b['model'].value_counts().head(1) # find the most common model
    top_model_string = top_model.to_string() # change astype to string
    top_model_name = top_model_string.split(' ')[0] #use split to get only name of the model
    top_model_pct = b.value_counts('model', normalize=True).head(1) # percentages
    dtop_model_name[x] = top_model_name # add the name to the dictionary
    dtop_model_amount[x] = int(top_model) # add total amount to the dictionary
    dtop_model_pct[x] = round(float(top_model_pct), 2) # add percentages round to 2 digits
    
top_model_by_brand = pd.Series(dtop_model_name) #change dictionary to series
top_model_by_brand_pct = pd.Series(dtop_model_pct) #change dictionary to series
top_model_by_brand_amount = pd.Series(dtop_model_amount) #change dictionary to series
final = pd.DataFrame(top_model_by_brand, columns=['Model']) # change seires to DataFrame
final['Total'] = top_model_by_brand_amount # add Total column
final['Percentage'] = top_model_by_brand_pct.astype(str) + '%' #add percentages columb 

final.sort_values('Total', ascending=False, inplace=True) # sort by total
final
    
Out[72]:
Model Total Percentage
volkswagen golf 3707 0.39%
bmw 3er 2615 0.53%
opel corsa 1592 0.33%
audi a4 1231 0.31%
mercedes_benz c_klasse 1136 0.26%
ford focus 762 0.24%
renault twingo 615 0.29%
peugeot 2_reihe 600 0.46%
smart fortwo 550 0.87%
fiat punto 415 0.36%
seat ibiza 328 0.4%
skoda octavia 305 0.41%
mini cooper 271 0.66%
nissan micra 243 0.35%
citroen other 219 0.35%
hyundai i_reihe 174 0.4%
honda civic 166 0.48%
mazda 3_reihe 164 0.24%
chevrolet other 147 0.59%
porsche 911 141 0.5%
toyota yaris 126 0.22%
suzuki other 119 0.44%
mitsubishi colt 112 0.31%
volvo v70 91 0.22%
alfa_romeo 156 88 0.3%
kia other 65 0.21%
daihatsu cuore 63 0.55%
rover other 55 0.93%
saab other 53 0.72%
chrysler other 52 0.34%
trabant 601 47 0.85%
dacia sandero 44 0.36%
jeep grand 43 0.43%
jaguar other 39 0.54%
land_rover freelander 31 0.32%
subaru legacy 30 0.32%
daewoo matiz 22 0.33%
lancia ypsilon 19 0.4%
lada niva 15 0.6%

The most popular model is Volkswagen Golf.\ Almost 87% of all smart car are model fortwo

Conclusion

In this project we wanted to clean our data and than see how price of the cars changes depending on different factors.

For cleaning data it's important to change column format and rename some of them based on the data dictionary to be more descriptive, to drop columns that have only one value which makes them not important for our analyze, to change astype of columns for one more suitable and to translate data from a foreign language.

The analyze shows that the price is influenced by many factors. Most expensive brands cost around 8000$ and cheapest 3000$. Cars with low mileage are more expensive. Damaged cars are on average 52% cheaper. The price of the car depends on all above and other factors combined.

The most popular model within all brands is Volkswagen Golf.