In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, which you can find it here.
A few modifications have been made from the original dataset:
The data dictionary provided with data is as follows:
The aim of this project is to clean the data and analyze the included used car listings.
- price of the car depends on many factors(model, year, condition, car mileage, etc.)
- some brands are more expensive than other
- cars with lower mileage are more expensive
- damaged cars are on average 52% cheaper
- the most popular model is Volkswagen Golf
Let's start by importing the libraries we need and reading the dataset into pandas.
# import usefull libraries
import pandas as pd
import numpy as np
# read CSV file into pandas
autos = pd.read_csv('autos.csv', encoding = 'Latin-1') # use Latin-1 encoding
Lets quickly explore our data set.
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Our data sets has 20 columns and 50 000 rows, most of them are stored as strings and some as float. 6 categories contain null entries, but none have more than 20% null values. The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
autos.columns # use .columns to print columns names
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# modify column names
columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
# assigned new column names
autos.columns = columns
autos.head(3)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
We've changed the column names from camelcase to snakecase and editted some of the columns names to be more descriptive.
Now let's do some basic data exploration to determine what other cleaning tasks need to be done.
autos.describe(include='all') # include='all' to get both categorical and numeric columns
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
We are going to clean our data in few steps:
1. Drop any colunms that have mostly one value.
2. Check columns for their unique values.
3. Change some columns for numeric values.
4. Change some columns for datetime dtype.
5. Translate some data from german to english.
We will drop seller
and offer_type
colunms since they have only one value and have a closer look at nr_of_pictures
column.
autos['nr_of_pictures'].value_counts()
0 50000 Name: nr_of_pictures, dtype: int64
Look like column nr_of_pictures
has 0 as every value so we can also drop it.
autos.drop(['seller','offer_type', 'nr_of_pictures'], axis = 1)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 17 columns
We will check vehicle_type
, fuel_type
and brand
for unique values.
# Vechicle Type
autos['vehicle_type'].value_counts(dropna=False)
limousine 12859 kleinwagen 10822 kombi 9127 NaN 5095 bus 4093 cabrio 3061 coupe 2537 suv 1986 andere 420 Name: vehicle_type, dtype: int64
# Fuel Type
autos['fuel_type'].value_counts(dropna=False)
benzin 30107 diesel 14567 NaN 4482 lpg 691 cng 75 hybrid 37 andere 22 elektro 19 Name: fuel_type, dtype: int64
# Brand
autos['brand'].value_counts(dropna=False)
volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 smart 701 citroen 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
Everything looks good here. Let's change some columns to numeric values.
We will change price
and odometer
for numeric values.
First we will delete '$' and ',' from price column and than change dtype for numeric.
autos['price'] = (autos['price']
.str.replace('$','') # delete $
.str.replace(',','') # delete ,
.astype(int) # convert column to numeric dtype
)
autos['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
Let's do the same for odometer
column.
autos['odometer'] = (autos['odometer'].str.replace('km','') # delete 'km'
.str.replace(',','') # delete ,
.astype(int) # change dtype to int
)
autos['odometer'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int64
#change column name to 'odometer_km'
autos.rename(columns={'odometer': 'odometer_km'}, inplace=True)
Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km
, price
and registration_year
columns. We will analyze the columns using minimum and maximum values and look for any values that look unrealistically high or low (outliers) that we might want to remove.
# see how many unique values
autos['odometer_km'].unique().shape
(13,)
# see min/max/median/mean etc
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
# see the highest and lowest values with their counts
autos['odometer_km'].value_counts().sort_index()
5000 967 10000 264 20000 784 30000 789 40000 819 50000 1027 60000 1164 70000 1230 80000 1436 90000 1757 100000 2169 125000 5170 150000 32424 Name: odometer_km, dtype: int64
Everything looks good for this column.
# see how many unique values
autos['price'].unique().shape
(2357,)
# see min/max/median/mean etc
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
# see the lowest values with their counts
autos['price'].value_counts().sort_index().head(5)
0 1421 1 156 2 3 3 1 5 2 Name: price, dtype: int64
# see the highest values with their counts
autos['price'].value_counts().sort_index().tail(15)
265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, dtype: int64
We can see many prices that doesn't look realistic. There are over 1400 cars listed for free. We have also some with a price like '1234566' or '11111111'.
Let's have a closer look at listings with price over 350 000.
autos[autos['price'] > 350000].sort_values('price')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37585 | 2016-03-29 11:38:54 | Volkswagen_Jetta_GT | privat | Angebot | 999990 | test | limousine | 1985 | manuell | 111 | jetta | 150000 | 12 | benzin | volkswagen | ja | 2016-03-29 00:00:00 | 0 | 50997 | 2016-03-29 11:38:54 |
514 | 2016-03-17 09:53:08 | Ford_Focus_Turnier_1.6_16V_Style | privat | Angebot | 999999 | test | kombi | 2009 | manuell | 101 | focus | 125000 | 4 | benzin | ford | nein | 2016-03-17 00:00:00 | 0 | 12205 | 2016-04-06 07:17:35 |
43049 | 2016-03-21 19:53:52 | 2_VW_Busse_T3 | privat | Angebot | 999999 | test | bus | 1981 | manuell | 70 | transporter | 150000 | 1 | benzin | volkswagen | NaN | 2016-03-21 00:00:00 | 0 | 99880 | 2016-03-28 17:18:28 |
22947 | 2016-03-22 12:54:19 | Bmw_530d_zum_ausschlachten | privat | Angebot | 1234566 | control | kombi | 1999 | automatik | 190 | NaN | 150000 | 2 | diesel | bmw | NaN | 2016-03-22 00:00:00 | 0 | 17454 | 2016-04-02 03:17:32 |
7814 | 2016-04-04 11:53:31 | Ferrari_F40 | privat | Angebot | 1300000 | control | coupe | 1992 | NaN | 0 | NaN | 50000 | 12 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 0 | 60598 | 2016-04-05 11:34:11 |
47634 | 2016-04-04 21:25:21 | Ferrari_FXX | privat | Angebot | 3890000 | test | coupe | 2006 | NaN | 799 | NaN | 5000 | 7 | NaN | sonstige_autos | nein | 2016-04-04 00:00:00 | 0 | 60313 | 2016-04-05 12:07:37 |
11137 | 2016-03-29 23:52:57 | suche_maserati_3200_gt_Zustand_unwichtig_laufe... | privat | Angebot | 10000000 | control | coupe | 1960 | manuell | 368 | NaN | 100000 | 1 | benzin | sonstige_autos | nein | 2016-03-29 00:00:00 | 0 | 73033 | 2016-04-06 21:18:11 |
2897 | 2016-03-12 21:50:57 | Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 | privat | Angebot | 11111111 | test | limousine | 1973 | manuell | 48 | escort | 50000 | 3 | benzin | ford | nein | 2016-03-12 00:00:00 | 0 | 94469 | 2016-03-12 22:45:27 |
24384 | 2016-03-21 13:57:51 | Schlachte_Golf_3_gt_tdi | privat | Angebot | 11111111 | test | NaN | 1995 | NaN | 0 | NaN | 150000 | 0 | NaN | volkswagen | NaN | 2016-03-21 00:00:00 | 0 | 18519 | 2016-03-21 14:40:18 |
27371 | 2016-03-09 15:45:47 | Fiat_Punto | privat | Angebot | 12345678 | control | NaN | 2017 | NaN | 95 | punto | 150000 | 0 | NaN | fiat | NaN | 2016-03-09 00:00:00 | 0 | 96110 | 2016-03-09 15:45:47 |
39377 | 2016-03-08 23:53:51 | Tausche_volvo_v40_gegen_van | privat | Angebot | 12345678 | control | NaN | 2018 | manuell | 95 | v40 | 150000 | 6 | NaN | volvo | nein | 2016-03-08 00:00:00 | 0 | 14542 | 2016-04-06 23:17:31 |
47598 | 2016-03-31 18:56:54 | Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... | privat | Angebot | 12345678 | control | limousine | 2001 | manuell | 101 | vectra | 150000 | 3 | benzin | opel | nein | 2016-03-31 00:00:00 | 0 | 4356 | 2016-03-31 18:56:54 |
42221 | 2016-03-08 20:39:05 | Leasinguebernahme | privat | Angebot | 27322222 | control | limousine | 2014 | manuell | 163 | c4 | 40000 | 2 | diesel | citroen | NaN | 2016-03-08 00:00:00 | 0 | 76532 | 2016-03-08 20:39:05 |
39705 | 2016-03-22 14:58:27 | Tausch_gegen_gleichwertiges | privat | Angebot | 99999999 | control | limousine | 1999 | automatik | 224 | s_klasse | 150000 | 9 | benzin | mercedes_benz | NaN | 2016-03-22 00:00:00 | 0 | 73525 | 2016-04-06 05:15:30 |
Most of this cars have unrealistic price excpet of 3 models(2x Ferrari and Masseratti) from 1 300 000 to 10 000 000. So we are going to delete all cars listed for free and over 350 000, but we are going to keep this 3 more expensive models.
autos = autos[(((autos["price"] >= 1 ) & (autos["price"] <= 350000 )) | ((autos["price"] >= 1300000 ) & (autos["price"] <= 10000000 )))]
autos['price'].value_counts().sort_index()
1 156 2 3 3 1 5 2 8 1 ... 345000 1 350000 1 1300000 1 3890000 1 10000000 1 Name: price, Length: 2349, dtype: int64
# see how many unique values
autos['registration_year'].unique().shape
(95,)
# see min/max/median/mean etc
autos['registration_year'].describe()
count 48568.000000 mean 2004.754262 std 88.641401 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
# see the highest and lowest values with their counts
autos['registration_year'].value_counts().sort_index()
1000 1 1001 1 1111 1 1800 2 1910 5 .. 5911 1 6200 1 8888 1 9000 1 9999 3 Name: registration_year, Length: 95, dtype: int64
Since the car couldn't be registered before 1900 and after 2016 which is year where this listings were added we will remove this outliers.
autos = autos[autos["registration_year"].between(1900,2016)]
Let's now move on to the date columns and understand the date range the data covers. There are 5 columns that should represent date values.
Right now, the date_crawled
, last_seen
, and ad_created
columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a astype datetime
representation so we can understand it quantitatively. The other two registration_month
and registration_year
columns are represented as numeric values, so we can use methods like Series.describe()
to understand the distribution without any extra data processing.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
We can notice that a date is in a format YYYY-MM-DD HH-MM-SS. We are going to use pd.to_datetime
to change astype to datetime and we will change format to YYYY-MM-DD.
autos['date_crawled'] = pd.to_datetime(autos['date_crawled'], format="%Y-%m-%d %H:%M:%S") # change dtype to datetime
date_crawled = autos['date_crawled'].dt.strftime('%Y-%m-%d') # change format to YYYY-MM-DD
date_crawled.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages
2016-03-05 0.025191 2016-03-06 0.014159 2016-03-07 0.036244 2016-03-08 0.033545 2016-03-09 0.033245 2016-03-10 0.032238 2016-03-11 0.032452 2016-03-12 0.036822 2016-03-13 0.015873 2016-03-14 0.036329 2016-03-15 0.034359 2016-03-16 0.029496 2016-03-17 0.031788 2016-03-18 0.012810 2016-03-19 0.034659 2016-03-20 0.038022 2016-03-21 0.037315 2016-03-22 0.032838 2016-03-23 0.032195 2016-03-24 0.029475 2016-03-25 0.031510 2016-03-26 0.032067 2016-03-27 0.030781 2016-03-28 0.034594 2016-03-29 0.034123 2016-03-30 0.033802 2016-03-31 0.031788 2016-04-01 0.033802 2016-04-02 0.035558 2016-04-03 0.038771 2016-04-04 0.036651 2016-04-05 0.013002 2016-04-06 0.003085 2016-04-07 0.001414 Name: date_crawled, dtype: float64
The adverts were first crawled between 5th of March and 7th of April 2016.
autos['ad_created'] = pd.to_datetime(autos['ad_created'], format="%Y-%m-%d %H:%M:%S")
ad_created = autos['ad_created'].dt.strftime("%Y-%m") # keep only the year and month
ad_created.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages
2015-06 0.000021 2015-08 0.000021 2015-09 0.000021 2015-11 0.000021 2015-12 0.000043 2016-01 0.000193 2016-02 0.001307 2016-03 0.837139 2016-04 0.161233 Name: ad_created, dtype: float64
The eBay listings were created between June 2015 and April 2016. But 99% of the ads were created in March and April 2016. Most adds were creater in March 2016 - 83,7%
autos['last_seen'] = pd.to_datetime(autos['last_seen'], format="%Y-%m-%d %H:%M:%S")
last_seen = autos['last_seen'].dt.strftime("%Y-%m-%d")
last_seen.value_counts(normalize=True,dropna=False).sort_index() #normalize and dropna to inluce missing values and use percentages
2016-03-05 0.001071 2016-03-06 0.004113 2016-03-07 0.005377 2016-03-08 0.007476 2016-03-09 0.009768 2016-03-10 0.010689 2016-03-11 0.012381 2016-03-12 0.023755 2016-03-13 0.008654 2016-03-14 0.012660 2016-03-15 0.016001 2016-03-16 0.016280 2016-03-17 0.028082 2016-03-18 0.007219 2016-03-19 0.015616 2016-03-20 0.020628 2016-03-21 0.020585 2016-03-22 0.020842 2016-03-23 0.018357 2016-03-24 0.019686 2016-03-25 0.018936 2016-03-26 0.016794 2016-03-27 0.015637 2016-03-28 0.020692 2016-03-29 0.022085 2016-03-30 0.024612 2016-03-31 0.023627 2016-04-01 0.022941 2016-04-02 0.024655 2016-04-03 0.025148 2016-04-04 0.024120 2016-04-05 0.125439 2016-04-06 0.223331 2016-04-07 0.132744 Name: last_seen, dtype: float64
The crawler saw this ads between 3rd of May and 7th of April 2016.
autos['registration_year'].describe()
count 46684.000000 mean 2002.909669 std 7.187808 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Cars were first registered between 1910 and 2016.
We have cleared our data set, now lets analyze them.
We are going to translate data from this columns: vehicle_type
, gearbox
, model
, fuel_type
, brand
and unrepaired_damage
from german to english. First we will check unique values for every column, translate them and than map the values to their english counterparts.
print(pd.unique(autos['vehicle_type']))
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio' 'andere']
# create dictionary with translations
vt_dict = {
'limousine': 'limousine',
'kleinwagen': 'small_car',
'kombi': 'combi',
'bus': 'bus',
'cabrio': 'convertible',
'coupe': 'coupe',
'suv': 'suv',
'andere': 'other'
}
autos['vehicle_type'] = autos['vehicle_type'].map(vt_dict) # map the new values
print(pd.unique(autos['vehicle_type'])) # check new values
['bus' 'limousine' 'small_car' 'combi' nan 'coupe' 'suv' 'convertible' 'other']
print(pd.unique(autos['gearbox']))
['manuell' 'automatik' nan]
# create dictionary with translations
gb_dict = {
'manuell': 'manual',
'automatik': 'automatic'
}
autos['gearbox'] = autos['gearbox'].map(gb_dict) # map the new values
print(pd.unique(autos['gearbox'])) # check new values
['manual' 'automatic' nan]
print(pd.unique(autos['fuel_type']))
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
# create dictionary with translations
ft_dict = {
'benzin': 'petrol',
'diesel': 'diesel',
'lpg': 'lpg',
'cng': 'cng',
'hybrid': 'hybrid',
'elektro': 'electric',
'andere': 'other'
}
autos['fuel_type'] = autos['fuel_type'].map(ft_dict) # map the new values
print(pd.unique(autos['fuel_type'])) # check new values
['lpg' 'petrol' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']
print(pd.unique(autos['unrepaired_damage']))
['nein' nan 'ja']
# create dictionary with translations
ud_dict = {
'ja': 'yes',
'nein': 'no'
}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(ud_dict) # map the new values
print(pd.unique(autos['unrepaired_damage'])) # check new values
['no' nan 'yes']
print(pd.unique(autos['model']))
['andere' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan 'a3' 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo' 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo' 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera' 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris' 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito' 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb' 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg' 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster' 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton' 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento' 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift' 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace' 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai' 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia' 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156' 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo' 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601' 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo' 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2' 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note' 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara' 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7' 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon' 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er' 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva' 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60' 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move' 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade' 'i3' '200' 'b_max']
Since we have alot unique values and we want translate only one type andere
, we will use str.replace()
.
autos['model'] = autos['model'].str.replace('andere','other')
print(pd.unique(autos['model']))
['other' '7er' 'golf' 'fortwo' 'focus' 'voyager' 'arosa' 'megane' nan 'a3' 'clio' 'vectra' 'scirocco' '3er' 'a4' '911' 'cooper' '5er' 'polo' 'e_klasse' '2_reihe' 'c_klasse' 'corsa' 'mondeo' 'altea' 'a1' 'twingo' 'a_klasse' 'cl' '3_reihe' 's_klasse' 'sandero' 'passat' 'primera' 'wrangler' 'a6' 'transporter' 'astra' 'v40' 'ibiza' 'micra' '1er' 'yaris' 'colt' '6_reihe' '5_reihe' 'corolla' 'ka' 'tigra' 'punto' 'vito' 'cordoba' 'galaxy' '100' 'octavia' 'm_klasse' 'lupo' 'fiesta' 'superb' 'meriva' 'c_max' 'laguna' 'touran' '1_reihe' 'm_reihe' 'touareg' 'seicento' 'avensis' 'vivaro' 'x_reihe' 'ducato' 'carnival' 'boxster' 'signum' 'sharan' 'zafira' 'rav' 'a5' 'beetle' 'c_reihe' 'phaeton' 'i_reihe' 'sl' 'insignia' 'up' 'civic' '80' 'mx_reihe' 'omega' 'sorento' 'z_reihe' 'berlingo' 'clk' 's_max' 'kalos' 'cx_reihe' 'grand' 'swift' 'tiguan' 'sprinter' 'mii' 'viano' 'kaefer' 'almera' 'picanto' 'espace' 'scenic' 'one' 'bora' 'fox' 'leon' 'transit' 'tucson' 'tt' 'qashqai' 'mustang' 'jazz' 'nubira' 'v70' 'duster' 'stilo' 'justy' 'kangoo' 'fabia' 'matiz' 'freelander' 'lancer' 'forester' 's_type' 'bravo' '500' '156' 'cr_reihe' 'toledo' 'panda' 'a8' 'ceed' 'caddy' 'aygo' 'citigo' 'exeo' 'c1' 'doblo' 'impreza' '147' 'agila' '145' 'c3' 'roomster' 'auris' '601' 'eos' 'aveo' 'slk' 'terios' 'fusion' 'c4' 'glk' 'materia' 'v50' 'combo' 'yeti' '900' '850' 'outlander' 'rio' 'escort' 'c2' 'jimny' 'forfour' 'a2' 'spider' 'kadett' 'b_klasse' 'carisma' 'cayenne' 'r19' 'cc' 'note' 'logan' 'xc_reihe' 'verso' 'v_klasse' 'jetta' 'c5' 'q5' 'getz' 'navara' 'x_type' 'clubman' 's60' '4_reihe' 'lybra' 'cherokee' 'q3' 'q7' 'roadster' 'captiva' 'cuore' '90' 'ptcruiser' 'kuga' 'x_trail' 'ypsilon' 'pajero' 'delta' 'rx_reihe' 'modus' 'sportage' 'calibra' 'sirion' '6er' 'gl' 'amarok' 'santa' '300c' 'antara' 'accord' 'crossfire' '159' 'niva' 'galant' 'legacy' 'defender' 'range_rover' 'juke' 'spark' 'lanos' 'v60' 'alhambra' 'discovery' 'range_rover_sport' 'range_rover_evoque' 'move' 'croma' 'g_klasse' 'lodgy' 'rangerover' 'samara' 'kappa' '9000' 'charade' 'i3' '200' 'b_max']
print(pd.unique(autos['brand']))
['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault' 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu' 'lancia']
autos['brand'] = autos['brand'].str.replace('sonstige_autos','other')
print(pd.unique(autos['brand']))
['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault' 'mercedes_benz' 'audi' 'other' 'opel' 'mazda' 'porsche' 'mini' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu' 'lancia']
autos['brand'].value_counts().sort_index # count values in 'brand' column
<bound method Series.sort_index of volkswagen 9862 bmw 5137 opel 5022 mercedes_benz 4503 audi 4041 ford 3263 renault 2201 peugeot 1393 fiat 1197 seat 853 skoda 766 nissan 713 mazda 709 smart 661 citroen 654 toyota 593 hyundai 468 other 461 volvo 427 mini 409 mitsubishi 384 honda 366 kia 330 alfa_romeo 310 porsche 286 suzuki 277 chevrolet 266 chrysler 164 dacia 123 daihatsu 117 jeep 106 subaru 100 land_rover 98 saab 77 jaguar 73 daewoo 70 trabant 65 rover 62 lancia 50 lada 27 Name: brand, dtype: int64>
We have 40 diffrent car brands in our data. The most popular brand is Volkswagen.
Will will continue analyze only for the 10 most popular brands.
top10 = autos['brand'].value_counts().index[:10] #names of Top10 brands
mean_price = {} #create dictionary for mean price
for brand in top10:
selected_rows = autos[autos['brand'] == brand] #select rows for every brand
price = selected_rows['price'].mean() # calculate mean price
mean_price[brand] = int(price) #add mean price to dictionary as int.
mean_price = pd.Series(mean_price).sort_values(ascending=False) # change dictionary to pandas.series and sort it.
brand_price = pd.DataFrame(mean_price, columns=["mean_price"]) # change pd.series to pd.DataFrame
brand_price
mean_price | |
---|---|
audi | 9336 |
mercedes_benz | 8628 |
bmw | 8332 |
volkswagen | 5402 |
seat | 4397 |
ford | 3749 |
peugeot | 3094 |
opel | 2975 |
fiat | 2813 |
renault | 2474 |
We observed that in the top 10 brands, there's a distinct price gap.
Audi
, Mercedes Benz
and BMW
are more expensiveFiat
, Peugeot
, Reanult
and Opel
are less expensiveVolkswagen
, Ford
and Seat
are in betweenFor the top 10 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price.
mean_mileage = {} #create dictionary for mean mileage
for brand in top10:
selected_rows = autos[autos['brand'] == brand] #select rows for every brand
mileage = selected_rows['odometer_km'].mean() # calculate mean mileagae
mean_mileage[brand] = int(mileage) #add mean mileage to dictionary as int.
mean_mileage = pd.Series(mean_mileage).sort_values(ascending=False) # change dictionary to pandas.series and sort it.
brand_mileage = pd.DataFrame(mean_mileage,columns=['mean_mileage']) # change pd.series to pd.DataFrame
brand_mileage
mean_mileage | |
---|---|
bmw | 132572 |
mercedes_benz | 130788 |
opel | 129310 |
audi | 129157 |
volkswagen | 128707 |
renault | 128071 |
peugeot | 127153 |
ford | 124266 |
seat | 121131 |
fiat | 117121 |
BMW cars have the highest car mileage.
Now we will check if the price of each brand depends on car mileage.
brand_price['mean_mileage'] = brand_mileage # combine to DataFrames
brand_price
mean_price | mean_mileage | |
---|---|---|
audi | 9336 | 129157 |
mercedes_benz | 8628 | 130788 |
bmw | 8332 | 132572 |
volkswagen | 5402 | 128707 |
seat | 4397 | 121131 |
ford | 3749 | 124266 |
peugeot | 3094 | 127153 |
opel | 2975 | 129310 |
fiat | 2813 | 117121 |
renault | 2474 | 128071 |
We cannot see a direct correlation between price and mileage. The price of the car is depending on many factors like: model, year, condition and mileage. We can conclude that the price is influenced by mileage but not depending on.
We will see how the car mileage influence the price.
### create a list of mileage
odometer = []
for x in autos['odometer_km']:
if x not in odometer:
odometer.append(x)
print(odometer)
[150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 40000, 100000]
avg_price = {} # create empty dictionary
for x in odometer:
selected_rows = autos[autos['odometer_km'] == x] #select rows
price = selected_rows['price'].mean() # calculate mean
avg_price[x] = int(price) # add price to dictionary
avg_price = pd.Series(avg_price).sort_values(ascending=False) # change dictionary to pandas.series and sort it.
odo_price = pd.DataFrame(avg_price,columns=['price']) # change pd.series to pd.DataFrame
odo_price
price | |
---|---|
10000 | 20550 |
20000 | 18448 |
30000 | 16608 |
40000 | 15499 |
50000 | 15106 |
5000 | 13811 |
100000 | 12985 |
60000 | 12385 |
70000 | 10927 |
80000 | 9721 |
90000 | 8465 |
125000 | 6214 |
150000 | 3767 |
Car price depends of their car mileage. We can see that cars with smaller mileage are mostly more expensive.
This data shows that the conclusion we made before is correct.
We will investigate how much cheaper are same models of cars when they damaged.
First we will calculate average price for every model not damaged and damaged.
Than we will compare both and see the diffrence.
### create a list of models
models = []
for x in autos['model'].dropna(): # select rows for each model and drop NaN values
if x not in models:
models.append(x)
models.remove('other') # remove 'other' form the list
print(models)
['7er', 'golf', 'fortwo', 'focus', 'voyager', 'arosa', 'megane', 'a3', 'clio', 'vectra', 'scirocco', '3er', 'a4', '911', 'cooper', '5er', 'polo', 'e_klasse', '2_reihe', 'c_klasse', 'corsa', 'mondeo', 'altea', 'a1', 'twingo', 'a_klasse', 'cl', '3_reihe', 's_klasse', 'sandero', 'passat', 'primera', 'wrangler', 'a6', 'transporter', 'astra', 'v40', 'ibiza', 'micra', '1er', 'yaris', 'colt', '6_reihe', '5_reihe', 'corolla', 'ka', 'tigra', 'punto', 'vito', 'cordoba', 'galaxy', '100', 'octavia', 'm_klasse', 'lupo', 'fiesta', 'superb', 'meriva', 'c_max', 'laguna', 'touran', '1_reihe', 'm_reihe', 'touareg', 'seicento', 'avensis', 'vivaro', 'x_reihe', 'ducato', 'carnival', 'boxster', 'signum', 'sharan', 'zafira', 'rav', 'a5', 'beetle', 'c_reihe', 'phaeton', 'i_reihe', 'sl', 'insignia', 'up', 'civic', '80', 'mx_reihe', 'omega', 'sorento', 'z_reihe', 'berlingo', 'clk', 's_max', 'kalos', 'cx_reihe', 'grand', 'swift', 'tiguan', 'sprinter', 'mii', 'viano', 'kaefer', 'almera', 'picanto', 'espace', 'scenic', 'one', 'bora', 'fox', 'leon', 'transit', 'tucson', 'tt', 'qashqai', 'mustang', 'jazz', 'nubira', 'v70', 'duster', 'stilo', 'justy', 'kangoo', 'fabia', 'matiz', 'freelander', 'lancer', 'forester', 's_type', 'bravo', '500', '156', 'cr_reihe', 'toledo', 'panda', 'a8', 'ceed', 'caddy', 'aygo', 'citigo', 'exeo', 'c1', 'doblo', 'impreza', '147', 'agila', '145', 'c3', 'roomster', 'auris', '601', 'eos', 'aveo', 'slk', 'terios', 'fusion', 'c4', 'glk', 'materia', 'v50', 'combo', 'yeti', '900', '850', 'outlander', 'rio', 'escort', 'c2', 'jimny', 'forfour', 'a2', 'spider', 'kadett', 'b_klasse', 'carisma', 'cayenne', 'r19', 'cc', 'note', 'logan', 'xc_reihe', 'verso', 'v_klasse', 'jetta', 'c5', 'q5', 'getz', 'navara', 'x_type', 'clubman', 's60', '4_reihe', 'lybra', 'cherokee', 'q3', 'q7', 'roadster', 'captiva', 'cuore', '90', 'ptcruiser', 'kuga', 'x_trail', 'ypsilon', 'pajero', 'delta', 'rx_reihe', 'modus', 'sportage', 'calibra', 'sirion', '6er', 'gl', 'amarok', 'santa', '300c', 'antara', 'accord', 'crossfire', '159', 'niva', 'galant', 'legacy', 'defender', 'range_rover', 'juke', 'spark', 'lanos', 'v60', 'alhambra', 'discovery', 'range_rover_sport', 'range_rover_evoque', 'move', 'croma', 'g_klasse', 'lodgy', 'rangerover', 'samara', 'kappa', '9000', 'charade', 'i3', '200', 'b_max']
avg_price_non_damage = {} # create an empty dictionary
for x in models:
selected_rows = autos.loc[(autos['model'] == x) & (autos['unrepaired_damage'] == 'no')] #select rows for each model without unrepaired damage
price = selected_rows['price'].mean() # calculate mean price
avg_price_non_damage[x] = price # add the price to dictionary
avg_price_non_damage = pd.Series(avg_price_non_damage).sort_values(ascending=False) #change dictionary to pd.series
non_damage = pd.DataFrame(avg_price_non_damage,columns=['price_non_damage']) #change pd.series to pd.DataFrame
non_damage
price_non_damage | |
---|---|
911 | 74240.922481 |
range_rover_sport | 35049.900000 |
gl | 34478.000000 |
defender | 33009.125000 |
m_reihe | 28971.975610 |
... | ... |
lybra | 659.800000 |
lanos | 593.625000 |
charade | 466.666667 |
materia | NaN |
i3 | NaN |
243 rows × 1 columns
avg_price_damage = {} # create an empty dictionary
for x in models:
selected_rows = autos.loc[(autos['model'] == x) & (autos['unrepaired_damage'] == 'yes')] #select rows for each model with unrepaired damage
price = selected_rows['price'].mean() # calculate mean pruce
avg_price_damage[x] = price # add mean price to dictionary
avg_price_damage = pd.Series(avg_price_damage).sort_values(ascending=False) # change dictionary to pd.Series
damage = pd.DataFrame(avg_price_damage,columns=['price_damage']) #change pd.series to pd.DataFrame
damage
price_damage | |
---|---|
viano | 24333.333333 |
a5 | 21649.800000 |
touareg | 20933.333333 |
911 | 20833.333333 |
cayenne | 17666.666667 |
... | ... |
rangerover | NaN |
kappa | NaN |
charade | NaN |
200 | NaN |
b_max | NaN |
243 rows × 1 columns
non_damage['price_damage'] = damage # add column with mean price from damged cars
non_damage.dropna(inplace=True) # drop rows with empty values, because we can't compare them
non_damage
price_non_damage | price_damage | |
---|---|---|
911 | 74240.922481 | 20833.333333 |
defender | 33009.125000 | 12245.000000 |
m_reihe | 28971.975610 | 8900.000000 |
g_klasse | 25956.736842 | 6375.000000 |
cayenne | 25918.923077 | 17666.666667 |
... | ... | ... |
cordoba | 921.172414 | 500.000000 |
r19 | 759.800000 | 662.250000 |
move | 683.166667 | 400.000000 |
lybra | 659.800000 | 799.000000 |
lanos | 593.625000 | 200.000000 |
201 rows × 2 columns
non_damage['price_non_damage'] = non_damage['price_non_damage'].astype(int) #change astype to int
non_damage['price_damage'] = non_damage['price_damage'].astype(int) #change astype to int
# create an extra column (difference) with diffrence in price
non_damage['difference'] = non_damage['price_non_damage'] - non_damage['price_damage']
#create an extra column (difference%) with diffrence in price show in percentages
non_damage['difference(%)'] = (1 - (non_damage['price_damage'] / non_damage['price_non_damage'])) * 100
non_damage.describe()
price_non_damage | price_damage | difference | difference(%) | |
---|---|---|---|---|
count | 201.000000 | 201.000000 | 201.000000 | 201.000000 |
mean | 7210.945274 | 3412.980100 | 3797.965174 | 52.045420 |
std | 7459.927049 | 3855.750868 | 5301.511094 | 23.212794 |
min | 593.000000 | 195.000000 | -4764.000000 | -29.463789 |
25% | 2910.000000 | 1089.000000 | 1255.000000 | 40.549020 |
50% | 5141.000000 | 2285.000000 | 2426.000000 | 56.291074 |
75% | 8916.000000 | 4016.000000 | 4615.000000 | 67.547642 |
max | 74240.000000 | 24333.000000 | 53407.000000 | 93.131868 |
We can see that some cars are getting more expensive when damaged, lets have a closer look at them.
non_damage = non_damage[non_damage['difference'] < 0]
non_damage
price_non_damage | price_damage | difference | difference(%) | |
---|---|---|---|---|
viano | 20365 | 24333 | -3968 | -19.484410 |
touareg | 16169 | 20933 | -4764 | -29.463789 |
cl | 13229 | 14425 | -1196 | -9.040744 |
superb | 9682 | 10639 | -957 | -9.884321 |
159 | 6659 | 7800 | -1141 | -17.134705 |
carnival | 2178 | 2410 | -232 | -10.651974 |
lybra | 659 | 799 | -140 | -21.244310 |
Damage of the cars has big influence on the price. On average the damaged cars are 52% cheaper. We can see that few of damaged counterparts are more expensvive, this beacause the price depednds on many diffrent factor.
We will look for most common model for every brand.
# create a list of brands
brands = []
for x in autos['brand']:
if x not in brands:
brands.append(x)
print(brands)
['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'other', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia']
# remove 'other' since it's not a brand
brands.remove('other')
print(brands)
['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia']
dtop_model_name = {}
dtop_model_amount = {}
dtop_model_pct = {}
for x in brands:
b = autos[autos['brand'] == x] # select rows for every brand
top_model = b['model'].value_counts().head(1) # find the most common model
top_model_string = top_model.to_string() # change astype to string
top_model_name = top_model_string.split(' ')[0] #use split to get only name of the model
top_model_pct = b.value_counts('model', normalize=True).head(1) # percentages
dtop_model_name[x] = top_model_name # add the name to the dictionary
dtop_model_amount[x] = int(top_model) # add total amount to the dictionary
dtop_model_pct[x] = round(float(top_model_pct), 2) # add percentages round to 2 digits
top_model_by_brand = pd.Series(dtop_model_name) #change dictionary to series
top_model_by_brand_pct = pd.Series(dtop_model_pct) #change dictionary to series
top_model_by_brand_amount = pd.Series(dtop_model_amount) #change dictionary to series
final = pd.DataFrame(top_model_by_brand, columns=['Model']) # change seires to DataFrame
final['Total'] = top_model_by_brand_amount # add Total column
final['Percentage'] = top_model_by_brand_pct.astype(str) + '%' #add percentages columb
final.sort_values('Total', ascending=False, inplace=True) # sort by total
final
Model | Total | Percentage | |
---|---|---|---|
volkswagen | golf | 3707 | 0.39% |
bmw | 3er | 2615 | 0.53% |
opel | corsa | 1592 | 0.33% |
audi | a4 | 1231 | 0.31% |
mercedes_benz | c_klasse | 1136 | 0.26% |
ford | focus | 762 | 0.24% |
renault | twingo | 615 | 0.29% |
peugeot | 2_reihe | 600 | 0.46% |
smart | fortwo | 550 | 0.87% |
fiat | punto | 415 | 0.36% |
seat | ibiza | 328 | 0.4% |
skoda | octavia | 305 | 0.41% |
mini | cooper | 271 | 0.66% |
nissan | micra | 243 | 0.35% |
citroen | other | 219 | 0.35% |
hyundai | i_reihe | 174 | 0.4% |
honda | civic | 166 | 0.48% |
mazda | 3_reihe | 164 | 0.24% |
chevrolet | other | 147 | 0.59% |
porsche | 911 | 141 | 0.5% |
toyota | yaris | 126 | 0.22% |
suzuki | other | 119 | 0.44% |
mitsubishi | colt | 112 | 0.31% |
volvo | v70 | 91 | 0.22% |
alfa_romeo | 156 | 88 | 0.3% |
kia | other | 65 | 0.21% |
daihatsu | cuore | 63 | 0.55% |
rover | other | 55 | 0.93% |
saab | other | 53 | 0.72% |
chrysler | other | 52 | 0.34% |
trabant | 601 | 47 | 0.85% |
dacia | sandero | 44 | 0.36% |
jeep | grand | 43 | 0.43% |
jaguar | other | 39 | 0.54% |
land_rover | freelander | 31 | 0.32% |
subaru | legacy | 30 | 0.32% |
daewoo | matiz | 22 | 0.33% |
lancia | ypsilon | 19 | 0.4% |
lada | niva | 15 | 0.6% |
The most popular model is Volkswagen Golf
.
Almost 87% of all smart
car are model fortwo
In this project we wanted to clean our data and than see how price of the cars changes depending on different factors.
For cleaning data it's important to change column format and rename some of them based on the data dictionary to be more descriptive, to drop columns that have only one value which makes them not important for our analyze, to change astype of columns for one more suitable and to translate data from a foreign language.
The analyze shows that the price is influenced by many factors. Most expensive brands cost around 8000$ and cheapest 3000$. Cars with low mileage are more expensive. Damaged cars are on average 52% cheaper. The price of the car depends on all above and other factors combined.
The most popular model within all brands is Volkswagen Golf.