Cleaning Data From Kleinanziegen -
The German eBay Classifieds
Image Courtesy of Used Cars in Germany
#This is where we bring in our libraries to help complete the analysis
import pandas as pd
import numpy as np
#read in the csv that will be used for the project
autos = pd.read_csv('autos.csv', encoding = 'Latin 1')
#creating a new cell to test if the autos csv was read
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
#looking at the head and info of 'autos' Dataframe will help identify data types and columns
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Looking at the colum names and dataframe reveals some clues about the data to be cleaned. Of the twenty data types 15 are objects(most likely strings) and 5 are int64 (numeric values), which may need to be converted as they have commas in them, rather than periods. Dates will most likely have to be converted or extracted to highlight specific time spans as either days or hours. Names will have to be clenaned of underscores so that models can be extraced and viewed aside from make. The items in Sellers
, offerType
, gearbox
, fuelType
and notRepairedDamage
columns will have to be converted to English from German. The odometer
column will have to have km removed and placed in the column name. There are a few unknowns already visible looking at the info counts suggesting that vehicle type and model may need to be compared to identify missing descriptors. Gear
and fuel type
look to be challenging as other column names don't suggest relatable information. The most intriguing column is the notRepairedDamage
column as it points to possible valuable missing information about the condition of those cars. We'll find out once we look a little closer at the dataset. Finally, the column names could very well use a makeover for more clear and concise nomenclature.
#in this step column names will be changed for better readability
#and all column names in Camel Case (upper and lower case combined) will be
#converted to Snake Case (lower case only)
print(autos.columns)
#convert column names to snakecase (all lowercase)
autos.rename(columns = {'dateCrawled': 'date_crawled', 'name':'name', 'seller':'seller', 'offerType':'offer_type', 'price':'price', 'abtest':'abtest',
'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'gearbox':'gearbox', 'powerPS':'power_ps', 'model':'model',
'odometer':'odometer', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'brand':'brand',
'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nmbr_of_pictures', 'postalCode':'postal_code',
'lastSeen':'last_seen'}, inplace = True)
#verify that changes were successful
print('\n')
print(' **Adjusted Column Headers Below**')
print('\n')
autos.head(2)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object') **Adjusted Column Headers Below**
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nmbr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
Seen above are the changes made to the header column names. By converting columm names to snakecase (lower case) uniformity has been achieved while,hopefully, confusion can be avoided for future data scientists who may mistake header names, in camelcase (upper and lower case), as function naming conventions. Additionally, and more importantly, creating some clear formatting provides improved ease of use for reading header names and clarifies some of the names that were previously crowded together.
#in this step we'll explore 'autos' descriptions and value counts to determine if there's
#data that can be omitted and numeric data, stored as text, that can be cleaned
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nmbr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 16:37:21 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
#explore columns to get an idea of data characteristics
print('Vehicle Name')
autos['name'].value_counts()
Vehicle Name
Ford_Fiesta 78 Volkswagen_Golf_1.4 75 BMW_316i 75 Volkswagen_Polo 72 BMW_318i 72 .. Audi_80_Sport_Edition 1 Start_Stop_Leder_Lenkrad_mit_neuen_Alweterrreifen 1 Opel_Corsa_1.3_CDTI_DPF_Eco_5_Tuerer_HU_02/18 1 Daihatsu_Cuore_GL___HU_7/2017 1 Golf_1.4_16V_Variant 1 Name: name, Length: 38754, dtype: int64
print('Price')
autos['price'].value_counts()
Price
$0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 ... $34,650 1 $30,800 1 $8,854 1 $15,888 1 $388 1 Name: price, Length: 2357, dtype: int64
print('Model Name')
autos['model'].value_counts()
Model Name
golf 4024 andere 3528 3er 2761 polo 1757 corsa 1735 ... kalina 2 200 1 rangerover 1 b_max 1 i3 1 Name: model, Length: 245, dtype: int64
print('Mileage on Odometer')
autos['odometer'].value_counts()
Mileage on Odometer
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
print('Fuel Type')
autos['fuel_type'].value_counts()
Fuel Type
benzin 30107 diesel 14567 lpg 691 cng 75 hybrid 37 andere 22 elektro 19 Name: fuel_type, dtype: int64
print('Auto Maker')
autos['brand'].value_counts()
Auto Maker
volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 citroen 701 smart 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
Looking at value counts for specific categories such as model
,name
and brand
(manufacturer) illustrate some redundancies. Unless we want to view very specific model characteristics, the column name
appears to be an unnecessary column since model
and brand
already clarify that information. Price
needs a deeper look to determine a range of values that hold the highest number of average prices. These values may also need to be converted to different currencies depending on who the target audience is for this analysis (current suggestion is U.S. Dollars based on author's location). The same can be said for mileage with a simple conversion from kilometers to miles. Fuel types are in German and, again, depending on the audience should be translated accordingly ex. benzin is German for gasoline or petrol.
For now removal of non numeric characters from price
and odometer
will clean them for subsequent analyses and conversion to numeric data type will allow for formulas requiring conversion between unit measurements.
#renaming column odometer to odeometer_km and converting to numeric data type(dtype)
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)
autos['price'] = autos['price'].str.replace('$','').str.replace(',','')
autos['price'] = autos['price'].astype(int)
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
#verify non numeric symbol removal and data type change
autos[0:3]['price'],autos[0:3]['odometer_km']
(0 5000 1 8500 2 8990 Name: price, dtype: int64, 0 150000 1 150000 2 70000 Name: odometer_km, dtype: int64)
#looking further into price and odometer_km to find large, unrealistic values
#if those kinds of values appear, removal will occur to normalize the series
autos['price'].unique().shape
(2357,)
autos['price'].describe(include = all)
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().sort_index(ascending = False)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 ... 5 2 3 1 2 3 1 156 0 1421 Name: price, Length: 2357, dtype: int64
#after viewing the descriptions and value counts, it's clear to see
#that some outliers in price are over a million on the high end and 0 at the low end
#to remedy this, removal of those values will be executed below
autos = autos[autos['price'].between(100,350000)]
#verify that the prices have fluctuated based on eliminating outliers
autos['price'].describe()
count 48224.000000 mean 5930.371433 std 9078.372762 min 100.000000 25% 1250.000000 50% 3000.000000 75% 7499.000000 max 350000.000000 Name: price, dtype: float64
After removing large price values (in excess of >350,000) and small values (less than <100) we can see how much the mean changed as it dropped from 9840 to 5930. This should allow us to view the average cost of used cars with a bit more confidence and help us start to identify specific types of cars and prices that are used car favorites. Now we'll take a look at the odometer_km
to find out if there's any really large discrepancies in mileage values.
#exploring odometer_km for large/small outliers in mileage
autos['odometer_km'].value_counts(ascending = False)
150000 31212 125000 5037 100000 2101 90000 1733 80000 1412 70000 1214 60000 1153 50000 1009 40000 814 30000 777 5000 760 20000 757 10000 245 Name: odometer_km, dtype: int64
Odometer_km
shows no irregularities in mileage as the values represented suggest normal ranges for used cars in terms of kilometers driven. Before moving on, there were three columns that had negligible and null values. They were seller
,offer_type
, and nmbr_of_pictures
and will be dropped from the analysis.
#dropping the three aforementioned columns
autos = autos.drop(['seller','offer_type','nmbr_of_pictures'], axis = 1)
#converting columns 'date_crawled','ad_created', 'last_seen' from string to numeric values
autos['date_crawled'].value_counts(ascending = False)
2016-03-09 11:54:38 3 2016-03-29 23:42:13 3 2016-03-05 16:57:05 3 2016-03-23 18:39:34 3 2016-03-25 19:57:10 3 .. 2016-03-26 12:50:12 1 2016-03-17 11:29:33 1 2016-03-09 22:55:53 1 2016-03-21 14:55:28 1 2016-03-20 13:46:21 1 Name: date_crawled, Length: 46571, dtype: int64
#removing hours, minutes and seconds from 'date_crawled'
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()
2016-03-05 2.54 2016-03-06 1.40 2016-03-07 3.61 2016-03-08 3.32 2016-03-09 3.30 2016-03-10 3.23 2016-03-11 3.26 2016-03-12 3.69 2016-03-13 1.57 2016-03-14 3.67 2016-03-15 3.43 2016-03-16 2.95 2016-03-17 3.15 2016-03-18 1.29 2016-03-19 3.47 2016-03-20 3.78 2016-03-21 3.72 2016-03-22 3.29 2016-03-23 3.23 2016-03-24 2.94 2016-03-25 3.15 2016-03-26 3.23 2016-03-27 3.11 2016-03-28 3.50 2016-03-29 3.41 2016-03-30 3.37 2016-03-31 3.19 2016-04-01 3.37 2016-04-02 3.56 2016-04-03 3.86 2016-04-04 3.65 2016-04-05 1.31 2016-04-06 0.32 2016-04-07 0.14 Name: date_crawled, dtype: float64
It appears that the highest percengage of 'crawling' (searching a website using bots or crawl agents) occured on 2016-04-03 @ 3.86%. Below we will see if any of the other dates can correlate any relevance to the highest percentage of crawling*
#removing hours, minutes and seconds from 'ad_created'
autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()
2015-06-11 0.00 2015-08-10 0.00 2015-09-09 0.00 2015-11-10 0.00 2015-12-05 0.00 ... 2016-04-03 3.89 2016-04-04 3.69 2016-04-05 1.18 2016-04-06 0.33 2016-04-07 0.12 Name: ad_created, Length: 76, dtype: float64
When viewing the ad_listed
column the highest percentage of listings occurs on 2016-04-03 @ 3.89%, which is the same day as the highest crawl count. This makes sense considering that the most time crawling would match the date with the most posts.
#removing hours, minutes and seconds from 'last_seen'
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()
2016-03-05 0.11 2016-03-06 0.43 2016-03-07 0.54 2016-03-08 0.73 2016-03-09 0.96 2016-03-10 1.06 2016-03-11 1.24 2016-03-12 2.38 2016-03-13 0.89 2016-03-14 1.26 2016-03-15 1.59 2016-03-16 1.64 2016-03-17 2.81 2016-03-18 0.73 2016-03-19 1.58 2016-03-20 2.07 2016-03-21 2.05 2016-03-22 2.14 2016-03-23 1.86 2016-03-24 1.98 2016-03-25 1.91 2016-03-26 1.67 2016-03-27 1.56 2016-03-28 2.08 2016-03-29 2.23 2016-03-30 2.47 2016-03-31 2.38 2016-04-01 2.29 2016-04-02 2.49 2016-04-03 2.51 2016-04-04 2.45 2016-04-05 12.51 2016-04-06 22.20 2016-04-07 13.22 Name: last_seen, dtype: float64
The last_seen
column is a bit different because we're inferring that the highest percentage of times these listings were seen reprsents a spike not in sales, but in the number of cars still listed. So three days after (2016-04-03) the posting (2016-04-06) crawler bots show the highest number of listings at 22%. Could've just been a slow day for sales, or a trend in the data not yet seen? It directly coincides with the bots two lowest percentage crawls, which would make sense, suggesting that less activity (sales) resulted fewer crawls over the website.
#examining the distribution of registration years
autos['registration_year'].describe()
count 48224.000000 mean 2004.730964 std 87.897388 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
#it appears that there are some values that cannot exist (ex. 1000).
#look at unique values
autos['registration_year'].describe()
count 48224.000000 mean 2004.730964 std 87.897388 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Registration year
has some values that cannot exist. Cars were not invented until the late 19th century, early 20th. It's safe to say that we'll have to create a realistic range for these cars to fall in and remove the others. It's clear to see that the newer te car the higher the percentage. This suggests that newer cars, most likely, have better resale value.
#create a range between 1900 and 2016
autos = autos[autos["registration_year"].between(1900, 2016)]
autos['registration_year'].value_counts(normalize = True).mul(100).round(2)
2000 6.70 2005 6.28 1999 6.21 2004 5.82 2003 5.81 ... 1938 0.00 1939 0.00 1953 0.00 1943 0.00 1952 0.00 Name: registration_year, Length: 78, dtype: float64
#identifying unique values in the 'brand' column
autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
#finding highest percentages of 'brand' within the dataframe
autos['brand'].value_counts(normalize = True, dropna = False).mul(100).round(2)
volkswagen 21.14 bmw 11.02 opel 10.72 mercedes_benz 9.67 audi 8.68 ford 6.98 renault 4.71 peugeot 2.99 fiat 2.56 seat 1.83 skoda 1.64 nissan 1.53 mazda 1.52 smart 1.42 citroen 1.40 toyota 1.28 hyundai 1.00 sonstige_autos 0.95 volvo 0.91 mini 0.88 mitsubishi 0.82 honda 0.79 kia 0.71 alfa_romeo 0.67 porsche 0.60 suzuki 0.59 chevrolet 0.57 chrysler 0.35 dacia 0.27 daihatsu 0.25 jeep 0.23 subaru 0.21 land_rover 0.21 saab 0.17 jaguar 0.15 daewoo 0.15 trabant 0.14 rover 0.13 lancia 0.11 lada 0.06 Name: brand, dtype: float64
#assign the top ten brands
top_10_brands = autos['brand'].value_counts(normalize = True, dropna = False).head(10).mul(100).round(2).index
print(top_10_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat'], dtype='object')
Looking at the list of auto brands shows that the top ten automakers represent close to 75 percent of the availablle vehicles. This shows manufacterers who have an established market presence resulting in added options, across models, for buyers.
#creating an empty dictionary to hold the top ten 'brands'
top_brand_values = {}
for brand_10 in top_10_brands:
mean_price = autos[autos['brand'] == brand_10]['price'].mean().round()
top_brand_values[brand_10] = mean_price
top_brand_values
{'volkswagen': 5437.0, 'bmw': 8382.0, 'opel': 3005.0, 'mercedes_benz': 8673.0, 'audi': 9381.0, 'ford': 3779.0, 'renault': 2496.0, 'peugeot': 3114.0, 'fiat': 2837.0, 'seat': 4433.0}
Price
doesn't necessarily correlate to brand
, but it makes sense that Volkswagen sits right in the middle of the range of values for mean. As Volkswagen reprsents the highest selling brand
present would suggest consumer confidence while mid-range pricing illustrates afffordability without skimping on build out factors that make cars reliable. Below we will create the same dictionary for mileage by mean and create a dataframe from series objects derived from the series constructors
#finding highest mean kilometers of 'odometer_km' within the series
top_brand_kilometers = {}
for brand_new in top_10_brands:
mean_km = autos[autos['brand']== brand_new]['odometer_km'].mean().round()
top_brand_kilometers[brand_new]= mean_km
top_brand_kilometers
{'volkswagen': 128800.0, 'bmw': 132695.0, 'opel': 129384.0, 'mercedes_benz': 131026.0, 'audi': 129245.0, 'ford': 124277.0, 'renault': 128281.0, 'peugeot': 127128.0, 'fiat': 116950.0, 'seat': 121537.0}
#creating series for top_brand_kilometers and top_brand_values
tbv_series = pd.Series(top_brand_values).sort_values(ascending = False)
tbk_series = pd.Series(top_brand_kilometers).sort_values(ascending = False)
#combine both tbv and tbk series into a dataframe
price_km_df = pd.DataFrame(tbv_series, columns = ['mean_prices'])
#add column for tbk_series
price_km_df['mean_kilometers'] = tbk_series
price_km_df
mean_prices | mean_kilometers | |
---|---|---|
audi | 9381.0 | 129245.0 |
mercedes_benz | 8673.0 | 131026.0 |
bmw | 8382.0 | 132695.0 |
volkswagen | 5437.0 | 128800.0 |
seat | 4433.0 | 121537.0 |
ford | 3779.0 | 124277.0 |
peugeot | 3114.0 | 127128.0 |
opel | 3005.0 | 129384.0 |
fiat | 2837.0 | 116950.0 |
renault | 2496.0 | 128281.0 |
After narrowing our results to see top brands, prices and mileage we can see that mileage is in a range of 10,000 km across brands, suggesting that mileage may effect price at higher (lower price) and lower(higher price) values. Of note would be the significance of sales price in relation to location. We will save that for another project. For now, it appears, that buying a volkswagen is the most popular choice amongst used car buyers and should not come as a surprise cosidering they are Germany's largest automaker.
We'll take a look at these unique values first, translate them and map their English translations.
#changing column data that has German words to English
autos.head(3)
#columns vehicle_type, gearbox, and fuel_type will be translated
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
#create dictionary, convert via .map
autos['vehicle_type'].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
#create dictionary, convert via .map
vehicle_dict = {'bus':'bus','limousine':'limousine','kleinwagen':'small car','kombi':'station_wagon','nan':'unknown','coupe':'two_door','suv':'suv','cabrio':'convertible','andere':'other'}
autos['vehicle_type'] = autos['vehicle_type'].map(vehicle_dict)
#verify translation changes
autos['vehicle_type'].unique()
array(['bus', 'limousine', 'small car', 'station_wagon', nan, 'two_door', 'suv', 'convertible', 'other'], dtype=object)
autos['gearbox'].unique()
array(['manuell', 'automatik', nan], dtype=object)
gear_dict = {'manuell':'manual','automatik':'automatic'}
autos['gearbox'] = autos['gearbox'].map(gear_dict)
autos['gearbox'].unique()
array(['manual', 'automatic', nan], dtype=object)
autos['fuel_type'].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
fuel_dict = {'lpg':'lpg','benzin':'gas','diesel':'diesel','cng':'cng','hybrid':'hybrid','elektro':'electric','andere':'other'}
autos['fuel_type'] = autos['fuel_type'].map(fuel_dict)
autos['fuel_type'].unique()
array(['lpg', 'gas', 'diesel', nan, 'cng', 'hybrid', 'electric', 'other'], dtype=object)
autos['unrepaired_damage'].unique()
array(['nein', nan, 'ja'], dtype=object)
unrepaired_dict = {'nein':'no','ja':'yes'}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(unrepaired_dict)
autos['unrepaired_damage'].unique()
array(['no', nan, 'yes'], dtype=object)
Conversion of all date columns to uniform, numeric format will be perfomed to allow for further queries against the dataset. Converting date formats from yyyy-dd-mm to yyyyddmm by removing dashes and extra characters from columns date_crawled
, ad_created
& last_seen
.
#starting with date_crawled column by removing hours, mins, secs and dashes
autos['date_crawled'].head()
0 2016-03-26 17:47:46 1 2016-04-04 13:38:56 2 2016-03-26 18:57:24 3 2016-03-12 16:58:10 4 2016-04-01 14:38:50 Name: date_crawled, dtype: object
autos['date_crawled_new'] = (autos['date_crawled']
.str.split()
.str[0])
autos['date_crawled_new'].head(3)
0 2016-03-26 1 2016-04-04 2 2016-03-26 Name: date_crawled_new, dtype: object
autos['date_crawled_new'] = autos['date_crawled_new'].str.replace('-','')
print(autos['date_crawled_new'])
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 ... 49995 20160327 49996 20160328 49997 20160402 49998 20160308 49999 20160314 Name: date_crawled_new, Length: 46352, dtype: object
autos['date_crawled_new'] = autos['date_crawled_new'].astype(int)
autos['date_crawled_new'].head(5)
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 Name: date_crawled_new, dtype: int64
#starting with ad_created column by removing hours, mins, secs and dashes
autos['ad_created'].head(2)
0 2016-03-26 00:00:00 1 2016-04-04 00:00:00 Name: ad_created, dtype: object
autos['ad_created_new'] = (autos['ad_created']
.str.split()
.str[0])
autos['ad_created_new'].head(3)
0 2016-03-26 1 2016-04-04 2 2016-03-26 Name: ad_created_new, dtype: object
autos['ad_created_new'] = autos['ad_created_new'].str.replace('-','')
print(autos['ad_created_new'])
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 ... 49995 20160327 49996 20160328 49997 20160402 49998 20160308 49999 20160313 Name: ad_created_new, Length: 46352, dtype: object
autos['ad_created_new'] = autos['ad_created_new'].astype(int)
autos['ad_created_new'].head(3)
0 20160326 1 20160404 2 20160326 Name: ad_created_new, dtype: int64
#starting with last_seen column by removing hours, mins, secs and dashes
autos['last_seen'].head(3)
0 2016-04-06 06:45:54 1 2016-04-06 14:45:08 2 2016-04-06 20:15:37 Name: last_seen, dtype: object
autos['last_seen_new'] = (autos['last_seen']
.str.split()
.str[0])
autos['last_seen_new'].head(2)
0 2016-04-06 1 2016-04-06 Name: last_seen_new, dtype: object
autos['last_seen_new'] = autos['last_seen_new'].str.replace('-','')
print(autos['last_seen_new'])
0 20160406 1 20160406 2 20160406 3 20160315 4 20160401 ... 49995 20160401 49996 20160402 49997 20160404 49998 20160405 49999 20160406 Name: last_seen_new, Length: 46352, dtype: object
autos['last_seen_new'] = autos['last_seen_new'].astype(int)
autos['last_seen_new'].head(3)
0 20160406 1 20160406 2 20160406 Name: last_seen_new, dtype: int64
Now that we've converted the date columns to numeric integer format in new columns, we can remove the original columns.
#dropping unformatted date columns, keeping newly created ones
autos = autos.drop(['date_crawled', 'ad_created','last_seen'], axis = 1)
Exploring the name
column to discover more descriptive and relevant keywords.
#look over name column for key words that may be helpful
autos['name'].head(10)
0 Peugeot_807_160_NAVTECH_ON_BOARD 1 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 Volkswagen_Golf_1.6_United 3 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 Golf_IV_1.9_TDI_90PS 8 Seat_Arosa 9 Renault_Megane_Scenic_1.6e_RT_Klimaanlage Name: name, dtype: object
#remove underscores
autos['name'] = (autos['name'].str.replace('_',' '))
autos['name'].head(3)
0 Peugeot 807 160 NAVTECH ON BOARD 1 BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik 2 Volkswagen Golf 1.6 United Name: name, dtype: object
#split the column into workable lists to choose model type. Will most likely look at first four
autos['name'] = (autos['name']
.str.split()
.str[0:3])
print(autos['name'])
0 [Peugeot, 807, 160] 1 [BMW, 740i, 4] 2 [Volkswagen, Golf, 1.6] 3 [Smart, smart, fortwo] 4 [Ford, Focus, 1] ... 49995 [Audi, Q5, 3.0] 49996 [Opel, Astra, F] 49997 [Fiat, 500, C] 49998 [Audi, A3, 2.0] 49999 [Opel, Vectra, 1.6] Name: name, Length: 46352, dtype: object
#the newly created lists from the name column appear to have model at the second index.
#put these in a new column and remove name column
autos['name'] = (autos['name'].str[1])
#rename name column to model
autos.head()
name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | postal_code | date_crawled_new | ad_created_new | last_seen_new | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 807 | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 79588 | 20160326 | 20160326 | 20160406 |
1 | 740i | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | gas | bmw | no | 71034 | 20160404 | 20160404 | 20160406 |
2 | Golf | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | gas | volkswagen | no | 35394 | 20160326 | 20160326 | 20160406 |
3 | smart | 4350 | control | small car | 2007 | automatic | 71 | fortwo | 70000 | 6 | gas | smart | no | 33729 | 20160312 | 20160312 | 20160315 |
4 | Focus | 1350 | test | station_wagon | 2003 | manual | 0 | focus | 150000 | 7 | gas | ford | no | 39218 | 20160401 | 20160401 | 20160401 |
While breaking apart the name
column clarified some model name extensions, it was difficult to determine what to keep with an extremeley high number of unique names in the dataset. For now using brand
and name
should be sufficient for finding the most common brand and model combination. Just like brand
, we'll use the top ten for brand_model
combination below.
#clean up columns name and brand by capitalizing first letter of each word
autos['brand'] = autos['brand'].str.capitalize()
autos['name'] = autos['name'].str.capitalize()
#combine columns brand and name to create a brand model column
autos['brand_model'] = autos['brand'] + ' ' + autos['name']
#checking the number of unique brand models in autos
autos['brand_model'].nunique()
3974
#create the top ten brand model percentage counts
autos['brand_model'].value_counts(normalize = True, dropna = False).head(10).mul(100).round(2)
Mercedes_benz Benz 6.96 Volkswagen Golf 5.56 Opel Corsa 2.94 Volkswagen Polo 2.72 Opel Astra 2.55 Volkswagen Passat 2.39 Audi A4 2.32 Audi A6 1.56 Audi A3 1.55 Ford Focus 1.51 Name: brand_model, dtype: float64
The top 10 brand models contain nine German automakers, with Mercedes and Volkswagen coming in with over 12.5% of popular models. This shouldn't come as a surprise considering that this site is based in Germany. It Should be noted that Benz is not a model name, but rather a part of the brand, suggesting that name may be a bit unreliable. For now we'll focus on the current result. Let's explore the odometer_km
column by splitting it into groups to see if mileage ranges have a specific impact on pricing.
#breaking up the odometer_km column into groups using the groupby method
#and finding the average price by mileage
split_odom = autos.groupby('odometer_km')['price'].mean()
split_odom.head(13)
odometer_km 5000 9794.102672 10000 21255.304721 20000 18523.200271 30000 16674.564069 40000 15518.952261 50000 13853.911111 60000 12406.989343 70000 10954.826014 80000 9743.110058 90000 8470.087321 100000 8188.223581 125000 6238.277181 150000 3792.320241 Name: price, dtype: float64
As we can see the average cost per 10,000 km driven decreases significantly after 10,000 km. This makes sense as the cost of maintenance and upkeep increases with added mileage. Seen here is an inverse relationship to distnace driven. As km increase, average price decreases. It should be noted that the average cost of cars driven 5000 km shows an average price that is half that of those driven 10,000 km. We'lll close out this analysis by seeing if damage to cars, listed under unrepaired_damage
, shows significant differences between values
#create means for damaged and undamaged then compare price
damaged_cars = autos[autos['unrepaired_damage']== 'yes']
damaged_price = damaged_cars['price'].mean()
print('$',damaged_price.round(2))
$ 2269.58
undamaged_cars = autos[autos['unrepaired_damage']=='no']
undamaged_price = undamaged_cars['price'].mean()
print('$',undamaged_price.round(2))
$ 7177.14
#show difference in price between damaged and undamaged cars
print('Damaged cars sell for $',(undamaged_cars['price'].mean().round(2)) - (damaged_cars['price'].mean().round(2)), 'less than undamaged cars')
Damaged cars sell for $ 4907.56 less than undamaged cars
-Buying used cars from Kleinanziegen shows that prices generally decrase with increasing mileage and unrepaired damage.
-The majority of brand automakers are from Germany, so variety for international autos is limited.