Cleaning Data From Kleinanziegen -
The German eBay Classifieds

image Image Courtesy of Used Cars in Germany


This analysis will be looking at a dataset that was scraped by orgesleka from the classifieds section of German eBay (Kleinanzeigen). There are apporximately 50,000 data points. The purpose of this project is to clean the data and analyze it for a deeper understanding of the varied used car listings. Usage of Jupyter Notebooks and Pandas will allow us to dig down into the data for fine tuning and detailed insighgts.


In [100]:
#This is where we bring in our libraries to help complete the analysis

import pandas as pd
import numpy as np

#read in the csv that will be used for the project

autos = pd.read_csv('autos.csv', encoding = 'Latin 1')
In [101]:
#creating a new cell to test if the autos csv was read

autos
Out[101]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [102]:
#looking at the head and info of 'autos' Dataframe will help identify data types and columns

autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[102]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50


Looking at the colum names and dataframe reveals some clues about the data to be cleaned. Of the twenty data types 15 are objects(most likely strings) and 5 are int64 (numeric values), which may need to be converted as they have commas in them, rather than periods. Dates will most likely have to be converted or extracted to highlight specific time spans as either days or hours. Names will have to be clenaned of underscores so that models can be extraced and viewed aside from make. The items in Sellers, offerType, gearbox, fuelType and notRepairedDamage columns will have to be converted to English from German. The odometer column will have to have km removed and placed in the column name. There are a few unknowns already visible looking at the info counts suggesting that vehicle type and model may need to be compared to identify missing descriptors. Gear and fuel type look to be challenging as other column names don't suggest relatable information. The most intriguing column is the notRepairedDamage column as it points to possible valuable missing information about the condition of those cars. We'll find out once we look a little closer at the dataset. Finally, the column names could very well use a makeover for more clear and concise nomenclature.


In [103]:
#in this step column names will be changed for better readability
#and all column names in Camel Case (upper and lower case combined) will be 
#converted to Snake Case (lower case only)

print(autos.columns)

#convert column names to snakecase (all lowercase)

autos.rename(columns = {'dateCrawled': 'date_crawled', 'name':'name', 'seller':'seller', 'offerType':'offer_type', 'price':'price', 'abtest':'abtest',
       'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'gearbox':'gearbox', 'powerPS':'power_ps', 'model':'model',
       'odometer':'odometer', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'brand':'brand',
       'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nmbr_of_pictures', 'postalCode':'postal_code',
       'lastSeen':'last_seen'}, inplace = True) 

#verify that changes were successful

print('\n')
print('                                     **Adjusted Column Headers Below**')
print('\n')
autos.head(2)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


                                     **Adjusted Column Headers Below**


Out[103]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nmbr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08


Seen above are the changes made to the header column names. By converting columm names to snakecase (lower case) uniformity has been achieved while,hopefully, confusion can be avoided for future data scientists who may mistake header names, in camelcase (upper and lower case), as function naming conventions. Additionally, and more importantly, creating some clear formatting provides improved ease of use for reading header names and clarifies some of the names that were previously crowded together.


In [104]:
#in this step we'll explore 'autos' descriptions and value counts to determine if there's
#data that can be omitted and numeric data, stored as text, that can be cleaned

autos.describe(include = 'all')
Out[104]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nmbr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-21 16:37:21 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
In [105]:
#explore columns to get an idea of data characteristics

print('Vehicle Name')
autos['name'].value_counts()
Vehicle Name
Out[105]:
Ford_Fiesta                                          78
Volkswagen_Golf_1.4                                  75
BMW_316i                                             75
Volkswagen_Polo                                      72
BMW_318i                                             72
                                                     ..
Audi_80_Sport_Edition                                 1
Start_Stop_Leder_Lenkrad_mit_neuen_Alweterrreifen     1
Opel_Corsa_1.3_CDTI_DPF_Eco_5_Tuerer_HU_02/18         1
Daihatsu_Cuore_GL___HU_7/2017                         1
Golf_1.4_16V_Variant                                  1
Name: name, Length: 38754, dtype: int64
In [106]:
print('Price')
autos['price'].value_counts()
Price
Out[106]:
$0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$34,650       1
$30,800       1
$8,854        1
$15,888       1
$388          1
Name: price, Length: 2357, dtype: int64
In [107]:
print('Model Name')
autos['model'].value_counts()
Model Name
Out[107]:
golf          4024
andere        3528
3er           2761
polo          1757
corsa         1735
              ... 
kalina           2
200              1
rangerover       1
b_max            1
i3               1
Name: model, Length: 245, dtype: int64
In [108]:
print('Mileage on Odometer')
autos['odometer'].value_counts()
Mileage on Odometer
Out[108]:
150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64
In [109]:
print('Fuel Type')
autos['fuel_type'].value_counts()
Fuel Type
Out[109]:
benzin     30107
diesel     14567
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64
In [110]:
print('Auto Maker')
autos['brand'].value_counts()
Auto Maker
Out[110]:
volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
citroen             701
smart               701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64


Looking at value counts for specific categories such as model,name and brand(manufacturer) illustrate some redundancies. Unless we want to view very specific model characteristics, the column name appears to be an unnecessary column since model and brand already clarify that information. Price needs a deeper look to determine a range of values that hold the highest number of average prices. These values may also need to be converted to different currencies depending on who the target audience is for this analysis (current suggestion is U.S. Dollars based on author's location). The same can be said for mileage with a simple conversion from kilometers to miles. Fuel types are in German and, again, depending on the audience should be translated accordingly ex. benzin is German for gasoline or petrol.

For now removal of non numeric characters from price and odometer will clean them for subsequent analyses and conversion to numeric data type will allow for formulas requiring conversion between unit measurements.


In [111]:
#renaming column odometer to odeometer_km and converting to numeric data type(dtype)

autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)
autos['price'] = autos['price'].str.replace('$','').str.replace(',','')
autos['price'] = autos['price'].astype(int)

autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
 
In [112]:
#verify non numeric symbol removal and data type change

autos[0:3]['price'],autos[0:3]['odometer_km']
Out[112]:
(0    5000
 1    8500
 2    8990
 Name: price, dtype: int64,
 0    150000
 1    150000
 2     70000
 Name: odometer_km, dtype: int64)
In [113]:
#looking further into price and odometer_km to find large, unrealistic values
#if those kinds of values appear, removal will occur to normalize the series

autos['price'].unique().shape
Out[113]:
(2357,)
In [114]:
autos['price'].describe(include = all)
Out[114]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [115]:
autos['price'].value_counts().sort_index(ascending = False)
Out[115]:
99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
            ... 
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64
In [116]:
#after viewing the descriptions and value counts, it's clear to see
#that some outliers in price are over a million on the high end and 0 at the low end
#to remedy this, removal of those values will be executed below

autos = autos[autos['price'].between(100,350000)]
In [117]:
#verify that the prices have fluctuated based on eliminating outliers

autos['price'].describe()
Out[117]:
count     48224.000000
mean       5930.371433
std        9078.372762
min         100.000000
25%        1250.000000
50%        3000.000000
75%        7499.000000
max      350000.000000
Name: price, dtype: float64


After removing large price values (in excess of >350,000) and small values (less than <100) we can see how much the mean changed as it dropped from 9840 to 5930. This should allow us to view the average cost of used cars with a bit more confidence and help us start to identify specific types of cars and prices that are used car favorites. Now we'll take a look at the odometer_km to find out if there's any really large discrepancies in mileage values.


In [118]:
#exploring odometer_km for large/small outliers in mileage

autos['odometer_km'].value_counts(ascending = False)
Out[118]:
150000    31212
125000     5037
100000     2101
90000      1733
80000      1412
70000      1214
60000      1153
50000      1009
40000       814
30000       777
5000        760
20000       757
10000       245
Name: odometer_km, dtype: int64


Odometer_km shows no irregularities in mileage as the values represented suggest normal ranges for used cars in terms of kilometers driven. Before moving on, there were three columns that had negligible and null values. They were seller,offer_type, and nmbr_of_pictures and will be dropped from the analysis.


In [119]:
#dropping the three aforementioned columns

autos = autos.drop(['seller','offer_type','nmbr_of_pictures'], axis = 1)
In [120]:
#converting columns 'date_crawled','ad_created', 'last_seen' from string to numeric values

autos['date_crawled'].value_counts(ascending = False)
Out[120]:
2016-03-09 11:54:38    3
2016-03-29 23:42:13    3
2016-03-05 16:57:05    3
2016-03-23 18:39:34    3
2016-03-25 19:57:10    3
                      ..
2016-03-26 12:50:12    1
2016-03-17 11:29:33    1
2016-03-09 22:55:53    1
2016-03-21 14:55:28    1
2016-03-20 13:46:21    1
Name: date_crawled, Length: 46571, dtype: int64
In [121]:
#removing hours, minutes and seconds from 'date_crawled'

autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()
Out[121]:
2016-03-05    2.54
2016-03-06    1.40
2016-03-07    3.61
2016-03-08    3.32
2016-03-09    3.30
2016-03-10    3.23
2016-03-11    3.26
2016-03-12    3.69
2016-03-13    1.57
2016-03-14    3.67
2016-03-15    3.43
2016-03-16    2.95
2016-03-17    3.15
2016-03-18    1.29
2016-03-19    3.47
2016-03-20    3.78
2016-03-21    3.72
2016-03-22    3.29
2016-03-23    3.23
2016-03-24    2.94
2016-03-25    3.15
2016-03-26    3.23
2016-03-27    3.11
2016-03-28    3.50
2016-03-29    3.41
2016-03-30    3.37
2016-03-31    3.19
2016-04-01    3.37
2016-04-02    3.56
2016-04-03    3.86
2016-04-04    3.65
2016-04-05    1.31
2016-04-06    0.32
2016-04-07    0.14
Name: date_crawled, dtype: float64


It appears that the highest percengage of 'crawling' (searching a website using bots or crawl agents) occured on 2016-04-03 @ 3.86%. Below we will see if any of the other dates can correlate any relevance to the highest percentage of crawling*


In [122]:
#removing hours, minutes and seconds from 'ad_created'

autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()
Out[122]:
2015-06-11    0.00
2015-08-10    0.00
2015-09-09    0.00
2015-11-10    0.00
2015-12-05    0.00
              ... 
2016-04-03    3.89
2016-04-04    3.69
2016-04-05    1.18
2016-04-06    0.33
2016-04-07    0.12
Name: ad_created, Length: 76, dtype: float64


When viewing the ad_listed column the highest percentage of listings occurs on 2016-04-03 @ 3.89%, which is the same day as the highest crawl count. This makes sense considering that the most time crawling would match the date with the most posts.


In [123]:
#removing hours, minutes and seconds from 'last_seen'

autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()
Out[123]:
2016-03-05     0.11
2016-03-06     0.43
2016-03-07     0.54
2016-03-08     0.73
2016-03-09     0.96
2016-03-10     1.06
2016-03-11     1.24
2016-03-12     2.38
2016-03-13     0.89
2016-03-14     1.26
2016-03-15     1.59
2016-03-16     1.64
2016-03-17     2.81
2016-03-18     0.73
2016-03-19     1.58
2016-03-20     2.07
2016-03-21     2.05
2016-03-22     2.14
2016-03-23     1.86
2016-03-24     1.98
2016-03-25     1.91
2016-03-26     1.67
2016-03-27     1.56
2016-03-28     2.08
2016-03-29     2.23
2016-03-30     2.47
2016-03-31     2.38
2016-04-01     2.29
2016-04-02     2.49
2016-04-03     2.51
2016-04-04     2.45
2016-04-05    12.51
2016-04-06    22.20
2016-04-07    13.22
Name: last_seen, dtype: float64


The last_seen column is a bit different because we're inferring that the highest percentage of times these listings were seen reprsents a spike not in sales, but in the number of cars still listed. So three days after (2016-04-03) the posting (2016-04-06) crawler bots show the highest number of listings at 22%. Could've just been a slow day for sales, or a trend in the data not yet seen? It directly coincides with the bots two lowest percentage crawls, which would make sense, suggesting that less activity (sales) resulted fewer crawls over the website.


In [124]:
#examining the distribution of registration years

autos['registration_year'].describe()
Out[124]:
count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
In [125]:
#it appears that there are some values that cannot exist (ex. 1000).
#look at unique values

autos['registration_year'].describe()
Out[125]:
count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64


Registration year has some values that cannot exist. Cars were not invented until the late 19th century, early 20th. It's safe to say that we'll have to create a realistic range for these cars to fall in and remove the others. It's clear to see that the newer te car the higher the percentage. This suggests that newer cars, most likely, have better resale value.


In [126]:
#create a range between 1900 and 2016

autos = autos[autos["registration_year"].between(1900, 2016)]
autos['registration_year'].value_counts(normalize = True).mul(100).round(2)
Out[126]:
2000    6.70
2005    6.28
1999    6.21
2004    5.82
2003    5.81
        ... 
1938    0.00
1939    0.00
1953    0.00
1943    0.00
1952    0.00
Name: registration_year, Length: 78, dtype: float64
In [127]:
#identifying unique values in the 'brand' column 

autos['brand'].unique()
Out[127]:
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)
In [128]:
#finding highest percentages of 'brand' within the dataframe

autos['brand'].value_counts(normalize = True, dropna = False).mul(100).round(2)
Out[128]:
volkswagen        21.14
bmw               11.02
opel              10.72
mercedes_benz      9.67
audi               8.68
ford               6.98
renault            4.71
peugeot            2.99
fiat               2.56
seat               1.83
skoda              1.64
nissan             1.53
mazda              1.52
smart              1.42
citroen            1.40
toyota             1.28
hyundai            1.00
sonstige_autos     0.95
volvo              0.91
mini               0.88
mitsubishi         0.82
honda              0.79
kia                0.71
alfa_romeo         0.67
porsche            0.60
suzuki             0.59
chevrolet          0.57
chrysler           0.35
dacia              0.27
daihatsu           0.25
jeep               0.23
subaru             0.21
land_rover         0.21
saab               0.17
jaguar             0.15
daewoo             0.15
trabant            0.14
rover              0.13
lancia             0.11
lada               0.06
Name: brand, dtype: float64
In [129]:
#assign the top ten brands

top_10_brands = autos['brand'].value_counts(normalize = True, dropna = False).head(10).mul(100).round(2).index



print(top_10_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object')


Looking at the list of auto brands shows that the top ten automakers represent close to 75 percent of the availablle vehicles. This shows manufacterers who have an established market presence resulting in added options, across models, for buyers.


In [130]:
#creating an empty dictionary to hold the top ten 'brands'

top_brand_values = {}

for brand_10 in top_10_brands:
    mean_price = autos[autos['brand'] == brand_10]['price'].mean().round()
    top_brand_values[brand_10] = mean_price

top_brand_values
Out[130]:
{'volkswagen': 5437.0,
 'bmw': 8382.0,
 'opel': 3005.0,
 'mercedes_benz': 8673.0,
 'audi': 9381.0,
 'ford': 3779.0,
 'renault': 2496.0,
 'peugeot': 3114.0,
 'fiat': 2837.0,
 'seat': 4433.0}


Price doesn't necessarily correlate to brand, but it makes sense that Volkswagen sits right in the middle of the range of values for mean. As Volkswagen reprsents the highest selling brand present would suggest consumer confidence while mid-range pricing illustrates afffordability without skimping on build out factors that make cars reliable. Below we will create the same dictionary for mileage by mean and create a dataframe from series objects derived from the series constructors


In [131]:
#finding highest mean kilometers of 'odometer_km' within the series

top_brand_kilometers = {}

for brand_new in top_10_brands:
    mean_km = autos[autos['brand']== brand_new]['odometer_km'].mean().round()
    top_brand_kilometers[brand_new]= mean_km
    
top_brand_kilometers    
Out[131]:
{'volkswagen': 128800.0,
 'bmw': 132695.0,
 'opel': 129384.0,
 'mercedes_benz': 131026.0,
 'audi': 129245.0,
 'ford': 124277.0,
 'renault': 128281.0,
 'peugeot': 127128.0,
 'fiat': 116950.0,
 'seat': 121537.0}
In [132]:
#creating series for top_brand_kilometers and top_brand_values

tbv_series = pd.Series(top_brand_values).sort_values(ascending = False)
tbk_series = pd.Series(top_brand_kilometers).sort_values(ascending = False)

#combine both tbv and tbk series into a dataframe

price_km_df = pd.DataFrame(tbv_series, columns = ['mean_prices'])

#add column for tbk_series
price_km_df['mean_kilometers'] = tbk_series

price_km_df
Out[132]:
mean_prices mean_kilometers
audi 9381.0 129245.0
mercedes_benz 8673.0 131026.0
bmw 8382.0 132695.0
volkswagen 5437.0 128800.0
seat 4433.0 121537.0
ford 3779.0 124277.0
peugeot 3114.0 127128.0
opel 3005.0 129384.0
fiat 2837.0 116950.0
renault 2496.0 128281.0


After narrowing our results to see top brands, prices and mileage we can see that mileage is in a range of 10,000 km across brands, suggesting that mileage may effect price at higher (lower price) and lower(higher price) values. Of note would be the significance of sales price in relation to location. We will save that for another project. For now, it appears, that buying a volkswagen is the most popular choice amongst used car buyers and should not come as a surprise cosidering they are Germany's largest automaker.



We'll take a look at these unique values first, translate them and map their English translations.


In [133]:
#changing column data that has German words to English

autos.head(3)

#columns vehicle_type, gearbox, and fuel_type will be translated
Out[133]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
In [134]:
#create dictionary, convert via .map

autos['vehicle_type'].unique()
Out[134]:
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)
In [135]:
#create dictionary, convert via .map

vehicle_dict = {'bus':'bus','limousine':'limousine','kleinwagen':'small car','kombi':'station_wagon','nan':'unknown','coupe':'two_door','suv':'suv','cabrio':'convertible','andere':'other'}
autos['vehicle_type'] = autos['vehicle_type'].map(vehicle_dict)
In [136]:
#verify translation changes

autos['vehicle_type'].unique()
Out[136]:
array(['bus', 'limousine', 'small car', 'station_wagon', nan, 'two_door',
       'suv', 'convertible', 'other'], dtype=object)
In [137]:
autos['gearbox'].unique()
Out[137]:
array(['manuell', 'automatik', nan], dtype=object)
In [138]:
gear_dict = {'manuell':'manual','automatik':'automatic'}
autos['gearbox'] = autos['gearbox'].map(gear_dict)
In [139]:
autos['gearbox'].unique()
Out[139]:
array(['manual', 'automatic', nan], dtype=object)
In [140]:
autos['fuel_type'].unique()
Out[140]:
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)
In [141]:
fuel_dict = {'lpg':'lpg','benzin':'gas','diesel':'diesel','cng':'cng','hybrid':'hybrid','elektro':'electric','andere':'other'}
autos['fuel_type'] = autos['fuel_type'].map(fuel_dict)
In [142]:
autos['fuel_type'].unique()
Out[142]:
array(['lpg', 'gas', 'diesel', nan, 'cng', 'hybrid', 'electric', 'other'],
      dtype=object)
In [143]:
autos['unrepaired_damage'].unique()
Out[143]:
array(['nein', nan, 'ja'], dtype=object)
In [144]:
unrepaired_dict = {'nein':'no','ja':'yes'}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(unrepaired_dict)
In [145]:
autos['unrepaired_damage'].unique()
Out[145]:
array(['no', nan, 'yes'], dtype=object)


Conversion of all date columns to uniform, numeric format will be perfomed to allow for further queries against the dataset. Converting date formats from yyyy-dd-mm to yyyyddmm by removing dashes and extra characters from columns date_crawled, ad_created & last_seen.


In [146]:
#starting with date_crawled column by removing hours, mins, secs and dashes

autos['date_crawled'].head()
Out[146]:
0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: date_crawled, dtype: object
In [147]:
autos['date_crawled_new'] = (autos['date_crawled']
                                     .str.split()
                                     .str[0])
autos['date_crawled_new'].head(3)
Out[147]:
0    2016-03-26
1    2016-04-04
2    2016-03-26
Name: date_crawled_new, dtype: object
In [148]:
autos['date_crawled_new'] = autos['date_crawled_new'].str.replace('-','')

print(autos['date_crawled_new'])
0        20160326
1        20160404
2        20160326
3        20160312
4        20160401
           ...   
49995    20160327
49996    20160328
49997    20160402
49998    20160308
49999    20160314
Name: date_crawled_new, Length: 46352, dtype: object
In [149]:
autos['date_crawled_new'] = autos['date_crawled_new'].astype(int)
autos['date_crawled_new'].head(5)
Out[149]:
0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled_new, dtype: int64
In [150]:
#starting with ad_created column by removing hours, mins, secs and dashes

autos['ad_created'].head(2)
Out[150]:
0    2016-03-26 00:00:00
1    2016-04-04 00:00:00
Name: ad_created, dtype: object
In [151]:
autos['ad_created_new'] = (autos['ad_created']
                                     .str.split()
                                     .str[0])
autos['ad_created_new'].head(3)
Out[151]:
0    2016-03-26
1    2016-04-04
2    2016-03-26
Name: ad_created_new, dtype: object
In [152]:
autos['ad_created_new'] = autos['ad_created_new'].str.replace('-','')

print(autos['ad_created_new'])
0        20160326
1        20160404
2        20160326
3        20160312
4        20160401
           ...   
49995    20160327
49996    20160328
49997    20160402
49998    20160308
49999    20160313
Name: ad_created_new, Length: 46352, dtype: object
In [153]:
autos['ad_created_new'] = autos['ad_created_new'].astype(int)
autos['ad_created_new'].head(3)
Out[153]:
0    20160326
1    20160404
2    20160326
Name: ad_created_new, dtype: int64
In [154]:
#starting with last_seen column by removing hours, mins, secs and dashes

autos['last_seen'].head(3)
Out[154]:
0    2016-04-06 06:45:54
1    2016-04-06 14:45:08
2    2016-04-06 20:15:37
Name: last_seen, dtype: object
In [155]:
autos['last_seen_new'] = (autos['last_seen']
                                  .str.split()
                                  .str[0])
autos['last_seen_new'].head(2)
                            
Out[155]:
0    2016-04-06
1    2016-04-06
Name: last_seen_new, dtype: object
In [156]:
autos['last_seen_new'] = autos['last_seen_new'].str.replace('-','')

print(autos['last_seen_new'])
0        20160406
1        20160406
2        20160406
3        20160315
4        20160401
           ...   
49995    20160401
49996    20160402
49997    20160404
49998    20160405
49999    20160406
Name: last_seen_new, Length: 46352, dtype: object
In [157]:
autos['last_seen_new'] = autos['last_seen_new'].astype(int)
autos['last_seen_new'].head(3)
Out[157]:
0    20160406
1    20160406
2    20160406
Name: last_seen_new, dtype: int64


Now that we've converted the date columns to numeric integer format in new columns, we can remove the original columns.


In [158]:
#dropping unformatted date columns, keeping newly created ones

autos = autos.drop(['date_crawled', 'ad_created','last_seen'], axis = 1)


Exploring the name column to discover more descriptive and relevant keywords.


In [159]:
#look over name column for key words that may be helpful

autos['name'].head(10)
Out[159]:
0                     Peugeot_807_160_NAVTECH_ON_BOARD
1           BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                           Volkswagen_Golf_1.6_United
3    Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...
4    Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...
5    Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...
6    VW_Golf_III_GT_Special_Electronic_Green_Metall...
7                                 Golf_IV_1.9_TDI_90PS
8                                           Seat_Arosa
9            Renault_Megane_Scenic_1.6e_RT_Klimaanlage
Name: name, dtype: object
In [160]:
#remove underscores

autos['name'] = (autos['name'].str.replace('_',' '))

autos['name'].head(3)
Out[160]:
0              Peugeot 807 160 NAVTECH ON BOARD
1    BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik
2                    Volkswagen Golf 1.6 United
Name: name, dtype: object
In [161]:
#split the column into workable lists to choose model type.  Will most likely look at first four

autos['name'] = (autos['name']
                         .str.split()
                         .str[0:3])
print(autos['name'])
0            [Peugeot, 807, 160]
1                 [BMW, 740i, 4]
2        [Volkswagen, Golf, 1.6]
3         [Smart, smart, fortwo]
4               [Ford, Focus, 1]
                  ...           
49995            [Audi, Q5, 3.0]
49996           [Opel, Astra, F]
49997             [Fiat, 500, C]
49998            [Audi, A3, 2.0]
49999        [Opel, Vectra, 1.6]
Name: name, Length: 46352, dtype: object
In [162]:
#the newly created lists from the name column appear to have model at the second index. 
#put these in a new column and remove name column

autos['name'] = (autos['name'].str[1])

#rename name column to model


autos.head()
Out[162]:
name price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage postal_code date_crawled_new ad_created_new last_seen_new
0 807 5000 control bus 2004 manual 158 andere 150000 3 lpg peugeot no 79588 20160326 20160326 20160406
1 740i 8500 control limousine 1997 automatic 286 7er 150000 6 gas bmw no 71034 20160404 20160404 20160406
2 Golf 8990 test limousine 2009 manual 102 golf 70000 7 gas volkswagen no 35394 20160326 20160326 20160406
3 smart 4350 control small car 2007 automatic 71 fortwo 70000 6 gas smart no 33729 20160312 20160312 20160315
4 Focus 1350 test station_wagon 2003 manual 0 focus 150000 7 gas ford no 39218 20160401 20160401 20160401


While breaking apart the name column clarified some model name extensions, it was difficult to determine what to keep with an extremeley high number of unique names in the dataset. For now using brand and name should be sufficient for finding the most common brand and model combination. Just like brand, we'll use the top ten for brand_model combination below.


In [163]:
#clean up columns name and brand by capitalizing first letter of each word


autos['brand'] = autos['brand'].str.capitalize()
autos['name'] = autos['name'].str.capitalize()


#combine columns brand and name to create a brand model column

autos['brand_model'] = autos['brand'] + ' ' + autos['name']
In [164]:
#checking the number of unique brand models in autos

autos['brand_model'].nunique()
Out[164]:
3974
In [165]:
#create the top ten brand model percentage counts

autos['brand_model'].value_counts(normalize = True, dropna = False).head(10).mul(100).round(2)
Out[165]:
Mercedes_benz Benz    6.96
Volkswagen Golf       5.56
Opel Corsa            2.94
Volkswagen Polo       2.72
Opel Astra            2.55
Volkswagen Passat     2.39
Audi A4               2.32
Audi A6               1.56
Audi A3               1.55
Ford Focus            1.51
Name: brand_model, dtype: float64


The top 10 brand models contain nine German automakers, with Mercedes and Volkswagen coming in with over 12.5% of popular models. This shouldn't come as a surprise considering that this site is based in Germany. It Should be noted that Benz is not a model name, but rather a part of the brand, suggesting that name may be a bit unreliable. For now we'll focus on the current result. Let's explore the odometer_km column by splitting it into groups to see if mileage ranges have a specific impact on pricing.


In [166]:
#breaking up the odometer_km column into groups using the groupby method
#and finding the average price by mileage


split_odom = autos.groupby('odometer_km')['price'].mean()
split_odom.head(13)
Out[166]:
odometer_km
5000       9794.102672
10000     21255.304721
20000     18523.200271
30000     16674.564069
40000     15518.952261
50000     13853.911111
60000     12406.989343
70000     10954.826014
80000      9743.110058
90000      8470.087321
100000     8188.223581
125000     6238.277181
150000     3792.320241
Name: price, dtype: float64


As we can see the average cost per 10,000 km driven decreases significantly after 10,000 km. This makes sense as the cost of maintenance and upkeep increases with added mileage. Seen here is an inverse relationship to distnace driven. As km increase, average price decreases. It should be noted that the average cost of cars driven 5000 km shows an average price that is half that of those driven 10,000 km. We'lll close out this analysis by seeing if damage to cars, listed under unrepaired_damage, shows significant differences between values


In [167]:
#create means for damaged and undamaged then compare price

damaged_cars = autos[autos['unrepaired_damage']== 'yes']
damaged_price = damaged_cars['price'].mean()
print('$',damaged_price.round(2))
$ 2269.58
In [168]:
undamaged_cars = autos[autos['unrepaired_damage']=='no']
undamaged_price = undamaged_cars['price'].mean()
print('$',undamaged_price.round(2))
$ 7177.14
In [169]:
#show difference in price between damaged and undamaged cars

print('Damaged cars sell for $',(undamaged_cars['price'].mean().round(2)) - (damaged_cars['price'].mean().round(2)), 'less than undamaged cars')
Damaged cars sell for $ 4907.56 less than undamaged cars

Conclusions

-Buying used cars from Kleinanziegen shows that prices generally decrase with increasing mileage and unrepaired damage.

-The majority of brand automakers are from Germany, so variety for international autos is limited.

-By analyzing brand, models, kilometers and price we can see that automakers like Volkswagen are reliable, mid-range in pricing and heavily represented in used car sales availability.