Cleaning Data From Kleinanziegen -
The German eBay Classifieds
¶

This analysis will be looking at a dataset that was scraped by orgesleka from the classifieds section of German eBay (Kleinanzeigen). There are apporximately 50,000 data points. The purpose of this project is to clean the data and analyze it for a deeper understanding of the varied used car listings. Usage of Jupyter Notebooks and Pandas will allow us to dig down into the data for fine tuning and detailed insighgts.¶

In [100]:

#This is where we bring in our libraries to help complete the analysis

import pandas as pd
import numpy as np

#read in the csv that will be used for the project

autos = pd.read_csv('autos.csv', encoding = 'Latin 1')

In [101]:

#creating a new cell to test if the autos csv was read

autos

Out[101]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [102]:

#looking at the head and info of 'autos' Dataframe will help identify data types and columns

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[102]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Looking at the colum names and dataframe reveals some clues about the data to be cleaned. Of the twenty data types 15 are objects(most likely strings) and 5 are int64 (numeric values), which may need to be converted as they have commas in them, rather than periods. Dates will most likely have to be converted or extracted to highlight specific time spans as either days or hours. Names will have to be clenaned of underscores so that models can be extraced and viewed aside from make. The items in Sellers, offerType, gearbox, fuelType and notRepairedDamage columns will have to be converted to English from German. The odometer column will have to have km removed and placed in the column name. There are a few unknowns already visible looking at the info counts suggesting that vehicle type and model may need to be compared to identify missing descriptors. Gear and fuel type look to be challenging as other column names don't suggest relatable information. The most intriguing column is the notRepairedDamage column as it points to possible valuable missing information about the condition of those cars. We'll find out once we look a little closer at the dataset. Finally, the column names could very well use a makeover for more clear and concise nomenclature.

In [103]:

#in this step column names will be changed for better readability
#and all column names in Camel Case (upper and lower case combined) will be 
#converted to Snake Case (lower case only)

print(autos.columns)

#convert column names to snakecase (all lowercase)

autos.rename(columns = {'dateCrawled': 'date_crawled', 'name':'name', 'seller':'seller', 'offerType':'offer_type', 'price':'price', 'abtest':'abtest',
       'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'gearbox':'gearbox', 'powerPS':'power_ps', 'model':'model',
       'odometer':'odometer', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'brand':'brand',
       'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nmbr_of_pictures', 'postalCode':'postal_code',
       'lastSeen':'last_seen'}, inplace = True) 

#verify that changes were successful

print('\n')
print('                                     **Adjusted Column Headers Below**')
print('\n')
autos.head(2)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')


                                     **Adjusted Column Headers Below**

Out[103]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nmbr_of_pictures	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08

Seen above are the changes made to the header column names. By converting columm names to snakecase (lower case) uniformity has been achieved while,hopefully, confusion can be avoided for future data scientists who may mistake header names, in camelcase (upper and lower case), as function naming conventions. Additionally, and more importantly, creating some clear formatting provides improved ease of use for reading header names and clarifies some of the names that were previously crowded together.

In [104]:

#in this step we'll explore 'autos' descriptions and value counts to determine if there's
#data that can be omitted and numeric data, stored as text, that can be cleaned

autos.describe(include = 'all')

Out[104]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nmbr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-21 16:37:21	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

In [105]:

#explore columns to get an idea of data characteristics

print('Vehicle Name')
autos['name'].value_counts()

Vehicle Name

Out[105]:

Ford_Fiesta                                          78
Volkswagen_Golf_1.4                                  75
BMW_316i                                             75
Volkswagen_Polo                                      72
BMW_318i                                             72
                                                     ..
Audi_80_Sport_Edition                                 1
Start_Stop_Leder_Lenkrad_mit_neuen_Alweterrreifen     1
Opel_Corsa_1.3_CDTI_DPF_Eco_5_Tuerer_HU_02/18         1
Daihatsu_Cuore_GL___HU_7/2017                         1
Golf_1.4_16V_Variant                                  1
Name: name, Length: 38754, dtype: int64

In [106]:

print('Price')
autos['price'].value_counts()

Price

Out[106]:

$0         1421
$500        781
$1,500      734
$2,500      643
$1,000      639
           ... 
$34,650       1
$30,800       1
$8,854        1
$15,888       1
$388          1
Name: price, Length: 2357, dtype: int64

In [107]:

print('Model Name')
autos['model'].value_counts()

Model Name

Out[107]:

golf          4024
andere        3528
3er           2761
polo          1757
corsa         1735
              ... 
kalina           2
200              1
rangerover       1
b_max            1
i3               1
Name: model, Length: 245, dtype: int64

In [108]:

print('Mileage on Odometer')
autos['odometer'].value_counts()

Mileage on Odometer

Out[108]:

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [109]:

print('Fuel Type')
autos['fuel_type'].value_counts()

Fuel Type

Out[109]:

benzin     30107
diesel     14567
lpg          691
cng           75
hybrid        37
andere        22
elektro       19
Name: fuel_type, dtype: int64

In [110]:

print('Auto Maker')
autos['brand'].value_counts()

Auto Maker

Out[110]:

volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
citroen             701
smart               701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

Looking at value counts for specific categories such as model,name and brand(manufacturer) illustrate some redundancies. Unless we want to view very specific model characteristics, the column name appears to be an unnecessary column since model and brand already clarify that information. Price needs a deeper look to determine a range of values that hold the highest number of average prices. These values may also need to be converted to different currencies depending on who the target audience is for this analysis (current suggestion is U.S. Dollars based on author's location). The same can be said for mileage with a simple conversion from kilometers to miles. Fuel types are in German and, again, depending on the audience should be translated accordingly ex. benzin is German for gasoline or petrol.

For now removal of non numeric characters from price and odometer will clean them for subsequent analyses and conversion to numeric data type will allow for formulas requiring conversion between unit measurements.

In [111]:

#renaming column odometer to odeometer_km and converting to numeric data type(dtype)

autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','')
autos['odometer'] = autos['odometer'].astype(int)
autos['price'] = autos['price'].str.replace('$','').str.replace(',','')
autos['price'] = autos['price'].astype(int)

autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
 

In [112]:

#verify non numeric symbol removal and data type change

autos[0:3]['price'],autos[0:3]['odometer_km']

Out[112]:

(0    5000
 1    8500
 2    8990
 Name: price, dtype: int64,
 0    150000
 1    150000
 2     70000
 Name: odometer_km, dtype: int64)

In [113]:

#looking further into price and odometer_km to find large, unrealistic values
#if those kinds of values appear, removal will occur to normalize the series

autos['price'].unique().shape

Out[113]:

(2357,)

In [114]:

autos['price'].describe(include = all)

Out[114]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [115]:

autos['price'].value_counts().sort_index(ascending = False)

Out[115]:

99999999       1
27322222       1
12345678       3
11111111       2
10000000       1
            ... 
5              2
3              1
2              3
1            156
0           1421
Name: price, Length: 2357, dtype: int64

In [116]:

#after viewing the descriptions and value counts, it's clear to see
#that some outliers in price are over a million on the high end and 0 at the low end
#to remedy this, removal of those values will be executed below

autos = autos[autos['price'].between(100,350000)]

In [117]:

#verify that the prices have fluctuated based on eliminating outliers

autos['price'].describe()

Out[117]:

count     48224.000000
mean       5930.371433
std        9078.372762
min         100.000000
25%        1250.000000
50%        3000.000000
75%        7499.000000
max      350000.000000
Name: price, dtype: float64

After removing large price values (in excess of >350,000) and small values (less than <100) we can see how much the mean changed as it dropped from 9840 to 5930. This should allow us to view the average cost of used cars with a bit more confidence and help us start to identify specific types of cars and prices that are used car favorites. Now we'll take a look at the odometer_km to find out if there's any really large discrepancies in mileage values.

In [118]:

#exploring odometer_km for large/small outliers in mileage

autos['odometer_km'].value_counts(ascending = False)

Out[118]:

150000    31212
125000     5037
100000     2101
90000      1733
80000      1412
70000      1214
60000      1153
50000      1009
40000       814
30000       777
5000        760
20000       757
10000       245
Name: odometer_km, dtype: int64

Odometer_km shows no irregularities in mileage as the values represented suggest normal ranges for used cars in terms of kilometers driven. Before moving on, there were three columns that had negligible and null values. They were seller,offer_type, and nmbr_of_pictures and will be dropped from the analysis.

In [119]:

#dropping the three aforementioned columns

autos = autos.drop(['seller','offer_type','nmbr_of_pictures'], axis = 1)

In [120]:

#converting columns 'date_crawled','ad_created', 'last_seen' from string to numeric values

autos['date_crawled'].value_counts(ascending = False)

Out[120]:

2016-03-09 11:54:38    3
2016-03-29 23:42:13    3
2016-03-05 16:57:05    3
2016-03-23 18:39:34    3
2016-03-25 19:57:10    3
                      ..
2016-03-26 12:50:12    1
2016-03-17 11:29:33    1
2016-03-09 22:55:53    1
2016-03-21 14:55:28    1
2016-03-20 13:46:21    1
Name: date_crawled, Length: 46571, dtype: int64

In [121]:

#removing hours, minutes and seconds from 'date_crawled'

autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()

Out[121]:

2016-03-05    2.54
2016-03-06    1.40
2016-03-07    3.61
2016-03-08    3.32
2016-03-09    3.30
2016-03-10    3.23
2016-03-11    3.26
2016-03-12    3.69
2016-03-13    1.57
2016-03-14    3.67
2016-03-15    3.43
2016-03-16    2.95
2016-03-17    3.15
2016-03-18    1.29
2016-03-19    3.47
2016-03-20    3.78
2016-03-21    3.72
2016-03-22    3.29
2016-03-23    3.23
2016-03-24    2.94
2016-03-25    3.15
2016-03-26    3.23
2016-03-27    3.11
2016-03-28    3.50
2016-03-29    3.41
2016-03-30    3.37
2016-03-31    3.19
2016-04-01    3.37
2016-04-02    3.56
2016-04-03    3.86
2016-04-04    3.65
2016-04-05    1.31
2016-04-06    0.32
2016-04-07    0.14
Name: date_crawled, dtype: float64

It appears that the highest percengage of 'crawling' (searching a website using bots or crawl agents) occured on 2016-04-03 @ 3.86%. Below we will see if any of the other dates can correlate any relevance to the highest percentage of crawling*

In [122]:

#removing hours, minutes and seconds from 'ad_created'

autos['ad_created'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()

Out[122]:

2015-06-11    0.00
2015-08-10    0.00
2015-09-09    0.00
2015-11-10    0.00
2015-12-05    0.00
              ... 
2016-04-03    3.89
2016-04-04    3.69
2016-04-05    1.18
2016-04-06    0.33
2016-04-07    0.12
Name: ad_created, Length: 76, dtype: float64

When viewing the ad_listed column the highest percentage of listings occurs on 2016-04-03 @ 3.89%, which is the same day as the highest crawl count. This makes sense considering that the most time crawling would match the date with the most posts.

In [123]:

#removing hours, minutes and seconds from 'last_seen'

autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).mul(100).round(2).sort_index()

Out[123]:

2016-03-05     0.11
2016-03-06     0.43
2016-03-07     0.54
2016-03-08     0.73
2016-03-09     0.96
2016-03-10     1.06
2016-03-11     1.24
2016-03-12     2.38
2016-03-13     0.89
2016-03-14     1.26
2016-03-15     1.59
2016-03-16     1.64
2016-03-17     2.81
2016-03-18     0.73
2016-03-19     1.58
2016-03-20     2.07
2016-03-21     2.05
2016-03-22     2.14
2016-03-23     1.86
2016-03-24     1.98
2016-03-25     1.91
2016-03-26     1.67
2016-03-27     1.56
2016-03-28     2.08
2016-03-29     2.23
2016-03-30     2.47
2016-03-31     2.38
2016-04-01     2.29
2016-04-02     2.49
2016-04-03     2.51
2016-04-04     2.45
2016-04-05    12.51
2016-04-06    22.20
2016-04-07    13.22
Name: last_seen, dtype: float64

The last_seen column is a bit different because we're inferring that the highest percentage of times these listings were seen reprsents a spike not in sales, but in the number of cars still listed. So three days after (2016-04-03) the posting (2016-04-06) crawler bots show the highest number of listings at 22%. Could've just been a slow day for sales, or a trend in the data not yet seen? It directly coincides with the bots two lowest percentage crawls, which would make sense, suggesting that less activity (sales) resulted fewer crawls over the website.

In [124]:

#examining the distribution of registration years

autos['registration_year'].describe()

Out[124]:

count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [125]:

#it appears that there are some values that cannot exist (ex. 1000).
#look at unique values

autos['registration_year'].describe()

Out[125]:

count    48224.000000
mean      2004.730964
std         87.897388
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

Registration year has some values that cannot exist. Cars were not invented until the late 19th century, early 20th. It's safe to say that we'll have to create a realistic range for these cars to fall in and remove the others. It's clear to see that the newer te car the higher the percentage. This suggests that newer cars, most likely, have better resale value.

In [126]:

#create a range between 1900 and 2016

autos = autos[autos["registration_year"].between(1900, 2016)]
autos['registration_year'].value_counts(normalize = True).mul(100).round(2)

Out[126]:

2000    6.70
2005    6.28
1999    6.21
2004    5.82
2003    5.81
        ... 
1938    0.00
1939    0.00
1953    0.00
1943    0.00
1952    0.00
Name: registration_year, Length: 78, dtype: float64

In [127]:

#identifying unique values in the 'brand' column 

autos['brand'].unique()

Out[127]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [128]:

#finding highest percentages of 'brand' within the dataframe

autos['brand'].value_counts(normalize = True, dropna = False).mul(100).round(2)

Out[128]:

volkswagen        21.14
bmw               11.02
opel              10.72
mercedes_benz      9.67
audi               8.68
ford               6.98
renault            4.71
peugeot            2.99
fiat               2.56
seat               1.83
skoda              1.64
nissan             1.53
mazda              1.52
smart              1.42
citroen            1.40
toyota             1.28
hyundai            1.00
sonstige_autos     0.95
volvo              0.91
mini               0.88
mitsubishi         0.82
honda              0.79
kia                0.71
alfa_romeo         0.67
porsche            0.60
suzuki             0.59
chevrolet          0.57
chrysler           0.35
dacia              0.27
daihatsu           0.25
jeep               0.23
subaru             0.21
land_rover         0.21
saab               0.17
jaguar             0.15
daewoo             0.15
trabant            0.14
rover              0.13
lancia             0.11
lada               0.06
Name: brand, dtype: float64

In [129]:

#assign the top ten brands

top_10_brands = autos['brand'].value_counts(normalize = True, dropna = False).head(10).mul(100).round(2).index

print(top_10_brands)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat'],
      dtype='object')

Looking at the list of auto brands shows that the top ten automakers represent close to 75 percent of the availablle vehicles. This shows manufacterers who have an established market presence resulting in added options, across models, for buyers.

In [130]:

#creating an empty dictionary to hold the top ten 'brands'

top_brand_values = {}

for brand_10 in top_10_brands:
    mean_price = autos[autos['brand'] == brand_10]['price'].mean().round()
    top_brand_values[brand_10] = mean_price

top_brand_values

Out[130]:

{'volkswagen': 5437.0,
 'bmw': 8382.0,
 'opel': 3005.0,
 'mercedes_benz': 8673.0,
 'audi': 9381.0,
 'ford': 3779.0,
 'renault': 2496.0,
 'peugeot': 3114.0,
 'fiat': 2837.0,
 'seat': 4433.0}

Price doesn't necessarily correlate to brand, but it makes sense that Volkswagen sits right in the middle of the range of values for mean. As Volkswagen reprsents the highest selling brand present would suggest consumer confidence while mid-range pricing illustrates afffordability without skimping on build out factors that make cars reliable. Below we will create the same dictionary for mileage by mean and create a dataframe from series objects derived from the series constructors

In [131]:

#finding highest mean kilometers of 'odometer_km' within the series

top_brand_kilometers = {}

for brand_new in top_10_brands:
    mean_km = autos[autos['brand']== brand_new]['odometer_km'].mean().round()
    top_brand_kilometers[brand_new]= mean_km
    
top_brand_kilometers    

Out[131]:

{'volkswagen': 128800.0,
 'bmw': 132695.0,
 'opel': 129384.0,
 'mercedes_benz': 131026.0,
 'audi': 129245.0,
 'ford': 124277.0,
 'renault': 128281.0,
 'peugeot': 127128.0,
 'fiat': 116950.0,
 'seat': 121537.0}

In [132]:

#creating series for top_brand_kilometers and top_brand_values

tbv_series = pd.Series(top_brand_values).sort_values(ascending = False)
tbk_series = pd.Series(top_brand_kilometers).sort_values(ascending = False)

#combine both tbv and tbk series into a dataframe

price_km_df = pd.DataFrame(tbv_series, columns = ['mean_prices'])

#add column for tbk_series
price_km_df['mean_kilometers'] = tbk_series

price_km_df

Out[132]:

	mean_prices	mean_kilometers
audi	9381.0	129245.0
mercedes_benz	8673.0	131026.0
bmw	8382.0	132695.0
volkswagen	5437.0	128800.0
seat	4433.0	121537.0
ford	3779.0	124277.0
peugeot	3114.0	127128.0
opel	3005.0	129384.0
fiat	2837.0	116950.0
renault	2496.0	128281.0

After narrowing our results to see top brands, prices and mileage we can see that mileage is in a range of 10,000 km across brands, suggesting that mileage may effect price at higher (lower price) and lower(higher price) values. Of note would be the significance of sales price in relation to location. We will save that for another project. For now, it appears, that buying a volkswagen is the most popular choice amongst used car buyers and should not come as a surprise cosidering they are Germany's largest automaker.

We'll take a look at these unique values first, translate them and map their English translations.

In [133]:

#changing column data that has German words to English

autos.head(3)

#columns vehicle_type, gearbox, and fuel_type will be translated

Out[133]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37

In [134]:

#create dictionary, convert via .map

autos['vehicle_type'].unique()

Out[134]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [135]:

#create dictionary, convert via .map

vehicle_dict = {'bus':'bus','limousine':'limousine','kleinwagen':'small car','kombi':'station_wagon','nan':'unknown','coupe':'two_door','suv':'suv','cabrio':'convertible','andere':'other'}
autos['vehicle_type'] = autos['vehicle_type'].map(vehicle_dict)

In [136]:

#verify translation changes

autos['vehicle_type'].unique()

Out[136]:

array(['bus', 'limousine', 'small car', 'station_wagon', nan, 'two_door',
       'suv', 'convertible', 'other'], dtype=object)

In [137]:

autos['gearbox'].unique()

Out[137]:

array(['manuell', 'automatik', nan], dtype=object)

In [138]:

gear_dict = {'manuell':'manual','automatik':'automatic'}
autos['gearbox'] = autos['gearbox'].map(gear_dict)

In [139]:

autos['gearbox'].unique()

Out[139]:

array(['manual', 'automatic', nan], dtype=object)

In [140]:

autos['fuel_type'].unique()

Out[140]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [141]:

fuel_dict = {'lpg':'lpg','benzin':'gas','diesel':'diesel','cng':'cng','hybrid':'hybrid','elektro':'electric','andere':'other'}
autos['fuel_type'] = autos['fuel_type'].map(fuel_dict)

In [142]:

autos['fuel_type'].unique()

Out[142]:

array(['lpg', 'gas', 'diesel', nan, 'cng', 'hybrid', 'electric', 'other'],
      dtype=object)

In [143]:

autos['unrepaired_damage'].unique()

Out[143]:

array(['nein', nan, 'ja'], dtype=object)

In [144]:

unrepaired_dict = {'nein':'no','ja':'yes'}
autos['unrepaired_damage'] = autos['unrepaired_damage'].map(unrepaired_dict)

In [145]:

autos['unrepaired_damage'].unique()

Out[145]:

array(['no', nan, 'yes'], dtype=object)

Conversion of all date columns to uniform, numeric format will be perfomed to allow for further queries against the dataset. Converting date formats from yyyy-dd-mm to yyyyddmm by removing dashes and extra characters from columns date_crawled, ad_created & last_seen.

In [146]:

#starting with date_crawled column by removing hours, mins, secs and dashes

autos['date_crawled'].head()

Out[146]:

0    2016-03-26 17:47:46
1    2016-04-04 13:38:56
2    2016-03-26 18:57:24
3    2016-03-12 16:58:10
4    2016-04-01 14:38:50
Name: date_crawled, dtype: object

In [147]:

autos['date_crawled_new'] = (autos['date_crawled']
                                     .str.split()
                                     .str[0])
autos['date_crawled_new'].head(3)

Out[147]:

0    2016-03-26
1    2016-04-04
2    2016-03-26
Name: date_crawled_new, dtype: object

In [148]:

autos['date_crawled_new'] = autos['date_crawled_new'].str.replace('-','')

print(autos['date_crawled_new'])

0        20160326
1        20160404
2        20160326
3        20160312
4        20160401
           ...   
49995    20160327
49996    20160328
49997    20160402
49998    20160308
49999    20160314
Name: date_crawled_new, Length: 46352, dtype: object

In [149]:

autos['date_crawled_new'] = autos['date_crawled_new'].astype(int)
autos['date_crawled_new'].head(5)

Out[149]:

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled_new, dtype: int64

In [150]:

#starting with ad_created column by removing hours, mins, secs and dashes

autos['ad_created'].head(2)

Out[150]:

0    2016-03-26 00:00:00
1    2016-04-04 00:00:00
Name: ad_created, dtype: object

In [151]:

autos['ad_created_new'] = (autos['ad_created']
                                     .str.split()
                                     .str[0])
autos['ad_created_new'].head(3)

Out[151]:

0    2016-03-26
1    2016-04-04
2    2016-03-26
Name: ad_created_new, dtype: object

In [152]:

autos['ad_created_new'] = autos['ad_created_new'].str.replace('-','')

print(autos['ad_created_new'])

0        20160326
1        20160404
2        20160326
3        20160312
4        20160401
           ...   
49995    20160327
49996    20160328
49997    20160402
49998    20160308
49999    20160313
Name: ad_created_new, Length: 46352, dtype: object

In [153]:

autos['ad_created_new'] = autos['ad_created_new'].astype(int)
autos['ad_created_new'].head(3)

Out[153]:

0    20160326
1    20160404
2    20160326
Name: ad_created_new, dtype: int64

In [154]:

#starting with last_seen column by removing hours, mins, secs and dashes

autos['last_seen'].head(3)

Out[154]:

0    2016-04-06 06:45:54
1    2016-04-06 14:45:08
2    2016-04-06 20:15:37
Name: last_seen, dtype: object

In [155]:

autos['last_seen_new'] = (autos['last_seen']
                                  .str.split()
                                  .str[0])
autos['last_seen_new'].head(2)
                            

Out[155]:

0    2016-04-06
1    2016-04-06
Name: last_seen_new, dtype: object

In [156]:

autos['last_seen_new'] = autos['last_seen_new'].str.replace('-','')

print(autos['last_seen_new'])

0        20160406
1        20160406
2        20160406
3        20160315
4        20160401
           ...   
49995    20160401
49996    20160402
49997    20160404
49998    20160405
49999    20160406
Name: last_seen_new, Length: 46352, dtype: object

In [157]:

autos['last_seen_new'] = autos['last_seen_new'].astype(int)
autos['last_seen_new'].head(3)

Out[157]:

0    20160406
1    20160406
2    20160406
Name: last_seen_new, dtype: int64

Now that we've converted the date columns to numeric integer format in new columns, we can remove the original columns.

In [158]:

#dropping unformatted date columns, keeping newly created ones

autos = autos.drop(['date_crawled', 'ad_created','last_seen'], axis = 1)

Exploring the name column to discover more descriptive and relevant keywords.

In [159]:

#look over name column for key words that may be helpful

autos['name'].head(10)

Out[159]:

0                     Peugeot_807_160_NAVTECH_ON_BOARD
1           BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                           Volkswagen_Golf_1.6_United
3    Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...
4    Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...
5    Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...
6    VW_Golf_III_GT_Special_Electronic_Green_Metall...
7                                 Golf_IV_1.9_TDI_90PS
8                                           Seat_Arosa
9            Renault_Megane_Scenic_1.6e_RT_Klimaanlage
Name: name, dtype: object

In [160]:

#remove underscores

autos['name'] = (autos['name'].str.replace('_',' '))

autos['name'].head(3)

Out[160]:

0              Peugeot 807 160 NAVTECH ON BOARD
1    BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik
2                    Volkswagen Golf 1.6 United
Name: name, dtype: object

In [161]:

#split the column into workable lists to choose model type.  Will most likely look at first four

autos['name'] = (autos['name']
                         .str.split()
                         .str[0:3])
print(autos['name'])

0            [Peugeot, 807, 160]
1                 [BMW, 740i, 4]
2        [Volkswagen, Golf, 1.6]
3         [Smart, smart, fortwo]
4               [Ford, Focus, 1]
                  ...           
49995            [Audi, Q5, 3.0]
49996           [Opel, Astra, F]
49997             [Fiat, 500, C]
49998            [Audi, A3, 2.0]
49999        [Opel, Vectra, 1.6]
Name: name, Length: 46352, dtype: object

In [162]:

#the newly created lists from the name column appear to have model at the second index. 
#put these in a new column and remove name column

autos['name'] = (autos['name'].str[1])

#rename name column to model

autos.head()

Out[162]:

	name	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	postal_code	date_crawled_new	ad_created_new	last_seen_new
0	807	5000	control	bus	2004	manual	158	andere	150000	3	lpg	peugeot	no	79588	20160326	20160326	20160406
1	740i	8500	control	limousine	1997	automatic	286	7er	150000	6	gas	bmw	no	71034	20160404	20160404	20160406
2	Golf	8990	test	limousine	2009	manual	102	golf	70000	7	gas	volkswagen	no	35394	20160326	20160326	20160406
3	smart	4350	control	small car	2007	automatic	71	fortwo	70000	6	gas	smart	no	33729	20160312	20160312	20160315
4	Focus	1350	test	station_wagon	2003	manual	0	focus	150000	7	gas	ford	no	39218	20160401	20160401	20160401

While breaking apart the name column clarified some model name extensions, it was difficult to determine what to keep with an extremeley high number of unique names in the dataset. For now using brand and name should be sufficient for finding the most common brand and model combination. Just like brand, we'll use the top ten for brand_model combination below.

In [163]:

#clean up columns name and brand by capitalizing first letter of each word

autos['brand'] = autos['brand'].str.capitalize()
autos['name'] = autos['name'].str.capitalize()

#combine columns brand and name to create a brand model column

autos['brand_model'] = autos['brand'] + ' ' + autos['name']

In [164]:

#checking the number of unique brand models in autos

autos['brand_model'].nunique()

Out[164]:

In [165]:

#create the top ten brand model percentage counts

autos['brand_model'].value_counts(normalize = True, dropna = False).head(10).mul(100).round(2)

Out[165]:

Mercedes_benz Benz    6.96
Volkswagen Golf       5.56
Opel Corsa            2.94
Volkswagen Polo       2.72
Opel Astra            2.55
Volkswagen Passat     2.39
Audi A4               2.32
Audi A6               1.56
Audi A3               1.55
Ford Focus            1.51
Name: brand_model, dtype: float64

The top 10 brand models contain nine German automakers, with Mercedes and Volkswagen coming in with over 12.5% of popular models. This shouldn't come as a surprise considering that this site is based in Germany. It Should be noted that Benz is not a model name, but rather a part of the brand, suggesting that name may be a bit unreliable. For now we'll focus on the current result. Let's explore the odometer_km column by splitting it into groups to see if mileage ranges have a specific impact on pricing.

In [166]:

#breaking up the odometer_km column into groups using the groupby method
#and finding the average price by mileage


split_odom = autos.groupby('odometer_km')['price'].mean()
split_odom.head(13)

Out[166]:

odometer_km
5000       9794.102672
10000     21255.304721
20000     18523.200271
30000     16674.564069
40000     15518.952261
50000     13853.911111
60000     12406.989343
70000     10954.826014
80000      9743.110058
90000      8470.087321
100000     8188.223581
125000     6238.277181
150000     3792.320241
Name: price, dtype: float64

As we can see the average cost per 10,000 km driven decreases significantly after 10,000 km. This makes sense as the cost of maintenance and upkeep increases with added mileage. Seen here is an inverse relationship to distnace driven. As km increase, average price decreases. It should be noted that the average cost of cars driven 5000 km shows an average price that is half that of those driven 10,000 km. We'lll close out this analysis by seeing if damage to cars, listed under unrepaired_damage, shows significant differences between values

In [167]:

#create means for damaged and undamaged then compare price

damaged_cars = autos[autos['unrepaired_damage']== 'yes']
damaged_price = damaged_cars['price'].mean()
print('$',damaged_price.round(2))

$ 2269.58

In [168]:

undamaged_cars = autos[autos['unrepaired_damage']=='no']
undamaged_price = undamaged_cars['price'].mean()
print('$',undamaged_price.round(2))

$ 7177.14

In [169]:

#show difference in price between damaged and undamaged cars

print('Damaged cars sell for $',(undamaged_cars['price'].mean().round(2)) - (damaged_cars['price'].mean().round(2)), 'less than undamaged cars')

Damaged cars sell for $ 4907.56 less than undamaged cars

Conclusions¶

-Buying used cars from Kleinanziegen shows that prices generally decrase with increasing mileage and unrepaired damage.

-The majority of brand automakers are from Germany, so variety for international autos is limited.

**-By analyzing brand, models, kilometers and price we can see that automakers like Volkswagen are reliable, mid-range in pricing and heavily represented in used car sales availability.**

Cleaning Data From Kleinanziegen - The German eBay Classifieds¶

Conclusions¶

Cleaning Data From Kleinanziegen -
The German eBay Classifieds
¶