This is the first guided project in DataQuest. We'll be working with a dataset of used cars from a classified section of eBay. Data originally from Kaggle, but there have been a few modifications to the data to make it more useful to practice cleaning the data.
To clean the data and analyze the included car listings.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv',encoding='Latin-1')
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
As we can see from the .info method, we have a few columns with some null values. Columns being vehicleType, gearbox, model, fuelType, and nonRepairedDamage. We must inspect these columns to see how we can deal with these null values. Also, we might want to change the price column to a float dtype so it is easier to work with.
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From the .head method, we can see that there is a lot of data cleaning that we need to do.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
new_columns = ({'yearOfRegistration':'registration_year', 'monthOfRegistration':'registration_month',
'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created', 'dateCrawled':'date_crawled',
'offerType':'offer_type','vehicleType':"vehicle_type", 'powerPS':'power_ps','fuelType':'fuel_type',
'nrOfPictures':'num_pictures','postalCode':'postal_code','lastSeen':'last_seen'}
)
autos.rename(new_columns,axis=1,inplace=True)
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Camelcase to snakecase basically means changing the word to lower case and have underscores instead of spaces. So, I changed the all the columns to to snakecase just for it to be more readable. I'm not sure what date_crawled means so I just left it as is. One specific change I made was to nrOfPictures to num_pictures.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-30 19:48:02 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Price and odometer are stored as text, so we need the change these to numeric. We can say this because we see their descriptive statistics results in all null values
We can also see that columns seller, offer_type, abtest, gearbox, unrepaired_damage all have only 2 values. We may consider dropping these columns because they won't be useful in our analyses.
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(float)
autos['odometer'] = autos['odometer'].str.lower().str.replace('km','').str.replace(',','').astype(float)
autos.rename({'odometer':'odometer_km'}, axis=1,inplace=True)
autos['price'].unique().shape
# theres 2357 unique values
(2357,)
print('Max: {}\nMin: {}'.format(autos['price'].max(), autos['price'].min()))
# What car costs 99,999,999 and 0? This isn't right
autos['price'].describe()
Max: 99999999.0 Min: 0.0
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
(autos['price'] > 500000).value_counts()
# 14 cars that are more than $500,000
False 49986 True 14 Name: price, dtype: int64
autos['price'].value_counts().sort_index(ascending=False).tail(200)
# Looking at how many occurrences of each price value in descending order.
760.0 8 755.0 1 750.0 433 749.0 23 745.0 2 ... 5.0 2 3.0 1 2.0 3 1.0 156 0.0 1421 Name: price, Length: 200, dtype: int64
price_q25, price_q75 = np.percentile(autos['price'].values, [25,75])
price_iqr = price_q75 - price_q25
price_cutoff = price_iqr * 2
price_upper, price_lower = price_q75+price_cutoff, price_q25-price_cutoff
price_outliers = [x for x in autos['price'] if x<price_lower or x>price_upper]
len(price_outliers)
2644
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offer_type 50000 non-null object 4 price 50000 non-null float64 5 abtest 50000 non-null object 6 vehicle_type 44905 non-null object 7 registration_year 50000 non-null int64 8 gearbox 47320 non-null object 9 power_ps 50000 non-null int64 10 model 47242 non-null object 11 odometer_km 50000 non-null float64 12 registration_month 50000 non-null int64 13 fuel_type 45518 non-null object 14 brand 50000 non-null object 15 unrepaired_damage 40171 non-null object 16 ad_created 50000 non-null object 17 num_pictures 50000 non-null int64 18 postal_code 50000 non-null int64 19 last_seen 50000 non-null object dtypes: float64(2), int64(5), object(13) memory usage: 7.6+ MB
new_autos = autos[autos['price'].between(price_lower,price_upper)]
new_autos.info()
# 2644 outliers removed
<class 'pandas.core.frame.DataFrame'> Int64Index: 47356 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 47356 non-null object 1 name 47356 non-null object 2 seller 47356 non-null object 3 offer_type 47356 non-null object 4 price 47356 non-null float64 5 abtest 47356 non-null object 6 vehicle_type 42311 non-null object 7 registration_year 47356 non-null int64 8 gearbox 44724 non-null object 9 power_ps 47356 non-null int64 10 model 44695 non-null object 11 odometer_km 47356 non-null float64 12 registration_month 47356 non-null int64 13 fuel_type 42950 non-null object 14 brand 47356 non-null object 15 unrepaired_damage 37668 non-null object 16 ad_created 47356 non-null object 17 num_pictures 47356 non-null int64 18 postal_code 47356 non-null int64 19 last_seen 47356 non-null object dtypes: float64(2), int64(5), object(13) memory usage: 7.6+ MB
To remove the outliers, I used the interquartile range method because it focuses solely on the data given. I used a multiplier of 2 for the cutoff because I noticed that some of the lower values were reasonable prices and I wanted to keep majority of them.
price_outliers.sort()
price_outliers
[19444.0, 19450.0, 19450.0, 19480.0, 19490.0, 19490.0, 19490.0, 19490.0, 19499.0, 19499.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19500.0, 19550.0, 19599.0, 19600.0, 19600.0, 19600.0, 19600.0, 19600.0, 19600.0, 19650.0, 19650.0, 19650.0, 19666.0, 19690.0, 19699.0, 19700.0, 19700.0, 19700.0, 19700.0, 19750.0, 19750.0, 19750.0, 19777.0, 19780.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19800.0, 19850.0, 19850.0, 19850.0, 19850.0, 19890.0, 19890.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19900.0, 19911.0, 19950.0, 19950.0, 19950.0, 19950.0, 19950.0, 19950.0, 19950.0, 19950.0, 19950.0, 19968.0, 19970.0, 19980.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19990.0, 19995.0, 19995.0, 19995.0, 19995.0, 19998.0, 19998.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 19999.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20000.0, 20123.0, 20150.0, 20199.0, 20200.0, 20200.0, 20200.0, 20200.0, 20200.0, 20222.0, 20250.0, 20290.0, 20299.0, 20300.0, 20390.0, 20390.0, 20399.0, 20400.0, 20400.0, 20400.0, 20400.0, 20450.0, 20480.0, 20490.0, 20490.0, 20490.0, 20490.0, 20490.0, 20490.0, 20495.0, 20495.0, 20499.0, 20499.0, 20499.0, 20499.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20500.0, 20550.0, 20550.0, 20550.0, 20589.0, 20590.0, 20595.0, 20650.0, 20650.0, 20650.0, 20700.0, 20700.0, 20750.0, 20750.0, 20750.0, 20750.0, 20790.0, 20799.0, 20800.0, 20800.0, 20800.0, 20800.0, 20800.0, 20800.0, 20800.0, 20800.0, 20800.0, 20800.0, 20800.0, 20850.0, 20850.0, 20850.0, 20850.0, 20888.0, 20890.0, 20890.0, 20890.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20900.0, 20911.0, 20950.0, 20950.0, 20950.0, 20950.0, 20950.0, 20950.0, 20980.0, 20985.0, 20987.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20990.0, 20999.0, 20999.0, 20999.0, 20999.0, 20999.0, 20999.0, 20999.0, 20999.0, 20999.0, 20999.0, 20999.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21000.0, 21049.0, 21111.0, 21200.0, 21200.0, 21200.0, 21200.0, 21250.0, 21250.0, 21275.0, 21299.0, 21300.0, 21300.0, 21300.0, 21300.0, 21300.0, 21400.0, 21400.0, 21400.0, 21400.0, 21400.0, 21400.0, 21400.0, 21400.0, 21444.0, 21450.0, 21450.0, 21490.0, 21490.0, 21490.0, 21490.0, 21490.0, 21490.0, 21490.0, 21499.0, 21499.0, 21499.0, 21499.0, 21499.0, 21499.0, 21499.0, 21499.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21500.0, 21590.0, 21590.0, 21599.0, 21600.0, 21600.0, 21600.0, 21690.0, 21700.0, 21750.0, 21750.0, 21750.0, 21780.0, 21780.0, 21790.0, 21800.0, 21800.0, 21800.0, 21800.0, 21850.0, 21850.0, 21850.0, 21888.0, 21890.0, 21890.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21900.0, 21950.0, 21950.0, 21950.0, 21950.0, 21950.0, 21950.0, 21959.0, 21980.0, 21980.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21990.0, 21995.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 21999.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22000.0, 22199.0, 22200.0, 22222.0, 22222.0, 22250.0, 22250.0, 22250.0, 22280.0, 22290.0, 22299.0, 22300.0, 22300.0, 22300.0, 22300.0, 22300.0, 22300.0, 22399.0, 22400.0, 22400.0, 22400.0, 22400.0, 22400.0, 22400.0, 22400.0, 22400.0, 22450.0, 22450.0, 22450.0, 22490.0, 22490.0, 22490.0, 22490.0, 22490.0, 22490.0, 22499.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22500.0, 22555.0, 22580.0, 22600.0, 22600.0, 22600.0, 22650.0, 22650.0, 22690.0, 22700.0, 22700.0, 22750.0, 22750.0, 22790.0, 22799.0, 22800.0, 22800.0, 22800.0, 22800.0, 22800.0, 22800.0, 22800.0, 22800.0, 22800.0, 22800.0, 22850.0, 22850.0, 22850.0, 22880.0, 22890.0, 22899.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22900.0, 22950.0, 22950.0, 22950.0, 22950.0, 22950.0, 22950.0, 22950.0, 22950.0, 22980.0, 22989.0, 22990.0, 22990.0, 22990.0, 22990.0, 22990.0, 22996.0, 22999.0, 22999.0, 22999.0, 22999.0, 22999.0, 22999.0, 22999.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23000.0, 23123.0, 23200.0, 23200.0, 23200.0, 23200.0, 23233.0, 23250.0, 23250.0, 23250.0, 23300.0, 23333.0, 23333.0, 23350.0, 23400.0, 23400.0, 23400.0, 23400.0, 23450.0, 23490.0, 23490.0, 23490.0, 23490.0, 23490.0, 23499.0, 23499.0, 23499.0, 23499.0, 23499.0, 23499.0, 23499.0, 23499.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23500.0, 23540.0, 23590.0, 23600.0, 23600.0, 23600.0, 23600.0, 23650.0, 23700.0, 23700.0, 23700.0, 23700.0, 23750.0, 23750.0, 23790.0, 23799.0, 23800.0, 23800.0, ...]
new_autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 47356 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 47356 non-null object 1 name 47356 non-null object 2 seller 47356 non-null object 3 offer_type 47356 non-null object 4 price 47356 non-null float64 5 abtest 47356 non-null object 6 vehicle_type 42311 non-null object 7 registration_year 47356 non-null int64 8 gearbox 44724 non-null object 9 power_ps 47356 non-null int64 10 model 44695 non-null object 11 odometer_km 47356 non-null float64 12 registration_month 47356 non-null int64 13 fuel_type 42950 non-null object 14 brand 47356 non-null object 15 unrepaired_damage 37668 non-null object 16 ad_created 47356 non-null object 17 num_pictures 47356 non-null int64 18 postal_code 47356 non-null int64 19 last_seen 47356 non-null object dtypes: float64(2), int64(5), object(13) memory usage: 7.6+ MB
autos[['date_crawled','ad_created','last_seen']].head()
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
new_autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
# normalize make it a percentage instead of the actual count
# we want to keep the null values
# earliest to latest
2016-03-05 0.025551 2016-03-06 0.013874 2016-03-07 0.036088 2016-03-08 0.033533 2016-03-09 0.033216 2016-03-10 0.032308 2016-03-11 0.032140 2016-03-12 0.037313 2016-03-13 0.015563 2016-03-14 0.037081 2016-03-15 0.033702 2016-03-16 0.029774 2016-03-17 0.031633 2016-03-18 0.012966 2016-03-19 0.034674 2016-03-20 0.037883 2016-03-21 0.037398 2016-03-22 0.032773 2016-03-23 0.032435 2016-03-24 0.029162 2016-03-25 0.031844 2016-03-26 0.032900 2016-03-27 0.030640 2016-03-28 0.034737 2016-03-29 0.034125 2016-03-30 0.033660 2016-03-31 0.031781 2016-04-01 0.033259 2016-04-02 0.035180 2016-04-03 0.038622 2016-04-04 0.036616 2016-04-05 0.012987 2016-04-06 0.003167 2016-04-07 0.001415 Name: date_crawled, dtype: float64
With the date_crawled column and its distribution, we can see that most occurrences were between the middle of March and very early April.
new_autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 ... 2016-04-03 0.038876 2016-04-04 0.036975 2016-04-05 0.011699 2016-04-06 0.003231 2016-04-07 0.001288 Name: ad_created, Length: 74, dtype: float64
Similarly to the date_crawled distribution, we can see that most of the ads were created between mid March and very early April. One thing to note is that some ads were created in 2015 and the date_crawled column did not have any. This may be something we would want to explore later.
new_autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
2016-03-05 0.001140 2016-03-06 0.004603 2016-03-07 0.005511 2016-03-08 0.007919 2016-03-09 0.010178 2016-03-10 0.011086 2016-03-11 0.013029 2016-03-12 0.024601 2016-03-13 0.009291 2016-03-14 0.012881 2016-03-15 0.016154 2016-03-16 0.016872 2016-03-17 0.028529 2016-03-18 0.007560 2016-03-19 0.016175 2016-03-20 0.021180 2016-03-21 0.021053 2016-03-22 0.022067 2016-03-23 0.019026 2016-03-24 0.019913 2016-03-25 0.019744 2016-03-26 0.017231 2016-03-27 0.016408 2016-03-28 0.021349 2016-03-29 0.022954 2016-03-30 0.025298 2016-03-31 0.024200 2016-04-01 0.023418 2016-04-02 0.025298 2016-04-03 0.025572 2016-04-04 0.025129 2016-04-05 0.121590 2016-04-06 0.216171 2016-04-07 0.126869 Name: last_seen, dtype: float64
The distribution of the last_seen column is as expected because you tend to see what is new rather than what is old in the public. So most of the occurrences lie within early April.
new_autos['registration_year'].describe()
count 47356.000000 mean 2004.620724 std 101.921674 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Using the describe method, we can see that there are definitely some unusual values, such as the max being 9999 and the min being 1000. Given the column tells us the year of the car was first released, these values (and others) may be removed.
(new_autos[new_autos['registration_year']<1900])
# 6 cars that have a registration_year under 1900
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10556 | 2016-04-01 06:02:10 | UNFAL_Auto | privat | Angebot | 450.0 | control | NaN | 1800 | NaN | 1800 | NaN | 5000.0 | 2 | NaN | mitsubishi | nein | 2016-04-01 00:00:00 | 0 | 63322 | 2016-04-01 09:42:30 |
22316 | 2016-03-29 16:56:41 | VW_Kaefer.__Zwei_zum_Preis_von_einem. | privat | Angebot | 1500.0 | control | NaN | 1000 | manuell | 0 | kaefer | 5000.0 | 0 | benzin | volkswagen | NaN | 2016-03-29 00:00:00 | 0 | 48324 | 2016-03-31 10:15:28 |
24511 | 2016-03-17 19:45:11 | Trabant__wartburg__Ostalgie | privat | Angebot | 490.0 | control | NaN | 1111 | NaN | 0 | NaN | 5000.0 | 0 | NaN | trabant | NaN | 2016-03-17 00:00:00 | 0 | 16818 | 2016-04-07 07:17:29 |
32585 | 2016-04-02 16:56:39 | UNFAL_Auto | privat | Angebot | 450.0 | control | NaN | 1800 | NaN | 1800 | NaN | 5000.0 | 2 | NaN | mitsubishi | nein | 2016-04-02 00:00:00 | 0 | 63322 | 2016-04-04 14:46:21 |
35238 | 2016-03-26 13:45:20 | Suche_Skoda_Fabia____Skoda_Fabia_Combi_mit_Klima | privat | Angebot | 0.0 | control | NaN | 1500 | NaN | 0 | NaN | 5000.0 | 0 | benzin | skoda | NaN | 2016-03-26 00:00:00 | 0 | 15517 | 2016-04-04 00:16:54 |
49283 | 2016-03-15 18:38:53 | Citroen_HY | privat | Angebot | 7750.0 | control | NaN | 1001 | NaN | 0 | andere | 5000.0 | 0 | NaN | citroen | NaN | 2016-03-15 00:00:00 | 0 | 66706 | 2016-04-06 18:47:20 |
(new_autos['registration_year']>2018).value_counts()
False 47337 True 19 Name: registration_year, dtype: int64
Looking at these specific years, I would say keeping the values between 1900 and 2018 is the most ideal. I would say its common to see car listings for cars in the future (1 or 2 years) so I decided to keep a few rows.
new_autos = new_autos[new_autos['registration_year'].between(1900,2018)]
new_autos['registration_year'].value_counts(normalize=True).head(20)
2000 0.070651 1999 0.063193 2005 0.063003 2004 0.057277 2003 0.057256 2001 0.056770 2006 0.056179 2002 0.053221 1998 0.051594 2007 0.046883 2008 0.044474 1997 0.042720 2009 0.041051 2017 0.030466 1996 0.030361 2010 0.028713 1995 0.027572 2011 0.027276 2016 0.027086 2012 0.020367 Name: registration_year, dtype: float64
The distribution of the remaining values of the registration year seems to be slightly skewed in the early 2000's. Then we see some 2010's and 1990s.
new_autos['brand'].unique()
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'porsche', 'hyundai', 'chevrolet', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'jaguar', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
selected_brands = new_autos['brand'].value_counts(normalize=True).head(10).index
new_autos['brand'].value_counts(normalize=True).head(10)
# Lets explore the top 10 based on percentages
volkswagen 0.216771 opel 0.114407 bmw 0.104287 mercedes_benz 0.088948 audi 0.079441 ford 0.071982 renault 0.050601 peugeot 0.030720 fiat 0.027530 seat 0.019459 Name: brand, dtype: float64
I have chosen to aggregate the top 10 brands based on their respective percentage in the dataset. We see mostly German car manufacturers so I think it would be interesting to find any insights within their listings.
brand_dict = {}
for x in selected_brands:
rows = new_autos[new_autos['brand'] == x]
mean = rows['price'].mean()
brand_dict[x]=mean
sorted_dict = dict(sorted(brand_dict.items(),key=lambda x:x[1],reverse=True))
# the key parameter allows us to sort by values instead of keys
# reverse makes it in descending order
sorted_dict
{'audi': 6090.976595744681, 'bmw': 5944.947325769854, 'mercedes_benz': 5710.772446555819, 'volkswagen': 4309.721637426901, 'seat': 3847.6916395222584, 'ford': 3098.1209275022015, 'peugeot': 2981.661623108666, 'opel': 2685.0227146814404, 'fiat': 2641.108979278588, 'renault': 2245.678496868476}
From the selected brands, we see that the luxury brands have the highest mean price, as expected. The less known brands are much cheaper (to my knowledge). The means range from (2245,6090).
So let's make a single dataframe with mean mileage and mean price.
price_ser = pd.Series(data=sorted_dict,index=sorted_dict.keys())
price_df = pd.DataFrame(price_ser,columns=['mean_price'])
list_top6_brand = list(sorted_dict.keys())
mile_dict = {}
for x in list_top6_brand:
selected = new_autos[new_autos['brand']==x]
mean = selected['odometer_km'].mean()
mile_dict[x] = mean
mile_dict
{'audi': 137809.8404255319, 'bmw': 137336.91247974068, 'mercedes_benz': 136861.04513064132, 'volkswagen': 131788.9863547758, 'seat': 123979.37024972856, 'ford': 125702.96448488407, 'peugeot': 127479.36726272352, 'opel': 130046.16805170821, 'fiat': 117237.14504988489, 'renault': 128540.70981210856}
price_df['mean_mileage'] = mile_dict.values()
price_df.round(2)
mean_price | mean_mileage | |
---|---|---|
audi | 6090.98 | 137809.84 |
bmw | 5944.95 | 137336.91 |
mercedes_benz | 5710.77 | 136861.05 |
volkswagen | 4309.72 | 131788.99 |
seat | 3847.69 | 123979.37 |
ford | 3098.12 | 125702.96 |
peugeot | 2981.66 | 127479.37 |
opel | 2685.02 | 130046.17 |
fiat | 2641.11 | 117237.15 |
renault | 2245.68 | 128540.71 |
With both the mean price and mean mileage, we don't see any correlation between the 2.
new_autos
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.0 | control | bus | 2004 | manuell | 158 | andere | 150000.0 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.0 | control | limousine | 1997 | automatik | 286 | 7er | 150000.0 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.0 | test | limousine | 2009 | manuell | 102 | golf | 70000.0 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.0 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000.0 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.0 | test | kombi | 2003 | manuell | 0 | focus | 150000.0 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | privat | Angebot | 1650.0 | control | kleinwagen | 1997 | manuell | 0 | NaN | 150000.0 | 7 | benzin | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | privat | Angebot | 5000.0 | control | kombi | 2001 | automatik | 299 | a6 | 150000.0 | 1 | benzin | audi | nein | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | 1980.0 | control | cabrio | 1996 | manuell | 75 | astra | 150000.0 | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | 13200.0 | test | cabrio | 2014 | automatik | 69 | 500 | 5000.0 | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | 1250.0 | control | limousine | 1996 | manuell | 101 | vectra | 150000.0 | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
47331 rows × 20 columns
new_autos['name'].str.replace('_',' ')
0 Peugeot 807 160 NAVTECH ON BOARD 1 BMW 740i 4 4 Liter HAMANN UMBAU Mega Optik 2 Volkswagen Golf 1.6 United 3 Smart smart fortwo coupe softouch/F1/Klima/Pan... 4 Ford Focus 1 6 Benzin TÜV neu ist sehr gepfleg... ... 49993 Audi A3 1 8l Silber; schoenes Fahrzeug 49994 Audi A6 S6 Avant 4.2 quattro eventuell Tausc... 49996 Opel Astra F Cabrio Bertone Edition TÜV neu+... 49997 Fiat 500 C 1.2 Dualogic Lounge 49999 Opel Vectra 1.6 16V Name: name, Length: 47331, dtype: object
new_autos['seller'].unique() #privat -> private, gewerblich -> commercial
array(['privat', 'gewerblich'], dtype=object)
new_autos['seller'] = new_autos['seller'].str.replace('privat','private')
new_autos['seller'] = new_autos['seller'].str.replace('gewerblich','commercial')
new_autos['offer_type'].unique()
# angebot = offer, gesuch = request
array(['Angebot', 'Gesuch'], dtype=object)
new_autos['offer_type'] = new_autos['offer_type'].str.replace('Angebot','offer').str.replace('Gesuch','request')
new_autos['vehicle_type'].unique()
array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv', 'cabrio', 'andere'], dtype=object)
#kleinwagen - supermini
#kombi - volkswagen type 2
#cabrio - convertible
#andere - other
new_autos['vehicle_type'].value_counts(dropna=False)
limousine 12175 kleinwagen 10790 kombi 8744 NaN 5020 bus 3866 cabrio 2640 coupe 2127 suv 1556 andere 413 Name: vehicle_type, dtype: int64
new_autos.loc[new_autos['vehicle_type']=='kleinwagen','vehicle_type'] = 'supermini'
new_autos.loc[new_autos['vehicle_type']=='kombi','vehicle_type'] = 'volkswagen type 2'
new_autos.loc[new_autos['vehicle_type']=='cabrio','vehicle_type'] = 'convertible'
new_autos.loc[new_autos['vehicle_type']=='andere','vehicle_type'] = 'other'
new_autos['gearbox'].unique()
array(['manuell', 'automatik', nan], dtype=object)
new_autos.loc[new_autos['gearbox']=='manuell','gearbox'] = 'manual'
new_autos.loc[new_autos['gearbox']=='automatik','gearbox'] = 'auto'
new_autos['fuel_type'].unique()
array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro', 'andere'], dtype=object)
#benzin = petrol
#lpg = liquefied petroleum gas
#cng = compressed natural gas
#elektro = electric
#andere = other
new_autos.loc[new_autos['fuel_type']=='benzin','fuel_type'] = 'petrol'
new_autos.loc[new_autos['fuel_type']=='elektro','fuel_type'] = 'electric'
new_autos.loc[new_autos['fuel_type']=='andere','fuel_type'] = 'other'
new_autos['unrepaired_damage'].unique()
#nein = no
#ja = yes
array(['nein', nan, 'ja'], dtype=object)
new_autos.loc[new_autos['unrepaired_damage']=='nein','unrepaired_damage'] = 'no'
new_autos.loc[new_autos['unrepaired_damage']=='ja','unrepaired_damage'] = 'yes'
new_autos['date'] = new_autos['date_crawled'].str.split(' ',expand=True)[0].str.replace('-','').astype(int)
new_autos
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | ... | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_pictures | postal_code | last_seen | date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | private | offer | 5000.0 | control | bus | 2004 | manual | 158 | ... | 150000.0 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 | 20160326 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | private | offer | 8500.0 | control | limousine | 1997 | auto | 286 | ... | 150000.0 | 6 | petrol | bmw | no | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 | 20160404 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | private | offer | 8990.0 | test | limousine | 2009 | manual | 102 | ... | 70000.0 | 7 | petrol | volkswagen | no | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 | 20160326 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | private | offer | 4350.0 | control | supermini | 2007 | auto | 71 | ... | 70000.0 | 6 | petrol | smart | no | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 | 20160312 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | private | offer | 1350.0 | test | volkswagen type 2 | 2003 | manual | 0 | ... | 150000.0 | 7 | petrol | ford | no | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 | 20160401 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49993 | 2016-03-15 18:47:35 | Audi_A3__1_8l__Silber;_schoenes_Fahrzeug | private | offer | 1650.0 | control | supermini | 1997 | manual | 0 | ... | 150000.0 | 7 | petrol | audi | NaN | 2016-03-15 00:00:00 | 0 | 65203 | 2016-04-06 19:46:53 | 20160315 |
49994 | 2016-03-22 17:36:42 | Audi_A6__S6__Avant_4.2_quattro_eventuell_Tausc... | private | offer | 5000.0 | control | volkswagen type 2 | 2001 | auto | 299 | ... | 150000.0 | 1 | petrol | audi | no | 2016-03-22 00:00:00 | 0 | 46537 | 2016-04-06 08:16:39 | 20160322 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | private | offer | 1980.0 | control | convertible | 1996 | manual | 75 | ... | 150000.0 | 5 | petrol | opel | no | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 | 20160328 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | private | offer | 13200.0 | test | convertible | 2014 | auto | 69 | ... | 5000.0 | 11 | petrol | fiat | no | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 | 20160402 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | private | offer | 1250.0 | control | limousine | 1996 | manual | 101 | ... | 150000.0 | 1 | petrol | opel | no | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 | 20160314 |
47331 rows × 21 columns