We're working with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.

The dataset was originally scraped and uploaded to Kaggle by user orgesleka.,/a> The original dataset isn't available on Kaggle anymore, but you can find it here.

Few modifications were made from the original dataset:

We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment.
We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with)

The data dictionary provided with data is as follows:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data and analyze the included used car listings.

Let's start by importing the libraries we need and reading the dataset into pandas.

In [1]:

import pandas as pd
import numpy as np

autos = pd.read_csv("autos.csv", encoding='Latin-1')

In [2]:

autos

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

From the above information we can the see that our datasets contains 50000 rows and 20 columns

In [3]:

# Enquiring further by checking the info() and head 

autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

most of our columns are objects while year of regristration, powerps, monthofregistration, nrofpictures and postalcode are int64

We can make the following observations:

The dataset contains 20 columns, most of which are strings.
Some columns have null values, but none have more than ~20% null values.
The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.

Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.

In [4]:

print(autos.columns)

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

In [5]:

autos.columns = ['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
       'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
       'odometer', 'registration_month', 'fueltype', 'brand',
       'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
       'lastseen']

In [6]:

autos.head()

Out[6]:

	datecrawled	name	seller	offertype	price	abtest	vehicletype	registration_year	gearbox	powerps	model	odometer	registration_month	fueltype	brand	unrepaired_damage	ad_created	postalcode	lastseen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:

Text columns where all or almost all values are the same. These can often be dropped as they don't have useful information for analysis.
Examples of numeric data stored as text which can be cleaned and converted.

The following methods are helpful for exploring the data:

DataFrame.describe() (with include='all' to get both categorical and numeric columns)
Series.value_counts() and Series.head() if any columns need a closer look.

In [7]:

autos.describe(include='all')

Out[7]:

	datecrawled	name	seller	offertype	price	abtest	vehicletype	registration_year	gearbox	powerps	model	odometer	registration_month	fueltype	brand	unrepaired_damage	ad_created	nrofpictures	postalcode	lastseen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-11 22:38:16	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

What can be seen is that columns "seller" and "offer_type" consist of almost all the same values.\ Therefore, these columns are deemed unuseful for further analysis and will be dropped from the dataframe.

In [8]:

# Taking a quick look at individual columns
autos['odometer'].value_counts(dropna = False)

Out[8]:

150,000km    32424
125,000km     5170
100,000km     2169
90,000km      1757
80,000km      1436
70,000km      1230
60,000km      1164
50,000km      1027
5,000km        967
40,000km       819
30,000km       789
20,000km       784
10,000km       264
Name: odometer, dtype: int64

In [9]:

#remove columns that provide unuseful data
autos.drop(['seller','offertype'],axis=1,inplace=True)

Exploration of price and odometer_km

As mentioned before, there are two columns which contain numeric data that need to be reformatted.\ This will be done below. Afterwards they will be investigated for the presence of outliers.

From this brief exploration we can see that the 'price' and 'odemeter'

columns are numeric values stored as text

For each column we will:

Remove all non-numeric characters Convert the column to a numeric type Rename the 'odometer' column to 'odometer_km'

In [10]:

#price
autos['price'] = (autos['price']
                  .str.replace("$","")
                  .str.replace(",","")
                  .astype(float)
                )


#odometer
autos['odometer'] = (autos['odometer'].
                     str.replace("km","").
                     str.replace(",","").
                     astype(int)
                    )

autos.rename({'odometer': 'odometer_km'},axis=1,inplace=True)

In [11]:

# autos['price'] = autos['price'].str.replace('$','').str.replace(',', '').astype(int)
autos['price'].head()

Out[11]:

0    5000.0
1    8500.0
2    8990.0
3    4350.0
4    1350.0
Name: price, dtype: float64

In [12]:

# autos['odometer'] = autos['odometer'].str.replace(',', '').str.replace('km', '').astype(int)
# autos.rename({'odometer':"odometer_km"}, axis = 1, inplace = True)
autos['odometer_km'].value_counts()

Out[12]:

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
70000      1230
60000      1164
50000      1027
5000        967
40000       819
30000       789
20000       784
10000       264
Name: odometer_km, dtype: int64

The columns 'price' and 'odometer_km' have now succesfully been transformed to integer values.\ They will now be further analysed in order to check for outliers that might need to be removed.

Let's continue exploring the data, specifically looking for data that doesn't look right. We'll start by analyzing the odometer_km and price columns

In [13]:

autos['price'].nunique()

Out[13]:

In [14]:

autos['price'].describe()

Out[14]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

Here we can see that the price column appears to have a 'Positive or Right Skew' with its maximum value being of the magnitude 10^5 times greater than values within 75% of the dataset.

In [15]:

autos['price'].unique().shape

Out[15]:

(2357,)

In [16]:

autos['price'].value_counts()

Out[16]:

0.0        1421
500.0       781
1500.0      734
2500.0      643
1200.0      639
           ... 
6202.0        1
18310.0       1
898.0         1
11240.0       1
789.0         1
Name: price, Length: 2357, dtype: int64

In [17]:

autos['odometer_km'].describe()

Out[17]:

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

Here we can see that the 'odometer_km' column appears to have a 'Negative or Left Skew' with its minimum value being of the magnitude 10^2 times less than values within 75% of the dataset.

In [18]:

autos['odometer_km'].shape

Out[18]:

(50000,)

In [19]:

autos['odometer_km'].unique().shape

Out[19]:

(13,)

In [20]:

autos['odometer_km'].unique().shape

Out[20]:

(13,)

Removing outliers in the 'Price' & 'odometer_km' columns

In [21]:

# Removing outliers in the Price column and checking the new statistical distribution
autos['price'] = autos.loc[autos['price'].between(0,30000), 'price']

In [22]:

autos['price'].describe()

Out[22]:

count    49206.000000
mean      5025.773483
std       5679.154441
min          0.000000
25%       1100.000000
50%       2850.000000
75%       6900.000000
max      30000.000000
Name: price, dtype: float64

The lowest price is zero, which has 1421 occurences in the dataset.\ Price represents the starting value of an action. Since it is only the starting price it does not mean that a car will be sold for that price. Therefore anything below 500 euro's (which seems like a realistic lower bound for used car prices) is not removed from the dataset.

There are also outliers at the top. When investigating the unique values with their respective counts the starting price increases slowly until 350,000.\ Onwards the price increases rapidly. Occurences with a starting price higher than 350,000 will be investigated further.

In [23]:

autos['price'].nunique()

Out[23]:

In [24]:

# Removing outliers in the odometer_km column and checking its statistical distribution
autos['odometer_km'] = autos.loc[autos['odometer_km'].between(10000,150000), 'odometer_km']

In [25]:

autos['odometer_km'].nunique()

Out[25]:

In [26]:

# The new Dataframe
autos.describe()

Out[26]:

	price	registration_year	powerps	odometer_km	registration_month	nrofpictures	postalcode
count	49206.000000	50000.000000	50000.000000	49033.000000	50000.000000	50000.0	50000.000000
mean	5025.773483	2005.073280	116.355920	128113.719332	5.723360	0.0	50813.627300
std	5679.154441	105.712813	209.216627	36631.381785	3.711984	0.0	25779.747957
min	0.000000	1000.000000	0.000000	10000.000000	0.000000	0.0	1067.000000
25%	1100.000000	1999.000000	70.000000	125000.000000	3.000000	0.0	30451.000000
50%	2850.000000	2003.000000	105.000000	150000.000000	6.000000	0.0	49577.000000
75%	6900.000000	2008.000000	150.000000	150000.000000	9.000000	0.0	71540.000000
max	30000.000000	9999.000000	17700.000000	150000.000000	12.000000	0.0	99998.000000

Exploring the Date Columns

With the initial read in of the data, columns - 'date_crawled', 'last_seen' and 'ad_created' are all identified as string vlaues by pandas.

Because these 3 columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.

As the other two time columns - 'registration_month' & 'registration_year' are already in a numerical format, their distribution can be understood using the Series.describe() method

In [27]:

autos.columns

Out[27]:

Index(['datecrawled', 'name', 'price', 'abtest', 'vehicletype',
       'registration_year', 'gearbox', 'powerps', 'model', 'odometer_km',
       'registration_month', 'fueltype', 'brand', 'unrepaired_damage',
       'ad_created', 'nrofpictures', 'postalcode', 'lastseen'],
      dtype='object')

In [28]:

autos.loc[0:5, ['datecrawled', 'ad_created', 'lastseen']]

Out[28]:

	datecrawled	ad_created	lastseen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50
5	2016-03-21 13:47:45	2016-03-21 00:00:00	2016-04-06 09:45:21

Extracting the date only and calculating the distribution of values in these 3 columns

In [29]:

# INVESTIGATING THE DATE_CRAWLED COLUMN
date_crawled = autos['datecrawled'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'
date_crawled.sort_index()

Out[29]:

2016-03-05    2.5%
2016-03-06    1.4%
2016-03-07    3.6%
2016-03-08    3.3%
2016-03-09    3.3%
2016-03-10    3.2%
2016-03-11    3.2%
2016-03-12    3.7%
2016-03-13    1.6%
2016-03-14    3.7%
2016-03-15    3.4%
2016-03-16    2.9%
2016-03-17    3.2%
2016-03-18    1.3%
2016-03-19    3.5%
2016-03-20    3.8%
2016-03-21    3.8%
2016-03-22    3.3%
2016-03-23    3.2%
2016-03-24    2.9%
2016-03-25    3.2%
2016-03-26    3.2%
2016-03-27    3.1%
2016-03-28    3.5%
2016-03-29    3.4%
2016-03-30    3.4%
2016-03-31    3.2%
2016-04-01    3.4%
2016-04-02    3.5%
2016-04-03    3.9%
2016-04-04    3.7%
2016-04-05    1.3%
2016-04-06    0.3%
2016-04-07    0.1%
Name: datecrawled, dtype: object

Looking at the data above it seems like the period over which the data has been crawled covers roughly one month (March-April 2016).\ The distribution is more or less uniform.

In [30]:

# INVESTIGATING THE ad_created COLUMN
ad_created = autos['ad_created'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'
ad_created.sort_index()

Out[30]:

2015-06-11    0.0%
2015-08-10    0.0%
2015-09-09    0.0%
2015-11-10    0.0%
2015-12-05    0.0%
              ... 
2016-04-03    3.9%
2016-04-04    3.7%
2016-04-05    1.2%
2016-04-06    0.3%
2016-04-07    0.1%
Name: ad_created, Length: 76, dtype: object

The dates ads were created range from June 2015 until April of 2016. The majority (+- 97%) of ads in the dataset were created after the date on which data was crawled for the first time.

This make sense as most auctions are only 'live' for a short period of time.

In [31]:

# INVESTIGATING THE lastseen COLUMN
last_seen = autos['lastseen'].str[:10].value_counts(normalize= True, dropna = False).mul(100).round(1).astype(str) + '%'
last_seen.sort_index()

Out[31]:

2016-03-05     0.1%
2016-03-06     0.4%
2016-03-07     0.5%
2016-03-08     0.8%
2016-03-09     1.0%
2016-03-10     1.1%
2016-03-11     1.3%
2016-03-12     2.4%
2016-03-13     0.9%
2016-03-14     1.3%
2016-03-15     1.6%
2016-03-16     1.6%
2016-03-17     2.8%
2016-03-18     0.7%
2016-03-19     1.6%
2016-03-20     2.1%
2016-03-21     2.1%
2016-03-22     2.2%
2016-03-23     1.9%
2016-03-24     2.0%
2016-03-25     1.9%
2016-03-26     1.7%
2016-03-27     1.6%
2016-03-28     2.1%
2016-03-29     2.2%
2016-03-30     2.5%
2016-03-31     2.4%
2016-04-01     2.3%
2016-04-02     2.5%
2016-04-03     2.5%
2016-04-04     2.5%
2016-04-05    12.4%
2016-04-06    22.1%
2016-04-07    13.1%
Name: lastseen, dtype: object

The last seen dates looks a bit uniform, but the last three days contain a disproportionate amount of 'last seen' values. Given that these are 6-10x the values from the previous days, it's unlikely that there was a massive spike in sales. It's more likely that these values are to do with the crawling period ending and don't indicate car sales.

In [32]:

registration_year = autos["registration_year"].describe()
registration_year

Out[32]:

count    50000.000000
mean      2005.073280
std        105.712813
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [33]:

autos["registration_year"].head(10)

Out[33]:

0    2004
1    1997
2    2009
3    2007
4    2003
5    2006
6    1995
7    1998
8    2000
9    1997
Name: registration_year, dtype: int64

The minimum and maximum value of registration year seems strange.

The lowest registration year is 1,000 which must be incorrect as cars only started appearing in the late 1800's, Also the max year is 9999 which is talking about the future, Due to this all occurences with a registration year before 1885 (first patented practical automobile) will be removed.

All registration years after 2016 must be incorrect as ads were created in 2015 & 2016.\ These will be removed from the dataset as well

A manufactured car can't be registered after its listing was seen but before, any vehicle with a registration year above 2016 we can deduce as inaccurate as our ad_created column which represents date_of_listing only has data on years up to 2016.

Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1800s. In an attempt to validate this we can count the number of listings that fall outside the 1900-2016 interval and see if it is safe to remove those rows entirely or if they require custom logic.

In [34]:

autos[autos['registration_year'].between(1800,2016)].describe()

Out[34]:

	price	registration_year	powerps	odometer_km	registration_month	nrofpictures	postalcode
count	47245.000000	48030.000000	48030.000000	47117.000000	48030.000000	48030.0	48030.000000
mean	5094.483014	2002.795066	117.140496	127874.864699	5.767604	0.0	50936.383094
std	5729.183673	7.426905	195.449149	36786.714694	3.696805	0.0	25791.666655
min	0.000000	1800.000000	0.000000	10000.000000	0.000000	0.0	1067.000000
25%	1100.000000	1999.000000	71.000000	125000.000000	3.000000	0.0	30459.000000
50%	2900.000000	2003.000000	107.000000	150000.000000	6.000000	0.0	49696.000000
75%	6999.000000	2008.000000	150.000000	150000.000000	9.000000	0.0	71665.000000
max	30000.000000	2016.000000	17700.000000	150000.000000	12.000000	0.0	99998.000000

There is still a large variety in registration years of the cars.The mean of apprpximately 2002 with a small standard deviation indicate that most cars are approximately between 7 and 21 years old.

Exploring Price by brand

In [35]:

autos['brand'].value_counts()

Out[35]:

volkswagen        10687
opel               5461
bmw                5429
mercedes_benz      4734
audi               4283
ford               3479
renault            2404
peugeot            1456
fiat               1308
seat                941
skoda               786
mazda               757
nissan              754
smart               701
citroen             701
toyota              617
sonstige_autos      546
hyundai             488
volvo               457
mini                424
mitsubishi          406
honda               399
kia                 356
alfa_romeo          329
porsche             294
suzuki              293
chevrolet           283
chrysler            181
dacia               129
daihatsu            128
jeep                110
subaru              109
land_rover           99
saab                 80
daewoo               79
trabant              78
jaguar               77
rover                69
lancia               57
lada                 31
Name: brand, dtype: int64

volkswagen is the most used brand of the vehicle which is responsible for over 20% of the population of the brands of the vehicle followed by opel, bmw, mercedes_benz, audi, ford, renault, peugeot and fiat

In [36]:

autos['brand'].nunique()

Out[36]:

In [37]:

autos['brand'].value_counts(normalize = True).mul(100).round(1).astype(str) + '%'

Out[37]:

volkswagen        21.4%
opel              10.9%
bmw               10.9%
mercedes_benz      9.5%
audi               8.6%
ford               7.0%
renault            4.8%
peugeot            2.9%
fiat               2.6%
seat               1.9%
skoda              1.6%
mazda              1.5%
nissan             1.5%
smart              1.4%
citroen            1.4%
toyota             1.2%
sonstige_autos     1.1%
hyundai            1.0%
volvo              0.9%
mini               0.8%
mitsubishi         0.8%
honda              0.8%
kia                0.7%
alfa_romeo         0.7%
porsche            0.6%
suzuki             0.6%
chevrolet          0.6%
chrysler           0.4%
dacia              0.3%
daihatsu           0.3%
jeep               0.2%
subaru             0.2%
land_rover         0.2%
saab               0.2%
daewoo             0.2%
trabant            0.2%
jaguar             0.2%
rover              0.1%
lancia             0.1%
lada               0.1%
Name: brand, dtype: object

When working with data on cars, it's good to explore variations across different car brands. Aggregation can be used to understand the brand column.

In [38]:

#selecting the top 20 brands  
auto_brands_20 = autos['brand'].value_counts().head(20).index
auto_brands_20

Out[38]:

Index(['volkswagen', 'opel', 'bmw', 'mercedes_benz', 'audi', 'ford', 'renault',
       'peugeot', 'fiat', 'seat', 'skoda', 'mazda', 'nissan', 'smart',
       'citroen', 'toyota', 'sonstige_autos', 'hyundai', 'volvo', 'mini'],
      dtype='object')

In [39]:

brand_mean_prices = {}
for brand in auto_brands_20:
    mean_price = round(autos.loc[autos['brand'] == brand, 'price'].mean())
    # Creating a new key_value pair with this mean price
    brand_mean_prices[brand] = mean_price

In [40]:

brand_mean_prices

Out[40]:

{'volkswagen': 4945.0,
 'opel': 2834.0,
 'bmw': 7142.0,
 'mercedes_benz': 7209.0,
 'audi': 7640.0,
 'ford': 3421.0,
 'renault': 2314.0,
 'peugeot': 3011.0,
 'fiat': 2698.0,
 'seat': 4188.0,
 'skoda': 6271.0,
 'mazda': 3773.0,
 'nissan': 4511.0,
 'smart': 3483.0,
 'citroen': 3645.0,
 'toyota': 4984.0,
 'sonstige_autos': 6692.0,
 'hyundai': 5317.0,
 'volvo': 4686.0,
 'mini': 10281.0}

In [41]:

# Step 5 - Sorting the values in the brand dictionary according to the highest mean price
brand_mean_prices = dict(sorted(brand_mean_prices.items(), key = lambda data: data[1], reverse = True ))
brand_mean_prices

Out[41]:

{'mini': 10281.0,
 'audi': 7640.0,
 'mercedes_benz': 7209.0,
 'bmw': 7142.0,
 'sonstige_autos': 6692.0,
 'skoda': 6271.0,
 'hyundai': 5317.0,
 'toyota': 4984.0,
 'volkswagen': 4945.0,
 'volvo': 4686.0,
 'nissan': 4511.0,
 'seat': 4188.0,
 'mazda': 3773.0,
 'citroen': 3645.0,
 'smart': 3483.0,
 'ford': 3421.0,
 'peugeot': 3011.0,
 'opel': 2834.0,
 'fiat': 2698.0,
 'renault': 2314.0}

we aggregated across brands to understand mean price. We observed that in the top 6 brands, there's a distinct price gap.

Audi, BMW and Mercedes Benz are more expensive

Ford and Opel are less expensive

Volkswagen is in between 4945.0

From the exploration and brief statistical aggregation conducted, it can be seen that the brand mini has the highest mean selling price among the top 20 German car brands in our dataset, followed by audi, sostige , mercedes and bmw

Discovering more Insights through Aggregation

For the top 20 brands, we would like to use yet another aggregation process to understand the average mileage for those cars and if there's any visible link with its mean price.

Using these two aggregated series objects we can then combine then into a single subset dataframe (with a shared index).

In [42]:

brand_mean_mileage = {}
for brand in auto_brands_20:
    mean_mileage = round(autos.loc[autos['brand'] == brand, 'odometer_km'].mean())
    # Creating a new key_value pair with this mean price
    brand_mean_mileage[brand] = mean_mileage

In [43]:

brand_mean_mileage

Out[43]:

{'volkswagen': 131187.0,
 'opel': 132091.0,
 'bmw': 134817.0,
 'mercedes_benz': 132721.0,
 'audi': 130967.0,
 'ford': 126436.0,
 'renault': 130734.0,
 'peugeot': 128540.0,
 'fiat': 120208.0,
 'seat': 123958.0,
 'skoda': 113009.0,
 'mazda': 127067.0,
 'nissan': 121135.0,
 'smart': 101444.0,
 'citroen': 121257.0,
 'toyota': 117262.0,
 'sonstige_autos': 101298.0,
 'hyundai': 107836.0,
 'volvo': 139812.0,
 'mini': 91205.0}

In [44]:

# Step 5 - Sorting the values in the brand dictionary according to the highest mean mileage
brand_mean_mileage = dict(sorted(brand_mean_mileage.items(), key = lambda data: data[1], reverse = True ))
brand_mean_mileage

Out[44]:

{'volvo': 139812.0,
 'bmw': 134817.0,
 'mercedes_benz': 132721.0,
 'opel': 132091.0,
 'volkswagen': 131187.0,
 'audi': 130967.0,
 'renault': 130734.0,
 'peugeot': 128540.0,
 'mazda': 127067.0,
 'ford': 126436.0,
 'seat': 123958.0,
 'citroen': 121257.0,
 'nissan': 121135.0,
 'fiat': 120208.0,
 'toyota': 117262.0,
 'skoda': 113009.0,
 'hyundai': 107836.0,
 'smart': 101444.0,
 'sonstige_autos': 101298.0,
 'mini': 91205.0}

In [45]:

#Converting both dictionaries to series objects, using the series constructor.
mean_prices_top20 = pd.Series(brand_mean_prices)
mean_milieage_top20 = pd.Series(brand_mean_mileage)

In [46]:

# Step 4 - Creating a dataframe using the mean_prices series
top_20_brands = pd.DataFrame(mean_prices_top20, columns=['mean_price'])
top_20_brands

Out[46]:

	mean_price
mini	10281.0
audi	7640.0
mercedes_benz	7209.0
bmw	7142.0
sonstige_autos	6692.0
skoda	6271.0
hyundai	5317.0
toyota	4984.0
volkswagen	4945.0
volvo	4686.0
nissan	4511.0
seat	4188.0
mazda	3773.0
citroen	3645.0
smart	3483.0
ford	3421.0
peugeot	3011.0
opel	2834.0
fiat	2698.0
renault	2314.0

In [47]:

# Appending the mean_mileage data in the other series as a column of this new dataframe to see if there is any possible link 
top_20_brands['mean_milieage'] = mean_milieage_top20

In [48]:

top_20_brands.sort_values('mean_milieage', ascending = False )

Out[48]:

	mean_price	mean_milieage
volvo	4686.0	139812.0
bmw	7142.0	134817.0
mercedes_benz	7209.0	132721.0
opel	2834.0	132091.0
volkswagen	4945.0	131187.0
audi	7640.0	130967.0
renault	2314.0	130734.0
peugeot	3011.0	128540.0
mazda	3773.0	127067.0
ford	3421.0	126436.0
seat	4188.0	123958.0
citroen	3645.0	121257.0
nissan	4511.0	121135.0
fiat	2698.0	120208.0
toyota	4984.0	117262.0
skoda	6271.0	113009.0
hyundai	5317.0	107836.0
smart	3483.0	101444.0
sonstige_autos	6692.0	101298.0
mini	10281.0	91205.0

Both mean price and mean mileage can be seen and from the table we can see that brands with higer mean mileage have low mean price, therefore It is impossible to conclude whether higher mileage is affecting the price. This is due to the fact that within a brand there are a lot of other variables affecting price (such as car type, engine type, registration year etc). In order to confirm whether mileage affects the price a slice of the dataset is necessary where all those variables are kept the same as much as possible.

In [49]:

autos.head(10)

Out[49]:

	datecrawled	name	price	abtest	vehicletype	registration_year	gearbox	powerps	model	odometer_km	registration_month	fueltype	brand	unrepaired_damage	ad_created	postalcode	lastseen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000.0	control	bus	2004	manuell	158	andere	150000.0	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500.0	control	limousine	1997	automatik	286	7er	150000.0	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990.0	test	limousine	2009	manuell	102	golf	70000.0	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350.0	control	kleinwagen	2007	automatik	71	fortwo	70000.0	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350.0	test	kombi	2003	manuell	0	focus	150000.0	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50
5	2016-03-21 13:47:45	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	7900.0	test	bus	2006	automatik	150	voyager	150000.0	4	diesel	chrysler	NaN	2016-03-21 00:00:00	22962	2016-04-06 09:45:21
6	2016-03-20 17:55:21	VW_Golf_III_GT_Special_Electronic_Green_Metall...	300.0	test	limousine	1995	manuell	90	golf	150000.0	8	benzin	volkswagen	NaN	2016-03-20 00:00:00	31535	2016-03-23 02:48:59
7	2016-03-16 18:55:19	Golf_IV_1.9_TDI_90PS	1990.0	control	limousine	1998	manuell	90	golf	150000.0	12	diesel	volkswagen	nein	2016-03-16 00:00:00	53474	2016-04-07 03:17:32
8	2016-03-22 16:51:34	Seat_Arosa	250.0	test	NaN	2000	manuell	0	arosa	150000.0	10	NaN	seat	nein	2016-03-22 00:00:00	7426	2016-03-26 18:18:10
9	2016-03-16 13:47:02	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	590.0	control	bus	1997	manuell	90	megane	150000.0	7	benzin	renault	nein	2016-03-16 00:00:00	15749	2016-04-06 10:46:35

vehicletype, gearbox and fueltype are three columns which have values in german.

In [51]:

autos["vehicletype"].unique()

Out[51]:

array(['bus', 'limousine', 'kleinwagen', 'kombi', nan, 'coupe', 'suv',
       'cabrio', 'andere'], dtype=object)

In [52]:

autos["gearbox"].unique()

Out[52]:

array(['manuell', 'automatik', nan], dtype=object)

In [53]:

autos["fueltype"].unique()

Out[53]:

array(['lpg', 'benzin', 'diesel', nan, 'cng', 'hybrid', 'elektro',
       'andere'], dtype=object)

In [56]:

#Identify categorical data that uses german words, translate them and map the values to their english counterparts
translate_map = {        'kleinwagen' : "small car", 
        'kombi' : "estate car" ,
        'cabrio' : "convertible", 
        'andere' : "other", 
        'elektro' : "electric",
        'benzin' : 'petrol',
        'manuell' : "manual", 
        'automatik' : "automatic",
        "bus" : "bus",
        "limousine" : "limousine",
                 "coupe":"coupe",
                 "suv" : "suv",
                 "lpg" : "lpg",
                 "diesel" : "diesel",
                 "cng" : "cng",
                 "hybrid": "hybrid",
                 "Unknown" : "Unknown",
                 "nein" : "no",
                 "ja" : "yes", }


categorical = ["fueltype", "gearbox", "vehicletype", 
               "unrepaired_damage"]

In [57]:

for word in categorical:
    autos[word] = autos[word].map(translate_map)

In [58]:

#replace nan values with "Unknown"
for c in categorical:
    autos.loc[ autos[c].isnull()   , c ] =  "Unknown"

autos["gearbox"].unique()

Out[58]:

array(['manual', 'automatic', 'Unknown'], dtype=object)

In [59]:

autos["vehicletype"].unique()

Out[59]:

array(['bus', 'limousine', 'small car', 'estate car', 'Unknown', 'coupe',
       'suv', 'convertible', 'other'], dtype=object)

In [60]:

autos["fueltype"].unique()

Out[60]:

array(['lpg', 'petrol', 'diesel', 'Unknown', 'cng', 'hybrid', 'electric',
       'other'], dtype=object)

In [61]:

autos["unrepaired_damage"].unique()

Out[61]:

array(['no', 'Unknown', 'yes'], dtype=object)

In [62]:

#Data cleaning of the dates, Convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321

date_cols = ["datecrawled", "ad_created", "lastseen"]

for c in date_cols:
    temp = autos[c].str[:10].str.replace("-","").astype(int)
    autos[c] = temp

In [63]:

autos.head(20)

Out[63]:

	datecrawled	name	price	abtest	vehicletype	registration_year	gearbox	powerps	model	odometer_km	registration_month	fueltype	brand	unrepaired_damage	ad_created	postalcode	lastseen
0	20160326	Peugeot_807_160_NAVTECH_ON_BOARD	5000.0	control	bus	2004	manual	158	andere	150000.0	3	lpg	peugeot	no	20160326	79588	20160406
1	20160404	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500.0	control	limousine	1997	automatic	286	7er	150000.0	6	petrol	bmw	no	20160404	71034	20160406
2	20160326	Volkswagen_Golf_1.6_United	8990.0	test	limousine	2009	manual	102	golf	70000.0	7	petrol	volkswagen	no	20160326	35394	20160406
3	20160312	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350.0	control	small car	2007	automatic	71	fortwo	70000.0	6	petrol	smart	no	20160312	33729	20160315
4	20160401	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350.0	test	estate car	2003	manual	0	focus	150000.0	7	petrol	ford	no	20160401	39218	20160401
5	20160321	Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...	7900.0	test	bus	2006	automatic	150	voyager	150000.0	4	diesel	chrysler	Unknown	20160321	22962	20160406
6	20160320	VW_Golf_III_GT_Special_Electronic_Green_Metall...	300.0	test	limousine	1995	manual	90	golf	150000.0	8	petrol	volkswagen	Unknown	20160320	31535	20160323
7	20160316	Golf_IV_1.9_TDI_90PS	1990.0	control	limousine	1998	manual	90	golf	150000.0	12	diesel	volkswagen	no	20160316	53474	20160407
8	20160322	Seat_Arosa	250.0	test	Unknown	2000	manual	0	arosa	150000.0	10	Unknown	seat	no	20160322	7426	20160326
9	20160316	Renault_Megane_Scenic_1.6e_RT_Klimaanlage	590.0	control	bus	1997	manual	90	megane	150000.0	7	petrol	renault	no	20160316	15749	20160406
10	20160315	VW_Golf_Tuning_in_siber/grau	999.0	test	Unknown	2017	manual	90	NaN	150000.0	4	petrol	volkswagen	no	20160314	86157	20160407
11	20160316	Mercedes_A140_Motorschaden	350.0	control	Unknown	2000	Unknown	0	NaN	150000.0	0	petrol	mercedes_benz	Unknown	20160316	17498	20160316
12	20160331	Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...	5299.0	control	small car	2010	automatic	71	fortwo	50000.0	9	petrol	smart	no	20160331	34590	20160406
13	20160323	Audi_A3_1.6_tuning	1350.0	control	limousine	1999	manual	101	a3	150000.0	11	petrol	audi	no	20160323	12043	20160401
14	20160323	Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...	3999.0	test	small car	2007	manual	75	clio	150000.0	9	petrol	renault	Unknown	20160323	81737	20160401
15	20160401	Corvette_C3_Coupe_T_Top_Crossfire_Injection	18900.0	test	coupe	1982	automatic	203	NaN	80000.0	6	petrol	sonstige_autos	no	20160401	61276	20160402
16	20160316	Opel_Vectra_B_Kombi	350.0	test	estate car	1999	manual	101	vectra	150000.0	5	petrol	opel	no	20160316	57299	20160318
17	20160329	Volkswagen_Scirocco_2_G60	5500.0	test	coupe	1990	manual	205	scirocco	150000.0	6	petrol	volkswagen	no	20160329	74821	20160405
18	20160326	Verkaufen_mein_bmw_e36_320_i_touring	300.0	control	bus	1995	manual	150	3er	150000.0	0	petrol	bmw	Unknown	20160326	54329	20160402
19	20160317	mazda_tribute_2.0_mit_gas_und_tuev_neu_2018	4150.0	control	suv	2004	manual	124	andere	150000.0	2	lpg	mazda	no	20160317	40878	20160317

finding the most common brand/model combinations; this can be done using the aggregation method

In [65]:

unique_brands = autos['brand'].unique()

unique_brands

Out[65]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [66]:

dict_model_brand = {}

for ub in unique_brands:
    brands_tem = autos.loc[ autos["brand"] == ub  , "model" ].value_counts().index.max()
    dict_model_brand[ub] = brands_tem
    
dict_model_brand

Out[66]:

{'peugeot': 'andere',
 'bmw': 'z_reihe',
 'volkswagen': 'up',
 'smart': 'roadster',
 'ford': 'transit',
 'chrysler': 'voyager',
 'seat': 'toledo',
 'renault': 'twingo',
 'mercedes_benz': 'vito',
 'audi': 'tt',
 'sonstige_autos': nan,
 'opel': 'zafira',
 'mazda': 'rx_reihe',
 'porsche': 'cayenne',
 'mini': 'one',
 'toyota': 'yaris',
 'dacia': 'sandero',
 'nissan': 'x_trail',
 'jeep': 'wrangler',
 'saab': 'andere',
 'volvo': 'xc_reihe',
 'mitsubishi': 'pajero',
 'jaguar': 'x_type',
 'fiat': 'stilo',
 'skoda': 'yeti',
 'subaru': 'legacy',
 'kia': 'sportage',
 'citroen': 'c5',
 'chevrolet': 'spark',
 'hyundai': 'tucson',
 'honda': 'jazz',
 'daewoo': 'nubira',
 'suzuki': 'swift',
 'trabant': 'andere',
 'land_rover': 'range_rover_sport',
 'alfa_romeo': 'spider',
 'lada': 'samara',
 'rover': 'rangerover',
 'daihatsu': 'terios',
 'lancia': 'ypsilon'}

using aggregation to see if average prices follows any patterns based on the mileage.

In [67]:

groups_odometer = autos["odometer_km"].unique().tolist()

groups_odometer =  sorted(groups_odometer)

groups_odometer

Out[67]:

[10000.0,
 20000.0,
 30000.0,
 40000.0,
 50000.0,
 60000.0,
 70000.0,
 80000.0,
 90000.0,
 100000.0,
 125000.0,
 150000.0,
 nan]

In [73]:

dict_avg_price_mil = {}

for kilom in groups_odometer:
    temp = autos.loc[autos["odometer_km"] == kilom,
                     "price" ].mean()
    temp = round(temp,2)
    dict_avg_price_mil[kilom] = temp
    
dict_avg_price_mil

Out[73]:

{10000.0: 12652.79,
 20000.0: 12008.03,
 30000.0: 12416.99,
 40000.0: 12365.93,
 50000.0: 11109.58,
 60000.0: 10098.24,
 70000.0: 9450.56,
 80000.0: 8523.17,
 90000.0: 7448.37,
 100000.0: 6919.56,
 125000.0: 5637.48,
 150000.0: 3519.07,
 nan: nan}

The average car price is decreasing with increase in mileage. The reason for this could be cars with damage but further analysis is needed.

Checking how cheaper are cars with damage than their non-damaged counterparts

In [74]:

autos['unrepaired_damage'].unique()

Out[74]:

array(['no', 'Unknown', 'yes'], dtype=object)

In [75]:

autos['unrepaired_damage'].value_counts()

Out[75]:

no         35232
Unknown     9829
yes         4939
Name: unrepaired_damage, dtype: int64

In [86]:

damage_not_damaged = ['no', 'Unknown', 'yes']

In [87]:

damage_not_damaged_counterparts = {}

for up in damage_not_damaged:
    val = autos.loc[autos["unrepaired_damage"] == up, "price" ].mean()
    damage_not_damaged_counterparts[up] = val
    
damage_not_damaged_counterparts

Out[87]:

{'no': 6091.86273770872,
 'Unknown': 2768.6617932726717,
 'yes': 2043.6258876039765}

In [88]:

diff_in_price =     damage_not_damaged_counterparts["no"] - damage_not_damaged_counterparts["yes"]

diff_in_price

Out[88]:

4048.236850104743

There is a difference of about 4000 dollars in average price of damaged and non damaged cars

The damaged price is more expensive than the non-damaged car