The aim of this project is to clean the data and analyze the included used car listings.
The data dictionary provided with data is as follows:
dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos
a concise summary of a DataFrame
autos.info()
autos.head()
We observed a couple of things:
Null Data | value | Percentage of data set |
---|---|---|
vehicleType |
44905 | 10,19% |
gearbox |
47320 | 5,36% |
model |
47242 | 5,5% |
fuelType |
45518 | 8,96% |
notRepairedDamage |
40171 | 19,658% |
actual type data | change of data | |
---|---|---|
price |
object | int64 |
powerPS |
object | int64 |
odometer |
object | int64 |
dateCreated |
object | int64 |
autos.columns
autos.rename(columns = {"yearOfRegistration":"registration_year",
"monthOfRegistration":"registration_month",
"notRepairedDamage":"unrepaired_damage",
"dateCreated":"ad_created",
"dateCrawled":"date_crawled",
"offerType":"offer_type",
"vehicleType":"vehicle_type",
"powerPS":"power_ps",
"nrOfPictures":"nr_ofpictures",
"postalCode":"postal_code",
"fuelType":"fuel_type",
"lastSeen":"last_seen"}, inplace = True)
This step has already been implicitly performed in the previous step with the parameter:
autos.head(2)
Renaming the columns serves for two purposes:
autos.describe(include='all').round() # get both categorical and numeric columns
Columns candidates to be dropped:
Columns that need more investigation, for further analysis:
With this data we could see differences and obtain a deeper knowledge by segmenting the market by this sectors, however, this is not the objective of this study, so we will proceed to remove them.
Any examples of numeric data stored as text that needs to be cleaned.
These Text columns where all or almost all values are the same, can often be dropped as they don't have useful information for analysis
autos.drop(['nr_ofpictures', 'abtest','seller','offer_type','abtest'], axis = 1, inplace = True )
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','')
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','')
autos["price"] = autos["price"].astype(int)
autos["price"].describe().round(2)
autos["odometer"] = autos["odometer"].astype(int)
autos["odometer"].describe().round(2)
Seems there's some kind of outlayer on price columm, both at the top and at the bottom
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)
autos.rename(columns = {"price":"price_Dollars"}, inplace = True)
autos.head(2) # Just to check the columns.
autos["odometer_km"].unique().shape
autos["price_Dollars"].unique().shape
autos["odometer_km"].value_counts()
autos["price_Dollars"].value_counts().sort_index(ascending = True).head()
autos["odometer_km"].describe()
Seems there´s a lot of kilometers on these cars.
The average number of kilometres travelled is very high and this should be taken into account.
autos["price_Dollars"].describe().round()
As we already know that the maximum value is 99999999.0 and the minimum value is 0 It is clear that they are values that cannot be true, because one is a very large amount and the other would mean that the vehicle is given as a gift and we are in a buying and selling website.
For that reason let's look at what happens to the values at the extreme ends of our list.
The way I will do it will be to divide the list in the upper part and another one in the lower part.
In the upper part I will choose a high value and check if it makes sense, that is to say if there is a vehicle with that price and from there I will filter it.
autos["price_Dollars"].value_counts().sort_index(ascending = False).head(15)
Let's see to which vehicle the value of 265000 corresponds.
In the event that this price is consistent with the vehicle, we can take it as a reference.
autos[autos["price_Dollars"] == 265000]
It makes sense, so we will use our comparison from this value upwards and see what we find.
https://www.carindigo.com/used-cars/porsche-911-gt3-rs
As there are not too many values to check, it seems sensible to start from this model and see what we can eliminate and what we cannot.
top_List = autos[(autos["price_Dollars"] >= 265000)].sort_index(ascending = False)
top_List
There is a lot of data and we cannot compare correctly, so let's pick the data that matters, price and all possible information we can extract in relation to the vehicle.
And see if it helps us discern between the price and the class of vehicle to which it has been assigned.
top_List[["price_Dollars","brand","model","vehicle_type"]].sort_index(ascending = False)
It seems to be clearer this way, we see that there are vehicles that make sense to be on this list and that others clearly do not, for example:
Strange values of 1234566, corresponding to a bmw or 11111111 to a volkswagen even a C4 that is worth 27322222.
Clearly, these cases could be eliminated as error.
However, there are some kind of vehicles called 'sonstige autos' (other cars) that we are not aware of and because they have an important weight in the list it is important to clarify whether or not we should eliminate them.
top_List[top_List["brand"] == "sonstige_autos"]
High-end vehicles with a very high market value are appearing, will have to be reviewed.
name | price in Dollars | registration_year | source |
---|---|---|---|
Ferrari_FXX | 4,000,000 | 2006 | https://www.autoblog.com/2006/06/14/for-sale-2006-ferrari-fxx-slightly-used/?guccounter=1 |
Rolls_Royce_Phantom_Drophead_Coupe | From 450,000 | 2012 | https://www.cars.com/shopping/rolls_royce-phantom_drophead_coupe-2012/ |
Ferrari_F40 | 1,959,900 | 1992 | https://www.dupontregistry.com/autos/listing/1992/ferrari/f40/2418434 |
maserati 3200 gt | Note |
Note: In relation to the Maserati, there are several things to consider.
The 3200 Gt model was in production during the period 1998 to 2001 so the registration date we have from 1960 does not match, however, during the period 1957 - 1964 only one model of this brand was produced, the 3500 Gt, only 2,222 units were produced between the Coupe and the Spider version.
If we also take into account the description referred to in the cell of the name:
" Zustand_unwichtig_laufe... // Condition_unimportant_running..."
The person who placed the advert was looking for a masserati in any condition (just running or not...), this gives us a clue that the model we are talking about is not the newest one, so probably refers to the 60's model.
As if this were not enough the price we have is 10 000 000 dollars and surely this is wrong because doing a search of what is the value of the car I have found that this ranges between 863.170,50 USD / 151.290,57 USD
So seeing as it is an offer from a private individual who does not care about the condition a price of 100000 would be a bargain for the vehicle like this.
Change the price of the Masserati = $100000.
autos.iloc[11137,2] = 100000
We finally see that the prices we were looking make sense as they all correspond to high-end vehicles.
Everything other than 'Sonstige_autos' and 'Porsche'
remove_bad_price_bool = ((top_List["brand"] != 'sonstige_autos') & (top_List["brand"] != 'porsche'))
remove_bad_price_bool.head(3) # the boolean list to be use.
#check_price[remove_bad_price_bool]
bad_cars = top_List[remove_bad_price_bool] # type dataframe! This is the dataframe to remove from check_price
bad_cars
We extract the indices of the rows we want to remove from the main dataframe.
index_to_remove = bad_cars.index
index_to_remove # from the mail dataframe autos.
for indice in index_to_remove:
autos.drop([indice], inplace = True)
autos.iloc[27371] # comprobacion de que el fiat panda lo habia eliminado
autos["price_Dollars"].value_counts().sort_index(ascending= True)
autos[autos["price_Dollars"] == 1]
We see that the values do not resemble the same case that we have applied to the top of the list so we will have to follow a different strategy.
if we look at the ads with value equal to 1 we realize that this value is simply an indicator of some particularity that has been noted in the name box:
We have an idea of the purpose of this ad section, so we are going to remove it, taking into account the maximum price we have in our list.
autos["price_Dollars"].max()
autos = autos[autos["price_Dollars"].between(2,3890000)]
autos["price_Dollars"].describe().round()
price_dollars |
Before cleaning | After cleaning | % diff |
---|---|---|---|
count | 50000.00 | 48412.0 | - 3,176% |
mean | 9840.04 | 6017.0 | - 38,85% |
std | 481104.38 | 20703.0 | - 95,7% |
25% | 1100.00 | 1250.0 | + 13,63 |
50% | 2950.00 | 3000.0 | + 1,69 |
75% | 7200.00 | 7498.0 | + 4,13 |
max | 99999999.00 | 3890000.0 | no sense |
As we can see in the cleaning process we have only eliminated - 3,176% of the total number of samples and the mean prices have decreased - 38,85%, the interquartile values have increased from 4% to 13% and standart deviation has also improved.
If we see the column of the name the descriptions with the price (for example) of $ 1 we realize that we can not eliminate these vehicles just like that because it is not a mistake, are ads that people who intend to make changes or simply gives away their vehicle altruistically, so I will not bias these values so low.
There are several columns that represent date values:
date_crawled
: When the ad was first crawled. Added by the crawler.last_seen
: When the crawler saw the ad last online. Added by the crawlerad_created
: The date on which the eBay listing was created. Created by the website.registration_month
: The month in which the car was first registered. Created by the website.registration_year
: The year in which the car was first registered. Created by the website.autos[['date_crawled','ad_created','last_seen']][0:5]
autos["date_crawled"].describe()
The way we will extract the date from this object type data will be by making a slice of the text
we will access the text by str in Series.value_counts()
.
copying our dataframe its the way to achieve
autos["date_crawled"] = autos["date_crawled"].str[:10].copy()
autos['date_crawled']
index_date_crawled = autos['date_crawled'].str[:10].index
index_date_crawled
for indice in index_to_remove:
autos.drop([indice], inplace = True)
# calculate the distribution of values in percentages instead of counts
autos['date_crawled']...value_counts(normalize=True, dropna=False).describe()
autos['ad_created']...describe()
# extract the date from the string
autos['ad_created']=autos['ad_created'].str[:10]
# calculate the distribution of values in percentages
autos['ad_created'].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
autos['ad_created'].shape
ad_created_bool = autos['ad_created'].value_counts(normalize=True, dropna=False).max()
autos['ad_created'].value_counts(normalize=True, dropna=False).head(1) == ad_created_bool
last_seen
¶autos['last_seen'].describe()
Column | Range of dates | Most frequent day |
---|---|---|
date_crawled |
2016-03-05 to 2016-04-07 | 2016-04-03 (3.86%) |
ad_created |
2015-06-11 to 2016-04-07 | 2016-04-03 (3.88%) |
last_seen |
2016-03-05 to 2016-04-07 | 2016-04-06 (2.21%) |
dateCrawled
displays the date the ads were crawled. Ad crawling started in March of 2016 and ended in April of the same year.ad_created
column.last_seen
column contains the date when the crawler last saw the ad online. The day when most ads were last seen online is 2016-04-06 (2.21%).autos['registration_year'].describe()
Given that the 'registration_year' corresponds to the year in which the car was first registered, we known that cannot be after 2015, the year the first ads were created. As for the earliest acceptable year for car registration, we will accept any year after 1900.
# remove the rows where the 'registation_year' values are outside the range we have defined
autos = autos[autos['registration_year'].between(1900,2015)]
autos['registration_year'].value_counts(normalize=True)
autos['registration_year'].value_counts(normalize=True, dropna=False)[0:15]
sum(autos['registration_year'].value_counts(normalize=True, dropna=False)[0:15])
In addition, almost 80% of all car registrations occured between 1997 and 2011. We can conclude from this analysis that, at the time of crawling, the majority of cars on sale in this platform were at least more than 5 years old.
top_brands = autos['brand'].value_counts(normalize=True, dropna=False).head(20)
top_brands
Create an empty dictionary to hold your aggregate data. Loop over your selected brands, and assign the mean price to the dictionary, with the brand name as the key. Print your dictionary of aggregate data, and write a paragraph analyzing the results.
autos['brand'].unique()
autos['price_Dollars']