The aim of this guided project is to clean and analyse used car listings. This notebook only contains the data cleaning process and some initial explorations, I'm going to analyze the data in a different notebook.
The original dataset was scraped from eBay Kleinanzeigen, a classified section of German eBay and is available here. But Dataquest prepared a modified version of it for study purposes (making it less clean) and that is what I'm going to use.
The dataset has 50,000 data points and the data dictionary is shown below:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
autos = pd.read_csv("autos.csv", encoding="Latin-1")
autos
autos.info()
autos.describe(include="all")
There are 50000 rows and 20 columns in the dataset. Most columns are strings.
Column names are in CamelCase, so a bit difficult to read.
Some columns have only a few unique values and might be unnecessary for the analysis
autos.columns
new_labels = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns = new_labels
autos.head()
print(autos["seller"].value_counts(), "\n")
print(autos["offer_type"].value_counts(), "\n")
print(autos["abtest"].value_counts(), "\n")
print(autos["gearbox"].value_counts(), "\n")
print(autos["unrepaired_damage"].value_counts(), "\n")
print(autos["nr_of_pictures"].value_counts())
The gearbox
and unrepaired_damage
column seems to be important, but I don't need the abtest
and postal_code
columns for this analysis and seller
, offer_type
and nr_of_pictures
columns don't contain any meaningful information, so they can be safely dropped:
autos.drop(["abtest", "seller", "offer_type", "nr_of_pictures", "postal_code"], axis=1, inplace=True)
autos
price
and odometer
columns to numeric values¶autos["price"].unique()
autos["odometer"].unique()
autos["price"] = (autos["price"]
.str.replace("$", "")
.str.replace(",", "")
.astype(int)
)
autos["price"].head()
autos["odometer"] = (autos["odometer"]
.str.replace("km", "")
.str.replace(",", "")
.astype(int)
)
autos.rename({"odometer" : "odometer_km"}, axis=1, inplace=True)
autos["odometer_km"].head()
autos["odometer_km"].unique()
autos["odometer_km"].describe()
autos["odometer_km"].value_counts().sort_index()
plt.figure(figsize=(10,5))
autos["odometer_km"].value_counts().sort_index().plot.bar(title = "Odometer distribution")
plt.xlabel("kilometers")
plt.ylabel("frequency")
plt.show()
The values in this field are rounded, which might indicate that sellers had to choose from pre-set options for this field. Additionally, the distribution of kilometers is left skewed, there are more high mileage than low mileage vehicles.
autos["price"].unique()
autos["price"].unique().shape
autos["price"].describe()
autos["price"].value_counts().sort_index(ascending=False).head(20)
autos["price"].value_counts().sort_index(ascending=True).head(20)
plt.figure(figsize=(10,5))
autos["price"].value_counts().sort_index().plot.hist(title = "Price distribution")
plt.xlabel("price")
plt.show()
There are 1,421 cars listed with 0 dollars and 156 cars listed with 1 dollar. Considering the fact that eBay is an auction-like site (it's like an auction, but transacctions are not binding, ie both buyer or seller can walk away from the sale without penalty), it's possible that some bids start with 0 dollars.
The maximum price is one hundred million dollars, which seems a lot. Prices increase steadily till 350,000 dollars, and then jump to less realistic numbers (after a bit of browsing, the most expensive car sold on eBay was around 350,000 dollars as well) so I'm removing all listings higher than this price.
autos = autos[autos["price"].between(0, 351000)]
autos["price"].describe()
There are 5 columns representing date values. Some of these columns were created by the crawler, some came from the website itself:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteThe date_crawled
, last_seen
, and ad_created
columns are identified as string values by pandas, the registration_month
and registration_year
columns are integers.
autos[['date_crawled','ad_created','last_seen', 'registration_month', 'registration_year']][0:5]
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_values()
plt.figure(figsize=(10,6))
autos["date_crawled"].str[:10].value_counts().sort_index().plot.barh(title="Distribution of crawling date")
plt.xlabel("frequency")
plt.ylabel("date")
plt.show()
The website was crawled over a period of month from March 5 to April 7, 2016. The distribution of listings crawled on each day is roughly uniform.
autos["last_seen"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
plt.figure(figsize=(10,5))
autos["last_seen"].str[:10].value_counts().sort_index().plot.bar(title="Distribution of date last seen by the crawler")
plt.xlabel("date")
plt.ylabel("frequency")
plt.show()
The last_seen
column shows the last time the crawler saw the data, which indicates the date each vehicle was probably sold. The distribution of the last 3 days is very high, which is because of the end of the crawling period, rather than a sudden spike in sales.
print(autos["ad_created"].str[:10].unique().shape)
autos["ad_created"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
plt.figure(figsize=(10,10))
autos["ad_created"].str[:10].value_counts().sort_index().plot.barh(title="Distribution of ad created date")
plt.xlabel("frequency")
plt.ylabel("date")
plt.show()
There is a large variety of ad created dates. Most fall within 1-2 months of the listing date, but a few are quite old, with the oldest at around 9 months.
autos["registration_year"].describe()
The year that the car was first registered probably indicates the age of the car. The minimum value is 1000, long before cars were invented and the maximum is 9999, many years into the future.
The first car was invented in 1885, so any registration year before 1885 is inaccurate. Also, a car can't be registered after beeing seen in the listing, so a registration year above 2016 (and some in 2016) is inaccurate as well.
autos["registration_year"].value_counts(normalize=True).sort_index().head(20)
autos["registration_year"].value_counts(normalize=True).sort_index().tail(20)
wrong_date = len(autos[autos["registration_year"] >= 2016]) + len(autos[autos["registration_year"] < 1885])
wrong_date_percentage = wrong_date / len(autos) * 100
print(wrong_date_percentage)
About 6.6% of the data has wrong dates. I'll drop all rows with registration date before 1885 and after 2019. But most of the inaccurate registrations are in 2016-2018, so I'll keep these for now for further examination.
index_to_drop = autos[autos["registration_year"] > 2018].index
index_to_drop2 = autos[autos["registration_year"] < 1885].index
autos.drop(index_to_drop, inplace=True, axis=0)
autos.drop(index_to_drop2, inplace=True, axis=0)
autos["registration_year"].value_counts().sort_index()
autos["vehicle_type"].value_counts()
mapping_dict = {
"limousine" : "sedan",
"kleinwagen" : "small car",
"kombi" : "station wagon",
"bus" : "bus",
"cabrio" : "convertible",
"coupe" : "coupe",
"suv" : "suv",
"andere" : "other"
}
autos["vehicle_type"] = autos["vehicle_type"].replace(mapping_dict)
autos["vehicle_type"].value_counts()
autos["gearbox"].value_counts()
mapping_dict = {
"manuell" : "manual",
"automatik" : "automatic"
}
autos["gearbox"] = autos["gearbox"].replace(mapping_dict)
autos["gearbox"].value_counts()
autos["fuel_type"].value_counts()
mapping_dict = {
"benzin" : "gasoline",
"diesel" : "diesel",
"lpg" : "lpg",
"cng" : "cng",
"hybrid" : "hybrid",
"andere" : "other",
"elektro" : "electric",
}
autos["fuel_type"] = autos["fuel_type"].replace(mapping_dict)
autos["fuel_type"].value_counts()
autos["unrepaired_damage"].value_counts()
mapping_dict = {
"nein" : "no",
"ja" : "yes"
}
autos["unrepaired_damage"] = autos["unrepaired_damage"].replace(mapping_dict)
autos["unrepaired_damage"].value_counts()
autos["name"].value_counts()
autos[["model","vehicle_type", "gearbox", "fuel_type", "unrepaired_damage"]] = autos[["model","vehicle_type", "gearbox", "fuel_type", "unrepaired_damage"]].fillna("unknown")
autos.info()
autos
autos.to_csv("ebaycars_cleaned.csv")