we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The aim of this project is to clean the data and analyze the included used car listings.
#we import the NumPy and Pandas library and read the data set
import numpy as np
import pandas as pd
autos = pd.read_csv("autos.csv", encoding = "Windows-1252")
autos
autos.info()
autos.head()
Quick observations:
- The dataset contains 20 columns, most of which are strings.
- The columns 'vehicleType', 'gearbox', 'gearbox', 'model', and 'notRepairedDamage' have some null objects. But none of them have more than ~20% null values.
- The data in the rows is in German. We will have to translate it.
- The columns 'price' and 'odometer' contain non-numeric values. We will clean that
- The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.
print(autos.columns)
# rename the columns
autos.rename({"yearOfRegistration":"registration_year"}, axis=1, inplace=True)
autos.rename({"registration_year":"registration_year"}, axis=1, inplace=True)
autos.rename({"notRepairedDamage":"unrepaired_damage"}, axis=1, inplace = True)
autos.rename({"monthOfRegistration":"registration_month"}, axis=1, inplace=True)
autos.rename({"dateCreated":"ad_created"}, axis=1, inplace=True)
autos.rename({'dateCrawled':"date_crawled"}, axis=1, inplace=True)
autos.rename({"offerType":"offer_type"}, axis=1, inplace=True)
autos.rename({"vehicleType":"vehicle_type"}, axis=1, inplace=True)
autos.rename({"powerPS":"power_ps"}, axis=1, inplace=True)
autos.rename({"fuelType":"fuel_type"}, axis=1, inplace=True)
autos.rename({"nrOfPictures":"number_of_pictures"}, axis=1, inplace=True)
autos.rename({"postalCode":"postal_code"}, axis=1, inplace=True)
autos.rename({"lastSeen":"last_seen"}, axis=1, inplace=True)
autos.head()
We renamed the columns names using snakecase instead of camelcase to make them more readable.
Now we'll explore the columns to see if we need to clean data.
autos.describe(include='all')
autos["seller"].value_counts()
autos["offer_type"].value_counts()
autos['number_of_pictures'].value_counts()
We have discovered, that:
We can see that almost all values in the columns 'seller' and "offer_type" are the same. The column 'number_of_pictures' have only one value, which is '0'. Hence we are going to drop them.
What is more, the columns 'price' and 'odometer' cointain numeric values stored as text. And the column 'data_crawled' containd time and date.
Columns 'vehicle_type', 'gearbox', 'model', 'fuel_type', 'unrepaired_damage' have missing values.
We'll start by converting the columns 'price' and 'odometer' to a numeric type.
autos["odometer"].head()
autos["odometer"].value_counts()
# # removing non-numeric characters
autos['odometer'] = autos['odometer'].astype(str)
autos["odometer"] = autos['odometer'].str.replace("km",'').str.replace(',','').astype(int)
autos.rename(columns={'odometer':'odometer_km'}, inplace=True)
autos['price'] = autos['price'].astype(str)
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int)
autos.rename(columns={'price':"price_dollars"}, inplace=True)
# analysing the 'odometer_km' column
autos['odometer_km'].unique()
autos['odometer_km'].describe()
autos['odometer_km'].value_counts()
autos["price_dollars"].unique().shape
autos["price_dollars"].describe()
autos["price_dollars"].value_counts().head()
autos["price_dollars"].value_counts().sort_index(ascending=False).head(10)
autos["price_dollars"].value_counts().sort_index(ascending=True).head(10)
autos['price_dollars'].value_counts().sort_index(ascending=True).tail(20)
We can see, that the price rises drastically from 350,000 to 999,990. We are going to eliminate the rows where the price equals to '0' or is higher them 350,000 as this price doesn't seem reasonable. However, we'll keep the 1$ price as it may indicate the startinf point for bidding.
#removing outliers
autos = autos[autos['price_dollars'].between(1,350000)]
autos['price_dollars'].value_counts().sort_index(ascending=True).tail(20)
autos['price_dollars'].value_counts().sort_index(ascending=True).head(20)
Now the rows with outliners are gone and the price range seem reasonable.
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteRight now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:
autos[['date_crawled','ad_created','last_seen']][0:5]
To understand the date range, we can extract just the date values, generate a distribution, and then sort by the index.
print(autos['date_crawled'].str[:10].head())
Exploring date_crawled
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna=False).sort_index()
We can see that the data was crawled since the 3d of March,2016 to the 7th of April,2016.
Exploring the ad_created column
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
After examining these three columns, we can say that this website ads work well, car might has been sold because the ads are no longer on site.
autos['registration_year'].describe()
This column clearly has outliers, because th registration year cannot be higher than 2016 and less than 1950. We will only count the rows with registration_year values between 1950 and 2016, because we don't know for sure whether the cars registered earlier are antique or there is just a mistake. What is more, we the date was crowled in 2016, so the year of registration cannot be higher.
autos['registration_year'].value_counts().sort_index(ascending=True).head(20)
autos['registration_year'].value_counts().sort_index(ascending=False).head(20)
#removing outliers
autos = autos[autos['registration_year'].between(1950,2016)]
autos['registration_year'].value_counts(normalize=True)
We have removed the rows with the outliers. Now we'll have a closer look at the 'brand' columns.
autos['brand'].unique()
top_autos = autos["brand"].value_counts().sort_values(ascending=False).head(20).index
top_brands = {}
for c in top_autos:
selected_rows = autos[autos['brand'] == c]
mean_price = selected_rows['price_dollars'].mean()
top_brands[c] = mean_price
top_brands_df = pd.DataFrame.from_dict(top_brands, orient='index')
top_brands_df.sort_values(by=0, ascending=False)
The brands with the highest mean price:
The brands with the lowest mean price:
Now we'll calculate the mean mileage for these brands and for a couple of in-between ones.
mean_mileage = {}
for c in top_autos:
selected_rows = autos[autos['brand'] == c]
mileage = selected_rows['odometer_km'].mean()
mean_mileage[c] = mileage
tb_series = pd.Series(top_brands)
mm_series = pd.Series(mean_mileage)
mean_data = pd.DataFrame(tb_series, columns=["mean_price"])
mean_data['mean_mileage'] = mm_series
mean_data.sort_values(by=["mean_price"], ascending=False)
Data cleaning next steps:
Analysis next steps:
autos.head(2)
Now we'll translate the next columns:
autos['seller'].value_counts()
autos['seller'] = autos['seller'].str.replace('privat', "private").str.replace('gewerblich','commercial')
autos['seller'].value_counts()
autos['offer_type'].value_counts()
autos['offer_type'] = autos['offer_type'].str.replace('Angebot','offer')
autos['offer_type'].value_counts()
autos['vehicle_type'].value_counts()
autos['vehicle_type'] = autos['vehicle_type'].str.replace(
'limousine','sedan').str.replace('kleinwagen','small car').str.replace('kombi', 'station wagon').str.replace(
'cabrio','convertible').str.replace('andere','other')
autos['vehicle_type'].value_counts()
autos['gearbox'].value_counts()
autos['gearbox'] = autos['gearbox'].str.replace('manuell','manual').str.replace('automatik','automatic')
autos['gearbox'].value_counts()
autos['model'].value_counts()
autos['model'] = autos['model'].str.replace('andere', 'other')
autos['fuel_type'].value_counts()
autos['fuel_type'] = autos['fuel_type'].str.replace('benzin','gasoline').str.replace(
'elektro','electric').str.replace('andere','other')
autos['fuel_type'].value_counts()
autos['unrepaired_damage'].value_counts()
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein', 'no').str.replace('ja', 'yes')
autos['unrepaired_damage'].value_counts()
autos.head(2)
We have successfully translated all German data into English. Now we will convert the dates to the uniform numeric data.
The columns containing dates are:
We'll convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer 20160321.
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['date_crawled'] = autos['date_crawled'].str.replace('-','')
autos['date_crawled'] = autos['date_crawled'].astype(float)
autos['date_crawled'].value_counts().head()
autos['ad_created'] = autos['ad_created'].str[:10]
autos['ad_created'] = autos['ad_created'].str.replace('-','')
autos['ad_created'] = autos['ad_created'].astype(int)
autos['ad_created'].value_counts().head()
autos['last_seen'] = autos['last_seen'].str[:10]
autos['last_seen'] = autos['last_seen'].str.replace('-','')
autos['last_seen'] = autos['last_seen'].astype(int)
autos['last_seen'].value_counts().head()
Now when we converted text and numeric values, the dataset is clen and looks like this:
autos.head(5)
First we will find the most common brand/model combinations.
brand_counts = autos['brand'].value_counts(normalize=True)*100
top_brand = brand_counts[brand_counts > 5].index
top_brand
brand_model = {}
for c in top_brand:
a_brand = autos[autos['brand'] == c]
a_model = a_brand['model'].describe()['top']
brand_model[c] = a_model
brand_model
Now we'll see if average prices follows any patterns based on the milage.
# we split the 'odometer_km' values in five bins (ranges)
ranges = autos['odometer_km'].value_counts(bins=5).index
ranges
ranges_tuple = ranges.to_tuples()
km_price = {}
#we calculated the mean price for each odometer bin
for r in ranges_tuple:
by_km = autos[autos['odometer_km'].between(r[0], r[1])]
price = by_km['price_dollars'].mean()
km_price[r] = int(price)
km_price
We can conclude, that the the higher the mileage is, the cheaper the price gets.
Now we'll check How much cheaper cars are with damage than their non-damaged counterparts.
damage = autos['unrepaired_damage'].unique()
damaged_price = {}
for d in damage:
by_d = autos[autos['unrepaired_damage'] == d]
price = by_d['price_dollars'].mean()
damaged_price[d] = price
damaged_price
We can see that the price for cars with unrepaired damage is 30% lower than for the undamaged cars.