This analysis will make use of Pandas.
Used cars scraed from eBay Kleinanzeigen, a classifieds section of German eBay.
# Mount Google.
from google.colab import drive
drive.mount("/content/gdrive", force_remount=True)
#drive.flush_and_unmount()
# Find out where the data is.
!ls "gdrive/MyDrive/Colab Notebooks/gp3_autos/"
import pandas as pd
import numpy as np
autos = pd.read_csv('gdrive/MyDrive/Colab Notebooks/gp3_autos/autos.csv', encoding="Latin-1")
autos.info()
autos.head()
autos["name"].value_counts()
autos["offerType"].value_counts()
NaN in string columns is problematic:
Datetime columns are incorrectly formatted:
autos.dtypes
autos.describe(include='all')
autos["seller"].value_counts()
autos["offerType"].value_counts()
autos["nrOfPictures"].value_counts()
autos["vehicleType"].iloc[0:22]
autos['gearbox'].value_counts()
autos['brand'].value_counts()
The following steps will be done in the cleaning processes:
The columns use CamelCase sometimes and are generally hard to read.
# Copy used to ensure we have a copy of the old columns for reference if needed.
columns_old = autos.columns.copy()
print(columns_old)
columns_new = ['date_crawled', # to snake_case
'name', # no change
'seller', # no change
'offer_type', # to snake_case
'price', # no change
'ab_test', # to snake_case
'vehicle_type', # to snake_case
'registration_year', # clarity
'gearbox', # no change
'power_ps', # to snake_case
'model', # no change
'odometer_km', # more accurate
'registration_month', #clarity
'fuel_type', # to snake_case
'brand', # no change
'unrepaired_damage', # clarity
'ad_created', # to snake_case
'num_pictures', # clarity and to snake_case
'postcode', # common usage
'last_seen'] # to snake_case
# Copy back (rather than assignment) to preserve versions
autos.columns = columns_new.copy()
autos.columns
"num_pictures", "seller" and "offer_type" should be dropped.
autos = autos.drop(columns=["num_pictures", "seller", "offer_type"], axis=1)
autos.columns
autos["registration_year"].describe()
Confirm there are years well out of range.
autos["registration_year"].value_counts().head()
Confirm our desired values are the dominant values.
# initial_year_distribution = autos["registration_year"].value_counts().sort_index()
# initial_year_distribution.plot()
Plot the initial distribution which we will fit to the normal curve.
range_bool = autos["registration_year"].between(1980,2020)
final_distribution = autos[range_bool]
final_distribution["registration_year"].value_counts().sort_index().plot()
Cars aged between 1980 and 2021 create an appropriate bell curve distribution with a few interesting features.
autos["registration_year"] = final_distribution["registration_year"]
Finally, write the new distribution back to the main dataset.
autos["price"].describe()
Confirm max and min values are outliers.
autos["price"].value_counts().sort_index().plot()
Plot the initial distribution we will crop to fit a normal distribution.
price_bool = autos["price"].between(1,45000)
final_price_distribution = autos[price_bool]
final_price_distribution["price"].value_counts().sort_index().plot()
Not sure this isn't charting properly.
autos["price"] = final_price_distribution["price"]
Write the new slice anyway
autos["price"].describe()
Confirm we have a good distribution even if we can't chart it properly.
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
date_crawled
: added by the crawler : stringlast_seen
: added by the crawler : stringad_created
: from the website : stringregistration_month
: from the website : int64registration_year
: from the website : floatprint(autos["date_crawled"].dtypes)
print(autos["last_seen"].dtypes)
print(autos["ad_created"].dtypes)
print(autos["registration_month"].dtypes)
print(autos["registration_year"].dtypes)
autos[['date_crawled','ad_created','last_seen']][0:5]
print(autos['date_crawled'].str[:10].value_counts(dropna=False).sort_index())
It is clear the crawl had quite a tight date range of 30 days.
Should cast as datetime.
autos['date_crawled'] = pd.to_datetime(autos['date_crawled'])
print(autos["date_crawled"].dtype)
print(autos['ad_created'].str[:10].value_counts(dropna=False).sort_index())
"ad_created" has values outside the range of "date_crawled".
Investigate if these rows are valid.
bool_2015 = autos["ad_created"].str[:4] == "2015"
slice_2015 = autos[bool_2015]
slice_2015.head()
Appears to be adverts which have taken a long time to see and have valid rows in the set.
Cast as datetime.
autos['ad_created'] = pd.to_datetime(autos['ad_created'])
print(autos["ad_created"].dtype)
print(autos['last_seen'].str[:10].value_counts(dropna=False).sort_index())
"last_seen" has exactly the same date range as "date_crawled" suggesting this was done as part of the crawl criteria.
Cast as datetime.
autos['last_seen'] = pd.to_datetime(autos['last_seen'])
print(autos["last_seen"].dtype)
autos["registration_month"].describe()
A month value of "0" is not very helpful.
If this is the month of registration of the vehicle, it's hard to imagine any analysis would use this.
Drop "registration_month"
autos = autos.drop(columns=["registration_month"], axis=1)
autos.columns
One fruitful area of analysis will be to look at brands and see if brand correlates to sales price, frequency and period on eBay.
To start, it's necessary to explore this series in more detail.
autos["brand"].value_counts(dropna=False, normalize=True).head(16)
The column series is very clean with no null values and no duplicates from spelling mistakes. Probably selected from a drop-down rather than manual entry by a user.
Many of the brands are rarely sold, making up less than 1% of listings.
"sonstige_autos" means "other cars" so we can eliminate that because it's vague.
Therefore, we will only aggregate the top 16 most frequently listed brands.
mean_price_agg = {}
top_16_index = autos["brand"].value_counts(dropna=False, normalize=True).head(16)
for brand in top_16_index.index:
brand_slice = autos[autos["brand"] == brand]
agg_mean = brand_slice["price"].mean()
mean_price_agg[brand] = round(agg_mean)
print(mean_price_agg)
mean_price_series = pd.Series(mean_price_agg)
mean_price_df = pd.DataFrame(mean_price_series, columns=["mean_price"])
mean_price_df = mean_price_df.sort_values(by=["mean_price"], ascending=False)
mean_price_df
mean_price_df.sort_values(by=["mean_price"], ascending=True).plot.barh()
This aggregation shows what you would expect for vehicle depreciation.
German cars cost more, and hold their value much better than French cars.
Further aggregation analysis could look at:
Audi, BMW, and Mercedes are luxury car brands so we expect a high price, but Skoda, Volkswagen and Toyota are more affordable but their prices remain high.
This is especially true for Skoda which appears to be in the wrong grouping.
It's possible that the prestige cars are higher mileage, or that Skodas hold their value only when their mileage is low.
To prove this hypothesis, we can construct a dataframe that allows us to visualise price and odometer values together.
We can create a new aggregation dictionary and construct then join dataframes.
First we should investigate this series because we haven't yet.
autos["odometer_km"].describe()
autos["odometer_km"].value_counts()
"odometer_km" seems nearly categorical it's so perfect.
mean_odo_agg = {}
top_16_index = autos["brand"].value_counts(dropna=False, normalize=True).head(16)
for brand in top_16_index.index:
brand_slice = autos[autos["brand"] == brand]
mean_odo = brand_slice["odometer_km"].mean()
mean_odo_agg[brand] = round(mean_odo/10)
print(mean_odo_agg)
mean_odo_series = pd.Series(mean_odo_agg)
mean_odo_df = pd.DataFrame(mean_odo_series, columns=["mean_odometer"])
mean_odo_df.sort_index()
mean_price_df.sort_index()
join_them = [mean_odo_df, mean_price_df]
mean_odo_price_df = pd.concat(join_them, axis=1)
mean_odo_price_df.sort_values(by="mean_price", ascending=False)
import matplotlib.pyplot as plt
#df2 = df.groupby(['Name', 'Abuse/NFF'])['Name'].count().unstack('Abuse/NFF').fillna(0)
mean_odo_price_df.sort_values(by="mean_price", ascending=False).head(6).plot(kind='bar', stacked=False)
As expected, we see a decline in the value for "odometer_km" as we move down to cheaper cars. In other words, high values for "odometer_km" suppress "price" such that there is an inverse correlation.