We are working with a german ebay Used Car
dataset
https://data.world/data-society/used-cars-data.
import pandas as pd
import numpy as np
import re
df = pd.read_csv("autos.csv", encoding='Windows-1252')
df.info()
#percent null for column 15
299468/371528
Columns needs to be cleaned up:
#let's start the cleanup
to_clean = df.columns
corrections = {
"dateCrawled":"date_crawled",
"name":"name",
"seller":"seller",
"offerType":"offer_type",
"price":"price",
"abtest":"ab_test",
"vehicleType":"vehicle_type",
"yearOfRegistration":"registration_year",
"gearbox":"gearbox",
"powerPS":"power_ps",
"model":"model",
"odometer":"odometer",
"monthOfRegistration":"registration_month",
"fuelType":"fuel_type",
"brand":"brand",
"notRepairedDamage":"unrepaired_damage",
"dateCreated":"ad_created",
"nrOfPictures":"num_of_pictures",
"postalCode":"postal_code",
"lastSeen":"last_seen"
}
cleaned_col=to_clean.map(corrections)
cleaned_col
#looks good, lets replace the original df columns
df.columns = cleaned_col
df.info()
let's looks a quick look at the data
df.head()
df.describe(include='all')
#Drop columns that are overwhelmingly one value
to_drop = ['seller',"offer_type",'num_of_pictures']
df = df.drop(to_drop, axis=1)
#Now let's convert price and odometer to numeric
#remove non-numeric characters and then convert type
to_numeric = ['price','odometer']
for c in to_numeric:
df[c] = df[c].str.replace(r'\D+', '').astype(int)
df.info()
#let's update odometer name with removed units.
#will consider doing that first in the future.
df.rename(columns={'odometer':'odometer_km'}, inplace=True)
df.price.describe()
#that scientific notation isn't helpful. let's take another look
df.price.describe().apply(lambda x: format(x, '.2f'))
df.price.value_counts(normalize=True)
df.price.value_counts().sort_index(ascending=False).head(20)
10% of power ps is zero
df.price.value_counts().sort_index(ascending=True).head(20)
Let's remove outliers from price. 0's should probably be removed as they were likely unsold listings w/ out any bids. There are a few cars listed above a million with numbers that don't seem real (12345678)
df = df[df.price.between(1,350000)]
df.price.describe()
let's see if we can find issues with any other numeric columns
df.odometer_km.value_counts()
odometer data series looks fine with no outlier problems It seems that datetime data might need some help - date_crawled, last_seen, and ad_created are listed as objects.
df[["date_crawled",'ad_created','last_seen']].head()
#let's examine each of these a little further starting with date_crawl
df.date_crawled.str[:10].value_counts(normalize=True, dropna=False).sort_index()
It looks like the crawling for data happened pretty consistently within a month's time.
df.last_seen.str[:10].value_counts(normalize=True, dropna=False).sort_index()
More towards the end, probably because there are more cars that haven't sold / removed their listing yet.
df.ad_created.str[:10].value_counts(dropna=False).loc[lambda x: x>2].sort_index()
Looks like most of these ads were created "recently" during with a few ads enduring for longer than a month long.
Finally, we should consider dropping the time portion from this column as it doesn't provide any information (all 00:00:00)
df['ad_created'] = df['ad_created'].str[:10]
#let's look at registration year
df.registration_year.describe()
The min and max certainly don't make any sense. certainly anything over 2016 or 2017 is impossible. The years 25% to 75% makes sense though
1-(df.registration_year.between(1950,2017).sum()/df.shape[0])
We should this 1% of cars that are not between 1950-2017. I included 2017 since some cars do sell by manufacturers half a year ahead of time (ie. buying 2018 model at the end of 2017)
pre-clean -48565
df = df[df.registration_year.between(1950,2017)]
df.registration_year.value_counts(normalize=True).sort_index().tail(30)
df.registration_year.describe()
That looks better. It looks like the majority of cars are less than 20 years old and we were also correct to include 2017
df.brand.value_counts()
brand_counts = df.brand.value_counts(normalize=True)
common_brands = brand_counts[brand_counts > .01]
common_brands.shape
Here I was looing at whether we should remove some less popular cars, but the number of brands (40) is not too great and retaining all the brands should be more interesting.
Here we can take a look at Price
and Mileage
by Brand
.
brand_price = df.groupby('brand')['price']
brand_price_describe = brand_price.describe().rename(columns={'mean':'mean_price'})
brand_price_describe.head()
brand_info = pd.DataFrame(brand_price_describe, columns=["mean_price"])
brand_info.head()
brand_mileage = df.groupby('brand')['odometer_km']
brand_mileage_describe = brand_mileage.describe()
brand_info['mean_mileage'] = brand_mileage_describe['mean']
brand_info = brand_info.sort_values(by=['mean_price','mean_mileage'], ascending=False)
brand_info
brand_info['mileage_per_price']=brand_info['mean_mileage']/brand_info['mean_price']
brand_info.sort_values(by=['mileage_per_price'], ascending=False)
brand_model = df.groupby('brand', dropna=False)['model'].value_counts()
print(brand_model.nlargest(20))
damaged=df.unrepaired_damage== 'ja'
undamaged=df.unrepaired_damage=='nein'
df[undamaged].price.describe()
df[damaged].price.describe()
Damaged cars seem to be worth 70% less.