The dataset was scraped from eBay Kleinanzeigen, a classifieds section of the German eBay website.
Data dictionary:
# import pandas and numpy and read in our autos dataset
import numpy as np
import pandas as pd
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
autos
autos.info()
autos.isnull().sum()
After reading in the dataset we can see that there are some null (NaN) values present in our dataset that we might have to decide on deleting or keeping depending on our goal.
# change the column names from camelCase to snake_case
autos.columns
new_columns = ['ad_created', 'name', 'seller', 'offer_type', 'price',
'ab_test', 'vehicle_type', 'registration_year', 'gearbox',
'power_in_ps', 'model', 'odometer_km', 'registration_month',
'fuel_type', 'brand', 'unrepaired_damage', 'date_created',
'num_pictures', 'postal_code', 'last_seen']
autos.columns = new_columns
Above I copied the names of the columns of the autos dataset and manually changed names from camelCase to snake_case and then reassigned the column names to the dataset
autos.describe(include = 'all').head()
From exploring the data closer, we can see that:
# coverting price and odometer columns to numeric
autos['price'] = (autos['price']
.str.replace('$', '')
.str.replace(',', '')
.astype(float)
)
autos['odometer_km'] = (autos['odometer_km']
.str.replace('km', '')
.str.replace(',', '')
.astype(float)
)
autos.head()
# start analyzing the odometer column to identify special cases
autos['odometer_km'].unique().shape
# we have 13 unique odometer readings
autos['odometer_km'].describe()
# describing how many cars are sold per km reading
print(autos['odometer_km']
.value_counts()
.sort_index(ascending = False)
)
By analyzing the odometer column, we can see that all the values seem reasonable, the minimum km is 5000 vs the max is at 150,000 km.
# now we analyze the price column
autos['price'].unique().shape
# we have 2,357 unique prices
autos['price'].describe()
print(autos['price']
.value_counts()
.sort_index(ascending = False)
.head(10)
)
print(autos['price']
.value_counts()
.sort_index(ascending = False)
.tail(10)
)
While I am tempted to count the price zero as an outlier, I'm not sure if I should, there were 1421 cars sold at that value, not sure if this is actually free or just bad data. Same with the max value, I would count the top two as outliers because who spends upwards of 100 million on ebay. But, if we want to analyze data for 'reasonably priced non-free cars' we can elimiate the millions values and the free cars.
autos_non_free = (autos[autos['price']
.between(0.0, 1000000,
inclusive = False)]
)
print(autos_non_free['price']
.value_counts()
.sort_index(ascending = False)
.head()
)
print(autos_non_free['price']
.value_counts()
.sort_index(ascending = False)
.tail()
)
Right now, the date_created, last_seen and ad_created columns are all identified as string (object) values by pandas. We need to conver them to numerical data so we can analyze it just like the registration_year and registration_month.
print(autos['date_created'].describe())
print('\n')
print(autos['last_seen'].describe())
print('\n')
print(autos['ad_created'].describe())
Notice how the data is presented in full time stamp. The first 10 characters are the year, month and day. We should extract the date alone and generate the percentage of the frequency of each date sorted from earilest to latest date.
(autos['ad_created']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index()
)
(autos['date_created']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index()
)
(autos['last_seen']
.str[:10]
.value_counts(normalize = True, dropna = False)
.sort_index()
)
After analyzing the columns, we can see that ad_created and last_seen columns only contain data for 2016 while the date_created column contains data for 2015 and 2016 with more unique values.
autos['registration_year'].describe()
From exploring the registration_year column, we can see that the minumum and maximum years of registration are impossible, so we have to clean those
print(autos['registration_year']
.value_counts(dropna = False)
.sort_index()
.head(8)
)
print(autos['registration_year']
.value_counts(dropna = False)
.sort_index()
.tail(15)
)
Comparing what we see above vs the earliest possible registration date of 1901, we can elimiate some values. As well as since the date of last_seen is 2016, it's impossible for a car to be first registered after 2016, so we remove the data after that as well.
autos = (autos[autos['registration_year']
.between(1901, 2016)]
)
print(autos['registration_year']
.value_counts(normalize = True, dropna = False)
.sort_index()
.tail(8)
)
top_6_brands = (autos['brand']
.value_counts()
.head(6)
.index
)
print(top_6_brands)
Right now wer're going to aggregate the data around the brand and price column. For this, I will choose the top 10 selling brands to compare the average price of each car sold. Before, we cleaned the price column but I didn't save it back into the DataFrame because I was unsure of which values actually matter so I'm just going to use that non_free_autos and compare it to the regular autos DataFrame
print(autos_non_free['price']
.value_counts()
.sort_index()
.head()
)
print('\n')
print(autos['price']
.value_counts()
.sort_index()
.head()
)
price_per_brand = {}
# key is the brand name and price is the value
for brand in top_6_brands:
mean = autos.loc[autos['brand'] == brand, 'price'].mean()
price_per_brand[brand] = mean
price_per_brand
On average, Mercedes-Benz, Audi and BMW have the highest average price.
km_per_brand = {}
# key is brand name, value is the mileage (kmage?)
for brand in top_6_brands:
kmage = (autos.loc[
autos['brand'] == brand, 'odometer_km']
.mean()
)
km_per_brand[brand] = kmage
km_per_brand
sr = pd.Series(price_per_brand)
df = pd.DataFrame(sr, columns = ['mean_price'])
sr2 = pd.Series(km_per_brand)
df['mean_km'] = sr2
df
So above we created a new Series from the price_per_brand dict, that allowed pandas to assign the keys as the index values, then we converted it to a DataFrame which allowed us to add another column and since pandas aligns the indecies, the values were aligned allowing our data to be easily explored.
From looking at the data, we can see that most of the brands had similar mileage values per price which shows no correlation between price and mileage, that means the price was most likely due to other factors such as model.