In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to Kaggle:
In this project I will start by cleaning the data and then analysing the different brands to check for price and popularity and then see if there are any connections between price, mileage and damage. Throughout this project I will be using Pandas the python module and then visualizing some of the data using matplotlib.
The data dictionary provided with data is as follows:
Now lets begin by importing the necessary modules and reading the file
# Import modules
import pandas as pd
import numpy as np
#Read dataset into Pandas
autos = pd.read_csv('autos.csv',encoding='Latin-1', engine='python')
Great, now that we have the dataset as a dataframe. let's take a look at the first 5 rows and then a description of the dataset to see what we can learn.
#First 5 rows
autos.head()
# Info about each column
autos.info()
A few observations from the dataset:
The dataset contains 20 columns, most of which are strings.
Some columns have null values, but none have more than ~20% null values.
The column names use camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores.
The price and odometer columns are both object type, which means that we need to change them to integer types in order to access the numeric values.
We need to do some cleaning:
Firstly I will change the column names from camelcase to sankecase and also change a couple of the column names so that they better match the descriptions from the data dictionary.
# Change column names
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year',
'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen']
autos.columns = new_columns
Now I will change the price and odometer columns to integer types. This will involve removing the non numeric values ($ , km) from the values and just putting them into the column name instead.
# Rename price and odometer columns to show units
autos = autos.rename(columns={'price': 'price($)', 'odometer':'odometer(km)'})
# Replace non numeric values in price and odometer columns and convert to interger datatype
autos['price($)'] = autos['price($)'].str.replace('$', '').str.replace(',','').astype(int)
autos['odometer(km)'] = autos['odometer(km)'].str.replace('km', '').str.replace(',', '').astype(int)
Let's take a look at the first 5 rows of the dataset to check our new column names and to check the values in the price and odometer columns.
# First 5 rows
autos.head()
The columns have been cleaned successfully so let's move on.
Let's take a look at the description of the data to see if there is anything else that needs cleaning up or could be removed
# Show data description
autos.describe(include='all')
From the description we can see that their are 2 columns with only 2 unique values and that in each case one of the values accounts for 49,999 out of 50,000 of the occurances. We can therefore remove the seller and offer_type columns as they are redundant.
# Remove unneeded columns
del autos['seller']
del autos['offer_type']
Now to look for any values within the data that could be incorrect as they are unrealisticlly too high or too low. I will focus on the columns with numeric values where unrealistic outliers can potentially affect the reliability of ouor findings.
I will therefore take a look at the following columns:
Firstly price($):
# Show description of price column
autos['price($)'].describe()
The max value is 100,000,000 dollars, I don't think anyone is buying a car on ebay for that much. Let's have a look at the 20 highes values in the price column.
# Show most expensive cars in dataset
value_counts = autos['price($)'].value_counts()
value_counts.sort_index().tail(20)
*There are 14 cars listed as being 999,990 dollars and above. Below that there is a big gap all the way down to 350,000 dollars which seems about right for the most expensive vehicle ebay. I will therefore remove all rows with a price above 350,000.
# Remove unreialistic prices
autos = autos[autos['price($)'].between(0, 350000)]
# Show descrition of cleaned price column
autos['price($)'].describe()
Now for the 20 lowest values
# Show least expensive cars
autos['price($)'].value_counts().sort_index().head(20)
There arent any outliers in this data. Yes, selling for nothing seems strange but there are 1421 listings at this price. Clearly with nearly 2.5 percent of listings at this price, these arent outliers but instead having a starting price of 0 is a legitimate practice on ebay. I therefore won't be removing any rows for having a price that is too low.
Now let's take a look at the odometer column:
# Show description of odometer column
autos['odometer(km)'].describe()
# Show value counts in order of index
autos['odometer(km)'].value_counts().sort_index()
Only 13 unique values. The odometer values have clearly been grouped together and there are no outliers here. We can therefore move on.
Now let's take a look at the columns with dates. There are 5 in total. I will begin by looking at the 3 string columns:
# Show 3 columns with string dates
autos[['date_crawled', 'ad_created', 'last_seen']][0:5]
# Show dates of date crawled column
autos['date_crawled'].str[:10].value_counts(dropna=False)
All of the listings were crawled between March 6th and April 7th 2016
# Show dates of ad created column
autos['ad_created'].str[:10].value_counts(dropna=False).sort_index()
Ads were created between June 11th 2015 and April 7th 2016.
# Show dates of last seen column
autos['last_seen'].str[:10].value_counts(dropna=False).sort_index().tail(20)
All listings were last seen between March 19th and April 7th 2016
For these 3 columns nothing stands out as being out of the ordinary. Now let's check the registration year column to see if it matches what we would expect from the above data. There should be no registration years later than 2016 as that is when the last ads were created and the data was crawled.
# Show description of registration year column
autos['registration_year'].describe()
hmmmm well that is strange. We have cars registered as early as the year 1000 and as late as 9999, incredible! We need to take a closer look at the earliest and latest dates in the registration year column.
# Show earliest dates
autos['registration_year'].value_counts().sort_index().head(20)
# Show latest dates
autos['registration_year'].value_counts().sort_index().tail(20)
There are dates here which are not possible.
We need to remove the rows with impossible registration dates. The Ford Model T became the first mass produced car in 1908. I will use that date as the cut off and remove everything from before that date.
As we saw from the ad_created column these cars were all listed between 2015 and 2016. Therefore we cannot have any cars with a reistration date later than 2016 and must remove all of those rows as well.
# Remove impossible dates
autos = autos[autos['registration_year'].between(1908,2016)]
Let's take a look at the remaining values in a histogram.
# Import graph plotting module
from matplotlib import pyplot as plt
%matplotlib inline
# Plot histgogram
ax = autos['registration_year'].plot(kind='hist', bins=30, range=(1908, 2016))
ax.set_xlabel('Date')
With the unrealistic dates gone we now have the values you would expect. Lot's of cars from the past 20 years and progressivly less as we go further in the past.
Our dataset now only contains accurate dates.
The next step is translate from German to English.
autos.head()
Just looking through the columns we can see that the 'vehicle_type', 'gearbox', 'fuel_type' and 'unrepaired_damage' all contain strings that may be in Gemran (other columns also are strings but they are names so we don't need to worry about translating them. I am going to first check the unique values in each column and then replace all of the German words with English ones. This will require some help from google translate.
# Check for any German in vehicle type column
autos['vehicle_type'].unique()
# Translate German to English
autos['vehicle_type'] = autos['vehicle_type'].str.replace('kleinwagen', 'mini').str.replace('kombi', 'combination')
autos['vehicle_type'] = autos['vehicle_type'].str.replace('cabrio', 'convertible').str.replace('andere', 'other')
# Check to see new values in vehicle type column
autos['vehicle_type'].unique()
# Check for any german in gerabox column
autos['gearbox'].unique()
# Translate German to English
autos['gearbox'] = autos['gearbox'].str.replace('manuell', 'manual').str.replace('automatik', 'automatic')
# Check to see new values in Gearbox column
autos['gearbox'].unique()
# Check for any German in Fuel type column
autos['fuel_type'].unique()
# Translate German to English
autos['fuel_type'] = autos['fuel_type'].str.replace('benzin', 'petrol').str.replace('elektro', 'electric').str.replace('andere', 'other')
# Check to see new values in fuel type column
autos['fuel_type'].unique()
# Check for any German in unrepaired damage column
autos['unrepaired_damage'].unique()
# Translate German to English
autos['unrepaired_damage'] = autos['unrepaired_damage'].str.replace('nein', 'no').str.replace('ja', 'yes')
# Check to see new values in unrepaired damage column
autos['unrepaired_damage'].unique()
Great, our dataset now contains English instead of German. Everything seems to be ready to do some data analysis!
Now that the data is cleaned we can start to analyse it.
First, I am going to look at the different car brands in the dataset. I want to have a look at which car brands are the most popular in the dataset and then narrow down our data to include the popular car brands. I will count popular as any brand that is above 2% of the dataset. 2% of 50,000 is 1000 which I think is a reasonable cut off point.
#Loop through brand column and create frequency table in brands dictionary
brand_frequency = {}
for values in autos['brand'].sort_index():
if values in brand_frequency:
brand_frequency[values] += 1
else:
brand_frequency[values] = 1
# Find % of each car brand
total = sum(brand_frequency.values())
percentages = {}
for keys, values in brand_frequency.items():
percent = values * 100 / total
percentages[keys] = percent
# Narrow down to brands above 2
brands_above_2_percent = {}
for keys, values in percentages.items():
if values > 2:
brands_above_2_percent[keys] = values
# Show dictionary
brands_above_2_percent
Im going to make this information a little easier to view by putting it into a bar chart:
# Plot Bar chart
plt.bar(brands_above_2_percent.keys(), brands_above_2_percent.values())
plt.ylabel('% of cars sold')
plt.xlabel('Brands')
plt.xticks(rotation=90)
plt.title('Most popular car brands')
plt.show()
Here we can see that Volkswagen is by far the most popular car brand on German ebay. This is followed by 4 more German car brands; BMW, Mercedes, Audi and Opel. The most popular non German brand is Ford.
Now let's have a look at the average price of these popular car brands:
#Find average price of each brand.
mean_brands = {}
brand_names = autos['brand'].unique()
for brand in brand_names:
selected_rows = autos[autos['brand'] == brand ]
mean = selected_rows['price($)'].mean()
mean_brands[brand] = mean
#Fitler out brands with less than 2 percent of sales.
popular_brands = brands_above_2_percent.keys()
mean_price_popular_brands = {}
for key, value in mean_brands.items():
if key in popular_brands:
mean_price_popular_brands[key] = value
#Show dictionary
mean_price_popular_brands
Again I will put this information into a bar char to make it more readable.
#Plot Bar Chart
plt.bar(mean_price_popular_brands.keys(), mean_price_popular_brands.values())
plt.title('Mean price of Popular car brands')
plt.xticks(rotation=90)
plt.ylabel('Price ($)')
plt.xlabel('brands')
plt.show()
We can see that the three luxury German brands are the three most expensive, with a fourth German brand, Volkswagen, coming next. After that come Ford and Peugeot, followed by Renault. Opel and Fiat. This means that not only are BMW, Mercedes and Audi the 3 most popular brands but they are also the 3 most expensive as well.
Next, lets take a look at the average mileage of each of this brands. Like with the price I will plot the results into a bar chart.
#Find average mileage of car brands
mean_mileage_brands = {}
brand_names = autos['brand'].unique()
for brand in brand_names:
selected_rows = autos[autos['brand'] == brand]
mean = selected_rows['odometer(km)'].mean()
mean_mileage_brands[brand] = mean
#Fitler out brands with less than 2 percent of sales.
popular_brands = brands_above_2_percent.keys()
mean_mileage_popular_brands = {}
for key, value in mean_mileage_brands.items():
if key in popular_brands:
mean_mileage_popular_brands[key] = value
mean_mileage_popular_brands
This data is easier to visualize in a var chart:
# Plot bar chart
plt.bar(mean_mileage_popular_brands.keys(), mean_mileage_popular_brands.values())
plt.ylim(110000, 140000)
plt.xticks(rotation=90)
plt.ylabel('Average Mileage(km)')
plt.xlabel('brands')
plt.title('Average mileage of most popular brands')
plt.show()
We can see here that on average each brand has similar mileage on the clock, between 110,000 and 130,000 KM. BMW has the highest average and Fiat is the lowest. On the whole there is little difference between the average mileage of different brands, with the majority of brands having an average mileage of 125,000 and 135,000 km.
I now want to have a look to see if there is a comparison between the average price and average mileage of these popular brands.
# Create panda series for mean price and mean mileage
mean_price_series = pd.Series(mean_price_popular_brands)
mean_mileage_series = pd.Series(mean_mileage_popular_brands)
# Create dataframe using these 2 series
price_mileage_dataframe = pd.DataFrame(mean_price_series, columns=['mean_price'])
price_mileage_dataframe['mean_mileage'] = mean_mileage_series
# Show mean price and mileage side by side
price_mileage_dataframe
This time for a viusalization of the data I will plot a scatter chart with the mileage on the x axis and the mean price on the y axis.
# Plot scatter chart
ax = price_mileage_dataframe.plot(x='mean_mileage', y='mean_price', kind='scatter', title='Mileage x price')
We would expect the cars which have a higer mileage to cost less. However, in the top right hand corner we can see that the three highest priced brands (BMW, Mercedes, Audi) also have the highest average mileage. This is largely a result of the average mileage between brands being very similar. All that we can conclude from this is that the price difference of brands is not affected by mileage but by other factors.
In order to see the difference that mileage can make to price we need to look at all of the cars in the dataset individually and see if their is a pattern between the average mileage and the average price of each car.
# Find values in odometer column
mileage = autos['odometer(km)'].value_counts()
# Create dictionary with mileage as the key and average price as the values
price_mileage = {}
mileage = autos['odometer(km)'].unique()
for miles in mileage:
selected_rows = autos[autos['odometer(km)'] == miles]
mean = selected_rows['price($)'].mean()
miles = str(miles)
price_mileage[miles] = mean
# Sort keys into ascending numerical order
sorted_keys = sorted(price_mileage, key=lambda line: int(line.split()[0]))
# Create new dictionary with sorted keys
sorted_dict = {}
for v in sorted_keys:
sorted_dict[v] = 0
for k, v in price_mileage.items():
sorted_dict[k] += v
# Show dictionary
sorted_dict
Now to put the dictionary into a bar chart for a better visualization.
# Plot bar chart showing average price and average mileage
plt.bar(sorted_dict.keys(), sorted_dict.values())
plt.xticks(rotation=90)
plt.ylabel('Average Price($)')
plt.xlabel('Mileage (km)')
plt.title('Average price of different mileage groupings')
plt.show()
The results here are clear. The more km's on the clock, the less a car will cost. The one outlier is cars with very few km's on the clock (around 5000km).
There could be numerous reasons why the cars wiht the lowest mileage fall outside of this trend. Perhaps, people only sell cars with low mileage if there is something wrong with them. let's finish by having a look at the relationship between average price and whether a car has unrepaired damage:
# Split cars between damaged and undamaged
damaged = autos[autos['unrepaired_damage'] == 'yes']
undamaged = autos[autos['unrepaired_damage'] == 'no']
# Find mean price for damaged and undamaged cars
damaged_mean_price = damaged['price($)'].mean()
undamaged_mean_price = undamaged['price($)'].mean()
# Plot bar chart to visualize data
plt.bar(['damaged', 'undamaged'], [damaged_mean_price, undamaged_mean_price])
plt.title('Average price of damaged and undamaged cars')
plt.ylabel('Price($)')
plt.show()
The results here are very clear. A damaged car is on average 5000 dollars cheaper than an undamaged car.
3 Main Takeaways
BMW, Mercedes and Audi are the kings of the German auto ebay market. They have the 3 most popular and three most expensive car brands on average. With Volkswagen as the fourth most popular it is clear that Germans like to buy German.
Mileage has a significant affect on the price of vehicles. The more miles on the clock, the less expensive the car will tend to be.
The same is true of cars with unrepaired damage. On average a car with unrepaired damage will cost 5000 dollars less than one without damage.