The aim of this project is to practice data cleaning and analysis using the included used car listings dataset. For our practise we will be attempting to find out the best cars to purchase from the german ebay website.
The dataset I'll be working with is a dataset of used cars from eBay Kleinanzeigen, a classified section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle but I'll be making use of a modified version that's been provided to me by dataquest.
The original dataset can be found along with information on this dataset by following the link.
After our analysis, we discovered that the Volkswagen was the most popular brand, the sonstige_autos was the most expensive on average and the volvo gave the most mileage. We also noticed a few brands, like the BMW and Mercedez Benz, which were expensive yet popular among buyers. We deduced that the reason for this was that they provided value which was worth their high price.
We concluded that the best brand to buy would be between a Volkwagen or volvo. If we chose to go for a more classy brand, a BMW or a Mercedez Benz would be better options.
Before exploring the dataset, there are some things we need to do first such as handle missing values, convert categorical data to numerical data and so on. All this is to ensure we get as accurate an insight as possible from our data. Let's take a look at our data
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head()
autos.info()
From the above code cell, We see that we have 20 columns in our dataset. Most of the columns have 50000 entries with a quater of them (five of them to be exact) having missing values of about 2000 to almost 9000 of them. These missing values will need to be cleaned out. We'll also need to change the column names to something that's a bit easier for us to work with. Let's begin shall we.
autos.columns
changes={'yearOfRegistration': 'registration_year',
'monthOfRegistration': 'registration_month',
'notRepairedDamage': 'unrepaired_damage',
'dateCreated': 'ad_created', 'offerType': 'offer_type',
'vehicleType': 'vehicle_type', 'powerPS': 'power_ps',
'fuelType': 'fuel_type', 'nrOfPictures': 'no_of_pics',
'postalCode': 'postal_code', 'lastSeen': 'last_seen',
'dateCrawled': 'date_crawled'}
autos.rename(changes, axis = 1, inplace=True)
autos.columns
autos.head()
Great! Now our column names look better. All letters are in lower case and al words have been separated using an underscore _
. This type of casing is known as snake case
. We write in snake case to ease our typing. This way we don't have to alternate between the uppercase and lowercase letters.
Now we to find out what kind of data we're looking at and what we need to clean.
autos.describe(include='all')
We find that we have two alpha-numeric columns (price
, odometer
) which need to be cleaned to a numeric type, three date columns (date_crawled
, ad_created
, last_seen
) which will be properly cleaned later five numeric columns (registration_year
, regestration_month
, no_of_pics
, postal_code
) and the rest of the columns are text columns.
We'll need to investigate the date columns some more to see how much insight we can gain about the timings of some events related to the cars. The no_of_pics
column has only one value for all its rows so we'll drop that column.
We'll start our cleaning process by dealing with the alpha-numeric data first. We'll remove all the letters and symbols in the data and convert the columns to either an integer or a float type whichever is best suited for the column.
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer': 'odometer_km', 'price': 'price_dollars'}, axis = 1, inplace=True)
Now that that's done let's take a look at the price
column first.
autos['price_dollars'].unique().shape
There are 2357 unique prices in price_dollars
. Lets investigate some more to find out more about these values
autos['price_dollars'].describe()
autos['price_dollars'].value_counts(ascending=False).head(6)
autos['price_dollars'].value_counts(ascending=False).tail(6)
autos.loc[autos['price_dollars'] == 99999999, "name"].shape
Upon further investigation, we notice the following:
price
is 0 dollars indicating that the vehicle is free.We also notice that the 0 dollar price has the highest frequency. This seems strange as it implies that quite a number of cars are free. I did some google searching and I discovered that the least price for a used care on ebay 4000 dollars indicating that no car should have a price of less than 4000 dollars.
Lets do some investigating. I decided to start my investigation by finding out the number of all the cars with a price of 0 dollars and an odometer reading of more than 100000km.
autos.loc[(autos['price_dollars'] == 0) & (autos['odometer_km'] > 10000), "price_dollars"].shape
autos.loc[(autos['price_dollars'] == 0) & (autos['odometer_km'] <= 10000), :]
After my investigation I noticed that there were a mix of fully-repaired vehicles, vehicles with unrepaired damages and those with an unknown
status. I also noticed Vehicles with registration years ranging from the 1900s to 2016 - vehicles from anytime in the 2000s shouldn't be free - and they were all private vehicles.
Since we have no way of confirming the correct data for these rows, We will be removing all the rows with a price of zero dollars and 99999999 dollars (the outlier) since it is safe to do so.
autos = autos[(autos['price_dollars'].between(1, 6000000))]
autos.shape
Odometer
column¶Now we move to the odometer column.
autos['odometer_km'].unique().shape
autos['odometer_km'].unique()
autos['odometer_km'].describe()
autos['odometer_km'].value_counts(ascending=False).head(6)
autos['odometer_km'].value_counts(ascending=False).tail(6)
From the analysis we notice the following:
autos.loc[autos['odometer_km'] == 5000, "brand"].value_counts().head()
autos.loc[autos['odometer_km'] == 5000, "price_dollars"].describe()
autos.loc[(autos['odometer_km'] == 5000) & (autos['price_dollars'] == 3890000), "name"]
autos.loc[(autos['odometer_km'] == 5000) & (autos['price_dollars'] == 1), ["name","registeration_year","unrepaired_damage"]]
As expected, the most expensive car with a 5000km odometer reading is a Ferrari_FXX which is worth 3890000 dollars.
Some other things I noticed include that:
Now let's look at the cars with the most mileage
autos.loc[autos['odometer_km'] == 150000, "brand"].value_counts().head()
autos.loc[autos['odometer_km'] == 150000, "name"].value_counts().head()
autos.loc[autos['odometer_km'] == 150000, "price_dollars"].describe()
autos.loc[(autos['odometer_km'] == 150000) & (autos['price_dollars'] == 12345678), "brand"]
After analysis, we noticed that of all the cars with a mileage of 150000km:
The highest seems strange. Typically prices within the millions range are rounded up to the nearest hundred thousands or higher. Since they are only three rows, its safe enough for us to remove them without losing any meaningful insights in from our data.
autos = autos.loc[~(autos['price_dollars'] == 12345678), :]
autos.shape
Apart from the weird highest price I mentioned earlier, nothing else seems to be out of the ordinary with the odometer
column. Lets move on to the next thing.
There are three date columns in our data:
We begin by extracting the date and time of each column to separate columns before performing our analysis. We will discard the original columns after this since we have no need for them. To do this, well create a function to avoid repeating our code multiple times.
def extract_date_time(column):
extract_date = autos[column].str[:10]
extract_time = autos[column].str[10:]
autos.drop(column, axis = 1)
return extract_date, extract_time
autos['date_crawled_date'], autos['date_crawled_time'] = extract_date_time('date_crawled')
autos['ad_created_date'], autos['ad_created_time'] = extract_date_time('ad_created')
autos['last_seen_date'], autos['last_seen_time'] = extract_date_time('last_seen')
Now that we've extracted the date and time of each column, we can begin our analysis.
autos['date_crawled_date'].value_counts(normalize=True, dropna=False)
We can observe the following:
autos['ad_created_date'].value_counts(normalize=True, dropna=False)
In the ad_created_date
column, we notice that:
autos['last_seen_date'].value_counts(normalize=True, dropna=False)
The crawler was last seen in 2016, sometime between most of March and the start of April 2016
autos['registration_year'].describe()
Here we observe that:
These are unrealistic because the ads were crawled in 2016 so its impossible for any car to have been registered after 2016. The registration year of 1000 is also impossible because at that time, cars had not been invented yet. We may have to remove rows like these which contain unrealistic date stamps but lets find out how many they are to see if its safe to do so. To do this we set our upper and lower bounds to be 2016 and 1940 respectively.
outlier_count = autos.loc[~(autos['registration_year'].between(1940, 2016)), 'registration_year'].count()
outlier_count
There are 1900 cars that were registered outside of 1900 to 2016. Considering how many rows we have dropped till now, It's I think it's okay to remove these ones outside our bounds.
autos = autos.loc[autos['registration_year'].between(1940, 2016), :]
autos['registration_year'].value_counts(normalize=True)
Now that we've cleaned our data, we can assert the following conclusions:
Great! We are finally done exploring all the date columns in out dataset for insights. We will move on to the next column, the brand
column.
brand
column¶To explore this column we will start by getting an aggregate of the unique values of this column. Because of how many unique brands are in the dataset, We will aggregate on the top 20 most common brands.
We will get this by using the ASeries.value_counts()
method available in pandas to count the
brand_count = autos['brand'].value_counts().head(20)
brand_count
The most commonly sold brand is the volkswagen. The BMW and opel are a second and third respectively but there is a significantly lesser number of them than the volkswagen. The mini and volvo are the rarest brands of the top 20 with both less than 430 in number.
Not surprisingly, posh cars like the ferarri dont appear on the top twenty list. The ford appears to be one quite common while nissan is a closer to the bottom of the list. This could be as a result of the german's preference in cars or a nissan might be more expensive than a ford. Lets find out by taking a look at the mean price of the cars on this list.
To do this, We'll create a function to sort and display, in a descending order, the mean price dictionary we will create for our analysis.
def display_table(dataset, index = 0, is_dataset = True):
"""
Displays frequency table of any column in the dataset in descending order.
Dataset can be a frequency dictionary or a dataset.
INPUT:
dataset = The dataset or frequency table to be examined
index = index of column whose frequency table we are after. Default = 0
is_dataset = A boolean. True if dataset is a dataset else: false. Default = True
OUTPUT:
table_sorted = Frequency table of column corresponding to index sorted in descending order.
"""
if is_dataset:
table = freq_table(dataset, index)
else:
table = dataset
table_val_sorted = sorted(table.values(), reverse = True)
for value in table_val_sorted:
for key in table:
if table[key] == value:
print(key, ':', value)
brand_unique = autos['brand'].unique()
brand_price_mean = {}
for brand in brand_count.index:
brand_price_mean[brand] = autos.loc[autos['brand'] == brand, 'price_dollars'].mean()
display_table(brand_price_mean, is_dataset=False)
First we notice that the volkswagen (our most common brand) and ford (a common brand) are somewhere in the middle of the list. Indicating that the average brand price is a bit on the average side of the top 20 brands.
The opel is at the bottom of the price list indicating that its popularity may be largly due to its affordability. We also notice that the sonstige_autos and citroen are at the top of the list. Not surprising considering that they are luxury cars.
We also notice that the BMW and Mercedes Benz are quite expensive yet they are among the top 4 most common brands. This could be because they offer better mileage, last longer than most other brands or offer some other advantages over their competition that make them worth their price.
Now let's look at the mean mileage of these brands.
brand_mean_mileage = {}
for brand in brand_count.index:
brand_mean_mileage[brand] = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
display_table(brand_mean_mileage, is_dataset=False)
On this list we notice that the Volvo has the highest average mileage. The sonstige_autos (our most expensive brand offers one of the lowest mileages. This could explain why its ranked so low among the top 20 most common brands. The bmw, opel and Mercedes benz all have one of the highest mileages. They are topped only by Volvo. It's worth noting that despite the high price, the BMW and Mercedez were one of the most common brands. Ford is at the middle of the list, it is also fairly expensive and is also a fairly common brand. And Volkwagen (our most common brand) has a high average mileage and its quite affordable.
From what I gather, This signifies a trend, the most important factor in picking a brand is the mileage. The better the mileage a brand provides, the more likely the brand is to be bought. Other factors that affect the likelihood could be buyer tastes, the price also has some importance but, as in the case of the BMW and Mercedes Benz, they can be overlooked if the benefits the brand provides are worth it.
Let's take a closer look at the mean mileages of brands in the top 8 of the mean price list. We could find some connections between the two properties.
mean_price_series = pd.Series(brand_price_mean)
mean_mileage_series = pd.Series(brand_mean_mileage)
mean_stats_df = pd.DataFrame(mean_price_series, columns = ['mean_price'])
mean_stats_df['mean_mileage'] = mean_mileage_series
mean_stats_df['brand_count'] = brand_count
mean_stats_df
Sorting our new dataframe by mean price
in descending order we will be able to easily find our top 8 brands with the highest prices and their corresponding mileages
mean_stats_df.sort_values('mean_price', axis=0, ascending=False)
Apart from the sonstige_autos and mini, all the brands in the top 8 of the mean price list have a mean mileage
above 100000km. The brands with average mileages
greater than 120000km have a high brand count
values. This indicated that a car is most likely to be sold if it has a high mileage even if the car may be a bit expensive.
In this project, we anayzed a modified version of the Ebay car sales dataset provided on kaggle to find out the best cars to purchase and reached the following conclusions:
We also noticed a few brands that like the BMW and mercedez which were expensive yet popular among buyers. We deduced that the reason for this was that they provided value which was worth their price.
We concluded that the best brand to buy would be between a volkwagen or volvo. If we chose to go for a mor classy brand, a BMW or a Mercedez Benz would be better options.