The dataset was originally scraped and uploaded to Kaggle. In this guided project, we'll work with a dataset of used cars from a calssified section of German eBay website.
The data dictionary provided with the data is as follows,
dateCrawled - When this ad was first crawled. All field values are taken from this date
name - Name of the car
seller - Whether the seller is private or a dealer
offerType - The type of listing
price - The price on the ad to sell the car
abtest - Whether the listing is included in an A/B test
vehicleType - The vehicle Type
yearOfRegistration - The year in which the car was first registered
gearBox - The transmission type
powerPS - The power of the car in PS
model - The car model name
kilometer - How many kilometers the car has driven
monthOfRegistration - The month in which the car was first registered
fuelType - What type of fuel the car uses
brand - The brand of the car
notRepairedDamage - If the car has a damage which is not yet repaired
dateCreated - The date on which the eBay listing was created
postalCode - The postal code for the location of the vehicle
lastSeenOnline - When the crawler saw this ad last online
The aim of this analysis is to clean the data and analyze the included used car listings.
import pandas as pd
autos = pd.read_csv("C:\\Users\\USER\\Videos\\Coding Videos\\my_datasets\\autos.csv", encoding = "Latin-1")
autos
autos.info()#Gives an information of our dataset, we can observe from our output that we have 20 columns most
#of which are strings.
When we extract the information of our dataframe using the info method, we see that some columns have missing values.But none of these columns have more than 20% null values. The columns with null values are:
autos.head()#print the first five rows of the dataframe
A thorough exploration of the head of our dataframe, we can observe that the columns label are written in camelcase instead of our standard Python snakecase style. So, we are poised with renaming our column labels to suit our snakecase
def clean_col(col):
col = col.replace("dateCrawled", "date_crawled")
col = col.replace("offerType", "offer_type")
col = col.replace("vehicleType", "vehicle_type")
col = col.replace("powerPS", "power_ps")
col = col.replace("fuelType", "fuel_type")
col = col.replace("monthOfRegistration", "registration_month")
col = col.replace("yearOfRegistration", "registration_year")
col = col.replace("notRepairedDamage", "unrepaired_damage")
col = col.replace("dateCreated", "ad_created")
col = col.replace("lastSeen", "last_seen")
col = col.replace("postalCode", "postal_code")
col = col.replace("nrOfPictures", "no_of_pics")
return col
cleaned_labels = []
for c in autos.columns:
cleaned = clean_col(c)
cleaned_labels.append(cleaned)
autos.columns = cleaned_labels
print(autos.columns)#Examining the column labels, using the Dataframe.columns attribute
print(autos.head())
print(autos.describe(include = "all"))
A thorough observation of the explored dataframe above, we see that some columns are columns worthy of being dropped and others are worth modifying. Findings:
Seller column - This column is worthy of being dropped, because it has just two unique values and has little or no significance to our analysis.
Offertype column - This column just like the previous has two unique value and hence is not significant to our analysis.
abtest column - This column is candidate to be dropped also
no_of_pics - This column is candidate to be dropped also.
Both the price and odometer columns have to be cleaned because they are supposed to be represented as numeric datas not text.
Now, lets drop the columns taht needs to be dropped.
autos = autos.drop(columns = ["abtest", "no_of_pics", "offer_type", "seller"])#Drop method drops either row or column.
Now, we will have to clean the price and odometer columns and rename them appropriately
autos["price"]= autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos.rename(columns ={"price":"price_dollars"}, inplace = True)
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int)
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)
print(autos.describe(include ="all"))
From the output in the above cell, we can see that we have successfully dropped the columns not relevant to our analysis and also successfully changed cleaned and rename the columns with unwanted parameters.
Now, we will continue to explore data from this two columns and look for data that doesn't look right.
First, we will look for outliners in each of thses columns and remove them.
print(autos["price_dollars"].unique().shape)
print("\n")
print(autos["price_dollars"].describe())
print("\n")
print(autos["price_dollars"].value_counts().head(20))
Exploration of the price column shows us that the following
There are 2357 unique values in our price column
Around 30% of prices equal 0, these rows might be considered for removing from the dataset.
The minimum value is 0 and the maximum value is 100,000,000. These values look too negligible and too large respectively.
Let's explore further.
autos["price_dollars"].value_counts().sort_index(ascending=False).head(20)#Value_counts method counts the number of a
#particular value present in that series. Sort_index, sorts the values in order of acending or descending and aids in
#showing the lowesrt or highest value.
autos["price_dollars"].value_counts().sort_index(ascending=True).head(20)
Findings:
Quite a number of the prices are below 30 dollars about 35% or more.
It is impossible to post a car with 0 dollars as the minimum.Hence, we have to drop values 0 dollar
From 350000 dollars, the prices become twice higher. This shows there was inappropriate rise in the prices above 350000dollars
Considering the fact that eBay is an auction site and prices could start at 1 dollar, we will keep prices from 1 dollar to 350000 dollars in the dataset and remove the rest, which i have considered as outliners.
#to remove outliner, we specify the limit of values. One way to do so is:
#df[(df["col"]> x) & (df["col"]< y)].The values x and y are the limit values. The asier way to do it is shown below
autos = autos[autos["price_dollars"].between(1,350000)]
autos["price_dollars"].describe()
From the output above, we see that we have effectively removed outliners. and now, our minimun value is 1 dollar and maximum value is 350000 dollars.
autos["odometer_km"].nunique()#This tells us the number of unique values in the column, more like the shape.
autos["odometer_km"].describe()
autos["odometer_km"].value_counts().sort_index(ascending = False)
We can see from the column exploration of the odometer_km column that both the minimun and maximum values are realistic, since the cars are used cars, it is expected that majority of the car will hAave very high odometer reading Hence, there are no outliners.
From our dataset, we can see there are four columns with date data
date_crawled
registration_year
registration_month
last_seen
ad_created
print(autos["date_crawled"].describe())
print(autos["registration_year"].describe())
print(autos["registration_month"].describe())
print(autos["last_seen"].describe())
print(autos["ad_created"].describe())
autos[["date_crawled", "last_seen", "registration_month", "registration_year", "ad_created"]].info()
From our exploration of the 5 columns with date data, we see that 3 are stored as text and 2 are stored in numeric format. Hence, we need to convert the 3 columns represented as texts to numerical form. These columns are
ad_created
last_seen
date_crawled
Now, lets explore the 3 columns stored as strings.
autos[["date_crawled", "last_seen", "ad_created"]][0:5]
We will notice that the first 10 characters represent the day( e.g 2016-03-26).
To get a good grasp of the data, we are going to extract the counts of each data in the column and then sort by index for easy analysis.
autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)#the value_counts has a few
#parameters that takes several argument, for example the normalize parameter takes a boolean argument and returns the
#relative frequency of the value count on a particular data point.By default, it is set to false.On the other hand, the
#dropna parameter is set to True, which excludes null or NaN values.But here, we set it to False.2
From the extraction, we have been able to make some findings. And these are:
The data was crawled everyday starting from 5th March 2016 to 7th April 2016
The amount of data crawled on each day is roughly uniform.
autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
From our exploration of the last seen column, these are what we find:
The distribution above shows the percentage or relative frequency of removed ads per day.
That the last 3 days more ads were removed
Majority of the ads were last seen April 6th to April 7th
autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
Exploration of the ad_created column, enables us to drow some conclusion from this column:
The ad was created from 11th June 2015 to 7th April 2016
Majority of the ad were created between the month of March and April
autos["registration_year"].describe()
From the exploration of this column, we see that the minimum and maximum values are outrageous. Hence further analysis has to be done.
It is worthy of not that a car cannot be registered after listing. Hence, any year in the registration_year column above 2016 is definitely inaccurate
autos["registration_year"].value_counts().sort_index()
Recall that from our last seen column exploration, we found out that values are between 5th March and 7th April 2016, so it means no registration year should exceed the year 2016.
autos["registration_year"].value_counts(dropna = False).sort_index(ascending = False).head(50)
A further exploration of the registration year column shows us that the registration started at year 1000, which is also unrealistic as cars have not been made by then. But if we look at the ad_created, last_seen and date_crawled column we will see that the earliest date started in the 90's. Hence, for the registration year, we will only consider years from 1900 to 2016. All other rows carrying year outside this range will be dropped.
autos = autos[autos["registration_year"].between(1917, 2016)]
autos["registration_year"].describe()
autos.describe()
Now, we will try to explore each brand of cars in our dataset and see the price variation of each.
First, lets see the columns available in our dataset again
autos.head()
autos["brand"].unique()#returns the various brands in the brand columns.
top_10_brands = autos["brand"].value_counts()[:10]
print(top_10_brands)
top_6_brands = autos["brand"].value_counts()[:6]
print(top_6_brands)
bottom_10_brands = autos["brand"].value_counts()[-10:]
print(bottom_10_brands)
Looking careful at the top_10_brands list and the bottom_10_brands, we see that the most popular car in the ebay car listing are Volkswagen, BMW, Mercedes_Benz alongside audi. The other are foreign brands.
Since the top_10_brands has about 80% of the total brands of car attracting more customers, we will focus our brand_price analysis on just these cars.
#We want to use the method of aggregation to explore brands in terms of prices.
mean_car_price = {}
top_6_brands = top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
selected_rows = autos[autos["brand"] == cars]
mean_price = selected_rows["price_dollars"].mean()
mean_price = int(mean_price)
mean_car_price[cars] = mean_price
print(mean_car_price)
print("\n")
sorted(mean_car_price.items(), key=lambda x:x[1], reverse = True)
top_6 = autos["brand"].value_counts()[:6]
print(top_6)
From the analysis above, we see that the mean prices of AUDI, MERCEDES BENZ AND BMW are the highest respecttively, while VOLKSWAGEN has a moderate mean price.
Previously, we analysed the prices of the top 6 brands and discovered that the highly luxurious cars are BMW, AUDI and MERCEDES BENZ. Now, we will try to compare the mileage(odometer) of these top_6 brands to see how it fairs when compared with prices.
mean_car_mileage = {}
top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
selected_rows = autos[autos["brand"]==cars]
mean_mileage = selected_rows["odometer_km"].mean()
mean_mileage = int(mean_mileage)
mean_car_mileage[cars] = mean_mileage
print(mean_car_mileage)
sorted(mean_car_mileage.items(), key = lambda x:x[1], reverse = True)
We see now that the brands such as BMW, MERCEDES BENZ, AUDI and VOLKSWAGEN have high mileage points. It is also worth noting that the mileage of the top_6 brands lies between 117000km to 133000km.
Now, we will try to use the mean_mileage of these cars to compare the mean prices, to see if there are any visible link.
To do this, we have to tie both the mean_car_mileage series and the mean_car price series together to form a dataframe. This will make our analysis alot easier.
#Turning the mean_car price dict into a series type.
top_6_price = pd.Series(mean_car_price)
top_6_mileage = pd.Series(mean_car_mileage)
top_6_brands = pd.DataFrame((top_6_price), columns = ["mean_price"])#Coverting the series object to a DataFrame,
#done by setting the columns parameter which accepts an array like object to specify column name
top_6_brands["mean_car_mileage"] = top_6_mileage
print(top_6_brands)
From our newly created DataFrame, we are able to draw some findings. And these are:
It appears that mileage doesn't seem to have a close relationship with price
Although, the most luxurious brands have somewhat of a higher mileage.
A couple of low priced brands still have high mileage e.g OPEL
brand_model = autos[["brand", "model"]]
brand_model.isnull().sum()
brand_model = brand_model.dropna(axis=0)
brand_model.isnull().sum()
From the above cells, we see that the model columns had 2190 missing values and since we can't work with that, we had to remove the missing rows corresponding to the columns.
brand_model.groupby("brand")["model"].value_counts(dropna = False).sort_values(ascending=False).head()
From this, we can see that the most common brand/models are Volkswagen golf, BMW 3er, Volkswagen Polo, Opel Corsa and the rest seen above.
autos["odometer_km"].value_counts().sort_index()
autos["odometer_km"].unique()
autos.groupby("odometer_km", as_index = False)["price_dollars"].mean().sort_values("price_dollars", ascending =False).round(0)
We have successfully splitted just the odometer reading of our cars and the associated prices. Now, taking a closer look at the table, we can conclude that cars with very high odometer reading have lower prices. Ecept for cars with 5000km mileage that has somewhat of a low price. This can be attributed to the fact that cars having such mileage are not popular brands hence buyers dont fancy it.
autos["unrepaired_damage"].describe()
Before proceeding, we see that the column contains values that are not written in a language we unto, now, we have to do some data cleaning on this column.
mapping_dict = {
"nein":"No",
"ja":"Yes"
}
autos["unrepaired_damage"]=autos["unrepaired_damage"].map(mapping_dict)
autos["unrepaired_damage"].describe()#We have succesfully corrected bad values.
#damaged = autos.loc[(autos["unrepaired_damage"] == "Yes"), ["price_dollars"]]
#damaged["price_dollars"].describe()
#damaged = autos.loc[(autos["unrepaired_damage"] == "No"), ["price_dollars"]]
#damaged["price_dollars"].describe()
autos.groupby("unrepaired_damage")["price_dollars"].max()#We can also compare the unrepaired column with the price column
#like so.
From the cells above, the results are obtained using different technique, for the first two above, we indexed the unrepaired and price columns while setting the former to Yes and its corresponding rows.
As for the second, we use the groupby method of the Dataframe object and compared, the unrepaired column with the price column to obtain the maximum.
We can conclude from this that cars with damages not yest repaired cost less than those without damage.