The project is separated in 4 parts:
In this project we are going to use a dataset of used cars from the German ebay website.
The data was modified , shortened and uploaded to Kaggle. Our dataset consists of 50,000 data points.
The goal of this project is to clean and try to give answers to some interesting questions about the Car sales data.
# First, we import numpy and pandas libraries.
import numpy as np
import pandas as pd
# Using pandas library we read the file Autos.csv and assign it to a variable:autos
# The encoding argument is "Latin-1", because the default encoding argument ("UTF-8) is not correct.
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.info()
autos.head()
Our data set consists of 50,000 entries and 20 columns.
We can clearly see that the names of the columns use camelcase and not snakecase. The dataset contains 75% strings and 25% of integers which we can see from the type of series(int64 -5 , object-15).
From the entries we can see that some columns have null values, like vehicleType, gearbox,model, fuelType and notRepairedDamage.
In this part of the project we will:
#changing column names
autos.columns
autos_copy = autos.copy()
col = autos_copy.columns
#creating a map so that we can change all column names together
mapping_columns = {
'dateCrawled':'date_crawled',
'name':'name',
'seller':'seller',
'offerType':'offer_type',
'price':'price',
'abtest':'abtest',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'gearbox':'gearbox',
'powerPS':'power_ps',
'model':'model',
'odometer':'odometer',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'brand':'brand',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'}
#creating Series with the mapping columns dictionary
autos_copy.columns = pd.Series(col).map(mapping_columns)
#assigning the columns name from autos.copy to autos
autos.columns = autos_copy.columns
autos.columns
autos.describe(include="all") #passing argument include=all so all columns can be included
We can make some observations from analysing the dataset autos now after we changed the column names.
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int) #removing currency, change from obj to int
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int) #removing km, change from obj to int
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True) #renaming column odometer to add the km string
autos.head()
After converting the odometer and price column we can calculate the max and the min values for that columns with the max() and min() methods.
max_price = autos["price"].max()
min_price = autos["price"].min()
print("Max price = ",max_price)
print("Min price = ",min_price)
max_odometer_km = autos["odometer_km"].max()
min_odometer_km = autos["odometer_km"].min()
print("Max odometer km = ",max_odometer_km)
print("Min odometer km = ",min_odometer_km)
autos["price"].unique().shape #how many unique values has the price column
autos["price"].describe() #statistics for the price column
sorted_value = autos["price"].value_counts().sort_index(ascending=True).head() #using value counts and sort index to analyse price data
print(sorted_value)
top_sorted_value = autos["price"].value_counts().sort_index(ascending=False).head() #top 5 prices
print(top_sorted_value)
autos = autos[autos["price"].between(1,350000)]
print(autos["price"].describe())
Analysis of the price column:
We can see that most of the prices are with 0 value. The answer to this can be found in that usually Ebay sellers try to attract more people to look their advertisments with cheap prices or prices that are with null values.
Also we can see that the max price is 999,999 $ which is also too much and something is not good with that car price.
Given the fact that Ebay is auction website, the different distribution of prices shouldn't influence us but we should eliminate any potencial data that can lead us to making bad conclusions.
In the last step we removed data that has price bigger than 350,000 dollars and price lower than 1 dollar.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:
autos[['date_crawled','ad_created','last_seen']][0:5]
#separating the date string and calculating the distribution of values
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
autos["ad_created"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
autos["last_seen"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
autos["registration_year"].describe()
first_bool = autos["registration_year"] >= 1885
second_bool = autos["registration_year"] <= 2016
combined =first_bool & second_bool
autos.loc[combined, "registration_year"]
autos["registration_year"].value_counts(normalize=True).head(10)
From the date columns and the column- registration year we can make some observations:
Most of the ads we made in March 2016 , which we can see from the distribution of values of the column ad_created. Also we can assume that most of the car sales were made in March because it is the month with the biggest distribution of values for the column date_crawled.
The registration year column has some interesting data like:
It is very odd to think that some car was registrated in the future. Having in mind that the first car was made in 1885, we cleaned our data and left only columns with registration year from 1885 up until 2016.
autos["brand"].value_counts(normalize=True,dropna=False).head(10) #sorting data in brand column by their value
#creating dictionary with brand/price and calculating average price for brand
avg_price_for_brand = {}
brand = ['volkswagen','opel','bmw','mercedes_benz','audi','ford']
for b in brand:
selected_rows = autos[autos["brand"] == b]
mean = selected_rows["price"].mean()
avg_price_for_brand[b] = int(mean)
print(avg_price_for_brand)
bmp_series = pd.Series(avg_price_for_brand) #creating new series
pd.DataFrame(bmp_series, columns=["mean_price"]) #creating data frame
#creating dictionary with brand/price and calculating average price for brand
avg_mil_for_brand = {}
for b in brand:
selected_rows = autos[autos["brand"] == b]
mean_mileage = selected_rows["odometer_km"].mean()
avg_mil_for_brand[b] = int(mean_mileage)
print(avg_mil_for_brand)
bmm_series = pd.Series(avg_mil_for_brand)
df2 =pd.DataFrame(bmm_series,columns=["mean_mileage"])
print(df2)
df2["mean_price"] = bmp_series
print(df2)
From the data gathered with exploring price by brand we can make some observations:
First step: find all columns that use german words and translate them in english*
*only the column name has not been changed for the complexity in finding the right translation
We have changed all german words in the columns: seller, offer_type, vehicle_type, gearbox, fuel_type, unrepaired_change
#mapping and updating series
seller = autos["seller"]
map_seller = {
"privat" :"private",
"gewerblich":"commercial"
}
autos["seller"] = pd.Series(seller).map(map_seller)
offer_type = autos["offer_type"]
map_offer = {
"Angebot" :"Offer",
"Gesuch":"Request"
}
autos["offer_type"] = pd.Series(offer_type).map(map_offer)
vehicle_type = autos["vehicle_type"]
map_vehicle = {
"limousine":"limousine",
"kleinwagen":"small car",
"kombi":"combi",
"bus": "bus",
"cabrio":"convertible",
"coupe":"coupe",
"suv":"suv",
"andere":"other"
}
autos["vehicle_type"] = pd.Series(vehicle_type).map(map_vehicle)
gear_box = autos["gearbox"]
map_gearbox = {
"manuell" :"manual",
"automatik":"automatic"
}
autos["gearbox"] = pd.Series(gear_box).map(map_gearbox)
fuel_type = autos["fuel_type"]
map_fuel_type = {
"benzin" :"petrol",
"diesel":"diesel",
"lpg" :"lpg",
"cpg":"cpg",
"hybrid" :"hybrid",
"andere":"other",
"elektro":"electric",
}
autos["fuel_type"] = pd.Series(fuel_type).map(map_fuel_type)
damage_type = autos["unrepaired_damage"]
map_damage_type = {
"nein" :"no",
"ja":"yes",
}
autos["unrepaired_damage"] = pd.Series(damage_type).map(map_damage_type)
Second step : convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer
In the data set there are different brands and models of cars. From the analysis we made we can clearly see that the 5 most common brand/model combinations are:
autos["date_crawled"].describe()
#cutting our string in columns and converting it in numeric data
autos["date_crawled"] = autos["date_crawled"].str[:10].str.replace("-","").astype(int)
print(autos["date_crawled"].head())
Finding the most common brand/model combinations
From the analysis below we identified that the 5 most common brand/model combinations are:
cols = ["brand","model"]
autos_sel = autos[cols].head()
print(autos_sel)
# grouping the dataset by model and brand
df_brand = autos.groupby(["brand","model"])
# find the count and sort descending
brand_model = df_brand["date_crawled"].count().sort_values(ascending=False)
print(brand_model[:5])
Third step : Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
autos["odometer_groups"] = pd.qcut(autos['odometer_km'], q=10,duplicates ="drop")
odometri = autos["odometer_groups"].unique()
print(odometri)
avg_price_for_groups_km ={}
for od in odometri:
sel_rows = autos[autos["odometer_groups"] == od]
mean_km_price_for_groups = sel_rows["price"].mean()
odo = int(mean_km_price_for_groups)
avg_price_for_groups_km[od] = odo
odo_series = pd.Series(avg_price_for_groups_km)
df3 =pd.DataFrame(odo_series,columns=["mean_price"])
print(df3)
The odometer_km column has been split in 4 groups: 5.000-60.000 , 60.000-90.000, 90.000-125.000, 125.000-150.000
We can clearly see that average prices follows patterns based on the milage and the cars that have lower mileage are more expensive.
4th step: How much cheaper are cars with damage than their non-damaged counterparts?
autos["unrepaired_damage"].value_counts()
repaired_unrepaired_price = {}
damage = autos["unrepaired_damage"].unique()
print(damage)
for d in damage:
selected_rows = autos[autos["unrepaired_damage"] == d]
mean = selected_rows["price"].mean()
repaired_unrepaired_price[d] = float(mean)
print(repaired_unrepaired_price)
From the analysis we can see that cars that were not damaged have bigger average price than the cars that were repaired.
If we want to make some further analysis , we can also look in the name column which contains some other information that is not included in the other columns like :
autos["name"].head(20)
In this project we got answers to our questions are tried to give the readers better understanding how the data cleaning process is the most important part for getting the right results and not making bad observations.