The aim of this project is to clean a large dataset and analyze the information using pandas
and numpy
libraries of python.
The dataset comes from eBay Kleinanzeigen classified section on ebay, you can download the full version in the this link.
The data dictionary provided with data is as follows:
dataCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- he car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.import pandas as pd
import numpy as np
# To read the csv file we use the .read_csv method
autos = pd.read_csv('autos(1).csv', encoding = 'Latin-1')
autos
At first glance we have the following information of the dataset using the .info method:
fueltype
and nonreparied
damage have null valuesautos.info()
autos.head()
We are going to convert the names case so we can have a more workable case for the dataset.
columns = autos.columns
columns
new_names= ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer','registration_month','fuel_type','brand',
'unreparied_damage','ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.columns = new_names
autos.columns
Now we can see the changes in the columns looking the results with the .head
method
autos.head()
In this step we are going to explore the data in detail, first using .describe
method to see some statistical results and look what kind of objects we have in the data.
autos.describe(include='all')
nr_of_pictures
have mostly 0 value, so we can remove the column without afecting the dataregistration_year
have a min value of 1000, it's really difficult to find a car of the year 1000 now a days, so we need to check that.registration_month
have a min value of 0, it's not a valid month so we need to check that toopower_ps
column have min 0 value, is really dificult to find a car with 0 power, so we need to check that.price
and odometer
are registered as strings, we need to convert to number so we can use the data.seller
, offer_type
and nr_of_pictures
have unique values, so we can drop this columns without affecting the data.So we need to drop the Columns that have the same values, we are not going to use the columns at this moment.
autos = autos.drop(["seller","offer_type","nr_of_pictures"], axis=1)
Now we need to clean the data in the 'price'
column, and also convert this data in integers for further calculus, the steps are:
astype
method..head()
auto_desc = autos["price"].describe()
auto_desc
# We need to replace the $ symbol and , separator with str.repalce
autos["price"] = autos["price"].str.replace("$","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["price"] = autos["price"].astype(int)
autos["price"].head()
The steps are similar with prices, but in this case we are going to use .rename()
method at the to convert the name from odometer
to odometer_km
, so the person that will use the dataset could know which units use the odometer.
autos["odometer"].value_counts()
# We need to replace the $ symbol and , separator with str.repalce
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","")
# Once we replace we can convert to int with .astype method
autos["odometer"] = autos["odometer"].astype(int)
autos["odometer"].head()
#With the rename method we need to indicate the previous name ant the new name
autos.rename(columns={"odometer":"odometer_km"}, inplace=True)
autos.describe(include = 'all')
Now that we have the data converted to int
type, we need to explore the outliers, some data that have no sense or are not adjusted to our goal. The steps are the same for odometer_km
and price
.
.unique().shape
methodvalue_counts().sort_index()
method..between()
method to indicate the range of values that we are going to useautos["odometer_km"].unique().shape
autos["odometer_km"].describe()
autos["odometer_km"].value_counts().sort_index(ascending=False)
# With this we obtain the number of unique values
autos["price"].unique().shape
autos["price"].describe()
# First we need to see the extreme superior values, prices above 1 million USD must be cleaned
autos["price"].value_counts().sort_index(ascending=False).head(20)
# Using `sort_index()` we can sort using the unique values, now we see the low prices, 0-100USD are a very low price
autos["price"].value_counts().sort_index(ascending=True).head(20)
#Using only .value_counts() we can see the values by frequency
autos["price"].value_counts().head(50)
With this info is better to start from 100USD than from 1USD, is difficult to asses if below this values is realistic to purchase a car, we need more info about how this market works.
#With between method we remove the outliers and reduce the dataset.
autos = autos[autos["price"].between(100,351000)]
autos["price"].describe()
We have 5 types of data from dates:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteWe need to convert the dates that are identified as strings to date format.
#First we see how is written the dates
autos[['date_crawled','ad_created','last_seen']][0:5]
# We need to remove the hour and leave only the dates using .str method
autos['date_crawled'].str[:10].head(10)
#Now we use value_counts() to see the most frequent dates
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).head(10)
# We need to identify the first dates of crawling using .sort_index()
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
#Then we identify the last dates using the same method
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
#The same previous process we will use with the other dates
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).head(10)
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).head(10)
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().head(10)
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index().tail(10)
From the previous exercise we identified:
With this info is clear that we need to review the registration year of the cars. Similar to previous step with the price, we need to fix a range and exclude some data.
autos["registration_year"].describe()
Looking the data we find that we have cars with registered years before 1900 (when we don't even have a clue of what is a Mercedes Benz) and years before 2016, the last year when we retrieved the data. So We need to explore the data set and removes this years.
autos["registration_year"].value_counts(normalize=True)
autos["registration_year"].value_counts().sort_index().head(20)
autos["registration_year"].value_counts().sort_index().tail(20)
autos = autos[autos["registration_year"].between(1900,2019)]
autos["registration_year"].describe()
autos["registration_year"].value_counts(normalize=True)
We prefer to mantain until 2019: could a mistake of typo not a deliberate error.
Now we need to explore which are the most popular brands in the used cars, and explore only this data.
top_brand_index
autos["brand"].value_counts(normalize=True).head(10)
brand_counts = autos["brand"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_brand = brand_counts[brand_counts > 0.05]
print(top_brand)
print('\n')
top_brand_index = brand_counts[brand_counts > 0.05].index
print(top_brand_index)
Once we identified the top brands, we are going to use the following information:
mean()
method the mean value for brand..Series()
method we are going to convert the dictionary into a single dataframe .dataframe()
method.mileage
and mean_price
we are going to do a single dataframe, combining the tables.brand_prices_dict = {}
for b in top_brand_index:
selected_rows = autos[autos["brand"] == b]
# Calculate the mean average price for just those rows
mean = selected_rows["price"].mean()
# Assign the mean value to the dictionary, using the
# brand name as the key
brand_prices_dict[b] = mean
brand_prices_dict
# pd. Series convert our dictionary into a single dataframe
brand_prices = pd.Series(brand_prices_dict)
brand_prices
# We can convert the series to a single Dataframe with column
brand_prices = pd.DataFrame(brand_prices, columns=['mean_price'])
brand_prices
mileage_dict = {}
for b in top_brand_index:
selected_rows = autos[autos["brand"] == b]
# Calculate the mean average price for just those rows
mean = selected_rows["odometer_km"].mean()
# Assign the mean value to the dictionary, using the
# brand name as the key
mileage_dict[b] = mean
mileage_dict
# pd. Series convert our dictionary into a single dataframe
mileage = pd.Series(mileage_dict)
mileage
mileage = pd.DataFrame(mileage, columns = ['mileage'])
mileage
mileage["mean_prices"] = brand_prices
mileage
mileage["mileage"] = mileage["mileage"].astype(int)
mileage["mean_prices"] = mileage["mean_prices"].astype(int)
mileage
The principal findings here are:
Several columns still use the same German words, so we can find them and translate to english.
The steps are the same in previous methods:
str.replace()
method to replace the words.autos.head()
autos["vehicle_type"].unique()
autos["fuel_type"].unique()
autos["vehicle_type"] = autos["vehicle_type"].str.replace("kleinwagen","small_car").str.replace("kombi","station_wagon").str.replace("cabrio","convertible").str.replace("andere","other")
autos["vehicle_type"].unique()
autos["fuel_type"] = autos["fuel_type"].str.replace("benzin","petrol").str.replace("elektro","electro")
autos["fuel_type"].unique()
In this part we are going to convert a string to int.
str.split
to select only the first part of the string.astype
to convert the string to integer.autos["date_crawled"] = (autos["date_crawled"]
.str.split()
.str[0]
)
autos["date_crawled"].head()
autos["date_crawled"] = autos["date_crawled"].str.replace("-","")
# Once we replace we can convert to int with .astype method
autos["date_crawled"] = autos["date_crawled"].astype(int)
autos["date_crawled"].head()
In this part we are going to identify the most offered models of each brand.
These are the steps:
split
method to select the second word of the name, that word is the modelautos["name_model"] = autos["name"].str.replace("_"," ")
autos["name_model"] = (autos["name_model"]
.str.split()
.str[1]
)
autos["name_model"].head()
autos["name_model"].head(10)
model_counts = autos["name_model"].value_counts(normalize=True).head(10)
# Series.index attribute to access the labels
top_model = model_counts[model_counts > 0.01]
top_model.head(20)
We are now tray to find how much change the mean price of a used cars if the mileage increase.
For this we are going to do the following:
.mean()
methododometer_to_50k = autos[autos["odometer_km"] <= 50000]
odometer_to_100k = autos[(autos["odometer_km"] > 50000) & (autos["odometer_km"] <= 100000)]
odometer_to_150k = autos[(autos["odometer_km"] > 100000) & (autos["odometer_km"] <= 150000)]
Price1 = round(odometer_to_50k["price"].mean(),0)
Price2 = round (odometer_to_100k["price"].mean(),0)
Price3 = round (odometer_to_150k["price"].mean(),0)
print("Average price with odometer from 0 to 50k km:", Price1)
print("Average price with odometer from 50 to 100k km:", Price2)
print("Average price with odometer from 100 to 150k km:", Price3)
Finally we are going to find the difference on price between a car with no damage and with damage.
"unrepaired_damage"
autos["unreparied_damage"].unique()
no_damage = autos[autos["unreparied_damage"] == 'nein']
damage = autos[autos["unreparied_damage"] == 'ja']
Price_1 = round(no_damage["price"].mean())
Price_2 = round(damage["price"].mean())
print("Average no damage:", Price_1)
print("Average with damage:", Price_2)
ratio_damage = 7099/2250
ratio_damage
We resume the main findings of our data here: