The our dataset contains data about offered for sale used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. This dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but you can find it here. The goal of my works - analyze data with this dataset and examine general trends if its are.
# Import required modules - pandas and numpy
import pandas as pd
import numpy as np
# Read the 'autos.csv' file as 'autos'
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
# Pring first five rows
print(autos.head())
# get general information about data set
autos.info()
# check that data set hasn't index column from dataset
autos.index
As we see:
get name of columns:
autos.columns
and manual rename it with calling rename() function, also rename 'price' to the 'price_usd' and 'odometer' to 'odometer_km'
autos.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type', 'price':'price_usd',
'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps',
'odometer':'odometer_km', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code',
'lastSeen':'last_seen'}, axis=1, inplace=True)
and checkj it:
autos.head()
Let's look on data in our dataset with descibe() function.
autos.describe(include = 'all')
as we see:
Lets estimate range and format of values in columns 'odometer_km' and 'price' for determinate the conversion criterion.
s = autos['odometer_km'].unique()
print("odometer_km", s, s.shape, sep = "\n")
p = autos['price_usd'].unique()
print("price_usd", p, p.shape, sep = "\n")
y = autos['registration_year'].unique()
print("registration_year", y, y.shape, sep = "\n")
let's begin optimization data set.
autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)
autos.info()
autos['price_usd'] = autos['price_usd'].str.replace(',','').str.replace('$','').astype('int64')
autos['odometer_km'] = autos['odometer_km'].str.replace('km','').str.replace(',','').astype('int64')
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
Lets analyze these values:
Lets calculate IQR for these values
irq_reg_year = autos['registration_year'].quantile(0.75) - autos['registration_year'].quantile(0.25)
print('IRQ reg_year =', irq_reg_year)
irq_power_ps = autos['power_ps'].quantile(0.75) - autos['power_ps'].quantile(0.25)
print('IRQ power_ps = ', irq_power_ps)
irq_price_usd = autos['price_usd'].quantile(0.75) - autos['price_usd'].quantile(0.25)
print('IRQ price_usd = ', irq_price_usd)
Remember the general rules about outlier values:
Before estimate high outlier limit for years see to max and min values for 'date_crawled' and 'ad_created' columns for understanding time of publication of data:
ad_created_max = max(pd.to_datetime(autos['ad_created']))
ad_created_min = min(pd.to_datetime(autos['ad_created']))
print("ad_created_max = ", ad_created_max, ", ad_created_min = ", ad_created_min )
date_crawled_max = max(pd.to_datetime(autos['date_crawled']))
date_crawled_min = min(pd.to_datetime(autos['date_crawled']))
print("date_crawled_max = ",date_crawled_max, ", date_crawled_min", date_crawled_min )
Define high and low outlier limits for year, power and price:
1.Define low and high outlier limit for 'registration_year':
for low limit use formula
year_low_limit = 1999 - 1,5 * 9 = 1985 (round down)
For defining high limit we see time period, see above, for sale advertisements in the dataset - from June 2015 from April 2016. We know that automaker write in the vehicle registration certificate real date of manufacturing, hense a high outlier limit can't be greater than 2016 years.
year_high_limit = 2016
2.Define low and high outlier limit for 'power_ps':
If we formaly use formula we get -50 ps and row 'power_ps' contains and zero value. Extract unque values from 'power_ps' column and sorted its.
power = sorted(autos['power_ps'].unique())
print (power[0:10])
Assignment values that doesn't equal zero and higher than -50 ps
power_low_limit = 1
for define high limit use formula
power_high_limit = 150 + 80 * 1,5 = 270
NB. A little power
3.Define low and high outlier limit for 'price_usd':
If we use formula we get low limit outlier ~ - 8 000 usd. Use real minimal value that doesn't equal zero.
price = sorted(autos['price_usd'].unique())
print(price[0:10])
so define:
price_low_limit = 1
for high limit outlier use formula:
price_high_limit = 7200 + 6100 * 1,5 = 16350
We define low and high limits for necessary values and let's modify autos data set. Create copy current autos and assignment it name 'autos_raw_copy" and extract data without outliers by some steps. clear year. Print autos_raw_copy for further comparison.
autos_raw_copy = autos.copy()
autos_raw_copy[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
Clean year:
autos= autos[autos['registration_year'].between(1985, 2016)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
As wee see 2 785 records ( -5.57% from original ) are outliers by year.
Clean power_ps:
autos= autos[autos['power_ps'].between(1, 270)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
As wee see 6 170 records (-12.34% from original ) are outliers by power_ps.
Clean price_usd':
autos= autos[autos['price_usd'].between(1, 16350)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()
As wee see 3 559 records (-7.12% from original ) are outliers by price_usd.
As result 12514 records (-25.03% from original) are outliers by year, power and price
check cleaned dataset
autos.info()
Formally we can remove contains null records in other columns, but considering free style sale advertisements in case eBay Kleinanzeigen it will be redundant - we get reliable dataset based of year registration, price and power that contains non zero values. Think which values have real significance for further analyzing:
**Summary, we analyze next values:
# Exract datetime valies
year_month_dt = pd.to_datetime(autos['ad_created'])
# Convert to string Month Year
year_month_str = year_month_dt.dt.strftime('%B, %Y')
# Create dictionary
month_year_dic = {}
for value in year_month_str:
if value not in month_year_dic:
month_year_dic[value] = 1
else:
month_year_dic[value] += 1
# Convert dictionary to lists
month_year = []
values = []
for k, v in month_year_dic.items ():
month_year.append(k)
values.append(v)
# Convert list to dataframe
month_year_df = pd.DataFrame(list(zip(month_year, values)),
columns =['Month, Year', 'Ad numbers'])
# Calculate sum of column Ad numbers
total = month_year_df['Ad numbers'].sum()
# Add new column Percent
month_year_df['Percents %'] = month_year_df['Ad numbers'] / total
# Sort by descending
month_year_df.sort_values(by =['Percents %'], ascending = False)
# Format output for dataframe
month_year_df.style.format({"Ad numbers":"{:,}", "Percents %":"{:.3%}"})
Remember that ad_created_max = 07 April 2016 and ad_created_min = 11 June 2015. As we see in average car sells didn't more than two month.
# Extract values from column 'registration_year'
reg_year = autos['registration_year']
#Create dictionary reg_year_dic = {}
reg_year_dic = {}
for value in reg_year:
if value not in reg_year_dic:
reg_year_dic[value] = 1
else:
reg_year_dic[value] += 1
# Calculate sum of values
total = sum(reg_year_dic.values())
year = []
values = []
for k, v in reg_year_dic.items ():
year.append(k)
values.append(v)
# Convert list to dataframe
year_df = pd.DataFrame(list(zip(year, values)),
columns =['Year reg', 'Ad numbers'])
# Calculate sum of column Ad numbers
total = year_df['Ad numbers'].sum()
# Sort by descending
year_df.sort_values(by = ['Ad numbers'], inplace = True, ascending=False)
# Add new column Percent
year_df['Percents %'] = year_df['Ad numbers'] / total
year_df.style.format({"Ad numbers":"{:,}", "Percents %":"{:.3%}"})
as we see, 76.18% offering for sale cars made in 1997 - 2009 years.
Get 15 top brand list
autos['brand'].value_counts().head(15).sort_values(ascending=False)
Us we see most offering for sale cars making the The big German Car Trinity - VAG (it owned also trade marks Audi, Seat, Scoda), BMW and Mersedes-Benz.
Us we know, mileage contain discrete and for my opinion artificial values, because I examine mean values for price, power and made year for these brands. I don't use dictionary but use their idea for creating little dataframe. Let's begin:
# Creating a list of list of brands for mean price, power and make year
agregate = []
for brand in autos['brand'].value_counts().head(15).index:
row_list = []
row_list.append(brand)
row_list.append(int(round(autos[autos['brand'] == brand]
['price_usd'].mean(), 0)))
row_list.append(int(round(autos[autos['brand'] == brand]
['power_ps'].mean(), 0)))
row_list.append(int(round(autos[autos['brand'] == brand]
['registration_year'].mean(), 0)))
agregate.append(row_list)
# Convert list of list to dataframe
agregate_df = pd.DataFrame.from_records(agregate)
# Assign name of columns
agregate_df.columns = ["Brand", "Price", "Power PS", "Made year"]
# Descending sort by price
agregate_df.sort_values(by =['Price'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})
We see dispersion by price. As we see brands Audi, BMW and Mersedes-Benz are the most expensive (with almost the same years of make). Lets see dispersion by power:
agregate_df.sort_values(by =['Power PS'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})
As we see brands Audi, BMW and Mersedes-Benz have the most power engine (with almost the same years of make). Lets see dispersion by make year:
agregate_df.sort_values(by =['Made year'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})
As we see, brands Scoda, Smart and Citroen have the smallest age
Created on Jan 27, 2021
@author: Vadim Maklakov, used some ideas from public Internet resources.
© 3-clause BSD License
Software environment:
Debian 10.7
Python 3.8.7
required preinstalled next python modules:
csv
numpy
pandas