Exploring Ebay Car Sales Data¶

Introduction¶

The our dataset contains data about offered for sale used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. This dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but you can find it here. The goal of my works - analyze data with this dataset and examine general trends if its are.

Importing the modules and opening the file¶

In [1]:

# Import required modules - pandas and numpy
import pandas as pd
import numpy as np
# Read the 'autos.csv' file as 'autos'
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
# Pring first five rows
print(autos.head()) 
# get general information about data set
autos.info()
# check that data set hasn't index column from dataset
autos.index

           dateCrawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offerType   price   abtest vehicleType  yearOfRegistration  \
0  privat   Angebot  $5,000  control         bus                2004   
1  privat   Angebot  $8,500  control   limousine                1997   
2  privat   Angebot  $8,990     test   limousine                2009   
3  privat   Angebot  $4,350  control  kleinwagen                2007   
4  privat   Angebot  $1,350     test       kombi                2003   

     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \
0    manuell      158  andere  150,000km                    3      lpg   
1  automatik      286     7er  150,000km                    6   benzin   
2    manuell      102    golf   70,000km                    7   benzin   
3  automatik       71  fortwo   70,000km                    6   benzin   
4    manuell        0   focus  150,000km                    7   benzin   

        brand notRepairedDamage          dateCreated  nrOfPictures  \
0     peugeot              nein  2016-03-26 00:00:00             0   
1         bmw              nein  2016-04-04 00:00:00             0   
2  volkswagen              nein  2016-03-26 00:00:00             0   
3       smart              nein  2016-03-12 00:00:00             0   
4        ford              nein  2016-04-01 00:00:00             0   

   postalCode             lastSeen  
0       79588  2016-04-06 06:45:54  
1       71034  2016-04-06 14:45:08  
2       35394  2016-04-06 20:15:37  
3       33729  2016-03-15 03:16:28  
4       39218  2016-04-01 14:38:50  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[1]:

RangeIndex(start=0, stop=50000, step=1)

As we see:

name of some columns are hard for reading.
some columns have NAN values because number non-null values not equal 50000.
not all required us for possible sorting and etc. procedures values in the columns how "odometr" "price" "lastseen" and others haven't digital format.

Rename columns and prepare data for further analyzing¶

get name of columns:

In [2]:

autos.columns

Out[2]:

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

and manual rename it with calling rename() function, also rename 'price' to the 'price_usd' and 'odometer' to 'odometer_km'

In [3]:

autos.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type', 'price':'price_usd', 
              'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps', 
              'odometer':'odometer_km', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 
              'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 
              'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code', 
              'lastSeen':'last_seen'}, axis=1, inplace=True)

and checkj it:

In [4]:

autos.head()

Out[4]:

	date_crawled	name	seller	offer_type	price_usd	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Determinate required values.¶

Let's look on data in our dataset with descibe() function.

In [5]:

autos.describe(include = 'all')

Out[5]:

	date_crawled	name	seller	offer_type	price_usd	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-10 15:36:24	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

as we see:

'nr_of_pictures' has only one value 0 (as its std, min, and max are equal to 0). 'seller' and 'offer_type' have two values (the count of the first value is 49999 and the count of the second one is only - they can to drop as . We can remove the 'nr_of_pictures' column. We can remove 'seller' and 'offer_type because they haven't almost no unique values.
'price' and 'odometer' store numeric data as a text and does not allow us to analyze this data as numeric data (does not allow us to find their min, max, std) - I think that we must convert its values to integer.
registration_year', 'power_ps', 'registration_month', and 'postal code' need more investigation as their number of unique values cannot be defined. Now they are equal to NaN because their dtype is 'int64' - the '.describe()' method is not capable of finding unique values of numeric values and because we convert it to integer.

Lets estimate range and format of values in columns 'odometer_km' and 'price' for determinate the conversion criterion.

In [6]:

s = autos['odometer_km'].unique()
print("odometer_km", s, s.shape, sep = "\n") 
p = autos['price_usd'].unique()
print("price_usd", p, p.shape, sep = "\n")
y = autos['registration_year'].unique()
print("registration_year", y, y.shape, sep = "\n")

odometer_km
['150,000km' '70,000km' '50,000km' '80,000km' '10,000km' '30,000km'
 '125,000km' '90,000km' '20,000km' '60,000km' '5,000km' '100,000km'
 '40,000km']
(13,)
price_usd
['$5,000' '$8,500' '$8,990' ... '$385' '$22,200' '$16,995']
(2357,)
registration_year
[2004 1997 2009 2007 2003 2006 1995 1998 2000 2017 2010 1999 1982 1990
 2015 2014 1996 1992 2005 2002 2012 2011 2008 1985 2016 1994 1986 2001
 2018 2013 1972 1993 1988 1989 1967 1973 1956 1976 4500 1987 1991 1983
 1960 1969 1950 1978 1980 1984 1963 1977 1961 1968 1934 1965 1971 1966
 1979 1981 1970 1974 1910 1975 5000 4100 2019 1959 9996 9999 6200 1964
 1958 1800 1948 1931 1943 9000 1941 1962 1927 1937 1929 1000 1957 1952
 1111 1955 1939 8888 1954 1938 2800 5911 1500 1953 1951 4800 1001]
(97,)

let's begin optimization data set.

1. Remove columns haven't statistical values.¶

In [7]:

autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          50000 non-null object
name                  50000 non-null object
price_usd             50000 non-null object
abtest                50000 non-null object
vehicle_type          44905 non-null object
registration_year     50000 non-null int64
gearbox               47320 non-null object
power_ps              50000 non-null int64
model                 47242 non-null object
odometer_km           50000 non-null object
registration_month    50000 non-null int64
fuel_type             45518 non-null object
brand                 50000 non-null object
unrepaired_damage     40171 non-null object
ad_created            50000 non-null object
postal_code           50000 non-null int64
last_seen             50000 non-null object
dtypes: int64(4), object(13)
memory usage: 6.5+ MB

2. Convert text values to integer in columns 'price_usd' and 'odometr_km'¶

In [8]:

autos['price_usd'] = autos['price_usd'].str.replace(',','').str.replace('$','').astype('int64')
autos['odometer_km'] = autos['odometer_km'].str.replace('km','').str.replace(',','').astype('int64')
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()

Out[8]:

	odometer_km	price_usd	registration_year	power_ps
count	50000.000000	5.000000e+04	50000.000000	50000.000000
mean	125732.700000	9.840044e+03	2005.073280	116.355920
std	40042.211706	4.811044e+05	105.712813	209.216627
min	5000.000000	0.000000e+00	1000.000000	0.000000
25%	125000.000000	1.100000e+03	1999.000000	70.000000
50%	150000.000000	2.950000e+03	2003.000000	105.000000
75%	150000.000000	7.200000e+03	2008.000000	150.000000
max	150000.000000	1.000000e+08	9999.000000	17700.000000

Lets analyze these values:

1. How we see before and now range of values in odometr_km have limited artificial discrete values set - from 5.000 to 150.000 km and very limited, and dispersions have only 13 values. Practical life cycle mileage for auto 1990 - 2000 years not less than 200.000 km for gasoline engine and not less 300.000 km for diesel engine, but we have max mileage 150.000 km which less practical life cycle limit of mileage. Statistical pattern based in this column will not descriptive and reliable for dataset with 50.000 rows - we can delete reliable information for other columns and dot get reliable pattern.
1. Practical values for defining outliers have data only in these columns - price_usd, reqistration_year and power_ps which have large dispersion of data from max and min values - for exampe registration year from 1000 to 9999, price 100M usd and etc.

Lets calculate IQR for these values

In [9]:

irq_reg_year = autos['registration_year'].quantile(0.75) -  autos['registration_year'].quantile(0.25)
print('IRQ reg_year =', irq_reg_year)
irq_power_ps = autos['power_ps'].quantile(0.75) -  autos['power_ps'].quantile(0.25)
print('IRQ power_ps = ', irq_power_ps)
irq_price_usd = autos['price_usd'].quantile(0.75) -  autos['price_usd'].quantile(0.25)
print('IRQ price_usd = ', irq_price_usd)

IRQ reg_year = 9.0
IRQ power_ps =  80.0
IRQ price_usd =  6100.0

Remember the general rules about outlier values:

1. Low outlier limit = Q1 (25%) - 1.5xIQR
1. High outlier limit = Q3(75%) + 1.5xIQR

Formally if we hence this formulas we can get next paradox values - negative power, negative price, 2021 year registration and etc. Hence for define reliable limit we will operate real values in some case.

Before estimate high outlier limit for years see to max and min values for 'date_crawled' and 'ad_created' columns for understanding time of publication of data:

In [10]:

ad_created_max = max(pd.to_datetime(autos['ad_created']))
ad_created_min = min(pd.to_datetime(autos['ad_created']))
print("ad_created_max = ", ad_created_max, ", ad_created_min = ", ad_created_min )
date_crawled_max = max(pd.to_datetime(autos['date_crawled']))
date_crawled_min = min(pd.to_datetime(autos['date_crawled']))
print("date_crawled_max = ",date_crawled_max, ", date_crawled_min", date_crawled_min )

ad_created_max =  2016-04-07 00:00:00 , ad_created_min =  2015-06-11 00:00:00
date_crawled_max =  2016-04-07 14:36:56 , date_crawled_min 2016-03-05 14:06:30

Define high and low outlier limits for year, power and price:

1.Define low and high outlier limit for 'registration_year':

for low limit use formula

year_low_limit = 1999 - 1,5 * 9 = 1985 (round down)

For defining high limit we see time period, see above, for sale advertisements in the dataset - from June 2015 from April 2016. We know that automaker write in the vehicle registration certificate real date of manufacturing, hense a high outlier limit can't be greater than 2016 years.

year_high_limit = 2016

2.Define low and high outlier limit for 'power_ps':

If we formaly use formula we get -50 ps and row 'power_ps' contains and zero value. Extract unque values from 'power_ps' column and sorted its.

In [11]:

power = sorted(autos['power_ps'].unique())
print (power[0:10])

[0, 1, 2, 3, 4, 5, 6, 8, 9, 10]

Assignment values that doesn't equal zero and higher than -50 ps

power_low_limit = 1

for define high limit use formula

power_high_limit = 150 + 80 * 1,5 = 270

NB. A little power

3.Define low and high outlier limit for 'price_usd':

If we use formula we get low limit outlier ~ - 8 000 usd. Use real minimal value that doesn't equal zero.

In [12]:

price = sorted(autos['price_usd'].unique()) 
print(price[0:10])

[0, 1, 2, 3, 5, 8, 9, 10, 11, 12]

so define:

price_low_limit = 1

for high limit outlier use formula:

price_high_limit = 7200 + 6100 * 1,5 = 16350

Removing outliers from 'registration_year', 'power_ps' and 'price_usd'.¶

We define low and high limits for necessary values and let's modify autos data set. Create copy current autos and assignment it name 'autos_raw_copy" and extract data without outliers by some steps. clear year. Print autos_raw_copy for further comparison.

In [13]:

autos_raw_copy = autos.copy()
autos_raw_copy[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()

Out[13]:

	odometer_km	price_usd	registration_year	power_ps
count	50000.000000	5.000000e+04	50000.000000	50000.000000
mean	125732.700000	9.840044e+03	2005.073280	116.355920
std	40042.211706	4.811044e+05	105.712813	209.216627
min	5000.000000	0.000000e+00	1000.000000	0.000000
25%	125000.000000	1.100000e+03	1999.000000	70.000000
50%	150000.000000	2.950000e+03	2003.000000	105.000000
75%	150000.000000	7.200000e+03	2008.000000	150.000000
max	150000.000000	1.000000e+08	9999.000000	17700.000000

Clean year:

In [14]:

autos= autos[autos['registration_year'].between(1985, 2016)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()

Out[14]:

	odometer_km	price_usd	registration_year	power_ps
count	47215.000000	4.721500e+04	47215.000000	47215.000000
mean	126118.288679	9.103618e+03	2003.319814	117.634650
std	39607.366404	4.836483e+05	6.035886	196.367757
min	5000.000000	0.000000e+00	1985.000000	0.000000
25%	125000.000000	1.111000e+03	1999.000000	75.000000
50%	150000.000000	2.950000e+03	2003.000000	109.000000
75%	150000.000000	7.250000e+03	2008.000000	150.000000
max	150000.000000	1.000000e+08	2016.000000	17700.000000

As wee see 2 785 records ( -5.57% from original ) are outliers by year.

Clean power_ps:

In [15]:

autos= autos[autos['power_ps'].between(1, 270)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()

Out[15]:

	odometer_km	price_usd	registration_year	power_ps
count	41045.000000	4.104500e+04	41045.000000	41045.000000
mean	126401.266902	8.942654e+03	2003.420563	119.698648
std	38795.620424	5.153831e+05	5.950197	47.966269
min	5000.000000	0.000000e+00	1985.000000	1.000000
25%	125000.000000	1.250000e+03	1999.000000	80.000000
50%	150000.000000	3.100000e+03	2003.000000	115.000000
75%	150000.000000	7.300000e+03	2008.000000	150.000000
max	150000.000000	1.000000e+08	2016.000000	270.000000

As wee see 6 170 records (-12.34% from original ) are outliers by power_ps.

Clean price_usd':

In [16]:

autos= autos[autos['price_usd'].between(1, 16350)]
autos[['odometer_km', 'price_usd', 'registration_year', 'power_ps' ]].describe()

Out[16]:

	odometer_km	price_usd	registration_year	power_ps
count	37486.000000	37486.000000	37486.000000	37486.000000
mean	130087.632716	4247.285413	2002.919063	115.380329
std	35291.177041	3849.517064	5.657016	45.229916
min	5000.000000	1.000000	1985.000000	1.000000
25%	125000.000000	1214.250000	1999.000000	75.000000
50%	150000.000000	2900.000000	2003.000000	109.000000
75%	150000.000000	6300.000000	2007.000000	143.000000
max	150000.000000	16350.000000	2016.000000	270.000000

As wee see ** 3 559 records (-7.12% from original )** are outliers by price_usd.

As result 12514 records (-25.03% from original) are outliers by year, power and price

check cleaned dataset

In [17]:

autos.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37486 entries, 0 to 49999
Data columns (total 17 columns):
date_crawled          37486 non-null object
name                  37486 non-null object
price_usd             37486 non-null int64
abtest                37486 non-null object
vehicle_type          36062 non-null object
registration_year     37486 non-null int64
gearbox               36822 non-null object
power_ps              37486 non-null int64
model                 36098 non-null object
odometer_km           37486 non-null int64
registration_month    37486 non-null int64
fuel_type             35431 non-null object
brand                 37486 non-null object
unrepaired_damage     31858 non-null object
ad_created            37486 non-null object
postal_code           37486 non-null int64
last_seen             37486 non-null object
dtypes: int64(6), object(11)
memory usage: 5.1+ MB

Formally we can remove contains null records in other columns, but considering free style sale advertisements in case eBay Kleinanzeigen it will be redundant - we get reliable dataset based of year registration, price and power that contains non zero values. Think which values have real significance for further analyzing:

1. date_crawled - have litle time period - one month and we can't understand any statistical pattern.
1. ad_created - have normal time period - nearly ten month and we can understand dispersion movement of sale advertisements by and define how long sell cars from date of publication sale advertisement.
1. Mileage - have a little series of thirteen why then fixed values - not enough for create authentic distortion.
1. Top brands and mean price for brands and registration year - we can create authentic pattern use these values.

**Summary, we analyze next values:

Dispersion for ad_created.
Disperion by year registration car
Top brands and their mean price, registration year and price.

Dispersion for ad_created¶

In [18]:

# Exract datetime valies
year_month_dt = pd.to_datetime(autos['ad_created'])
# Convert to string Month Year
year_month_str = year_month_dt.dt.strftime('%B, %Y')
# Create dictionary
month_year_dic = {}
for value in year_month_str:
    
    if value not in month_year_dic:
        month_year_dic[value] = 1
    else:
        month_year_dic[value] += 1
# Convert dictionary to lists
month_year = []
values = []
for k, v in month_year_dic.items ():
    month_year.append(k)
    values.append(v)
# Convert list to dataframe
month_year_df = pd.DataFrame(list(zip(month_year, values)), 
                             columns =['Month, Year', 'Ad numbers'])
# Calculate sum of column Ad numbers  
total = month_year_df['Ad numbers'].sum()
# Add new column Percent
month_year_df['Percents %'] = month_year_df['Ad numbers'] / total
# Sort by descending
month_year_df.sort_values(by =['Percents %'], ascending = False)
# Format output for dataframe
month_year_df.style.format({"Ad numbers":"{:,}", "Percents %":"{:.3%}"})

Out[18]:

	Month, Year	Ad numbers	Percents %
0	September, 2015	1	0.003%
1	February, 2016	52	0.139%
2	March, 2016	31,403	83.773%
3	January, 2016	7	0.019%
4	November, 2015	1	0.003%
5	December, 2015	2	0.005%
6	August, 2015	1	0.003%
7	April, 2016	6,019	16.057%

Remember that ad_created_max = 07 April 2016 and ad_created_min = 11 June 2015. As we see in average car sells didn't more than two month.

Dispersion by registration year¶

In [19]:

# Extract values from column 'registration_year'
reg_year = autos['registration_year']
#Create dictionary reg_year_dic = {}
reg_year_dic = {}
for value in reg_year:
    
    if value not in reg_year_dic:
        reg_year_dic[value] = 1
    else:
        reg_year_dic[value] += 1

# Calculate sum of values
total = sum(reg_year_dic.values())
year = []
values = []
for k, v in reg_year_dic.items ():
    year.append(k)
    values.append(v)
# Convert list to dataframe
year_df = pd.DataFrame(list(zip(year, values)), 
                             columns =['Year reg', 'Ad numbers'])
# Calculate sum of column Ad numbers  
total = year_df['Ad numbers'].sum()
# Sort by descending
year_df.sort_values(by = ['Ad numbers'], inplace = True, ascending=False)
# Add new column Percent
year_df['Percents %'] = year_df['Ad numbers'] / total 

year_df.style.format({"Ad numbers":"{:,}", "Percents %":"{:.3%}"})

Out[19]:

	Year reg	Ad numbers	Percents %
14	1999	2,516	6.712%
15	2000	2,494	6.653%
20	2005	2,452	6.541%
21	2006	2,408	6.424%
18	2003	2,408	6.424%
19	2004	2,403	6.410%
16	2001	2,346	6.258%
17	2002	2,205	5.882%
13	1998	2,070	5.522%
22	2007	1,958	5.223%
23	2008	1,857	4.954%
12	1997	1,724	4.599%
24	2009	1,716	4.578%
25	2010	1,182	3.153%
11	1996	1,178	3.143%
26	2011	1,066	2.844%
10	1995	994	2.652%
31	2016	828	2.209%
27	2012	779	2.078%
9	1994	533	1.422%
28	2013	385	1.027%
8	1993	364	0.971%
7	1992	325	0.867%
6	1991	288	0.768%
5	1990	260	0.694%
29	2014	227	0.606%
4	1989	145	0.387%
3	1988	113	0.301%
30	2015	85	0.227%
0	1985	63	0.168%
2	1987	58	0.155%
1	1986	56	0.149%

as we see, 76.18% offering for sale cars made in 1997 - 2009 years.

Define most popular automaker in German market¶

Get 15 top brand list

In [20]:

autos['brand'].value_counts().head(15).sort_values(ascending=False)

Out[20]:

volkswagen       8182
opel             4337
bmw              3943
mercedes_benz    3118
audi             2970
ford             2736
renault          1857
peugeot          1235
fiat             1021
seat              743
skoda             681
mazda             628
smart             584
nissan            567
citroen           567
Name: brand, dtype: int64

Us we see most offering for sale cars making the The big German Car Trinity - VAG (it owned also trade marks Audi, Seat, Scoda), BMW and Mersedes-Benz.

Dispersion the mean price, power and make year for cars¶

Us we know, mileage contain discrete and for my opinion artificial values, because I examine mean values for price, power and made year for these brands. I don't use dictionary but use their idea for creating little dataframe. Let's begin:

In [21]:

# Creating a list of list of brands for mean price, power and make year 
agregate = []
for brand in autos['brand'].value_counts().head(15).index:
    row_list = []
    row_list.append(brand)
    row_list.append(int(round(autos[autos['brand'] == brand]
                          ['price_usd'].mean(), 0)))
    row_list.append(int(round(autos[autos['brand'] == brand]
                          ['power_ps'].mean(), 0)))
    row_list.append(int(round(autos[autos['brand'] == brand]
                          ['registration_year'].mean(), 0)))
    agregate.append(row_list)
# Convert list of list to dataframe
agregate_df = pd.DataFrame.from_records(agregate)
# Assign name of columns
agregate_df.columns = ["Brand", "Price", "Power PS", "Made year"]
# Descending sort by price 
agregate_df.sort_values(by =['Price'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})

Out[21]:

	Brand	Price	Power PS	Made year
4	audi	5,842	151	2003
10	skoda	5,833	104	2007
2	bmw	5,707	159	2002
3	mercedes_benz	5,206	147	2002
0	volkswagen	4,348	102	2002
14	nissan	4,013	99	2003
9	seat	3,965	94	2004
12	smart	3,746	63	2006
11	mazda	3,745	118	2003
13	citroen	3,657	95	2005
7	peugeot	3,108	99	2004
5	ford	3,102	102	2003
1	opel	2,846	99	2003
8	fiat	2,810	80	2004
6	renault	2,468	90	2003

We see dispersion by price. As we see brands Audi, BMW and Mersedes-Benz are the most expensive (with almost the same years of make). Lets see dispersion by power:

In [22]:

agregate_df.sort_values(by =['Power PS'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})

Out[22]:

	Brand	Price	Power PS	Made year
2	bmw	5,707	159	2002
4	audi	5,842	151	2003
3	mercedes_benz	5,206	147	2002
11	mazda	3,745	118	2003
10	skoda	5,833	104	2007
0	volkswagen	4,348	102	2002
5	ford	3,102	102	2003
14	nissan	4,013	99	2003
7	peugeot	3,108	99	2004
1	opel	2,846	99	2003
13	citroen	3,657	95	2005
9	seat	3,965	94	2004
6	renault	2,468	90	2003
8	fiat	2,810	80	2004
12	smart	3,746	63	2006

As we see brands Audi, BMW and Mersedes-Benz have the most power engine (with almost the same years of make). Lets see dispersion by make year:

In [23]:

agregate_df.sort_values(by =['Made year'], ascending = False, inplace = True)
# Custom format to price
agregate_df.style.format({"Price":"{:,}"})

Out[23]:

	Brand	Price	Power PS	Made year
10	skoda	5,833	104	2007
12	smart	3,746	63	2006
13	citroen	3,657	95	2005
7	peugeot	3,108	99	2004
9	seat	3,965	94	2004
8	fiat	2,810	80	2004
4	audi	5,842	151	2003
11	mazda	3,745	118	2003
5	ford	3,102	102	2003
14	nissan	4,013	99	2003
1	opel	2,846	99	2003
6	renault	2,468	90	2003
2	bmw	5,707	159	2002
3	mercedes_benz	5,206	147	2002
0	volkswagen	4,348	102	2002

As we see, brands Scoda, Smart and Citroen have the smallest age

General conclusions:¶

We define 3 statistically insignificant columns ('seller', 'offer_type', and 'nr_of_pictures');
We define right interval for all type vehicles (motocycle, motobike, cars) and remove 12514 outliers (-25.03% from original). From task decription we haven't target separate only cars - In this case I would use 20 ps (power engine Ford - T 1905 - 1927 model year) for low limit outlier.
We understand that necessary checked calculated by formula limits for outliers and adjust to the surrounding real conditions - otherwise we get cars from future or ancient with negative or zero price and power.
We understand that mileage has artificial discrete values which cannot be used for defining limits of outliers.
We see that most popular cars in the Germany - the big German Car Trinity - VAG (it owned also trade marks Audi, Seat, Scoda), BMW and Mersedes-Benz. Luxury model this brands have almost the same price and power (with case of the almost same make year)

Created on Jan 27, 2021

@author: Vadim Maklakov, used some ideas from public Internet resources.

Software environment:

Debian 10.7

Python 3.8.7

required preinstalled next python modules:

csv

numpy

pandas