PROJECT: Exploring EBAY Car sales data¶

The project is separated in 4 parts:

Introduction
Data cleaning & Date exploring with answering questions
Extra analysis
Conclusion

Introduction:¶

In this project we are going to use a dataset of used cars from the German ebay website.

The data was modified , shortened and uploaded to Kaggle. Our dataset consists of 50,000 data points.

The goal of this project is to clean and try to give answers to some interesting questions about the Car sales data.

In [1]:

# First, we import numpy and pandas libraries.

import numpy as np
import pandas as pd

In [2]:

# Using pandas library we read the file Autos.csv and assign it to a variable:autos
# The encoding argument is "Latin-1", because the default encoding argument ("UTF-8) is not correct.
autos = pd.read_csv("autos.csv", encoding = "Latin-1")

autos.info()
autos.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

Our data set consists of 50,000 entries and 20 columns.

We can clearly see that the names of the columns use camelcase and not snakecase. The dataset contains 75% strings and 25% of integers which we can see from the type of series(int64 -5 , object-15).

From the entries we can see that some columns have null values, like vehicleType, gearbox,model, fuelType and notRepairedDamage.

Data Cleaning & Exploring¶

In this part of the project we will:

Change the column names
Clean price and odometer columns
Clean the date columns
Exploring price by brand
Calculating average mean price and average mean mileage per brand
Extra analysis

In [3]:

#changing column names
autos.columns

autos_copy = autos.copy()
col = autos_copy.columns

#creating a map so that we can change all column names together
mapping_columns = {
    'dateCrawled':'date_crawled',
    'name':'name',
    'seller':'seller',
    'offerType':'offer_type',
    'price':'price',
    'abtest':'abtest',
    'vehicleType':'vehicle_type',
    'yearOfRegistration':'registration_year',
    'gearbox':'gearbox',
    'powerPS':'power_ps',
    'model':'model',
    'odometer':'odometer',
    'monthOfRegistration':'registration_month',
    'fuelType':'fuel_type',
    'brand':'brand',
    'notRepairedDamage':'unrepaired_damage',
    'dateCreated':'ad_created',
    'nrOfPictures':'nr_of_pictures',
    'postalCode':'postal_code',
    'lastSeen':'last_seen'}

#creating Series with the mapping columns dictionary
autos_copy.columns = pd.Series(col).map(mapping_columns)


#assigning the columns name from autos.copy to autos
autos.columns = autos_copy.columns
autos.columns

Out[3]:

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')

In [4]:

autos.describe(include="all") #passing argument include=all so all columns can be included

Out[4]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-16 21:50:53	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

We can make some observations from analysing the dataset autos now after we changed the column names.

Some columns have null values
Some columns have only 2 unique values like: seller, offer_type, abtest, gearbox and unrepaired_damage
The text in the columns is in German
The date in the columns: Date crawled , Ad created and Last seen are in Date Time format
The columns: price,odometer should be with numeric values so that we can further our analysis

Data cleaning of columns :Price and Odometer¶

In [5]:

autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int) #removing currency, change from obj to int
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int) #removing km, change from obj to int

autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True) #renaming column odometer to add the km string
autos.head()

Out[5]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	5000	control	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	8500	control	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	8990	test	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	4350	control	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	1350	test	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

After converting the odometer and price column we can calculate the max and the min values for that columns with the max() and min() methods.

In [6]:

max_price = autos["price"].max()
min_price = autos["price"].min()

print("Max price = ",max_price)
print("Min price = ",min_price)

max_odometer_km = autos["odometer_km"].max()
min_odometer_km = autos["odometer_km"].min()

print("Max odometer km = ",max_odometer_km)
print("Min odometer km = ",min_odometer_km)

Max price =  99999999
Min price =  0
Max odometer km =  150000
Min odometer km =  5000

In [7]:

autos["price"].unique().shape #how many unique values has the price column

Out[7]:

(2357,)

In [8]:

autos["price"].describe() #statistics for the price column

Out[8]:

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64

In [9]:

sorted_value = autos["price"].value_counts().sort_index(ascending=True).head() #using value counts and sort index to analyse price data
print(sorted_value)

0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64

In [10]:

top_sorted_value = autos["price"].value_counts().sort_index(ascending=False).head() #top 5 prices
print(top_sorted_value)

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
Name: price, dtype: int64

In [11]:

autos = autos[autos["price"].between(1,350000)]

print(autos["price"].describe())

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Analysis of the price column:

We can see that most of the prices are with 0 value. The answer to this can be found in that usually Ebay sellers try to attract more people to look their advertisments with cheap prices or prices that are with null values.

Also we can see that the max price is 999,999 $ which is also too much and something is not good with that car price.

Given the fact that Ebay is auction website, the different distribution of prices shouldn't influence us but we should eliminate any potencial data that can lead us to making bad conclusions.

In the last step we removed data that has price bigger than 350,000 dollars and price lower than 1 dollar.

Cleaning and analysing the date columns¶

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [12]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[12]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

In [13]:

#separating the date string and calculating the distribution of values
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)

Out[13]:

2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64

In [14]:

autos["ad_created"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)

Out[14]:

2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21    0.037579
2016-03-22    0.032801
2016-03-23    0.032060
2016-03-24    0.029280
2016-03-25    0.031751
2016-03-26    0.032266
2016-03-27    0.030989
2016-03-28    0.034984
2016-03-29    0.034037
2016-03-30    0.033501
2016-03-31    0.031875
2016-04-01    0.033687
2016-04-02    0.035149
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

In [15]:

autos["last_seen"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)

Out[15]:

2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64

In [16]:

autos["registration_year"].describe()

Out[16]:

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

In [17]:

first_bool = autos["registration_year"] >= 1885
second_bool = autos["registration_year"] <= 2016

combined =first_bool & second_bool

autos.loc[combined, "registration_year"]

autos["registration_year"].value_counts(normalize=True).head(10)

Out[17]:

2000    0.064985
2005    0.060455
1999    0.059652
2004    0.055657
2003    0.055575
2006    0.054978
2001    0.054278
2002    0.051189
1998    0.048656
2007    0.046886
Name: registration_year, dtype: float64

From the date columns and the column- registration year we can make some observations:

Most of the ads we made in March 2016 , which we can see from the distribution of values of the column ad_created. Also we can assume that most of the car sales were made in March because it is the month with the biggest distribution of values for the column date_crawled.

The registration year column has some interesting data like:

first registration year: 1000
last registration year: 9999

It is very odd to think that some car was registrated in the future. Having in mind that the first car was made in 1885, we cleaned our data and left only columns with registration year from 1885 up until 2016.

Exploring price by brand¶

In [18]:

autos["brand"].value_counts(normalize=True,dropna=False).head(10) #sorting data in brand column by their value

Out[18]:

volkswagen       0.212828
opel             0.108658
bmw              0.108597
mercedes_benz    0.095789
audi             0.085823
ford             0.069639
renault          0.047874
peugeot          0.029445
fiat             0.025986
seat             0.018944
Name: brand, dtype: float64

In [19]:

#creating dictionary with brand/price and calculating average price for brand

avg_price_for_brand = {}

brand = ['volkswagen','opel','bmw','mercedes_benz','audi','ford']

for b in brand:
    selected_rows = autos[autos["brand"] == b]
    mean = selected_rows["price"].mean()
    avg_price_for_brand[b] = int(mean)

print(avg_price_for_brand)

{'ford': 3728, 'volkswagen': 5332, 'opel': 2944, 'audi': 9212, 'mercedes_benz': 8536, 'bmw': 8261}

In [20]:

bmp_series = pd.Series(avg_price_for_brand) #creating new series
pd.DataFrame(bmp_series, columns=["mean_price"]) #creating data frame

Out[20]:

	mean_price
audi	9212
bmw	8261
ford	3728
mercedes_benz	8536
opel	2944
volkswagen	5332

In [21]:

#creating dictionary with brand/price and calculating average price for brand
avg_mil_for_brand = {}

for b in brand:
    selected_rows = autos[autos["brand"] == b]
    mean_mileage = selected_rows["odometer_km"].mean()
    avg_mil_for_brand[b] = int(mean_mileage)
    
print(avg_mil_for_brand)

{'ford': 124349, 'volkswagen': 128896, 'opel': 129383, 'audi': 129492, 'mercedes_benz': 130796, 'bmw': 132682}

In [22]:

bmm_series = pd.Series(avg_mil_for_brand)

df2 =pd.DataFrame(bmm_series,columns=["mean_mileage"])
print(df2)

               mean_mileage
audi                 129492
bmw                  132682
ford                 124349
mercedes_benz        130796
opel                 129383
volkswagen           128896

In [23]:

df2["mean_price"] = bmp_series

In [24]:

print(df2)

               mean_mileage  mean_price
audi                 129492        9212
bmw                  132682        8261
ford                 124349        3728
mercedes_benz        130796        8536
opel                 129383        2944
volkswagen           128896        5332

From the data gathered with exploring price by brand we can make some observations:

Listings with more than 5 % from the total data sorted by brand: VOLKSWAGEN, OPEL, BMW, MERCEDES BENZ, AUDI,
The top 6 brand are from Germany
Mercedes has the biggest average price, and Opel the lowest average price.
Mercedes is the most expensive, and Opel is the cheapest brand by average price.
BMW has the biggest mean mileage, and FORD the lowest mean mileage.
We have corelation between the mean mileage and mean price , suggesting that cars that have bigger mean mileage also have bigger price. This can be based on the fact that cars that are old timers have ussually bigger price.

EXTRA ANALYSIS¶

First step: find all columns that use german words and translate them in english*

*only the column name has not been changed for the complexity in finding the right translation

We have changed all german words in the columns: seller, offer_type, vehicle_type, gearbox, fuel_type, unrepaired_change

In [25]:

#mapping and updating series
seller = autos["seller"]
map_seller = {
    "privat" :"private",
    "gewerblich":"commercial"
}

autos["seller"] = pd.Series(seller).map(map_seller)

offer_type = autos["offer_type"]
map_offer = {
    "Angebot" :"Offer",
    "Gesuch":"Request"
}

autos["offer_type"] = pd.Series(offer_type).map(map_offer)

vehicle_type = autos["vehicle_type"]
map_vehicle = {
    "limousine":"limousine",
    "kleinwagen":"small car",
    "kombi":"combi",
    "bus": "bus",
    "cabrio":"convertible",
    "coupe":"coupe",
    "suv":"suv",
    "andere":"other"
}

autos["vehicle_type"] = pd.Series(vehicle_type).map(map_vehicle)

gear_box = autos["gearbox"]
map_gearbox = {
    "manuell" :"manual",
    "automatik":"automatic"
}

autos["gearbox"] = pd.Series(gear_box).map(map_gearbox)

fuel_type = autos["fuel_type"]
map_fuel_type = {
    "benzin" :"petrol",
    "diesel":"diesel",
    "lpg" :"lpg",
    "cpg":"cpg",
    "hybrid" :"hybrid",
    "andere":"other",
    "elektro":"electric",
}

autos["fuel_type"] = pd.Series(fuel_type).map(map_fuel_type)

damage_type = autos["unrepaired_damage"]
map_damage_type = {
    "nein" :"no",
    "ja":"yes",
}

autos["unrepaired_damage"] = pd.Series(damage_type).map(map_damage_type)

Second step : convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer

In the data set there are different brands and models of cars. From the analysis we made we can clearly see that the 5 most common brand/model combinations are:

volkswagen golf
bmw 3er
volkswagen polo
opel corsa
opel astra

In [26]:

autos["date_crawled"].describe()

Out[26]:

count                   48565
unique                  46882
top       2016-03-19 17:36:18
freq                        3
Name: date_crawled, dtype: object

In [27]:

#cutting our string in columns and converting it in numeric data
autos["date_crawled"] = autos["date_crawled"].str[:10].str.replace("-","").astype(int)
print(autos["date_crawled"].head())

0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled, dtype: int64

Finding the most common brand/model combinations

From the analysis below we identified that the 5 most common brand/model combinations are:

brand model

volkswagen golf
bmw 3er
volkswagen polo
opel corsa
opel astra

In [28]:

cols = ["brand","model"]

autos_sel = autos[cols].head()
print(autos_sel)

        brand   model
0     peugeot  andere
1         bmw     7er
2  volkswagen    golf
3       smart  fortwo
4        ford   focus

In [29]:

# grouping the dataset by model and brand
df_brand = autos.groupby(["brand","model"])
# find the count and sort descending
brand_model = df_brand["date_crawled"].count().sort_values(ascending=False)
print(brand_model[:5])

brand       model
volkswagen  golf     3900
bmw         3er      2686
volkswagen  polo     1688
opel        corsa    1681
            astra    1410
Name: date_crawled, dtype: int64

Third step : Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.

In [30]:

autos["odometer_groups"] = pd.qcut(autos['odometer_km'], q=10,duplicates ="drop")

odometri = autos["odometer_groups"].unique()
print(odometri)
avg_price_for_groups_km ={}

for od in odometri:
    sel_rows = autos[autos["odometer_groups"] == od]
    mean_km_price_for_groups = sel_rows["price"].mean()
    odo = int(mean_km_price_for_groups)
    avg_price_for_groups_km[od] = odo
    

odo_series = pd.Series(avg_price_for_groups_km)

df3 =pd.DataFrame(odo_series,columns=["mean_price"])
print(df3)

[(125000.0, 150000.0], (60000.0, 90000.0], (4999.999, 60000.0], (90000.0, 125000.0]]
Categories (4, interval[float64]): [(4999.999, 60000.0] < (60000.0, 90000.0] < (90000.0, 125000.0] < (125000.0, 150000.0]]
                      mean_price
(4999.999, 60000.0]        14170
(60000.0, 90000.0]          9469
(90000.0, 125000.0]         6692
(125000.0, 150000.0]        3728

The odometer_km column has been split in 4 groups: 5.000-60.000 , 60.000-90.000, 90.000-125.000, 125.000-150.000

We can clearly see that average prices follows patterns based on the milage and the cars that have lower mileage are more expensive.

4th step: How much cheaper are cars with damage than their non-damaged counterparts?

In [31]:

autos["unrepaired_damage"].value_counts()

Out[31]:

no     34775
yes     4689
Name: unrepaired_damage, dtype: int64

In [32]:

repaired_unrepaired_price = {}

damage = autos["unrepaired_damage"].unique()
print(damage)

for d in damage:
    selected_rows = autos[autos["unrepaired_damage"] == d]
    mean = selected_rows["price"].mean()
    repaired_unrepaired_price[d] = float(mean)

print(repaired_unrepaired_price)

['no' nan 'yes']
{nan: nan, 'yes': 2221.887609298358, 'no': 7086.8027318475915}

From the analysis we can see that cars that were not damaged have bigger average price than the cars that were repaired.

If we want to make some further analysis , we can also look in the name column which contains some other information that is not included in the other columns like :

Does the car possess clima control system;
Engine type and ect.

In [33]:

autos["name"].head(20)

Out[33]:

0                      Peugeot_807_160_NAVTECH_ON_BOARD
1            BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                            Volkswagen_Golf_1.6_United
3     Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...
4     Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...
5     Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...
6     VW_Golf_III_GT_Special_Electronic_Green_Metall...
7                                  Golf_IV_1.9_TDI_90PS
8                                            Seat_Arosa
9             Renault_Megane_Scenic_1.6e_RT_Klimaanlage
10                         VW_Golf_Tuning_in_siber/grau
11                           Mercedes_A140_Motorschaden
12    Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...
13                                   Audi_A3_1.6_tuning
14    Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...
15          Corvette_C3_Coupe_T_Top_Crossfire_Injection
16                                  Opel_Vectra_B_Kombi
17                            Volkswagen_Scirocco_2_G60
18                 Verkaufen_mein_bmw_e36_320_i_touring
19          mazda_tribute_2.0_mit_gas_und_tuev_neu_2018
Name: name, dtype: object

Conclusion¶

In this project we got answers to our questions are tried to give the readers better understanding how the data cleaning process is the most important part for getting the right results and not making bad observations.

In [ ]: