PROJECT: Exploring EBAY Car sales data


Image


The project is separated in 4 parts:

  • Introduction
  • Data cleaning & Date exploring with answering questions
  • Extra analysis
  • Conclusion

Introduction:

In this project we are going to use a dataset of used cars from the German ebay website.

The data was modified , shortened and uploaded to Kaggle. Our dataset consists of 50,000 data points.

The goal of this project is to clean and try to give answers to some interesting questions about the Car sales data.

In [1]:
# First, we import numpy and pandas libraries.

import numpy as np
import pandas as pd
In [2]:
# Using pandas library we read the file Autos.csv and assign it to a variable:autos
# The encoding argument is "Latin-1", because the default encoding argument ("UTF-8) is not correct.
autos = pd.read_csv("autos.csv", encoding = "Latin-1")

autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

Our data set consists of 50,000 entries and 20 columns.

We can clearly see that the names of the columns use camelcase and not snakecase. The dataset contains 75% strings and 25% of integers which we can see from the type of series(int64 -5 , object-15).

From the entries we can see that some columns have null values, like vehicleType, gearbox,model, fuelType and notRepairedDamage.


Data Cleaning & Exploring

In this part of the project we will:

  • Change the column names
  • Clean price and odometer columns
  • Clean the date columns
  • Exploring price by brand
  • Calculating average mean price and average mean mileage per brand
  • Extra analysis
In [3]:
#changing column names
autos.columns

autos_copy = autos.copy()
col = autos_copy.columns

#creating a map so that we can change all column names together
mapping_columns = {
    'dateCrawled':'date_crawled',
    'name':'name',
    'seller':'seller',
    'offerType':'offer_type',
    'price':'price',
    'abtest':'abtest',
    'vehicleType':'vehicle_type',
    'yearOfRegistration':'registration_year',
    'gearbox':'gearbox',
    'powerPS':'power_ps',
    'model':'model',
    'odometer':'odometer',
    'monthOfRegistration':'registration_month',
    'fuelType':'fuel_type',
    'brand':'brand',
    'notRepairedDamage':'unrepaired_damage',
    'dateCreated':'ad_created',
    'nrOfPictures':'nr_of_pictures',
    'postalCode':'postal_code',
    'lastSeen':'last_seen'}

#creating Series with the mapping columns dictionary
autos_copy.columns = pd.Series(col).map(mapping_columns)


#assigning the columns name from autos.copy to autos
autos.columns = autos_copy.columns
autos.columns
Out[3]:
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen'],
      dtype='object')
In [4]:
autos.describe(include="all") #passing argument include=all so all columns can be included
Out[4]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-16 21:50:53 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

We can make some observations from analysing the dataset autos now after we changed the column names.

  • Some columns have null values
  • Some columns have only 2 unique values like: seller, offer_type, abtest, gearbox and unrepaired_damage
  • The text in the columns is in German
  • The date in the columns: Date crawled , Ad created and Last seen are in Date Time format
  • The columns: price,odometer should be with numeric values so that we can further our analysis

Data cleaning of columns :Price and Odometer

In [5]:
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int) #removing currency, change from obj to int
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int) #removing km, change from obj to int

autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True) #renaming column odometer to add the km string
autos.head()
Out[5]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000 control bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500 control limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990 test limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350 control kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350 test kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

After converting the odometer and price column we can calculate the max and the min values for that columns with the max() and min() methods.

In [6]:
max_price = autos["price"].max()
min_price = autos["price"].min()

print("Max price = ",max_price)
print("Min price = ",min_price)

max_odometer_km = autos["odometer_km"].max()
min_odometer_km = autos["odometer_km"].min()

print("Max odometer km = ",max_odometer_km)
print("Min odometer km = ",min_odometer_km)
Max price =  99999999
Min price =  0
Max odometer km =  150000
Min odometer km =  5000
In [7]:
autos["price"].unique().shape #how many unique values has the price column
Out[7]:
(2357,)
In [8]:
autos["price"].describe() #statistics for the price column
Out[8]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [9]:
sorted_value = autos["price"].value_counts().sort_index(ascending=True).head() #using value counts and sort index to analyse price data
print(sorted_value)
0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64
In [10]:
top_sorted_value = autos["price"].value_counts().sort_index(ascending=False).head() #top 5 prices
print(top_sorted_value)
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
Name: price, dtype: int64
In [11]:
autos = autos[autos["price"].between(1,350000)]

print(autos["price"].describe())
count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price, dtype: float64

Analysis of the price column:

We can see that most of the prices are with 0 value. The answer to this can be found in that usually Ebay sellers try to attract more people to look their advertisments with cheap prices or prices that are with null values.

Also we can see that the max price is 999,999 $ which is also too much and something is not good with that car price.

Given the fact that Ebay is auction website, the different distribution of prices shouldn't influence us but we should eliminate any potencial data that can lead us to making bad conclusions.

In the last step we removed data that has price bigger than 350,000 dollars and price lower than 1 dollar.


Cleaning and analysing the date columns

Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:

In [12]:
autos[['date_crawled','ad_created','last_seen']][0:5]
Out[12]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
In [13]:
#separating the date string and calculating the distribution of values
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
Out[13]:
2016-03-05    0.025327
2016-03-06    0.014043
2016-03-07    0.036014
2016-03-08    0.033296
2016-03-09    0.033090
2016-03-10    0.032184
2016-03-11    0.032575
2016-03-12    0.036920
2016-03-13    0.015670
2016-03-14    0.036549
2016-03-15    0.034284
2016-03-16    0.029610
2016-03-17    0.031628
2016-03-18    0.012911
2016-03-19    0.034778
2016-03-20    0.037887
2016-03-21    0.037373
2016-03-22    0.032987
2016-03-23    0.032225
2016-03-24    0.029342
2016-03-25    0.031607
2016-03-26    0.032204
2016-03-27    0.031092
2016-03-28    0.034860
2016-03-29    0.034099
2016-03-30    0.033687
2016-03-31    0.031834
2016-04-01    0.033687
2016-04-02    0.035478
2016-04-03    0.038608
2016-04-04    0.036487
2016-04-05    0.013096
2016-04-06    0.003171
2016-04-07    0.001400
Name: date_crawled, dtype: float64
In [14]:
autos["ad_created"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
Out[14]:
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
2016-01-03    0.000021
2016-01-07    0.000021
2016-01-10    0.000041
2016-01-13    0.000021
2016-01-14    0.000021
2016-01-16    0.000021
2016-01-22    0.000021
2016-01-27    0.000062
2016-01-29    0.000021
2016-02-01    0.000021
2016-02-02    0.000041
2016-02-05    0.000041
2016-02-07    0.000021
2016-02-08    0.000021
2016-02-09    0.000021
2016-02-11    0.000021
2016-02-12    0.000041
2016-02-14    0.000041
2016-02-16    0.000021
2016-02-17    0.000021
2016-02-18    0.000041
2016-02-19    0.000062
2016-02-20    0.000041
2016-02-21    0.000062
                ...   
2016-03-09    0.033151
2016-03-10    0.031895
2016-03-11    0.032904
2016-03-12    0.036755
2016-03-13    0.017008
2016-03-14    0.035190
2016-03-15    0.034016
2016-03-16    0.030125
2016-03-17    0.031278
2016-03-18    0.013590
2016-03-19    0.033687
2016-03-20    0.037949
2016-03-21    0.037579
2016-03-22    0.032801
2016-03-23    0.032060
2016-03-24    0.029280
2016-03-25    0.031751
2016-03-26    0.032266
2016-03-27    0.030989
2016-03-28    0.034984
2016-03-29    0.034037
2016-03-30    0.033501
2016-03-31    0.031875
2016-04-01    0.033687
2016-04-02    0.035149
2016-04-03    0.038855
2016-04-04    0.036858
2016-04-05    0.011819
2016-04-06    0.003253
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64
In [15]:
autos["last_seen"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
Out[15]:
2016-03-05    0.001071
2016-03-06    0.004324
2016-03-07    0.005395
2016-03-08    0.007413
2016-03-09    0.009595
2016-03-10    0.010666
2016-03-11    0.012375
2016-03-12    0.023783
2016-03-13    0.008895
2016-03-14    0.012602
2016-03-15    0.015876
2016-03-16    0.016452
2016-03-17    0.028086
2016-03-18    0.007351
2016-03-19    0.015834
2016-03-20    0.020653
2016-03-21    0.020632
2016-03-22    0.021373
2016-03-23    0.018532
2016-03-24    0.019767
2016-03-25    0.019211
2016-03-26    0.016802
2016-03-27    0.015649
2016-03-28    0.020859
2016-03-29    0.022341
2016-03-30    0.024771
2016-03-31    0.023783
2016-04-01    0.022794
2016-04-02    0.024915
2016-04-03    0.025203
2016-04-04    0.024483
2016-04-05    0.124761
2016-04-06    0.221806
2016-04-07    0.131947
Name: last_seen, dtype: float64
In [16]:
autos["registration_year"].describe()
Out[16]:
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
In [17]:
first_bool = autos["registration_year"] >= 1885
second_bool = autos["registration_year"] <= 2016

combined =first_bool & second_bool

autos.loc[combined, "registration_year"]

autos["registration_year"].value_counts(normalize=True).head(10)
Out[17]:
2000    0.064985
2005    0.060455
1999    0.059652
2004    0.055657
2003    0.055575
2006    0.054978
2001    0.054278
2002    0.051189
1998    0.048656
2007    0.046886
Name: registration_year, dtype: float64

From the date columns and the column- registration year we can make some observations:

Most of the ads we made in March 2016 , which we can see from the distribution of values of the column ad_created. Also we can assume that most of the car sales were made in March because it is the month with the biggest distribution of values for the column date_crawled.

The registration year column has some interesting data like:

  • first registration year: 1000
  • last registration year: 9999

It is very odd to think that some car was registrated in the future. Having in mind that the first car was made in 1885, we cleaned our data and left only columns with registration year from 1885 up until 2016.


Exploring price by brand

In [18]:
autos["brand"].value_counts(normalize=True,dropna=False).head(10) #sorting data in brand column by their value
Out[18]:
volkswagen       0.212828
opel             0.108658
bmw              0.108597
mercedes_benz    0.095789
audi             0.085823
ford             0.069639
renault          0.047874
peugeot          0.029445
fiat             0.025986
seat             0.018944
Name: brand, dtype: float64
In [19]:
#creating dictionary with brand/price and calculating average price for brand

avg_price_for_brand = {}

brand = ['volkswagen','opel','bmw','mercedes_benz','audi','ford']

for b in brand:
    selected_rows = autos[autos["brand"] == b]
    mean = selected_rows["price"].mean()
    avg_price_for_brand[b] = int(mean)

print(avg_price_for_brand)
{'ford': 3728, 'volkswagen': 5332, 'opel': 2944, 'audi': 9212, 'mercedes_benz': 8536, 'bmw': 8261}
In [20]:
bmp_series = pd.Series(avg_price_for_brand) #creating new series
pd.DataFrame(bmp_series, columns=["mean_price"]) #creating data frame
Out[20]:
mean_price
audi 9212
bmw 8261
ford 3728
mercedes_benz 8536
opel 2944
volkswagen 5332
In [21]:
#creating dictionary with brand/price and calculating average price for brand
avg_mil_for_brand = {}

for b in brand:
    selected_rows = autos[autos["brand"] == b]
    mean_mileage = selected_rows["odometer_km"].mean()
    avg_mil_for_brand[b] = int(mean_mileage)
    
print(avg_mil_for_brand)
{'ford': 124349, 'volkswagen': 128896, 'opel': 129383, 'audi': 129492, 'mercedes_benz': 130796, 'bmw': 132682}
In [22]:
bmm_series = pd.Series(avg_mil_for_brand)

df2 =pd.DataFrame(bmm_series,columns=["mean_mileage"])
print(df2)
               mean_mileage
audi                 129492
bmw                  132682
ford                 124349
mercedes_benz        130796
opel                 129383
volkswagen           128896
In [23]:
df2["mean_price"] = bmp_series
In [24]:
print(df2)
               mean_mileage  mean_price
audi                 129492        9212
bmw                  132682        8261
ford                 124349        3728
mercedes_benz        130796        8536
opel                 129383        2944
volkswagen           128896        5332

From the data gathered with exploring price by brand we can make some observations:

  • Listings with more than 5 % from the total data sorted by brand: VOLKSWAGEN, OPEL, BMW, MERCEDES BENZ, AUDI,
  • The top 6 brand are from Germany
  • Mercedes has the biggest average price, and Opel the lowest average price.
  • Mercedes is the most expensive, and Opel is the cheapest brand by average price.
  • BMW has the biggest mean mileage, and FORD the lowest mean mileage.
  • We have corelation between the mean mileage and mean price , suggesting that cars that have bigger mean mileage also have bigger price. This can be based on the fact that cars that are old timers have ussually bigger price.

EXTRA ANALYSIS

First step: find all columns that use german words and translate them in english*

*only the column name has not been changed for the complexity in finding the right translation

We have changed all german words in the columns: seller, offer_type, vehicle_type, gearbox, fuel_type, unrepaired_change

In [25]:
#mapping and updating series
seller = autos["seller"]
map_seller = {
    "privat" :"private",
    "gewerblich":"commercial"
}

autos["seller"] = pd.Series(seller).map(map_seller)

offer_type = autos["offer_type"]
map_offer = {
    "Angebot" :"Offer",
    "Gesuch":"Request"
}

autos["offer_type"] = pd.Series(offer_type).map(map_offer)

vehicle_type = autos["vehicle_type"]
map_vehicle = {
    "limousine":"limousine",
    "kleinwagen":"small car",
    "kombi":"combi",
    "bus": "bus",
    "cabrio":"convertible",
    "coupe":"coupe",
    "suv":"suv",
    "andere":"other"
}

autos["vehicle_type"] = pd.Series(vehicle_type).map(map_vehicle)

gear_box = autos["gearbox"]
map_gearbox = {
    "manuell" :"manual",
    "automatik":"automatic"
}

autos["gearbox"] = pd.Series(gear_box).map(map_gearbox)

fuel_type = autos["fuel_type"]
map_fuel_type = {
    "benzin" :"petrol",
    "diesel":"diesel",
    "lpg" :"lpg",
    "cpg":"cpg",
    "hybrid" :"hybrid",
    "andere":"other",
    "elektro":"electric",
}

autos["fuel_type"] = pd.Series(fuel_type).map(map_fuel_type)

damage_type = autos["unrepaired_damage"]
map_damage_type = {
    "nein" :"no",
    "ja":"yes",
}

autos["unrepaired_damage"] = pd.Series(damage_type).map(map_damage_type)

Second step : convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer

In the data set there are different brands and models of cars. From the analysis we made we can clearly see that the 5 most common brand/model combinations are:

  • volkswagen golf
  • bmw 3er
  • volkswagen polo
  • opel corsa
  • opel astra
In [26]:
autos["date_crawled"].describe()
Out[26]:
count                   48565
unique                  46882
top       2016-03-19 17:36:18
freq                        3
Name: date_crawled, dtype: object
In [27]:
#cutting our string in columns and converting it in numeric data
autos["date_crawled"] = autos["date_crawled"].str[:10].str.replace("-","").astype(int)
print(autos["date_crawled"].head())
0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: date_crawled, dtype: int64

Finding the most common brand/model combinations

From the analysis below we identified that the 5 most common brand/model combinations are:

  • brand model
    2) volkswagen golf 3) bmw 3er 4) volkswagen polo 5) opel corsa 6) opel astra
In [28]:
cols = ["brand","model"]

autos_sel = autos[cols].head()
print(autos_sel)
        brand   model
0     peugeot  andere
1         bmw     7er
2  volkswagen    golf
3       smart  fortwo
4        ford   focus
In [29]:
# grouping the dataset by model and brand
df_brand = autos.groupby(["brand","model"])
# find the count and sort descending
brand_model = df_brand["date_crawled"].count().sort_values(ascending=False)
print(brand_model[:5])
brand       model
volkswagen  golf     3900
bmw         3er      2686
volkswagen  polo     1688
opel        corsa    1681
            astra    1410
Name: date_crawled, dtype: int64

Third step : Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.

In [30]:
autos["odometer_groups"] = pd.qcut(autos['odometer_km'], q=10,duplicates ="drop")

odometri = autos["odometer_groups"].unique()
print(odometri)
avg_price_for_groups_km ={}

for od in odometri:
    sel_rows = autos[autos["odometer_groups"] == od]
    mean_km_price_for_groups = sel_rows["price"].mean()
    odo = int(mean_km_price_for_groups)
    avg_price_for_groups_km[od] = odo
    

odo_series = pd.Series(avg_price_for_groups_km)

df3 =pd.DataFrame(odo_series,columns=["mean_price"])
print(df3)
[(125000.0, 150000.0], (60000.0, 90000.0], (4999.999, 60000.0], (90000.0, 125000.0]]
Categories (4, interval[float64]): [(4999.999, 60000.0] < (60000.0, 90000.0] < (90000.0, 125000.0] < (125000.0, 150000.0]]
                      mean_price
(4999.999, 60000.0]        14170
(60000.0, 90000.0]          9469
(90000.0, 125000.0]         6692
(125000.0, 150000.0]        3728

The odometer_km column has been split in 4 groups: 5.000-60.000 , 60.000-90.000, 90.000-125.000, 125.000-150.000

We can clearly see that average prices follows patterns based on the milage and the cars that have lower mileage are more expensive.


4th step: How much cheaper are cars with damage than their non-damaged counterparts?

In [31]:
autos["unrepaired_damage"].value_counts()
Out[31]:
no     34775
yes     4689
Name: unrepaired_damage, dtype: int64
In [32]:
repaired_unrepaired_price = {}

damage = autos["unrepaired_damage"].unique()
print(damage)

for d in damage:
    selected_rows = autos[autos["unrepaired_damage"] == d]
    mean = selected_rows["price"].mean()
    repaired_unrepaired_price[d] = float(mean)

print(repaired_unrepaired_price)
['no' nan 'yes']
{nan: nan, 'yes': 2221.887609298358, 'no': 7086.8027318475915}

From the analysis we can see that cars that were not damaged have bigger average price than the cars that were repaired.


If we want to make some further analysis , we can also look in the name column which contains some other information that is not included in the other columns like :

  1. Does the car possess clima control system;
  2. Engine type and ect.
In [33]:
autos["name"].head(20)
Out[33]:
0                      Peugeot_807_160_NAVTECH_ON_BOARD
1            BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik
2                            Volkswagen_Golf_1.6_United
3     Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...
4     Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...
5     Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto...
6     VW_Golf_III_GT_Special_Electronic_Green_Metall...
7                                  Golf_IV_1.9_TDI_90PS
8                                            Seat_Arosa
9             Renault_Megane_Scenic_1.6e_RT_Klimaanlage
10                         VW_Golf_Tuning_in_siber/grau
11                           Mercedes_A140_Motorschaden
12    Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan...
13                                   Audi_A3_1.6_tuning
14    Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver...
15          Corvette_C3_Coupe_T_Top_Crossfire_Injection
16                                  Opel_Vectra_B_Kombi
17                            Volkswagen_Scirocco_2_G60
18                 Verkaufen_mein_bmw_e36_320_i_touring
19          mazda_tribute_2.0_mit_gas_und_tuev_neu_2018
Name: name, dtype: object

Conclusion

In this project we got answers to our questions are tried to give the readers better understanding how the data cleaning process is the most important part for getting the right results and not making bad observations.

In [ ]: