The project is separated in 4 parts:
In this project we are going to use a dataset of used cars from the German ebay website.
The data was modified , shortened and uploaded to Kaggle. Our dataset consists of 50,000 data points.
The goal of this project is to clean and try to give answers to some interesting questions about the Car sales data.
# First, we import numpy and pandas libraries.
import numpy as np
import pandas as pd
# Using pandas library we read the file Autos.csv and assign it to a variable:autos
# The encoding argument is "Latin-1", because the default encoding argument ("UTF-8) is not correct.
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Our data set consists of 50,000 entries and 20 columns.
We can clearly see that the names of the columns use camelcase and not snakecase. The dataset contains 75% strings and 25% of integers which we can see from the type of series(int64 -5 , object-15).
From the entries we can see that some columns have null values, like vehicleType, gearbox,model, fuelType and notRepairedDamage.
In this part of the project we will:
#changing column names
autos.columns
autos_copy = autos.copy()
col = autos_copy.columns
#creating a map so that we can change all column names together
mapping_columns = {
'dateCrawled':'date_crawled',
'name':'name',
'seller':'seller',
'offerType':'offer_type',
'price':'price',
'abtest':'abtest',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'gearbox':'gearbox',
'powerPS':'power_ps',
'model':'model',
'odometer':'odometer',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'brand':'brand',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'}
#creating Series with the mapping columns dictionary
autos_copy.columns = pd.Series(col).map(mapping_columns)
#assigning the columns name from autos.copy to autos
autos.columns = autos_copy.columns
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.describe(include="all") #passing argument include=all so all columns can be included
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-16 21:50:53 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
We can make some observations from analysing the dataset autos now after we changed the column names.
autos["price"] = autos["price"].str.replace("$","").str.replace(",","").astype(int) #removing currency, change from obj to int
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int) #removing km, change from obj to int
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace = True) #renaming column odometer to add the km string
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
After converting the odometer and price column we can calculate the max and the min values for that columns with the max() and min() methods.
max_price = autos["price"].max()
min_price = autos["price"].min()
print("Max price = ",max_price)
print("Min price = ",min_price)
max_odometer_km = autos["odometer_km"].max()
min_odometer_km = autos["odometer_km"].min()
print("Max odometer km = ",max_odometer_km)
print("Min odometer km = ",min_odometer_km)
Max price = 99999999 Min price = 0 Max odometer km = 150000 Min odometer km = 5000
autos["price"].unique().shape #how many unique values has the price column
(2357,)
autos["price"].describe() #statistics for the price column
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
sorted_value = autos["price"].value_counts().sort_index(ascending=True).head() #using value counts and sort index to analyse price data
print(sorted_value)
0 1421 1 156 2 3 3 1 5 2 Name: price, dtype: int64
top_sorted_value = autos["price"].value_counts().sort_index(ascending=False).head() #top 5 prices
print(top_sorted_value)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 Name: price, dtype: int64
autos = autos[autos["price"].between(1,350000)]
print(autos["price"].describe())
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
Analysis of the price column:
We can see that most of the prices are with 0 value. The answer to this can be found in that usually Ebay sellers try to attract more people to look their advertisments with cheap prices or prices that are with null values.
Also we can see that the max price is 999,999 $ which is also too much and something is not good with that car price.
Given the fact that Ebay is auction website, the different distribution of prices shouldn't influence us but we should eliminate any potencial data that can lead us to making bad conclusions.
In the last step we removed data that has price bigger than 350,000 dollars and price lower than 1 dollar.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values, like so:
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
#separating the date string and calculating the distribution of values
autos["date_crawled"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
autos["ad_created"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 2015-12-30 0.000021 2016-01-03 0.000021 2016-01-07 0.000021 2016-01-10 0.000041 2016-01-13 0.000021 2016-01-14 0.000021 2016-01-16 0.000021 2016-01-22 0.000021 2016-01-27 0.000062 2016-01-29 0.000021 2016-02-01 0.000021 2016-02-02 0.000041 2016-02-05 0.000041 2016-02-07 0.000021 2016-02-08 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-02-12 0.000041 2016-02-14 0.000041 2016-02-16 0.000021 2016-02-17 0.000021 2016-02-18 0.000041 2016-02-19 0.000062 2016-02-20 0.000041 2016-02-21 0.000062 ... 2016-03-09 0.033151 2016-03-10 0.031895 2016-03-11 0.032904 2016-03-12 0.036755 2016-03-13 0.017008 2016-03-14 0.035190 2016-03-15 0.034016 2016-03-16 0.030125 2016-03-17 0.031278 2016-03-18 0.013590 2016-03-19 0.033687 2016-03-20 0.037949 2016-03-21 0.037579 2016-03-22 0.032801 2016-03-23 0.032060 2016-03-24 0.029280 2016-03-25 0.031751 2016-03-26 0.032266 2016-03-27 0.030989 2016-03-28 0.034984 2016-03-29 0.034037 2016-03-30 0.033501 2016-03-31 0.031875 2016-04-01 0.033687 2016-04-02 0.035149 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
autos["last_seen"].str[:10].value_counts(normalize=True, dropna= False).sort_index(ascending=True)
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
first_bool = autos["registration_year"] >= 1885
second_bool = autos["registration_year"] <= 2016
combined =first_bool & second_bool
autos.loc[combined, "registration_year"]
autos["registration_year"].value_counts(normalize=True).head(10)
2000 0.064985 2005 0.060455 1999 0.059652 2004 0.055657 2003 0.055575 2006 0.054978 2001 0.054278 2002 0.051189 1998 0.048656 2007 0.046886 Name: registration_year, dtype: float64
From the date columns and the column- registration year we can make some observations:
Most of the ads we made in March 2016 , which we can see from the distribution of values of the column ad_created. Also we can assume that most of the car sales were made in March because it is the month with the biggest distribution of values for the column date_crawled.
The registration year column has some interesting data like:
It is very odd to think that some car was registrated in the future. Having in mind that the first car was made in 1885, we cleaned our data and left only columns with registration year from 1885 up until 2016.
autos["brand"].value_counts(normalize=True,dropna=False).head(10) #sorting data in brand column by their value
volkswagen 0.212828 opel 0.108658 bmw 0.108597 mercedes_benz 0.095789 audi 0.085823 ford 0.069639 renault 0.047874 peugeot 0.029445 fiat 0.025986 seat 0.018944 Name: brand, dtype: float64
#creating dictionary with brand/price and calculating average price for brand
avg_price_for_brand = {}
brand = ['volkswagen','opel','bmw','mercedes_benz','audi','ford']
for b in brand:
selected_rows = autos[autos["brand"] == b]
mean = selected_rows["price"].mean()
avg_price_for_brand[b] = int(mean)
print(avg_price_for_brand)
{'ford': 3728, 'volkswagen': 5332, 'opel': 2944, 'audi': 9212, 'mercedes_benz': 8536, 'bmw': 8261}
bmp_series = pd.Series(avg_price_for_brand) #creating new series
pd.DataFrame(bmp_series, columns=["mean_price"]) #creating data frame
mean_price | |
---|---|
audi | 9212 |
bmw | 8261 |
ford | 3728 |
mercedes_benz | 8536 |
opel | 2944 |
volkswagen | 5332 |
#creating dictionary with brand/price and calculating average price for brand
avg_mil_for_brand = {}
for b in brand:
selected_rows = autos[autos["brand"] == b]
mean_mileage = selected_rows["odometer_km"].mean()
avg_mil_for_brand[b] = int(mean_mileage)
print(avg_mil_for_brand)
{'ford': 124349, 'volkswagen': 128896, 'opel': 129383, 'audi': 129492, 'mercedes_benz': 130796, 'bmw': 132682}
bmm_series = pd.Series(avg_mil_for_brand)
df2 =pd.DataFrame(bmm_series,columns=["mean_mileage"])
print(df2)
mean_mileage audi 129492 bmw 132682 ford 124349 mercedes_benz 130796 opel 129383 volkswagen 128896
df2["mean_price"] = bmp_series
print(df2)
mean_mileage mean_price audi 129492 9212 bmw 132682 8261 ford 124349 3728 mercedes_benz 130796 8536 opel 129383 2944 volkswagen 128896 5332
From the data gathered with exploring price by brand we can make some observations:
First step: find all columns that use german words and translate them in english*
*only the column name has not been changed for the complexity in finding the right translation
We have changed all german words in the columns: seller, offer_type, vehicle_type, gearbox, fuel_type, unrepaired_change
#mapping and updating series
seller = autos["seller"]
map_seller = {
"privat" :"private",
"gewerblich":"commercial"
}
autos["seller"] = pd.Series(seller).map(map_seller)
offer_type = autos["offer_type"]
map_offer = {
"Angebot" :"Offer",
"Gesuch":"Request"
}
autos["offer_type"] = pd.Series(offer_type).map(map_offer)
vehicle_type = autos["vehicle_type"]
map_vehicle = {
"limousine":"limousine",
"kleinwagen":"small car",
"kombi":"combi",
"bus": "bus",
"cabrio":"convertible",
"coupe":"coupe",
"suv":"suv",
"andere":"other"
}
autos["vehicle_type"] = pd.Series(vehicle_type).map(map_vehicle)
gear_box = autos["gearbox"]
map_gearbox = {
"manuell" :"manual",
"automatik":"automatic"
}
autos["gearbox"] = pd.Series(gear_box).map(map_gearbox)
fuel_type = autos["fuel_type"]
map_fuel_type = {
"benzin" :"petrol",
"diesel":"diesel",
"lpg" :"lpg",
"cpg":"cpg",
"hybrid" :"hybrid",
"andere":"other",
"elektro":"electric",
}
autos["fuel_type"] = pd.Series(fuel_type).map(map_fuel_type)
damage_type = autos["unrepaired_damage"]
map_damage_type = {
"nein" :"no",
"ja":"yes",
}
autos["unrepaired_damage"] = pd.Series(damage_type).map(map_damage_type)
Second step : convert the dates to be uniform numeric data, so "2016-03-21" becomes the integer
In the data set there are different brands and models of cars. From the analysis we made we can clearly see that the 5 most common brand/model combinations are:
autos["date_crawled"].describe()
count 48565 unique 46882 top 2016-03-19 17:36:18 freq 3 Name: date_crawled, dtype: object
#cutting our string in columns and converting it in numeric data
autos["date_crawled"] = autos["date_crawled"].str[:10].str.replace("-","").astype(int)
print(autos["date_crawled"].head())
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 Name: date_crawled, dtype: int64
Finding the most common brand/model combinations
From the analysis below we identified that the 5 most common brand/model combinations are:
cols = ["brand","model"]
autos_sel = autos[cols].head()
print(autos_sel)
brand model 0 peugeot andere 1 bmw 7er 2 volkswagen golf 3 smart fortwo 4 ford focus
# grouping the dataset by model and brand
df_brand = autos.groupby(["brand","model"])
# find the count and sort descending
brand_model = df_brand["date_crawled"].count().sort_values(ascending=False)
print(brand_model[:5])
brand model volkswagen golf 3900 bmw 3er 2686 volkswagen polo 1688 opel corsa 1681 astra 1410 Name: date_crawled, dtype: int64
Third step : Split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the milage.
autos["odometer_groups"] = pd.qcut(autos['odometer_km'], q=10,duplicates ="drop")
odometri = autos["odometer_groups"].unique()
print(odometri)
avg_price_for_groups_km ={}
for od in odometri:
sel_rows = autos[autos["odometer_groups"] == od]
mean_km_price_for_groups = sel_rows["price"].mean()
odo = int(mean_km_price_for_groups)
avg_price_for_groups_km[od] = odo
odo_series = pd.Series(avg_price_for_groups_km)
df3 =pd.DataFrame(odo_series,columns=["mean_price"])
print(df3)
[(125000.0, 150000.0], (60000.0, 90000.0], (4999.999, 60000.0], (90000.0, 125000.0]] Categories (4, interval[float64]): [(4999.999, 60000.0] < (60000.0, 90000.0] < (90000.0, 125000.0] < (125000.0, 150000.0]] mean_price (4999.999, 60000.0] 14170 (60000.0, 90000.0] 9469 (90000.0, 125000.0] 6692 (125000.0, 150000.0] 3728
The odometer_km column has been split in 4 groups: 5.000-60.000 , 60.000-90.000, 90.000-125.000, 125.000-150.000
We can clearly see that average prices follows patterns based on the milage and the cars that have lower mileage are more expensive.
4th step: How much cheaper are cars with damage than their non-damaged counterparts?
autos["unrepaired_damage"].value_counts()
no 34775 yes 4689 Name: unrepaired_damage, dtype: int64
repaired_unrepaired_price = {}
damage = autos["unrepaired_damage"].unique()
print(damage)
for d in damage:
selected_rows = autos[autos["unrepaired_damage"] == d]
mean = selected_rows["price"].mean()
repaired_unrepaired_price[d] = float(mean)
print(repaired_unrepaired_price)
['no' nan 'yes'] {nan: nan, 'yes': 2221.887609298358, 'no': 7086.8027318475915}
From the analysis we can see that cars that were not damaged have bigger average price than the cars that were repaired.
If we want to make some further analysis , we can also look in the name column which contains some other information that is not included in the other columns like :
autos["name"].head(20)
0 Peugeot_807_160_NAVTECH_ON_BOARD 1 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 Volkswagen_Golf_1.6_United 3 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 5 Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... 6 VW_Golf_III_GT_Special_Electronic_Green_Metall... 7 Golf_IV_1.9_TDI_90PS 8 Seat_Arosa 9 Renault_Megane_Scenic_1.6e_RT_Klimaanlage 10 VW_Golf_Tuning_in_siber/grau 11 Mercedes_A140_Motorschaden 12 Smart_smart_fortwo_coupe_softouch_pure_MHD_Pan... 13 Audi_A3_1.6_tuning 14 Renault_Clio_3__Dynamique_1.2__16_V;_viele_Ver... 15 Corvette_C3_Coupe_T_Top_Crossfire_Injection 16 Opel_Vectra_B_Kombi 17 Volkswagen_Scirocco_2_G60 18 Verkaufen_mein_bmw_e36_320_i_touring 19 mazda_tribute_2.0_mit_gas_und_tuev_neu_2018 Name: name, dtype: object
In this project we got answers to our questions are tried to give the readers better understanding how the data cleaning process is the most important part for getting the right results and not making bad observations.