This project will explore used car sale data on eBay Kleinanzeigen, the classified information portion of the German eBay site. For each of the 50,000 listings, various attributes are given, such as the car's brand, fuel type and kilometers driven. The data will first be cleaned, and then analyzed for trends.
Import necessary libraries and store .csv file data as a dataframe.
import numpy as np
import pandas as pd
autos = pd.read_csv("autos3.csv", encoding = "Latin-1")
Get information about the dataframe.
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Print first five rows of dataset.
print(autos.head())
dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
From the first few rows, there are a lot of columns with values written in German, which makes sense because the data was scraped from the German version of eBay. There are also quite a few NaN values that should be investigated, especially in the notRepairedDamage column.
Print the column names for the dataframe.
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Want to change column names from camelcase to snakecase.
new_columns = ["date_crawled", "name", "seller", "offer_type", "price", "abtest", "vehicle_type",
"registration_year", "gearbox", "powerPS", "model", "odometer", "registration_month",
"fuel_type", "brand", "unrepaired_damage", "ad_created", "nr_of_pictures",
"postal_code", "last_seen"]
autos.columns = new_columns
print(autos.head())
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offer_type price abtest vehicle_type registration_year \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer registration_month fuel_type \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand unrepaired_damage ad_created nr_of_pictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postal_code last_seen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
Get descriptive statistics for all columns.
autos.describe(include = "all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | powerPS | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-05 16:57:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
The seller and offer_type columns are mostly one value, so they could be dropped. Also, the nr_of_pictures columns has zero for all values. Price and odometer columns are stored as text, so need to be cleaned so data is in numeric form.
Remove the seller, offer_type, and nr_of_pictures columns.
autos = autos.drop(columns = ["seller","offer_type","nr_of_pictures"])
Remove non-numeric characters and convert price and odometer columns to numeric dtype.
autos["price"] = autos["price"].str.replace(",","",regex=True).str.replace("$","",regex=True).astype(int)
autos["odometer"] = autos["odometer"].str.replace(",","",regex=True).str.replace("km","").astype(int)
autos.rename({"odometer":"odometer_km"},axis=1,inplace=True)
print(autos.head())
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... price abtest vehicle_type registration_year gearbox powerPS model \ 0 5000 control bus 2004 manuell 158 andere 1 8500 control limousine 1997 automatik 286 7er 2 8990 test limousine 2009 manuell 102 golf 3 4350 control kleinwagen 2007 automatik 71 fortwo 4 1350 test kombi 2003 manuell 0 focus odometer_km registration_month fuel_type brand unrepaired_damage \ 0 150000 3 lpg peugeot nein 1 150000 6 benzin bmw nein 2 70000 7 benzin volkswagen nein 3 70000 6 benzin smart nein 4 150000 7 benzin ford nein ad_created postal_code last_seen 0 2016-03-26 00:00:00 79588 2016-04-06 06:45:54 1 2016-04-04 00:00:00 71034 2016-04-06 14:45:08 2 2016-03-26 00:00:00 35394 2016-04-06 20:15:37 3 2016-03-12 00:00:00 33729 2016-03-15 03:16:28 4 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
Explore the price column.
print(autos["price"].unique().shape[0],"\n")
print(autos["price"].describe(),"\n")
print(autos["price"].value_counts().sort_index(ascending = False).head(),"\n")
print(autos["price"].value_counts().sort_index().head())
2357 count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64 99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 Name: price, dtype: int64 0 1421 1 156 2 3 3 1 5 2 Name: price, dtype: int64
Explore the odometer_km column.
print(autos["odometer_km"].unique().shape[0],"\n")
print(autos["odometer_km"].describe(),"\n")
print(autos["odometer_km"].value_counts().sort_index(ascending = False).head(),"\n")
print(autos["odometer_km"].value_counts().sort_index().head())
13 count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64 150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 Name: odometer_km, dtype: int64 5000 967 10000 264 20000 784 30000 789 40000 819 Name: odometer_km, dtype: int64
While the range of 5,000 to 150,000 kilometers is reasonable for the odometer_km column, there are outliers in the price column. A price of 0 does not make sense, while a price of 10,000,000 is too high. Values outside the price range of 1 to 200,000 will be removed.
autos = autos[autos["price"].between(1,200000)]
print(autos["price"].describe())
count 48557.000000 mean 5842.888914 std 8301.871663 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7450.000000 max 198000.000000 Name: price, dtype: float64
Will now explore date values in the date_crawled, ad_created, and last_seen columns, will look at percentage distributions.
Start with date_crawled column.
print(autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2016-03-05 0.025331 2016-03-06 0.014045 2016-03-07 0.036020 2016-03-08 0.033301 2016-03-09 0.033095 2016-03-10 0.032189 2016-03-11 0.032580 2016-03-12 0.036926 2016-03-13 0.015672 2016-03-14 0.036555 2016-03-15 0.034290 2016-03-16 0.029615 2016-03-17 0.031633 2016-03-18 0.012913 2016-03-19 0.034784 2016-03-20 0.037894 2016-03-21 0.037358 2016-03-22 0.032992 2016-03-23 0.032210 2016-03-24 0.029347 2016-03-25 0.031612 2016-03-26 0.032210 2016-03-27 0.031077 2016-03-28 0.034825 2016-03-29 0.034104 2016-03-30 0.033672 2016-03-31 0.031839 2016-04-01 0.033692 2016-04-02 0.035484 2016-04-03 0.038594 2016-04-04 0.036493 2016-04-05 0.013077 2016-04-06 0.003172 2016-04-07 0.001400 Name: date_crawled, dtype: float64
The date_crawled column data is fairly evenly distributed, with each day containing roughly 3% of the listings.
print(autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038841 2016-04-04 0.036864 2016-04-05 0.011801 2016-04-06 0.003254 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
The ad_created column has data spanning a much larger range than the date_crawled and last_seen columns.
print(autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2016-03-05 0.001071 2016-03-06 0.004325 2016-03-07 0.005396 2016-03-08 0.007414 2016-03-09 0.009597 2016-03-10 0.010668 2016-03-11 0.012377 2016-03-12 0.023786 2016-03-13 0.008897 2016-03-14 0.012604 2016-03-15 0.015878 2016-03-16 0.016455 2016-03-17 0.028091 2016-03-18 0.007352 2016-03-19 0.015837 2016-03-20 0.020656 2016-03-21 0.020636 2016-03-22 0.021377 2016-03-23 0.018535 2016-03-24 0.019771 2016-03-25 0.019215 2016-03-26 0.016805 2016-03-27 0.015631 2016-03-28 0.020841 2016-03-29 0.022345 2016-03-30 0.024775 2016-03-31 0.023786 2016-04-01 0.022798 2016-04-02 0.024919 2016-04-03 0.025187 2016-04-04 0.024487 2016-04-05 0.124740 2016-04-06 0.221801 2016-04-07 0.131948 Name: last_seen, dtype: float64
The most common dates for the last_seen column are April 5th, 6th, and 7th in 2016.
print(autos["registration_year"].describe())
count 48557.000000 mean 2004.754968 std 88.651037 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
There are clearly outliers in the registration year column, as there were no cars in the year 1000, and the year 9999 has not happened yet. The data in this column will have to be cleaned.
Will remove all listings for which the registration_year value is not between 1900 and 2016.
num_range = autos[autos["registration_year"].between(1900,2016)].shape[0]
print(num_range)
autos = autos[autos["registration_year"].between(1900,2016)]
print(autos["registration_year"].describe())
46673 count 46673.000000 mean 2002.909969 std 7.183604 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Will analyze the top 20 most popular car brands based on average price to find patterns.
top_20 = autos["brand"].value_counts()[:20].index
brand_price = {}
for brand in top_20:
brand_price[brand] = round(autos.loc[autos["brand"] == brand,"price"].mean(),2)
brand_tuple = []
for key in brand_price:
brand_tuple.append((brand_price[key],key))
brand_tuple = sorted(brand_tuple, reverse = True)
for value in brand_tuple:
print("{brand}: ${price}".format(brand = value[1], price = value[0]))
sonstige_autos: $11610.63 mini: $10613.46 audi: $9336.69 mercedes_benz: $8628.45 bmw: $8236.94 skoda: $6368.0 volkswagen: $5402.41 hyundai: $5365.25 toyota: $5167.09 volvo: $4946.5 nissan: $4743.4 seat: $4397.23 mazda: $4112.6 citroen: $3779.14 ford: $3749.47 smart: $3580.22 peugeot: $3094.02 opel: $2975.24 fiat: $2813.75 renault: $2474.86
Out of the top 20 most popular car brands, sonstige_autos and minis tend to have the highest used car resale price, while fiats and renauls tend to have the lowest.
For each brand, will compare the average price with the average mileage to see if there are any observable trends.
brand_mileage = {}
for brand in top_20:
brand_mileage[brand] = autos.loc[autos["brand"] == brand, "odometer_km"].mean()
price_series = pd.Series(brand_price)
brand_df = pd.DataFrame(price_series, columns = ["mean_price"])
brand_df["mean_mileage"] = pd.Series(brand_mileage)
print(brand_df.sort_values("mean_price",ascending = False))
mean_price mean_mileage sonstige_autos 11610.63 90109.409190 mini 10613.46 88105.134474 audi 9336.69 129157.386785 mercedes_benz 8628.45 130788.363313 bmw 8236.94 132619.279455 skoda 6368.00 110848.563969 volkswagen 5402.41 128707.158791 hyundai 5365.25 106442.307692 toyota 5167.09 115944.350759 volvo 4946.50 138067.915691 nissan 4743.40 118330.995792 seat 4397.23 121131.301290 mazda 4112.60 124464.033850 citroen 3779.14 119694.189602 ford 3749.47 124266.012872 smart 3580.22 99326.777610 peugeot 3094.02 127153.625269 opel 2975.24 129310.035842 fiat 2813.75 117121.971596 renault 2474.86 128071.331213
There appears to be no identifiable correlation between the average price and average mileage for each brand. This implies that there is an intrinsic value to a car brand not linked to mileage.
Want to replace categorical data that is written in German with English words using a mapping dictionary.
print(autos.head())
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... price abtest vehicle_type registration_year gearbox powerPS model \ 0 5000 control bus 2004 manuell 158 andere 1 8500 control limousine 1997 automatik 286 7er 2 8990 test limousine 2009 manuell 102 golf 3 4350 control kleinwagen 2007 automatik 71 fortwo 4 1350 test kombi 2003 manuell 0 focus odometer_km registration_month fuel_type brand unrepaired_damage \ 0 150000 3 lpg peugeot nein 1 150000 6 benzin bmw nein 2 70000 7 benzin volkswagen nein 3 70000 6 benzin smart nein 4 150000 7 benzin ford nein ad_created postal_code last_seen 0 2016-03-26 00:00:00 79588 2016-04-06 06:45:54 1 2016-04-04 00:00:00 71034 2016-04-06 14:45:08 2 2016-03-26 00:00:00 35394 2016-04-06 20:15:37 3 2016-03-12 00:00:00 33729 2016-03-15 03:16:28 4 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
The vehicle_type, gearbox, fuel_type, and unrepaired_damage columns are all written in German. They will have to be translated to make the data more readable.
Will start with the vehicle_type column. Want to get list of all unique values in the column to make translation easier.
print(autos["vehicle_type"].unique())
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio' 'andere']
Map data values, convert null values to unknown.
vh_type_dict = {"bus":"bus","limousine":"limousine","kleinwagen":"small car","kombi":"station wagon",
"coupe":"coupe","suv":"suv","cabrio":"convertible","andere":"other"}
autos["vehicle_type"] = autos["vehicle_type"].map(vh_type_dict)
autos.loc[autos["vehicle_type"].isnull(),"vehicle_type"] = "unknown"
print(autos["vehicle_type"].unique())
['bus' 'limousine' 'small car' 'station wagon' 'unknown' 'coupe' 'suv' 'convertible' 'other']
Will repeat this process with the gearbox column.
print(autos["gearbox"].unique())
['manuell' 'automatik' nan]
gearbox_dict = {"manuell":"manual","automatik":"automatic"}
autos["gearbox"] = autos["gearbox"].map(gearbox_dict)
autos.loc[autos["gearbox"].isnull(),"gearbox"] = "unknown"
print(autos["gearbox"].unique())
['manual' 'automatic' 'unknown']
Will repeat this process with the fuel_type column.
print(autos["fuel_type"].unique())
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
fuel_type_dict = {"lpg":"lpg","benzin":"petrol","diesel":"diesel","cng":"cng","hybrid":"hybrid","elektro":"electric",
"andere":"other"}
autos["fuel_type"] = autos["fuel_type"].map(fuel_type_dict)
autos.loc[autos["fuel_type"].isnull(),"fuel_type"] = "unknown"
print(autos["fuel_type"].unique())
['lpg' 'petrol' 'diesel' 'unknown' 'cng' 'hybrid' 'electric' 'other']
Will repeat this process with the unrepaired_damage column.
print(autos["unrepaired_damage"].unique())
['nein' nan 'ja']
un_dmg_dict = {"nein":"no","ja":"yes"}
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(un_dmg_dict)
autos.loc[autos["unrepaired_damage"].isnull(),"unrepaired_damage"] = "unknown"
print(autos["unrepaired_damage"].unique())
['no' 'unknown' 'yes']
The next step in this process will be to convert dates to numeric and datetime data from string form. This will be for the ad_created column. Will update the old ad_created column and will create a new ad_created_dt column.
import datetime as dt
print(autos["ad_created"].head())
0 2016-03-26 00:00:00 1 2016-04-04 00:00:00 2 2016-03-26 00:00:00 3 2016-03-12 00:00:00 4 2016-04-01 00:00:00 Name: ad_created, dtype: object
The format of this column is year-month-day hour:minute:second.
autos["ad_created_dt"] = pd.to_datetime(autos["ad_created"])
autos["ad_created"] = autos["ad_created"].str[:10].str.replace("-","").astype(int)
print(autos["ad_created"].head(),"\n")
print(autos["ad_created_dt"].head())
0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 Name: ad_created, dtype: int64 0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 Name: ad_created_dt, dtype: datetime64[ns]
Identify the most common brand/model combinations. Will do top three most common models for top ten most common brands.
top_10 = autos["brand"].value_counts()[:10].index
first = {}
second = {}
third = {}
for brand in top_10:
brand_model = autos.loc[autos["brand"] == brand, "model"].value_counts().index.to_list()
if len(brand_model) >= 1:
first[brand] = brand_model[0]
if len(brand_model) >= 2:
second[brand] = brand_model[1]
if len(brand_model) >= 3:
third[brand] = brand_model[2]
bm_series = pd.Series(first)
bm_df = pd.DataFrame(bm_series, columns = ["1"])
bm_df["2"] = pd.Series(second)
bm_df["3"] = pd.Series(third)
print(bm_df)
1 2 3 volkswagen golf polo passat bmw 3er 5er 1er opel corsa astra vectra mercedes_benz c_klasse e_klasse a_klasse audi a4 a3 a6 ford focus fiesta mondeo renault twingo clio megane peugeot 2_reihe 3_reihe 1_reihe fiat punto andere 500 seat ibiza leon arosa
Will look at trends in the odometer_km column to see if mileage affects price.
import math
km_max = autos["odometer_km"].max()
km_min = autos["odometer_km"].min()
km_diff = 5000
tuple_list = []
for x in range(km_min,km_max,km_diff):
tuple_list.append((x,x+km_diff))
km_dict = {}
for value in tuple_list:
km_dict[value] = round(autos.loc[autos["odometer_km"].between(value[0],value[1]),"price"].mean(), 2)
km_dict_up = {}
for key in km_dict:
if math.isnan(km_dict[key]) == False:
new_key = str(key[0]) + "-" + str(key[1])
km_dict_up[new_key] = km_dict[key]
km_df = pd.DataFrame(pd.Series(km_dict_up), columns = ["price"])
print(km_df.sort_values("price"))
price 145000-150000 3767.93 120000-125000 6214.02 125000-130000 6214.02 100000-105000 7991.29 95000-100000 7991.29 90000-95000 8465.03 85000-90000 8465.03 80000-85000 9721.95 75000-80000 9721.95 5000-10000 10526.88 65000-70000 10927.18 70000-75000 10927.18 55000-60000 12385.00 60000-65000 12385.00 45000-50000 13812.17 50000-55000 13812.17 40000-45000 15499.57 35000-40000 15499.57 30000-35000 16340.86 25000-30000 16340.86 20000-25000 17682.12 15000-20000 17682.12 10000-15000 20550.87
There is a correlation between mileage and average price. The higher mileage a car has, the lower the price it tends to be. However, there are a few exceptions, as cars with a mileage of 5000-10000 have a lower price than their mileage would suggest.
Want to compare the prices of damaged and undamaged cars. Cars are determined to be the same if they have the same brand and model.
autos = autos[(autos["brand"].isnull() == False) & (autos["model"].isnull() == False)]
brand_list = autos["brand"].unique()
brand_model = {}
for brand in brand_list:
brand_model[brand] = autos.loc[autos["brand"] == brand,"model"].unique()
bm_price_un = {}
bm_price_re = {}
for brand in brand_model:
for model in brand_model[brand]:
un_bool = (autos["brand"] == brand) & (autos["model"] == model) & (autos["unrepaired_damage"] == "yes")
bm_price_un[(brand,model)] = round(autos.loc[un_bool,"price"].mean(), 2)
re_bool = (autos["brand"] == brand) & (autos["model"] == model) & (autos["unrepaired_damage"] == "no")
bm_price_re[(brand,model)] = round(autos.loc[re_bool,"price"].mean(), 2)
bm_price_df = pd.DataFrame(pd.Series(bm_price_re), columns = ["undamaged"])
bm_price_df["damaged"] = pd.Series(bm_price_un)
bm_price_df["diff"] = bm_price_df["undamaged"] - bm_price_df["damaged"]
bm_price_df = bm_price_df.sort_values("diff", ascending = False)
bm_price_df.dropna(inplace = True)
print(bm_price_df.head(),"\n")
print(bm_price_df.tail())
undamaged damaged diff porsche 911 65710.31 20833.33 44876.98 land_rover range_rover 25411.43 1800.00 23611.43 defender 33009.12 12245.00 20764.12 bmw m_reihe 28971.98 8900.00 20071.98 mercedes_benz g_klasse 25956.74 6375.00 19581.74 undamaged damaged diff mercedes_benz cl 13229.33 14425.00 -1195.67 volkswagen andere 5564.46 7116.67 -1552.21 lancia andere 4913.00 8750.00 -3837.00 mercedes_benz viano 20365.25 24333.33 -3968.08 volkswagen touareg 16169.65 20933.33 -4763.68
Porsche 911s have the largest postive difference between the undamaged and damaged average prices, while the volkswagen touareg has the largest negative difference.
print(round(bm_price_df["diff"].mean(), 2))
3817.21
Cars that are undamaged tend to be on average $3817.21 more expensive than if they were damaged.
Want to look at distributions of car price by month, will add a column to the autos dataframe with the month of each listing from the ad_created_dt column of datetime objects.
autos["month"] = autos["ad_created_dt"].dt.strftime("%b")
print(autos.head())
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... price abtest vehicle_type registration_year gearbox powerPS \ 0 5000 control bus 2004 manual 158 1 8500 control limousine 1997 automatic 286 2 8990 test limousine 2009 manual 102 3 4350 control small car 2007 automatic 71 4 1350 test station wagon 2003 manual 0 model odometer_km registration_month fuel_type brand \ 0 andere 150000 3 lpg peugeot 1 7er 150000 6 petrol bmw 2 golf 70000 7 petrol volkswagen 3 fortwo 70000 6 petrol smart 4 focus 150000 7 petrol ford unrepaired_damage ad_created postal_code last_seen \ 0 no 20160326 79588 2016-04-06 06:45:54 1 no 20160404 71034 2016-04-06 14:45:08 2 no 20160326 35394 2016-04-06 20:15:37 3 no 20160312 33729 2016-03-15 03:16:28 4 no 20160401 39218 2016-04-01 14:38:50 ad_created_dt month 0 2016-03-26 Mar 1 2016-04-04 Apr 2 2016-03-26 Mar 3 2016-03-12 Mar 4 2016-04-01 Apr
month_list = autos["month"].unique()
month_p_dict = {}
for month in month_list:
month_p_dict[month] = round(autos.loc[autos["month"] == month,"price"].mean(), 2)
tuple_list = []
for key in month_p_dict:
tuple_list.append((month_p_dict[key], key))
tuple_list = sorted(tuple_list, reverse = True)
for entry in tuple_list:
print("{month}: ${price}".format(month = entry[1], price = entry[0]))
Jun: $47900.0 Sep: $9500.0 Jan: $9182.43 Feb: $7198.63 Apr: $6257.89 Mar: $5935.49 Nov: $5499.0 Aug: $5200.0 Dec: $1650.0
June appears to be the most profitable time of the year to sell cars, while december seems like the least.
Thanks for taking the time to look through my project, any and all feedback is greatly appreciated!