eBay Car Sales Data Exploration

This project will explore used car sale data on eBay Kleinanzeigen, the classified information portion of the German eBay site. For each of the 50,000 listings, various attributes are given, such as the car's brand, fuel type and kilometers driven. The data will first be cleaned, and then analyzed for trends.

Initialization

Import necessary libraries and store .csv file data as a dataframe.

In [1]:
import numpy as np
import pandas as pd

autos = pd.read_csv("autos3.csv", encoding = "Latin-1")

Get information about the dataframe.

In [2]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

Print first five rows of dataset.

In [3]:
print(autos.head())
           dateCrawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offerType   price   abtest vehicleType  yearOfRegistration  \
0  privat   Angebot  $5,000  control         bus                2004   
1  privat   Angebot  $8,500  control   limousine                1997   
2  privat   Angebot  $8,990     test   limousine                2009   
3  privat   Angebot  $4,350  control  kleinwagen                2007   
4  privat   Angebot  $1,350     test       kombi                2003   

     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \
0    manuell      158  andere  150,000km                    3      lpg   
1  automatik      286     7er  150,000km                    6   benzin   
2    manuell      102    golf   70,000km                    7   benzin   
3  automatik       71  fortwo   70,000km                    6   benzin   
4    manuell        0   focus  150,000km                    7   benzin   

        brand notRepairedDamage          dateCreated  nrOfPictures  \
0     peugeot              nein  2016-03-26 00:00:00             0   
1         bmw              nein  2016-04-04 00:00:00             0   
2  volkswagen              nein  2016-03-26 00:00:00             0   
3       smart              nein  2016-03-12 00:00:00             0   
4        ford              nein  2016-04-01 00:00:00             0   

   postalCode             lastSeen  
0       79588  2016-04-06 06:45:54  
1       71034  2016-04-06 14:45:08  
2       35394  2016-04-06 20:15:37  
3       33729  2016-03-15 03:16:28  
4       39218  2016-04-01 14:38:50  

From the first few rows, there are a lot of columns with values written in German, which makes sense because the data was scraped from the German version of eBay. There are also quite a few NaN values that should be investigated, especially in the notRepairedDamage column.

Data Cleaning

Print the column names for the dataframe.

In [4]:
print(autos.columns)
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Want to change column names from camelcase to snakecase.

In [5]:
new_columns = ["date_crawled", "name", "seller", "offer_type", "price", "abtest", "vehicle_type", 
               "registration_year", "gearbox", "powerPS", "model", "odometer", "registration_month", 
               "fuel_type", "brand", "unrepaired_damage", "ad_created", "nr_of_pictures", 
               "postal_code", "last_seen"]

autos.columns = new_columns
print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offer_type   price   abtest vehicle_type  registration_year  \
0  privat    Angebot  $5,000  control          bus               2004   
1  privat    Angebot  $8,500  control    limousine               1997   
2  privat    Angebot  $8,990     test    limousine               2009   
3  privat    Angebot  $4,350  control   kleinwagen               2007   
4  privat    Angebot  $1,350     test        kombi               2003   

     gearbox  powerPS   model   odometer  registration_month fuel_type  \
0    manuell      158  andere  150,000km                   3       lpg   
1  automatik      286     7er  150,000km                   6    benzin   
2    manuell      102    golf   70,000km                   7    benzin   
3  automatik       71  fortwo   70,000km                   6    benzin   
4    manuell        0   focus  150,000km                   7    benzin   

        brand unrepaired_damage           ad_created  nr_of_pictures  \
0     peugeot              nein  2016-03-26 00:00:00               0   
1         bmw              nein  2016-04-04 00:00:00               0   
2  volkswagen              nein  2016-03-26 00:00:00               0   
3       smart              nein  2016-03-12 00:00:00               0   
4        ford              nein  2016-04-01 00:00:00               0   

   postal_code            last_seen  
0        79588  2016-04-06 06:45:54  
1        71034  2016-04-06 14:45:08  
2        35394  2016-04-06 20:15:37  
3        33729  2016-03-15 03:16:28  
4        39218  2016-04-01 14:38:50  

Get descriptive statistics for all columns.

In [6]:
autos.describe(include = "all")
Out[6]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox powerPS model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-05 16:57:05 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

The seller and offer_type columns are mostly one value, so they could be dropped. Also, the nr_of_pictures columns has zero for all values. Price and odometer columns are stored as text, so need to be cleaned so data is in numeric form.

Remove the seller, offer_type, and nr_of_pictures columns.

In [7]:
autos = autos.drop(columns = ["seller","offer_type","nr_of_pictures"])

Remove non-numeric characters and convert price and odometer columns to numeric dtype.

In [8]:
autos["price"] = autos["price"].str.replace(",","",regex=True).str.replace("$","",regex=True).astype(int)

autos["odometer"] = autos["odometer"].str.replace(",","",regex=True).str.replace("km","").astype(int)
autos.rename({"odometer":"odometer_km"},axis=1,inplace=True)

print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   price   abtest vehicle_type  registration_year    gearbox  powerPS   model  \
0   5000  control          bus               2004    manuell      158  andere   
1   8500  control    limousine               1997  automatik      286     7er   
2   8990     test    limousine               2009    manuell      102    golf   
3   4350  control   kleinwagen               2007  automatik       71  fortwo   
4   1350     test        kombi               2003    manuell        0   focus   

   odometer_km  registration_month fuel_type       brand unrepaired_damage  \
0       150000                   3       lpg     peugeot              nein   
1       150000                   6    benzin         bmw              nein   
2        70000                   7    benzin  volkswagen              nein   
3        70000                   6    benzin       smart              nein   
4       150000                   7    benzin        ford              nein   

            ad_created  postal_code            last_seen  
0  2016-03-26 00:00:00        79588  2016-04-06 06:45:54  
1  2016-04-04 00:00:00        71034  2016-04-06 14:45:08  
2  2016-03-26 00:00:00        35394  2016-04-06 20:15:37  
3  2016-03-12 00:00:00        33729  2016-03-15 03:16:28  
4  2016-04-01 00:00:00        39218  2016-04-01 14:38:50  

Explore the price column.

In [9]:
print(autos["price"].unique().shape[0],"\n")
print(autos["price"].describe(),"\n")
print(autos["price"].value_counts().sort_index(ascending = False).head(),"\n")
print(autos["price"].value_counts().sort_index().head())
2357 

count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64 

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
Name: price, dtype: int64 

0    1421
1     156
2       3
3       1
5       2
Name: price, dtype: int64

Explore the odometer_km column.

In [10]:
print(autos["odometer_km"].unique().shape[0],"\n")
print(autos["odometer_km"].describe(),"\n")
print(autos["odometer_km"].value_counts().sort_index(ascending = False).head(),"\n")
print(autos["odometer_km"].value_counts().sort_index().head())
13 

count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64 

150000    32424
125000     5170
100000     2169
90000      1757
80000      1436
Name: odometer_km, dtype: int64 

5000     967
10000    264
20000    784
30000    789
40000    819
Name: odometer_km, dtype: int64

While the range of 5,000 to 150,000 kilometers is reasonable for the odometer_km column, there are outliers in the price column. A price of 0 does not make sense, while a price of 10,000,000 is too high. Values outside the price range of 1 to 200,000 will be removed.

In [11]:
autos = autos[autos["price"].between(1,200000)]
print(autos["price"].describe())
count     48557.000000
mean       5842.888914
std        8301.871663
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7450.000000
max      198000.000000
Name: price, dtype: float64

Will now explore date values in the date_crawled, ad_created, and last_seen columns, will look at percentage distributions.

Start with date_crawled column.

In [12]:
print(autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2016-03-05    0.025331
2016-03-06    0.014045
2016-03-07    0.036020
2016-03-08    0.033301
2016-03-09    0.033095
2016-03-10    0.032189
2016-03-11    0.032580
2016-03-12    0.036926
2016-03-13    0.015672
2016-03-14    0.036555
2016-03-15    0.034290
2016-03-16    0.029615
2016-03-17    0.031633
2016-03-18    0.012913
2016-03-19    0.034784
2016-03-20    0.037894
2016-03-21    0.037358
2016-03-22    0.032992
2016-03-23    0.032210
2016-03-24    0.029347
2016-03-25    0.031612
2016-03-26    0.032210
2016-03-27    0.031077
2016-03-28    0.034825
2016-03-29    0.034104
2016-03-30    0.033672
2016-03-31    0.031839
2016-04-01    0.033692
2016-04-02    0.035484
2016-04-03    0.038594
2016-04-04    0.036493
2016-04-05    0.013077
2016-04-06    0.003172
2016-04-07    0.001400
Name: date_crawled, dtype: float64

The date_crawled column data is fairly evenly distributed, with each day containing roughly 3% of the listings.

In [13]:
print(autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2015-06-11    0.000021
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
                ...   
2016-04-03    0.038841
2016-04-04    0.036864
2016-04-05    0.011801
2016-04-06    0.003254
2016-04-07    0.001256
Name: ad_created, Length: 76, dtype: float64

The ad_created column has data spanning a much larger range than the date_crawled and last_seen columns.

In [14]:
print(autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2016-03-05    0.001071
2016-03-06    0.004325
2016-03-07    0.005396
2016-03-08    0.007414
2016-03-09    0.009597
2016-03-10    0.010668
2016-03-11    0.012377
2016-03-12    0.023786
2016-03-13    0.008897
2016-03-14    0.012604
2016-03-15    0.015878
2016-03-16    0.016455
2016-03-17    0.028091
2016-03-18    0.007352
2016-03-19    0.015837
2016-03-20    0.020656
2016-03-21    0.020636
2016-03-22    0.021377
2016-03-23    0.018535
2016-03-24    0.019771
2016-03-25    0.019215
2016-03-26    0.016805
2016-03-27    0.015631
2016-03-28    0.020841
2016-03-29    0.022345
2016-03-30    0.024775
2016-03-31    0.023786
2016-04-01    0.022798
2016-04-02    0.024919
2016-04-03    0.025187
2016-04-04    0.024487
2016-04-05    0.124740
2016-04-06    0.221801
2016-04-07    0.131948
Name: last_seen, dtype: float64

The most common dates for the last_seen column are April 5th, 6th, and 7th in 2016.

In [15]:
print(autos["registration_year"].describe())
count    48557.000000
mean      2004.754968
std         88.651037
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

There are clearly outliers in the registration year column, as there were no cars in the year 1000, and the year 9999 has not happened yet. The data in this column will have to be cleaned.

Will remove all listings for which the registration_year value is not between 1900 and 2016.

In [16]:
num_range = autos[autos["registration_year"].between(1900,2016)].shape[0]
print(num_range)

autos = autos[autos["registration_year"].between(1900,2016)]

print(autos["registration_year"].describe())
46673
count    46673.000000
mean      2002.909969
std          7.183604
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Data Analysis

Will analyze the top 20 most popular car brands based on average price to find patterns.

In [17]:
top_20 = autos["brand"].value_counts()[:20].index

brand_price = {}
for brand in top_20:
    brand_price[brand] = round(autos.loc[autos["brand"] == brand,"price"].mean(),2)
    
brand_tuple = []
for key in brand_price:
    brand_tuple.append((brand_price[key],key))
    
brand_tuple = sorted(brand_tuple, reverse = True)
for value in brand_tuple:
    print("{brand}: ${price}".format(brand = value[1], price = value[0]))
sonstige_autos: $11610.63
mini: $10613.46
audi: $9336.69
mercedes_benz: $8628.45
bmw: $8236.94
skoda: $6368.0
volkswagen: $5402.41
hyundai: $5365.25
toyota: $5167.09
volvo: $4946.5
nissan: $4743.4
seat: $4397.23
mazda: $4112.6
citroen: $3779.14
ford: $3749.47
smart: $3580.22
peugeot: $3094.02
opel: $2975.24
fiat: $2813.75
renault: $2474.86

Out of the top 20 most popular car brands, sonstige_autos and minis tend to have the highest used car resale price, while fiats and renauls tend to have the lowest.

For each brand, will compare the average price with the average mileage to see if there are any observable trends.

In [18]:
brand_mileage = {}
for brand in top_20:
    brand_mileage[brand] = autos.loc[autos["brand"] == brand, "odometer_km"].mean()
    
price_series = pd.Series(brand_price)
brand_df = pd.DataFrame(price_series, columns = ["mean_price"])
brand_df["mean_mileage"] = pd.Series(brand_mileage)

print(brand_df.sort_values("mean_price",ascending = False))
                mean_price   mean_mileage
sonstige_autos    11610.63   90109.409190
mini              10613.46   88105.134474
audi               9336.69  129157.386785
mercedes_benz      8628.45  130788.363313
bmw                8236.94  132619.279455
skoda              6368.00  110848.563969
volkswagen         5402.41  128707.158791
hyundai            5365.25  106442.307692
toyota             5167.09  115944.350759
volvo              4946.50  138067.915691
nissan             4743.40  118330.995792
seat               4397.23  121131.301290
mazda              4112.60  124464.033850
citroen            3779.14  119694.189602
ford               3749.47  124266.012872
smart              3580.22   99326.777610
peugeot            3094.02  127153.625269
opel               2975.24  129310.035842
fiat               2813.75  117121.971596
renault            2474.86  128071.331213

There appears to be no identifiable correlation between the average price and average mileage for each brand. This implies that there is an intrinsic value to a car brand not linked to mileage.

Back to Data Cleaning

Want to replace categorical data that is written in German with English words using a mapping dictionary.

In [19]:
print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   price   abtest vehicle_type  registration_year    gearbox  powerPS   model  \
0   5000  control          bus               2004    manuell      158  andere   
1   8500  control    limousine               1997  automatik      286     7er   
2   8990     test    limousine               2009    manuell      102    golf   
3   4350  control   kleinwagen               2007  automatik       71  fortwo   
4   1350     test        kombi               2003    manuell        0   focus   

   odometer_km  registration_month fuel_type       brand unrepaired_damage  \
0       150000                   3       lpg     peugeot              nein   
1       150000                   6    benzin         bmw              nein   
2        70000                   7    benzin  volkswagen              nein   
3        70000                   6    benzin       smart              nein   
4       150000                   7    benzin        ford              nein   

            ad_created  postal_code            last_seen  
0  2016-03-26 00:00:00        79588  2016-04-06 06:45:54  
1  2016-04-04 00:00:00        71034  2016-04-06 14:45:08  
2  2016-03-26 00:00:00        35394  2016-04-06 20:15:37  
3  2016-03-12 00:00:00        33729  2016-03-15 03:16:28  
4  2016-04-01 00:00:00        39218  2016-04-01 14:38:50  

The vehicle_type, gearbox, fuel_type, and unrepaired_damage columns are all written in German. They will have to be translated to make the data more readable.

Will start with the vehicle_type column. Want to get list of all unique values in the column to make translation easier.

In [20]:
print(autos["vehicle_type"].unique())
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio'
 'andere']

Map data values, convert null values to unknown.

In [21]:
vh_type_dict = {"bus":"bus","limousine":"limousine","kleinwagen":"small car","kombi":"station wagon",
                "coupe":"coupe","suv":"suv","cabrio":"convertible","andere":"other"}

autos["vehicle_type"] = autos["vehicle_type"].map(vh_type_dict)
autos.loc[autos["vehicle_type"].isnull(),"vehicle_type"] = "unknown"

print(autos["vehicle_type"].unique())
['bus' 'limousine' 'small car' 'station wagon' 'unknown' 'coupe' 'suv'
 'convertible' 'other']

Will repeat this process with the gearbox column.

In [22]:
print(autos["gearbox"].unique())
['manuell' 'automatik' nan]
In [23]:
gearbox_dict = {"manuell":"manual","automatik":"automatic"}

autos["gearbox"] = autos["gearbox"].map(gearbox_dict)
autos.loc[autos["gearbox"].isnull(),"gearbox"] = "unknown"

print(autos["gearbox"].unique())
['manual' 'automatic' 'unknown']

Will repeat this process with the fuel_type column.

In [24]:
print(autos["fuel_type"].unique())
['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere']
In [25]:
fuel_type_dict = {"lpg":"lpg","benzin":"petrol","diesel":"diesel","cng":"cng","hybrid":"hybrid","elektro":"electric",
                  "andere":"other"}

autos["fuel_type"] = autos["fuel_type"].map(fuel_type_dict)
autos.loc[autos["fuel_type"].isnull(),"fuel_type"] = "unknown"

print(autos["fuel_type"].unique())
['lpg' 'petrol' 'diesel' 'unknown' 'cng' 'hybrid' 'electric' 'other']

Will repeat this process with the unrepaired_damage column.

In [26]:
print(autos["unrepaired_damage"].unique())
['nein' nan 'ja']
In [27]:
un_dmg_dict = {"nein":"no","ja":"yes"}

autos["unrepaired_damage"] = autos["unrepaired_damage"].map(un_dmg_dict)
autos.loc[autos["unrepaired_damage"].isnull(),"unrepaired_damage"] = "unknown"

print(autos["unrepaired_damage"].unique())
['no' 'unknown' 'yes']

The next step in this process will be to convert dates to numeric and datetime data from string form. This will be for the ad_created column. Will update the old ad_created column and will create a new ad_created_dt column.

In [28]:
import datetime as dt

print(autos["ad_created"].head())
0    2016-03-26 00:00:00
1    2016-04-04 00:00:00
2    2016-03-26 00:00:00
3    2016-03-12 00:00:00
4    2016-04-01 00:00:00
Name: ad_created, dtype: object

The format of this column is year-month-day hour:minute:second.

In [29]:
autos["ad_created_dt"] = pd.to_datetime(autos["ad_created"])
autos["ad_created"] = autos["ad_created"].str[:10].str.replace("-","").astype(int)

print(autos["ad_created"].head(),"\n")
print(autos["ad_created_dt"].head())
0    20160326
1    20160404
2    20160326
3    20160312
4    20160401
Name: ad_created, dtype: int64 

0   2016-03-26
1   2016-04-04
2   2016-03-26
3   2016-03-12
4   2016-04-01
Name: ad_created_dt, dtype: datetime64[ns]

Back to Data Analysis

Identify the most common brand/model combinations. Will do top three most common models for top ten most common brands.

In [30]:
top_10 = autos["brand"].value_counts()[:10].index
first = {}
second = {}
third = {}

for brand in top_10:
    brand_model = autos.loc[autos["brand"] == brand, "model"].value_counts().index.to_list()
    if len(brand_model) >= 1:
        first[brand] = brand_model[0]
        if len(brand_model) >= 2:
            second[brand] = brand_model[1]
            if len(brand_model) >= 3:
                third[brand] = brand_model[2]
    
bm_series = pd.Series(first)
bm_df = pd.DataFrame(bm_series, columns = ["1"])
bm_df["2"] = pd.Series(second)
bm_df["3"] = pd.Series(third)

print(bm_df)
                      1         2         3
volkswagen         golf      polo    passat
bmw                 3er       5er       1er
opel              corsa     astra    vectra
mercedes_benz  c_klasse  e_klasse  a_klasse
audi                 a4        a3        a6
ford              focus    fiesta    mondeo
renault          twingo      clio    megane
peugeot         2_reihe   3_reihe   1_reihe
fiat              punto    andere       500
seat              ibiza      leon     arosa

Will look at trends in the odometer_km column to see if mileage affects price.

In [31]:
import math

km_max = autos["odometer_km"].max()
km_min = autos["odometer_km"].min()
km_diff = 5000

tuple_list = []
for x in range(km_min,km_max,km_diff):
    tuple_list.append((x,x+km_diff))

km_dict = {}
for value in tuple_list:
    km_dict[value] = round(autos.loc[autos["odometer_km"].between(value[0],value[1]),"price"].mean(), 2)

km_dict_up = {}
for key in km_dict:
    if math.isnan(km_dict[key]) == False:
        new_key = str(key[0]) + "-" + str(key[1])
        km_dict_up[new_key] = km_dict[key]
        
km_df = pd.DataFrame(pd.Series(km_dict_up), columns = ["price"])
print(km_df.sort_values("price"))
                  price
145000-150000   3767.93
120000-125000   6214.02
125000-130000   6214.02
100000-105000   7991.29
95000-100000    7991.29
90000-95000     8465.03
85000-90000     8465.03
80000-85000     9721.95
75000-80000     9721.95
5000-10000     10526.88
65000-70000    10927.18
70000-75000    10927.18
55000-60000    12385.00
60000-65000    12385.00
45000-50000    13812.17
50000-55000    13812.17
40000-45000    15499.57
35000-40000    15499.57
30000-35000    16340.86
25000-30000    16340.86
20000-25000    17682.12
15000-20000    17682.12
10000-15000    20550.87

There is a correlation between mileage and average price. The higher mileage a car has, the lower the price it tends to be. However, there are a few exceptions, as cars with a mileage of 5000-10000 have a lower price than their mileage would suggest.

Want to compare the prices of damaged and undamaged cars. Cars are determined to be the same if they have the same brand and model.

In [32]:
autos = autos[(autos["brand"].isnull() == False) & (autos["model"].isnull() == False)]

brand_list = autos["brand"].unique()
brand_model = {}

for brand in brand_list:
    brand_model[brand] = autos.loc[autos["brand"] == brand,"model"].unique()
    
bm_price_un = {}
bm_price_re = {}
for brand in brand_model:
    for model in brand_model[brand]:
        un_bool = (autos["brand"] == brand) & (autos["model"] == model) & (autos["unrepaired_damage"] == "yes")
        bm_price_un[(brand,model)] = round(autos.loc[un_bool,"price"].mean(), 2)
        
        re_bool = (autos["brand"] == brand) & (autos["model"] == model) & (autos["unrepaired_damage"] == "no")
        bm_price_re[(brand,model)] = round(autos.loc[re_bool,"price"].mean(), 2)
        
bm_price_df = pd.DataFrame(pd.Series(bm_price_re), columns = ["undamaged"])
bm_price_df["damaged"] = pd.Series(bm_price_un)

bm_price_df["diff"] = bm_price_df["undamaged"] - bm_price_df["damaged"]

bm_price_df = bm_price_df.sort_values("diff", ascending = False)
bm_price_df.dropna(inplace = True)

print(bm_price_df.head(),"\n")
print(bm_price_df.tail())
                           undamaged   damaged      diff
porsche       911           65710.31  20833.33  44876.98
land_rover    range_rover   25411.43   1800.00  23611.43
              defender      33009.12  12245.00  20764.12
bmw           m_reihe       28971.98   8900.00  20071.98
mercedes_benz g_klasse      25956.74   6375.00  19581.74 

                       undamaged   damaged     diff
mercedes_benz cl        13229.33  14425.00 -1195.67
volkswagen    andere     5564.46   7116.67 -1552.21
lancia        andere     4913.00   8750.00 -3837.00
mercedes_benz viano     20365.25  24333.33 -3968.08
volkswagen    touareg   16169.65  20933.33 -4763.68

Porsche 911s have the largest postive difference between the undamaged and damaged average prices, while the volkswagen touareg has the largest negative difference.

In [33]:
print(round(bm_price_df["diff"].mean(), 2))
3817.21

Cars that are undamaged tend to be on average $3817.21 more expensive than if they were damaged.

Want to look at distributions of car price by month, will add a column to the autos dataframe with the month of each listing from the ad_created_dt column of datetime objects.

In [34]:
autos["month"] = autos["ad_created_dt"].dt.strftime("%b")
print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   price   abtest   vehicle_type  registration_year    gearbox  powerPS  \
0   5000  control            bus               2004     manual      158   
1   8500  control      limousine               1997  automatic      286   
2   8990     test      limousine               2009     manual      102   
3   4350  control      small car               2007  automatic       71   
4   1350     test  station wagon               2003     manual        0   

    model  odometer_km  registration_month fuel_type       brand  \
0  andere       150000                   3       lpg     peugeot   
1     7er       150000                   6    petrol         bmw   
2    golf        70000                   7    petrol  volkswagen   
3  fortwo        70000                   6    petrol       smart   
4   focus       150000                   7    petrol        ford   

  unrepaired_damage  ad_created  postal_code            last_seen  \
0                no    20160326        79588  2016-04-06 06:45:54   
1                no    20160404        71034  2016-04-06 14:45:08   
2                no    20160326        35394  2016-04-06 20:15:37   
3                no    20160312        33729  2016-03-15 03:16:28   
4                no    20160401        39218  2016-04-01 14:38:50   

  ad_created_dt month  
0    2016-03-26   Mar  
1    2016-04-04   Apr  
2    2016-03-26   Mar  
3    2016-03-12   Mar  
4    2016-04-01   Apr  
In [35]:
month_list = autos["month"].unique()
month_p_dict = {}
for month in month_list:
    month_p_dict[month] = round(autos.loc[autos["month"] == month,"price"].mean(), 2)
    
tuple_list = []
for key in month_p_dict:
    tuple_list.append((month_p_dict[key], key))
    
tuple_list = sorted(tuple_list, reverse = True)
for entry in tuple_list:
    print("{month}: ${price}".format(month = entry[1], price = entry[0]))
    
Jun: $47900.0
Sep: $9500.0
Jan: $9182.43
Feb: $7198.63
Apr: $6257.89
Mar: $5935.49
Nov: $5499.0
Aug: $5200.0
Dec: $1650.0

June appears to be the most profitable time of the year to sell cars, while december seems like the least.

Thanks for taking the time to look through my project, any and all feedback is greatly appreciated!

In [ ]: