Guided Project: Exploring eBay Car Sales Data

Introduction

In this project we'll use an adapted version of the dataset of used cars from eBay Kleinanzeigen, a classifies section of the German eBay website. The original dataset is available on: https://data.world/data-society/used-cars-data.

Data dictionary

Feature Description
dateCrawled When this ad was first crawled. All field-values are taken from this date.
name Name of the car.
seller Whether the seller is private or a dealer.
offerType The type of listing.
price The price on the ad to sell the car.
abtest Whether the listing is included in an A/B test.
vehicleType The vehicle Type.
yearOfRegistration The year in which the car was first registered.
gearbox The transmission type.
powerPS The power of the car in PS.
model The car model name.
kilometer How many kilometers the car has driven.
monthOfRegistration The month in which the car was first registered.
fuelType What type of fuel the car uses.
brand The brand of the car.
notRepairedDamage If the car has a damage which is not yet repaired.
dateCreated The date on which the eBay listing was created.
nrOfPictures The number of pictures in the ad.
postalCode The postal code for the location of the vehicle.
lastSeenOnline When the crawler saw this ad last online.

Aim of the project

The aim of this project is to clean and the data and analyze the included used cars listings.

Reading the CSV File

In [1]:
import pandas as pd
import numpy as np

autos = pd.read_csv('autos.csv', encoding = 'Latin-1', sep = ',')
In [2]:
autos.head()
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [3]:
autos.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

First observations:

  • There are some columns with missing data: vehicleType, gearbox, model, fuelType, notRepairedDamage.
  • The dataset has not float64 dtype for any column.
  • The column names use camelcase instead of Python's preferred snakecase.

In response to the last point, all columns will be converted from camelcase to snakecase according to the data dictionary and some of them will be reworded.

In [4]:
autos.columns
Out[4]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')
In [5]:
autos.rename({'yearOfRegistration':'registration_year'}, axis = 1, inplace = True)
autos.rename({'monthOfRegistration':'registration_month'}, axis = 1, inplace = True)
autos.rename({'notRepairedDamage':'unrepaired_damage'}, axis = 1, inplace = True)
autos.rename({'dateCreated':'ad_created'}, axis = 1, inplace = True)

new_names_columns = []

for name in autos.columns:
  name = name.lower()
  new_names_columns.append(name)

autos.columns = new_names_columns
In [6]:
autos.columns
Out[6]:
Index(['datecrawled', 'name', 'seller', 'offertype', 'price', 'abtest',
       'vehicletype', 'registration_year', 'gearbox', 'powerps', 'model',
       'odometer', 'registration_month', 'fueltype', 'brand',
       'unrepaired_damage', 'ad_created', 'nrofpictures', 'postalcode',
       'lastseen'],
      dtype='object')

Data Exploration

In [7]:
autos.describe(include = 'all')
Out[7]:
datecrawled name seller offertype price abtest vehicletype registration_year gearbox powerps model odometer registration_month fueltype brand unrepaired_damage ad_created nrofpictures postalcode lastseen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-05 16:57:05 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN

First observations:

  • Some columns have just two values (seller, offertype, abtest, gearbox, unrepaired_damage). It can be worth investigating if they can be dropped.
  • price and odometer should be stored as a numeric type, not text.

In response to the last point, the columns price and odometer will be converted from text to numeric dtype. The odometer column will be renamed.

Exploring "odometer" and "price" columns

In [8]:
autos["price"] = autos["price"].str.replace('$','')
autos["price"] = autos["price"].str.replace(',','')
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','')

autos["price"] = autos["price"].astype(float)
autos["odometer"] = autos["odometer"].astype(float)

autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
In [9]:
autos.head()
Out[9]:
datecrawled name seller offertype price abtest vehicletype registration_year gearbox powerps model odometer_km registration_month fueltype brand unrepaired_damage ad_created nrofpictures postalcode lastseen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot 5000.0 control bus 2004 manuell 158 andere 150000.0 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot 8500.0 control limousine 1997 automatik 286 7er 150000.0 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot 8990.0 test limousine 2009 manuell 102 golf 70000.0 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot 4350.0 control kleinwagen 2007 automatik 71 fortwo 70000.0 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot 1350.0 test kombi 2003 manuell 0 focus 150000.0 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [10]:
autos.dtypes
Out[10]:
datecrawled            object
name                   object
seller                 object
offertype              object
price                 float64
abtest                 object
vehicletype            object
registration_year       int64
gearbox                object
powerps                 int64
model                  object
odometer_km           float64
registration_month      int64
fueltype               object
brand                  object
unrepaired_damage      object
ad_created             object
nrofpictures            int64
postalcode              int64
lastseen               object
dtype: object

To continue exploring data, the odometer_km and price columns will be analyzed closer.

In [11]:
# Number of unique values in the odometer_km column:
autos["odometer_km"].unique().shape
Out[11]:
(13,)
In [12]:
# Statistics for the odometer_km column:
autos["odometer_km"].describe()
Out[12]:
count     50000.000000
mean     125732.700000
std       40042.211706
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [13]:
# Values distribution for the odometer_km column:
autos["odometer_km"].value_counts().sort_values(ascending = False)
Out[13]:
150000.0    32424
125000.0     5170
100000.0     2169
90000.0      1757
80000.0      1436
70000.0      1230
60000.0      1164
50000.0      1027
5000.0        967
40000.0       819
30000.0       789
20000.0       784
10000.0       264
Name: odometer_km, dtype: int64
In [14]:
# Number of unique values in price column:
autos["price"].unique().shape
Out[14]:
(2357,)
In [15]:
# Statistics for the price column:
autos["price"].describe()
Out[15]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [16]:
# Values distribution for the price column:
autos["price"].value_counts().sort_index(ascending = False).head(10)
Out[16]:
99999999.0    1
27322222.0    1
12345678.0    3
11111111.0    2
10000000.0    1
3890000.0     1
1300000.0     1
1234566.0     1
999999.0      2
999990.0      1
Name: price, dtype: int64
In [17]:
autos["price"].value_counts().sort_index(ascending = False).tail(10)
Out[17]:
12.0       3
11.0       2
10.0       7
9.0        1
8.0        1
5.0        2
3.0        1
2.0        3
1.0      156
0.0     1421
Name: price, dtype: int64

First of all: some rows have "0.0" as price. Hardly a car costs "0.0" dollars. Probably the information about the price was not available for some cars of the dataset. Due to this fact, the rows with the price "0.0" will be removed.

In [18]:
autos = autos.loc[autos["price"] != 0.0, :]
In [19]:
autos.shape
Out[19]:
(48579, 20)
In [20]:
autos["price"].describe()
Out[20]:
count    4.857900e+04
mean     1.012788e+04
std      4.880873e+05
min      1.000000e+00
25%      1.200000e+03
50%      3.000000e+03
75%      7.490000e+03
max      1.000000e+08
Name: price, dtype: float64

As observed earlier, for some large prices there are one or two cars. These points should be removed for the standard deviation for the data can decrease. Like a metric to this removal, the rows with prices above three times the third quartile value will be eliminating. The effect on standard deviation will be observed.

In [21]:
autos = autos.loc[autos["price"] <= 22470.0, :]
In [22]:
autos.shape
Out[22]:
(46669, 20)
In [23]:
autos["price"].describe()
Out[23]:
count    46669.000000
mean      4681.993486
std       4761.927364
min          1.000000
25%       1200.000000
50%       2900.000000
75%       6700.000000
max      22450.000000
Name: price, dtype: float64

The standard deviation was from about 488,087 dollars to 4,761 dollars. A total of 3,331 rows were removed from the original dataset so far.

Exploring date columns

There are 5 columns that should represent date values. Some os these columns were created by the crawler, some came from the website itself.

  • date_crawled: added by the crawler
  • last_seen: added by the crawler
  • ad_created: from the website
  • registration_month: from the website
  • registration_year: from the website

Just registration_month and registration_year are numeric columns. The others columns are represented as string values by pandas. It is important to understand how the date values are stored in the dataset for the string columns.

In [24]:
autos[["datecrawled","ad_created","lastseen"]][0:5]
Out[24]:
datecrawled ad_created lastseen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37
3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50

Since the main interest is on the day (e.g. 2016-03-12), the first ten characters of every string column should be analyzed.

In [25]:
autos["datecrawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[25]:
2016-03-05    0.025542
2016-03-06    0.014014
2016-03-07    0.036212
2016-03-08    0.033363
2016-03-09    0.033127
2016-03-10    0.032398
2016-03-11    0.032334
2016-03-12    0.037241
2016-03-13    0.015685
2016-03-14    0.036812
2016-03-15    0.034070
2016-03-16    0.029763
2016-03-17    0.031627
2016-03-18    0.012814
2016-03-19    0.034605
2016-03-20    0.037884
2016-03-21    0.037391
2016-03-22    0.032913
2016-03-23    0.032270
2016-03-24    0.029291
2016-03-25    0.031670
2016-03-26    0.032463
2016-03-27    0.030748
2016-03-28    0.034755
2016-03-29    0.034134
2016-03-30    0.033791
2016-03-31    0.031670
2016-04-01    0.033448
2016-04-02    0.035270
2016-04-03    0.038591
2016-04-04    0.036577
2016-04-05    0.013006
2016-04-06    0.003150
2016-04-07    0.001371
Name: datecrawled, dtype: float64
In [26]:
autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[26]:
2015-08-10    0.000021
2015-09-09    0.000021
2015-11-10    0.000021
2015-12-05    0.000021
2015-12-30    0.000021
                ...   
2016-04-03    0.038827
2016-04-04    0.036962
2016-04-05    0.011699
2016-04-06    0.003214
2016-04-07    0.001243
Name: ad_created, Length: 75, dtype: float64
In [27]:
autos["lastseen"].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[27]:
2016-03-05    0.001114
2016-03-06    0.004457
2016-03-07    0.005507
2016-03-08    0.007671
2016-03-09    0.009878
2016-03-10    0.010907
2016-03-11    0.012707
2016-03-12    0.024384
2016-03-13    0.009171
2016-03-14    0.012664
2016-03-15    0.016006
2016-03-16    0.016692
2016-03-17    0.028563
2016-03-18    0.007371
2016-03-19    0.016178
2016-03-20    0.020978
2016-03-21    0.020828
2016-03-22    0.021749
2016-03-23    0.018856
2016-03-24    0.020056
2016-03-25    0.019670
2016-03-26    0.017013
2016-03-27    0.015921
2016-03-28    0.021256
2016-03-29    0.022927
2016-03-30    0.025070
2016-03-31    0.024127
2016-04-01    0.022970
2016-04-02    0.025049
2016-04-03    0.025520
2016-04-04    0.024984
2016-04-05    0.122544
2016-04-06    0.218132
2016-04-07    0.129079
Name: lastseen, dtype: float64

The most of registers done by the crawler are from 2016. This fact demonstrates that there should be no records beyond 2016.

Now, the registration_year column will be analyzed to confirm this fact.

In [28]:
autos["registration_year"].describe()
Out[28]:
count    46669.000000
mean      2004.496089
std         88.288345
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The minimum and maximum years values are 1000 and 9999, respectively. This is not possible because in the year 1000, there no are cars and the year 9999 has even not come. The dataset will be maintained between the years 1900 and 2016.

In [29]:
autos = autos.loc[autos["registration_year"].between(1900,2016), :]
In [30]:
autos["registration_year"].describe()
Out[30]:
count    44799.000000
mean      2002.660417
std          6.917061
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2007.000000
max       2016.000000
Name: registration_year, dtype: float64
In [31]:
autos["registration_year"].value_counts(normalize = True)
Out[31]:
2000    0.070269
2005    0.065113
1999    0.064577
2003    0.060024
2004    0.059845
          ...   
1948    0.000022
1927    0.000022
1941    0.000022
1929    0.000022
1952    0.000022
Name: registration_year, Length: 72, dtype: float64

Exploring brand column

In [32]:
autos["brand"].unique()
Out[32]:
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'mini', 'toyota', 'dacia', 'nissan', 'jeep',
       'saab', 'volvo', 'mitsubishi', 'fiat', 'skoda', 'subaru', 'kia',
       'citroen', 'porsche', 'hyundai', 'chevrolet', 'honda', 'daewoo',
       'suzuki', 'trabant', 'land_rover', 'jaguar', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

Top 20 brands more present in the dataset:

In [33]:
autos["brand"].value_counts().head(20)
Out[33]:
volkswagen        9591
opel              5000
bmw               4790
mercedes_benz     4134
audi              3639
ford              3218
renault           2196
peugeot           1391
fiat              1196
seat               846
skoda              751
nissan             703
mazda              698
smart              661
citroen            651
toyota             586
hyundai            465
volvo              421
sonstige_autos     394
mini               388
Name: brand, dtype: int64

From this information, it is possible to answer the question: how much is the price mean for each of the top 20 most popular brands?

In [34]:
top20 = autos["brand"].value_counts().head(20).index
In [35]:
top20_brands_mean_prices = {}

for brand in top20:
    selected_rows = autos.loc[autos["brand"] == brand, :]
    mean_price = selected_rows["price"].mean(axis = 0)
    top20_brands_mean_prices[brand] = mean_price
In [36]:
top20_brands_mean_prices
Out[36]:
{'volkswagen': 4742.8270253362525,
 'opel': 2876.961,
 'bmw': 6593.0787056367435,
 'mercedes_benz': 6416.294146105467,
 'audi': 6807.801593844462,
 'ford': 3324.6743318831573,
 'renault': 2393.0724043715845,
 'peugeot': 3063.606757728253,
 'fiat': 2795.198160535117,
 'seat': 4221.321513002364,
 'skoda': 5972.252996005326,
 'nissan': 4422.687055476529,
 'mazda': 3715.515759312321,
 'smart': 3580.2239031770046,
 'citroen': 3672.284178187404,
 'toyota': 4879.250853242321,
 'hyundai': 5235.376344086021,
 'volvo': 4636.976247030879,
 'sonstige_autos': 6566.5,
 'mini': 9809.605670103092}

Among the 20 most popular brands, mini, audi, and bmw are the more expensive. renault, fiat, and opel are the brands less expensive.

The mean prices can be related to mean mileage of the cars. To confirm this inference, the mean mileage will be calculated, and aggregated to brands and mean prices.

In [37]:
top20_brands_mean_mileage = {}

for brand in top20:
    selected_rows = autos.loc[autos["brand"] == brand, :]
    mean_mileage = selected_rows["odometer_km"].mean()
    top20_brands_mean_mileage[brand] = mean_mileage
In [38]:
# Coverting both dictionaries to series objects:

top20_price_series = pd.Series(top20_brands_mean_prices)
print(top20_price_series)
print('')
top20_mileage_series = pd.Series(top20_brands_mean_mileage)
print(top20_mileage_series)
volkswagen        4742.827025
opel              2876.961000
bmw               6593.078706
mercedes_benz     6416.294146
audi              6807.801594
ford              3324.674332
renault           2393.072404
peugeot           3063.606758
fiat              2795.198161
seat              4221.321513
skoda             5972.252996
nissan            4422.687055
mazda             3715.515759
smart             3580.223903
citroen           3672.284178
toyota            4879.250853
hyundai           5235.376344
volvo             4636.976247
sonstige_autos    6566.500000
mini              9809.605670
dtype: float64

volkswagen        130773.120634
opel              129710.000000
bmw               136492.693111
mercedes_benz     135584.179971
audi              135959.054685
ford              125419.515227
renault           128212.659381
peugeot           127286.125090
fiat              117203.177258
seat              121879.432624
skoda             112416.777630
nissan            119374.110953
mazda             126045.845272
smart              99326.777610
citroen           119807.987711
toyota            116962.457338
hyundai           106892.473118
volvo             138836.104513
sonstige_autos     93591.370558
mini               91262.886598
dtype: float64
In [39]:
# Converting price series to a dataframe:
top20_df = pd.DataFrame(top20_price_series, columns = ["mean_price"])
top20_df
Out[39]:
mean_price
volkswagen 4742.827025
opel 2876.961000
bmw 6593.078706
mercedes_benz 6416.294146
audi 6807.801594
ford 3324.674332
renault 2393.072404
peugeot 3063.606758
fiat 2795.198161
seat 4221.321513
skoda 5972.252996
nissan 4422.687055
mazda 3715.515759
smart 3580.223903
citroen 3672.284178
toyota 4879.250853
hyundai 5235.376344
volvo 4636.976247
sonstige_autos 6566.500000
mini 9809.605670
In [40]:
# Adding a new column: mean mileage
top20_df["mean_mileage"] = top20_mileage_series
top20_df
Out[40]:
mean_price mean_mileage
volkswagen 4742.827025 130773.120634
opel 2876.961000 129710.000000
bmw 6593.078706 136492.693111
mercedes_benz 6416.294146 135584.179971
audi 6807.801594 135959.054685
ford 3324.674332 125419.515227
renault 2393.072404 128212.659381
peugeot 3063.606758 127286.125090
fiat 2795.198161 117203.177258
seat 4221.321513 121879.432624
skoda 5972.252996 112416.777630
nissan 4422.687055 119374.110953
mazda 3715.515759 126045.845272
smart 3580.223903 99326.777610
citroen 3672.284178 119807.987711
toyota 4879.250853 116962.457338
hyundai 5235.376344 106892.473118
volvo 4636.976247 138836.104513
sonstige_autos 6566.500000 93591.370558
mini 9809.605670 91262.886598

Observations:

  • The brand more expensive (mini) has the lower mean mileage.
  • The brands less expensive (renault,fiat,opel) haven't the highest mileage averages.
  • The highest mean mileage belongs to the brand audi, one of most expensive.

Conclusions:

The dataset was cleaned and some points were analyzed. It was possible to detect that most of the records are from 2016. It was concluded that not necessarily the most expensive car brand is the one that has the lower mean mileage.