INTRODUCTION OF OUR DATASETS

The dataset was originally scraped and uploaded to Kaggle. In this guided project, we'll work with a dataset of used cars from a calssified section of German eBay website.

The data dictionary provided with the data is as follows,

  • dateCrawled - When this ad was first crawled. All field values are taken from this date

  • name - Name of the car

  • seller - Whether the seller is private or a dealer

  • offerType - The type of listing

  • price - The price on the ad to sell the car

  • abtest - Whether the listing is included in an A/B test

  • vehicleType - The vehicle Type

  • yearOfRegistration - The year in which the car was first registered

  • gearBox - The transmission type

  • powerPS - The power of the car in PS

  • model - The car model name

  • kilometer - How many kilometers the car has driven

  • monthOfRegistration - The month in which the car was first registered

  • fuelType - What type of fuel the car uses

  • brand - The brand of the car

  • notRepairedDamage - If the car has a damage which is not yet repaired

  • dateCreated - The date on which the eBay listing was created

  • postalCode - The postal code for the location of the vehicle

  • lastSeenOnline - When the crawler saw this ad last online

The aim of this analysis is to clean the data and analyze the included used car listings.

READING OUR CSV FILE AND SPECIFYING ENCODING

In [1]:
import pandas as pd

autos = pd.read_csv("C:\\Users\\USER\\Videos\\Coding Videos\\my_datasets\\autos.csv", encoding = "Latin-1")
In [2]:
autos
Out[2]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49995 2016-03-27 14:38:19 Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon privat Angebot $24,900 control limousine 2011 automatik 239 q5 100,000km 1 diesel audi nein 2016-03-27 00:00:00 0 82131 2016-04-01 13:47:40
49996 2016-03-28 10:50:25 Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... privat Angebot $1,980 control cabrio 1996 manuell 75 astra 150,000km 5 benzin opel nein 2016-03-28 00:00:00 0 44807 2016-04-02 14:18:02
49997 2016-04-02 14:44:48 Fiat_500_C_1.2_Dualogic_Lounge privat Angebot $13,200 test cabrio 2014 automatik 69 500 5,000km 11 benzin fiat nein 2016-04-02 00:00:00 0 73430 2016-04-04 11:47:27
49998 2016-03-08 19:25:42 Audi_A3_2.0_TDI_Sportback_Ambition privat Angebot $22,900 control kombi 2013 manuell 150 a3 40,000km 11 diesel audi nein 2016-03-08 00:00:00 0 35683 2016-04-05 16:45:07
49999 2016-03-14 00:42:12 Opel_Vectra_1.6_16V privat Angebot $1,250 control limousine 1996 manuell 101 vectra 150,000km 1 benzin opel nein 2016-03-13 00:00:00 0 45897 2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:
autos.info()#Gives an information of our dataset, we can observe from our output that we have 20 columns most 
#of which are strings.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

When we extract the information of our dataframe using the info method, we see that some columns have missing values.But none of these columns have more than 20% null values. The columns with null values are:

  • VehicleType
  • gearBox
  • model
  • fuelType
  • notRepairedDamage
In [4]:
autos.head()#print the first five rows of the dataframe
Out[4]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

A thorough exploration of the head of our dataframe, we can observe that the columns label are written in camelcase instead of our standard Python snakecase style. So, we are poised with renaming our column labels to suit our snakecase

CLEANING COLUMN NAMES

Now, we will clean the column labels by making use of a dataframe attribute (df.columns). We will write a function that takes just one argument. And then we will iterate through the column labels and for each iteration, we will call the clean function. Amd then assign the modified column label back to the Dataframe.columns attribute

In [5]:
def clean_col(col):
    col = col.replace("dateCrawled", "date_crawled")
    col = col.replace("offerType", "offer_type")
    col = col.replace("vehicleType", "vehicle_type")
    col = col.replace("powerPS", "power_ps")
    col = col.replace("fuelType", "fuel_type")
    col = col.replace("monthOfRegistration", "registration_month")
    col = col.replace("yearOfRegistration", "registration_year")
    col = col.replace("notRepairedDamage", "unrepaired_damage")
    col = col.replace("dateCreated", "ad_created")
    col = col.replace("lastSeen", "last_seen")
    col = col.replace("postalCode", "postal_code")
    col = col.replace("nrOfPictures", "no_of_pics")
    return col

cleaned_labels = []
for c in autos.columns:
    cleaned = clean_col(c)
    cleaned_labels.append(cleaned)
autos.columns = cleaned_labels
    
    
In [6]:
print(autos.columns)#Examining the column labels, using the Dataframe.columns attribute
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'no_of_pics', 'postal_code',
       'last_seen'],
      dtype='object')
In [7]:
print(autos.head())
          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offer_type   price   abtest vehicle_type  registration_year  \
0  privat    Angebot  $5,000  control          bus               2004   
1  privat    Angebot  $8,500  control    limousine               1997   
2  privat    Angebot  $8,990     test    limousine               2009   
3  privat    Angebot  $4,350  control   kleinwagen               2007   
4  privat    Angebot  $1,350     test        kombi               2003   

     gearbox  power_ps   model   odometer  registration_month fuel_type  \
0    manuell       158  andere  150,000km                   3       lpg   
1  automatik       286     7er  150,000km                   6    benzin   
2    manuell       102    golf   70,000km                   7    benzin   
3  automatik        71  fortwo   70,000km                   6    benzin   
4    manuell         0   focus  150,000km                   7    benzin   

        brand unrepaired_damage           ad_created  no_of_pics  postal_code  \
0     peugeot              nein  2016-03-26 00:00:00           0        79588   
1         bmw              nein  2016-04-04 00:00:00           0        71034   
2  volkswagen              nein  2016-03-26 00:00:00           0        35394   
3       smart              nein  2016-03-12 00:00:00           0        33729   
4        ford              nein  2016-04-01 00:00:00           0        39218   

             last_seen  
0  2016-04-06 06:45:54  
1  2016-04-06 14:45:08  
2  2016-04-06 20:15:37  
3  2016-03-15 03:16:28  
4  2016-04-01 14:38:50  

INITIAL EXPLORATION AND MORE DATA CLEANING

In [8]:
print(autos.describe(include = "all"))
               date_crawled         name  seller offer_type  price abtest  \
count                 50000        50000   50000      50000  50000  50000   
unique                48213        38754       2          2   2357      2   
top     2016-03-22 09:51:06  Ford_Fiesta  privat    Angebot     $0   test   
freq                      3           78   49999      49999   1421  25756   
mean                    NaN          NaN     NaN        NaN    NaN    NaN   
std                     NaN          NaN     NaN        NaN    NaN    NaN   
min                     NaN          NaN     NaN        NaN    NaN    NaN   
25%                     NaN          NaN     NaN        NaN    NaN    NaN   
50%                     NaN          NaN     NaN        NaN    NaN    NaN   
75%                     NaN          NaN     NaN        NaN    NaN    NaN   
max                     NaN          NaN     NaN        NaN    NaN    NaN   

       vehicle_type  registration_year  gearbox      power_ps  model  \
count         44905       50000.000000    47320  50000.000000  47242   
unique            8                NaN        2           NaN    245   
top       limousine                NaN  manuell           NaN   golf   
freq          12859                NaN    36993           NaN   4024   
mean            NaN        2005.073280      NaN    116.355920    NaN   
std             NaN         105.712813      NaN    209.216627    NaN   
min             NaN        1000.000000      NaN      0.000000    NaN   
25%             NaN        1999.000000      NaN     70.000000    NaN   
50%             NaN        2003.000000      NaN    105.000000    NaN   
75%             NaN        2008.000000      NaN    150.000000    NaN   
max             NaN        9999.000000      NaN  17700.000000    NaN   

         odometer  registration_month fuel_type       brand unrepaired_damage  \
count       50000        50000.000000     45518       50000             40171   
unique         13                 NaN         7          40                 2   
top     150,000km                 NaN    benzin  volkswagen              nein   
freq        32424                 NaN     30107       10687             35232   
mean          NaN            5.723360       NaN         NaN               NaN   
std           NaN            3.711984       NaN         NaN               NaN   
min           NaN            0.000000       NaN         NaN               NaN   
25%           NaN            3.000000       NaN         NaN               NaN   
50%           NaN            6.000000       NaN         NaN               NaN   
75%           NaN            9.000000       NaN         NaN               NaN   
max           NaN           12.000000       NaN         NaN               NaN   

                 ad_created  no_of_pics   postal_code            last_seen  
count                 50000     50000.0  50000.000000                50000  
unique                   76         NaN           NaN                39481  
top     2016-04-03 00:00:00         NaN           NaN  2016-04-07 06:17:27  
freq                   1946         NaN           NaN                    8  
mean                    NaN         0.0  50813.627300                  NaN  
std                     NaN         0.0  25779.747957                  NaN  
min                     NaN         0.0   1067.000000                  NaN  
25%                     NaN         0.0  30451.000000                  NaN  
50%                     NaN         0.0  49577.000000                  NaN  
75%                     NaN         0.0  71540.000000                  NaN  
max                     NaN         0.0  99998.000000                  NaN  

A thorough observation of the explored dataframe above, we see that some columns are columns worthy of being dropped and others are worth modifying. Findings:

  • Seller column - This column is worthy of being dropped, because it has just two unique values and has little or no significance to our analysis.

  • Offertype column - This column just like the previous has two unique value and hence is not significant to our analysis.

  • abtest column - This column is candidate to be dropped also

  • no_of_pics - This column is candidate to be dropped also.

Both the price and odometer columns have to be cleaned because they are supposed to be represented as numeric datas not text.

Now, lets drop the columns taht needs to be dropped.

In [9]:
autos = autos.drop(columns = ["abtest", "no_of_pics", "offer_type", "seller"])#Drop method drops either row or column.

Now, we will have to clean the price and odometer columns and rename them appropriately

In [10]:
autos["price"]= autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos.rename(columns ={"price":"price_dollars"}, inplace = True)

autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int)
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)
In [11]:
print(autos.describe(include ="all"))
               date_crawled         name  price_dollars vehicle_type  \
count                 50000        50000   5.000000e+04        44905   
unique                48213        38754            NaN            8   
top     2016-03-22 09:51:06  Ford_Fiesta            NaN    limousine   
freq                      3           78            NaN        12859   
mean                    NaN          NaN   9.840044e+03          NaN   
std                     NaN          NaN   4.811044e+05          NaN   
min                     NaN          NaN   0.000000e+00          NaN   
25%                     NaN          NaN   1.100000e+03          NaN   
50%                     NaN          NaN   2.950000e+03          NaN   
75%                     NaN          NaN   7.200000e+03          NaN   
max                     NaN          NaN   1.000000e+08          NaN   

        registration_year  gearbox      power_ps  model    odometer_km  \
count        50000.000000    47320  50000.000000  47242   50000.000000   
unique                NaN        2           NaN    245            NaN   
top                   NaN  manuell           NaN   golf            NaN   
freq                  NaN    36993           NaN   4024            NaN   
mean          2005.073280      NaN    116.355920    NaN  125732.700000   
std            105.712813      NaN    209.216627    NaN   40042.211706   
min           1000.000000      NaN      0.000000    NaN    5000.000000   
25%           1999.000000      NaN     70.000000    NaN  125000.000000   
50%           2003.000000      NaN    105.000000    NaN  150000.000000   
75%           2008.000000      NaN    150.000000    NaN  150000.000000   
max           9999.000000      NaN  17700.000000    NaN  150000.000000   

        registration_month fuel_type       brand unrepaired_damage  \
count         50000.000000     45518       50000             40171   
unique                 NaN         7          40                 2   
top                    NaN    benzin  volkswagen              nein   
freq                   NaN     30107       10687             35232   
mean              5.723360       NaN         NaN               NaN   
std               3.711984       NaN         NaN               NaN   
min               0.000000       NaN         NaN               NaN   
25%               3.000000       NaN         NaN               NaN   
50%               6.000000       NaN         NaN               NaN   
75%               9.000000       NaN         NaN               NaN   
max              12.000000       NaN         NaN               NaN   

                 ad_created   postal_code            last_seen  
count                 50000  50000.000000                50000  
unique                   76           NaN                39481  
top     2016-04-03 00:00:00           NaN  2016-04-07 06:17:27  
freq                   1946           NaN                    8  
mean                    NaN  50813.627300                  NaN  
std                     NaN  25779.747957                  NaN  
min                     NaN   1067.000000                  NaN  
25%                     NaN  30451.000000                  NaN  
50%                     NaN  49577.000000                  NaN  
75%                     NaN  71540.000000                  NaN  
max                     NaN  99998.000000                  NaN  

From the output in the above cell, we can see that we have successfully dropped the columns not relevant to our analysis and also successfully changed cleaned and rename the columns with unwanted parameters.

EXPLORING THE ODOMETER AND PRICE COLUMNS

Now, we will continue to explore data from this two columns and look for data that doesn't look right.

First, we will look for outliners in each of thses columns and remove them.

In [12]:
print(autos["price_dollars"].unique().shape)
print("\n")
print(autos["price_dollars"].describe())
print("\n")
print(autos["price_dollars"].value_counts().head(20))
(2357,)


count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollars, dtype: float64


0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
Name: price_dollars, dtype: int64

Exploration of the price column shows us that the following

  • There are 2357 unique values in our price column

  • Around 30% of prices equal 0, these rows might be considered for removing from the dataset.

  • The minimum value is 0 and the maximum value is 100,000,000. These values look too negligible and too large respectively.

Let's explore further.

In [13]:
autos["price_dollars"].value_counts().sort_index(ascending=False).head(20)#Value_counts method counts the number of a
#particular value present in that series. Sort_index, sorts the values in order of acending or descending and aids in
#showing the lowesrt or highest value.
Out[13]:
99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price_dollars, dtype: int64
In [14]:
autos["price_dollars"].value_counts().sort_index(ascending=True).head(20)
Out[14]:
0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price_dollars, dtype: int64

Findings:

  • Quite a number of the prices are below 30 dollars about 35% or more.

  • It is impossible to post a car with 0 dollars as the minimum.Hence, we have to drop values 0 dollar

  • From 350000 dollars, the prices become twice higher. This shows there was inappropriate rise in the prices above 350000dollars

Considering the fact that eBay is an auction site and prices could start at 1 dollar, we will keep prices from 1 dollar to 350000 dollars in the dataset and remove the rest, which i have considered as outliners.

In [15]:
#to remove outliner, we specify the limit of values. One way to do so is:
#df[(df["col"]> x) & (df["col"]< y)].The values x and y are the limit values. The asier way to do it is shown below
autos = autos[autos["price_dollars"].between(1,350000)]
autos["price_dollars"].describe()
Out[15]:
count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price_dollars, dtype: float64

From the output above, we see that we have effectively removed outliners. and now, our minimun value is 1 dollar and maximum value is 350000 dollars.

In [16]:
autos["odometer_km"].nunique()#This tells us the number of unique values in the column, more like the shape.
Out[16]:
13
In [17]:
autos["odometer_km"].describe()
Out[17]:
count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64
In [18]:
autos["odometer_km"].value_counts().sort_index(ascending = False)
Out[18]:
150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
40000       815
30000       780
20000       762
10000       253
5000        836
Name: odometer_km, dtype: int64

We can see from the column exploration of the odometer_km column that both the minimun and maximum values are realistic, since the cars are used cars, it is expected that majority of the car will hAave very high odometer reading Hence, there are no outliners.

EXPLORING DATE COLUMNS

From our dataset, we can see there are four columns with date data

  • date_crawled

  • registration_year

  • registration_month

  • last_seen

  • ad_created

In [19]:
print(autos["date_crawled"].describe())
print(autos["registration_year"].describe())
print(autos["registration_month"].describe())
print(autos["last_seen"].describe())
print(autos["ad_created"].describe())
count                   48565
unique                  46882
top       2016-03-25 19:57:10
freq                        3
Name: date_crawled, dtype: object
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
count    48565.000000
mean         5.782251
std          3.685595
min          0.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64
count                   48565
unique                  38474
top       2016-04-07 06:17:27
freq                        8
Name: last_seen, dtype: object
count                   48565
unique                     76
top       2016-04-03 00:00:00
freq                     1887
Name: ad_created, dtype: object
In [20]:
autos[["date_crawled", "last_seen", "registration_month", "registration_year", "ad_created"]].info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 48565 entries, 0 to 49999
Data columns (total 5 columns):
date_crawled          48565 non-null object
last_seen             48565 non-null object
registration_month    48565 non-null int64
registration_year     48565 non-null int64
ad_created            48565 non-null object
dtypes: int64(2), object(3)
memory usage: 2.2+ MB

From our exploration of the 5 columns with date data, we see that 3 are stored as text and 2 are stored in numeric format. Hence, we need to convert the 3 columns represented as texts to numerical form. These columns are

  • ad_created

  • last_seen

  • date_crawled

Now, lets explore the 3 columns stored as strings.

In [21]:
autos[["date_crawled", "last_seen", "ad_created"]][0:5]
Out[21]:
date_crawled last_seen ad_created
0 2016-03-26 17:47:46 2016-04-06 06:45:54 2016-03-26 00:00:00
1 2016-04-04 13:38:56 2016-04-06 14:45:08 2016-04-04 00:00:00
2 2016-03-26 18:57:24 2016-04-06 20:15:37 2016-03-26 00:00:00
3 2016-03-12 16:58:10 2016-03-15 03:16:28 2016-03-12 00:00:00
4 2016-04-01 14:38:50 2016-04-01 14:38:50 2016-04-01 00:00:00

We will notice that the first 10 characters represent the day( e.g 2016-03-26).

To get a good grasp of the data, we are going to extract the counts of each data in the column and then sort by index for easy analysis.

In [22]:
autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)#the value_counts has a few
#parameters that takes several argument, for example the normalize parameter takes a boolean argument and returns the
#relative frequency of the value count on a particular data point.By default, it is set to false.On the other hand, the
#dropna parameter is set to True, which excludes null or NaN values.But here, we set it to False.2
Out[22]:
2016-04-07    0.001400
2016-04-06    0.003171
2016-04-05    0.013096
2016-04-04    0.036487
2016-04-03    0.038608
2016-04-02    0.035478
2016-04-01    0.033687
2016-03-31    0.031834
2016-03-30    0.033687
2016-03-29    0.034099
2016-03-28    0.034860
2016-03-27    0.031092
2016-03-26    0.032204
2016-03-25    0.031607
2016-03-24    0.029342
2016-03-23    0.032225
2016-03-22    0.032987
2016-03-21    0.037373
2016-03-20    0.037887
2016-03-19    0.034778
2016-03-18    0.012911
2016-03-17    0.031628
2016-03-16    0.029610
2016-03-15    0.034284
2016-03-14    0.036549
2016-03-13    0.015670
2016-03-12    0.036920
2016-03-11    0.032575
2016-03-10    0.032184
2016-03-09    0.033090
2016-03-08    0.033296
2016-03-07    0.036014
2016-03-06    0.014043
2016-03-05    0.025327
Name: date_crawled, dtype: float64

From the extraction, we have been able to make some findings. And these are:

  • The data was crawled everyday starting from 5th March 2016 to 7th April 2016

  • The amount of data crawled on each day is roughly uniform.

In [23]:
autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
Out[23]:
2016-04-07    0.131947
2016-04-06    0.221806
2016-04-05    0.124761
2016-04-04    0.024483
2016-04-03    0.025203
2016-04-02    0.024915
2016-04-01    0.022794
2016-03-31    0.023783
2016-03-30    0.024771
2016-03-29    0.022341
2016-03-28    0.020859
2016-03-27    0.015649
2016-03-26    0.016802
2016-03-25    0.019211
2016-03-24    0.019767
2016-03-23    0.018532
2016-03-22    0.021373
2016-03-21    0.020632
2016-03-20    0.020653
2016-03-19    0.015834
2016-03-18    0.007351
2016-03-17    0.028086
2016-03-16    0.016452
2016-03-15    0.015876
2016-03-14    0.012602
2016-03-13    0.008895
2016-03-12    0.023783
2016-03-11    0.012375
2016-03-10    0.010666
2016-03-09    0.009595
2016-03-08    0.007413
2016-03-07    0.005395
2016-03-06    0.004324
2016-03-05    0.001071
Name: last_seen, dtype: float64

From our exploration of the last seen column, these are what we find:

  • The distribution above shows the percentage or relative frequency of removed ads per day.

  • That the last 3 days more ads were removed

  • Majority of the ads were last seen April 6th to April 7th

In [24]:
autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
Out[24]:
2016-04-07    0.001256
2016-04-06    0.003253
2016-04-05    0.011819
2016-04-04    0.036858
2016-04-03    0.038855
                ...   
2015-12-05    0.000021
2015-11-10    0.000021
2015-09-09    0.000021
2015-08-10    0.000021
2015-06-11    0.000021
Name: ad_created, Length: 76, dtype: float64

Exploration of the ad_created column, enables us to drow some conclusion from this column:

  • The ad was created from 11th June 2015 to 7th April 2016

  • Majority of the ad were created between the month of March and April

In [25]:
autos["registration_year"].describe()
Out[25]:
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

From the exploration of this column, we see that the minimum and maximum values are outrageous. Hence further analysis has to be done.

It is worthy of not that a car cannot be registered after listing. Hence, any year in the registration_year column above 2016 is definitely inaccurate

In [26]:
autos["registration_year"].value_counts().sort_index()
Out[26]:
1000    1
1001    1
1111    1
1800    2
1910    5
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 95, dtype: int64

Recall that from our last seen column exploration, we found out that values are between 5th March and 7th April 2016, so it means no registration year should exceed the year 2016.

In [27]:
autos["registration_year"].value_counts(dropna = False).sort_index(ascending = False).head(50)
Out[27]:
9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
2009    2085
2008    2215
2007    2277
2006    2670
2005    2936
2004    2703
2003    2699
2002    2486
2001    2636
2000    3156
1999    2897
1998    2363
1997    1951
1996    1373
1995    1227
1994     629
1993     425
1992     370
1991     339
1990     347
1989     174
1988     135
1987      72
1986      72
1985      95
1984      51
1983      51
1982      41
1981      28
1980      85
Name: registration_year, dtype: int64

A further exploration of the registration year column shows us that the registration started at year 1000, which is also unrealistic as cars have not been made by then. But if we look at the ad_created, last_seen and date_crawled column we will see that the earliest date started in the 90's. Hence, for the registration year, we will only consider years from 1900 to 2016. All other rows carrying year outside this range will be dropped.

In [28]:
autos = autos[autos["registration_year"].between(1917, 2016)]
autos["registration_year"].describe()
Out[28]:
count    46676.000000
mean      2002.920709
std          7.120843
min       1927.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64
In [29]:
autos.describe()
Out[29]:
price_dollars registration_year power_ps odometer_km registration_month postal_code
count 46676.000000 46676.000000 46676.000000 46676.000000 46676.000000 46676.000000
mean 5978.332612 2002.920709 117.905562 125594.631074 5.827749 51096.456166
std 9178.207768 7.120843 184.928789 39842.598992 3.670001 25755.911915
min 1.000000 1927.000000 0.000000 5000.000000 0.000000 1067.000000
25% 1250.000000 1999.000000 75.000000 100000.000000 3.000000 30827.000000
50% 3100.000000 2003.000000 109.000000 150000.000000 6.000000 49826.000000
75% 7500.000000 2008.000000 150.000000 150000.000000 9.000000 71732.000000
max 350000.000000 2016.000000 17700.000000 150000.000000 12.000000 99998.000000

EXPLORING PRICE BY BRAND

Now, we will try to explore each brand of cars in our dataset and see the price variation of each.

First, lets see the columns available in our dataset again

In [30]:
autos.head()
Out[30]:
date_crawled name price_dollars vehicle_type registration_year gearbox power_ps model odometer_km registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 bus 2004 manuell 158 andere 150000 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 limousine 1997 automatik 286 7er 150000 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 limousine 2009 manuell 102 golf 70000 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 kleinwagen 2007 automatik 71 fortwo 70000 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 kombi 2003 manuell 0 focus 150000 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
In [31]:
autos["brand"].unique()#returns the various brands in the brand columns.
Out[31]:
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)
In [32]:
top_10_brands = autos["brand"].value_counts()[:10]
print(top_10_brands)
volkswagen       9862
bmw              5137
opel             5020
mercedes_benz    4503
audi             4041
ford             3263
renault          2200
peugeot          1393
fiat             1197
seat              853
Name: brand, dtype: int64
In [33]:
top_6_brands = autos["brand"].value_counts()[:6]
print(top_6_brands)
volkswagen       9862
bmw              5137
opel             5020
mercedes_benz    4503
audi             4041
ford             3263
Name: brand, dtype: int64
In [34]:
bottom_10_brands = autos["brand"].value_counts()[-10:]
print(bottom_10_brands)
jeep          106
subaru        100
land_rover     98
saab           77
jaguar         73
daewoo         70
trabant        64
rover          62
lancia         50
lada           27
Name: brand, dtype: int64

Looking careful at the top_10_brands list and the bottom_10_brands, we see that the most popular car in the ebay car listing are Volkswagen, BMW, Mercedes_Benz alongside audi. The other are foreign brands.

Since the top_10_brands has about 80% of the total brands of car attracting more customers, we will focus our brand_price analysis on just these cars.

In [35]:
#We want to use the method of aggregation to explore brands in terms of prices.
mean_car_price = {}

top_6_brands = top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
    selected_rows = autos[autos["brand"] == cars]
    mean_price = selected_rows["price_dollars"].mean()
    mean_price = int(mean_price)
    mean_car_price[cars] = mean_price
print(mean_car_price)
print("\n")
sorted(mean_car_price.items(), key=lambda x:x[1], reverse = True)
    
{'volkswagen': 5402, 'bmw': 8332, 'opel': 2976, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749}


Out[35]:
[('audi', 9336),
 ('mercedes_benz', 8628),
 ('bmw', 8332),
 ('volkswagen', 5402),
 ('ford', 3749),
 ('opel', 2976)]
In [36]:
top_6 = autos["brand"].value_counts()[:6]
print(top_6)
volkswagen       9862
bmw              5137
opel             5020
mercedes_benz    4503
audi             4041
ford             3263
Name: brand, dtype: int64
In [ ]:

From the analysis above, we see that the mean prices of AUDI, MERCEDES BENZ AND BMW are the highest respecttively, while VOLKSWAGEN has a moderate mean price.

STORING AGGREGATE DATA IN A DATAFRAME

Previously, we analysed the prices of the top 6 brands and discovered that the highly luxurious cars are BMW, AUDI and MERCEDES BENZ. Now, we will try to compare the mileage(odometer) of these top_6 brands to see how it fairs when compared with prices.

In [37]:
mean_car_mileage = {}

top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
    selected_rows = autos[autos["brand"]==cars]
    mean_mileage = selected_rows["odometer_km"].mean()
    mean_mileage = int(mean_mileage)
    mean_car_mileage[cars] = mean_mileage
print(mean_car_mileage)
sorted(mean_car_mileage.items(), key = lambda x:x[1], reverse = True)
{'volkswagen': 128707, 'bmw': 132572, 'opel': 129311, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266}
Out[37]:
[('bmw', 132572),
 ('mercedes_benz', 130788),
 ('opel', 129311),
 ('audi', 129157),
 ('volkswagen', 128707),
 ('ford', 124266)]

We see now that the brands such as BMW, MERCEDES BENZ, AUDI and VOLKSWAGEN have high mileage points. It is also worth noting that the mileage of the top_6 brands lies between 117000km to 133000km.

Now, we will try to use the mean_mileage of these cars to compare the mean prices, to see if there are any visible link.

To do this, we have to tie both the mean_car_mileage series and the mean_car price series together to form a dataframe. This will make our analysis alot easier.

In [38]:
#Turning the mean_car price dict into a series type.
top_6_price = pd.Series(mean_car_price)

top_6_mileage = pd.Series(mean_car_mileage)

top_6_brands = pd.DataFrame((top_6_price), columns = ["mean_price"])#Coverting the series object to a DataFrame, 
#done by setting the columns parameter which accepts an array like object to specify column name
top_6_brands["mean_car_mileage"] = top_6_mileage
In [39]:
print(top_6_brands)
               mean_price  mean_car_mileage
volkswagen           5402            128707
bmw                  8332            132572
opel                 2976            129311
mercedes_benz        8628            130788
audi                 9336            129157
ford                 3749            124266

From our newly created DataFrame, we are able to draw some findings. And these are:

  • It appears that mileage doesn't seem to have a close relationship with price

  • Although, the most luxurious brands have somewhat of a higher mileage.

  • A couple of low priced brands still have high mileage e.g OPEL

MORE CLEANING STEPS AND ANALYSIS

COMMON BRAND AND MODEL

In [40]:
brand_model = autos[["brand", "model"]]
brand_model.isnull().sum()
Out[40]:
brand       0
model    2190
dtype: int64
In [41]:
brand_model = brand_model.dropna(axis=0)
brand_model.isnull().sum()
Out[41]:
brand    0
model    0
dtype: int64

From the above cells, we see that the model columns had 2190 missing values and since we can't work with that, we had to remove the missing rows corresponding to the columns.

In [42]:
brand_model.groupby("brand")["model"].value_counts(dropna = False).sort_values(ascending=False).head()
Out[42]:
brand       model 
volkswagen  golf      3707
bmw         3er       2615
volkswagen  polo      1609
opel        corsa     1591
volkswagen  passat    1349
Name: model, dtype: int64

From this, we can see that the most common brand/models are Volkswagen golf, BMW 3er, Volkswagen Polo, Opel Corsa and the rest seen above.

SPLIT ODOMETER AND PRICE

In [43]:
autos["odometer_km"].value_counts().sort_index()
Out[43]:
5000        782
10000       241
20000       742
30000       760
40000       797
50000       993
60000      1128
70000      1187
80000      1375
90000      1673
100000     2057
125000     4857
150000    30084
Name: odometer_km, dtype: int64
In [44]:
autos["odometer_km"].unique()
Out[44]:
array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000,  40000, 100000], dtype=int64)
In [45]:
autos.groupby("odometer_km", as_index = False)["price_dollars"].mean().sort_values("price_dollars", ascending =False).round(0)
Out[45]:
odometer_km price_dollars
1 10000 20551.0
2 20000 18448.0
3 30000 16609.0
4 40000 15500.0
5 50000 13812.0
6 60000 12385.0
7 70000 10927.0
8 80000 9722.0
0 5000 8907.0
9 90000 8465.0
10 100000 8137.0
11 125000 6214.0
12 150000 3768.0

We have successfully splitted just the odometer reading of our cars and the associated prices. Now, taking a closer look at the table, we can conclude that cars with very high odometer reading have lower prices. Ecept for cars with 5000km mileage that has somewhat of a low price. This can be attributed to the fact that cars having such mileage are not popular brands hence buyers dont fancy it.

COMPARING PRICE OF DAMAGED AND NON DAMAGED CARS.

In [46]:
autos["unrepaired_damage"].describe()
Out[46]:
count     38374
unique        2
top        nein
freq      33834
Name: unrepaired_damage, dtype: object

Before proceeding, we see that the column contains values that are not written in a language we unto, now, we have to do some data cleaning on this column.

In [47]:
mapping_dict = {
    "nein":"No",
    "ja":"Yes"
}
autos["unrepaired_damage"]=autos["unrepaired_damage"].map(mapping_dict)
In [48]:
autos["unrepaired_damage"].describe()#We have succesfully corrected bad values.
Out[48]:
count     38374
unique        2
top          No
freq      33834
Name: unrepaired_damage, dtype: object
In [49]:
#damaged = autos.loc[(autos["unrepaired_damage"] == "Yes"), ["price_dollars"]]
#damaged["price_dollars"].describe()
In [50]:
#damaged = autos.loc[(autos["unrepaired_damage"] == "No"), ["price_dollars"]]
#damaged["price_dollars"].describe()
In [51]:
autos.groupby("unrepaired_damage")["price_dollars"].max()#We can also compare the unrepaired column with the price column
#like so.
Out[51]:
unrepaired_damage
No     350000
Yes     44200
Name: price_dollars, dtype: int32

From the cells above, the results are obtained using different technique, for the first two above, we indexed the unrepaired and price columns while setting the former to Yes and its corresponding rows.

As for the second, we use the groupby method of the Dataframe object and compared, the unrepaired column with the price column to obtain the maximum.

We can conclude from this that cars with damages not yest repaired cost less than those without damage.

CONCLUSION SNIPPETS

With this, we come to the conclusion of our data cleaning and data analysis of the ebay car sales data. We were able to clean the data by cleaning the columns and changing it from CAMELCASE to SNAKECASE.
We also cleaned the price and odometer columns converting them from texts to numeric and then went ahead to rename them.
And finally, we had to look at vehicles with unrepaired damage to ascertain the maximum prices if damages were present or not.
In [ ]: