INTRODUCTION OF OUR DATASETS¶

The dataset was originally scraped and uploaded to Kaggle. In this guided project, we'll work with a dataset of used cars from a calssified section of German eBay website.

The data dictionary provided with the data is as follows,

dateCrawled - When this ad was first crawled. All field values are taken from this date
name - Name of the car
seller - Whether the seller is private or a dealer
offerType - The type of listing
price - The price on the ad to sell the car
abtest - Whether the listing is included in an A/B test
vehicleType - The vehicle Type
yearOfRegistration - The year in which the car was first registered
gearBox - The transmission type
powerPS - The power of the car in PS
model - The car model name
kilometer - How many kilometers the car has driven
monthOfRegistration - The month in which the car was first registered
fuelType - What type of fuel the car uses
brand - The brand of the car
notRepairedDamage - If the car has a damage which is not yet repaired
dateCreated - The date on which the eBay listing was created
postalCode - The postal code for the location of the vehicle
lastSeenOnline - When the crawler saw this ad last online

The aim of this analysis is to clean the data and analyze the included used car listings.

READING OUR CSV FILE AND SPECIFYING ENCODING¶

In [1]:

import pandas as pd

autos = pd.read_csv("C:\\Users\\USER\\Videos\\Coding Videos\\my_datasets\\autos.csv", encoding = "Latin-1")

In [2]:

autos

Out[2]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	nrOfPictures	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	0	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	0	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	0	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	0	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	0	39218	2016-04-01 14:38:50
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
49995	2016-03-27 14:38:19	Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon	privat	Angebot	$24,900	control	limousine	2011	automatik	239	q5	100,000km	1	diesel	audi	nein	2016-03-27 00:00:00	0	82131	2016-04-01 13:47:40
49996	2016-03-28 10:50:25	Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+...	privat	Angebot	$1,980	control	cabrio	1996	manuell	75	astra	150,000km	5	benzin	opel	nein	2016-03-28 00:00:00	0	44807	2016-04-02 14:18:02
49997	2016-04-02 14:44:48	Fiat_500_C_1.2_Dualogic_Lounge	privat	Angebot	$13,200	test	cabrio	2014	automatik	69	500	5,000km	11	benzin	fiat	nein	2016-04-02 00:00:00	0	73430	2016-04-04 11:47:27
49998	2016-03-08 19:25:42	Audi_A3_2.0_TDI_Sportback_Ambition	privat	Angebot	$22,900	control	kombi	2013	manuell	150	a3	40,000km	11	diesel	audi	nein	2016-03-08 00:00:00	0	35683	2016-04-05 16:45:07
49999	2016-03-14 00:42:12	Opel_Vectra_1.6_16V	privat	Angebot	$1,250	control	limousine	1996	manuell	101	vectra	150,000km	1	benzin	opel	nein	2016-03-13 00:00:00	0	45897	2016-04-06 21:18:48

50000 rows × 20 columns

In [3]:

autos.info()#Gives an information of our dataset, we can observe from our output that we have 20 columns most 
#of which are strings.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
dateCrawled            50000 non-null object
name                   50000 non-null object
seller                 50000 non-null object
offerType              50000 non-null object
price                  50000 non-null object
abtest                 50000 non-null object
vehicleType            44905 non-null object
yearOfRegistration     50000 non-null int64
gearbox                47320 non-null object
powerPS                50000 non-null int64
model                  47242 non-null object
odometer               50000 non-null object
monthOfRegistration    50000 non-null int64
fuelType               45518 non-null object
brand                  50000 non-null object
notRepairedDamage      40171 non-null object
dateCreated            50000 non-null object
nrOfPictures           50000 non-null int64
postalCode             50000 non-null int64
lastSeen               50000 non-null object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB

When we extract the information of our dataframe using the info method, we see that some columns have missing values.But none of these columns have more than 20% null values. The columns with null values are:

VehicleType
gearBox
model
fuelType
notRepairedDamage

In [4]:

autos.head()#print the first five rows of the dataframe

Out[4]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	odometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

A thorough exploration of the head of our dataframe, we can observe that the columns label are written in camelcase instead of our standard Python snakecase style. So, we are poised with renaming our column labels to suit our snakecase

CLEANING COLUMN NAMES¶

Now, we will clean the column labels by making use of a dataframe attribute (df.columns). We will write a function that takes just one argument. And then we will iterate through the column labels and for each iteration, we will call the clean function. Amd then assign the modified column label back to the Dataframe.columns attribute¶

In [5]:

def clean_col(col):
    col = col.replace("dateCrawled", "date_crawled")
    col = col.replace("offerType", "offer_type")
    col = col.replace("vehicleType", "vehicle_type")
    col = col.replace("powerPS", "power_ps")
    col = col.replace("fuelType", "fuel_type")
    col = col.replace("monthOfRegistration", "registration_month")
    col = col.replace("yearOfRegistration", "registration_year")
    col = col.replace("notRepairedDamage", "unrepaired_damage")
    col = col.replace("dateCreated", "ad_created")
    col = col.replace("lastSeen", "last_seen")
    col = col.replace("postalCode", "postal_code")
    col = col.replace("nrOfPictures", "no_of_pics")
    return col

cleaned_labels = []
for c in autos.columns:
    cleaned = clean_col(c)
    cleaned_labels.append(cleaned)
autos.columns = cleaned_labels
    
    

In [6]:

print(autos.columns)#Examining the column labels, using the Dataframe.columns attribute

Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'no_of_pics', 'postal_code',
       'last_seen'],
      dtype='object')

In [7]:

print(autos.head())

          date_crawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offer_type   price   abtest vehicle_type  registration_year  \
0  privat    Angebot  $5,000  control          bus               2004   
1  privat    Angebot  $8,500  control    limousine               1997   
2  privat    Angebot  $8,990     test    limousine               2009   
3  privat    Angebot  $4,350  control   kleinwagen               2007   
4  privat    Angebot  $1,350     test        kombi               2003   

     gearbox  power_ps   model   odometer  registration_month fuel_type  \
0    manuell       158  andere  150,000km                   3       lpg   
1  automatik       286     7er  150,000km                   6    benzin   
2    manuell       102    golf   70,000km                   7    benzin   
3  automatik        71  fortwo   70,000km                   6    benzin   
4    manuell         0   focus  150,000km                   7    benzin   

        brand unrepaired_damage           ad_created  no_of_pics  postal_code  \
0     peugeot              nein  2016-03-26 00:00:00           0        79588   
1         bmw              nein  2016-04-04 00:00:00           0        71034   
2  volkswagen              nein  2016-03-26 00:00:00           0        35394   
3       smart              nein  2016-03-12 00:00:00           0        33729   
4        ford              nein  2016-04-01 00:00:00           0        39218   

             last_seen  
0  2016-04-06 06:45:54  
1  2016-04-06 14:45:08  
2  2016-04-06 20:15:37  
3  2016-03-15 03:16:28  
4  2016-04-01 14:38:50

INITIAL EXPLORATION AND MORE DATA CLEANING¶

In [8]:

print(autos.describe(include = "all"))

               date_crawled         name  seller offer_type  price abtest  \
count                 50000        50000   50000      50000  50000  50000   
unique                48213        38754       2          2   2357      2   
top     2016-03-22 09:51:06  Ford_Fiesta  privat    Angebot     $0   test   
freq                      3           78   49999      49999   1421  25756   
mean                    NaN          NaN     NaN        NaN    NaN    NaN   
std                     NaN          NaN     NaN        NaN    NaN    NaN   
min                     NaN          NaN     NaN        NaN    NaN    NaN   
25%                     NaN          NaN     NaN        NaN    NaN    NaN   
50%                     NaN          NaN     NaN        NaN    NaN    NaN   
75%                     NaN          NaN     NaN        NaN    NaN    NaN   
max                     NaN          NaN     NaN        NaN    NaN    NaN   

       vehicle_type  registration_year  gearbox      power_ps  model  \
count         44905       50000.000000    47320  50000.000000  47242   
unique            8                NaN        2           NaN    245   
top       limousine                NaN  manuell           NaN   golf   
freq          12859                NaN    36993           NaN   4024   
mean            NaN        2005.073280      NaN    116.355920    NaN   
std             NaN         105.712813      NaN    209.216627    NaN   
min             NaN        1000.000000      NaN      0.000000    NaN   
25%             NaN        1999.000000      NaN     70.000000    NaN   
50%             NaN        2003.000000      NaN    105.000000    NaN   
75%             NaN        2008.000000      NaN    150.000000    NaN   
max             NaN        9999.000000      NaN  17700.000000    NaN   

         odometer  registration_month fuel_type       brand unrepaired_damage  \
count       50000        50000.000000     45518       50000             40171   
unique         13                 NaN         7          40                 2   
top     150,000km                 NaN    benzin  volkswagen              nein   
freq        32424                 NaN     30107       10687             35232   
mean          NaN            5.723360       NaN         NaN               NaN   
std           NaN            3.711984       NaN         NaN               NaN   
min           NaN            0.000000       NaN         NaN               NaN   
25%           NaN            3.000000       NaN         NaN               NaN   
50%           NaN            6.000000       NaN         NaN               NaN   
75%           NaN            9.000000       NaN         NaN               NaN   
max           NaN           12.000000       NaN         NaN               NaN   

                 ad_created  no_of_pics   postal_code            last_seen  
count                 50000     50000.0  50000.000000                50000  
unique                   76         NaN           NaN                39481  
top     2016-04-03 00:00:00         NaN           NaN  2016-04-07 06:17:27  
freq                   1946         NaN           NaN                    8  
mean                    NaN         0.0  50813.627300                  NaN  
std                     NaN         0.0  25779.747957                  NaN  
min                     NaN         0.0   1067.000000                  NaN  
25%                     NaN         0.0  30451.000000                  NaN  
50%                     NaN         0.0  49577.000000                  NaN  
75%                     NaN         0.0  71540.000000                  NaN  
max                     NaN         0.0  99998.000000                  NaN

A thorough observation of the explored dataframe above, we see that some columns are columns worthy of being dropped and others are worth modifying. Findings:

Seller column - This column is worthy of being dropped, because it has just two unique values and has little or no significance to our analysis.
Offertype column - This column just like the previous has two unique value and hence is not significant to our analysis.
abtest column - This column is candidate to be dropped also
no_of_pics - This column is candidate to be dropped also.

Both the price and odometer columns have to be cleaned because they are supposed to be represented as numeric datas not text.

Now, lets drop the columns taht needs to be dropped.

In [9]:

autos = autos.drop(columns = ["abtest", "no_of_pics", "offer_type", "seller"])#Drop method drops either row or column.

Now, we will have to clean the price and odometer columns and rename them appropriately

In [10]:

autos["price"]= autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos.rename(columns ={"price":"price_dollars"}, inplace = True)

autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int)
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)

In [11]:

print(autos.describe(include ="all"))

               date_crawled         name  price_dollars vehicle_type  \
count                 50000        50000   5.000000e+04        44905   
unique                48213        38754            NaN            8   
top     2016-03-22 09:51:06  Ford_Fiesta            NaN    limousine   
freq                      3           78            NaN        12859   
mean                    NaN          NaN   9.840044e+03          NaN   
std                     NaN          NaN   4.811044e+05          NaN   
min                     NaN          NaN   0.000000e+00          NaN   
25%                     NaN          NaN   1.100000e+03          NaN   
50%                     NaN          NaN   2.950000e+03          NaN   
75%                     NaN          NaN   7.200000e+03          NaN   
max                     NaN          NaN   1.000000e+08          NaN   

        registration_year  gearbox      power_ps  model    odometer_km  \
count        50000.000000    47320  50000.000000  47242   50000.000000   
unique                NaN        2           NaN    245            NaN   
top                   NaN  manuell           NaN   golf            NaN   
freq                  NaN    36993           NaN   4024            NaN   
mean          2005.073280      NaN    116.355920    NaN  125732.700000   
std            105.712813      NaN    209.216627    NaN   40042.211706   
min           1000.000000      NaN      0.000000    NaN    5000.000000   
25%           1999.000000      NaN     70.000000    NaN  125000.000000   
50%           2003.000000      NaN    105.000000    NaN  150000.000000   
75%           2008.000000      NaN    150.000000    NaN  150000.000000   
max           9999.000000      NaN  17700.000000    NaN  150000.000000   

        registration_month fuel_type       brand unrepaired_damage  \
count         50000.000000     45518       50000             40171   
unique                 NaN         7          40                 2   
top                    NaN    benzin  volkswagen              nein   
freq                   NaN     30107       10687             35232   
mean              5.723360       NaN         NaN               NaN   
std               3.711984       NaN         NaN               NaN   
min               0.000000       NaN         NaN               NaN   
25%               3.000000       NaN         NaN               NaN   
50%               6.000000       NaN         NaN               NaN   
75%               9.000000       NaN         NaN               NaN   
max              12.000000       NaN         NaN               NaN   

                 ad_created   postal_code            last_seen  
count                 50000  50000.000000                50000  
unique                   76           NaN                39481  
top     2016-04-03 00:00:00           NaN  2016-04-07 06:17:27  
freq                   1946           NaN                    8  
mean                    NaN  50813.627300                  NaN  
std                     NaN  25779.747957                  NaN  
min                     NaN   1067.000000                  NaN  
25%                     NaN  30451.000000                  NaN  
50%                     NaN  49577.000000                  NaN  
75%                     NaN  71540.000000                  NaN  
max                     NaN  99998.000000                  NaN

From the output in the above cell, we can see that we have successfully dropped the columns not relevant to our analysis and also successfully changed cleaned and rename the columns with unwanted parameters.

EXPLORING THE ODOMETER AND PRICE COLUMNS¶

Now, we will continue to explore data from this two columns and look for data that doesn't look right.

First, we will look for outliners in each of thses columns and remove them.

In [12]:

print(autos["price_dollars"].unique().shape)
print("\n")
print(autos["price_dollars"].describe())
print("\n")
print(autos["price_dollars"].value_counts().head(20))

(2357,)


count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollars, dtype: float64


0       1421
500      781
1500     734
2500     643
1000     639
1200     639
600      531
800      498
3500     498
2000     460
999      434
750      433
900      420
650      419
850      410
700      395
4500     394
300      384
2200     382
950      379
Name: price_dollars, dtype: int64

Exploration of the price column shows us that the following

There are 2357 unique values in our price column
Around 30% of prices equal 0, these rows might be considered for removing from the dataset.
The minimum value is 0 and the maximum value is 100,000,000. These values look too negligible and too large respectively.

Let's explore further.

In [13]:

autos["price_dollars"].value_counts().sort_index(ascending=False).head(20)#Value_counts method counts the number of a
#particular value present in that series. Sort_index, sorts the values in order of acending or descending and aids in
#showing the lowesrt or highest value.

Out[13]:

99999999    1
27322222    1
12345678    3
11111111    2
10000000    1
3890000     1
1300000     1
1234566     1
999999      2
999990      1
350000      1
345000      1
299000      1
295000      1
265000      1
259000      1
250000      1
220000      1
198000      1
197000      1
Name: price_dollars, dtype: int64

In [14]:

autos["price_dollars"].value_counts().sort_index(ascending=True).head(20)

Out[14]:

0     1421
1      156
2        3
3        1
5        2
8        1
9        1
10       7
11       2
12       3
13       2
14       1
15       2
17       3
18       1
20       4
25       5
29       1
30       7
35       1
Name: price_dollars, dtype: int64

Findings:

Quite a number of the prices are below 30 dollars about 35% or more.
It is impossible to post a car with 0 dollars as the minimum.Hence, we have to drop values 0 dollar
From 350000 dollars, the prices become twice higher. This shows there was inappropriate rise in the prices above 350000dollars

Considering the fact that eBay is an auction site and prices could start at 1 dollar, we will keep prices from 1 dollar to 350000 dollars in the dataset and remove the rest, which i have considered as outliners.

In [15]:

#to remove outliner, we specify the limit of values. One way to do so is:
#df[(df["col"]> x) & (df["col"]< y)].The values x and y are the limit values. The asier way to do it is shown below
autos = autos[autos["price_dollars"].between(1,350000)]
autos["price_dollars"].describe()

Out[15]:

count     48565.000000
mean       5888.935591
std        9059.854754
min           1.000000
25%        1200.000000
50%        3000.000000
75%        7490.000000
max      350000.000000
Name: price_dollars, dtype: float64

From the output above, we see that we have effectively removed outliners. and now, our minimun value is 1 dollar and maximum value is 350000 dollars.

In [16]:

autos["odometer_km"].nunique()#This tells us the number of unique values in the column, more like the shape.

Out[16]:

In [17]:

autos["odometer_km"].describe()

Out[17]:

count     48565.000000
mean     125770.101925
std       39788.636804
min        5000.000000
25%      125000.000000
50%      150000.000000
75%      150000.000000
max      150000.000000
Name: odometer_km, dtype: float64

In [18]:

autos["odometer_km"].value_counts().sort_index(ascending = False)

Out[18]:

150000    31414
125000     5057
100000     2115
90000      1734
80000      1415
70000      1217
60000      1155
50000      1012
40000       815
30000       780
20000       762
10000       253
5000        836
Name: odometer_km, dtype: int64

We can see from the column exploration of the odometer_km column that both the minimun and maximum values are realistic, since the cars are used cars, it is expected that majority of the car will hAave very high odometer reading Hence, there are no outliners.

EXPLORING DATE COLUMNS¶

From our dataset, we can see there are four columns with date data

date_crawled
registration_year
registration_month
last_seen
ad_created

In [19]:

print(autos["date_crawled"].describe())
print(autos["registration_year"].describe())
print(autos["registration_month"].describe())
print(autos["last_seen"].describe())
print(autos["ad_created"].describe())

count                   48565
unique                  46882
top       2016-03-25 19:57:10
freq                        3
Name: date_crawled, dtype: object
count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64
count    48565.000000
mean         5.782251
std          3.685595
min          0.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64
count                   48565
unique                  38474
top       2016-04-07 06:17:27
freq                        8
Name: last_seen, dtype: object
count                   48565
unique                     76
top       2016-04-03 00:00:00
freq                     1887
Name: ad_created, dtype: object

In [20]:

autos[["date_crawled", "last_seen", "registration_month", "registration_year", "ad_created"]].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48565 entries, 0 to 49999
Data columns (total 5 columns):
date_crawled          48565 non-null object
last_seen             48565 non-null object
registration_month    48565 non-null int64
registration_year     48565 non-null int64
ad_created            48565 non-null object
dtypes: int64(2), object(3)
memory usage: 2.2+ MB

From our exploration of the 5 columns with date data, we see that 3 are stored as text and 2 are stored in numeric format. Hence, we need to convert the 3 columns represented as texts to numerical form. These columns are

ad_created
last_seen
date_crawled

Now, lets explore the 3 columns stored as strings.

In [21]:

autos[["date_crawled", "last_seen", "ad_created"]][0:5]

Out[21]:

	date_crawled	last_seen	ad_created
0	2016-03-26 17:47:46	2016-04-06 06:45:54	2016-03-26 00:00:00
1	2016-04-04 13:38:56	2016-04-06 14:45:08	2016-04-04 00:00:00
2	2016-03-26 18:57:24	2016-04-06 20:15:37	2016-03-26 00:00:00
3	2016-03-12 16:58:10	2016-03-15 03:16:28	2016-03-12 00:00:00
4	2016-04-01 14:38:50	2016-04-01 14:38:50	2016-04-01 00:00:00

We will notice that the first 10 characters represent the day( e.g 2016-03-26).

To get a good grasp of the data, we are going to extract the counts of each data in the column and then sort by index for easy analysis.

In [22]:

autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)#the value_counts has a few
#parameters that takes several argument, for example the normalize parameter takes a boolean argument and returns the
#relative frequency of the value count on a particular data point.By default, it is set to false.On the other hand, the
#dropna parameter is set to True, which excludes null or NaN values.But here, we set it to False.2

Out[22]:

2016-04-07    0.001400
2016-04-06    0.003171
2016-04-05    0.013096
2016-04-04    0.036487
2016-04-03    0.038608
2016-04-02    0.035478
2016-04-01    0.033687
2016-03-31    0.031834
2016-03-30    0.033687
2016-03-29    0.034099
2016-03-28    0.034860
2016-03-27    0.031092
2016-03-26    0.032204
2016-03-25    0.031607
2016-03-24    0.029342
2016-03-23    0.032225
2016-03-22    0.032987
2016-03-21    0.037373
2016-03-20    0.037887
2016-03-19    0.034778
2016-03-18    0.012911
2016-03-17    0.031628
2016-03-16    0.029610
2016-03-15    0.034284
2016-03-14    0.036549
2016-03-13    0.015670
2016-03-12    0.036920
2016-03-11    0.032575
2016-03-10    0.032184
2016-03-09    0.033090
2016-03-08    0.033296
2016-03-07    0.036014
2016-03-06    0.014043
2016-03-05    0.025327
Name: date_crawled, dtype: float64

From the extraction, we have been able to make some findings. And these are:

The data was crawled everyday starting from 5th March 2016 to 7th April 2016
The amount of data crawled on each day is roughly uniform.

In [23]:

autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)

Out[23]:

2016-04-07    0.131947
2016-04-06    0.221806
2016-04-05    0.124761
2016-04-04    0.024483
2016-04-03    0.025203
2016-04-02    0.024915
2016-04-01    0.022794
2016-03-31    0.023783
2016-03-30    0.024771
2016-03-29    0.022341
2016-03-28    0.020859
2016-03-27    0.015649
2016-03-26    0.016802
2016-03-25    0.019211
2016-03-24    0.019767
2016-03-23    0.018532
2016-03-22    0.021373
2016-03-21    0.020632
2016-03-20    0.020653
2016-03-19    0.015834
2016-03-18    0.007351
2016-03-17    0.028086
2016-03-16    0.016452
2016-03-15    0.015876
2016-03-14    0.012602
2016-03-13    0.008895
2016-03-12    0.023783
2016-03-11    0.012375
2016-03-10    0.010666
2016-03-09    0.009595
2016-03-08    0.007413
2016-03-07    0.005395
2016-03-06    0.004324
2016-03-05    0.001071
Name: last_seen, dtype: float64

From our exploration of the last seen column, these are what we find:

The distribution above shows the percentage or relative frequency of removed ads per day.
That the last 3 days more ads were removed
Majority of the ads were last seen April 6th to April 7th

In [24]:

autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)

Out[24]:

2016-04-07    0.001256
2016-04-06    0.003253
2016-04-05    0.011819
2016-04-04    0.036858
2016-04-03    0.038855
                ...   
2015-12-05    0.000021
2015-11-10    0.000021
2015-09-09    0.000021
2015-08-10    0.000021
2015-06-11    0.000021
Name: ad_created, Length: 76, dtype: float64

Exploration of the ad_created column, enables us to drow some conclusion from this column:

The ad was created from 11th June 2015 to 7th April 2016
Majority of the ad were created between the month of March and April

In [25]:

autos["registration_year"].describe()

Out[25]:

count    48565.000000
mean      2004.755421
std         88.643887
min       1000.000000
25%       1999.000000
50%       2004.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

From the exploration of this column, we see that the minimum and maximum values are outrageous. Hence further analysis has to be done.

It is worthy of not that a car cannot be registered after listing. Hence, any year in the registration_year column above 2016 is definitely inaccurate

In [26]:

autos["registration_year"].value_counts().sort_index()

Out[26]:

1000    1
1001    1
1111    1
1800    2
1910    5
       ..
5911    1
6200    1
8888    1
9000    1
9999    3
Name: registration_year, Length: 95, dtype: int64

Recall that from our last seen column exploration, we found out that values are between 5th March and 7th April 2016, so it means no registration year should exceed the year 2016.

In [27]:

autos["registration_year"].value_counts(dropna = False).sort_index(ascending = False).head(50)

Out[27]:

9999       3
9000       1
8888       1
6200       1
5911       1
5000       4
4800       1
4500       1
4100       1
2800       1
2019       2
2018     470
2017    1392
2016    1220
2015     392
2014     663
2013     803
2012    1310
2011    1623
2010    1589
2009    2085
2008    2215
2007    2277
2006    2670
2005    2936
2004    2703
2003    2699
2002    2486
2001    2636
2000    3156
1999    2897
1998    2363
1997    1951
1996    1373
1995    1227
1994     629
1993     425
1992     370
1991     339
1990     347
1989     174
1988     135
1987      72
1986      72
1985      95
1984      51
1983      51
1982      41
1981      28
1980      85
Name: registration_year, dtype: int64

A further exploration of the registration year column shows us that the registration started at year 1000, which is also unrealistic as cars have not been made by then. But if we look at the ad_created, last_seen and date_crawled column we will see that the earliest date started in the 90's. Hence, for the registration year, we will only consider years from 1900 to 2016. All other rows carrying year outside this range will be dropped.

In [28]:

autos = autos[autos["registration_year"].between(1917, 2016)]
autos["registration_year"].describe()

Out[28]:

count    46676.000000
mean      2002.920709
std          7.120843
min       1927.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

In [29]:

autos.describe()

Out[29]:

	price_dollars	registration_year	power_ps	odometer_km	registration_month	postal_code
count	46676.000000	46676.000000	46676.000000	46676.000000	46676.000000	46676.000000
mean	5978.332612	2002.920709	117.905562	125594.631074	5.827749	51096.456166
std	9178.207768	7.120843	184.928789	39842.598992	3.670001	25755.911915
min	1.000000	1927.000000	0.000000	5000.000000	0.000000	1067.000000
25%	1250.000000	1999.000000	75.000000	100000.000000	3.000000	30827.000000
50%	3100.000000	2003.000000	109.000000	150000.000000	6.000000	49826.000000
75%	7500.000000	2008.000000	150.000000	150000.000000	9.000000	71732.000000
max	350000.000000	2016.000000	17700.000000	150000.000000	12.000000	99998.000000

EXPLORING PRICE BY BRAND¶

Now, we will try to explore each brand of cars in our dataset and see the price variation of each.

First, lets see the columns available in our dataset again

In [30]:

autos.head()

Out[30]:

	date_crawled	name	price_dollars	vehicle_type	registration_year	gearbox	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	bus	2004	manuell	158	andere	150000	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	limousine	1997	automatik	286	7er	150000	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	limousine	2009	manuell	102	golf	70000	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	kleinwagen	2007	automatik	71	fortwo	70000	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	kombi	2003	manuell	0	focus	150000	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [31]:

autos["brand"].unique()#returns the various brands in the brand columns.

Out[31]:

array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler',
       'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos',
       'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan',
       'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda',
       'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda',
       'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada',
       'rover', 'daihatsu', 'lancia'], dtype=object)

In [32]:

top_10_brands = autos["brand"].value_counts()[:10]
print(top_10_brands)

volkswagen       9862
bmw              5137
opel             5020
mercedes_benz    4503
audi             4041
ford             3263
renault          2200
peugeot          1393
fiat             1197
seat              853
Name: brand, dtype: int64

In [33]:

top_6_brands = autos["brand"].value_counts()[:6]
print(top_6_brands)

volkswagen       9862
bmw              5137
opel             5020
mercedes_benz    4503
audi             4041
ford             3263
Name: brand, dtype: int64

In [34]:

bottom_10_brands = autos["brand"].value_counts()[-10:]
print(bottom_10_brands)

jeep          106
subaru        100
land_rover     98
saab           77
jaguar         73
daewoo         70
trabant        64
rover          62
lancia         50
lada           27
Name: brand, dtype: int64

Looking careful at the top_10_brands list and the bottom_10_brands, we see that the most popular car in the ebay car listing are Volkswagen, BMW, Mercedes_Benz alongside audi. The other are foreign brands.

Since the top_10_brands has about 80% of the total brands of car attracting more customers, we will focus our brand_price analysis on just these cars.

In [35]:

#We want to use the method of aggregation to explore brands in terms of prices.
mean_car_price = {}

top_6_brands = top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
    selected_rows = autos[autos["brand"] == cars]
    mean_price = selected_rows["price_dollars"].mean()
    mean_price = int(mean_price)
    mean_car_price[cars] = mean_price
print(mean_car_price)
print("\n")
sorted(mean_car_price.items(), key=lambda x:x[1], reverse = True)
    

{'volkswagen': 5402, 'bmw': 8332, 'opel': 2976, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749}

Out[35]:

[('audi', 9336),
 ('mercedes_benz', 8628),
 ('bmw', 8332),
 ('volkswagen', 5402),
 ('ford', 3749),
 ('opel', 2976)]

In [36]:

top_6 = autos["brand"].value_counts()[:6]
print(top_6)

volkswagen       9862
bmw              5137
opel             5020
mercedes_benz    4503
audi             4041
ford             3263
Name: brand, dtype: int64

In [ ]:

From the analysis above, we see that the mean prices of AUDI, MERCEDES BENZ AND BMW are the highest respecttively, while VOLKSWAGEN has a moderate mean price.

STORING AGGREGATE DATA IN A DATAFRAME¶

Previously, we analysed the prices of the top 6 brands and discovered that the highly luxurious cars are BMW, AUDI and MERCEDES BENZ. Now, we will try to compare the mileage(odometer) of these top_6 brands to see how it fairs when compared with prices.

In [37]:

mean_car_mileage = {}

top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
    selected_rows = autos[autos["brand"]==cars]
    mean_mileage = selected_rows["odometer_km"].mean()
    mean_mileage = int(mean_mileage)
    mean_car_mileage[cars] = mean_mileage
print(mean_car_mileage)
sorted(mean_car_mileage.items(), key = lambda x:x[1], reverse = True)

{'volkswagen': 128707, 'bmw': 132572, 'opel': 129311, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266}

Out[37]:

[('bmw', 132572),
 ('mercedes_benz', 130788),
 ('opel', 129311),
 ('audi', 129157),
 ('volkswagen', 128707),
 ('ford', 124266)]

We see now that the brands such as BMW, MERCEDES BENZ, AUDI and VOLKSWAGEN have high mileage points. It is also worth noting that the mileage of the top_6 brands lies between 117000km to 133000km.

Now, we will try to use the mean_mileage of these cars to compare the mean prices, to see if there are any visible link.

To do this, we have to tie both the mean_car_mileage series and the mean_car price series together to form a dataframe. This will make our analysis alot easier.

In [38]:

#Turning the mean_car price dict into a series type.
top_6_price = pd.Series(mean_car_price)

top_6_mileage = pd.Series(mean_car_mileage)

top_6_brands = pd.DataFrame((top_6_price), columns = ["mean_price"])#Coverting the series object to a DataFrame, 
#done by setting the columns parameter which accepts an array like object to specify column name
top_6_brands["mean_car_mileage"] = top_6_mileage

In [39]:

print(top_6_brands)

               mean_price  mean_car_mileage
volkswagen           5402            128707
bmw                  8332            132572
opel                 2976            129311
mercedes_benz        8628            130788
audi                 9336            129157
ford                 3749            124266

From our newly created DataFrame, we are able to draw some findings. And these are:

It appears that mileage doesn't seem to have a close relationship with price
Although, the most luxurious brands have somewhat of a higher mileage.
A couple of low priced brands still have high mileage e.g OPEL

MORE CLEANING STEPS AND ANALYSIS¶

COMMON BRAND AND MODEL¶

In [40]:

brand_model = autos[["brand", "model"]]
brand_model.isnull().sum()

Out[40]:

brand       0
model    2190
dtype: int64

In [41]:

brand_model = brand_model.dropna(axis=0)
brand_model.isnull().sum()

Out[41]:

brand    0
model    0
dtype: int64

From the above cells, we see that the model columns had 2190 missing values and since we can't work with that, we had to remove the missing rows corresponding to the columns.

In [42]:

brand_model.groupby("brand")["model"].value_counts(dropna = False).sort_values(ascending=False).head()

Out[42]:

brand       model 
volkswagen  golf      3707
bmw         3er       2615
volkswagen  polo      1609
opel        corsa     1591
volkswagen  passat    1349
Name: model, dtype: int64

From this, we can see that the most common brand/models are Volkswagen golf, BMW 3er, Volkswagen Polo, Opel Corsa and the rest seen above.

SPLIT ODOMETER AND PRICE¶

In [43]:

autos["odometer_km"].value_counts().sort_index()

Out[43]:

5000        782
10000       241
20000       742
30000       760
40000       797
50000       993
60000      1128
70000      1187
80000      1375
90000      1673
100000     2057
125000     4857
150000    30084
Name: odometer_km, dtype: int64

In [44]:

autos["odometer_km"].unique()

Out[44]:

array([150000,  70000,  50000,  80000,  10000,  30000, 125000,  90000,
        20000,  60000,   5000,  40000, 100000], dtype=int64)

In [45]:

autos.groupby("odometer_km", as_index = False)["price_dollars"].mean().sort_values("price_dollars", ascending =False).round(0)

Out[45]:

	odometer_km	price_dollars
1	10000	20551.0
2	20000	18448.0
3	30000	16609.0
4	40000	15500.0
5	50000	13812.0
6	60000	12385.0
7	70000	10927.0
8	80000	9722.0
0	5000	8907.0
9	90000	8465.0
10	100000	8137.0
11	125000	6214.0
12	150000	3768.0

We have successfully splitted just the odometer reading of our cars and the associated prices. Now, taking a closer look at the table, we can conclude that cars with very high odometer reading have lower prices. Ecept for cars with 5000km mileage that has somewhat of a low price. This can be attributed to the fact that cars having such mileage are not popular brands hence buyers dont fancy it.

COMPARING PRICE OF DAMAGED AND NON DAMAGED CARS.¶

In [46]:

autos["unrepaired_damage"].describe()

Out[46]:

count     38374
unique        2
top        nein
freq      33834
Name: unrepaired_damage, dtype: object

Before proceeding, we see that the column contains values that are not written in a language we unto, now, we have to do some data cleaning on this column.

In [47]:

mapping_dict = {
    "nein":"No",
    "ja":"Yes"
}
autos["unrepaired_damage"]=autos["unrepaired_damage"].map(mapping_dict)

In [48]:

autos["unrepaired_damage"].describe()#We have succesfully corrected bad values.

Out[48]:

count     38374
unique        2
top          No
freq      33834
Name: unrepaired_damage, dtype: object

In [49]:

#damaged = autos.loc[(autos["unrepaired_damage"] == "Yes"), ["price_dollars"]]
#damaged["price_dollars"].describe()

In [50]:

#damaged = autos.loc[(autos["unrepaired_damage"] == "No"), ["price_dollars"]]
#damaged["price_dollars"].describe()

In [51]:

autos.groupby("unrepaired_damage")["price_dollars"].max()#We can also compare the unrepaired column with the price column
#like so.

Out[51]:

unrepaired_damage
No     350000
Yes     44200
Name: price_dollars, dtype: int32

From the cells above, the results are obtained using different technique, for the first two above, we indexed the unrepaired and price columns while setting the former to Yes and its corresponding rows.

As for the second, we use the groupby method of the Dataframe object and compared, the unrepaired column with the price column to obtain the maximum.

We can conclude from this that cars with damages not yest repaired cost less than those without damage.

CONCLUSION SNIPPETS¶

With this, we come to the conclusion of our data cleaning and data analysis of the ebay car sales data. We were able to clean the data by cleaning the columns and changing it from CAMELCASE to SNAKECASE.¶

We also cleaned the price and odometer columns converting them from texts to numeric and then went ahead to rename them.¶

Once we were done with that, we went ahead to explore those columns carring date dates, we extracted the date from the ones represented as strings, made necessary conclusions from them. We also compared brands of cars with their mean prices, we saw that some of the most popular brands are quite expensive. Exploration and analysis was also done on the price, with respect to mileage, we saw that Cars with very high mileage had low prices.¶

And finally, we had to look at vehicles with unrepaired damage to ascertain the maximum prices if damages were present or not.¶

In [ ]: