The dataset was originally scraped and uploaded to Kaggle. In this guided project, we'll work with a dataset of used cars from a calssified section of German eBay website.
The data dictionary provided with the data is as follows,
dateCrawled - When this ad was first crawled. All field values are taken from this date
name - Name of the car
seller - Whether the seller is private or a dealer
offerType - The type of listing
price - The price on the ad to sell the car
abtest - Whether the listing is included in an A/B test
vehicleType - The vehicle Type
yearOfRegistration - The year in which the car was first registered
gearBox - The transmission type
powerPS - The power of the car in PS
model - The car model name
kilometer - How many kilometers the car has driven
monthOfRegistration - The month in which the car was first registered
fuelType - What type of fuel the car uses
brand - The brand of the car
notRepairedDamage - If the car has a damage which is not yet repaired
dateCreated - The date on which the eBay listing was created
postalCode - The postal code for the location of the vehicle
lastSeenOnline - When the crawler saw this ad last online
The aim of this analysis is to clean the data and analyze the included used car listings.
import pandas as pd
autos = pd.read_csv("C:\\Users\\USER\\Videos\\Coding Videos\\my_datasets\\autos.csv", encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()#Gives an information of our dataset, we can observe from our output that we have 20 columns most
#of which are strings.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
When we extract the information of our dataframe using the info method, we see that some columns have missing values.But none of these columns have more than 20% null values. The columns with null values are:
autos.head()#print the first five rows of the dataframe
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
A thorough exploration of the head of our dataframe, we can observe that the columns label are written in camelcase instead of our standard Python snakecase style. So, we are poised with renaming our column labels to suit our snakecase
def clean_col(col):
col = col.replace("dateCrawled", "date_crawled")
col = col.replace("offerType", "offer_type")
col = col.replace("vehicleType", "vehicle_type")
col = col.replace("powerPS", "power_ps")
col = col.replace("fuelType", "fuel_type")
col = col.replace("monthOfRegistration", "registration_month")
col = col.replace("yearOfRegistration", "registration_year")
col = col.replace("notRepairedDamage", "unrepaired_damage")
col = col.replace("dateCreated", "ad_created")
col = col.replace("lastSeen", "last_seen")
col = col.replace("postalCode", "postal_code")
col = col.replace("nrOfPictures", "no_of_pics")
return col
cleaned_labels = []
for c in autos.columns:
cleaned = clean_col(c)
cleaned_labels.append(cleaned)
autos.columns = cleaned_labels
print(autos.columns)#Examining the column labels, using the Dataframe.columns attribute
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'no_of_pics', 'postal_code', 'last_seen'], dtype='object')
print(autos.head())
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offer_type price abtest vehicle_type registration_year \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox power_ps model odometer registration_month fuel_type \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand unrepaired_damage ad_created no_of_pics postal_code \ 0 peugeot nein 2016-03-26 00:00:00 0 79588 1 bmw nein 2016-04-04 00:00:00 0 71034 2 volkswagen nein 2016-03-26 00:00:00 0 35394 3 smart nein 2016-03-12 00:00:00 0 33729 4 ford nein 2016-04-01 00:00:00 0 39218 last_seen 0 2016-04-06 06:45:54 1 2016-04-06 14:45:08 2 2016-04-06 20:15:37 3 2016-03-15 03:16:28 4 2016-04-01 14:38:50
print(autos.describe(include = "all"))
date_crawled name seller offer_type price abtest \ count 50000 50000 50000 50000 50000 50000 unique 48213 38754 2 2 2357 2 top 2016-03-22 09:51:06 Ford_Fiesta privat Angebot $0 test freq 3 78 49999 49999 1421 25756 mean NaN NaN NaN NaN NaN NaN std NaN NaN NaN NaN NaN NaN min NaN NaN NaN NaN NaN NaN 25% NaN NaN NaN NaN NaN NaN 50% NaN NaN NaN NaN NaN NaN 75% NaN NaN NaN NaN NaN NaN max NaN NaN NaN NaN NaN NaN vehicle_type registration_year gearbox power_ps model \ count 44905 50000.000000 47320 50000.000000 47242 unique 8 NaN 2 NaN 245 top limousine NaN manuell NaN golf freq 12859 NaN 36993 NaN 4024 mean NaN 2005.073280 NaN 116.355920 NaN std NaN 105.712813 NaN 209.216627 NaN min NaN 1000.000000 NaN 0.000000 NaN 25% NaN 1999.000000 NaN 70.000000 NaN 50% NaN 2003.000000 NaN 105.000000 NaN 75% NaN 2008.000000 NaN 150.000000 NaN max NaN 9999.000000 NaN 17700.000000 NaN odometer registration_month fuel_type brand unrepaired_damage \ count 50000 50000.000000 45518 50000 40171 unique 13 NaN 7 40 2 top 150,000km NaN benzin volkswagen nein freq 32424 NaN 30107 10687 35232 mean NaN 5.723360 NaN NaN NaN std NaN 3.711984 NaN NaN NaN min NaN 0.000000 NaN NaN NaN 25% NaN 3.000000 NaN NaN NaN 50% NaN 6.000000 NaN NaN NaN 75% NaN 9.000000 NaN NaN NaN max NaN 12.000000 NaN NaN NaN ad_created no_of_pics postal_code last_seen count 50000 50000.0 50000.000000 50000 unique 76 NaN NaN 39481 top 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27 freq 1946 NaN NaN 8 mean NaN 0.0 50813.627300 NaN std NaN 0.0 25779.747957 NaN min NaN 0.0 1067.000000 NaN 25% NaN 0.0 30451.000000 NaN 50% NaN 0.0 49577.000000 NaN 75% NaN 0.0 71540.000000 NaN max NaN 0.0 99998.000000 NaN
A thorough observation of the explored dataframe above, we see that some columns are columns worthy of being dropped and others are worth modifying. Findings:
Seller column - This column is worthy of being dropped, because it has just two unique values and has little or no significance to our analysis.
Offertype column - This column just like the previous has two unique value and hence is not significant to our analysis.
abtest column - This column is candidate to be dropped also
no_of_pics - This column is candidate to be dropped also.
Both the price and odometer columns have to be cleaned because they are supposed to be represented as numeric datas not text.
Now, lets drop the columns taht needs to be dropped.
autos = autos.drop(columns = ["abtest", "no_of_pics", "offer_type", "seller"])#Drop method drops either row or column.
Now, we will have to clean the price and odometer columns and rename them appropriately
autos["price"]= autos["price"].str.replace("$","").str.replace(",","").astype(int)
autos.rename(columns ={"price":"price_dollars"}, inplace = True)
autos["odometer"] = autos["odometer"].str.replace("km","").str.replace(",","").astype(int)
autos.rename(columns = {"odometer":"odometer_km"}, inplace = True)
print(autos.describe(include ="all"))
date_crawled name price_dollars vehicle_type \ count 50000 50000 5.000000e+04 44905 unique 48213 38754 NaN 8 top 2016-03-22 09:51:06 Ford_Fiesta NaN limousine freq 3 78 NaN 12859 mean NaN NaN 9.840044e+03 NaN std NaN NaN 4.811044e+05 NaN min NaN NaN 0.000000e+00 NaN 25% NaN NaN 1.100000e+03 NaN 50% NaN NaN 2.950000e+03 NaN 75% NaN NaN 7.200000e+03 NaN max NaN NaN 1.000000e+08 NaN registration_year gearbox power_ps model odometer_km \ count 50000.000000 47320 50000.000000 47242 50000.000000 unique NaN 2 NaN 245 NaN top NaN manuell NaN golf NaN freq NaN 36993 NaN 4024 NaN mean 2005.073280 NaN 116.355920 NaN 125732.700000 std 105.712813 NaN 209.216627 NaN 40042.211706 min 1000.000000 NaN 0.000000 NaN 5000.000000 25% 1999.000000 NaN 70.000000 NaN 125000.000000 50% 2003.000000 NaN 105.000000 NaN 150000.000000 75% 2008.000000 NaN 150.000000 NaN 150000.000000 max 9999.000000 NaN 17700.000000 NaN 150000.000000 registration_month fuel_type brand unrepaired_damage \ count 50000.000000 45518 50000 40171 unique NaN 7 40 2 top NaN benzin volkswagen nein freq NaN 30107 10687 35232 mean 5.723360 NaN NaN NaN std 3.711984 NaN NaN NaN min 0.000000 NaN NaN NaN 25% 3.000000 NaN NaN NaN 50% 6.000000 NaN NaN NaN 75% 9.000000 NaN NaN NaN max 12.000000 NaN NaN NaN ad_created postal_code last_seen count 50000 50000.000000 50000 unique 76 NaN 39481 top 2016-04-03 00:00:00 NaN 2016-04-07 06:17:27 freq 1946 NaN 8 mean NaN 50813.627300 NaN std NaN 25779.747957 NaN min NaN 1067.000000 NaN 25% NaN 30451.000000 NaN 50% NaN 49577.000000 NaN 75% NaN 71540.000000 NaN max NaN 99998.000000 NaN
From the output in the above cell, we can see that we have successfully dropped the columns not relevant to our analysis and also successfully changed cleaned and rename the columns with unwanted parameters.
Now, we will continue to explore data from this two columns and look for data that doesn't look right.
First, we will look for outliners in each of thses columns and remove them.
print(autos["price_dollars"].unique().shape)
print("\n")
print(autos["price_dollars"].describe())
print("\n")
print(autos["price_dollars"].value_counts().head(20))
(2357,) count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_dollars, dtype: float64 0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 Name: price_dollars, dtype: int64
Exploration of the price column shows us that the following
There are 2357 unique values in our price column
Around 30% of prices equal 0, these rows might be considered for removing from the dataset.
The minimum value is 0 and the maximum value is 100,000,000. These values look too negligible and too large respectively.
Let's explore further.
autos["price_dollars"].value_counts().sort_index(ascending=False).head(20)#Value_counts method counts the number of a
#particular value present in that series. Sort_index, sorts the values in order of acending or descending and aids in
#showing the lowesrt or highest value.
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price_dollars, dtype: int64
autos["price_dollars"].value_counts().sort_index(ascending=True).head(20)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 Name: price_dollars, dtype: int64
Findings:
Quite a number of the prices are below 30 dollars about 35% or more.
It is impossible to post a car with 0 dollars as the minimum.Hence, we have to drop values 0 dollar
From 350000 dollars, the prices become twice higher. This shows there was inappropriate rise in the prices above 350000dollars
Considering the fact that eBay is an auction site and prices could start at 1 dollar, we will keep prices from 1 dollar to 350000 dollars in the dataset and remove the rest, which i have considered as outliners.
#to remove outliner, we specify the limit of values. One way to do so is:
#df[(df["col"]> x) & (df["col"]< y)].The values x and y are the limit values. The asier way to do it is shown below
autos = autos[autos["price_dollars"].between(1,350000)]
autos["price_dollars"].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price_dollars, dtype: float64
From the output above, we see that we have effectively removed outliners. and now, our minimun value is 1 dollar and maximum value is 350000 dollars.
autos["odometer_km"].nunique()#This tells us the number of unique values in the column, more like the shape.
13
autos["odometer_km"].describe()
count 48565.000000 mean 125770.101925 std 39788.636804 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos["odometer_km"].value_counts().sort_index(ascending = False)
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 40000 815 30000 780 20000 762 10000 253 5000 836 Name: odometer_km, dtype: int64
We can see from the column exploration of the odometer_km column that both the minimun and maximum values are realistic, since the cars are used cars, it is expected that majority of the car will hAave very high odometer reading Hence, there are no outliners.
From our dataset, we can see there are four columns with date data
date_crawled
registration_year
registration_month
last_seen
ad_created
print(autos["date_crawled"].describe())
print(autos["registration_year"].describe())
print(autos["registration_month"].describe())
print(autos["last_seen"].describe())
print(autos["ad_created"].describe())
count 48565 unique 46882 top 2016-03-25 19:57:10 freq 3 Name: date_crawled, dtype: object count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64 count 48565.000000 mean 5.782251 std 3.685595 min 0.000000 25% 3.000000 50% 6.000000 75% 9.000000 max 12.000000 Name: registration_month, dtype: float64 count 48565 unique 38474 top 2016-04-07 06:17:27 freq 8 Name: last_seen, dtype: object count 48565 unique 76 top 2016-04-03 00:00:00 freq 1887 Name: ad_created, dtype: object
autos[["date_crawled", "last_seen", "registration_month", "registration_year", "ad_created"]].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48565 entries, 0 to 49999 Data columns (total 5 columns): date_crawled 48565 non-null object last_seen 48565 non-null object registration_month 48565 non-null int64 registration_year 48565 non-null int64 ad_created 48565 non-null object dtypes: int64(2), object(3) memory usage: 2.2+ MB
From our exploration of the 5 columns with date data, we see that 3 are stored as text and 2 are stored in numeric format. Hence, we need to convert the 3 columns represented as texts to numerical form. These columns are
ad_created
last_seen
date_crawled
Now, lets explore the 3 columns stored as strings.
autos[["date_crawled", "last_seen", "ad_created"]][0:5]
date_crawled | last_seen | ad_created | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 |
We will notice that the first 10 characters represent the day( e.g 2016-03-26).
To get a good grasp of the data, we are going to extract the counts of each data in the column and then sort by index for easy analysis.
autos["date_crawled"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)#the value_counts has a few
#parameters that takes several argument, for example the normalize parameter takes a boolean argument and returns the
#relative frequency of the value count on a particular data point.By default, it is set to false.On the other hand, the
#dropna parameter is set to True, which excludes null or NaN values.But here, we set it to False.2
2016-04-07 0.001400 2016-04-06 0.003171 2016-04-05 0.013096 2016-04-04 0.036487 2016-04-03 0.038608 2016-04-02 0.035478 2016-04-01 0.033687 2016-03-31 0.031834 2016-03-30 0.033687 2016-03-29 0.034099 2016-03-28 0.034860 2016-03-27 0.031092 2016-03-26 0.032204 2016-03-25 0.031607 2016-03-24 0.029342 2016-03-23 0.032225 2016-03-22 0.032987 2016-03-21 0.037373 2016-03-20 0.037887 2016-03-19 0.034778 2016-03-18 0.012911 2016-03-17 0.031628 2016-03-16 0.029610 2016-03-15 0.034284 2016-03-14 0.036549 2016-03-13 0.015670 2016-03-12 0.036920 2016-03-11 0.032575 2016-03-10 0.032184 2016-03-09 0.033090 2016-03-08 0.033296 2016-03-07 0.036014 2016-03-06 0.014043 2016-03-05 0.025327 Name: date_crawled, dtype: float64
From the extraction, we have been able to make some findings. And these are:
The data was crawled everyday starting from 5th March 2016 to 7th April 2016
The amount of data crawled on each day is roughly uniform.
autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
2016-04-07 0.131947 2016-04-06 0.221806 2016-04-05 0.124761 2016-04-04 0.024483 2016-04-03 0.025203 2016-04-02 0.024915 2016-04-01 0.022794 2016-03-31 0.023783 2016-03-30 0.024771 2016-03-29 0.022341 2016-03-28 0.020859 2016-03-27 0.015649 2016-03-26 0.016802 2016-03-25 0.019211 2016-03-24 0.019767 2016-03-23 0.018532 2016-03-22 0.021373 2016-03-21 0.020632 2016-03-20 0.020653 2016-03-19 0.015834 2016-03-18 0.007351 2016-03-17 0.028086 2016-03-16 0.016452 2016-03-15 0.015876 2016-03-14 0.012602 2016-03-13 0.008895 2016-03-12 0.023783 2016-03-11 0.012375 2016-03-10 0.010666 2016-03-09 0.009595 2016-03-08 0.007413 2016-03-07 0.005395 2016-03-06 0.004324 2016-03-05 0.001071 Name: last_seen, dtype: float64
From our exploration of the last seen column, these are what we find:
The distribution above shows the percentage or relative frequency of removed ads per day.
That the last 3 days more ads were removed
Majority of the ads were last seen April 6th to April 7th
autos["ad_created"].str[:10].value_counts(normalize = True, dropna = False).sort_index(ascending = False)
2016-04-07 0.001256 2016-04-06 0.003253 2016-04-05 0.011819 2016-04-04 0.036858 2016-04-03 0.038855 ... 2015-12-05 0.000021 2015-11-10 0.000021 2015-09-09 0.000021 2015-08-10 0.000021 2015-06-11 0.000021 Name: ad_created, Length: 76, dtype: float64
Exploration of the ad_created column, enables us to drow some conclusion from this column:
The ad was created from 11th June 2015 to 7th April 2016
Majority of the ad were created between the month of March and April
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
From the exploration of this column, we see that the minimum and maximum values are outrageous. Hence further analysis has to be done.
It is worthy of not that a car cannot be registered after listing. Hence, any year in the registration_year column above 2016 is definitely inaccurate
autos["registration_year"].value_counts().sort_index()
1000 1 1001 1 1111 1 1800 2 1910 5 .. 5911 1 6200 1 8888 1 9000 1 9999 3 Name: registration_year, Length: 95, dtype: int64
Recall that from our last seen column exploration, we found out that values are between 5th March and 7th April 2016, so it means no registration year should exceed the year 2016.
autos["registration_year"].value_counts(dropna = False).sort_index(ascending = False).head(50)
9999 3 9000 1 8888 1 6200 1 5911 1 5000 4 4800 1 4500 1 4100 1 2800 1 2019 2 2018 470 2017 1392 2016 1220 2015 392 2014 663 2013 803 2012 1310 2011 1623 2010 1589 2009 2085 2008 2215 2007 2277 2006 2670 2005 2936 2004 2703 2003 2699 2002 2486 2001 2636 2000 3156 1999 2897 1998 2363 1997 1951 1996 1373 1995 1227 1994 629 1993 425 1992 370 1991 339 1990 347 1989 174 1988 135 1987 72 1986 72 1985 95 1984 51 1983 51 1982 41 1981 28 1980 85 Name: registration_year, dtype: int64
A further exploration of the registration year column shows us that the registration started at year 1000, which is also unrealistic as cars have not been made by then. But if we look at the ad_created, last_seen and date_crawled column we will see that the earliest date started in the 90's. Hence, for the registration year, we will only consider years from 1900 to 2016. All other rows carrying year outside this range will be dropped.
autos = autos[autos["registration_year"].between(1917, 2016)]
autos["registration_year"].describe()
count 46676.000000 mean 2002.920709 std 7.120843 min 1927.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
autos.describe()
price_dollars | registration_year | power_ps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 46676.000000 | 46676.000000 | 46676.000000 | 46676.000000 | 46676.000000 | 46676.000000 |
mean | 5978.332612 | 2002.920709 | 117.905562 | 125594.631074 | 5.827749 | 51096.456166 |
std | 9178.207768 | 7.120843 | 184.928789 | 39842.598992 | 3.670001 | 25755.911915 |
min | 1.000000 | 1927.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1250.000000 | 1999.000000 | 75.000000 | 100000.000000 | 3.000000 | 30827.000000 |
50% | 3100.000000 | 2003.000000 | 109.000000 | 150000.000000 | 6.000000 | 49826.000000 |
75% | 7500.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71732.000000 |
max | 350000.000000 | 2016.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
Now, we will try to explore each brand of cars in our dataset and see the price variation of each.
First, lets see the columns available in our dataset again
autos.head()
date_crawled | name | price_dollars | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
autos["brand"].unique()#returns the various brands in the brand columns.
array(['peugeot', 'bmw', 'volkswagen', 'smart', 'ford', 'chrysler', 'seat', 'renault', 'mercedes_benz', 'audi', 'sonstige_autos', 'opel', 'mazda', 'porsche', 'mini', 'toyota', 'dacia', 'nissan', 'jeep', 'saab', 'volvo', 'mitsubishi', 'jaguar', 'fiat', 'skoda', 'subaru', 'kia', 'citroen', 'chevrolet', 'hyundai', 'honda', 'daewoo', 'suzuki', 'trabant', 'land_rover', 'alfa_romeo', 'lada', 'rover', 'daihatsu', 'lancia'], dtype=object)
top_10_brands = autos["brand"].value_counts()[:10]
print(top_10_brands)
volkswagen 9862 bmw 5137 opel 5020 mercedes_benz 4503 audi 4041 ford 3263 renault 2200 peugeot 1393 fiat 1197 seat 853 Name: brand, dtype: int64
top_6_brands = autos["brand"].value_counts()[:6]
print(top_6_brands)
volkswagen 9862 bmw 5137 opel 5020 mercedes_benz 4503 audi 4041 ford 3263 Name: brand, dtype: int64
bottom_10_brands = autos["brand"].value_counts()[-10:]
print(bottom_10_brands)
jeep 106 subaru 100 land_rover 98 saab 77 jaguar 73 daewoo 70 trabant 64 rover 62 lancia 50 lada 27 Name: brand, dtype: int64
Looking careful at the top_10_brands list and the bottom_10_brands, we see that the most popular car in the ebay car listing are Volkswagen, BMW, Mercedes_Benz alongside audi. The other are foreign brands.
Since the top_10_brands has about 80% of the total brands of car attracting more customers, we will focus our brand_price analysis on just these cars.
#We want to use the method of aggregation to explore brands in terms of prices.
mean_car_price = {}
top_6_brands = top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
selected_rows = autos[autos["brand"] == cars]
mean_price = selected_rows["price_dollars"].mean()
mean_price = int(mean_price)
mean_car_price[cars] = mean_price
print(mean_car_price)
print("\n")
sorted(mean_car_price.items(), key=lambda x:x[1], reverse = True)
{'volkswagen': 5402, 'bmw': 8332, 'opel': 2976, 'mercedes_benz': 8628, 'audi': 9336, 'ford': 3749}
[('audi', 9336), ('mercedes_benz', 8628), ('bmw', 8332), ('volkswagen', 5402), ('ford', 3749), ('opel', 2976)]
top_6 = autos["brand"].value_counts()[:6]
print(top_6)
volkswagen 9862 bmw 5137 opel 5020 mercedes_benz 4503 audi 4041 ford 3263 Name: brand, dtype: int64
From the analysis above, we see that the mean prices of AUDI, MERCEDES BENZ AND BMW are the highest respecttively, while VOLKSWAGEN has a moderate mean price.
Previously, we analysed the prices of the top 6 brands and discovered that the highly luxurious cars are BMW, AUDI and MERCEDES BENZ. Now, we will try to compare the mileage(odometer) of these top_6 brands to see how it fairs when compared with prices.
mean_car_mileage = {}
top_6_brands = autos["brand"].value_counts()[:6].index
for cars in top_6_brands:
selected_rows = autos[autos["brand"]==cars]
mean_mileage = selected_rows["odometer_km"].mean()
mean_mileage = int(mean_mileage)
mean_car_mileage[cars] = mean_mileage
print(mean_car_mileage)
sorted(mean_car_mileage.items(), key = lambda x:x[1], reverse = True)
{'volkswagen': 128707, 'bmw': 132572, 'opel': 129311, 'mercedes_benz': 130788, 'audi': 129157, 'ford': 124266}
[('bmw', 132572), ('mercedes_benz', 130788), ('opel', 129311), ('audi', 129157), ('volkswagen', 128707), ('ford', 124266)]
We see now that the brands such as BMW, MERCEDES BENZ, AUDI and VOLKSWAGEN have high mileage points. It is also worth noting that the mileage of the top_6 brands lies between 117000km to 133000km.
Now, we will try to use the mean_mileage of these cars to compare the mean prices, to see if there are any visible link.
To do this, we have to tie both the mean_car_mileage series and the mean_car price series together to form a dataframe. This will make our analysis alot easier.
#Turning the mean_car price dict into a series type.
top_6_price = pd.Series(mean_car_price)
top_6_mileage = pd.Series(mean_car_mileage)
top_6_brands = pd.DataFrame((top_6_price), columns = ["mean_price"])#Coverting the series object to a DataFrame,
#done by setting the columns parameter which accepts an array like object to specify column name
top_6_brands["mean_car_mileage"] = top_6_mileage
print(top_6_brands)
mean_price mean_car_mileage volkswagen 5402 128707 bmw 8332 132572 opel 2976 129311 mercedes_benz 8628 130788 audi 9336 129157 ford 3749 124266
From our newly created DataFrame, we are able to draw some findings. And these are:
It appears that mileage doesn't seem to have a close relationship with price
Although, the most luxurious brands have somewhat of a higher mileage.
A couple of low priced brands still have high mileage e.g OPEL
brand_model = autos[["brand", "model"]]
brand_model.isnull().sum()
brand 0 model 2190 dtype: int64
brand_model = brand_model.dropna(axis=0)
brand_model.isnull().sum()
brand 0 model 0 dtype: int64
From the above cells, we see that the model columns had 2190 missing values and since we can't work with that, we had to remove the missing rows corresponding to the columns.
brand_model.groupby("brand")["model"].value_counts(dropna = False).sort_values(ascending=False).head()
brand model volkswagen golf 3707 bmw 3er 2615 volkswagen polo 1609 opel corsa 1591 volkswagen passat 1349 Name: model, dtype: int64
From this, we can see that the most common brand/models are Volkswagen golf, BMW 3er, Volkswagen Polo, Opel Corsa and the rest seen above.
autos["odometer_km"].value_counts().sort_index()
5000 782 10000 241 20000 742 30000 760 40000 797 50000 993 60000 1128 70000 1187 80000 1375 90000 1673 100000 2057 125000 4857 150000 30084 Name: odometer_km, dtype: int64
autos["odometer_km"].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 40000, 100000], dtype=int64)
autos.groupby("odometer_km", as_index = False)["price_dollars"].mean().sort_values("price_dollars", ascending =False).round(0)
odometer_km | price_dollars | |
---|---|---|
1 | 10000 | 20551.0 |
2 | 20000 | 18448.0 |
3 | 30000 | 16609.0 |
4 | 40000 | 15500.0 |
5 | 50000 | 13812.0 |
6 | 60000 | 12385.0 |
7 | 70000 | 10927.0 |
8 | 80000 | 9722.0 |
0 | 5000 | 8907.0 |
9 | 90000 | 8465.0 |
10 | 100000 | 8137.0 |
11 | 125000 | 6214.0 |
12 | 150000 | 3768.0 |
We have successfully splitted just the odometer reading of our cars and the associated prices. Now, taking a closer look at the table, we can conclude that cars with very high odometer reading have lower prices. Ecept for cars with 5000km mileage that has somewhat of a low price. This can be attributed to the fact that cars having such mileage are not popular brands hence buyers dont fancy it.
autos["unrepaired_damage"].describe()
count 38374 unique 2 top nein freq 33834 Name: unrepaired_damage, dtype: object
Before proceeding, we see that the column contains values that are not written in a language we unto, now, we have to do some data cleaning on this column.
mapping_dict = {
"nein":"No",
"ja":"Yes"
}
autos["unrepaired_damage"]=autos["unrepaired_damage"].map(mapping_dict)
autos["unrepaired_damage"].describe()#We have succesfully corrected bad values.
count 38374 unique 2 top No freq 33834 Name: unrepaired_damage, dtype: object
#damaged = autos.loc[(autos["unrepaired_damage"] == "Yes"), ["price_dollars"]]
#damaged["price_dollars"].describe()
#damaged = autos.loc[(autos["unrepaired_damage"] == "No"), ["price_dollars"]]
#damaged["price_dollars"].describe()
autos.groupby("unrepaired_damage")["price_dollars"].max()#We can also compare the unrepaired column with the price column
#like so.
unrepaired_damage No 350000 Yes 44200 Name: price_dollars, dtype: int32
From the cells above, the results are obtained using different technique, for the first two above, we indexed the unrepaired and price columns while setting the former to Yes and its corresponding rows.
As for the second, we use the groupby method of the Dataframe object and compared, the unrepaired column with the price column to obtain the maximum.
We can conclude from this that cars with damages not yest repaired cost less than those without damage.