This is the first guided project in DataQuest. We'll be working with a dataset of used cars from a classified section of eBay. Data originally from Kaggle, but there have been a few modifications to the data to make it more useful to practice cleaning the data.
To clean the data and analyze the included car listings.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv',encoding='Latin-1')
autos.info()
As we can see from the .info method, we have a few columns with some null values. Columns being vehicleType, gearbox, model, fuelType, and nonRepairedDamage. We must inspect these columns to see how we can deal with these null values. Also, we might want to change the price column to a float dtype so it is easier to work with.
autos.head()
From the .head method, we can see that there is a lot of data cleaning that we need to do.
autos.columns
new_columns = ({'yearOfRegistration':'registration_year', 'monthOfRegistration':'registration_month',
'notRepairedDamage':'unrepaired_damage','dateCreated':'ad_created', 'dateCrawled':'date_crawled',
'offerType':'offer_type','vehicleType':"vehicle_type", 'powerPS':'power_ps','fuelType':'fuel_type',
'nrOfPictures':'num_pictures','postalCode':'postal_code','lastSeen':'last_seen'}
)
autos.rename(new_columns,axis=1,inplace=True)
autos.head()
Camelcase to snakecase basically means changing the word to lower case and have underscores instead of spaces. So, I changed the all the columns to to snakecase just for it to be more readable. I'm not sure what date_crawled means so I just left it as is. One specific change I made was to nrOfPictures to num_pictures.
autos.describe(include='all')
Price and odometer are stored as text, so we need the change these to numeric. We can say this because we see their descriptive statistics results in all null values
We can also see that columns seller, offer_type, abtest, gearbox, unrepaired_damage all have only 2 values. We may consider dropping these columns because they won't be useful in our analyses.
autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(float)
autos['odometer'] = autos['odometer'].str.lower().str.replace('km','').str.replace(',','').astype(float)
autos.rename({'odometer':'odometer_km'}, axis=1,inplace=True)
autos['price'].unique().shape
# theres 2357 unique values
print('Max: {}\nMin: {}'.format(autos['price'].max(), autos['price'].min()))
# What car costs 99,999,999 and 0? This isn't right
autos['price'].describe()
(autos['price'] > 500000).value_counts()
# 14 cars that are more than $500,000
autos['price'].value_counts().sort_index(ascending=False).tail(200)
# Looking at how many occurrences of each price value in descending order.
price_q25, price_q75 = np.percentile(autos['price'].values, [25,75])
price_iqr = price_q75 - price_q25
price_cutoff = price_iqr * 2
price_upper, price_lower = price_q75+price_cutoff, price_q25-price_cutoff
price_outliers = [x for x in autos['price'] if x<price_lower or x>price_upper]
len(price_outliers)
autos.info()
new_autos = autos[autos['price'].between(price_lower,price_upper)]
new_autos.info()
# 2644 outliers removed
To remove the outliers, I used the interquartile range method because it focuses solely on the data given. I used a multiplier of 2 for the cutoff because I noticed that some of the lower values were reasonable prices and I wanted to keep majority of them.
price_outliers.sort()
price_outliers
new_autos.info()
autos[['date_crawled','ad_created','last_seen']].head()
new_autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
# normalize make it a percentage instead of the actual count
# we want to keep the null values
# earliest to latest
With the date_crawled column and its distribution, we can see that most occurrences were between the middle of March and very early April.
new_autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
Similarly to the date_crawled distribution, we can see that most of the ads were created between mid March and very early April. One thing to note is that some ads were created in 2015 and the date_crawled column did not have any. This may be something we would want to explore later.
new_autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index()
The distribution of the last_seen column is as expected because you tend to see what is new rather than what is old in the public. So most of the occurrences lie within early April.
new_autos['registration_year'].describe()
Using the describe method, we can see that there are definitely some unusual values, such as the max being 9999 and the min being 1000. Given the column tells us the year of the car was first released, these values (and others) may be removed.
(new_autos[new_autos['registration_year']<1900])
# 6 cars that have a registration_year under 1900
(new_autos['registration_year']>2018).value_counts()
Looking at these specific years, I would say keeping the values between 1900 and 2018 is the most ideal. I would say its common to see car listings for cars in the future (1 or 2 years) so I decided to keep a few rows.
new_autos = new_autos[new_autos['registration_year'].between(1900,2018)]
new_autos['registration_year'].value_counts(normalize=True).head(20)
The distribution of the remaining values of the registration year seems to be slightly skewed in the early 2000's. Then we see some 2010's and 1990s.
new_autos['brand'].unique()
selected_brands = new_autos['brand'].value_counts(normalize=True).head(10).index
new_autos['brand'].value_counts(normalize=True).head(10)
# Lets explore the top 10 based on percentages
I have chosen to aggregate the top 10 brands based on their respective percentage in the dataset. We see mostly German car manufacturers so I think it would be interesting to find any insights within their listings.
brand_dict = {}
for x in selected_brands:
rows = new_autos[new_autos['brand'] == x]
mean = rows['price'].mean()
brand_dict[x]=mean
sorted_dict = dict(sorted(brand_dict.items(),key=lambda x:x[1],reverse=True))
# the key parameter allows us to sort by values instead of keys
# reverse makes it in descending order
sorted_dict
From the selected brands, we see that the luxury brands have the highest mean price, as expected. The less known brands are much cheaper (to my knowledge). The means range from (2245,6090).
So let's make a single dataframe with mean mileage and mean price.
price_ser = pd.Series(data=sorted_dict,index=sorted_dict.keys())
price_df = pd.DataFrame(price_ser,columns=['mean_price'])
list_top6_brand = list(sorted_dict.keys())
mile_dict = {}
for x in list_top6_brand:
selected = new_autos[new_autos['brand']==x]
mean = selected['odometer_km'].mean()
mile_dict[x] = mean
mile_dict
price_df['mean_mileage'] = mile_dict.values()
price_df.round(2)
With both the mean price and mean mileage, we don't see any correlation between the 2.
new_autos
new_autos['name'].str.replace('_',' ')
new_autos['seller'].unique() #privat -> private, gewerblich -> commercial
new_autos['seller'] = new_autos['seller'].str.replace('privat','private')
new_autos['seller'] = new_autos['seller'].str.replace('gewerblich','commercial')
new_autos['offer_type'].unique()
# angebot = offer, gesuch = request
new_autos['offer_type'] = new_autos['offer_type'].str.replace('Angebot','offer').str.replace('Gesuch','request')
new_autos['vehicle_type'].unique()
#kleinwagen - supermini
#kombi - volkswagen type 2
#cabrio - convertible
#andere - other
new_autos['vehicle_type'].value_counts(dropna=False)
new_autos.loc[new_autos['vehicle_type']=='kleinwagen','vehicle_type'] = 'supermini'
new_autos.loc[new_autos['vehicle_type']=='kombi','vehicle_type'] = 'volkswagen type 2'
new_autos.loc[new_autos['vehicle_type']=='cabrio','vehicle_type'] = 'convertible'
new_autos.loc[new_autos['vehicle_type']=='andere','vehicle_type'] = 'other'
new_autos['gearbox'].unique()
new_autos.loc[new_autos['gearbox']=='manuell','gearbox'] = 'manual'
new_autos.loc[new_autos['gearbox']=='automatik','gearbox'] = 'auto'
new_autos['fuel_type'].unique()
#benzin = petrol
#lpg = liquefied petroleum gas
#cng = compressed natural gas
#elektro = electric
#andere = other
new_autos.loc[new_autos['fuel_type']=='benzin','fuel_type'] = 'petrol'
new_autos.loc[new_autos['fuel_type']=='elektro','fuel_type'] = 'electric'
new_autos.loc[new_autos['fuel_type']=='andere','fuel_type'] = 'other'
new_autos['unrepaired_damage'].unique()
#nein = no
#ja = yes
new_autos.loc[new_autos['unrepaired_damage']=='nein','unrepaired_damage'] = 'no'
new_autos.loc[new_autos['unrepaired_damage']=='ja','unrepaired_damage'] = 'yes'
new_autos['date'] = new_autos['date_crawled'].str.split(' ',expand=True)[0].str.replace('-','').astype(int)
new_autos