Our personal goal is to familiarize ourselves with various data cleaning tasks and work flows using numpy and pandas.
As for the project itself, we will be trying to clean the data set and do some preliminary analysis. The data we will be using for this project are ads for used cars from the classifieds sectionof the GermaneBay website.
The data dictionary for our data is as follows:
Variable Name | Description |
---|---|
dateCrawled |
when this ad was first crawled (all field-values are taken from this date) |
name |
name of the car |
seller |
whether the seller is private or a dealer |
offerType |
the type of listing |
price |
the price on the ad to sell the car |
abtest |
whether the listing is included in an A/B test |
vehicleType |
the vehicle type |
yearOfRegistration |
the year in which the car was first registered |
gearbox |
the transmission type |
powerPS |
the power of the car in PS |
model |
the car model name |
kilometer |
how many kilometers the car has driven |
monthOfRegistration |
the month in which the car was first registered |
fuelType |
what type of fuel the car uses |
brand |
the brand of the car |
notRepairedDamage |
if the car has a damage which is not yet repaired |
dateCreated |
the date on which the eBay listing was created |
nrOfPictures |
the number of pictures in the ad |
postalCode |
the postal code for the location of the vehicle |
lastSeenOnline |
when the crawler saw this ad last online |
We will start by importing the pandas and numpy libraries, as well as loading the data set.
# importing libraries
import pandas as pd
import numpy as np
# loading the file
autos = pd.read_csv('autos.csv', encoding='Latin-1')
# Checking what our data set looks like
# This cell also shows some features of Jupyter Notebook in displaying dataframes.
autos
Let's also have a quick look at the shape and data types in our data set using the info()
and the head()
methods.
autos.info()
autos.head(3)
Since the column names are currently in camelcase, let's rename them to follow the snakecase convention. Dataquest also suggests for us to rename some specific column names such as notRepairedDamage
to unrepaired_damage
, among others.
We'll do this manually since we're not yet experts in regular expressions and advanced manipulation of strings.
# Checking the column names
print(autos.columns)
# Renaming the columns
new_columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price',
'ab_test', 'vehicle_type', 'registration_year',
'gear_box', 'power_ps', 'model', 'odometer',
'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures',
'postal_code', 'last_seen'
]
autos.columns = new_columns
# Checking the new column names by printing the first few rows of our dataframe
autos.head(2)
Let's look at the the characteristics of our data set so we can start figuring out what data cleaning tasks we need to do. We use the describe()
method and specify the parameter include='all'
so that non-numeric columns are included.
# Checking column descriptions/summaries
autos.describe(include='all')
Some quick observations here:
seller
column are tagged as 'privat'
offer_type
column are tagged as 'Angebot
nr_of_pictures
column appear to be zero (we can check this by using the value_counts()
method laterWe'll consider dropping these three columns but let's check the nr_of_pictures
more closely first.
# Checking value counts of the three columns being considered for deletion.
# We're most interested in the the nr_of_pictures column
print(autos['nr_of_pictures'].value_counts())
print('\n')
print(autos['seller'].value_counts())
print('\n')
print(autos['offer_type'].value_counts())
The previous cell confirms that all values in the nr_of_pictures
columns are zero. Also, there isn't really any significant variation in the values contained in the seller
and offer_type
columns, with only one observation having a different value. We'll drop those two as well.
# Dropping the three columns
autos.drop(['seller', 'offer_type', 'nr_of_pictures'], axis=1, inplace=True)
# Checking if they are really dropped
print(autos.info())
Some other observations from checking the info of our dataframe:
date_crawled
, odometer
, price
, ad_created
and last_seen
are stored as strings (we may want to convert these to dates and integers, as appropriate)registration_year
and registration_month
are stored as integers (we may want to convert these to datetime
format)Hence, these are our tasks for further cleaning the data:
odometer
and price
to numeric formatdate_crawled
, ad_created
, last_seen
, registration_year
, and registration_month
columns to datetime formatSince item no. 1 should be so much easier compared to item no. 2, let's start with that.
# Checking what the values in price and odometer columns look like
print(autos['price'].unique())
print(autos['odometer'].unique())
It appears that for the price
column, we need to remove the leading $ sign and remove the commas. For the odometer
, we need to remove the trailing km indicator and also the commas.
# Converting the two columns to numeric
autos['price'] = (autos['price'].str.replace(',','')
.str.replace('$','')
.astype(int)
)
autos['odometer'] = (autos['odometer'].str.replace(',','')
.str.replace('km','')
.astype(int)
)
# Renaming odometer column
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
After converting price
and odometer
to numeric, let's begin exploring these two columns.
Dataquest recommends the following steps:
Series.unique().shape
to see how many unique valuesSeries.describe()
to view min/max/median/mean etcSeries.value_counts()
to see frequency distributionsSince those are a lot of steps, let's define a function that will do all those things for us.
# Defining a function to make our examination of the series faster
def explore(series):
unique_values = series.unique().shape[0]
print('The series "{s}" has {u} unique values'
.format(s=series.name, u=unique_values))
print('\n')
print('Description')
print(series.describe())
print('\n')
print('Value counts top')
print(series.value_counts()
.sort_index(ascending=False)
.head(10))
print('\n')
print('Value counts bottom')
print(series.value_counts()
.sort_index(ascending=True)
.head(10))
# Using the explore function to check price and odometer_km columns
explore(autos['price'])
print('\n')
explore(autos['odometer_km'])
For the price
column, we remove those with zero prices (lower outliers) and also those that cost more than $900,000 (upper outliers).
Note that the cars that are priced above $900,000 may still be valid data points. However, it is likely that these very high prices are typographical errors or items that are just being advertised but not really for sale. Anyway, these extremely high values (and there are only a few of them) might disproportionately affect our observations, so we will drop them for now.
# Removing zero prices and overpriced observations
autos = autos[autos['price'].between(1,900000)]
#autos = autos.drop(autos[(autos['price'] == 0)
# | (autos['price'] > 900000)].index)
explore(autos['price'])
After dropping the outliers, the number of unique prices has dropped from 2,357 to 2,346. Our total number of observations also dropped from 50,000 to 48,565. We only lost a small amount of observations, so this is acceptable.
For the odometer_km
column, there are no obvious outliers since the minimum and maximum values appear reasonable enough. We won't be dropping anything for this column.
Recall that we have some columns that give us datetime-related information. These are the date_crawled
, registration_year
, registration_month
, ad_created
, and last_seen
columns. Some of these columns are stored as strings. Let's check these columns again, cleaning them as necessary, and trying out some preliminary analytical tasks.
date_columns = ['date_crawled', 'registration_year', 'registration_month', 'ad_created', 'last_seen']
print(autos[date_columns].info())
autos[date_columns][:4]
The columns registration_year
and registration_month
are in int64
format. For the purpose of our analysis, there is no need to convert these to datetime
format. We may consider, however, combining the two into a single variable that contains month and year information if we want to study trends over several years.
For the other time-related columns, let's convert them to datetime
format. While we're not interested in the exact time values contained in the columns, there may be a use for those in later analysis. What we'll do, instead, is to create a new dataframe with columns that are of datetime
format but only containing date (and not time) information. We create a new dataframe because adding new columns to our autos
dataframe appears to give us a SettingwithCopyWarning.
# importing datetime library
import datetime as dt
# creating function for conversion
date_format = '%Y-%m-%d'
def convert_dt(series):
date = series.str[:10]
date_dt = pd.to_datetime(date, format=date_format)
return date_dt
# Creating an empty dataframe
autos_copy = pd.DataFrame()
# Adding columns to our new dataframe
autos_copy['date_crawled_dt'] = convert_dt(autos['date_crawled'])
autos_copy['ad_created_dt'] = convert_dt(autos['ad_created'])
autos_copy['last_seen_dt'] = convert_dt(autos['last_seen'])
Now we check the distributions of the three date columns, sorted by earliest date to latest date.
# Date crawled
print('date_crawled')
print(autos_copy['date_crawled_dt'].
value_counts(normalize=True, dropna=False).
sort_index()
)
print('\n')
# Ad created
print('ad_created')
print(autos_copy['ad_created_dt'].
value_counts(normalize=True, dropna=False).
sort_index()
)
print('\n')
# Last seen
print('last_seen')
print(autos_copy['last_seen_dt'].
value_counts(normalize=True, dropna=False).
sort_index()
)
In terms of frequency distribution, there is no pattern that really pops out. Looking at the date_crawled_dt
column, we learn that the crawler collected this data over a span of around a month in March/April of 2016.
The oldest ad seen by the crawler, as shown by the ad_created_dt
column was from June 2015, while the latest one (April 4, 2016) coincides with the last day the crawler collected data.
Not much can be said about the last_seen
column, other than over 40% of the ads were still seen by the crawler during its last three days (April 5-7) of crawling.
It's now time to look at registration_year
and registration_month
. Again, we have not created a new series that combines these two. Let's explore them separately for now.
# Checking the registration year and moth information
print(autos[['registration_year', 'registration_month']].describe())
print('\n')
# Checking frequency distributions
print(autos['registration_year'].value_counts(dropna=False).head(5))
print(autos['registration_year'].value_counts(dropna=False).tail(5))
print('\n')
# Checking frequency distribution of registration month
print(autos['registration_month'].value_counts(dropna=False))
Discussion on the observations from the above results are in the next section.
Upon checking the results from using the describe
method for the registrationyear, we see some outliers that don't make sense. The minimum is 1,000 while the maximum is 9,999. The lower outlier doesn't make sense because this years are too early while the higher outliers are years that haven't occured yet.
We may want to remove this using the criteria:
The reason for these the lower-bound criteria is that mass production of vehicles only really started around the 1900s. The reason for the upper-bound criteria is that the crawler couldn't possibly have detected vehicles that were registered in the future (the crawler's last day in our data set is April 2016).
As for the registration_month
, we see that March is the most frequent month of registration for the vehicles in our data set. However, we do see that there are vehicles that are identified as being registered on month zero. We have several options to deal with this: (1) drop the rows where registration month is zero, (2) convert the months to 1 (January), or (3) convert the months to missing (NaN).
In doing the steps above, we need to clean the year data first, because it's likely that the zero month rows coincide with the outlier years where data is questionable. For the rows that remain after dropping the rows with outlier years, we will then convert the zero months to missing.
# Removing outlier registration years
autos = autos[autos['registration_year'].between(1900,2016)]
# Checking data and months
print('registration year')
print(autos['registration_year'].describe())
print('\n')
print('registration month')
print(autos['registration_month'].value_counts())
We dropped around 2,000 observations after removing the outliers. The registration years look okay now, with the earliest observation tagged as 1910.
Upon checking the regisration month, there are still months listed as zero. We'll assume that these months are 'unknown' and just tag them as missing. This is so our frequency distributions for regisration months can still be analyzed, while removing the zeros so it doesn't affect any numerical computations or aggregations.
Note that if we want to create a new datetime column, we may want to go with the January conversion instead, although this will inflate the numbers for January registrations, hence making any trend analysis tenuous.
# replacing zero months to 1
zero_month = (autos['registration_month'] == 0)
autos.loc[zero_month, 'registration_month'] = np.nan
# re-checking the registration month column
print(autos['registration_month'].value_counts())
Let's now check how price patterns vary across brands. Let's start with looking at what the unique brands in our data set are.
print('Unique brands: ', len(autos['brand'].unique()))
autos['brand'].value_counts(normalize=True)
We have over 40 brands of automobiles in our data set, with Volkswagen being the most common brand. These are a lot of brands, we'll perform some aggregation for all the brands, since it is fairly simple to do using loops, but will only delve deeper into the top 10 most common brands.
# Creating dictionary on average price per brand
mean_price_brand = {}
brands = autos['brand'].unique()
for b in brands:
brand = autos[autos['brand'] == b]
mean_price = brand['price'].mean()
mean_price_brand[b] = int(mean_price)
mean_price_brand
It turns out that analyzing the dictionary with all the brands is quite difficult for us at our current skill level. Let's do the aggregation again, this time with only for the brands that account for at least 5% of the total number of observations.
# Getting the top brands
brand_share = autos['brand'].value_counts(normalize=True)
top_brands = brand_share[brand_share > 0.05].index
mean_price_top_brands = {}
for b in top_brands:
brand = autos[autos['brand'] == b]
mean_price = brand['price'].mean()
mean_price_top_brands[b] = int(mean_price)
mean_price_top_brands
We can convert dictionaries to series, then dataframes, using some pandas functionalities to help us analyze the aggregated data much easier.
# Converting a couple of dictionaries we've already created
mean_price_all = pd.Series(mean_price_brand)
mean_price_top = pd.Series(mean_price_top_brands)
# Converting the series into dataframes
mean_price_all_df = pd.DataFrame(mean_price_all, columns=['mean_price'])
mean_price_top_df = pd.DataFrame(mean_price_top, columns=['mean_price'])
mean_price_all_df.sort_values('mean_price', ascending=False)
mean_price_top_df.sort_values('mean_price', ascending=False)
When accounting for all the brands, we see that the most expensive brand, on the average, is Porsche with Land Rover a distant second. The cheapest brand, on the average, is Daewoo.
When we consider only the top brands (those which account for >5% of total observations), Audi cars are the most expensive with a mean price of 9,366 while Opel is the cheapest with a mean price of 2,975.
We see from the top brands that the Audi, Mercedes Benz, and BMW are relatively more expensive (average prices greather than 8,000) while Ford and Opel are very cheap by comparison (average prices less than 4,000).
It may be the brands themselves creating this price differential. However, it's also possible that the price differentials are the result of different average mileage or age of the vehicles. It may be the case that certain brands have, on the average, more mileage or are older, hence the lower average price.
Let's check whether our guess is correct by aggregating mileage (odometer_km
) and registration years (registration_year
) by brand then appending to a single dataframe along with the mean prices which we've already generated. This time, we limit our analysis to the top brands.
# Creating a copy dataframe for our aggregations
aggregate_df = mean_price_top_df.copy()
# For mileage
mean_mileage_brands = {}
for b in top_brands:
brand = autos[autos['brand'] == b]
mean_mileage = brand['odometer_km'].mean()
mean_mileage_brands[b] = int(mean_mileage)
mean_mileage_series = pd.Series(mean_mileage_brands)
aggregate_df['mean_mileage'] = mean_mileage_series
# For registration year
mean_reg_year_brands = {}
for b in top_brands:
brand = autos[autos['brand'] == b]
mean_reg_year = brand['registration_year'].mean()
mean_reg_year_brands[b] = int(mean_reg_year)
mean_reg_year_series = pd.Series(mean_reg_year_brands)
aggregate_df['mean_reg_year'] = mean_reg_year_series
# Displaying our new dataframe
aggregate_df.sort_values('mean_price', ascending=False)
From the table above, we see that our initial guess as to the relationship between price and mileage for the top brands may be incorrect. In fact, Audi cars have more mileage (129,157km), on the average, compared to Ford cars (124,266km) and only slightly lower than the much cheaper Opel cars (129,310km).
However, we also see that the Audis may be newer cars, with an average registration year of 2004 compared to Ford and Opel's 2002. We cannot make any definitive conclusions for now, especially since Mercedes Benz cares are, on the average, just as old as the cheaper brands but are still more expensive. We could expand this analysis to include all the brands in our data set but we'll settle with this for now.
For this project, we were able to practice a lot of skills:
The skills we practiced for this project will prove to be useful in future data science tasks.