The dataset was originally scraped and uploaded to Kaggle. We've made a few modifications from the original dataset that was uploaded to Kaggle:
We sampled 50,000 data points from the full dataset, to ensure your code runs quickly in our hosted environment We dirtied the dataset a bit to more closely resemble what you would expect from a scraped dataset (the version uploaded to Kaggle was cleaned to be easier to work with) The data dictionary provided with data is as follows:
dateCrawled - When this ad was first crawled. All field-values are taken from this date. name - Name of the car. seller - Whether the seller is private or a dealer. offerType - The type of listing price - The price on the ad to sell the car. abtest - Whether the listing is included in an A/B test. vehicleType - The vehicle Type. yearOfRegistration - The year in which the car was first registered. gearbox - The transmission type. powerPS - The power of the car in PS. model - The car model name. kilometer - How many kilometers the car has driven. monthOfRegistration - The month in which the car was first registered. fuelType - What type of fuel the car uses. brand - The brand of the car. notRepairedDamage - If the car has a damage which is not yet repaired. dateCreated - The date on which the eBay listing was created. nrOfPictures - The number of pictures in the ad. postalCode - The postal code for the location of the vehicle. lastSeenOnline - When the crawler saw this ad last online. The aim of this project is to clean the data and analyze the included used car listings. You'll also become familiar with some of the unique benefits jupyter notebook provides for pandas.
Let's start by importing the libraries we need and reading the dataset into pandas.
import pandas as pd
import numpy as np
autos = pd.read_csv('autos.csv', encoding='Windows-1252')
autos.info()
autos.head()
autos.info()
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer_km', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_photos', 'postal_code',
'last_seen']
autos.head()
autos.describe(include='all')
autos["num_photos"].value_counts()
autos = autos.drop(["num_photos", "seller" , "offer_type"], axis=1)
autos["price"] = (autos["price"]
.str.replace("$","")
.str.replace(",","")
.astype(int)
)
autos["price"].head()
autos["odometer_km"] = (autos["odometer_km"]
.str.replace("km","")
.str.replace(",","")
.astype(float)
)
autos["odometer_km"].value_counts()
print(autos["price"].unique().shape)
print(autos["price"].describe())
autos["price"].value_counts()
autos["price"].value_counts().sort_index(ascending=False).head(20)
autos["price"].value_counts().sort_index(ascending=True).head(20)
autos = autos[autos["price"].between(1,350001)]
autos["price"].describe()
autos.info()
autos[['date_crawled','ad_created','last_seen']][0:5]
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index(ascending=True)
#Dealing with Incorrect Registration Year Data
autos = autos[autos['registration_year'].between(1901, 2020)]
autos.describe()
autos["brand"].unique().shape
brands = autos['brand'].value_counts(normalize=True).sort_values(ascending=False)[0:20].index
brands
price_by_brand = {}
for brand in brands:
price_mean = autos.loc[autos['brand'] == brand, 'price'].mean()
price_by_brand[brand] = price_mean
price_by_brand
# Audi, BMW and Mercedes Benz are more expensive
# Ford and Opel are less expensive
# Volkswagen is in between
bmp_series = pd.Series(price_by_brand)
print(bmp_series)
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df
#storing aggregate data in a dataframe
mileage_by_brand = {}
for brand in brands:
mileage_mean = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
mileage_by_brand[brand] = mileage_mean
mileage_by_brand
mbb_series = pd.Series(mileage_by_brand)
d = {'mean_price' : bmp_series, 'mean_mileage' : mbb_series}
dff = pd.DataFrame(data=d)
dff