This project will analyze used vehicles data from eBay Kleinanzeigen (eBay Germany). The dataset was originally scraped (March, 5, 2016 through April, 7, 2016) and uploaded to Kaggle. The data contains several variables describing features about the vehicle in the ad (e.g., model, gas type, kilometers driven, etc), and variables about the website (e.g., type of seller, type of listing, number of pictures in the ad, etc). Here is the data dictionary associated with the data:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which which year the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which which year the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.The objective of this project is to conduct exploratory analysis to obtain an understanding of what type of used vehicle ads appear on the website. In doing so, the aim is to identify which vehicles are potentially worth purchasing.
The dataset is a sub-sample of 50,000 records from a larger 370,000 sample which was randomly selected. This project will:
# Import packages
import pandas as pd
import numpy as np
# Load dataset - used Latin-1 encoding to avoid errors
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.shape # 50k rows, with 20 columns
autos.info()
print("-----")
print(autos.isnull().sum())
print("-----")
autos.head()
Note that there are missing values for some variables (e.g., vehicleType, gearbox, model, fuelType, notRepairedDamage). In addition, there are some special characters (e.g., $, km). Lastly, there are some variables that have German words, which make sense as this data was scrapped from a German website.
We can also see some variables have incorrect data types. For example, dateCrawled, dateCreated, and lastSeen are Objects, but should be datetime type. In addition, some variables like price, odometer could become integer type once we remove the special characters.
# Checked for duplicate rows. There should be no duplicates given each row represents one unique ad.
duplicate_bool = autos.duplicated()
autos[duplicate_bool]
# Renamed columns to make them easier to work with
autos.columns = autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer(km)', 'registration_months', 'fuel_type', 'brand',
'not_repaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
# Dropped nr_of_pictures column as author of the web scrapper noted that there was a bug in the code that did not capture this data.
autos.drop(["nr_of_pictures"], axis = 1, inplace = True)
# Created mappings and function to translate German words to English.
mappings = {"privat": "private", "gewerblich": "commercial", "Angebot": "bid", "Gesuch": "application", "kleinwagen":
"super mini", "kombi":"station wagon", "cabrio": "convertible", "limousine": "sedan", "andere": "other", "manuell": "manual", "automatik":
"automatic", "benzin":"gas", "elektro": "electric", "sonstige_auto": "other", "sonstige_autos": "other", "nein": "no", "ja": "yes"}
def super_translate(string):
if string in mappings:
return mappings[string]
else:
return string
columns_change = ["seller", "offer_type", "vehicle_type", "gearbox", "model", "fuel_type", "brand", "not_repaired_damage"]
autos[columns_change] = autos[columns_change].applymap(super_translate)
# Removed special characters from columns of interest
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "")
autos["odometer(km)"] = autos["odometer(km)"].str.replace("km", "").str.replace(",", "")
# Changed dtypes of relevant columns
autos = (autos.astype({"price": "int64", "odometer(km)": "int64", "date_crawled": "datetime64[ns]",
"ad_created": "datetime64[ns]", "last_seen": "datetime64[ns]"}))
# Note: timestamp portion of datetime is lost for ad_created when casting correct dtype
autos["odometer(km)"].value_counts(normalize = True).sort_index(ascending = False)*100
autos["price"].value_counts().sort_index()
There are a couple of observations:
Given that eBay is an auction website, it is possible to have these low values in the price column. When examining records above 350,000, several of these entries appear to be illegitimate entries (e.g., 1234566, 12345678, 1111111, 99999999). In addition, it is unlikely that these prices represent the value of the vehicles given that this is a used vehicle classified site. Therefore, anything below 1.00 and above 350,000 will be removed as these could have been errors or illegitimate entries.
autos = autos[(autos.loc[:, "price"] > 0) & (autos.loc[:, "price"] <= 350000)]
len(autos)
# Date columns
print(autos["date_crawled"].describe())
print("-------")
print(autos["registration_year"].value_counts(ascending = False).sort_index())
There are are a couple of observations:
Therefore, anything below 1900 and beyond 2016 will be removed.
autos = autos[(autos.loc[:, "registration_year"] >=1900) & (autos.loc[:, "registration_year"] <= 2016)]
len(autos)
# Exploring null values
autos.isnull().sum()
autos.vehicle_type.unique()
As examined when translating German names to English names, there appears to be preset vehicle types. Upon examining the name column, there are values that could be used to fill in some nan values in the vehicle type column.
import warnings
warnings.filterwarnings("ignore", 'This pattern has match groups')
# Used str.contains as a make-shift boolean index, hence disabling the warnings.
# Replaced null values in vehicle type based word matches in name
autos.loc[autos["name"].str.contains(r"(.*kombi)", case = False), "vehicle_type"] = "station wagon"
autos.loc[autos["name"].str.contains(r"(.*cabrio)", case = False), "vehicle_type"] = "convertible"
autos.loc[autos["name"].str.contains(r"(.*klein)", case = False), "vehicle_type"] = "small car"
autos.loc[autos["name"].str.contains(r"(.*limo[^n])", case = False), "vehicle_type"] = "sedan"
autos.loc[autos["name"].str.contains(r"(.*coupe)", case = False), "vehicle_type"] = "coupe"
autos.loc[autos["name"].str.contains(r"(.*bus)", case = False), "vehicle_type"] = "bus"
autos.loc[autos["name"].str.contains(r"(.*caravan)", case = False), "vehicle_type"] = "van"
# Multiple instances where there could be multiple vehicle type keywords in a string. Thus, "smart car" was replaced to "small car" last
# to reverse some of these cleaning induced errors.
autos.loc[autos["name"].str.contains(r"(.*smart)", case = False), "vehicle_type"] = "small car"
# Reactivated filtering warnings
import warnings
warnings.filterwarnings("default", 'This pattern has match groups')
autos.isnull().sum()
len(autos)
With the data cleaning so far, we have seen the following reductions (rounded) in null values:
We have lost approximately 6.6% of our data due to cleaning so far. This leaves us with approximately (46,681) 93% of our original sample of 50,000. If we were to drop all records with nans, this number would increase to 31%, with 34,619 records of the 50,000 remaining.
The "not_repaired_damage" category contains 8,307 (18% of remaining sample of 46,681) blanks. In other words, sellers did not declare whether the listed vehicle has damage that has yet to be repaired. This could suggest that:
From the available data, there is no way to differentiate these categories. In addition, the definition of "damage" is subjective. For example, while a scratch is damage in a strict definition of the word, most people would consider this minor compared to the transmission not working properly. Furthermore, sellers may define what constitutes "damage" differently than a buyer. These blanks will remain as generally people will (hopefully) examine or have a mechanic examine the vehicle before purchasing.
Recall that the objective of this project is to conduct exploratory analysis and hopefully identify potential candidate vehicles that could be purchased. In addition to brand, odometer and price, vehicle_type and model provide useful information on the vehicles listed on the site. Therefore, all records with NANs in both the vehicle type and model will be removed. This leaves us with 42,387 (81%) of our original data.
Onto the analysis!
autos = autos.dropna(thresh = 2, subset = ["vehicle_type", "model"])
print(len(autos))
# Create summary output table of brand and price and sort by count.
summary_bp_group = autos.groupby(["brand"])["price"]
summary_bp_output = summary_bp_group.agg([np.size, np.mean, np.std]).sort_values(by = ["size"], ascending = False).reset_index() # sort by count
summary_bp_output["percentage_total"] = summary_bp_output["size"] / len(autos)*100
summary_bp_output = (summary_bp_output.iloc[:, [0,1,4,2,3]].rename(columns= {"size": "count",
"mean": "average_price",
"std":"standard_deviation(price)"})) #re-order and rename columns
#round values (convert dtypes)
summary_bp_output = summary_bp_output.astype({"percentage_total":"int64", "average_price":"int64", "standard_deviation(price)":"int64"})
summary_bp_output
# Import packages for plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Plotting the distribution of price for all brands
fig = plt.figure(figsize=(10,6))
sns.set(style = "white")
sns.kdeplot(autos["price"], legend = False)
sns.despine(bottom = True, left = True)
plt.title("Kernel density estimate plot of price for all brands", fontsize = 20, pad = 30)
plt.xlabel("Price", fontsize = 14, labelpad = 15)
plt.xlim(0, 50000) #capped at 50,000 to improve visibility
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()
Unsurprisingly given this is a used vehicle site, there is a high proportion of vehicles in the sample that are less than $10,000. Given the proportion of data represented by the top five brands, lets examine these brands specifically.
# Plot top five brands
import seaborn as sns
import matplotlib.pyplot as plt
summary_bp_output_top5 = summary_bp_output[0:5].sort_values(by = "average_price", ascending = False)
fig = plt.figure(figsize=(10,6))
sns.set(style = "white")
sns.barplot(x="brand", y = "average_price", data= summary_bp_output_top5, saturation= 1, palette = "tab20c")
plt.title("Barplot showing the top 5 brands and their average list price", fontsize = 20, pad = 30)
plt.xlabel("Brand", fontsize = 14, labelpad = 15)
plt.ylabel("Average Price", fontsize = 14, labelpad = 15)
plt.xticks(np.arange(5), ["Audi", "Mercedes-Benz", "BMW", "Volkswagen", "Opel"], fontsize = 12)
plt.yticks(np.arange(0,10500,1000), fontsize = 12)
sns.despine(bottom = True, left = True)
for index, row in summary_bp_output_top5.reset_index().iterrows():
plt.text(index, row[4] + 50, str(int(row[4])), ha = "center", color = "black")
plt.show()
From the above table and graph:
Another key factor in determining whether an individual would purchase a vehicle could be the amount of kilometers put on the vehicle. Let's examine the odometer variable next.
# Examine the odometer(km) category
autos_odo_grouped = (autos.groupby(["odometer(km)"]).size().reset_index().rename(columns ={0:"count"}).sort_values(by = "odometer(km)",
ascending = False))
autos_odo_grouped["percentage"] = autos_odo_grouped["count"] / len(autos)*100
autos_odo_grouped = autos_odo_grouped.astype({"odometer(km)":"int64","percentage":"int64"}) #round values by converting to int type
autos_odo_grouped
# Plotting the distribution of odometer(km) readings for all brands
fig = plt.figure(figsize=(10,6))
sns.set(style = "white")
sns.kdeplot(autos["odometer(km)"], legend = False)
sns.despine(bottom = True, left = True)
plt.title("Kernel density estimate plot of the odometer(km) for all brands", fontsize = 20, pad = 30)
plt.xlabel("Odometer(km)", fontsize = 14, labelpad = 15)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
plt.show()
From the table and KDE plot above, we can see that the distribution of the odometer(km) readings for all brands is skewed to the left. Interestingly, there seems to be a higher proportion (approximately 64%) of odometer readings around 150,000 km. Given that the average life expectancy of a vehicle is around 200,000 miles (~321,000 km), one possible explanation is that sellers would like to receive some value for their used vehicle before its expiration date. Another possible reason is that sellers want to upgrade to a new vehicle as they have become dissatisfied with their current vehicle, or the vehicle has started to degrade after 150,000km.
It should be noted that the odometer readings appear to be estimations (i.e., the original categories are rounded), which suggests that the website forces users to pick a range, rather than put the exact number. Let's examine the average odometer readings in the top 5 brands.
# Generate summary table for top five brands
summary_bo_group = autos.groupby(["brand"])[["price", "odometer(km)"]]
summary_bo_output = (summary_bo_group.agg([np.size, np.mean, np.std]).sort_values(
by = [("price", "size")],ascending = False).reset_index().droplevel(0,axis = 1) # sort by size, drop multi-index
)
summary_bo_output.columns = summary_bo_output.columns = ["brand", "count", "average_price", "standard_deviation(price)",
"drop1", "average_odometer(km)", "standard_deviation(odometer)"] # rename columns
summary_bo_output.drop("drop1", axis = 1, inplace = True) # drop useless column
summary_bo_output = (summary_bo_output.astype({"average_price":"int64", "standard_deviation(price)":"int64",
"average_odometer(km)":"int64", "standard_deviation(odometer)": "int64"}))
# Select the top five brands
summary_bo_output_top5 = summary_bo_output[0:5].sort_values("average_odometer(km)", ascending = False).reset_index().drop("index", axis = 1)
summary_bo_output_top5
# Plot barplot of odometer readings with error bars.
std = np.array(summary_bo_output_top5["standard_deviation(odometer)"])
fig = plt.figure(figsize = (12,6))
sns.barplot(x="brand", y = "average_odometer(km)", data= summary_bo_output_top5, saturation= 1, palette = "tab20c")
sns.set(style = "white")
sns.despine(left = True, bottom = True)
plt.title("Barplot showing the top 5 brands and their average listed odometer readings", fontsize = 20, pad = 30)
plt.xlabel("Brand", fontsize = 14, labelpad = 15)
plt.xticks(np.arange(5), ["BMW", "Mercedes-Benz", "Opel", "Audi", "Volkswagen"], fontsize = 12)
plt.ylabel("Average Odometer Reading (km)", fontsize = 14, labelpad = 15)
plt.ylim(0, 170000)
plt.yticks(np.arange(0, 170000, 15000),fontsize = 12)
plt.errorbar(x= np.array(summary_bo_output_top5["brand"]), y = np.array(summary_bo_output_top5["average_odometer(km)"]),
yerr=std, color = "black", capsize = 10, fmt=" ")
for index, row in summary_bo_output_top5.iterrows():
plt.text(index, row[4] - 60000, str(int(row[4])) + str(" +/-") + ("\n") + str("(") +str(int(row[5])) + str(")"),
ha = "center", color = "black", fontsize = 12)
plt.text(-1, - 40000 , "Note: Error bars represent standard deviations.", ha = 'left', fontsize = 12)
plt.show()
From the table and barplot above:
Let's examine the models for each of the top five brands.
autos.model.value_counts(normalize = True).head(10) * 100
From the table above, the Golf model makes up the highest proportion of models with just over 8%, followed by the other category at approximately 7.5%. Let's examine the models further.
# Examine the model category for all brands
autos_model_grouped = (autos.groupby(["brand", "model"]).size().reset_index().
rename(columns=({0:"model_count"})).sort_values(by = "model_count", ascending = False))
autos_model_grouped["percentage_overall"] = autos_model_grouped["model_count"] / len(autos)*100
# Add percentage of brand columns
list1 = []
for index, row in autos_model_grouped.iterrows():
brand = row[0]
count = row[2]
list1.append([index, count / len(autos[autos["brand"] == brand]) * 100])
dataframe = pd.DataFrame(list1).set_index(0)
autos_model_grouped = pd.concat([autos_model_grouped, dataframe.rename(columns=({1:"percentage_brand"}))], axis = 1)
autos_model_grouped = autos_model_grouped.astype({"percentage_overall":"int64", "percentage_brand":"int64"}) # Convert to int to round columns
autos_model_grouped.head(10)
Perhaps unsurprisingly, the first top ten models belong to our top five brands. We can also tell that these models contribute to a large amount of vehicles for that brand. For example, the BMW 3er accounts for 53% of all vehicles for the BMW brand.
Given that the top five brands make up 62% of the data, let's examine the top three models of each top five brand.
# Filter dataset - Top three models for each of the top five brands
autos_model_grouped_top3 = pd.DataFrame()
brands = ["audi", "mercedes_benz", "bmw", "volkswagen", "opel"] #brands of interest
for brand in brands:
temp = autos_model_grouped[autos_model_grouped["brand"] == brand].sort_values("model_count", ascending = False)
autos_model_grouped_top3 = pd.concat([autos_model_grouped_top3, temp[0:3]], axis = 0)
autos_model_grouped_top3.groupby(["brand"])["percentage_brand"].sum()
We can see from the table above that the top three models for each of the top five brands accounts for a high proportion of the data for that brand. Mercedes-Benz is the lowest with 60%.
# Filter dataset - Top four models for each of the top five brands
autos_model_grouped_top4 = pd.DataFrame()
brands = ["audi", "mercedes_benz", "bmw", "volkswagen", "opel"] # brands of interest
for brand in brands:
temp = autos_model_grouped[autos_model_grouped["brand"] == brand].sort_values("model_count", ascending = False)
autos_model_grouped_top4 = pd.concat([autos_model_grouped_top4, temp[0:4]], axis = 0)
autos_model_grouped_top4.groupby(["brand"])["percentage_brand"].sum()
autos_model_grouped_top4
Including the top four models for each of the top five brands further increases the percentage of listings attributed for that brand, with the minimum now being close to 70%. However, the Mercedes-Benz and Audi brands now contains the "other" category as a model.
Given this could include a wide range of vehicles, we will continue with analyzing the top three models for each of the top five brands.
# Extract all records with brand and model of interest
brand_model = autos_model_grouped_top3.iloc[:, 0:2] # filter dataset to include dataframe with only brand and models
#obtain raw records from cleaned dataset
autos_bmpo_top5_brand = pd.merge(left = autos[["brand", "model", "price", "odometer(km)", "postal_code"]], right = brand_model, how = "inner")
len(autos_bmpo_top5_brand)
# Create summary table
autos_bmpo_top5_brand_summary_output = autos_bmpo_top5_brand.groupby(["brand", "model"])[["price", "odometer(km)"]].mean().reset_index()
autos_bmpo_top5_brand_summary_output = autos_bmpo_top5_brand_summary_output.rename(columns = {"price": "average_price",
"odometer(km)":"average_odometer(km)"})
#Rank the groups for graphs
autos_bmpo_top5_brand_summary_output["order"] = None
autos_bmpo_top5_brand_summary_output.loc[autos_bmpo_top5_brand_summary_output.loc[:,"brand"] == "bmw", "order"] = 5
autos_bmpo_top5_brand_summary_output.loc[autos_bmpo_top5_brand_summary_output.loc[:,"brand"] == "audi", "order"] = 4
autos_bmpo_top5_brand_summary_output.loc[autos_bmpo_top5_brand_summary_output.loc[:,"brand"] == "mercedes_benz", "order"] = 3
autos_bmpo_top5_brand_summary_output.loc[autos_bmpo_top5_brand_summary_output.loc[:,"brand"] == "volkswagen", "order"] = 2
autos_bmpo_top5_brand_summary_output.loc[autos_bmpo_top5_brand_summary_output.loc[:,"brand"] == "opel", "order"] = 1
autos_bmpo_top5_brand_summary_output = autos_bmpo_top5_brand_summary_output.sort_values(["order", "average_price"],
ascending = False).reset_index().drop("index", axis = 1)
autos_bmpo_top5_brand_summary_output
# Create barplot of the average price of the top three models for each brand
fig = plt.figure(figsize =(16,7))
ax1 = fig.add_subplot(1, 2, 1)
ax2 = fig.add_subplot(1, 2, 2)
sns.set(style = "white")
sns.despine(left = True, bottom = True)
plt.suptitle("Average price and odometer readings for the top 3 models", fontsize= 22, x = 0.5, y = 1.02)
sns.barplot(x = "model", y = "average_price", hue = "brand", dodge = False, ax = ax1, data = autos_bmpo_top5_brand_summary_output,
saturation = 1, palette = "tab20c")
ax1.set_title("Average price (by brand)", fontsize = 18, pad = 15)
x_tick_labels = autos_bmpo_top5_brand_summary_output["model"].str.replace("_","-").str.capitalize()
# ['1er', '5er', '3er', 'A6', 'A3', 'A4', 'E klasse', 'C klasse', 'A klasse', 'Golf', 'Passat', 'Polo', 'Astra', 'Corsa', 'Vectra']
ax1.set_xlabel("Model", fontsize = 14, labelpad = 15)
ax1.set_xticklabels(x_tick_labels, fontsize = 12, rotation = 90)
ax1.set_ylabel("Average Price", fontsize = 14, labelpad = 15)
ax1.set_yticks(np.arange(0, 13000, 1000))
labels_legend = ["BMW", "Audi", "Mercedes-Benz", "Volkswagen","Opel"]
h, l = ax1.get_legend_handles_labels()
ax1.legend(h,labels_legend,edgecolor = "None", ncol = 5, bbox_to_anchor=(1.8, -0.3), fontsize = 14)
#add price labels
for index, row in autos_bmpo_top5_brand_summary_output.iterrows():
ax1.text(index, (row[2] + 100), str(int(row[2])), ha = "center", color = "black", fontsize = 12)
#Odometer Plot
order_bars = autos_bmpo_top5_brand_summary_output.sort_values("average_odometer(km)", ascending = False).model.to_list()
sns.barplot(x = "average_odometer(km)", y = "model", order = order_bars, hue = "brand", dodge = False, ax = ax2,
data = autos_bmpo_top5_brand_summary_output, saturation = 1, palette = "tab20c")
ax2.set_title("Average odometer readings (ranked by model)", fontsize = 18, pad = 15)
ax2.set_xlabel("Average Odometer Reading", fontsize = 14, labelpad = 20)
for label in ax2.get_xticklabels():
label.set_fontsize(12)
label.set_rotation(90)
ax2.set_xticks(np.arange(0, 150000, 10000))
y_tick_labels = (autos_bmpo_top5_brand_summary_output.sort_values("average_odometer(km)", ascending = False)["model"].str.replace("_","-").str.capitalize())
ax2.set_yticklabels(y_tick_labels, fontsize = 12)
ax2.set_ylabel("")
ax2.legend().remove()
#add odometer labels
for index, row in autos_bmpo_top5_brand_summary_output.sort_values("average_odometer(km)", ascending = False).reset_index().iterrows():
ax2.text((row[4] + 1000), index, str(int(row[4])), va = "center", color = "black", fontsize = 12)
plt.subplots_adjust(wspace = 0.145)
plt.show()
Based on the table and graph above:
We will proceed with identifying the best places to look for our top candidate models.
It might be of interest to potential buyers where to travel to view the candidate vehicles. Given that a person might view several vehicles a day before making a decision, let's examine which locations have the most ads placed for our candidate vehicles. The following sources were used to create the map:
# Import packages
import geopandas
import folium
# Create postal code dataframe containing only the top 3 models
postal_codes = autos_bmpo_top5_brand.loc[:, ["postal_code", "model"]].reset_index().drop("index", axis = 1)
postal_codes = postal_codes[postal_codes.model.str.contains("corsa|astra|polo")] # select top 3 models
postal_codes["region"] = postal_codes.postal_code.astype("str").str.zfill(5).str[0:2] #pad 0s and create region column to match with json.
# Add a count column - count number of regions per model
postal_codes = postal_codes.groupby(["model", "region"]).size().reset_index().rename(columns=({0:"Model Count", "region":"plz"}))
postal_codes = postal_codes.iloc[:, [1,0,2]] # re order columns
# Clean Wikipedia dataset (saved into an csv file with encoding Windows-1252/WinLatin1)
region_map = pd.read_csv("region_map2.csv", encoding = "latin-1", dtype = {'Region #': "str"})
region_map["Region #"] = region_map["Region #"].str.strip()
region_map["Area"] = region_map["Area"].str.split(",").str[0]
region_map["Area"] = region_map["Area"].str.strip()
# Merge with used vehicles dataset
postal_codes = pd.merge(left = postal_codes, right = region_map, left_on = "plz", right_on = "Region #", how = "left").drop("Region #",
axis = 1)
# Check for nulls
print(postal_codes.isnull().sum()) # no null values
# Read json into geopandas dataframe
geopandas_germany = geopandas.read_file("germany_2.geojson")
geo_germany_merge = geopandas_germany.merge(postal_codes, on = "plz", how = "left") # add counts and area name into geopandas dataframe
# Create separate geopandas dataframes for each candidate vehicle
corsa = geo_germany_merge[geo_germany_merge.model == "corsa"]
astra = geo_germany_merge[geo_germany_merge.model == "astra"]
polo = geo_germany_merge[geo_germany_merge.model == "polo"]
# Create choropleth map that shows areas with the most models
m = folium.Map(location=[52.520008, 13.404954], zoom_start=5)
# Corsa
corsa_geo = folium.Choropleth(
geo_data = corsa,
name='Corsa',
data = corsa,
columns=["plz", "Model Count"],
key_on='feature.properties.plz',
fill_color="OrRd",
highlight = True,
nan_fill_color='white',
fill_opacity=0.7,
line_opacity=0.3,
show = True,
).add_to(m)
corsa_geo.geojson.add_child(
folium.features.GeoJsonTooltip(["Model Count", "Area"]))
# Astra
astra_geo = folium.Choropleth(
geo_data = astra,
name='Astra',
data = astra,
columns=["plz", "Model Count"],
key_on='feature.properties.plz',
fill_color="BuGn",
highlight = True,
nan_fill_color='white',
fill_opacity=0.7,
line_opacity=0.3,
show = False
).add_to(m)
astra_geo.geojson.add_child(
folium.features.GeoJsonTooltip(["Model Count", "Area"]))
# Polo
polo_geo = folium.Choropleth(
geo_data = polo,
name='Polo',
data = polo,
columns=["plz", "Model Count"],
key_on='feature.properties.plz',
fill_color="PuBu",
highlight = True,
nan_fill_color='white',
fill_opacity=0.7,
line_opacity=0.3,
show = False
).add_to(m)
polo_geo.geojson.add_child(
folium.features.GeoJsonTooltip(["Model Count", "Area"]))
folium.LayerControl(position='topright', collapsed=False).add_to(m)
m