#!/usr/bin/env python # coding: utf-8 # ## About the Data: # # This rich data set provides a comprehensive window into Germany's used automobile market, spanning over 400,000 online sales listings from 2014-2016. # # - The contents of the data was in German, so some transalations and modifications (replacements of words) were made on the relevant fields. # # - Data Source is from [Data World](https://data.world/data-society/used-cars-data). However, the data was scrapped from eBay # # - Data was visualised using the Matplotlib library. # # - This data was from 2014 - 2016 # # - Business problems intended to be analysed and identified from tyhis data were: # - understanding the most used brands and their relationship with the vehicle types # - undersatnding how prices affect the brands and vehicle types # - understand consuer's preference for fuel type # - Understand the sdvertisement trend on ebay (for cars) # - identiffy the seller category with the most advertised cars on ebay. # - Comparisms between high-end luxury vehicles versus their affordability # - other metrics like pricing, brands, models, mileage, repairs and modifications were uncovered; revealling an intriguing demand patterns and shifts within this vast market. # # # # A few major dimensions stand out as offering meaningful insights. These analyses revealed how the ebay platform performed and insights to help them understand conusmers' preferences. # # Hopefully, this informs listings their sales team would be more on the lookout for (cosidering their increasing customer demands). # # ---------------------------------------------------------- # **Data fields**: # - dateCrawled : when this ad was first crawled, all field-values are taken from this date # - name : "name" of the car # - seller : private or dealer # - offerType *(in German)* # - price : the price on the ad to sell the car # - abtest # - vehicleType *(in German)* # - yearOfRegistration : at which year the car was first registered # - gearbox *(in German)* # - powerPS : power of the car in PS # - model *(in German)* # - kilometer : how many kilometers the car has driven # - monthOfRegistration : at which month the car was first registered # - fuelType *(in German)* # - brand # - notRepairedDamage : if the car has a damage which is not repaired yet *(in German)* # - dateCreated : the date for which the ad at ebay was created # - nrOfPictures : number of pictures in the ad # - postalCode # - lastSeenOnline : when the crawler saw this ad last online # # # ### Import the Data # In[1]: import pandas as pd # In[3]: df = pd.read_csv(r"C:\Users\Teni\Desktop\Git-Github\Online Datasets\autos.csv") # In[4]: df # The original dataset # In[5]: # To get the summary, and datatype, of the fields. df.info() # #### Based on the highlighted business problems to answer: # To solve the business questions on: # - most advertised brands and vehicle types # - how the price affects the influx of cars advertised (are there more 'less priced' vehicles on the platform or more' pricey' vehicles?) # - what type of sellers do we have on the platfomr the most # - how does milleage and powerps of the vehicles relate with the brand and vehicle types? Perhaps there's a consumer trend outside ebay we're not paying attention to, yet # # *The needed fields are: seller, price, vehicletype, gearbox, powerups, kilometer, datecreated, fueltype* # # ### Clean the data # # ** Translate needed columns to the English Language ** # In[6]: # summarizing the categorical data uder offertype. This is in German language; so the next code replaces the field values df['offertype'].value_counts() # In[7]: df['offertype'] = df['offertype'].replace({'Angebot': 'Offer', 'Gesuch': 'Request'}) # In[8]: # updated field value replacement df['offertype'].value_counts() # In[9]: # summarizing the categorical data uder fueltype. This is in German language; so the next code replaces the field values df['fueltype'].value_counts() # In[10]: df['fueltype']= df['fueltype'].replace({'benzin':'Petrol', 'diesel':'Diesel','lpg':'Liquefied Petroleum Gas', 'cng':'Compressed Natural Gas', 'hybrid': 'Hybrid', 'andere': 'Others', 'elektro':'Electric'}) # In[11]: # updated field replacement df['fueltype'].value_counts() # In[12]: df['gearbox']= df['gearbox'].replace({'manuell': 'Manual', 'automatik': 'Automatic'}) # In[13]: # updated field replacement f the gearbox attribute of the data df['gearbox'].value_counts() # In[14]: df['abtest'].value_counts() # In[15]: df['vehicletype'].value_counts() # In[16]: df['vehicletype']=df['vehicletype'].replace({'kleinwagen': 'small car', 'kombi': 'station wagon', 'cabrio': 'convertible', 'andere': 'other'}) # In[17]: df['seller'].value_counts() # In[18]: df['seller'] = df['seller'].replace({'privat':'private', 'gewerblich':'dealer'}) df['seller'].value_counts() # In[19]: df['notrepaireddamage'].value_counts() # In[20]: df['notrepaireddamage'] = df['notrepaireddamage'].replace({'nein': 'No', 'ja': 'Yes'}) # In[21]: df['model'].head(100) # In[22]: df['model']= df['model'].replace({'andere': 'other', }) # In[23]: df # ** Drop irrelevant columns ** # # *Some attributes are not needed in solving the business problem, so they'd all be dropped* # In[24]: df['nrofpictures'].value_counts() # In[25]: df = df.drop(['nrofpictures', 'lastseen', 'postalcode'], axis=1) # In[26]: df # #### Trends to Uncover from the data : # # # 1. **Pricing Analysis**: Investigate the price distribution for different vehicle types, brands, or models to understand pricing trends and identify outliers. # # 2. **Trend Analysis over Time**: Analyze the trends in the number of listings or sales over different months or years to understand seasonal patterns in the automotive market. # # 3. **Brand Comparison**: Compare the popularity or sales of different car brands to understand which brands are more preferred in the market and how they perform in terms of pricing or demand. # # 4. **Vehicle Condition and Price Correlation**: Examine how the condition of a vehicle (repaired damage or not) correlates with its pricing. Determine if repaired damage affects the selling price significantly. # # 5. **Fuel Type Preference**: Analyze the preference for different fuel types among different vehicle types or brands. Understand if there's a shift in preference towards more eco-friendly options like hybrid or electric vehicles. # # 6. **Feature Importance**: Analyze the impact of specific vehicle features (like power, gearbox type, vehicle type, etc.) on the pricing to identify which features drive higher prices or are more sought after by buyers. # # # #### Task 1. # **Pricing Analysis**: Investigate the price distribution for different vehicle types, brands, or models to understand pricing trends and identify outliers. # In[27]: df # In[28]: df.info() # In[29]: df['price'].isnull() # In[30]: df = df.fillna('None') # Since there's no Null value in Price, it's safe to fill all null values in the dataframe with the string 'None' than the integer '0' # #### Price and Brand Relationship # # In[31]: df # In[32]: import matplotlib.pyplot as plt import pandas as pd # Calculate the average price for each brand (top 10) avg_prices = df.groupby('brand')['price'].mean().nlargest(10) # Get the number of brands for each category (top 10) brand_count = df['brand'].value_counts()[:10] # Using the add_axes because I'd like to display 2 related information in one plot fig = plt.figure(figsize=(22, 15)) # Adjust the figure size if needed ax1 = fig.add_axes([0, 0, 1, 1]) # Plotting the Main graph for the Avg_prices for the top 10 brands ax1.bar(avg_prices.index,avg_prices, color='blue') # Defining the title of the main plot ax1.set_title('Prices of the top 10 Brands advertised on the platform', fontsize=20, fontweight='bold',color='blue') # Defining the x and y labels ax1.set_xlabel('Brand', fontsize=20,fontweight='bold') ax1.set_ylabel('Average Price', fontsize=20, fontweight='bold') # # Adjusting the x and y items ax1.tick_params(axis='x', rotation=45, labelsize=16) ax1.tick_params(axis='y', rotation=45, labelsize=16) # Plotting the inner graph for the count of the top 10 Brands ax2 = fig.add_axes([0.5, 0.5, 0.25, 0.25]) ax2.bar(brand_count.index,brand_count, color='skyblue') # Defining the title of the main plot ax2.set_title('Top 10 advertised brands on the platform ', fontsize=20, fontweight='bold',color='skyblue') # Defining the x and y labels ax2.set_xlabel('Brand', fontsize=14,fontweight='bold') ax2.set_ylabel('Count', fontsize=14, fontweight='bold') # Adjusting the x and y items ax2.tick_params(axis='x', rotation=45, labelsize=15) ax2.tick_params(axis='y', labelsize=15) # Display the plot plt.show() # ### Possible reasons for the inverse relationship between the Brand price and the quantity: # # - The data reveals that the expensive cars do not meet the top 10 most sold or advertised cars on Ebay. Invariably, the less pricey cars are the most sold/advertised cars # - This tells us that either the demand and supply power is in play. The less pricey the car, the more they are in quantity, and vice-versa. This would nt be restricted to the platfomr alone, but the automobile industry in general. # - Most of the pricey cars are probably not sold on ebay (probaly there's a different market for those types of cars), and ebay is mostly popular for selling the less-pricey cars. # - ebay doesn't have as much car sellers of whose who own/deal in luxury cars in comparism with those who own the less pricey cars (*data sufficiency wouldn't allow us deterine this*). # # In[33]: df['vehicletype'].value_counts() # In[34]: import matplotlib.pyplot as plt import pandas as pd # Assuming df is your DataFrame # Filter out rows where vehicletype is 'None' or 'Other' filtered_df = df[~df['vehicletype'].isin(['None', 'other'])] # Updated vehicle type and price data after filtering vehicletype = filtered_df['vehicletype'].value_counts() price_type = filtered_df.groupby('vehicletype')['price'].mean().sort_values(ascending=False) # Create a figure and two subplots fig, (axes1, axes2) = plt.subplots(1, 2, figsize=(12, 5)) # First subplot - Average prices based on vehicle types axes1.bar(price_type.index, price_type, color='purple') axes1.set_xlabel('Vehicle Types') axes1.set_ylabel('Prices') axes1.set_title('Top 7 Vehices per Price', fontweight='bold') axes1.tick_params(axis='x', rotation=45, labelsize=10) # Rotate x-axis labels for better reada6bility # Second subplot - Count of vehicles based on types axes2.bar(vehicletype.index, vehicletype, color='blue') axes2.set_xlabel('Vehicle Type') axes2.set_ylabel('Count') axes2.set_title('Top 7 Vehicles', fontweight='bold') axes2.tick_params(axis='x', rotation=45, labelsize=10) # Rotate x-axis labels for better readability plt.tight_layout() plt.show() # ### Vehicle Type and Brand have the same trend; inverse relationship between the price and the vehicle type # # - As the brands, the trend is the same. The high-pricey vehicles are low in quantity, while the less-pricey vehicles are high in quantity. # - The data reveals that the expensive *vehicle types* do not meet the top 7 most sold or advertised *vehicle types* on Ebay. Invariably, the less pricey *vehicle types* are the most sold/advertised *vehicle types* # - In the automobile industry, either the **limited quantity** of the expensive *vehicle types* influenced the increased prices or because they're made with **more quality-sourced resources**, they're more expensive. Brands would also play a pivotal role in the pricing. The demand and supply power is in play. The less pricey the *vehicle types* , the more they are in quantity, and vice-versa. This would nt be restricted to the platfomr alone, but the automobile industry in general. # - Most of the pricey *vehicle types* are probably not sold on eBay (probaly there's a different market for those types of cars), and ebay is mostly popular for selling the less-pricey *vehicle types* . # - eBay doesn't have as much car sellers of whose who own/deal in luxury cars in comparism with those who own the less pricey cars (*data sufficiency wouldn't allow us deterine this*). # - This gives an insight to the type of Private Dealers on the platfomr. Middle to low-inclome earners. # #### Trend Analysis over Time # In[35]: df # In[36]: df['datecreated'] = pd.to_datetime(df['datecreated']) year = df['datecreated'].dt.year year_count = year.value_counts() year_count # In[37]: import matplotlib.pyplot as plt import pandas as pd import matplotlib.ticker as ticker df['datecreated'] = pd.to_datetime(df['datecreated']) year_created = df['datecreated'].dt.year year_count = year_created.value_counts() fig = plt.figure() axes1 = fig.add_axes([0, 0, 1, 1]) axes1.bar(year_count.index, year_count, color='blue') axes1.set_xlabel('Year') axes1.set_ylabel('Count') axes1.set_title('Sales Year Trend', fontweight='bold') axes1.xaxis.set_major_locator(ticker.MaxNLocator(integer=True)) # Ensure integer values on x-axis axes2 = fig.add_axes([0.25, 0.25, 0.25, 0.25]) axes2.bar(year_count.index,year_count, color='purple') axes2.set_xlim(2014, 2015) axes2.set_ylim(0, 35) axes2.set_xlabel('Year', fontsize=8) axes2.set_ylabel('Count', fontsize=8) axes2.set_title('Zoomed_In View for years 2014 and 2015', fontsize=10) axes2.xaxis.set_major_locator(ticker.MaxNLocator(integer=True)) # Ensure integer values on x-axis plt.show() plt.tight_layout() # ### Analyzing the reasons for the spike in Car Advertisements in 2016 # More Ads were put out in 2016 than ever advertised. What was the cause for this? # # - Firstly, most of our top advertised brands released new models in 2016. As a matter of [fact](https://www.best-selling-cars.com/germany/2016-full-year-germany-30-best-selling-car-models/), they topped the list of top 25 bestmodels and car sales in year 2016 # - The most cars were sold in 2016, than in previous year. This was a general market trend. # - A speculation could be due to the. From this [report](https://www.best-selling-cars.com/germany/2016-germany-developments-new-car-market/), 2016 increased by 4.51% to almost 3.4 million cars sold; the best since 2009. # # - This general rise were attributed to [environmental bonus](https://www.electrive.com/2023/09/29/germany-hits-2-million-approved-environmental-bonus-mark/#:~:text=Since%202016%2C%20the%20German%20government,bonus%20experienced%20a%20strong%20boost.) (which must have been promoted since 2016) but was introducted in 2020. # In[38]: import matplotlib.pyplot as plt sellers = df['seller'].value_counts() fig = plt.figure() ax1 = fig.add_axes([0, 0, 1, 1]) ax1.bar(sellers.index, sellers) ax1.set_title("Sellers' Index", fontweight='bold') ax2=fig.add_axes([0.5, 0.5, 0.25, 0.25]) ax2.bar(sellers.index, sellers) ax2.set_ylim(0, 30) ax2.set_title('Zoomed_In view', size=10) # ### Why do we have more Privtate Sellers? # # - Firstly, the introduction of the [environmental bonus](https://www.electrive.com/2023/09/29/germany-hits-2-million-approved-environmental-bonus-mark/#:~:text=Since%202016%2C%20the%20German%20government,bonus%20experienced%20a%20strong%20boost.) tilted to favour individual ownership of vehicles. It's safe to assume most of the private sellers took the most of this bonus by plunging fulltime in car businesses. # # - Other questions revealed from thsi question is; # - is our marketing content selling our plaform more to private sellers car dealers? # - Could this mean, dealers have other marketing platforms or means they use in selling or advertising, that we're not aware of? # # #### Brand Comparism # In[39]: df # In[40]: df['brand'] # In[41]: import matplotlib.pyplot as plt import pandas as pd filtered_df = df[~df['vehicletype'].isin(['None', 'other'])] vehicle_type = filtered_df['vehicletype'].value_counts() brand = df['brand'].value_counts().nlargest(15) fig = plt.figure(figsize=(24, 16)) axes1 = fig.add_axes([0, 0, 1, 1]) axes1.bar(brand.index, brand, color='#50ABC7') axes1.set_title('Brand', fontweight='bold', fontsize=20) axes1.set_xlabel('brands') axes1.set_ylabel('brand count') axes1.tick_params(axis='x', rotation=45, labelsize=13) axes2 = fig.add_axes([0.5, 0.5, 0.25, 0.25]) axes2.bar(vehicle_type.index,vehicle_type) axes2.set_title('Types of Vehicle', fontweight='bold') # axes2.set_xlabel('vehicle types') axes2.set_ylabel('count') axes2.tick_params(axis='x', rotation=45, labelsize=10) plt.show() # ### Why do we have more Privtate Sellers? # # - Firstly, the introduction of the [environmental bonus](https://www.electrive.com/2023/09/29/germany-hits-2-million-approved-environmental-bonus-mark/#:~:text=Since%202016%2C%20the%20German%20government,bonus%20experienced%20a%20strong%20boost.) tilted to favour individual ownership of vehicles. It's safe to assume most of the private sellers took the most of this bonus by plunging fulltime in car businesses. # # - Other questions revealed from thsi question is; # - is our marketing content selling our plaform more to private sellers car dealers? # - Could this mean, dealers have other marketing platforms or means they use in selling or advertising, that we're not aware of? # # #### Vehicle Condition and Price Correlation # In[42]: df # In[43]: df['notrepaireddamage'] = df['notrepaireddamage'].replace({'Yes': 'repaired', 'No': 'never_repaired'}) filtered_damages = df['notrepaireddamage'][~df['notrepaireddamage'].isin(['None'])].value_counts() filtered_damages # In[44]: gearbox_counts = df[df['gearbox'] !='None']['gearbox'].value_counts() filtered_df = df[df['gearbox'] != 'None'] gear_box = round(filtered_df.groupby('gearbox')['kilometer'].mean(), 2) gear_box # In[ ]: # In[45]: import matplotlib.pyplot as plt import pandas as pd filtered_damages = df['notrepaireddamage'][df['notrepaireddamage'].isin(['Yes', 'No'])].value_counts() filtered_damages = df[~df['notrepaireddamage'].isin(['None'])]['notrepaireddamage'].value_counts() prices = df['price'].value_counts() gearbox_counts = df[df['gearbox'] !='None']['gearbox'].value_counts() filtered_df = df[df['gearbox'] != 'None'] gear_box = round(filtered_df.groupby('gearbox')['kilometer'].mean(), 2) fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(38, 15)) ax1.bar(gearbox_counts.index, gearbox_counts, label='Gearbox') ax1.set_xlabel('Gear_Box', size=22, weight='bold') ax1.set_ylabel('Count', size=22, weight='bold') ax1.tick_params(axis='x', labelsize=20) ax1.tick_params(axis='y', labelsize=20) ax2.bar(filtered_damages.index, filtered_damages, label='Repair Status', color='skyblue') ax2.set_xlabel('Repair Status', size=22, weight='bold') ax2.set_ylabel('Count', size=22, weight='bold') ax2.tick_params(axis='x', labelsize=20) ax2.tick_params(axis='y', labelsize=20) ax3.bar(gear_box.index, gear_box, label='Avg kilomenter per Gear_Type') ax3.set_xlabel('Gear_Type', size=22, weight='bold') ax3.set_ylabel('Kilometer', size=22, weight='bold') ax3.tick_params(axis='x', labelsize=20) ax3.tick_params(axis='y', labelsize=20) ax1.legend(fontsize=26) ax2.legend(loc='center right', fontsize=26) ax3.legend(loc='lower left', fontsize=26) # plt.tight_layout() plt.show() # ### Vehicle conditions sold on the platform. # # - The data reveals that manual cars are the stongest and they make up cars which haven't been repaired. # - If we are able to analyse the sales data, we can determine if the vehicle conditions have imporved sales. # #### Fuel Type Preference # In[46]: df # In[47]: import matplotlib.pyplot as plt fuel_type = df['fueltype'][~df['fueltype'].isin(['None', 'Others'])].value_counts() plt.figure(figsize=(12, 8)) plt.bar(fuel_type.index, fuel_type, label='Fuel Type Preference', color='#C79F50') plt.tick_params(axis='x', rotation = 45, labelsize=8) plt.tick_params(axis='y', labelsize=8) plt.legend(); # ### Vehicle Conditions # # - From research, this is a [general trend](https://www.acea.auto/figure/trends-in-fuel-type-of-new-cars-between-2015-and-2016-by-country/) across the country # #### Feature Importances # In[48]: df # In[49]: import matplotlib.pyplot as plt powerps = df['powerps'].value_counts() # plt.bar(powerps.index, powerps) powerps # power, gearbox type, vehicle type, # In[50]: filtered_gearbox = df[~df['gearbox'].isin(['None'])] avg_power_gear = round(filtered_gearbox.groupby('gearbox')['powerps'].mean(), 2) # In[51]: filtered_vehicles = df[df['vehicletype']!='None'] avg_power_veh = round(filtered_vehicles.groupby('vehicletype')['powerps'].mean()) avg_power_veh # In[52]: import matplotlib.pyplot as plt filtered_gearbox = df[~df['gearbox'].isin(['None'])] avg_power_gear = round(filtered_gearbox.groupby('gearbox')['powerps'].mean(), 2) filtered_vehicles = df[df['vehicletype']!='None'] avg_power_veh = round(filtered_vehicles.groupby('vehicletype')['powerps'].mean()) fig, (axis1, axis2) = plt.subplots(1, 2, figsize=(20, 14), gridspec_kw={'width_ratios': [1, 2]}) axis1.bar(avg_power_gear.index, avg_power_gear) axis1.set_xlabel('Gearbox', fontweight='bold') axis1.set_ylabel('Powerps', fontweight='bold') axis1.set_title('Avg Powerups per GearBox', fontweight='bold', fontsize=10) # Axis2 axis2.bar(avg_power_veh.index, avg_power_veh) axis2.set_title('Avg Powerups per Vehicle Type', fontweight='bold', fontsize=10) plt.show() # ### Powerps of Advertised Vehicles # # - The data reveals that automatic cars have more Powerps than manual Cars; even though previous graph showed that manual cars are sttronger, with less repairs. This simply means Automatic cars focus on efficiency and speed, rather than the durability and effectiveness of namual cars. # # Since majority of the cars sold on the platform are manual cars, this would help us understand customer preference # ### Conclusion: # # - Automatic cars have more horsepower than manual cars, showing a focus on efficiency and speed for automatics versus the durability of manuals. Despite this, most cars sold are manuals, reflecting customer preference. # - Expensive cars aren't in the top-selling list on eBay; cheaper cars dominate. Likely, demand and supply dynamics favor lower-priced cars, possibly due to a different market for pricier vehicles. # - Similar trends exist among brands and vehicle types: high-priced ones are less common, suggesting limited quantities or higher-quality resources contributing to their cost. # - More ads were posted in 2016, likely due to top brands releasing new models, a general market trend of increased car sales, and factors like the [environmental bonus](https://www.electrive.com/2023/09/29/germany-hits-2-million-approved-environmental-bonus-mark/#:~:text=Since%202016%2C%20the%20German%20government,bonus%20experienced%20a%20strong%20boost.) introduced in 2020. # - The environmental bonus likely encouraged individual ownership, possibly driving more private sellers into the car business. # - Questions arise about the effectiveness of the platform's marketing to private sellers and whether dealers use alternative marketing channels we're not aware of. # #