#!/usr/bin/env python # coding: utf-8 # # # Exploring Ebay Car Sales Data # # The aim of this project is to clean a dataset of used cars from eBay Kleinanzeigen (a classifieds section of the German eBay website) and analyze the included used car listings. # # The dataset was originally scraped and uploaded to Kaggle. In this project we are using its reduced version (50,000 data points sampled from the full dataset) with more dirty data. The original dataset, as well as the data dictionary, can be found [here]( https://www.kaggle.com/orgesleka/used-cars-database/data). # ## Data Downloading and Understanding # In[1]: import pandas as pd import numpy as np # In[2]: autos = pd.read_csv('autos.csv', encoding='Latin-1') # In[3]: autos # In[4]: autos.info() # Some observations can be made here. # # - Our dataset has 50,000 entries and 20 columns. # - A few columns (`'vehicleType'`, `'gearbox'`, `'model'`, `'fuelType'`, `'notRepairedDamage'`) have null values, for `'notRepairedDamage'` up to 20% of all values. # - The majority of columns have values in string data type (including `'price'` and `"odometer'`, which will probably have to be fixed to avoid issues). # - The column names use JavaScript's preferred camelcase instead of Python's preferred snakecase, which means we can't just replace spaces with underscores. Given that some of the names are long and compound, contaning even prepositions, it makes it more difficult to catch their meaning. # # Let's investigate now the first 10 rows of the dataset. # In[5]: autos.head(10) # - At a first glance, it seems that the columns containing dates and times are of quite a consistent format. # - In some columns (`'seller'`, `'offerType'`, `'gearbox'`, `'notRepairedDamage'`) values are written in German. # - The columns `'price'` and `'odometer'` show the same units for each value. # - The columns `'seller'`, `'offerType'`, `'notRepairedDamage'`, `'nrOfPictures'` seem to have a dominant value each (with `'notRepairedDamage'` also having a lot of null values) and probably will not represent much interest for further data analysis. # # ## Cleaning Column Names # At this stage we convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive. # In[6]: autos.columns # For converting the column names from camelcase to snakecase, we will use the following function: # In[7]: def camel_to_snake(str): res = [str[0].lower()] for c in str[1:]: if c in ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'): res.append('_') res.append(c.lower()) else: res.append(c) return ''.join(res) # Converting the column names new_columns = [] for c in autos.columns: column_snake = camel_to_snake(c) new_columns.append(column_snake) autos.columns = new_columns autos.columns # In this way the column names look more Python-style. However, for making them more descriptive, it's worth doing some further renaming here. # In[8]: # Renaming columns autos.rename({'abtest': 'ab_test', 'year_of_registration': 'registration_year', 'power_p_s': 'power_ps', 'month_of_registration': 'registration_month', 'not_repaired_damage': 'unrepaired_damage', 'date_created': 'ad_created', 'nr_of_pictures': 'number_pictures'}, axis=1, inplace=True) autos.columns # In[9]: autos.head() # Now all the column names are in line with the Python's naming convention and represent clear and informative description of their contents. # ## Initial Data Exploration and Cleaning # Let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for: # # - Text columns where all or almost all values are the same. These can be dropped as they don't have useful information for analysis. # - Examples of numeric data stored as text which can be cleaned and converted. # - Categorical data that uses German words that can be translated and mapped to their English counterparts. # In[10]: # Checking the dataset statistics including string columns autos.describe(include='all') # From these dataset statistics we can draw many insights. # # - The columns `'seller'`, `'offer_type'` and `'number_pictures'` have practically one value each, which is exactly what we supposed also earlier, analyzing the first 10 rows of the dataset. These columns have to be dropped. # - The `'price'` and `'odometer'` columns are numeric data stored as text (again, as we noticed earlier). They have to be cleaned and converted to a numeric data type. # - Judging by minimum and maximum values, the columns `'registration_year'`, `'registration_month'` and `'power_ps'` have obvious outliers. The incorrect data has to be removed or corrected. # - One curious observation about the `'price'` column: 1,421 out of 50,000 entries have a price of 0$, meaning that around 3% of the cars of our dataset are actually free of charge. # - More than 9.5% of all the cars have unrepaired damages. # # Let's drop the non-informative columns `'seller'`, `'offer_type'`, and `'number_pictures'`. # In[11]: # Dropping columns autos.drop(['seller', 'offer_type', 'number_pictures'], axis=1, inplace=True) # There is one more interesting column to have a look at: `'fuel_type'`. Out of 7 different fuel types represented in the dataset benzine is the most common one (60% of all the entries, including null values). What are the other 6? # In[12]: autos['fuel_type'].value_counts() # We see that the second common fuel type is diesel (quite predictable), while the other types, all together less than 2%, are liquified/compressed natural gas, hybrid, electro and "other" ("andere" in German). # # Next, we are going to identify categorical data that uses German words, translate them and map to their English counterparts. After deleting the non-informative columns, we have the following ones containg German words: `'gearbox'`, `'fuel_type'` and `'unrepaired_damage'`. Let's investigate their vocabulary. # In[13]: print(autos['gearbox'].unique()) print(autos['fuel_type'].unique()) print(autos['unrepaired_damage'].unique()) # Let's create a unique "German dictionary" for these words and map them into English meanings. # In[14]: # Creating a dictionary german_dict = { 'manuell': 'manual', 'automatik': 'automatic', 'benzin': 'benzine', 'elektro': 'electro', 'andere': 'other', 'nein': 'no', 'ja': 'yes', # Next words remain the same, but have to be mapped to avoid NaN 'lpg': 'lpg', 'diesel': 'diesel', 'cng': 'cng', 'hybrid': 'hybrid' } # Mapping German words in English autos['gearbox'] = autos['gearbox'].map(german_dict) autos['fuel_type'] = autos['fuel_type'].map(german_dict) autos['unrepaired_damage'] = autos['unrepaired_damage'].map(german_dict) # Checking updated unique values print(autos['gearbox'].unique()) print(autos['fuel_type'].unique()) print(autos['unrepaired_damage'].unique()) # ## Exploring the `Odometer` and `Price` Columns: Looking for Outliers # Let's return to the `'price'` and `'odometer'` columns: we have to remove any non-numeric characters ('$' and 'km' correspondingly and the symbol of comma from both), convert the columns to a numeric data type and rename the columns for not to lose the information about units. # In[15]: # Removing non-numeric characters and converting to a numeric dtype autos['price'] = autos['price'].str.replace('$','').str.replace(',','').astype(int) autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype(int) # Renaming columns autos.rename({'price': 'price_dollars', 'odometer': 'odometer_km'}, axis=1, inplace=True) # In[16]: autos.head() # Now the columns `'price_dollars'` and `'odometer_km'` are in the right format to be analyzed. # In[17]: autos['price_dollars'].unique().shape # In[18]: autos['odometer_km'].unique().shape # The column `'price_dollars'` has 2357 unique values, the column `'odometer_km'` 13 ones. # # Let's look at the statistics of prices. # In[19]: autos['price_dollars'].describe() # It seems that there is an outlier in prices: the maximum value is 5 orders bigger than the value of the third quartile (75%), and in general 100,000,000$ doesn't look a realistic price for a used car on sale. As for the minimum value equals 0, it looks ok and represents the cars offered free of charge. # # We will investigate in more detail the maximum value now. # In[20]: # Finding the 5 biggest prices in descending order five_biggest_prices = autos['price_dollars'].value_counts().sort_index(ascending=False).head() five_biggest_prices # Hence, the maximum value is clearly wrong, probably it was intended to represent an undefined value. We have to remove it from the dataset. # In[21]: # Removing the outlier autos = autos.loc[autos['price_dollars'].between(0, 99999998), :] # In[22]: # Double-checking the 5 biggest prices five_biggest_prices = autos['price_dollars'].value_counts().sort_index(ascending=False).head() five_biggest_prices # In[23]: autos['price_dollars'].describe() # Now the values of the `'price_dollars'` column are cleaned and meaningful, with the maximum price of 27,322,220\$, the average of 7,840$ and the minimum free of charge. # # Let's conduct a similar analysis for the `'odometer_km'` column. # In[24]: autos['odometer_km'].describe() # In[25]: odometer_values = autos['odometer_km'].value_counts().sort_index().head() odometer_values.head() # In[26]: odometer_values.tail() # In this column we don't have any outliers. The value range is from 5,000 to 150,000 km, with the average of 125,732 km. These statistics are heavily influenced by the frequency of the maximum value: we can deduce from the previous cell that about 65% of cars in our dataset have driven 150,000 km. # ## Exploring the Date Columns # Let's now understand the date range the data covers. # # There are 5 columns representing date values. Some of these columns were created by the crawler, some came from the website itself. # - `'date_crawled'`: added by the crawler # - `'last_seen'`: added by the crawler # - `'ad_created'`: from the website # - `'registration_month'`: from the website # - `'registration_year'`: from the website # # The first 3 of these columns are identified as string values by pandas so we need to convert them into a numerical representation for further quantitative analysis. The other 2 columns are already numeric. # # Let's first understand how the values in the 3 string columns are formatted. # In[27]: autos[['date_crawled','ad_created','last_seen']].head() # These columns all represent full timestamp values, with the first 10 characters representing the day and the last 5 - the time. For our purposes we are interested only in days. # # We are going to calculate the distribution of values of dates in the `'date_crawled'`, `'ad_created'`, and `'last_seen'` columns as percentages, including missing values and ranking by date from earliest to latest. # In[28]: # Extracting days, counting values in % including missing values, and sorting from earliest to latest data_crawled_dates = autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index() data_crawled_dates # We can observe that the range of dates when each ad in our dataset was first crawled is from the 5th of March 2016 till the 7th of April 2016. The distribution of frequencies looks quite consistent. # In[29]: # Extracting days, counting values in % including missing values, and sorting from earliest to latest ad_created_dates = autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index() ad_created_dates # In our dataset the first eBay listing was created the 11th of June 2015 and the last - the 7th of April 2016. As for the distribution, the frequencies look very low at the begining and then grow significantly. Let's try to trace where this transition takes place. # In[30]: ad_created_dates[30:] # An evident transition is observed on the 5th of March, exactly the day when the first adds were crawled from the website. # In[31]: # Extracting days, counting values in % including missing values, and sorting from earliest to latest last_seen_dates = autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index() last_seen_dates # The range of dates the crawler saw each ad last online is also (like for `'date_crawled'`) from the 5th of March 2016 till the 7th of April 2016. The distribution of frequencies mostly looks consistent, increasing only the last 3 days. # # Let's take a look now at the statistics of the `'registration_year'` column, which is already numeric. # In[32]: autos['registration_year'].describe() # We clearly see 2 issues here: a wrong minimum value of 1000 year and a wrong maximum one of 9999 year. These values certainly bias also the other statistics (mean and standard deviation). # # Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. As for the earliest valid year, realistically, it could be somewhere in the first few decades of the 1900s. # # Let's count the number of listings with cars that fall outside the 1900 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic. # In[33]: autos['registration_year'].value_counts().sort_index() # The earliest plausible value of a registration year is 1910. It's worth looking at the 9 entries with that year in more detail. # In[34]: autos[autos['registration_year'] == 1910] # The majority of columns for these entries have missing values. The `'odometer_km'` column looks realistic, with low values for almost all the cars, which is in line with the technology level of more than 100 years ago. The prices are low (up to 0) enough for just old cars but unexpectedly low for something considered a rarity. In any case, a significant proportion of missing values suggests that these rows don't represent particular interest for further analysis and can be removed. # # Let's investigate 3 next early years from our value counts exploration: 1927, 1929, and 1931. # In[35]: autos[autos['registration_year'] == 1927] # In[36]: autos[autos['registration_year'] == 1929] # In[37]: autos[autos['registration_year'] == 1931] # In these 3 rows there are almost no missing values, and all the other values look plausible: high prices for such rarities, cabrio type of vehicle, manual gearbox, low horsepower, low odometer values. # # Hence, it seems reasonable to use the year of 1927 as the lowest acceptable value and 2016 (discussed earlier) as the highest one. All the values outside this interval have to be removed. # In[38]: # Removing years outside the interval autos = autos.loc[autos['registration_year'].between(1927, 2016)] # Let's calculate the distribution of frequencies of the remaining values. # In[39]: autos['registration_year'].value_counts(normalize=True).sort_index() # The frequencies for the older years are, naturally, very low. For more recent years the values are much higher, showing, however, some fluctuations from one year to another. # ## Exploring Top Brand Prices # Our next step is to explore average price variations across different car brands. To start, let's take a look at the unique brands available in our dataset. # In[40]: autos['brand'].unique() # Now let's have a look at number of cars (in % from the total number) by brand. # In[41]: # Brands by percentage of the total number of cars brands = autos['brand'].value_counts(normalize=True) brands # In[42]: # Finding brands with 5% and more of the total number of cars top_brands = brands[brands > 0.05].index top_brands # A few brands have over 5% of the total number of cars each. Let's calculate average price for them. # In[43]: # Creating an empty dictionary top_brands_average_prices = {} # Looping over the top brands and assigning the average price to the dictionary for brand in top_brands: top_brands_average_prices[brand] = autos[autos['brand'] == brand]['price_dollars'].mean() top_brands_average_prices # Here we can notice that Volkswagen, being the most common brand in our dataset with a big gap from the rest, has relatively low average price among the other top brands. Audi, being the 4th common brand, has the highest average price among the top ones, while Opel the lowest, with a distinct difference between these two of 3,839$. # # For the top 6 brands, we will analyze average mileage and if there is any visible link with average price. # In[44]: # Converting the average price dictionary into a series object prices_series = pd.Series(top_brands_average_prices) prices_series # In[45]: # Creating an empty dictionary for average mileage values top_brands_average_mileage = {} # Looping over the top brands and assigning the average mileage to the dictionary for brand in top_brands: top_brands_average_mileage[brand] = autos[autos['brand'] == brand]['odometer_km'].mean() top_brands_average_mileage # In[46]: # Converting the average mileage dictionary into a series object mileage_series = pd.Series(top_brands_average_mileage) mileage_series # In[47]: # Creating a single dataframe from both series objects prices_mileage = pd.DataFrame(prices_series, columns=['average_price']) prices_mileage['average_mileage'] = mileage_series prices_mileage # Logically, it would be expectable to find an inverse relation between average prices and average mileage. However, we observe just the opposite: the 3 brands with the highest average prices (Audi, Mercedes-Benz and BMW) have also the biggest average mileage values. On the other hand, the variation of mileage here is not significant (7% between extreme values), so this factor is not of much use in our case. # ## Finding the Most Common Brand/Model Combinations # Let's investigate the most common combinations of brands and models in our dataset. We can try to find, say, all the models represented by 1,500 and more cars. # In[48]: # Finding combinations brand/model represented by 1,500 and more cars each brand_model = autos.groupby('brand').model.value_counts().loc[lambda x : x>1500] print(brand_model) # The most common brand / model combinations are: Volkswagen Golf, BMW 3er, Volkswagen Polo, and Opel Corsa. Not surprising that Volkswagen is leading here: from the previous analysis we remember that the brand itself is the most common one, counting 21% of all the cars in the dataset. # ## Analyzing Price Patterns Based on Mileage # Now we will investigate how mileage influences an average car price. For this purpose, it is convenient to split the `'odometer_km'` values into groups, but first we have to check these values. # In[49]: autos['odometer_km'].unique() # We can define the following 7 groups of values (in thousands km): 5-10, 20-30, 40-50, 60-70, 80-90, 100-125, and 150. Let's calculate average prices for them. # In[50]: # Grouping by mileage and calculating average price for each group odometer_5_10 = autos[autos['odometer_km'] <= 100000 ]['price_dollars'].mean() odometer_20_30 = autos[(autos['odometer_km'] >= 20000) & (autos['odometer_km'] <= 30000)]['price_dollars'].mean() odometer_40_50 = autos[(autos['odometer_km'] >= 40000) & (autos['odometer_km'] <= 50000)]['price_dollars'].mean() odometer_60_70 = autos[(autos['odometer_km'] >= 60000) & (autos['odometer_km'] <= 70000)]['price_dollars'].mean() odometer_80_90 = autos[(autos['odometer_km'] >= 80000) & (autos['odometer_km'] <= 90000)]['price_dollars'].mean() odometer_100_125 = autos[(autos['odometer_km'] >= 100000) & (autos['odometer_km'] <= 125000)]['price_dollars'].mean() odometer_150 = autos[autos['odometer_km'] >= 150000]['price_dollars'].mean() print(odometer_5_10) print(odometer_20_30) print(odometer_40_50) print(odometer_60_70) print(odometer_80_90) print(odometer_100_125) print(odometer_150) # We see that average prices follow some clear patterns. The highest average price (36,375\$) is related to the used cars with medium values of mileage (40,000 - 50,000 km). Such cars can have been in use approximately about 2-3 years. The cars with smaller values of mileage, even though, obviously, being less used and, theoretically, better preserved, are twice cheaper. It's not surprising: a car with small mileage (used for a short period, or even those on sale soon after being bought) can look less attractive for potential buyers, because it can have some hidden flaws or problems that are not easy to detect from the beginning, so the current owner can prefer to get rid of it. For the values of mileage higher than 40,000 - 50,000 km (that is, for the cars having been more in use and, hence, more worn), average prices decrease again, and the higher the mileage the lower the average price, up to 4,514\$ for the cars that ran 150,000 km. # ## Comparing Prices of Damaged vs. Non-Damaged Cars # Our last step will be to find out how much cheaper are cars with damage than their non-damaged counterparts. First, let's see how many cars have unrepaired damages. # In[51]: autos['unrepaired_damage'].value_counts(dropna=False) # We can colclude that at least 9.5% of all the cars (since we don't know the status of the numerous ones with null values) have unrepaired damages. Now we will calculate average prices of both categories. # In[52]: # Calculating average prices for damaged and non-damaged cars unrepaired = autos.groupby('unrepaired_damage')['price_dollars'].mean() print(unrepaired) # The difference of prices is significant: the damaged cars cost on average 3.5 times less than the non-damaged ones. For curiosity, let's have a look at the most expensive car among the damaged ones. # In[53]: # Finding the most expensive car among the damaged ones autos.loc[autos[autos['unrepaired_damage'] == 'yes']['price_dollars'].idxmax()] # We see that classy cars are still quite expensive also among the damaged ones, even though, evidently, much cheaper than their non-damaged counterparts. # ## Conclusion # To summarize, in this project we have cleaned and explored a dataset of used cars from the German eBay website. Cleaning tasks included renaming columns according to the Python’s convention and giving them descriptive names, deleting non-informative columns, mapping German words to English, converting strings to numeric data types, removing outliers, correcting year range. Exploring the date columns, we found out that the ads were added on the website from June 2015 till April 2016 and crawled from March 2016 till April 2016. We identified the most common brands in our dataset, with Volkswagen at the first place with a big gap from all the others. Only 6 brands count more than 5% of the total number of cars each. For these top brands average prices and mileage values were calculated. We discovered that about 3% of cars in the dataset are free of charge, the most common brand / model combination is Volkswagen Golf, and that 65% of cars have the highest mileage (150,000 km). The mileage-based price patterns were analyzed. We found out that the highest average prices are related to the cars with average mileage of 40,000 – 50,000 km. For lower values of mileage, and especially for higher ones, average prices decrease. Lastly, we explored the difference of average prices between non-damaged and damaged cars and observed that the latter are on average 3.5 times cheaper.