This project represents an exploration into data cleaning and analysis. This project uses scraped data from used car advertisements on eBay Kleinanzeigen. The primary focus of the project was the introduction of basic dataset cleaning. The basic outline and work flow of this project is based of the )Dataquest project of the same name , as well the following blog. I used these outlines as a reference piece, but have made my own concessions regarding the cleaning of the dataset and the Visualization. All text and code was written by myself.

The project aims to accomplish the following:

- Clean, and prepare the scraped dataset for analysis and visualization.
- Perform basic visual analysis on the dataset
- Determine which vehicle type is most commonly featured.
- Determine which brands are most commonly featured and which brand is most expensive.
- Determine various relationships between price and odometer reading, vehicle type, age, and vehicle damage.

In [1]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
```

The dataset is a collection of over 370000 used cars scraped with Scrapy from Ebay-Kleinanzeigen.

Data dictionary:

- dateCrawled : when this ad was first crawled, all field-values are taken from this date
- name : "name" of the car
- seller : private or dealer
- offerType
- price : the price on the ad to sell the car
- abtest
- vehicleType
- yearOfRegistration : at which year the car was first registered
- gearbox
- powerPS : power of the car in PS, the metric conversion of HP
- model
- kilometer : how many kilometers the car has driven
- monthOfRegistration : at which month the car was first registered
- fuelType
- brand
- notRepairedDamage : if the car has a damage which is not repaired yet
- dateCreated : the date for which the ad at ebay was created
- nrOfPictures : number of pictures in the ad
- postalCode
- lastSeenOnline : when the crawler saw this ad last online

We will first display the first five rows of the dataframe to get an initial understanding of our dataframe.

In [2]:

```
autos = pd.read_csv(r"C:\Users\AlexM\OneDrive\Documents\Python\data_sets\data-society-used-cars-data\autos.csv", encoding = "latin-1" )
#encoding Latin-1 as UTF-8 gave error.
autos.head()
```

Out[2]:

We will also display dataframe information using `dataframe.info`

and `dataframe.describe`

to get gain a numerical understanding of our dataframe.

In [3]:

```
autos.info()
```

In [4]:

```
pd.set_option('display.float_format', lambda x: '%.1f' % x) # This removes the scientific notation for readability
autos.describe(include = 'all')
```

Out[4]:

From this brief exploration we have learned the following:

- The dataset has 20 columns and 371527 rows.
- 13 columns are object data types:
`str`

- 7 columns are numerical data types:
`int64`

- Various column entries are in German

The following columns have null values:

- 6 vehicleType:
`37869 null`

- 8 gearbox:
`20209 null`

- 10 model:
`351044 null`

- 13 fuelType:
`33386 null`

- 15 notRepairedDamage:
`72059 null`

The following columns appear to have a limited diversity of entry types and may not offer much information regarding the dataset:

- seller
- offerType
- abtest
- nrOfPictures
- notRepairedDamage

We will briefly expand our analysis on the columns with a limited diversity of entry types by viewing the unique values of the columns. This will provide us with more information about the columns and allow us to determine whether we should retain the columns for analysis.

In [5]:

```
autos["seller"].value_counts(dropna=False) #Counts items in seller column and provides a total of each item, includes any NaN values.
```

Out[5]:

In [6]:

```
autos["offerType"].value_counts(dropna=False)
```

Out[6]:

In [7]:

```
autos["abtest"].value_counts(dropna=False)
```

Out[7]:

In [8]:

```
autos["nrOfPictures"].value_counts(dropna=False)
```

Out[8]:

In [9]:

```
autos["notRepairedDamage"].value_counts(dropna=False)
```

Out[9]:

Using the 'count','unique','top','freq' values from `dataset.describe()`

to guide us and keeping in mind that the dataset has 371527 entries, we find the following:

The 'Seller' column has two unique values associated with it: 'privat' and 'gewerblich'.'privat' is the most frequent entry type with '371525' entries.

The 'offerType' column has two unique values associated with it: 'Angebot' and 'Gesuch'. 'Angebot' is the most frequent entry with '371516'entries.

The 'abtest' column has two unique values associated with it: 'test' and 'control' with 'test' being the most frequent with '192585' entries.

The 'nrOfPictures' column has '371528' entries, but no numerical values associated with it.

The 'notRepairedDamage' column has three entry types: 'nein', 'Nan', and Ja. Nein is the most frequent entry with '263182 entries.

In conjunction with the above, we determine that:

Almost all entries for the 'Seller' Column reference 'Privat'. According to the data dictionary, this column represents whether the listing on ebay was by a seller or a commercial company. Due to the skew in the frequency of 'privat' in comparison to 'gewerblich', thus no comparative analysis can be undertaken and we will remove the column from our analysis.

The 'offerType' column has two unique values associated with it, but no associated explanation in the data dictionary. Again, there is a skew in the frequency of 'Angebot' in comparison to 'Gesuch', thus no comparative analysis can be undertaken and we will remove the column from our analysis.

The 'abtest' column has two unique values associated with it, but no associated explanation in the data dictionary. Due to this we will remove the column from our analysis.'

The 'nrOfPictures' column has '371528' entries, but no numerical values associated with it. This indicates that no car listings have pictures associated with their entries. This may be a result of the scraping that was performed. Regardless, this column provides no insight and we remove the column from our analysis.

Despite the low diversity of entry types in the 'notRepairedDamage' column, the associated entries match with the information given by the data dictionary, and thus will be kept for analysis.

Note: We will also remove the 'dateCrawled' column from our analysis as it will provide no additional insight to any analysis performed.

In [10]:

```
#Removing columns from dataframe
autos.drop(["seller","offerType","abtest","nrOfPictures","dateCrawled"],axis = 1,inplace = True)
```

As can be seen above, all columns are written in CamelCase. We will convert the column labels to snake_case as is standard as it allows for easier use in the code. We also note that numerous entries are not in English and we will rectify this issue below.

We will again display the dataframe to confirm that we have removed the above columns.

In [11]:

```
autos.head()
```

Out[11]:

In [12]:

```
#creating a renaming dictionary to change column indexes to snake case and relabeling
name_dictionary = {
'name':'name',
'price':'price',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'gearbox':'gearbox',
'powerPS':'power_ps',
'model':'model',
'kilometer':'odometer',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'brand':'brand',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'postalCode':'postal_code',
'lastSeen':'last_seen'}
autos = autos.rename(columns = name_dictionary)
```

We will now inspect the ranges of numerical columns to see if there are any unusually high or low values.

In [13]:

```
autos.describe(include = np.number) #returns numerical columns only.
```

Out[13]:

We note the following unusual characteristics:

price: The minimum price(in euros) is: € 0 (free) and the max price is: € 2,147,483,647 (2 Billion). Incidentally, this is the eighth Mersenne prime.

registration_year: The minimum registration year is: 1000, and the maximum registration year is: 9999.

power_ps: The minimum PS is: 0 PS and the maximum PS is: 20,000 PS. Note that the 75th percentile has a value of 150 - thus 20,000 represents a very large increase.

- Note that PS is Metric Horsepower, which is roughly equivalent to Horsepower.

registration_month: The values of registration month range from 0 to 12.This represents a total 13 months.

As we saw above, the minimum price is €0 (free) and the maximum price is €2,147,483,647 (2 Billion+). Entries with the minimum price of €0 may have this value as a placeholder, as a missing value, or to indicate that the seller would like to trade the vehicle. With no way to determine this, we will drop the entries with a price of €0 from the data set.

The extremely high maximum price leads us to believe there are other significantly large outliers. We will either confirm or refute this claim below.

We will first start the process of determining a reasonable listing price. First we will get a better understanding of the maximum and minimum prices and the frequency in which they occur.

In [14]:

```
autos["price"].value_counts().sort_index(ascending=True) #Counts unique values and displays values from low to high
```

Out[14]:

In [15]:

```
autos["price"].value_counts().sort_values(ascending=False) #Counts unique values and displays number of values from low to high
```

Out[15]:

Immediately we can determine from the above that the following holds:

- There are over 10,000 entries with a price of 0, which represents the most common value.
- There are several significantly large outliers, which confirms our assumption.
- There are 5597 unique prices.
- After a price of zero; €500, €1500, €1000, and €1200 represent the most common listing price.

We will now remove the outliers from this column.

Originally, using a heuristic approach, I created bins between the values of 500 and 100,000 as these values seemed appropriate given that 500 was the second most common value and there were 75 entries between the values of 90050.0 and 100,000.The upper bound of this approach seemed reasonable as there may be some vintage cars in the listings and thus skew the upper bound.

To keep this project simple, I instead use the interquartile range (IQR) to clean the data. The IQR provides reasonable upper and lower bounds based on statistical dispersion. See this Wikipedia article for more information.

We now create our Inter quartile range by finding the difference between the third quartile and second quartile of our data. We then set the lower bound equal to the first quartile minus 1.5* the IQR and set our upper bound equal to the third quartile plus 1.5* the IQR.

In [16]:

```
Q1 = autos["price"].quantile(0.25)
Q3 = autos["price"].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - (1.5*IQR)
upper_bound = Q3 + (1.5*IQR)
autos = autos.loc[autos["price"].between(lower_bound,upper_bound)]
print('Our lower bound is:',lower_bound)
print('Our upper bound is:',upper_bound)
```

A digression on why our lower bound is negative is outside of the scope of this project, and will not be mentioned here. Despite this negative lower bound, we can continue with our analysis as the minimum price is €0 and within the bounds. Recall from above we determined that zero was the most frequent price, and the next most frequent price is €500.

We will make a brief exploration into prices between €101 and €499 to determine whether values in the range add value to the project.The arbitrary lower bound of €100 has been used as there are limited values below this price.

In [17]:

```
lower_price = autos.loc[autos['price'].between(101,499)]
lower_price.describe()
```

Out[17]:

In [18]:

```
plt.figure(figsize=(10, 10))
sns.histplot(data=lower_price,
x='price'
).set(title='Distribution of car prices between €101 - €499 ')
plt.show()
```

From the basic descriptive statistics and the visualization of the price distributions between €101 and €499, we will choose to remove all values below the 25th percentile(€230). A noted observation is the tendency for users to place their cars at the rounded numbers of of 100's and 50's.

In [19]:

```
autos = autos.loc[autos["price"]>= 230] #remove entries with price less than 500.
```

In [20]:

```
print('The size of our dataset is now:',autos.shape)
```

We now visualize the distribution of prices inclusive of the upper bound we have selected:

In [21]:

```
plt.figure(figsize=(10, 4))
ax =sns.histplot(data=autos,
x='price', kde=True,
bins=50).set(title='Distribution of car prices')
plt.show()
```

With the cleaning of the price column completed, and matching our assumptions of prices, we will turn to cleaning the registration_year column. We will return to applying more limited analysis to the price column below.

As we saw above, the minimum registration year is: 1000 and the maximum registration year is: 9999. Obviously these values are incorrect. Note that this dataset was scraped in 2016, and so the maximum registration year can be no higher than 2016.

As we can automatically have an upper bound of 2016, we will explore the lower bound of our data. According to this wikipedia page on the Automotive industry in Germany - in 1901 there were 900 cars produced each year. Using this information, we will create a filter between the years 1901 and 1959. The year 1959 was chosen as I am making the assumption that cars were a fairly common household item by this year.

First we will gain a brief numerical understanding of our lower bound.

In [22]:

```
lower_b = autos.loc[autos['registration_year'].between(1901,1959)] #filtering the registration between year 1901 and 1959
lower_b["registration_year"].value_counts().sort_index()
```

Out[22]:

In [23]:

```
lower_b["registration_year"].describe()
```

Out[23]:

In [24]:

```
lower_b.corr()
```

Out[24]:

In [25]:

```
plt.figure(figsize=(10, 10))
sns.histplot(data=lower_b,
x='registration_year',
bins = 50, kde=True
).set(title='Distribution of car registrations 1901 - 1959')
plt.show()
```

From the basic descriptive statistics and the visualization of our filter, we observe the following:

- There are no vehicles for sale with a registration year below 1910, or between the years 1911 and 1923.
- There are 209 listings within this filter.
- 75% of all vehicles have a registration year above 1957.
- There appears to be a weak correlation (Pearsons 0.3) between price and registration year.

Regarding cars with registration years at or below 1959, we expect the listing price to be quite high as these cars could be considered collectibles. We will quickly observe this relationship between the registration year and median price to further inform our lower bound.

In [26]:

```
plt.figure(figsize=(10, 10))
ax = sns.barplot(data=lower_b,y = 'price',
x = 'registration_year',
estimator = np.median,
ci = None
).set(title='Relationship Between Registration Year and Price (1901 - 1959)')
plt.xticks(rotation=90)
plt.show()
```

From the above visualization, we can see that there is a limited and seemingly random relationship between price and registration year. Using a similar method to cleaning the prices, we will remove all entries below the 25th percentile.

In [27]:

```
autos = autos.loc[autos['registration_year'].between(1937, 2016)]
autos['registration_year'].describe()
```

Out[27]:

In [28]:

```
plt.figure(figsize=(10, 10))
ax = sns.histplot(data=autos,
x='registration_year',
bins=50).set(title='Distribution of Car Registrations by Year')
plt.show()
```

In [29]:

```
print('The size of our dataset is now:',autos.shape)
```

With the cleaning of the registration_year column complete, we will turn to cleaning the registration_month column.

As we saw above, The values of registration month range from 0 to 12, which implies there are 13 months total. Obviously, we must inspect this error and determine a course of action.

Note the visualization below:

In [30]:

```
plt.figure(figsize=(10, 4))
plt.xticks(([0,1,2,3,4,5,6,7,8,9,10,11,12]))
ax = sns.histplot(data=autos,
x='registration_month',
discrete=True,
)
ax.set(title='Distribution of car registrations(month)')
plt.show()
```

As we can see,the value zero occurs over 25,000 times.This value may represent placeholder values, or may be an error from the scraping. As we have no other information regarding this value, we will drop all rows with zero as the registration month.

In [31]:

```
autos = autos.loc[autos['registration_month']>=1]
```

As was noted above, The minimum PS is: 0 PS and the max PS is: 20,000 PS. Note, PS (or Metric Horsepower) is roughly equivalent to HP.

We first notice that the 75th percentile of the power_ps column is 143, thus the maximum of 20,000 PS represents a significant outlier. Before cleaning the data, we will first gain a numerical and visual understanding of the data. We will create a filter between 0 and 450. This upper bound represents the maximum PS in most commercially available vehicles.

In [32]:

```
autos['power_ps'].describe()
```

Out[32]:

In [33]:

```
power_ps_bound = autos.loc[autos['power_ps'].between(0, 450)]
plt.figure(figsize=(10, 10))
ax =sns.histplot(data=power_ps_bound,
x='power_ps',
bins=50).set(title='Distribution of power_ps')
plt.show()
```

The visualization and description of the power_ps column presents the following findings:

- There are 20000 listings with 0 PS
- Most listings have a PS value between 50 and 250.
- The mean value is 179.3 PS
- There are limited entires with a PS above 400.

Using the above information, we will create a filter to futher clean our data.

In [34]:

```
autos = autos.loc[autos['power_ps'].between(50, 400)] #creating a filter with a power PS ranging from 50 to 400.
autos['power_ps'].describe()
```

Out[34]:

In [35]:

```
print('The size of our dataset is now:',autos.shape)
```

As this dataset was scraped from a German website, most worded entries are of German origin. Below, we will create dictionaries to remap/translate the German words to English words.

In [36]:

```
#Create a series of unique entry types
vehicle_type_series = pd.Series(autos["vehicle_type"].unique())
gearbox_series = pd.Series(autos["gearbox"].unique())
fuel_type_series = pd.Series(autos["fuel_type"].unique())
unrepaired_damage_series = pd.Series(autos["unrepaired_damage"].unique())
#create a dictionary from the series above and use the print function so we can manually create the dictionaries.
print(vehicle_type_series.to_dict())
print(gearbox_series.to_dict())
print(fuel_type_series.to_dict())
print(unrepaired_damage_series.to_dict())
```

In [37]:

```
vehicleType_dict = {'coupe':'coupe',
'suv':'suv',
'kleinwagen':'small car',
'limousine':'limousine',
'cabrio':'convertible',
'bus':'bus',
'kombi':'kombi',
'andere':'other'}
gearbox_dict = {'manuell':'manual',
'automatik':'automatic'
}
fuelType_dict = {'benzin':'petrol',
'diesel':'diesel',
'lpg':'lpg',
'andere':'other',
'hybrid':'hybrid',
'cng':'cng',
'elektro':'electric'}
notRepairedDamage_dict = {'ja':'yes',
'nein':'no'}
```

In [38]:

```
#create a data dictionary to map german words to english words
autos["gearbox"] = autos["gearbox"].map(gearbox_dict)
autos["vehicle_type"] = autos["vehicle_type"].map(vehicleType_dict)
autos["fuel_type"] = autos["fuel_type"].map(fuelType_dict)
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(notRepairedDamage_dict)
```

We will display the head of the dataframe to confirm that our mappings have been applied.

In [39]:

```
autos.head()
```

Out[39]:

It appears that our dictionaries have successfully remapped the German words to English. With this in mind, we will now remove the underscores that appear in the 'brand','name', and model columns and replace them with a space.

In [57]:

```
#remove underscore from car brands
autos["brand"] = autos["brand"].str.replace("_", " ")
#remove underscore from car name
autos["name"] = autos["name"].str.replace("_", " ")
#remove underscore from car model
autos["model"] = autos["model"].str.replace("_", " ")
```

We will display the head of the dataframe to confirm that the underscores have been applied.

In [58]:

```
autos.head()
```

Out[58]:

As we can see, the underscores have now been removed. We will now perform our final cleaning task by converting the strings in the 'ad_created' and 'last_seen' column. With the conversion from strings to datetime objects, Pandas will recognize the entries as dates and this will allow us to visualize our data.

In [40]:

```
format_str = "%Y-%m-%d %H:%M:%S"
autos["ad_created"] = pd.to_datetime(autos["ad_created"], format = format_str) #converts to datetime64[ns]
autos["last_seen"] = pd.to_datetime(autos["last_seen"], format = format_str)
```

Finally, we near the end of this project. We will now perform some basic visualizations of the dataset which may be of interest, given the dataset. No statistical analysis will be performed, as it is out of the scope of this project.

In [42]:

```
order_count_top_10 = autos['brand'].value_counts().sort_values(ascending=False).head(10).index
plt.figure(figsize=(10, 10))
ax =sns.countplot(data=autos,
y = 'brand',
order = order_count_top_10
).set(title='Top Ten Most Common Brand of Car For Sale')
```

As we can see from the visualization above, Volkswagen has the highest count at approximately 55000 listings. This is followed by Bmw and Opel, which both have just under 30000 listings.

In [43]:

```
order_count = autos['brand'].value_counts().sort_values(ascending=False).index
plt.figure(figsize=(10, 10))
ax =sns.countplot(data=autos,
y = 'brand',
order = order_count
).set(title='Count of Listing By Brand')
```

As we can see from this visualization, the brands: Trabant, Lada, and Rover have the least listings. Trabant has 5 Listings, lada has 145 listings, and Rover has 314 listings.

In [44]:

```
order_price=autos.groupby('brand')['price'].agg('median').sort_values(ascending=False).index
plt.figure(figsize=(10, 10))
ax =sns.barplot(data=autos,
x = 'price',
y = 'brand',
estimator = np.median,
order = order_price,
ci = None
).set(title='Median Price of Listing by Brand')
plt.show()
```

As we can see from the visualization above, the median price varies from approximately €1000 to just under €12000. Porsche has the highest median price of €11500, followed by Mini, and Land Rover with median prices of €7990, and €6625 respectively.

Rover, Daewoo, and Lancia have the lowest median prices, which are all under €1500.

In [45]:

```
order_count_type = autos['vehicle_type'].value_counts().sort_values(ascending=False).index
plt.figure(figsize=(10, 10))
ax =sns.countplot(data=autos,
y = 'vehicle_type',
order = order_count_type
).set(title='Most Common Vehicle Type')
```

Interestingly, 'Limousine' is the most common type of car in the dataset, with a count of approximately 75000. This is followed by the type 'Small Car' with a count of approximately 60000.

In [46]:

```
order_car = autos.groupby('vehicle_type')['price'].agg('median').sort_values(ascending=False).index
plt.figure(figsize=(10, 10))
ax =sns.barplot(data=autos,
x = 'price',
y = 'vehicle_type',
estimator = np.median,
order = order_car,
ci = None
).set(title='Median Price of Vehicle by Type')
plt.show()
```

In relation to the to most common vehicle type for sale, we have a visualization of the median price of those vehicle type above. As can be seen: Suv has the highest median price of just under €8000, and small cars have the lowest median price of just under €2000.

In [74]:

```
plt.figure(figsize = (14, 5))
plt.title(label='Median Price of Car vs Odometer Reading')
ax = sns.lineplot(data=autos,
x = 'odometer',
y = 'price',
estimator = np.median,
ci=None
)
plt.show()
```

This graph confirms the base assumption that the higher the odometer reading, the lower the asking price for the car.

In [76]:

```
# relation between age and selling price
sns.set(rc={'figure.figsize':(22,8)})
sns.barplot(data=autos,
x = 'registration_year',
y = 'price',
estimator = np.median,
ci=None,
).set(title='Median Price of Car by Registration Year')
plt.xticks(rotation=90)
plt.show()
```

This visualization is a presentation of the relationship between the median price of the car that is listed and the year in which the car was registered. Note the following:

- There is a downward trend in price from the mid 19070's until 1997, after which the median price begins to trend upwards.
- The year 2016 has a very low median price, this is most likely due to limited entries, as the dataset was scraped in 2016.
- Cars in 1952 have the highest median price of approximately €14500.
- Cars in 1995 have the lowest median price of approximately €1000.

In [67]:

```
plt.figure(figsize=(5, 5))
plt.title('Median Price of Car With Damage vs No Damage')
plt.xticks(rotation = 90)
sns.barplot(
data = autos,
x = 'unrepaired_damage',
y = 'price',
estimator = np.median,
ci = None
)
plt.show()
```

This graph confirms the base assumption that cars with damage have a lower median listing price than cars that have no damage.

This project used scraped data and was focused on basic 'data cleaning'. The cleaning that was performed related to the usability of the dataset for any future analysis. The cleaning performed involved the following:

- Removing columns with data that provided little insight.
- Reformatting headers, strings, and dates for the purpose of usability and readability.
- Removing numerical outliers.

Alongside the 'data cleaning', we also used visualizations to enhance our understanding of the dataset that we were working with. The visualizations helped guide our thought processes in determining which entires to remove.

Various visualizations were also performed to generate understanding of the cleaned dataset. We visualized the following:

- Median Price of Car With Damage vs No Damage'
- Median Price of Car by Registration Year'
- Median Price of Car by Odometer
- Median Price of Vehicle by Type
- Most Common Vehicle Type
- Median price of Listing by Brand
- Count of Listing By Brand
- Top Ten Most Common Brand of Car For Sale

In future, I may return to this project to perform further analysis on the dataset. With new tools, there may be interesting insights to be gained.