Exploring Used Car Sales Data

In this assignmnet, I will be analyzing a dataset of used cars from the classified section of a german Ebay website. The aim of this project will be to go through the process of cleaning the data and analyzing the data and trying to draw conclusions from it. Also, is doing this project I will become more famililar with the unique benefits that jupyter notebook provides for Pandas.

In [146]:
#import the pandas and numpy packages

import pandas as pd
import numpy as np
#read the csv file; call it autos
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
In [147]:
#first review of the data. this should show us the types of variables and the size of our dataset
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
Out[147]:
dateCrawled name seller offerType price abtest vehicleType yearOfRegistration gearbox powerPS model odometer monthOfRegistration fuelType brand notRepairedDamage dateCreated nrOfPictures postalCode lastSeen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50
In [148]:
#show the column names currently
autos.columns
Out[148]:
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Above, we can see that there are 20 columns within this dataset, most of which are strings. The column names are written in camelcase instead of pythons preferred snake_case, which means we can't just replace the spaces with underscores. The next step will be to convert the column names from camelcase to snakecase and to re-name some of the column names based on the data dictionary to be more descriptive.

In [149]:
#rename the columns
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type',
              'registration_year', 'gearbox', 'power_ps', 'model',
              'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
              'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen']

autos.columns
#show the header for each of the columns to make sure it is in snake_case and looks right.
autos.head()
Out[149]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD privat Angebot $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 0 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik privat Angebot $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 0 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United privat Angebot $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 0 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... privat Angebot $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 0 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... privat Angebot $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 0 39218 2016-04-01 14:38:50

There. Now that the column names are cleaned up, we can move forward.

In looking at the autos dataframe, we can try to figure out if other data cleaning tasks need to be done.

I should be on the lookout for text columns where all or almost all the values are the same. If this is the case with a column, then that column should be dropped as it wouldn't have any useful information for analysis.

Also I should look for examples of numeric data stored as text wich can be cleaned and converted.

In [150]:
autos.describe(include ='all')  #run this keep exploring the dataframe
Out[150]:
date_crawled name seller offer_type price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created nr_of_pictures postal_code last_seen
count 50000 50000 50000 50000 50000 50000 44905 50000.000000 47320 50000.000000 47242 50000 50000.000000 45518 50000 40171 50000 50000.0 50000.000000 50000
unique 48213 38754 2 2 2357 2 8 NaN 2 NaN 245 13 NaN 7 40 2 76 NaN NaN 39481
top 2016-03-19 17:36:18 Ford_Fiesta privat Angebot $0 test limousine NaN manuell NaN golf 150,000km NaN benzin volkswagen nein 2016-04-03 00:00:00 NaN NaN 2016-04-07 06:17:27
freq 3 78 49999 49999 1421 25756 12859 NaN 36993 NaN 4024 32424 NaN 30107 10687 35232 1946 NaN NaN 8
mean NaN NaN NaN NaN NaN NaN NaN 2005.073280 NaN 116.355920 NaN NaN 5.723360 NaN NaN NaN NaN 0.0 50813.627300 NaN
std NaN NaN NaN NaN NaN NaN NaN 105.712813 NaN 209.216627 NaN NaN 3.711984 NaN NaN NaN NaN 0.0 25779.747957 NaN
min NaN NaN NaN NaN NaN NaN NaN 1000.000000 NaN 0.000000 NaN NaN 0.000000 NaN NaN NaN NaN 0.0 1067.000000 NaN
25% NaN NaN NaN NaN NaN NaN NaN 1999.000000 NaN 70.000000 NaN NaN 3.000000 NaN NaN NaN NaN 0.0 30451.000000 NaN
50% NaN NaN NaN NaN NaN NaN NaN 2003.000000 NaN 105.000000 NaN NaN 6.000000 NaN NaN NaN NaN 0.0 49577.000000 NaN
75% NaN NaN NaN NaN NaN NaN NaN 2008.000000 NaN 150.000000 NaN NaN 9.000000 NaN NaN NaN NaN 0.0 71540.000000 NaN
max NaN NaN NaN NaN NaN NaN NaN 9999.000000 NaN 17700.000000 NaN NaN 12.000000 NaN NaN NaN NaN 0.0 99998.000000 NaN
  1. So from looking at the dataset, it looks like seller and _offertype only have 2 unique values so I think these columns may be unuseable.

  2. It looks like _nr_ofpictures column has numeric values, so that would be a good column to look at more detailed.

  3. It looks like price and odometer columns contain text values, so I will try to convert them to numeric.

In [151]:
#review the nr_of_pictures column
autos['nr_of_pictures'].value_counts()
Out[151]:
0    50000
Name: nr_of_pictures, dtype: int64

So it looks like the _nr_ofpictures column only has 0 as its value, so I think it would be smart to drop/delete this column in addition to the seller and _offertype columns.

In [152]:
#delete these columns 
autos.drop(columns = ["nr_of_pictures", "seller", "offer_type"], inplace = True)
autos.head()
Out[152]:
date_crawled name price abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD $5,000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik $8,500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United $8,990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... $4,350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... $1,350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50

Next is formatting the price column. Looks like we have to remove the ',' and the '$'

In [153]:
#convert the values in the price column from text to numeric
autos['price'] = (autos['price'].str.replace('$','')
                  .str.replace(',','')
                  .str.replace(' ', '')
                  .astype(int) 
                 )
#review the data
autos['price'].head()
'\n'
autos['price'].describe()
/Users/rexcameron/opt/anaconda3/envs/Pandas/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  
Out[153]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price, dtype: float64
In [154]:
#changing the column name
autos.rename({'price' : 'price_dollar'}, axis=1, inplace=True)
autos.head()
Out[154]:
date_crawled name price_dollar abtest vehicle_type registration_year gearbox power_ps model odometer registration_month fuel_type brand unrepaired_damage ad_created postal_code last_seen
0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 5000 control bus 2004 manuell 158 andere 150,000km 3 lpg peugeot nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54
1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 8500 control limousine 1997 automatik 286 7er 150,000km 6 benzin bmw nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08
2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 8990 test limousine 2009 manuell 102 golf 70,000km 7 benzin volkswagen nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37
3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4350 control kleinwagen 2007 automatik 71 fortwo 70,000km 6 benzin smart nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28
4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... 1350 test kombi 2003 manuell 0 focus 150,000km 7 benzin ford nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
In [155]:
#check the values in the price_dollar column
autos['price_dollar'].unique().shape
Out[155]:
(2357,)
In [156]:
autos['price_dollar'].describe()
Out[156]:
count    5.000000e+04
mean     9.840044e+03
std      4.811044e+05
min      0.000000e+00
25%      1.100000e+03
50%      2.950000e+03
75%      7.200000e+03
max      1.000000e+08
Name: price_dollar, dtype: float64

So it looks like the Non-Numeric numbers have been removed. Next step is going to be converting the 'Odometer' column and re-naming it.

Now that 'Price_Dollar' column looks set, I can move to the 'Odometer' Coulmn. I guess the first step would be to figure out what non-numeric characters are currently there.

In [157]:
#Figure out what non-numeric characters in column

autos['odometer'].head()
Out[157]:
0    150,000km
1    150,000km
2     70,000km
3     70,000km
4    150,000km
Name: odometer, dtype: object
In [158]:
#Remove the non-numeric values (',' and the km) from the column and convert it to a numeric dtype.

autos['odometer'] = autos['odometer'].str.replace(',', '').str.replace('km', '').astype('int64')

autos['odometer'].head()
Out[158]:
0    150000
1    150000
2     70000
3     70000
4    150000
Name: odometer, dtype: int64

That looks better. Just some simple clean values for the Odometer column. I think it might be smart to specify the units for the odometer, so I will change the column name to Odometer_km.

In [159]:
#Change the column name to odometer_km
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
#double check to make sure that the column was successfully renamed
autos.columns
Out[159]:
Index(['date_crawled', 'name', 'price_dollar', 'abtest', 'vehicle_type',
       'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km',
       'registration_month', 'fuel_type', 'brand', 'unrepaired_damage',
       'ad_created', 'postal_code', 'last_seen'],
      dtype='object')

So now the price column looks set. Next, I want to continue to learn more about the other numeric columns. To do this, I can use a whole bunch of different commands. It is also important to make sure the columns have more than one unique value, or else the data will be seen as satisitically insignificat and will be deleted from the dataframe.

In [160]:
#Calculate the unique values for the other columns
print('Unique Values for this dataset:')
print('registration_year:', *autos['registration_year'].unique().shape)
print('registration_month:', *autos['registration_month'].unique().shape)

print('postal_code:', *autos['postal_code'].unique().shape)
print('power_ps:', *autos['power_ps'].unique().shape)
Unique Values for this dataset:
registration_year: 97
registration_month: 13
postal_code: 7014
power_ps: 448

From the output above, it looks like all those 4 columns have plenty of unique values, so that means they should stay in the dataset as they are statistically significant.

Looking at the Outliers

The next step it to figure out if there are some big outliers in the data that are pulling or pushing the rest of the column one way or another. To do this, lets create a table showing different sections of the data and the min and max. Lets look for outliers in the 'odometer_km' and the 'price_dollar 'columns.
In [161]:
outliers_table = autos[['odometer_km', 'price_dollar']].describe()
outliers_table
Out[161]:
odometer_km price_dollar
count 50000.000000 5.000000e+04
mean 125732.700000 9.840044e+03
std 40042.211706 4.811044e+05
min 5000.000000 0.000000e+00
25% 125000.000000 1.100000e+03
50% 150000.000000 2.950000e+03
75% 150000.000000 7.200000e+03
max 150000.000000 1.000000e+08

To remove the outliers, I need to first figure out if there are any. I can use the '1.5xIQR rule'. This is a pretty useful rule that will help you figure out if the data falls outside the upper and lower quartiles.

In [162]:
#remove the outliers from the Odometers_km column
odometer_no_outliers = autos[autos['odometer_km'].between(autos['odometer_km'].quantile(0.25)-1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)),
                                autos['odometer_km'].quantile(0.75)+1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)))]
                                                                          
                                                                                                
In [163]:
#print the dataframe with and without outliers
print('odomemter_km with potential outliers:', *autos['odometer_km'].shape)
print('odometer_km without potential outliers:', *odometer_no_outliers['odometer_km'].shape)
odomemter_km with potential outliers: 50000
odometer_km without potential outliers: 41520

So it looks like there were outliers in the column 'odometer_km' becasue the column amount of 41520 is lower than the original 50000. So I think it might be smart to replace the values of the 'autos' dataframe with the values of 'odometer_no_outliers.' This is the same dataframe but without the original outliers that were in the 'odometer_km' column.

In [164]:
#Remove the rows with the outliers in the 'odometer_km' column.
autos = odometer_no_outliers
#double check to make sure the number of rows in the new autos dataframe doesnt have those outliers.
autos.shape[0]
Out[164]:
41520

Now lets run the same code for the 'price_dollars' column and see what we can get rid of. Again, I will use the '1.5xIQR rule.'

In [165]:
#removing the outliers from the price column
price_no_outliers = autos[autos['price_dollar'].between(autos['price_dollar'].quantile(0.25)-1.5*(autos['price_dollar'].quantile(0.75)-autos['price_dollar'].quantile(0.25)), 
                                                 autos['price_dollar'].quantile(0.75)+1.5*(autos['price_dollar'].quantile(0.75)-autos['price_dollar'].quantile(0.25)))]
In [166]:
#Show the autos dataframe so that you can compare price_no_outliers to original
print('autos with potential price outliers:', *autos['price_dollar'].shape)
print('autos without potential price outliers:', *price_no_outliers['price_dollar'].shape)
autos with potential price outliers: 41520
autos without potential price outliers: 38386

So it looks like there were some price outliers in the autos dataframe that were removed when we ran the '1.5xIQR rule.' You can tell this becasue the number of rows in the autos without potential price outliers is 38386, which is lower than the number of rows in the original autos dataframe. Like earlier, I should be able to replace the values in the 'autos' dataframe with this values in the 'price_no_outliers' dataframe, without anything too bad happending to the rest of the data.

In [167]:
#update the 'autos' dataframe so that the likely price outliers have been removed.

autos = price_no_outliers
In [168]:
#double check that the number of rows in the new autos version looks better

autos.shape[0]
Out[168]:
38386

Comparing the columns with and without outliers

Lets compare the values of the columns with the outliers included with the values without the outliers and see whats going on.

In [169]:
print('Table with Outliers')
outliers_table
Table with Outliers
Out[169]:
odometer_km price_dollar
count 50000.000000 5.000000e+04
mean 125732.700000 9.840044e+03
std 40042.211706 4.811044e+05
min 5000.000000 0.000000e+00
25% 125000.000000 1.100000e+03
50% 150000.000000 2.950000e+03
75% 150000.000000 7.200000e+03
max 150000.000000 1.000000e+08
In [170]:
print('Table w/o Outliers')
autos[['odometer_km','price_dollar']].describe()
Table w/o Outliers
Out[170]:
odometer_km price_dollar
count 38386.000000 38386.000000
mean 142674.412546 3127.445136
std 16147.477812 2872.206674
min 90000.000000 0.000000
25% 150000.000000 950.000000
50% 150000.000000 2131.000000
75% 150000.000000 4500.000000
max 150000.000000 12250.000000

From lookig at the two columns, it looks like 'odometer_km' went thru a couple of changes. The Standard Deviation decreased (from 40042 to 16147) while the minimum value increased quite a bit (from 5000 to 90000).

As for the 'price' column, it looks to have changed significantally as well. Its Standard Deviation appears to have decaresed (from 481104 to 2872). Also the maxium value for this price column now looks to be 12250 which is alot more realistic than 1 billion, which is what the old one was.

So to start looking carefully at the columns related to date values, I think it would help it I first review how these columns are formatted. I want to specifically start by looking at the 'date_crawled', 'ad_created, and 'last_seen' columns.

In [171]:
#review formatting for dat related columns
autos[['date_crawled', 'ad_created', 'last_seen']].head()
Out[171]:
date_crawled ad_created last_seen
0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54
1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08
4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
5 2016-03-21 13:47:45 2016-03-21 00:00:00 2016-04-06 09:45:21
6 2016-03-20 17:55:21 2016-03-20 00:00:00 2016-03-23 02:48:59

Looks like it goes year-month-day, and the hours are in 24 hr time. Ok then. Next I want to review the range for 'date_crawled' column and se if there are any big outliers which may impact the conclusions.

In [172]:
#review the dat range in the 'date_crawled' column
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[172]:
2016-03-05    0.024697
2016-03-06    0.013911
2016-03-07    0.036628
2016-03-08    0.034414
2016-03-09    0.033736
2016-03-10    0.032642
2016-03-11    0.032017
2016-03-12    0.037540
2016-03-13    0.015058
2016-03-14    0.037383
2016-03-15    0.034127
2016-03-16    0.030037
2016-03-17    0.032121
2016-03-18    0.013234
2016-03-19    0.033398
2016-03-20    0.037514
2016-03-21    0.037488
2016-03-22    0.032564
2016-03-23    0.032538
2016-03-24    0.028630
2016-03-25    0.032199
2016-03-26    0.032069
2016-03-27    0.030819
2016-03-28    0.034674
2016-03-29    0.033867
2016-03-30    0.034127
2016-03-31    0.031991
2016-04-01    0.033267
2016-04-02    0.034492
2016-04-03    0.037800
2016-04-04    0.036862
2016-04-05    0.013442
2016-04-06    0.003569
2016-04-07    0.001146
Name: date_crawled, dtype: float64

Data seems fine I think. From reviewing it, we can see that they were mostly found in March, with a few in April, all in the same year 2016.

Next lets review the date range for 'ad_created'.

In [173]:
#review date range for 'ad_created'
autos['date_crawled'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[173]:
2016-03-05    0.024697
2016-03-06    0.013911
2016-03-07    0.036628
2016-03-08    0.034414
2016-03-09    0.033736
2016-03-10    0.032642
2016-03-11    0.032017
2016-03-12    0.037540
2016-03-13    0.015058
2016-03-14    0.037383
2016-03-15    0.034127
2016-03-16    0.030037
2016-03-17    0.032121
2016-03-18    0.013234
2016-03-19    0.033398
2016-03-20    0.037514
2016-03-21    0.037488
2016-03-22    0.032564
2016-03-23    0.032538
2016-03-24    0.028630
2016-03-25    0.032199
2016-03-26    0.032069
2016-03-27    0.030819
2016-03-28    0.034674
2016-03-29    0.033867
2016-03-30    0.034127
2016-03-31    0.031991
2016-04-01    0.033267
2016-04-02    0.034492
2016-04-03    0.037800
2016-04-04    0.036862
2016-04-05    0.013442
2016-04-06    0.003569
2016-04-07    0.001146
Name: date_crawled, dtype: float64

So it looks like the ads were createed between March and April 2016. Again, this seems pretty even in distribution. Good to know.

Next lets review the distribution for the 'last_seen' column, and see what that looks like.

In [174]:
#review the date range for 'last_seen' column
autos['last_seen'].str[:10].value_counts(normalize = True, dropna = False).sort_index()
Out[174]:
2016-03-05    0.001094
2016-03-06    0.005002
2016-03-07    0.006174
2016-03-08    0.008701
2016-03-09    0.010707
2016-03-10    0.011384
2016-03-11    0.014120
2016-03-12    0.026051
2016-03-13    0.009821
2016-03-14    0.013755
2016-03-15    0.016985
2016-03-16    0.017871
2016-03-17    0.030141
2016-03-18    0.007685
2016-03-19    0.017090
2016-03-20    0.022300
2016-03-21    0.022091
2016-03-22    0.022638
2016-03-23    0.019434
2016-03-24    0.021049
2016-03-25    0.020320
2016-03-26    0.017845
2016-03-27    0.017064
2016-03-28    0.022326
2016-03-29    0.023368
2016-03-30    0.025712
2016-03-31    0.024879
2016-04-01    0.024488
2016-04-02    0.025817
2016-04-03    0.025634
2016-04-04    0.025426
2016-04-05    0.116944
2016-04-06    0.205309
2016-04-07    0.120773
Name: last_seen, dtype: float64

Again, looks like this distribution is pretty clean and even. The dates look like they are the same as the 'date_crawled' column, from March 5, 2016 to April 7, 2016.

Next I want to review the 'registration_year' column. There might be some outliers there so let me check that out.

In [175]:
#review the 'registration_year' column to see that the distribution is
print(autos['registration_year'].describe())

print(autos['registration_year'].value_counts().sort_index().head())
count    38386.000000
mean      2002.499557
std         36.331672
min       1910.000000
25%       1999.000000
50%       2002.000000
75%       2006.000000
max       9000.000000
Name: registration_year, dtype: float64
1910    2
1934    1
1937    1
1950    1
1953    1
Name: registration_year, dtype: int64

Yea this column needs to get cleaned up a little. 1000 and 9000 for a registration year are obviously not realistic. Cars really appeared in society around 1910, so 1000 needs to be deleted. Also, the data was scraped in 2016, so no registration year should be later than 2016. Lets remove some rows!

In [176]:
#remove the rows where 'registration_year' is off
autos = autos[autos['registration_year'].between(1900, 2016)]
In [177]:
#double check that the values that remain are realistic
autos["registration_year"].value_counts(normalize = True).sort_index()
Out[177]:
1910    0.000054
1934    0.000027
1937    0.000027
1950    0.000027
1953    0.000027
          ...   
2012    0.004194
2013    0.000654
2014    0.000218
2015    0.000136
2016    0.031071
Name: registration_year, Length: 66, dtype: float64
In [178]:
print(autos['registration_year'].describe())
count    36722.000000
mean      2001.640924
std          5.980791
min       1910.000000
25%       1998.000000
50%       2002.000000
75%       2005.000000
max       2016.000000
Name: registration_year, dtype: float64

That seems better. We can see that the majority of the cars were registered in the mid-1990's up to our assigned maxium year date of 2016. Good to know.

Review the Top 10 Car Brands in this data set

For this final part of this guided project Exploring Car Sales Data, I want to do an analysis of the top 15 car brands sold within this dataset.

In [179]:
#review the top 10 cars in the 'brands' column
autos['brand'].value_counts().head(10)
Out[179]:
volkswagen       8161
opel             4469
bmw              3933
mercedes_benz    3307
audi             2928
ford             2684
renault          1958
peugeot          1195
fiat              948
seat              684
Name: brand, dtype: int64

In this dataset, we can see that the top selling brand was 'volkswagen' and the lowest selling brand was 'toyota'. Next I want to get a better sense of the mean price of these cars. Figure out if there is any strong relationship related between the brand of car and the price it sold for.

In [180]:
#creating a dictionary for mean price
aggregate_price = {}  #empty dictionary called aggregate_price
top_brands = autos['brand'].value_counts().head(10).index
for brand in top_brands:
    sel_row = autos[autos['brand'] == brand]
    mean_row = sel_row['price_dollar'].mean()
    aggregate_price[brand] = mean_row
In [181]:
#show the aggregate (mean) price for each brand of car
aggregate_price
Out[181]:
{'volkswagen': 3205.765960053915,
 'opel': 2115.4443947191767,
 'bmw': 4501.369692346809,
 'mercedes_benz': 4178.984275778651,
 'audi': 4362.219945355191,
 'ford': 2242.295827123696,
 'renault': 1728.5985699693565,
 'peugeot': 2328.5171548117155,
 'fiat': 1831.6582278481012,
 'seat': 2713.6988304093566}
In [182]:
#sort the aggregate (mean) price from high to low
sorted(aggregate_price.items(), key = lambda x: x[1], reverse = True)
Out[182]:
[('bmw', 4501.369692346809),
 ('audi', 4362.219945355191),
 ('mercedes_benz', 4178.984275778651),
 ('volkswagen', 3205.765960053915),
 ('seat', 2713.6988304093566),
 ('peugeot', 2328.5171548117155),
 ('ford', 2242.295827123696),
 ('opel', 2115.4443947191767),
 ('fiat', 1831.6582278481012),
 ('renault', 1728.5985699693565)]

From looking at this data, it looks like the top selling brand was 'bmw' and the lowest selling brand is 'renault.' Now that we know this, lets analyze the mileage and see what kind of relationship there is between mileage and selling price and brand.

In [183]:
#Create a dictionary for mileage called 'mean_mileage'
aggregate_mileage = {}
for brand in autos['brand'].value_counts().head(10).index:
    sel_row = autos[autos['brand'] == brand]
    mean_mileage = sel_row['odometer_km'].mean()
    aggregate_mileage[brand] = mean_mileage

aggregate_mileage    
Out[183]:
{'volkswagen': 144072.41759588287,
 'opel': 142254.41933318417,
 'bmw': 145766.59038901603,
 'mercedes_benz': 145777.13940127005,
 'audi': 146784.49453551913,
 'ford': 141067.43666169894,
 'renault': 141036.77221654748,
 'peugeot': 140380.75313807532,
 'fiat': 137420.88607594935,
 'seat': 140445.90643274854}

From this data, it looks like 'volkswagen' had the highest average mileage while 'seat' had the lower average mileage. I think it would be smart to create a pd series of this data and then compare it with a pd series of the aggregate (mean) price.

In [184]:
#create a series for 'mean_mileage'
carsmm_series = pd.Series(aggregate_mileage)
carsmm_series
Out[184]:
volkswagen       144072.417596
opel             142254.419333
bmw              145766.590389
mercedes_benz    145777.139401
audi             146784.494536
ford             141067.436662
renault          141036.772217
peugeot          140380.753138
fiat             137420.886076
seat             140445.906433
dtype: float64
In [185]:
#create a series for 'mean_price'
carsmp_series = pd.Series(aggregate_price)
carsmp_series
Out[185]:
volkswagen       3205.765960
opel             2115.444395
bmw              4501.369692
mercedes_benz    4178.984276
audi             4362.219945
ford             2242.295827
renault          1728.598570
peugeot          2328.517155
fiat             1831.658228
seat             2713.698830
dtype: float64
In [186]:
#use the Series Constructor that uses the 'aggregate_price' dictionary
bmp_series = pd.Series(aggregate_price)
print(bmp_series)
volkswagen       3205.765960
opel             2115.444395
bmw              4501.369692
mercedes_benz    4178.984276
audi             4362.219945
ford             2242.295827
renault          1728.598570
peugeot          2328.517155
fiat             1831.658228
seat             2713.698830
dtype: float64
In [187]:
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df
Out[187]:
mean_price
volkswagen 3205.765960
opel 2115.444395
bmw 4501.369692
mercedes_benz 4178.984276
audi 4362.219945
ford 2242.295827
renault 1728.598570
peugeot 2328.517155
fiat 1831.658228
seat 2713.698830
In [188]:
#insert the values for mean mileage as another column
df.insert(1, 'mean_mileage', carsmm_series)
df
Out[188]:
mean_price mean_mileage
volkswagen 3205.765960 144072.417596
opel 2115.444395 142254.419333
bmw 4501.369692 145766.590389
mercedes_benz 4178.984276 145777.139401
audi 4362.219945 146784.494536
ford 2242.295827 141067.436662
renault 1728.598570 141036.772217
peugeot 2328.517155 140380.753138
fiat 1831.658228 137420.886076
seat 2713.698830 140445.906433
In [189]:
#sort the dataframe by mean price in decending order
df.sort_values('mean_price', ascending=False)
Out[189]:
mean_price mean_mileage
bmw 4501.369692 145766.590389
audi 4362.219945 146784.494536
mercedes_benz 4178.984276 145777.139401
volkswagen 3205.765960 144072.417596
seat 2713.698830 140445.906433
peugeot 2328.517155 140380.753138
ford 2242.295827 141067.436662
opel 2115.444395 142254.419333
fiat 1831.658228 137420.886076
renault 1728.598570 141036.772217
In [190]:
#caluclate a correlation between the 2 columns
df['mean_price'].corr(df['mean_mileage'])
Out[190]:
0.8992195082897082

So it looks like the correlation I calculated was 0.89 or 89%. I am not sure if this is correct, but if it is, that would mean there is a little bit of a correlation between the mean price and the mean mileage for the cars sold within this dataset.

Review:

-There were 3 statistically insignificant columns within this dataset. They were 'seller', 'offer_type', and 'nr_of_pictures'.
-There were 11,614 outliers within the 'odometer_km' column and 3,134 outliers within the 'price_dollar' column.
-All the ads were first crawled from March 5, 2016 to April 7, 2016 with the ads being created in that same timeframe.
-The mean price and car brand seem to be somewhat correlated.