Mobile Market is a retailer leveraging machine learning to profit in the growing secondary market for mobile devices. Our goal is to analyze their supplied data and lend insights—such as factors that influence price—through a linear model.
Data from 2021 contains the following features:
# math and data
import numpy as np
import scipy.stats as stats
import pandas as pd
# visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to split the data set
from sklearn.model_selection import train_test_split
# to build the linear model
import statsmodels
import statsmodels.api as sm
phone=pd.read_csv('dataset.csv')
phone_copy=phone.copy()
phone.shape
(3454, 15)
phone.head()
brand_name | os | screen_size | 4g | 5g | main_camera_mp | selfie_camera_mp | int_memory | ram | battery | weight | release_year | days_used | normalized_used_price | normalized_new_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Honor | Android | 14.50 | yes | no | 13.0 | 5.0 | 64.0 | 3.0 | 3020.0 | 146.0 | 2020 | 127 | 4.307572 | 4.715100 |
1 | Honor | Android | 17.30 | yes | yes | 13.0 | 16.0 | 128.0 | 8.0 | 4300.0 | 213.0 | 2020 | 325 | 5.162097 | 5.519018 |
2 | Honor | Android | 16.69 | yes | yes | 13.0 | 8.0 | 128.0 | 8.0 | 4200.0 | 213.0 | 2020 | 162 | 5.111084 | 5.884631 |
3 | Honor | Android | 25.50 | yes | yes | 13.0 | 8.0 | 64.0 | 6.0 | 7250.0 | 480.0 | 2020 | 345 | 5.135387 | 5.630961 |
4 | Honor | Android | 15.32 | yes | no | 13.0 | 8.0 | 64.0 | 3.0 | 5000.0 | 185.0 | 2020 | 293 | 4.389995 | 4.947837 |
This is a sizable data set: 15 columns with over 3400 entries! I'm curious whether OS will end up being a significant factor, as I'd guess brand would be more important. Additionally, the normalized prices both seem shockingly low. I'll need to follow up with the histogram or boxplot later.
phone.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3454 entries, 0 to 3453 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand_name 3454 non-null object 1 os 3454 non-null object 2 screen_size 3454 non-null float64 3 4g 3454 non-null object 4 5g 3454 non-null object 5 main_camera_mp 3275 non-null float64 6 selfie_camera_mp 3452 non-null float64 7 int_memory 3450 non-null float64 8 ram 3450 non-null float64 9 battery 3448 non-null float64 10 weight 3447 non-null float64 11 release_year 3454 non-null int64 12 days_used 3454 non-null int64 13 normalized_used_price 3454 non-null float64 14 normalized_new_price 3454 non-null float64 dtypes: float64(9), int64(2), object(4) memory usage: 404.9+ KB
A first inspection shows the data types all appear to be correct. From the head, we can see that 4g and 5g are recorded as 'yes' or 'no', so it follows that these columns would be objects.
phone.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
We do have some null entries, most notably in the main_camera_mp column, with 179 null entries. That being said, it does only make up around 5% of the data. The other columns with null entries are: selfie_camera_mp, int_memory, ram, battery, and weight.
There is a possibility that brand_name or os have missing data as '?' or 'unknown', so let's check for that too.
phone['brand_name'].value_counts()
Others 502 Samsung 341 Huawei 251 LG 201 Lenovo 171 ZTE 140 Xiaomi 132 Oppo 129 Asus 122 Alcatel 121 Micromax 117 Vivo 117 Honor 116 HTC 110 Nokia 106 Motorola 106 Sony 86 Meizu 62 Gionee 56 Acer 51 XOLO 49 Panasonic 47 Realme 41 Apple 39 Lava 36 Celkon 33 Spice 30 Karbonn 29 Coolpad 22 BlackBerry 22 Microsoft 22 OnePlus 22 Google 15 Infinix 10 Name: brand_name, dtype: int64
There's a catchall 'Others' group. This is missing data and may mess up our model; there is presumably more than one brand categorized as other. We will review this during the model tuning.
phone['os'].value_counts()
Android 3214 Others 137 Windows 67 iOS 36 Name: os, dtype: int64
There is also an 'Others' label here. Again, we may need to remove these to create a powerful model.
phone.duplicated().sum()
0
There are no duplicate rows.
phone.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
screen_size | 3454.0 | 13.713115 | 3.805280 | 5.080000 | 12.700000 | 12.830000 | 15.340000 | 30.710000 |
main_camera_mp | 3275.0 | 9.460208 | 4.815461 | 0.080000 | 5.000000 | 8.000000 | 13.000000 | 48.000000 |
selfie_camera_mp | 3452.0 | 6.554229 | 6.970372 | 0.000000 | 2.000000 | 5.000000 | 8.000000 | 32.000000 |
int_memory | 3450.0 | 54.573099 | 84.972371 | 0.010000 | 16.000000 | 32.000000 | 64.000000 | 1024.000000 |
ram | 3450.0 | 4.036122 | 1.365105 | 0.020000 | 4.000000 | 4.000000 | 4.000000 | 12.000000 |
battery | 3448.0 | 3133.402697 | 1299.682844 | 500.000000 | 2100.000000 | 3000.000000 | 4000.000000 | 9720.000000 |
weight | 3447.0 | 182.751871 | 88.413228 | 69.000000 | 142.000000 | 160.000000 | 185.000000 | 855.000000 |
release_year | 3454.0 | 2015.965258 | 2.298455 | 2013.000000 | 2014.000000 | 2015.500000 | 2018.000000 | 2020.000000 |
days_used | 3454.0 | 674.869716 | 248.580166 | 91.000000 | 533.500000 | 690.500000 | 868.750000 | 1094.000000 |
normalized_used_price | 3454.0 | 4.364712 | 0.588914 | 1.536867 | 4.033931 | 4.405133 | 4.755700 | 6.619433 |
normalized_new_price | 3454.0 | 5.233107 | 0.683637 | 2.901422 | 4.790342 | 5.245892 | 5.673718 | 7.847841 |
Let's start with some highlights of the statistical summary.
phone.describe(include='object').T
count | unique | top | freq | |
---|---|---|---|---|
brand_name | 3454 | 34 | Others | 502 |
os | 3454 | 4 | Android | 3214 |
4g | 3454 | 2 | yes | 2335 |
5g | 3454 | 2 | no | 3302 |
For the non-numeric columns, there are 34 brands, though the catchall 'Other' category is most common. Android is the OS with the overwhelming majority of devices on Mobile Market, and 4G connectivity is far more common than 5G.
sns.set_theme()
We'll start with a first glance at each of our columns.
plt.figure(figsize=(8,5))
plt.title('Countplot of Phone Brands',fontsize=16)
sns.countplot(data=phone,
x='brand_name',
order=phone['brand_name'].value_counts().index)
plt.xticks(rotation=90)
plt.xlabel('Brand Name');
Aside from the 'Others' designation, Samsung is the most common brand in the Mobile Market system, followed by Huawei, LG, and Lenovo. OnePlus, Google, and Infinix are the least common brands.
plt.figure(figsize=(8,5))
plt.title('Countplot of Phone OS',fontsize=16)
sns.countplot(data=phone,
x='os',
order=phone['os'].value_counts().index)
plt.xlabel('Operating System');
phone['os'].value_counts(normalize=True)
Android 0.930515 Others 0.039664 Windows 0.019398 iOS 0.010423 Name: os, dtype: float64
Android is far and away the most popular phone OS in the Mobile Market system. It is installed on 93% of the phones.
plt.figure(figsize=(8,5))
plt.title('Histogram of Screen Size',fontsize=16)
sns.histplot(data=phone,x='screen_size',bins=12)
plt.xlabel('Screen Size (cm on the diagonal)');
The most common sceen size is around 12.5cm on the diagonal (about 5in). It looks like the most popular screen sizes fall between 10cm and 18cm. There are some devices on Mobile Market with screens greater than 20cm; there are roughly as many of these as there are devices with screens less than 10cm.
plt.figure(figsize=(10,5))
# 4g
plt.subplot(1,2,1)
plt.title('4G Capable?')
sns.countplot(data=phone,x='4g',order=('yes','no'))
# 5g
plt.subplot(1,2,2)
plt.title('5G Capable?')
sns.countplot(data=phone,x='5g',order=('yes','no'))
plt.tight_layout()
While roughly two-thirds of phones on Mobile Market have 4G capability, very few have 5G.
# import numeric data type check from pandas
from pandas.api.types import is_numeric_dtype
def hist_box(df,col,*,kde=False):
'''Plot a histogram and boxplot
of numeric column in data frame.
Optional KDE boolean.'''
# include only numeric cols
if is_numeric_dtype(df[col])==False:
raise TypeError('Column must be numeric.')
# figure setup
plt.figure(figsize=(12,5))
# histogram
plt.subplot(1,2,1)
plt.title('Histogram of '+str(col),fontsize=14)
sns.histplot(data=df,x=col,kde=kde)
# boxplot
plt.subplot(1,2,2)
plt.title('Boxplot of '+str(col),fontsize=14)
sns.boxplot(data=df,x=col)
# display figure
plt.tight_layout()
return
hist_box(phone,'main_camera_mp')
The distribution of main camera resolution is sporatic, but 75% of the data is concentrated below 15MP.
hist_box(phone,'selfie_camera_mp')
The histogram of selfie camera resolution would lead one to believe these observations are more spread out than those for the main camera, but the boxplot shows that the data is still concentrated on the lower end. There are just more outliers in these observations that skew the data.
hist_box(phone,'int_memory')
Internal memory is heavily skewed by outliers, so let's visualize the data without the outliers.
plt.figure(figsize=(8,5))
plt.title('Internal Memory (Outliers Removed)',fontsize=16)
sns.boxplot(data=phone,x='int_memory',showfliers=False)
plt.xlabel('Memory (GB)');
Now we can see that almost all phones on Mobile Market have between 0 and 130GB of internal storage. The data is still looks right-skewed, as the lower 50% is narrower than the upper 50%. However most memory comes in powers of 2, so we can see that the 25, 50, and 75 quartiles are 16GB, 32GB, and 64GB respectively. In other words, internal memory data is not continuous, so the skewed appearance of the data is likely misleading.
two_powers=[float(2**i) for i in range(12)]
mem_counts=phone['int_memory'].value_counts(normalize=True)
percent=0
for idx in mem_counts.index:
if idx in two_powers:
percent+=mem_counts[idx]
print(percent)
0.9875362318840579
In fact, a quick calculation shows that almost all (98.75%) internal memory comes as a power of 2.
hist_box(phone,'ram')
phone['ram'].value_counts(normalize=True)
4.00 0.815942 6.00 0.044638 8.00 0.037681 2.00 0.026087 0.25 0.024058 3.00 0.023478 1.00 0.009855 12.00 0.005217 0.02 0.005217 0.03 0.004928 0.50 0.002609 1.50 0.000290 Name: ram, dtype: float64
We find that 81.6% of phones on Mobile Market have 4GB of RAM. Other common options are 2GB, 6GB, and 8GB.
hist_box(phone,'battery',kde=True)
The battery data is concentrated between roughly 1500mAh and 5000mAh, with three peaks in between. Battery capacity is right skewed with many outliers.
plt.figure(figsize=(8,5))
plt.title('Battery Capacity (Outliers Removed)',fontsize=16)
sns.boxplot(data=phone,x='battery',showfliers=False)
plt.xlabel('Capacity (mAh)');
Without the outliers, battery capacity looks more evenly distributed. The median capacity is around 3000mAh, with the middle 50% concentrated between 2000mAh and 4000mAh.
hist_box(phone,'weight')
plt.figure(figsize=(8,5))
plt.title('Device Weight (Outliers Removed)',fontsize=16)
sns.boxplot(data=phone,x='weight',showfliers=False)
plt.xlabel('Weight (g)');
While device weight is generally concentrated between 75g and 250g, there are too many outliers to throw them all away. In particular, there's a real possibility these outliers will have an effect on our model.
phone['release_year'].unique()
array([2020, 2019, 2013, 2014, 2016, 2018, 2015, 2017])
plt.figure(figsize=(8,5))
plt.title('Device Release Year',fontsize=16)
sns.countplot(data=phone,x='release_year')
plt.xlabel('Year');
The most common release year is 2014, followed by 2013 and 2015. There's a dip in popularity after 2015, with 2019 being the exception.
hist_box(phone,'days_used')
Days used is fairly evenly distributed, with a vague peak around 600 days.
hist_box(phone,'normalized_new_price',kde=True)
Normalized new price is approximately normally distributed with long tails on both sides. The peak is around 5€. The boxplot shows that there are many outliers.
hist_box(phone,'normalized_used_price',kde=True)
Used price is disributed smoothly between 2€ and 6€ with a slight left skew. The peak is around 4.25€ and there are many outliers in the boxplot.
# comparison of new price and used price
plt.figure(figsize=(8,5))
plt.title('New Price Versus Used Price',fontsize=16)
sns.histplot(data=phone,x='normalized_new_price',
color='darkslateblue',label='New',kde=True)
sns.histplot(data=phone,x='normalized_used_price',
color='orange',label='Used',kde=True)
plt.legend()
plt.xlabel('Price (€)');
Here is a direct comparison of new and used prices. We see that new prices generally trend higher.
Next, let's look at how RAM varies by brand.
plt.figure(figsize=(20,30))
rows=1+len(phone['brand_name'].unique())//4
# subplot generating loop
for idx, brand in enumerate(phone['brand_name'].sort_values().unique()):
plt.subplot(rows,4,idx+1)
plt.title(brand)
sns.histplot(data=phone[phone['brand_name']==brand]['ram'])
plt.tight_layout()
What's clear from this breakdown is that every phone brand offers a product with 4GB of RAM on Mobile Market, and it is unmistakably the most common device configuration. For Microsoft, Panasonic, and XOLO, the data set only includes devices with 4GB of RAM. Celkon and Infinix are the only brands for which 4GB of RAM is not the most common configuration. Like most other brands, the most common configuration for OnePlus phones is 4GB of RAM, but unlike the others, OnePlus phones have a very high proportion of offerings with 8GB of RAM.
Next we will look at the weight of devices with high-capacity batteries.
# collect phones with battery capacity >4500mAh
heavy=phone[phone['battery']>4500]
plt.figure(figsize=(8,5))
plt.title('High-Capacity Batteries (>4500 mAh) and Weight',fontsize=16)
sns.regplot(data=heavy,x='battery',y='weight')
plt.xlabel('Capacity (mAh)')
plt.ylabel('Weight (g)');
There is a positive trend between battery capacity and phone weight. There are, however, a few observations far from the regression line that have high capacity batteries but lower weight. Such devices are generally more valuable since they last longer on a single charge but do not weigh much.
We will now examine several aspects across brands, the first being the distribution of large screen devices.
# number of phones with screen size >6in = 15.24cm
phone[phone['screen_size']>15.24].shape[0]
1099
plt.figure(figsize=(8,5))
plt.title('Brands with Large Screen (>6in diagonal)',fontsize=16)
sns.countplot(data=phone[phone['screen_size']>15.24],
x='brand_name',
order=phone[phone['screen_size']>15.24]['brand_name'].value_counts().index)
plt.xticks(rotation=90)
plt.xlabel('Brand Name');
There are 1099 devices in this data set with a screen size greater than 6in diagonally. Huawei and Samsung offer the greatest number of devices with large screens. Panasonic, Spice, and Microsoft offer very few large sceen devices through Mobile Market.
plt.figure(figsize=(8,5))
plt.title('Brands with High-Resolution Selfie Camera (>8MP)',fontsize=16)
sns.countplot(data=phone[phone['selfie_camera_mp']>8],
x='brand_name',
order=phone[phone['selfie_camera_mp']>8]['brand_name'].value_counts().index)
plt.xticks(rotation=90)
plt.xlabel('Brand Name');
Huawei, Vivo, and Oppo offer the greatest number of devices with a high resolution selfie camera, while Micromax, Panasonic, and Acer offer the fewest on Mobile Market.
# logarithmic regression of memory vs used price
plt.figure(figsize=(8,5))
plt.title('Memory Capacity and Used Price',fontsize=16)
sns.regplot(data=phone,
x='int_memory',
y='normalized_used_price',
logx=True,
x_jitter=10,
marker='+',
scatter_kws={'color':'blue'},
line_kws={'color':'red'},
ci=False)
plt.xlabel('Memory (GB)')
plt.ylabel('Used Price (€)');
There seems to be a positive trend between device memory capacity and normalized used price. In particular, price appears to be proportional to the logarithm of memory capacity.
plt.figure(figsize=(12,8))
plt.title('Correlation Heatmap',fontsize=16)
sns.heatmap(phone.corr(),annot=True,cmap='Spectral');
The correlation matrix highlights several previously unseen relationships. There are some clear ones, like the strong negative correlation between release year and days_used. But we have detected moderately positive correlations between normalized_used_price and the following: screen_size, main_camera_mp, selfie_camera_mp, and battery. We will visualize these, along with the strong positive relationship between battery capacity and weight.
plt.figure(figsize=(16,16))
plt.suptitle('Used Price versus Various Features', fontsize=16)
# screen_size
plt.subplot(2,2,1)
sns.scatterplot(data=phone,x='screen_size',y='normalized_used_price')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('Price (€)')
# main_camera_mp
plt.subplot(2,2,2)
sns.scatterplot(data=phone,x='main_camera_mp',y='normalized_used_price')
plt.xlabel('Main Camera Resolution (MP)')
plt.ylabel('Price (€)')
# selfie_camera_mp
plt.subplot(2,2,3)
sns.scatterplot(data=phone,x='selfie_camera_mp',y='normalized_used_price')
plt.xlabel('Selfie Camera Resolution (MP)')
plt.ylabel('Price (€)')
# battery
plt.subplot(2,2,4)
sns.scatterplot(data=phone,x='battery',y='normalized_used_price')
plt.xlabel('Battery Capacity (mAh)')
plt.ylabel('Price (€)')
plt.show()
The four scatterplots all illustrate positive correlation but have decidedly different distributions. Note too that some of the relationships have a bit of a curve to the trend, like a logarithmic relationship.
Next, let's look at battery capacity, weight, and screen size of devices on Mobile Market.
plt.figure(figsize=(8,5))
plt.title('Battery and Weight',fontsize=16)
sns.scatterplot(data=phone,x='battery',y='weight')
plt.xlabel('Capacity (mAh)')
plt.ylabel('Weight (g)');
plt.figure(figsize=(16,5))
plt.suptitle('Screen Size versus Battery and Weight',fontsize=16)
# battery
plt.subplot(1,2,1)
sns.scatterplot(data=phone,x='screen_size',y='battery')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('Capacity (mAh)')
# weight
plt.subplot(1,2,2)
sns.scatterplot(data=phone,x='screen_size',y='weight')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('Weight (g)');
While we can observe a positive trend in the first plot, the correlations are more apparent in the latter two (relating screen size to battery and weight).
Consider next days used and new price.
plt.figure(figsize=(8,5))
plt.title('Days Used and New Price')
sns.scatterplot(data=phone,x='days_used',y='normalized_new_price')
plt.xlabel('Days')
plt.ylabel('Price (€)');
plt.figure(figsize=(8,5))
plt.title('Days Used and New Price')
sns.scatterplot(data=phone,x='days_used',y='normalized_new_price',hue='release_year')
plt.xlabel('Days')
plt.ylabel('Price (€)')
plt.legend(title='');
On the other hand, there is no apparent trend in the relationship between days_used and normalized_used_price; the scatterplot is just noise. But by adding the release year, we find that phones released less recently are generally used for longer. Not a big surprise!
plt.figure(figsize=(8,5))
plt.title('Release Year and Days Used',fontsize=16)
sns.boxplot(data=phone,x='release_year',y='days_used')
plt.xlabel('Year')
plt.ylabel('Days Used');
Quantifying the relationship between release_year and days_used, we find that the spread of days_used data is similar for release_years up to 2016, and then the data trends negatively (fewer days used) as the phones were more recently released.
sns.jointplot(data=phone,
x='normalized_new_price',
y='normalized_used_price',
hue='os',
kind='kde');
Lastly, looking at the density of new price versus used price, we can identify specifically price brackets broken down by operating system. Devices running Android have the greatest variability in price. On the other hand, iOS is installed on higher priced devices, namely Apple products. The least expensive devices, new or used, run an operating system other than Android, Windows, or iOS.
phone.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 179 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
The feature with the most missing data is main_camera_mp. We will impute the null values with the median camera resolution for the brand of that row in the dataframe.
# collect medians for main_camera_mp by brand
camera_medians={}
for brand in phone['brand_name'].unique():
med=phone.loc[phone['brand_name']==brand]['main_camera_mp'].median()
if np.isnan(med)==False:
camera_medians[brand]=phone.loc[phone['brand_name']==brand]['main_camera_mp'].median()
else:
camera_medians[brand]=0
# impute main_camera_mp with brand median
idx=0
for val in phone['main_camera_mp']:
if np.isnan(val)==True:
brand=phone.iloc[idx]['brand_name']
phone.loc[idx,'main_camera_mp']=camera_medians[brand]
idx+=1
/usr/local/lib/python3.7/dist-packages/numpy/lib/nanfunctions.py:1117: RuntimeWarning: Mean of empty slice return np.nanmean(a, axis, out=out, keepdims=keepdims)
phone.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 2 int_memory 4 ram 4 battery 6 weight 7 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
We will drop rows with the last few null entries, as we will not be losing much data.
# drop rows with remain few null entries
phone.dropna(axis=0,inplace=True)
phone.isnull().sum()
brand_name 0 os 0 screen_size 0 4g 0 5g 0 main_camera_mp 0 selfie_camera_mp 0 int_memory 0 ram 0 battery 0 weight 0 release_year 0 days_used 0 normalized_used_price 0 normalized_new_price 0 dtype: int64
We can confirm that there are no more null entries. Next, we will also drop any row where both the brand name and the OS have value 'Other', since this combination is too vague to contribute positively to our model.
print('Number of rows before dropping "Other/Other":',
phone.shape[0])
print('Number of "Other/Other" rows:',
phone[(phone['brand_name']=='Others')&(phone['os']=='Others')].shape[0])
Number of rows before dropping "Other/Other": 3432 Number of "Other/Other" rows: 28
# drop rows with brand_name=Others AND os=Others
for val in phone[(phone['brand_name']=='Others')&(phone['os']=='Others')].index:
phone.drop(val,axis=0,inplace=True)
print('Number of rows after dropping "Other/Other":',phone.shape[0])
Number of rows after dropping "Other/Other": 3404
We have dropped all 28 Other/Other rows.
phone.reset_index(inplace=True,drop=True)
After five experiments building the linear regression, I have found that the precise release year is not as important as whether the device is older or newer. I can confirm that this change will significantly reduce multicollinearity with negligible effect on the final R-squared. I used the boxplot from the EDA, recalled below, to define the threshold for an old phone.
plt.figure(figsize=(8,5))
plt.title('Release Year and Days Used',fontsize=16)
sns.boxplot(data=phone,x='release_year',y='days_used')
plt.xlabel('Year')
plt.ylabel('Days Used');
# classify rows
old_phones=[2013,2014,2015,2016]
age=[]
for val in phone['release_year']:
if val in old_phones:
age.append('old')
else:
age.append('new')
# generate feature
phone['age']=pd.Categorical(age)
phone=phone.drop(['release_year'],axis=1)
Since over 80% of devices on Mobile Market have 4GB of RAM, we will create a new categorical column. It will have a 1 if the device has 4GB of RAM and a 0 otherwise. This is one way of treating outliers in RAM.
# classify rows
ram4=[]
for val in phone['ram']:
if val==4.0:
ram4.append(1)
else:
ram4.append(0)
# generate feature
phone['ram4']=pd.Categorical(ram4)
We will also add a column for the ratio of battery capacity to weight. We can confidently design this feature because battery and weight are strongly correlated. This will have the effect of retaining data about battery and weight while reducing dimension.
phone['battery_by_weight']=phone['battery']/phone['weight']
I feel we have adequately adressed outliers. Those visible on the boxplots for weight or price are necessary aspects of the selection process. Our model would be incabaple of predicting price for the breadth of devices on Mobile Market without these data points.
hist_box(phone,'screen_size')
Screen size remains highly variable after our data transformations.
plt.figure(figsize=(10,5))
# 4g
plt.subplot(1,2,1)
plt.title('4G Capable?')
sns.countplot(data=phone,x='4g',order=('yes','no'))
# 5g
plt.subplot(1,2,2)
plt.title('5G Capable?')
sns.countplot(data=phone,x='5g',order=('yes','no'))
plt.tight_layout()
The ratios of 4g capable phones and 5g capable phone are essentially unchanged from before.
plt.figure(figsize=(8,5))
plt.title('Device Camera Resolutions',fontsize=16)
# main camera
sns.histplot(phone,x='main_camera_mp',
bins=12,
color='darkslateblue',
label='main')
#selfie camera
sns.histplot(phone,x='selfie_camera_mp',
bins=12,
color='orange',
label='selfie')
plt.xlabel('Resolution (MP)')
plt.legend();
The selfie camera generally has a lower resolution than the main camera. This makes sense, as there is more room in the device frame for a main camera sensor than a selfie camera, since the latter is crowded by the display panel.
plt.figure(figsize=(8,5))
plt.title('Internal Memory',fontsize=16)
sns.histplot(phone,x='int_memory',bins=12)
plt.xlabel('Memory Capacity (GB)');
Since memory comes in powers of 2, and less memory is generally cheaper, it tracks that most of the devices have less memory, but that the outliers are far greater than the bulk of the data. The capacity is literally increasing exponentially!
plt.figure(figsize=(8,5))
plt.title('Internal Memory (No Outliers)',fontsize=16)
sns.boxplot(data=phone,x='int_memory',showfliers=False)
plt.xlabel('Memory Capacity (GB)');
To understand the spread of the majority of our data, consider the boxplot above. Outliers are hidden in this plot, and we see that the middle 50% of the data is between 16GB and 64GB, with a median of 32GB (all powers of 2).
plt.figure(figsize=(8,5))
plt.title('Screen Size and RAM',fontsize=16)
sns.regplot(data=phone,x='screen_size',y='ram')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('RAM (GB)');
Generally a larger screen, in particular more pixels, requires more RAM for the device to operate smoothly. The trend is visible here, albeit only mildly.
plt.figure(figsize=(16,5))
plt.suptitle('Radio Capabilities and Device Age',fontsize=16)
plt.subplot(1,2,1)
sns.countplot(data=phone,x='age',hue='4g',hue_order=('yes','no'))
plt.subplot(1,2,2)
sns.countplot(data=phone,x='age',hue='5g',hue_order=('yes','no'));
Here we see that new devices are far more likely to have 4G radios, with almost no new devices lacking. Conversely, about half of old devices lack 4G connectivity.
There are no old devices available on Mobile Market with 5G connectivity, and most of the new devices don't have it either.
Now we can start preparing our data for modeling. In particular, let's use one hot encoding for our categorical variables.
phone_dummies=pd.get_dummies(data=phone,drop_first=True)
# dataframe of independent variables plus constant
X=phone_dummies.drop(['normalized_used_price'],axis=1)
X=sm.add_constant(X)
# response variable series
y=phone_dummies['normalized_used_price']
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only x = pd.concat(x[::order], 1)
# split data
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=1)
# train first model
model=sm.OLS(y_train,X_train)
results=model.fit()
We start by defining functions that will help us assess model performance. The first function generates a table of VIF stats and tests R-squared values for different dropped columns. The second incorporates the first into a data-rich model tuning interface. We will use the standard VIF cutoff of 5, but the function allows different thresholds.
# import variance inflation factor function
from statsmodels.stats.outliers_influence import variance_inflation_factor
def high_vif_stats(y_data,X_data,vif_cutoff=5):
'''Summarize VIF and test R-squared
for VIF > cutoff.'''
# generate vif table
vif=pd.Series([variance_inflation_factor(X_data.values,i) for i in range(X_data.shape[1])],
index=X_data.columns).sort_values(ascending=False)
print('Variance Inflation Factors')
print(vif)
# collect high vif columns (omit constant col)
high_vif_col=[]
for idx, val in vif.drop(['const'],axis=0).items():
if val>vif_cutoff:
high_vif_col.append(idx)
if len(high_vif_col)==0:
print('\nThere are no features with VIF greater than',vif_cutoff)
return
# create dataframe with original R-squared
results=sm.OLS(y_data,X_data).fit()
rsq_comp=pd.DataFrame({'R-squared':np.round(results.rsquared,3),
'Adj. R-squared':np.round(results.rsquared_adj,3)},
index=['Before'])
# tabulate R-square and adj. R-squared without high vif cols
for col in high_vif_col:
res=sm.OLS(y_data,X_data.drop([col],axis=1)).fit()
df=pd.DataFrame({'R-squared':np.round(res.rsquared,3),
'Adj. R-squared':np.round(res.rsquared_adj,3)},
index=['Without '+col])
rsq_comp=pd.concat([rsq_comp,df],axis=0)
# print table of R-squared values
title='R-squared for data with dropped high-VIF columns'
print('='*len(title))
print(title)
print(rsq_comp)
return
def model_tuning(y_data,X_data,*,threshold=5,hide_vif=False):
'''Presents summary of OLS model
and multicollinearity check given
VIF threshold (default=5).'''
# generate model
model=sm.OLS(y_data,X_data).fit()
# print model statistics and high VIF table
print(model.summary())
if hide_vif==False:
print('\n\n')
high_vif_stats(y_data,X_data,vif_cutoff=threshold)
return model
# print series without truncating
pd.set_option('display.max_rows',None)
model_tuning(y_train,X_train)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.842 Model: OLS Adj. R-squared: 0.838 Method: Least Squares F-statistic: 247.8 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:01 Log-Likelihood: 131.16 No. Observations: 2382 AIC: -160.3 Df Residuals: 2331 BIC: 134.2 Df Model: 50 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.3025 0.079 16.399 0.000 1.147 1.458 screen_size 0.0235 0.003 6.764 0.000 0.017 0.030 main_camera_mp 0.0214 0.001 14.830 0.000 0.019 0.024 selfie_camera_mp 0.0146 0.001 12.402 0.000 0.012 0.017 int_memory 0.0002 6.09e-05 2.571 0.010 3.71e-05 0.000 ram 0.0144 0.005 2.738 0.006 0.004 0.025 battery -2.531e-05 1.29e-05 -1.967 0.049 -5.06e-05 -7.64e-08 weight 0.0012 0.000 5.808 0.000 0.001 0.002 days_used -4.071e-05 2.98e-05 -1.365 0.172 -9.92e-05 1.78e-05 normalized_new_price 0.4132 0.012 34.505 0.000 0.390 0.437 battery_by_weight 0.0044 0.003 1.572 0.116 -0.001 0.010 brand_name_Alcatel -0.0150 0.046 -0.323 0.747 -0.106 0.076 brand_name_Apple 0.0483 0.148 0.326 0.745 -0.242 0.339 brand_name_Asus 0.0144 0.047 0.309 0.758 -0.077 0.106 brand_name_BlackBerry 0.0652 0.072 0.911 0.363 -0.075 0.206 brand_name_Celkon -0.2933 0.069 -4.252 0.000 -0.429 -0.158 brand_name_Coolpad 0.0045 0.072 0.062 0.951 -0.137 0.146 brand_name_Gionee -0.0109 0.056 -0.194 0.846 -0.121 0.099 brand_name_Google 0.0453 0.092 0.493 0.622 -0.135 0.226 brand_name_HTC -0.0079 0.048 -0.166 0.868 -0.101 0.086 brand_name_Honor 0.0124 0.048 0.258 0.797 -0.082 0.107 brand_name_Huawei -0.0272 0.043 -0.628 0.530 -0.112 0.058 brand_name_Infinix 0.3247 0.090 3.617 0.000 0.149 0.501 brand_name_Karbonn 0.0134 0.066 0.203 0.839 -0.116 0.143 brand_name_LG -0.0457 0.044 -1.035 0.301 -0.132 0.041 brand_name_Lava -0.0053 0.060 -0.089 0.929 -0.122 0.112 brand_name_Lenovo 0.0124 0.044 0.279 0.780 -0.075 0.100 brand_name_Meizu -0.0309 0.055 -0.566 0.572 -0.138 0.076 brand_name_Micromax -0.0294 0.047 -0.631 0.528 -0.121 0.062 brand_name_Microsoft 0.0589 0.086 0.684 0.494 -0.110 0.228 brand_name_Motorola -0.0813 0.048 -1.680 0.093 -0.176 0.014 brand_name_Nokia 0.0603 0.050 1.195 0.232 -0.039 0.159 brand_name_OnePlus 0.0134 0.067 0.202 0.840 -0.117 0.144 brand_name_Oppo 0.0029 0.047 0.062 0.951 -0.089 0.095 brand_name_Others -0.0174 0.041 -0.426 0.670 -0.098 0.063 brand_name_Panasonic -0.0103 0.055 -0.187 0.852 -0.118 0.098 brand_name_Realme -0.0009 0.060 -0.015 0.988 -0.119 0.117 brand_name_Samsung -0.0381 0.042 -0.903 0.367 -0.121 0.045 brand_name_Sony -0.0764 0.050 -1.529 0.126 -0.174 0.022 brand_name_Spice -0.0445 0.061 -0.729 0.466 -0.164 0.075 brand_name_Vivo -0.0334 0.048 -0.694 0.488 -0.128 0.061 brand_name_XOLO -0.0700 0.061 -1.155 0.248 -0.189 0.049 brand_name_Xiaomi 0.0645 0.047 1.367 0.172 -0.028 0.157 brand_name_ZTE -0.0257 0.046 -0.565 0.572 -0.115 0.064 os_Others -0.0492 0.037 -1.312 0.190 -0.123 0.024 os_Windows -0.0310 0.044 -0.696 0.487 -0.118 0.056 os_iOS -0.0876 0.147 -0.595 0.552 -0.376 0.201 4g_yes 0.0681 0.015 4.500 0.000 0.038 0.098 5g_yes 0.0222 0.034 0.658 0.511 -0.044 0.088 age_old -0.0468 0.017 -2.832 0.005 -0.079 -0.014 ram4_1 0.0276 0.017 1.611 0.107 -0.006 0.061 ============================================================================== Omnibus: 188.856 Durbin-Watson: 1.943 Prob(Omnibus): 0.000 Jarque-Bera (JB): 369.169 Skew: -0.534 Prob(JB): 6.85e-81 Kurtosis: 4.605 Cond. No. 1.73e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.73e+05. This might indicate that there are strong multicollinearity or other numerical problems. Variance Inflation Factors const 280.388122 weight 14.647755 brand_name_Apple 14.121957 os_iOS 12.770393 battery 11.965925 brand_name_Others 8.850750 screen_size 7.589599 battery_by_weight 7.447852 brand_name_Samsung 6.910259 brand_name_Huawei 5.764543 brand_name_LG 4.726981 brand_name_Lenovo 4.173822 brand_name_ZTE 3.957683 brand_name_Oppo 3.672643 brand_name_Xiaomi 3.595153 brand_name_Alcatel 3.360767 brand_name_Vivo 3.343906 brand_name_Micromax 3.322402 brand_name_Asus 3.277016 brand_name_Nokia 3.225654 brand_name_Motorola 3.173537 brand_name_Honor 3.159061 brand_name_HTC 3.116526 selfie_camera_mp 2.886479 age_old 2.875527 normalized_new_price 2.855934 brand_name_Sony 2.723139 days_used 2.453434 ram 2.303676 main_camera_mp 2.201178 4g_yes 2.185732 brand_name_Meizu 2.137792 5g_yes 2.079510 brand_name_Panasonic 2.012944 brand_name_Gionee 1.971333 brand_name_Realme 1.937008 os_Others 1.904531 ram4_1 1.899102 brand_name_Microsoft 1.788582 brand_name_Lava 1.776196 brand_name_Celkon 1.760827 brand_name_Spice 1.719549 brand_name_XOLO 1.697516 brand_name_OnePlus 1.637108 brand_name_BlackBerry 1.615950 os_Windows 1.594822 brand_name_Karbonn 1.542344 brand_name_Coolpad 1.440145 brand_name_Infinix 1.347722 int_memory 1.321038 brand_name_Google 1.257138 dtype: float64 ================================================ R-squared for data with dropped high-VIF columns R-squared Adj. R-squared Before 0.842 0.838 Without weight 0.839 0.836 Without brand_name_Apple 0.842 0.838 Without os_iOS 0.842 0.838 Without battery 0.841 0.838 Without brand_name_Others 0.842 0.838 Without screen_size 0.839 0.835 Without battery_by_weight 0.842 0.838 Without brand_name_Samsung 0.842 0.838 Without brand_name_Huawei 0.842 0.838
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e9713110>
We have seen that weight is highly positively correlated with screen size and battery capacity. Since it's VIF is the highest, we will drop it.
# drop weight column
X_train=X_train.drop(['weight'],axis=1)
# keep track of dropped features
dropped_cols=['weight']
model_tuning(y_train,X_train)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.839 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 248.7 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:02 Log-Likelihood: 114.04 No. Observations: 2382 AIC: -128.1 Df Residuals: 2332 BIC: 160.7 Df Model: 49 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.4061 0.078 18.041 0.000 1.253 1.559 screen_size 0.0352 0.003 12.329 0.000 0.030 0.041 main_camera_mp 0.0210 0.001 14.470 0.000 0.018 0.024 selfie_camera_mp 0.0143 0.001 12.103 0.000 0.012 0.017 int_memory 0.0001 6.12e-05 2.316 0.021 2.17e-05 0.000 ram 0.0129 0.005 2.436 0.015 0.003 0.023 battery 2.934e-05 8.84e-06 3.319 0.001 1.2e-05 4.67e-05 days_used -2.261e-05 2.99e-05 -0.757 0.449 -8.12e-05 3.6e-05 normalized_new_price 0.4142 0.012 34.352 0.000 0.391 0.438 battery_by_weight -0.0078 0.002 -4.292 0.000 -0.011 -0.004 brand_name_Alcatel -0.0260 0.047 -0.557 0.577 -0.117 0.065 brand_name_Apple -0.0411 0.148 -0.277 0.782 -0.332 0.250 brand_name_Asus 0.0115 0.047 0.245 0.807 -0.080 0.103 brand_name_BlackBerry 0.0550 0.072 0.762 0.446 -0.086 0.196 brand_name_Celkon -0.3110 0.069 -4.482 0.000 -0.447 -0.175 brand_name_Coolpad -0.0086 0.072 -0.119 0.905 -0.151 0.133 brand_name_Gionee -0.0139 0.057 -0.245 0.806 -0.125 0.097 brand_name_Google 0.0362 0.093 0.391 0.696 -0.145 0.218 brand_name_HTC -0.0228 0.048 -0.476 0.634 -0.117 0.071 brand_name_Honor -0.0031 0.049 -0.064 0.949 -0.098 0.092 brand_name_Huawei -0.0440 0.044 -1.012 0.312 -0.129 0.041 brand_name_Infinix 0.3066 0.090 3.394 0.001 0.129 0.484 brand_name_Karbonn 0.0020 0.067 0.030 0.976 -0.129 0.133 brand_name_LG -0.0535 0.044 -1.204 0.229 -0.141 0.034 brand_name_Lava -0.0198 0.060 -0.329 0.742 -0.138 0.098 brand_name_Lenovo 0.0013 0.045 0.029 0.977 -0.086 0.089 brand_name_Meizu -0.0375 0.055 -0.681 0.496 -0.145 0.070 brand_name_Micromax -0.0365 0.047 -0.778 0.437 -0.129 0.056 brand_name_Microsoft 0.0353 0.087 0.407 0.684 -0.135 0.205 brand_name_Motorola -0.0892 0.049 -1.830 0.067 -0.185 0.006 brand_name_Nokia 0.0494 0.051 0.972 0.331 -0.050 0.149 brand_name_OnePlus -0.0042 0.067 -0.062 0.950 -0.135 0.127 brand_name_Oppo -0.0070 0.047 -0.147 0.883 -0.100 0.086 brand_name_Others -0.0204 0.041 -0.495 0.620 -0.101 0.060 brand_name_Panasonic -0.0245 0.055 -0.442 0.659 -0.133 0.084 brand_name_Realme -0.0139 0.061 -0.230 0.818 -0.133 0.105 brand_name_Samsung -0.0507 0.042 -1.193 0.233 -0.134 0.033 brand_name_Sony -0.0880 0.050 -1.751 0.080 -0.187 0.011 brand_name_Spice -0.0466 0.061 -0.758 0.449 -0.167 0.074 brand_name_Vivo -0.0468 0.048 -0.966 0.334 -0.142 0.048 brand_name_XOLO -0.0798 0.061 -1.307 0.192 -0.199 0.040 brand_name_Xiaomi 0.0524 0.047 1.104 0.270 -0.041 0.145 brand_name_ZTE -0.0376 0.046 -0.821 0.412 -0.128 0.052 os_Others -0.0119 0.037 -0.321 0.748 -0.085 0.061 os_Windows -0.0262 0.045 -0.584 0.559 -0.114 0.062 os_iOS -0.0141 0.148 -0.095 0.924 -0.304 0.276 4g_yes 0.0687 0.015 4.505 0.000 0.039 0.099 5g_yes 0.0184 0.034 0.541 0.589 -0.048 0.085 age_old -0.0430 0.017 -2.587 0.010 -0.076 -0.010 ram4_1 0.0270 0.017 1.563 0.118 -0.007 0.061 ============================================================================== Omnibus: 186.892 Durbin-Watson: 1.942 Prob(Omnibus): 0.000 Jarque-Bera (JB): 359.880 Skew: -0.534 Prob(JB): 7.13e-79 Kurtosis: 4.576 Cond. No. 1.72e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.72e+05. This might indicate that there are strong multicollinearity or other numerical problems. Variance Inflation Factors const 266.254541 brand_name_Apple 13.969785 os_iOS 12.675829 brand_name_Others 8.849361 brand_name_Samsung 6.892157 brand_name_Huawei 5.738640 battery 5.569033 screen_size 5.048455 brand_name_LG 4.722569 brand_name_Lenovo 4.166099 brand_name_ZTE 3.949684 brand_name_Oppo 3.667826 brand_name_Xiaomi 3.588137 brand_name_Alcatel 3.355143 brand_name_Vivo 3.336237 brand_name_Micromax 3.320114 brand_name_Asus 3.276640 brand_name_Nokia 3.221178 battery_by_weight 3.180797 brand_name_Motorola 3.171056 brand_name_Honor 3.149329 brand_name_HTC 3.107480 selfie_camera_mp 2.882209 age_old 2.871069 normalized_new_price 2.855337 brand_name_Sony 2.718741 days_used 2.426666 ram 2.298031 main_camera_mp 2.196235 4g_yes 2.185643 brand_name_Meizu 2.136881 5g_yes 2.078721 brand_name_Panasonic 2.008991 brand_name_Gionee 1.971167 brand_name_Realme 1.934318 ram4_1 1.899025 os_Others 1.848840 brand_name_Microsoft 1.784595 brand_name_Lava 1.773107 brand_name_Celkon 1.757360 brand_name_Spice 1.719489 brand_name_XOLO 1.696226 brand_name_OnePlus 1.633716 brand_name_BlackBerry 1.614968 os_Windows 1.594271 brand_name_Karbonn 1.540983 brand_name_Coolpad 1.438734 brand_name_Infinix 1.346102 int_memory 1.318765 brand_name_Google 1.256768 dtype: float64 ================================================ R-squared for data with dropped high-VIF columns R-squared Adj. R-squared Before 0.839 0.836 Without brand_name_Apple 0.839 0.836 Without os_iOS 0.839 0.836 Without brand_name_Others 0.839 0.836 Without brand_name_Samsung 0.839 0.836 Without brand_name_Huawei 0.839 0.836 Without battery 0.839 0.835 Without screen_size 0.829 0.825
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e952f190>
There's an interesting argument to be made that we can drop either brand_name_Apple or os_iOS. While both belong to categorical variables, these two features must be highly correlated: The only brand running iOS is Apple and the only OS on Apple devices is iOS. In other words, there is indeed redundancy in including both features. While brand_name_Apple has higher VIF, os_iOS has an incredibly high p-value. In addition, after building out the subsequent regression several times, I have determined that losing OS data will not be detrimental to our model (as the p-value will remain high regardless).
# drop
X_train=X_train.drop(['os_iOS'],axis=1)
# add col to dropped_cols list
dropped_cols.append('os_iOS')
model_tuning(y_train,X_train)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.839 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 254.0 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:03 Log-Likelihood: 114.04 No. Observations: 2382 AIC: -130.1 Df Residuals: 2333 BIC: 152.9 Df Model: 48 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.4056 0.078 18.078 0.000 1.253 1.558 screen_size 0.0353 0.003 12.384 0.000 0.030 0.041 main_camera_mp 0.0210 0.001 14.494 0.000 0.018 0.024 selfie_camera_mp 0.0143 0.001 12.117 0.000 0.012 0.017 int_memory 0.0001 6.12e-05 2.315 0.021 2.17e-05 0.000 ram 0.0129 0.005 2.455 0.014 0.003 0.023 battery 2.934e-05 8.84e-06 3.319 0.001 1.2e-05 4.67e-05 days_used -2.267e-05 2.99e-05 -0.759 0.448 -8.12e-05 3.59e-05 normalized_new_price 0.4142 0.012 34.360 0.000 0.391 0.438 battery_by_weight -0.0078 0.002 -4.292 0.000 -0.011 -0.004 brand_name_Alcatel -0.0260 0.047 -0.558 0.577 -0.117 0.065 brand_name_Apple -0.0540 0.058 -0.933 0.351 -0.168 0.060 brand_name_Asus 0.0115 0.047 0.245 0.806 -0.080 0.103 brand_name_BlackBerry 0.0546 0.072 0.758 0.448 -0.087 0.196 brand_name_Celkon -0.3112 0.069 -4.487 0.000 -0.447 -0.175 brand_name_Coolpad -0.0085 0.072 -0.118 0.906 -0.151 0.133 brand_name_Gionee -0.0138 0.057 -0.244 0.807 -0.125 0.097 brand_name_Google 0.0361 0.093 0.390 0.696 -0.145 0.218 brand_name_HTC -0.0227 0.048 -0.475 0.635 -0.117 0.071 brand_name_Honor -0.0031 0.049 -0.064 0.949 -0.098 0.092 brand_name_Huawei -0.0440 0.044 -1.012 0.311 -0.129 0.041 brand_name_Infinix 0.3066 0.090 3.396 0.001 0.130 0.484 brand_name_Karbonn 0.0021 0.067 0.032 0.974 -0.129 0.133 brand_name_LG -0.0535 0.044 -1.203 0.229 -0.141 0.034 brand_name_Lava -0.0196 0.060 -0.327 0.744 -0.137 0.098 brand_name_Lenovo 0.0013 0.045 0.030 0.976 -0.086 0.089 brand_name_Meizu -0.0374 0.055 -0.680 0.496 -0.145 0.070 brand_name_Micromax -0.0364 0.047 -0.777 0.437 -0.128 0.056 brand_name_Microsoft 0.0352 0.087 0.407 0.684 -0.135 0.205 brand_name_Motorola -0.0891 0.049 -1.830 0.067 -0.185 0.006 brand_name_Nokia 0.0492 0.051 0.970 0.332 -0.050 0.149 brand_name_OnePlus -0.0042 0.067 -0.063 0.950 -0.135 0.127 brand_name_Oppo -0.0069 0.047 -0.146 0.884 -0.099 0.086 brand_name_Others -0.0204 0.041 -0.494 0.621 -0.101 0.060 brand_name_Panasonic -0.0244 0.055 -0.441 0.659 -0.133 0.084 brand_name_Realme -0.0140 0.061 -0.230 0.818 -0.133 0.105 brand_name_Samsung -0.0507 0.042 -1.193 0.233 -0.134 0.033 brand_name_Sony -0.0880 0.050 -1.751 0.080 -0.187 0.011 brand_name_Spice -0.0464 0.061 -0.756 0.450 -0.167 0.074 brand_name_Vivo -0.0468 0.048 -0.966 0.334 -0.142 0.048 brand_name_XOLO -0.0797 0.061 -1.306 0.192 -0.199 0.040 brand_name_Xiaomi 0.0524 0.047 1.104 0.270 -0.041 0.145 brand_name_ZTE -0.0376 0.046 -0.820 0.412 -0.127 0.052 os_Others -0.0110 0.036 -0.307 0.759 -0.081 0.059 os_Windows -0.0260 0.045 -0.581 0.561 -0.114 0.062 4g_yes 0.0687 0.015 4.506 0.000 0.039 0.099 5g_yes 0.0182 0.034 0.536 0.592 -0.048 0.085 age_old -0.0430 0.017 -2.587 0.010 -0.076 -0.010 ram4_1 0.0269 0.017 1.561 0.119 -0.007 0.061 ============================================================================== Omnibus: 187.067 Durbin-Watson: 1.942 Prob(Omnibus): 0.000 Jarque-Bera (JB): 360.181 Skew: -0.535 Prob(JB): 6.13e-79 Kurtosis: 4.576 Cond. No. 1.69e+05 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.69e+05. This might indicate that there are strong multicollinearity or other numerical problems. Variance Inflation Factors const 265.104024 brand_name_Others 8.847330 brand_name_Samsung 6.892140 brand_name_Huawei 5.738639 battery 5.568998 screen_size 5.011843 brand_name_LG 4.721959 brand_name_Lenovo 4.165953 brand_name_ZTE 3.949164 brand_name_Oppo 3.667339 brand_name_Xiaomi 3.588128 brand_name_Alcatel 3.355138 brand_name_Vivo 3.336228 brand_name_Micromax 3.319684 brand_name_Asus 3.276583 brand_name_Nokia 3.217382 battery_by_weight 3.180109 brand_name_Motorola 3.170931 brand_name_Honor 3.149329 brand_name_HTC 3.106789 selfie_camera_mp 2.874054 age_old 2.871001 normalized_new_price 2.855336 brand_name_Sony 2.718676 days_used 2.425685 ram 2.279765 main_camera_mp 2.191216 4g_yes 2.185600 brand_name_Meizu 2.136820 brand_name_Apple 2.128776 5g_yes 2.068311 brand_name_Panasonic 2.008574 brand_name_Gionee 1.970664 brand_name_Realme 1.934292 ram4_1 1.892896 brand_name_Microsoft 1.784541 brand_name_Lava 1.772362 brand_name_Celkon 1.756081 os_Others 1.726867 brand_name_Spice 1.718455 brand_name_XOLO 1.695882 brand_name_OnePlus 1.633656 brand_name_BlackBerry 1.609489 os_Windows 1.592640 brand_name_Karbonn 1.540439 brand_name_Coolpad 1.438564 brand_name_Infinix 1.346046 int_memory 1.318438 brand_name_Google 1.256753 dtype: float64 ================================================ R-squared for data with dropped high-VIF columns R-squared Adj. R-squared Before 0.839 0.836 Without brand_name_Others 0.839 0.836 Without brand_name_Samsung 0.839 0.836 Without brand_name_Huawei 0.839 0.836 Without battery 0.839 0.835 Without screen_size 0.829 0.825
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e9ab6750>
It seems that brand_name_Others is more complicating than elucidating. It may very well be too imprecise to be beneficial for our model. Let's drop it knowing that the R-squared value will not decrease.
# drop brand_name_Others
X_train=X_train.drop(['brand_name_Others'],axis=1)
# add to dropped_col
dropped_cols.append('brand_name_Others')
model_tuning(y_train,X_train)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.839 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 259.5 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:04 Log-Likelihood: 113.91 No. Observations: 2382 AIC: -131.8 Df Residuals: 2334 BIC: 145.4 Df Model: 47 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.3861 0.067 20.717 0.000 1.255 1.517 screen_size 0.0354 0.003 12.458 0.000 0.030 0.041 main_camera_mp 0.0210 0.001 14.492 0.000 0.018 0.024 selfie_camera_mp 0.0143 0.001 12.125 0.000 0.012 0.017 int_memory 0.0001 6.12e-05 2.311 0.021 2.14e-05 0.000 ram 0.0129 0.005 2.457 0.014 0.003 0.023 battery 2.918e-05 8.83e-06 3.303 0.001 1.19e-05 4.65e-05 days_used -2.261e-05 2.99e-05 -0.757 0.449 -8.12e-05 3.59e-05 normalized_new_price 0.4141 0.012 34.362 0.000 0.390 0.438 battery_by_weight -0.0078 0.002 -4.285 0.000 -0.011 -0.004 brand_name_Alcatel -0.0077 0.028 -0.272 0.786 -0.063 0.048 brand_name_Apple -0.0356 0.044 -0.804 0.421 -0.123 0.051 brand_name_Asus 0.0299 0.029 1.046 0.296 -0.026 0.086 brand_name_BlackBerry 0.0729 0.062 1.181 0.238 -0.048 0.194 brand_name_Celkon -0.2930 0.059 -4.994 0.000 -0.408 -0.178 brand_name_Coolpad 0.0099 0.062 0.160 0.873 -0.112 0.132 brand_name_Gionee 0.0048 0.042 0.113 0.910 -0.078 0.088 brand_name_Google 0.0549 0.084 0.650 0.516 -0.111 0.220 brand_name_HTC -0.0043 0.030 -0.144 0.886 -0.063 0.055 brand_name_Honor 0.0154 0.031 0.500 0.617 -0.045 0.076 brand_name_Huawei -0.0256 0.022 -1.149 0.251 -0.069 0.018 brand_name_Infinix 0.3252 0.082 3.962 0.000 0.164 0.486 brand_name_Karbonn 0.0204 0.055 0.368 0.713 -0.088 0.129 brand_name_LG -0.0350 0.024 -1.460 0.144 -0.082 0.012 brand_name_Lava -0.0013 0.047 -0.027 0.978 -0.094 0.091 brand_name_Lenovo 0.0197 0.025 0.789 0.430 -0.029 0.069 brand_name_Meizu -0.0189 0.040 -0.470 0.639 -0.098 0.060 brand_name_Micromax -0.0182 0.029 -0.630 0.529 -0.075 0.038 brand_name_Microsoft 0.0527 0.079 0.666 0.505 -0.102 0.208 brand_name_Motorola -0.0705 0.031 -2.287 0.022 -0.131 -0.010 brand_name_Nokia 0.0675 0.035 1.946 0.052 -0.001 0.135 brand_name_OnePlus 0.0144 0.055 0.259 0.795 -0.094 0.123 brand_name_Oppo 0.0116 0.029 0.406 0.685 -0.045 0.068 brand_name_Panasonic -0.0060 0.041 -0.147 0.883 -0.087 0.074 brand_name_Realme 0.0047 0.047 0.100 0.921 -0.088 0.098 brand_name_Samsung -0.0322 0.020 -1.589 0.112 -0.072 0.008 brand_name_Sony -0.0695 0.034 -2.074 0.038 -0.135 -0.004 brand_name_Spice -0.0280 0.049 -0.574 0.566 -0.124 0.068 brand_name_Vivo -0.0282 0.031 -0.925 0.355 -0.088 0.032 brand_name_XOLO -0.0614 0.049 -1.265 0.206 -0.157 0.034 brand_name_Xiaomi 0.0710 0.029 2.457 0.014 0.014 0.128 brand_name_ZTE -0.0191 0.026 -0.723 0.470 -0.071 0.033 os_Others -0.0103 0.036 -0.286 0.775 -0.081 0.060 os_Windows -0.0251 0.045 -0.561 0.575 -0.113 0.063 4g_yes 0.0686 0.015 4.500 0.000 0.039 0.098 5g_yes 0.0182 0.034 0.536 0.592 -0.048 0.085 age_old -0.0426 0.017 -2.567 0.010 -0.075 -0.010 ram4_1 0.0269 0.017 1.561 0.119 -0.007 0.061 ============================================================================== Omnibus: 187.235 Durbin-Watson: 1.941 Prob(Omnibus): 0.000 Jarque-Bera (JB): 360.648 Skew: -0.535 Prob(JB): 4.86e-79 Kurtosis: 4.577 Cond. No. 6.69e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 6.69e+04. This might indicate that there are strong multicollinearity or other numerical problems. Variance Inflation Factors const 196.353264 battery 5.560984 screen_size 4.984176 battery_by_weight 3.179004 selfie_camera_mp 2.873727 age_old 2.864429 normalized_new_price 2.855014 days_used 2.425647 ram 2.279733 main_camera_mp 2.191015 4g_yes 2.185181 5g_yes 2.068308 ram4_1 1.892896 os_Others 1.723775 os_Windows 1.589856 brand_name_Samsung 1.573346 brand_name_Nokia 1.504856 brand_name_Huawei 1.502879 brand_name_Microsoft 1.488295 brand_name_LG 1.372475 brand_name_Oppo 1.350397 brand_name_Xiaomi 1.330653 brand_name_Vivo 1.325473 int_memory 1.318334 brand_name_ZTE 1.308460 brand_name_Lenovo 1.296540 brand_name_Motorola 1.272006 brand_name_Honor 1.270812 brand_name_Celkon 1.256991 brand_name_Micromax 1.253321 brand_name_Apple 1.247784 brand_name_Alcatel 1.240127 brand_name_HTC 1.221159 brand_name_Asus 1.215106 brand_name_Sony 1.209331 brand_name_BlackBerry 1.183462 brand_name_Realme 1.181635 brand_name_Meizu 1.139277 brand_name_OnePlus 1.118116 brand_name_Infinix 1.112590 brand_name_Gionee 1.100861 brand_name_Panasonic 1.100701 brand_name_Lava 1.094078 brand_name_Spice 1.088881 brand_name_XOLO 1.073786 brand_name_Karbonn 1.065913 brand_name_Coolpad 1.055485 brand_name_Google 1.045864 dtype: float64 ================================================ R-squared for data with dropped high-VIF columns R-squared Adj. R-squared Before 0.839 0.836 Without battery 0.839 0.835
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e7505490>
Our R-squared did not change and brand_name_Samsung no longer has high VIF.
Since we have both battery_by_weight and screen_size predicting battery, we can confidently remove the latter, as it has high VIF.
# drop
X_train=X_train.drop(['battery'],axis=1)
# append to list
dropped_cols.append('battery')
model_tuning(y_train,X_train)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.839 Model: OLS Adj. R-squared: 0.835 Method: Least Squares F-statistic: 263.8 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:05 Log-Likelihood: 108.36 No. Observations: 2382 AIC: -122.7 Df Residuals: 2335 BIC: 148.7 Df Model: 46 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.2957 0.061 21.175 0.000 1.176 1.416 screen_size 0.0430 0.002 26.279 0.000 0.040 0.046 main_camera_mp 0.0207 0.001 14.259 0.000 0.018 0.024 selfie_camera_mp 0.0139 0.001 11.816 0.000 0.012 0.016 int_memory 0.0001 6.13e-05 2.441 0.015 2.94e-05 0.000 ram 0.0134 0.005 2.543 0.011 0.003 0.024 days_used -1.819e-05 2.99e-05 -0.609 0.543 -7.68e-05 4.04e-05 normalized_new_price 0.4169 0.012 34.603 0.000 0.393 0.441 battery_by_weight -0.0043 0.001 -2.893 0.004 -0.007 -0.001 brand_name_Alcatel -0.0166 0.028 -0.586 0.558 -0.072 0.039 brand_name_Apple -0.0220 0.044 -0.498 0.619 -0.109 0.065 brand_name_Asus 0.0311 0.029 1.086 0.278 -0.025 0.087 brand_name_BlackBerry 0.0711 0.062 1.150 0.250 -0.050 0.192 brand_name_Celkon -0.2889 0.059 -4.914 0.000 -0.404 -0.174 brand_name_Coolpad 0.0070 0.062 0.112 0.911 -0.115 0.129 brand_name_Gionee 0.0119 0.042 0.282 0.778 -0.071 0.095 brand_name_Google 0.0497 0.085 0.588 0.557 -0.116 0.216 brand_name_HTC -0.0064 0.030 -0.213 0.831 -0.065 0.053 brand_name_Honor 0.0099 0.031 0.320 0.749 -0.051 0.070 brand_name_Huawei -0.0248 0.022 -1.112 0.266 -0.069 0.019 brand_name_Infinix 0.3327 0.082 4.046 0.000 0.171 0.494 brand_name_Karbonn 0.0107 0.055 0.193 0.847 -0.098 0.119 brand_name_LG -0.0356 0.024 -1.482 0.139 -0.083 0.012 brand_name_Lava -0.0055 0.047 -0.116 0.908 -0.098 0.087 brand_name_Lenovo 0.0219 0.025 0.878 0.380 -0.027 0.071 brand_name_Meizu -0.0225 0.040 -0.559 0.576 -0.101 0.056 brand_name_Micromax -0.0220 0.029 -0.763 0.445 -0.079 0.035 brand_name_Microsoft 0.0503 0.079 0.635 0.525 -0.105 0.206 brand_name_Motorola -0.0678 0.031 -2.194 0.028 -0.128 -0.007 brand_name_Nokia 0.0654 0.035 1.883 0.060 -0.003 0.134 brand_name_OnePlus 0.0064 0.055 0.115 0.908 -0.102 0.115 brand_name_Oppo 0.0105 0.029 0.364 0.716 -0.046 0.067 brand_name_Panasonic -0.0113 0.041 -0.275 0.784 -0.092 0.069 brand_name_Realme 0.0100 0.047 0.212 0.832 -0.083 0.103 brand_name_Samsung -0.0313 0.020 -1.539 0.124 -0.071 0.009 brand_name_Sony -0.0690 0.034 -2.053 0.040 -0.135 -0.003 brand_name_Spice -0.0230 0.049 -0.470 0.638 -0.119 0.073 brand_name_Vivo -0.0314 0.031 -1.027 0.305 -0.091 0.029 brand_name_XOLO -0.0643 0.049 -1.322 0.186 -0.160 0.031 brand_name_Xiaomi 0.0755 0.029 2.613 0.009 0.019 0.132 brand_name_ZTE -0.0211 0.026 -0.798 0.425 -0.073 0.031 os_Others 0.0177 0.035 0.505 0.614 -0.051 0.086 os_Windows -0.0266 0.045 -0.595 0.552 -0.114 0.061 4g_yes 0.0695 0.015 4.552 0.000 0.040 0.099 5g_yes 0.0204 0.034 0.599 0.549 -0.046 0.087 age_old -0.0443 0.017 -2.665 0.008 -0.077 -0.012 ram4_1 0.0269 0.017 1.561 0.119 -0.007 0.061 ============================================================================== Omnibus: 186.544 Durbin-Watson: 1.941 Prob(Omnibus): 0.000 Jarque-Bera (JB): 359.965 Skew: -0.533 Prob(JB): 6.83e-79 Kurtosis: 4.578 Cond. No. 1.40e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.4e+04. This might indicate that there are strong multicollinearity or other numerical problems. Variance Inflation Factors const 163.546190 age_old 2.861644 normalized_new_price 2.841189 selfie_camera_mp 2.841148 days_used 2.420772 ram 2.278053 4g_yes 2.184430 main_camera_mp 2.179395 battery_by_weight 2.087648 5g_yes 2.067531 ram4_1 1.892894 screen_size 1.651488 os_Others 1.628177 os_Windows 1.589682 brand_name_Samsung 1.573021 brand_name_Nokia 1.504374 brand_name_Huawei 1.502714 brand_name_Microsoft 1.488174 brand_name_LG 1.372401 brand_name_Oppo 1.350188 brand_name_Xiaomi 1.327600 brand_name_Vivo 1.324158 int_memory 1.316189 brand_name_ZTE 1.307764 brand_name_Lenovo 1.295566 brand_name_Motorola 1.271094 brand_name_Honor 1.267041 brand_name_Celkon 1.256428 brand_name_Micromax 1.251259 brand_name_Apple 1.236995 brand_name_Alcatel 1.229015 brand_name_HTC 1.220608 brand_name_Asus 1.214908 brand_name_Sony 1.209302 brand_name_BlackBerry 1.183374 brand_name_Realme 1.180261 brand_name_Meizu 1.138431 brand_name_OnePlus 1.115987 brand_name_Infinix 1.111752 brand_name_Panasonic 1.099036 brand_name_Gionee 1.097969 brand_name_Lava 1.093285 brand_name_Spice 1.087834 brand_name_XOLO 1.073437 brand_name_Karbonn 1.062911 brand_name_Coolpad 1.055264 brand_name_Google 1.045507 dtype: float64 There are no features with VIF greater than 5
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e8fe0790>
There was no effect on R-squared, we have eliminated all high VIF features.
Next we'll look at p-values, assuming a 5% level of significance. The p-values for os_Others and os_Windows are both high, indicating that operating system is generally not a good predictor of used price. We will drop all os features.
# drop
X_train=X_train.drop(['os_Others','os_Windows'],axis=1)
# append
dropped_cols.append('os_Others')
dropped_cols.append('os_Windows')
model_tuning(y_train,X_train,hide_vif=True)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.839 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 275.9 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:07 Log-Likelihood: 108.01 No. Observations: 2382 AIC: -126.0 Df Residuals: 2337 BIC: 133.9 Df Model: 44 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.3053 0.059 22.168 0.000 1.190 1.421 screen_size 0.0429 0.002 26.747 0.000 0.040 0.046 main_camera_mp 0.0206 0.001 14.243 0.000 0.018 0.023 selfie_camera_mp 0.0139 0.001 11.874 0.000 0.012 0.016 int_memory 0.0001 6.13e-05 2.429 0.015 2.87e-05 0.000 ram 0.0128 0.005 2.459 0.014 0.003 0.023 days_used -1.819e-05 2.99e-05 -0.609 0.543 -7.68e-05 4.04e-05 normalized_new_price 0.4164 0.012 34.658 0.000 0.393 0.440 battery_by_weight -0.0043 0.001 -2.907 0.004 -0.007 -0.001 brand_name_Alcatel -0.0151 0.028 -0.536 0.592 -0.070 0.040 brand_name_Apple -0.0184 0.044 -0.421 0.674 -0.104 0.067 brand_name_Asus 0.0322 0.029 1.129 0.259 -0.024 0.088 brand_name_BlackBerry 0.0818 0.059 1.397 0.163 -0.033 0.197 brand_name_Celkon -0.2840 0.058 -4.905 0.000 -0.397 -0.170 brand_name_Coolpad 0.0077 0.062 0.124 0.901 -0.114 0.130 brand_name_Gionee 0.0126 0.042 0.298 0.766 -0.070 0.095 brand_name_Google 0.0508 0.085 0.600 0.548 -0.115 0.217 brand_name_HTC -0.0066 0.030 -0.220 0.826 -0.066 0.052 brand_name_Honor 0.0104 0.031 0.338 0.736 -0.050 0.071 brand_name_Huawei -0.0243 0.022 -1.089 0.276 -0.068 0.019 brand_name_Infinix 0.3301 0.082 4.019 0.000 0.169 0.491 brand_name_Karbonn 0.0101 0.055 0.183 0.855 -0.098 0.119 brand_name_LG -0.0349 0.024 -1.457 0.145 -0.082 0.012 brand_name_Lava -0.0071 0.047 -0.150 0.881 -0.100 0.085 brand_name_Lenovo 0.0226 0.025 0.907 0.365 -0.026 0.072 brand_name_Meizu -0.0216 0.040 -0.536 0.592 -0.100 0.057 brand_name_Micromax -0.0217 0.029 -0.754 0.451 -0.078 0.035 brand_name_Microsoft 0.0243 0.066 0.367 0.714 -0.105 0.154 brand_name_Motorola -0.0672 0.031 -2.176 0.030 -0.128 -0.007 brand_name_Nokia 0.0645 0.032 1.995 0.046 0.001 0.128 brand_name_OnePlus 0.0080 0.055 0.144 0.885 -0.101 0.116 brand_name_Oppo 0.0109 0.029 0.381 0.703 -0.045 0.067 brand_name_Panasonic -0.0107 0.041 -0.259 0.796 -0.091 0.070 brand_name_Realme 0.0096 0.047 0.204 0.839 -0.083 0.103 brand_name_Samsung -0.0299 0.020 -1.480 0.139 -0.070 0.010 brand_name_Sony -0.0674 0.034 -2.012 0.044 -0.133 -0.002 brand_name_Spice -0.0235 0.049 -0.481 0.630 -0.119 0.072 brand_name_Vivo -0.0306 0.031 -1.003 0.316 -0.090 0.029 brand_name_XOLO -0.0651 0.049 -1.340 0.180 -0.160 0.030 brand_name_Xiaomi 0.0762 0.029 2.639 0.008 0.020 0.133 brand_name_ZTE -0.0202 0.026 -0.767 0.443 -0.072 0.032 4g_yes 0.0691 0.015 4.529 0.000 0.039 0.099 5g_yes 0.0214 0.034 0.631 0.528 -0.045 0.088 age_old -0.0449 0.017 -2.709 0.007 -0.077 -0.012 ram4_1 0.0252 0.017 1.481 0.139 -0.008 0.059 ============================================================================== Omnibus: 184.690 Durbin-Watson: 1.942 Prob(Omnibus): 0.000 Jarque-Bera (JB): 357.701 Skew: -0.527 Prob(JB): 2.12e-78 Kurtosis: 4.579 Cond. No. 1.38e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.38e+04. This might indicate that there are strong multicollinearity or other numerical problems.
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e9a8f350>
Dropping os_Others and os_Windows has had no impact on the model's R-squared.
With a p-value of 0.543, days_used is certainly not lending any predictive insight to our model. Let's drop it.
# drop
X_train=X_train.drop(['days_used'],axis=1)
# append
dropped_cols.append('days_used')
model_tuning(y_train,X_train,hide_vif=True)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.839 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 282.4 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:07 Log-Likelihood: 107.82 No. Observations: 2382 AIC: -127.6 Df Residuals: 2338 BIC: 126.5 Df Model: 43 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.2961 0.057 22.781 0.000 1.185 1.408 screen_size 0.0429 0.002 26.896 0.000 0.040 0.046 main_camera_mp 0.0205 0.001 14.234 0.000 0.018 0.023 selfie_camera_mp 0.0140 0.001 11.981 0.000 0.012 0.016 int_memory 0.0001 6.12e-05 2.443 0.015 2.95e-05 0.000 ram 0.0131 0.005 2.523 0.012 0.003 0.023 normalized_new_price 0.4159 0.012 34.687 0.000 0.392 0.439 battery_by_weight -0.0042 0.001 -2.861 0.004 -0.007 -0.001 brand_name_Alcatel -0.0155 0.028 -0.550 0.582 -0.071 0.040 brand_name_Apple -0.0176 0.044 -0.402 0.688 -0.103 0.068 brand_name_Asus 0.0319 0.029 1.119 0.263 -0.024 0.088 brand_name_BlackBerry 0.0826 0.059 1.411 0.158 -0.032 0.197 brand_name_Celkon -0.2835 0.058 -4.899 0.000 -0.397 -0.170 brand_name_Coolpad 0.0071 0.062 0.115 0.909 -0.115 0.129 brand_name_Gionee 0.0122 0.042 0.288 0.774 -0.071 0.095 brand_name_Google 0.0512 0.085 0.605 0.545 -0.115 0.217 brand_name_HTC -0.0065 0.030 -0.216 0.829 -0.065 0.052 brand_name_Honor 0.0110 0.031 0.356 0.722 -0.049 0.071 brand_name_Huawei -0.0239 0.022 -1.072 0.284 -0.068 0.020 brand_name_Infinix 0.3317 0.082 4.040 0.000 0.171 0.493 brand_name_Karbonn 0.0100 0.055 0.181 0.856 -0.099 0.119 brand_name_LG -0.0342 0.024 -1.430 0.153 -0.081 0.013 brand_name_Lava -0.0075 0.047 -0.160 0.873 -0.100 0.085 brand_name_Lenovo 0.0226 0.025 0.905 0.366 -0.026 0.072 brand_name_Meizu -0.0223 0.040 -0.556 0.578 -0.101 0.056 brand_name_Micromax -0.0223 0.029 -0.775 0.438 -0.079 0.034 brand_name_Microsoft 0.0238 0.066 0.359 0.719 -0.106 0.153 brand_name_Motorola -0.0655 0.031 -2.131 0.033 -0.126 -0.005 brand_name_Nokia 0.0648 0.032 2.005 0.045 0.001 0.128 brand_name_OnePlus 0.0072 0.055 0.130 0.897 -0.101 0.116 brand_name_Oppo 0.0110 0.029 0.383 0.702 -0.045 0.067 brand_name_Panasonic -0.0116 0.041 -0.282 0.778 -0.092 0.069 brand_name_Realme 0.0117 0.047 0.248 0.804 -0.081 0.104 brand_name_Samsung -0.0298 0.020 -1.473 0.141 -0.069 0.010 brand_name_Sony -0.0670 0.034 -1.999 0.046 -0.133 -0.001 brand_name_Spice -0.0249 0.049 -0.511 0.610 -0.121 0.071 brand_name_Vivo -0.0303 0.031 -0.994 0.321 -0.090 0.030 brand_name_XOLO -0.0651 0.049 -1.339 0.181 -0.160 0.030 brand_name_Xiaomi 0.0761 0.029 2.635 0.008 0.019 0.133 brand_name_ZTE -0.0205 0.026 -0.777 0.437 -0.072 0.031 4g_yes 0.0689 0.015 4.522 0.000 0.039 0.099 5g_yes 0.0224 0.034 0.660 0.509 -0.044 0.089 age_old -0.0493 0.015 -3.287 0.001 -0.079 -0.020 ram4_1 0.0228 0.017 1.377 0.169 -0.010 0.055 ============================================================================== Omnibus: 184.367 Durbin-Watson: 1.941 Prob(Omnibus): 0.000 Jarque-Bera (JB): 356.117 Skew: -0.527 Prob(JB): 4.68e-78 Kurtosis: 4.573 Cond. No. 2.03e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.03e+03. This might indicate that there are strong multicollinearity or other numerical problems.
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e74ed890>
Dropping days_used had no impact on R-squared either.
Despite high p-values, I do not believe we can drop only some of the brand_name columns. This is because dropping only part of a categorical variable will negatively impact interpretability of the model. Thus, the only other features with high p-values are 5g_yes and ram4_1. The former records whether the device has 5G connectivity, which is apparently not a good predictor for price, especially as compared to 4G connectivity. The latter is a feature I made in an attempt to bin observations for RAM; it appears to not benefit the model.
We will still only drop one column at a time to record the effect on the R-squared value.
# drop
X_train=X_train.drop(['5g_yes'],axis=1)
# append
dropped_cols.append('5g_yes')
model_tuning(y_train,X_train,hide_vif=True)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.839 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 289.2 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:07 Log-Likelihood: 107.60 No. Observations: 2382 AIC: -129.2 Df Residuals: 2339 BIC: 119.2 Df Model: 42 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.2898 0.056 22.997 0.000 1.180 1.400 screen_size 0.0429 0.002 26.912 0.000 0.040 0.046 main_camera_mp 0.0204 0.001 14.237 0.000 0.018 0.023 selfie_camera_mp 0.0140 0.001 11.983 0.000 0.012 0.016 int_memory 0.0002 6.12e-05 2.468 0.014 3.1e-05 0.000 ram 0.0148 0.005 3.253 0.001 0.006 0.024 normalized_new_price 0.4171 0.012 35.205 0.000 0.394 0.440 battery_by_weight -0.0042 0.001 -2.885 0.004 -0.007 -0.001 brand_name_Alcatel -0.0152 0.028 -0.539 0.590 -0.070 0.040 brand_name_Apple -0.0192 0.044 -0.440 0.660 -0.105 0.066 brand_name_Asus 0.0320 0.029 1.121 0.263 -0.024 0.088 brand_name_BlackBerry 0.0825 0.059 1.410 0.159 -0.032 0.197 brand_name_Celkon -0.2819 0.058 -4.876 0.000 -0.395 -0.169 brand_name_Coolpad 0.0089 0.062 0.143 0.886 -0.113 0.130 brand_name_Gionee 0.0121 0.042 0.286 0.775 -0.071 0.095 brand_name_Google 0.0476 0.084 0.565 0.572 -0.118 0.213 brand_name_HTC -0.0070 0.030 -0.232 0.817 -0.066 0.052 brand_name_Honor 0.0116 0.031 0.376 0.707 -0.049 0.072 brand_name_Huawei -0.0240 0.022 -1.077 0.281 -0.068 0.020 brand_name_Infinix 0.3308 0.082 4.030 0.000 0.170 0.492 brand_name_Karbonn 0.0105 0.055 0.189 0.850 -0.098 0.119 brand_name_LG -0.0342 0.024 -1.429 0.153 -0.081 0.013 brand_name_Lava -0.0072 0.047 -0.152 0.879 -0.100 0.085 brand_name_Lenovo 0.0231 0.025 0.924 0.356 -0.026 0.072 brand_name_Meizu -0.0225 0.040 -0.561 0.575 -0.101 0.056 brand_name_Micromax -0.0218 0.029 -0.757 0.449 -0.078 0.035 brand_name_Microsoft 0.0241 0.066 0.364 0.716 -0.106 0.154 brand_name_Motorola -0.0651 0.031 -2.117 0.034 -0.125 -0.005 brand_name_Nokia 0.0660 0.032 2.043 0.041 0.003 0.129 brand_name_OnePlus 0.0078 0.055 0.141 0.888 -0.101 0.116 brand_name_Oppo 0.0116 0.029 0.404 0.687 -0.045 0.068 brand_name_Panasonic -0.0112 0.041 -0.274 0.784 -0.092 0.069 brand_name_Realme 0.0149 0.047 0.318 0.751 -0.077 0.107 brand_name_Samsung -0.0302 0.020 -1.497 0.135 -0.070 0.009 brand_name_Sony -0.0668 0.033 -1.993 0.046 -0.132 -0.001 brand_name_Spice -0.0246 0.049 -0.505 0.614 -0.120 0.071 brand_name_Vivo -0.0306 0.031 -1.004 0.315 -0.090 0.029 brand_name_XOLO -0.0649 0.049 -1.336 0.182 -0.160 0.030 brand_name_Xiaomi 0.0765 0.029 2.649 0.008 0.020 0.133 brand_name_ZTE -0.0203 0.026 -0.769 0.442 -0.072 0.031 4g_yes 0.0689 0.015 4.518 0.000 0.039 0.099 age_old -0.0492 0.015 -3.287 0.001 -0.079 -0.020 ram4_1 0.0184 0.015 1.214 0.225 -0.011 0.048 ============================================================================== Omnibus: 184.759 Durbin-Watson: 1.942 Prob(Omnibus): 0.000 Jarque-Bera (JB): 356.947 Skew: -0.528 Prob(JB): 3.09e-78 Kurtosis: 4.575 Cond. No. 2.03e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.03e+03. This might indicate that there are strong multicollinearity or other numerical problems.
<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e75f5510>
Dropping 5g_yes had no effect on the R-squared. What's more, the p-value for ram4_1 actually went up, further confirming our hypothesis that this feature is not necessary.
# drop
X_train=X_train.drop(['ram4_1'],axis=1)
# append
dropped_cols.append('ram4_1')
model=model_tuning(y_train,X_train,hide_vif=True)
OLS Regression Results ================================================================================= Dep. Variable: normalized_used_price R-squared: 0.838 Model: OLS Adj. R-squared: 0.836 Method: Least Squares F-statistic: 296.1 Date: Sat, 08 Oct 2022 Prob (F-statistic): 0.00 Time: 04:45:08 Log-Likelihood: 106.85 No. Observations: 2382 AIC: -129.7 Df Residuals: 2340 BIC: 112.9 Df Model: 41 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 1.3002 0.055 23.454 0.000 1.191 1.409 screen_size 0.0428 0.002 26.886 0.000 0.040 0.046 main_camera_mp 0.0207 0.001 14.505 0.000 0.018 0.023 selfie_camera_mp 0.0138 0.001 11.940 0.000 0.012 0.016 int_memory 0.0001 6.01e-05 2.282 0.023 1.93e-05 0.000 ram 0.0146 0.005 3.222 0.001 0.006 0.024 normalized_new_price 0.4185 0.012 35.462 0.000 0.395 0.442 battery_by_weight -0.0043 0.001 -2.934 0.003 -0.007 -0.001 brand_name_Alcatel -0.0171 0.028 -0.609 0.543 -0.072 0.038 brand_name_Apple -0.0200 0.044 -0.459 0.647 -0.106 0.066 brand_name_Asus 0.0329 0.029 1.153 0.249 -0.023 0.089 brand_name_BlackBerry 0.0823 0.059 1.406 0.160 -0.032 0.197 brand_name_Celkon -0.2913 0.057 -5.082 0.000 -0.404 -0.179 brand_name_Coolpad 0.0089 0.062 0.144 0.885 -0.113 0.131 brand_name_Gionee 0.0128 0.042 0.303 0.762 -0.070 0.096 brand_name_Google 0.0434 0.084 0.515 0.607 -0.122 0.209 brand_name_HTC -0.0077 0.030 -0.257 0.797 -0.067 0.051 brand_name_Honor 0.0094 0.031 0.306 0.760 -0.051 0.070 brand_name_Huawei -0.0248 0.022 -1.113 0.266 -0.068 0.019 brand_name_Infinix 0.3248 0.082 3.964 0.000 0.164 0.485 brand_name_Karbonn 0.0090 0.055 0.162 0.871 -0.100 0.118 brand_name_LG -0.0357 0.024 -1.495 0.135 -0.083 0.011 brand_name_Lava -0.0079 0.047 -0.168 0.867 -0.100 0.085 brand_name_Lenovo 0.0231 0.025 0.926 0.355 -0.026 0.072 brand_name_Meizu -0.0235 0.040 -0.584 0.559 -0.102 0.055 brand_name_Micromax -0.0214 0.029 -0.742 0.458 -0.078 0.035 brand_name_Microsoft 0.0245 0.066 0.371 0.711 -0.105 0.154 brand_name_Motorola -0.0662 0.031 -2.154 0.031 -0.126 -0.006 brand_name_Nokia 0.0600 0.032 1.879 0.060 -0.003 0.123 brand_name_OnePlus 0.0028 0.055 0.051 0.959 -0.105 0.111 brand_name_Oppo 0.0098 0.029 0.342 0.732 -0.046 0.066 brand_name_Panasonic -0.0101 0.041 -0.246 0.806 -0.091 0.070 brand_name_Realme 0.0081 0.047 0.173 0.863 -0.083 0.100 brand_name_Samsung -0.0306 0.020 -1.517 0.129 -0.070 0.009 brand_name_Sony -0.0674 0.033 -2.013 0.044 -0.133 -0.002 brand_name_Spice -0.0239 0.049 -0.489 0.625 -0.120 0.072 brand_name_Vivo -0.0309 0.031 -1.013 0.311 -0.091 0.029 brand_name_XOLO -0.0644 0.049 -1.324 0.186 -0.160 0.031 brand_name_Xiaomi 0.0754 0.029 2.611 0.009 0.019 0.132 brand_name_ZTE -0.0211 0.026 -0.799 0.425 -0.073 0.031 4g_yes 0.0695 0.015 4.561 0.000 0.040 0.099 age_old -0.0461 0.015 -3.125 0.002 -0.075 -0.017 ============================================================================== Omnibus: 188.661 Durbin-Watson: 1.940 Prob(Omnibus): 0.000 Jarque-Bera (JB): 364.229 Skew: -0.538 Prob(JB): 8.10e-80 Kurtosis: 4.585 Cond. No. 2.02e+03 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.02e+03. This might indicate that there are strong multicollinearity or other numerical problems.
R-squared only decreased by 0.001 after dropping ram4_1. Outside of the brand_name columns, there are no p-values greater than 0.05. With that, we have a model that explains 83.8% of the variance in used price!
# tabulate model performance
model_perf=pd.DataFrame.from_dict({'Actual':y_train.values,
'Fitted':model.fittedvalues.values,
'Residuals':model.resid.values})
model_perf.head()
Actual | Fitted | Residuals | |
---|---|---|---|
0 | 5.169518 | 5.033365 | 0.136153 |
1 | 4.284276 | 4.485839 | -0.201563 |
2 | 4.955757 | 4.771354 | 0.184403 |
3 | 4.460607 | 4.403883 | 0.056724 |
4 | 4.005695 | 4.063245 | -0.057549 |
If our data really does exhibit a linear relationship, then there should be no pattern in the plot of fitted values versus residuals.
plt.figure(figsize=(8,5))
plt.title('Linearity Check')
sns.residplot(data=model_perf,x='Fitted',y='Residuals',lowess=True);
There is at worst a weak trend apparent here, but the data is essentially noise. This affirms that there is a linear relationship between our inputs and the response variable, normalized_used_price.
In addition, the plot confirms the independence of residuals, since there is no regularity or pattern in the residuals.
plt.figure(figsize=(8,5))
plt.title('Histogram of Linear Regression Residuals',fontsize=16)
sns.histplot(data=model_perf,x='Residuals',kde=True);
All told, this distribution is reasonably normal. The only issues seem to be on the tails. Further evidence in favor of normality can be obtained by studying the QQ plot.
# import extra library for QQ plot
import pylab
plt.figure(figsize=(8,5))
stats.probplot(model_perf['Residuals'],dist='norm',plot=pylab);
Again, save for the tails, we find that the distribution is close to normal. Recall that a normal distribution will follow the diagonal of a QQ plot.
But let's go further by running a Shapiro-Wilk test at a 5% level of significance. We make the following hypotheses: $$H_0: \text{Residuals are normally distributed.}$$ $$H_a: \text{Residuals are NOT normally distributed.}$$
test_stat,p_val=stats.shapiro(model_perf['Residuals'])
print('The p-value of the Shapiro-Wilk test for residuals is:',p_val)
The p-value of the Shapiro-Wilk test for residuals is: 6.636463188267622e-21
Unfortunately, the Shapiro-Wilk test implies the distribution of residuals is almost certainly not normal. Hence, while the distribution is not technically normal, it is approximately normal, as shown in the two plots. Put another way, we can use it for our purposes so long as we keep in mind its limitations and reliability.
Homoscedasticity concerns the variance of residuals. We require that the variance of residuals remain consistent throughout the data.
We will run the Goldfeld-Quandt test, a hypothesis test for homoscedasticity. We make the assumption that the residuals are homoscedastic, with the alternative being that they are heteroscedastic. $$H_0: \text{Residuals are homoscedastic.}$$ $$H_a: \text{Residuals are heteroscedastic.}$$ We will assume a 5% level of significance.
# homoscedasticity test
test=statsmodels.stats.diagnostic.het_goldfeldquandt(model_perf['Residuals'],X_train)
print('The p-value for the Goldfeld-Quandt test is:',test[1])
The p-value for the Goldfeld-Quandt test is: 0.24735380493522216
Since the p-value is greater than 0.05, we can safely conclude that the residuals are homoscedastic.
# drop required columns from test data
X_test=X_test.drop(dropped_cols,axis=1)
# generate predicted y values with model
y_predicted=model.predict(X_test)
from sklearn.metrics import mean_squared_error, mean_absolute_error
print('The RMSE on the training data is:',
np.sqrt(mean_squared_error(y_train,model_perf['Fitted'])))
The RMSE on the training data is: 0.23135627303854692
print('The RMSE on the test data is:',
np.sqrt(mean_squared_error(y_test,y_predicted)))
The RMSE on the test data is: 0.2344360049678886
The RMSE is low for both the training data and the test data. What's more, the numbers are almost equal, which implies overfitting is not an issue for our model.
print('The mean absolute error for the training data is:',
mean_absolute_error(y_train,model_perf['Fitted']))
The mean absolute error for the training data is: 0.18154048705559045
print('The mean absolute error for the test data is:',
mean_absolute_error(y_test,y_predicted))
The mean absolute error for the test data is: 0.18160092038083928
Again, the values for the training data and the test data are quite close. The low mean absolute error for the test data indicates that the average error commited by the model in predicting normalized_used_price is only 0.18€.
We conclude that this is a strong model that performs well on out-of-sample data.
For reference, the model parameters are listed below. We can use these to draw business conclusions.
model.params
const 1.300184 screen_size 0.042756 main_camera_mp 0.020663 selfie_camera_mp 0.013775 int_memory 0.000137 ram 0.014610 normalized_new_price 0.418463 battery_by_weight -0.004308 brand_name_Alcatel -0.017111 brand_name_Apple -0.020045 brand_name_Asus 0.032903 brand_name_BlackBerry 0.082250 brand_name_Celkon -0.291263 brand_name_Coolpad 0.008939 brand_name_Gionee 0.012789 brand_name_Google 0.043397 brand_name_HTC -0.007726 brand_name_Honor 0.009408 brand_name_Huawei -0.024759 brand_name_Infinix 0.324816 brand_name_Karbonn 0.008970 brand_name_LG -0.035735 brand_name_Lava -0.007912 brand_name_Lenovo 0.023106 brand_name_Meizu -0.023462 brand_name_Micromax -0.021357 brand_name_Microsoft 0.024526 brand_name_Motorola -0.066186 brand_name_Nokia 0.059966 brand_name_OnePlus 0.002819 brand_name_Oppo 0.009788 brand_name_Panasonic -0.010076 brand_name_Realme 0.008064 brand_name_Samsung -0.030627 brand_name_Sony -0.067437 brand_name_Spice -0.023866 brand_name_Vivo -0.030908 brand_name_XOLO -0.064353 brand_name_Xiaomi 0.075372 brand_name_ZTE -0.021065 4g_yes 0.069481 age_old -0.046131 dtype: float64
Model Insights
Mobile Market Recommendations