Notebook

Mobile Market¶

Problem Statement¶

Objective¶

Mobile Market is a retailer leveraging machine learning to profit in the growing secondary market for mobile devices. Our goal is to analyze their supplied data and lend insights—such as factors that influence price—through a linear model.

Data from 2021 contains the following features:

brand_name
os: operating system
screen_size: screen size in cm diag.
4g: 4G capability
5g: 5G capability
main_camera_mp: Resolution of the rear camera in megapixels
selfie_camera_mp: Resolution of the front camera in megapixels
int_memory: internal memory (ROM) in GB
ram: RAM in GB
battery: battery capacity (mAh)
weight: device weight (grams)
release_year
days_used: Number of days the used/refurbished device has been used
normalized_new_price: Normalized price of a new device of the same model in euros
normalized_used_price: Normalized price of the used/refurbished device in euros

Importing necessary libraries¶

In [ ]:

# math and data
import numpy as np
import scipy.stats as stats
import pandas as pd

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to split the data set
from sklearn.model_selection import train_test_split

# to build the linear model
import statsmodels
import statsmodels.api as sm

Loading the dataset¶

In [ ]:

phone=pd.read_csv('dataset.csv')

In [ ]:

phone_copy=phone.copy()

Data Overview¶

In [ ]:

phone.shape

Out[ ]:

(3454, 15)

In [ ]:

phone.head()

Out[ ]:

	brand_name	os	screen_size	4g	5g	main_camera_mp	selfie_camera_mp	int_memory	ram	battery	weight	release_year	days_used	normalized_used_price	normalized_new_price
0	Honor	Android	14.50	yes	no	13.0	5.0	64.0	3.0	3020.0	146.0	2020	127	4.307572	4.715100
1	Honor	Android	17.30	yes	yes	13.0	16.0	128.0	8.0	4300.0	213.0	2020	325	5.162097	5.519018
2	Honor	Android	16.69	yes	yes	13.0	8.0	128.0	8.0	4200.0	213.0	2020	162	5.111084	5.884631
3	Honor	Android	25.50	yes	yes	13.0	8.0	64.0	6.0	7250.0	480.0	2020	345	5.135387	5.630961
4	Honor	Android	15.32	yes	no	13.0	8.0	64.0	3.0	5000.0	185.0	2020	293	4.389995	4.947837

This is a sizable data set: 15 columns with over 3400 entries! I'm curious whether OS will end up being a significant factor, as I'd guess brand would be more important. Additionally, the normalized prices both seem shockingly low. I'll need to follow up with the histogram or boxplot later.

In [ ]:

phone.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3454 entries, 0 to 3453
Data columns (total 15 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   brand_name             3454 non-null   object 
 1   os                     3454 non-null   object 
 2   screen_size            3454 non-null   float64
 3   4g                     3454 non-null   object 
 4   5g                     3454 non-null   object 
 5   main_camera_mp         3275 non-null   float64
 6   selfie_camera_mp       3452 non-null   float64
 7   int_memory             3450 non-null   float64
 8   ram                    3450 non-null   float64
 9   battery                3448 non-null   float64
 10  weight                 3447 non-null   float64
 11  release_year           3454 non-null   int64  
 12  days_used              3454 non-null   int64  
 13  normalized_used_price  3454 non-null   float64
 14  normalized_new_price   3454 non-null   float64
dtypes: float64(9), int64(2), object(4)
memory usage: 404.9+ KB

A first inspection shows the data types all appear to be correct. From the head, we can see that 4g and 5g are recorded as 'yes' or 'no', so it follows that these columns would be objects.

In [ ]:

phone.isnull().sum()

Out[ ]:

brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

We do have some null entries, most notably in the main_camera_mp column, with 179 null entries. That being said, it does only make up around 5% of the data. The other columns with null entries are: selfie_camera_mp, int_memory, ram, battery, and weight.

There is a possibility that brand_name or os have missing data as '?' or 'unknown', so let's check for that too.

In [ ]:

phone['brand_name'].value_counts()

Out[ ]:

Others        502
Samsung       341
Huawei        251
LG            201
Lenovo        171
ZTE           140
Xiaomi        132
Oppo          129
Asus          122
Alcatel       121
Micromax      117
Vivo          117
Honor         116
HTC           110
Nokia         106
Motorola      106
Sony           86
Meizu          62
Gionee         56
Acer           51
XOLO           49
Panasonic      47
Realme         41
Apple          39
Lava           36
Celkon         33
Spice          30
Karbonn        29
Coolpad        22
BlackBerry     22
Microsoft      22
OnePlus        22
Google         15
Infinix        10
Name: brand_name, dtype: int64

There's a catchall 'Others' group. This is missing data and may mess up our model; there is presumably more than one brand categorized as other. We will review this during the model tuning.

In [ ]:

phone['os'].value_counts()

Out[ ]:

Android    3214
Others      137
Windows      67
iOS          36
Name: os, dtype: int64

There is also an 'Others' label here. Again, we may need to remove these to create a powerful model.

In [ ]:

phone.duplicated().sum()

Out[ ]:

There are no duplicate rows.

Exploratory Data Analysis (EDA)¶

In [ ]:

phone.describe().T

Out[ ]:

	count	mean	std	min	25%	50%	75%	max
screen_size	3454.0	13.713115	3.805280	5.080000	12.700000	12.830000	15.340000	30.710000
main_camera_mp	3275.0	9.460208	4.815461	0.080000	5.000000	8.000000	13.000000	48.000000
selfie_camera_mp	3452.0	6.554229	6.970372	0.000000	2.000000	5.000000	8.000000	32.000000
int_memory	3450.0	54.573099	84.972371	0.010000	16.000000	32.000000	64.000000	1024.000000
ram	3450.0	4.036122	1.365105	0.020000	4.000000	4.000000	4.000000	12.000000
battery	3448.0	3133.402697	1299.682844	500.000000	2100.000000	3000.000000	4000.000000	9720.000000
weight	3447.0	182.751871	88.413228	69.000000	142.000000	160.000000	185.000000	855.000000
release_year	3454.0	2015.965258	2.298455	2013.000000	2014.000000	2015.500000	2018.000000	2020.000000
days_used	3454.0	674.869716	248.580166	91.000000	533.500000	690.500000	868.750000	1094.000000
normalized_used_price	3454.0	4.364712	0.588914	1.536867	4.033931	4.405133	4.755700	6.619433
normalized_new_price	3454.0	5.233107	0.683637	2.901422	4.790342	5.245892	5.673718	7.847841

Let's start with some highlights of the statistical summary.

Mean screen size is 13.7cm, or about 5.4in.
The median resolution of the main camera is 8MP, although the largest one avaiable has 48MP!
The resolution of the selfie camera is generally lower than that of the main camera.
The middle 50% of the data for internal memory falls between 16GB and 64GB.
While device weight is generally less than 200g, the max observation is 855g, indicating the strong possibility of at least one outlier.
Days used is highly variable.
Both used price and new price are lower than I would expect, although new price is predictably higher than used.

In [ ]:

phone.describe(include='object').T

Out[ ]:

	count	unique	top	freq
brand_name	3454	34	Others	502
os	3454	4	Android	3214
4g	3454	2	yes	2335
5g	3454	2	no	3302

For the non-numeric columns, there are 34 brands, though the catchall 'Other' category is most common. Android is the OS with the overwhelming majority of devices on Mobile Market, and 4G connectivity is far more common than 5G.

In [ ]:

sns.set_theme()

We'll start with a first glance at each of our columns.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Countplot of Phone Brands',fontsize=16)
sns.countplot(data=phone,
              x='brand_name',
              order=phone['brand_name'].value_counts().index)
plt.xticks(rotation=90)
plt.xlabel('Brand Name');

Aside from the 'Others' designation, Samsung is the most common brand in the Mobile Market system, followed by Huawei, LG, and Lenovo. OnePlus, Google, and Infinix are the least common brands.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Countplot of Phone OS',fontsize=16)
sns.countplot(data=phone,
              x='os',
              order=phone['os'].value_counts().index)
plt.xlabel('Operating System');

In [ ]:

phone['os'].value_counts(normalize=True)

Out[ ]:

Android    0.930515
Others     0.039664
Windows    0.019398
iOS        0.010423
Name: os, dtype: float64

Android is far and away the most popular phone OS in the Mobile Market system. It is installed on 93% of the phones.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Histogram of Screen Size',fontsize=16)
sns.histplot(data=phone,x='screen_size',bins=12)
plt.xlabel('Screen Size (cm on the diagonal)');

The most common sceen size is around 12.5cm on the diagonal (about 5in). It looks like the most popular screen sizes fall between 10cm and 18cm. There are some devices on Mobile Market with screens greater than 20cm; there are roughly as many of these as there are devices with screens less than 10cm.

In [ ]:

plt.figure(figsize=(10,5))

# 4g
plt.subplot(1,2,1)
plt.title('4G Capable?')
sns.countplot(data=phone,x='4g',order=('yes','no'))

# 5g
plt.subplot(1,2,2)
plt.title('5G Capable?')
sns.countplot(data=phone,x='5g',order=('yes','no'))

plt.tight_layout()

While roughly two-thirds of phones on Mobile Market have 4G capability, very few have 5G.

In [ ]:

# import numeric data type check from pandas
from pandas.api.types import is_numeric_dtype

def hist_box(df,col,*,kde=False):
  '''Plot a histogram and boxplot 
  of numeric column in data frame.
  Optional KDE boolean.'''

  # include only numeric cols
  if is_numeric_dtype(df[col])==False:
    raise TypeError('Column must be numeric.')
  
  # figure setup
  plt.figure(figsize=(12,5))

  # histogram
  plt.subplot(1,2,1)
  plt.title('Histogram of '+str(col),fontsize=14)
  sns.histplot(data=df,x=col,kde=kde)

  # boxplot
  plt.subplot(1,2,2)
  plt.title('Boxplot of '+str(col),fontsize=14)
  sns.boxplot(data=df,x=col)

  # display figure
  plt.tight_layout()
  return

In [ ]:

hist_box(phone,'main_camera_mp')

The distribution of main camera resolution is sporatic, but 75% of the data is concentrated below 15MP.

In [ ]:

hist_box(phone,'selfie_camera_mp')

The histogram of selfie camera resolution would lead one to believe these observations are more spread out than those for the main camera, but the boxplot shows that the data is still concentrated on the lower end. There are just more outliers in these observations that skew the data.

In [ ]:

hist_box(phone,'int_memory')

Internal memory is heavily skewed by outliers, so let's visualize the data without the outliers.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Internal Memory (Outliers Removed)',fontsize=16)
sns.boxplot(data=phone,x='int_memory',showfliers=False)
plt.xlabel('Memory (GB)');

Now we can see that almost all phones on Mobile Market have between 0 and 130GB of internal storage. The data is still looks right-skewed, as the lower 50% is narrower than the upper 50%. However most memory comes in powers of 2, so we can see that the 25, 50, and 75 quartiles are 16GB, 32GB, and 64GB respectively. In other words, internal memory data is not continuous, so the skewed appearance of the data is likely misleading.

In [ ]:

two_powers=[float(2**i) for i in range(12)]
mem_counts=phone['int_memory'].value_counts(normalize=True)
percent=0
for idx in mem_counts.index:
  if idx in two_powers:
    percent+=mem_counts[idx]
print(percent)

0.9875362318840579

In fact, a quick calculation shows that almost all (98.75%) internal memory comes as a power of 2.

In [ ]:

hist_box(phone,'ram')

In [ ]:

phone['ram'].value_counts(normalize=True)

Out[ ]:

4.00     0.815942
6.00     0.044638
8.00     0.037681
2.00     0.026087
0.25     0.024058
3.00     0.023478
1.00     0.009855
12.00    0.005217
0.02     0.005217
0.03     0.004928
0.50     0.002609
1.50     0.000290
Name: ram, dtype: float64

We find that 81.6% of phones on Mobile Market have 4GB of RAM. Other common options are 2GB, 6GB, and 8GB.

In [ ]:

hist_box(phone,'battery',kde=True)

The battery data is concentrated between roughly 1500mAh and 5000mAh, with three peaks in between. Battery capacity is right skewed with many outliers.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Battery Capacity (Outliers Removed)',fontsize=16)
sns.boxplot(data=phone,x='battery',showfliers=False)
plt.xlabel('Capacity (mAh)');

Without the outliers, battery capacity looks more evenly distributed. The median capacity is around 3000mAh, with the middle 50% concentrated between 2000mAh and 4000mAh.

In [ ]:

hist_box(phone,'weight')

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Device Weight (Outliers Removed)',fontsize=16)
sns.boxplot(data=phone,x='weight',showfliers=False)
plt.xlabel('Weight (g)');

While device weight is generally concentrated between 75g and 250g, there are too many outliers to throw them all away. In particular, there's a real possibility these outliers will have an effect on our model.

In [ ]:

phone['release_year'].unique()

Out[ ]:

array([2020, 2019, 2013, 2014, 2016, 2018, 2015, 2017])

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Device Release Year',fontsize=16)
sns.countplot(data=phone,x='release_year')
plt.xlabel('Year');

The most common release year is 2014, followed by 2013 and 2015. There's a dip in popularity after 2015, with 2019 being the exception.

In [ ]:

hist_box(phone,'days_used')

Days used is fairly evenly distributed, with a vague peak around 600 days.

In [ ]:

hist_box(phone,'normalized_new_price',kde=True)

Normalized new price is approximately normally distributed with long tails on both sides. The peak is around 5€. The boxplot shows that there are many outliers.

In [ ]:

hist_box(phone,'normalized_used_price',kde=True)

Used price is disributed smoothly between 2€ and 6€ with a slight left skew. The peak is around 4.25€ and there are many outliers in the boxplot.

In [ ]:

# comparison of new price and used price
plt.figure(figsize=(8,5))
plt.title('New Price Versus Used Price',fontsize=16)
sns.histplot(data=phone,x='normalized_new_price',
             color='darkslateblue',label='New',kde=True)
sns.histplot(data=phone,x='normalized_used_price',
             color='orange',label='Used',kde=True)
plt.legend()
plt.xlabel('Price (€)');

Here is a direct comparison of new and used prices. We see that new prices generally trend higher.

Next, let's look at how RAM varies by brand.

In [ ]:

plt.figure(figsize=(20,30))
rows=1+len(phone['brand_name'].unique())//4

# subplot generating loop
for idx, brand in enumerate(phone['brand_name'].sort_values().unique()):
  plt.subplot(rows,4,idx+1)
  plt.title(brand)
  sns.histplot(data=phone[phone['brand_name']==brand]['ram'])

plt.tight_layout()

What's clear from this breakdown is that every phone brand offers a product with 4GB of RAM on Mobile Market, and it is unmistakably the most common device configuration. For Microsoft, Panasonic, and XOLO, the data set only includes devices with 4GB of RAM. Celkon and Infinix are the only brands for which 4GB of RAM is not the most common configuration. Like most other brands, the most common configuration for OnePlus phones is 4GB of RAM, but unlike the others, OnePlus phones have a very high proportion of offerings with 8GB of RAM.

Next we will look at the weight of devices with high-capacity batteries.

In [ ]:

# collect phones with battery capacity >4500mAh
heavy=phone[phone['battery']>4500]

plt.figure(figsize=(8,5))
plt.title('High-Capacity Batteries (>4500 mAh) and Weight',fontsize=16)
sns.regplot(data=heavy,x='battery',y='weight')
plt.xlabel('Capacity (mAh)')
plt.ylabel('Weight (g)');

There is a positive trend between battery capacity and phone weight. There are, however, a few observations far from the regression line that have high capacity batteries but lower weight. Such devices are generally more valuable since they last longer on a single charge but do not weigh much.

We will now examine several aspects across brands, the first being the distribution of large screen devices.

In [ ]:

# number of phones with screen size >6in = 15.24cm
phone[phone['screen_size']>15.24].shape[0]

Out[ ]:

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Brands with Large Screen (>6in diagonal)',fontsize=16)
sns.countplot(data=phone[phone['screen_size']>15.24],
              x='brand_name',
              order=phone[phone['screen_size']>15.24]['brand_name'].value_counts().index)
plt.xticks(rotation=90)
plt.xlabel('Brand Name');

There are 1099 devices in this data set with a screen size greater than 6in diagonally. Huawei and Samsung offer the greatest number of devices with large screens. Panasonic, Spice, and Microsoft offer very few large sceen devices through Mobile Market.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Brands with High-Resolution Selfie Camera (>8MP)',fontsize=16)
sns.countplot(data=phone[phone['selfie_camera_mp']>8],
              x='brand_name',
              order=phone[phone['selfie_camera_mp']>8]['brand_name'].value_counts().index)
plt.xticks(rotation=90)
plt.xlabel('Brand Name');

Huawei, Vivo, and Oppo offer the greatest number of devices with a high resolution selfie camera, while Micromax, Panasonic, and Acer offer the fewest on Mobile Market.

In [ ]:

# logarithmic regression of memory vs used price
plt.figure(figsize=(8,5))
plt.title('Memory Capacity and Used Price',fontsize=16)
sns.regplot(data=phone,
            x='int_memory',
            y='normalized_used_price',
            logx=True,
            x_jitter=10,
            marker='+',
            scatter_kws={'color':'blue'},
            line_kws={'color':'red'},
            ci=False)
plt.xlabel('Memory (GB)')
plt.ylabel('Used Price (€)');

There seems to be a positive trend between device memory capacity and normalized used price. In particular, price appears to be proportional to the logarithm of memory capacity.

In [ ]:

plt.figure(figsize=(12,8))
plt.title('Correlation Heatmap',fontsize=16)
sns.heatmap(phone.corr(),annot=True,cmap='Spectral');

The correlation matrix highlights several previously unseen relationships. There are some clear ones, like the strong negative correlation between release year and days_used. But we have detected moderately positive correlations between normalized_used_price and the following: screen_size, main_camera_mp, selfie_camera_mp, and battery. We will visualize these, along with the strong positive relationship between battery capacity and weight.

In [ ]:

plt.figure(figsize=(16,16))
plt.suptitle('Used Price versus Various Features', fontsize=16)

# screen_size
plt.subplot(2,2,1)
sns.scatterplot(data=phone,x='screen_size',y='normalized_used_price')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('Price (€)')

# main_camera_mp
plt.subplot(2,2,2)
sns.scatterplot(data=phone,x='main_camera_mp',y='normalized_used_price')
plt.xlabel('Main Camera Resolution (MP)')
plt.ylabel('Price (€)')

# selfie_camera_mp
plt.subplot(2,2,3)
sns.scatterplot(data=phone,x='selfie_camera_mp',y='normalized_used_price')
plt.xlabel('Selfie Camera Resolution (MP)')
plt.ylabel('Price (€)')

# battery
plt.subplot(2,2,4)
sns.scatterplot(data=phone,x='battery',y='normalized_used_price')
plt.xlabel('Battery Capacity (mAh)')
plt.ylabel('Price (€)')

plt.show()

The four scatterplots all illustrate positive correlation but have decidedly different distributions. Note too that some of the relationships have a bit of a curve to the trend, like a logarithmic relationship.

Next, let's look at battery capacity, weight, and screen size of devices on Mobile Market.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Battery and Weight',fontsize=16)
sns.scatterplot(data=phone,x='battery',y='weight')
plt.xlabel('Capacity (mAh)')
plt.ylabel('Weight (g)');

In [ ]:

plt.figure(figsize=(16,5))
plt.suptitle('Screen Size versus Battery and Weight',fontsize=16)

# battery
plt.subplot(1,2,1)
sns.scatterplot(data=phone,x='screen_size',y='battery')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('Capacity (mAh)')

# weight
plt.subplot(1,2,2)
sns.scatterplot(data=phone,x='screen_size',y='weight')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('Weight (g)');

While we can observe a positive trend in the first plot, the correlations are more apparent in the latter two (relating screen size to battery and weight).

Consider next days used and new price.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Days Used and New Price')
sns.scatterplot(data=phone,x='days_used',y='normalized_new_price')
plt.xlabel('Days')
plt.ylabel('Price (€)');

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Days Used and New Price')
sns.scatterplot(data=phone,x='days_used',y='normalized_new_price',hue='release_year')
plt.xlabel('Days')
plt.ylabel('Price (€)')
plt.legend(title='');

On the other hand, there is no apparent trend in the relationship between days_used and normalized_used_price; the scatterplot is just noise. But by adding the release year, we find that phones released less recently are generally used for longer. Not a big surprise!

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Release Year and Days Used',fontsize=16)
sns.boxplot(data=phone,x='release_year',y='days_used')
plt.xlabel('Year')
plt.ylabel('Days Used');

Quantifying the relationship between release_year and days_used, we find that the spread of days_used data is similar for release_years up to 2016, and then the data trends negatively (fewer days used) as the phones were more recently released.

In [ ]:

sns.jointplot(data=phone,
              x='normalized_new_price',
              y='normalized_used_price',
              hue='os',
              kind='kde');

Lastly, looking at the density of new price versus used price, we can identify specifically price brackets broken down by operating system. Devices running Android have the greatest variability in price. On the other hand, iOS is installed on higher priced devices, namely Apple products. The least expensive devices, new or used, run an operating system other than Android, Windows, or iOS.

Data Preprocessing¶

Missing Value Treatment¶

In [ ]:

phone.isnull().sum()

Out[ ]:

brand_name                 0
os                         0
screen_size                0
4g                         0
5g                         0
main_camera_mp           179
selfie_camera_mp           2
int_memory                 4
ram                        4
battery                    6
weight                     7
release_year               0
days_used                  0
normalized_used_price      0
normalized_new_price       0
dtype: int64

The feature with the most missing data is main_camera_mp. We will impute the null values with the median camera resolution for the brand of that row in the dataframe.

In [ ]:

# collect medians for main_camera_mp by brand
camera_medians={}
for brand in phone['brand_name'].unique():
  med=phone.loc[phone['brand_name']==brand]['main_camera_mp'].median()
  if np.isnan(med)==False:
    camera_medians[brand]=phone.loc[phone['brand_name']==brand]['main_camera_mp'].median()
  else:
    camera_medians[brand]=0

# impute main_camera_mp with brand median
idx=0
for val in phone['main_camera_mp']:
  if np.isnan(val)==True:
    brand=phone.iloc[idx]['brand_name']
    phone.loc[idx,'main_camera_mp']=camera_medians[brand]
  idx+=1

/usr/local/lib/python3.7/dist-packages/numpy/lib/nanfunctions.py:1117: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)

In [ ]:

phone.isnull().sum()

Out[ ]:

brand_name               0
os                       0
screen_size              0
4g                       0
5g                       0
main_camera_mp           0
selfie_camera_mp         2
int_memory               4
ram                      4
battery                  6
weight                   7
release_year             0
days_used                0
normalized_used_price    0
normalized_new_price     0
dtype: int64

We will drop rows with the last few null entries, as we will not be losing much data.

In [ ]:

# drop rows with remain few null entries
phone.dropna(axis=0,inplace=True)

In [ ]:

phone.isnull().sum()

Out[ ]:

brand_name               0
os                       0
screen_size              0
4g                       0
5g                       0
main_camera_mp           0
selfie_camera_mp         0
int_memory               0
ram                      0
battery                  0
weight                   0
release_year             0
days_used                0
normalized_used_price    0
normalized_new_price     0
dtype: int64

Additional Drops¶

We can confirm that there are no more null entries. Next, we will also drop any row where both the brand name and the OS have value 'Other', since this combination is too vague to contribute positively to our model.

In [ ]:

print('Number of rows before dropping "Other/Other":',
      phone.shape[0])
print('Number of "Other/Other" rows:',
      phone[(phone['brand_name']=='Others')&(phone['os']=='Others')].shape[0])

Number of rows before dropping "Other/Other": 3432
Number of "Other/Other" rows: 28

In [ ]:

# drop rows with brand_name=Others AND os=Others
for val in phone[(phone['brand_name']=='Others')&(phone['os']=='Others')].index:
  phone.drop(val,axis=0,inplace=True)

In [ ]:

print('Number of rows after dropping "Other/Other":',phone.shape[0])

Number of rows after dropping "Other/Other": 3404

We have dropped all 28 Other/Other rows.

In [ ]:

phone.reset_index(inplace=True,drop=True)

Data Typing, Feature Engineering, and Outlier Treatment¶

After five experiments building the linear regression, I have found that the precise release year is not as important as whether the device is older or newer. I can confirm that this change will significantly reduce multicollinearity with negligible effect on the final R-squared. I used the boxplot from the EDA, recalled below, to define the threshold for an old phone.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Release Year and Days Used',fontsize=16)
sns.boxplot(data=phone,x='release_year',y='days_used')
plt.xlabel('Year')
plt.ylabel('Days Used');

In [ ]:

# classify rows
old_phones=[2013,2014,2015,2016]
age=[]
for val in phone['release_year']:
  if val in old_phones:
    age.append('old')
  else:
    age.append('new')

# generate feature
phone['age']=pd.Categorical(age)

In [ ]:

phone=phone.drop(['release_year'],axis=1)

Since over 80% of devices on Mobile Market have 4GB of RAM, we will create a new categorical column. It will have a 1 if the device has 4GB of RAM and a 0 otherwise. This is one way of treating outliers in RAM.

In [ ]:

# classify rows
ram4=[]
for val in phone['ram']:
  if val==4.0:
    ram4.append(1)
  else:
    ram4.append(0)

# generate feature
phone['ram4']=pd.Categorical(ram4)

We will also add a column for the ratio of battery capacity to weight. We can confidently design this feature because battery and weight are strongly correlated. This will have the effect of retaining data about battery and weight while reducing dimension.

In [ ]:

phone['battery_by_weight']=phone['battery']/phone['weight']

I feel we have adequately adressed outliers. Those visible on the boxplots for weight or price are necessary aspects of the selection process. Our model would be incabaple of predicting price for the breadth of devices on Mobile Market without these data points.

Second EDA¶

In [ ]:

hist_box(phone,'screen_size')

Screen size remains highly variable after our data transformations.

In [ ]:

plt.figure(figsize=(10,5))

# 4g
plt.subplot(1,2,1)
plt.title('4G Capable?')
sns.countplot(data=phone,x='4g',order=('yes','no'))

# 5g
plt.subplot(1,2,2)
plt.title('5G Capable?')
sns.countplot(data=phone,x='5g',order=('yes','no'))

plt.tight_layout()

The ratios of 4g capable phones and 5g capable phone are essentially unchanged from before.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Device Camera Resolutions',fontsize=16)

# main camera
sns.histplot(phone,x='main_camera_mp',
             bins=12,
             color='darkslateblue',
             label='main')

#selfie camera
sns.histplot(phone,x='selfie_camera_mp',
             bins=12,
             color='orange',
             label='selfie')

plt.xlabel('Resolution (MP)')
plt.legend();

The selfie camera generally has a lower resolution than the main camera. This makes sense, as there is more room in the device frame for a main camera sensor than a selfie camera, since the latter is crowded by the display panel.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Internal Memory',fontsize=16)
sns.histplot(phone,x='int_memory',bins=12)
plt.xlabel('Memory Capacity (GB)');

Since memory comes in powers of 2, and less memory is generally cheaper, it tracks that most of the devices have less memory, but that the outliers are far greater than the bulk of the data. The capacity is literally increasing exponentially!

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Internal Memory (No Outliers)',fontsize=16)
sns.boxplot(data=phone,x='int_memory',showfliers=False)
plt.xlabel('Memory Capacity (GB)');

To understand the spread of the majority of our data, consider the boxplot above. Outliers are hidden in this plot, and we see that the middle 50% of the data is between 16GB and 64GB, with a median of 32GB (all powers of 2).

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Screen Size and RAM',fontsize=16)
sns.regplot(data=phone,x='screen_size',y='ram')
plt.xlabel('Screen Size (cm diagonal)')
plt.ylabel('RAM (GB)');

Generally a larger screen, in particular more pixels, requires more RAM for the device to operate smoothly. The trend is visible here, albeit only mildly.

In [ ]:

plt.figure(figsize=(16,5))
plt.suptitle('Radio Capabilities and Device Age',fontsize=16)

plt.subplot(1,2,1)
sns.countplot(data=phone,x='age',hue='4g',hue_order=('yes','no'))

plt.subplot(1,2,2)
sns.countplot(data=phone,x='age',hue='5g',hue_order=('yes','no'));

Here we see that new devices are far more likely to have 4G radios, with almost no new devices lacking. Conversely, about half of old devices lack 4G connectivity.

There are no old devices available on Mobile Market with 5G connectivity, and most of the new devices don't have it either.

Model Building - Linear Regression¶

Now we can start preparing our data for modeling. In particular, let's use one hot encoding for our categorical variables.

In [ ]:

phone_dummies=pd.get_dummies(data=phone,drop_first=True)

In [ ]:

# dataframe of independent variables plus constant
X=phone_dummies.drop(['normalized_used_price'],axis=1)
X=sm.add_constant(X)

# response variable series
y=phone_dummies['normalized_used_price']

/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/tsatools.py:142: FutureWarning: In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only
  x = pd.concat(x[::order], 1)

In [ ]:

# split data
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=1)

In [ ]:

# train first model
model=sm.OLS(y_train,X_train)
results=model.fit()

Model Performance Check¶

We start by defining functions that will help us assess model performance. The first function generates a table of VIF stats and tests R-squared values for different dropped columns. The second incorporates the first into a data-rich model tuning interface. We will use the standard VIF cutoff of 5, but the function allows different thresholds.

In [ ]:

# import variance inflation factor function
from statsmodels.stats.outliers_influence import variance_inflation_factor

def high_vif_stats(y_data,X_data,vif_cutoff=5):
  '''Summarize VIF and test R-squared
  for VIF > cutoff.'''

  # generate vif table
  vif=pd.Series([variance_inflation_factor(X_data.values,i) for i in range(X_data.shape[1])],
               index=X_data.columns).sort_values(ascending=False)
  print('Variance Inflation Factors')
  print(vif)

  # collect high vif columns (omit constant col)
  high_vif_col=[]
  for idx, val in vif.drop(['const'],axis=0).items():
    if val>vif_cutoff:
      high_vif_col.append(idx)
  if len(high_vif_col)==0:
    print('\nThere are no features with VIF greater than',vif_cutoff)
    return

  # create dataframe with original R-squared
  results=sm.OLS(y_data,X_data).fit()
  rsq_comp=pd.DataFrame({'R-squared':np.round(results.rsquared,3),
                        'Adj. R-squared':np.round(results.rsquared_adj,3)},
                        index=['Before'])

  # tabulate R-square and adj. R-squared without high vif cols
  for col in high_vif_col:
    res=sm.OLS(y_data,X_data.drop([col],axis=1)).fit()
    df=pd.DataFrame({'R-squared':np.round(res.rsquared,3),
                    'Adj. R-squared':np.round(res.rsquared_adj,3)},
                    index=['Without '+col])
    rsq_comp=pd.concat([rsq_comp,df],axis=0)

  # print table of R-squared values
  title='R-squared for data with dropped high-VIF columns'
  print('='*len(title))
  print(title)
  print(rsq_comp)

  return

def model_tuning(y_data,X_data,*,threshold=5,hide_vif=False):
  '''Presents summary of OLS model
  and multicollinearity check given
  VIF threshold (default=5).'''

  # generate model
  model=sm.OLS(y_data,X_data).fit()
  
  # print model statistics and high VIF table
  print(model.summary())
  if hide_vif==False:
    print('\n\n')
    high_vif_stats(y_data,X_data,vif_cutoff=threshold)
  return model

# print series without truncating
pd.set_option('display.max_rows',None)

In [ ]:

model_tuning(y_train,X_train)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.842
Model:                               OLS   Adj. R-squared:                  0.838
Method:                    Least Squares   F-statistic:                     247.8
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:01   Log-Likelihood:                 131.16
No. Observations:                   2382   AIC:                            -160.3
Df Residuals:                       2331   BIC:                             134.2
Df Model:                             50                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.3025      0.079     16.399      0.000       1.147       1.458
screen_size               0.0235      0.003      6.764      0.000       0.017       0.030
main_camera_mp            0.0214      0.001     14.830      0.000       0.019       0.024
selfie_camera_mp          0.0146      0.001     12.402      0.000       0.012       0.017
int_memory                0.0002   6.09e-05      2.571      0.010    3.71e-05       0.000
ram                       0.0144      0.005      2.738      0.006       0.004       0.025
battery               -2.531e-05   1.29e-05     -1.967      0.049   -5.06e-05   -7.64e-08
weight                    0.0012      0.000      5.808      0.000       0.001       0.002
days_used             -4.071e-05   2.98e-05     -1.365      0.172   -9.92e-05    1.78e-05
normalized_new_price      0.4132      0.012     34.505      0.000       0.390       0.437
battery_by_weight         0.0044      0.003      1.572      0.116      -0.001       0.010
brand_name_Alcatel       -0.0150      0.046     -0.323      0.747      -0.106       0.076
brand_name_Apple          0.0483      0.148      0.326      0.745      -0.242       0.339
brand_name_Asus           0.0144      0.047      0.309      0.758      -0.077       0.106
brand_name_BlackBerry     0.0652      0.072      0.911      0.363      -0.075       0.206
brand_name_Celkon        -0.2933      0.069     -4.252      0.000      -0.429      -0.158
brand_name_Coolpad        0.0045      0.072      0.062      0.951      -0.137       0.146
brand_name_Gionee        -0.0109      0.056     -0.194      0.846      -0.121       0.099
brand_name_Google         0.0453      0.092      0.493      0.622      -0.135       0.226
brand_name_HTC           -0.0079      0.048     -0.166      0.868      -0.101       0.086
brand_name_Honor          0.0124      0.048      0.258      0.797      -0.082       0.107
brand_name_Huawei        -0.0272      0.043     -0.628      0.530      -0.112       0.058
brand_name_Infinix        0.3247      0.090      3.617      0.000       0.149       0.501
brand_name_Karbonn        0.0134      0.066      0.203      0.839      -0.116       0.143
brand_name_LG            -0.0457      0.044     -1.035      0.301      -0.132       0.041
brand_name_Lava          -0.0053      0.060     -0.089      0.929      -0.122       0.112
brand_name_Lenovo         0.0124      0.044      0.279      0.780      -0.075       0.100
brand_name_Meizu         -0.0309      0.055     -0.566      0.572      -0.138       0.076
brand_name_Micromax      -0.0294      0.047     -0.631      0.528      -0.121       0.062
brand_name_Microsoft      0.0589      0.086      0.684      0.494      -0.110       0.228
brand_name_Motorola      -0.0813      0.048     -1.680      0.093      -0.176       0.014
brand_name_Nokia          0.0603      0.050      1.195      0.232      -0.039       0.159
brand_name_OnePlus        0.0134      0.067      0.202      0.840      -0.117       0.144
brand_name_Oppo           0.0029      0.047      0.062      0.951      -0.089       0.095
brand_name_Others        -0.0174      0.041     -0.426      0.670      -0.098       0.063
brand_name_Panasonic     -0.0103      0.055     -0.187      0.852      -0.118       0.098
brand_name_Realme        -0.0009      0.060     -0.015      0.988      -0.119       0.117
brand_name_Samsung       -0.0381      0.042     -0.903      0.367      -0.121       0.045
brand_name_Sony          -0.0764      0.050     -1.529      0.126      -0.174       0.022
brand_name_Spice         -0.0445      0.061     -0.729      0.466      -0.164       0.075
brand_name_Vivo          -0.0334      0.048     -0.694      0.488      -0.128       0.061
brand_name_XOLO          -0.0700      0.061     -1.155      0.248      -0.189       0.049
brand_name_Xiaomi         0.0645      0.047      1.367      0.172      -0.028       0.157
brand_name_ZTE           -0.0257      0.046     -0.565      0.572      -0.115       0.064
os_Others                -0.0492      0.037     -1.312      0.190      -0.123       0.024
os_Windows               -0.0310      0.044     -0.696      0.487      -0.118       0.056
os_iOS                   -0.0876      0.147     -0.595      0.552      -0.376       0.201
4g_yes                    0.0681      0.015      4.500      0.000       0.038       0.098
5g_yes                    0.0222      0.034      0.658      0.511      -0.044       0.088
age_old                  -0.0468      0.017     -2.832      0.005      -0.079      -0.014
ram4_1                    0.0276      0.017      1.611      0.107      -0.006       0.061
==============================================================================
Omnibus:                      188.856   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              369.169
Skew:                          -0.534   Prob(JB):                     6.85e-81
Kurtosis:                       4.605   Cond. No.                     1.73e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.73e+05. This might indicate that there are
strong multicollinearity or other numerical problems.



Variance Inflation Factors
const                    280.388122
weight                    14.647755
brand_name_Apple          14.121957
os_iOS                    12.770393
battery                   11.965925
brand_name_Others          8.850750
screen_size                7.589599
battery_by_weight          7.447852
brand_name_Samsung         6.910259
brand_name_Huawei          5.764543
brand_name_LG              4.726981
brand_name_Lenovo          4.173822
brand_name_ZTE             3.957683
brand_name_Oppo            3.672643
brand_name_Xiaomi          3.595153
brand_name_Alcatel         3.360767
brand_name_Vivo            3.343906
brand_name_Micromax        3.322402
brand_name_Asus            3.277016
brand_name_Nokia           3.225654
brand_name_Motorola        3.173537
brand_name_Honor           3.159061
brand_name_HTC             3.116526
selfie_camera_mp           2.886479
age_old                    2.875527
normalized_new_price       2.855934
brand_name_Sony            2.723139
days_used                  2.453434
ram                        2.303676
main_camera_mp             2.201178
4g_yes                     2.185732
brand_name_Meizu           2.137792
5g_yes                     2.079510
brand_name_Panasonic       2.012944
brand_name_Gionee          1.971333
brand_name_Realme          1.937008
os_Others                  1.904531
ram4_1                     1.899102
brand_name_Microsoft       1.788582
brand_name_Lava            1.776196
brand_name_Celkon          1.760827
brand_name_Spice           1.719549
brand_name_XOLO            1.697516
brand_name_OnePlus         1.637108
brand_name_BlackBerry      1.615950
os_Windows                 1.594822
brand_name_Karbonn         1.542344
brand_name_Coolpad         1.440145
brand_name_Infinix         1.347722
int_memory                 1.321038
brand_name_Google          1.257138
dtype: float64
================================================
R-squared for data with dropped high-VIF columns
                            R-squared  Adj. R-squared
Before                          0.842           0.838
Without weight                  0.839           0.836
Without brand_name_Apple        0.842           0.838
Without os_iOS                  0.842           0.838
Without battery                 0.841           0.838
Without brand_name_Others       0.842           0.838
Without screen_size             0.839           0.835
Without battery_by_weight       0.842           0.838
Without brand_name_Samsung      0.842           0.838
Without brand_name_Huawei       0.842           0.838

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e9713110>

We have seen that weight is highly positively correlated with screen size and battery capacity. Since it's VIF is the highest, we will drop it.

In [ ]:

# drop weight column
X_train=X_train.drop(['weight'],axis=1)

# keep track of dropped features
dropped_cols=['weight']

In [ ]:

model_tuning(y_train,X_train)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.836
Method:                    Least Squares   F-statistic:                     248.7
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:02   Log-Likelihood:                 114.04
No. Observations:                   2382   AIC:                            -128.1
Df Residuals:                       2332   BIC:                             160.7
Df Model:                             49                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.4061      0.078     18.041      0.000       1.253       1.559
screen_size               0.0352      0.003     12.329      0.000       0.030       0.041
main_camera_mp            0.0210      0.001     14.470      0.000       0.018       0.024
selfie_camera_mp          0.0143      0.001     12.103      0.000       0.012       0.017
int_memory                0.0001   6.12e-05      2.316      0.021    2.17e-05       0.000
ram                       0.0129      0.005      2.436      0.015       0.003       0.023
battery                2.934e-05   8.84e-06      3.319      0.001     1.2e-05    4.67e-05
days_used             -2.261e-05   2.99e-05     -0.757      0.449   -8.12e-05     3.6e-05
normalized_new_price      0.4142      0.012     34.352      0.000       0.391       0.438
battery_by_weight        -0.0078      0.002     -4.292      0.000      -0.011      -0.004
brand_name_Alcatel       -0.0260      0.047     -0.557      0.577      -0.117       0.065
brand_name_Apple         -0.0411      0.148     -0.277      0.782      -0.332       0.250
brand_name_Asus           0.0115      0.047      0.245      0.807      -0.080       0.103
brand_name_BlackBerry     0.0550      0.072      0.762      0.446      -0.086       0.196
brand_name_Celkon        -0.3110      0.069     -4.482      0.000      -0.447      -0.175
brand_name_Coolpad       -0.0086      0.072     -0.119      0.905      -0.151       0.133
brand_name_Gionee        -0.0139      0.057     -0.245      0.806      -0.125       0.097
brand_name_Google         0.0362      0.093      0.391      0.696      -0.145       0.218
brand_name_HTC           -0.0228      0.048     -0.476      0.634      -0.117       0.071
brand_name_Honor         -0.0031      0.049     -0.064      0.949      -0.098       0.092
brand_name_Huawei        -0.0440      0.044     -1.012      0.312      -0.129       0.041
brand_name_Infinix        0.3066      0.090      3.394      0.001       0.129       0.484
brand_name_Karbonn        0.0020      0.067      0.030      0.976      -0.129       0.133
brand_name_LG            -0.0535      0.044     -1.204      0.229      -0.141       0.034
brand_name_Lava          -0.0198      0.060     -0.329      0.742      -0.138       0.098
brand_name_Lenovo         0.0013      0.045      0.029      0.977      -0.086       0.089
brand_name_Meizu         -0.0375      0.055     -0.681      0.496      -0.145       0.070
brand_name_Micromax      -0.0365      0.047     -0.778      0.437      -0.129       0.056
brand_name_Microsoft      0.0353      0.087      0.407      0.684      -0.135       0.205
brand_name_Motorola      -0.0892      0.049     -1.830      0.067      -0.185       0.006
brand_name_Nokia          0.0494      0.051      0.972      0.331      -0.050       0.149
brand_name_OnePlus       -0.0042      0.067     -0.062      0.950      -0.135       0.127
brand_name_Oppo          -0.0070      0.047     -0.147      0.883      -0.100       0.086
brand_name_Others        -0.0204      0.041     -0.495      0.620      -0.101       0.060
brand_name_Panasonic     -0.0245      0.055     -0.442      0.659      -0.133       0.084
brand_name_Realme        -0.0139      0.061     -0.230      0.818      -0.133       0.105
brand_name_Samsung       -0.0507      0.042     -1.193      0.233      -0.134       0.033
brand_name_Sony          -0.0880      0.050     -1.751      0.080      -0.187       0.011
brand_name_Spice         -0.0466      0.061     -0.758      0.449      -0.167       0.074
brand_name_Vivo          -0.0468      0.048     -0.966      0.334      -0.142       0.048
brand_name_XOLO          -0.0798      0.061     -1.307      0.192      -0.199       0.040
brand_name_Xiaomi         0.0524      0.047      1.104      0.270      -0.041       0.145
brand_name_ZTE           -0.0376      0.046     -0.821      0.412      -0.128       0.052
os_Others                -0.0119      0.037     -0.321      0.748      -0.085       0.061
os_Windows               -0.0262      0.045     -0.584      0.559      -0.114       0.062
os_iOS                   -0.0141      0.148     -0.095      0.924      -0.304       0.276
4g_yes                    0.0687      0.015      4.505      0.000       0.039       0.099
5g_yes                    0.0184      0.034      0.541      0.589      -0.048       0.085
age_old                  -0.0430      0.017     -2.587      0.010      -0.076      -0.010
ram4_1                    0.0270      0.017      1.563      0.118      -0.007       0.061
==============================================================================
Omnibus:                      186.892   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              359.880
Skew:                          -0.534   Prob(JB):                     7.13e-79
Kurtosis:                       4.576   Cond. No.                     1.72e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.72e+05. This might indicate that there are
strong multicollinearity or other numerical problems.



Variance Inflation Factors
const                    266.254541
brand_name_Apple          13.969785
os_iOS                    12.675829
brand_name_Others          8.849361
brand_name_Samsung         6.892157
brand_name_Huawei          5.738640
battery                    5.569033
screen_size                5.048455
brand_name_LG              4.722569
brand_name_Lenovo          4.166099
brand_name_ZTE             3.949684
brand_name_Oppo            3.667826
brand_name_Xiaomi          3.588137
brand_name_Alcatel         3.355143
brand_name_Vivo            3.336237
brand_name_Micromax        3.320114
brand_name_Asus            3.276640
brand_name_Nokia           3.221178
battery_by_weight          3.180797
brand_name_Motorola        3.171056
brand_name_Honor           3.149329
brand_name_HTC             3.107480
selfie_camera_mp           2.882209
age_old                    2.871069
normalized_new_price       2.855337
brand_name_Sony            2.718741
days_used                  2.426666
ram                        2.298031
main_camera_mp             2.196235
4g_yes                     2.185643
brand_name_Meizu           2.136881
5g_yes                     2.078721
brand_name_Panasonic       2.008991
brand_name_Gionee          1.971167
brand_name_Realme          1.934318
ram4_1                     1.899025
os_Others                  1.848840
brand_name_Microsoft       1.784595
brand_name_Lava            1.773107
brand_name_Celkon          1.757360
brand_name_Spice           1.719489
brand_name_XOLO            1.696226
brand_name_OnePlus         1.633716
brand_name_BlackBerry      1.614968
os_Windows                 1.594271
brand_name_Karbonn         1.540983
brand_name_Coolpad         1.438734
brand_name_Infinix         1.346102
int_memory                 1.318765
brand_name_Google          1.256768
dtype: float64
================================================
R-squared for data with dropped high-VIF columns
                            R-squared  Adj. R-squared
Before                          0.839           0.836
Without brand_name_Apple        0.839           0.836
Without os_iOS                  0.839           0.836
Without brand_name_Others       0.839           0.836
Without brand_name_Samsung      0.839           0.836
Without brand_name_Huawei       0.839           0.836
Without battery                 0.839           0.835
Without screen_size             0.829           0.825

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e952f190>

There's an interesting argument to be made that we can drop either brand_name_Apple or os_iOS. While both belong to categorical variables, these two features must be highly correlated: The only brand running iOS is Apple and the only OS on Apple devices is iOS. In other words, there is indeed redundancy in including both features. While brand_name_Apple has higher VIF, os_iOS has an incredibly high p-value. In addition, after building out the subsequent regression several times, I have determined that losing OS data will not be detrimental to our model (as the p-value will remain high regardless).

In [ ]:

# drop
X_train=X_train.drop(['os_iOS'],axis=1)

# add col to dropped_cols list
dropped_cols.append('os_iOS')

In [ ]:

model_tuning(y_train,X_train)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.836
Method:                    Least Squares   F-statistic:                     254.0
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:03   Log-Likelihood:                 114.04
No. Observations:                   2382   AIC:                            -130.1
Df Residuals:                       2333   BIC:                             152.9
Df Model:                             48                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.4056      0.078     18.078      0.000       1.253       1.558
screen_size               0.0353      0.003     12.384      0.000       0.030       0.041
main_camera_mp            0.0210      0.001     14.494      0.000       0.018       0.024
selfie_camera_mp          0.0143      0.001     12.117      0.000       0.012       0.017
int_memory                0.0001   6.12e-05      2.315      0.021    2.17e-05       0.000
ram                       0.0129      0.005      2.455      0.014       0.003       0.023
battery                2.934e-05   8.84e-06      3.319      0.001     1.2e-05    4.67e-05
days_used             -2.267e-05   2.99e-05     -0.759      0.448   -8.12e-05    3.59e-05
normalized_new_price      0.4142      0.012     34.360      0.000       0.391       0.438
battery_by_weight        -0.0078      0.002     -4.292      0.000      -0.011      -0.004
brand_name_Alcatel       -0.0260      0.047     -0.558      0.577      -0.117       0.065
brand_name_Apple         -0.0540      0.058     -0.933      0.351      -0.168       0.060
brand_name_Asus           0.0115      0.047      0.245      0.806      -0.080       0.103
brand_name_BlackBerry     0.0546      0.072      0.758      0.448      -0.087       0.196
brand_name_Celkon        -0.3112      0.069     -4.487      0.000      -0.447      -0.175
brand_name_Coolpad       -0.0085      0.072     -0.118      0.906      -0.151       0.133
brand_name_Gionee        -0.0138      0.057     -0.244      0.807      -0.125       0.097
brand_name_Google         0.0361      0.093      0.390      0.696      -0.145       0.218
brand_name_HTC           -0.0227      0.048     -0.475      0.635      -0.117       0.071
brand_name_Honor         -0.0031      0.049     -0.064      0.949      -0.098       0.092
brand_name_Huawei        -0.0440      0.044     -1.012      0.311      -0.129       0.041
brand_name_Infinix        0.3066      0.090      3.396      0.001       0.130       0.484
brand_name_Karbonn        0.0021      0.067      0.032      0.974      -0.129       0.133
brand_name_LG            -0.0535      0.044     -1.203      0.229      -0.141       0.034
brand_name_Lava          -0.0196      0.060     -0.327      0.744      -0.137       0.098
brand_name_Lenovo         0.0013      0.045      0.030      0.976      -0.086       0.089
brand_name_Meizu         -0.0374      0.055     -0.680      0.496      -0.145       0.070
brand_name_Micromax      -0.0364      0.047     -0.777      0.437      -0.128       0.056
brand_name_Microsoft      0.0352      0.087      0.407      0.684      -0.135       0.205
brand_name_Motorola      -0.0891      0.049     -1.830      0.067      -0.185       0.006
brand_name_Nokia          0.0492      0.051      0.970      0.332      -0.050       0.149
brand_name_OnePlus       -0.0042      0.067     -0.063      0.950      -0.135       0.127
brand_name_Oppo          -0.0069      0.047     -0.146      0.884      -0.099       0.086
brand_name_Others        -0.0204      0.041     -0.494      0.621      -0.101       0.060
brand_name_Panasonic     -0.0244      0.055     -0.441      0.659      -0.133       0.084
brand_name_Realme        -0.0140      0.061     -0.230      0.818      -0.133       0.105
brand_name_Samsung       -0.0507      0.042     -1.193      0.233      -0.134       0.033
brand_name_Sony          -0.0880      0.050     -1.751      0.080      -0.187       0.011
brand_name_Spice         -0.0464      0.061     -0.756      0.450      -0.167       0.074
brand_name_Vivo          -0.0468      0.048     -0.966      0.334      -0.142       0.048
brand_name_XOLO          -0.0797      0.061     -1.306      0.192      -0.199       0.040
brand_name_Xiaomi         0.0524      0.047      1.104      0.270      -0.041       0.145
brand_name_ZTE           -0.0376      0.046     -0.820      0.412      -0.127       0.052
os_Others                -0.0110      0.036     -0.307      0.759      -0.081       0.059
os_Windows               -0.0260      0.045     -0.581      0.561      -0.114       0.062
4g_yes                    0.0687      0.015      4.506      0.000       0.039       0.099
5g_yes                    0.0182      0.034      0.536      0.592      -0.048       0.085
age_old                  -0.0430      0.017     -2.587      0.010      -0.076      -0.010
ram4_1                    0.0269      0.017      1.561      0.119      -0.007       0.061
==============================================================================
Omnibus:                      187.067   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              360.181
Skew:                          -0.535   Prob(JB):                     6.13e-79
Kurtosis:                       4.576   Cond. No.                     1.69e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.69e+05. This might indicate that there are
strong multicollinearity or other numerical problems.



Variance Inflation Factors
const                    265.104024
brand_name_Others          8.847330
brand_name_Samsung         6.892140
brand_name_Huawei          5.738639
battery                    5.568998
screen_size                5.011843
brand_name_LG              4.721959
brand_name_Lenovo          4.165953
brand_name_ZTE             3.949164
brand_name_Oppo            3.667339
brand_name_Xiaomi          3.588128
brand_name_Alcatel         3.355138
brand_name_Vivo            3.336228
brand_name_Micromax        3.319684
brand_name_Asus            3.276583
brand_name_Nokia           3.217382
battery_by_weight          3.180109
brand_name_Motorola        3.170931
brand_name_Honor           3.149329
brand_name_HTC             3.106789
selfie_camera_mp           2.874054
age_old                    2.871001
normalized_new_price       2.855336
brand_name_Sony            2.718676
days_used                  2.425685
ram                        2.279765
main_camera_mp             2.191216
4g_yes                     2.185600
brand_name_Meizu           2.136820
brand_name_Apple           2.128776
5g_yes                     2.068311
brand_name_Panasonic       2.008574
brand_name_Gionee          1.970664
brand_name_Realme          1.934292
ram4_1                     1.892896
brand_name_Microsoft       1.784541
brand_name_Lava            1.772362
brand_name_Celkon          1.756081
os_Others                  1.726867
brand_name_Spice           1.718455
brand_name_XOLO            1.695882
brand_name_OnePlus         1.633656
brand_name_BlackBerry      1.609489
os_Windows                 1.592640
brand_name_Karbonn         1.540439
brand_name_Coolpad         1.438564
brand_name_Infinix         1.346046
int_memory                 1.318438
brand_name_Google          1.256753
dtype: float64
================================================
R-squared for data with dropped high-VIF columns
                            R-squared  Adj. R-squared
Before                          0.839           0.836
Without brand_name_Others       0.839           0.836
Without brand_name_Samsung      0.839           0.836
Without brand_name_Huawei       0.839           0.836
Without battery                 0.839           0.835
Without screen_size             0.829           0.825

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e9ab6750>

It seems that brand_name_Others is more complicating than elucidating. It may very well be too imprecise to be beneficial for our model. Let's drop it knowing that the R-squared value will not decrease.

In [ ]:

# drop brand_name_Others
X_train=X_train.drop(['brand_name_Others'],axis=1)

# add to dropped_col
dropped_cols.append('brand_name_Others')

In [ ]:

model_tuning(y_train,X_train)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.836
Method:                    Least Squares   F-statistic:                     259.5
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:04   Log-Likelihood:                 113.91
No. Observations:                   2382   AIC:                            -131.8
Df Residuals:                       2334   BIC:                             145.4
Df Model:                             47                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.3861      0.067     20.717      0.000       1.255       1.517
screen_size               0.0354      0.003     12.458      0.000       0.030       0.041
main_camera_mp            0.0210      0.001     14.492      0.000       0.018       0.024
selfie_camera_mp          0.0143      0.001     12.125      0.000       0.012       0.017
int_memory                0.0001   6.12e-05      2.311      0.021    2.14e-05       0.000
ram                       0.0129      0.005      2.457      0.014       0.003       0.023
battery                2.918e-05   8.83e-06      3.303      0.001    1.19e-05    4.65e-05
days_used             -2.261e-05   2.99e-05     -0.757      0.449   -8.12e-05    3.59e-05
normalized_new_price      0.4141      0.012     34.362      0.000       0.390       0.438
battery_by_weight        -0.0078      0.002     -4.285      0.000      -0.011      -0.004
brand_name_Alcatel       -0.0077      0.028     -0.272      0.786      -0.063       0.048
brand_name_Apple         -0.0356      0.044     -0.804      0.421      -0.123       0.051
brand_name_Asus           0.0299      0.029      1.046      0.296      -0.026       0.086
brand_name_BlackBerry     0.0729      0.062      1.181      0.238      -0.048       0.194
brand_name_Celkon        -0.2930      0.059     -4.994      0.000      -0.408      -0.178
brand_name_Coolpad        0.0099      0.062      0.160      0.873      -0.112       0.132
brand_name_Gionee         0.0048      0.042      0.113      0.910      -0.078       0.088
brand_name_Google         0.0549      0.084      0.650      0.516      -0.111       0.220
brand_name_HTC           -0.0043      0.030     -0.144      0.886      -0.063       0.055
brand_name_Honor          0.0154      0.031      0.500      0.617      -0.045       0.076
brand_name_Huawei        -0.0256      0.022     -1.149      0.251      -0.069       0.018
brand_name_Infinix        0.3252      0.082      3.962      0.000       0.164       0.486
brand_name_Karbonn        0.0204      0.055      0.368      0.713      -0.088       0.129
brand_name_LG            -0.0350      0.024     -1.460      0.144      -0.082       0.012
brand_name_Lava          -0.0013      0.047     -0.027      0.978      -0.094       0.091
brand_name_Lenovo         0.0197      0.025      0.789      0.430      -0.029       0.069
brand_name_Meizu         -0.0189      0.040     -0.470      0.639      -0.098       0.060
brand_name_Micromax      -0.0182      0.029     -0.630      0.529      -0.075       0.038
brand_name_Microsoft      0.0527      0.079      0.666      0.505      -0.102       0.208
brand_name_Motorola      -0.0705      0.031     -2.287      0.022      -0.131      -0.010
brand_name_Nokia          0.0675      0.035      1.946      0.052      -0.001       0.135
brand_name_OnePlus        0.0144      0.055      0.259      0.795      -0.094       0.123
brand_name_Oppo           0.0116      0.029      0.406      0.685      -0.045       0.068
brand_name_Panasonic     -0.0060      0.041     -0.147      0.883      -0.087       0.074
brand_name_Realme         0.0047      0.047      0.100      0.921      -0.088       0.098
brand_name_Samsung       -0.0322      0.020     -1.589      0.112      -0.072       0.008
brand_name_Sony          -0.0695      0.034     -2.074      0.038      -0.135      -0.004
brand_name_Spice         -0.0280      0.049     -0.574      0.566      -0.124       0.068
brand_name_Vivo          -0.0282      0.031     -0.925      0.355      -0.088       0.032
brand_name_XOLO          -0.0614      0.049     -1.265      0.206      -0.157       0.034
brand_name_Xiaomi         0.0710      0.029      2.457      0.014       0.014       0.128
brand_name_ZTE           -0.0191      0.026     -0.723      0.470      -0.071       0.033
os_Others                -0.0103      0.036     -0.286      0.775      -0.081       0.060
os_Windows               -0.0251      0.045     -0.561      0.575      -0.113       0.063
4g_yes                    0.0686      0.015      4.500      0.000       0.039       0.098
5g_yes                    0.0182      0.034      0.536      0.592      -0.048       0.085
age_old                  -0.0426      0.017     -2.567      0.010      -0.075      -0.010
ram4_1                    0.0269      0.017      1.561      0.119      -0.007       0.061
==============================================================================
Omnibus:                      187.235   Durbin-Watson:                   1.941
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              360.648
Skew:                          -0.535   Prob(JB):                     4.86e-79
Kurtosis:                       4.577   Cond. No.                     6.69e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.69e+04. This might indicate that there are
strong multicollinearity or other numerical problems.



Variance Inflation Factors
const                    196.353264
battery                    5.560984
screen_size                4.984176
battery_by_weight          3.179004
selfie_camera_mp           2.873727
age_old                    2.864429
normalized_new_price       2.855014
days_used                  2.425647
ram                        2.279733
main_camera_mp             2.191015
4g_yes                     2.185181
5g_yes                     2.068308
ram4_1                     1.892896
os_Others                  1.723775
os_Windows                 1.589856
brand_name_Samsung         1.573346
brand_name_Nokia           1.504856
brand_name_Huawei          1.502879
brand_name_Microsoft       1.488295
brand_name_LG              1.372475
brand_name_Oppo            1.350397
brand_name_Xiaomi          1.330653
brand_name_Vivo            1.325473
int_memory                 1.318334
brand_name_ZTE             1.308460
brand_name_Lenovo          1.296540
brand_name_Motorola        1.272006
brand_name_Honor           1.270812
brand_name_Celkon          1.256991
brand_name_Micromax        1.253321
brand_name_Apple           1.247784
brand_name_Alcatel         1.240127
brand_name_HTC             1.221159
brand_name_Asus            1.215106
brand_name_Sony            1.209331
brand_name_BlackBerry      1.183462
brand_name_Realme          1.181635
brand_name_Meizu           1.139277
brand_name_OnePlus         1.118116
brand_name_Infinix         1.112590
brand_name_Gionee          1.100861
brand_name_Panasonic       1.100701
brand_name_Lava            1.094078
brand_name_Spice           1.088881
brand_name_XOLO            1.073786
brand_name_Karbonn         1.065913
brand_name_Coolpad         1.055485
brand_name_Google          1.045864
dtype: float64
================================================
R-squared for data with dropped high-VIF columns
                 R-squared  Adj. R-squared
Before               0.839           0.836
Without battery      0.839           0.835

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e7505490>

Our R-squared did not change and brand_name_Samsung no longer has high VIF.

Since we have both battery_by_weight and screen_size predicting battery, we can confidently remove the latter, as it has high VIF.

In [ ]:

# drop
X_train=X_train.drop(['battery'],axis=1)

# append to list
dropped_cols.append('battery')

In [ ]:

model_tuning(y_train,X_train)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.835
Method:                    Least Squares   F-statistic:                     263.8
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:05   Log-Likelihood:                 108.36
No. Observations:                   2382   AIC:                            -122.7
Df Residuals:                       2335   BIC:                             148.7
Df Model:                             46                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.2957      0.061     21.175      0.000       1.176       1.416
screen_size               0.0430      0.002     26.279      0.000       0.040       0.046
main_camera_mp            0.0207      0.001     14.259      0.000       0.018       0.024
selfie_camera_mp          0.0139      0.001     11.816      0.000       0.012       0.016
int_memory                0.0001   6.13e-05      2.441      0.015    2.94e-05       0.000
ram                       0.0134      0.005      2.543      0.011       0.003       0.024
days_used             -1.819e-05   2.99e-05     -0.609      0.543   -7.68e-05    4.04e-05
normalized_new_price      0.4169      0.012     34.603      0.000       0.393       0.441
battery_by_weight        -0.0043      0.001     -2.893      0.004      -0.007      -0.001
brand_name_Alcatel       -0.0166      0.028     -0.586      0.558      -0.072       0.039
brand_name_Apple         -0.0220      0.044     -0.498      0.619      -0.109       0.065
brand_name_Asus           0.0311      0.029      1.086      0.278      -0.025       0.087
brand_name_BlackBerry     0.0711      0.062      1.150      0.250      -0.050       0.192
brand_name_Celkon        -0.2889      0.059     -4.914      0.000      -0.404      -0.174
brand_name_Coolpad        0.0070      0.062      0.112      0.911      -0.115       0.129
brand_name_Gionee         0.0119      0.042      0.282      0.778      -0.071       0.095
brand_name_Google         0.0497      0.085      0.588      0.557      -0.116       0.216
brand_name_HTC           -0.0064      0.030     -0.213      0.831      -0.065       0.053
brand_name_Honor          0.0099      0.031      0.320      0.749      -0.051       0.070
brand_name_Huawei        -0.0248      0.022     -1.112      0.266      -0.069       0.019
brand_name_Infinix        0.3327      0.082      4.046      0.000       0.171       0.494
brand_name_Karbonn        0.0107      0.055      0.193      0.847      -0.098       0.119
brand_name_LG            -0.0356      0.024     -1.482      0.139      -0.083       0.012
brand_name_Lava          -0.0055      0.047     -0.116      0.908      -0.098       0.087
brand_name_Lenovo         0.0219      0.025      0.878      0.380      -0.027       0.071
brand_name_Meizu         -0.0225      0.040     -0.559      0.576      -0.101       0.056
brand_name_Micromax      -0.0220      0.029     -0.763      0.445      -0.079       0.035
brand_name_Microsoft      0.0503      0.079      0.635      0.525      -0.105       0.206
brand_name_Motorola      -0.0678      0.031     -2.194      0.028      -0.128      -0.007
brand_name_Nokia          0.0654      0.035      1.883      0.060      -0.003       0.134
brand_name_OnePlus        0.0064      0.055      0.115      0.908      -0.102       0.115
brand_name_Oppo           0.0105      0.029      0.364      0.716      -0.046       0.067
brand_name_Panasonic     -0.0113      0.041     -0.275      0.784      -0.092       0.069
brand_name_Realme         0.0100      0.047      0.212      0.832      -0.083       0.103
brand_name_Samsung       -0.0313      0.020     -1.539      0.124      -0.071       0.009
brand_name_Sony          -0.0690      0.034     -2.053      0.040      -0.135      -0.003
brand_name_Spice         -0.0230      0.049     -0.470      0.638      -0.119       0.073
brand_name_Vivo          -0.0314      0.031     -1.027      0.305      -0.091       0.029
brand_name_XOLO          -0.0643      0.049     -1.322      0.186      -0.160       0.031
brand_name_Xiaomi         0.0755      0.029      2.613      0.009       0.019       0.132
brand_name_ZTE           -0.0211      0.026     -0.798      0.425      -0.073       0.031
os_Others                 0.0177      0.035      0.505      0.614      -0.051       0.086
os_Windows               -0.0266      0.045     -0.595      0.552      -0.114       0.061
4g_yes                    0.0695      0.015      4.552      0.000       0.040       0.099
5g_yes                    0.0204      0.034      0.599      0.549      -0.046       0.087
age_old                  -0.0443      0.017     -2.665      0.008      -0.077      -0.012
ram4_1                    0.0269      0.017      1.561      0.119      -0.007       0.061
==============================================================================
Omnibus:                      186.544   Durbin-Watson:                   1.941
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              359.965
Skew:                          -0.533   Prob(JB):                     6.83e-79
Kurtosis:                       4.578   Cond. No.                     1.40e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.4e+04. This might indicate that there are
strong multicollinearity or other numerical problems.



Variance Inflation Factors
const                    163.546190
age_old                    2.861644
normalized_new_price       2.841189
selfie_camera_mp           2.841148
days_used                  2.420772
ram                        2.278053
4g_yes                     2.184430
main_camera_mp             2.179395
battery_by_weight          2.087648
5g_yes                     2.067531
ram4_1                     1.892894
screen_size                1.651488
os_Others                  1.628177
os_Windows                 1.589682
brand_name_Samsung         1.573021
brand_name_Nokia           1.504374
brand_name_Huawei          1.502714
brand_name_Microsoft       1.488174
brand_name_LG              1.372401
brand_name_Oppo            1.350188
brand_name_Xiaomi          1.327600
brand_name_Vivo            1.324158
int_memory                 1.316189
brand_name_ZTE             1.307764
brand_name_Lenovo          1.295566
brand_name_Motorola        1.271094
brand_name_Honor           1.267041
brand_name_Celkon          1.256428
brand_name_Micromax        1.251259
brand_name_Apple           1.236995
brand_name_Alcatel         1.229015
brand_name_HTC             1.220608
brand_name_Asus            1.214908
brand_name_Sony            1.209302
brand_name_BlackBerry      1.183374
brand_name_Realme          1.180261
brand_name_Meizu           1.138431
brand_name_OnePlus         1.115987
brand_name_Infinix         1.111752
brand_name_Panasonic       1.099036
brand_name_Gionee          1.097969
brand_name_Lava            1.093285
brand_name_Spice           1.087834
brand_name_XOLO            1.073437
brand_name_Karbonn         1.062911
brand_name_Coolpad         1.055264
brand_name_Google          1.045507
dtype: float64

There are no features with VIF greater than 5

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e8fe0790>

There was no effect on R-squared, we have eliminated all high VIF features.

Next we'll look at p-values, assuming a 5% level of significance. The p-values for os_Others and os_Windows are both high, indicating that operating system is generally not a good predictor of used price. We will drop all os features.

In [ ]:

# drop
X_train=X_train.drop(['os_Others','os_Windows'],axis=1)

# append
dropped_cols.append('os_Others')
dropped_cols.append('os_Windows')

In [ ]:

model_tuning(y_train,X_train,hide_vif=True)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.836
Method:                    Least Squares   F-statistic:                     275.9
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:07   Log-Likelihood:                 108.01
No. Observations:                   2382   AIC:                            -126.0
Df Residuals:                       2337   BIC:                             133.9
Df Model:                             44                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.3053      0.059     22.168      0.000       1.190       1.421
screen_size               0.0429      0.002     26.747      0.000       0.040       0.046
main_camera_mp            0.0206      0.001     14.243      0.000       0.018       0.023
selfie_camera_mp          0.0139      0.001     11.874      0.000       0.012       0.016
int_memory                0.0001   6.13e-05      2.429      0.015    2.87e-05       0.000
ram                       0.0128      0.005      2.459      0.014       0.003       0.023
days_used             -1.819e-05   2.99e-05     -0.609      0.543   -7.68e-05    4.04e-05
normalized_new_price      0.4164      0.012     34.658      0.000       0.393       0.440
battery_by_weight        -0.0043      0.001     -2.907      0.004      -0.007      -0.001
brand_name_Alcatel       -0.0151      0.028     -0.536      0.592      -0.070       0.040
brand_name_Apple         -0.0184      0.044     -0.421      0.674      -0.104       0.067
brand_name_Asus           0.0322      0.029      1.129      0.259      -0.024       0.088
brand_name_BlackBerry     0.0818      0.059      1.397      0.163      -0.033       0.197
brand_name_Celkon        -0.2840      0.058     -4.905      0.000      -0.397      -0.170
brand_name_Coolpad        0.0077      0.062      0.124      0.901      -0.114       0.130
brand_name_Gionee         0.0126      0.042      0.298      0.766      -0.070       0.095
brand_name_Google         0.0508      0.085      0.600      0.548      -0.115       0.217
brand_name_HTC           -0.0066      0.030     -0.220      0.826      -0.066       0.052
brand_name_Honor          0.0104      0.031      0.338      0.736      -0.050       0.071
brand_name_Huawei        -0.0243      0.022     -1.089      0.276      -0.068       0.019
brand_name_Infinix        0.3301      0.082      4.019      0.000       0.169       0.491
brand_name_Karbonn        0.0101      0.055      0.183      0.855      -0.098       0.119
brand_name_LG            -0.0349      0.024     -1.457      0.145      -0.082       0.012
brand_name_Lava          -0.0071      0.047     -0.150      0.881      -0.100       0.085
brand_name_Lenovo         0.0226      0.025      0.907      0.365      -0.026       0.072
brand_name_Meizu         -0.0216      0.040     -0.536      0.592      -0.100       0.057
brand_name_Micromax      -0.0217      0.029     -0.754      0.451      -0.078       0.035
brand_name_Microsoft      0.0243      0.066      0.367      0.714      -0.105       0.154
brand_name_Motorola      -0.0672      0.031     -2.176      0.030      -0.128      -0.007
brand_name_Nokia          0.0645      0.032      1.995      0.046       0.001       0.128
brand_name_OnePlus        0.0080      0.055      0.144      0.885      -0.101       0.116
brand_name_Oppo           0.0109      0.029      0.381      0.703      -0.045       0.067
brand_name_Panasonic     -0.0107      0.041     -0.259      0.796      -0.091       0.070
brand_name_Realme         0.0096      0.047      0.204      0.839      -0.083       0.103
brand_name_Samsung       -0.0299      0.020     -1.480      0.139      -0.070       0.010
brand_name_Sony          -0.0674      0.034     -2.012      0.044      -0.133      -0.002
brand_name_Spice         -0.0235      0.049     -0.481      0.630      -0.119       0.072
brand_name_Vivo          -0.0306      0.031     -1.003      0.316      -0.090       0.029
brand_name_XOLO          -0.0651      0.049     -1.340      0.180      -0.160       0.030
brand_name_Xiaomi         0.0762      0.029      2.639      0.008       0.020       0.133
brand_name_ZTE           -0.0202      0.026     -0.767      0.443      -0.072       0.032
4g_yes                    0.0691      0.015      4.529      0.000       0.039       0.099
5g_yes                    0.0214      0.034      0.631      0.528      -0.045       0.088
age_old                  -0.0449      0.017     -2.709      0.007      -0.077      -0.012
ram4_1                    0.0252      0.017      1.481      0.139      -0.008       0.059
==============================================================================
Omnibus:                      184.690   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              357.701
Skew:                          -0.527   Prob(JB):                     2.12e-78
Kurtosis:                       4.579   Cond. No.                     1.38e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.38e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e9a8f350>

Dropping os_Others and os_Windows has had no impact on the model's R-squared.

With a p-value of 0.543, days_used is certainly not lending any predictive insight to our model. Let's drop it.

In [ ]:

# drop
X_train=X_train.drop(['days_used'],axis=1)

# append
dropped_cols.append('days_used')

In [ ]:

model_tuning(y_train,X_train,hide_vif=True)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.836
Method:                    Least Squares   F-statistic:                     282.4
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:07   Log-Likelihood:                 107.82
No. Observations:                   2382   AIC:                            -127.6
Df Residuals:                       2338   BIC:                             126.5
Df Model:                             43                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.2961      0.057     22.781      0.000       1.185       1.408
screen_size               0.0429      0.002     26.896      0.000       0.040       0.046
main_camera_mp            0.0205      0.001     14.234      0.000       0.018       0.023
selfie_camera_mp          0.0140      0.001     11.981      0.000       0.012       0.016
int_memory                0.0001   6.12e-05      2.443      0.015    2.95e-05       0.000
ram                       0.0131      0.005      2.523      0.012       0.003       0.023
normalized_new_price      0.4159      0.012     34.687      0.000       0.392       0.439
battery_by_weight        -0.0042      0.001     -2.861      0.004      -0.007      -0.001
brand_name_Alcatel       -0.0155      0.028     -0.550      0.582      -0.071       0.040
brand_name_Apple         -0.0176      0.044     -0.402      0.688      -0.103       0.068
brand_name_Asus           0.0319      0.029      1.119      0.263      -0.024       0.088
brand_name_BlackBerry     0.0826      0.059      1.411      0.158      -0.032       0.197
brand_name_Celkon        -0.2835      0.058     -4.899      0.000      -0.397      -0.170
brand_name_Coolpad        0.0071      0.062      0.115      0.909      -0.115       0.129
brand_name_Gionee         0.0122      0.042      0.288      0.774      -0.071       0.095
brand_name_Google         0.0512      0.085      0.605      0.545      -0.115       0.217
brand_name_HTC           -0.0065      0.030     -0.216      0.829      -0.065       0.052
brand_name_Honor          0.0110      0.031      0.356      0.722      -0.049       0.071
brand_name_Huawei        -0.0239      0.022     -1.072      0.284      -0.068       0.020
brand_name_Infinix        0.3317      0.082      4.040      0.000       0.171       0.493
brand_name_Karbonn        0.0100      0.055      0.181      0.856      -0.099       0.119
brand_name_LG            -0.0342      0.024     -1.430      0.153      -0.081       0.013
brand_name_Lava          -0.0075      0.047     -0.160      0.873      -0.100       0.085
brand_name_Lenovo         0.0226      0.025      0.905      0.366      -0.026       0.072
brand_name_Meizu         -0.0223      0.040     -0.556      0.578      -0.101       0.056
brand_name_Micromax      -0.0223      0.029     -0.775      0.438      -0.079       0.034
brand_name_Microsoft      0.0238      0.066      0.359      0.719      -0.106       0.153
brand_name_Motorola      -0.0655      0.031     -2.131      0.033      -0.126      -0.005
brand_name_Nokia          0.0648      0.032      2.005      0.045       0.001       0.128
brand_name_OnePlus        0.0072      0.055      0.130      0.897      -0.101       0.116
brand_name_Oppo           0.0110      0.029      0.383      0.702      -0.045       0.067
brand_name_Panasonic     -0.0116      0.041     -0.282      0.778      -0.092       0.069
brand_name_Realme         0.0117      0.047      0.248      0.804      -0.081       0.104
brand_name_Samsung       -0.0298      0.020     -1.473      0.141      -0.069       0.010
brand_name_Sony          -0.0670      0.034     -1.999      0.046      -0.133      -0.001
brand_name_Spice         -0.0249      0.049     -0.511      0.610      -0.121       0.071
brand_name_Vivo          -0.0303      0.031     -0.994      0.321      -0.090       0.030
brand_name_XOLO          -0.0651      0.049     -1.339      0.181      -0.160       0.030
brand_name_Xiaomi         0.0761      0.029      2.635      0.008       0.019       0.133
brand_name_ZTE           -0.0205      0.026     -0.777      0.437      -0.072       0.031
4g_yes                    0.0689      0.015      4.522      0.000       0.039       0.099
5g_yes                    0.0224      0.034      0.660      0.509      -0.044       0.089
age_old                  -0.0493      0.015     -3.287      0.001      -0.079      -0.020
ram4_1                    0.0228      0.017      1.377      0.169      -0.010       0.055
==============================================================================
Omnibus:                      184.367   Durbin-Watson:                   1.941
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              356.117
Skew:                          -0.527   Prob(JB):                     4.68e-78
Kurtosis:                       4.573   Cond. No.                     2.03e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.03e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e74ed890>

Dropping days_used had no impact on R-squared either.

Despite high p-values, I do not believe we can drop only some of the brand_name columns. This is because dropping only part of a categorical variable will negatively impact interpretability of the model. Thus, the only other features with high p-values are 5g_yes and ram4_1. The former records whether the device has 5G connectivity, which is apparently not a good predictor for price, especially as compared to 4G connectivity. The latter is a feature I made in an attempt to bin observations for RAM; it appears to not benefit the model.

We will still only drop one column at a time to record the effect on the R-squared value.

In [ ]:

# drop
X_train=X_train.drop(['5g_yes'],axis=1)

# append
dropped_cols.append('5g_yes')

In [ ]:

model_tuning(y_train,X_train,hide_vif=True)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.839
Model:                               OLS   Adj. R-squared:                  0.836
Method:                    Least Squares   F-statistic:                     289.2
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:07   Log-Likelihood:                 107.60
No. Observations:                   2382   AIC:                            -129.2
Df Residuals:                       2339   BIC:                             119.2
Df Model:                             42                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.2898      0.056     22.997      0.000       1.180       1.400
screen_size               0.0429      0.002     26.912      0.000       0.040       0.046
main_camera_mp            0.0204      0.001     14.237      0.000       0.018       0.023
selfie_camera_mp          0.0140      0.001     11.983      0.000       0.012       0.016
int_memory                0.0002   6.12e-05      2.468      0.014     3.1e-05       0.000
ram                       0.0148      0.005      3.253      0.001       0.006       0.024
normalized_new_price      0.4171      0.012     35.205      0.000       0.394       0.440
battery_by_weight        -0.0042      0.001     -2.885      0.004      -0.007      -0.001
brand_name_Alcatel       -0.0152      0.028     -0.539      0.590      -0.070       0.040
brand_name_Apple         -0.0192      0.044     -0.440      0.660      -0.105       0.066
brand_name_Asus           0.0320      0.029      1.121      0.263      -0.024       0.088
brand_name_BlackBerry     0.0825      0.059      1.410      0.159      -0.032       0.197
brand_name_Celkon        -0.2819      0.058     -4.876      0.000      -0.395      -0.169
brand_name_Coolpad        0.0089      0.062      0.143      0.886      -0.113       0.130
brand_name_Gionee         0.0121      0.042      0.286      0.775      -0.071       0.095
brand_name_Google         0.0476      0.084      0.565      0.572      -0.118       0.213
brand_name_HTC           -0.0070      0.030     -0.232      0.817      -0.066       0.052
brand_name_Honor          0.0116      0.031      0.376      0.707      -0.049       0.072
brand_name_Huawei        -0.0240      0.022     -1.077      0.281      -0.068       0.020
brand_name_Infinix        0.3308      0.082      4.030      0.000       0.170       0.492
brand_name_Karbonn        0.0105      0.055      0.189      0.850      -0.098       0.119
brand_name_LG            -0.0342      0.024     -1.429      0.153      -0.081       0.013
brand_name_Lava          -0.0072      0.047     -0.152      0.879      -0.100       0.085
brand_name_Lenovo         0.0231      0.025      0.924      0.356      -0.026       0.072
brand_name_Meizu         -0.0225      0.040     -0.561      0.575      -0.101       0.056
brand_name_Micromax      -0.0218      0.029     -0.757      0.449      -0.078       0.035
brand_name_Microsoft      0.0241      0.066      0.364      0.716      -0.106       0.154
brand_name_Motorola      -0.0651      0.031     -2.117      0.034      -0.125      -0.005
brand_name_Nokia          0.0660      0.032      2.043      0.041       0.003       0.129
brand_name_OnePlus        0.0078      0.055      0.141      0.888      -0.101       0.116
brand_name_Oppo           0.0116      0.029      0.404      0.687      -0.045       0.068
brand_name_Panasonic     -0.0112      0.041     -0.274      0.784      -0.092       0.069
brand_name_Realme         0.0149      0.047      0.318      0.751      -0.077       0.107
brand_name_Samsung       -0.0302      0.020     -1.497      0.135      -0.070       0.009
brand_name_Sony          -0.0668      0.033     -1.993      0.046      -0.132      -0.001
brand_name_Spice         -0.0246      0.049     -0.505      0.614      -0.120       0.071
brand_name_Vivo          -0.0306      0.031     -1.004      0.315      -0.090       0.029
brand_name_XOLO          -0.0649      0.049     -1.336      0.182      -0.160       0.030
brand_name_Xiaomi         0.0765      0.029      2.649      0.008       0.020       0.133
brand_name_ZTE           -0.0203      0.026     -0.769      0.442      -0.072       0.031
4g_yes                    0.0689      0.015      4.518      0.000       0.039       0.099
age_old                  -0.0492      0.015     -3.287      0.001      -0.079      -0.020
ram4_1                    0.0184      0.015      1.214      0.225      -0.011       0.048
==============================================================================
Omnibus:                      184.759   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              356.947
Skew:                          -0.528   Prob(JB):                     3.09e-78
Kurtosis:                       4.575   Cond. No.                     2.03e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.03e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Out[ ]:

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f75e75f5510>

Dropping 5g_yes had no effect on the R-squared. What's more, the p-value for ram4_1 actually went up, further confirming our hypothesis that this feature is not necessary.

In [ ]:

# drop
X_train=X_train.drop(['ram4_1'],axis=1)

# append
dropped_cols.append('ram4_1')

In [ ]:

model=model_tuning(y_train,X_train,hide_vif=True)

                              OLS Regression Results                             
=================================================================================
Dep. Variable:     normalized_used_price   R-squared:                       0.838
Model:                               OLS   Adj. R-squared:                  0.836
Method:                    Least Squares   F-statistic:                     296.1
Date:                   Sat, 08 Oct 2022   Prob (F-statistic):               0.00
Time:                           04:45:08   Log-Likelihood:                 106.85
No. Observations:                   2382   AIC:                            -129.7
Df Residuals:                       2340   BIC:                             112.9
Df Model:                             41                                         
Covariance Type:               nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                     1.3002      0.055     23.454      0.000       1.191       1.409
screen_size               0.0428      0.002     26.886      0.000       0.040       0.046
main_camera_mp            0.0207      0.001     14.505      0.000       0.018       0.023
selfie_camera_mp          0.0138      0.001     11.940      0.000       0.012       0.016
int_memory                0.0001   6.01e-05      2.282      0.023    1.93e-05       0.000
ram                       0.0146      0.005      3.222      0.001       0.006       0.024
normalized_new_price      0.4185      0.012     35.462      0.000       0.395       0.442
battery_by_weight        -0.0043      0.001     -2.934      0.003      -0.007      -0.001
brand_name_Alcatel       -0.0171      0.028     -0.609      0.543      -0.072       0.038
brand_name_Apple         -0.0200      0.044     -0.459      0.647      -0.106       0.066
brand_name_Asus           0.0329      0.029      1.153      0.249      -0.023       0.089
brand_name_BlackBerry     0.0823      0.059      1.406      0.160      -0.032       0.197
brand_name_Celkon        -0.2913      0.057     -5.082      0.000      -0.404      -0.179
brand_name_Coolpad        0.0089      0.062      0.144      0.885      -0.113       0.131
brand_name_Gionee         0.0128      0.042      0.303      0.762      -0.070       0.096
brand_name_Google         0.0434      0.084      0.515      0.607      -0.122       0.209
brand_name_HTC           -0.0077      0.030     -0.257      0.797      -0.067       0.051
brand_name_Honor          0.0094      0.031      0.306      0.760      -0.051       0.070
brand_name_Huawei        -0.0248      0.022     -1.113      0.266      -0.068       0.019
brand_name_Infinix        0.3248      0.082      3.964      0.000       0.164       0.485
brand_name_Karbonn        0.0090      0.055      0.162      0.871      -0.100       0.118
brand_name_LG            -0.0357      0.024     -1.495      0.135      -0.083       0.011
brand_name_Lava          -0.0079      0.047     -0.168      0.867      -0.100       0.085
brand_name_Lenovo         0.0231      0.025      0.926      0.355      -0.026       0.072
brand_name_Meizu         -0.0235      0.040     -0.584      0.559      -0.102       0.055
brand_name_Micromax      -0.0214      0.029     -0.742      0.458      -0.078       0.035
brand_name_Microsoft      0.0245      0.066      0.371      0.711      -0.105       0.154
brand_name_Motorola      -0.0662      0.031     -2.154      0.031      -0.126      -0.006
brand_name_Nokia          0.0600      0.032      1.879      0.060      -0.003       0.123
brand_name_OnePlus        0.0028      0.055      0.051      0.959      -0.105       0.111
brand_name_Oppo           0.0098      0.029      0.342      0.732      -0.046       0.066
brand_name_Panasonic     -0.0101      0.041     -0.246      0.806      -0.091       0.070
brand_name_Realme         0.0081      0.047      0.173      0.863      -0.083       0.100
brand_name_Samsung       -0.0306      0.020     -1.517      0.129      -0.070       0.009
brand_name_Sony          -0.0674      0.033     -2.013      0.044      -0.133      -0.002
brand_name_Spice         -0.0239      0.049     -0.489      0.625      -0.120       0.072
brand_name_Vivo          -0.0309      0.031     -1.013      0.311      -0.091       0.029
brand_name_XOLO          -0.0644      0.049     -1.324      0.186      -0.160       0.031
brand_name_Xiaomi         0.0754      0.029      2.611      0.009       0.019       0.132
brand_name_ZTE           -0.0211      0.026     -0.799      0.425      -0.073       0.031
4g_yes                    0.0695      0.015      4.561      0.000       0.040       0.099
age_old                  -0.0461      0.015     -3.125      0.002      -0.075      -0.017
==============================================================================
Omnibus:                      188.661   Durbin-Watson:                   1.940
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              364.229
Skew:                          -0.538   Prob(JB):                     8.10e-80
Kurtosis:                       4.585   Cond. No.                     2.02e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.02e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

R-squared only decreased by 0.001 after dropping ram4_1. Outside of the brand_name columns, there are no p-values greater than 0.05. With that, we have a model that explains 83.8% of the variance in used price!

Checking Linear Regression Assumptions¶

In [ ]:

# tabulate model performance
model_perf=pd.DataFrame.from_dict({'Actual':y_train.values,
                                   'Fitted':model.fittedvalues.values,
                                   'Residuals':model.resid.values})

In [ ]:

model_perf.head()

Out[ ]:

	Actual	Fitted	Residuals
0	5.169518	5.033365	0.136153
1	4.284276	4.485839	-0.201563
2	4.955757	4.771354	0.184403
3	4.460607	4.403883	0.056724
4	4.005695	4.063245	-0.057549

Linearity and Independence¶

If our data really does exhibit a linear relationship, then there should be no pattern in the plot of fitted values versus residuals.

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Linearity Check')
sns.residplot(data=model_perf,x='Fitted',y='Residuals',lowess=True);

There is at worst a weak trend apparent here, but the data is essentially noise. This affirms that there is a linear relationship between our inputs and the response variable, normalized_used_price.

In addition, the plot confirms the independence of residuals, since there is no regularity or pattern in the residuals.

Normality¶

In [ ]:

plt.figure(figsize=(8,5))
plt.title('Histogram of Linear Regression Residuals',fontsize=16)
sns.histplot(data=model_perf,x='Residuals',kde=True);

All told, this distribution is reasonably normal. The only issues seem to be on the tails. Further evidence in favor of normality can be obtained by studying the QQ plot.

In [ ]:

# import extra library for QQ plot
import pylab

plt.figure(figsize=(8,5))
stats.probplot(model_perf['Residuals'],dist='norm',plot=pylab);

Again, save for the tails, we find that the distribution is close to normal. Recall that a normal distribution will follow the diagonal of a QQ plot.

But let's go further by running a Shapiro-Wilk test at a 5% level of significance. We make the following hypotheses: $$H_0: \text{Residuals are normally distributed.}$$ $$H_a: \text{Residuals are NOT normally distributed.}$$

In [ ]:

test_stat,p_val=stats.shapiro(model_perf['Residuals'])
print('The p-value of the Shapiro-Wilk test for residuals is:',p_val)

The p-value of the Shapiro-Wilk test for residuals is: 6.636463188267622e-21

Unfortunately, the Shapiro-Wilk test implies the distribution of residuals is almost certainly not normal. Hence, while the distribution is not technically normal, it is approximately normal, as shown in the two plots. Put another way, we can use it for our purposes so long as we keep in mind its limitations and reliability.

Homoscedasticity¶

Homoscedasticity concerns the variance of residuals. We require that the variance of residuals remain consistent throughout the data.

We will run the Goldfeld-Quandt test, a hypothesis test for homoscedasticity. We make the assumption that the residuals are homoscedastic, with the alternative being that they are heteroscedastic. $$H_0: \text{Residuals are homoscedastic.}$$ $$H_a: \text{Residuals are heteroscedastic.}$$ We will assume a 5% level of significance.

In [ ]:

# homoscedasticity test
test=statsmodels.stats.diagnostic.het_goldfeldquandt(model_perf['Residuals'],X_train)
print('The p-value for the Goldfeld-Quandt test is:',test[1])

The p-value for the Goldfeld-Quandt test is: 0.24735380493522216

Since the p-value is greater than 0.05, we can safely conclude that the residuals are homoscedastic.

Final Model¶

In [ ]:

# drop required columns from test data
X_test=X_test.drop(dropped_cols,axis=1)

In [ ]:

# generate predicted y values with model
y_predicted=model.predict(X_test)

In [ ]:

from sklearn.metrics import mean_squared_error, mean_absolute_error

In [ ]:

print('The RMSE on the training data is:',
      np.sqrt(mean_squared_error(y_train,model_perf['Fitted'])))

The RMSE on the training data is: 0.23135627303854692

In [ ]:

print('The RMSE on the test data is:',
      np.sqrt(mean_squared_error(y_test,y_predicted)))

The RMSE on the test data is: 0.2344360049678886

The RMSE is low for both the training data and the test data. What's more, the numbers are almost equal, which implies overfitting is not an issue for our model.

In [ ]:

print('The mean absolute error for the training data is:',
      mean_absolute_error(y_train,model_perf['Fitted']))

The mean absolute error for the training data is: 0.18154048705559045

In [ ]:

print('The mean absolute error for the test data is:',
      mean_absolute_error(y_test,y_predicted))

The mean absolute error for the test data is: 0.18160092038083928

Again, the values for the training data and the test data are quite close. The low mean absolute error for the test data indicates that the average error commited by the model in predicting normalized_used_price is only 0.18€.

We conclude that this is a strong model that performs well on out-of-sample data.

Actionable Insights and Recommendations¶

For reference, the model parameters are listed below. We can use these to draw business conclusions.

In [ ]:

model.params

Out[ ]:

const                    1.300184
screen_size              0.042756
main_camera_mp           0.020663
selfie_camera_mp         0.013775
int_memory               0.000137
ram                      0.014610
normalized_new_price     0.418463
battery_by_weight       -0.004308
brand_name_Alcatel      -0.017111
brand_name_Apple        -0.020045
brand_name_Asus          0.032903
brand_name_BlackBerry    0.082250
brand_name_Celkon       -0.291263
brand_name_Coolpad       0.008939
brand_name_Gionee        0.012789
brand_name_Google        0.043397
brand_name_HTC          -0.007726
brand_name_Honor         0.009408
brand_name_Huawei       -0.024759
brand_name_Infinix       0.324816
brand_name_Karbonn       0.008970
brand_name_LG           -0.035735
brand_name_Lava         -0.007912
brand_name_Lenovo        0.023106
brand_name_Meizu        -0.023462
brand_name_Micromax     -0.021357
brand_name_Microsoft     0.024526
brand_name_Motorola     -0.066186
brand_name_Nokia         0.059966
brand_name_OnePlus       0.002819
brand_name_Oppo          0.009788
brand_name_Panasonic    -0.010076
brand_name_Realme        0.008064
brand_name_Samsung      -0.030627
brand_name_Sony         -0.067437
brand_name_Spice        -0.023866
brand_name_Vivo         -0.030908
brand_name_XOLO         -0.064353
brand_name_Xiaomi        0.075372
brand_name_ZTE          -0.021065
4g_yes                   0.069481
age_old                 -0.046131
dtype: float64

Model Insights

We saw a strong positive correlation between normalized_new_price and normalized_used_price in the correlation heatmap. This is reflected in the linear model parameters, which shows that for every 1€ increase in new price, there is an increase in the used price of 0.418€.
Every centimeter increase in screen size adds 0.043€ to the used price.
Main camera resolution has a greater impact on used price than selfie camera resolution.
Devices released before 2017 will sell for less on Mobile Market.
4G connectivity contributes positively to the price, adding 0.069€.
Every GB increase in internal memory has less impact on price than every GB increase in RAM. This is because phones have much more internal memory than RAM, so each additional GB of RAM is a greater percentage increase in the total amount. (Going from 2GB to 3GB of RAM is a 50% increase, whereas going from 64GB to 65GB of internal memory is only a 1.6% increase. There is a far wider range of values for internal memory than RAM.)
Infinix brand devices command a higher used price, adding 0.325€ on top of the total generated by the device specifications.
On the other hand, the Celkon brand has a lower used price, taking away 0.291€ from the total generated by the device specifications.

Mobile Market Recommendations

The best predictor of used price is the new price of the same device. If Mobile Market wanted to drive higher prices, they should focus on re-selling phones with higher prices when new.
Devices with larger screens sell for a higher price too, so concentrating on large-format smart phones and tablets, which have become more popular in recent years, would help Mobile Market garner higher prices.
Premium brands seem to sell for less used. This is likely because premium devices are updated yearly, with older models becoming outdated trends. With all the new features introduced each year, oulder models are immediately less desirable. Mobile Market could distinguish itself in the market by offering premium phones for less, as brands like Apple, Huawei, Motorola, and Samsung actually reduce the price of the device.
Mobile Market could also boost the price of devices by offering phones with larger screens, higher resolution cameras, and lots of RAM.