Implementing Linear Regression with Categorical variable Using Sklearn¶

https://medium.com/analytics-vidhya/implementing-linear-regression-using-sklearn-76264a3c073c

Imports¶

In [52]:

import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Datasets¶

In [53]:

df=df = pd.read_excel('./Data/Multiple_variable.xlsx',sheet_name='Sheet2')
df.head()

Out[53]:

	Age	YearsExperience	Salary	Gender	Classification	Job
0	22	1.1	39343	Female	Low	Assistant
1	22	1.3	46205	Male	TOP	Professor
2	23	1.5	37731	Female	TOP	Administrative
3	24	2.0	43525	Female	Medium	Assistant
4	25	2.2	39891	Male	Medium	Professor

Data Pre-Processing¶

In [54]:

df.shape
df.describe()

Out[54]:

	Age	YearsExperience	Salary
count	36.000000	36.000000	36.000000
mean	34.472222	6.008333	82228.277778
std	6.942565	3.031489	28784.838078
min	22.000000	1.100000	37731.000000
25%	29.000000	3.575000	57050.000000
50%	37.000000	5.600000	82225.500000
75%	40.250000	9.000000	110232.000000
max	49.000000	10.500000	122391.000000

In [55]:

df.dtypes

Out[55]:

Age                  int64
YearsExperience    float64
Salary               int64
Gender              object
Classification      object
Job                 object
dtype: object

Explanatory Data Analysis¶

In [56]:

sns.pairplot(df, hue='Gender')

Out[56]:

<seaborn.axisgrid.PairGrid at 0x2b8f3b597c0>

Years of Experience is directly proportional to Salary and even age.

In [57]:

sns.pairplot(df,x_vars=['Age','YearsExperience'],y_vars=['Salary'],hue='Gender')

Out[57]:

<seaborn.axisgrid.PairGrid at 0x2b8f2b5d2e0>

Age and Years of Experience both are correlated with Salary

In [58]:

df.corr()

Out[58]:

	Age	YearsExperience	Salary
Age	1.000000	0.858866	0.825977
YearsExperience	0.858866	1.000000	0.982536
Salary	0.825977	0.982536	1.000000

In [59]:

sns.heatmap(df.corr(),annot=True,lw=1)

Out[59]:

<AxesSubplot:>

In [60]:

sns.boxplot(y='Age',x='Gender',data=df)

Out[60]:

<AxesSubplot:xlabel='Gender', ylabel='Age'>

In [61]:

sns.boxplot(y='Salary',x='Classification',data=df)

Out[61]:

<AxesSubplot:xlabel='Classification', ylabel='Salary'>

Dummy Variable¶

Regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.

In [62]:

X = df[['Age', 'YearsExperience', 'Gender', 'Classification', 'Job']]

In [63]:

X = pd.get_dummies(data=X, drop_first=True)
X.head()

Out[63]:

	Age	YearsExperience	Gender_Male	Classification_Medium	Classification_TOP	Job_Assistant	Job_Professor
0	22	1.1	0	0	0	1	0
1	22	1.3	1	0	1	0	1
2	23	1.5	0	0	1	0	0
3	24	2.0	0	1	0	1	0
4	25	2.2	1	1	0	0	1

In [64]:

Y = df['Salary']

Create a train and test dataset¶

In [65]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(21, 9)
(15, 9)
(21,)
(15,)

SkLearn¶

In [66]:

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)

Out[66]:

LinearRegression()

In [67]:

# print the intercept
print(model.intercept_)

40767.075290708104

The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.

A positive sign indicates that as the predictor variable increases, the Target variable also increases.
A negative sign indicates that as the predictor variable increases, the Target variable decreases

In [68]:

coeff_parameter = pd.DataFrame(model.coef_,X.columns,columns=['Coefficient'])
coeff_parameter

Out[68]:

	Coefficient
Age	-548.323776
YearsExperience	10743.731522
Gender_Male	-655.537127
Classification_Medium	-6061.914786
Classification_TOP	-1234.672994
Job_Assistant	-1114.042048
Job_Manager	2291.025846
Job_Professor	964.080429
Job_Senior Manager	-2141.064227

In [69]:

predictions = model.predict(X_test)
predictions

Out[69]:

array([ 80970.53158392,  54147.79230197,  84608.3284182 , 112838.31994826,
       111603.646954  , 121986.73865586, 111208.49523541, 125909.94403861,
       112148.27092338,  43036.55273237, 127901.81847906,  85930.91891418,
        71270.93953831,  62820.7589416 ,  41918.81087788])

In [70]:

sns.regplot(y_test,predictions)

Out[70]:

<AxesSubplot:xlabel='Salary'>

StatsModels¶

In [71]:

import statsmodels.api as sm
X_train_Sm= sm.add_constant(X_train)
X_train_Sm= sm.add_constant(X_train)
ls=sm.OLS(y_train,X_train_Sm).fit()
print(ls.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Salary   R-squared:                       0.971
Model:                            OLS   Adj. R-squared:                  0.952
Method:                 Least Squares   F-statistic:                     51.09
Date:                Sun, 09 Jul 2023   Prob (F-statistic):           4.20e-08
Time:                        22:21:20   Log-Likelihood:                -207.81
No. Observations:                  21   AIC:                             433.6
Df Residuals:                      12   BIC:                             443.0
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
const                  3.261e+04   9431.499      3.458      0.005    1.21e+04    5.32e+04
Age                    -548.3238    427.836     -1.282      0.224   -1480.499     383.852
YearsExperience        1.074e+04   1744.592      6.158      0.000    6942.592    1.45e+04
Gender_Male            -655.5371   3091.666     -0.212      0.836   -7391.698    6080.624
Classification_Medium -6061.9148   4374.305     -1.386      0.191   -1.56e+04    3468.878
Classification_TOP    -1234.6730   4574.620     -0.270      0.792   -1.12e+04    8732.568
Job_Assistant          7039.3730   3997.588      1.761      0.104   -1670.623    1.57e+04
Job_Manager            1.044e+04   5193.314      2.011      0.067    -870.818    2.18e+04
Job_Professor          9117.4955   4842.954      1.883      0.084   -1434.395    1.97e+04
Job_Senior Manager     6012.3508   6454.697      0.931      0.370   -8051.225    2.01e+04
==============================================================================
Omnibus:                        9.775   Durbin-Watson:                   2.131
Prob(Omnibus):                  0.008   Jarque-Bera (JB):                2.092
Skew:                           0.115   Prob(JB):                        0.351
Kurtosis:                       1.471   Cond. No.                     4.70e+17
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.18e-31. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

What Are the Adjusted R-squared?¶

We Use adjusted R-squared to compare the goodness-of-fit for regression models that contain different numbers of independent variables.

In [ ]:

	Age	YearsExperience	Gender_Male	Classification_Medium	Classification_TOP	Job_Assistant	Job_Professor
0	22	1.1	0	0	0	1	0
1	22	1.3	1	0	1	0	1
2	23	1.5	0	0	1	0	0
3	24	2.0	0	1	0	1	0
4	25	2.2	1	1	0	0	1

	Age	YearsExperience	Gender_Male	Classification_Medium	Classification_TOP	Job_Assistant	Job_Professor
0	22	1.1	0	0	0	1	0
1	22	1.3	1	0	1	0	1
2	23	1.5	0	0	1	0	0
3	24	2.0	0	1	0	1	0
4	25	2.2	1	1	0	0	1

	Age	YearsExperience	Gender_Male	Classification_Medium	Classification_TOP	Job_Assistant	Job_Professor
0	22	1.1	0	0	0	1	0
1	22	1.3	1	0	1	0	1
2	23	1.5	0	0	1	0	0
3	24	2.0	0	1	0	1	0
4	25	2.2	1	1	0	0	1