https://medium.com/analytics-vidhya/implementing-linear-regression-using-sklearn-76264a3c073c
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df=df = pd.read_excel('./Data/Multiple_variable.xlsx',sheet_name='Sheet2')
df.head()
Age | YearsExperience | Salary | Gender | Classification | Job | |
---|---|---|---|---|---|---|
0 | 22 | 1.1 | 39343 | Female | Low | Assistant |
1 | 22 | 1.3 | 46205 | Male | TOP | Professor |
2 | 23 | 1.5 | 37731 | Female | TOP | Administrative |
3 | 24 | 2.0 | 43525 | Female | Medium | Assistant |
4 | 25 | 2.2 | 39891 | Male | Medium | Professor |
df.shape
df.describe()
Age | YearsExperience | Salary | |
---|---|---|---|
count | 36.000000 | 36.000000 | 36.000000 |
mean | 34.472222 | 6.008333 | 82228.277778 |
std | 6.942565 | 3.031489 | 28784.838078 |
min | 22.000000 | 1.100000 | 37731.000000 |
25% | 29.000000 | 3.575000 | 57050.000000 |
50% | 37.000000 | 5.600000 | 82225.500000 |
75% | 40.250000 | 9.000000 | 110232.000000 |
max | 49.000000 | 10.500000 | 122391.000000 |
df.dtypes
Age int64 YearsExperience float64 Salary int64 Gender object Classification object Job object dtype: object
sns.pairplot(df, hue='Gender')
<seaborn.axisgrid.PairGrid at 0x2b8f3b597c0>
Years of Experience is directly proportional to Salary and even age.
sns.pairplot(df,x_vars=['Age','YearsExperience'],y_vars=['Salary'],hue='Gender')
<seaborn.axisgrid.PairGrid at 0x2b8f2b5d2e0>
Age and Years of Experience both are correlated with Salary
df.corr()
Age | YearsExperience | Salary | |
---|---|---|---|
Age | 1.000000 | 0.858866 | 0.825977 |
YearsExperience | 0.858866 | 1.000000 | 0.982536 |
Salary | 0.825977 | 0.982536 | 1.000000 |
sns.heatmap(df.corr(),annot=True,lw=1)
<AxesSubplot:>
sns.boxplot(y='Age',x='Gender',data=df)
<AxesSubplot:xlabel='Gender', ylabel='Age'>
sns.boxplot(y='Salary',x='Classification',data=df)
<AxesSubplot:xlabel='Classification', ylabel='Salary'>
Regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence.
X = df[['Age', 'YearsExperience', 'Gender', 'Classification', 'Job']]
X = pd.get_dummies(data=X, drop_first=True)
X.head()
Age | YearsExperience | Gender_Male | Classification_Medium | Classification_TOP | Job_Assistant | Job_Manager | Job_Professor | Job_Senior Manager | |
---|---|---|---|---|---|---|---|---|---|
0 | 22 | 1.1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
1 | 22 | 1.3 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
2 | 23 | 1.5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
3 | 24 | 2.0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
4 | 25 | 2.2 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
Y = df['Salary']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=101)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
(21, 9) (15, 9) (21,) (15,)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
LinearRegression()
# print the intercept
print(model.intercept_)
40767.075290708104
The sign of each coefficient indicates the direction of the relationship between a predictor variable and the response variable.
coeff_parameter = pd.DataFrame(model.coef_,X.columns,columns=['Coefficient'])
coeff_parameter
Coefficient | |
---|---|
Age | -548.323776 |
YearsExperience | 10743.731522 |
Gender_Male | -655.537127 |
Classification_Medium | -6061.914786 |
Classification_TOP | -1234.672994 |
Job_Assistant | -1114.042048 |
Job_Manager | 2291.025846 |
Job_Professor | 964.080429 |
Job_Senior Manager | -2141.064227 |
predictions = model.predict(X_test)
predictions
array([ 80970.53158392, 54147.79230197, 84608.3284182 , 112838.31994826, 111603.646954 , 121986.73865586, 111208.49523541, 125909.94403861, 112148.27092338, 43036.55273237, 127901.81847906, 85930.91891418, 71270.93953831, 62820.7589416 , 41918.81087788])
sns.regplot(y_test,predictions)
<AxesSubplot:xlabel='Salary'>
import statsmodels.api as sm
X_train_Sm= sm.add_constant(X_train)
X_train_Sm= sm.add_constant(X_train)
ls=sm.OLS(y_train,X_train_Sm).fit()
print(ls.summary())
OLS Regression Results ============================================================================== Dep. Variable: Salary R-squared: 0.971 Model: OLS Adj. R-squared: 0.952 Method: Least Squares F-statistic: 51.09 Date: Sun, 09 Jul 2023 Prob (F-statistic): 4.20e-08 Time: 22:21:20 Log-Likelihood: -207.81 No. Observations: 21 AIC: 433.6 Df Residuals: 12 BIC: 443.0 Df Model: 8 Covariance Type: nonrobust ========================================================================================= coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------------- const 3.261e+04 9431.499 3.458 0.005 1.21e+04 5.32e+04 Age -548.3238 427.836 -1.282 0.224 -1480.499 383.852 YearsExperience 1.074e+04 1744.592 6.158 0.000 6942.592 1.45e+04 Gender_Male -655.5371 3091.666 -0.212 0.836 -7391.698 6080.624 Classification_Medium -6061.9148 4374.305 -1.386 0.191 -1.56e+04 3468.878 Classification_TOP -1234.6730 4574.620 -0.270 0.792 -1.12e+04 8732.568 Job_Assistant 7039.3730 3997.588 1.761 0.104 -1670.623 1.57e+04 Job_Manager 1.044e+04 5193.314 2.011 0.067 -870.818 2.18e+04 Job_Professor 9117.4955 4842.954 1.883 0.084 -1434.395 1.97e+04 Job_Senior Manager 6012.3508 6454.697 0.931 0.370 -8051.225 2.01e+04 ============================================================================== Omnibus: 9.775 Durbin-Watson: 2.131 Prob(Omnibus): 0.008 Jarque-Bera (JB): 2.092 Skew: 0.115 Prob(JB): 0.351 Kurtosis: 1.471 Cond. No. 4.70e+17 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The smallest eigenvalue is 1.18e-31. This might indicate that there are strong multicollinearity problems or that the design matrix is singular.
We Use adjusted R-squared to compare the goodness-of-fit for regression models that contain different numbers of independent variables.