Linear regression is used as a predictive model that assumes a linear relationship between the dependent variable (which is the variable we are trying to predict/estimate) and the independent variable/s (input variable/s used in the prediction).
In the following example, we will perform multiple linear regression for a fictitious economy, where the index_price is the dependent variable, and the 2 independent/input variables are:
Please note that you will have to validate that several assumptions are met before you apply linear regression models. Most notably, you have to make sure that a linear relationship exists between the dependent variable and the independent variable/s (more on that under the checking for linearity section).
import pandas as pd
data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
}
df = pd.DataFrame(data)
print(df)
year month interest_rate unemployment_rate index_price 0 2017 12 2.75 5.3 1464 1 2017 11 2.50 5.3 1394 2 2017 10 2.50 5.3 1357 3 2017 9 2.50 5.3 1293 4 2017 8 2.50 5.4 1256 5 2017 7 2.50 5.6 1254 6 2017 6 2.50 5.5 1234 7 2017 5 2.25 5.5 1195 8 2017 4 2.25 5.5 1159 9 2017 3 2.25 5.6 1167 10 2017 2 2.00 5.7 1130 11 2017 1 2.00 5.9 1075 12 2016 12 2.00 6.0 1047 13 2016 11 1.75 5.9 965 14 2016 10 1.75 5.8 943 15 2016 9 1.75 6.1 958 16 2016 8 1.75 6.2 971 17 2016 7 1.75 6.1 949 18 2016 6 1.75 6.1 884 19 2016 5 1.75 6.1 866 20 2016 4 1.75 5.9 876 21 2016 3 1.75 6.2 822 22 2016 2 1.75 6.2 704 23 2016 1 1.75 6.1 719
import matplotlib.pyplot as plt
plt.scatter(df['interest_rate'], df['index_price'], color='red')
plt.title('Index Price Vs Interest Rate', fontsize=14)
plt.xlabel('Interest Rate', fontsize=14)
plt.ylabel('Index Price', fontsize=14)
plt.grid(True)
plt.show()
x = df[['interest_rate','unemployment_rate']]
y = df['index_price']
import statsmodels.api as sm
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()
predictions = model.predict(x)
print_model = model.summary()
print(print_model)
OLS Regression Results ============================================================================== Dep. Variable: index_price R-squared: 0.898 Model: OLS Adj. R-squared: 0.888 Method: Least Squares F-statistic: 92.07 Date: Sun, 09 Jul 2023 Prob (F-statistic): 4.04e-11 Time: 17:15:55 Log-Likelihood: -134.61 No. Observations: 24 AIC: 275.2 Df Residuals: 21 BIC: 278.8 Df Model: 2 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------- const 1798.4040 899.248 2.000 0.059 -71.685 3668.493 interest_rate 345.5401 111.367 3.103 0.005 113.940 577.140 unemployment_rate -250.1466 117.950 -2.121 0.046 -495.437 -4.856 ============================================================================== Omnibus: 2.691 Durbin-Watson: 0.530 Prob(Omnibus): 0.260 Jarque-Bera (JB): 1.551 Skew: -0.612 Prob(JB): 0.461 Kurtosis: 3.226 Cond. No. 394. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(x, y)
LinearRegression()
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
Intercept: 1798.4039776258558 Coefficients: [ 0. 345.54008701 -250.14657137]
Linear regression is often used in Machine Learning. You have seen some examples of how to perform multiple linear regression in Python using both sklearn and statsmodels.
Before applying linear regression models, make sure to check that a linear relationship exists between the dependent variable (i.e., what you are trying to predict) and the independent variable/s (i.e., the input variable/s)