Regression analysis is one of the most important statistical techniques for business applications. It’s a statistical methodology that helps estimate the strength and direction of the relationship between two (or more variables). Regression results show whether the relationship is valid or not. It also helps to predict an unknown value based on the derived relationship.
Regression Analysis is a parametric technique used to predict the value of an unknown target variable (or dependent variable) $y$ based on one or more of known input features (or independent variables, predictors), often denoted by $x$.
Linearity, one of the assumptions of this approach, suggests that the relationship between dependent and independent variable can be expressed as a straight line.
In general, regression analysis helps us in the following ways:
**Linear regression is simply a manifestation of $$y=mx+c$$ or, alternatively $$y = \beta_0+ \beta_1 x $$ So this is as complicated as our linear regression model gets. The equation here is the same one used to find a line in algebra, but in statistics, the actual data points don't actually lie on a line!
The real challenge for regression analysis is to fit a line, out of an infinite number of lines that best describes that data.
or $$\hat y = \hat \beta_0+ \hat \beta_1 x $$
As you can see, you're using a "hat" notation which stands for the fact that we are working with estimations.
Before we calculate the best-fit line, we have to make sure that we have calculated the following measures for variables X and Y:
The mean of the X $(\bar{X})$
The mean of the Y $(\bar{Y})$
The standard deviation of the X values $(S_X)$
The standard deviation of the y values $(S_Y)$
The correlation between X and Y ( often denoted by the Greek letter "Rho" or $\rho$ - Pearson Correlation)
With the above ingredients in hand, we can calculate the slope (shown as $b$ below) of the best-fit line, using the formula:
$$\hat m = \rho \frac{S_Y}{S_X}$$This formula is also known as the least squares method.
You can visit this Wikipedia link to get take a look into the maths behind derivation of this formula.
The slope of the best-fit line can be a negative number following a negative correlation. For example, if an increase in police officers is related to a decrease in the number of crimes in a linear fashion, the correlation and hence the slope of the best-fitting line in this particular setting is negative.
So now that we have the slope value (\hat m), we can put it back into our formula $(\hat y = \hat m x+ \hat c)$ to calculate intercept. The idea is that
$$\bar{Y} = \hat c + \hat m \bar{X}$$$$ \hat c = \bar{Y} - \hat m\bar{X}$$Recall that $\bar{X}$ and $\bar{Y}$ are the mean values for variables X and Y. So, in order to calculate the $\hat y$-intercept of the best-fit line, we start by finding the slope of the best-fit line using the above formula. Then to find the $\hat y$-intercept, we multiply slope value by mean of x and subtract the result from mean of y.
As mentioned before, when you have a regression line with defined parameters slope and intercept as calculated above, you can easily predict the $\hat{y}$ (target) value for a new $x$ (feature) value using the estimated parameter values:
$$\hat{y} = \hat mx + \hat c$$Remember that the difference between y and $\hat{y}$ is that $\hat{y}$ is the value predicted by the fitted model, whereas $y$ carries actual values of the variable (called the truth values) that were used to calculate the best fit.
The linearity assumptions requires that there is a linear relationship between the response variable (Y) and predictor (X). Linear means that the change in Y by 1-unit change in X, is constant.
If we try to fit a linear model to a non-linear data set, Ordinary Least Squares (OLS) will fail to capture the trend mathematically, resulting in an inaccurate relationship. This will also result in erroneous predictions on an unseen data set.
The linearity assumption can best be tested with scatter plots
For non-linear relationships, you can use non-linear mathematical functions to fit the data e.g. polynomial and exponential functions. You'll come across these later.
Note: As an extra measure, it is also important to check for outliers as the presence of outliers in the data can have a major impact on the model.
The normality assumption states that the model residuals should follow a normal distribution
Note that the normality assumption talks about the model residuals and not about the distributions of the variables! In general, data scientists will often check the distributions of the variables as well. Keep in mind that the normality assumption is mandatory for the residuals, and it is useful to check normality of your variables to check for weirdness (more on data distributions later), but OLS works fine for non-normal data distributions in the context of prediction.
The easiest way to check for the normality assumption is with histograms or a Q-Q-Plots.
Heteroscedasticity (also spelled heteroskedasticity) refers to the circumstance in which the dependent variable is unequal across the range of values of the predictor(s).
When there is heteroscedasticity in the data, a scatterplot of these variables will often create a cone-like shape. The scatter of the dependent variable widens or narrows as the value of the independent variable increases.
The inverse of heteroscedasticity is homoscedasticity, which indicates that a dependent variable's variability is equal across values of the independent variable. Homoscedasticity is the third assumption necessary when creating a linear regression model.
A scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The scatter plots shown here are examples of data that are heteroscedastic (except the plot far right). You can also use significance tests like Breusch-Pagan / Cook-Weisberg test or White general test to detect this phenomenon. If these tests give you a p-value < 0.05, the null hypothesis can rejected, and you can assume the data is heteroscedastic.
The $R^2$ or Coefficient of determination is a statistical measure that is used to assess the goodness of fit of a regression model
Here is how it works.
R-Squared uses a so-called "baseline" model which is the worst model. This baseline model does not make use of any independent variables to predict the value of dependent variable Y. Instead, it uses the mean of the observed responses of the dependent variable $y$ and always predicts this mean as the value of $y$ for any value of $x$. Any regression model that we fit is compared to this baseline model to understand its goodness of fit. Simply put, R-Squared just explains how good is your model when compared to the baseline model. That's about it.
The mathematical formula to calculate R-Squared for a linear regression line is in terms of squared errors for the fitted model and the baseline model. It's calculated as :
$$ \large R^2 = 1- \dfrac{SS_{RES}}{SS_{TOT}} = \dfrac{\sum_i(y_i - \hat y_i)^2}{\sum_i(y_i - \overline y_i)^2} $$$SS_{RES}$ (also called RSS) is the Residual sum of squared errors of our regression model also known as $SSE$ (Sum of Squared Errors). $SS_{RES}$ is the squared difference between $y$ and $\hat y$. For the one highlighted observation in our graph above, the $SS_{RES}$ is denoted by the red arrow. This part of the error is not explained by our model.
$SS_{TOT}$ (also called TSS)is the Total sum of squared error. $SS_{TOT}$ is the squared difference between $y$ and $\overline y$. For the one highlighted observation in our graph above, the $SS_{TOT}$ is denoted by the green arrow.
Looking at this, you'll understand that you can interpret R-Squared as "1 - the proportion of the variance not explained by the model", which means as much as "the variation explained by the model". As a result, you'll want to maximize the R-Squared.
For completion,
R-Squared can take value between 0 and 1 where values closer to 0 represent a poor fit and values closer to 1 represent an (almost) perfect fit
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
def calc_slope(xs,ys):
m = (((np.mean(xs)*np.mean(ys)) - np.mean(xs*ys)) /
((np.mean(xs)**2) - np.mean(xs*xs)))
return m
def best_fit(xs,ys):
m = calc_slope(xs,ys)
b = np.mean(ys) - m*np.mean(xs)
return m, b
def reg_line (m, b, X):
return [(m*x)+b for x in X]
def sum_sq_err(ys_real,ys_predicted):
sse = sum((ys_predicted - ys_real) * (ys_predicted - ys_real))
return sse
def r_squared(ys_real, ys_predicted):
# Calculate Y_mean , squared error for regression and mean line , and calculate r-squared
y_mean = [np.mean(ys_real) for y in ys_real]
sq_err_reg= sum_sq_err(ys_real, ys_predicted)
sq_err_y_mean = sum_sq_err(ys_real, y_mean)
# Calculate r-squared
r_sq = 1 - (sq_err_reg/sq_err_y_mean)
return r_sq
def plot_reg(X,Y,Y_pred):
plt.scatter(X,Y,color='#003F72',label='data')
plt.plot(X, Y_pred, label='regression line')
plt.legend(loc=4)
plt.show()
return None
X = predictor
Y = target
m, b = best_fit(X,Y)
Y_pred = reg_line(m, b, X)
r_squared = r_squared(Y,Y_pred)
print ('Basic Regression Diagnostics')
print ('----------------------------')
print ('Slope:', round(m,2))
print ('Y-Intercept:', round(b,2))
print ('R-Squared:', round(r_squared,2))
print ('----------------------------')
print ('Model: Y =',round(m,2),'* X +', round(b,2))
plot_reg(X,Y,Y_pred)
#make predictions and plot
x_new = 4.5
y_new = (m*x_new)+b
y_new
plt.scatter(X,Y,color='#000F72',label='data')
plt.plot(X, Y_pred, color='#880000', label='regression line')
plt.scatter(x_new,y_new,color='r',label='Prediction: '+ str(np.round(y_new,1)))
plt.legend(loc=4)
plt.show()
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
plt.style.use('seaborn')
#visualize distribution of data
df.plot.kde()
plt.title("distribution check for dependent and independent variable")
plt.show()
#formula for simple regression 'target~feature'
f = 'weight~height'
#fit model
model = ols(formula=f, data=df).fit()
model.summary()
#plot error visuals
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(model, "height", fig=fig)
plt.show()
The Jarque-Bera (JB) test is a test for normality. This test is usually used for large data sets, because other tests like Q-Q Plots can become unreliable when your sample size is large.
$$JB = n *\Bigl(\dfrac{S^2}{6} + \dfrac{(K – 3)^2}{24}\Bigr)$$The Jarque-Bera test inspects the skewness and kurtosis of data to see if it matches a normal distribution. It is a common method for inspecting errors distribution in regression as shown below.
Here, $n$ is the sample size, $S$ is the sample skewness coefficient and $K$ is the sample kurtosis.
Here is how you use JB in statsmodels. A JB value of roughly 6 or higher indicates that errors are not normally distributed. In other words, this means that the normality null hypothesis has been rejected at the $5\%$ significance level. A value close to 0 on the contrary, indicates the data $is$ normally distributed. We have already seen the JB test using model.summary()
. The code below shows you how to run this test on its own.
#jb test
name = ['Jarque-Bera','Prob','Skew', 'Kurtosis']
test = sms.jarque_bera(model.resid)
list(zip(name, test))
The Goldfeld Quandt (GQ) test is used in regression analysis to check for homoscedasticity in the error terms. The GQ test checks if you can define a point that can be used to differentiate the variance of the error term. It is a parametric test and uses the assumption that the data is normally distributed. So it is general practice to check for normality before going over to the GQ test!
In the image below, you can see how observations are split into two groups. Next, a test statistic is run through taking the ratio of mean square residual errors for the regressions on the two subsets. Evidence of heteroskedasticity is based on performing a hypothesis test.
Here is a brief description of the steps involved:
F values typically indicate that the variances are different. If the error term is homoscedastic, there should be no systematic difference between residuals and F values will be small. However, if the standard deviation of the distribution of the error term is proportional to the x variable, one part will generate a higher sum of square values than the other.
The null hypothesis for the GQ test is homoskedasticity. The larger the F-statistic, the more evidence we will have against the homoskedasticity assumption and the more likely we have heteroskedasticity (different variance for the two groups).
The p-value for our tests above tells us whether or not to reject the null-hypothesis of homoscedasticity taking a confidence level of alpha = 0.05.
#Run Goldfeld Quandt test
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(model.resid.iloc[indices], model.model.exog[indices])
list(zip(name, test))
We can add as many predictors as we like. What is important to note, however, is that for models with more than 2 predictors representing a multiple linear model becomes very difficult and even impossible! Still, it is done all the time, as linear models of all sorts are extensively used in many fields!
When thinking of lines and slopes statistically, slope parameters associated with a particular predictor $x_i$ are often denoted by $\beta_i$. Extending this example mathematically, you would write a multiple linear regression model as follows:
$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$where $n$ is the number of predictors, $\beta_0$ is the intercept, and $\hat y$ is the so-called "fitted line" or the predicted value associated with the dependent variable.
Returns a table when summary is used
X = boston_features
y = pd.DataFrame(boston.target, columns= ["price"])
import statsmodels.api as sm
X_int = sm.add_constant(X)
model = sm.OLS(y,X_int).fit()
model.summary()
Needs commands to view results
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_smaller, y)
print(linreg.intercept_)
print(linreg.coef_)
Now, instead of completely "deleting" certain predictors from a model (which is equal to setting coefficients equal to zero), wouldn't it be interesting so-called penalized estimation operates in way where parameter shrinkage effets are used to make some or all of the coefficients smaller in magnitude, closer to zero. Some of the penalties have the property to perform both variable selection (setting some coefficients exactly equal to zero) and shrinking the other coefficients. Ridge and Lasso regression are two examples of penalized estimation. There are multiple advantages to using these methods:
Lasso and Ridge are two commonly used so-called regularization techniques. Regularization is a general term used when one tries to battle overfitting. Regularization techniques will be covered in more depth when we're moving into machine learning!
In ridge regression, the linear regression cost function is changed by adding a penalty term to square of the magnitude of the coefficients
$$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p m_j^2$$Recall that you want to minimize your cost function, so by adding the penalty term $\lambda$, ridge regression puts a constraint on the coefficients $m$. This means that large coefficients penalize the optimization function. That's why ridge regression leads to a shrinkage of the coefficients and helps to reduce model complexity and multi-collinearity.
$\lambda$ is a so-called hyperparameter, which means you have to specify the value for lamda. For a small lambda, the outcome of your ridge regression will resemble a linear regression model. For large lambda, penalization will increase and more parameters will shrink.
Ridge regression is often also referred to as L2 Norm Regularization
from sklearn.linear_model import Lasso, Ridge
ridge = Ridge() #Lasso is also known as the L1 norm.
ridge.fit(X_train, y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))
Lasso regression is very similar to Ridge regression, except that the magnitude of the coefficients are not squared in the penalty term. So, while ridge regression keeps the sum of the squared regression coefficients (except for the intercept) bounded, the lasso method bounds the sum of the absolute values.
$$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p \mid m_j \mid$$The name "Lasso" comes from ‘Least Absolute Shrinkage and Selection Operator’.
While looking similar to the definition of the ridge estimator, the effect of the absolute values is that some coefficients might be set exactly equal to zero, while other coefficients are shrunk towards zero. Hence the lasso method is attractive because it performs estimation and selection simultaneously. Especially for variable selection when the number of predictors is very high.
Lasso regression is often also referred to as L1 Norm Regularization
from sklearn.linear_model import Lasso, Ridge
lasso = Lasso()
lasso.fit(X_train, y_train)
print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))