Notebook

Simple Linear Regression¶

Linear Regression¶

Regression analysis is one of the most important statistical techniques for business applications. It’s a statistical methodology that helps estimate the strength and direction of the relationship between two (or more variables). Regression results show whether the relationship is valid or not. It also helps to predict an unknown value based on the derived relationship.

Regression Analysis is a parametric technique used to predict the value of an unknown target variable (or dependent variable) $y$ based on one or more of known input features (or independent variables, predictors), often denoted by $x$.

Linearity, one of the assumptions of this approach, suggests that the relationship between dependent and independent variable can be expressed as a straight line.

In general, regression analysis helps us in the following ways:

Finding an association or relationship between certain phenomenons or variables
Identifying which variables contribute more towards the outcomes
Prediction of future observations

**Linear regression is simply a manifestation of $$y=mx+c$$ or, alternatively $$y = \beta_0+ \beta_1 x $$ So this is as complicated as our linear regression model gets. The equation here is the same one used to find a line in algebra, but in statistics, the actual data points don't actually lie on a line!

The real challenge for regression analysis is to fit a line, out of an infinite number of lines that best describes that data.

Notation¶

$$\hat{y}=\hat m x+ \hat{c}$$

or $$\hat y = \hat \beta_0+ \hat \beta_1 x $$

As you can see, you're using a "hat" notation which stands for the fact that we are working with estimations.

When trying to draw a "best fit line", you're estimating the most appropriate value possible for your intercept and your slope, hence $\hat{c}$ /$ \hat \beta_0 $ and $\hat{m}$ /$ \hat \beta_1 $.
Next, when we use our line to predict new values $y$ given $x$, your estimate is an approximation based on our estimated parameter values. Hence we use $\hat y $ instead of $y$. $\hat y$ lies ON your regression line, $y$ is the associated y-value for each of the blue diamonds in the plot below. The error or the vertical offset between the line and the actual observation values is denoted by the red vertical lines in the plot above. Mathematically, the vertical offset can be written as $\mid \hat y - y\mid$.

Best-Fit Line Ingredients¶

Before we calculate the best-fit line, we have to make sure that we have calculated the following measures for variables X and Y:

The mean of the X $(\bar{X})$
The mean of the Y $(\bar{Y})$
The standard deviation of the X values $(S_X)$
The standard deviation of the y values $(S_Y)$
The correlation between X and Y ( often denoted by the Greek letter "Rho" or $\rho$ - Pearson Correlation)

Calculating Slope¶

With the above ingredients in hand, we can calculate the slope (shown as $b$ below) of the best-fit line, using the formula:

$$\hat m = \rho \frac{S_Y}{S_X}$$

This formula is also known as the least squares method.

You can visit this Wikipedia link to get take a look into the maths behind derivation of this formula.

The slope of the best-fit line can be a negative number following a negative correlation. For example, if an increase in police officers is related to a decrease in the number of crimes in a linear fashion, the correlation and hence the slope of the best-fitting line in this particular setting is negative.

Calculating Intercept¶

So now that we have the slope value (\hat m), we can put it back into our formula $(\hat y = \hat m x+ \hat c)$ to calculate intercept. The idea is that

$$\bar{Y} = \hat c + \hat m \bar{X}$$$$ \hat c = \bar{Y} - \hat m\bar{X}$$

Recall that $\bar{X}$ and $\bar{Y}$ are the mean values for variables X and Y. So, in order to calculate the $\hat y$-intercept of the best-fit line, we start by finding the slope of the best-fit line using the above formula. Then to find the $\hat y$-intercept, we multiply slope value by mean of x and subtract the result from mean of y.

Predicting from the model¶

As mentioned before, when you have a regression line with defined parameters slope and intercept as calculated above, you can easily predict the $\hat{y}$ (target) value for a new $x$ (feature) value using the estimated parameter values:

$$\hat{y} = \hat mx + \hat c$$

Remember that the difference between y and $\hat{y}$ is that $\hat{y}$ is the value predicted by the fitted model, whereas $y$ carries actual values of the variable (called the truth values) that were used to calculate the best fit.

Assumptions of Linear Regression¶

1. Linearity¶

The linearity assumptions requires that there is a linear relationship between the response variable (Y) and predictor (X). Linear means that the change in Y by 1-unit change in X, is constant.

If we try to fit a linear model to a non-linear data set, Ordinary Least Squares (OLS) will fail to capture the trend mathematically, resulting in an inaccurate relationship. This will also result in erroneous predictions on an unseen data set.

The linearity assumption can best be tested with scatter plots

For non-linear relationships, you can use non-linear mathematical functions to fit the data e.g. polynomial and exponential functions. You'll come across these later.

Note: As an extra measure, it is also important to check for outliers as the presence of outliers in the data can have a major impact on the model.

2. Normality¶

The normality assumption states that the model residuals should follow a normal distribution

Note that the normality assumption talks about the model residuals and not about the distributions of the variables! In general, data scientists will often check the distributions of the variables as well. Keep in mind that the normality assumption is mandatory for the residuals, and it is useful to check normality of your variables to check for weirdness (more on data distributions later), but OLS works fine for non-normal data distributions in the context of prediction.

The easiest way to check for the normality assumption is with histograms or a Q-Q-Plots.

3. Heteroscedasticity¶

Heteroscedasticity (also spelled heteroskedasticity) refers to the circumstance in which the dependent variable is unequal across the range of values of the predictor(s).

When there is heteroscedasticity in the data, a scatterplot of these variables will often create a cone-like shape. The scatter of the dependent variable widens or narrows as the value of the independent variable increases.

The inverse of heteroscedasticity is homoscedasticity, which indicates that a dependent variable's variability is equal across values of the independent variable. Homoscedasticity is the third assumption necessary when creating a linear regression model.

A scatter plot is good way to check whether the data are homoscedastic (meaning the residuals are equal across the regression line). The scatter plots shown here are examples of data that are heteroscedastic (except the plot far right). You can also use significance tests like Breusch-Pagan / Cook-Weisberg test or White general test to detect this phenomenon. If these tests give you a p-value < 0.05, the null hypothesis can rejected, and you can assume the data is heteroscedastic.

Coefficient of Determination¶

R-Squared¶

The $R^2$ or Coefficient of determination is a statistical measure that is used to assess the goodness of fit of a regression model

Here is how it works.

R-Squared uses a so-called "baseline" model which is the worst model. This baseline model does not make use of any independent variables to predict the value of dependent variable Y. Instead, it uses the mean of the observed responses of the dependent variable $y$ and always predicts this mean as the value of $y$ for any value of $x$. Any regression model that we fit is compared to this baseline model to understand its goodness of fit. Simply put, R-Squared just explains how good is your model when compared to the baseline model. That's about it.

Calculating R-squared¶

The mathematical formula to calculate R-Squared for a linear regression line is in terms of squared errors for the fitted model and the baseline model. It's calculated as :

$$ \large R^2 = 1- \dfrac{SS_{RES}}{SS_{TOT}} = \dfrac{\sum_i(y_i - \hat y_i)^2}{\sum_i(y_i - \overline y_i)^2} $$

$SS_{RES}$ (also called RSS) is the Residual sum of squared errors of our regression model also known as $SSE$ (Sum of Squared Errors). $SS_{RES}$ is the squared difference between $y$ and $\hat y$. For the one highlighted observation in our graph above, the $SS_{RES}$ is denoted by the red arrow. This part of the error is not explained by our model.
$SS_{TOT}$ (also called TSS)is the Total sum of squared error. $SS_{TOT}$ is the squared difference between $y$ and $\overline y$. For the one highlighted observation in our graph above, the $SS_{TOT}$ is denoted by the green arrow.

Looking at this, you'll understand that you can interpret R-Squared as "1 - the proportion of the variance not explained by the model", which means as much as "the variation explained by the model". As a result, you'll want to maximize the R-Squared.

For completion,

$SS_{EXP}$ (also called ESS) is the Explained sum of squared error. $SS_{EXP}$ is the squared difference between $\hat y$ and $\overline y$. For the one highlighted observation in our graph above, the $SS_{EXP}$ is denoted by the blue arrow.

R-Squared can take value between 0 and 1 where values closer to 0 represent a poor fit and values closer to 1 represent an (almost) perfect fit

Complete Linear Regression raw code¶

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')

def calc_slope(xs,ys):

    m = (((np.mean(xs)*np.mean(ys)) - np.mean(xs*ys)) /
         ((np.mean(xs)**2) - np.mean(xs*xs)))
    return m

def best_fit(xs,ys):
    m = calc_slope(xs,ys)
    b = np.mean(ys) - m*np.mean(xs)
    return m, b

def reg_line (m, b, X):
    return [(m*x)+b for x in X]

def sum_sq_err(ys_real,ys_predicted):
    sse =  sum((ys_predicted - ys_real) * (ys_predicted - ys_real))
    return sse

def r_squared(ys_real, ys_predicted):
    # Calculate Y_mean , squared error for regression and mean line , and calculate r-squared
    y_mean = [np.mean(ys_real) for y in ys_real]
    sq_err_reg= sum_sq_err(ys_real, ys_predicted)
    sq_err_y_mean = sum_sq_err(ys_real, y_mean)
    # Calculate r-squared 
    r_sq =  1 - (sq_err_reg/sq_err_y_mean)
    return r_sq

def plot_reg(X,Y,Y_pred):
    plt.scatter(X,Y,color='#003F72',label='data')
    plt.plot(X, Y_pred, label='regression line')
    plt.legend(loc=4)
    plt.show()
    return None

X = predictor
Y = target

m, b = best_fit(X,Y)
Y_pred = reg_line(m, b, X)
r_squared = r_squared(Y,Y_pred)

print ('Basic Regression Diagnostics')
print ('----------------------------')
print ('Slope:', round(m,2))
print ('Y-Intercept:', round(b,2))
print ('R-Squared:', round(r_squared,2))
print ('----------------------------')
print ('Model: Y =',round(m,2),'* X +', round(b,2))

plot_reg(X,Y,Y_pred)

#make predictions and plot
x_new = 4.5
y_new = (m*x_new)+b
y_new

plt.scatter(X,Y,color='#000F72',label='data')
plt.plot(X, Y_pred, color='#880000', label='regression line')
plt.scatter(x_new,y_new,color='r',label='Prediction: '+ str(np.round(y_new,1)))
plt.legend(loc=4)
plt.show()

OLS in Stasmodels¶

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import matplotlib.pyplot as plt
plt.style.use('seaborn')

#visualize distribution of data
df.plot.kde()
plt.title("distribution check for dependent and independent variable")
plt.show()

#formula for simple regression 'target~feature'
f = 'weight~height'
#fit model
model = ols(formula=f, data=df).fit()
model.summary()

#plot error visuals
fig = plt.figure(figsize=(15,8))
fig = sm.graphics.plot_regress_exog(model, "height", fig=fig)
plt.show()

Interpreting OLS Statsmodels Results¶

The left part of the first table gives some specifics on the data and the model:¶

Dep. Variable: Singular. Which variable is the point of interest of the model
Model: Technique used, an abbreviated version of Method (see methods for more).
Method: The loss function optimized in the parameter selection process. Least Squares since it picks the parameters that reduce the training error. This is also known as Mean Square Error [MSE].
No. Observations: The number of observations used by the model, or size of the training data.
Degrees of Freedom Residuals: Degrees of freedom of the residuals, which is the number of observations – number of parameters. Intercept is a parameter. The purpose of Degrees of Freedom is to reflect the impact of descriptive/summarizing statistics in the model, which in regression is the coefficient. Since the observations must "live up" to these parameters, they only have so many free observations, and the rest must be reserved to "live up" to the parameters' prophecy. Internal mechanism to ensures that there are enough observations to parameters.
Degrees of Freedom Model: The number of parameters in the model (not including the constant/intercept term if present)
Covariance Type: Robust regression methods are designed to be not overly affected by violations of assumptions by the underlying data-generating process. Since this model is Ordinary Least Squares, it is non-robust and therefore highly sensitive to outliers.

The right part of the first table shows the goodness of fit¶

R-squared: The coefficient of determination, the Sum Squares of Regression divided by Total Sum Squares. This translates to the percent of variance explained by the model. The remaining percentage represents the variance explained by error, the E term, the part that model and predictors fail to grasp.
Adj. R-squared: Version of the R-Squared that penalizes additional independent variables.
F-statistic: A measure of how significant the fit is. The mean squared error of the model divided by the mean squared error of the residuals. Feeds into the calculation of the P-Value.
Prob (F-statistic) or P-Value: The probability that a sample like this would yield the above statistic, and whether the model's verdict on the null hypothesis will consistently represent the population. Does not measure effect magnitude, instead measures the integrity and consistency of this test on this group of data.
Log-likelihood: The log of the likelihood function.
AIC: The Akaike Information Criterion. Adjusts the log-likelihood based on the number of observations and the complexity of the model. Penalizes the model selection metrics when more independent variables are added.
BIC: The Bayesian Information Criterion. Similar to the AIC, but has a higher penalty for models with more parameters. Penalizes the model selection metrics when more independent variables are added.

Second Table: Coefficient Reports¶

coef: The estimated value of the coefficient. By how much the model multiplies the independent value by.
std err: The basic standard error of the estimate of the coefficient. Average distance deviation of the points from the model, which offers a unit relevant way to gauge model accuracy.
t: The t-statistic value. This is a measure of how statistically significant the coefficient is.
P > |t|: P-value that the null-hypothesis that the coefficient = 0 is true. If it is less than the confidence level, often 0.05, it indicates that there is a statistically significant relationship between the term and the response.
[95.0% Conf. Interval]: The lower and upper values of the 95% confidence interval. Specific range of the possible coefficient values.

Third Table: Residuals, Autocorrelation, and Multicollinearity¶

Skewness: A measure of the symmetry of the data about the mean. Normally-distributed errors should be symmetrically distributed about the mean (equal amounts above and below the line). The normal distribution has 0 skew.
Kurtosis: A measure of the shape of the distribution. Compares the amount of data close to the mean with those far away from the mean (in the tails), so model "peakiness". The normal distribution has a Kurtosis of 3, and the greater the number, the more the curve peaks.
Omnibus D’Angostino’s test: It provides a combined statistical test for the presence of skewness and kurtosis.
Prob(Omnibus): The above statistic turned into a probability
Jarque-Bera: A different test of the skewness and kurtosis
Prob (JB): The above statistic turned into a probability
Durbin-Watson: A test for the presence of autocorrelation (that the errors are not independent), which is often important in time-series analysis
Cond. No: A test for multicollinearity (if in a fit with multiple parameters, the parameters are related to each other).

Regression Diagnostics Tests¶

Normality Check (Jarque-Bera Test)¶

The Jarque-Bera (JB) test is a test for normality. This test is usually used for large data sets, because other tests like Q-Q Plots can become unreliable when your sample size is large.

The Jarque-Bera test inspects the skewness and kurtosis of data to see if it matches a normal distribution. It is a common method for inspecting errors distribution in regression as shown below.

$$JB = n *\Bigl(\dfrac{S^2}{6} + \dfrac{(K – 3)^2}{24}\Bigr)$$

Here, $n$ is the sample size, $S$ is the sample skewness coefficient and $K$ is the sample kurtosis.

Here is how you use JB in statsmodels. A JB value of roughly 6 or higher indicates that errors are not normally distributed. In other words, this means that the normality null hypothesis has been rejected at the $5\%$ significance level. A value close to 0 on the contrary, indicates the data $is$ normally distributed. We have already seen the JB test using model.summary(). The code below shows you how to run this test on its own.

#jb test
name = ['Jarque-Bera','Prob','Skew', 'Kurtosis']
test = sms.jarque_bera(model.resid)
list(zip(name, test))

Heteroscedacity check with GQ Test¶

The Goldfeld Quandt (GQ) test is used in regression analysis to check for homoscedasticity in the error terms. The GQ test checks if you can define a point that can be used to differentiate the variance of the error term. It is a parametric test and uses the assumption that the data is normally distributed. So it is general practice to check for normality before going over to the GQ test!

In the image below, you can see how observations are split into two groups. Next, a test statistic is run through taking the ratio of mean square residual errors for the regressions on the two subsets. Evidence of heteroskedasticity is based on performing a hypothesis test.

Here is a brief description of the steps involved:

Order the data in ascending order
Split your data into three parts and drop values in the middle part.
Run separate regression analyses on two parts. After each regression, find the Residual Sum of Squares.
Calculate the ratio of the Residual sum of squares of two parts.
Apply the F-test.

F values typically indicate that the variances are different. If the error term is homoscedastic, there should be no systematic difference between residuals and F values will be small. However, if the standard deviation of the distribution of the error term is proportional to the x variable, one part will generate a higher sum of square values than the other.

The null hypothesis for the GQ test is homoskedasticity. The larger the F-statistic, the more evidence we will have against the homoskedasticity assumption and the more likely we have heteroskedasticity (different variance for the two groups).

The p-value for our tests above tells us whether or not to reject the null-hypothesis of homoscedasticity taking a confidence level of alpha = 0.05.

#Run Goldfeld Quandt test
name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(model.resid.iloc[indices], model.model.exog[indices])
list(zip(name, test))

Multiple Linear Regression Intro.¶

We can add as many predictors as we like. What is important to note, however, is that for models with more than 2 predictors representing a multiple linear model becomes very difficult and even impossible! Still, it is done all the time, as linear models of all sorts are extensively used in many fields!

When thinking of lines and slopes statistically, slope parameters associated with a particular predictor $x_i$ are often denoted by $\beta_i$. Extending this example mathematically, you would write a multiple linear regression model as follows:

$$ \hat y = \hat\beta_0 + \hat\beta_1 x_1 + \hat\beta_2 x_2 +\ldots + \hat\beta_n x_n $$

where $n$ is the number of predictors, $\beta_0$ is the intercept, and $\hat y$ is the so-called "fitted line" or the predicted value associated with the dependent variable.

Multiple Linear Regression¶

Statsmodels¶

Returns a table when summary is used

X = boston_features
y = pd.DataFrame(boston.target, columns= ["price"])
import statsmodels.api as sm
X_int = sm.add_constant(X)
model = sm.OLS(y,X_int).fit()
model.summary()

Sci-kit Learn¶

Needs commands to view results

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_smaller, y)
print(linreg.intercept_)
print(linreg.coef_)

Ridge and Lasso Regression (for overfit regression models)¶

Now, instead of completely "deleting" certain predictors from a model (which is equal to setting coefficients equal to zero), wouldn't it be interesting so-called penalized estimation operates in way where parameter shrinkage effets are used to make some or all of the coefficients smaller in magnitude, closer to zero. Some of the penalties have the property to perform both variable selection (setting some coefficients exactly equal to zero) and shrinking the other coefficients. Ridge and Lasso regression are two examples of penalized estimation. There are multiple advantages to using these methods:

They reduce model complexity
The may prevent from overfitting
Some of them may perform variable selection at the same time (when coefficients are set to 0)
They can be used to counter multicollinearity

Lasso and Ridge are two commonly used so-called regularization techniques. Regularization is a general term used when one tries to battle overfitting. Regularization techniques will be covered in more depth when we're moving into machine learning!

Ridge regression¶

In ridge regression, the linear regression cost function is changed by adding a penalty term to square of the magnitude of the coefficients

$$ \text{cost_function_ridge}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p m_j^2$$

Recall that you want to minimize your cost function, so by adding the penalty term $\lambda$, ridge regression puts a constraint on the coefficients $m$. This means that large coefficients penalize the optimization function. That's why ridge regression leads to a shrinkage of the coefficients and helps to reduce model complexity and multi-collinearity.

$\lambda$ is a so-called hyperparameter, which means you have to specify the value for lamda. For a small lambda, the outcome of your ridge regression will resemble a linear regression model. For large lambda, penalization will increase and more parameters will shrink.

Ridge regression is often also referred to as L2 Norm Regularization

from sklearn.linear_model import Lasso, Ridge

ridge = Ridge() #Lasso is also known as the L1 norm.
ridge.fit(X_train, y_train)
print('Training r^2:', ridge.score(X_train, y_train))
print('Testing r^2:', ridge.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, ridge.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, ridge.predict(X_test)))

Lasso Regression¶

Lasso regression is very similar to Ridge regression, except that the magnitude of the coefficients are not squared in the penalty term. So, while ridge regression keeps the sum of the squared regression coefficients (except for the intercept) bounded, the lasso method bounds the sum of the absolute values.

$$ \text{cost_function_lasso}= \sum_{i=1}^n(y_i - \hat{y})^2 = \sum_{i=1}^n(y_i - \sum_{j=1}^k(m_jx_{ij} + b))^2 + \lambda \sum_{j=1}^p \mid m_j \mid$$

The name "Lasso" comes from ‘Least Absolute Shrinkage and Selection Operator’.

While looking similar to the definition of the ridge estimator, the effect of the absolute values is that some coefficients might be set exactly equal to zero, while other coefficients are shrunk towards zero. Hence the lasso method is attractive because it performs estimation and selection simultaneously. Especially for variable selection when the number of predictors is very high.

Lasso regression is often also referred to as L1 Norm Regularization

from sklearn.linear_model import Lasso, Ridge

lasso = Lasso() 
lasso.fit(X_train, y_train)
print('Training r^2:', lasso.score(X_train, y_train))
print('Testing r^2:', lasso.score(X_test, y_test))
print('Training MSE:', mean_squared_error(y_train, lasso.predict(X_train)))
print('Testing MSE:', mean_squared_error(y_test, lasso.predict(X_test)))