Last edited: 2019-10-20
A large portion of this lecture is taken from QuantEcon https://quantecon.org. You can read the more detailed contents of the lecture here: https://python.quantecon.org/ols.html
Linear regression is a standard tool for analyzing the relationship between two or more variables.
In this lecture, we’ll use the Python package statsmodels
to estimate, interpret, and visualize linear regression models.
Along the way, we’ll discuss a variety of topics, including
As an example, we will replicate results from Acemoglu, Johnson and Robinson’s seminal paper [AJR01].
You can download a copy here.
In the paper, the authors emphasize the importance of institutions in economic development. The main contribution is the use of settler mortality rates as a source of exogenous variation in institutional differences.
Let’s start with some imports:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import pandas as pd
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
from linearmodels.iv import IV2SLS
In addition to what’s in Anaconda, this lecture will need the following libraries:
!pip install linearmodels
Requirement already satisfied: linearmodels in /srv/app/venv/lib/python3.6/site-packages Requirement already satisfied: pandas>=0.20 in /srv/app/venv/lib/python3.6/site-packages (from linearmodels) Requirement already satisfied: statsmodels>=0.8 in /srv/app/venv/lib/python3.6/site-packages (from linearmodels) Requirement already satisfied: numpy>=1.13 in /srv/app/venv/lib/python3.6/site-packages (from linearmodels) Requirement already satisfied: cached-property>=1.5.1 in /srv/app/venv/lib/python3.6/site-packages (from linearmodels) Requirement already satisfied: mypy-extensions>=0.4 in /srv/app/venv/lib/python3.6/site-packages (from linearmodels) Requirement already satisfied: patsy in /srv/app/venv/lib/python3.6/site-packages (from linearmodels) Requirement already satisfied: scipy>=0.18 in /srv/app/venv/lib/python3.6/site-packages (from linearmodels) Requirement already satisfied: pytz>=2011k in /srv/app/venv/lib/python3.6/site-packages (from pandas>=0.20->linearmodels) Requirement already satisfied: python-dateutil>=2.5.0 in /srv/app/venv/lib/python3.6/site-packages (from pandas>=0.20->linearmodels) Requirement already satisfied: six in /srv/app/venv/lib/python3.6/site-packages (from patsy->linearmodels)
[AJR01] wish to determine whether or not differences in institutions can help to explain observed economic outcomes. How do we measure institutional differences and economic outcomes?
In this paper,
These variables and other data used in the paper are available for download on Daron Acemoglu’s webpage.
We will use pandas’ .read_stata()
function to read in data contained in the .dta
files to dataframes
df1 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable1.dta')
df1.head()
shortnam | euro1900 | excolony | avexpr | logpgp95 | cons1 | cons90 | democ00a | cons00a | extmort4 | logem4 | loghjypl | baseco | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | 0.000000 | 1.0 | NaN | NaN | 1.0 | 2.0 | 1.0 | 1.0 | 93.699997 | 4.540098 | NaN | NaN |
1 | AGO | 8.000000 | 1.0 | 5.363636 | 7.770645 | 3.0 | 3.0 | 0.0 | 1.0 | 280.000000 | 5.634789 | -3.411248 | 1.0 |
2 | ARE | 0.000000 | 1.0 | 7.181818 | 9.804219 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | ARG | 60.000004 | 1.0 | 6.386364 | 9.133459 | 1.0 | 6.0 | 3.0 | 3.0 | 68.900002 | 4.232656 | -0.872274 | 1.0 |
4 | ARM | 0.000000 | 0.0 | NaN | 7.682482 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Let’s use a scatterplot to see whether any obvious relationship exists between GDP per capita and the protection against expropriation index.
plt.style.use('seaborn')
df1.plot(x='avexpr', y='logpgp95', kind='scatter')
plt.show()
The plot shows a fairly strong positive relationship between protection against expropriation and log GDP per capita. Specifically, if higher protection against expropriation is a measure of institutional quality, then better institutions appear to be positively correlated with better economic outcomes (higher GDP per capita).
Given the plot, choosing a linear model to describe this relationship seems like a reasonable assumption. We can write our model as $$ {logpgp95}_i = \beta_0 + \beta_1 {avexpr}_i + u_i $$ where:
# Dropping NA's is required to use numpy's polyfit
df1_subset = df1.dropna(subset=['logpgp95', 'avexpr'])
# Use only 'base sample' for plotting purposes
df1_subset = df1_subset[df1_subset['baseco'] == 1]
X = df1_subset['avexpr']
y = df1_subset['logpgp95']
labels = df1_subset['shortnam']
# Replace markers with country labels
fig, ax = plt.subplots()
ax.scatter(X, y, marker='')
for i, label in enumerate(labels):
ax.annotate(label, (X.iloc[i], y.iloc[i]))
# Fit a linear trend line
ax.plot(np.unique(X),
np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),
color='black')
ax.set_xlim([3.3,10.5])
ax.set_ylim([4,10.5])
ax.set_xlabel('Average Expropriation Risk 1985-95')
ax.set_ylabel('Log GDP per capita, PPP, 1995')
ax.set_title('Figure 2: OLS relationship between expropriation \
risk and income')
plt.show()
The most common technique to estimate the parameters ($ \beta $’s) of the linear model is Ordinary Least Squares (OLS). As the name implies, an OLS model is solved by finding the parameters that minimize the sum of squared residuals, i.e. $$ \underset{\hat{\beta}}{\min} \sum^N_{i=1}{\hat{u}^2_i} $$ where $ \hat{u}_i $ is the difference between the observation and the predicted value of the dependent variable.
To estimate the constant term $ \beta_0 $, we need to add a column of 1’s to our dataset (consider the equation if $ \beta_0 $ was replaced with $ \beta_0 x_i $ and $ x_i = 1 $)
df1['const'] = 1
Now we can construct our model in statsmodels
using the OLS function.
We will use pandas
dataframes with statsmodels
, however standard arrays can also be used as arguments
reg1 = sm.OLS(endog=df1['logpgp95'], exog=df1[['const', 'avexpr']], missing='drop')
type(reg1)
statsmodels.regression.linear_model.OLS
So far, we have simply constructed our model. We need to use .fit()
to obtain parameter estimates
$ \hat{\beta}_0 $ and $ \hat{\beta}_1 $
results = reg1.fit()
type(results)
statsmodels.regression.linear_model.RegressionResultsWrapper
We now have the fitted regression model stored in results
.
To view the OLS regression results, we can call the .summary()
method.
Note that an observation was mistakenly dropped from the results in the original paper (see the note located in maketable2.do from Acemoglu’s webpage), and thus the coefficients differ slightly.
print(results.summary())
OLS Regression Results ============================================================================== Dep. Variable: logpgp95 R-squared: 0.611 Model: OLS Adj. R-squared: 0.608 Method: Least Squares F-statistic: 171.4 Date: Sun, 20 Oct 2019 Prob (F-statistic): 4.16e-24 Time: 22:13:25 Log-Likelihood: -119.71 No. Observations: 111 AIC: 243.4 Df Residuals: 109 BIC: 248.8 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 4.6261 0.301 15.391 0.000 4.030 5.222 avexpr 0.5319 0.041 13.093 0.000 0.451 0.612 ============================================================================== Omnibus: 9.251 Durbin-Watson: 1.689 Prob(Omnibus): 0.010 Jarque-Bera (JB): 9.170 Skew: -0.680 Prob(JB): 0.0102 Kurtosis: 3.362 Cond. No. 33.2 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
From our results, we see that
Using our parameter estimates, we can now write our estimated relationship as
$$ \widehat{logpgp95}_i = 4.63 + 0.53 \ {avexpr}_i $$This equation describes the line that best fits our data, as shown in
Figure 2.
We can use this equation to predict the level of log GDP per capita for
a value of the index of expropriation protection.
For example, for a country with an index value of 7.07 (the average for
the dataset), we find that their predicted level of log GDP per capita
in 1995 is 8.38.
mean_expr = np.mean(df1_subset['avexpr'])
mean_expr
6.515625
predicted_logpdp95 = 4.63 + 0.53 * 7.07
predicted_logpdp95
8.3771
An easier (and more accurate) way to obtain this result is to use
.predict()
and set $ constant = 1 $ and
$ {avexpr}_i = mean\_expr $
results.predict(exog=[1, mean_expr])
array([8.09156367])
We can obtain an array of predicted $ {logpgp95}_i $ for every value
of $ {avexpr}_i $ in our dataset by calling .predict()
on our
results.
Plotting the predicted values against $ {avexpr}_i $ shows that the
predicted values lie along the linear line that we fitted above.
The observed values of $ {logpgp95}_i $ are also plotted for
comparison purposes.
# Drop missing observations from whole sample
df1_plot = df1.dropna(subset=['logpgp95', 'avexpr'])
# Plot predicted values
fix, ax = plt.subplots()
ax.scatter(df1_plot['avexpr'], results.predict(), alpha=0.5,
label='predicted')
# Plot observed values
ax.scatter(df1_plot['avexpr'], df1_plot['logpgp95'], alpha=0.5,
label='observed')
ax.legend()
ax.set_title('OLS predicted values')
ax.set_xlabel('avexpr')
ax.set_ylabel('logpgp95')
plt.show()
So far we have only accounted for institutions affecting economic
performance - almost certainly there are numerous other factors
affecting GDP that are not included in our model.
Leaving out variables that affect $ logpgp95_i $ will result in omitted variable bias, yielding biased and inconsistent parameter estimates.
We can extend our bivariate regression model to a multivariate regression model by adding in other factors that may affect $ logpgp95_i $.
[AJR01] consider other factors such as:
Let’s estimate some of the extended models considered in the paper
(Table 2) using data from maketable2.dta
df2 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable2.dta')
# Add constant term to dataset
df2['const'] = 1
# Create lists of variables to be used in each regression
X1 = ['const', 'avexpr']
X2 = ['const', 'avexpr', 'lat_abst']
X3 = ['const', 'avexpr', 'lat_abst', 'asia', 'africa', 'other']
# Estimate an OLS regression for each set of variables
reg1 = sm.OLS(df2['logpgp95'], df2[X1], missing='drop').fit()
reg2 = sm.OLS(df2['logpgp95'], df2[X2], missing='drop').fit()
reg3 = sm.OLS(df2['logpgp95'], df2[X3], missing='drop').fit()
Now that we have fitted our model, we will use summary_col
to
display the results in a single table (model numbers correspond to those
in the paper).
info_dict={'R-squared' : lambda x: f"{x.rsquared:.2f}",
'No. observations' : lambda x: f"{int(x.nobs):d}"}
results_table = summary_col(results=[reg1,reg2,reg3],
float_format='%0.2f',
stars = True,
model_names=['Model 1',
'Model 3',
'Model 4'],
info_dict=info_dict,
regressor_order=['const',
'avexpr',
'lat_abst',
'asia',
'africa'])
results_table.add_title('Table 2 - OLS Regressions')
print(results_table)
Table 2 - OLS Regressions ========================================= Model 1 Model 3 Model 4 ----------------------------------------- const 4.63*** 4.87*** 5.85*** (0.30) (0.33) (0.34) avexpr 0.53*** 0.46*** 0.39*** (0.04) (0.06) (0.05) lat_abst 0.87* 0.33 (0.49) (0.45) asia -0.15 (0.15) africa -0.92*** (0.17) other 0.30 (0.37) R-squared 0.61 0.62 0.72 No. observations 111 111 111 ========================================= Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
As [AJR01] discuss, the OLS models likely suffer from
endogeneity issues, resulting in biased and inconsistent model
estimates.
Namely, there is likely a two-way relationship between institutions and
economic outcomes:
To deal with endogeneity, we can use two-stage least squares (2SLS) regression, which is an extension of OLS regression.
This method requires replacing the endogenous variable $ {avexpr}_i $ with a variable that is:
the dependent variable, otherwise it would be correlated with
$ u_i $ due to omitted variable bias)
The new set of regressors is called an instrument, which aims to
remove endogeneity in our proxy of institutional differences.
The main contribution of [AJR01] is the use of settler mortality rates to instrument for institutional differences. They hypothesize that higher mortality rates of colonizers led to the establishment of institutions that were more extractive in nature (less protection against expropriation), and these institutions still persist today.
Using a scatterplot (Figure 3 in [AJR01]), we can see protection against expropriation is negatively correlated with settler mortality rates, coinciding with the authors’ hypothesis and satisfying the first condition of a valid instrument.
# Dropping NA's is required to use numpy's polyfit
df1_subset2 = df1.dropna(subset=['logem4', 'avexpr'])
X = df1_subset2['logem4']
y = df1_subset2['avexpr']
labels = df1_subset2['shortnam']
# Replace markers with country labels
fig, ax = plt.subplots()
ax.scatter(X, y, marker='')
for i, label in enumerate(labels):
ax.annotate(label, (X.iloc[i], y.iloc[i]))
# Fit a linear trend line
ax.plot(np.unique(X),
np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),
color='black')
ax.set_xlim([1.8,8.4])
ax.set_ylim([3.3,10.4])
ax.set_xlabel('Log of Settler Mortality')
ax.set_ylabel('Average Expropriation Risk 1985-95')
ax.set_title('Figure 3: First-stage relationship between settler mortality \
and expropriation risk')
plt.show()
The second condition may not be satisfied if settler mortality rates in the 17th to 19th centuries have a direct effect on current GDP (in addition to their indirect effect through institutions).
For example, settler mortality rates may be related to the current disease environment in a country, which could affect current economic performance.
[AJR01] argue this is unlikely because:
As we appear to have a valid instrument, we can use 2SLS regression to obtain consistent and unbiased parameter estimates.
First stage
The first stage involves regressing the endogenous variable
($ {avexpr}_i $) on the instrument. The instrument is the set of all exogenous variables in our model (and
not just the variable we have replaced). Using model 1 as an example, our instrument is simply a constant and
settler mortality rates $ {logem4}_i $.
Therefore, we will estimate the first-stage regression as
$$
{avexpr}_i = \delta_0 + \delta_1 {logem4}_i + v_i
$$
The data we need to estimate this equation is located in
maketable4.dta
(only complete data, indicated by baseco = 1
, is
used for estimation).
# Import and select the data
df4 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable4.dta')
df4 = df4[df4['baseco'] == 1]
# Add a constant variable
df4['const'] = 1
# Fit the first stage regression and print summary
results_fs = sm.OLS(df4['avexpr'],
df4[['const', 'logem4']],
missing='drop').fit()
print(results_fs.summary())
OLS Regression Results ============================================================================== Dep. Variable: avexpr R-squared: 0.270 Model: OLS Adj. R-squared: 0.258 Method: Least Squares F-statistic: 22.95 Date: Sun, 20 Oct 2019 Prob (F-statistic): 1.08e-05 Time: 22:30:20 Log-Likelihood: -104.83 No. Observations: 64 AIC: 213.7 Df Residuals: 62 BIC: 218.0 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 9.3414 0.611 15.296 0.000 8.121 10.562 logem4 -0.6068 0.127 -4.790 0.000 -0.860 -0.354 ============================================================================== Omnibus: 0.035 Durbin-Watson: 2.003 Prob(Omnibus): 0.983 Jarque-Bera (JB): 0.172 Skew: 0.045 Prob(JB): 0.918 Kurtosis: 2.763 Cond. No. 19.4 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Second stage
We need to retrieve the predicted values of $ {avexpr}_i $ using
.predict()
.
We then replace the endogenous variable $ {avexpr}_i $ with the
predicted values $ \widehat{avexpr}_i $ in the original linear model.
Our second stage regression is thus
$$
{logpgp95}_i = \beta_0 + \beta_1 \widehat{avexpr}_i + u_i
$$
df4['predicted_avexpr'] = results_fs.predict()
results_ss = sm.OLS(df4['logpgp95'],
df4[['const', 'predicted_avexpr']]).fit()
print(results_ss.summary())
OLS Regression Results ============================================================================== Dep. Variable: logpgp95 R-squared: 0.477 Model: OLS Adj. R-squared: 0.469 Method: Least Squares F-statistic: 56.60 Date: Sun, 20 Oct 2019 Prob (F-statistic): 2.66e-10 Time: 22:31:07 Log-Likelihood: -72.268 No. Observations: 64 AIC: 148.5 Df Residuals: 62 BIC: 152.9 Df Model: 1 Covariance Type: nonrobust ==================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------ const 1.9097 0.823 2.320 0.024 0.264 3.555 predicted_avexpr 0.9443 0.126 7.523 0.000 0.693 1.195 ============================================================================== Omnibus: 10.547 Durbin-Watson: 2.137 Prob(Omnibus): 0.005 Jarque-Bera (JB): 11.010 Skew: -0.790 Prob(JB): 0.00407 Kurtosis: 4.277 Cond. No. 58.1 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The second-stage regression results give us an unbiased and consistent estimate of the effect of institutions on economic outcomes. The result suggests a stronger positive relationship than what the OLS results indicated. Note that while our parameter estimates are correct, our standard errors are not and for this reason, computing 2SLS ‘manually’ (in stages with OLS) is not recommended.
We can correctly estimate a 2SLS regression in one step using the
linearmodels package, an extension of statsmodels
.
Note that when using IV2SLS
, the exogenous and instrument variables
are split up in the function arguments (whereas before the instrument
included exogenous variables).
iv = IV2SLS(dependent=df4['logpgp95'],
exog=df4['const'],
endog=df4['avexpr'],
instruments=df4['logem4']).fit(cov_type='unadjusted')
print(iv.summary)
IV-2SLS Estimation Summary ============================================================================== Dep. Variable: logpgp95 R-squared: 0.1870 Estimator: IV-2SLS Adj. R-squared: 0.1739 No. Observations: 64 F-statistic: 37.568 Date: Sun, Oct 20 2019 P-value (F-stat) 0.0000 Time: 22:32:21 Distribution: chi2(1) Cov. Estimator: unadjusted Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const 1.9097 1.0106 1.8897 0.0588 -0.0710 3.8903 avexpr 0.9443 0.1541 6.1293 0.0000 0.6423 1.2462 ============================================================================== Endogenous: avexpr Instruments: logem4 Unadjusted Covariance (Homoskedastic) Debiased: False
Given that we now have consistent and unbiased estimates, we can infer from the model we have estimated that institutional differences (stemming from institutions set up during colonization) can help to explain differences in income levels across countries today.
[AJR01] use a marginal effect of 0.94 to calculate that the difference in the index between Chile and Nigeria (ie. institutional quality) implies up to a 7-fold difference in income, emphasizing the significance of institutions in economic development.
In the lecture, we think the original model suffers from endogeneity bias due to the likely effect income has on institutional development.
Although endogeneity is often best identified by thinking about the data
and model, we can formally test for endogeneity using the Hausman
test.
We want to test for correlation between the endogenous variable,
$ avexpr_i $, and the errors, $ u_i $
This test is running in two stages.
First, we regress $ avexpr_i $ on the instrument, $ logem4_i $
Second, we retrieve the residuals $ \hat{\upsilon}_i $ and include them in the original equation
$$ logpgp95_i = \beta_0 + \beta_1 avexpr_i + \alpha \hat{\upsilon}_i + u_i $$If $ \alpha $ is statistically significant (with a p-value < 0.05),
then we reject the null hypothesis and conclude that $ avexpr_i $ is
endogenous.
Using the above information, estimate a Hausman test and interpret your
results.
The OLS parameter $ \beta $ can also be estimated using matrix
algebra and numpy
.
The linear equation we want to estimate is (written in matrix form)
To solve for the unknown parameter $ \beta $, we want to minimize the sum of squared residuals
$$ \underset{\hat{\beta}}{\min} \hat{u}'\hat{u} $$Rearranging the first equation and substituting into the second equation, we can write
$$ \underset{\hat{\beta}}{\min} \ (Y - X\hat{\beta})' (Y - X\hat{\beta}) $$Solving this optimization problem gives the solution for the $ \hat{\beta} $ coefficients
$$ \hat{\beta} = (X'X)^{-1}X'y $$Using the above information, compute $ \hat{\beta} $ from model 1
using numpy
- your results should be the same as those in the
statsmodels
output from earlier in the lecture.
# Load in data
df4 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable4.dta')
# Add a constant term
df4['const'] = 1
# Estimate the first stage regression
reg1 = sm.OLS(endog=df4['avexpr'],
exog=df4[['const', 'logem4']],
missing='drop').fit()
# Retrieve the residuals
df4['resid'] = reg1.resid
# Estimate the second stage residuals
reg2 = sm.OLS(endog=df4['logpgp95'],
exog=df4[['const', 'avexpr', 'resid']],
missing='drop').fit()
print(reg2.summary())
The output shows that the coefficient on the residuals is statistically significant, indicating $ avexpr_i $ is endogenous.
# Load in data
df1 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable1.dta')
df1 = df1.dropna(subset=['logpgp95', 'avexpr'])
# Add a constant term
df1['const'] = 1
# Define the X and y variables
y = np.asarray(df1['logpgp95'])
X = np.asarray(df1[['const', 'avexpr']])
# Compute β_hat
β_hat = np.linalg.solve(X.T @ X, X.T @ y)
# Print out the results from the 2 x 1 vector β_hat
print(f'β_0 = {β_hat[0]:.2}')
print(f'β_1 = {β_hat[1]:.2}')
β_0 = 4.6 β_1 = 0.53
It is also possible to use np.linalg.inv(X.T @ X) @ X.T @ y
to solve
for $ \beta $, however .solve()
is preferred as it involves fewer
computations.