Last edited: 2019-10-12
You should have gotten to this point vis this link:
#libraries:
!pip install linearmodels
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col
from linearmodels.iv import IV2SLS
# inline graphics
%matplotlib inline
Requirement already satisfied: linearmodels in /Users/delong/anaconda3/lib/python3.6/site-packages (4.13)
Requirement already satisfied: scipy>=0.18 in /Users/delong/anaconda3/lib/python3.6/site-packages (from linearmodels) (0.19.0)
Requirement already satisfied: mypy-extensions>=0.4 in /Users/delong/anaconda3/lib/python3.6/site-packages (from linearmodels) (0.4.2)
Requirement already satisfied: patsy in /Users/delong/anaconda3/lib/python3.6/site-packages (from linearmodels) (0.4.1)
Requirement already satisfied: statsmodels>=0.8 in /Users/delong/anaconda3/lib/python3.6/site-packages (from linearmodels) (0.8.0)
Requirement already satisfied: pandas>=0.20 in /Users/delong/anaconda3/lib/python3.6/site-packages (from linearmodels) (0.20.1)
Requirement already satisfied: cached-property>=1.5.1 in /Users/delong/anaconda3/lib/python3.6/site-packages (from linearmodels) (1.5.1)
Requirement already satisfied: numpy>=1.13 in /Users/delong/anaconda3/lib/python3.6/site-packages (from linearmodels) (1.17.2)
Requirement already satisfied: six in /Users/delong/anaconda3/lib/python3.6/site-packages (from patsy->linearmodels) (1.10.0)
Requirement already satisfied: python-dateutil>=2 in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas>=0.20->linearmodels) (2.6.0)
Requirement already satisfied: pytz>=2011k in /Users/delong/anaconda3/lib/python3.6/site-packages (from pandas>=0.20->linearmodels) (2017.2)
WARNING: You are using pip version 19.2.3, however version 19.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
/Users/delong/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. from pandas.core import datetools
ajr_df = pd.read_csv('https://delong.typepad.com/files/ajr.csv')
ajr_df.head()
shortnam | euro1900 | excolony | avexpr | logpgp95 | cons1 | cons90 | democ00a | cons00a | extmort4 | logem4 | loghjypl | baseco | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | 0.000000 | 1.0 | NaN | NaN | 1.0 | 2.0 | 1.0 | 1.0 | 93.699997 | 4.540098 | NaN | NaN |
1 | AGO | 8.000000 | 1.0 | 5.363636 | 7.770645 | 3.0 | 3.0 | 0.0 | 1.0 | 280.000000 | 5.634789 | -3.411248 | 1.0 |
2 | ARE | 0.000000 | 1.0 | 7.181818 | 9.804219 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | ARG | 60.000004 | 1.0 | 6.386364 | 9.133459 | 1.0 | 6.0 | 3.0 | 3.0 | 68.900002 | 4.232656 | -0.872274 | 1.0 |
4 | ARM | 0.000000 | 0.0 | NaN | 7.682482 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Let’s use a scatterplot to see whether any obvious relationship exists between GDP per capita and the protection against expropriation index:
plt.style.use('seaborn')
ajr_df.plot.scatter(x='avexpr', y='logpgp95')
plt.show()
Let's add three-letter country labels to the points:
x = ajr_df['avexpr']
y = ajr_df['logpgp95']
labels = ajr_df['shortnam']
fig, ax = plt.subplots()
ax.scatter(x, y, marker='.')
for i, txt in enumerate(labels):
ax.annotate(txt, (x.iloc[i], y.iloc[i]))
plt.show()
1 5.363636 3 6.386364 5 9.318182 11 4.454545 12 5.136364 15 7.500000 19 5.636364 20 7.909091 25 9.727273 27 7.818182 29 7.000000 30 6.454545 31 4.681818 32 7.318182 35 7.045455 40 6.181818 41 6.500000 42 6.545455 43 6.772727 47 5.727273 51 7.818182 54 6.272727 55 6.545455 56 8.272727 59 5.136364 60 5.886364 61 8.136364 62 5.318182 64 3.727273 66 7.590909 ... 90 7.090909 92 4.454545 93 7.500000 95 4.000000 96 7.227273 103 7.954545 105 5.000000 106 5.545455 107 5.227273 111 9.727273 113 6.045455 114 5.909091 115 5.772727 121 6.954545 127 4.000000 128 6.000000 129 9.318182 130 5.818182 131 5.000000 141 6.909091 145 7.454545 146 6.454545 149 6.636364 150 4.454545 152 7.000000 153 10.000000 155 7.136364 156 6.409091 159 6.863636 160 3.500000 Name: avexpr, Length: 64, dtype: float64
x
1 5.363636 3 6.386364 5 9.318182 11 4.454545 12 5.136364 15 7.500000 19 5.636364 20 7.909091 25 9.727273 27 7.818182 29 7.000000 30 6.454545 31 4.681818 32 7.318182 35 7.045455 40 6.181818 41 6.500000 42 6.545455 43 6.772727 47 5.727273 51 7.818182 54 6.272727 55 6.545455 56 8.272727 59 5.136364 60 5.886364 61 8.136364 62 5.318182 64 3.727273 66 7.590909 ... 90 7.090909 92 4.454545 93 7.500000 95 4.000000 96 7.227273 103 7.954545 105 5.000000 106 5.545455 107 5.227273 111 9.727273 113 6.045455 114 5.909091 115 5.772727 121 6.954545 127 4.000000 128 6.000000 129 9.318182 130 5.818182 131 5.000000 141 6.909091 145 7.454545 146 6.454545 149 6.636364 150 4.454545 152 7.000000 153 10.000000 155 7.136364 156 6.409091 159 6.863636 160 3.500000 Name: avexpr, Length: 64, dtype: float64
Let's fit a linear model to this scatter:
(1) $ \ln(pgp_95)_i= β_0 + β_1(avexpr_i) + u_i $
Fitting this linear model chooses a straight line that best fits the data in a least-squares, as in the following plot (Figure 2 in AJR):
# dropping NA's is required to use numpy's polyfit...
# using only 'base sample' for plotting purposes...
ajr_df = ajr_df.dropna(subset=['logpgp95', 'avexpr'])
ajr_df = ajr_df[ajr_df['baseco'] == 1]
x = ajr_df['avexpr'].tolist()
y = ajr_df['logpgp95'].tolist()
labels = ajr_df['shortnam'].tolist()
fig, ax = plt.subplots()
ax.scatter(x, y, marker='.')
for i, txt in enumerate(labels):
ax.annotate(txt, (x[i], y[i]))
ax.plot(np.unique(x),
np.poly1d(np.polyfit(x, y, 1))(np.unique(x)),
color='black')
ax.set_xlabel('Inverse Expropriation Risk Classification, 1985-95')
ax.set_ylabel('Log GDP per capita 1995 (PPP)')
ax.set_title('Figure 2: OLS Relationship: Prosperity and "Property Security Institutions"')
plt.show()
To estimate the constant term $ β_0 $, we need to add a column of 1’s to our dataframe so that we can use statsmodels's OLS routines:
ajr_df['constant'] = 1
regression_1 = sm.OLS(endog=ajr_df['logpgp95'],
exog=ajr_df[['constant', 'avexpr']],
missing='drop')
results_1 = regression_1.fit()
print(results_1.summary())
OLS Regression Results ============================================================================== Dep. Variable: logpgp95 R-squared: 0.540 Model: OLS Adj. R-squared: 0.533 Method: Least Squares F-statistic: 72.82 Date: Wed, 16 Oct 2019 Prob (F-statistic): 4.72e-12 Time: 11:23:22 Log-Likelihood: -68.168 No. Observations: 64 AIC: 140.3 Df Residuals: 62 BIC: 144.7 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ constant 4.6604 0.409 11.408 0.000 3.844 5.477 avexpr 0.5221 0.061 8.533 0.000 0.400 0.644 ============================================================================== Omnibus: 7.098 Durbin-Watson: 2.064 Prob(Omnibus): 0.029 Jarque-Bera (JB): 6.657 Skew: -0.781 Prob(JB): 0.0358 Kurtosis: 3.234 Cond. No. 31.2 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We extend our bivariate regression model to a multivariate regression model by adding in other factors correlated with $ \ln(pgp_95)_i $:
latitude is used to proxy this differences that affect both economic performance and institutions, eg. cultural, historical, etc.; controlled for with the use of continent dummies Let’s estimate some of the extended models considered in the paper (Table 2) using data from
ajr2_df = pd.read_csv('https://delong.typepad.com/files/ajr2.csv')
ajr2_df['constant'] = 1
X1 = ['constant', 'avexpr']
X2 = ['constant', 'avexpr', 'lat_abst']
X3 = ['constant', 'avexpr', 'lat_abst', 'asia', 'africa', 'other']
regression_2 = sm.OLS(ajr2_df['logpgp95'], ajr2_df[X1], missing='drop').fit()
regression_3 = sm.OLS(ajr2_df['logpgp95'], ajr2_df[X2], missing='drop').fit()
regression_4 = sm.OLS(ajr2_df['logpgp95'], ajr2_df[X3], missing='drop').fit()
info_dict={'R-squared' : lambda x: f"{x.rsquared:.2f}",
'No. observations' : lambda x: f"{int(x.nobs):d}"}
results_table = summary_col(results=[regression_2, regression_3, regression_4],
float_format='%0.2f',
stars = True,
model_names=['Model 1',
'Model 3',
'Model 4'],
info_dict=info_dict,
regressor_order=['constant',
'avexpr',
'lat_abst',
'asia',
'africa'])
results_table.add_title('Table 2 - OLS Regressions')
print(results_table)
Table 2 - OLS Regressions ========================================= Model 1 Model 3 Model 4 ----------------------------------------- constant 4.63*** 4.87*** 5.85*** (0.30) (0.33) (0.34) avexpr 0.53*** 0.46*** 0.39*** (0.04) (0.06) (0.05) lat_abst 0.87* 0.33 (0.49) (0.45) asia -0.15 (0.15) africa -0.92*** (0.17) other 0.30 (0.37) R-squared 0.61 0.62 0.72 No. observations 111 111 111 ========================================= Standard errors in parentheses. * p<.1, ** p<.05, ***p<.01
# Dropping NA's is required to use numpy's polyfit
df1_subset2 = ajr_df.dropna(subset=['logem4', 'avexpr'])
X = df1_subset2['logem4']
y = df1_subset2['avexpr']
labels = df1_subset2['shortnam']
# Replace markers with country labels
fig, ax = plt.subplots()
ax.scatter(X, y, marker='')
for i, label in enumerate(labels):
ax.annotate(label, (X.iloc[i], y.iloc[i]))
# Fit a linear trend line
ax.plot(np.unique(X),
np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),
color='black')
ax.set_xlim([1.8,8.4])
ax.set_ylim([3.3,10.4])
ax.set_xlabel('Log of Settler Mortality')
ax.set_ylabel('Average Expropriation Risk 1985-95')
ax.set_title('Figure 3: First-stage relationship between settler mortality \
and expropriation risk')
plt.show()
df4 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/raw/master/ols/maketable4.dta')
df4 = df4[df4['baseco'] == 1]
df4['const'] = 1
iv = IV2SLS(dependent=df4['logpgp95'],
exog=df4['const'],
endog=df4['avexpr'],
instruments=df4['logem4']).fit(cov_type='unadjusted')
print(iv.summary)
IV-2SLS Estimation Summary ============================================================================== Dep. Variable: logpgp95 R-squared: 0.1870 Estimator: IV-2SLS Adj. R-squared: 0.1739 No. Observations: 64 F-statistic: 37.568 Date: Wed, Oct 16 2019 P-value (F-stat) 0.0000 Time: 11:23:24 Distribution: chi2(1) Cov. Estimator: unadjusted Parameter Estimates ============================================================================== Parameter Std. Err. T-stat P-value Lower CI Upper CI ------------------------------------------------------------------------------ const 1.9097 1.0106 1.8897 0.0588 -0.0710 3.8903 avexpr 0.9443 0.1541 6.1293 0.0000 0.6423 1.2462 ============================================================================== Endogenous: avexpr Instruments: logem4 Unadjusted Covariance (Homoskedastic) Debiased: False
pwt91_df = pd.read_csv('https://delong.typepad.com/files/pwt91-data.csv')
pwt91_df.head()
countrycode | country | currency_unit | year | rgdpe | rgdpo | pop | emp | avh | hc | ... | csh_x | csh_m | csh_r | pl_c | pl_i | pl_g | pl_x | pl_m | pl_n | pl_k | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ABW | Aruba | Aruban Guilder | 1950 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | ABW | Aruba | Aruban Guilder | 1951 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | ABW | Aruba | Aruban Guilder | 1952 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | ABW | Aruba | Aruban Guilder | 1953 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | ABW | Aruba | Aruban Guilder | 1954 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 52 columns