#!/usr/bin/env python
# coding: utf-8
# In[24]:
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
import matplotlib.pyplot as plt
# # What Is Panel Data?
# **Panel data** is a hybrid data type that has feature of both _cross section_ and _time series_. Actually panel data are the most common data type in industry, for instance a car manufacturer has record of its suppliers' price level over time, a bank has full history of its clients' monthly balance for many years. Needless to say, to carry out serious researches, you must use panel data.
#
#
# Here we will use the data from "Why has Productivity Declined? Productivity and Public Investment" written by Munell, A.
#
# Variable names defined as below:
# ```
# STATE = state name
# ST_ABB = state abbreviation
# YR = 1970,...,1986
# P_CAP = public capital
# HWY = highway capital
# WATER = water utility capital
# UTIL = utility capital
# PC = private capital
# GSP = gross state product
# EMP = employment
# UNEMP = unemployment rate
# ```
# In[8]:
df = pd.read_excel(
"Basic_Econometrics_practice_data.xlsx", sheet_name="Prod_PubInvestment"
)
df.head(5)
# In[9]:
df.tail(5)
# Each state is recorded over time in several aspects, such as public capitals, highway capital, water facility capital and etc. If each state is recorded in equal length of time period, we call it **balanced panel**, otherwise **unbalanced panel**.
# Estimation methods includes four approaches
# 1. Pooled OLS model
# 2. Fixed effects least square dummy variable (LSDV) model
# 3. Fixed effects within-in group model
# 4. Random effects model
# # Pooled OLS Regression
# \begin{aligned}
# ln{GSP}_{i t} &=\beta_{1}+\beta_{2} \ln{PCAP}_{i t}+\beta_{3} \ln{HWY}_{i t}+\beta_{4} \ln{WATER}_{i t}+\beta_{5} \ln{UTIL}_{i t}+\beta_{6} \ln{EMP}_{i t}+u_{i t}
# \end{aligned}
# where $i$ means the $i$the state, $t$ means time period.
# In[23]:
model = smf.ols(
formula="np.log(GSP) ~ np.log(P_CAP) + np.log(PC) + np.log(HWY) + np.log(WATER) + np.log(UTIL) + np.log(EMP)",
data=df,
)
results = model.fit()
print(results.summary())
# The common symptoms of pooled regression on panel data is that all most of coefficients will be highly significant and also $R^2$ is exceedingly high. However, we can still spot some problems, the conditional number is high, meaning multicollinearity and Durbin-Watson test is close to $0$ meaning autocorrelation or specification error.
# But the most prominent issue of this model is that it camouflages the heterogeneity that may exist among states. The heterogeneity of each state is subsumed by the disturbance term, which causes correlation between independent variables and disturbance terms, therefore OLS estimates are bound to be biased and inconsistent.
# # The Fixed Effect LSDV Model
# LSDV model allows heterogeneity to take part in by adding different intercept value
# \begin{aligned}
# ln{GSP}_{i t} &=\beta_{1i}+\beta_{2} \ln{PCAP}_{i t}+\beta_{3} \ln{HWY}_{i t}+\beta_{4} \ln{WATER}_{i t}+\beta_{5} \ln{UTIL}_{i t}+\beta_{6} \ln{EMP}_{i t}+u_{i t}
# \end{aligned}
# $\beta_{1i}$ represents the intercept for each state $i$. There are various possible reasons contributing to heterogeneity among states, such as population, average education level and urbanization rate, etc.
#
# _Fixed effect_ means that though each state has its own intercept, but it is **time-invariant**, i.e. constant over the time. If we assume **time-variant** intercept, the notation would be $\beta_{1it}$
# In[25]:
df
# In[44]:
fig, ax = plt.subplots(nrows=2, ncols=3, figsize=(18, 12))
ax[0, 0].scatter(df["GSP"], df["P_CAP"], c="r", s=5)
ax[0, 0].grid()
ax[0, 0].set_xlabel("Public Capital")
ax[0, 0].set_ylabel("Gross Regional Produce")
ax[0, 1].scatter(df["GSP"], df["HWY"], c="r", s=5)
ax[0, 1].grid()
ax[0, 1].set_xlabel("High Way Capital")
ax[0, 1].set_ylabel("Gross Regional Produce")
ax[0, 2].scatter(df["GSP"], df["WATER"], c="r", s=5)
ax[0, 2].grid()
ax[0, 2].set_xlabel("Water Facility")
ax[0, 2].set_ylabel("Gross Regional Produce")
ax[1, 0].scatter(df["GSP"], df["UTIL"], c="r", s=5)
ax[1, 0].grid()
ax[1, 0].set_xlabel("Utiltiy Capital")
ax[1, 0].set_ylabel("Gross Regional Produce")
ax[1, 1].scatter(df["GSP"], df["PC"], c="r", s=5)
ax[1, 1].grid()
ax[1, 1].set_xlabel("Private Capital")
ax[1, 1].set_ylabel("Gross Regional Produce")
ax[1, 2].scatter(df["GSP"], df["EMP"], c="r", s=5)
ax[1, 2].grid()
ax[1, 2].set_xlabel("Employement")
ax[1, 2].set_ylabel("Gross Regional Produce")
plt.show()
# Check how many states are there in the panel data
# In[59]:
print(df["STATE"].unique())
print(len(df["STATE"].unique()))
# To avoid dummy variable trap, we can define $47$ dummy intercepts.
# Add dummies onto the intercept
#
# \begin{aligned}
# ln{GSP}_{i t} &=\alpha_{1}+ \sum_{j=2}^{48}\alpha_{j} D_{j i}+\beta_{2} \ln{PCAP}_{i t}+\beta_{3} \ln{HWY}_{i t}+\beta_{4} \ln{WATER}_{i t}+\beta_{5} \ln{UTIL}_{i t}+\beta_{6} \ln{EMP}_{i t}+u_{i t}
# \end{aligned}
#
# Use ```STATE``` as the dummy column and add ```drop_fist``` to avoid dummy trap.
# In[62]:
df_dum = pd.get_dummies(data=df, columns=["STATE"], drop_first=True)
# In[72]:
df_dum
# In[ ]: