Starter notebook to do some basic stuff with statsmodels in Python for finance analysts who are new to Python data analysis. I won't explain all of the python syntax but just demonstrate bare minimum required to:
Statsmodels
import pandas as pd
import statsmodels.api as sm
# plotting
import seaborn as sns
sns.set() # set the default plotting backend to seaborn
For the purposes of this demo we'll use a built-in dataset.
# Load Data
longley = sm.datasets.longley.load_pandas()
# Convert to dataframe to keep it simple
df = longley.data
This classic dataset of labor statistics was one of the first used to test the accuracy of least squares computations. The response variable (y) is the Total Derived Employment and the predictor variables are GNP Implicit Price Deflator with Year 1954 = 100 (x1), Gross National Product (x2), Unemployment (x3), Size of Armed Forces (x4), Non-Institutional Population Age 14 & Over (x5), and Year (x6). Source
If you want to load in your own data, I'd recommend two options:
df = pd.read_excel("filename.xlsx")
A simpler alternative, especially if you're new to python, is that you can use a method called read_clipboard()
that might work for you. In your excel file, simply highlight the data table you want to use and copy it with Ctrl+V, and then switch to your jupyter notebook and enter:
df = pd.read_clipboard()
If it doesn't work, you may need to reformat your data in excel.
# see dataframe representation
df.head()
# Here we will use seaborn to perform some simple plotting
df.plot(x="GNP", y="TOTEMP", kind="scatter", title="A simple plot", figsize=(10,6))
df.describe()
# look at distributions
df.hist('TOTEMP')
You can easily find the correlation values in a dataframe with the .corr()
method.
df.corr()
We can visualize this simply with a heatmap. Use sns.heatmap()
to visualize this heatmap:
sns.heatmap(df.corr())
Notice a lot of multicollinearity here so you should exclude some of those terms if we want to use linear regression. Here's an article on why multicollinearity is bad and what you can do about it.
Simple example of multiple linear regression with ordinary least squares (the 'vanilla' linear regression). Here I chose not to use the R-style formulas for simplicity.
y = df.iloc[:, 0]
X = df.iloc[:, 1:]
# Add a constant term if we want to fit an intercept
X = sm.add_constant(X, prepend=True)
# Drop the YEAR column
X = X.drop("YEAR", axis=1)
# Declare/specify the model
mod = sm.OLS(endog=y, exog=X)
# Fit the model
res = mod.fit()
# Present the summary statistics from the fit
res.summary()
Notice that statsmodels
called out in note 2 that there is strong multicollinearity, which we saw earlier in the correlation matrix and the heatmap.
We'll use VIF to find the variables that contribute the most to the model, borrowing this helper function from this article
from statsmodels.stats.outliers_influence import variance_inflation_factor
def calc_vif(X):
# Calculating VIF
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return(vif)
calc_vif(X.iloc[:, 1:])
Now that we have a table of VIF values, we will iteratively remove the variable with the highest VIF and recalculate the remaining VIF values, with the goal of getting them as low as possible (ideally <5 as a rule of thumb).
calc_vif(X[["GNP", "UNEMP", "ARMED"]])
X[["GNP", "UNEMP", "ARMED"]].corr()
There's still a good amount of multicollinearity going on here, which is not great, and the VIF values are still quite high, but this is good enough to go on.
Next, we'll refit the linear regression with these terms, and making sure to include that constant value.
# Fit the limited model
mod = sm.OLS(endog=y, exog=X[["const", "GNP", "UNEMP", "ARMED"]])
res = mod.fit()
res.summary()
So we see that the higher the unemployment rate, the lower the total employment rate, which makes sense. We see here and in our EDA that GNP
is a valuable predictor. However, I would remove ARMED
because it includes 0 in its confidence interval, indicating that there is a possibility that the actual coefficent may be 0 and that it has no effect on the outcome. So let's do that:
# Refit the model without `ARMED`
mod = sm.OLS(endog=y, exog=X[["const", "GNP", "UNEMP"]])
res = mod.fit()
res.summary()
And here we have a good model, though to make it even better we could remove UNEMP
and just do a model with GNP
and still get a very good model. Determining when to add/subtract terms can be subjective, there are instances in which you would still want to keep some terms.
Next we'll cover autoregression, in which we look purely at a single column in our data and try to calculate how well past values relate to future values.
from statsmodels.tsa.ar_model import AutoReg
df.plot(x="YEAR", y="TOTEMP")
res = AutoReg(df["TOTEMP"], lags = [1,2,3]).fit()
res.summary()
Demonstrate vector autoregression with Statsmodels
. See Forecasting: Principles and Practice for an introduction to VAR.
from statsmodels.tsa.api import VAR
For vector autoregression, instead of having a matrix of X values and a y variable, we are going to measure cross-correlations at various lags.
var_data = df[["TOTEMP", "GNP", "UNEMP"]]
Next, we call the model fit with statsmodels
and display the results, similar to what we have done with AR and linear regression before.
var_model = VAR(var_data)
res = var_model.fit(1) # VAR(1) model, i.e. fit up to 2 lags
res.summary()
We see that the results for equation TOTEMP
and UNEMP
are not so good, it appears that the lag-1 data are poor predictors. However, the results for equation GNP
are more promising, as L1.GNP
is a good predictor for GNP
.
So if we wanted to build a predictive model, there is some hope here because we have a reasonable model for GNP
1 year in the future, and we can then use that to create a prediction for what TOTEMP
would be.