(time-nowcasting)=
Nowcasting is the prediction of economic indicators in the present, the very near future, and the very recent past—usually in a context where there is a lag in collecting and compiling information on that economic indicators and only partial information is available. For GDP, it can take over a year after the end of the so-called reference period for the final version of the data to be released. The two times you'll most often hear about nowcasting are in the context of the weather and GDP, two variables that people care deeply about!
For most of this Chapter, we'll see an example of coding up a nowcast following an example based on Federal Reserve Board economist Chad Fulton's notes.
Let's import the packages we'll need for this chapter:
import types
import warnings
import matplotlib as mpl
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Plot settings
plt.style.use(
"https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
mpl.rcParams.update({"lines.linewidth": 1.2})
# Set max rows displayed for readability
pd.set_option("display.max_rows", 8)
# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(
seed_for_prng
) # prng=probabilistic random number generator
# Turn off warnings
warnings.filterwarnings("ignore")
We need to begin with some definitions. A reference period is the time period of interest to the nowcast, ie the time period for which you would like to estimate a data point. The data vintage is the time when the data were issued, and this could be before or after the reference period—indeed, for data that are revised, it can be some time after the reference period. Finally, the frequency is how often reference periods occur. For nowcasting GDP, the frequency is typically quarterly.
There are three parts to a typical nowcast:
# TODO remove input
line_pos = [0, 3, 9, 12]
colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
labels = [r"$t-1$", r"$t$", r"$t+1 \longrightarrow$"]
text_height_1 = 5
text_height_2 = 2
fig, ax = plt.subplots(figsize=(5, 2))
ax.set_xlim(line_pos[0], line_pos[-1])
ax.set_ylim(line_pos[0], 6)
ax.axvline(3, color="k", linestyle=":")
ax.axvline(9, color="k", linestyle=":")
ax.annotate("Forecast", xy=(1.5, text_height_1), ha="center")
ax.annotate("True Nowcast", xy=(6, text_height_1), ha="center")
ax.annotate("Backcast", xy=(10.5, text_height_1), ha="center")
ax.annotate("Reference period", xy=(6, text_height_2), ha="center")
ax.axvspan(xmin=line_pos[0], xmax=line_pos[1], facecolor=colors[0], alpha=0.3)
ax.axvspan(xmin=line_pos[1], xmax=line_pos[2], facecolor=colors[1], alpha=0.3)
ax.axvspan(xmin=line_pos[2], xmax=line_pos[3], facecolor=colors[2], alpha=0.3)
ax.set_xticks(line_pos[1:])
ax.set_xticklabels(labels)
ax.set_yticklabels([])
ax.set_yticks([])
ax.set_xlabel("Time (reference period)")
ax.spines["right"].set_visible(True)
ax.spines["top"].set_visible(True)
ax.set_title("The different periods of a nowcast", fontsize=10)
plt.show()
In the above figure, the model is the same for the three different sub-periods, the only difference is whether the target event has happened yet.
Nowcasting is, in many ways, more complicated than forecasting. With forecasting, it's often the case that the information set, $\Omega$, that you're using is fixed at a point in time (say, at time $t-1$), and you plug that information into a model, $f$, to try and predict your target variable, $y$, at $y_{t}$. With nowcasting, the problem becomes dynamic. We could be at $t-1$, $t$, or even $t+1$ (and anything inbetween) but we still want to use the same model to nowcast the value of $y$ at $t$! This is a tall order for any model; it has to cover input data that are changing over time.
Nowcasts should be able to be constructed whenever new data are available; that is whenever the information set, $\Omega$, is updated. And this could happen several times through the nowcasting period (in forecast, nowcast, or backcasting periods).
Nowcasts have to deal with two particular challenges:
As a concrete example of the new data that arrive during the period of operation of a nowcast (and these two challenges), let's imagine we are trying to nowcast quarterly GDP but that every month there is a survey of growth expectations and every three weeks on a Monday there is a release on sales (which has some predictive power for GDP). And then let's imagine we are trying to take a nowcast.
# TODO remove input
time_min = "30-11-2019"
periods = 50
weekly_range = pd.date_range("01-01-2020", freq="W", periods=periods)
months = mdates.MonthLocator(interval=1)
months_fmt = mdates.DateFormatter("%b")
weeks = mdates.WeekdayLocator(byweekday=mdates.MO)
date_lines = [
(weekly_range.min() - pd.DateOffset(months=1)).to_period("Q"),
(weekly_range.min() + pd.DateOffset(months=1)).to_period("Q"),
(weekly_range.min() + pd.DateOffset(months=5)).to_period("Q"),
(weekly_range.min() + pd.DateOffset(months=8)).to_period("Q"),
]
locs_release_sales = pd.date_range(time_min, freq="W-MON", periods=periods)
locs_survey = pd.date_range(time_min, freq="M", periods=periods)
text_height = 7
fig, ax = plt.subplots(figsize=(7, 2))
ax.set_xlim(weekly_range.min() - pd.DateOffset(months=1), weekly_range.max())
ax.set_ylim(0, 8)
for date_val in date_lines:
ax.axvline(date_val, color="k", linestyle=":", alpha=0.3)
annotations = [" Q1", " Q2 (ref. period)", " Q3", " Q4"]
for i, tation in enumerate(annotations):
ax.annotate(tation, xy=(date_lines[i], text_height), ha="left", fontsize=9)
ax.annotate(
" (Q1 GDP\n 1st estimate released)",
xy=(date_lines[2], 6),
ha="left",
va="center",
fontsize=7,
)
ax.annotate(
" (Q2 GDP 1st estimate released)",
xy=(date_lines[3], 6),
ha="left",
va="center",
fontsize=7,
)
for val in locs_release_sales[::3]:
ax.annotate(
"Sales Release",
xy=(val, 0),
xytext=(val, 1),
fontsize=6,
rotation=90,
ha="center",
)
for val in locs_survey:
ax.annotate(
"Survey", xy=(val, 0), xytext=(val, 3), fontsize=6, rotation=90, ha="center"
)
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(months_fmt)
ax.xaxis.set_minor_locator(weeks)
ax.yaxis.set_ticks([])
ax.yaxis.set_ticklabels([])
ax.tick_params(axis="x", which="minor", bottom=True)
ax.set_title(
"An example of high-frequency information releases relevant to a GDP nowcast",
fontsize=10,
)
plt.show()
In the above example we have three time series, all with different frequencies: the GDP 1st estimate releases (quarterly, but with a 1 quarter lag) and this we would use as an auto-regressive term in a model, the sales releases that come out every 3 weeks on a Monday, and the once-a-month survey releases. (In the diagram, the smaller tick marks denote every Monday.)
A good nowcasting model needs to be able to incorporate all of the information in the example above, and be able to run at any point in time shown.
There are a range of methods for nowcasting around, but the most popular broad classes are probably:
For more on both MIDAS and Bridge Equation approaches, see {cite:t}schumacher2016comparison
. For the rest of this chapter, we're going to look at the practical application in code of a dynamic factor model. The classic reference on dynamic factor models is
The equation below defines a dynamic factor model:
$$ \begin{aligned} \vec{y}_t & = \Lambda \vec{f}_t + \epsilon_t \\ \vec{f}_t & = A_1 \vec{f}_{t-1} + \dots + A_p \vec{f}_{t-p} + u_t \end{aligned} $$Here, $\vec{y}_t$ is a vector of observed variables, $\vec{f}_t$ is a vector of factors, $\Lambda$ is a matrix of factor loadings (these are a map from factor space to the space of observed variables), and $\epsilon_t \thicksim \mathcal{N}(0, \Sigma_\epsilon)$ and $u_t \thicksim \mathcal{N}(0, \Sigma_u)$ with $\Sigma_\epsilon$ and $\Sigma_u$ assumed to be diagonal matrices.
The model above is specified in so-called state space form.
As above the information set at vintage $v$ is $\Omega_v$. The nowcast is given by
$$ \mathbb{E\left[y_t \mid \Omega_v \right]} $$For our data, we'll use a US panel of economic data released at a monthly frequency, along with GDP, which is only released at a quarterly frequency in the US. The monthly datasets that we’ll be using come from FRED-MD database {cite:t}mccracken2016fred
, and we will take real GDP from the companion FRED-QD database.
The FRED-MD dataset was launched in January 2015, and vintages are available for each month since then. The FRED-QD dataset was fully launched in May 2018, and monthly vintages are available for each month since then. Our baseline exercise will use the February 2020 dataset (which includes data through reference month January 2020), and then we will examine how updated incoming data in the months of March - June influenced the model’s nowcast of real GDP growth in 2020Q2.
Now, the data we download is not going to be stationary (for more on stationarity, see {ref}time-series
). So we first need to transform it. Let's write a function that will do all of the transformations we need:
def transform(column, transforms):
"""Transforms a pandas series to make it stationary.
:param column: [description]
:type column: [type]
:param transforms: [description]
:type transforms: [type]
:return: [description]
:rtype: [type]
"""
transformation = transforms[column.name]
# For quarterly data like GDP, we will compute
# annualized percent changes
mult = 4 if column.index.freqstr[0] == "Q" else 1
# 1 => No transformation
if transformation == 1:
pass
# 2 => First difference
elif transformation == 2:
column = column.diff()
# 3 => Second difference
elif transformation == 3:
column = column.diff().diff()
# 4 => Log
elif transformation == 4:
column = np.log(column)
# 5 => Log first difference, multiplied by 100
# (i.e. approximate percent change)
# with optional multiplier for annualization
elif transformation == 5:
column = np.log(column).diff() * 100 * mult
# 6 => Log second difference, multiplied by 100
# with optional multiplier for annualization
elif transformation == 6:
column = np.log(column).diff().diff() * 100 * mult
# 7 => Exact percent change, multiplied by 100
# with optional annualization
elif transformation == 7:
column = ((column / column.shift(1)) ** mult - 1.0) * 100
return column
Following {cite:t}mccracken2016fred
, we remove outliers (setting their value to missing). These are defined as observations that are more than 10 times the interquartile range from the series mean.
However, in this exercise we are interested in nowcasting real GDP growth for 2020Q2, which was greatly affected by economic shutdowns stemming from the COVID-19 pandemic. During the first half of 2020, there are a number of series which include extreme observations, many of which would be excluded by this outlier test. Because these observations are likely to be informative about real GDP in 2020Q2, we only remove outliers for the period 1959-01 through 2019-12.
The details of outlier removal are in the remove_outliers function, below:
def remove_outliers(dta):
# Compute the mean and interquartile range
mean = dta.mean()
iqr = dta.quantile([0.25, 0.75]).diff().T.iloc[:, 1]
# Replace entries that are more than 10 times the IQR
# away from the mean with NaN (denotes a missing entry)
mask = np.abs(dta) > mean + 10 * iqr
treated = dta.copy()
treated[mask] = np.nan
return treated
We will now write a function that can load the data for us. It's going to perform the following functions for both the FRED-MD and FRED-QD datasets:
orig_[m|q]
variable.transform_[m|q]
(and, for the quarterly dataset, also extracts the column describing which factor an earlier paper assigned each variable to).Here's the function:
def load_fredmd_data(vintage):
"""Given a vintage, finds the FRED data and returns it.
:param vintage: The year month combo, eg "2020-02"
:type vintage: str
:return: all data associated with that vintage
:rtype: simple namespace
"""
base_url = "https://files.stlouisfed.org/files/htdocs/fred-md"
# - FRED-MD --------------------------------------------------------------
# 1. Download data
orig_m = pd.read_csv(f"{base_url}/monthly/{vintage}.csv").dropna(how="all")
# 2. Extract transformation information
transform_m = orig_m.iloc[0, 1:]
orig_m = orig_m.iloc[1:]
# 3. Extract the date as an index
orig_m.index = pd.PeriodIndex(orig_m.sasdate.tolist(), freq="M")
orig_m.drop("sasdate", axis=1, inplace=True)
# 4. Apply the transformations
dta_m = orig_m.apply(transform, axis=0, transforms=transform_m)
# 5. Remove outliers (but not in 2020)
dta_m.loc[:"2019-12"] = remove_outliers(dta_m.loc[:"2019-12"])
# - FRED-QD --------------------------------------------------------------
# 1. Download data
orig_q = pd.read_csv(f"{base_url}/quarterly/{vintage}.csv").dropna(how="all")
# 2. Extract factors and transformation information
factors_q = orig_q.iloc[0, 1:]
transform_q = orig_q.iloc[1, 1:]
orig_q = orig_q.iloc[2:]
# 3. Extract the date as an index
orig_q.index = pd.PeriodIndex(orig_q.sasdate.tolist(), freq="Q")
orig_q.drop("sasdate", axis=1, inplace=True)
# 4. Apply the transformations
dta_q = orig_q.apply(transform, axis=0, transforms=transform_q)
# 5. Remove outliers (but not in 2020)
dta_q.loc[:"2019Q4"] = remove_outliers(dta_q.loc[:"2019Q4"])
# - Output datasets ------------------------------------------------------
return types.SimpleNamespace(
orig_m=orig_m,
orig_q=orig_q,
dta_m=dta_m,
transform_m=transform_m,
dta_q=dta_q,
transform_q=transform_q,
factors_q=factors_q,
)
Let's now grab the vintages we want and show some information about what we downloaded:
vintages = ["2020-02", "2020-03", "2020-04", "2020-05", "2020-06"]
dta = {date: load_fredmd_data(date) for date in vintages}
# Print some information about the base dataset
selected_vintage = "2020-02"
n, k = dta[selected_vintage].dta_m.shape
start = dta[selected_vintage].dta_m.index[0]
end = dta[selected_vintage].dta_m.index[-1]
print(
f"For vintage {selected_vintage}, there are {k} series and {n} observations,"
f" over the period {start} to {end}."
)
To see how the transformation and outlier removal works, here we plot three graphs of the RPI variable (“Real Personal Income”) over the period 2000-01 - 2020-01:
Notice that the large negative growth rate in January 2013 was deemed to be an outlier and so was replaced with a missing value (a NaN value).
The BEA release at the time noted that this was related to “the effect of special factors, which boosted employee contributions for government social insurance in January [2013] and which had boosted wages and salaries and personal dividends in December [2012].”.
fig, axes = plt.subplots(3, figsize=(14, 6))
# Plot the raw data from the February 2020 vintage, for:
vintage = "2020-02"
variable = "RPI"
start = "2000-01"
end = "2020-01"
# 1. Plot the original dataset, for 2000-01 through 2020-01
dta[vintage].orig_m.loc[start:end, variable].plot(ax=axes[0])
axes[0].set(xlim=("2000", "2020"), ylabel="Billons of $")
axes[0].set_title("Original data", loc="center")
# 2. Plot the transformed data, still including outliers
# (we only stored the transformation with outliers removed, so
# here we'll manually perform the transformation)
transformed = transform(dta[vintage].orig_m[variable], dta[vintage].transform_m)
transformed.loc[start:end].plot(ax=axes[1])
mean = transformed.mean()
iqr = transformed.quantile([0.25, 0.75]).diff().iloc[1]
axes[1].hlines(
[mean - 10 * iqr, mean + 10 * iqr],
transformed.index[0],
transformed.index[-1],
linestyles="--",
linewidth=1,
)
axes[1].set(
title="Transformed data, with bands showing outliers cutoffs",
xlim=("2000", "2020"),
ylim=(mean - 15 * iqr, mean + 15 * iqr),
ylabel="Percent",
)
axes[1].annotate(
"Outlier",
xy=("2013-01", transformed.loc["2013-01"]),
xytext=("2014-01", -5.3),
textcoords="data",
arrowprops=dict(arrowstyle="->", connectionstyle="arc3"),
)
# 3. Plot the transformed data, with outliers removed (see missing value for 2013-01)
dta[vintage].dta_m.loc[start:end, "RPI"].plot(ax=axes[2])
axes[2].set(
title="Transformed data, with outliers removed",
xlim=("2000", "2020"),
ylabel="Percent",
)
axes[2].annotate(
"Missing value in place of outlier",
xy=("2013-01", -1),
xytext=("2014-01", -2),
textcoords="data",
arrowprops=dict(arrowstyle="->", connectionstyle="arc3"),
)
fig.suptitle("Real Personal Income (RPI)")
fig.tight_layout(rect=[0, 0.00, 1, 0.95]);
In addition to providing the raw data, {cite:t}mccracken2016fred
and the FRED-MD/FRED-QD dataset provide additional information in appendices about the variables in each dataset:
In particular, we’re interested in:
The descriptions make it easier to understand the results, while the variable groupings can be useful in specifying the factor block structure. For example, we may want to have one or more “Global” factors that load on all variables while at the same time having one or more group-specific factors that only load on variables in a particular group.
The information from those appendices has been extracted into csvs that we can load in:
# Definitions from the Appendix for FRED-MD variables
# defn_m = pd.read_csv('https://github.com/aeturrell/coding-for-economists/blob/main/data/fred/FRED-MD_updated_appendix.csv')
defn_m = pd.read_csv("data/fred/FRED-MD_updated_appendix.csv")
defn_m.index = defn_m.fred
# Definitions from the Appendix for FRED-QD variables
# defn_q = pd.read_csv('https://github.com/aeturrell/coding-for-economists/blob/main/data/fred/FRED-QD_updated_appendix.csv')
defn_q = pd.read_csv("data/fred/FRED-QD_updated_appendix.csv")
defn_q.index = defn_q.fred
# Example of the information in these files:
defn_m.head()
defn_q.head()
print(defn_m["group"].unique())
print(defn_q["Group"].unique())