#!/usr/bin/env python
# coding: utf-8

# (time-nowcasting)=
# # Introduction to Nowcasting
# 
# Nowcasting is the prediction of economic indicators in the present, the very near future, and the very recent past—usually in a context where there is a lag in collecting and compiling information on that economic indicators and only partial information is available. For GDP, it can take over a year after the end of the so-called reference period for the final version of the data to be released. The two times you'll most often hear about nowcasting are in the context of the weather and GDP, two variables that people care deeply about!
# 
# For most of this Chapter, we'll see an example of coding up a nowcast following an example based on Federal Reserve Board economist Chad Fulton's [notes](http://www.chadfulton.com/topics/statespace_large_dynamic_factor_models.html).
# 
# Let's import the packages we'll need for this chapter:

# In[ ]:


import types
import warnings

import matplotlib as mpl
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Plot settings
plt.style.use(
    "https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt"
)
mpl.rcParams.update({"lines.linewidth": 1.2})
# Set max rows displayed for readability
pd.set_option("display.max_rows", 8)

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(
    seed_for_prng
)  # prng=probabilistic random number generator
# Turn off warnings
warnings.filterwarnings("ignore")


# ## Nowcasting
# 
# We need to begin with some definitions. A *reference period* is the time period of interest to the nowcast, ie the time period for which you would like to estimate a data point. The *data vintage* is the time when the data were issued, and this could be before or after the reference period—indeed, for data that are revised, it can be some time *after* the reference period. Finally, the *frequency* is how often reference periods occur. For nowcasting GDP, the frequency is typically quarterly.
# 
# There are three parts to a typical nowcast:
# 
# 1. the forecast (the estimate of the data point shortly *before* the reference period)
# 2. the nowcast (the estimate of the data point *during* the reference period)
# 3. the backcast (the estimate of the data point *after* the reference period)
# 

# In[ ]:


# TODO remove input

line_pos = [0, 3, 9, 12]
colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]
labels = [r"$t-1$", r"$t$", r"$t+1 \longrightarrow$"]

text_height_1 = 5
text_height_2 = 2

fig, ax = plt.subplots(figsize=(5, 2))
ax.set_xlim(line_pos[0], line_pos[-1])
ax.set_ylim(line_pos[0], 6)
ax.axvline(3, color="k", linestyle=":")
ax.axvline(9, color="k", linestyle=":")
ax.annotate("Forecast", xy=(1.5, text_height_1), ha="center")
ax.annotate("True Nowcast", xy=(6, text_height_1), ha="center")
ax.annotate("Backcast", xy=(10.5, text_height_1), ha="center")
ax.annotate("Reference period", xy=(6, text_height_2), ha="center")
ax.axvspan(xmin=line_pos[0], xmax=line_pos[1], facecolor=colors[0], alpha=0.3)
ax.axvspan(xmin=line_pos[1], xmax=line_pos[2], facecolor=colors[1], alpha=0.3)
ax.axvspan(xmin=line_pos[2], xmax=line_pos[3], facecolor=colors[2], alpha=0.3)
ax.set_xticks(line_pos[1:])
ax.set_xticklabels(labels)
ax.set_yticklabels([])
ax.set_yticks([])
ax.set_xlabel("Time (reference period)")
ax.spines["right"].set_visible(True)
ax.spines["top"].set_visible(True)
ax.set_title("The different periods of a nowcast", fontsize=10)
plt.show()


# In the above figure, the model is the same for the three different sub-periods, the only difference is whether the target event has happened yet.
# 
# Nowcasting is, in many ways, more complicated than forecasting. With forecasting, it's often the case that the *information set*, $\Omega$, that you're using is fixed at a point in time (say, at time $t-1$), and you plug that information into a model, $f$, to try and predict your target variable, $y$, at $y_{t}$. With nowcasting, the problem becomes *dynamic*. We could be at $t-1$, $t$, or even $t+1$ (and anything inbetween) but we still want to use the *same* model to nowcast the value of $y$ at $t$! This is a tall order for any model; it has to cover input data that are changing over time. 
# 
# Nowcasts should be able to be constructed *whenever* new data are available; that is whenever the *information set*, $\Omega$, is updated. And this could happen several times through the nowcasting period (in forecast, nowcast, or backcasting periods).
# 
# Nowcasts have to deal with two particular challenges:
# 
# - Nowcasts should be able to deal with mixed frequency data, so as to take information that is released between and within reference periods.
# - Nowcasts should be able to deal with "ragged edges" in the data; that is, different variables in the information set have different release frequencies so making a nowcast at any point in time will inevitably collect an information set that has some data missing.
# 
# As a concrete example of the new data that arrive during the period of operation of a nowcast (and these two challenges), let's imagine we are trying to nowcast quarterly GDP but that every month there is a survey of growth expectations and every three weeks on a Monday there is a release on sales (which has some predictive power for GDP). And then let's imagine we are trying to take a nowcast.

# In[ ]:


# TODO remove input

time_min = "30-11-2019"
periods = 50
weekly_range = pd.date_range("01-01-2020", freq="W", periods=periods)
months = mdates.MonthLocator(interval=1)
months_fmt = mdates.DateFormatter("%b")

weeks = mdates.WeekdayLocator(byweekday=mdates.MO)

date_lines = [
    (weekly_range.min() - pd.DateOffset(months=1)).to_period("Q"),
    (weekly_range.min() + pd.DateOffset(months=1)).to_period("Q"),
    (weekly_range.min() + pd.DateOffset(months=5)).to_period("Q"),
    (weekly_range.min() + pd.DateOffset(months=8)).to_period("Q"),
]

locs_release_sales = pd.date_range(time_min, freq="W-MON", periods=periods)
locs_survey = pd.date_range(time_min, freq="M", periods=periods)

text_height = 7

fig, ax = plt.subplots(figsize=(7, 2))
ax.set_xlim(weekly_range.min() - pd.DateOffset(months=1), weekly_range.max())
ax.set_ylim(0, 8)
for date_val in date_lines:
    ax.axvline(date_val, color="k", linestyle=":", alpha=0.3)
annotations = ["  Q1", "  Q2 (ref. period)", "  Q3", "  Q4"]
for i, tation in enumerate(annotations):
    ax.annotate(tation, xy=(date_lines[i], text_height), ha="left", fontsize=9)
ax.annotate(
    "  (Q1 GDP\n  1st estimate released)",
    xy=(date_lines[2], 6),
    ha="left",
    va="center",
    fontsize=7,
)
ax.annotate(
    "  (Q2 GDP 1st estimate released)",
    xy=(date_lines[3], 6),
    ha="left",
    va="center",
    fontsize=7,
)
for val in locs_release_sales[::3]:
    ax.annotate(
        "Sales Release",
        xy=(val, 0),
        xytext=(val, 1),
        fontsize=6,
        rotation=90,
        ha="center",
    )
for val in locs_survey:
    ax.annotate(
        "Survey", xy=(val, 0), xytext=(val, 3), fontsize=6, rotation=90, ha="center"
    )
ax.spines["right"].set_visible(False)
ax.spines["top"].set_visible(False)
ax.spines["left"].set_visible(False)
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(months_fmt)
ax.xaxis.set_minor_locator(weeks)
ax.yaxis.set_ticks([])
ax.yaxis.set_ticklabels([])
ax.tick_params(axis="x", which="minor", bottom=True)
ax.set_title(
    "An example of high-frequency information releases relevant to a GDP nowcast",
    fontsize=10,
)
plt.show()


# In the above example we have three time series, all with different frequencies: the GDP 1st estimate releases (quarterly, but with a 1 quarter lag) and this we would use as an auto-regressive term in a model, the sales releases that come out every 3 weeks on a Monday, and the once-a-month survey releases. (In the diagram, the smaller tick marks denote every Monday.)
# 
# A good nowcasting model needs to be able to incorporate all of the information in the example above, *and* be able to run at any point in time shown.

# There are a range of methods for nowcasting around, but the most popular broad classes are probably:
# 
# - *bridge models*, which use timely indicators to forecast a target variable. They create a "bridge" between higher-frequency series and the target variables by regressing the target variables on the higher frequency indicators. They tend to produce iterated forecasts and the high-frequency series are likely to be aggregated to low frequency.
# - *MIDAS*, or mixed-data sampling regressions which use a multi-step forecast and weight high frequency indicators using a special type of polynomial of lags and include leads of the high-frequency indicators. There is not, typically, any time aggregation.
# - *dynamic factor models*, which suppose that a small number of unobserved “factors” can be used to explain a substantial portion of the variation and dynamics in a larger number of observed variables, and (most usefully for nowcasting) variables that are lower frequency can be estimated from the high-frequency indicators.
# 
# For more on both MIDAS and Bridge Equation approaches, see {cite:t}`schumacher2016comparison`. For the rest of this chapter, we're going to look at the practical application in code of a dynamic factor model. The classic reference on dynamic factor models is

# ## Dynamic Factor Model
# 
# The equation below defines a dynamic factor model:
# 
# $$
# \begin{aligned}
# \vec{y}_t & = \Lambda \vec{f}_t + \epsilon_t \\
# \vec{f}_t & = A_1 \vec{f}_{t-1} + \dots + A_p \vec{f}_{t-p} + u_t
# \end{aligned}
# $$
# 
# Here, $\vec{y}_t$ is a vector of observed variables, $\vec{f}_t$ is a vector of factors, $\Lambda$ is a matrix of *factor loadings* (these are a map from factor space to the space of observed variables), and $\epsilon_t \thicksim \mathcal{N}(0, \Sigma_\epsilon)$ and $u_t \thicksim \mathcal{N}(0, \Sigma_u)$ with $\Sigma_\epsilon$ and $\Sigma_u$ assumed to be diagonal matrices.
# 
# The model above is specified in so-called *state space form*.
# 
# As above the information set at vintage $v$ is $\Omega_v$. The nowcast is given by 
# 
# $$
# \mathbb{E\left[y_t \mid \Omega_v \right]}
# $$

# ### Data
# 
# For our data, we'll use a US panel of economic data released at a monthly frequency, along with GDP, which is only released at a quarterly frequency in the US. The monthly datasets that we’ll be using come from FRED-MD database {cite:t}`mccracken2016fred`, and we will take real GDP from the companion FRED-QD database.
# 
# The FRED-MD dataset was launched in January 2015, and vintages are available for each month since then. The FRED-QD dataset was fully launched in May 2018, and monthly vintages are available for each month since then. Our baseline exercise will use the February 2020 dataset (which includes data through reference month January 2020), and then we will examine how updated incoming data in the months of March - June influenced the model’s nowcast of real GDP growth in 2020Q2.
# 
# #### Transformations
# 
# Now, the data we download is not going to be stationary (for more on stationarity, see {ref}`time-series`). So we first need to transform it. Let's write a function that will do all of the transformations we need:

# In[ ]:


def transform(column, transforms):
    """Transforms a pandas series to make it stationary.

    :param column: [description]
    :type column: [type]
    :param transforms: [description]
    :type transforms: [type]
    :return: [description]
    :rtype: [type]
    """
    transformation = transforms[column.name]
    # For quarterly data like GDP, we will compute
    # annualized percent changes
    mult = 4 if column.index.freqstr[0] == "Q" else 1

    # 1 => No transformation
    if transformation == 1:
        pass
    # 2 => First difference
    elif transformation == 2:
        column = column.diff()
    # 3 => Second difference
    elif transformation == 3:
        column = column.diff().diff()
    # 4 => Log
    elif transformation == 4:
        column = np.log(column)
    # 5 => Log first difference, multiplied by 100
    #      (i.e. approximate percent change)
    #      with optional multiplier for annualization
    elif transformation == 5:
        column = np.log(column).diff() * 100 * mult
    # 6 => Log second difference, multiplied by 100
    #      with optional multiplier for annualization
    elif transformation == 6:
        column = np.log(column).diff().diff() * 100 * mult
    # 7 => Exact percent change, multiplied by 100
    #      with optional annualization
    elif transformation == 7:
        column = ((column / column.shift(1)) ** mult - 1.0) * 100

    return column


# #### Outliers
# 
# Following {cite:t}`mccracken2016fred`, we remove outliers (setting their value to missing). These are defined as observations that are more than 10 times the interquartile range from the series mean.
# 
# However, in this exercise we are interested in nowcasting real GDP growth for 2020Q2, which was greatly affected by economic shutdowns stemming from the COVID-19 pandemic. During the first half of 2020, there are a number of series which include extreme observations, many of which would be excluded by this outlier test. Because these observations are likely to be informative about real GDP in 2020Q2, we only remove outliers for the period 1959-01 through 2019-12.
# 
# The details of outlier removal are in the remove_outliers function, below:

# In[ ]:


def remove_outliers(dta):
    # Compute the mean and interquartile range
    mean = dta.mean()
    iqr = dta.quantile([0.25, 0.75]).diff().T.iloc[:, 1]

    # Replace entries that are more than 10 times the IQR
    # away from the mean with NaN (denotes a missing entry)
    mask = np.abs(dta) > mean + 10 * iqr
    treated = dta.copy()
    treated[mask] = np.nan

    return treated


# #### Loading the data
# 
# We will now write a function that can load the data for us. It's going to perform the following functions for both the FRED-MD and FRED-QD datasets:
# 
# 1. Based on the vintage argument, it downloads a particular vintage of these datasets from the base URL https://s3.amazonaws.com/files.fred.stlouisfed.org/fred-md into the `orig_[m|q]` variable.
# 2. Extracts the column describing which transformation to apply into the `transform_[m|q]` (and, for the quarterly dataset, also extracts the column describing which factor an earlier paper assigned each variable to).
# 3. Extracts the observation date (from the “sasdate” column) and uses it as the index of the dataset.
# 4. Applies the transformations we need for stationarity.
# 5. Removes outliers for the period 1959-01 through 2019-12.
# 
# Here's the function:

# In[ ]:


def load_fredmd_data(vintage):
    """Given a vintage, finds the FRED data and returns it.

    :param vintage: The year month combo, eg "2020-02"
    :type vintage: str
    :return: all data associated with that vintage
    :rtype: simple namespace
    """
    base_url = "https://files.stlouisfed.org/files/htdocs/fred-md"
    # - FRED-MD --------------------------------------------------------------
    # 1. Download data
    orig_m = pd.read_csv(f"{base_url}/monthly/{vintage}.csv").dropna(how="all")

    # 2. Extract transformation information
    transform_m = orig_m.iloc[0, 1:]
    orig_m = orig_m.iloc[1:]

    # 3. Extract the date as an index
    orig_m.index = pd.PeriodIndex(orig_m.sasdate.tolist(), freq="M")
    orig_m.drop("sasdate", axis=1, inplace=True)

    # 4. Apply the transformations
    dta_m = orig_m.apply(transform, axis=0, transforms=transform_m)

    # 5. Remove outliers (but not in 2020)
    dta_m.loc[:"2019-12"] = remove_outliers(dta_m.loc[:"2019-12"])

    # - FRED-QD --------------------------------------------------------------
    # 1. Download data
    orig_q = pd.read_csv(f"{base_url}/quarterly/{vintage}.csv").dropna(how="all")

    # 2. Extract factors and transformation information
    factors_q = orig_q.iloc[0, 1:]
    transform_q = orig_q.iloc[1, 1:]
    orig_q = orig_q.iloc[2:]

    # 3. Extract the date as an index
    orig_q.index = pd.PeriodIndex(orig_q.sasdate.tolist(), freq="Q")
    orig_q.drop("sasdate", axis=1, inplace=True)

    # 4. Apply the transformations
    dta_q = orig_q.apply(transform, axis=0, transforms=transform_q)

    # 5. Remove outliers (but not in 2020)
    dta_q.loc[:"2019Q4"] = remove_outliers(dta_q.loc[:"2019Q4"])

    # - Output datasets ------------------------------------------------------
    return types.SimpleNamespace(
        orig_m=orig_m,
        orig_q=orig_q,
        dta_m=dta_m,
        transform_m=transform_m,
        dta_q=dta_q,
        transform_q=transform_q,
        factors_q=factors_q,
    )


# Let's now grab the vintages we want and show some information about what we downloaded:

# In[ ]:


vintages = ["2020-02", "2020-03", "2020-04", "2020-05", "2020-06"]
dta = {date: load_fredmd_data(date) for date in vintages}


# In[ ]:


# Print some information about the base dataset
selected_vintage = "2020-02"
n, k = dta[selected_vintage].dta_m.shape
start = dta[selected_vintage].dta_m.index[0]
end = dta[selected_vintage].dta_m.index[-1]

print(
    f"For vintage {selected_vintage}, there are {k} series and {n} observations,"
    f" over the period {start} to {end}."
)


# To see how the transformation and outlier removal works, here we plot three graphs of the RPI variable (“Real Personal Income”) over the period 2000-01 - 2020-01:
# 
# 1. The original dataset (which is in Billions of Chained 2012 Dollars)
# 2. The transformed data (RPI had a transformation code of “5”, corresponding to log first difference)
# 3. The transformed data, with outliers removed
# 
# Notice that the large negative growth rate in January 2013 was deemed to be an outlier and so was replaced with a missing value (a NaN value).
# 
# The [BEA release](https://www.bea.gov/news/2013/personal-income-and-outlays-january-2013) at the time noted that this was related to “the effect of special factors, which boosted employee contributions for government social insurance in January [2013] and which had boosted wages and salaries and personal dividends in December [2012].”.

# In[ ]:


fig, axes = plt.subplots(3, figsize=(14, 6))

# Plot the raw data from the February 2020 vintage, for:
vintage = "2020-02"
variable = "RPI"
start = "2000-01"
end = "2020-01"

# 1. Plot the original dataset, for 2000-01 through 2020-01
dta[vintage].orig_m.loc[start:end, variable].plot(ax=axes[0])
axes[0].set(xlim=("2000", "2020"), ylabel="Billons of $")
axes[0].set_title("Original data", loc="center")
# 2. Plot the transformed data, still including outliers
# (we only stored the transformation with outliers removed, so
# here we'll manually perform the transformation)
transformed = transform(dta[vintage].orig_m[variable], dta[vintage].transform_m)
transformed.loc[start:end].plot(ax=axes[1])
mean = transformed.mean()
iqr = transformed.quantile([0.25, 0.75]).diff().iloc[1]
axes[1].hlines(
    [mean - 10 * iqr, mean + 10 * iqr],
    transformed.index[0],
    transformed.index[-1],
    linestyles="--",
    linewidth=1,
)
axes[1].set(
    title="Transformed data, with bands showing outliers cutoffs",
    xlim=("2000", "2020"),
    ylim=(mean - 15 * iqr, mean + 15 * iqr),
    ylabel="Percent",
)
axes[1].annotate(
    "Outlier",
    xy=("2013-01", transformed.loc["2013-01"]),
    xytext=("2014-01", -5.3),
    textcoords="data",
    arrowprops=dict(arrowstyle="->", connectionstyle="arc3"),
)

# 3. Plot the transformed data, with outliers removed (see missing value for 2013-01)
dta[vintage].dta_m.loc[start:end, "RPI"].plot(ax=axes[2])
axes[2].set(
    title="Transformed data, with outliers removed",
    xlim=("2000", "2020"),
    ylabel="Percent",
)
axes[2].annotate(
    "Missing value in place of outlier",
    xy=("2013-01", -1),
    xytext=("2014-01", -2),
    textcoords="data",
    arrowprops=dict(arrowstyle="->", connectionstyle="arc3"),
)

fig.suptitle("Real Personal Income (RPI)")

fig.tight_layout(rect=[0, 0.00, 1, 0.95]);


# #### Data definitions and details
# 
# In addition to providing the raw data, {cite:t}`mccracken2016fred` and the FRED-MD/FRED-QD dataset provide additional information in appendices about the variables in each dataset:
# 
# - FRED-MD Updated Appendix
# - FRED-QD Updated Appendix
# 
# In particular, we’re interested in:
# 
# - The human-readable description (e.g. the description of the variable “RPI” is “Real Personal Income”—*not* retail price index!)
# - The grouping of the variable (e.g. “RPI” is part of the “Output and income” group, while “FEDFUNDS” is in the “Interest and exchange rates” group).
# 
# The descriptions make it easier to understand the results, while the variable groupings can be useful in specifying the factor block structure. For example, we may want to have one or more “Global” factors that load on all variables while at the same time having one or more group-specific factors that only load on variables in a particular group.
# 
# The information from those appendices has been extracted into csvs that we can load in:

# In[ ]:


# Definitions from the Appendix for FRED-MD variables
# defn_m = pd.read_csv('https://github.com/aeturrell/coding-for-economists/blob/main/data/fred/FRED-MD_updated_appendix.csv')
defn_m = pd.read_csv("data/fred/FRED-MD_updated_appendix.csv")
defn_m.index = defn_m.fred

# Definitions from the Appendix for FRED-QD variables
# defn_q = pd.read_csv('https://github.com/aeturrell/coding-for-economists/blob/main/data/fred/FRED-QD_updated_appendix.csv')
defn_q = pd.read_csv("data/fred/FRED-QD_updated_appendix.csv")
defn_q.index = defn_q.fred

# Example of the information in these files:
defn_m.head()


# In[ ]:


defn_q.head()


# In[ ]:


print(defn_m["group"].unique())
print(defn_q["Group"].unique())


# In[ ]: