Notebook

Basic usage example¶

This notebook demonstrates how to use the library to perform regression without delving into technical details. The interested reader is referred to the Robust Local Polynomial Regression with Similarity Kernels draft paper. It is assumed that the library is installed in the environment that you are using. You can install the library from PyPI using pip: pip install rsklpr

Note, the plots are interactive, it is possible to pan, zoom, rotate as well as toggle viewing of elements by clicking / double-clicking the legend items.

To begin lets define a function that generates some data to perform regression on.

In [1]:

from typing import Tuple

import numpy as np


def benchmark_curve_1(noise_ratio: float, hetero: bool, num_points: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Generates a dataset of points sampled from smooth curve with added noise. In the case of heteroscedastic noise the
    noise scale is sampled from a uniform distribution up to noise_ratio of the response range.

    Args:
        noise_ratio: The ratio of noise of the response range used to generate noise.
        hetero: True to generate heteroscedastic noise, False for homoscedastic noise.
        num_points: The number of points sampled from the curve.

    Returns:
        The predictor, response and ground truth.
    """
    generator: np.random.Generator = np.random.default_rng(seed=14)
    x_: np.ndarray = np.linspace(start=0.0, stop=1.0, num=num_points)
    x_ += generator.normal(scale=1 / np.sqrt(num_points), size=x_.shape[0])
    x_ = np.sort(x_)

    y_true_: np.ndarray = np.sqrt(np.abs(x_**3 - 4 * x_**4 / 3)) + (
        0.31 * x_ * np.sin(x_ * 3 * np.pi) ** 2 / np.max(x_)
    )

    scale: np.ndarray

    if hetero:
        scale = generator.uniform(low=0.001, high=noise_ratio * (y_true_.max() - y_true_.min()), size=x_.shape[0])
    else:
        scale = np.full(shape=x_.shape[0], fill_value=noise_ratio * (y_true_.max() - y_true_.min()))

    y_: np.ndarray = y_true_ + generator.normal(scale=scale)

    return (
        x_,
        y_,
        y_true_,
    )

Next, define a function to plot our data and regression results.

In [2]:

import pandas as pd
from typing import Optional
import plotly
import plotly.graph_objects as go


def plot_results_1d(
    x_: Optional[np.ndarray],
    y_: Optional[np.ndarray],
    y_true_: Optional[np.ndarray] = None,
    estimates_: Optional[pd.DataFrame] = None,
    title: str = "",
    estimates_mode: str = "lines+markers",
) -> None:
    """
    Plots a scatter plot of all columns in the provided dataframe where each column is treated as a separate variable.
    It is assumed estimates_.index is the same as x_ as both are used for x coordinates in the plot.

    Args:
        x_: The predictors X for all observations with shape [N,K] where N is the number of observations and K is the
            dimension of X.
        y_: The corresponding response for all observations.
        y_true_: The ground truth response at the locations x.
        estimates_: The estimates y_hat given x. it is assumed each column in the dataframe is the results of a
            different estimator.
        title: The plot title.
        estimates_mode: The style mode for estimates data.
    """
    fig = go.Figure()
    if x_ is not None and y_ is not None:
        fig.add_trace(go.Scatter(x=x_, y=y_, mode="markers", name="data"))

    if y_true_ is not None:
        fig.add_trace(
            go.Scatter(x=x_ if x_ is not None else estimates_.index, y=y_true_, mode="markers", name="y_true")
        )

    if estimates_ is not None:
        for c in estimates_.columns:
            fig.add_trace(go.Scatter(x=estimates_.index, y=estimates_[c], mode=estimates_mode, name=c))

    fig.update_layout(
        go.Layout(
            title=title,
            autosize=True,
            height=500,
        )
    )

    plotly.offline.iplot(fig)

We can now use the function to generate some data as well as the ground truth and plot it.

In [3]:

x: np.ndarray
y: np.ndarray
y_true: np.ndarray

x, y, y_true = benchmark_curve_1(noise_ratio=0.15, hetero=True, num_points=100)

In [4]:

plot_results_1d(x_=x, y_=y, y_true_=y_true, title="Generated noisy data and ground truth")

To perform regression we need to instantiate a Rsklpr object, fit the data and predict the outputs. The main parameter that requires attention and has the largest impact is the size_neighborhood. Having a larger neighborhood thus including more datapoints results in a smoother estimate with lower variance but increases the bias. Let's regress the data with a few neighborhood sizes to see the impact.

In [5]:

from rsklpr.rsklpr import Rsklpr

rsklpr: Rsklpr = Rsklpr(size_neighborhood=30)
y_hat: np.ndarray = rsklpr(x=x, y=y)
plot_results_1d(
    x_=x,
    y_=y,
    y_true_=y_true,
    estimates_=pd.DataFrame(data=y_hat, columns=["y_hat"], index=x),
    title="Regression with size_neighborhood=30",
)

Note that calling the model directly as python rsklpr(x=x, y=y) is a shortform for calling fit and predict on all locations given in the dataset. We can also make the curve smoother by increasing the number of datapoints in the neighborhood.

In [6]:

rsklpr = Rsklpr(size_neighborhood=60)
y_hat = rsklpr(x=x, y=y)
plot_results_1d(
    x_=x,
    y_=y,
    y_true_=y_true,
    estimates_=pd.DataFrame(data=y_hat, columns=["y_hat"], index=x),
    title="Regression with size_neighborhood=60",
)

In [7]:

rsklpr = Rsklpr(size_neighborhood=100)
y_hat = rsklpr(x=x, y=y)
plot_results_1d(
    x_=x,
    y_=y,
    y_true_=y_true,
    estimates_=pd.DataFrame(data=y_hat, columns=["y_hat"], index=x),
    title="Regression with size_neighborhood=100",
)

Metrics¶

There are also metrics provided to help with evaluating the quality of fit such as the residuals and local R-Squared statistics. There are additional metrics available which are described in more detail in the documentation.

In [8]:

rsklpr = Rsklpr(size_neighborhood=35)
y_hat = rsklpr(x=x, y=y, metrics=["residuals", "r_squared"])

plot_results_1d(
    x_=x,
    y_=y,
    estimates_=pd.DataFrame(
        data=np.concatenate([y_hat.reshape((-1, 1)), rsklpr.residuals.reshape((-1, 1))], axis=1),
        columns=["y_hat", "residuals"],
        index=x,
    ),
    title="Regression residuals with size_neighborhood=35",
    estimates_mode="markers",
)

In [9]:

plot_results_1d(
    x_=x,
    y_=y,
    y_true_=y_true,
    estimates_=pd.DataFrame(
        data=np.concatenate([y_hat.reshape((-1, 1)), rsklpr.r_squared.reshape((-1, 1))], axis=1),
        columns=["y_hat", "r_squared"],
        index=x,
    ),
    title="Regression local r_squared with size_neighborhood=35",
    estimates_mode="markers",
)

Predicting in novel locations¶

It is also possible to predict values in locations not present in the original dataset.

In [10]:

rsklpr = Rsklpr(size_neighborhood=35)
rsklpr.fit(x=x, y=y)
x_linspace: np.ndarray = np.linspace(start=x.min(), stop=x.max())
y_hat = rsklpr.predict(x=x_linspace)
plot_results_1d(
    x_=x,
    y_=y,
    y_true_=y_true,
    estimates_=pd.DataFrame(data=y_hat, columns=["y_hat"], index=x_linspace),
    title="Regression in linearly spaced locations",
)

Confidence intervals¶

We can also use the library to calculate confidence intervals for the prediction. In this case we calculate the 95% confidence intervals at the linearly spaced locations.

In [11]:

import numpy as np

rsklpr = Rsklpr(size_neighborhood=35)
rsklpr.fit(x=x, y=y)
y_hat = rsklpr.predict(x=x_linspace)

y_mean_hat: np.ndarray
conf_low: np.ndarray
conf_high: np.ndarray
y_mean_hat, conf_low, conf_high, _ = rsklpr.predict_bootstrap(x=x_linspace)

estimates: pd.DataFrame = pd.DataFrame(
    data=np.concatenate([y_hat.reshape((-1, 1)), y_mean_hat.reshape((-1, 1))], axis=1),
    columns=["y_hat", "y_mean_hat"],
    index=x_linspace,
)
estimates["conf_0.025"] = conf_low
estimates["conf_0.975"] = conf_high

plot_results_1d(
    x_=x,
    y_=y,
    y_true_=y_true,
    estimates_=estimates,
    title="Regression in linearly spaced locations with 95% confidence intervals",
)

Multivariate predictors¶

Now let's regress a multivariate predictor. In this case to be able to visualize we regress a 2D predictor but it is possible to input data of any dimension. As before start by generating some data and a function to visualize the results.

In [12]:

from typing import List


def benchmark_plane_1(noise_ratio: float, hetero: bool, num_points: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """
    Generates a dataset of points sampled from a smooth function with added noise. In the case of heteroscedastic noise
    the noise scale is sampled from a uniform distribution up to noise_ratio of the response range.

    Args:
        noise_ratio: The ratio of noise of the response range used to generate noise.
        hetero: True to generate heteroscedastic noise, False for homoscedastic noise.
        num_points: The number of points sampled from the function.

    Returns:
        The predictor, response and ground truth.
    """
    generator: np.random.Generator = np.random.default_rng(seed=14)

    x_: np.ndarray = np.concatenate(
        [
            generator.uniform(low=0.0, high=0.7, size=(num_points, 1)),
            generator.uniform(low=-0.5, high=0.8, size=(num_points, 1)),
        ],
        axis=1,
    )

    y_true_: np.ndarray = np.power(x_[:, 0], 2) * np.cos(1.5 * np.pi * x_[:, 1]) + np.sin(np.square(x_.sum(axis=1)))
    scale: np.ndarray

    if hetero:
        scale = generator.uniform(
            low=0.001,
            high=noise_ratio * (y_true_.max() - y_true_.min()),
            size=num_points,
        )
    else:
        scale = np.full(shape=num_points, fill_value=noise_ratio * (y_true_.max() - y_true_.min()))

    y_: np.ndarray = y_true_ + generator.normal(scale=scale)

    return (
        x_,
        y_,
        y_true_,
    )


def plot_results_2d(
    x_: np.ndarray,
    y_: np.ndarray,
    y_true_: Optional[np.ndarray],
    estimates_: Optional[pd.DataFrame] = None,
    title: str = "",
) -> None:
    """
    Plots a scatter plot of all columns in the provided dataframe where each column is treated as a separate variable.
    It is assumed all plots share the same x coordinates.

    Args:
        x_: The predictors X for all observations with shape [N,K] where N is the number of observations and K is the
            dimension of X.
        y_: The corresponding response for all observations.
        y_true_: The ground truth response at the locations x.
        estimates_: The estimates y_hat given x. it is assumed each column in the dataframe is the results of a
            different estimator.
        title: The plot title.
    """
    fig = go.Figure()

    fig.add_trace(go.Scatter3d(x=x_[:, 0], y=x_[:, 1], z=y_, mode="markers", name="data"))

    if y_true_ is not None:
        fig.add_trace(go.Scatter3d(x=x_[:, 0], y=x_[:, 1], z=y_true_, mode="markers", name="y_true"))

    if estimates_ is not None:
        for c in estimates_.columns:
            fig.add_trace(go.Scatter3d(x=x_[:, 0], y=x_[:, 1], z=estimates_[c], mode="markers", name=c))

        x_lines: List[Optional[float]] = []
        y_lines: List[Optional[float]] = []
        z_lines: List[Optional[float]] = []
        estimates_["y"] = y_

        if y_true_ is not None:
            estimates_["y_true"] = y_true_

        min_z: np.ndarray = estimates_.values.min(axis=1)
        max_z: np.ndarray = estimates_.values.max(axis=1)
        estimates_.drop(columns=["y", "y_true"], inplace=True)

        for i in range(x_.shape[0]):
            x_lines.append(float(x_[i, 0]))
            y_lines.append(float(x_[i, 1]))
            z_lines.append(float(min_z[i]))

            x_lines.append(float(x_[i, 0]))
            y_lines.append(float(x_[i, 1]))
            z_lines.append(float(max_z[i]))

            x_lines.append(None)
            y_lines.append(None)
            z_lines.append(None)

        fig.add_trace(go.Scatter3d(x=x_lines, y=y_lines, z=z_lines, mode="lines", name="lines"))

    fig.update_traces(marker=dict(size=3))
    fig.update_layout(
        go.Layout(
            title=title,
            autosize=True,
            height=800,
        )
    )

    plotly.offline.iplot(fig)

Lets generate and plot the data.

In [13]:

x, y, y_true = benchmark_plane_1(noise_ratio=0.15, hetero=True, num_points=200)
plot_results_2d(x_=x, y_=y, y_true_=y_true, title="Generated noisy data and ground truth")

Next regress the plane at the locations in the dataset.

In [14]:

rsklpr: Rsklpr = Rsklpr(size_neighborhood=100)
y_hat: np.ndarray = rsklpr(x=x, y=y)

plot_results_2d(
    x_=x,
    y_=y,
    y_true_=y_true,
    estimates_=pd.DataFrame(data=y_hat, columns=["y_hat"]),
    title="Regression with size_neighborhood=100",
)