`condensier`

and `sl3`

¶For a moment, we will go back to simple data structures: we have observations which are realizations of univariate random variables, $$X_1, \ldots, X_n \sim F,$$ where $F$ denotes an unknown cumulative distribution function (CDF). The goal is to estimate the distribution $F$. In particular, we are interested in estimating the density $f = F′$ , assuming that it exists.

The histogram is the oldest and most popular density estimator. We need to specify an "origin" $x_0$ and the class width $h$ for the specifications of the intervals: $$I_j =(x_0 + j \cdot h,x_0 + (j + 1) \cdot h), (j = \ldots, −1, 0, 1, \ldots)$$

where $w(x) = \frac{1}{2}, \mid X \mid \leq 1; 0, \text{otherwise}$. This is merely a simple weight function that places a rectangular box around each interval $(x - h, x + h)$.

By replaceing $w$ with a generalized smooth kernel function, we get the definition of the *kernel density estimator*:
$$\hat{f}(x) = \frac{1}{nh} \sum_{i = 1}^{n} K \left(\frac{x - X_i}{h}\right),$$
where $$K(x) \geq 0, \int_{-\infty}^{\infty} K(x) dx = 1, K(x) = K(-x).$$

The positivity of the kernel function $K(\cdot)$ guarantees a positive density estimate $f(\cdot)$ and the normalization $K(x)dx = 1$ implies that $f(x)dx = 1$, which is necessary for $f(\cdot)$ to be a density. Typically, the kernel function $K(\cdot)$ is chosen as a probability density which is symmetric around $0$. Additionally, the smoothness of $f(\cdot)$ is inherited from the smoothness of the kernel.

In the above definition, we leave the bandwidth $h$ as a *tuning parameter*, which can be chosen so as to minimize an arbitrary distance metric that ensures the estimated function is optimal, given the available data. For large bandwidth $h$, the estimate $f(x)$ tends to be very slowly varying as a function of $x$, while small bandwidths will produce a more variable function estimate.

The bandwidth h is often also called the "smoothing parameter". It should be clear that for $h \to 0$, we will have spikes at every observation $X_i$, whereas $f(\cdot) = fh(\cdot)$ becomes smoother as $h$ is increasing. In the above, we use a global bandwidth, which we might choose optimally using cross-validation, but, we can also use variable bandwidths (locally changing bandwidths $h(x)$), with the general idea her being to use a large bandwidth for regions where the data is sparse. With respect to the *bias-variance tradeoff*: **the (absolute value of the) bias of $\hat{f}$ increases and the variance of $\hat{f}$ decreases as $h$ increases**.

`condensier`

¶First, let's load the packages we'll be using and some core project management tools.

In [1]:

```
library(usethis)
usethis::create_project(".")
library(here)
library(tidyverse)
library(data.table, quietly = TRUE)
```

In [2]:

```
library(simcausal)
library(condensier)
library(sl3)
```

We begin by simulating a simple data set and illustrating a simple execution of how to use `condensier`

to perform conditional density estimation:

In [3]:

```
D <- DAG.empty()
D <- D + node("W1", distr = "rbern", prob = 0.5) +
node("W2", distr = "rbern", prob = 0.3) +
node("W3", distr = "rbern", prob = 0.3) +
node("A", distr = "rnorm", mean = (0.98 * W1 + 0.58 * W2 + 0.33 * W3), sd = 1)
D <- set.DAG(D, n.test = 10)
```

In [4]:

```
plotDAG(D, xjitter = 0.3, yjitter = 0.04, edge_attrs = list(width = 0.5, arrow.width = 0.4, arrow.size = 0.8), vertex_attrs = list(size = 12, label.cex = 0.8))
```

Now that we've taken a look at the structure of the data-generating process (the DAG), let's generate some data and take a quick look at the data set:

In [5]:

```
data_O <- as.data.table(sim(D, n = 10000, rndseed = 57192))
head(data_O)
```

In [6]:

```
newdata <- data_O[sample(seq_len(1000)), c("W1", "W2", "W3", "A"), with = FALSE]
```

Now that we've generate some data, we will generate an `Task`

object for use with `sl3`

:

In [7]:

```
task <- sl3_Task$new(data_O, covariates = c("W1", "W2", "W3"),
outcome = "A")
task
```

We'll do this same conversion for the sub-sampled testing data as well:

In [8]:

```
new_task <- sl3_Task$new(newdata, covariates = c("W1", "W2", "W3"),
outcome = "A")
```

In [13]:

```
sl3::sl3_list_properties()
```

In [14]:

```
lrn <- Lrnr_condensier$new(nbins = 10, bin_method = "equal.len",
bin_estimator = Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic"))
```

In [15]:

```
trained_lrn = lrn$train(task)
pred_probs = trained_lrn$predict(new_task)
head(pred_probs)
```

Next, ...

In [16]:

```
lrn1 <- Lrnr_condensier$new(nbins = 25, bin_method = "equal.len",
bin_estimator = Lrnr_glm_fast$new())
lrn2 <- Lrnr_condensier$new(nbins = 35, bin_method = "equal.len",
bin_estimator = Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic"))
```

In [19]:

```
sl <- Lrnr_sl$new(learners = list(lrn1, lrn2),
metalearner = Lrnr_solnp_density$new())
sl_fit <- sl$train(task)
head(sl_fit$predict(new_task))
```

In [18]:

```
?Lrnr_solnp_density
```

Note that bin_estimator can be also a Super-Learner object from sl3. In this case the bin hazard will be estimated by stacking several candidate estimators. For example, below, we define a single density learner lrn, with the hazard estimator defined by the Super-Learner that stacks two candidates (GLM and xgboost GBM). Note that in contrast to the above example, this Super-Learner fit will be optimized for the logistic regression problem (estimating pooled bin hazards), but still using internal 10-fold cross-validation.

In [20]:

```
lrn <- Lrnr_condensier$new(nbins = 35, bin_method = "equal.len", bin_estimator =
Lrnr_sl$new(
learners = list(
Lrnr_glm_fast$new(family = "binomial"),
Lrnr_xgboost$new(nrounds = 50, objective = "reg:logistic")
),
metalearner = Lrnr_glm$new()
))
binSL_fit <- lrn$train(task)
head(binSL_fit$predict(new_task))
```

`sl3`

¶What learners to contribute: https://github.com/tlverse/sl3/issues/114 How do learners work: https://sl3.tlverse.org/articles/custom_lrnrs.html

Tips and best practices:

- For the
`sl3`

repo from https://github.com/tlverse/sl3 - Using git, create a new branch for your proposed contribution (e.g.,
`Lrnr_ranger`

) - Generate a template for your new learner via
`sl3::write_learner_template(here::here("R", "Lrnr_ranger.R"))`

- Fill out the template based on the properties of your new learner. Try looking at already written learners to get a better idea of how to fill out the various methods slots.
- Write a set of unit tests under
`tests/testthat/test_Lrnr_ranger.R`

that ensure that your new learner works as intended/expected. - For a guide on how to write unit tests properly, try looking at the numerous unit tests that already exist inside the
`sl3`

package. - For background on unit testing in R, take a look here: http://r-pkgs.had.co.nz/tests.html
- The
`tlverse`

ecosystem uses the "`Tidyverse`

Style Guide". Make sure to follow the formatting guidelines for your code and for pull requests. You can use the`styler`

R package to automatically reformat your code.