Notebook

Modeling the COVID-19 Pandemic¶

Reading¶

Bregman D.J., A.D. Langmuir, 1990. Farr's law applied to AIDS projections. JAMA, 263:1522–1525. doi: 10.1001/jama.263.11.1522.
King, A.A., M.D. de Cellès, F.M.G. Magpantay, and P. Rohani, 2015. Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to Ebola, Proc. R. Soc. B. 28220150347 http://doi.org/10.1098/rspb.2015.0347.
Murray, C.J.L., and IHME COVID-19 health service utilization forecasting team, 2020. Forecasting COVID-19 impact on hospital bed-days, ICU-days, ventilator-days and deaths by US state in the next 4 months,

https://www.medrxiv.org/content/10.1101/2020.03.27.20043752v1.full-text

Jewell, N.P., J.A. Lewnard, and B.L. Jewell, 2020.

Predictive Mathematical Models of the COVID-19 Pandemic: Underlying Principles and Value of Projections, JAMA, 323(19):1893-1894. doi:10.1001/jama.2020.6585

Froese, H., 2020. Infectious Disease Modelling: Fit Your Model to Coronavirus Data, Towards Data Science, https://towardsdatascience.com/infectious-disease-modelling-fit-your-model-to-coronavirus-data-2568e672dbc7; https://github.com/henrifroese/infectious_disease_modelling
Smith, D., and L. Moore, 2004. The SIR model for spread of disease, Convergence, https://www.maa.org/press/periodicals/loci/joma/the-sir-model-for-spread-of-disease-introduction

Models of epidemics¶

This notebook explores two basic approaches to modeling the spread of a communicable disease through a population. There are more complicated approaches than are considered here, but these are common, and the predictions of COVID-19 infections, ICU beds, and deaths are often based on variations of these models.

SIR models¶

Mathematical models of epidemics date back to the early 1900s.

One of the simplest models of epidemics is the SIR model, which partitions the population into three disjoint groups: Susceptible, Infected, and Recovered (or Removed).

The SIR model treats the population as fixed: no births, immigration, emigration, or deaths from other causes. It ignores the possibility of carriers, population heterogeneity, or geographic heterogeneity. It treats everyone as equivalent and assumes the population "mixes" perfectly.

Here are the state variables of the SIR model:

$N$: initial population size
$S(t)$: number of susceptible individuals in the population at time $t$
$I(t)$: number of infected individuals in the population at time $t$
$R(t)$: number "removed" from risk (dead or recovered) in the population at time $t$

Assumptions:

Everyone is either susceptible, infected, or removed (recovered or dead). (There are more complex models with more "compartments.")
There are no births, immigration, emigration, or deaths from other causes.
If a susceptible person is infected, that person eventually dies or recovers.
While a person is infected, the person is infectious.
There is no incubation period between being infected and being infectious.
If an infected person gets "close enough" to a susceptible person, the susceptible person becomes infected.
Every unit of time, every infected person exposes $\beta$ people.
Every susceptible person exposed to an infected person becomes infected.
Every infected person exposes a disjoint group of people.
Every unit of time, a fraction $\gamma$ of infected people recover or die.
People who recover or die are not infectious.
People who recover or die never become susceptible again.

Then $S + I + R = N$ is constant, the initial population size.

The basic reproductive number is $R_0 = \beta/\gamma$. Initially, when essentially the entire population is susceptible (the proportion who are infected or recovered is small compared to $N$), this is the number of people each infected person infects before recovering or dying. (Note that $R_0$ has no connection to $R$: the notation is unfortunate.)

Define $s(t) \equiv S(t)/N$, $i(t) \equiv I(t)/N$, and $r(t) \equiv R(t)/N$. Then $s+i+r = 1$ for all $t$. There is an epidemic if $s(0)\beta/\gamma > 1$: the rate at which susceptible people are getting infected is greater than the rate at which infected people are recoving.

The parameters of the model are:

$\beta$, the number of exposed people per unit of time per infected persion
$\gamma$, the fraction of infected people who are removed (recover or die) per unit time

The state of the population evolves according to three coupled differential equations:

$dS/dt = -\beta I S/N$
$dI/dt = \beta I S/N - \gamma I$ (equivalently, $di/dt = \beta is - \gamma i$)
$dR/dt = \gamma I$

Another interesting variable is the cumulative number of infections, $C$.

$dC/dt = \beta I S/N$; $C(0) = I(0)$.

The SIR model has some built-in features that might not match reality. First, note that the derivative of $R$ is always non-negative and proportional to $I$. Since $R \le N$ and $N$ is finite, $I$ must eventually be zero; that is, the SIR implies that eventually the entire population will be disease-free. Once $I=0$, it can never grow again, since the derivative of $I$ is proportional to $I$. (If nobody is infected, nobody can infect anyone.) Epidemics that follow the SIR model are necessarily self-limiting.

That isn't the case if immunity wears off over time (or if the population is not isolated). In that case, we might we parametrize the transition from recovered to susceptible by assuming that a fraction $\delta$ of $R$ return to $S$ in each time period. (There are other ways we could do this, for instance, having a variable "lag" between recovery and becoming susceptible again.) That yields:

$dS/dt = -\beta I S/N + \delta R$
$dI/dt = \beta I S/N - \gamma I$ (equivalently, $di/dt = \beta is - \gamma i$)
$dR/dt = \gamma I - \delta R$

In this model, $I$ does not necessarily go to zero eventually.

Other generalizations include separate "compartments" for dead versus recovered), needing an ICU bed, etc., as well as time-varying values of $\beta$ to account for lockdowns or other policy interventions, heterogeneity in the population, and more.

Let's run this model (in discrete time) for a population of $N=1,000,000$ people of whom 0.05% are initially infected. We will take the time interval to be 1 day, the period of infectiousness to be $1/\gamma = 14$ days, and $R_0 = 2 = \beta/\gamma$, so $\beta = 2 \gamma = 1/7$. We will start with $\delta=0$ (no re-infections).

In [1]:

# boilerplate
import numpy as np
import scipy as sp
from scipy.integrate import odeint
import matplotlib.pyplot as plt
import matplotlib.dates as mdates

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import clear_output, display, HTML

In [2]:

def sir_model(N, i_0, beta=1, gamma=1, delta=0, steps=365):
    '''
    Run SIR model as coupled finite-difference equations.
    
    Assumes that initially a fraction i_0 of the population of N people is infected and 
    everyone else is susceptible (none has died or recovered).
    
    Parameters
    -----------
    N     : int
        population size
    i_0   : double in (0, 1)
        initial fraction of population infected
    beta  : double in [0, infty)
        number of encounters each infected person has per time step
    gamma : double in (0, 1)
        fraction of infecteds who recover or die per time step
    delta : double in (0, 1)
        fraction of recovereds who become susceptible per time step
    steps : int
        number of steps in time to run the model
    
    Returns
    --------
    S, I, R, C 
    S     : list
        time history of susceptibles
    I     : list
        time history of infecteds
    R     : list
        time history of recovered/dead    
    C     : list
        cumulative number of infections over time
    '''
    assert i_0 > 0, 'initial rate of infection is zero'
    assert i_0 <= 1, 'infection rate greater than 1'
    assert beta >= 0, 'beta must be nonnegative'
    assert gamma > 0, 'gamma must be positive'
    assert gamma < 1, 'gamma must be less than 1'
    assert delta >= 0, 'delta must be nonnegative'
    assert delta < 1, 'delta must be less than 1'
    S = np.zeros(steps)
    I = np.zeros(steps)
    R = np.zeros(steps)
    C = np.zeros(steps)
    I[0] = int(N*i_0)
    C[0] = I[0]
    S[0] = N-I[0]
    for i in range(steps-1):
        new_i = beta*I[i]*S[i]/N
        S[i+1] = max(0, S[i] - new_i + delta*R[i])
        I[i+1] = max(0, I[i] + new_i - gamma*I[i])
        R[i+1] = max(0, R[i] + gamma*I[i] - delta*R[i])
        C[i+1] = C[i] + new_i
    return S, I, R, C

In [3]:

def plot_sir(N, i_0, beta=2/14, gamma=1/14, delta=0, steps=365, verbose=False):
    '''
    Plot the time history of an SIR model.
    
    Parameters
    -----------
    N     : int
        population size
    i_0   : double in (0, 1)
        infection rate at time 0
    beta  :  double in [0, infty)
        number of encounters each infected person has per time step
    gamma : double in (0, 1)
        fraction of infecteds who recover or die per time step
    delta : double in (0, 1)
        fraction of recovereds who become susceptible per time step
    steps : int
        number of time steps to run the model
    verbose : Boolean
        if True, return the model predictions
    
    Returns (if verbose == True)
    --------
    S : list
        susceptibles as a function of time
    I : list
        infecteds as a function of time
    R : list
        recovereds as a function of time
    C : list
        cumulative incidence as a function of time
    '''
    S, I, R, C = sir_model(N, i_0, beta=beta, gamma=gamma, delta=delta, steps=steps)
    times = list(range(steps))
    fig, ax = plt.subplots(nrows=1, ncols=1)
    ax.plot(times, I, linestyle='--', color='r', label='Infected')
    ax.plot(times, S, linestyle='-', color='b', label='Susceptible')
    ax.plot(times, R, linestyle=':', color='g', label='Recovered')
    ax.plot(times, C, linestyle='-.', color='k', label='Tot infect.')
    ax.legend(loc='best')
    plt.show()
    if verbose:
        return S, I, R, C

In [4]:

i_0 = 0.0005
N = int(10**6)
steps = 365

plot_sir(N, i_0)

For these parameters, the number of infections is shaped like a bell curve and the cumulative number of infections is shaped like a CDF: it is "sigmoidal" (vaguely "S"-shaped). You might imagine fitting a scaled CDF to the curve. Essentially, the IHME model fits a scaled Gaussian CDF to the total number of infections. (More on this below.)

What if immunity wears off over time? Let's try a positive value of $\delta$:

In [5]:

plot_sir(N, i_0, delta=0.01)

In [6]:

interact(plot_sir, N=fixed(N),
                   i_0 = widgets.FloatSlider(min=0, max=1, value=i_0),
                   beta=widgets.FloatSlider(value=2/14, min=0.01, max=100),
                   gamma=widgets.FloatSlider(value=1/14, min=0, max=1, step=0.01),
                   delta=widgets.FloatSlider(value=0, min=0, max=1, step=0.01),
                   steps=fixed(steps),
                   verbose=fixed(False))

interactive(children=(FloatSlider(value=0.0005, description='i_0', max=1.0), FloatSlider(value=0.1428571428571…

Out[6]:

<function __main__.plot_sir(N, i_0, beta=0.14285714285714285, gamma=0.07142857142857142, delta=0, steps=365, verbose=False)>

Fitting the SIR model to data¶

Now we will use least squares to fit the SIR model to a time history of infection data collected by researchers at Johns Hopkins University. See https://github.com/CSSEGISandData

This exercise is for illustration. There are many reasons to be wary of this modeling:

There are serious issues with the data quality: these are cases that were reported according to the rules and circumstances of where they were detected. For mortality (rather than incidence), excess mortality might give a more accurate measure.
The model is a cartoon, not "physics" of epidemics, and it omits many factors that plausibly matter. It is not clear what estimates of the parameters mean when the model is wrong.
Absent a trustworthy generative model for the data, it is not clear how to assess or interpret the uncertainty of the estimates.
If the estimated model is to be used for prediction, there is no obvious way to assign meaningful uncertainties to the predictions.
The optimization problem to fit the parameters has no statistical content or motivation. It is not clear that it yields an estimate that is "good" in a useful sense.
The objective function is not convex in the model parameters, and the optimization algorithm is not guaranteed to solve the optimization problem.

The data record total "confirmed" cases as a function of time, deaths, and recoveries: $C(t)$

In [7]:

from scipy.optimize import curve_fit  # nonlinear least squares

def f(x, beta, gamma):
    '''
    Model cumulative infections
    
    This is just a wrapper for a call to sir_model, to generate a set of
    predictions of the cumulative incidence for curve fitting
    '''    
    return sir_model(N, i_0, beta, gamma, delta=0, steps=len(x))[3]  # C

In [8]:

# test f
N = int(10**6)
i_0 = 0.0005
x = range(25)
y = f(x, 1/7, 1/14 )
y

Out[8]:

array([ 500.        ,  571.39285714,  647.87464122,  729.80689646,
        817.5766761 ,  911.59831599, 1012.31532767, 1120.20241833,
       1235.76764552, 1359.55471484, 1492.14542917, 1634.1622985 ,
       1786.27131969, 1949.18493587, 2123.66518564, 2310.5270524 ,
       2510.64202461, 2724.94187786, 2954.42268994, 3200.14910031,
       3463.25882519, 3744.96743963, 4046.57343789, 4369.46358264,
       4715.11855381])

In [9]:

# test curve_fit
popt, pvoc = curve_fit(f, x, y, p0=[1, 0.5], bounds = (0, [np.inf, 1]))
popt

Out[9]:

array([0.14285714, 0.07142857])

In [10]:

# try some real data from the JHU site
import pandas as pd
# data for countries. THIS IS A "LIVE" DATASET THAT IS UPDATED FREQUENTLY
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
df = pd.read_csv(url, sep=",")

In [11]:

df.head()

Out[11]:

	Province/State	Country/Region	Lat	Long	...	3/4/22	3/5/22	3/6/22	3/7/22	3/8/22	3/9/22	3/10/22	3/11/22	3/12/22	3/13/22
0	NaN	Afghanistan	33.93911	67.709953	...	174214	174331	174582	175000	175353	175525	175893	175974	176039	176201
1	NaN	Albania	41.15330	20.168300	...	272030	272030	272210	272250	272337	272412	272479	272552	272621	272663
2	NaN	Algeria	28.03390	1.659600	...	265186	265227	265265	265297	265323	265346	265366	265391	265410	265432
3	NaN	Andorra	42.50630	1.521800	...	38434	38434	38434	38620	38710	38794	38794	38794	38794	38794
4	NaN	Angola	-11.20270	17.873900	...	98796	98796	98806	98806	98829	98855	98855	98855	98909	98927

5 rows × 786 columns

In [12]:

# data for Denmark
DK = df.loc[(df["Country/Region"] == "Denmark") & df["Province/State"].isna()]  # Denmark
DK = DK.drop(['Province/State','Country/Region','Lat','Long'], axis=1).T        # remove fields we don't need
y = DK.to_numpy()   # turn series into a vector
y = y[np.nonzero(y)] # remove the data before the first detected case
x = range(len(y))    # for model-fitting and plotting
print(len(y),y)

746 [      1       1       3       4       4       6      10      10      23
      23      35      90     262     442     615     801     827     864
     914     977    1057    1151    1255    1326    1395    1450    1591
    1724    1877    2046    2201    2395    2577    2860    3107    3386
    3757    4077    4369    4681    5071    5402    5635    5819    5996
    6174    6318    6511    6681    6879    7073    7242    7384    7515
    7695    7912    8073    8210    8445    8575    8698    8851    9008
    9158    9311    9407    9523    9670    9821    9938   10083   10218
   10319   10429   10513   10591   10667   10713   10791   10858   10927
   10968   11044   11117   11182   11230   11289   11360   11387   11428
   11480   11512   11593   11633   11669   11699   11734   11771   11811
   11875   11924   11948   11962   12001   12016   12035   12099   12139
   12193   12217   12250   12294   12344   12391   12391   12391   12527
   12561   12615   12636   12675   12675   12675   12751   12768   12794
   12815   12832   12832   12832   12878   12888   12900   12916   12946
   12946   12946   13037   13061   13092   13124   13173   13173   13173
   13262   13302   13350   13390   13438   13438   13438   13547   13577
   13634   13725   13789   13789   13789   13996   14073   14185   14306
   14442   14442   14442   14815   14959   15070   15214   15379   15483
   15617   15740   15855   15940   16056   16127   16239   16317   16397
   16480   16537   16627   16700   16779   16891   16985   17084   17195
   17374   17547   17736   17883   18113   18356   18607   18924   19216
   19557   19890   20237   20571   20571   21393   21847   22436   22905
   23323   23799   24357   24916   25594   26213   26637   27072   27464
   27998   28396   28932   29302   29680   30057   30379   30710   31156
   31638   32082   32422   32811   33101   33593   34023   34441   34941
   35392   35844   36373   37003   37763   38622   39411   40356   41412
   42157   43174   44034   45225   46351   47299   48241   49594   50530
   51753   53180   54230   55121   55892   56958   57952   58963   60000
   61078   62136   63331   64551   65808   67105   68362   69635   70485
   71654   73021   74204   75395   76718   78354   79352   80481   81949
   83535   85140   86743   88858   90603   92649   94799   97357  100489
  103564  107116  109758  113095  116087  119779  123813  128321  131606
  134434  137632  140175  143472  146341  149333  151167  153347  155826
  158447  161230  163479  165930  167541  168711  170787  172779  174995
  176837  178497  180240  181486  182725  183801  185159  185159  187320
  188199  189088  189895  190619  191505  192265  193038  193917  194671
  195296  195948  196540  197208  197664  198095  198472  198960  199357
  199782  200335  200773  201186  201621  202051  202417  202887  203365
  203793  204067  204362  204799  205183  205597  206065  206617  207081
  207577  208027  208556  209079  209682  210212  210732  211195  211692
  212224  212798  213318  213932  214326  214839  215264  215791  217798
  218660  219305  219918  220459  221071  221842  222629  223415  224258
  224848  225505  226777  227031  225030  225844  226633  227049  228013
  228692  229902  230603  231265  231973  232718  233318  233797  234317
  234931  235648  236346  237101  237792  238306  238869  239532  240330
  241007  241731  242633  243374  244065  244868  245761  246463  247010
  247622  248326  248950  249785  250554  250554  252045  252912  253673
  254482  254482  256482  257505  258182  259056  259988  260913  262159
  263514  264465  265539  266503  267339  268255  269343  270557  271908
  272659  273494  274413  275207  276280  277399  278396  279434  280383
  281227  282135  283089  284117  285044  285636  286489  286948  287325
  288229  288704  289122  289559  289874  290111  290333  290686  291017
  291220  291463  291652  291801  291956  292179  292352  292574  292769
  292943  293094  293337  293677  294152  294478  294925  295317  295654
  296196  296885  297543  298094  298614  299223  300071  301126  302328
  303469  304429  305459  306100  306944  307764  308615  309420  310127
  310876  311520  312292  312851  314135  314983  316068  316807  317700
  318485  319295  320222  321154  322019  322965  323786  324721  325725
  326887  327972  329010  329927  330777  331736  332622  333815  334799
  335722  336746  337466  338240  339580  340567  341549  342474  343351
  344088  344850  345693  346518  347212  347909  348347  348979  349440
  349891  350405  350996  351553  351939  352373  352636  353061  353431
  353744  354068  354393  354645  354913  355257  355603  355944  356326
  356684  357037  357370  357827  358369  358796  359237  359663  360031
  360411  360888  361461  362068  362738  363356  363900  364464  365051
  365840  366607  367307  368060  368651  369403  370159  371286  372533
  373803  375065  376414  377825  379078  380949  382796  384580  386251
  387783  389240  391221  393199  395797  398048  400145  402561  404855
  407417  410434  412566  417151  420099  423322  426992  430891  434798
  438811  442881  446676  450091  453802  458001  462427  466817  470893
  474637  478927  483253  487401  492521  497201  501760  506085  509111
  516257  522581  529210  536195  542794  548400  554389  562188  570502
  579275  589274  600468  609062  617274  627356  647934  661320  673807
  685036  685036  685036  726071  739071  762299  783702  802397  823282
  831236  840037  865110  893393  919388  937649  950237  969485  983899
 1006295 1030638 1056389 1080003 1105037 1131206 1159986 1193479 1232238
 1272864 1319695 1355815 1397833 1438181 1484771 1531518 1582551 1636206
 1677289 1713485 1742569 1787935 1842936 1887161 1927340 1966530 2003042
 2037891 2087689 2142809 2196556 2244726 2289076 2327399 2356873 2399851
 2442799 2483399 2519057 2552361 2578051 2606934 2637414 2666454 2691663
 2714447 2731806 2748260 2764838 2783399 2803857 2803857 2837501 2853236
 2864063 2876288 2891763 2908721 2923030 2936856 2943947 2950605]

In [13]:

plt.plot(x,y, linestyle='-', color='r')
plt.title('"Confirmed" COVID-19 Cases in Denmark')
plt.xlabel('Days since first detected case')

Out[13]:

Text(0.5, 0, 'Days since first detected case')

In [14]:

N = 5792202 # estimated population of DK in 2020
i_0 = y[0]/N # initial prevalence
x = np.array(range(len(y)))

In [15]:

# fit the model by nonlinear least squares
popt, pvoc = curve_fit(f, x, y, p0=[1, 0.5], bounds = (0, [500, 1]), maxfev=10000)
popt

Out[15]:

array([2.11506788e-02, 4.12605498e-16])

In [16]:

S, I, R, C = sir_model(N, i_0, beta=popt[0], gamma=popt[1], delta=0, steps=len(y))
times = list(range(len(y)))
fig, ax = plt.subplots(nrows=1, ncols=1)
ax.plot(times, y, linestyle='-', color='r', label='Data')
ax.plot(times, C, linestyle='-.', color='k', label='SIR Model')
ax.legend(loc='best')
plt.xlabel('Days since first detected case')
plt.title('"Confirmed" COVID-19 Infections in Denmark')
plt.show()

What happens if we run the model into the future?

In [17]:

S, I, R, C = plot_sir(N, i_0, beta=popt[0], gamma=popt[1], delta=0, steps=1000, verbose=True)

The total number of infections stabilizes, and the pandemic abates. The effective reproductive number $R_0$ is less than 1.

Let's do a quick sanity check. Recall that $R_0 = \beta/\gamma$ for the SIR model initially, when the number infected, recovered, or dead are negligible compared to the total population. Once some of the population has recovered, some of the people an infectious person encounters wiil be people who have recovered and not susceptible to reinfection (according to the model). To first order, $\beta$ is effectively reduced by $S/N$.

In [18]:

# R_0 at the time the first infection is detected
popt[0]/popt[1] # greater than 1. According to the model, the infection will spread if R_0 > 1.

Out[18]:

51261262636618.664

In [19]:

# After 1000 timesteps, R_0 is different 
(S[-1]/N)*popt[0]/popt[1] # less than 1: According to the model, the infection will abate.

Out[19]:

219047240451.27252

Fitting curves to the cumulative incidence function¶

The SIR model and its variants attempt to capture the dynamics of transmission and recovery, For some parameter choices, that leads to a sigmoidal shape for the cumulative incidence of infections.

Some pandemic predictions simply fit a sigmoidal function to the cumulative number of infections, an approach that dates back to William Farr in the 1800s. King et al. (2015) showed that this approach does not work well for AIDS or Ebola. Nonetheless, this is the approach the IHME takes: they fit a scaled and shifted Gaussian cdf to cumulative incidence.

Sigmoidal Models¶

Common sigmoidal models include CDFs of unimodal distributions (such as the Gaussian cdf $\Phi(x)$) and the sigmoid function $1/(e^{-x} + 1)$, the inverse of the logistic function.

To allow this function to be shifted and scaled, we can introduce the family of curves

\begin{equation} \sigma(x; a, b, c) \equiv \frac{c}{e^{-b(x-a)}+1}. \end{equation}

The parameter $a$ controls where the function crosses $c/2$; $b$ controls how rapidly it increases, and $c$ is its asymptotic limit.

Let's fit this to the Danish data.

In [20]:

def g(x, a, b, c):
    '''
    Wrapper for the sigmoid to pass to the curve-fitting function
    '''    
    return c/(np.exp(-b*(x-a))+1)

In [21]:

# test g visually.
z = np.array(range(-8,15,1))
plt.plot(z, g(z,3,1/2,5))

Out[21]:

[<matplotlib.lines.Line2D at 0x7f98b0dde0a0>]

In [22]:

# fit the DK data
popts, pvocs = curve_fit(g, x, y, p0=[len(x)/2, 0.02, np.max(y)/2], bounds = ([0, 0, 0], [500, 500, N]), maxfev=10000)
popts

Out[22]:

array([5.00000000e+02, 1.11921669e-02, 1.16292975e+06])

In [23]:

# plot
Cs = g(x, popts[0], popts[1], popts[2])
times = list(range(len(y)))
fig, ax = plt.subplots(nrows=1, ncols=1)
ax.plot(times, y, linestyle='-', color='r', label='Data')
ax.plot(times, C[:len(times)], linestyle='-.', color='k', label='SIR Model')
ax.plot(times, Cs, linestyle=':', color='b', label='Sigmoid Model')
ax.legend(loc='best')
plt.xlabel('Days since first detected case')
plt.title('"Confirmed" COVID-19 Infections in Denmark')
plt.show()

Both models agree reasonably well with past data. But they have rather different predictions about the future.

Let's look at how the predictions vary over time, as the dataset grows.

In [24]:

def plot_predictions(y, n, points):
    '''
    plot SIR and sigmoid models fitted to the first n data
    
    Parameters
    ----------
    y : list of floats
        number of infections at each time point
    n : int
        fit the models to the first n elements of y
    points : int
        number of time points for which to plot the prediction
        
    Returns
    -------
    no return value
    '''
    # fit the two models
    x = list(range(len(y)))
    popt, pvoc = curve_fit(f, x[:n], y[:n], p0=[1, 0.5], bounds = (0, [500, 1]),
                           maxfev=10000)
    popts, pvocs = curve_fit(g, x[:n], y[:n], p0=[len(x)/2, 0.02, np.max(y)/2],
                             bounds = ([0, 0, 0], [500, 500, N]), maxfev=10000)
    times = list(range(points))
    S, I, R, C = sir_model(N, i_0, beta=popt[0], gamma=popt[1], delta=0, steps=points)
    Cs = g(times, popts[0], popts[1], popts[2])
    fig, ax = plt.subplots(nrows=1, ncols=1)
    ax.plot(x, y, linestyle='-', color='r', label='Data')
    ax.plot(times, C, linestyle='-.', color='k', label='SIR Model')
    ax.plot(times, Cs, linestyle=':', color='b', label='Sigmoid Model')
    ax.vlines(x[n-1], 0, np.max([C, Cs]), label='truncation time')
    ax.legend(loc='best')
    plt.xlabel('Days since first detected case')
    plt.title('"Confirmed" COVID-19 Infections in Denmark')
    plt.show()

In [25]:

interact(plot_predictions, y=fixed(y),
                           n = widgets.IntSlider(min=20, max=len(y), value=100),
                           points=fixed(2*len(y)))

interactive(children=(IntSlider(value=100, description='n', max=746, min=20), Output()), _dom_classes=('widget…

Out[25]:

<function __main__.plot_predictions(y, n, points)>

The predictions depend on the model and details of the data. A few additional days of data can drastically change the predictions. For some end dates, the two models have nearly identical predictions; for others, they differ by a factor of 20 or more.

Unlike the SIR model, the sigmoid model does not "know" how many people are in the population, so even if recovery conveys immunity, the cumulative number of infections in the sigmoid model can exceed the size of the population.

This mindless "curve fitting" approach is not a reliable basis for predicting the course of the pandemic, predicting ICU demand, predicting the effect of an intervention, allocating resources, or setting public policy. It is a mechanical application of models and algorithms to data, not science.

Unsurprisingly, models of this type do not predict infections well in practice. The most-cited predictions have been those promulgated by the Institute for Health Metrics and Evaluation (IHME) at the University of Washington. They say,

Our model is designed to be a planning tool for government officials who need to know how different policy decisions can radically alter the trajectory of COVID-19 for better or worse.

Our model aimed at helping hospital administrators and government officials understand when demand on health system resources will be greatest. (http://www.healthdata.org/covid/faqs#differences%20in%20modeling, last visited 23 January 2021)

IHME's model is described here. Early in the pandemic, IHME predicted that there would be about 60,000 COVID-19 deaths in ths U.S. As of 18 February 2021, there have been more than 490,000. Between March and August, 2020, next-day deaths were within the IHME 95% prediction interval only about 30% of the time, despite revisions of the model https://arxiv.org/abs/2004.04734. (We shall not delve into how IHME produces its prediction intervals.) See also https://www.vox.com/future-perfect/2020/5/2/21241261/coronavirus-modeling-us-deaths-ihme-pandemic.

One feature of the sigmoidal models is that they predict rapid declines after the peak (the decline is just as rapid as the rise), a feature that some politicians like. That feature is built into the sigmoidal curve: it is an assumption, not something that comes from the data.

There are also more complicated models of pandemics, including models with more "compartments," models based on other sigmoidal functions (such as the Gaussian CDF), and models that allow various kinds of heterogeneity within the population and simulate interactions between individuals. Some have proved more accurate than others. Just how poorly some models perform is obscured by the practice of revising the model and projections frequently--even daily.

Sometimes, a moddle with little substantive foundation can nonetheless predict the behavior of a system, at least if the system is not changing quickly. Accurate predictions of the effect of interventions (masks, lockdowns, vaccination, and so on) require a close tie between the world and the model: the model needs a substantive basis. For causal inferences, the model needs to be a generative model or response schedule.

George Box famously wrote, "all models are wrong, but some are useful." While that quotation is often invoked as a general license to throw models at data, it should instead prompt questions:

useful for what?
how can you tell whether a model is useful for a particular goal?

In the same paper, Box wrote, "It is inappropriate to be concerned with mice when there are tigers abroad." Most of the fretting people do about statistical models is at the level of mice (e.g., asymptotics under unrealistic assumptions), while the problems with most statistical models are tigers: the models have little or no connection to the actual scientific question.

New variants¶

As of this writing (10 February 2021), new strains of COVID-19 have been detected in the UK, South Africa, and the US, where cases of the new strain have a doubling time of about 10 days. The UK strain is estimated to be 50% to 70% more transmissible than the original virus. There is mounting evidence that the strain is also more lethal. https://www.wsj.com/articles/new-u-k-covid-19-variant-could-be-more-deadly-british-officials-say-11611338370 (last visited 22 January 2021).