Secondary Logo

Journal Logo

Infectious diseases

Estimating the Size of a COVID-19 Epidemic from Surveillance Systems

Yue, Mu; Clapham, Hannah E.; Cook, Alex R.

Author Information
doi: 10.1097/EDE.0000000000001202
  • Open


As the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)1 spreads around the world from its initial focus in Wuhan, China,2 causing local Coronavirus Disease 2019 (COVID-19) epidemics, public health policy makers in countries or territories face the decision of when to switch from containment to mitigation measures.3 This decision rests upon an accurate estimate of the size of the local outbreak. Where intensive contact tracing has been undertaken, such as in Singapore,4 or mass testing, such as South Korea, there may be some degree of confidence that most cases have been identified and thus that the order of magnitude of the outbreak is known. Otherwise, however, policy makers may be reliant on passive surveillance streams to infer the size of the outbreak. Such inference may be challenged by incompleteness in coverage and the rapid growth of the outbreak, which coupled with the lag between onset of symptoms and being detected by the surveillance system, requires statistical inflation to correct the estimates.

This note outlines a simple Bayesian model designed to estimate the outbreak size during the exponential growth phase of the COVID-19 epidemic from one or two surveillance streams providing counts of cases meeting various criteria. We illustrate it through scenarios based on virologic surveillance from a network of influenza-like illness consultations in primary care and from pneumonia cases in hospitals, but the approach generalizes to other surveillance streams such as mortalities.


We assume that we remain in the initial phase of the epidemic when both the total and the new number of cases (regardless of whether they are imported or autochthonous) grows exponentially,5 prior to herd immunity taking hold. Let the number of new cases on day t be . We assume that growth has a constant exponent, as it might if control has been implemented to a constant degree. Time 0 is arbitrary but may be set to the day the alarm was first raised in Wuhan, which coincidentally was 31 December 2019, allowing t to represent the day of the year in 2020. Also let the number of new cases detected by surveillance stream s on day t be the Poisson variable, . We assume that a fraction of cases enter the surveillance system at an average lag of after onset, which we assume does not change over time. For instance, a fraction of cases may develop pneumonia or may present to a primary care clinic that is part of a virologic surveillance network. The likelihood function obtains from . Altogether the parameter space is dimensional, at the early phase of the outbreak, the scarcity of local data may necessitate fixing some parameters using knowledge obtained from elsewhere (say on the proportion of cases developing pneumonia) or from the nature of the surveillance system (say on the proportion of primary care clinics in the network). The target of inference is and the total cases to date, . These can be estimated by (1) setting noninformative prior distributions for the parameters we have information to estimate, informative or Dirac delta priors for those we have not, (2) running a standard Metropolis-Hastings algorithm,6 and (3) transforming the primary estimates to obtain posterior distribution for and . We have developed example R code7 to implement this algorithm which may be downloaded from We now illustrate the approach through two examples. The pattern of cases together with the dynamic estimates of the size of the outbreak for illustrative examples 1 and 2 are presented in the Table.

Estimated Outbreak Size to Date Based on Pneumonias Surveillance System and Pneumonia and ILI Surveillance Systems

Example 1: The First Case of Pneumonia

In city X, all cases of pneumonia are being tested for SARS-CoV-2 infection. We assume that and .8 In this scenario, after 30 days of negative tests, a positive case is identified on day 31. The estimated number of infections is (95% CrI = 2–93). This estimate will necessarily evolve over the next few days as more cases come in, or not. Should there be no new cases by day 35, the estimate would reduce to (95% credible interval [CrI] = 2–64). Should the first case be followed by one more pneumonia a day for 4 days, the estimate would change to (95% CrI = 26–311).

Example 2: A Smattering of Pneumonias and Influenza-like Illnesses

In country Y, two passive surveillance systems are used to detect COVID-19: all pneumonias are tested, and a network of primary care doctors take nasopharyngeal swabs from patients with influenza-like illness (ILI), which are tested virologically for SARS-CoV-2 in addition to influenza. The network covers approximately 4% of ILIs presenting to primary care. We assume that 43% of SARS-CoV-2 cases develop ILI9 and consult an average of 5.5 days after onset,10 i.e., , , as well as and as before.


The method we outline can readily be implemented on a daily basis as new reports come in. It will be affected by delays in reporting, which should be accommodated through the lag parameter(s) or by revising previous estimates as cases are reported. In the early period of the outbreak, it may be necessary to use estimates of parameters such as the growth rate b from China5 or the second wave of countries to be affected. As the estimates are dependent on the prior distribution assumed for these parameters, sensitivity analyses may be conducted to assess how robust the estimates are to misspecification of input parameters.11 As the local outbreak continues, there may be sufficient information to permit localized parameterization and to use model predictive checks to assess whether its assumptions, for instance of exponential growth, are valid.

As countries repatriate their citizens from areas of heightened transmission, the growth in allochthonous and autochthonous infections may diverge, and thus the exclusion of the former may be warranted when using this method.


1. Gorbalenya AE, Baker SC, Baric RS, et al. The species severe acute respiratory syndrome-related coronavirus: classifying 2019-nCoV and naming it SARS-CoV-2. Nat Microbiol. 2020;5:536–544.
2. Li Q, Guan X, Wu P, et al. Early transmission dynamics in Wuhan, China, of novel coronavirus–infected pneumonia. N Engl J Med. 2020;382:1199–1207.
3. Heymann DL, Shindo N; WHO Scientific and Technical Advisory Group for Infectious Hazards. COVID-19: what is next for public health? Lancet. 2020;395:542–545.
4. Wong JE, Leo YS, Tan CC. COVID-19 in Singapore—current experience: critical global issues that require attention and action. JAMA 2020;323:1243–1244.
5. Wu JT, Leung K, Leung GM. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study. Lancet. 2020;395:689–697.
6. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. 2013.Boca Raton, FL: CRC Press.
7. R Core Team. R: A Language and Environment for Statistical Computing. 2014.Vienna, Austria: R Foundation for Statistical Computing.
8. The Novel Coronavirus Pneumonia Emergency Response Epidemiology Team. The epidemiological characteristics of an outbreak of 2019 novel coronavirus diseases (COVID-19) — China, 2020. China CDC Wkly. 2020;2:113–122.
9. Guan W, Ni Z, Hu Y, et al. Clinical characteristics of Coronavirus Disease 2019 in China. N Engl J Med. 2020. doi: 10.1056/NEJMoa2002032
10. World Health Organization. Report of the WHO-China Joint Mission on Coronavirus Disease 2019 (COVID-19). Published February 2020. Available at: Accessed 6 March 2020.
11. Lee VJ, Chen MI, Yap J, et al. Comparability of different methods for estimating influenza infection rates over a single epidemic wave. Am J Epidemiol. 2011;174:468–478.

Bayesian inference; COVID-19; Epidemic size; Surveillance

Copyright © 2020 The Author(s). Published by Wolters Kluwer Health, Inc.