# Alternative Methods of Estimating an Incubation Distribution: Examples From Severe Acute Respiratory Syndrome

Background: Accurate and precise estimates of the incubation distribution of novel, emerging infectious diseases are vital to inform public health policy and to parameterize mathematical models.

Methods: We discuss and compare different methods of estimating the incubation distribution allowing for interval censoring of exposures, using data from the severe acute respiratory syndrome (SARS) epidemic in 2003 as an example.

Results: Combining data on unselected samples of 149 and 168 patients with defined exposure intervals from Toronto and Hong Kong, respectively, we estimated the mean and variance of the incubation period to be 5.1 day and 18.3 days and the 95th percentile to be 12.9 days. We conducted multiple linear regression on the log incubation times and found that incubation was significantly longer in Toronto than in Hong Kong and in older compared with younger patients, while it was significantly shorter in healthcare workers than in other patients.

Conclusions: Our findings suggest subtle but important heterogeneities in the incubation period of SARS among different strata of patients. Robust estimation of the incubation period should be independently carried out in different settings and subgroups for novel human pathogens using valid statistical methods.

From the *Department of Community Medicine and School of Public Health, the University of Hong Kong, Hong Kong; †Mount Sinai Hospital, Toronto, Ontario, Canada; and ‡Canadian SARS Research Network, Canada.

Submitted 15 May 2006; accepted 4 September 2006.

Supported in part through a commissioned research grant from the Research Fund for the Control of Infectious Diseases of the Health, Welfare, and Food Bureau of the Hong Kong Special Administrative Region Government; by the University of Hong Kong SARS Research Fund; and by the EU Sixth Framework Programme for research for policy support (contract SP22-CT-2004-511066). The Canadian SARS Research Network was funded by the Canadian Institutes for Health Research.

Correspondence: Benjamin Cowling, Department of Community Medicine, the University of Hong Kong, 21 Sassoon Road, Pokfulam, Hong Kong. E-mail: bcowling@hku.hk

During the 2003 severe acute respiratory syndrome (SARS) epidemic, knowledge of the incubation period of SARS was critical to the global effort to control the spread of illness. Based on initial estimates of incubation time, a 10-day quarantine period was adopted by most health authorities, although some suggested that a longer quarantine period might have been more appropriate.^{1} Furthermore, the incubation period was incorporated into the case definition for SARS early in the outbreak: patients were only considered to have SARS if they had a potential exposure within the 10-day period before symptom onset.^{2} Estimates of the incubation distribution were also required for developing the mathematical models used to understand transmission dynamics and to predict the impact of different control policies.^{3} Thus, accurate estimates of the incubation distribution, and particularly the right-hand tail in the first 2 scenarios, are central in informing public health policy.

Estimation of the incubation period is complicated because infection events cannot be directly observed. Instead, patients typically report a period of exposure during which infection likely occurred; if the infection occurred at time X, an exposure period between times A and B, where A ≤ X ≤ B, is usually documented. If symptom onset occurred at time Z, interest lies in the incubation time T = Z-X, but we can only observe that symptoms began between L = Z-B and R = Z-A days after infection.

These data are a special type of survival data, and a natural approach would be to “reverse” the time axis setting Z as the origin and X as the outcome time.^{4} Then each patient’s incubation time is not exactly known, but must be in the interval (L, R). As De Gruttola and Lagakos^{5} explained, “reversing” the time axis is “valid only when the density function for infection is uniform in chronologic time.” This condition is reasonable in the setting of SARS, with each exposure interval being relatively short, and will be assumed to hold for all the analyses described below; in fact, this assumption underlay all previous estimates of SARS incubation either implicitly^{6–13} or explicitly.^{14} Further details on this assumption are given in the technical appendix at the end of this article.

In the following sections we describe nonparametric and parametric statistical techniques for accurate estimation of the incubation distribution and illustrate the bias in some commonly used methods. We present the first inferential analysis of SARS incubation data from Toronto. By combining data from Hong Kong and Toronto, we have conducted the largest analysis of the SARS incubation distribution in the literature. Furthermore, we review all previous incubation period estimates,^{6–14} and investigate in our data whether such varied by country, age, sex, or occupation.

## METHODS

The simplest estimate of the incubation distribution is obtained by excluding cases in which exposure occurred within a defined interval, and using data only from cases in which a single exposure occurred at a defined time.^{8,9} However, this approach may underestimate the incubation distribution because of recall bias, since recent exposures are likely to be more precisely remembered. To incorporate defined exposure periods, the infection dates could be imputed as the midpoint of those exposure intervals, which then permits empirical estimation.^{7,13} Alternatively, one could assume that all infections occurred at the beginning or end of each interval, and then calculate the corresponding empirical distribution functions.

Turnbull^{15} derived the nonparametric maximum likelihood estimator of the distribution function for interval-censored exposure data, which simplifies to the empirical distribution function if all exposure times are exactly observed. As the maximum likelihood estimate, this approach should be viewed as the nonparametric gold-standard. It can be implemented in R (R Development Core Team, Vienna) through the package “Icens,” or in SAS (SAS Institute Inc., Cary, NC).^{4} Pointwise 95% confidence intervals may be calculated from the observed information matrix. An alternative simple nonparametric estimate of the incubation distribution for interval-censored data is given by Meltzer.^{11}

It is often useful to summarize the incubation distribution by a parametric model, which can easily accommodate interval-censoring (appendix).^{4,16} Gamma,^{6,10,12} Weibull,^{14} lognormal,^{10} and loggamma^{10} distributions have previously been used for SARS. Comparison between models may be through visual comparison with a nonparametric estimate, or by Akaike’s Information Criterion.^{17} Finally, Kuk and Ma^{14} present a parametric approach to estimating the incubation distribution when detailed contact tracing data are available by analyzing the serial intervals between symptom onset in index and secondary cases.

A further advantage of parametric models is that covariate information can be incorporated through the scale parameter, and these models are known as accelerated life or accelerated failure time models.^{4} When the lognormal distribution is used, the corresponding regression model is essentially a multiple linear regression, or equivalently an analysis of covariance, on the natural logarithm of the incubation time. Given that incubation distributions are typically right-skewed and always positive-valued, it is more appropriate to perform linear regression on log-transformed incubation times than on untransformed times. The estimated regression coefficients, β, can be interpreted as the expected change in median log incubation time relative to baseline, while the transformed effects exp(β) can be interpreted as acceleration factors ie, proportional increases or decreases in the median incubation time. Parametric regression models for interval-censored data may be applied in R (through the package “survival”) or in SAS (using “proc lifereg”). All analyses presented here were conducted using R version 2.3.1 (R Development Core Team, Vienna).

### APPLICATION TO SARS DATA

Data on a defined exposure period from Toronto were available from 149 (51%) of all 291 adults diagnosed with “probable SARS,”^{18} whereas similarly specified data were available from 168 (10%) of the 1755 probable SARS patients in Hong Kong.^{12}

Figure 1A compares the Turnbull nonparametric estimate and 3 parametric models on the Toronto data. By Akaike’s Information Criterion, the best-fitting distribution was the lognormal (AIC = 478), followed by the Weibull (AIC= 480), and then the gamma (AIC = 482). Figures 1B and 1C show the same distributions fitted to the Hong Kong data and the combined data respectively, where again the lognormal fitted best. Visual inspection of the parametric curves against the Turnbull estimates in Figures 1A–C confirm that the lognormal distribution fits the data appropriately. For the combined data, the Turnbull estimate had mean and variance 5.0 days and 11.5 days, respectively, and the 95th percentile was 11.0 days. Using the lognormal parametric model for the combined data, the estimated mean and variance were 5.1 days and 18.3 days, respectively, and the 95th percentile was 12.9 days (95% confidence interval = 11.7–14.5 estimated by bootstrapping).

The estimated means, 95th and 99th percentiles of the lognormal, Weibull, and gamma distributions are presented in Table 1, with 95% confidence intervals generated by bootstrapping. The 95th and 99th percentiles would be of interest for public health planning, in particular for specifying case definitions and the quarantine period. All 3 models have fairly similar values of AIC, but quite significant discrepancies in the tails, where the fitted lognormal distributions have the longest right-hand tails. We also investigated the loggamma distribution which is a generalization of the gamma and lognormal distributions,^{10} and found the results were almost identical to the lognormal model in each case (data not shown). As a sensitivity analysis, we fitted lognormal models to the incubation data for only the 128 and 136 laboratory confirmed SARS cases in Toronto and Hong Kong and found that the estimates were very similar to those for all probable SARS patients (Figs. 2A and B). As a further sensitivity analysis, we fitted a lognormal model to data on exactly known exposures only (n = 65) and found that the mean and variance of the SARS incubation period were 5.1 days (95% CI = 4.2–5.9) and 16.5 days, respectively, and the 95th percentile was 12.6 days (CI = 10.1–15.4).

Previous estimates of the incubation distribution are summarized in Table 2 and compared with the estimates for data presented here. Early estimates from Hong Kong,^{6} and estimates from Singapore^{7,14} provided the shortest incubation times whereas a recent estimate from Beijing^{13} is consistent with our results.

Figure 1D compares the alternative nonparametric estimates of the incubation distribution. Imputing the infection date as the midpoint of exposure intervals overestimated the median of the incubation distribution compared with Turnbull’s estimate, and the bias was worse under Meltzer’s nonparametric estimate. Finally, the 2 extreme empirical estimates, where infection dates were imputed as either the left or the right end of all exposure intervals, gave slightly wider bounds for the incubation distribution than the 95% pointwise confidence intervals of the Turnbull estimate. Figure 1E compares the Turnbull estimates of the estimated incubation distributions in Hong Kong and Toronto, while the lognormal estimates are compared in Figure 1F.

We performed a multiple linear regression on the log of the incubation times, allowing for interval censoring. The resulting regression coefficients and acceleration factors are shown in Table 3. Older patients had 45% longer median incubation times, while patients in Hong Kong and healthcare workers had 21% and 34% shorter median incubation times than patients in Toronto and nonhealthcare workers, respectively. These findings were fairly consistent under separate analyses for each center (Appendix Table).

## DISCUSSION

We have estimated the incubation distribution of SARS based on 317 patients with known exposure periods, which is the largest combined database in the literature to date. The estimated mean and 95th percentile are consistent with other recent estimates in the SARS literature, while slightly longer than the earliest published estimates (Table 2). Two previous empirical analyses of small subsets of SARS patients in Toronto with exact infection times found mean incubation times of 5 and 4.7 days,^{8,9} slightly shorter than the 5.6 day mean in the full Toronto data analyzed here. Previous analyses of the incubation data in Hong Kong typically excluded longer exposure intervals,^{6,10,12} and this may have biased those estimates towards shorter times if recent exposures were more precisely recalled than earlier exposures. It is also noticeable from Figure 1 and Table 1 that the gamma distribution used in those previous estimates has a shorter right tail than the lognormal and Weibull distributions, and thus a shorter 95th percentile. When only the exact exposure times were analyzed, we found that the estimated incubation distribution was very similar while the smaller sample size resulted in a less precise estimate of the 95th percentile.

We found that the 95th percentile of the incubation distribution was 11 days under the nonparametric estimate, but 12.9 days under the best-fitting parametric model. However, the means of the nonparametric and parametric estimates were very close. A potential reason for the smaller right tail of the nonparametric estimate could be because the World Health Organization clinical case definitions for SARS required exposure within 10 days,^{2} and therefore some earlier potential exposures (ie, longer incubation times) may not have been reported, or even if reported these may not have been recorded in the charts due to the perceived implausibility of long incubation time. Furthermore, the lack of exposure data for many patients, particularly in Hong Kong, could have biased the estimates of the incubation distribution. We found that the lognormal distribution gave best fit to the SARS data, and this is biologically plausible since this distribution has given best fit to the incubation distribution of a wide variety of infectious diseases.^{19,20} We investigated the incubation distribution using data on probable SARS cases to be consistent with the literature,^{6–14} and we note that our results were very similar when we analyzed the subgroup with laboratory confirmation of SARS.

The alternative parametric models varied quite significantly in the estimated length of the right-hand tail of the incubation distribution (Table 1). The 95th percentile, or a similar statistic from the right-hand tail of the incubation distribution, is important for public health planning for 2 reasons. First, it would allow proper specification of the case definitions and minimize misdiagnosis of patients with long intervals between exposure and onset of symptoms of the disease in question. Second, it would allow proper specification of the quarantine period, essentially the length of time that an exposed but asymptomatic person will be inconvenienced before they can be assumed to be disease free. Given that public health officials should aim for conservatism, a useful approach might be to fit a variety of parametric models, and not only select the best-fitting model (according to robust criteria) but also look at the longest (ie, most conservative) tail percentiles.

We found significant geographical variation in the incubation distribution between Hong Kong and Toronto, even after adjusting for other important factors (Table 3). Analysis of incubation data from China^{13} also found significant geographical variation and hypothesized that these may have been due to differences in host resistance, or differences in socio-demographic or environmental characteristics in the various areas. An *ex ante* alternative explanation is that geographical variation in the incubation period could have been due to variation in genotypic differences of the SARS virus although subsequent investigation showed that the Metropole Hotel strain from Hong Kong seeded the worldwide outbreak, including Toronto. Another explanation for the differences between Hong Kong and Toronto is that reporting behavior was different, either due to differences in recognition of initial symptoms or differences in elicitation of the exposure data. Finally, as commented previously, the patients who provided exposure data may not have formed a representative sample, which could have led to biases in the estimated incubation distributions in either or both locations.

Our results suggest that incubation times varied with age and occupation. The longer incubation periods experienced by older patients might have resulted from a delayed immune response (albeit the disease consequences were more severe with age), given that SARS has characteristics of an immunomodulated disease.^{21} The shorter incubation periods of healthcare workers might have been due to the exposure of healthcare workers to patients with high viral loads, as peak viral load and peak infectivity typically occurred at day 7 to day 10 of illness, a time point at which most SARS patients had already been hospitalized. Alternatively, these findings may be due to reporting bias if younger individuals or healthcare workers were more vigilant for symptom onset. While an earlier analysis of incubation data in China^{13} did not find any statistically significant relationship between these variables, the direction of effects in their results is consistent with those presented in Table 3.

Turnbull’s nonparametric maximum likelihood estimate should be viewed as the gold standard estimate of the incubation distribution and used in preference to midpoint imputation whenever possible. For incubation distributions that are right-skewed, as is typically the case,^{19,20} the midpoint imputation approach will overestimate the incubation distribution. However, midpoint imputation should provide practical estimates during the early stages of an emerging epidemic when data are scarce and any estimate will have considerable uncertainty, noting that overestimation should be preferable to underestimation. As a final caveat, estimation of the incubation period can be further complicated if the onset date is unclear although for SARS this was not a problem given that the defining symptom was fever.^{22}

In emerging infectious diseases, accurate and precise estimates of the distribution of incubation times are necessary to advise public health policy, to specify case definitions, and to facilitate robust mathematical modeling. In this paper we have described and illustrated the alternative methods which could be used to derive the incubation distribution of SARS. The statistical techniques discussed here are generalizable to any infectious disease where exposure is not exactly known and thus incubation data are interval censored.

## ACKNOWLEDGMENTS

The authors thank their colleagues in the Hong Kong Department of Health and Hospital Authority and in Toronto who were involved with the public health control of the SARS epidemic and data collection and processing.

## REFERENCES

*Science*. 2003;300:1961–1966.

*Stat Med*. 1999 Apr 15;18:890].

*Stat Med*. 1998;17:219–238.

*Biometrics*. 1989;45:1–11.

*Lancet*. 2003;361:1761–1766.

*Morb Mortal Wkly Rep*. 2003;52:405–411.

*N Engl J Med*. 2004;350:2352–2361.

*CMAJ*. 2003;169:285–292.

*Stat Med*. 2005;24:3431–3445.

*Emerg Infect Dis*. 2004;10:207–209.

*Ann Intern Med*. 2004;141:662–673.

*Am J Epidemiol*. 2006;163:211–216.

*Stat Med*. 2005;24:2525–2537.

*J Roy Statist Soc: Series B*. 1976;38:290–295.

*Survival Analysis: Techniques for Censored and Truncated Data. Statistics for Biology and Health*. 2nd ed. New York: Springer; 2003.

*Lifetime Data Anal*. 1998;4:329–354.

*Eur J Clin Microbiol Infect Dis*. 2006;25:230–237.

*Am J Hygiene*. 1950;51:310–318.

*Jinrui Idengaku Zasshi [Japanese Journal of Human Genetics]*. 1977;21:217–237.

*Lancet*. 2003;361:1767–1772.

*Ann Intern Med*. 2004;141:333–342.

## APPENDIX

##### A Discussion of the Assumption of Uniform Probability of Infection

If infection occurred on the interval (A, B), and symptoms occurred at Z, then the likelihood contribution for an individual is of the form

where *g*(*u*) describes the probability of infection at calendar time *u*, and *f*(·) is the density of the incubation distribution. This simplifies to the expression {*F*(*Z* − *A*) − *F*(*Z* − *B*)}, where *F*(·) is the cumulative density of the incubation distribution, if *g*(*u*) is the uniform density.^{5} Therefore the assumption of constant infection probability through any given exposure interval allows us to “reverse the time axis,” since the simplified expression above is identical to the likelihood contribution for survival data which are censored on the interval (Z-B, Z-A).^{16}

De Gruttola and Lagakos^{5} propose a method for joint estimation of *g*(*u*) and *f*(·), where *g*(*u*) represents the changing population risk of infection; however, this does not seem appropriate in the example of SARS, in which population risk was negligible throughout the epidemic. Individual risk may have varied considerably (eg, when healthcare workers were on or off duty) but such detailed data on precise exposure intervals are rarely available. As an example of a situation where the population risk of infection *g*(*u*) is of interest, De Gruttola and Lagakos^{5} estimate the increase in risk of HIV over calendar time for hemophiliacs using contaminated blood banks.

The sensitivity of the estimated incubation distribution to the assumption of constant risk in the 2 most extreme cases of decreasing (increasing) risk can be shown by calculating the empirical distribution functions assuming that all infections occurred at the beginning (end) of individual exposure intervals. The extreme estimates for the SARS example are shown in Figure 1D, and are somewhat wider than the 95% pointwise confidence intervals about the Turnbull estimate.

##### Fitting Parametric Models to Interval Censored Survival Data

We have considered 3 parametric models for the incubation distribution:

(1)The lognormal distribution with parameters μ, σ and probability density function

(2)The Weibull distribution with parameters r, λ and probability density function

(3)The gamma distribution with parameters k, θ and probability density function

Each of these distributions may be fit to data which include incubation times written in the form [L_{i}, R_{i}] for individuals *i* = 1,…,*n*, where for L_{i} < R_{i} these are interval censored times and for L_{i} = R_{i} they are exactly observed incubation times. Note the special conventions in notation L_{i} = 0 for left-censored incubation times and *R* _{i} = ∞ for right-censored incubation times. Then each of the parametric models may be fit by maximizing (eg, by numerical optimization) the likelihood function of the form

where *F* (·) and *f* (·) are the relevant cumulative distribution function and probability density function, respectively.^{4}