L(β, G) now has 3 components: the specified regression model fβ(Y|X) in the first bracket, the unspecified g(X) in the second bracket, and the outcome-dependent sampling-induced probability P(Yki∈Ck) that ties fβ(Y¦X) and g(X) together in the third bracket. The first component would be the usual likelihood function for observed data, had the sampling been simple random sampling. The last component reflects the biased sampling nature of the outcome-dependent sampling design; ignoring it in the analysis would result in biased estimates of β. Hence, g(X), or G, in the second bracket cannot be simply factored out as would be the case with a simple random sample design. Statistical inference about β, using the standard maximum likelihood estimation method, will depend on a known or a parameterized G. In practice, however, G is rarely known. Misspecification of the distribution could lead to an erroneous conclusion and bias the parameter estimation. Consequently, statistical approaches that do not rely on the extra parameterization of G are desirable.
To estimate β without specifying G(X), Zhou et al14 developed a maximum likelihood based approach that maximizes L(β, G) by modeling G nonparametrically. This approach used the profile likelihood idea where it fixes β in equation (3) and solves for an empirical likelihood estimate Ĝ(β) from a constrained likelihood function, constraints placed on Ĝ that reflect its properties of being a discrete distribution function, using the Lagrange multiplier technique. An explicit solution for Ĝ(β) can be obtained. Plugging Ĝ(β) into equation (3), the Zhou et al estimator β̂Z can be obtained, using the Newton–Raphson procedure, by maximizing the resulting likelihood. An explicit standard error formula based on an asymptotic distribution has been provided.14 The statistical program for this analysis can be obtained from the web page (www.bios.unc.edu/∼zhou) or from the authors.
The inverse-probability weighted approach of Horvitz and Thompson15 can also be adapted in this situation by crudely treating all observed data, including the simple random sample, as if it were sampled from 3 strata, each with a given selection probability. Like the Zhou et al estimator, the inverse probability weighted approach also yields a consistent estimate for β. This method is commonly used with data from a 2-stage study.16,17 If all N individuals were fully observed in the population, the log likelihood function would be ∑i=1N log P(yi¦xi;β). An estimate of this quantity is obtained if we use the completely observed individuals and weight their contributions inversely according to their selection probability into the second stage. The inverse probability weighted estimator β̂IPW is the solution to the following weighted score equation
where pk can be estimated by
if there is a complete information. Note that this method needs more information than the likelihood approach employed by Zhou et al method since it requires the sampling probabilities to be known. However, since the inverse-probability weighted approach is based on crudely accounting for the sampling scheme and on an estimating-equations approach, it may not be as efficient as a likelihood-based estimator. When the number of categories of Y is small, the estimating-equation method is not efficient.18 The realistic settings for the outcome-dependent sampling design we considered had k between 2 to 4. Generally speaking, if the same amount of information is used, among the various statistical approaches to computing any estimator, the maximum likelihood approach is always the most efficient.
A SIMULATION STUDY
We designed a simulation to study the efficiency of different methods under a variety of conditions that one might face in real applications. The basic simulation setting is modeled after a study by Daniels et al19 of prenatal exposure to low levels of PCB in relation to mental and motor development. An outcome-dependent sampling design was used in the data collection. The data were generated according to the following model,
where X is the exposure variable that takes on several distributions, p = 1 indicates a linear dose-response relationship and p = 2 represents a nonlinear relationship for E[Y¦X], the Zs are independent covariate variables, and e is a standard normal random error. We generate X from several distributions that include normal, exponential, lognormal, and Bernoulli. These selections reflect possible real situations where X could be a rare-binary variable, a continuous variable, or a skewed variable. We generated Z1 from a binary distribution (Bernoulli (0.45)), Z2 from a log-normal distribution (LN (0, 0.25)), and Z3 from a 3 level polynomial distribution (P(n, 0.3, 0.7)).
In our simulation, we first generated a cohort with 100,000 individuals according to the above model and drew outcome-dependent samples from this cohort. We drew an overall random sample of size n0, where we observed {Y, X, Z1, Z2, Z3}. We also drew a supplemental random sample of size n1 from the lower tail of Y defined by {Y<Ȳ−a*σY} and a supplemental random sample of size n3 from the upper tail of Y defined by {Y>Ȳ+a*σY}, where σ is the standard deviation of Y and a is a known constant. In addition to various configurations for the parameter values, we investigated the effect of varying the value of cut point a on the performance of the methods. We also investigated the impact on statistical power of varying the contribution to the total sample size from the overall random sample and the supplement random samples (ρ = n0/(n0 + n1 + n3)). Note n2 = 0 here does not reduce the generality.
Under each setting, we compared the Zhou et al estimator, denoted by β̂Z, with 4 other estimators: (i) the naive maximum likelihood estimator, β̂N, based on the observed ODS data but ignoring the sampling scheme; (ii) an inverse probability weighted estimator (β̂IPW); (iii) the maximum likelihood estimator based on a plain random sample of the same size as the ODS sample (β̂P); and (iv) 2 logistic regression estimators (β̂Lk) based on dichotomizing a continuous Y by defining the outcome D as D = 1 if Y > mean (Y) + kσY and D = 0 otherwise, where k = 0, 1. The weight used in calculating β̂IPW is the inverse of the observed probability of being sampled in the respective strata of Y. The βN and βP estimates are the same as the ordinary least square estimates in our simulation setting. Each set of simulations generated 1000 data sets.
SIMULATION STUDY RESULTS
Table 1 shows the simulation results for a = 1, and (n0, n1, n3) = (200, 100, 100) (hence ρ = 0.5) for various exposure effects. The mean estimate given by β̂N is biased for estimating β1 in the simulation. Thus, ignoring the sampling scheme (β̂N) leads to a biased estimate for the exposure effect. It is also clear from Table 1 that different dichotomization of a continuous Y will lead to various inconsistent estimates (β̂L0 and β̂L1) of the β1. Perhaps, more importantly, the logistic estimators can be less able to detect the true underlying relationship, as reflected by corresponding P values of 0.08 for β̂L1, compared with P < 0.05 for all other methods using continuous response. We do not present results for these 2 estimators in future comparisons. The other 3 methods all yielded consistent estimates of β1. The actual coverage of their nominal 95% confidence intervals (CIs) coverage are all close to 95%, indicating that a good approximation to the asymptotic normality is achieved with this sample size, and the estimated standard errors (SE) are close to the true standard deviations. Under the setting considered, β̂Z has the smallest SE while β̂P has the largest SE. Because β̂N is a biased estimator and its SE underestimates the true variation, we excluded it from the further studies of sample size and power below. The above observations are consistent across different exposure effects listed in Table 1. Under the same linear model but when the X term is quadratic, β̂Z is again more efficient.
Results in the lower portion of Table 1 provide the contrast for the normally distributed X to extreme X, namely a skewed exposure (lognormal) and a rare binary exposure (Bernoulli (0.05)). When compared with the Normal distributed exposure, the SE for βZ is even smaller SE in the skewed exposure situations. For the rare binary exposure case, results in Table 1 demonstrate that β̂Z is still the most efficient overall, though the sample size considered was not sufficiently large enough for any of them. This is reflected in the fact that the estimated standard errors are bigger than the estimates of the slope. This is not surprising because with the distribution of a rare binary X, there may not be enough information in the data set as X = 1 could be sparse. Future development of a modified outcome-dependent sampling design for this situation is certainly warranted.
Figure 1 shows the power for testing H0: β1 = 0 versus H1: β1 = true value, for n = 400 and type I error fixed at 5%. The points corresponding to the true value of β1 = 0 shows the empirical type I error rate for each test. All 3 methods yield close to 5% type I error. For all 3 estimators, as β1 increases, so does the statistical power, and the power of β̂Z>β̂IPW>β̂P.
Table 2 shows the sample sizes required to achieve a given statistical power for 3 values of β1 according to type of estimator under the same settings used in the top panel of Table 1. Use of outcome-dependent sampling with an appropriate estimator requires a smaller sample size. In this setting, the β̂Z method on average needs about 60% of the subjects who would be needed if the study were conducted with a simple random sampling scheme; the β̂IPW method needs about 83%. Further, for a given power, as the true value of β1 is farther away from 0, relatively fewer subjects are needed to achieve the same power with β̂Z as compared with β̂P. That is, efficiency increased as β1 is farther away from 0.
Figure 2 shows the impact of 2 factors in a given setting of outcome-dependent sampling. The impact of varying a, the cut-point that determined the strata of Y where the supplement random samples were drawn, is shown in Figure 2A. When a = 0, there was only plain random sampling, and in this instance all 3 methods had the same power. As a increases, however, β̂Z (solid line) has better power than the other 2. With β̂Z, the increase in power appears to be monotonic in a. Figure 2B shows the impact of ρ, the fraction of the overall random sample in the total outcome-dependent sample (ρ = n0/(n0 + n1 + n3)). At ρ = 1, there is no supplemental sample and the outcome-dependent samples are plain random samples. However, as ρ decreases to below 0.7, β̂Z (solid curve) is again the most powerful of the 3 estimators.
Table 3 presents different allocations of outcome-dependent sampling when the exposure variable is skewed (X ∼ lognormal). Compared with the results in the lognormal panel of Table 1, we see that allocating more of the sample to the upper tail of Y improves the efficiency. The standard error of the slope estimator decreases as more of the sample is shifted from the lower tail to the upper tail, ie, 0.0222 → 0.0201 → 0.0192 as the allocation changes (200, 150, 50) → (200, 100, 100) → (200, 50, 150).
Real Data Example 1
In the motivating example noted in the introductory text, a nested case–control study would have been implemented if the outcome had been dichotomous. However, the outcome of interest, (the score on the Bayley Scale of Infant Development), was a continuous variable, and, treating it as a dichotomous variable would have resulted in a loss of statistical power. Measurements of polychlorinated biphenyls (the exposure of interest here) are expensive; thus, minimizing sample size while maintaining power was especially important. The authors drew a random sample of cohort members and 2 additional random samples, one from each tail of the outcome distribution.19 Using the inverse probability-weighted estimator, the estimated β̂IPW was 0.47 BSID units/μg/L PCB with estimated SE as 0.32 (P = 0.14).14 Using the Zhou et al estimator, the estimated β̂Z was 0.44 with SE = 0.22 (P = 0.02). Using the simple random sample data alone, the β̂N was 0.29 (SE = 0.29, P = 0.32). Daniels et al19 also examined and confirmed the shape of the dose-response relation in both the outcome-dependent and the simple random samples. The example demonstrates the improved efficiency obtained by the Zhou et al estimator.
Real Data Example 2
Rissanen et al20 examined the relation of serum lycopene concentration to the thickness of carotid arteries among 1028 men in Finland. Using their data, we selected 400 samples in 2 ways. First, we selected a random sample of 400. Second, we selected a random sample of 200, and, plus a random sample of 100 from those with carotid artery thickness above the 90th percentile, and a random sample of 100 from those with carotid artery thickness below the 10th percentile. We then analyzed the data using ordinary regression model with all 1028 samples, and with the 400 selected at random. In addition, we analyzed the 200-100-100 sample using inverse probability-weighted method and then with the Zhou et al estimator. The lycopene results were adjusted for the same covariates (age, year, and sonographer) in all models. Results are given in Table 4. With the full sample of 1028, the estimated coefficient for lycopene was −0.14 with an estimated SE at 0.04 and P = 0.0011. With the random sample of 400, the estimated effect β̂P was −0.10, with SE = 0.06 and P = 0.096. With the 200-100-100 sample, β̂IPW was −0.19 with SE = 0.08, P = 0.017 and β̂Z was −0.24, SE = 0.07 and P = 0.0009. This example suggests that the outcome-dependent sampling scheme and the Zhou et al estimator provided nearly as much power as analysis of the full dataset. Furthermore, the Zhou et al estimator clearly had greater efficiency compared with the inverse-probability weighted approach. The larger βs obtained using outcome-dependent sampling and estimators reflect the shape of the dose-response curve, which had larger negative slopes near the tails of the outcome. While we focused on analyzing the lycopene association as linear (trend-test), a curvilinear approach would better describe the relation and could be easily accommodated with either the inverse-probability weighted or the Zhou et al estimators.
Real Data Example 3
Korrick et al21 conducted a case–control study of hypertension and bone lead level. For logistic reasons the sampling probabilities for cases and controls could not be determined. Measurements of blood pressure were available. The example, presented as Appendix 1 in the online supplementary material, available with the online version of the article, shows that the proposed outcome-dependent sampling approach could be used to estimate the coefficient for bone lead in a model of blood pressure.
DISCUSSION
When there is a fixed, limited number of observations for examining the linear relation between a continuous exposure and a continuous outcome, outcome-dependent sampling is more efficient than a random sample. The simulations provided an intuitive and simple expression of this benefit: the estimator yields the same estimand as an analysis of the underlying complete population. Stated another way, for a given level of statistical power, the number of observations can be reduced by about 40% compared with a random sample. Thus, the benefits of outcome-dependent sampling apply to continuous outcomes, as well as to dichotomous ones through case–control designs. For a binary response variable, our approach, implemented with logistic regression, is equivalent to the case–control analysis.22
To implement the weighted method, one needs to know or estimate the weights, this requires at least empirical data about the distribution of Y. Such information may not be a problem for nested studies; however, it can be difficult to calculate the weight for studies that are not nested. The difficultly arises because good quality data on the distribution of Y, and an enumeration of potential subjects, may not be available for the base population. The inverse-probability weighting suffers from the fact that in the typical setting of the outcome-dependent sampling, the natural choice of number of categories of Y is not large enough to yield the variability in weights that would make it efficient.18 Some recently developed methods may help to identify even more optimal weights.18,23 The Zhou et al method, on the other hand, does not use selection probability and hence does not need to enumerate the base population.
Our results showed that with a up to about 1 and ρ near 0.5 and a total sample size of 400, outcome-dependent sampling using the estimator of Zhou et al increased power about twice as much as the weighted estimating equation approach (Fig. 2). With larger values of a or smaller values of ρ, the advantage of the Zhou et al estimator was greater. This reflects a greater influence on the regression parameters for data from the tail areas than the middle areas. In general, the efficiency gain of the inverse-probability weighted method over the simple random sample analysis is notable; thus we would suggest in practice that one should at least do the weighted analysis if one cannot implement the Zhou et al method.
If an outcome-dependent sampling procedure is to be used, questions will arise regarding the optimal choice of a and ρ. For example, consider an examination of the serum level of contaminant X among pregnant women in relation to a continuous outcome in offspring. Subject matter considerations might support large values of a (eg, >1) so that a corresponds to a clinically abnormal value. Values of a >1 might seem appealing because of the resulting increase in power, especially with the Zhou et al estimator (Fig. 2). But this reward depends on an assumption about fβ(Y, X) across the range of Y (see the online Appendix 2 for discussion of an example). Similarly, choice of ρ > 0 has several advantages over ρ = 0,14 since including an overall random sample provides the flexibility of a cohort study and allows for model checking. In general, choosing a ρ from the range of [0.2, 0.5] provides much improved power with the Zhou et al estimator compared with the weighted-estimated-equation estimator, while still allocating enough observations to the simple random sample.
Similarly, the relative size of n1 and n3 might be affected by several factors that will vary across studies. For example, if the exposure variable is known to be skewed with a long tail to the right and β is known to be either zero or positive, then increasing the size of n3 relative to n1 would be sensible. A large n3 relative to n1 would also make sense when there is little interest in the determinants of a low value of the outcome variable.
Prospective designs coupled with relatively expensive measures of exposure are being used with increasing frequency in epidemiologic research. Furthermore, the scope of epidemiologic research increasingly includes outcomes best measured on a continuous scale. Given these trends, methods that allow cost-cutting while maintaining statistical efficiency are likely to see greater use. Recently, similar ideas using outcome-dependent sampling have been extended to the situation in which information other than the exposure variable are also available for both the sample and the rest of the base population.24 Methods have been developed that account for an ordinal outcome variable in a generalized linear model setting; other methods incorporate auxiliary information about the exposure variable that is available for the entire base population.25 Much work, however, remains to be done; for example, how to use outcome-dependent sampling with longitudinal data is still an open question. A survey of the recent statistical research on outcome-dependent sampling design can be found in Zhou and You.26
ACKNOWLEDGMENTS
We thank Clare Weinberg and Beth Gladen for their careful reading of the paper and helpful suggestions. We also thank the reviewers for their helpful suggestions that lead to a much more complete version of the manuscript.
REFERENCES
1. Cornfield J. A method of estimating comparative rates from clinical data. Applications to cancer of lung, breast, and cervix.
J Natl Cancer Inst. 1951;11:1269–1275.
2. Anderson JA. Separate sample logistic discrimination.
Biometrika. 1972;59:19–35.
3. Breslow NE, Cain KC. Logistic regression for two-stage case–control data.
Biometrika. 1988;75:11–20.
4. Prentice RL, Pyke R. Logistic disease incidence models and case–control studies.
Biometrika. 1979;71:101–113.
5. Prentice RL. A case-cohort design for epidemiologic studies and disease prevention trials.
Biometrika. 1986;73:1–11.
6. White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease.
Am J Epidemiol. 1982;115:119–128.
7. Wacholder S, Weinberg CR. Flexible maximum likelihood methods for assessing joint effects in case–control studies with complex sampling.
Biometrics. 1994;50:350–357.
8. Imbens GW, Lancaster T. Efficient estimation and stratified sampling.
J Econom. 1996;74:289–318.
9. Cosslett SR. Maximum likelihood estimator for choice-based samples.
Econometrika. 1981;49:1289–1316.
10. Holt D, Smith TMF, Winter PD. Regression analysis of data from complex surveys.
J R Stat Soc, A. 1980;143:474–487.
11. Li R, Folsom AR, Sharrett AR, et al. Interaction of the glutathione S-transferase genes and cigarette smoking on risk of lower extremity arterial disease: the Atherosclerosis Risk in Communities (ARIC) study.
Atherosclerosis. 2001;154:729–738.
12. Iribarren C, Folsom AR, Jacobs DR Jr, et al. Association of serum vitamin levels, LDL susceptibility to oxidation, and autoantibodies against MDA-LDL with carotid atherosclerosis. A case–control study. The ARIC Study Investigators. Atherosclerosis Risk in Communities.
Arterioscler Thromb Vasc Biol. 1997;17:1171–1177.
13. Suissa S. Binary methods for continuous outcome: a parametric alternative.
J Clin Epidemiol. 1991;44:241–248.
14. Zhou H, Weaver MA, Qin J, et al. A semiparametric empirical likelihood method for data from an outcome-dependent sampling design with a continuous outcome.
Biometrics. 2002;58:413–421.
15. Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite universe.
J Am Stat Assoc. 1952;47:663–685.
16. Flanders WD, Greenland S. Analytical methods for two-stage case–control studies and other stratified designs.
Stat Med. 1991;10:739–747.
17. Zhao LP, Lipsitz S. Designs and analysis of two-stage studies.
Stat Med. 1992;11:769–782.
18. Godambe VP, Vijyan K. Optimal estimation for response-dependent retrospective sampling.
J Am Stat Assoc. 1996;91:1724–1734.
19. Daniels JL, Longnecker MP, Klebanoff MA, et al. Prenatal exposure to low-level polychlorinated biphenyls in relation to mental and motor development at 8 months.
Am J Epidemiol. 2003;157:485–492.
20. Rissanen T, Voutilainen S, Nyyssonen K, et al. Low plasma lycopene concentration is associated with increased intima-media thickness of the carotid artery wall.
Arterioscler Thromb Vasc Biol. 2000;20:2677–2681.
21. Korrick SA, Hunter DJ, Rotnitzky A, et al. Lead and hypertension in a sample of middle-aged women.
J Public Health. 1999;89:330–335.
22. Qin J, Zhang B. A goodness-of-fit test for logistic regression models based on case–control data.
Biometrika. 1997;84:609–618.
23. Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed.
J Am Stat Assoc. 1994;89:846–866.
24. Weaver M, Zhou H. An estimated likelihood method for continuous outcome regression models with outcome-dependent subsampling.
J Am Stat Assoc. 2005;100:459–469.
25. Wang X, Zhou H. A semiparametric empirical likelihood method for biased sampling schemes in epidemiologic studies with auxiliary covariates.
Biometrics.2006;62:1149–1160.
26. Zhou H, You J. Semiparametric methods for data from an outcome-dependent sampling scheme. In: Hong D, Shyr Y. eds.
Quantitative Medical Data Analysis Using Mathematical Tools and Statistical Techniques. Singapore: World Scientific Publications. 2006.
Supplemental Digital Content
© 2007 Lippincott Williams & Wilkins, Inc.
Source
Epidemiology. 18(4):461-468, July 2007.
Your message has been successfully sent to your colleague.
Some error has occurred while processing your request. Please try after some time.
The item(s) has been successfully added to "".
