In each scenario, we derived the expected OR estimate based on the EM method (Appendix C). We also report the expected OR estimates yielded by the first-stage method.13
For the investigated errors in the group-level exposure XG, the first-stage method generally implied bias away from the null when the exposure had an effect (Table 3). The EM method was robust to errors in XG when the case and control participation rates at the second stage were equal (both 75%). By contrast, with unequal case/control participation rates, marked bias generally occurred in both directions, even when the exposure had no effect. When the errors in the group-level data were consistent underestimations (Fig. 1A) or overestimations (Fig. 1B), the direction and magnitude of the bias was determined largely by control participation relative to case participation at the second stage, whereas varying the distribution of the controls across the groups (Figs. 2A–C) had less impact (Table 3). When truly low exposure proportions were overestimated and truly high exposure proportions were underestimated (Fig. 1C), the control distribution across the groups, as well as the control participation rate, clearly influenced the resulting bias. Only marginal bias resulted from such errors when the distribution was uniform, whereas marked bias occurred when the exposure distribution was skewed; modest bias occurred for the bimodal distribution.
Björk et al.10 conducted a population-based case-control study for a series of 255 Philadelphia chromosome-positive cases of chronic myeloid leukemia from Southern Sweden, 1976–1993. For each case, 3 age- and sex-matched controls were sampled randomly from the population at risk at the time of diagnosis. For each subject, we obtained first-stage data as occupational titles for 1960, 1970, 1975, 1980, 1985, and 1990 from national censuses. Jobs held at a census were assumed to be held until the following census. We here concentrate on occupational exposure to organic solvents. To assign the first-stage exposure probabilities, we used a Swedish translation of a Finnish job-exposure matrix.3 In addition to occupational title and agent, the matrix incorporates calendar year as a third dimension. Up to 5 exposure probabilities were assigned to each individual i: XG(1),i, XG(2),i, XG(3),i, XG(4),i, XG(5),i; viz. 1 for each time epoch (with respect to the censuses) during the 20-year time period before the year of diagnosis. Each subject's exposure probability was then calculated as
We used the individually assigned exposure probabilities as the first-stage exposure data. The underlying assumption of independence over time epochs may be too strong in practice. Using the maximum exposure probability (of XG(1),i, XG(2),i, XG(3),i, XG(4),i, XG(5),i) instead did not alter the results noticeably.
At the second stage, all cases and 1 randomly selected control in each matched set were selected for structured telephone interviews. If a control declined to participate, another control from the matched set was contacted. Occupational hygienists assessed individual exposures to organic solvents based on the interview information. The numbers of participants who contributed with individual-level exposure data on organic solvents were 195 cases (76% of the 255 first-stage cases) and 219 controls (29% of the 765 first-stage controls). Among the first-stage controls who were first contacted for interview in each matched set (n = 255), 151 (59%) contributed with individual-level exposure data.
Effect Estimation, Including Covariates
We estimated the age- and sex-adjusted OR based on the first-stage data by using an extension of the simple linear OR model (equation 1), viz. the additive-multiplicative logistic regression model13,14:
where S 1,..., S J are indicators for relevant strata of age and sex; γj is the log-transformed OR associated with stratum j. We used 6 strata of age and sex (age groups ≤54, 55–69, and 70+ years; separate for males and females). There are 2 underlying assumptions: (1) a common OR (exposed vs. unexposed) across the strata (ie, no effect modification) and (2) the exposure probability for a given work history i (XG(1),i, XG(2),i, XG(3),i, XG(4),i, XG(5),i) does not depend on the covariates (age and sex in our example). Alternatively, one can use the conditional additive-multiplicative logistic regression model based on the matched sets of cases and controls.16 Estimates of β and SE(β) can be obtained in the software package EGRET for Windows (Cytel Software Corp., Cambridge, MA).
Generalizations of the second-stage and EM methods for handling stratified data are straightforward. The computations use stratum-specific numbers of exposed and unexposed among the cases and controls (observed numbers in the second-stage method and observed + expected numbers in the EM method). Then the OR with a 95% CI can be estimated by using logistic regression (assuming a common OR across the strata).
The OR estimate for chronic myeloid leukemia associated with occupational exposure to organic solvents was 0.75 (95% CI = 0.30–1.9) based on the first-stage exposure probabilities. A similar estimate was obtained from the conditional additive-multiplicative logistic regression model (OR = 0.78; CI = 0.32–1.9), using the first-stage exposure data with respect to the matched sets of cases and controls. A somewhat higher estimate (1.1; 0.63–1.8) was obtained based on the second-stage (interview) data. Among the participating controls, the exposure prevalence was 17% according to the second-stage data compared with only 4% according to the first-stage probabilities. The first-stage exposure probabilities were even lower among the nonparticipating controls (mean 2%), indicating some association between group-level exposure probability and participation. The EM method based on all available data yielded an elevated OR estimate of 2.2 (1.4–3.4). Including only the first contacted control in each matched set in the analysis implies more equal case/control participation rates; the EM method then yielded a lower OR estimate (1.2; 0.71–2.0).
The simulation study compared the performance of the first-stage, second-stage, and EM methods, considering scenarios in which group-level exposure data are correct. The better precision in OR estimates is an advantage of the EM method over the first-stage method (and, not surprisingly, over the second-stage method). In the scenarios with participation bias, the EM method eliminated the bias at the second stage by incorporating error-free probabilities of exposure (first-stage information obtained from external data sources) for the nonparticipants. The bias-free results from the EM method rely on the assumption the individual-level exposure data on xi are missing at random within each group for the first-stage cases and controls, respectively. We allowed the participation rates among the first-stage cases and controls to differ in each group. When a stratified analysis based on covariates (such as age and sex) is performed, the missing-at-random assumption is conditional on disease status, group affiliation, and stratum; hence, the participation rate is also allowed to vary across the strata.
In practice, however, participation within an exposure group may depend not only on disease status, but also on exposure status, making inference about the OR more problematic. For example, we modified the selection of second-stage controls in our simulation setting by assuming participation rates of 17.3% and 25.9% among the exposed and unexposed, respectively, within each group (implying overall control participation rate = 25%); the second-stage method then yielded, for a true OR = 3, an expected OR = 4.5 (= 3 × 0.259/0.173; see Rothman and Greenland2(p.356)). The EM method is expected to reduce the participation bias even if the missing-at-random assumption is violated, at least when the first-stage exposure data are accurate. In the example, the EM method yielded an OR = 3.1 (geometric mean; coverage of 95% CIs = 94.6%). By increasing the overall control participation rate, and thereby the relative contribution of second-stage exposure data in the EM method, less reduction of the participation bias (induced by violating the missing-at-random assumption) can be expected. Potential users should address whether the missing-at-random assumption is reasonable in their applications.
Our major concern with the EM method is the accuracy of the assigned group-level exposure probabilities. It is generally believed that exposure estimates based on general population job-exposure matrices are less accurate than individual exposure estimates based on occupational hygienists’ reviews. Indeed, we assumed that the individual exposures (exposed, unexposed) were correctly classified. We have previously reported how erroneous group-level probabilities can produce substantially biased ORs from the first-stage method, if the exposure has an effect.13,17 The EM method was robust to bias for the investigated error structures when the expected participation rates among the first-stage cases and controls were equal in each group (75%). We did not investigate scenarios with varying participation rates across groups in combination with erroneous group-level data; the EM method may produce a markedly biased OR under such circumstances even if the overall case and control participation rates are equal. Nevertheless, we showed that unequal participation rates among the first-stage cases and controls in each group and overall, in combination with errors in the group-level exposure data, can produce severe bias under the EM method. This bias is different from the bias encountered under the first-stage method. As an example, when the group-level exposure data were underestimations and the control distribution across the groups was bimodal, no or only marginal bias occurred from the first-stage method (Table 3). By contrast, the EM method yielded substantial bias away from the null if the participation rate among the controls (25%) was much lower than among the cases (75%). Such bias appeared because the number of exposed among the nonparticipating controls was underestimated in the EM algorithm. Consequently, as a result of the low participation rate among the controls, the total number of exposed controls was pronouncedly underestimated, which produced bias away from the null.
In the empiric example on organic solvent exposure and chronic myeloid leukemia, the effect estimate from the EM method, using all available control data (n = 765; 29% participated), was positively biased compared with the estimate from the first-stage method. Among the interviewed controls, the exposure prevalence was 17% according to the second-stage data (xi) compared with only 4% according to the first-stage probabilities (XG), which gave a clear indication of discrepancies between the first- and second-stage exposure assessment methods. We believe that errors in the first-stage exposure probabilities, originating from both the censuses and the job-exposure matrix, can explain the elevated estimate from the EM method; bias was introduced by the very different participation rates among the cases and controls. Accordingly, when only the first contacted control in each matched set was included in the analysis, implying more equal case/control participation rates (76%/59%), the OR estimate from the EM method was not noticeably elevated.
We recommend potential users of the EM method to perform a sensitivity analysis with respect to the assigned group-level exposure probabilities.2 (pp.343–357) Table 3 can be helpful when addressing sensitivity to errors. The results are based on 12 exposure groups defined by various mean exposure probabilities (0, 0.05, 0.15,..., 0.85, 0.95, 1), which were chosen with consideration of our empiric example. In other applications, the number of assigned mean probabilities based on a job-exposure matrix may well be fewer (in the range of 3–5)18; the EM method may then be more sensitive to assignment errors.
In some study settings, relevant group-level exposure data is unavailable. Nevertheless, it might be possible to stratify the first-stage cases and controls on group affiliation and then estimate the group-level exposure probabilities based on the individual-level exposure data collected for the participating controls in each group. The accuracy of the estimated group-level exposures relies on the missing-at-random assumption (and the accuracy of the individual-level exposure data). This approach is basically the same as constructing a population-specific job-exposure matrix from interviews of a job-stratified sample of subjects.19 In our empiric example, however, each subject's group affiliation was essentially unique; it was based on a detailed job history for a 20-year period obtained from censuses. The EM method based on collected individual-level exposure data only may rely on additional programming to derive a valid CI around the OR estimate in the final step.20 By contrast, in our proposed EM method, which incorporates external group-level exposure information, a 95% CI is calculated by standard techniques based on a 2-by-2-table (or stratified 2-by-2-tables) with the cell frequencies obtained in the final expectation step of the algorithm. Our simulations demonstrated accurate coverage of such conventionally calculated 95% CIs in error-free scenarios.
The EM method could be applicable in other studies in occupational and environmental epidemiology. We provide 2 examples, with grouping based on residential area. The first example is related to our population-based case-control study on leukemia, considering work as a farmer or farmhand to be a proxy for the exposures of interest (a protective effect associated with such agricultural life was suggested10). Information on the proportion of farmers and farmhands in each municipality may be obtained. At the second stage, interviews can provide individual information on current and past occupations, to assess accurately whether a subject has been working on a farm for a sufficient time period. The second-stage data collection may, however, suffer from selective participation. In other applications, it might be necessary to generalize the EM method for handling polytomous or continuous exposure variables.9,20 There are also important design considerations, such as efficient sampling of second-stage subjects.12,21,22 As another example, consider cases of airway diseases and controls selected at the first stage. Assume that the first-stage subjects can be classified into exposure groups based on ambient monitoring of pollutants in the subjects’ current residential areas. A questionnaire including items on residence history, usual indoor and outdoor times and activities, and so on, may then be administered to the second-stage subjects.21
1. Olson SH, Voigt LF, Begg CB, et al. Reporting participation in case-control studies. Epidemiology
2. Rothman KJ, Greenland S, eds. Modern Epidemiology,
2nd ed. Philadelphia: Lippincott-Raven; 1998.
3. Kauppinen T, Toikkanen J, Pukkala E. From cross-tabulations to multipurpose exposure information systems: a new job-exposure matrix. Am J Ind Med
4. Kauppinen T, Toikkanen J, Pedersen D, et al. Occupational exposure to carcinogens in the European Union. Occup Environ Med
5. Ihrig MM, Shalat SL, Baynes C. A hospital-based case-control study of stillbirths and environmental exposure to arsenic using an atmospheric dispersion model linked to a geographical information system. Epidemiology
6. Bobak M, Leon DA. The effect of air pollution on infant mortality appears specific for respiratory causes in the postneonatal period. Epidemiology
7. Nyberg F, Gustavsson P, Jarup L, et al. Urban air pollution and lung cancer in Stockholm. Epidemiology
8. Richiardi L, Boffetta P, Merletti F. Analysis of nonresponse bias in a population-based case-control on lung cancer. J Clin Epidemiol
9. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B
10. Björk J, Albin M, Welinder H, et al. Are occupational, hobby, or lifestyle exposures associated with Philadelphia chromosome positive chronic myeloid leukaemia? Occup Environ Med
11. White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol
12. Cain KC, Breslow NE. Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol
13. Björk J, Strömberg U. Effects of systematic exposure assessment errors in partially ecologic case-control studies. Int J Epidemiol
14. Bouyer J, Hémon D. Comparison of three methods of estimating odds ratios from a job exposure matrix in occupational case-control studies. Am J Epidemiol
15. Brenner H, Savitz DA, Jockel KH, et al. Effects of nondifferential exposure misclassification in ecologic studies. Am J Epidemiol
16. EGRET for Windows. Software for the Analysis of Biomedical and Epidemiological Studies. User Manual
. Cambridge, MA: CYTEL Software Corp; 1999:148–149.
17. Webster T. Commentary: does the spectre of ecologic bias haunt epidemiology? Int J Epidemiol
18. Kernan GJ, Ji BT, Dosemeci M, et al. Occupational risk factors for pancreatic cancer: a case-control study based on death certificates from 24 US states. Am J Ind Med
19. Kromhout H, Heederik D, Dalderup LM, et al. Performance of two job-exposure matrices in a study of lung cancer morbidity in the Zutphen cohort. Am J Epidemiol
20. Wacholder S, Weinberg CR. Flexible maximum likelihood methods for assessing joint effects in case-control studies with complex sampling. Biometrics
21. Navidi W, Thomas D, Stram D, et al. Design and analysis of multilevel analytic studies with applications to a study of air pollution. Environ Health Perspect
. 1994;102(suppl 8):25–32.
22. Weinberg CR, Wacholder S. The design and analysis of case-control studies with biased sampling. Biometrics
23. Björk J, Strömberg U. Attributable fraction estimation in partially ecologic case-control studies. Epidemiology
Formulae for Generating Group- and Individual-Level Exposures
Let P(XG|D = 1) and P(XG|D = 0) denote the proportions in a group (defined by XG) among the first-stage cases (with D = 1 denoting the presence of disease) and controls (D = 0), respectively. The relation
where β = OR – 1 is the excess OR (exposed vs. unexposed) is true provided that the OR can be interpreted as a relative risk, the group-level exposure probabilities are error-free, and confounding is absent.13,14 Hence, based on a theoretical group distribution of the population (ie, specified values on P(XG|D = 0) for XG = 0, 0.05, 0.15,..., 0.85, 0.95, 1) and the true OR (= 1 + β), the values on each P(XG|D = 1) can be calculated.
Individual-level exposures (xi) for the cases and controls were generated from Bernoulli variables with correct exposure probabilities imposed on the groups. If the true exposure probability for a control is p, then the probability for a case in the same group is23
Let Njp denote the number of subjects with disease status D = j in a group with group-level exposure probability XG = p. Let njkp denote the number of the Njp subjects with known individual exposure status xi = k (second-stage participants); mjp = Njp – nj0p – nj1p is the number with unknown individual exposure status (nonparticipants).
The EM method is carried out as follows:
Expectation (E) step
For a given value of the OR (OR*), the expected exposure probability for a case with XG = p is
(see equation A1). Thus, the expected total number of exposed (xi = 1) cases (D = 1) is
the expected total number of unexposed (xi = 0) cases (D = 1) is
the expected total number of exposed (xi = 1) controls (D = 0) is
and the expected total number of unexposed (xi = 0) controls (D = 0) is
Maximization (M) step
Calculate the maximum likelihood estimate of the OR, using the frequencies calculated in the E step.
The procedure is repeated (with OR* set to the current point estimate of the OR) until convergence (convergence criterion: change of OR* < 0.0001).
Calculation of Expected OR Estimate From the EM Method
We here use the same notation as in Appendix B. The total numbers of first-stage cases and controls, together with the specified distribution of the first-stage controls across the groups and the true OR, yield the expected values on Njp (see Appendix A). Each expected value on njkp can be calculated from the group-specific participation rates among the first-stage cases and controls, the true group-level exposure value XG = p, and the true OR (which is taken into account for the participating cases; see equation A1). Then, based on the specified (erroneous) values on XG, the EM algorithm can be carried out as described in Appendix B; the final OR estimate corresponds to the expected OR.© 2004 Lippincott Williams & Wilkins, Inc.