In case-control studies with exposure data obtained from interviews, participation is an issue of concern. Use of external group-level exposure information, available for all cases and controls (including nonparticipants), can reduce participation bias and improve precision of effect estimates. Our methodologic investigation was motivated by a population-based case-control study on occupational exposures and leukemia. We assessed exposure using dichotomous data collected in interviews, and also using census data on past and current occupational groups for all subjects. Based on information from a job-exposure database, a group-level probability of exposure was assigned to each subject. We studied the performance of the iterative expectation-maximization method for estimating the odds ratio (OR) by using the individual-level exposure data on the interviewed participants together with the assigned group-level exposure probabilities for the nonparticipants. In each iteration, the expected numbers of exposed and unexposed among the nonparticipating cases and controls were calculated from their assigned exposure probabilities and, for the cases only, from the current OR estimate. We then estimated the OR based on the total (observed plus expected) numbers and repeated the procedure until convergence. The expectation-maximization method eliminated participation bias and improved precision for scenarios with error-free group-level exposures and individual-level exposure data missing at random conditional on disease status and group affiliation. We specifically addressed consequences of assigning erroneous exposure probabilities to the nonparticipating subjects. In such situations, the expectation-maximization method can produce biased estimates if the participation rates among the cases and controls differ substantially.
From the *Department of Occupational and Environmental Medicine, Lund University, and the †Competence Centre for Clinical Research, Lund University Hospital, Lund, Sweden.
Submitted 17 Februrary 2003; final version accepted 12 March 2004.
Financial support provided by the Swedish Council for Working Life and Social Research (grant nos. 2001–1188 and 2002–0097).
Correspondence: Ulf Strömberg, Department of Occupational and Environmental Medicine, University Hospital, SE-221 85 Lund, Sweden. E-mail: email@example.com
Nonresponse in case-control studies is a ubiquitous problem.1 Participation (self-selection) bias arises if the relation between exposure and disease is different for those who participate.2(p.119) This can occur if, for example, exposed cases are more likely to participate than exposed controls. Such bias may not be controllable as a result of limited information on the factors affecting participation.
Nonetheless, population registers and census data can, in some countries, provide partial information on virtually all cases and controls such as general individual characteristics (eg, age and sex) and group affiliations (eg, occupational group or residential area). Such grouping is relevant in population-based case-control studies on occupational and environmental exposures. Group-level exposure information may be obtained from an exposure database. For example, a job-exposure matrix can supply group-level information such as exposure proportions or average intensities for various occupational groups.3,4 Similarly, a geographic information system can supply average concentrations of air pollutants for various residential areas.5–7 Occupational or residential affiliation may be a primary determinant not only of exposure, but also of participation such as in situations in which socioeconomic status is associated with nonresponse.8 Appropriate use of complementary group-level exposure information may reduce participation bias and improve precision of effect estimates.
This article proposes a method for estimating the odds ratio (OR) given dichotomous, or appropriately dichotomized, individual-level exposure data for the participants as well as exposure proportions on the group level for the nonparticipants. Based on the group-level information, a probability of exposure is assigned to each subject. We use an iterative estimation method, the expectation-maximization (EM) method,9 which takes into account the assigned exposure probabilities for the nonparticipants. We study the performance of the EM method using simulated data. We quantify the bias implied by incorporating erroneous group-level exposure information. We apply the EM method using empiric data from a population-based case-control study on occupational exposures and chronic myeloid leukemia.10
Simulation Study—Scenarios in Which Group-Level Exposure Data Are Correct
We consider a population-based case-control study of a rare disease in which the incident cases during a recruitment period are included. Those cases are referred to as the first-stage cases. The first-stage controls are sampled randomly among those who remain free of disease at the end of the recruitment period.2(pp.110–111) All first-stage cases and first-stage controls are eligible for the interview portion of the study. We assume that no selection bias arises from the recruitment of the first-stage cases and controls. The first-stage subjects can be classified into exposure groups with various exposure proportions XG. Assuming a rare disease, XG should reflect the exposure probability for the disease-free subjects in each group of the population. Within each group, the first-stage cases and controls are assigned the same value on XG. The second-stage cases and controls are those who participate in interviews and thereby provide individual-level exposure data (xi = 1, exposed; xi = 0, unexposed). Thus, we consider a 2-stage case-control study,11,12 in which XG is the first-stage exposure variable (with no missing values) and xi is the second-stage exposure variable (with data only for the participants).
We assume that data on xi are missing at random, conditional on group affiliation and disease status. Thus, within each group, participation (in interview) among the first-stage cases as well as among the first-stage controls is unrelated to exposure status on the individual level. Moreover, the case and control participation rates can differ in each group and can differ overall. Consequently, participation bias is absent within each group, but not necessarily absent overall if the case and control participation rates vary differently across the groups. For example, if the control participation rate decreases with XG, the expected exposure prevalence among the second-stage (participating) controls is lower than in the total population.
In each simulation, we included 400 first-stage cases and 1200 first-stage controls distributed over 12 groups, with the possible values of XG restricted to 0, 0.05, 0.15,..., 0.85, 0.95, 1. We assume that each subject is placed in 1 of the 12 groups based on the subject's occupation. For example, if the proportion exposed is 5% for occupation A according to the exposure database, a subject with occupation A is placed in the group with XG = 0.05. In practice, a group may be defined by an interval of exposure probabilities (eg, 0.01–0.10); XG reflects the mean probability (eg, 0.05) in each group.13 The first-stage controls were distributed over the groups according to specified group affiliation probabilities (Table 1); this skewed control distribution across the groups was chosen to resemble the empiric example presented subsequently. The first-stage cases were distributed over the groups correspondingly. For a true OR >1, higher proportions of the cases than of the controls can be expected to belong to the groups with XG > 0 (Appendix A).
Of the 400 first-stage cases, 300 second-stage (participating) cases were selected completely at random in each simulation (participation rate = 75%). Of the 1200 first-stage controls, we selected 300 second-stage controls (25%). Two types of scenarios were simulated: without and with participation bias. In the scenarios without participation bias, the second-stage controls were selected completely at random. We introduced participation bias by decreasing or increasing the control participation rate with XG (Table 1). Data on xi for the second-stage participants were generated from Bernoulli variables with correct exposure probabilities imposed on the groups. If the exposure implies an increase in risk (OR >1), the actual exposure probability for a case is greater than the true exposure probability for a control within the same group with 0 < XG < 1 (Appendix A).
We assumed 3 true ORs: 1, 2, and 3. For each scenario, we carried out 1000 replications.
Estimates of the OR were obtained from 3 methods: (1) first-stage method (analysis of first-stage data only, for all subjects); (2) second-stage method (analysis of second-stage data only, for the participants); and (3) EM method.
When the disease is rare, the association between the first-stage exposure variable XG and the disease odds is linear in the absence of confounding:
where α is a positive constant.13,14 For effect estimation with the first-stage method, we thus used the linear OR model,
is the odds ratio comparing the exposed with the unexposed. The maximum likelihood estimate of β together with the standard error [SE(β̂)] were obtained as described by Björk and Strömberg.13 An approximate 95% confidence interval (CI) for the OR was calculated as13
In the second-stage method, the maximum likelihood estimate of the OR and a 95% CI (Wald limits) were obtained conventionally from a 2-by-2-table.2(pp.241–246)
The EM method is an iterative method. In each iteration, the expected numbers of exposed and unexposed among the nonparticipating cases and controls, respectively, were calculated from their assigned exposure probabilities in the first stage and, for the cases only, the current OR estimate. We then estimated the OR based on the total (observed + expected) numbers and repeated the procedure until convergence (Appendix B). The OR estimate was obtained from a 2-by-2-table in each maximization step of the EM algorithm and, in the last step, a 95% CI (Wald limits) was calculated from the final 2-by-2-table.
Unbiased OR estimates were obtained with all methods in the scenarios without participation bias (Table 2). The standard deviations (SD) of the logarithmically transformed OR estimates produced by the EM method were 17% to 24% lower than with the first-stage method and 30% to 35% lower than with the second-stage method. The EM method yielded 95% CIs with accurate coverage.
A bias away from the null appeared from the second-stage analyses if the control participation rate decreased with XG (Table 2); exposed second-stage controls were thereby underrepresented (7.2% compared with the expected exposure prevalence of 10.4%). On the other hand, a negative bias appeared if the control participation rate increased with XG. By using the EM method, such participation bias was eliminated. Compared with the first-stage method, the EM method yielded better precision.
Bias Assessment—Scenarios in Which Group-Level Exposure Data Are Erroneous
The conditions here are the same as in the simulation study, except that we allow erroneous group-level exposure data on XG. We studied 3 different error structures: (1) systematic underestimations (Fig. 1A); (2) systematic overestimations (Fig. 1B); and (3) perfectly negatively correlation between the true values and the differences between assigned and true values (Fig. 1C). The latter error structure arises if the exposure proportions from the external data source are established with constant sensitivity and specificity (here, both 90%) across all groups.13,15 Furthermore, we studied 3 different distributions of the first-stage controls across the groups: (1) the distribution used in the simulation study, ie, controls concentrated in groups with low exposure proportions, referred to as the skewed distribution (Table 2; Fig. 2A; true overall exposure prevalence among the controls = 10.4%); (2) uniformly distributed (Fig. 2B; 50.0%); and (3) controls concentrated in groups with low as well as high exposure proportions, referred to as the bimodal distribution (Fig. 2C; 30.8%). Skewed distributions are common for occupational and environmental exposures in population-based studies. An example of an exposure that is likely to have a bimodal distribution is animal dust (a proxy for animal-borne viruses), which is highly prevalent in certain occupational groups (eg, farmers, farmhands, and butchers). We also varied the true OR (1, 2, and 3) and the participation rate among the first-stage controls (25%, 50%, 75%, and 90%). The participation rate among the first-stage cases was 75% in all scenarios. The second-stage cases and controls (participants) were selected completely at random. Hence, participation bias was absent.
In each scenario, we derived the expected OR estimate based on the EM method (Appendix C). We also report the expected OR estimates yielded by the first-stage method.13
For the investigated errors in the group-level exposure XG, the first-stage method generally implied bias away from the null when the exposure had an effect (Table 3). The EM method was robust to errors in XG when the case and control participation rates at the second stage were equal (both 75%). By contrast, with unequal case/control participation rates, marked bias generally occurred in both directions, even when the exposure had no effect. When the errors in the group-level data were consistent underestimations (Fig. 1A) or overestimations (Fig. 1B), the direction and magnitude of the bias was determined largely by control participation relative to case participation at the second stage, whereas varying the distribution of the controls across the groups (Figs. 2A–C) had less impact (Table 3). When truly low exposure proportions were overestimated and truly high exposure proportions were underestimated (Fig. 1C), the control distribution across the groups, as well as the control participation rate, clearly influenced the resulting bias. Only marginal bias resulted from such errors when the distribution was uniform, whereas marked bias occurred when the exposure distribution was skewed; modest bias occurred for the bimodal distribution.
Björk et al.10 conducted a population-based case-control study for a series of 255 Philadelphia chromosome-positive cases of chronic myeloid leukemia from Southern Sweden, 1976–1993. For each case, 3 age- and sex-matched controls were sampled randomly from the population at risk at the time of diagnosis. For each subject, we obtained first-stage data as occupational titles for 1960, 1970, 1975, 1980, 1985, and 1990 from national censuses. Jobs held at a census were assumed to be held until the following census. We here concentrate on occupational exposure to organic solvents. To assign the first-stage exposure probabilities, we used a Swedish translation of a Finnish job-exposure matrix.3 In addition to occupational title and agent, the matrix incorporates calendar year as a third dimension. Up to 5 exposure probabilities were assigned to each individual i: XG(1),i, XG(2),i, XG(3),i, XG(4),i, XG(5),i; viz. 1 for each time epoch (with respect to the censuses) during the 20-year time period before the year of diagnosis. Each subject's exposure probability was then calculated as
We used the individually assigned exposure probabilities as the first-stage exposure data. The underlying assumption of independence over time epochs may be too strong in practice. Using the maximum exposure probability (of XG(1),i, XG(2),i, XG(3),i, XG(4),i, XG(5),i) instead did not alter the results noticeably.
At the second stage, all cases and 1 randomly selected control in each matched set were selected for structured telephone interviews. If a control declined to participate, another control from the matched set was contacted. Occupational hygienists assessed individual exposures to organic solvents based on the interview information. The numbers of participants who contributed with individual-level exposure data on organic solvents were 195 cases (76% of the 255 first-stage cases) and 219 controls (29% of the 765 first-stage controls). Among the first-stage controls who were first contacted for interview in each matched set (n = 255), 151 (59%) contributed with individual-level exposure data.
Effect Estimation, Including Covariates
We estimated the age- and sex-adjusted OR based on the first-stage data by using an extension of the simple linear OR model (equation 1), viz. the additive-multiplicative logistic regression model13,14:
where S1,..., SJ are indicators for relevant strata of age and sex; γj is the log-transformed OR associated with stratum j. We used 6 strata of age and sex (age groups ≤54, 55–69, and 70+ years; separate for males and females). There are 2 underlying assumptions: (1) a common OR (exposed vs. unexposed) across the strata (ie, no effect modification) and (2) the exposure probability for a given work history i (XG(1),i, XG(2),i, XG(3),i, XG(4),i, XG(5),i) does not depend on the covariates (age and sex in our example). Alternatively, one can use the conditional additive-multiplicative logistic regression model based on the matched sets of cases and controls.16 Estimates of β and SE(β) can be obtained in the software package EGRET for Windows (Cytel Software Corp., Cambridge, MA).
Generalizations of the second-stage and EM methods for handling stratified data are straightforward. The computations use stratum-specific numbers of exposed and unexposed among the cases and controls (observed numbers in the second-stage method and observed + expected numbers in the EM method). Then the OR with a 95% CI can be estimated by using logistic regression (assuming a common OR across the strata).
The OR estimate for chronic myeloid leukemia associated with occupational exposure to organic solvents was 0.75 (95% CI = 0.30–1.9) based on the first-stage exposure probabilities. A similar estimate was obtained from the conditional additive-multiplicative logistic regression model (OR = 0.78; CI = 0.32–1.9), using the first-stage exposure data with respect to the matched sets of cases and controls. A somewhat higher estimate (1.1; 0.63–1.8) was obtained based on the second-stage (interview) data. Among the participating controls, the exposure prevalence was 17% according to the second-stage data compared with only 4% according to the first-stage probabilities. The first-stage exposure probabilities were even lower among the nonparticipating controls (mean 2%), indicating some association between group-level exposure probability and participation. The EM method based on all available data yielded an elevated OR estimate of 2.2 (1.4–3.4). Including only the first contacted control in each matched set in the analysis implies more equal case/control participation rates; the EM method then yielded a lower OR estimate (1.2; 0.71–2.0).
The simulation study compared the performance of the first-stage, second-stage, and EM methods, considering scenarios in which group-level exposure data are correct. The better precision in OR estimates is an advantage of the EM method over the first-stage method (and, not surprisingly, over the second-stage method). In the scenarios with participation bias, the EM method eliminated the bias at the second stage by incorporating error-free probabilities of exposure (first-stage information obtained from external data sources) for the nonparticipants. The bias-free results from the EM method rely on the assumption the individual-level exposure data on xi are missing at random within each group for the first-stage cases and controls, respectively. We allowed the participation rates among the first-stage cases and controls to differ in each group. When a stratified analysis based on covariates (such as age and sex) is performed, the missing-at-random assumption is conditional on disease status, group affiliation, and stratum; hence, the participation rate is also allowed to vary across the strata.
In practice, however, participation within an exposure group may depend not only on disease status, but also on exposure status, making inference about the OR more problematic. For example, we modified the selection of second-stage controls in our simulation setting by assuming participation rates of 17.3% and 25.9% among the exposed and unexposed, respectively, within each group (implying overall control participation rate = 25%); the second-stage method then yielded, for a true OR = 3, an expected OR = 4.5 (= 3 × 0.259/0.173; see Rothman and Greenland2(p.356)). The EM method is expected to reduce the participation bias even if the missing-at-random assumption is violated, at least when the first-stage exposure data are accurate. In the example, the EM method yielded an OR = 3.1 (geometric mean; coverage of 95% CIs = 94.6%). By increasing the overall control participation rate, and thereby the relative contribution of second-stage exposure data in the EM method, less reduction of the participation bias (induced by violating the missing-at-random assumption) can be expected. Potential users should address whether the missing-at-random assumption is reasonable in their applications.
Our major concern with the EM method is the accuracy of the assigned group-level exposure probabilities. It is generally believed that exposure estimates based on general population job-exposure matrices are less accurate than individual exposure estimates based on occupational hygienists’ reviews. Indeed, we assumed that the individual exposures (exposed, unexposed) were correctly classified. We have previously reported how erroneous group-level probabilities can produce substantially biased ORs from the first-stage method, if the exposure has an effect.13,17 The EM method was robust to bias for the investigated error structures when the expected participation rates among the first-stage cases and controls were equal in each group (75%). We did not investigate scenarios with varying participation rates across groups in combination with erroneous group-level data; the EM method may produce a markedly biased OR under such circumstances even if the overall case and control participation rates are equal. Nevertheless, we showed that unequal participation rates among the first-stage cases and controls in each group and overall, in combination with errors in the group-level exposure data, can produce severe bias under the EM method. This bias is different from the bias encountered under the first-stage method. As an example, when the group-level exposure data were underestimations and the control distribution across the groups was bimodal, no or only marginal bias occurred from the first-stage method (Table 3). By contrast, the EM method yielded substantial bias away from the null if the participation rate among the controls (25%) was much lower than among the cases (75%). Such bias appeared because the number of exposed among the nonparticipating controls was underestimated in the EM algorithm. Consequently, as a result of the low participation rate among the controls, the total number of exposed controls was pronouncedly underestimated, which produced bias away from the null.
In the empiric example on organic solvent exposure and chronic myeloid leukemia, the effect estimate from the EM method, using all available control data (n = 765; 29% participated), was positively biased compared with the estimate from the first-stage method. Among the interviewed controls, the exposure prevalence was 17% according to the second-stage data (xi) compared with only 4% according to the first-stage probabilities (XG), which gave a clear indication of discrepancies between the first- and second-stage exposure assessment methods. We believe that errors in the first-stage exposure probabilities, originating from both the censuses and the job-exposure matrix, can explain the elevated estimate from the EM method; bias was introduced by the very different participation rates among the cases and controls. Accordingly, when only the first contacted control in each matched set was included in the analysis, implying more equal case/control participation rates (76%/59%), the OR estimate from the EM method was not noticeably elevated.
We recommend potential users of the EM method to perform a sensitivity analysis with respect to the assigned group-level exposure probabilities.2(pp.343–357) Table 3 can be helpful when addressing sensitivity to errors. The results are based on 12 exposure groups defined by various mean exposure probabilities (0, 0.05, 0.15,..., 0.85, 0.95, 1), which were chosen with consideration of our empiric example. In other applications, the number of assigned mean probabilities based on a job-exposure matrix may well be fewer (in the range of 3–5)18; the EM method may then be more sensitive to assignment errors.
In some study settings, relevant group-level exposure data is unavailable. Nevertheless, it might be possible to stratify the first-stage cases and controls on group affiliation and then estimate the group-level exposure probabilities based on the individual-level exposure data collected for the participating controls in each group. The accuracy of the estimated group-level exposures relies on the missing-at-random assumption (and the accuracy of the individual-level exposure data). This approach is basically the same as constructing a population-specific job-exposure matrix from interviews of a job-stratified sample of subjects.19 In our empiric example, however, each subject's group affiliation was essentially unique; it was based on a detailed job history for a 20-year period obtained from censuses. The EM method based on collected individual-level exposure data only may rely on additional programming to derive a valid CI around the OR estimate in the final step.20 By contrast, in our proposed EM method, which incorporates external group-level exposure information, a 95% CI is calculated by standard techniques based on a 2-by-2-table (or stratified 2-by-2-tables) with the cell frequencies obtained in the final expectation step of the algorithm. Our simulations demonstrated accurate coverage of such conventionally calculated 95% CIs in error-free scenarios.
The EM method could be applicable in other studies in occupational and environmental epidemiology. We provide 2 examples, with grouping based on residential area. The first example is related to our population-based case-control study on leukemia, considering work as a farmer or farmhand to be a proxy for the exposures of interest (a protective effect associated with such agricultural life was suggested10). Information on the proportion of farmers and farmhands in each municipality may be obtained. At the second stage, interviews can provide individual information on current and past occupations, to assess accurately whether a subject has been working on a farm for a sufficient time period. The second-stage data collection may, however, suffer from selective participation. In other applications, it might be necessary to generalize the EM method for handling polytomous or continuous exposure variables.9,20 There are also important design considerations, such as efficient sampling of second-stage subjects.12,21,22 As another example, consider cases of airway diseases and controls selected at the first stage. Assume that the first-stage subjects can be classified into exposure groups based on ambient monitoring of pollutants in the subjects’ current residential areas. A questionnaire including items on residence history, usual indoor and outdoor times and activities, and so on, may then be administered to the second-stage subjects.21
1. Olson SH, Voigt LF, Begg CB, et al. Reporting participation in case-control studies. Epidemiology
2. Rothman KJ, Greenland S, eds. Modern Epidemiology,
2nd ed. Philadelphia: Lippincott-Raven; 1998.
3. Kauppinen T, Toikkanen J, Pukkala E. From cross-tabulations to multipurpose exposure information systems: a new job-exposure matrix. Am J Ind Med
4. Kauppinen T, Toikkanen J, Pedersen D, et al. Occupational exposure to carcinogens in the European Union. Occup Environ Med
5. Ihrig MM, Shalat SL, Baynes C. A hospital-based case-control study of stillbirths and environmental exposure to arsenic using an atmospheric dispersion model linked to a geographical information system. Epidemiology
6. Bobak M, Leon DA. The effect of air pollution on infant mortality appears specific for respiratory causes in the postneonatal period. Epidemiology
7. Nyberg F, Gustavsson P, Jarup L, et al. Urban air pollution and lung cancer in Stockholm. Epidemiology
8. Richiardi L, Boffetta P, Merletti F. Analysis of nonresponse bias in a population-based case-control on lung cancer. J Clin Epidemiol
9. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B
10. Björk J, Albin M, Welinder H, et al. Are occupational, hobby, or lifestyle exposures associated with Philadelphia chromosome positive chronic myeloid leukaemia? Occup Environ Med
11. White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol
12. Cain KC, Breslow NE. Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol
13. Björk J, Strömberg U. Effects of systematic exposure assessment errors in partially ecologic case-control studies. Int J Epidemiol
14. Bouyer J, Hémon D. Comparison of three methods of estimating odds ratios from a job exposure matrix in occupational case-control studies. Am J Epidemiol
15. Brenner H, Savitz DA, Jockel KH, et al. Effects of nondifferential exposure misclassification in ecologic studies. Am J Epidemiol
16. EGRET for Windows. Software for the Analysis of Biomedical and Epidemiological Studies. User Manual
. Cambridge, MA: CYTEL Software Corp; 1999:148–149.
17. Webster T. Commentary: does the spectre of ecologic bias haunt epidemiology? Int J Epidemiol
18. Kernan GJ, Ji BT, Dosemeci M, et al. Occupational risk factors for pancreatic cancer: a case-control study based on death certificates from 24 US states. Am J Ind Med
19. Kromhout H, Heederik D, Dalderup LM, et al. Performance of two job-exposure matrices in a study of lung cancer morbidity in the Zutphen cohort. Am J Epidemiol
20. Wacholder S, Weinberg CR. Flexible maximum likelihood methods for assessing joint effects in case-control studies with complex sampling. Biometrics
21. Navidi W, Thomas D, Stram D, et al. Design and analysis of multilevel analytic studies with applications to a study of air pollution. Environ Health Perspect
. 1994;102(suppl 8):25–32.
22. Weinberg CR, Wacholder S. The design and analysis of case-control studies with biased sampling. Biometrics
23. Björk J, Strömberg U. Attributable fraction estimation in partially ecologic case-control studies. Epidemiology
Formulae for Generating Group- and Individual-Level Exposures
Let P(XG|D = 1) and P(XG|D = 0) denote the proportions in a group (defined by XG) among the first-stage cases (with D = 1 denoting the presence of disease) and controls (D = 0), respectively. The relation
where β = OR – 1 is the excess OR (exposed vs. unexposed) is true provided that the OR can be interpreted as a relative risk, the group-level exposure probabilities are error-free, and confounding is absent.13,14 Hence, based on a theoretical group distribution of the population (ie, specified values on P(XG|D = 0) for XG = 0, 0.05, 0.15,..., 0.85, 0.95, 1) and the true OR (= 1 + β), the values on each P(XG|D = 1) can be calculated.
Individual-level exposures (xi) for the cases and controls were generated from Bernoulli variables with correct exposure probabilities imposed on the groups. If the true exposure probability for a control is p, then the probability for a case in the same group is23 Cited Here...
Let Njp denote the number of subjects with disease status D = j in a group with group-level exposure probability XG = p. Let njkp denote the number of the Njp subjects with known individual exposure status xi = k (second-stage participants); mjp = Njp – nj0p – nj1p is the number with unknown individual exposure status (nonparticipants).
The EM method is carried out as follows:
Expectation (E) step
For a given value of the OR (OR*), the expected exposure probability for a case with XG = p is
(see equation A1). Thus, the expected total number of exposed (xi = 1) cases (D = 1) is
the expected total number of unexposed (xi = 0) cases (D = 1) is
the expected total number of exposed (xi = 1) controls (D = 0) is
and the expected total number of unexposed (xi = 0) controls (D = 0) is
Maximization (M) step
Calculate the maximum likelihood estimate of the OR, using the frequencies calculated in the E step.
The procedure is repeated (with OR* set to the current point estimate of the OR) until convergence (convergence criterion: change of OR* < 0.0001). Cited Here...
Calculation of Expected OR Estimate From the EM Method
We here use the same notation as in Appendix B. The total numbers of first-stage cases and controls, together with the specified distribution of the first-stage controls across the groups and the true OR, yield the expected values on Njp (see Appendix A). Each expected value on njkp can be calculated from the group-specific participation rates among the first-stage cases and controls, the true group-level exposure value XG = p, and the true OR (which is taken into account for the participating cases; see equation A1). Then, based on the specified (erroneous) values on XG, the EM algorithm can be carried out as described in Appendix B; the final OR estimate corresponds to the expected OR. Cited Here...© 2004 Lippincott Williams & Wilkins, Inc.