# Incorporating Group-Level Exposure Information in Case-Control Studies With Missing Data on Dichotomous Exposures

In case-control studies with exposure data obtained from interviews, participation is an issue of concern. Use of external group-level exposure information, available for all cases and controls (including nonparticipants), can reduce participation bias and improve precision of effect estimates. Our methodologic investigation was motivated by a population-based case-control study on occupational exposures and leukemia. We assessed exposure using dichotomous data collected in interviews, and also using census data on past and current occupational groups for all subjects. Based on information from a job-exposure database, a group-level probability of exposure was assigned to each subject. We studied the performance of the iterative expectation-maximization method for estimating the odds ratio (OR) by using the individual-level exposure data on the interviewed participants together with the assigned group-level exposure probabilities for the nonparticipants. In each iteration, the expected numbers of exposed and unexposed among the nonparticipating cases and controls were calculated from their assigned exposure probabilities and, for the cases only, from the current OR estimate. We then estimated the OR based on the total (observed plus expected) numbers and repeated the procedure until convergence. The expectation-maximization method eliminated participation bias and improved precision for scenarios with error-free group-level exposures and individual-level exposure data missing at random conditional on disease status and group affiliation. We specifically addressed consequences of assigning erroneous exposure probabilities to the nonparticipating subjects. In such situations, the expectation-maximization method can produce biased estimates if the participation rates among the cases and controls differ substantially.

From the *Department of Occupational and Environmental Medicine, Lund University, and the †Competence Centre for Clinical Research, Lund University Hospital, Lund, Sweden.

Submitted 17 Februrary 2003; final version accepted 12 March 2004.

Financial support provided by the Swedish Council for Working Life and Social Research (grant nos. 2001–1188 and 2002–0097).

Correspondence: Ulf Strömberg, Department of Occupational and Environmental Medicine, University Hospital, SE-221 85 Lund, Sweden. E-mail: ulf.stromberg@ymed.lu.se

Nonresponse in case-control studies is a ubiquitous problem.^{1} Participation (self-selection) bias arises if the relation between exposure and disease is different for those who participate.^{2} ^{(p.119)} This can occur if, for example, exposed cases are more likely to participate than exposed controls. Such bias may not be controllable as a result of limited information on the factors affecting participation.

Nonetheless, population registers and census data can, in some countries, provide partial information on virtually all cases and controls such as general individual characteristics (eg, age and sex) and group affiliations (eg, occupational group or residential area). Such grouping is relevant in population-based case-control studies on occupational and environmental exposures. Group-level exposure information may be obtained from an exposure database. For example, a job-exposure matrix can supply group-level information such as exposure proportions or average intensities for various occupational groups.^{3,4} Similarly, a geographic information system can supply average concentrations of air pollutants for various residential areas.^{5–7} Occupational or residential affiliation may be a primary determinant not only of exposure, but also of participation such as in situations in which socioeconomic status is associated with nonresponse.^{8} Appropriate use of complementary group-level exposure information may reduce participation bias and improve precision of effect estimates.

This article proposes a method for estimating the odds ratio (OR) given dichotomous, or appropriately dichotomized, individual-level exposure data for the participants as well as exposure proportions on the group level for the nonparticipants. Based on the group-level information, a probability of exposure is assigned to each subject. We use an iterative estimation method, the expectation-maximization (EM) method,^{9} which takes into account the assigned exposure probabilities for the nonparticipants. We study the performance of the EM method using simulated data. We quantify the bias implied by incorporating erroneous group-level exposure information. We apply the EM method using empiric data from a population-based case-control study on occupational exposures and chronic myeloid leukemia.^{10}

## Simulation Study—Scenarios in Which Group-Level Exposure Data Are Correct

### Setting

We consider a population-based case-control study of a rare disease in which the incident cases during a recruitment period are included. Those cases are referred to as the first-stage cases. The first-stage controls are sampled randomly among those who remain free of disease at the end of the recruitment period.^{2} ^{(pp.110–111)} All first-stage cases and first-stage controls are eligible for the interview portion of the study. We assume that no selection bias arises from the recruitment of the first-stage cases and controls. The first-stage subjects can be classified into exposure groups with various exposure proportions *XG*. Assuming a rare disease, *XG* should reflect the exposure probability for the disease-free subjects in each group of the population. Within each group, the first-stage cases and controls are assigned the same value on *XG*. The second-stage cases and controls are those who participate in interviews and thereby provide individual-level exposure data (*xi* = 1, exposed; *xi* = 0, unexposed). Thus, we consider a 2-stage case-control study,^{11,12} in which *XG* is the first-stage exposure variable (with no missing values) and *xi* is the second-stage exposure variable (with data only for the participants).

### Missing-at-Random Assumption

We assume that data on *xi* are missing at random, conditional on group affiliation and disease status. Thus, within each group, participation (in interview) among the first-stage cases as well as among the first-stage controls is unrelated to exposure status on the individual level. Moreover, the case and control participation rates can differ in each group and can differ overall. Consequently, participation bias is absent within each group, but not necessarily absent overall if the case and control participation rates vary differently across the groups. For example, if the control participation rate decreases with *XG*, the expected exposure prevalence among the second-stage (participating) controls is lower than in the total population.

### Simulations

In each simulation, we included 400 first-stage cases and 1200 first-stage controls distributed over 12 groups, with the possible values of *XG* restricted to 0, 0.05, 0.15,..., 0.85, 0.95, 1. We assume that each subject is placed in 1 of the 12 groups based on the subject's occupation. For example, if the proportion exposed is 5% for occupation A according to the exposure database, a subject with occupation A is placed in the group with *XG* = 0.05. In practice, a group may be defined by an interval of exposure probabilities (eg, 0.01–0.10); *XG* reflects the mean probability (eg, 0.05) in each group.^{13} The first-stage controls were distributed over the groups according to specified group affiliation probabilities (Table 1); this skewed control distribution across the groups was chosen to resemble the empiric example presented subsequently. The first-stage cases were distributed over the groups correspondingly. For a true OR >1, higher proportions of the cases than of the controls can be expected to belong to the groups with *XG* > 0 (Appendix A).

Of the 400 first-stage cases, 300 second-stage (participating) cases were selected completely at random in each simulation (participation rate = 75%). Of the 1200 first-stage controls, we selected 300 second-stage controls (25%). Two types of scenarios were simulated: without and with participation bias. In the scenarios without participation bias, the second-stage controls were selected completely at random. We introduced participation bias by decreasing or increasing the control participation rate with *XG* (Table 1). Data on *xi* for the second-stage participants were generated from Bernoulli variables with correct exposure probabilities imposed on the groups. If the exposure implies an increase in risk (OR >1), the actual exposure probability for a case is greater than the true exposure probability for a control within the same group with 0 < *XG* < 1 (Appendix A).

We assumed 3 true ORs: 1, 2, and 3. For each scenario, we carried out 1000 replications.

### Estimation Methods

Estimates of the OR were obtained from 3 methods: (1) *first-stage method* (analysis of first-stage data only, for all subjects); (2) *second-stage method* (analysis of second-stage data only, for the participants); and (3) *EM method*.

When the disease is rare, the association between the first-stage exposure variable *XG* and the disease odds is linear in the absence of confounding:

where α is a positive constant.^{13,14} For effect estimation with the first-stage method, we thus used the linear OR model,

where

is the odds ratio comparing the exposed with the unexposed. The maximum likelihood estimate of β together with the standard error [SE()] were obtained as described by Björk and Strömberg.^{13} An approximate 95% confidence interval (CI) for the OR was calculated as^{13}

In the second-stage method, the maximum likelihood estimate of the OR and a 95% CI (Wald limits) were obtained conventionally from a 2-by-2-table.^{2} ^{(pp.241–246)}

The EM method is an iterative method. In each iteration, the expected numbers of exposed and unexposed among the nonparticipating cases and controls, respectively, were calculated from their assigned exposure probabilities in the first stage and, for the cases only, the current OR estimate. We then estimated the OR based on the total (observed + expected) numbers and repeated the procedure until convergence (Appendix B). The OR estimate was obtained from a 2-by-2-table in each maximization step of the EM algorithm and, in the last step, a 95% CI (Wald limits) was calculated from the final 2-by-2-table.

## RESULTS

Unbiased OR estimates were obtained with all methods in the scenarios without participation bias (Table 2). The standard deviations (SD) of the logarithmically transformed OR estimates produced by the EM method were 17% to 24% lower than with the first-stage method and 30% to 35% lower than with the second-stage method. The EM method yielded 95% CIs with accurate coverage.

A bias away from the null appeared from the second-stage analyses if the control participation rate decreased with *XG* (Table 2); exposed second-stage controls were thereby underrepresented (7.2% compared with the expected exposure prevalence of 10.4%). On the other hand, a negative bias appeared if the control participation rate increased with *XG*. By using the EM method, such participation bias was eliminated. Compared with the first-stage method, the EM method yielded better precision.

### Bias Assessment—Scenarios in Which Group-Level Exposure Data Are Erroneous

The conditions here are the same as in the simulation study, except that we allow erroneous group-level exposure data on *XG*. We studied 3 different error structures: (1) systematic underestimations (Fig. 1A); (2) systematic overestimations (Fig. 1B); and (3) perfectly negatively correlation between the true values and the differences between assigned and true values (Fig. 1C). The latter error structure arises if the exposure proportions from the external data source are established with constant sensitivity and specificity (here, both 90%) across all groups.^{13,15} Furthermore, we studied 3 different distributions of the first-stage controls across the groups: (1) the distribution used in the simulation study, ie, controls concentrated in groups with low exposure proportions, referred to as the skewed distribution (Table 2; Fig. 2A; true overall exposure prevalence among the controls = 10.4%); (2) uniformly distributed (Fig. 2B; 50.0%); and (3) controls concentrated in groups with low as well as high exposure proportions, referred to as the bimodal distribution (Fig. 2C; 30.8%). Skewed distributions are common for occupational and environmental exposures in population-based studies. An example of an exposure that is likely to have a bimodal distribution is animal dust (a proxy for animal-borne viruses), which is highly prevalent in certain occupational groups (eg, farmers, farmhands, and butchers). We also varied the true OR (1, 2, and 3) and the participation rate among the first-stage controls (25%, 50%, 75%, and 90%). The participation rate among the first-stage cases was 75% in all scenarios. The second-stage cases and controls (participants) were selected completely at random. Hence, participation bias was absent.

In each scenario, we derived the expected OR estimate based on the EM method (Appendix C). We also report the expected OR estimates yielded by the first-stage method.^{13}

## RESULTS

For the investigated errors in the group-level exposure *XG*, the first-stage method generally implied bias away from the null when the exposure had an effect (Table 3). The EM method was robust to errors in *XG* when the case and control participation rates at the second stage were equal (both 75%). By contrast, with unequal case/control participation rates, marked bias generally occurred in both directions, even when the exposure had no effect. When the errors in the group-level data were consistent underestimations (Fig. 1A) or overestimations (Fig. 1B), the direction and magnitude of the bias was determined largely by control participation relative to case participation at the second stage, whereas varying the distribution of the controls across the groups (Figs. 2A–C) had less impact (Table 3). When truly low exposure proportions were overestimated and truly high exposure proportions were underestimated (Fig. 1C), the control distribution across the groups, as well as the control participation rate, clearly influenced the resulting bias. Only marginal bias resulted from such errors when the distribution was uniform, whereas marked bias occurred when the exposure distribution was skewed; modest bias occurred for the bimodal distribution.

### Empiric Example

Björk et al.^{10} conducted a population-based case-control study for a series of 255 Philadelphia chromosome-positive cases of chronic myeloid leukemia from Southern Sweden, 1976–1993. For each case, 3 age- and sex-matched controls were sampled randomly from the population at risk at the time of diagnosis. For each subject, we obtained first-stage data as occupational titles for 1960, 1970, 1975, 1980, 1985, and 1990 from national censuses. Jobs held at a census were assumed to be held until the following census. We here concentrate on occupational exposure to organic solvents. To assign the first-stage exposure probabilities, we used a Swedish translation of a Finnish job-exposure matrix.^{3} In addition to occupational title and agent, the matrix incorporates calendar year as a third dimension. Up to 5 exposure probabilities were assigned to each individual *i*: *XG(1),i*, *XG(2),i*, *XG(3),i*, *XG(4),i*, *XG(5),i*; viz. 1 for each time epoch (with respect to the censuses) during the 20-year time period before the year of diagnosis. Each subject's exposure probability was then calculated as

We used the individually assigned exposure probabilities as the first-stage exposure data. The underlying assumption of independence over time epochs may be too strong in practice. Using the maximum exposure probability (of *XG(1),i*, *XG(2),i*, *XG(3),i*, *XG(4),i*, *XG(5),i*) instead did not alter the results noticeably.

At the second stage, all cases and 1 randomly selected control in each matched set were selected for structured telephone interviews. If a control declined to participate, another control from the matched set was contacted. Occupational hygienists assessed individual exposures to organic solvents based on the interview information. The numbers of participants who contributed with individual-level exposure data on organic solvents were 195 cases (76% of the 255 first-stage cases) and 219 controls (29% of the 765 first-stage controls). Among the first-stage controls who were first contacted for interview in each matched set (n = 255), 151 (59%) contributed with individual-level exposure data.

### Effect Estimation, Including Covariates

We estimated the age- and sex-adjusted OR based on the first-stage data by using an extension of the simple linear OR model (equation 1), viz. the additive-multiplicative logistic regression model^{13,14}:

where *S* _{1},..., *S* _{J} are indicators for relevant strata of age and sex; γ_{j} is the log-transformed OR associated with stratum *j*. We used 6 strata of age and sex (age groups ≤54, 55–69, and 70+ years; separate for males and females). There are 2 underlying assumptions: (1) a common OR (exposed vs. unexposed) across the strata (ie, no effect modification) and (2) the exposure probability for a given work history *i* (*XG(1),i*, *XG(2),i*, *XG(3),i*, *XG(4),i*, *XG(5),i*) does not depend on the covariates (age and sex in our example). Alternatively, one can use the conditional additive-multiplicative logistic regression model based on the matched sets of cases and controls.^{16} Estimates of β and SE(β) can be obtained in the software package EGRET for Windows (Cytel Software Corp., Cambridge, MA).

Generalizations of the second-stage and EM methods for handling stratified data are straightforward. The computations use stratum-specific numbers of exposed and unexposed among the cases and controls (observed numbers in the second-stage method and observed + expected numbers in the EM method). Then the OR with a 95% CI can be estimated by using logistic regression (assuming a common OR across the strata).

## RESULTS

The OR estimate for chronic myeloid leukemia associated with occupational exposure to organic solvents was 0.75 (95% CI = 0.30–1.9) based on the first-stage exposure probabilities. A similar estimate was obtained from the conditional additive-multiplicative logistic regression model (OR = 0.78; CI = 0.32–1.9), using the first-stage exposure data with respect to the matched sets of cases and controls. A somewhat higher estimate (1.1; 0.63–1.8) was obtained based on the second-stage (interview) data. Among the participating controls, the exposure prevalence was 17% according to the second-stage data compared with only 4% according to the first-stage probabilities. The first-stage exposure probabilities were even lower among the nonparticipating controls (mean 2%), indicating some association between group-level exposure probability and participation. The EM method based on all available data yielded an elevated OR estimate of 2.2 (1.4–3.4). Including only the first contacted control in each matched set in the analysis implies more equal case/control participation rates; the EM method then yielded a lower OR estimate (1.2; 0.71–2.0).

## DISCUSSION

The simulation study compared the performance of the first-stage, second-stage, and EM methods, considering scenarios in which group-level exposure data are correct. The better precision in OR estimates is an advantage of the EM method over the first-stage method (and, not surprisingly, over the second-stage method). In the scenarios with participation bias, the EM method eliminated the bias at the second stage by incorporating error-free probabilities of exposure (first-stage information obtained from external data sources) for the nonparticipants. The bias-free results from the EM method rely on the assumption the individual-level exposure data on *xi* are missing at random within each group for the first-stage cases and controls, respectively. We allowed the participation rates among the first-stage cases and controls to differ in each group. When a stratified analysis based on covariates (such as age and sex) is performed, the missing-at-random assumption is conditional on disease status, group affiliation, *and* stratum; hence, the participation rate is also allowed to vary across the strata.

In practice, however, participation within an exposure group may depend not only on disease status, but also on exposure status, making inference about the OR more problematic. For example, we modified the selection of second-stage controls in our simulation setting by assuming participation rates of 17.3% and 25.9% among the exposed and unexposed, respectively, within each group (implying overall control participation rate = 25%); the second-stage method then yielded, for a true OR = 3, an expected OR = 4.5 (= 3 × 0.259/0.173; see Rothman and Greenland^{2(p.356)}). The EM method is expected to reduce the participation bias even if the missing-at-random assumption is violated, at least when the first-stage exposure data are accurate. In the example, the EM method yielded an OR = 3.1 (geometric mean; coverage of 95% CIs = 94.6%). By increasing the overall control participation rate, and thereby the relative contribution of second-stage exposure data in the EM method, less reduction of the participation bias (induced by violating the missing-at-random assumption) can be expected. Potential users should address whether the missing-at-random assumption is reasonable in their applications.

Our major concern with the EM method is the accuracy of the assigned group-level exposure probabilities. It is generally believed that exposure estimates based on general population job-exposure matrices are less accurate than individual exposure estimates based on occupational hygienists’ reviews. Indeed, we assumed that the individual exposures (exposed, unexposed) were correctly classified. We have previously reported how erroneous group-level probabilities can produce substantially biased ORs from the first-stage method, if the exposure has an effect.^{13,17} The EM method was robust to bias for the investigated error structures when the expected participation rates among the first-stage cases and controls were equal in each group (75%). We did not investigate scenarios with varying participation rates across groups in combination with erroneous group-level data; the EM method may produce a markedly biased OR under such circumstances even if the overall case and control participation rates are equal. Nevertheless, we showed that unequal participation rates among the first-stage cases and controls in each group and overall, in combination with errors in the group-level exposure data, can produce severe bias under the EM method. This bias is different from the bias encountered under the first-stage method. As an example, when the group-level exposure data were underestimations and the control distribution across the groups was bimodal, no or only marginal bias occurred from the first-stage method (Table 3). By contrast, the EM method yielded substantial bias away from the null if the participation rate among the controls (25%) was much lower than among the cases (75%). Such bias appeared because the number of exposed among the nonparticipating controls was underestimated in the EM algorithm. Consequently, as a result of the low participation rate among the controls, the total number of exposed controls was pronouncedly underestimated, which produced bias away from the null.

In the empiric example on organic solvent exposure and chronic myeloid leukemia, the effect estimate from the EM method, using all available control data (n = 765; 29% participated), was positively biased compared with the estimate from the first-stage method. Among the interviewed controls, the exposure prevalence was 17% according to the second-stage data (*xi*) compared with only 4% according to the first-stage probabilities (*XG*), which gave a clear indication of discrepancies between the first- and second-stage exposure assessment methods. We believe that errors in the first-stage exposure probabilities, originating from both the censuses and the job-exposure matrix, can explain the elevated estimate from the EM method; bias was introduced by the very different participation rates among the cases and controls. Accordingly, when only the first contacted control in each matched set was included in the analysis, implying more equal case/control participation rates (76%/59%), the OR estimate from the EM method was not noticeably elevated.

We recommend potential users of the EM method to perform a sensitivity analysis with respect to the assigned group-level exposure probabilities.^{2} ^{(pp.343–357)} Table 3 can be helpful when addressing sensitivity to errors. The results are based on 12 exposure groups defined by various mean exposure probabilities (0, 0.05, 0.15,..., 0.85, 0.95, 1), which were chosen with consideration of our empiric example. In other applications, the number of assigned mean probabilities based on a job-exposure matrix may well be fewer (in the range of 3–5)^{18}; the EM method may then be more sensitive to assignment errors.

In some study settings, relevant group-level exposure data is unavailable. Nevertheless, it might be possible to stratify the first-stage cases and controls on group affiliation and then estimate the group-level exposure probabilities based on the individual-level exposure data collected for the participating controls in each group. The accuracy of the estimated group-level exposures relies on the missing-at-random assumption (and the accuracy of the individual-level exposure data). This approach is basically the same as constructing a population-specific job-exposure matrix from interviews of a job-stratified sample of subjects.^{19} In our empiric example, however, each subject's group affiliation was essentially unique; it was based on a detailed job history for a 20-year period obtained from censuses. The EM method based on collected individual-level exposure data only may rely on additional programming to derive a valid CI around the OR estimate in the final step.^{20} By contrast, in our proposed EM method, which incorporates external group-level exposure information, a 95% CI is calculated by standard techniques based on a 2-by-2-table (or stratified 2-by-2-tables) with the cell frequencies obtained in the final expectation step of the algorithm. Our simulations demonstrated accurate coverage of such conventionally calculated 95% CIs in error-free scenarios.

The EM method could be applicable in other studies in occupational and environmental epidemiology. We provide 2 examples, with grouping based on residential area. The first example is related to our population-based case-control study on leukemia, considering work as a farmer or farmhand to be a proxy for the exposures of interest (a protective effect associated with such agricultural life was suggested^{10}). Information on the proportion of farmers and farmhands in each municipality may be obtained. At the second stage, interviews can provide individual information on current and past occupations, to assess accurately whether a subject has been working on a farm for a sufficient time period. The second-stage data collection may, however, suffer from selective participation. In other applications, it might be necessary to generalize the EM method for handling polytomous or continuous exposure variables.^{9,20} There are also important design considerations, such as efficient sampling of second-stage subjects.^{12,21,22} As another example, consider cases of airway diseases and controls selected at the first stage. Assume that the first-stage subjects can be classified into exposure groups based on ambient monitoring of pollutants in the subjects’ current residential areas. A questionnaire including items on residence history, usual indoor and outdoor times and activities, and so on, may then be administered to the second-stage subjects.^{21}

## REFERENCES

*Epidemiology*. 2002;13:123–126.

*Modern Epidemiology,*2nd ed. Philadelphia: Lippincott-Raven; 1998.

*Am J Ind Med*. 1998;33:409–417.

*Occup Environ Med*. 2000;57:10–18.

*Epidemiology*. 1998;9:290–294.

*Epidemiology*. 1999;10:666–670.

*Epidemiology*. 2000;11:487–495.

*J Clin Epidemiol*. 2002;55:1033–1040.

*J R Stat Soc B*. 1977;39:1–38.

*Occup Environ Med*. 2001;58:722–727.

*Am J Epidemiol*. 1982;115:119–128.

*Am J Epidemiol*. 1988;128:1198–1206.

*Int J Epidemiol*. 2002;31:154–160.

*Am J Epidemiol*. 1993;137:472–481.

*Am J Epidemiol*. 1992;135:85–95.

*Software for the Analysis of Biomedical and Epidemiological Studies. User Manual*. Cambridge, MA: CYTEL Software Corp; 1999:148–149.

*Int J Epidemiol*. 2002;31:161–162.

*Am J Ind Med*. 1999;36:260–270.

*Am J Epidemiol*. 1992;136:698–711.

*Biometrics*. 1994;50:350–357.

*Environ Health Perspect*. 1994;102(suppl 8):25–32.

*Biometrics*. 1990;46:963–975.

*Epidemiology*. 2002;13:459–466.

## APPENDIX A

### Formulae for Generating Group- and Individual-Level Exposures

Let *P*(*XG*|*D* = 1) and *P*(*XG*|*D* = 0) denote the proportions in a group (defined by *XG*) among the first-stage cases (with *D* = 1 denoting the presence of disease) and controls (*D* = 0), respectively. The relation

where β = OR – 1 is the excess OR (exposed vs. unexposed) is true provided that the OR can be interpreted as a relative risk, the group-level exposure probabilities are error-free, and confounding is absent.^{13,14} Hence, based on a theoretical group distribution of the population (ie, specified values on *P*(*XG|D* = 0) for *XG* = 0, 0.05, 0.15,..., 0.85, 0.95, 1) and the true OR (= 1 + β), the values on each *P*(*XG|D* = 1) can be calculated.

Individual-level exposures (*xi*) for the cases and controls were generated from Bernoulli variables with correct exposure probabilities imposed on the groups. If the true exposure probability for a control is *p*, then the probability for a case in the same group is^{23}

## APPENDIX B

### EM Method

Let *Njp* denote the number of subjects with disease status *D* = *j* in a group with group-level exposure probability *XG* = *p*. Let *njkp* denote the number of the *Njp* subjects with known individual exposure status *xi* = *k* (second-stage participants); *mjp* = *Njp* – *nj0p – nj1p* is the number with unknown individual exposure status (nonparticipants).

The EM method is carried out as follows:

### Expectation (E) step

For a given value of the OR (OR*), the expected exposure probability for a case with *XG* = *p* is

(see equation A1). Thus, the expected total number of exposed (*xi* = *1*) cases (*D* = 1) is

the expected total number of unexposed (*xi* = 0) cases (*D* = 1) is

the expected total number of exposed (*xi* = 1) controls (*D* = 0) is

and the expected total number of unexposed (*xi* = 0) controls (*D* = 0) is

### Maximization (M) step

Calculate the maximum likelihood estimate of the OR, using the frequencies calculated in the E step.

The procedure is repeated (with OR* set to the current point estimate of the OR) until convergence (convergence criterion: change of OR* < 0.0001).

## APPENDIX C

### Calculation of Expected OR Estimate From the EM Method

We here use the same notation as in Appendix B. The total numbers of first-stage cases and controls, together with the specified distribution of the first-stage controls across the groups and the true OR, yield the expected values on *Njp* (see Appendix A). Each expected value on *njkp* can be calculated from the group-specific participation rates among the first-stage cases and controls, the true group-level exposure value *XG* = *p*, and the true OR (which is taken into account for the participating cases; see equation A1). Then, based on the specified (erroneous) values on *XG*, the EM algorithm can be carried out as described in Appendix B; the final OR estimate corresponds to the expected OR.