# Estimating Attributable Fraction in Partially Ecologic Case-Control Studies

Partially ecologic case-control studies combine group-level exposure data with individual-level data on disease status, group membership, and covariates. If the exposure measure is the exposure prevalence of various groups, the attributable fraction (AF; the estimated proportion of cases that are attributable to exposure) can be estimated by classifying all subjects in groups with exposure prevalence above zero as exposed. Such a threshold AF estimator (FIGURE_{T}) is unbiased in confounding-free situations if the threshold is 100% sensitive, but it might be imprecise. We propose an alternative AF estimator, FIGURE_{L}, for partially ecologic case-control studies under a linear model for the association between the exposure prevalence and the odds ratio. The proposed estimator can also be applied to situations in which covariate adjustment is necessary. FIGURE_{T} and FIGURE_{L} are compared with respect to precision and bias. FIGURE_{L} is also unbiased when the exposure prevalence is zero in the group(s) assessed as unexposed. Using FIGURE_{L} will consistently result in improved precision compared with FIGURE_{T}, although the results may not differ substantially. The 95% confidence intervals for both AF estimators show satisfactory coverage in bias-free exposure scenarios. Pronounced negative bias and decreased coverage result for both AF estimators even when small fractions (3–9%) of exposed subjects are included in the group assessed as unexposed.

From the Department of Occupational and Environmental Medicine, Lund, Sweden.

Address correspondence to: Jonas Björk, Department of Occupational and Environmental Medicine, Lund University Hospital, SE-221 85 Lund, Sweden; Jonas.Bjork@ymed.lu.se

This study was supported in part by the Swedish Council for Working Life and Social Research.

Submitted 11 July 2001; final version accepted 4 March 2002.

Partially ecologic case-control studies combine group-level exposure data (*eg*, obtained from an exposure database) with individual-level data on disease status, group membership, and covariates to estimate exposure-disease associations on the individual level. ^{1} Others have referred to similar study designs as semi-individual studies. ^{2,3} Common practice in such studies focusing on occupational exposures is to use a job exposure matrix (JEM) with binary or categorized exposure data ^{4–6} or to categorize a JEM comprising exposure prevalences or average intensities. ^{7–11} For exposures with low overall prevalences, such approaches usually produce risk ratio (RR) estimates that are considerably biased toward the null unless the threshold criterion for exposure has high specificity. ^{12–16} There is an attractive alternative for exposures that are dichotomous (or dichotomized) on the individual level. This approach models the linear association between the corresponding group-level measure (*ie*, the exposure prevalence of various groups) and the odds ratio (OR). ^{1,12} The linear OR model has so far rarely been applied ^{17} despite its bias-reducing pontential. ^{12} Partially ecologic case-control studies in environmental epidemiology often use group-level measures (such as average concentrations of air pollutants in narrowly defined districts) obtained from geographical information systems or dispersion models. ^{18–22} If data on the variability of the exposure within a district are available, the proportion of individuals exposed above some known or suspected effect level in various districts can be estimated, which makes the linear OR model applicable.

The attributable fraction ^{23} (AF; the estimated proportion of cases that would not have occurred if exposure above the reference [*here*, the unexposed] category had not occurred) is an interesting epidemiologic measure from a public-health perspective if a study is population based. The AF can be estimated by classifying as exposed all subjects in groups with exposure prevalence above zero. This threshold method with a 100% sensitive—but not necessarily 100% specific—threshold (*ie*, all truly exposed and possibly some truly unexposed are classified as exposed) implies that an underestimation of the OR is balanced by an overestimation of the exposure prevalence. The resulting AF estimate is unbiased, ^{24–26} at least in the absence of confounding and effect modification. The conclusion holds for individual-level as well as partially ecologic case-control studies. However, reduced precision may be the price for using the threshold AF estimator based on a broad definition of exposure. ^{24,26}

In this paper, we propose an alternative AF estimator for partially ecologic case-control studies of dichotomous exposures, using the linear OR model. We point out that the proposed AF estimator is unbiased under the same conditions as the threshold AF. The two AF estimators are compared with respect to precision and coverage of 95% confidence intervals (CIs), using simulated case-control study data. In addition, we investigate the effect on bias when grouping the study subjects with a sensitivity less than 100%, using both simulated and empirical data. We also address covariate adjustments in the estimation procedure.

### Unadjusted Attributable Fraction Estimation

We assume that the OR can be interpreted as a RR, that the population has *J* groups (*eg, J* different occupations or geographical areas), and that each group *j* (*j* = 0, 1, ..., *J* − 1) comprises η_{j} × 100% of all individuals. Further, we assume that the prevalence of a harmful binary exposure in each group *j* is *x*_{j} and that *x*_{0} = 0. The overall exposure prevalence of the population is and the variance of the group-specific exposure prevalences is

The exposure prevalence of a group can be interpreted as the exposure probability of a randomly selected group member. ^{12} The OR comparing group *j* with the reference group 0 is *OR*_{j} (*OR*_{0} = 1). The unadjusted AF associated with the excess risks in groups 1, ..., *k* − 1 *vs* group 0 can be expressed as ^{27} where MATHis the population OR (see also Greenland ^{28}), *ie*, the weighted average of the group-specific ORs in a population with overall exposure prevalence *JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png*. The true association between the exposure prevalence *x* and the OR in each group is linear in the absence of confounding and effect modification from other risk factors across and within groups. ^{1,12,29}

This linear OR model can be used in partially ecologic case-control studies by assigning an exposure probability *x* to each study subject solely on the basis of group membership. Note that MATHis an unbiased estimator of the OR for exposed *vs* unexposed subjects if the exposure probabilities are estimated without errors. ^{1} The population OR is and hence Eq 3 implies that the AF for the linear model can be expressed as

Next, we assume that the threshold method classifies group 0 as unexposed and all other groups as exposed. The population OR under this 100% sensitive threshold classification is where MATHis the proportion of subjects classified as exposed and *OR*_{T} is the OR for the exposed *vs* the unexposed according to the threshold. Thus, the AF for the threshold method is

Estimators based on Eqs 6 and 8, FIGURE_{L} and FIGURE_{T}, are directly applicable when the controls are density sampled, ^{30} if confounding and effect modification from other risk factors are absent. In particular, *JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* is estimated as the average exposure probability among the control subjects.

### Variance Estimators

In Appendix A, we show that the variance of FIGURE_{L} can be estimated as where *n* is the number of controls and and vâr(*x*) are estimates of the mean and the variance of the exposure probability in the population (Eqs 1 and 2), which can be directly estimated from density-sampled controls. A valid estimate of the variance of the unadjusted *AF*_{T} estimator is given by Greenland. ^{31} This applies also to the adjusted *AF*_{T} estimator (see below).

In practice, we will construct confidence limits for ln(1 −*AF*), which are then transformed back to limits for *AF*. ^{23} Using the delta method, ^{32} a variance estimator for ln(1 − FIGURE) is MATHwhere FIGURE is any of the two AF estimators.

### Adjusted Attributable Fraction Estimation

In matched study designs with a representative case series, AF expressions based on the exposure prevalence of the cases can be used. ^{27,33,34} The overall exposure prevalence of the cases (Appendix B) is where η′_{j} is the proportion of the cases that belongs to group *j*. Thus, is analogous with Eq 6 and can be estimated by using the distribution of the cases and the conditional maximum likelihood estimate of β. Similarly, ^{27,33,34} where η′_{T} is the proportion of the cases classified as exposed according to the threshold, is analogous with Eq 8. AF estimators based on *AF* ′_{L} and *AF* ′_{T} are also applicable if confounding-adjusted OR estimates are obtained. ^{27} Confounding across groups, *ie*, when other risk factors are associated with the determinants of group membership, ^{35,36} can be adjusted for if individual data on confounders are available for each case and each control. The linear OR model (Eq 4) extends to an additive-relative OR model if we assume a common RR for exposed *vs* unexposed individuals across the various levels of the confounders, ^{1,12} where *s*_{i} is an indicator variable for stratum *i* in a complete, or sufficient, cross-classification of the confounders and α_{i} is the log-transformed OR associated with stratum *i*. We apply the additive-relative OR model in the empirical example below when estimating β and hence *AF* ′_{L} (Eq 11) in the presence of confounding across groups.

The variance estimator of Eq 9 is valid for the *AF* ′_{L} estimator if *n* is replaced by *m* (the number of cases) and if estimates of *JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* and var(*x*) are derived from the observed distribution of exposure probabilities of the cases. Combining Eqs 6 and 11 yields ^{34}MATH

Estimating var(*x*) requires estimates of η_{j} (*j* = 0, 1, ..., *k* − 1), which can be obtained as MATHand MATH

(See Appendix B for details.)

### Bias Resulting from Exposure Misclassification

The formulation of Eq 3 represents the summary AF for all factors (*ie*, the exposure of interest as well as other possible risk factors) that may be responsible for the excess risks of groups 1, ..., *J* − 1 *vs* group 0. ^{27} However, Eq 3 is a valid expression for the AF *associated with the exposure* only if the population OR (*OR*_{JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png}) is obtained adequately. In particular, the threshold estimator FIGURE_{T} (*cf*Eq 8), based on the *OR*_{JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png} estimator of Eq 7, is unbiased only if ^{24–26}

- there is no confounding or effect modification from other risk factors across or within groups and
- the identified group 0 comprises truly unexposed subjects only.

To see why the same conclusion generally holds for FIGURE_{L}, suppose that the linear OR model (Eq 4) is valid and that group 0 has been correctly defined (*ie*, the exposure prevalence of the group is zero). By contrast, the exposure prevalences of groups 1, ..., *J* − 1 may be estimated with errors, producing expected estimates β*, *JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* *, and *OR**_{JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png} of the true β, *JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png*, and *OR*_{JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png}. Notice that when group 0 is correct, *ORJOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* under the linear OR model (Eq 5) and *OR*_{JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png} under the threshold method (Eq 7) are numerically identical, although the estimated exposure prevalences may differ. In ordinary least-squares linear regression with measurement errors in *x* only, the point (*JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* *, ¯;y), where ¯;y is the true mean of the response variable, can be expected to lie on the regression line (see also Draper and Smith ^{37}). Similarly, if the direction of the estimated effect is not reversed (*ie*, when β* > 0), one can (at least asymptotically) expect the linear OR model to pass through (*JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* *, *OR*_{JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png}). Thus, MATH*ie*, the expected value of the estimated population risk ratio and hence the estimated AF are unaffected by errors in the estimated exposure prevalences of groups 1, ..., *J* − 1. In contrast, if group 0 is contaminated by exposed subjects, both approaches will tend to attenuate the estimate of *OR*_{JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png} and, consequently, of AF.

For the threshold method, Hsieh and Walter ^{25} have noted that the expected estimate *AF*^{*}_{T} is a constant proportion of the true AF regardless of the true OR in the presence of exposed subjects in group 0, if the misclassification is nondifferential. Using the notation of the present paper, where *x*_{0} is the true exposure prevalence of the identified group 0, and *JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* is the exposure prevalence of the total population. The bias in FIGURE_{L} produced by contaminating group 0 with exposed subjects is examined with simulations (see below). However, for a given measurement error structure, the further classification of the subjects in groups 1, ..., *J* − 1, which increases var(*x*), is likely to reduce the bias somewhat in ˆ;β and, accordingly, in FIGURE_{L} relative to FIGURE_{T} (see also Wacholder ^{38}).

Bias evaluation in the presence of confounding or effect modification is more complex. An adjusted AF estimator is valid when the identified group 0 comprises truly unexposed subjects only, provided that the model used to obtain the adjusted OR estimate is correctly specified. By contrast, inadequate adjustment for confounding or effect modification from other risk factors may bias the effect estimate and hence the corresponding adjusted AF estimate in either direction. ^{39}

### Simulation Methods

We studied several different exposure scenarios in a population with five groups. The proportion of the subjects belonging to group 0 (η_{0}) ranged between 0.1 and 0.8. Groups 1, 2, 3, and 4 were of equal size with exposure prevalences either equal to 0.1, 0.2, 0.5, and 0.6, respectively (mean 0.35), or equal to 0.4, 0.6, 0.8, and 1.0 (mean 0.7). In addition, under some scenarios, we changed the exposure prevalences of groups 1–4 while keeping the mean unchanged. For each exposure scenario, we simulated 1,000 unmatched partially ecologic case-control studies, each including 250 cases and 250 controls and with a true OR of 3.0. We used a standard maximum likelihood technique for parameter estimation under the linear OR model. FIGURE_{T} and FIGURE_{L} were compared with respect to precision, measured by the empirical standard deviations. We also examined the coverage and the precision (width) of the corresponding CIs. In addition, we investigated the bias produced by contaminating group 0 with exposed subjects.

### Simulation Results

The estimator FIGURE_{L} exhibited precision that was consistently, but not always substantially, better than FIGURE_{T} (Table 1). The coverage of the CIs was generally satisfactory for both FIGURE_{T} and FIGURE_{L}. When the overall exposure prevalence was low (0.07), however, the coverage of the CIs for FIGURE_{L} was too low. Generally, the improvement in precision of FIGURE_{L} relative to FIGURE_{T}, as well as of the corresponding CIs, increased with increasing proportion of subjects in groups 1–4, *ie*, in groups with nonzero exposure prevalences. There was, however, a markedly improved precision with the linear OR model for AF estimation only when a clear majority of the subjects belonged to groups with low but nonzero exposure prevalences (here, ≤0.6). When the exposure prevalences of groups 1–4 were simulated to be more concentrated around the mean (*eg*, 0.2, 0.3, 0.4, and 0.5, rather than 0.1, 0.2, 0.5, and 0.6), the improvement in precision of FIGURE_{L} relative to FIGURE_{T} was less pronounced (not in tables).

Results concerning bias attributable to contamination of group 0 are presented for one of the exposure scenarios (η_{0} = 0.60 and 35% mean exposure prevalence in groups 1–4;Table 2). Pronounced bias of the AF estimates was seen even when only a small fraction (3%) of group 0 was exposed. The AF estimates were much more sensitive to bias and decreased coverage of the corresponding CIs than were the OR estimates obtained from the linear OR model. FIGURE_{L} and FIGURE_{T} were almost equally sensitive to bias; increased sample size did not alter this finding (not in tables).

### Empirical Example

Using a series of 372 cases of acute myeloid leukemia (AML) from southern Sweden in 1976–1993, Albin *et al.*^{40} conducted a case-control study with one age- and gender-matched control per case. They assessed the exposure individually on the basis of data on occupations obtained by structured telephone interviews. An association between occupational exposure to organic solvents and AML was suggested (OR = 1.5%, CI = 0.95–2.4; our calculations), corresponding to an AF estimate of 6.6% (CI = 0.0–13%; 20% exposed cases).

For comparison, we conducted a partially ecologic analysis of the same case series, including two extra controls in each matched set to improve the precision. We obtained occupational titles for 1960 and for every fifth year between 1970 and 1990 from national census data, and we used exposure probabilities from a Swedish translation of a Finnish JEM (FINJEM). ^{41} Age- and gender-adjusted ORs and corresponding AFs were estimated both under the threshold approach, by classifying all subjects with exposure probabilities above zero as exposed, and under the additive-relative OR model (Eq 13), using EGRET^{®} for Windows (Cytel Software Corp, version 2.0).

The partially ecologic data did not discern any AML risk attributable to organic solvents. Only 16% of the cases had nonzero exposure probabilities and the estimated exposure prevalence was low (*JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-pngˆ′* = 4.2%;Eq 10). For the threshold approach, FIGURE_{T} was 1.1 (CI = 0.8–1.6), corresponding to an FIGURE′_{T} of 1.5% (Eq 12).

We obtained similarly low estimates under the additive-relative OR model (FIGURE_{L} = 1.2, CI = 0.6–2.7; FIGURE′_{L} = 0.84% [Eq 9]). Among subjects with zero exposure probability according to the census/JEM data, 11% were classified as exposed according to the individual assessments. This severe contamination of group 0 with exposed subjects was an important explanation for the observed negative bias of the AF estimates for the partially ecologic data compared with the AF estimate based on individual exposure data. Many occupations with low (<5%) exposure prevalences are assigned zero exposure probability in FINJEM, ^{41} which may to some extent explain the contamination of group 0.

## Discussion

We propose a new AF estimator for partially ecologic case-control studies of dichotomous exposures. The partially ecologic case-control design may well be a rapid and cost-efficient approach to conduct population-based studies. Although AF is seldom used in epidemiologic reports, this measure has the potential of identifying risk factors for which effective interventions may have great public health impact. ^{42}

Our simulations show that the intuitively appealing threshold AF estimator FIGURE_{T} and its CIs have a precision that is close to the precision of the AF estimator under the linear OR model, FIGURE_{L}, in situations in which the proportion of subjects in groups with nonzero exposure prevalences is low. Under such exposure scenarios (often encountered in studies of occupational exposure to chemicals, for example), it appears that the precision in the AF estimate can be improved substantially only by assessing the exposure individually. On the other hand, if a large part of the population belongs to groups with relatively low but nonzero exposure prevalences (which may be the case in studies of exposure to extremely low frequency magnetic fields, for example), the gain in precision by using FIGURE_{L} instead of FIGURE_{T} may be substantial. In addition, FIGURE_{L} (but not FIGURE_{T}) can be used when estimating the impact of partial exposure reduction, implying reduced exposure prevalence in some but not in all groups (see also Greenland ^{28}). Contrary to FIGURE_{L}, however, FIGURE_{T} can be used even if the exposure database (*eg*, a JEM) comprises binary exposure classifications rather than exposure proportions. The complexity of the variance expressions for FIGURE_{T} and FIGURE_{L} hampered our ability to identify beforehand exposure scenarios with considerable improvement in precision under the linear OR model.

The proposed method for estimating a 95% confidence interval around FIGURE_{L} generally led to satisfactory coverage. However, decreased coverage resulted when a large majority of the population (80%) was truly unexposed and the remaining population had a low exposure probability, resulting in a low overall exposure prevalence (0.07). Under such scenarios, the linear OR model should be used with caution because it may produce unstable estimates of the regression parameter β and its variability var(ˆ;β) (see also the empirical example of Ref 1).

We investigated bias produced by incorrectly identifying the group with zero exposure prevalence. The observed difference in bias sensitivity between the two AF estimators was small and may not be of practical importance. Nevertheless, the bias in FIGURE_{T} (Eq 14) can be viewed as the upper limit of the (negative) bias in FIGURE_{L} in the absence of confounding and effect modification. This is true unless errors in the estimated exposure prevalences reverse the direction of the estimated effect. Even small proportions of exposed subjects within the group regarded as truly unexposed produced severe bias of both AF estimates and decreased coverage of corresponding CIs. As highlighted by the empirical example, it is crucial when establishing exposure databases that the exposure prevalence is assessed also for groups in which the exposure is expected to be rare, and not regard these simply as zero.

When estimating the AF based on the exposure proportion of the cases, lacking representativeness of the case series may produce attenuations as well as exaggerations of the AF estimate even if the OR estimate is unbiased. ^{27} Confounding across groups can be adjusted for in partially ecologic settings if individual data on confounders are available. It should be noted, however, that the additive-relative OR model (Eq 13) provides a valid confounder adjustment only if each exposure probability *x* is constant across various levels of the confounders, *ie*, when there is no confounding within groups. Any residual confounding across or within groups may hide, alter, or even spuriously create a nonzero AF. However, the partially ecologic case-control design facilitates a more detailed grouping of the population, which may reduce such ecologic bias ^{43} and makes the design far more attractive than the traditional pure ecologic design.

## Acknowledgments

Maria Albin and Timo Kauppinen gave access to the empirical dataset presented in the text.

## Appendix A: Derivation of vâr(*AF*_{L})

Let *S*_{ca} and *S*_{co} denote the sum of the observed exposure probabilities among the cases and controls, respectively, and let MATH

The mean exposure probability among the controls is MATHwhere *n* is the number of controls. The logit transformation of *AF*_{L} (Eq 6) is MATH

Thus, by assuming that the control selection is such that is a valid estimate of the exposure prevalence *JOURNAL/epide/04.02/00001648-200207000-00015/ENTITY_OV0335/v/2017-07-26T080015Z/r/image-png* in the population, it follows that MATH

If lnˆ;β is an unbiased estimator of the true value lnβ and *U(* lnβ) is the efficient score evaluated at lnβ, then it can be shown that asymptotically, subject to certain regularity conditions, ^{31,44}MATH

For binary regression models, ^{45} such as the linear OR model MATH

Using the delta method, ^{32} it follows that MATHand, furthermore, MATHand MATH

Accordingly, var(FIGURE_{L}) can be estimated as MATH

## Appendix B: The Distribution of the Exposure Probabilities Estimated from the Case Series

The exposure probability for the cases in group *j*, *x* ′_{j}, satisfies ^{27}MATHand MATHsuch that MATH

Thus, *x* ′j >*x*_{j} if β > 0 and 0 <*x*_{j} < 1, *ie*, when there is an harmful effect of exposure, cases are more likely to have been exposed than controls within the same group. As a result, the overall exposure prevalence among the cases is MATHwhere η′_{j} is the proportion of cases that belongs to group *j*.

Under the linear OR model (Eq 4), and as a reasonable approximation under the additive-relative OR model (Eq 13), the group-specific proportions satisfy ^{12}MATHwhich is equivalent with MATH

Given that ∑_{j = 0}^{J − 1} η_{j} = 1, some algebraic manipulations yield MATH

Thus, η_{j} (*j* = 0, 1, ..., *J* − 1) and hence var(*x*) can be estimated on the basis of the observed distribution of the cases in the various groups.

## References

*et al*. Occupational exposure to pesticides and pancreatic cancer. Am J Ind Med 2001; 39: 92–99.

*et al*. Risk of pancreatic cancer and occupational exposures in Spain. Ann Occup Hyg 2000; 44: 391–403.

*et al*. Are occupational, hobby, or lifestyle exposures associated with Philadelphia chromosome-positive chronic myeloid leukemia? Occup Environ Med 2001; 58: 722–727.

*et al*. Urban air pollution and lung cancer in Stockholm. Epidemiology 2000; 11: 487–495.

*et al*. Acute myeloid leukemia and clonal chromosome aberrations in relation to past exposure to organic solvents. Scand J Work Environ Health 2000; 26: 482–491.

**Keywords:**

etiologic fraction; attributable risk; bias; ecologic studies; job exposure matrix; occupational exposure; epidemiologic method; linear models