**ArticlePlus**

Click on the links below to access all the ArticlePlus for this article.

Please note that ArticlePlus files may launch a viewer application outside of your web browser.

* http://links.lww.com/EDE/A97

* http://links.lww.com/EDE/A98

In epidemiologic research, evidence for a postulated biologic mechanism may be strengthened if 1 risk factor can be shown to modify the action of another. Theories regarding potential effect modification are often founded in evidence from laboratory research and are intended to tease apart the complex web of causation. A wealth of experimental data may allow for meticulously specified mechanistic hypotheses; nevertheless, the correct statistical methods with which to investigate effect modification in human populations are not always clear.

Given the well-recognized difficulty of interpreting statistical interactions as evidence in support of biologic effect modification,^{1–3} it is not uncommon for researchers to explore effect modification on more than one statistical scale,^{4–12} possibly out of uncertainty about which scale is “correct.” Rather than clarifying the consistency of data with postulated biologic mechanisms, this approach may merely obfuscate the target of scientific inquiry, because multiple testing usually inflates the probability of falsely concluding that biologic effect modification has occurred. To our knowledge, however, no one has ever empirically evaluated the degree of type I error inflation associated with testing for interaction on more than one scale. Although the hazards of multiple significance-testing are widely appreciated by epidemiologists, this awareness has apparently not extended to the specific context of interaction hypothesis-testing on multiple measurement scales; in none of the cited examples do authors acknowledge this problem.^{4–12}

We addressed this statistical problem by performing a simulation study of how testing for interaction on both the additive and multiplicative scales might affect type I error rates in case-control studies. We studied validity of *P* values not because we advocate the use of dichotomous significance-testing, but because this provides a clear way to assess the hazards of measuring interaction on more than 1 scale.^{13}

#### METHODS

##### Simulation Study Design

We simulated data that could have been collected from epidemiologic case-control studies under a range of underlying “true” scenarios with 2 dichotomous exposure variables. These scenarios differed in the underlying individual and joint exposure probabilities (Table 1), in the individual exposure odds ratios (OR = 1, 2, 5, or 10), and in the measurement scale under which the exposures did not modify each other’s effects—additive, multiplicative, or both. We varied the sample sizes from 500 to 2500 individuals, half of whom were cases. Given the exposure probability distribution among the controls, one can algebraically compute the exposure probability distribution among cases that would have resulted in the specified individual and joint exposure ORs. Using these distributional characteristics of each scenario, we randomly generated 5000 datasets.

##### Assessment of Interaction on the Multiplicative Scale

We assessed risk ratio (RR) heterogeneity by using a test based on the multiplicative interaction term (M) from a logistic regression model. We fit a saturated logistic regression model to each dataset, where *X1* and *X2* were dichotomous indicators of exposures 1 and 2, with the jointly unexposed group as the referent category:

We then fit a reduced logistic regression model excluding the multiplicative interaction term, β_{3:}

We compared models (1) and (2) by using a likelihood ratio test.

##### Assessment of Interaction on the Additive Scale

We assessed risk ratio difference (RRD) heterogeneity by using a test based on the interaction term (A) from an additive relative risk regression model, in which the RRDs are approximated through odds ratio differences (ORD) when the outcome being modeled is rare.^{14–16} We first fit a saturated model parameterized to include an additive relative risk interaction coefficient:

where the interaction term, α_{3}, is an estimate of the ORD. To calculate the statistical significance of this estimate, we compared the saturated model (3) to a reduced model (4) (in which the interaction term was assumed to be zero) by using a likelihood ratio test.

In keeping with our goal of assessing inflation of the type I error rate when commonly used techniques are applied, we chose to apply the additive relative risk regression models using standard software. We thus did not program the constrained estimation of these models, which would have prevented the estimated coefficients from leading to negative values of the ORs. Instead, following each estimation, we checked whether the parameter estimates would imply negative ORs. We removed datasets that implied negative ORs without replacing them and tallied the number that occurred.

##### Computation of Type I Error

We excluded from all analyses any dataset in which (a) the individuals in a given exposure stratum were either all cases or all controls or (b) the additive relative risk regression model produced ORD parameter estimates that caused there to be negative ORs. We recorded the numbers of each of these types of omitted datasets. The individual test type I error rates were calculated as the proportion of the remaining datasets in which the respective null hypotheses for tests based on A and M were rejected even though they were true. We then computed the joint type I error rate as the proportion of datasets in which the test based on either A or M (or both) was rejected when there was no underlying interaction on either scale. We computed exact binomial confidence intervals (CIs) for these probabilities. We also assessed the proportion of simulations in which the tests based on A and M yielded concordant or discordant results.

We performed all simulations in Stata version 7.0.^{17}

#### RESULTS

##### Description of Data

Fifty-five of the 144 scenarios generated complete sets of 5000 datasets in which none led to negative ORs in the additive relative risk regression model (data available with the electronic version of this article). The negative OR problem was less frequent with low ORs when there was no “true” ORD heterogeneity and with higher ORs when there was no “true” OR heterogeneity. Parameter estimates that would have led to negative ORs occurred more frequently as the magnitude of the 2 ORs diverged, when there was no underlying ORD heterogeneity, and particularly at smaller sample sizes.

Nine datasets contained cell frequencies that were zero; all occurred in the scenario that had both single-exposure ORs equal to 10, OR homogeneity, sample sizes of 500, and high exposure prevalence.

##### Tests of Interaction Considered Individually

In evaluating departures from additivity of the ORDs, the scenario-specific A rejection rate ranged from 2.3% to 5.5% (Table 2). This test was generally conservative, as demonstrated by the observation that in all 57 of the 90 scenarios in which the 95% CI excluded 0.05, the CI lay entirely below 0.05. There was no apparent pattern in the type I error rate in relation to either the exposure prevalence or the ORs, but the type I error rate increased slightly with increasing sample sizes (Table 2).

The type I error rate in assessing M was very close to its nominal level more often and at lower sample sizes and ORs than when the test was based on A (Table 2). In the scenarios in which there was no underlying heterogeneity of the ORs, the M rejection rate ranged from 3.2% to 5.9%. The 95% CI excluded 0.05 in only 21 of the 90 scenarios, and the excluded scenarios tended to be those with the lowest exposure prevalence. As the sample size, exposure prevalence, or particularly the ORs increased, the type I error rate also increased.

##### Simultaneous Tests of Interaction on Both the Additive and Multiplicative Scales

In scenarios in which the underlying ORs conformed to the “true” null hypothesis—that there was no interaction on either statistical scale—one of the ORs was necessarily equal to 1. When interactions were explored with both A and M under these conditions, 25 of 36 scenarios (69%; 95% CI = 52–84%) yielded rejection rates above the nominal 5% (Table 3). The highest type I error rate observed with this combination of statistical tests was 7.3% (95% CI = 6.5–8.2%). The inflation of the joint A/M rejection rate tended to increase as the underlying exposure probability, second exposure OR, or sample size increased.

##### Comparison of the Rejection Regions of Tests of ORD and OR Heterogeneity

Tests of both M and A tended to indicate rejection of their respective null hypotheses in the same datasets when both ORs were equal to 1 (Fig. 1). As the second exposure OR increased, the overlap in the rejection regions of the 2 tests decreased, both in the absolute proportion and as a percentage of the datasets in which either hypothesis was rejected. This occurred despite the fact that the average rejection rates of either A or M individually did not necessarily increase with the second exposure OR. Neither sample size (Fig. 1) nor exposure prevalence (Table 3) affected the patterns of overlap of the 2 tests in any systematic way.

##### Other Tests of ORD Heterogeneity

We compared these results regarding joint testing with M and A with those obtained when M was used in conjunction with each of 3 other ORD heterogeneity tests: tests based on the relative excess risk resulting from interaction (RERI),^{18,19} the synergy index (SI),^{18,19} and the proportion of interaction attributable to the joint exposure to 2 risk factors (APAB).^{18–20} Among the 4 joint tests, type I error inflation was lowest when M was used in conjunction with either the RERI or SI, each of which performed extremely conservatively on an individual basis (data available with the electronic version of this article). Type I error inflation was highest for the joint test based on APAB and M, for which the rejection rate exceeded 5% in all scenarios; the APAB-rejected rate also tended to be inflated when it was used alone.

#### DISCUSSION

In simulations of case-control data generated over a range of underlying ORs, exposure prevalence, and sample sizes, we observed inflation of the type I error rate when we evaluated statistical tests of interaction on a combination of the additive and multiplicative scales. These results show the limited validity of presenting the test with the lowest *P* value as the one that is more indicative of underlying biology. Just as is already widely appreciated with regard to the assessment of first-order associations, lack of a priori specification of the scale on which one will measure effect modification will often lead to falsely low *P* values.

The inflation of the type I error rate was generally less than might have been expected in a multiple-testing setting such as this; under many scenarios, the joint-test type I error rates remained below 5%. This occurred, in part, because the 2 tests falsely rejected their respective null hypotheses in many of the same datasets. The dependence between the M and A tests was less marked when the second exposure OR was high, but the number of datasets missing as a result of negative ORs was also higher in these scenarios, which complicates the interpretation of this trend. The joint A/M type I error inflation was also lessened by the poor approximation of the chi-squared distribution for the test of A, which was extremely conservative. Although many acknowledge that estimation of ORDs necessitates large sample sizes,^{16,21} it is generally underappreciated how large the sample size must be for the results of approximate tests based on these estimators to be valid. Even with samples as large as 2500, rejection rates for the individual test based on A were lower than the nominal value.

Still, among the scenarios in which we simulated sample sizes of 1000 or greater, the proportion with inflated type I errors was greater (19 of 24, or 79%; 95% CI = 58–93%), probably because larger sample sizes allowed the tests to behave closer to their approximated distributions. Similarly, as the exposure prevalence increased, so did the proportion of scenarios with inflated type I errors. These patterns suggest that type I error inflation can be a problem in practice.

We excluded the additive relative risk regression models in which the estimated coefficients would have produced negative ORs, which have no realistic interpretation. In practice, this problem can be prevented by constraining the regression model estimation. The need to program a constrained additive relative risk regression model is one of the technical challenges that currently discourages its widespread use; it would be useful for commercial software to include a constrained additive relative risk regression program.

A further limitation of this study was our focus on low exposure prevalence. We initially chose this range to minimize the number of empty cells when the ORs were high, to avoid programming an exact test for these datasets. Thus, our results may apply only to the limited range of exposure prevalence investigated here.

The exploration of statistical interactions on multiple scales may be entirely appropriate in some specific research contexts. In exploratory data analysis, for example, researchers might be less interested in whether additivity of ORDs or multiplicativity of ORs holds in their data and more interested in the degree to which these relationships do not hold.^{15,22,23} Or, for purposes of prediction within a specific population, one might compare results from a variety of statistical models to find the one that best predicts a particular outcome in data sampled from that population. When the purpose of the data analysis is to assess consistency with specific etiologic mechanisms, however, authors should state their reasons for having performed interaction hypotheses tests on a given scale. To aid readers in interpreting epidemiologic findings regarding potential effect modification, authors should make sure to report the magnitude of the interaction estimate along with a measure of its variance.

Our results regarding type I error rates should be considered among other possible reasons for false-positive tests for interactions. Statistical interactions in the absence of biologic interactions can result from model misspecification, including misassignment of the cutpoint when a continuous outcome is dichotomized. Similarly, exposure measurement error may cause 2 exposures to appear to interact even when they do not.^{24} Thus, we cannot emphasize strongly enough that statistical interaction is not a sufficient criterion for biologic interaction. Furthermore, our results demonstrate that statistical interactions should be interpreted especially cautiously when one has performed these tests on more than 1 measurement scale. We agree with the advice that researchers should “…seek different, complementary, quantifiable, and verifiable implications of the biologic model. Only this would reduce the likelihood that the posited biologic model is one among many possible contenders.”^{1}

#### ACKNOWLEDGMENTS

We thank Stephen M. Schwartz for his suggestions.