The last 10 years have seen many studies of the dependence of daily mortality or hospital admissions in cities on daily concentrations of ambient pollution in the same city. 1 Similar designs are used to study effects of changing weather on health, and can be used to study acute effects of any risk factor that changes over time. Recently, several studies have focused on whether the pollution effect is modified by factors that might affect the vulnerability of subjects, such as pre-existing disease and socio-economic status (SES). 2,3 The conventional approach to the investigation of such modification is by inclusion of interaction terms in a regression analysis of outcome on the exposure and modifier, or by carrying out separate regressions on strata with different modifier status (eg, vulnerable and not vulnerable). However, this approach has some drawbacks. It requires intricate and hard-to-validate modeling of fluctuations in mortality over time not due to the variable of interest; if there are no denominator data, it must also reflect changes in the denominator over time. If there are denominator data, data on exposure and the effect modifier in this population are needed.
To avoid these problems, we propose using the case-only approach. Originally proposed for use in studies of gene–environment interactions, this approach involves measuring the environmental factor and the genotype (modifier) of a series of cases, without controls. 4–6 If the environmental factor is more often present in cases with a specific genotype than in other cases, then this is evidence that the gene predisposes to the environmental factor causing the disease. The argument can be quantified. Crucially, this conclusion rests on the assumption that the genotype and environmental factor are not associated in the base population from which the cases arose, an assumption that is sometimes controversial. 7,8
The motivation for using the case-only approach in gene–environment studies is different from the motivation for using it in time-series studies. The issue for genetic epidemiology is efficiency—avoiding costs of genotyping and exposure measurement in controls, and increasing power. In fact, even with the same number of cases, the case-only approach has more power than a case-control study. For most time-series studies, there are only “cases,” and no denominators or controls. Here we are proposing the case-only approach for analysis rather than for design, and the motivation is simplification of modeling and reduced vulnerability to model miss-specification bias. It seems likely that power is also increased (because information-using devices to substitute for denominators are not needed), but it is unclear whether this effect would be substantial.
Specifically, we propose the use of the case-only approach to study modification by individual factors that are effectively time invariant (such as sex, SES, or housing type) in time-series studies of effects of air pollution, weather or other time-varying factors. The key features of this context are that the risk factors of interest (pollution or weather) vary in time, but not over individuals at the same time, with the reverse being the case for modifiers (they vary over individuals, but not over time). This orthogonality virtually guarantees the assumption of independence of the exposure of interest and the potential modifier in the population from which cases arose.
First, we restate arguments for the validity of the approach for dichotomous exposure and modifier, and then consider polytomous and numerical exposures and modifiers. We provide an example and discuss problem issues.
TERMINOLOGY, NOTATION, AND MODEL
The time-series regression model we propose follows closely that used extensively in the air-pollution time-series literature, such as that described by Dominici. 9 Thus, we describe it only briefly here.
The primary variables are:
* Yij: Number of outcome events in subpopulation i on day j. To make the development more concrete, we will refer to these as deaths.
* xj: Explanatory variable of primary interest (eg, air pollution, ambient temperature) on day j.
* zi: Characteristic of subpopulation i possibly modifying the effect of the primary explanatory variable on mortality (eg, pre-existing disease, housing characteristic); may also be an independent predictor of mortality.
Variables not of primary interest that also influence occurrence rates include:
* wj: A vector of other measured time-varying factors that affect mortality, such as flu epidemics or meteorological factors not of primary interest, may be modeled as linear terms or smooth curves; for simplicity, we assume linear terms φTwj, where T represents “transpose.”
* j (time): Medium- and long-term changes in mortality are expected due to unmeasured factors that change over time. These are modeled as smooth functions (smoothing or natural splines or LOESS smooths) of j:S0(γ, j), where the parameter γ determines smoothness.
The distribution of Yij given xj and zi is assumed to be well approximated by the Poisson distribution, with log link and linear predictor, so that
The parameter of interest is the coefficient λ of the interaction term xjzi, which reflects the modification of the effect of x by z. Quantitatively, it is log of the interaction rate ratio—the change in log rate ratio per unit change in x given a unit change in z, with simpler interpretations if either or both are dichotomous. There may also be interest in β, the “main effect” of exposure x when z = 0, and sometimes also φ, the effect of the time-varying confounders; however, δ is not usually of interest, because in the absence of person-time denominators, this reflects not only differential mortality according to z, but also the population distribution of z.
Dichotomous x and z
Although this case has been considered in general for gene–environment interactions, it is instructive to consider it in this specific context. We assume x and z take the values 0 or 1. The expected numbers of deaths with (x, z) = (0, 0), (0, 1), (1, 0) and (1, 1) are given by summing the daily expectations (model 1) over days in which xj = 0 and xj = 1 (call these j |xj = 0 and j |xj = 1). It simplifies later expressions to define the expectations of sums of daily deaths on “exposed” (x = 1) and “unexposed” (x = 0) days, first for the group without the potential modifier (z = 0):
For the group with the modifier (z = 1), the days on which x = 1 are the same as for the group without the modifier (all experience the same pollution and weather). This is analogous to the assumption in gene–environment studies that the gene and the environmental factor are independent in the population. Thus, the expected numbers of deaths in the group with the potential modifier (z = 1) are the same sums, with additional multipliers to reflect the distribution of z, its effect on mortality, and the modification of this effect by x:
The expected distribution of total deaths according to x and z is given in Table 1.
The odds ratio for the association between the exposure x and the modifier z among deaths is, thus, from Table 1:
This confirms that the exposure–modifier interaction odds ratio from a case-only sample (ie, deaths only) tabulated as in Table 1 estimates the interaction rate ratio describing the modification of the effect of exposure on mortality by the presence of z.
Note that k0 and k1, which are functions of nuisance variables and parameters, cancel out. This is a major and very useful simplification, as the choice of specific models for these is one of the difficult and debated aspects of time-series regressions. However, the canceling of the main effect β of x is a limiting feature of the approach—main effects are not estimable, and only interactions can be estimated.
Polytomous and Numerical x with Dichotomous z
The overwhelming majority of mortality time-series regression studies have focused on numerical explanatory variables of interest (pollution or temperature). As noted by Albert, 7 we can extend the above argument to derive the probability of z conditional on a numerical x among deaths. Imagine, for example, Table 1 extended to as many rows as there are unique x values or groups. For each value of x in the table, the probability of z being 1 reduces simply to:
Here, subscripts i and j have been omitted for simplicity. Equation 2 is a logistic model, implying that the interaction parameter λ is estimable from a logistic regression of z on x.
For polytomous x with L levels, we reinterpret x as a vector of L-1 indicators, and λ as an vector of length L-1.
To extend the model to polytomous modifier z (say z = 1,, K), we extend the logistic regression (model 2) to a polytomous (multinomial) logistic regression model:
This expression extends the definition of the function expit. This can again be motivated heuristically by imagining a table set out like Table 1, but with K columns rather than 2.
Most major statistical packages fit this model, using maximum likelihood. For dichotomous and numerical x, there will be K-1 interaction parameters (λ1 is set to 0), reflecting how the baseline (z = 1) effect of exposure on mortality is modified by each other level of z. For polytomous x of L levels, there will be (K-1)(L-1) parameters, probably best avoided by dichotomizing x or z, or assuming a numerical score for one of them (see below for numerical z).
An alternative and probably equivalent Poisson log-linear model for polytomous modifiers in case-only analyzes has also been proposed. 5 In this context, we prefer the polytomous logistic formulation because of its simple motivation as an extension of the binary logistic model, and its natural handling of numerical continuous explanatory variables (x) of interest.
We have found no presentation of methods for this situation in the case-only literature, presumably because genotype is intrinsically categorical. In the full model 1 a numerical–numerical interaction term is defined as the product of x and z. The interaction coefficient λ is, however, invariant to centering x and z around alternative origins (hence, product (x −xo)(z −zo)), which is sometimes useful to reduce correlation of the product with x or z.
We propose using constrained polytomous regression to fit this model in the case-only approach. Thus, we will use the approach of the previous subsection, but constrain the parameters λk to reflect a linear increase or decrease in the modification of the effect of x by z as z increases.
Assume first that there are only a few (K) distinct values of z:z1,, zK. We can then apply model 3 but with a constraint to the parameters λk as follows:
This reproduces the baseline (λ1 = 0) of the unconstrained model, and substituting it in model 3 gives
The case-only model 4 can thus be deduced from the full-data model 1 if z is centered around its lowest value z1, using the same heuristic argument we used above for the unconstrained logistic model. Constraints are accommodated in some statistical packages (eg, Stata, College Station, TX). If there are too many distinct values of z to allow each to be a level in such a model, an approximation is obtained by grouping z, and using group means as zk. However, we have not investigated the impact of this approximation, and expect to encounter lost precision and power.
Extension to Multiple Time Series
It is not uncommon to study multiple time series (say from several cities) simultaneously. 9,10 Usually, the main focus of such studies is the modification of exposure effects by factors measured at a city level (eg, average SES). This type of modification is not amenable to study using the case-only approach, because we can no longer assume that the distribution of exposure (weather or pollution) is independent of the modifier (SES) in the population.
However, sometimes there are modifier subgroups within cities, and a researcher may be interested in making an estimate of the exposure–modifier interaction that draws information from all the cities. This can be achieved by combining city-specific estimates (either by case-only or conventional methods) using meta-analytic techniques. It can also be achieved by including deaths from all cities in a case-only logistic regression stratified by city (ie, in which an indicator for city is included).
EXAMPLE: HIGH TEMPERATURE AND Socio-economic Status IN SAO PAULO
Daily mortality (in persons age 65 years and above) and temperature measurements were obtained for the period 1991-194 (1,461 days). We were interested in the modification of the effects of high temperature by the SES of area of residence (58 areas), classified in quartiles of areas. 3 High temperature was considered as a numerical variable, defined as the number of degrees that the 2-day mean temperature (index and previous day) rose above 20°C, and zero if this mean was below 20°C. We estimated this by conventional and case-only analysis:
1. By conventional time-series regression (model 1) with a set of 3 indicator variables for SES. Potentially confounding variables were particulate air pollution (PM10) the day before, humidity, holidays, cold temperature in the previous week (degrees below 20°C), and day of the week. Long-term changes over time were modeled as smoothing splines with 7 degrees of freedom per year (in STATA, convergence tolerance 10−8).
2. By case-only methods as described above. Because the potential modifier SES was polytomous (4 levels), we used a polytomous logistic regression model 3 with SES as outcome and high temperature as explanatory variable. The constrained polytomous model 4 was used to estimate the trend across SES groups, by scoring the groups zk = 1, 2, 3, 4.
Key results are shown in Table 2. The interaction parameters (λk and the trend λ) and their standard errors, as estimated by the two methods, are very similar. None of the individual parameters were much larger than their standard errors, although the point estimate indicated a lower heat effect in the highest SES group. The middle column (βk) gives in its first row the baseline main effect of high temperatures (% increment in log rate per degree), which is that in the first SES group; ie, β1 = β from model 1. The heat effects from the first column in each of the other groups is derived from this baseline and the interaction term (βk = β + λk). These estimates are only possible from the conventional analysis.
The key assumption of independence of time-varying factors and time-fixed modifiers is more secure than is the analogous assumption for gene–environment interactions. However, there may be situations in which the assumption is violated. For example, persons of high SES might move out of the city in times of heat, and their deaths might be uncounted. This would cause an association between proportion in the high SES group (the modifier) in the population at risk and temperature, thus invalidating the independence assumption, and causing a spurious reduction in mortality during heat among high SES persons.
If the deaths among migrating high-SES persons were counted, but not affected by heat because they had moved to cooler places, then this would also cause an apparent reduced effect of heat on mortality in that group. This is not a violation of the assumption of independence, as the observed temperature series is not associated with the proportion in the high SES group, which does not change appreciably over time. We could say that the high SES group is truly less affected by heat in the city, the mechanism being its migration to cooler places. The true temperature series, however, is different in the high SES group, so true temperature is associated with high SES; we could say that the reduced effect of heat in high SES persons is an artifact due to bias. We suggest below that this can better be viewed as a measurement error issue—temperature is not correctly recorded for high SES persons.
A sufficient (though not necessary) condition for the independence assumption is that the distribution of the modifier over persons at risk does not change over time. Plausibility arguments may indicate whether this is likely to be approximately met. If not, it might be possible to test the independence assumption if information is available on variation in time of z in the population (eg, variations over time in proportion in the high SES group).
The validity of the case-only approach does not require assumptions about other time-varying factors w, which are not restricted in model 1. Recall that these risk factors for daily mortality are specified in the conventional model 1, but not specified in the case-only model 2 or 3, because their effects “cancel out” in the case-only odds ratios. They may be correlated with the factor of interest x, and can in particular include terms representing interactions between x and other time-varying factors.
Neither the case-only nor the conventional model 1 requires specification of time-fixed factors other than z. In the conventional model, the main effects of such factors are of no interest in the absence of denominators, and they do not confound effects of time-varying factors. Furthermore, there seems no reason to expect such factors to confound the interaction of interest, even if they were associated with the time-fixed factor of interest (or included interaction terms with it). Robustness of the case-only approach to the presence of unspecified time-fixed risk factors is suggested by analogy with the time-varying effects, and can also be seen by noting that the effect of time-fixed factors would apply equally to all days; ie, in Table 1, both rows in each column would be affected equally, so odds ratios would not change.
There are some modeling difficulties that the case-only approach does not simplify. It is critical to the argument for the validity of this approach that there are no other interactions in model 1 of time-varying factors with time-fixed factors. These would not “cancel out” in Table 1 and, hence, could confound the interaction of interest. The following two types of such interactions would almost inevitably confound:
a. Interaction of the putative modifier under investigation with other time-varying factors. (In the example, SES modifies, say, the pollution effect.)
b. Interaction of the time-varying factor of interest with another time-fixed variable. (In the example, the temperature effect is modified by, say, housing density.)
These types of interactions would also confound in the conventional approach. An alternative conventional approach that is robust to interactions of type a is to fit completely separate models of form 1 to each modifier group (SES = 1, 2, 3, 4 in the example), and compare the coefficients for x (heat in the example). This approach is equivalent to fitting an interaction between SES and each term in the model 1 thus allowing the possibility that the effects of risk factors other than hot temperature, including those captured in the time smooth, are modified by SES. The robustness of this model comes at the cost of reduced power and precision if these other interactions do not in fact exist. Alternatively, explicit interaction terms could be added to the model. This could also incorporate interactions of type b.
Confounding interactions of type a could also be modeled in the case-only approach. For example, air pollution could be entered as an additional regressor in the case-only model 2 or 3. Inclusion of an annual sine-cosine pair might be sufficient to control for a seasonal pattern not related to temperature that might plausibly be modified by SES. However, details have not been worked out, and incorporation of interactions of type b seems less obvious.
In conclusion, the case-only approach may ignore risk factors varying over time, providing their affect is not modified by the time-fixed variable of interest (or correlates of it). Also, time-fixed risk factors may be ignored, providing they do not modify the effect of the time-varying factor of interest or its correlates.
If conditions for the validity of the case-only approach are met, is it useful? We noted above that the main motivation is reduced modeling complexity and dependence on model assumptions. Concerns about such assumptions have been highlighted by recognition of convergence and inference problems of generalized additive models in the time-series context. 11 Analysis of modification of effects in a manner not dependent on those assumptions can provide reassurance that conclusions are not sensitive to them. However, the inability of the case-only approach to provide estimates of the main effects would lead most investigators to consider it as a supplement to rather than as a replacement of conventional methods.
In the Sao Paulo data of the example, conventional analyzes in fact showed rather little sensitivity of the estimate of heat–SES interaction to model specification. If all time-varying factors other than heat were omitted, point estimates of heat–SES interactions changed little (eg, −1.18 vs. −1.11 for category 4 vs. 1). It may be that the conditions for validity of the case-only approach also limit confounding of the interaction in the conventional model. However, the resulting model fitted very poorly. Scale overdispersion was 1.37 and residuals were highly autocorrelated—sources of concern for most analysts. If a simple correction was made to standard errors for overdispersion, the standard error of λ was higher in the reduced conventional model (0.086) than in the case-only model (0.072).
A further potential advantage of the case-only approach is its practical simplification of data analysis. In the example described, analysis with conventional methods was not particularly complex, but in multi-city studies, computational complexity of conventional methods have led investigators to carry out two-step analyzes, with city-specific time-series analyzes followed by “meta-analytic” combination of effect parameters from each city. 9 Investigation of modification by factors varying within-city would add further complexity. Case-only analysis would be computationally much simpler (many fewer parameters), and could probably be carried out without the need to use two steps. The case-crossover approach provides another alternative to the conventional analysis with similar computational simplification. 12,13 Unlike the case-only approach, it has the ability to estimate main effects, but it has some problems of its own. 14
It seems likely that the assumption that modifiers are fixed in time could be relaxed to allow modifiers that change in time much more slowly than does the exposure of interest, by stratifying by time periods. Examples are age, chronic disease, and some long-term treatments. Details, however, remain to be worked out.
There are limitations that affect case-only and conventional approaches equally. For example, errors in measuring the principal exposure may distort patterns of effect modification: the lower heat effect in high SES areas in Sao Paulo might result from temperature in these areas in fact being lower than the central meteorological station measurements—an issue similar to that discussed above in the context of the independence assumption. Also, in both models the interaction measures departure from the multiplicative model, rather than the additive model, which is arguably more fundamental. However, given the usually small rate ratios involved, this distinction typically will be minor.
In conclusion, the case-only approach provides a method for analyzes of effect modification in time-series regressions with fewer assumptions than the conventional approach. Furthermore, it provides computational simplification that is potentially useful in complex data sets for which conventional methods are problematic.
Nelson Gouviea provided the data from Sao Paulo, and Sam Pattenden made helpful comments.
© 2003 Lippincott Williams & Wilkins, Inc.