From the 1Department of Biostatistics and
2Department of Environmental Health, University of Washington, Seattle, Washington.
Address correspondence to: Lianne Sheppard, Box 357232, Department of Biostatistics, Seattle, WA 98195-7232; Sheppard@u.washington.edu
Editors’ note: Another invited commentary on this topic appears on page 11.
In this issue of Epidemiology, Ramsay et al. 1 discuss the effect of concurvity on estimation of air pollution health effects. Their basic message is simple: the S-PLUS generalized additive model (GAM) implementation (gam function), and possibly others, moderately underestimates standard errors for air pollution time series studies. The resulting inflation in type I error is about 23%. This result is worth knowing but it is not catastrophic. In comparison, this inflation is much less than that which can occur when doing two independent comparisons but reporting only the most significant (in which case the inflation is 95%).
The mathematic reasons for the bias are complicated. Ramsay et al. 1 provide a useful analogy to collinearity in regression, but a precise understanding of what goes wrong requires knowledge of the underlying mathematics. In this case the problem is with the uncertainty in the shape of the smooth seasonal function included to control for temporal confounding. The approximation used to compute the standard errors ignores the effect of correlations between the fitted smooth function and the temperature or pollution effects. Air pollution time series models are unusual in that pollution is largely uncorrelated with time but very strongly correlated with the fitted smooth function of time, and this causes the bias to be noticeable.
Results in Ramsay et al., 1 combined with the discovery of premature convergence of the gam function back-fitting algorithm under default convergence criteria in S-PLUS version 3.4, 2 have shaken scientists and policy makers, and warranted coverage in the New York Times2a and elsewhere. In air pollution epidemiology much of the recent evidence comes from time series studies using the gam function. We address two important questions: Why are the problems with the gam function appearing and how serious are they?
The potential for incorrect standard errors in GAMs is clearly implicit in textbook descriptions. For instance, Hastie and Tibshirani, 3 who developed both the model and the S-PLUS software, noted (at that time) that direct computation of the matrix needed for the standard errors was “formidable”. The S-PLUS gam function implementation was described in 1992 as “admittedly ad hoc” by Hastie. 4 In both texts this information is embedded in a dense technical description of the method. Over the past decade these and other researchers have found ways to fit GAMs without the approximations that Ramsay et al. criticize. Pseudosplines 5 and P-splines 6 are examples of techniques with similar characteristics to smoothing splines that require much less computation for standard errors. However, the new methods have not been implemented in widely used software, and so have not reached epidemiologists (although implementations are often available using S). One of the difficulties comes from inadequate attention to the details of statistical programming. This is a direct consequence of the strong disincentive in academic statistics for researchers to take statistical computing seriously.
The recently published Statistical Rules of Thumb7,p.185 lists as Rule 8.8 “Any basic statistical computing package will do.” Although this advice may be useful in general, an important lesson from the recent controversy over GAM is that the word “basic” is important: the rule is useful for simple and standard methods. Logistic regression, for example, is both simple and standard; the GAM is neither. There are two reasons to be cautious about a statistical technique: when there are few implementations of it, and when it has many options or tuning parameters under control of the analyst. When a method is not widely implemented, there will have been less testing of the software. When a method has many tuning parameters, it is harder to test and requires more expertise from users.
So what can epidemiologists do? These models cannot simply be abandoned as they address scientifically relevant questions with inherent methodologic challenges. The counsel of perfection would be to engage an expert in statistical computing. This is often impossible because there are few experts with in-depth knowledge of both statistical theory and programming. In the absence of expert assistance, epidemiologists should identify whether the method is standard. It is not sufficient that there is a single widely used implementation of a method or that it has become standard in one narrow field. This narrow view of accepted practice is one of the reasons why, well before their limitations were uncovered, GAMs became so widely used in air pollution epidemiology.
Applications of novel methods always require careful scrutiny. Strategies for checking software without a programming expert are limited. We advise that, first, epidemiologists should repeat the results with different software. Clearly, if results change, more investigation is needed. Second, within any software package, identify all the possible tuning parameters, note their defaults, and look at the sensitivity of the results to varying them. Finally, look at the examples in statistical papers that propose or evaluate the method. When data like yours have been examined in the statistical literature, then it is more likely that relevant technical issues with the software will have been evaluated.
GAMs are appealing for air pollution time series studies because they are flexible and don't require strong assumptions about the shape of the seasonal pattern. Within the GAM context, researchers can use generalized linear models with natural splines. For a fixed number and location of knots, natural spline models are parametric. This avoids reliance on the iteration and tuning parameters needed in the GAM back-fitting algorithm. Many statisticians view GAM primarily as a useful exploratory tool; once the exploratory analysis is complete, they prefer to replace the arbitrary smooth functions with parametric alternatives. It is an open research question whether natural splines in generalized linear models are a better choice than other GAMs for the air pollution application. However, researchers should be suspicious of any results that change meaningfully when the technical details of the smoothing change.
In practice the effects of technical details such as the one noted by Ramsay et al. 1 are likely to be smaller than the scientific uncertainty about how much smoothing is appropriate. Appropriate control for seasonal effects is clearly important because these are much larger than the air pollution effects of interest. Control of seasonal effects with smooth functions of time is an approximate approach to controlling for strong but unmeasured confounders, such as influenza epidemics and time-varying population behaviors. Many analyses are highly dependent upon the degree of control analysts invoke for this approximate confounder. There is no automated procedure that can be used to determine the appropriate degree of smoothing or which smoother is best. This problem is not unique to time series studies; similar challenges are faced in modeling the dependence structure in spatial epidemiology.
Estimation of very weak associations in the presence of measurement error and strong confounding is inherently challenging. In this situation, prudent epidemiologists should recognize that residual bias can dominate their results. Because the possible mechanisms of action and their latencies are uncertain, the biologically correct models are unknown. This model selection problem is exacerbated by the common practice of screening multiple analyses and then selectively reporting only a few important results. This practice can inflate the type I errors much more than 23% and enhances the likelihood of publication bias. Technical difficulties should be placed in the context of the more important biases from model selection in both relative risk estimates and standard errors. GAM software does not even attempt to account for these, but epidemiologists must.
About the Authors
THOMAS LUMLEY and LIANNE SHEPPARD are members of the National Research Center for Statistics and the Environment and the EPA Northwest Center for Particulate Matter and Health. They are both on the faculty of the Department of Biostatistics at the University of Washington. Dr. Sheppard is also a faculty member in the UW Department of Environmental Health and Dr. Lumley is a member of the R and Omega statistical computing projects.
1. Ramsay TO, Burnett RT, Krewski D. The effect of concurvity in generalized additive models linking mortality to ambient particulate matter. Epidemiology 2003; 14: 18–23.
2. Dominici F, McDermott A, Zeger SL, Samet JM. On the use of generalized additive models in time series of air pollution and health. Am J Epidemiol 2002; 156: 193–203.
2A. Revkin AC. Data revised on soot in air and deaths. NY Times 5 June 2002.
3. Hastie TJ, Tibshirani RJ. Generalized Additive Models. New York: Chapman and Hall, 1990.
4. Hastie TJ. Generalized additive models. In: Chambers JM, Hastie TJ, eds. Statistical Models in S. Pacific Grove, Calif: Wadsworth & Brooks/Cole, 1992; 249–308.
5. Hastie TJ. Pseudosplines. J R Stat Soc Ser B Methodol 1995; 58: 379–396.
6. Eilers PHC, Marx BD. Flexible smoothing with B-splines and penalties. Stat Sci 1996; 11: 89–121.
7. van Belle G. Statistical Rules of Thumb. New York: Wiley, 2002.
© 2003 Lippincott Williams & Wilkins, Inc.