In recent years, a number of studies have applied generalized additive models to time series data to estimate associations between exposure to air pollution and cardiorespiratory morbidity and mortality. If concurvity, the nonparametric analogue of multicollinearity, is present in the data, statistical software such as S-plus can seriously underestimate the variance of fitted model parameters, leading to significance tests with inflated type 1 error. This paper uses computer simulation and analyses of actual epidemiologic data to explore this underestimation of standard errors. We provide a method for assessing concurvity in data and an alternate class of models that is unaffected by concurvity. We argue that some degree of concurvity is likely to be present in all epidemiologic time series datasets and we explore through the use of meta-analysis the possible impact of concurvity on the existing body of work relating ambient levels of sulfate particles to mortality.
From 1Health Canada,
2McLaughlin Centre for Population Health Risk Assessment, and
3University of Ottawa, Ottawa, Canada.
Address correspondence to: Tim Ramsay, McLaughlin Centre for Population Health Risk Assessment, Room 321, One Stewart St., Ottawa, K1N 6N5, Canada; firstname.lastname@example.org
This research was supported in part by the Toxic Substances Research Initiative of Health Canada, the National Science and Engineering Research Council of Canada, and the Health Effects Institute in Cambridge, Massachusetts.
D. Krewski is the NSERC/SSHRC/McLaughlin Chair in Population Health Risk Assessment at the University of Ottawa.
Submitted 26 March 2002; final version accepted 27 September 2002.
Editors’ note: Invited commentaries on this article appear on pages 11 and 13.
Click on the links below to access all the ArticlePlus for this article.
Please note that ArticlePlus files may launch a viewer application outside of your web browser.
Epidemiologic studies have consistently demonstrated associations between exposure to urban air pollution and cardiorespiratory morbidity and mortality. Associations between hospital admissions and short-term increases in the levels of particulate and gaseous pollutants have been explored using time series (Burnett et al.) 1 and case-crossover studies (Lin et al.). 2 Associations between mortality and long-term exposure to fine particulate air pollution have been demonstrated in cohort mortality studies conducted by Dockery et al. 3 and Pope et al. 4
Time series analysts have long recognized the importance of including temporal trends and serial autocorrelation in time series models (Burnett and Krewski). 5 Time series analysis that compensates for both secular trends and autocorrelation in the residuals provides a means of establishing associations between short-term fluctuations in ambient air pollution levels and population health outcomes (Cakmak et al.). 6 Although serial autocorrelation can be explicitly included in time series models, nonparametric local smoothing methods based on generalized additive models (GAMs) have been used as an alternative analytic approach (Schwartz 7; Schwartz et al. 8). Such methods eliminate the need to specify a parametric form for secular trends and allow a greater degree of robustness against model misspecification.
The purpose of this paper is to examine the use of generalized additive models for the analysis of temporal data used in epidemiologic studies of the effects of ambient air pollution on population health. It has recently been noted elsewhere (Dominici et al.) 9 that using the default convergence criteria in S-Plus can also result in biased fitted linear parameters. We show that the variance estimation method currently used in commercial software can underestimate the variance of fitted linear parameters.
In the next section, we describe smoothers and generalized additive models and review the notion of concurvity, a nonparametric analogue of multicollinearity. The consequence of concurvity in the analysis of temporal air pollution data is then examined, using Monte Carlo analysis–based actual epidemiologic data. We show that the inability of the generalized additive model to detect concurvity can lead to misleading statistical inferences, notably an overstatement of the significance of associations between air pollution and health status. All of the simulations and calculations for this paper were executed with S-Plus Version 6. The implications of this observation for population health risk assessment are discussed in the concluding section.
Generalized Additive Models
The Generalized Additive Model
The GAM is a powerful and widely used tool that allows researchers to fit models without specifying parametric relations between the dependent and independent variables (Hastie and Tibshirani). 10 The general form of a GAM is MATHwhere g is the link function and the fi are smooth functions defining the relation between the transformed mean response μ and the independent variables Xi. The functions fi can be chosen to be either parametric or nonparametric and thus allow for a great deal of flexibility in constructing models with nonlinear effects. The gam function in the statistical software package S-Plus makes it easy to fit GAM models with a variety of families of predictor functions fi and different link functions g.
The generalized linear model, or GLM (McCullagh and Nelder), 11 is a special case of the GAM in which none of the functions fi is nonparametric. Linear regression is a special case of the GLM in which the link function is taken to be the identity function. The GAM is therefore a generalization of linear regression and most of the familiar diagnostic tests for fitted linear regression models have analogues applicable to fitted GAMs. One important exception to this rule is concurvity, the nonparametric analogue of multicollinearity. At present, S-Plus provides no diagnostic tools for assessing the impact of concurvity on a fitted GAM.
Scatterplot smoothing functions, commonly referred to as smoothers, are central to GAMs. A smoother is a mathematic technique for approximating an observed variable Y by a smooth function of one (or several) independent variable(s) X. Some examples of commonly used smoothers are smoothing splines, regression splines, LOESS functions (also known as locally estimated polynomial regression) and kernel smoothers. These methods are called nonparametric because they make no parametric assumption about the shape of the function being estimated. Instead, they assume only that the function Y (X) is smooth, or that the value of the variable Y at similar values of X will be similar.
The most common smoothers, including all those mentioned here, are linear smoothers. A linear smoother (see Hastie and Tibshirani) 10p.44 is a smoother in which the smoothed values Ŷ can be expressed in the form Ŷ =AX for some matrix A. In other words, the value of the estimated smooth function at a given point X is always a linear combination of all the observed values of Y. Because different smoothers use different methods for estimating the value of Y at each observed value of X, one can expect different methods to give somewhat different estimates of the smooth function.
Each smoother has a parameter that determines how smooth the resulting function will be. For LOESS functions, that parameter is called span. For smoothing splines and natural splines, the degree of smoothing can be specified through the degrees of freedom parameter. Figure 1 shows some simulated data smoothed by LOESS functions with three different spans. The solid line is the most smooth; the dashed line is the least smooth. In general, the amount of smoothing selected will have more impact on the final function than the type of smoother chosen.
Concurvity, like multicollinearity, can be defined in terms of the eigenvalues of a certain projection matrix associated with the model (Hastie and Tibshirani). 10p.123 This definition does not easily lead to an intuitive understanding of what concurvity means in a practical modeling situation, however, and concurvity is easier to understand by analogy to multicollinearity.
Consider the situation in which one is using linear regression to model the expected value of a variable Y as a function of p variables X1,…, Xp. We say that multicollinearity is present in the data if some subset of the regressors Xi is highly correlated (Montgomery and Peck). 12p.165 More formally, multicollinearity occurs when one of those variables, say Xp, can be approximated by a linear combination of the remaining variables:MATH
If the equality in Eq 2 is exact, then for any constant λ, MATHand the model is nonidentifiable. If, as is more likely to be the case, the equality is only approximate, then the model is identifiable but the parameter estimates associated with the multicollinear variables become both highly unstable and highly correlated.
Concurvity is a nonparametric extension of the concept of multicollinearity. Suppose that one is modeling the expected value of Y as in Eq 1, with each function fi specified to be in some set of functions Hi. Some common choices for the set Hi are linear functions, higher-order polynomials, LOESS (local polynomial regression) functions or regression splines. Concurvity occurs when a function of one of the variables, say Xp, can be approximated by a linear combinations of functions of the other variables with the function hi lying in the set Hi, with MATH
Again, if the equality in Eq 3 is exact, the model is nonidentifiable because the equality MATHholds for any constant λ. We formalize this description of concurvity in the definition below. (Note that this definition is somewhat different from that given by Hastie and Tibshirani, 10 and is equivalent to what they call approximate concurvity.)
Suppose one wishes to fit the GAM model in Eq 1, with fi ∈Hi for some specified function spaces,MATH from observations Y and MATH. We say that there is concurvity in the data if there exists a set of functions MATH such that gi(Xi) is highly correlated with MATH.
As is the case with linear regression (Montgomery and Peck), 12p.190 the parameter estimates of a fitted GAM are highly unstable if there is concurvity in the data. Unfortunately, because of the way variances are estimated in the S-Plus gam function, the variance estimates do not reflect this instability. In other words, the variance estimates produced by the gam function will be biased downwards whenever there is concurvity in the independent variables. This will result in confidence intervals on linear parameters being too narrow and in misleading P-values arising from hypothesis tests based on variance estimates.
Assuming that the smoothers used in a GAM are linear, as are both the spline and LOESS smoothers, there is an exact expression for the asymptotic standard error of the predicted parameters. Unfortunately, the matrices necessary for estimating this expression are not produced by the back-fitting algorithm used to fit GAMs. Furthermore, the algorithm proposed by Hastie and Tibshirani 10p.127 for computing these matrices is computationally expensive for large datasets. Instead, the gam function approximates the standard errors with an admittedly ad hoc method. The technical details of this approximation, also proposed by Hastie and Tibshirani, 10 are discussed in Chambers and Hastie 13 (sections 7.4.3 and 7.4.4). As the simulations of the next section will show, this approximation works well only when there is little or no concurvity present in the data.
An alternate strategy for estimating the variance is to use a Monte Carlo method such as the parametric bootstrap. Our simulations suggest that the variance estimates produced by the parametric bootstrap, although less biased than those produced by S-Plus, are still biased downwards. A better solution is to abandon the nonparametric model altogether by using a parametric smoother such as natural splines in place of the nonparametric smoother. This approach, which effectively turns the GAM into a GLM and thus produces correct variance estimates, is discussed further in Dominici et al. 9
Consequence of Concurvity
Example 1: Real Data
In this section, we present the results of a simulation to illustrate the variance inflation that can arise from concurvity. These results will show that because of the inability of the gam function to detect concurvity, significance tests based on gam variance estimates can lead to understated P-values and therefore to inflated type I errors. The simulation is based on an analysis done by Burnett et al. 14 relating nonaccidental deaths in Toronto to PM2.5. Before describing the simulation, we will describe the original analysis.
The dependent variable in this study was the daily count of nonaccidental deaths (dt) in Toronto for the years of 1980 through 1994. The independent variables were time (t), daily minimum temperature (mint), daily dewpoint temperature (dewt), and a daily measure of suspended particulate matter with diameter less than or equal to 2.5 microns (PM2.5t). These data were fitted with a Poisson GAM model of the form MATHwhere t = 1,…,5479 represents the day and f, g, and h were LOESS functions. The error term, εi, is assumed to have mean 0 and variance φμt. The overdispersion parameter φ was estimated from the data.
This model was obtained by Burnett et al. 14 through stepwise variable selection from a larger set of potential predictor variables. The linear effect of suspended particulate matter on mortality, β̂, was estimated by the S-Plus gam function to be 1.13 × 10−3 with a standard error of 2.66 × 10−4. Based on these estimates, it was concluded that the effect of PM2.5 was statistically substantial.
Upon reanalysis, however, we noticed some concurvity present in the data. The concurvity can be seen from the fact that the correlation between PM2.5 and the predicted values obtained from fitting the model E (PM2.5t) =f (t) +g (mint) +h (dewt) is 0.59. To test the impact of this concurvity, we executed the following simulation 1000 times. First, we simulated overdispersed Poisson data of the form MATHwhere all the parameters were obtained by fitting to the real data. We then fit to this simulated data and recorded both the slope estimate and the estimated standard error of the slope.
Using the convergence criteria suggested by Dominici et al., 9 the average estimated standard error of the slope parameter given by S-Plus was 2.68 × 10−4. By contrast, the standard deviation of the thousand estimated slopes was 3.13 × 10−4. In other words, the true standard error is about 16% higher than the S-Plus output would lead us to believe. This bias is a result of the concurvity in the data. A standard normal test of the true null hypotheses that β = 1.13 × 10−3, tested at the 0.05 level of significance, was rejected 176 times out of 1,000. This represents a type I error rate of about 18%, rather than the 5% that one should observe with a test at the 0.05 level of significance. Clearly, it is unwise to conduct significance tests based on the S-Plus error estimate for this type of model.
Using the default convergence criteria built into S-Plus rather than the stricter criteria introduced by Dominici et al. 9 resulted in an even larger bias. When the simulation just described was run with the default criteria, the empiric estimate of standard error was 3.30 × 10−4, or 23% higher than the S-Plus estimate. The S-Plus estimate of standard error did not change. The larger empiric error was presumably a result of the failure of the back-fitting algorithm to converge to the maximum likelihood estimate, thus adding an extra source of variation to the parameter estimates. Note that, because the original study used the default convergence criteria, 23% is probably a more realistic estimate of the standard-error bias in the Burnett et al. 14 study than 18%.
Example 2: Simulated Data
One can easily explore the relation between concurvity and biased standard error estimates further by generating data with various degrees of concurvity. In this section, we provide a simple example of how to do this in S-Plus, together with the results of a simulation based on the example. The example is based on the Poisson model:MATH
To create concurvity, we let both X1 and X2 be uniformly distributed on the interval [0,π], and we let X3 = 0.1 sin(X1) + 0.1 sin(X2) + τ. The error term τ is normally distributed with mean 0 and standard deviation 0.05. This data can then be fit with a Poisson GAM using LOESS functions for X1 and X2 and a linear function of X3. The S-Plus code for a simulation of this model is available with the electronic version of this article at www.epidem.com. When we ran this simulation, we found the average S-Plus standard error estimate to be 1.08 whereas the sample standard deviation of the hundred slope estimates was 1.37. This discrepancy can be made larger by making the standard deviation of the error τ smaller (ie, by making the concurvity worse).
Concurvity in Time Series Data
In the GAM models used by researchers to explore the effect of air pollution on population health, a Poisson model is typically used with time series data to explain log counts of an outcome such as mortality or hospital admissions as a function of air pollution plus a sum of functions of other confounding variables. One of the confounding variables is usually time, which is used to control for seasonal and other temporal trends. Because many variables display temporal trends, it is very likely that at least some degree of concurvity will be present whenever time is used as an independent variable in a GAM. For example, it is reasonable to assume that some of the variation in any continuously varying air pollution variable can be explained by a smooth function of time.
The problem created by concurvity can be understood intuitively by thinking about temporal trends in the data. If the smooth function of time is left out of the model, the residuals tend to be autocorrelated (they display a temporal trend). This is thought to be because some unknown confounding variables, each with some temporal trend of its own, have been left out of the model. The smooth function of time is included to filter out the effect of these unknown variables. If another covariate, say PM10, displays a temporal trend and has a linear effect on log mortality, then the effect of that covariate will also have a temporal trend. The gam function is then given the rather ill-defined job of extracting the temporal trend from log mortality and apportioning some of that trend to time and some to PM10. It is precisely this ambiguity that leads to a larger variance in the estimates. The problem with the variance estimates will also occur with normally distributed data (as well as for any other distribution) and, as far as we know, with all other current software that fits nonparametric GAMs.
The general problem of assessing concurvity, which has been studied by Donnell et al. 15 and, to a certain degree, by Gu, 16 has not been adequately solved. The “semiparametric” GAMs (containing at least one linear parameter) used in epidemiologic investigations, however, have a particular structure that makes concurvity detection relatively straightforward. Suppose that one is fitting a model of the form MATHwith each function fi specified to belong to a set of functions Hi. Suppose further that one is only really interested in the value of the linear parameter β. The variance of β̂ will only be affected by concurvity if there exists a relation of the form MATHwhere the function gi is in the function space Hi and the variance of ε is small relative to the variance of X1. To assess the degree of concurvity in the data, one should fit Eq 5 and then compute the correlation between X1 and X̂1, the fitted values from Eq 5. Although it is difficult to say exactly what magnitude of correlation coefficient should be a cause for concern, our experience suggests that there is certainly reason to expect problems with concurvity if the correlation coefficient is greater than 0.5 (ie, R2 = 0.25).
If one does suspect a problem with concurvity, we recommend the use of a parametric smoother such as the natural spline (ns) in S-Plus. Substituting natural splines with the same degrees of freedom as a nonparametric smoother turns a nonparametric GAM into a fully parametric model that can fit similar trends to the GAM model. This is discussed further in Dominici et al., 9 where it is proposed that using natural splines rather than nonparametric smoothers also removes the problem of biased parameter estimates attributable to lack of convergence of the GAM-estimation algorithm. Given the fact that time series data based on air pollution are unlikely ever to be completely free of concurvity, the best course of action for analysts is certainly always to use fully parametric models. Nonparametric smoother-based models may, however, still be valuable for exploratory analyses. They may well be able to suggest adequate parametric forms. For example, it may well be the case that the effect of a certain variable might be described well by a simple quadratic function or, in the case of a periodic seasonal effect, by a sum of trigonometric terms. If this were the case, it would make sense to use the quadratic or trigonometric function rather than a natural spline in the parametric model.
Air Pollution and Mortality
We observed in the introduction that the epidemiologic literature contains a wealth of reports of a statistical association between air pollution and cardiorespiratory morbidity and mortality. Given that many of these results are based on analyses using generalized additive models and that most of these models are likely to be affected by concurvity, we conducted a meta-analysis to investigate the potential impact of inflated standard errors in these studies. We did this by reanalyzing a meta-analysis by Stieb et al., 17 which examined a collection of comparable epidemiologic studies on the relation between mortality and air pollution. In the reanalysis, we assumed that the pollution effect in every study had been underestimated.
Stieb et al. 17 identified 31 epidemiologic studies, conducted throughout the world, that compared daily variations in PM10 concentrations and daily numbers of nonaccidental deaths. The estimate of the log-relative risk, β̂i, in the daily mortality rate associated with a μg/m3 change in PM10 was recorded for each study in addition to the standard error of this estimate, v̂i. Stieb et al. 17 used a linear random effects model to pool the β̂i among studies, assuming that the true association varied at random. Specifically, it was assumed that the β̂i were normally distributed with mean β and variance θ + v̂i, with θ representing the true variation in the log-relative risks between studies. They estimated β to be 2.0 × 10−2 with a standard error of 3.0 × 10−3. Their estimate of θ was 1.0 × 10−4.
Many of these studies used nonparametric smoothers to model both time and weather variables. Therefore, the reported standard errors of the air pollution association with mortality may have been underestimated. Our analysis of the association between PM2.5 and mortality in Toronto, using the default convergence criteria provided by S-Plus, indicated that the true standard error was 23% greater than the estimated standard error. We repeated the random effects analysis of the 31 studies examining PM10 effects on mortality by inflating the standard error reported in these studies by 23%. The estimate of the pooled log-relative risk was then calculated to be 1.8 × 10−2 with a standard error of 3.3 × 10−3. The estimated variation in log-relative risks between studies, v̂, was 5.3 × 10−5.
The standard error of the mean effect was only 8% larger in the analysis that used inflated study-specific standard errors (23%) compared with the analysis that was based on the data reported in the literature. The estimate of the standard error of the mean effect is a function of the true variation in effects among studies and the estimation of the study-specific effect. If the study-specific estimation error is increased, then the estimate of the true variation among studies is reduced as in our example. Thus, the impact of underestimating the study-specific standard error will be mitigated somewhat through the meta-analysis.
The conclusions based on these two analyses are similar in that there is strong evidence of a statistical association between daily variations in PM10 concentrations and nonaccidental mortality. However, the estimated variation in estimates between studies was cut almost in half. Although the statistical evidence to suggest that there exists true variation in the PM10 association with mortality among studies is reduced from P = 0.000003 to P = 0.009116, one would still likely conclude that such variation is real based on both sets of analyses. Of course, this analysis depends on the assumption that the standard error of each of the excess risks was underestimated by 23%. Even though the true degree of underestimation will certainly not be uniformly equal to 23%, these results suggest that a meta-analysis based on correctly estimated excess risk would still show a statistically substantial PM10 effect.
1. Burnett RT, Brook J, Dann T, et al
. Association between particulate and gas phase components of urban air pollution and daily mortality in eight Canadian cities. Inhal Toxicol
2. Lin M, Chen Y, Burnett RT, Villeneuve P, Krewski D. The influence of ambient coarse particulate matter on hospitalization in children: case-crossover and time series analyses. Environ Health Perspect (in press).
3. Dockery DW, Pope CA, Xu X, et al
. An association between air pollution and mortality in six U.S. cities. N Engl J Med 1993; 329: 1753–1759.
4. Pope CA, Thun MJ, Namboodiri MM, et al
. Particulate air pollution as a predictor of mortality in a prospective study of U.S. Adults. Am J Respir Crit Care Med 1995; 151: 669–674.
5. Burnett RT, Krewski D. Air pollution effects on hospital admission rates: a random effects modeling approach. Can J Stat 1994; 22: 441–458.
6. Cakmak S, Burnett RT, Krewski D. Adjusting for temporal variation in the analysis of parallel time series of health and environmental variables. J Exp Anal Environ Epid 1998; 8: 129–144.
7. Schwartz J. Nonparametric smoothing in the analysis of air pollution and respiratory illness. Can J Stat 1994; 22: 471–488.
8. Schwartz J. Air pollution and hospital admissions for heart disease in eight U.S. counties. Epidemiology 1999; 10: 17–22.
9. Dominici F, McDermott A, Zeger S, Samet J. On the use of generealized additive models in time-series studies of air pollution and health. Am J Epidemiol
10. Hastie T, Tibshirani R. Generalized Additive Models. London: Chapman and Hall, 1990.
11. McCullagh P, Nelder J. Generalized Linear Models. 2nd ed. New York: Chapman and Hall, 1989.
12. Montgomery DC, Peck EA. Introduction to Regression Analysis. 2nd ed. New York: John Wiley and Sons, 1992.
13. Chambers J, Hastie T, eds. Statistical Models in S. Pacific Grove, Calif: Wadsworth and Brooks, 1992.
14. Burnett RT, Dales R, Krewski D, Vincent R, Dann T, Brook JR. Associations between ambient particulate sulfate and admissions to Ontario hospitals for cardiac and respiratory diseases. Am J Epidemiol 1995; 142: 15–22.
15. Donnell DJ, Buja A, Stuetzle W. Analysis of additive dependencies and concurvities using smallest additive principal components. Ann Stat 1994; 22: 1635–1673.
16. Gu C. Diagnostics for nonparametric regression with additive terms. J Am Stat Soc 1992; 87: 1051–1058.
17. Stieb DM, Judek S, Burnett RT. Meta-analysis of time series studies of air pollution and mortality: effects of gases and particles and the influence of cause of death, age and season. J Air Waste Manage Assoc 2002; 52: 470–484.
This article has been cited
concurvity; generalized additive model; sulfate particles; air pollution; PM2.5; PM10
Supplemental Digital Content
© 2003 Lippincott Williams & Wilkins, Inc.