To the Editor:
Generalized estimating equations (GEEs) are popular tools for estimating associations in clustered data settings. The semiparametric nature of this approach makes it highly appealing because unbiased effect estimators can be obtained without knowing the true distribution of the data being modeled. For example, it is unnecessary to specify a specific parametric distribution or even the correct correlation structure within the data – the mean model parameter estimators are unbiased if the mean model is correctly specified. However, a working correlation that is close to the structure of the true data-generating mechanism provides greater efficiency than a poorly specified working correlation.1 Thus, it is tempting to employ some method of choosing the working correlation structure – potentially reducing standard errors and improving the power to detect an association between a covariate and the outcome.
To this end, several criteria for specifically selecting the working correlation structure (as opposed to selecting covariates in the mean model) have been proposed, including Pan’s seminal quasi-likelihood information criterion2 and variations thereof.3,4 Many of these criteria have been implemented in commonly used software such as SAS and Stata, which facilitates their use by data analysts, some of whom may not be fully aware of the drawbacks of the criteria.
While such information criteria have sound theoretical bases, their use can have unintended consequences if their application leads the analyst to choose an inappropriate working correlation structure for the chosen mean model. For instance, GEEs yield biased estimators of cross-sectional model parameters when the true data-generating mechanism relies on covariate history5 (such as when a “cross-sectional” model is being fit to data and the true underlying data-generating mechanism is not cross-sectional) unless an independence correlation structure is assumed. For example, one may wish to understand the predictive value of current covariate measurements on current health status to understand what can be learned from the information available in a given visit without relying on historical measurements. Current health is highly likely to be predicted by additional antecedent factors, e.g., previous health status. In this setting, data analysts must use an independence working correlation when regressing health status on covariates using GEEs.
We have previously demonstrated6 that type I error is distorted because of postselection inference, i.e., the use of confidence intervals or significance tests following model selection. Moreover, in the eAppendix; https://links.lww.com/EDE/B384, we demonstrate via brief simulations that bias can arise due to using information criteria in settings where an independence working correlation is required. While these limitations of model selection in the GEE context are well-known to statisticians, this message appears to be insufficiently disseminated to other fields. For instance, more than 80% of the citations of Pan’s quasi-likelihood information criterion are in nonstatistical journals,6 suggesting that the criterion is being used in routine data analysis, in, for example, epidemiology and cancer biology. Even in our institution, the routine use of these criteria is encouraged, without mentioning the potential perils discussed above. Moreover, new criteria continue to be developed7,8 despite these potential perils.
We urge data analysts to consider selection of the working correlation structure based on the data-generating mechanism and not solely on information criteria. The development or extensions of ever more methods for choosing among different correlation structures is of little use and may even be counterproductive if used in the same manner as the previously developed criteria already in use. Thus, while GEEs offer consistency without perfect knowledge of the correlation structure, reliance on this known and proven property may be the most prudent and fruitful analysis approach.
Wilhemina Adoma Pels
African Institute for Mathematical Sciences
Senegal Mbour, Senegal
Canada Montreal, Canada
Lindsay N. Carpp
Vaccine and Infectious Disease Division
Fred Hutchinson Cancer Research Center
Erica E. M. Moodie
1. Diggle PJ, Heagerty P, Liang K-Y, Zeger SL. Analysis of Longitudinal Data. 2002.2nd ed. Oxford, UK; Oxford University Press:
2. Pan W. Akaike’s information criterion in generalized estimating equations. Biometrics. 2001;57:120125.
3. Hardin JW, Hilbe JM. Generalized Linear Models and Extensions. 2007.College Station, TX,; Stata Press:
4. Hin LY, Wang YG. Working-correlation-structure identification in generalized estimating equations. Stat Med. 2008;28:642658.
5. Pepe MS, Anderson GL. A cautionary note on inference for marginal regression models with longitudinal data and general correlated response data. Commun Stat Simul Comput. 1994;23:939951.
6. Wang Y, Murphy O, Turgeon M, et al. The perils of quasi-likelihood information criteria. Stat. 2015;4:246254.
7. Jaman A, Latif MA, Bari W, Wahed AS. A determinant-based criterion for working correlation structure selection in generalized estimating equations. Stat Med. 2016;35:18191833.
8. Wang P, Zhou J, Qu A. Correlation structure selection for longitudinal data with diverging cluster size. Can J Stat. 2016;44:343360.