Multilevel Modeling in Epidemiology with GLIMMIX
Witte, John S.1; Greenland, Sander2; Kim, Lee-Lian1; Arab, Lenore3
From the 1Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH; 2Department of Epidemiology, University of California-Los Angeles School of Public Health; and 3Departments of Epidemiology and Nutrition, University of North Carolina School of Public Health, Chapel Hill, NC.
This work was supported by National Institutes of Health Grant CA73270.
Submitted February 2, 2000; final version accepted April 24, 2000.
Address reprint requests to: John S. Witte, Department of Epidemiology and Biostatistics, Case Western Reserve University, 2500 MetroHealth Drive, Cleveland, OH 44109-1998.
Previous work has shown that multilevel modeling can be a valuable technique for epidemiologic analysis. The complexity of using this approach, however, continues to restrict its general application. A critical factor is the lack of flexible and appropriate software for multilevel modeling. SAS provides a macro, GLIMMIX, that can be used for multilevel modeling, but that is not sufficient for a complete epidemiologic analysis. We here provide additional code to obtain epidemiologic output from GLIMMIX, illustrated with new data on diet and breast cancer from the European Community Multicenter Study on Antioxidants, Myocardial Infarction, and Breast Cancer (EURAMIC). Our results give epidemiologists an easily used tool for fitting multilevel models.
Multilevel modeling (also known as hierarchical regression) is an important technique for epidemiologic analysis for three key reasons. First, it allows one to incorporate multiple levels of information into a single epidemiologic analysis. Doing this can be critical for adequately modeling exposure-disease relations driven by risk factors arising at both the individual and ecologic levels. This advantage helped spur the use of multilevel modeling in social science research and is currently helping prompt its application in epidemiology. For example, a recent study used a multilevel model to examine concurrently the impact of individual- and neighborhood-level factors on poor health. This analysis found that both individual educational status and neighborhood socioeconomic environment were associated with self-reported poor health. 1
Second, multilevel modeling can provide estimates of effect that are more accurate and more plausible than those from conventional models. 2 Much research indicates that in the analysis of epidemiologic data on multiple exposures, multilevel models can statistically outperform conventional approaches (for example, one-stage logistic regression). 3–11 This outperformance occurs in part because the higher levels of a multilevel model incorporate additional information for estimation. For example, in an application to nutritional epidemiology, a hierarchical model improved conventional estimates of food effects by shrinking them toward each other when their corresponding foods had similar levels of nutrients. 7
Third, multilevel modeling provides a solution to problems of multiple comparisons. Proposed solutions of this issue have ranged from ignoring the multiple comparisons to “adjusting” for them by altering the alpha level used in statistical procedures. 12–17 Instead of ignoring or adjusting for multiple comparisons, one can solve this problem by using a multilevel model to simultaneously evaluate multiple exposures or outcomes. 3–6 For example, hierarchical modeling has been used in genetic epidemiology for evaluating associations between large numbers of candidate human leukocyte antigen genes and insulin-dependent diabetes mellitus 18 and for evaluating gene and diet interactions in the etiology of colon polyps. 11 In a similar fashion, hierarchical regression (in the form of random coefficient modeling) provides an alternative to variable selection in problems involving many potential confounders. 19
Despite the benefits of multilevel modeling, lack of software tailored to epidemiology remains a major obstacle to its use. In a previous report, we provided SAS code for two-stage multilevel modeling. 20 This code uses a basic two-step approach whereby regression coefficients from a first-stage model are analyzed with a second-stage linear weighted-least-squares regression. 5,6 This approach requires that the data be adequate for fitting the first-stage model without any constraints. One can instead undertake multilevel modeling with the SAS macro GLIMMIX (http://ftp.sas.com/techsup/download/stat/glmm612.sas), 21 which allows for more than two stages and does not require fitting distinct models in the different stages. The GLIMMIX fitting method (penalized quasi-likelihood) has properties superior to the basic two-step approach outlined above 9 and allows one to evaluate the fit of a multilevel model with conventional likelihood ratio tests. The value of GLIMMIX is limited, however, because it does not provide statistics [for example, relative risk estimates and 95% confidence intervals (CIs)] for quantities of primary epidemiologic interest. The additional work required to obtain this information from GLIMMIX may deter its use—and the application of multilevel models—in epidemiology. Therefore, in the present paper, we show how to use GLIMMIX for epidemiologic analysis with a new application of multilevel modeling.
In a previous report, 7 we applied multilevel modeling to a study of diet and breast cancer. In particular, we developed a multilevel application to produce improved relative risk estimates of breast cancer for foods by using a second-stage nutrient-composition model to pull ordinary estimates toward each other when they have similar levels of nutrients (such as fat or fiber). We here use this approach for the Berlin European Community Multicenter Study on Antioxidants, Myocardial Infarction, and Breast Cancer (EURAMIC) data on diet and breast cancer. 22 We have complete information from 40 cases and 90 controls and are interested in estimating the effects of the following 12 foods on breast cancer: cauliflower, broccoli, Brussels sprouts, red cabbage, white cabbage, savoy cabbage, sauerkraut, allium vegetables (that is, onions and garlic), tomato, pizza, tomato salad, and strawberries. These foods were selected for investigation a priori on the basis of their known levels of anticarcinogenic constituents (for example, glucosinolates).
Our conventional analysis of the data used logistic regression to model the risk of breast cancer as MATH where p = risk of breast cancer, X is the matrix of food intake information, W is a matrix of covariate data (on age, calories, body mass index, socioeconomic status, alcohol intake, and supplemental…, hormonal use), α is the intercept term, and β = (β1,…, β12)′ and γ are the vectors of logistic regression coefficients corresponding to the 12 foods and six covariates. Fitting model 1 to the data by maximum likelihood yields the odds ratio (OR) estimates and 95% CIs given in the first set of columns in Table 1. (All results in Table 1 correspond to a 100-gm-per-week increase, vs none, in the foods). Some of the ORs show an inverse association with breast cancer, whereas Brussels sprouts and white cabbage show positive associations. The latter results are implausible (a priori we expected all foods to be either inversely associated or not associated with breast cancer owing to their known constituents’ potential anticarcinogenic properties) and may only reflect small-sample bias 23–25 or instability of the maximum-likelihood estimates from the conventional logistic regression: with only 40/19 = 2.1 cases per parameter, the dataset fails the criterion of at least five cases and five controls per parameter for valid maximum-likelihood estimates of the coefficients.
We can improve our estimates by using a multilevel model. This approach uses external (prior) information to develop a regression model for a second level or stage in which the effects of exposure variables are partially determined by other factors. 2,5–7 The second level (stage) of our multilevel model is a linear regression model for the logistic coefficients β of the dietary items, MATH where the i th row of Z contains second-stage covariates for the i th dietary item, π is a vector of coefficients corresponding to the effects of second-stage covariates on breast cancer, and the elements of δ are independent normal random variables with zero means and variances τi2. The second-stage covariates are defined as food constituents that may contribute to dietary effects on breast cancer. 5 Thus, Z is the diet-nutrient table that gives the amount of nutrients in each food (that is, element zij of Z is the amount of constituent j reported in food i). Here we include the following eight second-stage covariates that may contribute to the effects of food items on breast cancer: glucosinolates, alpha- and beta-carotene, lutein, lycopene, vitamin E, organosulfate, and fiber. Table 2 gives the 12-by-eight second-stage design matrix Z used here.
Fitting the Multilevel Model with GLIMMIX
To use GLIMMIX, we combine models 1 and 2 into a “mixed-effects” logistic model, MATH where the parameters are as defined above;π and γ are treated as vectors of fixed coefficients and, as above, δ is treated as a vector of random coefficients with mean 0 and variance τ2. Note that the τ2 values reflect any residual associations for the 12 foods on breast cancer, after incorporating the eight constituents in Z. At first glance, one might think that incorporating XZ in model 3 would overfit the data relative to the conventional one-stage model 1. Just the opposite is the case, however; the conventional approach (1) is equivalent to using model 3 with no constraint on the residual effects δ (that is, with τ2 = ∞) and thus results in more overfitting than the constrained (that is, τ2 < ∞) multilevel approach.
The GLIMMIX default restricts one to empirical-Bayes, in which a single common variance for the random coefficients, τ2, is estimated from the data. Several applications and simulation studies indicate, however, that when the study size is small, as in the present example, prespecifying the τ2 to define a reasonable range of residual values (that is, a semi-Bayes approach) can work as well as or better than empirical-Bayes, 5–8 and there are also philosophical reasons for preferring prespecification. 19 One can modify GLIMMIX for semi-Bayes multilevel modeling by incorporating in the random statement the option gdata = “filename,” where the file in “filename” contains prespecified τ2 values. In the present example we use a semi-Bayes approach; because Z includes most of the known relevant food constituents, we set the elements of τ to a relatively moderate value (0.35) for all foods. Assuming a normal distribution for the δ, τi = 0.35 corresponds to a 95% prior certainty that the residual relative risk for the effect of a 100-gm increase in food item i lies in a fourfold range (that is, 0.5–2.0), because exp (3.92τi) = 4.0 if τi = 0.35. 5–7
Fitting model 3 to the diet and breast cancer data with GLIMMIX gives estimates of the fixed and random coefficients and their corresponding covariances (Table 3). We can use this output from GLIMMIX to calculate the food-effect coefficients MATH the natural antilogarithms of which give OR estimates. To obtain 95% CIs, we need the covariance matrix estimate for β̂ given by MATH where cov (MATH), cov (MATH), and cov (MATH, MATH) = cov (MATH, MATH) ′ are covariance-matrix estimates from GLIMMIX, and an apostrophe denotes matrix transposition (Table 3). The square root of the diagonal of cov(MATH) gives standard error estimates for the MATH, which we can use to calculate corresponding 95% CIs.
Upon fitting the mixed model 3 to the example data and calculating the semi-Bayes estimates from model 4, estimates that were unstable and nonsensical became more precise and reasonable, whereas those that were relatively precise did not change much. For example, the maximum-likelihood estimate of the OR for eating 100 additional grams of white cabbage was 2.5 (95% CI = 0.75–8.4); in contrast, the semi-Bayes OR estimate from GLIMMIX was 1.2 (95% CI = 0.59–2.6) (Table 1). In contrast, the OR for eating 100 additional grams of allium vegetables changed little upon applying the multilevel model (Table 1). Setting the elements of τ to larger values (for example, 0.53, which corresponds to a 95% prior certainty that the residual relative risk lies in an eightfold range) gave less stable results; here, the semi-Bayes OR for eating 100 additional grams of white cabbage was 1.7 (95% CI = 0.64–4.4). Using the GLIMMIX default (that is, empirical-Bayes) led to the common τ2 being estimated as 0, which ignores any residual second-stage effects and results in overly precise estimates that are pulled too close together. 5
For comparison, we also fit the two-level model (model 1 plus model 2) to the diet and breast cancer data using our earlier weighted-least-squares method. 5–8,20 This method (third set of columns in Table 1) generally gave results almost identical to the penalized likelihood results from GLIMMIX. The similarity of these results, and their limited improvement over the conventional approach (Table 1), occurred because the mixed model 3 with δ = 0 fit the data reasonably well.
Whereas the multilevel estimates in Table 1 are certainly more credible than the maximum-likelihood estimates, previous applications 3–7,11,19 have shown more improvement from multilevel modeling than the current analysis. For example, in our earlier diet and breast cancer application, 7 we obtained a maximum-likelihood estimate of OR for eating four 4-inch celery sticks per week vs none equal to 5.1 (95% CI = 0.89–29); in contrast, the multilevel modeling estimate was 0.90 (95% CI = 0.28–2.9). Part of the difference in impact may be due to the fact that our earlier example involved even fewer cases per first-stage parameter (140/92 = 1.5).
The Appendix provides the SAS IML code used to calculate ORs and 95% CIs from our GLIMMIX output. The complete program, a corresponding SAS macro, and instructions on how to obtain results using a multilevel model with GLIMMIX, are available at URL http://darwin.cwru.edu/∼witte/hm.html.
With the explication and code given here, epidemiologists can use GLIMMIX to analyze their own data with a multilevel model. Strengths of GLIMMIX include allowing for more than two stages and providing diagnostic statistics. A recent report 26 provides further evaluation of GLIMMIX and comparison with variance component software packages specifically written for multilevel modeling. 27–29 As with SAS GLIMMIX, however, use of these packages for epidemiologic analysis will require either writing special code or altering output to give the relative risk estimates and corresponding confidence intervals. 20 Multilevel modeling can also be undertaken with procedures available in the SAS IML and GAUSS languages. 5,6,20,30 These procedures provide standard epidemiologic output and are available from URL http://darwin.cwru.edu/∼witte/hm.html.
We thank the EURAMIC Study for the example diet and breast cancer data used here.
1. Malmstrom M, Sundquist J, Johansson SE. Neighborhood environment and self-reported health status: a multilevel analysis. Am J Public Health 1999; 89: 1181–1186.
2. Greenland S. Principles of multilevel modeling. Int J Epidemiol 2000; 29: 158–167.
3. Efron B, Morris C. Data analysis using Stein’s estimator and its generalizations. J Am Stat Assoc 1975; 70: 311–319.
4. Thomas DC, Siemiatycki J, Dewar R, Robins J, Goldberg M, Armstrong BG. The problem of multiple inference in studies designed to generate hypotheses. Am J Epidemiol 1985; 122: 1080–1095.
5. Greenland S. A semi-Bayes approach to the analysis of correlated multiple associations, with an application to an occupational cancer-mortality study. Stat Med 1992; 11: 219–230.
6. Greenland S. Methods for epidemiologic analyses of multiple exposures: a review and comparative study of maximum-likelihood, preliminary-testing, and empirical-Bayes regression. Stat Med 1993; 12: 717–736.
7. Witte JS, Greenland S, Bird CL, Haile RW. Hierarchical regression analysis applied to a study of multiple dietary exposures and breast cancer. Epidemiology 1994; 5: 612–621.
8. Witte JS, Greenland S. Simulation study of hierarchical regression. Stat Med 1996; 15: 1161–1170.
9. Greenland S. Second-stage least squares versus penalized quasi-likelihood for fitting hierarchical models in epidemiologic analyses. Stat Med 1997; 16: 515–526.
10. Witte JS. Genetic analysis with hierarchical models. Genet Epidemiol 1997; 14: 1137–1142.
11. Aragaki CC, Greenland S, Probst-Hensch N, Haile RW. Hierarchical modeling of gene-environment interactions: estimating NAT2* genotype-specific dietary effects on adenomatous polyps. Cancer Epidemiol Biomarkers Prev 1997; 6: 307–314.
12. Rothman K. No adjustments are needed for multiple comparisons. Epidemiology 1990; 1: 43–46.
13. Witte JS, Elston RC, Schork NJ. Re: Genetic dissection of complex traits. Nat Genet 1996; 12: 355–356.
14. Thompson JR. Invited commentary. Re: “Multiple comparisons and related issues in the interpretation of epidemiologic data.” Am J Epidemiol 1998; 147: 801–806.
15. Goodman SN. Multiple comparisons, explained. Am J Epidemiol 1998; 147: 807–812.
16. Savitz DA, Olshan AF. Describing data requires no adjustment for multiple comparisons: a reply from Savitz and Olshan. Am J Epidemiol 1998; 147: 813–814.
17. Thompson JR. Re: Describing data requires no adjustment for multiple comparisons. Am J Epidemiol 1998; 147: 815.
18. Thomas DC, Langholz B, Clayton D, Pitkaniemi J, Tuomilehto-Wolf E, Tuomilehto J. Empirical Bayes methods for testing associations with large numbers of candidate genes in the presence of environmental risk factors, with applications to HLA association in IDDM. Ann Med 1992; 24: 387–392.
19. Greenland S. When should epidemiologic regressions use random coefficients? Biometrics 2000 (in press).
20. Witte JS, Greenland S, Kim L-L. Software for hierarchical modeling of epidemiologic data. Epidemiology 1998; 9: 563–566.
21. Wolfinger R, O’Connell M. Generalized linear mixed models: a pseudo-likelihood approach. J Stat Comput Simul 1993; 48: 223–243.
22. Simonsen NR, Fernández Crehuet Navajas J, Martín Moreno JM, Strain JJ, Huttunen JK, Martin BC, Thamm M, Kardinaal AF, vant Veer P, Kok FJ, Kohlmeier L, van t’Veer P. Tissue stores of individual monounsaturated fatty acids and breast cancer: the EURAMIC study. European Community Multicenter Study on Antioxidants, Myocardial Infarction, and Breast Cancer. Am J Clin Nutr 1998; 68: 134–141.
23. Jewell NP. On the bias of commonly used measures of association for 2 × 2 tables. Biometrics 1986; 42: 351–358.
24. Greenland S. Small-sample bias and corrections for conditional maximum-likelihood odds-ratio estimators. Biostatistics 2000; 1: 113–122.
25. Greenland S, Schwartzbaum JA, Finkle WD. Small-sample and sparse-data problems in conditional logistic regression analysis. Am J Epidemiol 2000; 151: 531–539.
26. Zhou XH, Perkins AJ, Hui SL. Comparisons of software packages for generalized linear multilevel models. Am Stat 1999; 53: 282–290.
27. HLM, Version 4.0. Chicago: Scientific Software Inc., 1996.
28. MLn, Version 1.0a. London: Multilevel Models Project, 1996.
29. VARCL. Groningen, The Netherlands: ProGAMMA, 1996.
30. The GAUSS System, Version 3.2. Maple Valley, WA: Aptech Systems, Inc., 1996.
31. Fenwick GR, Heaney RK, Mullin WJ. Glucosinolates and their breakdown products in food and food plants. Crit Rev Food Sci Nutr 1983; 18: 123–201.
32. Chug-Ahuja JK, Holden JM, Forman MR, Mangels AR, Beecher GR, Lanza E. The development and application of a carotenoid database for fruits, vegetables, and selected multicomponent foods. J Am Diet Assoc 1993; 93: 318–323.
33. Haussler A, Rehm J, Nass E, Kohlmeier L. Data bases for nutritional epidemiology: The food code of the German Federal Republic (BLS). In: Kohlmaier L, ed. The Diet History Method. London: Smith-Gordon, 1991; 103–108.
breast neoplasms; diet; epidemiologic methods; logistic regression; multilevel models; odds ratio; relative risk; software
© 2000 Lippincott Williams & Wilkins, Inc.
What does "Remember me" mean?
By checking this box, you'll stay logged in until you logout. You'll get easier access to your articles, collections,
media, and all your other content, even if you close your browser or shut down your
To protect your most sensitive data and activities (like changing your password),
we'll ask you to re-enter your password when you access these services.
What if I'm on a computer that I share with others?
If you're using a public computer or you share this computer with others, we recommend
that you uncheck the "Remember me" box.
Highlight selected keywords in the article text.
Data is temporarily unavailable. Please try again soon.