# Multilevel Modeling in Epidemiology with GLIMMIX

Previous work has shown that multilevel modeling can be a valuable technique for epidemiologic analysis. The complexity of using this approach, however, continues to restrict its general application. A critical factor is the lack of flexible and appropriate software for multilevel modeling. SAS provides a macro, GLIMMIX, that can be used for multilevel modeling, but that is not sufficient for a complete epidemiologic analysis. We here provide additional code to obtain epidemiologic output from GLIMMIX, illustrated with new data on diet and breast cancer from the European Community Multicenter Study on Antioxidants, Myocardial Infarction, and Breast Cancer (EURAMIC). Our results give epidemiologists an easily used tool for fitting multilevel models.

From the ^{1}Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH; ^{2}Department of Epidemiology, University of California-Los Angeles School of Public Health; and ^{3}Departments of Epidemiology and Nutrition, University of North Carolina School of Public Health, Chapel Hill, NC.

This work was supported by National Institutes of Health Grant CA73270.

Submitted February 2, 2000; final version accepted April 24, 2000.

Address reprint requests to: John S. Witte, Department of Epidemiology and Biostatistics, Case Western Reserve University, 2500 MetroHealth Drive, Cleveland, OH 44109-1998.

Multilevel modeling (also known as hierarchical regression) is an important technique for epidemiologic analysis for three key reasons. First, it allows one to incorporate multiple levels of information into a single epidemiologic analysis. Doing this can be critical for adequately modeling exposure-disease relations driven by risk factors arising at both the individual and ecologic levels. This advantage helped spur the use of multilevel modeling in social science research and is currently helping prompt its application in epidemiology. For example, a recent study used a multilevel model to examine concurrently the impact of individual- and neighborhood-level factors on poor health. This analysis found that both individual educational status and neighborhood socioeconomic environment were associated with self-reported poor health. ^{1}

Second, multilevel modeling can provide estimates of effect that are more accurate and more plausible than those from conventional models. ^{2} Much research indicates that in the analysis of epidemiologic data on multiple exposures, multilevel models can statistically outperform conventional approaches (for example, one-stage logistic regression). ^{3–11} This outperformance occurs in part because the higher levels of a multilevel model incorporate additional information for estimation. For example, in an application to nutritional epidemiology, a hierarchical model improved conventional estimates of food effects by shrinking them toward each other when their corresponding foods had similar levels of nutrients. ^{7}

Third, multilevel modeling provides a solution to problems of multiple comparisons. Proposed solutions of this issue have ranged from ignoring the multiple comparisons to “adjusting” for them by altering the alpha level used in statistical procedures. ^{12–17} Instead of ignoring or adjusting for multiple comparisons, one can solve this problem by using a multilevel model to simultaneously evaluate multiple exposures or outcomes. ^{3–6} For example, hierarchical modeling has been used in genetic epidemiology for evaluating associations between large numbers of candidate human leukocyte antigen genes and insulin-dependent diabetes mellitus ^{18} and for evaluating gene and diet interactions in the etiology of colon polyps. ^{11} In a similar fashion, hierarchical regression (in the form of random coefficient modeling) provides an alternative to variable selection in problems involving many potential confounders. ^{19}

Despite the benefits of multilevel modeling, lack of software tailored to epidemiology remains a major obstacle to its use. In a previous report, we provided SAS code for two-stage multilevel modeling. ^{20} This code uses a basic two-step approach whereby regression coefficients from a first-stage model are analyzed with a second-stage linear weighted-least-squares regression. ^{5,6} This approach requires that the data be adequate for fitting the first-stage model without any constraints. One can instead undertake multilevel modeling with the SAS macro GLIMMIX (http://ftp.sas.com/techsup/download/stat/glmm612.sas), ^{21} which allows for more than two stages and does not require fitting distinct models in the different stages. The GLIMMIX fitting method (penalized quasi-likelihood) has properties superior to the basic two-step approach outlined above ^{9} and allows one to evaluate the fit of a multilevel model with conventional likelihood ratio tests. The value of GLIMMIX is limited, however, because it does not provide statistics [for example, relative risk estimates and 95% confidence intervals (CIs)] for quantities of primary epidemiologic interest. The additional work required to obtain this information from GLIMMIX may deter its use—and the application of multilevel models—in epidemiology. Therefore, in the present paper, we show how to use GLIMMIX for epidemiologic analysis with a new application of multilevel modeling.

## Multilevel Modeling

In a previous report, ^{7} we applied multilevel modeling to a study of diet and breast cancer. In particular, we developed a multilevel application to produce improved relative risk estimates of breast cancer for foods by using a second-stage nutrient-composition model to pull ordinary estimates toward each other when they have similar levels of nutrients (such as fat or fiber). We here use this approach for the Berlin European Community Multicenter Study on Antioxidants, Myocardial Infarction, and Breast Cancer (EURAMIC) data on diet and breast cancer. ^{22} We have complete information from 40 cases and 90 controls and are interested in estimating the effects of the following 12 foods on breast cancer: cauliflower, broccoli, Brussels sprouts, red cabbage, white cabbage, savoy cabbage, sauerkraut, allium vegetables (that is, onions and garlic), tomato, pizza, tomato salad, and strawberries. These foods were selected for investigation *a priori* on the basis of their known levels of anticarcinogenic constituents (for example, glucosinolates).

Our conventional analysis of the data used logistic regression to model the risk of breast cancer as MATH where *p* = risk of breast cancer, **X** is the matrix of food intake information, **W** is a matrix of covariate data (on age, calories, body mass index, socioeconomic status, alcohol intake, and supplemental…, hormonal use), α is the intercept term, and β = (β_{1},…, β_{12})′ and γ are the vectors of logistic regression coefficients corresponding to the 12 foods and six covariates. Fitting model 1 to the data by maximum likelihood yields the odds ratio (OR) estimates and 95% CIs given in the first set of columns in Table 1. (All results in Table 1 correspond to a 100-gm-per-week increase, *vs* none, in the foods). Some of the ORs show an inverse association with breast cancer, whereas Brussels sprouts and white cabbage show positive associations. The latter results are implausible (*a priori* we expected all foods to be either inversely associated or not associated with breast cancer owing to their known constituents’ potential anticarcinogenic properties) and may only reflect small-sample bias ^{23–25} or instability of the maximum-likelihood estimates from the conventional logistic regression: with only 40/19 = 2.1 cases per parameter, the dataset fails the criterion of at least five cases and five controls per parameter for valid maximum-likelihood estimates of the coefficients.

We can improve our estimates by using a multilevel model. This approach uses external (prior) information to develop a regression model for a second level or stage in which the effects of exposure variables are partially determined by other factors. ^{2,5–7} The second level (stage) of our multilevel model is a linear regression model for the logistic coefficients β of the dietary items, MATH where the *i* th row of **Z** contains second-stage covariates for the *i* th dietary item, π is a vector of coefficients corresponding to the effects of second-stage covariates on breast cancer, and the elements of δ are independent normal random variables with zero means and variances τ_{i}^{2}. The second-stage covariates are defined as food constituents that may contribute to dietary effects on breast cancer. ^{5} Thus, **Z** is the diet-nutrient table that gives the amount of nutrients in each food (that is, element *z*_{ij} of **Z** is the amount of constituent *j* reported in food *i*). Here we include the following eight second-stage covariates that may contribute to the effects of food items on breast cancer: glucosinolates, alpha- and beta-carotene, lutein, lycopene, vitamin E, organosulfate, and fiber. Table 2 gives the 12-by-eight second-stage design matrix **Z** used here.

## Fitting the Multilevel Model with GLIMMIX

To use GLIMMIX, we combine models 1 and 2 into a “mixed-effects” logistic model, MATH where the parameters are as defined above;π and γ are treated as vectors of fixed coefficients and, as above, δ is treated as a vector of random coefficients with mean 0 and variance τ^{2}. Note that the τ^{2} values reflect any residual associations for the 12 foods on breast cancer, after incorporating the eight constituents in **Z**. At first glance, one might think that incorporating **XZ** in model 3 would overfit the data relative to the conventional one-stage model 1. Just the opposite is the case, however; the conventional approach (1) is equivalent to using model 3 with no constraint on the residual effects δ (that is, with τ^{2} = ∞) and thus results in more overfitting than the constrained (that is, τ^{2} < ∞) multilevel approach.

The GLIMMIX default restricts one to empirical-Bayes, in which a single common variance for the random coefficients, τ^{2}, is estimated from the data. Several applications and simulation studies indicate, however, that when the study size is small, as in the present example, prespecifying the τ^{2} to define a reasonable range of residual values (that is, a semi-Bayes approach) can work as well as or better than empirical-Bayes, ^{5–8} and there are also philosophical reasons for preferring prespecification. ^{19} One can modify GLIMMIX for semi-Bayes multilevel modeling by incorporating in the *random* statement the option gdata = “filename,” where the file in “filename” contains prespecified τ^{2} values. In the present example we use a semi-Bayes approach; because **Z** includes most of the known relevant food constituents, we set the elements of τ to a relatively moderate value (0.35) for all foods. Assuming a normal distribution for the δ, τ_{i} = 0.35 corresponds to a 95% prior certainty that the residual relative risk for the effect of a 100-gm increase in food item *i* lies in a fourfold range (that is, 0.5–2.0), because exp (3.92τ_{i}) = 4.0 if τ_{i} = 0.35. ^{5–7}

Fitting model 3 to the diet and breast cancer data with GLIMMIX gives estimates of the fixed and random coefficients and their corresponding covariances (Table 3). We can use this output from GLIMMIX to calculate the food-effect coefficients MATH the natural antilogarithms of which give OR estimates. To obtain 95% CIs, we need the covariance matrix estimate for β̂ given by MATH where cov (MATH), cov (MATH), and cov (MATH, MATH) = cov (MATH, MATH) ′ are covariance-matrix estimates from GLIMMIX, and an apostrophe denotes matrix transposition (Table 3). The square root of the diagonal of cov(MATH) gives standard error estimates for the MATH, which we can use to calculate corresponding 95% CIs.

Upon fitting the mixed model 3 to the example data and calculating the semi-Bayes estimates from model 4, estimates that were unstable and nonsensical became more precise and reasonable, whereas those that were relatively precise did not change much. For example, the maximum-likelihood estimate of the OR for eating 100 additional grams of white cabbage was 2.5 (95% CI = 0.75–8.4); in contrast, the semi-Bayes OR estimate from GLIMMIX was 1.2 (95% CI = 0.59–2.6) (Table 1). In contrast, the OR for eating 100 additional grams of allium vegetables changed little upon applying the multilevel model (Table 1). Setting the elements of τ to larger values (for example, 0.53, which corresponds to a 95% prior certainty that the residual relative risk lies in an eightfold range) gave less stable results; here, the semi-Bayes OR for eating 100 additional grams of white cabbage was 1.7 (95% CI = 0.64–4.4). Using the GLIMMIX default (that is, empirical-Bayes) led to the common τ^{2} being estimated as 0, which ignores any residual second-stage effects and results in overly precise estimates that are pulled too close together. ^{5}

For comparison, we also fit the two-level model (model 1 plus model 2) to the diet and breast cancer data using our earlier weighted-least-squares method. ^{5–8,20} This method (third set of columns in Table 1) generally gave results almost identical to the penalized likelihood results from GLIMMIX. The similarity of these results, and their limited improvement over the conventional approach (Table 1), occurred because the mixed model 3 with δ = 0 fit the data reasonably well.

Whereas the multilevel estimates in Table 1 are certainly more credible than the maximum-likelihood estimates, previous applications ^{3–7,11,19} have shown more improvement from multilevel modeling than the current analysis. For example, in our earlier diet and breast cancer application, ^{7} we obtained a maximum-likelihood estimate of OR for eating four 4-inch celery sticks per week *vs* none equal to 5.1 (95% CI = 0.89–29); in contrast, the multilevel modeling estimate was 0.90 (95% CI = 0.28–2.9). Part of the difference in impact may be due to the fact that our earlier example involved even fewer cases per first-stage parameter (140/92 = 1.5).

The Appendix provides the SAS IML code used to calculate ORs and 95% CIs from our GLIMMIX output. The complete program, a corresponding SAS macro, and instructions on how to obtain results using a multilevel model with GLIMMIX, are available at URL http://darwin.cwru.edu/∼witte/hm.html.

## Conclusion

With the explication and code given here, epidemiologists can use GLIMMIX to analyze their own data with a multilevel model. Strengths of GLIMMIX include allowing for more than two stages and providing diagnostic statistics. A recent report ^{26} provides further evaluation of GLIMMIX and comparison with variance component software packages specifically written for multilevel modeling. ^{27–29} As with SAS GLIMMIX, however, use of these packages for epidemiologic analysis will require either writing special code or altering output to give the relative risk estimates and corresponding confidence intervals. ^{20} Multilevel modeling can also be undertaken with procedures available in the SAS IML and GAUSS languages. ^{5,6,20,30} These procedures provide standard epidemiologic output and are available from URL http://darwin.cwru.edu/∼witte/hm.html.

## Acknowledgment

We thank the EURAMIC Study for the example diet and breast cancer data used here.

## References

## Cited By:

This article has been cited
**4** time(s).

Analysis of Multiple Exposures: An Empirical Comparison of Results From Conventional and Semi-Bayes Modeling Strategies

Epidemiology, 21(1): 144-151.

10.1097/EDE.0b013e3181c297c7

PDF (292) | CrossRef

Dissecting Effects of Complex Mixtures: Who’s Afraid of Informative Priors?

Epidemiology, 18(2): 186-190.

10.1097/01.ede.0000254682.47697.70

PDF (191) | CrossRef

Effectiveness of Crime Prevention Through Environmental Design in Reducing Criminal Activity in Liquor Stores: A Pilot Study

Journal of Occupational and Environmental Medicine, 46(5): 450-458.

PDF (212)

Hearing Loss Among Licensed Pesticide Applicators in the Agricultural Health Study

Journal of Occupational and Environmental Medicine, 50(7): 817-826.

10.1097/JOM.0b013e31816a8caf

PDF (228) | CrossRef

**Keywords:**

breast neoplasms; diet; epidemiologic methods; logistic regression; multilevel models; odds ratio; relative risk; software