Secondary Logo

Journal Logo

Modeling Physical Activity Outcomes: A Two-part Generalized-estimating-equations Approach

Lee, Andy H.a; Xiang, Limingb; Hirayama, Fumia

doi: 10.1097/EDE.0b013e3181e9428b
Methods: Brief Report

The health benefits of physical activity are well known. Various parameters of physical activity, such as time or energy expenditure, are often assessed in observational and experimental studies. This article highlights several methodologic issues concerning the analysis of physical activity. These include non-normality, presence of many zeros, and violation of the independence assumption. Application of the standard regression model to a (log-transformed) physical activity variable may lead to spurious associations and misleading conclusions. We developed an alternative 2-part generalized-estimating-equations (GEE) approach to analyze the heterogeneous and correlated physical activity data. We first estimated a logistic GEE model for the prevalence of physical activity and factors affecting physical activity participation. We then fit a gamma GEE model to assess the effects of predictors among persons engaging in physical activity. An empirical application to an epidemiologic study of physical activity of community-dwelling older adults illustrates the proposed methodology.

From the aDepartment of Epidemiology and Biostatistics, School of Public Health, Curtin University of Technology, Perth, Western Australia; and bDivision of Mathematical Sciences, SPMS, Nanyang Technological University, 21 Nanyang Link, Singapore.

Submitted 19 October 2009; accepted 23 January 2010; posted 29 June 2010.

Correspondence: Andy H. Lee, Department of Epidemiology and Biostatistics, School of Public Health, Curtin Health Innovation Research Institute, Curtin University of Technology, GPO Box U 1987, Perth, WA 6845, Australia. E-mail:

Physical activity is a modifiable lifestyle factor with well-established health benefits. As much as half of all functional decline associated with ageing may be preventable if adequate levels of physical activity are maintained.1 Physical activity may also enhance psychologic and emotional well-being,2 and increase satisfaction with life.3,4 A sedentary lifestyle is responsible for 1.9 million deaths globally each year, according to the World Health Organization.5 The US Centers for Disease Control have developed guidelines on physical activity, recommending a minimum of 30 minutes of moderate activity daily.6

Back to Top | Article Outline


Given its importance to health and well being, physical activity levels are often assessed and monitored. Measurements of such activity include duration, frequency, and intensity. Data are usually collected by self-report using a validated questionnaire, or more objectively by accelerometry.7 Physical activity is usually summarized in terms of time (eg, minutes per week) or energy expenditure (metabolic equivalent tasks [MET]). A MET score is the ratio of metabolic rate during activity compared with the metabolic rate at rest. Each type of activity (vigorous, moderate, walking, etc) is allocated a MET score under a standard compendium.8 Many studies model physical activity and correlate it with pertinent risk factors. Several methodologic issues arise with the application of regression-type analyses.

Back to Top | Article Outline


The typical distribution of physical activity variables is highly skewed,7,9–11 with extreme scores and outliers that severely violate the normality assumption.9 Given a non-normal distribution, continuous physical activity variables are often reported in medians and interquartile ranges rather than means.11–13

One standard approach to skewness is to remove outliers and apply some transformation such as logarithm.7,9 Unfortunately, rules to trim outlying observations and to select an appropriate transformation are often determined arbitrarily, without theoretical justification or support.13 Alternatively, observed physical activity levels are dichotomized at the median to create binary outcomes for logistic regression analysis.10,14 However, the conversion of continuous physical activity into discrete random variables results in loss of information.

Back to Top | Article Outline

Presence of Many Zeros

According to the World Health Organization, at least 60% of the world's population has less than 30 minutes of moderate-intensity physical activity daily.15 Elderly people are particularly unlikely to be active. In the United States, for example, 51% of adults aged more than 65 years are inactive.16 This holds true especially for those with illness and chronic disease.11,17 The high prevalence of sedentary persons leads to a preponderance of zeros, which poses difficulties for modeling and analysis.

Back to Top | Article Outline

Correlated Observations

For subjects residing in the same catchment area, correlations among physical activity levels are expected due to similar socioeconomic status and environmental conditions. In longitudinal studies, repeated physical activity measures are recorded for persons in the cohort; these (within-subject) observations are also likely to exhibit high correlation. When the independence assumption is violated, application of standard regression analysis may result in spurious associations and misleading inferences, and adjustment for clustering effects is necessary.12

Back to Top | Article Outline


Two common objectives of epidemiologic studies are (1) to estimate the prevalence of physical activity and factors affecting participation and (2) to assess the effects of variables on level of physical activity. A 2-part generalized-estimating-equation (GEE) modeling approach is constructed below to achieve both purposes. The GEE method adjusts for the inherent correlation of the observations, and provides robust standard errors for regression coefficients.18

Suppose that a subject engages in a certain level of physical activity with probability π. Let binary variable Zij = 1 for person j within cluster i whose physical activity >0, and Zij = 0 for those who are completely sedentary. Let πij= π(xij, α) = P(Zij = 1|xij) for given covariates xij and associated parameters α. Sedentary persons contribute a likelihood factor of 1-xij, whereas those who engage in physical activity contribute a likelihood factor πijf(yij, wij, β), where f represents the probability density of the nonzero physical activity levels Yij(j = 1, 2, . . ., ni, i = 1, 2, . . .,m) with covariates wij and parameters β. The likelihood function L(α,β) factorizes into a probability component which concerns only the aspect of participation,

and a conditional component which concerns realization of the physical activity event,

Hence, the aforementioned objectives can be treated completely separately. Such a 2-part conditional approach is in principle analogous to the hurdle model used for modeling discrete counts with extra zeros.19–21

A natural way to relate π with covariate vector x is via the logistic regression model:

whereas the gamma distribution accommodates different degrees of skewness of the underlying physical activity variable through a scale parameter v. The gamma regression model relates μij = E(Yij) to the covariates wij by a link function:

To accommodate the correlation among observations within clusters, the parameters α and β are estimated using the GEE approach instead of the likelihood-based method. For the gamma part, let Ri(θ) be an ni × ni working correlation matrix. A common form for clustered data is the exchangeable correlation structure with θ = Corr(Yij, Yik), jk. To allow more flexibility, a different θ is prespecified in the logistic part. Model fitting for the 2 parts can be readily implemented in the SAS statistical package; see Appendix for instructions.

Back to Top | Article Outline


From a public health perspective, measurement of physical activity levels allows identification of at-risk population subgroups and development of tailored interventions.14,22 A community-based study was conducted in Japan in 2006.23 A total of 355 men and 220 women aged 55-75 years were recruited from 10 prefectures. Information on physical activity levels was obtained by administering the International Physical Activity Questionnaire.13 Total physical activity was defined as the sum of walking, moderate, and vigorous activities. Overall, 114 (20%) of study participants were completely sedentary, (physical activity = 0). The median (total) physical activity was 693 (interquartile range = 1308) MET min/wk, which is below the Japanese government recommendation of 1380 MET min/wk.24Figure 1 shows that the distribution of the amount of physical activity was extremely skewed for those participants with any physical activity (n = 461).



The available covariates were body mass index (kg/m2), sex (male, female), age group (55-59, 60-64, 65-69, 70-75 years), education (high school or below, college/university), marital status (single/divorced, married), employment status (unemployed, employed), smoking (nonsmoker, current smoker), and presence of comorbidity (no, yes). The majority of participants were men (62%), with mean age 63.2 years (SD = 6.4) and mean BMI 23.3 (SD = 3.1). Most (79%) were married and had at least a high school education (62%). The prevalence of smoking was 18%, 38% were not working, and half (49%) had an adverse health condition.

The logistic GEE model was first fitted to the binary physical activity data Zij clustered within prefectures. The Table presents the regression results. None of the covariates was found to affect π and the exchangeable correlation coefficient SYMBOL = 0.03. The gamma GEE model was next fitted to the Yij > 0 via a log-link function gij) = log(μij). The estimate of scale parameter v is 0.782 and SYMBOL = 0.014. Education level, employment status, and age group were factors associated with physical activity. Specifically, those aged 60-64 years had substantially higher physical activity levels than the reference group aged 55-59 years, whereas participants who were unemployed and participants who were college/university graduates tended to have relatively lower physical activity levels. However, education level and employment status had little association with log-transformed physical activity under the standard linear regression model, and the effects of sex, marital status, and smoking went in opposite directions in the 2 models. This suggests that logarithmic transformation alone is insufficient for handling such highly skewed and heterogeneous data. Finally, to assess the adequacy of the fitted gamma GEE model, the corresponding Pearson residuals were plotted against the fitted values (Fig. 2), with no evidence of systematic departure.







Back to Top | Article Outline


Successful interventions to increase physical activity levels require an understanding of the associated factors.22,25 Appropriate physical activity guidelines should be developed specifically for older adults to maintain their health and wellbeing.23 Even though our study did not comprehensively evaluate demographic and lifestyle factors, a few population subgroups (for example, the unemployed, and also those with university education) were highlighted for attention based on the gamma GEE model.

An appealing feature of the GEE approach is its robustness to misspecification of the within-cluster/subject correlation. An inverse Gaussian distribution can be assumed as an alternative to the gamma distribution to tackle the highly skewed physical activity observations. The 2-part conditional approach provides a practical and statistically valid method for analyzing heterogeneous and clustered physical activity data. Fitting of the models can be readily accomplished using common software programs. Another advantage is that the 2 sets of parameters are estimated separately and therefore not susceptible to the identifiability problem that arises when the same covariates are used in both logistic and gamma parts (as in zero-inflated Poisson or mixture regression models26).

Back to Top | Article Outline


SAS Instructions for 2-Part GEE Model Fitting

PROC GENMOD is invoked to fit a logistic GEE model for binary outcome bipa in part 1 and a gamma GEE model for continuous response totalmet in part 2.

data PA;

set PA;


if totalmet=0 then bipa=0;

else if totalmet NE 0 then bipa=1;


/*Part 1: logistic GEE */

proc genmod data=PA descending;

class catage perfecture;

model bipa= gender catage bmi. . . / dist=bin;

repeated subject= perfecture/ type=exch;


data GPA;

set PA;

where (bipa NE 0);


/*Part 2: gamma GEE */

proc genmod data=GPA;

class catage prefecture;

model totalmet=gender catage bmi... / dist=gamma


repeated subject=perfecture/ type=exch;


Back to Top | Article Outline


1. O'Brien Cousins S. Grounding theory in self-referent thinking: conceptualizing motivation for older adult physical activity. Psychol Sport Exerc. 2003;4:81–100.
2. Adams TB, Bezner JR, Whistler LS. The relationship between physical activity and indicators of perceived wellness. Am J Health Stud. 1999;15:130–138.
3. Rejeski WJ, Mihalko SL. Physical activity and quality of life in older adults. J Gerontol A Biol Sci Med Sci. 2001;56:23–35.
4. Kritz-Silverstein D, Barrett-Connor E, Corbeau C. Cross-sectional and prospective study of exercise and depressed mood in the elderly: the Rancho Bernardo study. Am J Epidemiol. 2001;153:596–603.
5. World Health Organization. Annual global move for health initiative: a concept paper. 2003. Available at: Accessed September 8, 2009.
6. Pate RR, Pratt M, Blair SN, et al. Physical activity and public health. A recommendation from the Centers for Disease Control and Prevention and the American College of Sports Medicine. JAMA. 1995;273:402–407.
7. Ekelund U, Sepp H, Brage S, et al. Criterion-related validity of the last 7-day, short form of the International Physical Activity Questionnaire in Swedish adults. Public Health Nutr. 2006;9:258–265.
8. Ainsworth BE, Haskell WL, Whitt MC, et al. Compendium of physical activities: an update of activity codes and MET intensities. Med Sci Sports Exerc. 2000;32:S498–S504.
9. Rzewnicki R, Vanden Auweele Y, De Bourdeaudhuij I. Addressing overreporting on the International Physical Activity Questionnaire (IPAQ) telephone survey with a population sample. Public Health Nutr. 2003;6:299–305.
10. Jurj AL, Wen W, Gao YT, et al. Patterns and correlates of physical activity: a cross-sectional study in urban Chinese women. BMC Public Health. 2007;7:213.
11. Harrison S, Hayes SC, Newman B. Level of physical activity and characteristics associated with change following breast cancer diagnosis and treatment. Psychooncology. 2009;18:387–394.
12. International Physical Activity Questionnaire. Guidelines for data processing and analysis of the international physical activity questionnaire (IPAQ). November 2005. Available at: Accessed September 25, 2009.
13. Allman-Farinelli MA, Chey T, Merom D, Bowles H, Bauman AE. The effects of age, birth cohort, and survey period on leisure-time physical activity by Australian adults: 1990-2005. Br J Nutr. 2009;101:609–617.
14. Nitzan Kaluski D, Demem Mazengia G, Shimony T, Goldsmith R, Berry EM. Prevalence and determinants of physical activity and lifestyle in relation to obesity among schoolchildren in Israel. Public Health Nutr. 2009;12:774–782.
15. World Health Organization. Benefits of physical activity. 2004. Available at: Accessed October 2, 2009.
16. US Department of Health and Human Services. Healthy people 2010-reproductive health. 2001. Available at: Accessed October 1, 2009.
17. Hirayama F, Lee AH, Binns CW, Leong CC, Hiramatsu T. Physical activity of patients with chronic obstructive pulmonary disease: implications for pulmonary rehabilitation. J Cardiopulm Rehabil Prev. 2008;28:330–334.
18. Liang KY, Zeger SL. Longitudinal data analysis using generalized linear models. Biometrika. 1986;73:13–22.
19. Mullahy J. Specification and testing of some modified count data models. J Econometrics. 1986;33:341–365.
20. Welsh AH, Cunningham RB, Donnelly CF, Lindenmayer DB. Modelling the abundance of rare species: statistical models for counts with extra zeros. Ecol Model. 1996;88:297–308.
21. Yau KK, Lee AH. Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme. Stat Med. 2001;20:2907–2920.
22. Shibata A, Oka K, Nakamura Y, Muraoka I. Prevalence and demographic correlates of meeting the physical activity recommendation among Japanese adults. J Phys Act Health. 2009;6:24–32.
23. Hirayama F, Lee AH, Binns CW. Physical activity of adults aged 55 to 75 years in Japan. J Phys Ther Sci. 2008;20:217–220.
24. Ministry of Health, Labour and Welfare. Standard of physical activity for healthy life 2006. 2006. Available at: Accessed September 30, 2009.
25. Jancey JM, Lee AH, Howat PA, Clarke A, Wang K, Shilton T. The effectiveness of a physical activity intervention for seniors. Am J Health Promot. 2008;22:318–321.
26. Xiang L, Yau KK, Van Hui Y, Lee AH. Minimum Hellinger distance estimation for k-component Poisson mixture with random effects. Biometrics. 2008;64:508–518.
© 2010 Lippincott Williams & Wilkins, Inc.