Journal Logo


Applying Generalizability Theory to Estimate Habitual Activity Levels


Author Information
Medicine & Science in Sports & Exercise: August 2010 - Volume 42 - Issue 8 - p 1528-1534
doi: 10.1249/MSS.0b013e3181d107c4
  • Free


Physical activity is a complex behavior that has been proven difficult to assess in a reliable and valid way. A variety of assessment techniques are available, but each has inherent limitations. Objective monitoring with pedometers or accelerometry-based activity monitors has become a standard practice because it helps to overcome the subjective nature of self-report instruments. However, many challenges remain with processing and interpreting physical activity data (27).

The standard protocol in most activity monitoring studies has been to monitor activity patterns during a 7-d period and to use data during multiple days (4-7) to generalize to typical behavior. This protocol has been justified from well-intentioned reliability studies using these standard protocols (12-14,21,28,31). The general approach in these studies has been to compute the intraclass correlation coefficient over days (ICC = between-person variance/total variance). The ICC value is then used to estimate the number of monitoring days needed to obtain a desired level of reliability using the Spearman-Brown prophecy formula (number of monitoring days = [ICCdesired/(1 − ICCdesired)][(1 − ICCestimated)/ICCestimated]). An implicit assumption with this approach is that correlations among days are similar; however, there is ample evidence that this assumption does not hold. Several studies have demonstrated that considerable variation in activity exists among days within a week (7,8,25,28). Lower mean levels of activity have been reported across seasons (2,24), and considerable intraindividual variation in activity has been reported across repeated measurements during a year (15). As described by Baranowski et al. (1) in a recent review of this issue, violations to this assumption lead to overestimates of reliability and corresponding underestimates of the number of days needed to obtain a desired level of reliability using the Spearman-Brown formula. This suggests that activity levels reported in most studies may not generalize to typical activity behavior.

Logistical considerations (e.g., cost, feasibility, and participant burden) impose limitations on physical activity monitoring protocols, but it is important for researchers to understand the implications of using activity measures with low relative reliability in various types of research. Imprecise measures may attenuate correlations between activity and health-related variables and also reduce power in behavioral interventions designed to promote physical activity. To advance research on physical activity epidemiology, it is important to better quantify the variability in free-living behavior so that future studies can be appropriately powered. Quantifying the variance in activity will also advance our knowledge about the biological basis of activity (10,22,23,32).

Improved designs and methods are needed to determine the reliability of free-living physical activity behavior. As Brennan (3) notes, reliability is a property of a set of scores that can be influenced by the parameters of the study design. Therefore, reliability estimates of physical activity across days will differ from reliability estimates obtained across seasons or even years. To date, most studies include days as the parameter of interest, and therefore, less is known about the reliability estimates of activity during longer durations. A more complex model (i.e., one that can handle daily variability and seasonal fluctuations) is needed to address the issue of habitual estimates of activity.

Generalizability theory (G-theory) is an extension of the classical test theory and provides a systematic approach to examine reliability when more than one condition is present. In contrast to the classical test theory, which suggests that an observed measure can be separated into the true score and a single undifferentiated random error term, G-theory allows researchers to partition the total variance into multiple sources of measurement error variance as well as the interactions that contribute to measurement error variance. For example, in physical activity research, reliability can be influenced by the participants, the assessment tool, the time/type of day, the number of days, and the season. G-theory consists of two types of studies: generalizability (G-study) and decision (D-study) studies. A G-study is conducted to quantify the amount of variance attributed to facets, or conditions, included in a model. On the basis of the relative size of each variance component, multiple D-studies are conducted to make informed decisions about the conditions of a study. Two coefficients ranging from 0 to 1 are determined from each D-study. The generalizability (ESymbol2) coefficient is a norm-referenced value used for relative decisions, whereas the phi (Symbol) coefficient is a criterion-referenced value used for absolute decisions. In physical activity research, the generalizability coefficient could be used to determine the frequency of assessment days and/or months needed to distinguish between the most and least active individuals (relative decision). In contrast, the phi coefficient could be used to determine the requency of assessment days and/or months needed to report compliance rates for a physical activity recommendation (absolute decision).


Reliable estimates of physical activity are needed to improve our understanding of the relationship between activity and health and also to report compliance rates of physical activity recommendations. The present study will apply the G-theory to evaluate the reliability of physical activity data collected with pedometers in youth. The specific aims of this study were 1) to quantify total variability in physical activity according to (a) differences among participants, (b) inconsistency across days, (c) relative differences among seasons, and (d) the interactions among variables and 2) to project the number of monitoring days and seasons needed to characterize long-term levels of physical activity.



Participants were from the control group of an intervention program (SWITCH) aimed at improving lifestyle behaviors (n = 648, 45% males; 2005-2006) (9). Anthropometric characteristics (age, body mass, and stature) were collected by a trained technician following standard procedures, and body mass index was calculated (BMI = kg·m−2). Physical activity was assessed with a pedometer (Digiwalker SW-200, New Lifestyles, Inc., Lee's Summit, MO) across seven consecutive days during three separate months (September, January, and May). Detailed instructions about the pedometer were provided at the onset of the study protocol. Specifically, participants were instructed to 1) wear the pedometer on the right side of their body in line with the knee during all waking hours (except while bathing/swimming), 2) record total pedometer steps and daily start and stop times on a note card, and 3) reset the pedometer to zero for the subsequent day. Pedometers and note cards were collected by a research team member after each assessment period and were redistributed nearly 4 months later. At the start of each subsequent assessment period, detailed instructions about the pedometer were provided to the participants. Forty-one participants did not have any physical activity data, and they were removed from the analysis, leaving 607 participants (299 boys and 308 girls). Criteria described by Rowe et al. (21) were used to screen the pedometer data for possible outliers. Individual days within each month were eliminated if pedometer steps were <1000 (n = 65) or >30,000 steps (n = 68). The final data set included 80 participants (42 boys and 38 girls; 13% of the total control sample) with 7 d of physical activity data across each month. Written informed consent and assent were provided by parents/guardians and minors, respectively, before participating in the original study. This study was approved by the institutional review board.

Statistical Analysis

Descriptive statistics (mean and SD) were computed for the participants' characteristics and the physical activity data. Independent t-tests were used to compare descriptive characteristics and physical activity levels between those included in the current analysis and excluded participants. The physical activity universe score for each participant was determined by taking the mean value across 21 monitoring days (7 d × 3 months).


A fully crossed, three-factor design (participant (P) × season (S) × day (D)) was used for this study because each participant had seven monitoring days from three separate months. Season, rather than month, was identified as a factor in the G-study because the variance across assessment intervals was more likely representative of seasonal variation instead of monthly variation. Season and day were considered random factors to allow for a broader generalization of the study results, and participants were considered the object of measurement. A two-way (season × day) repeated-measures ANOVA was performed to provide the mean square values for each factor and their interaction terms. Variance component estimates were calculated for each factor and all interaction terms (Symbol2(P), Symbol2(S), Symbol2(D), Symbol2(P×S), Symbol2(P×D), Symbol2(S×D), and Symbol2(P×S×D)) according to the procedures outlined by Brennan (3). The relative contribution (%) from each variance component was estimated by dividing individual variance components by the total variance and multiplying by 100.



Several D-studies were conducted using the fully crossed, random design to determine the optimal number of days (nd) and seasons (ns) needed to obtain an acceptable level of reliability (≥0.80). For each D-study, two reliability coefficients (generalizability (ESymbol2) and phi (Symbol)) were calculated. Similarities exist in the equations to determine each coefficient (equations 1 and 2); however, structural differences in the denominator influence their interpretation. The generalizability coefficient contains variance components interacting with the participant (relative error variance, σ2(δ)) and is used to make relative decisions about activity levels. In contrast, the phi coefficient contains all variance components except the variance attributed to the participant (absolute error variance, σ2(Δ)) and is used to make decisions about the absolute level of activity for a specific individual.

Multiple D-studies were also conducted using a mixed-model design, where season was fixed and day was random. Fixing a factor restricts the universe of generalization and changes the estimation of σ2(τ), σ2(δ), and σ2(Δ) (see below); however, no changes are made to the coefficient estimates in equations 1 and 2. Conceptually, "fixing" the season term implies that daily measurements would be obtained during the same season, thereby reducing the variability within the data.


For each D-study, the relative and absolute SEM were calculated by taking the square root of the relative (σ2(δ)) and absolute (σ2(Δ)) error variance terms, respectively. Although the SEM is unrelated to G-theory, it provides an indication about how much the activity level obtained by an individual would likely vary from one situation to another. Smaller SEM values indicate greater reliability in the physical activity measure. The square root of the absolute error variance represents a 68% CI for the participant's universe score. Analyses were conducted in 2009.


Descriptive results.

Table 1 compares the characteristics and physical activity data between the current study sample (n = 80) and the excluded participants. Mean values for age, height, and weight were similar between groups. Activity levels were comparable between included and excluded participants during September and January; however, participants in the current study had higher mean levels of activity in May compared with the excluded participants (P < 0.05). A detailed description of physical activity for the current study sample is reported in Table 2. In general, the mean activity level and time worn were highest on Friday (13,331 ± 3657 steps; 804 ± 92 min) and lowest on Sunday (10,134 ± 3217 steps; 712 ± 90 min). Table 2 also reveals a seasonal influence in mean activity levels (January > September > May; P < 0.05 for pairwise comparisons). The mean ± SD universe scores for activity and time worn for the entire sample were 11,730 ± 2620 steps and 750 ± 48 min, respectively. Individual universe scores for activity ranged from 5643 to 20,509 steps per day.

Comparison of descriptive characteristics and physical activity data between current study sample and excluded participants.
Mean ± SD pedometer steps and time worn data across days and months.


The purpose of the G-study was to partition and quantify the variance in physical activity. Although activity was assessed for each participant across 7 d during three separate months, the variance component estimates refer only to a single observation (3). Variance components are the fundamental blocks of G-theory that allow investigators to quantify the consistencies and inconsistencies among repeated measures. The largest source of error in the fully crossed model was attributed to residual variance (55.64%), which includes error from the P × S × D interaction term and any other unidentified sources of error not systematically incorporated into the G-study. Approximately 19% of the variance was from the participant term, whereas smaller amounts of variance were explained by the season (6.59%) and day (2.67%) terms. The relatively low contribution from the P × D interaction term (1.04%) suggests that the rank order of activity among participants was similar regardless of the day. In contrast, the relatively higher contribution of variance from the P × S interaction term (13.95%) indicates that some variation in rank order existed among participants across seasons. The rank order of activity among days was relatively unchanged across seasons (D × S interaction term = 1.38%). In some cases, sampling variability in the study design may result in negative variance component estimates (3,5). We did not observe this phenomenon in the fully crossed G-study.


Multiple D-studies were conducted to determine the effects of the various levels of seasons (ns) and days (nd) on variance components and reliability estimates. As shown in Table 3, increasing the number of seasons for a given assessment period length (i.e., 7 d) decreases the percent variance attributable from the P × S × D interaction term and increases the percent variance attributable from the participant term. Table 3 also reports reliability estimates for several conditional D-studies using a random and mixed design. For each condition, higher reliability estimates (ESymbol2 and Symbol) were obtained using a mixed design compared with a random design, because Symbol2 (P × S) in a mixed design contributes to universe score variance (σ2(τ)) and not error variance (σ2(δ) or σ2(Δ)). Also, for each conditional D-study, the generalizability coefficient is larger than the phi coefficient because the denominator in the generalizability coefficient contains relative error variance (σ2(δ)), whereas the denominator in the phi coefficient contains absolute error variance (σ2(Δ)). As expected, an unreliable estimate of habitual activity was obtained from a single monitoring day using a random (Symbol2 = 0.21; Symbol = 0.19) or mixed (Symbol2 = 0.37; Symbol = 0.35) design (Fig. 1). It is also evident from Figure 1 that increasing the number of assessment days, within a single, random season does not increase reliability to an acceptable level. This finding is troublesome given that most studies use a single assessment period to make decisions about the rank order of participants (relative decisions) and physical activity compliance rates (absolute decisions). Using a random design, acceptable reliability (Symbol2 ≥ 0.80) was obtained using ns = 4 and nd = 12. Acceptable reliability estimates for absolute decisions were not reached using a plausible number of days across four seasons (Symbol < 0.80). Using a mixed design, where season is fixed and day is random, an acceptable level of reliability was reached using ns = 1 and nd = 7 (Symbol2 ≥ 0.80) and ns = 1 and nd = 8(Symbol ≥ 0.80). A trade-off existed between the number of days and seasons needed to achieve reliability coefficients ≥0.80. For example, acceptable reliability can be achieved for relative decisions using a mixed design for ns = 1 and nd = 7 (Symbol2 = 0.80) or ns = 4 and nd = 3 (Symbol2 = 0.82).

Estimated variance components, reliability indicators, and calculated SEM values for the random and mixed D-studies.
Generalizability coefficients (y-axes) for several conditional D-studies using random (left) and mixed (right) designs. Random estimates using four seasons are not provided because these represent the maximal number of seasons within a year.


Relative and absolute SEM for several conditional D-studies are also shown in Table 3. Overall, smaller SEM values were obtained by either increasing the monitoring frequency (day or season) or using a mixed design for a given conditional study.


G-theory was used in this study to quantify the sources of variance in activity (i.e., pedometer steps per day) and to identify the study conditions needed to attain a desired level of reliability among youth. In contrast to previous studies that monitored activity across a series of days, the universe of admissible observations in the current analysis included days and seasons. Extending the universe provides a more robust interpretation of reliability and can likely be used to improve relative and absolute decisions in physical activity research.

Consistent with previous reports (11,13,28), poor reliability estimates of physical activity (Symbol2 and Symbol < 0.38) and an undesirable distribution of variance (residual error ≈ 56%; participant error ≈ 19%) were obtained using a single monitoring day. This distribution of variance is undesirable in physical activity research because it limits any conclusions about the measured activity level. Ideally, most of the variance in activity should be attributed to the participant term because this represents true score variance. Intuitively, increasing the frequency of either condition (season or day) would increase the variance attributed to the participant term and decrease the total variance. In this study, the decision to change the frequency of conditions was motivated by the variance component estimates obtained in the G-study. The variance attributable to the season term was nearly two and a half times as large as the variance from the day term. This suggests that increasing the number of seasons would have a greater effect on the total variance (and reliability estimates) compared with increasing the number of days. Although this finding may partly be expected given the seasonal changes in activity due to the weather (4,6,30) and time available to spend outdoors (26), it does address a critical gap in the literature regarding long-term reliability estimates of physical activity. Specifically, earlier studies were designed primarily to estimate reliability during a single measurement period and, as a result, were unable to account for the tremendous amount of intraindividual variation in activity that exists across a 1-yr period (15). The results from this study demonstrate the limitations of using a single assessment period to estimate typical levels of activity. Among this sample, both coefficients (Symbol2 and Symbol) begin to plateau after 20 monitoring days within a single, random season and fail to increase above 0.60. This finding is troublesome because a single assessment period is often used to make decisions about the rank order of participants (relative decisions) and physical activity compliance rates (absolute decisions) (18-20,29). In practice, unreliable estimates of activity may contribute to the inconsistent relationships reported between activity and a variety of health outcomes among youth (16,17). The results of this study also demonstrate the advantage of using a mixed design, where one term is fixed (season) and the other term is random (day). Fixing the season term, as we did in this study, restricts the universe of generalization; however, it does reduce the burden placed on the participant and the investigator.

The results of the study provide new information about reliability of objectively monitored physical activity in youth. Numerous studies have used 7-d monitoring protocols to capture "typical" activity behavior, and in most cases, 4-5 d of clean data are viewed as acceptable. The decisions have been justified on the basis of calculations from the Spearman-Brown prophecy formula, but G-theory provides a more robust approach for reliability determinations. For example, consider the D-study in Table 3 for ns = 1 and nd = 7 (random design), which gives Symbol2 = 0.46. For an assessment period that is twice as long (i.e., ns = 1 and nd = 14), the Spearman-Brown formula predicts Symbol2 = [2(0.46)/(1 + 0.46)] = 0.63. In contrast, G-theory calculates a lower reliability estimate for ns = 1 and nd = 14 (Symbol2 = 0.51). In G-theory, variance attributable to σ2 (P × S) remains unchanged for any number of days, whereas in the above example using the Spearman-Brown formula, all relative error components (e.g., Symbol2 (P × S), Symbol2 (P × D), Symbol2 (P × S × D)) are divided by the same constant. As a result, the Spearman-Brown formula predicts a larger value for reliability and a smaller relative error variance compared with values obtained from the G-theory. The observed differences can likely be attributed to the structural differences in measurement error models between the Spearman-Brown formula and the G-theory. The G-theory can also handle data with varying variance and covariance structure.

Analytic error is a concern for any field-based study where participants are instructed to report daily activity levels. In this study, residual variance contributed a large proportion of the total variance (∼56%), and we were unable to determine the amount of error directly attributable from the P × S × D interaction term and that from other unidentified sources. Given the wide range in activity universe scores among participants (5643-20,509 steps per day), we contend that error from the P × S × D interaction term was the largest source of residual error. Although data are not available to validate this assumption, we argue that analytic error was minimized in the study design by providing detailed instructions to all participants, removing excessively high (>30,000 steps) and low (<1000 steps) daily pedometer values and capturing total activity across most of the monitoring day (mean pedometer wear time = 750 ± 48 min). We do acknowledge the risk of using self-reported pedometer data; however, we have no reason to believe that the amount of analytic error in the current study differed from other studies using a similar protocol. Additional sources of variance unaccounted for in the current design (e.g., pedometer type or body mass index) could have also contributed toward residual error.

The present study demonstrates the potential for using G-theory to understand the reliability of youth activity behavior. The findings are based on a relatively small sample, so additional research with larger samples is needed to provide comparison data. To date, few studies report variance component estimates, so direct comparisons are difficult. The results herein are based on an indicator of physical activity volume (daily pedometer steps), and therefore, future studies should use G-theory to examine the variance from different instruments (e.g., accelerometry-based activity monitors) and for different outcome measures (moderate-to-vigorous physical activity or sedentary behavior). Investigators are encouraged to report variance component estimations to allow for comparisons across studies. In doing so, a method to determine the importance (or unimportance) of factors could be determined. G-theory allows researchers to examine 1) multiple sources of measurement error variance, 2) interactions that contribute to measurement error variance, and 3) estimate reliability coefficients on the basis of different decisions. Quantifying variance in physical activity research is important to identify sources of error. The systematic approach used to obtain variance component estimates reported herein represents a novel approach to examine sources of variance and represents a unique contribution to physical activity assessment research.

This work was not supported by a funding source. The authors have no conflicts of interest to declare. The authors thank Dr. Joe Eisenmann for insightful conversations regarding the variance in physical activity. SWITCH is a trademark of the National Institute of Media and the Family, and this was funded by Medica Foundation, Healthy and Active America Foundation, and Fairview Health Services and Cargill, Inc.

The results of the present study do not constitute endorsement by the American College of Sports Medicine.


1. Baranowski T, Mâsse LC, Ragan B, Welk G. How many days was that? We're still not sure, but we're asking the question better! Med Sci Sports Exerc. 2008;40(7 suppl):S544-9.
2. Beighle A, Alderman B, Morgan CF, Le Masurier G. Seasonality in children's pedometer-measured physical activity levels. Res Q Exerc Sport. 2008;79(2):256-60.
3. Brennan RL. Generalizability Theory. New York (NY): Springer; 2001. p. 538.
4. Chan CB, Ryan DA, Tudor-Locke C. Relationship between objective measures of physical activity and weather: a longitudinal study. Int J Behav Nutr Phys Act. 2006;3(21):1-9.
5. Cronbach LJ, Gleser GS, Nanda H, Rajaratnam N. The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York (NY): Wiley; 1972. p. 410.
6. Duncan JS, Hopkins WG, Schofield G, Duncan EK. Effects of weather on pedometer-determined physical activity in children. Med Sci Sports Exerc. 2008;40(8):1432-8.
7. Duncan JS, Schofield G, Duncan EK. Pedometer-determined physical activity and body composition in New Zealand children. Med Sci Sports Exerc. 2006;38(8):1402-9.
8. Duncan MJ, Al-Nakeeb Y, Woodfield L, Lyons M. Pedometer determined physical activity levels in primary school children from central England. Prev Med. 2007;44(5):416-20.
9. Eisenmann JC, Gentile DA, Welk GJ, et al. SWITCH: rationale, design, and implementation of a community, school, and family-based intervention to modify behaviors related to childhood obesity. BMC Public Health. 2008;8:223.
10. Eisenmann JC, Wickel EE. The biological basis of physical activity in children: revisited. Pediatr Exerc Sci. 2009;21(3):257-72.
11. Fairclough SJ, Butcher ZH, Stratton G. Whole-day and segmented-day physical activity variability of northwest England school children. Prev Med. 2007;44(5):421-5.
12. Gretebeck RJ, Montoye HJ. Variability of some objective measures of physical activity. Med Sci Sports Exerc. 1992;24(10):1167-72.
13. Janz KF, Witt J, Mahoney LT. The stability of children's physical activity as measured by accelerometry and self-report. Med Sci Sports Exerc. 1995;27(9):1326-32.
14. Matthews CE, Ainsworth BE, Thompson RW, Bassett DR. Sources of variance in daily physical activity levels as measured by an accelerometer. Med Sci Sports Exerc. 2002;34(8):1376-81.
15. Mattocks C, Leary S, Ness A, et al. Intraindividual variation of objectively measured physical activity in children. Med Sci Sports Exerc. 2007;39(4):622-9.
16. Must A, Tybor DJ. Physical activity and sedentary behavior: a review of longitudinal studies of weight and adiposity in youth. Int J Obes (Lond). 2005;29(2 suppl):S84-S96.
17. Parsons TJ, Power C, Logan S, Summerbell CD. Childhood predictors of adult obesity: a systematic review. Int J Obes Relat Metab Disord. 1999;23(8 suppl):S1-S107.
18. Pate RR, Freedson PS, Sallis JF, et al. Compliance with physical activity guidelines: prevalence in a population of children and youth. Ann Epidemiol. 2002;12(5):303-8.
19. Riddoch CJ, Andersen LB, Wedderkopp N, et al. Physical activity levels and patterns of 9- and 15-yr-old European children. Med Sci Sports Exerc. 2004;36(1):86-92.
20. Riddoch CJ, Mattocks C, Deere K, et al. Objective measurement of levels and patterns of physical activity. Arch Dis Child. 2007;92(11):963-9.
21. Rowe DA, Mahar MT, Raedeke TD, Lore J. Measuring physical activity in children with pedometers: reliability, reactivity, and replacement of missing data. Pediatr Exerc Sci. 2004;16(4):343-54.
22. Rowland TW. The biological basis of physical activity. Med Sci Sports Exerc. 1998;30(3):392-9.
23. Rowlands AV. Methodological approaches for investigating the biological basis for physical activity in children. Pediatr Exerc Sci. 2009;21(3):273-8.
24. Rowlands AV, Hughes DR. Variability of physical activity patterns by type of day and season in 8-10-year-old boys. Res Q Exerc Sport. 2006;77(3):391-5.
25. Rowlands AV, Pilgrim EL, Eston RG. Patterns of habitual activity across weekdays and weekend days in 9-11-year-old children. Prev Med. 2008;46(4):317-24.
26. Sallis JF, Prochaska JJ, Taylor WC. A review of correlates of physical activity of children and adolescents. Med Sci Sports Exerc. 2000;32(5):963-75.
27. Troiano RP. A timely meeting: objective measurement of physical activity. Med Sci Sports Exerc. 2005;37(11 suppl):S487-9.
28. Trost SG, Pate RR, Freedson PS, Sallis JF, Taylor WC. Using objective physical activity measures with youth: how many days of monitoring are needed? Med Sci Sports Exerc. 2000;32(2):426-31.
29. Trost SG, Pate RR, Sallis JF, et al. Age and gender differences in objectively measured physical activity in youth. Med Sci Sports Exerc. 2002;34(2):350-5.
30. Tucker P, Gilliland J. The effect of season and weather on physical activity: a systematic review. Public Health. 2007;121(12):909-22.
31. Vincent SD, Pangrazi RP. Does reactivity exist in children when measuring activity levels with pedometers? Pediatr Exerc Sci. 2002;14(1):56-63.
32. Wilkin TJ, Mallam KM, Metcalf BS, Jeffery AN, Voss LD. Variation in physical activity lies with the child, not his environment: evidence for an 'activitystat' in young children (EarlyBird 16). Int J Obes (Lond). 2006;30(7):1050-5.


©2010The American College of Sports Medicine