Decreases in the levels of physical activity in youth have led to intense interest among investigators in the creation and testing of interventions to promote increased physical activity in children. Accelerometry offers an objective measure that can be used to evaluate these interventions. Accelerometers record counts that represent movements summed over a specified time interval. Accelerometry counts have been calibrated against oxygen consumption and can be used to predict energy expenditure (^{16} ). In addition, counts have been calibrated against known intensity levels of activity, and thus can provide estimates of time spent performing sedentary, light, moderate, and vigorous activity (^{21} ). Coupled with these attributes, the relatively low cost of accelerometry data and its application to free-living situations has led to its use in studies evaluating physical activity interventions in children (^{1,6,12,14,18,20} ).

Where the intervention manipulates the social or physical environment, involves group processes, or cannot be delivered to individuals, these trials usually employ a group-randomized trial (GRT) design. In a GRT, identifiable social groups are randomly assigned to study conditions, and observations are made on the members of those groups (^{8} , p. 3). Unfortunately, GRT usually have less statistical power than trials in which individuals are randomized because there is extra variation among the groups beyond the variation among the members and because the degrees of freedom available to estimate that extra variation are often limited. The magnitude of the extra variation is indexed by the group-level intraclass correlation (ICC); in order to plan a trial with sufficient power, it is critical for investigators to have a good estimate of the expected ICC and to be aware of other issues in study design and analysis that can impact the power of a GRT (^{8} , pp. 6-9).

We know from previous studies that school-level ICC for a variety of variables can vary by grade, time of year, and other factors (^{10} ). In the Pathways study moderate to vigorous physical activity (MVPA) was assessed by accelerometry in third grade American Indian children with all children within a school measured in a single 24-h period; the school-level ICC for MVPA was very high at 0.2 (^{19} ). In contrast, the estimate of the school-level ICC for physical activity assessed by accelerometry was zero in a pilot study of eighth grade girls for the Trial of Activity in Adolescent Girls (TAAG) (^{9} ). In that study MVPA was measured over a 7-d period and girls within schools were measured in three different waves defined by the date on which measurements began. These large differences in the ICC estimates could be due to differences in study design, data collection or analytic methods, the ages of the children, the study populations, or other factors.

In this paper, we use accelerometry data from sixth grade girls who participated in the TAAG baseline survey to investigate the effect of selected design and analytic methods on the school-level ICC for MVPA. For both weighted and unweighted measures of MVPA we compare the impact of day of the week, number of days of data collection, and number of waves of data collection. We hypothesized that the ICC would be larger when data from girls within schools were collected 1) on weekdays compared with weekend days, 2) over a shorter (1 d) as compared with a longer (6 d) time interval, 3) in the same week for all participants in a school compared with different weeks. Each of these factors should serve to increase the homogeneity of activity and thus increase the ICC. Though we did not have additional a priori hypotheses, we were also interested in looking at the ICC calculated for individual days and for the 4-d period Thursday through Sunday. Some studies might have resources only for a single day's measurements; others might have resources for several days, but not for a full week. These investigators might benefit from having ICC estimates based on a single day or on several days. After consultation with several physical activity specialists associated with TAAG, we determined that Thursday through Sunday was the period that would likely be used for measurements if a full week was not possible, both because this period included weekend days and weekdays and because it would allow staff to fit the girls with the devices on a school day (Thursday morning) and recover them on another school day (Monday morning). Those specialists could not agree on which single day would be used, so we examined all of them. We had no a priori expectations about what values we might obtain, but we did have an a priori reason for reporting the ICC for individual days and for the period Thursday through Sunday. We demonstrate the impact of these design features on statistical power and make recommendations to enhance efficiency.

METHODS
TAAG.
TAAG is a multicenter GRT designed to test an intervention to reduce by half the age-related decline in MVPA in middle school girls (^{18} ). TAAG has six sites (at the Universities of Arizona, Maryland, Minnesota, and South Carolina; San Diego State University; and Tulane University), a coordinating center (at the University of North Carolina, Chapel Hill), and a project office at the National Heart Lung and Blood Institute. Each site recruited six schools according to standard eligibility criteria. Schools were randomly allocated within site and school district to either an intervention or a control condition. Baseline measures were collected from a cross-sectional sample of sixth grade girls in each school in spring 2003, and the intervention began soon thereafter. Follow-up measures were collected in spring 2005 from a cross-sectional sample of eighth grade girls in the same schools. The intervention effect will be estimated as the mean difference in MVPA between intervention and control schools in the eighth grade, controlling for sixth grade levels. Activity was measured using an Actigraph accelerometer (Actigraph, Manufacturing Technologies, Inc., http://www.mtiactigraph.com/ ).

The TAAG baseline measurement of MVPA.
Each of the six sites obtained a complete roster of the sixth grade girls enrolled in each participating school in the winter of 2003. The coordinating center selected 60 girls at random from each school for invitation to the measurement of MVPA by accelerometry. Inclusion criteria were: 1) informed consent signed by a parent or guardian, and 2) informed assent from the girl. Exclusion criteria were: 1) medical condition for which physical activity is contraindicated, and 2) inability to complete a brief questionnaire. Girls who were ineligible or who transferred out of the participating schools before the data were collected were replaced by other girls selected at random from the school rosters.

At each school, data were collected over a period of two to five nonconsecutive weeks. Most girls were measured in the first 2 wk, and the remaining girls were measured as they could be scheduled. The goal at each school was to measure 80% of the 60 girls selected to represent their school, so TAAG staff at each site made repeat visits to the participating schools to measure as many of the selected girls as possible. As in the earlier pilot study, this measurement schedule permitted examination of week-to-week variation in physical activity due to factors such as weather, field trips, etc. We refer to the first and second weeks of data collection in each school as the first and second waves because the specific weeks involved varied from school to school; because most girls were measured in the first two waves, subsequent weeks were pooled and treated as a third wave.

Human subjects approval.
The TAAG baseline data collection protocol was approved by the institutional review board of each participating site and collaborating institution. It was also approved by the appropriate school and school district officials.

Variables of interest and their measures.
The primary outcome for TAAG is intensity-weighted minutes (i.e., MET-minutes) of total MVPA, measured across both school time and nonschool time and across the entire week. Other measures include unweighted minutes of MVPA as well as demographic and anthropometric variables (ethnicity, age, weight, height, and body mass index). All measurement staff completed a measurement training and certification program.

The measurement of MVPA required six measurement days and two school visits by TAAG staff for each girl. During the first school visit, staff confirmed eligibility, consent and assent, obtained baseline information including date of birth, height, weight, and ethnicity, and gave instructions. Using a standardized protocol, staff measured each girl's weight twice to the nearest 0.1 kg on an electronic scale (Seca, Model 770, Hamburg, Germany). Staff measured height twice to the nearest 0.1 cm using a portable stadiometer (Shorr Height Measuring Board, Olney, MD). Each girl was asked to wear the accelerometer for six complete days, except at night while sleeping, during any activity that might get the accelerometer wet (e.g., bathing, swimming), or during any activity for which the wearing of the accelerometer was prohibited (e.g. competitive team sport activities). Girls were instructed to wear the accelerometer on their right hip, attached to a belt provided by the study and standardized across sites; girls returned the accelerometer 6 d later at the second school visit. We avoided Fridays and Mondays as starting days because absenteeism is usually higher on those days.

Data processing and analysis.
Accelerometer readings were processed using methods similar to those reported by Puyau et al. (^{13} ). Readings above 1500 counts per half minute were treated as MVPA, while readings below that threshold were ignored; we reported previously that this threshold had the optimal sensitivity and specificity for discriminating brisk walking from less vigorous activities in eighth grade girls (^{21} ). Half-minute counts were used instead of full-minute counts based on the expectation that they would be more sensitive to fluctuations in activity levels.

Occasional missing accelerometry data within a girl's 6-d record were replaced via imputation based on the expectation maximization (EM) algorithm; details are reported elsewhere (^{4} ) and we provide only a brief summary here. The monitor records the slightest motion as a nonzero count, so we judged a sustained (20 min) period of zero counts as a time when the monitor was not being worn. There was considerable between- and within-girl variation in the amount of time they wore the monitors; for example, the girls wore the devices for an average of 9.5 h·d^{−1} on weekends and 13.2 h·d^{−1} on weekdays. We judged girls to be compliant with the protocol if they wore the monitor 80% of the time available in a given block of time; blocks represented before school, during school, after school, early evening, and evening. If the girl was compliant for a block, we used the data provided; if not, we used imputation to fill in the missing data for that block, with at least one compliant day required for each girl. The result was a set of six 18-h days of data for each girl, 6 a.m. to midnight. Evaluation of our imputation procedure indicated that it provided valid results, even when data were not missing at random (^{4} ).

Counts above 1500 per half minute were converted into METs using a regression equation developed from a second pilot study for TAAG (^{16} ); the METs were summed over the 6 a.m.-midnight day to provide MET-minutes per day of MVPA, where 1 MET·min represents the metabolic equivalent of energy expended sitting at rest for one minute.

The goal of the analysis was to estimate the variability due to sites, schools (within sites), wave of data collection (within school and site), and girls (within wave, school, and site) to allow calculation of the school-level and wave-level ICC. To do so, a general linear mixed model (^{7} ) was applied to predict the intensity-weighted MVPA score and separately to predict the unweighted MVPA score. The general linear mixed model is appropriate for correlated data with normally distributed errors. The model was fit using SAS PROC MIXED, Version 8.2 (^{15} ). Site, school, wave, and girl were modeled as random effects. Because our previous study had shown that the school-level ICC was reduced with regression adjustment for ethnicity (^{9} ), we evaluated ethnicity as a fixed effect. Confidence intervals for the school-level ICC were calculated based on the F -distribution with 30 df for schools (6 sites × (6 schools/site - 1)), 72 df for wave (6 sites × 6 schools/site × (3 waves/school - 1)), and 1495 df for girls (6 sites × 6 schools/site × 3 waves/school × (14.84 girls/school - 1)) (^{17} , pp. 245-246).

We conducted this set of analyses separately for MET-minutes and for minutes of MVPA. We also repeated the analyses after omitting wave as a random effect to gauge the benefit of having that term in the model. We repeated these analyses separately for all days combined, all weekdays combined, both weekend days combined, Thursday through Sunday combined, and each day of the week separately to estimate ICC as they might be available to investigators who could not collect data for six consecutive days. Finally, we conducted all analyses initially for waves 1-3 combined and then for wave 1 only to gauge the benefit of spreading data collection in a school over time versus concentrating it in a single week.

We use the general notation of Murray (^{8} , pp. 131-132) in which m members (here girls) are nested within each of s subgroups (here waves), which are nested within each of g groups (here schools), so that there are ms girls per school. We use σ_{m} ^{2} , σ_{s} ^{2} , and σ_{g} ^{2} to represent member (here girl), subgroup (here wave), and group (here school) components of variance, respectively. Excluding site, which plays no role in the calculation of the school and wave ICC, the three components of variance sum to the total random variation in the dependent variable, σ_{y} ^{2} =σ_{m} ^{2} +σ_{s} ^{2} +σ_{g} ^{2} . The wave- and school-level ICC are defined as:

In the models that ignored wave, there is no estimate of σ_{s} ^{2} and thus no wave-level ICC; the school-level ICC is then calculated as shown above but omitting σ_{s} ^{2} , recognizing that the estimates for σ_{m} ^{2} and σ_{g} ^{2} will be different in models that do and do not include wave.

We applied these estimates of the components of variance at the school, wave, and girl levels to a hypothetical new study to examine the impact of the study design on detectable difference and power given a specified number of schools per condition. We also examined the impact of the study design on the number of schools per condition required for a specified detectable difference and level of power.

ICC: RESULTS AND DISCUSSION
Across the six sites, 1603 girls provided useable data for MVPA, representing 74.2% of those invited. The number of girls per site ranged from 232 to 291; 44.5% were white, 22.0% were African-American, 21.8% were Hispanic, 3.8% were Asian, 0.8% were Native American, and 7.1% were multiethnic. Most of the observations were collected in wave 1 (52.2%) and wave 2 (35.8%), with many fewer in wave 3 (12.0%).

Table 1 provides descriptive statistics for age, BMI, weight, minutes of MVPA, and MET-minutes of MVPA. Table 1 also presents descriptive statistics for MVPA and MET-minutes of MVPA by wave; the average and median values on both measures increased over the sequence of waves, though their variability did not. Table 2 presents the variance components and ICC for MET-minutes of MVPA as a function of the days and waves included in the analysis and whether wave was modeled or ignored. All values are adjusted for ethnicity; in all cases, the adjusted variance components and ICC were smaller than the unadjusted values, which are not shown. Table 3 presents the parallel information for minutes of MVPA. The 95% confidence bounds are provided to indicate the precision of the measurement of the ICC and should not be used as the basis for ignoring the ICC if the bound includes zero. (Standard errors for very small components of variance are not well estimated, and power for the test that ICC = 0 is often poor. The prudent course is to employ an analysis that reflects fully the nested design (^{5,8} , p. 232)).

TABLE 1: Descriptive statistics for age, BMI, weight, minutes of MVPA, and MET-minutes of MVPA.

TABLE 2: Variance component estimates and intraclass correlations for MET-minutes of MVPA as a function of the days and waves included in the analysis and whether wave was modeled or ignored. All values are adjusted for ethnicity.

TABLE 3: Variance component estimates and intraclass correlations for minutes of MVPA as a function of the days and waves included in the analysis and whether wave was modeled or ignored. All values are adjusted for ethnicity.

Several patterns are apparent from the results presented in Tables 2 and 3 . First, the patterns are the same for MET-minutes of MVPA and minutes of MVPA, with only minor discrepancies, though the absolute values differ for the two measures.

Second, the school-level ICC were much lower when estimated from a model that included wave than when estimated from a model that ignored wave. When wave is included in the model, some of the variance otherwise attributable to school is attributed instead to wave, so it is not surprising that the school-level ICC were lower when estimated from a model that included wave. As presented later, the variance of the intervention effect is defined differently in analyses that include or exclude wave so that the benefit of the smaller school-level ICC is not as great as it might first appear.

Third, consistent with hypothesis 1, the school-level ICC were uniformly lower when calculated across weekend days rather than weekdays. For example, in Table 2 , the school-level ICC when wave was ignored was 0.024 for weekdays but 0.012 for weekends. This suggests that physical activity patterns are more idiosyncratic on weekend days than on weekdays, perhaps because girls get a higher proportion of their activity in school-related group programs such as physical education classes on weekdays compared with weekends. As noted above, compliance for wearing the monitors was lower on weekends than weekdays. This difference in compliance could be responsible for some of the differences observed in the ICC, but we remind the reader that our imputation scheme generated a complete 18-h day for each girl, both for weekend days and for weekdays, so that compliance differences among the days of the week did not result in differences in the amount of data available for analysis.

Fourth, the school-level ICC were smaller when estimated from data collected on Thursday through Sunday than when collected on all 6 d, across weekdays, or across weekends. For example, in Table 2 , the school-level ICC when wave was ignored was 0.004 for Thursday through Sunday, but 0.022 for all 6 d, 0.024 for weekdays, and 0.012 for weekends. This is due in part to the inclusion of two weekend days and just two weekdays, since the weekend days had lower ICC than the weekdays. But the analyses across the weekend days resulted in higher school-level ICC than those obtained for the data collected on Thursday through Sunday, so that the inclusion of two weekend days is not the only explanation for this pattern. It may be that the factors responsible for shared variation attributable to school are different during the week and on weekends so that when those periods are combined, the net shared variation attributable to school is lower than in either period alone.

Fifth, the results with regard to hypothesis 2 were mixed. We did observe the highest ICC in the analyses of data collected on a single day; the highest school-level ICC were often associated with Wednesdays, though other weekdays sometimes had higher values. At the same time, many days had ICC that were appreciably lower than when estimated from all 6 d; for example, Sunday was associated with the lowest school-level ICC in all cases. Thus we cannot conclude that the ICC are always higher when based on a shorter time interval, and instead must conclude that the nature of the time interval plays an important role. The pattern observed among the day-specific ICC is not likely due to chance, as the value for a given day of the week represents a weighted average over many Tuesdays, or many Thursdays, etc., with at least 3 d from each field center, often measured in different months. If the observed pattern were due entirely to chance variation from day to day, we would expect these variations to average out over a series of Tuesdays over several months, reducing the variation in the day-specific ICC. Instead, we see appreciable differences among the day-specific ICC, with much lower values on some days and higher values on other days. The confidence intervals are of course wider than the CI reported for the ICC based on all days, as the day-specific ICC were based on a single day instead of 6 d. As a result, few of the day-specific values are significantly different from one another. Even so, the day-specific differences may be real, and it will be important for other investigators to look for such patterns in their data.

Sixth, consistent with hypothesis 3, the ICC based on 6 d of measurement were higher when estimated from the first wave than when estimated from all three waves combined. This comparison is made using the all days rows for all waves, wave ignored versus wave 1, wave not applicable. For weighted and unweighted MVPA, the school-level ICC were twice as large when based on one wave than when based on three waves combined.

These patterns provide additional evidence that school-level ICC might be dependent on the schedule of data collection. Previously, we have seen evidence that school-level ICC for cigarette smoking are higher when measured in the spring than when measured in the fall, at least among junior and senior high school youth (^{10} ). Here, we see evidence that school-level ICC for physical activity are higher when measured during a single week in a school compared with measurements taken over several weeks. We see evidence that school-level ICC are higher when measured on weekdays than on weekend days. And we see considerable variation among measures taken on a single day, with the lowest values on Sundays and the highest values during the week.

We would offer a strong cautionary note against choosing a data collection schedule based only on the magnitude of the ICC or the standard error for the intervention effect and their implications for power. Our results suggest, for example, that the most efficient plan for an intervention effect in MET-minutes of MVPA would be to collect data on Thursday through Sunday over three waves and to ignore wave in the analysis. But reliability was reported at only 0.42 when based on a single day of measurement, increased to 0.75 when based on 4 d, and increased to 0.82 when based on 6 d (^{9} ); subsequent analysis suggested that reliability was only 0.62 when based on Thursday through Sunday. Certainly, investigators should recognize the value of trading power for reliability and validity, and in this case, the higher reliability and validity available given 6 d of measurements would seem to justify that approach, given adequate resources, even if that required a somewhat larger study. The estimate of the treatment effect may also depend on the days used for measurement; for example, if students are more likely to follow the intervention during weekdays than weekends, it would be important to measure as many weekdays as possible.

APPLICATION: POWER, DETECTABLE DIFFERENCE, AND SAMPLE SIZE IN A NEW TRIAL
We will demonstrate the application of these results and implications for study planning through power, detectable difference, and sample size calculations for a hypothetical new trial. We assume that the goal of the new trial will be to evaluate a 2-yr intervention among fifth and sixth graders designed to increase their MVPA. Baseline data will be collected in late fourth grade or early fifth grade, and follow-up data will be collected in late sixth grade. We can compare trials that collect data over six consecutive days per girl with trials that collect data over a long weekend or on a single day. We can compare trials that collect data over several weeks in each school with trials that collect data during a single week in each school. We will assume that these trials will be single-center trials, that is, no need to represent site in the design or in the analysis. We will also assume that the trials will have a nested cross-sectional design, that is, independent cross-sectional surveys at baseline and follow-up.

Sample size calculation for any trial requires explication of the primary analysis plan for the trial. We will use the approach adopted for TAAG, where the analysis will be conducted in two stages as if there were no overlap among girls measured at baseline and follow-up. This two-stage approach avoids many of the complexities inherent in a single-stage mixed-model regression analysis and mimics an analysis of covariance performed on follow-up data with regression adjustment for baseline values on the primary endpoint (^{9,18} ). Importantly, our developmental work in TAAG showed that this analytic approach had better power in TAAG than the alternatives considered, including a traditional cohort design in which the same girls would be followed over time and analyzed using repeated-measures techniques (^{18} ); similar findings have been reported in another study that employed a nested cross-sectional design (^{11} ). Certainly, there will be overlap among the independent cross-sectional samples of sixth and eighth graders, but we will ignore it in the primary analysis, as the expected benefit from using it (variance reduction due to correlation over time) is quite small relative to the expected price for using it (diluted intervention effect from imputing data for girls lost to follow-up). We will assume ethnicity is included as a covariate in the first stage; we will compare models that include wave as a nested random effect with models that ignore wave.

In the first stage of the adjusted analysis, the girls' MET-minutes or minutes of MVPA would be regressed on school, time (baseline, follow-up), their interaction, wave, and ethnicity; study condition would not be included in that model.

Here, all terms would be modeled as fixed effects, except for wave and girl. Of interest is the estimation of ethnicity-adjusted school means for MET-minutes of MVPA, estimated for each school at baseline and at follow-up. By performing a pooled analysis of both surveys in the first stage, we would standardize the results for the two surveys against the same reference distribution for ethnicity, here the average ethnicity distribution over time. The result of the first stage would be two adjusted mean MVPA values (baseline, follow-up) for each participating school. The version of equation 2 without wave would differ only in the omission of wave.

The second stage of analysis would be conducted on the adjusted or unadjusted means from the first stage. We would regress the follow-up school mean for MET-minutes or minutes of MVPA on condition, adjusting for the baseline school mean:

Here, intercept, condition, and baseline MVPA would be fixed effects while school would be a random effect; in this model, there is no residual error beyond school. Given a proper randomization and a well-executed study, this model provides an unbiased test of the intervention effect. It also provides the statistical basis for inferences to schools like those included in the study.

Assumptions.
The means and standard deviations for MET-minutes and minutes of MVPA for sixth graders were taken from the results reported in Table 1 . The test of the intervention effect will be two-tailed, with a Type I error rate of 5%, and based on the two-stage analysis used in TAAG. To be conservative, the sample size calculations ignore any correlation over time at the level of the girl or school. The calculations assume an attrition rate of 36% over 2 yr, which is the same assumption that was made for TAAG; power will, of course, be better for studies whose investigators expect a lower attrition rate. The calculations also assume that occasional missing accelerometry data for a single girl will be replaced via imputation using the EM algorithm (^{4} ). The calculations assume that 20% of girls will refuse the accelerometry measurements at the baseline and follow-up surveys; we will not impute data for refusals based on our earlier observation that doing so is likely to have little impact in terms of bias or power (^{9} ).

Consistent with the expectations for TAAG, the sample size calculations assume that girls enrolled in the intervention schools during the 2-yr intervention will display an intervention effect of 10% more MET-minutes or minutes of MVPA than girls in the control schools. Also consistent with the expectations for TAAG, the calculations assume that girls enrolled in the intervention schools only in the second year of the intervention period will display an intervention effect of 6% more MET-minutes or minutes of MVPA. Combined with expectations for attrition and refusals, we anticipate that the average intervention effect will be 9.2% of the mean, or 13.4 MET·min of MVPA and 2.18 min of MVPA.

Methods.
The test for the intervention effect is given by the F -test for condition from the second stage of the analysis. Because that test involves a 1-df contrast, it is convenient to use the equivalent t -test for the power analysis , written generically as t = Δ/σ_{Δ} , where Δ is the intervention effect and σ_{Δ} is the standard error of the intervention effect. We calculated the approximate power using equation 9.7 from Murray (^{8} , p. 358):

The t _{α/2} and t _{β} are the critical values for the t -distribution reflecting the Type I and II error rates and the df available for the test for the intervention effect. Those df are calculated as c(g − 1) − 1, where there are c conditions and g schools per condition and there is one school-level covariate, the baseline school mean for MVPA.

We calculated the absolute detectable difference using equation 9.11 from Murray (^{8} , p. 360):

The relative detectable difference was calculated as (Δ/μ) × 100, where μ is the mean.

In our earlier paper, we provided a formula for σ_{Δ} ^{2} based on equation 9.25 from Murray (^{8} , p. 367). That formula was based on the results of the analysis of eighth grade data, where there was no evidence of any variation attributable to wave:

Equation 6 is appropriate for the estimates derived from the analytic model that ignores wave, as the only components of variance estimated were for girls and schools. However, if wave is included as a nested factor and the wave component of variance is greater than zero, equation 6 will underestimate the variance of the intervention effect, and this could lead to an undersized study. A more general form of equation 6 that would allow for a nonzero component of variance for wave is:

We calculated the number of schools (g) required based on equation 9.22 from Murray (^{8} , p. 364). Working from equation 6:

Working from equation 7:

The df upon which the values for the t -values depend will change as the value calculated for g changes; convergence may require two or three iterations.

Investigators may not have access to variance component estimates; instead, they may have only estimates of ICC and σ_{y} ^{2} . Equations 6 and 7 can be rewritten in terms of these parameters working from the expressions provided in equation 1.

Example.
Consider the adjusted components of variance for MET-minutes of MVPA in sixth grade girls measured over all days in all three waves, as reported in the first row of Table 2 . Assume 30 girls are to be measured in each of three waves in each of 20 schools in each condition in the new trial. Given c = 2, g = 20, and df = c(g - 1) - 1 = 37, the critical t -values are t _{α/2} = 2.026 and t _{β} = 0.851. Because we included wave in the model, equation 7 is appropriate.

The power for a 10% relative difference in MET-minutes of MVPA, 14.5 MET·min corrected for attrition and refusals to be 13.4 MET·min, is calculated using equation 4:

This means that there would be 88% power to detect an intervention effect equal to 10% of the mean, given the assumptions described above.

The detectable difference in MET-minutes of MVPA is calculated using equation 5:

So with 20 schools per condition and 80% power, the study would be able to detect an absolute difference between the two conditions of 12.0 MET·min, after correction for refusals and attrition, or 13.0 MET·min for students exposed to the full intervention, equal to a relative effect of 13.0/145.6 = 8.9%.

The investigators may choose to reduce the size and cost of the study by determining the number of schools required to achieve 80% power to detect a 10% relative difference, or 13.4 MET·min after correction for attrition and refusals. Using equation 9:

Rounding up, this initial calculation suggests that only 17 schools may be required per condition for 80% power for an absolute difference of 13.4 MET·min. But this calculation used t -values based on 20 schools per condition and 2(20 - 1) - 1 = 37 df , and it is important to repeat the calculation with df based on 17 schools per condition:

Because this result also rounds up to 17, the iteration process has converged and we can stop. The result indicates that at least 80% power would be available for a relative difference of 10%, or 13.4 MET·min after correction for refusals and attrition, with 17 schools per condition.

SAMPLE SIZE CALCULATIONS: RESULTS AND DISCUSSION
Tables 4 and 5 summarize the results for power and detectable difference (absolute and relative) given 20 schools per condition for a new trial separately for each set of variance components reported in Tables 2 and 3 , respectively. Tables 4 and 5 also summarize the results for schools per condition given 80% power for an intervention effect equal to 10% of the mean, after correction for attrition an refusals. In this new trial, schools would be randomized to conditions, independent cross-sectional samples of girls would be measured at baseline and 2 yr later, an intervention to promote physical activity would be introduced in the intervention schools for 2 yr, and assumptions would be made as noted earlier. Each trial would be based on two conditions, and 90 girls would be measured per school, regardless of the number of schools or the schedule of those measurements. Several patterns emerge from Tables 4 and 5 . First, the patterns in the two tables are generally similar in relative terms, though the absolute values in some columns are quite different.

TABLE 4: Standard errors, power, detectable differences (absolute and relative), and schools per condition for each set of variance component estimates reported in

Table 3 for MET-minutes of MVPA.

TABLE 5: Standard errors, power, detectable differences (absolute and relative), and schools per condition for each set of variance component estimates reported in

Table 4 for minutes of MVPA.

Second, the estimates for σ_{Δ} were often lower when estimated based on an analysis that modeled rather than ignored wave as a nested random effect. This was true in 6 of 11 analyses for MET-minutes of MVPA, and in 7 of 11 analyses for minutes of MVPA. One previous paper has reported almost uniformly higher estimates of σ_{Δ} when subgroup was modeled rather than ignored (^{11} ); that paper reported findings for alcohol, tobacco, and other drug use among adolescents. The explanation for these divergent findings lies in how much of the variability due to the subgroup (here wave) is random versus systematically associated with the group (here school). When variability is largely random, ignoring the subgroup will cause the subgroup variation to drop to the residual error where it will be well controlled by the msg in the denominator of the standard error formula; when variation is systematically associated with the group (here school), ignoring the subgroup will cause the subgroup variation to rise to the group where it will not be as well controlled because the denominator is reduced to g. In these data from TAAG, the variation attributable to the wave appears to be at least in part systematically associated with school, perhaps due to the weather, school functions, and other factors that would vary by wave within school.

Third, the estimates for σ_{Δ} were uniformly lower when calculated across weekend days rather than weekdays. This translated into substantial differences in the number of schools required per condition. This is likely due to the fact that physical education classes and other activities affect girls in groups during the week, whereas girls are more idiosyncratic in their pursuits on weekends.

Fourth, the estimates for σ_{Δ} were uniformly smaller when estimated from data collected on Thursday through Sunday than when collected on all 6 d, across weekdays, or across weekends. Here, too, this translated into substantial differences in the number of schools required per condition.

Fifth, the largest estimates for σ_{Δ} were obtained when estimated from data for a single day, always a weekday. The lowest estimate for σ_{Δ} from a single day was obtained when estimated from data for Sunday.

Sixth, the estimates for σ_{Δ} based on 6 d of measurement were higher when estimated from the first wave than when estimated from all three waves combined. Because girls were not randomized to waves or days, it is not possible to conclude based on this evidence that spreading the survey out over several weeks caused the lower standard error. As noted earlier, it may be that the girls who were surveyed in wave 1 were somehow different than those surveyed in later waves and that those differences may account for this pattern. Even so, these results are consistent with the interpretation that spreading data collection out over time may help reduce school-level ICC and so reduce σ_{Δ} in a school-based GRT.

SUMMARY AND COMMENT
These results affirm the importance of developing accurate estimates of the ICC expected to apply in the primary analysis as part of the planning for any new GRT. Here we report estimates for total MVPA including school and nonschool time. The previous estimates reported for eighth grade girls were considerably smaller than for sixth grade girls, so that a much larger study would be required for a comparable intervention effect in sixth grade girls. We could not have known that without having estimates for sixth graders. As important, the ICC measured using the same protocol were appreciably different as a function of the days and waves included in the analysis and whether wave was modeled or ignored. Those differences translated into often dramatic differences in the estimates of σ_{Δ} , power, detectable difference, and sample size. The implications for cost and logistics are obvious.

These results also affirm the importance of publishing variance components and ICC for physical activity from other datasets, so that the population of available estimates can grow. It would be of interest to see whether investigators would replicate the patterns we have observed associated with the sample analyzed, whether wave was modeled or ignored, and which days were included in the analysis, for example. Once a sufficient number of estimates is available, it may be possible to take advantage of those estimates in the analysis of existing studies and in planning new studies (^{2,3} ). In the meanwhile, the estimates and methods presented here, and in our earlier paper, can be used to calculate the size required for a new study.

This research was funded by grants from the National Heart, Lung, and Blood Institute (U01HL66858, U01HL66857, U01HL66845, U01HL66856, U01HL66855, U01HL66853, U01HL66852).

None of the authors has any personal or professional relationship with any company or manufacturer that might benefit from the results of this study.