Johnston, Kylie N.; Jenkins, Sue C.; Stick, Stephen M.
Cardiopulmonary exercise testing (CPET) is widely used in children with obesity, lung and heart disease to assess clinical progress and outcomes of treatment.1 Despite the frequent use of CPET, a criticism of previous studies measuring V̇O2peak in children is the failure to report the repeatability coefficients of this measure.2 A number of recent articles in the exercise and sports medicine literature3–5 have emphasized the limitations of the Pearson correlation coefficient (r) and to a lesser extent, the intraclass correlation coefficient (ICC) to report repeatability in exercise physiology. In contrast, calculation of repeatability coefficients is a method that more clearly describes the level of agreement between repeated measurements. Originally described by Bland and Altman in 1986,6 this method provides 95% limits of agreement for measurement variability. Outside these limits, differences in the variable can be attributed to actual change over time or due to an intervention, rather than measurement variability.
The purpose of this study was to determine repeatability coefficients of physiological variables during a maximal CPET in children, in order to facilitate valid interpretation of clinical and research interventions using CPET as an outcome measure. As background to this study, a review of previous pediatric studies reporting repeatability of V̇O2peak was performed. A discussion of the statistical methods previously used, and an introduction to limits of agreement analysis is provided to critique the relative value of these approaches.
Reliability: the Intraclass Correlation Coefficient
A variety of approaches to the assessment of reliability and repeatability have been applied to exercise test data in children. An early study reported a Pearson correlation coefficient of r = 0.53 for tests four to five months apart in 10 year old boys,7 a low value related to a long between-test interval. However, Pearson correlations measure the strength of the relationship between two variables, not the agreement between them.8 Reliability is classically defined as the proportion of the total observed variance that is attributable to error, ie, true variance/(true variance + error variance).9 This gives a useful measure of the error variance within subjects as a proportion of the variance between subjects, for a given measure. The intraclass correlation coefficient (ICC) provides such an index. Six different equations have been described by Shrout and Fleiss10 for calculating the ICC, with formula (2,1) most suitable for repeated measures undertaken by a single investigator:
Equation (Uncited)Image Tools
where BMS = between-subjects mean square, EMS = error mean square, TMS = trials mean square, k = number of trials and n = number of subjects.
Investigators have reported high ICCs for V̇O2peak in healthy children (r = 0.96, n = 9)11 and adolescent athletes (r = 0.97 and r = 0.88 described for groups who did [n = 7] and did not [n = 13] achieve V̇O2peak plateau criteria, respectively).12 In a study of 10 boys, a coefficient of reliability (described as between subject variance/between subject + within subject variance) of 0.65 for V̇O2peak demonstrates one of the significant limitations of this approach.13 A group of subjects whose V̇O2peak values are similar will have a lower between-subject variance for a given error variance, compared with a more heterogenous group. This will result in a lower ICC value, which is unrelated to the degree of error but simply a function of lower variability between subjects. Conversely, ICC values (and Pearson’s correlation coefficients) can be inflated by including subjects with a wide variability in raw scores, leading to an overestimation of reliability.14
Response Stability: Coefficient of Variation
As well as measuring the reliability of the measurement of exercise test data (as provided by the ICC), clinicians require information about the consistency of repeated responses over time. Various forms of the coefficient of variation (CV), calculated most often as the standard deviation of the differences divided by the mean of the differences in test-retest scores, have been used to assess stability of responses across repeated trials.9 Reported values for CV in V̇O2peak range from 7.5% (in 61 girls, trials six weeks apart15) to 4.4% for two maximal treadmill tests performed one week apart in subjects 11 to 14 years old,16 and 3.8% in repeated tests on nine young females whose ages were not reported.17 Unlike the correlation coefficient, this approach reflects the percentage of variation from trial to trial, and is not affected by a lack of variation in raw scores. However, the CV does not account for systematic variation between repeated tests. Another limitation is that use of the CV is based on the assumption that the degree of agreement between tests depends on the magnitude of the measured value, ie, the assumption that the largest test-retest variation occurs in those individuals with the highest scores in the test, a phenomenon called heteroscedasticity.5 It would be preferable to first examine the data for the presence of heteroscedasticity by plotting the differences between tests against the mean scores before coefficients of variation are reported.5
Repeatability: 95% Limits of Agreement
Initially described by Bland and Altman in 1986,6 the 95% limits of agreement technique has been advocated as a means of assessing test-retest repeatability, indicating the extent to which a person’s score varies on repeated measurement and expressing this variation in actual units.5 This approach has been applied to repeatability studies of V̇O2peak18 and ratings of perceived exertion19 in healthy adults, the three minute step test in children with cystic fibrosis,20 and the constant-load cycle test in adults with chronic obstructive pulmonary disease.21 Agreement between repeated tests is examined graphically by plotting the difference between each pair of measurements with respect to their mean value. Bias is calculated as the mean difference between measurements in test 1 and test 2. The 95% limits of agreement are defined as the bias ± 2 standard deviations of the differences between measurements in test 1 and test 2.
Whereas 95% limits of agreement data allow valid interpretation of repeated exercise tests in a clinical setting or intervention study, this approach has not been reported for children who are healthy. Therefore, the purpose of this study was to apply 95% limits of agreement analysis to determine repeatability of V̇O2peak during an incremental maximal exercise test in this group. In addition, we determined repeatability coefficients for minute ventilation (V̇E) and heart rate (HR) at peak exercise in this population. To allow comparison with previous studies, ICCs and coefficients of variation have also been presented for these variables.
Subjects. Nine children who were healthy (six boys) aged eight to 11 years (mean 10.5 ± 1.0 years) were recruited in response to advertising within the hospital, and took part in the study. Informed consent was obtained from parents of all subjects prior to their participation, and the study was approved by the Princess Margaret Hospital for Children Ethics Committee. Height (cm), weight (kg) and pulmonary function (forced expiratory volume in one second [FEV1]) were measured prior to exercise testing. An estimate of participation in physical activity during the past year was obtained by interviewing the parents and children together, using a standardized questionnaire.22
Exercise Testing Procedures
The children performed two exercise tests three to seven days apart. All tests were conducted by the same investigator under conditions of controlled temperature (20–22°C). Exercise was performed on a treadmill (Marquette Electronics, model 2000, Milwaukee, Wisconsin) using a continuous, incremental protocol. Children were first habituated to the laboratory and allowed to practice running on the treadmill. Treadmill speed was individually selected on the initial testing occasion based on age, lung function and height. The speed ranged from 5.5 to 6 km/hr, with the aim of achieving a maximal test within eight to12 minutes.23 On the second testing occasion for each child, the protocol was reproduced using the initial treadmill speed. Children were given standardized instructions prior to the test and encouragement was provided each minute during the test using standardized wording, with the aim of producing a maximal effort. After a stationary baseline period of one minute, children ran at the selected speed with zero inclination for a two-minute warm-up then one minute running at a grade of 4%, with the grade increased by 1% each minute. Treadmill speed remained constant throughout the test. Test termination occurred at the point of voluntary exhaustion, when the child was unable to continue despite strong verbal encouragement.
Criteria for accepting the tests as maximal effort included: signs of intense effort (hyperpnea, facial flushing, sweating, unsteady gait); heart rate (HR) ≥ 90% of predicted maximum (200 beats min−1); and respiratory exchange ratio (RER) ≥ 1.00.2 These three criteria were met in all tests.
A three lead electrocardiograph (Sensormedics, Yorba Linda, California) was used to monitor HR continuously and oxyhemoglobin saturation was recorded with a pulse oximeter (N-395, Nellcor Puritan Bennett Inc., California with Nellcor Oxisensor II I-20LF finger sensor). A pediatric face mask (Hans Rudolph, Kansas City) (dead space 32 ml) with harness was fitted and checked for leaks prior to test commencement. Inspired and expired gases were monitored throughout the test with measurement of breath-by-breath rate of oxygen consumption (V̇O2), carbon dioxide production (V̇CO2) and ventilation (V̇E) using Vmax29 hardware and software (Sensormedics). The system was calibrated prior to each test according to the manufacturer’s instructions. Peak oxygen uptake was obtained by averaging data over the last 30 seconds before the termination of exercise.2,24
Mean and standard deviations were reported for children’s demographic details, lung function and physical activity levels at their initial visit. Repeatability of V̇O2peak, HRpeak and V̇Epeak was assessed according to the method described by Bland and Altman.6 Bias was calculated as the mean difference between measurements in test 1 and test 2. Coefficients of repeatability (two standard deviations [SDs] of mean difference) were calculated, as well as the range within which 95% of the differences in the two tests are expected to fall (mean ± repeatability coefficient). Wilcoxon signed rank tests were used to compare variables between tests due to small sample size. A probability value of p < 0.05 was considered significant. In order to provide comparison with statistical methods used in previous studies, the ICC and coefficient of variation were calculated for each variable. ICC (2,1) was calculated according to the formula described in the background of this article.10 Data were assessed for heteroscedasticity, and pooled coefficient of variation calculated as previously described (the standard deviation of differences between test and re-test divided by the overall mean score of the test and retest, multiplied by 100).
The children had a mean height of 142.7 ± 8.2 cm and weighed 33.7 ± 6.0kg. Their lung function was normal (FEV1 = 101 ± 9.2% predicted25) and they participated in an average of 3.9 ± 1.2 hours per week of vigorous physical activity (range 1.6–5.2 hours). Exercise test duration was 11.4 ± 2.3 min and 12.2 ± 2.2 min in tests 1 and 2, respectively, and not significantly different between testing occasions (p = 0.06).
Raw values for V̇O2peak are given in Table 1. Mean differences between tests 1 and 2 were not significant for V̇O2peak, HRpeak or V̇Epeak. V̇O2peak scores demonstrated a bias of 1.4 ml/kg/min with repeatability coefficient of 4.4 ml/kg/min. Bias and repeatability coefficients for mean V̇O2peak, HRpeak and V̇Epeak are summarized in Table 2, along with the calculated ICC and CV for each variable. The magnitude of differences did not increase with mean score for any of these variables (ie, no heteroscedasticity detected).
Bias and repeatability coefficient ranges are illustrated with Bland and Altman plots for each variable in Figures 1 through 3. The ranges included by the coefficients of repeatability were −3 to + 5.8 ml/kg/min for V̇O2peak, −10.2 to 5.8 beats/min for HRpeak and −8.7 to + 12.9 liters/min for V̇Epeak.
This study has determined repeatability coefficients for V̇O2peak, HRpeak and V̇Epeak during CPET in nine children who are healthy, and demonstrated no significant difference in these variables on repeated testing.
Although previous studies of children have demonstrated strong correlations between repeated tests,7,11,12,26 this is insufficient for determining the clinical significance of changes in CPET variables.6 Based upon the results of this study, a clinically significant improvement in V̇O2peak would be indicated if the difference between the tests was greater than the mean difference plus the repeatability coefficient, ie, greater than 5.8 ml/kg/min. Similarly, a clinically significant deterioration in V̇O2peak would be indicated by a reduction of 3 ml/kg/min. Examination of Figure 1 shows that differences between V̇O2peak in tests 1 and 2 did not vary in any systematic way over the range of measurement, indicating the 95% limits of agreement are appropriate across the spectrum of values in this sample.6
The data presented in this study suggest a trend toward a higher value for V̇O2peak on repeated testing (mean difference 1.4 ± 2.2 kg/min). This variation in V̇O2peak of 3 ± 5% is consistent with test-retest variation of 2 ± 6% in a group of 19 untrained adults who performed incremental cycle ergometer tests on two occasions.18 In the same study, the adults who had previously trained on a cycle ergometer had less variation in their V̇O2peak (mean difference 0 ± 4%), indicating fewer effects of learning or familiarization in subjects accustomed to the type and intensity of exercise required. The children in our study practiced running on the treadmill prior to testing, but none had previously trained on a treadmill. On repeat testing, children in this study may have developed increased treadmill running efficiency (thus demonstrating a longer test duration for similar oxygen consumption), as well as familiarization with the sensations of high level exercise. The trend toward greater V̇O2peak on repeat testing should be taken into consideration when using this variable to evaluate interventions. Small changes in V̇O2peak on repeated testing (ie, within the 95% limits of agreement) may be attributed to other sources of variation including familiarization and learning, rather than the effects of an intervention.
The high ICC for V̇O2peak in this study (0.96) provides some comparison with previous recent studies in children (ICC of 0.6513 and 0.9611 respectively). However, the interpretation of the ICC is limited by small sample size (≤ 10) in all three studies, as data with at least 25 degrees of freedom is recommended.27 The coefficient of variation for V̇O2peak in this study (5.1%) was lower than or comparable with previous reports (eg, 7.5%15). However, examination of the data in this study showed that test-retest differences did not increase with measurement size. In this instance, where the assumption of heteroscedasticity was not met, or where this data has not been reported, coefficients of variation should be interpreted with caution.
Repeatability of exercise testing has been evaluated in this study in eight to 11 year old children who were healthy and active. However, clinical CPET and interventions to improve exercise capacity are more likely to be indicated in children who are less active with chronic conditions including asthma, diabetes and obesity. Studies in clinical samples of interest (eg, children who are overweight and obese) would more accurately define repeatability coefficients for these specific populations. While the measurements of repeatability in this study will be helpful in the estimation of sample size for clinical exercise studies in children,5 repeatability data are specific to population, exercise protocol, test equipment and time interval between tests. For these reasons it is recommended that exercise laboratories develop repeatability coefficients for their own specific test conditions and clinical samples of interest.
This study provides values for the repeatability of V̇O2peak, HRpeak and V̇Epeak in children who are healthy during maximal incremental treadmill tests. The measurement of repeatability coefficients as performed in this study defines variation consistent with repeated exercise tests in children under standardized conditions. This data is essential for the valid interpretation of repeated exercise tests performed to evaluate change in clinical status, or to determine the effect of therapeutic interventions.
The authors wish to thank Dr. Ric Roberts for his advice regarding testing protocol, Rachel Crocker for assistance with exercise testing, and the participants and their families.
1. Fahey J, Nemet D, Cooper DM. Clinical exercise testing in children. In: Weisman IM, Zeballos RJ, eds. Clinical Exercise Testing. Basel, Switzerland: Karger; 2002:282–299.
2. Armstrong N, Welsman JR. Aerobic fitness. In: Armstrong N, van Mechelen W, eds. Paediatric Exercise Science and Medicine. Oxford: Oxford University Press; 2000:65–75.
3. Atkinson G. A comparison of statistical methods for assessing measurement repeatability in ergonomics research. In: Atkinson G, Reilly T, eds. Third International Conference on Sport, Leisure and Ergonomics; 1995 12th–14th July, 1995; Burton Manor, The Wirral, UK: E & FN Spon; 1995:219–222.
4. Lamb KL. Test-retest reliability in quantitative physical education research: a commentary. Eur Phys Ed Rev 1998;4(2):145–152
5. Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med 1998;26(4):217–238
6. Bland JM, Altman DG. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet 1986;i:307–310.
7. Cunningham D, Van Waterschoot B, Paterson D, Lefcoe M, Sangal S. Reliability and reproducibility of maximal oxygen uptake measurement in children. Med Sci Sports Exerc 1977;9(2):104–108
8. Bland JM, Altman DG. Comparing two methods of clinical measurement: a personal history. Int J Epidemiol 24(Suppl 1):S7–S14, 1995.
9. Portney LG, Watkins MP. Chapter 26: Statistical measures of reliability. In: Foundations of Clinical Research: Applications to Practice. Norwalk, Conneticut: Appleton and Lange; 1993:505–528.
10. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86(2):420–428
11. Golden J, Janz KF, Clarke WR, Mahoney LT. New protocol for submaximal and peak exercise values for children and adolescents: The Muscatine study. Ped Exerc Sci 1991;3:129–140
12. Rivera-Brown AM, Frontera WR. Achievement of plateau and reliability of VO2max in trained adolescents tested with different ergometers. Ped Exerc Sci 1998;10:164–175
13. Unnithan V, Murray L, Timmons J, Buchanan D, Paton J. Reproducibility of cardiorespiratory measurements during submaximal and maximal running in children. Br J Sports Med 1995;29(1):66–71
14. Bland JM, Altman DG. A note on the use of the intraclass correlation coefficient in the evaluation of agreement between two methods of measurement. Comput Biol Med 1990;20:337–340
15. Figueroa-Colon R, Hunter GR, Mayo MS, Aldridge RA, Goran MI, Weinsier RL. Reliability of treadmill measures and criteria to determine V̇O2max in prepubertal girls. Med Sci Sports Exerc 2000;32(4):865–869
16. Boileau R, Bonen A, Heyward V, BH M. Maximal aerobic capacity on the treadmill and bicycle ergometer of boys 11–14 years of age. J Sports Med 1977;17:153–162.
17. Jones NL, Kane JW. Quality control of exercise test measurements. Med Sci Sports Exerc 1979;11(4):368–372
18. Bingisser R, Kaplan V, Scherer T, Russi EW, Bloch KE. Effect of training on repeatability of cardiopulmonary exercise performance in normal men and women. Med Sci Sports Exerc 1997;29(11):1499–1504
19. Lamb KL, Eston RG, Corns D. Reliability of ratings of perceived exertion during progressive treadmill exercise. Br J Sports Med 1999;33:336–339
20. Balfour-Lynn IM, Prasad SA, Laverty A, Whitehead BF, Dinwiddle R. A step in the right direction: assessing exercise tolerance in cystic fibrosis. Pediatr Pulmonol 1998;25:278–284
21. van’t Hul A, Gosselink R, Kwakkel G. Constant-load cycle endurance performance: test-retest reliability and validity in patients with COPD. J Cardiopulm Rehabil 2003;23:143–150
22. Aaron DJ, Kriska AM, Dearwater SR, Cauley JA, Metz KF, LaPorte RE. Reproducibility and validity of an epidemiologic questionnaire to assess past year physical activity in adolescents. Am J Epidemiol 1995;142(2):191–201
23. Wasserman K, Hansen JE, Sue DY, Casaburi R, Whipp BJ. Principles of Exercise Testing and Interpretation. 3rd ed. Baltimore: Lippincott Williams & Wilkins; 1999.
24. Zeballos RJ, Weisman IM. Behind the scenes of cardiopulmonary exercise testing. Clin Chest Med 1994;15(2):193–213
25. Knudson R, Lebowitz M, Holberg C, Burrows B. Changes in the normal maximal expiratory flow volume curve with growth and aging. Am Rev Respir Dis 1983;127:725–734
26. Turley KR, Rogers DM, Harper KM, Kujawa KI, Wilmore JH. Maximal treadmill versus cycle ergometry testing in children: differences, reliability and variability of responses. Ped Exerc Sci 1995;7:49–60
27. Chinn S. Repeatability and method comparison. Thorax 1991;46:454–456
© 2005 Lippincott Williams & Wilkins, Inc.