It is important to have accurate and reproducible measures of physiological variables during maximal exercise, especially maximal oxygen intake (V̇O2max), which is considered the best indicator of cardiorespiratory fitness (16). These maximal values are often used to prescribe exercise and to describe the intensity (% V̇O2max) of exercise. Reproducibility also is an important consideration when one wants to determine the significance of changes in V̇O2max or other cardiorespiratory endurance phenotypes that might occur with endurance exercise training. That is, if trial-to-trial reliability is low and day-to-day biological variability is large, then it may be difficult to determine the true magnitude of the changes induced by training. These concerns are even greater when data are pooled from several centers using the same procedures, as there might be differences in reproducibility across these centers.
The HERITAGE Family study is a large, multicenter clinical trial studying the possible genetic bases for the variation in response to endurance exercise training of various physiological measures and risk factors for cardiovascular disease and non-insulin-dependent diabetes mellitus. This study has been described in detail in a previous publication (2). The purpose of the present study was to determine the reproducibility of several cardiovascular, respiratory, and metabolic variables obtained during maximal exercise in 390 HERITAGE Family study subjects studied at the four participating Clinical Centers. In addition, reproducibility of maximal data was determined in 55 subjects tested four times within and in a group of eight subjects tested across the four Clinical Centers. The reproducibility of submaximal data was reported earlier (23).
The HERITAGE subjects came from families that included the natural mother and father (age 65 yr or less) and their natural children, 17–40 yr of age. The Caucasian families had at least three offspring, whereas the African-American families were smaller. This paper describes the results from the first 390 subjects (198 men and 192 women) studied at the four Clinical Centers (Arizona State University [Indiana University since January, 1996], Laval University, University of Minnesota, and The University of Texas at Austin). Subjects were healthy and sedentary and met a number of inclusion and exclusion criteria (2). They also had passed a medical examination with a physician that included a 12-lead electrocardiogram obtained at rest and during a maximal exercise test. The study protocol had been previously approved by the Human Subjects Committee at each of the four Clinical Centers. Informed consent was obtained from each subject.
Each Clinical Center recruited five additional subjects every 6 months over three 6-month periods to participate in an intracenter quality control (ICQC) substudy during the second and third year of data collection. Data were available from 55 subjects across all four centers. In addition, eight subjects (four men and four women) were tested in all four centers (four just before the onset of data collection in the first year and four during the second year) as part of an intercenter comparison. This traveling crew quality control (TCQC) substudy was done to compare results from the same subjects across centers. With the exception of family membership, the subjects in the two substudies met all criteria for admission to the HERITAGE Family Study. The physical characteristics of subjects in the three studies are presented in Table 1.
Maximal, submaximal, and submaximal-maximal exercise tests were each performed on separate days before and after the 20-wk training program on a SensorMedics (Yorba Linda, CA) 800S cycle ergometer connected to a SensorMedics 2900 metabolic measurement cart. The electrocardiogram was used to monitor heart rate before exercise, during the last 15 s of each exercise stage, at the end of exercise, and for several minutes after the exercise tests. During each exercise stage, gas exchange variables (V̇O2, V̇CO2, V̇E, and respiratory exchange ratio (RER)) were recorded as a rolling average of three 20-s intervals. The criteria for V̇O2max were: RER > 1.1, plateau in V̇O2 (change of < 100 mL·min−1 in the last three consecutive 20-s averages), and a heart rate (HR) within 10 beats·min−1 of the maximal level predicted by age. All subjects achieved a V̇O2max by at least one of these criteria. The majority of the exercise tests were conducted at the same time of day, with at least 48 h between tests.
For the first maximal test, subjects exercised at a power output (PO) of 50 W for 3 min, followed by increases of 25 W each 2 min until volitional exhaustion. Because a given PO requires a similar V̇O2 (mL·min−1) for all subjects, 50 W would require a higher %V̇O2max for older, smaller, or less fit persons. For these subjects, the test started at 40 W, with increases of 10–20 W each 2 min thereafter; this was done to increase the number of stages they could do before reaching their maximum, as results of this test were used to select the PO for the subsequent tests.
The submaximal exercise test consisted of two stages. Subjects exercised 8–12 min at an absolute (50 W) and at a relative PO equivalent to 60% V̇O2max. This was done to get steady-state data for V̇O2, HR, and cardiac output before and after training.
The protocol for the submaximal-maximal exercise test was the same as that for the submaximal test, except that after exercising at 60% V̇O2max, subjects also exercised for 3 min at 80% V̇O2max. The resistance was then increased to the highest PO attained in the first maximal test. If subjects were able to pedal after 2 min, PO was increased each 2 min thereafter until they reached volitional fatigue.
Thus, each subject had duplicate submaximal and duplicate maximal exercise tests. The submaximal values for V̇O2, HR, and PO from the three tests were used to prescribe the various intensities for each subject’s training program. Data from the submaximal tests were not used in this study but have been reported elsewhere (23).
For the ICQC substudy, subjects were tested to maximum once using the maximal exercise test protocol and then three more times over the next 2 wk using the submaximal-maximal exercise test protocol. For the TCQC substudy, subjects were tested with the maximal exercise protocol at the Quebec center. Shortly thereafter, they were tested once at each of the four Clinical Centers with the submaximal-maximal exercise protocol; these four tests were given over a 2-wk period, with at least 3 d between tests.
The HR was determined manually from electrocardiographic tracings, and values were recorded during the last 15 s of each stage. Blood pressure (BP) was obtained during the last minute of each stage using a Colin STBP-780 automated BP unit (San Antonio, TX). Two electrocondensor microphones embedded in the cuff synchronized the sound signal to the R-wave of the ECG. Earphones allowed the technician to confirm the BP values selected by the instrument’s detection algorithm. At rest, the first Korotkoff sound was used to identify systolic BP, whereas the fifth Korotkoff sound (the point at which sound disappears) identified diastolic BP. During exercise, the fifth sound may not be heard, in which case the fourth sound (the point marked by the distinct, abrupt muffling of sound) was used.
Quality assurance, quality control, and statistical methodology.
Several important quality assurance and quality control procedures were instituted at each of the four Clinical Centers. Staff from all Clinical Centers were trained centrally before data collection began and all staff had to be certified annually on each technique for which they were responsible. A detailed manual of procedures (MOP) was developed. Every 6 months, staff reviewed those sections of the MOP for which they were responsible. More information on quality control and quality assurance can be found in two recent publications (2,7).
All data were analyzed using the SAS statistical package. Data are expressed as mean ± SD. Technical errors (TE), coefficients of variation for repeated measures (CV), and intraclass correlation coefficients (ICC) were computed to evaluate the reproducibility in both the HERITAGE sample and the ICQC sample using the model of Shrout and Fleiss (18). With this model, the ith measurement on the jth subject, xij, is given by:
where μ is the population mean, bj is the difference from m of the mean of the repeated measurements of the jth subject (or true score), and wij is a residual that encompasses the inseparable effects of the rater, the rater × subject interaction, and the error term. Both bj and wij are assumed to be normally distributed and independent, with zero means and with standard deviations of ςτ and ςω, respectively. To compute the ICC, PROC GLM in SAS was used to run a one-way repeated measures ANOVA, providing a between-subjects mean square (BMS) and a within-subjects mean square (WMS). These were used to estimate the ICC according to Shrout and Fleiss (18):
where k is the number of replicate measurements on a subject. TE, or ςω, is the within-subject standard deviation, derived as the square root of the WMS from the ANOVA. The CV within subjects was computed as:
The ICC, CV, and TE are all different measures of reproducibility. Whereas ICC and CV are unitless and are therefore relative measures, the TE is scale-dependent and is therefore an absolute measure of reproducibility. The TE would be most informative for investigators who are familiar with the scale and units of measurements of a particular variable. The relative measures (ICC and CV) are generalizable to any measure.
A multiple-testing analysis of variance (ANOVA) was implemented using the general linear models (GLM) procedure to assess whether there were any differences across the two tests for the total sample and across the four tests for the ICQC sample and across the four Clinical Centers for the TCQC sample. Tukey’s Studentized range (HSD) test was used to identify the source of significant differences between trials. The multiple-testing ANOVA controls for all potential sources of between-individual variation. Statistical significance was set at the 0.05 level.
The cardiovascular, respiratory, and metabolic responses to maximal exercise are presented in Table 2 for the total sample of 390 HERITAGE subjects across 2 d at baseline. With the exception of a drop in diastolic BP in the second test, there was little difference between the two trials in the mean values for all of the other variables. The CV were below 10% for all variables except diastolic BP, and the ICC were 0.88 to 0.95 for all variables except systolic BP, diastolic BP, and RER. Similar results were found across the four centers before and after training (data not shown).
Table 3 contains the results from the ICQC substudy on 55 subjects. There were similar results across the four trials, with the only significant difference being a lower but physiologically insignificant RER at the first maximal test. The TE, CV, and ICC values were similar to those found in the larger HERITAGE Family study sample shown in Table 2. Consistent with results from the HERITAGE sample, all ICC were 0.88 to 0.96 with the exception of those for systolic BP, diastolic BP, and RER.
Table 4 contains the data from the TCQC substudy on eight subjects. No significant differences were found across the tests done at the four centers. Again, with the exception of BP and RER, the ICC values for HR, ventilation, V̇O2, and V̇CO2 ranged from 0.87 to 0.97.
One of the purposes of the HERITAGE Family study is to determine the possible role of genetic factors in various phenotypes studied at baseline and in the different responses in these phenotypes to the same exercise training stimulus. The fact that we may need to detect fairly small changes in some variables and that data from four centers are being pooled to obtain an adequate sample size mandates that data be collected at each center in a highly reproducible manner. To be able to determine what role genetics might play, as well as to identify any relevant genes, it is also important to have accurate and reproducible measurements so that the sources of variation within and between families can be studied.
Because data obtained at maximal levels of exercise are so important for evaluating the response to training, the reproducibility of these variables was determined in three groups of subjects. A total of 390 HERITAGE subjects was tested twice at maximum before and after training in the four centers.
An additional 55 subjects (groups of 5 subjects in the ICQC substudy) were tested four times in each center over several 6-month periods to determine whether there was a drift in reproducibility over time. Change over time was considered important because there could have been greater error (less attention to detail, addition of new personnel who might have been trained differently, etc.) or less error (staff got better at taking measurements), both of which could influence the subsequent analyses and interpretation of results. More details about drift over time in the reproducibility of HERITAGE data can be found in a recent publication (4).
As well, eight subjects (two groups of 4 subjects) were tested in each of the four Clinical Centers. It could be argued that the intracenter reliability is more critical because we want to detect changes resulting from training within each Clinical Center. If there is good reproducibility, then any systematic differences among the four centers should have little impact. Nevertheless, we felt that it was also important to determine whether the data across the centers were similar.
There is a fair amount of information in the literature on the reproducibility of V̇O2max and maximal HR, but little information is available on such variables measured at maximal exercise as ventilation, RER, or BP. The reader should realize that the interpretation of the ICC in this study can be affected by the homogeneity or heterogeneity of the data. For example, maximal values for V̇O2 (and for ventilation due to differences in body size) are quite heterogeneous because of the large range in functional capacity within the three groups studied. Maximal values of HR are also heterogeneous due to its relationship with age. Because of this heterogeneity, one should expect higher ICC. On the other hand, maximal values of BP and RER can be quite similar in subjects with high and low functional capacities who are normotensive, as was the case in these studies. Therefore, the ICC of these data are not quite as reflective of reproducibility as would be the case for data on V̇O2, HR, and ventilation.
The reader also should be aware that most of the r values reported from the previous literature are not ICC but are Pearson product-moment correlations. Product-moment correlations are usually not appropriate for assessing reproducibility, except when there are only two measurements.
Even when studies are available, there is a problem because most of the information was obtained from small sample sizes. For example, of the 11 publications found reporting data on V̇O2max and maximal HR on healthy subjects, five (5,6,11,17,20) had only 5–17 subjects and six (1,13,14,19,21,22) had 26–47 subjects. Five of these 11 studies also involved young students and athletes aged 18–26 yr (5,11,13,14,19), whereas six (1,6,17,20–22) had subjects aged 17–55 yr. There was one study of 16 patients aged 51–75 yr with chronic, stable congestive heart failure (CHF) who were tested to their “maximum” five or more times over a period of 3–22 months (10) and another study of peak V̇O2 in 51 patients aged 30–76 yr with mild to moderate CHF who were tested twice within 8 d (3). Thus, the present study has data on a larger and more heterogeneous sample of subjects; even the ICQC substudy had more subjects than was reported in these 13 studies.
The reproducibility of V̇O2max in data from HERITAGE (Table 2), from the ICQC substudy (Table 3) and from the TCQC substudy (Table 4) compare favorably with data available from other studies. The CV of 5.1% (HERITAGE), 4.7% (ICQC), and 4.1% (TCQC) were consistent and similar to values of 2 to 6% found in the literature (6,10,11,17). Similarly, the ICC of 0.97 (HERITAGE and TCQC) and 0.96 (ICQC) were consistent with most values presented in the literature that vary between 0.90 and 0.97 using various forms of exercise in healthy persons and CHF patients (1,3,5,13–15,17,20–22). Only the study by Taylor done in 1944 (19) reported a lower value of 0.70. Although not reported here, similar results (r = 0.97–0.99) were found before and after training across the four centers.
Maximal heart rate.
The CV of 2.9% (HERITAGE), 2.0% (ICQC), and 2.1% (TCQC) were similar to values reported in the literature that range from 1.5 to 4.0% (6,11,17) in healthy subjects and 3.9% in 16 heart failure patients (10). The ICC values of 0.88 (HERITAGE and ICQC) and 0.87 (TCQC) were also similar to reported values of 0.90 (5) and 0.91 (1) but higher than values of 0.76 (13,14), 0.81 (19,22) and 0.82 (17,21). Similar results (r = 0.82–0.92) were found before and after training across the four centers.
The CV of 9.5% (HERITAGE), 7.3% (ICQC), and 8.5% (TCQC) were similar to the only other value found in the literature of 7.0% (17). The ICC of 0.89 (HERITAGE), 0.90 (ICQC) and 0.87 (TCQC) were similar to values of 0.88 found in two studies (1,13) but higher than 0.78 (14,22), 0.73 (21), 0.58 (5), and 0.53 (19) found in five other studies. Values of 0.87 to 0.94 were found across the four centers before and after training.
Although there is little or no difference in the mean values for RER (Tables 2–4), the ICC were 0.52 for HERITAGE, 0.55 for ICQC, and 0.74 for TCQC. Across the four centers before and after training, ICC ranged from 0.38 to 0.63. As these low coefficients were found in spite of the high reproducibility in the two variables used to calculate it (V̇O2 = 0.96, 0.97, and 0.97 and V̇CO2 = 0.94, 0.95, and 0.95 for HERITAGE, ICQC, and TCQC, respectively), it appears that RER should still be used as one criterion (e.g., RER >1.10) to evaluate whether subjects have pushed themselves to maximum. Although the smaller ICC values for RER suggest that it is not highly reproducible, the ICC may be artificially small because the variability in maximal RER is also low as the subjects had to have an RER >1.10 to be considered at maximum. Our values were similar to reported values of 0.52 (14), 0.48 (21), and 0.44 (22) but higher than 0.23 found in swimmers (13). No data were found in the literature on CV for maximal RER.
We found little published data on the reproducibility of systolic BP at maximal exercise. The only CV found was obtained on 16 patients aged 51–75 yr with chronic stable congestive heart failure (10) who exercised to exhaustion during five or more incremental treadmill tests over a period of 3–22 months. The CV of 6.7% in that study was similar to those found for the HERITAGE data (6.9%), the ICQC data (6.0%), and the TCQC data (6.8%). Relative to ICC, only one study was found. In this study (9), 156 CHD patients were tested, and the mean values were compared with those obtained 9 months later. The test-retest reliability coefficient was 0.67, a value similar to 0.75 from HERITAGE and 0.72 from ICQC but higher than 0.53 from TCQC. Data from these two studies may not be comparable because the clinical condition of the patients could have remained the same, improved, or worsened over time.
No data were found in the literature for diastolic BP during maximal exercise. The low ICC in HERITAGE (0.54), ICQC (0.52), TCQC (−0.03), and across the four centers (0.32 to 0.66) before and after training were not surprising, however, as it has been shown that the error in measuring diastolic BP becomes greater as exercise intensity increases (8). Although fourth-phase diastolic BP tends to remain steady during dynamic exercise in healthy subjects, attempting to find the point where the Korotkoff sounds disappear (fifth-phase diastolic BP) is often difficult because of the noise and movement associated with exercise (especially maximal exercise). In addition, even though the diastolic BP might not change, these sounds can often be heard all the way down to zero pressure (12), resulting in variable and inaccurate measurements. Therefore, only the fourth phase should be used to reflect diastolic BP during vigorous exercise, and the fifth-phase diastolic BP should be used at rest.
Similar results were found in the HERITAGE, ICQC, and TCQC data regarding the ICC values. For the most part, the results were consistent across the four Clinical Centers for each variable. There was no drift over time in these variables (4). There is also fair agreement from pre- to post-training and no trend for the ICC to become higher or lower after training. The data presented here also were similar to those obtained during submaximal exercise on the same subjects, as reported by Wilmore et al. (23), particularly at 60% V̇O2max.
With the exception of BP and RER, it is concluded that the day-to-day variations in responses to maximal cycle ergometer exercise were low in the HERITAGE subjects both before and after training, over several 6-month periods (ICQC) and in the same subjects tested in all four centers over a 2-wk period (TCQC). Thus, the reproducibility of responses to maximal exercise was high at each of the four centers in the HERITAGE Family Study. Reproducibility was excellent (CVs below 10% and ICC over 0.86) in those variables considered most important for evaluating the response to endurance exercise training (i.e., V̇O2max, V̇CO2max, maximal HR, and maximal ventilation). This was due in part to the careful and constant program of quality control and quality assurance implemented in the HERITAGE Family Study (7). This high level of reproducibility provides a robust foundation for investigating the role of genetic and nongenetic factors in phenotypes associated with maximal exercise.
The HERITAGE Family study is supported by the National Heart, Lung and Blood Institute through the following grants: HL45670 (C. Bouchard, PI); HL47323 (A. S. Leon, PI); HL47317 (D. C. Rao, PI); HL47327 (J. S. Skinner, PI); and HL47321 (J. H. Wilmore, PI). Credit is also given to the University of Minnesota Clinical Research Center, NIH grant MO1-RR000400. Jack H. Wilmore was partially supported by the Margie Gurley Seay Centennial Professorship, and Arthur S. Leon is partially supported by the Henry L. Taylor endowed Professorship in Exercise Science and Health Enhancement.
Thanks are expressed to all of the co-principal investigators, investigators, co-investigators, local project coordinators, research assistants, laboratory technicians, and secretaries who have contributed to this study. Finally, the HERITAGE consortium is very thankful to those hard-working families whose participation has made these data possible.
1. Aunola, S., and H. Rusko. Reproducibility of aerobic and anaerobic thresholds in 20–50 year old men. Eur. J. Appl. Physiol. 53:260–266, 1984.
2. Bouchard, C., A. S. Leon, D. C. Rao, J. S. Skinner, J. H. Wilmore, and J. Gagnon. The HERITAGE Family Study: aims, design, and measurement protocol. Med. Sci. Sports Exerc. 27:721–729, 1995.
3. Cohen-Solal, A., F. Zannad, J. G. Kayanakis, P. Gueret, J. F. Aupetit, and H. Kolsky. Multicentre study of the determination of peak oxygen uptake and ventilatory threshold during bicycle exercise in chronic heart failure. Eur. Heart J. 12:1055–1063, 1991.
4. Daw, E. W., M. A. Province, J. Gagnon, et al. Reproducibility of the HERITAGE Family Study intervention protocol: drift over time. Ann. Epidemiol. 7:452–462, 1997.
5. Ferguson, R. J., G. G. Marcotte, and R. R. Montpetit. A maximal oxygen uptake test during ice skating. Med. Sci. Sports 1:207–211, 1969.
6. Froelicher, V. F., M. C. Maj, H. Brammell, et al. A comparison of the reproducibility and physiologic response to three maximal treadmill exercise protocols. Chest 65:512–517, 1974.
7. Gagnon, J., M. A. Province, C. Bouchard, et al. The HERITAGE Family Study: quality assurance and quality control. Ann. Epidemiol. 6:520–529, 1996.
8. Griffin, S. E., R. A. Robergs, and V. H. Heyward. Blood pressure measurement during exercise: a review. Med. Sci. Sports Exerc. 29:149–159, 1997.
9. Irving, J. B., R. A. Bruce, and T. A. Derouen. Variations in and significance of systolic pressure during maximal exercise (treadmill) testing: relation to severity of coronary artery disease and cardiac mortality. Am. J. Cardiol. 39:841–843, 1977.
10. Janicki, J. S., S. Gupta, S. T. Ferris, and P. A. McElroy. Long-term reproducibility of respiratory gas exchange measurements during exercise in patients with stable cardiac failure. Chest 97:12–17, 1990.
11. Katch, V. L., S. S. Sady, and P. Freedson. Biological variability in maximum aerobic power. Med. Sci. Sports Exerc. 14:21–25, 1982.
12. Madsen, E. B., and V. Froelicher. The use of exercise testing to evaluate patients after myocardial infarction. In:Heart Disease and Rehabilitation,
2nd Ed. M. Pollock and D. Schmidt (Eds.). New York: Wiley, 1986, pp. 455–476.
13. Magel, J. R., and J. A. Faulkner. Maximum oxygen uptake of college swimmers. J. Appl. Physiol. 22:929–933, 1967.
14. McArdle, W. D., F. I. Katch, G. Pechar, L. Jacobson, and S. Ruck. Reliability and interrelationships between maximal oxygen intake, physical work capacity, and step-test scores in college women. Med. Sci. Sports Exerc. 4:182–186, 1972.
15. McArdle, W. D., F. I. Katch, and G. Pechar. Comparison of continuous and discontinuous treadmill and bicycle tests for max V̇O2
. Med. Sci. Sports Exerc. 5:156–160, 1973.
16. Mitchell, J. H., B. J. Sproule, and C. B. Chapman. The physiological meaning of the maximal oxygen intake test. J. Clin. Invest. 37:538–547, 1958.
17. Nordrehaug, J. E., R. Danielsen, L. Stangeland, G. A. Rosland, and H. Vik-Mo. Respiratory gas exchange during treadmill exercise testing: reproducibility and comparison of different exercise protocols. Scand. J. Clin. Lab. Invest. 51:655–658, 1991.
18. Shrout, P. E., and J. L. Fleiss. Intraclass correlations: uses in assessing rater reliability. Psychol. Bull. 86:420–428, 1979.
19. Taylor, C. Some properties of maximal and submaximal exercise with reference to physiological variation and the measurement of exercise tolerance. Am. J. Physiol. 142:200–212, 1944.
20. Taylor, H. L., E. Buskirk, and A. Henschel. Maximal oxygen intake as an objective measure of cardio-respiratory performance. J. Appl. Physiol. 8:73–80, 1955.
21. Wilmore, J. H., J. A. Davis, R. S. O’Brien, P. A. Vodak, G. R. Walder, and E. A. Amsterdam. Physiological alterations consequent to 20-week conditioning programs of bicycling, tennis, and jogging. Med. Sci. Sports Exerc. 12:1–8, 1980.
22. Wilmore, J. H., B. J. Freund, M. J. Joyner, et al. Acute response to submaximal and maximal exercise consequent to beta-adrenergic blockade: implications for the prescription of exercise. Am. J. Cardiol. 55:135D–141D, 1985.
23. Wilmore, J. H., P. R. Stanforth, K. R. Turley, et al. Reproducibility of cardiovascular, respiratory and metabolic responses to submaximal exercise: the HERITAGE Family Study. Med. Sci. Sports Exerc. 30:259–265, 1998.