For many years researchers have used time to exhaustion in tests performed at constant power output or speed to study treatments that affect endurance performance . (See, for example, the publications of R. J. Maughan and coworkers.) This measure regularly revealed statistically significant large effects, but the fact that it showed large test-retest variability in reliability studies (coefficients of variation of ∼15–25%) led some researchers to conclude that it was unsuitable for investigating the small changes in endurance performance that matter to competitive athletes (^{6} ). These researchers suggested that performance tests in which the athlete attempts to maximize power output during performance of a constant amount of work, or over a constant distance or for a constant duration, were not only more race specific, but were apparently more reliable. Indeed, an athlete's performance in such tests varies typically by only a few percent in reliability studies (^{4} ).

Other researchers have argued that the apparently poor reliability of time to exhaustion is an artifact of the relationship between exercise duration and power output (^{4} ). The relationship is such that small random changes in a subject's ability to output power from test to test (e.g., ∼1%) result in much larger random changes in time to exhaustion (∼10–20%). But a treatment that produces a substantial change in a subject's ability to output power will also result in a large change in time to exhaustion, which will stand out against the large random changes. When these researchers used duration–power relationships to convert changes in time to exhaustion into equivalent changes in power output in a constant-duration time trial, they found that time to exhaustion was among the most reliable of measures of endurance performance (^{4} ). They suggested that the measures derived from race-specific constant-work or constant-duration time trials are, if anything, less reliable because optimal performance in such time trials requires subjects to set a pace appropriate for the distance or duration, and appropriate for their current state of fitness. In contrast, tests conducted at constant power, along with incremental tests to maximum effort, require no self-selection of pace, and potentially less familiarization with the test protocol. Such tests might therefore be better for studying factors that affect power output, whereas time trials are more appropriate for studies of pacing.

Notwithstanding the strength of these arguments, we continue to encounter skepticism about the utility of time to exhaustion as a measure of performance . The skepticism may arise in part from lack of understanding of the calculus used to derive the relationship between changes in power output and changes in time to exhaustion. One aim of the present study was therefore to provide a detailed account of the application of calculus to convert changes in time to exhaustion into changes in a time trial of similar duration. This account should help researchers understand why the high test-retest variability of time to exhaustion does not imply poor reliability. A further aim was to present empirical evidence of high reliability of time to exhaustion. For this aim, we performed a reliability study in which our subjects ran to exhaustion at three speeds, then repeated the set of runs at various times over 14 wk. For each set of three runs, we used critical-power modeling and log-log modeling of the relationship between run distance and run time to predict time-trial performance over standard competition distances. We reasoned that high reliability for these predicted times would be indirect but compelling evidence of functionally high reliability in times to exhaustion, from which they were derived.

METHODS
Choice and use of models.
The relationship between exercise duration and the maximum work or mean power output that can be achieved during the exercise is the key to understanding test-retest variation in time to exhaustion. In the present study, we restricted the exercise duration to 1–10 min, a range in which the critical-power model can be used for the relationship (^{2} ).

The critical-power model is based on the following assumptions: work is derived from aerobic and anaerobic mechanisms, the aerobic system provides power at maximum (m, the so-called critical power) for the entire duration (T) of the exercise, and the point of exhaustion coincides with the depletion of the anaerobic work capacity (a) needed to sustain the supraaerobic maximal exercise. It follows that W = a + mT, where W is the total work done (^{2} ). For running, the model is D = a + mT, where D is the total distance run, a is the distance run purely anaerobically, and m is the maximum aerobic running speed (^{2} ). Most researchers have used the critical-power model to make inferences about reliability of the parameters a and m, and about effects of experimental treatments on these parameters. Few researchers have used the model to predict and analyze performance times for set distances or set work. In the only such study of runners (^{5} ), the prediction was for a 10-km running time, which is far beyond the theoretical maximum time (∼10 min) for the critical-power model. To make predictions in the present study, we expressed T as a function of D: T = D/m − a/m.

The fit of the critical-power model to performance times and distances produces high correlations (^{7} ). Although our data were no exception, analysis of residuals showed systematic and nonuniform deviations from the model (see Results section), a phenomenon that appears to have gone unnoticed in previous studies. We therefore chose an additional model for the relationship between T and D, an empirical log-log model with a long history (^{1} ). The model is easiest to understand in the form log(S) = klog(T) + c, where S is running speed or power, and k and c are constants. By applying differential calculus, dS/S = kdT/T, so the model implies that percent changes in speed (100dS/S) are proportional to percent changes in exercise duration (100dT/T), and k is the constant of proportionality. To make predictions of performance times for set distances, we used S = D/T to express the model as log(T) = log(D)/(1 + k) − c/(1 + k), or log(T) = k′log(D) + c′. Researchers interested in modeling the duration–power relationship in human performance should note that the log-log model readily accommodates modes of exercise where air or water resistance makes power output proportional to speed raised to some exponent, as in rowing, swimming, and cycling. The effect of such nonlinear resistance is simply to change the values of k and c in the log-log model.

We also investigated a three-parameter quadratic log-log model in an attempt to improve reliability of predicted times. The reliability of predicted times for 3000 m was substantially worse with this model, presumably because of the effect of overfitting on predictions requiring extrapolation beyond the range of the observations. The analyses are not included in this paper. Other three-parameter models are available (^{1} ), but these models would probably suffer from the same problem of overfitting with only three observations.

To derive the relationship between change in time to exhaustion at a constant speed and change in time in a constant-distance time trial of similar duration, it is necessary to express the critical-power model first as a relationship between speed (S) and distance traveled (D). The critical-power model is D = a + mT, therefore T = (D − a)/m, and so S = D/T = D/[(D − a)/m] = mD/(D − a). Figure 1 shows this relationship as a curve over the range of distances corresponding to run times of 1–8 min for the mean values of a and m observed in this study. Figure 1 also shows the new curve when a and m change by small arbitrary amounts, and the inset of Figure 1 shows the resulting small change in S for a constant-distance time trial and small change in D for a constant-speed run to exhaustion at given values of S and D. The ratio −ΔS/ΔD is approximately the slope of either curve. But, using differential calculus, the slope of the curve = dS/dD = −am/(D − a)^{2} ≈ −ΔS/ΔD. Rearranging, ΔS/S ≈ [am/(D − a)^{2} ]ΔD/S = [a/(mT)]ΔD/D. Therefore, 100ΔS/S = percent change in speed in a constant-distance time trial ≈ [a/(mT)]100ΔD/D = a/(mT) times the percent change in distance in a constant-speed run to exhaustion. But D = ST, so using differential calculus again, ΔD = SΔT, and therefore 100ΔD/D = 100SΔT/D = 100ΔT/T = a small percent change in time to exhaustion. Therefore, the constant to convert small percent changes in time to exhaustion into small percent changes in a constant-distance time trial is a/(mT). By a similar analysis, the constant to convert small percent changes in time to exhaustion into small percent changes in a constant-time time trial is a/(a + mT), although this constant is not used in this study. Similarly, the conversion constants derived from the log-log model for constant-distance and constant-duration time trials are −k/(1 + k) (or k′ − 1) and −k (or (k′ − 1)/k′), respectively.

FIGURE 1— Curves showing the relationship between running speed and distance traveled under the assumptions of the critical-power model. The: solid curveis for mean values of anaerobic capacity (a) and maximum aerobic power (m) observed in this study for treadmill running at an inclination of 1.5%; thedashed curveis for arbitrary small increases in a (5%) and m (2%). The inset shows that the resulting small change in S (ΔS) for a constant-distance test divided by the small change in D (ΔD) for a constant-speed test is approximately equal to the slope of the curves at the given values of S and D.

Subjects.
Eight male distance runners (age 27 ± 9 yr, mean ± SD) at the regional and national competitive levels participated in the study. The runners' current best competitive performance speed as a percent of world record in the 800–10,000 m was 79 ± 11 (mean ± SD). They provided informed consent, and the study was conducted in accord with the institutional ethics policy. Runners were excluded from the study if they were suffering from illness or injury that would interfere with normal training and competition.

Design.
Variability of a subject's performance can depend on the time between tests. We therefore designed this reliability study to determine variability of performance over a period of a few days (during which changes in performance should be close to minimal) and over a period of several months (which is typical for training or other long-term interventions). To this end, runners were recruited with the understanding that they would maintain their usual training (∼7 h·wk^{−1} ) over a 15-wk period in a summer season, and that during this time they would visit the laboratory on seven separate occasions to perform a set of three runs to exhaustion. The first visit was a familiarization session, and the remaining six visits, beginning 1 wk later, were testing sessions. Pairs of testing sessions were scheduled 5 d apart every 7 wk for a total of 14 wk. One subject failed to report to the lab for his last test.

As models with two parameters, the critical-power and log-log models require at least two runs to exhaustion for derivation of slope and intercept parameters and for subsequent prediction of time-trial times. Three or more runs allow derivation of goodness of fit and more accurate estimation of parameters and predictions. We limited the number of runs to three because we were concerned that more runs might compromise the performance of the runners in some runs.

The sample size for the reliability study produced 20 df for the estimate of 5-d test-retest error of measurement (within-subject variability). The sample size is the equivalent of 21 subjects tested twice (after familiarization), which is more than that of most studies of reliability of performance (^{4} ). The resulting precision of error of measurement (90% confidence limits ×/÷1.2) is adequate, although not ideal, for comparison of errors (^{3} ).

Runs to exhaustion.
We instructed the runners to engage in little or no training for at least 1 d before each visit. At each visit, the runners performed three runs to exhaustion on a treadmill (Powerjog M30, Biddeford, ME) at three speeds. In a pilot study with two runners and during the familiarization session, the runners indicated they had recovered sufficiently to perform the set of three runs with a 30-min rest between runs. In the familiarization session, the speed of the treadmill was adjusted to elicit exhaustion times of approximately 1–2, 3–4, and 7–10 min, in that order. The first speed was based on the individual's current best 800- or 1500-m time. The runner's speeds were then held constant for the six testing sessions. The slope was fixed at 1.5% for all runs. The runners received no feedback about elapsed time or distance traveled during the runs.

For two of the runners, speed of the third run was changed after the first testing session to keep time to exhaustion within 10 min. Analysis of variability of time to exhaustion requires the speed of the treadmill to be constant for the given runner, so for these runners we excluded the times for the third run in the first test from the analysis. These times were, however, included in the critical-power modeling, and were therefore included in reliability analyses of parameters and estimates derived from the critical-power modeling. We also excluded the third run of one runner in the fifth test because the standardized residual of this run in the reliability analysis was 6.2. Critical-power modeling was not applied to this runner's other two runs in the fifth test.

Statistical analysis.
For the prediction of time-trial times, we fit a least-squares straight line to the run time and run distance (critical-power predictions) and to the natural logarithms of run time and run distance (log-log predictions) for each runner's set of three runs. The parameter estimates were then used to predict times for the standard competition distances of 800, 1500, and 3000 m, as illustrated in Figure 2 , for the set of three runs with the worst fit for the critical-power model. We also predicted each runner's times for the group mean distance run in each of the three runs (640, 1200, and 2300 m). Measures of goodness of fit for each set of three runs were the adjusted correlation coefficient (square root of the R^{2} adjusted for degrees of freedom) and the SE of the estimate; the latter were averaged over all six runs of all runners (via averaging of their squares). Residuals and predicteds for each set of three runs were also output for plotting and averaging to assess the uniformity of fit.

Mixed-model analyses of log-transformed times (three) provided estimates of test-retest variability of times to exhaustion, predicted times, and model parameters as coefficients of variation (percent of each subject's mean). The mixed model had a single fixed effect for identity of trial (six levels, excluding the familiarization trial), a residual representing within-subject variation (error of measurement) between any pair of trials, a random effect representing additional within-subject variation between trials 7 or 14 wk apart, and a random effect for subject identity.

RESULTS
The adjusted correlation coefficients for the critical-power and log-log modeling of the sets of three runs were all at least 0.998, and mean SE of the estimates were 4.8 s and 2.0% for the critical-power and log-log models, respectively. A plot of residuals against predicteds from the critical-power modeling of all sets of three runs showed clear evidence of nonuniformity of error (Fig. 3 ): the means of residuals for the three runs (shortest to longest) were 1.4, −2.4, and 1.0 s; the SD were 2.0, 2.9, and 1.2 s. (The root-mean square of all the residuals in Figure 3 , appropriately corrected for degrees of freedom, is the mean SE of the estimate, 4.8 s.). The residuals from the log-log modeling showed a similar pattern of nonuniformity, but with reversed sign for the means for the three runs (shortest to longest): −0.7, 1.0, and −0.3%; the SDs were 0.8, 1.2, and 0.6%.

Table 1 shows values of parameters derived from the modeling as means and variabilities.

TABLE 1: Reliability analysis of parameters derived from the critical-power and log-log models for each set of three runs over the six trials: overall mean, between-runner variability in any one trial, and within-runner variability from trial to trial.

The reliability analyses for time to exhaustion in each of the three runs over the six tests produced negative estimates for additional error variance when tests were 7 wk apart. Because it is unrealistic that there should be better reliability between tests 7 wk apart than between tests 5 d apart, the additional variance was set to zero, and the resulting estimates of variability (in which the six tests contribute equally) are shown in Table 2 . Also shown in Table 2 for each of the three runs is the within-runner variability of an equivalent constant-distance time trial estimated from the within-runner variability in time to exhaustion via the critical-power and log-log models.

TABLE 2: Analysis of time to exhaustion for each of the three runs over the six trials for treadmill running at an inclination of 1.5%: overall mean, between-runner variability in any one trial, within-runner variability from trial to trial, and within-runner variability of an equivalent constant-distance time trial estimated from the critical-power and log-log models.

The reliability analysis of time-trial times predicted for the distances of each run and for standard competition distances using the critical-power and log-log models is summarized in Table 3 . The analysis for the shortest distance produced negative estimates for additional error variance when tests were 7 wk apart. As in the analysis of times to exhaustion, the additional variance was set to zero, and the resulting estimates of 5-d and 7-wk variability in Table 3 are therefore equal. Variabilities of the time-trial times for the other distances were greater for tests 7 wk apart than for tests 5 d apart, as shown in Table 3 .

TABLE 3: Within-runner variability of constant-distance time-trial times predicted for various distances from each runner's set of three runs using the critical-power and log-log models. The distances are the mean distance in each of the three runs to exhaustion (for comparison with variabilities inTable 2) and standard competition distances.

DISCUSSION
In the present study, we used several approaches to derive estimates of test-retest error of measurement from times to exhaustion in tests conducted at constant running speed. All estimates were <3%, and some were ∼1%, which represents excellent reliability in comparison with other tests of endurance performance (^{4} ). These findings should lay to rest any concerns that time to exhaustion is inherently an unreliable measure of endurance performance .

Although the errors were small, there were considerable differences in the estimates (up to a factor of 2) between the different approaches. Given the uncertainty in the estimates represented by the confidence limits (×/÷1.2–1.3), these differences are likely to represent substantial real differences rather than simply sampling variation. The most noteworthy differences are for the within-runner equivalent time-trial variabilities estimated from single runs to exhaustion (Table 2 , last two columns): the variabilities derived from the critical-power model get smaller for longer runs; the trend is in the other direction and less marked for variabilities derived from the log-log model. These differences arise presumably from the systematic lack of fit of the models revealed by the mean values of the residuals for the three runs. Evidently, there is some curvature in the relationship between run time and run distance, and log-log transformation overcompensates by inverting the curvature. The factors for converting variation in time to exhaustion into variation in equivalent time-trial time are based on the slope of the appropriate version of each model, so failure to fit curvature in the data will result in systematic errors in the factors. We suspect that the biological within-subject variability in mean power output for exercise durations over the range of 1–10 min is practically constant, when expressed as a coefficient of variation. In our view, the trends shown in the last two columns of Table 2 are therefore artifacts of the models. A model that fit the curvature would probably produce equivalent time–time trial variability of ∼1.7% (the variability for both models for the second of the three runs) for runs of any duration in the 1- to 10-min range.

Although neither model produced an ideal fit for predicting time-trial times from single runs to exhaustion, Table 3 provides some evidence of the superiority of the log-log model when combining three runs. The 5-d variability from the log-log model differs little for the shortest and longest of the three runs (640 and 2300 m). The variability for the middle run (1200 m) is a little lower, as one might expect for a prediction for the middle of a set of points in comparison with predictions for the extremes. In contrast, the 5-d variability from the critical-power model appears to have a systematic trend toward less variability for longer runs, similar to that in Table 2 . Note, however, that the superiority of the log-log model applies only to three runs to exhaustion with a 30-min rest between runs. With longer periods of rest or with runs performed on different days, goodness of fit of both models might change.

The estimates of 7-wk variability for run times predicted from each set of three runs (Table 3 ) were similar to the variability of times derived from single runs (Table 2 ). Variability over 5 d, when it could be estimated, was a little smaller. Greater variability is expected as time between tests increases, so the failure to find any such increase for the shortest runs in Table 3 could simply represent sampling variation. It is also possible that over the 14 wk of the study, the runners tended to experience more variability in the aerobic (endurance ) component of their fitness.

Estimability of 5-d and 7-wk variability for predicted times for 1200- and 2300-m runs (Table 3 ) was not matched by such estimability for corresponding runs to exhaustion (Table 2 ). Thus, single runs to exhaustion appear to have less sensitivity to subtle changes in reliability. We can think of two likely explanations. First, there is a large increase in variability of the time to exhaustion (9.2 to 16%) as the duration of the run increases from the shortest (1–2 min) to the longest (5–10 min). Within a single run, there is a twofold range of run time between runners. It follows that variability of time to exhaustion within a single run will be greater for the runners with longer run times. This “variability of variability” presumably tends to mask the random effect representing additional variability for tests 7 wk apart. The other likely explanation is a similar variability of variability arising from individual differences in the constants that convert changes in time to exhaustion into changes in equivalent time-trial time (27–28%, Table 1 ). Both these sources of variation are eliminated when the critical-power or log-log models are used to predict times for specific distances from two or more runs.

An incidental outcome in our study is estimation of the reliability of parameters in the critical-power model. The only comparable studies were performed over a shorter time frame (≤1 wk) on cycle ergometers (reviewed in Hopkins et al. (^{4} )). In these studies, aerobic power was more variable (2.3–7.6%) and anaerobic capacity was less variable (8.4–14%), possibly reflecting the different time frame and mode of ergometry.

In conclusion, it is clear that time to exhaustion is a reliable measure, but it is now also clear that converting change in time to exhaustion from trials at one speed or power output into change in time-trial time can only be approximate. Researchers or sport scientists may still wish to use time-to-exhaustion tests because of their potential to avoid the problems associated with pacing. We recommend researchers use at least two trials differing widely in speed or power, then combine the trials with an appropriate model to predict time-trial times or mean power over race-specific distances. Choice of the model can make substantial differences in the predictions for the race distances. For time to exhaustion in the 1- to 10-min range, the log-log model appears to be appropriate and superior to the critical-power model. For longer times the critical-power model is theoretically inappropriate, and research is needed to determine the suitability of the log-log model. More research is also needed to determine how well the log-log model predicts race times from times to exhaustion in modes of exercise where there is a nonlinear relationship between power and speed.