The primary aim of researchers in the field of sport performance enhancement is to determine the effect of training, nutritional, or other treatments on the medal-winning prospects of top athletes. In this respect the field may be unique among the biomedical sciences, which are concerned typically with defining the effects of treatments on the average individual in a population, not on the individuals at one extreme of the population distribution. This difference impacts on an important aspect of research design: sample size. Previous methods for estimating sample size, for example, those based on various definitions of effect size ^{(3)} , cannot be assumed to apply to research aimed at delimiting enhancements that matter to an elite athlete.

Another unusual aspect of research on performance enhancement is that researchers have seldom, if ever, investigated the effect of treatments on performance in actual competitive events . Instead they have used performance in laboratory or field tests that simulate the event, probably because it is easier to recruit subjects and measure variables that underlie or explain the effects of treatments that might help or hinder performance. Unfortunately the relationship between performance in tests and performance in events has not been explored adequately, so it is uncertain how a change in performance in a test translates into a change in performance in an event.

In this article we address these problems by first using simulation to define the magnitude of the smallest enhancement that matters to an elite athlete. Next, we calculate the sample sizes that would be needed to delimit such an enhancement in crossover and fully controlled experiments, if the event itself were used to assay the effect of a treatment on performance. We then explain how the reliability of a test impacts on sample size when the test is used to measure performance enhancement, and we review methods and results of recent studies on the reliability of various laboratory tests. Useful tests have to be valid as well as reliable, so we discuss the validity of tests and present a new method to quantify validity . Finally we identify special features of research designed to measure performance enhancement in tests and events. Our findings are relevant to all sports in which athletes compete as individuals and are ranked for medals according to their time, distance, points, or other performance score.

WORTHWHILE ENHANCEMENTS OF PERFORMANCE FOR ELITE ATHLETES
How big does an enhancement of performance have to be before it makes a difference to the medal-winning prospects of an elite athlete? Two important factors are the variation in an athlete's performance between events (also known as within-athlete variation), and the variation in performance between athletes in the same event (also known as between-athlete variation).

The importance of within-athlete variation is illustrated in Figure 1 , which depicts several athletes of equal ability vying for first place in an Olympic final. At a first glance of the final in Figure 1A , it would appear that even a minuscule enhancement of performance would be worthwhile for one of these athletes because it would put that athlete ahead of the others. But within-athlete variation has to be taken into account. On re-running the race, these four athletes would be unlikely to cross the line together again because within-athlete variation would produce a slightly different outcome each time, as shown for a hypothetical re-run in Figure 1B . The typical within-athlete variation (SD) is shown by the arrows drawn on the runners in Figure 1C . For 100-m runners in the 1997 Grand Prix events of the International Amateur Athletic Federation, the variation amounts to a variation in position at the finishing line of about ± 0.9 m, or a variation of about ± 0.09 s in finishing time (W. G. Hopkins, unpublished observations, 1998). When expressed as a percent of a runner's mean performance, the within-athlete standard deviation is known as the coefficient of variation (CV). Here the CV is about 0.9%.

Figure 1-F: our of the contestants at the finishing line of the 100-m running final at the Barcelona Olympics: A) The real image of the final. B) Image manipulated to simulate an outcome of a re-run of the final. C) Arrows indicate typical within-athlete variation in position of each competitor if the event were re-run. Original image courtesy of Sporting Images (http://www.sportingimages.com.au/, Nundah, Brisbane, Australia) and photographer Chuck Muhlstock. Reproduced with permission.

For equally matched athletes, an enhancement of performance much smaller than the CV will obviously have no effect on an athlete's chances of winning an event. It should be equally obvious that enhancements much greater than the CV will guarantee the athlete first place. The enhancement that begins to make a difference to the athlete's chance of winning is somewhere between these two extremes, similar in magnitude to the CV.

Between-athlete variation represents the true variation in ability between athletes. Between-athlete variation has an important effect on medal winning because it adds to the spread between athletes at the finish of an event. The greater the spread in ability relative to the within-athlete variation, the greater the enhancement needed to lift an athlete to first place from a lower ranking.

We have performed simulations to assess the impact of performance enhancements on gold-medal prospects of particular place-getters in an Olympic final (Appendix 1) . We assumed equal between and within variations, because in most international track and field athletics events of the 1997 season, the between-athlete SD was one to two times the within-athlete SD (W. G. Hopkins, unpublished observations, 1998). Ten thousand events were then simulated, each with an independent draw of 15 athletes, and in each of which a particular athlete (e.g., the true fourth place-getter) was given a particular enhancement (e.g., 1.5 × CV). The chance of that athlete winning was then calculated as the percent of events in which the athlete placed first. The results are shown in Figure 2 . The simulations were then repeated for a between-athlete variation of zero, for a between-athlete variation equal to twice the CV, for a grossly non-normal distribution of between-athlete variation, and for eight athletes. In each case the absolute chance of winning changed for each place-getter, but the increase in the chance of winning for a given enhancement was similar.

Figure 2-Effec: t of enhancement of performance on the chance of winning an event for an athlete who averages a certain placing (labels on curves) without the enhancement. The event has 15 athletes and equal within- and between-athlete variations.

Before Figure 2 can be used to make a decision about the smallest worthwhile enhancement, it is necessary to decide what represents a worthwhile increase in the chance of winning an event. With certain exceptions in the realm of public health, an absolute increase of 10% can be regarded as the smallest worthwhile increase in the frequency of occurrence of anything, regardless of the initial frequency (W. G. Hopkins, unpublished observations, 1998; based partly on converting a frequency difference to a correlation, then applying Cohen's interpretation of magnitude of correlations ^{(3)} ). Figure 2 shows that the magnitude of the enhancement that produces such an increase depends on the average placing of the athlete. For example, the athlete who averages 10th place wins only 1% of events, but the frequency rises to 11% with an enhancement of 1.3 of the CV. The athlete who averages fourth place needs an enhancement of 0.6 of the CV to increase the chance of winning from 9% to 19%. The first place-getter normally takes home the gold a surprisingly low 38% of the time, but with an enhancement of 0.3 of the CV the chance rises to 48%. The conclusion from these simulations is that an enhancement as small as 0.3 to 0.4 of the CV is important for the best athletes.

Some researchers and athletes will be interested in how the chance of winning is affected by a performance decrement, such as might occur with illness, injury, or inappropriate preparation or competition strategies. By extrapolating the curves of Figure 2 leftwards, it is easy to see that decrements have the greatest effect on the best athlete, and that a decrement of ∼0.3 of the CV reduces the chance of winning by about the same absolute amount (10%) that an enhancement of similar magnitude increases it. Greater decrements are needed to reduce the winning chances of the true second, third, and fourth place-getters by 10%. Athletes further down the field are out of the running for a gold regardless of any decrement in performance.

For the individual athlete an absolute change of 10% in the chance of winning may be barely noticeable, so it could be argued that 20% is more realistic when evaluating treatments or strategies that might affect performance. In this case, the minimum worthwhile enhancement for top athletes is doubled to about 0.7 of the CV. On the other hand, the medal tally of a country like the United States is affected substantially by an absolute increase of 10%: if the average best athlete in the world in a given sport has only a 40% chance of winning an Olympic gold, raising this chance to 50% for all such athletes based in the United States would push the U.S. medal tally up by a factor of 5/4, or 25%. Of course, no single strategy will raise the medal chances of top athletes in every sport, but the U.S. Olympic Committee could aim for an across-the-board increase by investigating every potential strategy in every sport. Unfortunately, sample size is a major problem for studies aimed at measuring enhancements that increase performance by 0.3 of the CV.

SAMPLE SIZE FOR STUDIES OF PERFORMANCE ENHANCEMENT
Researchers traditionally are supposed to use a sample size that will permit them to detect (declare statistically significant) the smallest worthwhile effect, with acceptably low rates for detection of nonexistent effects (5%) and failed detection of the smallest worthwhile effect (20%). If the competitive event itself is used as the performance test in a study of performance enhancement, the sample size needed to detect a given fraction or multiple of a coefficient of variation can be obtained from tables (e.g., ^{(3)} ) or statistical programs. For a crossover study in which every athlete receives an experimental and control treatment, 0.7 of the CV can be detected with a sample size of about 34. For a fully controlled study in which performance is assessed before and after either an experimental treatment for one-half the athletes, and before and after a control treatment for the other half, four times as many subjects, or about 140, are needed. If the focus of the study is performance enhancement that ultimately benefits the U.S. medal tally, detecting performance enhancements of ∼0.35 of the CV requires four times as many subjects again: 140 for a crossover, and 550 for a fully controlled study. Such sample sizes are beyond the resources of most sport scientists.

A new approach to sample-size estimation focuses on precision of estimation of the magnitude of effect of a treatment rather than detection of a nonzero effect (W. G. Hopkins, unpublished observations, 1998). Precision of estimation is defined by 95% confidence limits, which represent the likely range of the true value of the effect for the average subject. If the smallest worthwhile effect (e.g., a difference between means) is d, it is important that the confidence interval does not overlap a worthwhile positive effect (+d) and a worthwhile negative effect (−d) simultaneously. It follows that the acceptable confidence limits are ±d, because only with an observed effect of exactly zero would the confidence interval touch +d and −d. Acceptable confidence limits are obtained with sample sizes about one-half those of the traditional approach (Appendix 2) . For example, if the within-athlete CV is 1% and the smallest worthwhile enhancement is regarded as 0.7%, the acceptable confidence limits of ± 0.7% would be provided by a sample of 16 athletes in a crossover and 65 in a fully controlled study; to delimit an enhancement of 0.35%, the corresponding sample sizes are about 65 and 260.

What are the appropriate sample sizes when the researcher uses a test rather than an event to study performance enhancement? To answer this question, the researcher has to be reasonably certain that a particular treatment under investigation produces the same average enhancement in the event as in the test. Reasonable certainty can be provided by an appropriate validity study (described subsequently). Sample size for a study using a valid test is obtained by multiplying the sample size for a study using the event by the square of the CV for the test relative to the event (Appendix 2) . For example, a study using a test that has twice the CV of the event needs four times as many subjects as a study using the event itself to get the same precision for the estimate of the enhancement. Such a test would need unrealistic sample sizes to delimit the smallest performance enhancement that matters to the elite athlete. On the other hand, a study using a test with half the CV of the event would need only one-quarter the number of subjects of a study using the event. Delimiting enhancements as small as 0.3-0.4 of the CV would be possible with such a test, especially in a crossover.

Clearly the within-subject variation in performance of a test is crucial for the estimation of sample size. In the following section we describe recent research on within-subject variation and discuss other aspects of reliability .

RELIABILITY OF PERFORMANCE TESTS
Measures of Reliability
For the scientist undertaking repeated testing of athletes, three components of reliability are important: changes in mean performance, within-subject variation, and retest correlation. Changes in the mean indicate the extent to which the average performance of athletes gets better or worse when the test is repeated. Within-subject variation, as we have seen, represents the typical variation of an athlete's performance between trials, after any changes in the mean have been taken out of consideration. Retest correlation, the correlation between pairs of trials, is a measure that represents how well athletes retain their rank order on retest.

Shifts in the mean are important when researchers (and coaches) first begin to monitor the performance of athletes. At this time, improvements are likely to occur as the athletes learn how to perform the test. In a controlled study, such overall shifts in the mean are seemingly irrelevant because they disappear when performance of a treatment group is compared with that of a control group. However, there are almost certain to be individual differences between athletes in the magnitude of the learning effect. These individual differences result in a larger overall within-subject variation so that a larger sample size is needed to obtain acceptable confidence limits for the effect of a treatment. Researchers should therefore use tests with little or no learning effects, or they should get athletes to perform sufficient familiarization trials to eliminate learning effects.

Within-subject variation is the most important of the three measures of reliability because it defines the sample size for an experimental study. A reliability study should therefore be designed to obtain acceptable precision for the estimate of the within-subject variation. Ten athletes and three trials will give a barely acceptable 95% confidence interval for the estimate (0.75 to 1.5 of the observed value). More trials and more athletes will allow firmer conclusions about learning effects between trials and about the magnitude of the within-subject variation relative to that of other tests in the literature. To take into account changes in the mean between trials, the analysis should be a two-way ANOVA or repeated measures; averaging the athletes' standard deviations, as some researchers have done, is not appropriate. Analysis of the log-transformed performance variable yields within-subject variation as a percent (the CV); changes in the mean between trials are also derived as percent changes by the same analysis ^{(18-20)} .

Retest correlation is a combination of within-athlete and between-athlete variation. A high correlation for a given sample of athletes means that the within variation is much less than the between variation. A high correlation therefore implies that an athlete's performance score on retest will not change much relative to other athletes, so the rank order of the athletes will be largely retained. When the value for the retest correlation is derived from an ANOVA or repeated-measures analysis, it represents an average of correlations between all possible pairs of trials, and it is known as the intraclass correlation coefficient. It is difficult to compare the retest correlation between studies because the value of the correlation is sensitive to the between-athlete variation in the studies. For example, a study of reliability of a test with a group of elite athletes might produce a small CV, but the retest correlation would be low if all athletes in the group had similar ability. A poorer test would have a larger CV, but if the sample of athletes had a wider range in ability, the correlation could be higher than that of the better test. Thus, comparison of the correlations would give a misleading impression of the relative reliabilities of the two tests.

Studies of Reliability
Despite the obvious importance of reliability when tests are used to measure changes in performance, there is a lack of data for many of the tests used to study well-trained athletes. For this reason we have undertaken a series of investigations aimed at determining the reliability of a variety of laboratory tests of endurance performance ^{(12,14,18-20)} . The data from these and other studies are shown in Table 1 . We will restrict subsequent discussion to tests of endurance, but the principles apply to any tests of athletic performance.

TABLE 1: Some ergometer-based tests of performance for well-trained endurance athletes ranked by the coefficient of variation (CV) for measured or estimated mean power.

Table 1 includes most of the published studies in which reliability of an endurance test has been investigated in well-trained athletes and calculated as the within-subject coefficient of variation. The studies cover several modes of endurance exercise (cycling, rowing, running) on several different kinds of ergometer (electrically braked cycle, wind-braked cycle, wind-braked rowing ergometer, tread-mill) for tests of various durations (0.5 min to 2.5 h) using various protocols (constant power to exhaustion, constant work or distance, constant duration, constant power followed by constant duration, and incremental power to exhaustion).

Depending on the protocol and ergometer, performance in a test can be expressed as duration, distance covered, mean speed, or mean power. Which of these measures is most appropriate to compare the reliability of the tests? Duration, distance, and speed are practical measures that allow an athlete or coach to compare performances on a given ergometer. For most ergometers the manufacturer simulates these measures on the basis of power developed by the athlete, but it is not always clear whether the manufacturer has used an accurate simulation. For example, with the Cybex MET100 cycle ergometer (Cybex Corp., Huntsville, AL), the relationship between power (P) and speed (V) is P = kV^{1.5} , where k is a constant ^{(7)} , whereas the relationship for the Kingcycle is P = aV + bV^{3} , where a and b are constants ^{(14)} , or P = kV^{2.2} for speeds around 40 km·h^{−1} . Clearly these simulations cannot both be correct, so a comparison of the CV for speed would be inappropriate. On the other hand, both ergometers probably measure power accurately and could therefore be compared for reliability on the basis of their CV for power. There are other reasons for using power to represent reliability . First, the calibration of power on the ergometer could be wrong by a constant factor, but the CV for power (representing percent variation) would be unaffected. Second, if the researcher knows the relationship between mean power and performance time or speed in an event, the CV for power can be converted to a CV for performance time or speed in that event. The researcher can thereby determine the sample size that would be needed to delimit enhancements for that event. Finally, the relationship between the CV for power in the test and the CV for performance in the event is the same as that between percent enhancement of power in the test and percent enhancement of performance in the event (see next paragraph); the effect of a treatment on enhancement of power in the test can therefore be translated into enhancement of performance in any sport for which the relationship between power and performance is known.

In Table 1 we have included data for the CV of mean power in each test. In some cases these CV are estimates because the authors did not provide them. Before we compare the CV, we will explain how we derived the estimates. We assumed a relationship between power (P) and speed (V) of the form P = kV^{n} , where k and n are constants for the particular ergometer. By elementary calculus, it follows that 100ΔP/P ≈ n(100ΔV/V); that is, the percent change in power (either the CV or a performance enhancement) is approximately n times the corresponding percent change in speed. (For percent changes of less than 10% the approximation is practically exact, and for percent changes derived from log-transformed variables the relationship is exact regardless of the magnitudes of the changes.) For the Kingcycle ergometer (Kingcycle Ltd., High Wycombe, Bucks, U.K.) and tests lasting about an hour, n ≈ 2.2. For a treadmill, power cannot be measured directly, but from measurement of oxygen consumption it is known that running speed is directly proportional to output power ^{(16)} . Thus for a treadmill, P = kV, i.e., n = 1, so the CV for mean speed or time to complete a set distance is the same as the CV for power.

Table 1 also includes an estimate for the effective CV of mean power for a constant power test calculated from the CV for time to exhaustion. The large CV for time to exhaustion in these tests are a consequence of the fact that a small change in an athlete's output power between tests results in a large change in endurance time. To convert changes in duration to effective changes in mean power, the data of Peronnet and Thibault ^{(16)} were used. These authors found that for individual trained runners power expressed as a percent of O_{2max} lies on a straight line with slope 6.4 ± 1.5 (mean ± SD) when plotted against endurance time on a log scale (Fig. 3) . Thus 100ΔP/O_{2max} = 6.4Δ(log(T)) ≈ 6.4ΔT/T, where T is the endurance time and Δ represents a small change. Rearranging, 100ΔT/T ≈ (100ΔP/P)[(100P/O_{2max} )/6.4], or percent change in endurance time is approximately equal to percent change in power multiplied by a constant (where the constant is power as percent of O_{2max} divided by 6.4). For tests lasting about 7 min, P = 100% of O_{2max} , and the constant is 100/6.4 = 16; that is, a 1% change in power results in a 16% change in time to exhaustion, with a typical variation between runners of 13% to 20%. For tests lasting 2 h, P = 80% of O_{2max} , so the change in time to exhaustion is 13% for a 1% change in power.

Figure 3: The relationship between mean power in a constant-work test and duration of a constant-power test performed at the same power, for a runner whose endurance doubles for a drop in power of 5% of maximal oxygen uptake (JOURNAL/mespex/04.02/00005768-199903000-00018/root/v/2017-07-20T222656Z/r/text-xmlO2max).

The practical implication of this relationship between power and endurance is that published CV for time to exhaustion in constant-power tests lasting a few minutes to a few hours need to be divided by a factor of 13-16 to make them comparable to CV for constant-work or constant-time tests. Furthermore, a treatment that increases sustainable power by 1% produces an increase in speed of 1% in a constant-work or constant-time test, but in a constant-power test the increase in time to exhaustion is typically 10-20%, depending on the duration of the test and the physiology of the athlete. A study of the effects of a treatment on performance in a constant-power test can therefore provide only a rough estimate of the effect in an event.

CV for two of the tests in Table 1 cannot be compared with those of other tests. The test consisting of a fixed period at constant power followed by a time trial is one of several similar examples in the literature. These tests are designed to allow researchers to monitor physiological variables at a defined intensity before the athletes go all out for the time trial, which provides the performance measure. The CV is calculated for the time trial, but there appears to be no way to transform this CV into an equivalent for mean power over the total duration of the test. For the same reason, percent performance enhancements observed with this test cannot be transformed into expected performance enhancements for an event of similar duration. The test with an incremental-power protocol has a similar problem. Tests of this kind are usually conducted to measure O_{2peak} , although the peak power or speed attained by an athlete in these tests is now recognized as having greater validity in predicting endurance performance than O_{2peak} ^{(6)} . The CV of the incremental-power test represents the percent variation in a subject's peak power, but whether this CV can be converted to a CV for mean power in a time trial of any duration is unclear. Similarly, enhancements in peak power observed in this test may not translate, even qualitatively, into enhancements in performance in an event. Researchers whose primary interest is competitive performance enhancement should not use either of these tests.

Possible Factors Affecting Reliability
In assessing the factors that might contribute to the reliabilities of the various tests in Table 1 , one must take into consideration the confidence intervals of the reliabilities. CV of tests at the top of the table probably differ from those at the bottom because their confidence intervals do not overlap. Confidence intervals of other pairs of tests overlap considerably so the real CV may not be different. Firm conclusions about differences of ∼1% are important because differences of this order represent large differences in the number of athletes needed in studies using the tests. Such conclusions are possible only with more subjects and/or more trials than were used in most of the studies shown in Table 1 .

Mode and duration of exercise. These factors do not greatly affect reliability of mean power. Cycling tests, for example, have the smallest and largest CV, and two tests from the same laboratory that differ only in duration (∼12 vs ∼100 min) have similar CV. Research on the reliability of performance in competitive events may be more likely to reveal any differences in the reliability of these and other modes and durations of exercise because the large numbers of events and athletes give greater precision to the estimates. For example, as noted previously, the CV of the 100-m competitors in the 1997 Grand Prix events was 0.9%, but CV increased gradually with event distance to reach about 1.5% for the 5000 m (W. G. Hopkins, unpublished observations, 1998).

Caliber of athletes. Although all subjects in the studies detailed in Table 1 were defined by the authors as well-trained athletes, some groups may have been at a higher competitive level or have had more competitive experience. These athletes may have paced themselves more reproducibly during the tests, either because they had a better "feel" for pace or because their perceptions of effort or fatigue that limit pace or performance were less variable. Experience with the ergometer may also be a related factor. In the 2000-m rowing test the athletes were accustomed to using the ergometer in training and in performance trials for team selection ^{(19)} . In the tests undertaken on the Kingcycle ergometer, the cyclists are able to ride their own racing bikes instead of conventional ergometers, which may not configure to the normal riding position.

Type of ergometer. Some ergometers are probably more stable than others in providing resistance to the athlete's power output. For example, the Concept II rowing ergometer (Concept II Ltd., Morrisville, VT) provides a fixed resistance requiring no calibration, whereas the rolling resistance on the Kingcycle ergometer has to be set for each subject and may drift with changes in temperature, humidity, and the pressure of the bicycle tires. The reliability of the Cybex MET100 may also be higher than that of the Kingcycle, because it also does not have a frictional component that needs calibrating.

Specificity of the test. Tests with a pattern of pacing that mimics an event ought to be more reliable than other tests because the athletes can reproduce a familiar pacing strategy in the tests ^{(4)} . But the data in Table 1 show little evidence for this concept. Examples: the test for lactate threshold produces highly reliable estimates of endurance race pace from only a few 5-min bouts at such intensities; the constant-power test to exhaustion is not particularly race specific, yet it appears to be one of the more reliable tests; and the 100-m cycling test that includes sprints to simulate real road races has the worst reliability .

VALIDITY OF TESTS
Regardless of the reliability of a test, the researcher needs to be satisfied that enhancements measured in the test will be reproduced in the event. In the past researchers have addressed this problem by performing a validity study in which performance of a group of athletes in the test is correlated either with their performance in the event or with current personal best performance. If a high correlation is obtained, the researchers conclude that the test is valid.

There are several problems with this approach. First, no one has presented a rationale for deciding how high the correlation should be before the test is considered valid. Is 0.90 acceptable, or does it have to be at least 0.95? Second, the magnitude of the correlation is sensitive to the heterogeneity (between-subject variation) of the subjects, but no one has taken this effect into account when assessing the observed correlation. Finally, the approach is based on the assumption that a strong relationship between single performances in the test and event implies a strong relationship between changes in performance in the test and the event, but no one has analyzed whether this assumption is tenable. Indeed, the assumption is impossible to verify directly because it would require a study in which athletes perform the test and the event both before and after every experimental treatment that is ever likely to be investigated with the test. Furthermore, the experimental treatments would have to change performance.

We have devised a method for assessing the validity of tests that addresses these problems. It is based on a new statistical procedure that allows estimation of random variation contributed by several sources in repeated-measures data (Appendix 3) . The researcher has to perform a reliability study for the test and a reliability study for the event concurrently and with the same athletes. The analysis allows estimation of the usual within- and between-subject variations that are obtained in separate reliability studies, but it partitions the between-subject variations into three components: variation common to test and event, variation unique to the test, and variation unique to the event. These between-subject variations arise from corresponding factors that affect performance of the subjects in the test and event: (Equations 1 and 2)

In these equations, the common factors give rise to the common variation, the test and event factors give rise to the variation unique to the test and event, and the error terms give rise to the within-subject variations in test and event (usually expressed as the CV for the test and the event).

For most tests and events the main contributors to the common variance are factors related to the output power of the athletes: maximum sustainable power, efficiency of motion, anaerobic capacity, and so on. An example of a factor unique to the test could be effort or motivation: some subjects may be reproducibly under-motivated to perform to maximum effort in a test while others hold nothing back, whereas all athletes might make their best effort in the event. Skill could be an example of a factor unique to event when the test measures purely power output but the event requires skilled movement (example: a rowing ergometer vs on-water rowing). Anxiety could be another factor unique to the event if anxiety affects the performance of different subjects reproducibly in different ways. Other examples might be diet or supplementation strategies if these differ between subjects in the event and if researchers ensure that all athletes have the same values for these factors for the test. Equipment that varies between athletes in an event is another example, in a situation where the equipment affects performance and is not used in the test.

The validity analysis is easiest to understand when it reveals no unique variation in the test and event. In this ideal situation, a treatment can change performance in the test and event only by changing the common factors. Inasmuch as these factors are the same in test and event, the treatment must produce the same performance enhancement in the test and event, apart from the random test error and event error. When unique variation is present in the test or event, a factor affects performance uniquely in the test or event. A treatment might therefore change the value of this factor and thereby produce a different enhancement of performance in the test versus the event, or it might change the value of a common factor and produce the same enhancement. The researcher cannot tell; therefore, the enhancement in the test cannot be presumed to apply in the event and the test should not be used.

In a decision about whether the unique factors are negligible, the important issue is the relative magnitudes of the common variation and the unique variations. When these are expressed as standard deviations or CV, they represent the typical magnitudes of differences between subjects attributable to the common and unique factors. For example, the common factor, unique test factor, and unique event factor might account for variations of ± 5%, ± 0.5%, and ± 1% in performance between subjects. (Note: these numbers do not represent the so-called "variance explained.") A treatment would therefore enhance performance by 5-10 times more when it changes common factors than when it changes the unique factors. In this situation the researcher can be confident that most treatments will produce similar changes in performance in the test and in the event, provided the treatment does not produce changes in the values of the unique factors that are large relative to the changes in the common factors. On the other hand, when the common and unique factors produce similar variations (e.g., 3%), there is every possibility that a treatment will produce different enhancements in test versus event.

We recognize that the complexity of the statistical analysis will make this new approach to validity difficult for researchers to implement. In Appendix 4 we offer a less powerful but more accessible method to check on validity , based on the traditional approach of correlating test and event performances.

Researchers should note that both methods of validation depend on using a sample of athletes with a substantial between-athlete variation of performance. Thus it is impossible even in principle to validate tests only for the very best athletes. The only way to be reasonably certain that a treatment will enhance performance of the best athletes is to try the treatment with the best athletes in real events.

Finally, researchers need to be aware that a test with apparently high validity can produce opposite effects with the same treatment in the test and the event. This seemingly impossible outcome can occur when a condition that affects performance is held constant in the test and has a different constant value in the event. If the treatment then interacts with the condition, it could enhance performance in the test and hinder performance in the event. A validity study would not detect the potential for this anomaly because the factor representing the condition is constant in the test and the event and therefore cannot contribute to between-athlete variation in either. An example of such a condition is carbohydrate supplementation in studies of the effects of diet on endurance performance. In these studies researchers usually have not provided a carbohydrate supplement during the test ^{(5)} , yet in endurance events athletes usually consume considerable amounts of carbohydrate. It is possible that one diet is beneficial to performance in the test without supplementation and detrimental to performance when the athletes supplement in the event. Environmental temperature is another factor that could interact with certain treatments to give different outcomes in test and event. Researchers should be cautious in extrapolating the results of tests to events when they make the conditions of the test different from those of the event.

SPECIAL FEATURES OF RESEARCH ON PERFORMANCE ENHANCEMENT
The sporting press is dominated by reports of novel training programs and aids, equipment, diets, and dietary supplements that purport to enhance competitive performance. Against this background of anecdotes and testimonials, athletes and coaches are told that a scientific study, or the consensus from a number of controlled studies, provides the only sure way to establish the efficacy of a potential performance-enhancing treatment. For the most part, these studies are conducted by scientists whose primary interest is to understand the mechanisms controlling or limiting human physical performance. Features important in the design and execution of these studies, which are common to most experimental biomedical research, are summarized in Table 2 . We will not discuss these further but will focus instead on additional and alternative features that we believe are important for studies of competitive performance. These features are summarized in Table 3 .

TABLE 2: Desirable features of experimental studies aimed at understanding mechanisms of human physical performance.

TABLE 3: Additional or alternative features of studies aimed at quantifying athletic performance enhancement.

Sample Size; Reliable and Valid tests
We have already shown that, for realistic sample sizes, the CV of a performance test has to be equal to, or preferably less than, that of the event. At present there are not enough reliability studies of tests and events with elite athletes to determine whether any tests satisfy this requirement, so we can only speculate. We expect athletes to be intrinsically more reliable in their production of power for an event because they are likely to take more care in preparing for and performing an event. On the other hand, performance in tests is unaffected by extrinsic factors such as tactics, terrain, and environmental conditions, which may reduce the reliability of some events. Thus, shorter track events may be more reliable than any test that can be devised, but longer events such as road cycling may be less reliable than an appropriate laboratory-based test. For example, the simulated 100-km cycle time-trial (which incorporated 4 × 1 km and 4 × 4 km sprints) appears to have a poor reliability relative to other tests in Table 1 , but the CV for this test is probably smaller than that of 100-km road races, owing to the way riders can "draft" behind each other in these races ^{(11,15)} .

High validity is another essential property of a performance test, but again, appropriate data are not available for making judgments about the validity of any tests. Our impression, based on many observations of highly trained athletes in numerous laboratory tests, is that the constant-distance, constant-work, and constant-duration tests simulate events well. Measures of lactate threshold may also have high validity even though they are based on incremental-exercise protocols that do not mimic a real event. The unique test and event factors may therefore turn out to be negligible for these tests. We are uncertain about the validity of tests in which the athlete maintains constant power to exhaustion. Performance in the longest of these tests may be limited partly by boredom rather than maximum sustainable power. In any case, the new method of analyzing validity does not lend itself to these tests, and enhancements in the tests are difficult to translate into enhancements in events.

Use of the Event
If the researcher cannot find or devise a test with better reliability than the event, or if there is concern that the test has not been validated for the very best athletes, there is no alternative but to perform the research using performance in the event itself. There are formidable problems with this approach: top athletes may be unwilling to be randomized to control and treatment groups; the time frame of an experiment will be dictated by the frequency of competitive events ; the study will almost certainly have to be conducted over an extended period involving many events that not all the athletes enter; and analysis of performance in multiple events with missing values requires an advanced statistical procedure not accessible to some researchers. If these problems can be overcome, the results of a study of performance enhancement would be definitive: no measure of performance is more valid for the athlete than performance in the actual event.

Selection of Best Athletes
The results of a research study apply with reasonable certainty only to populations that have similar characteristics to the sample under study. Elite athletes almost certainly have genetic endowment, training history, and training programs that differ from those of subelite athletes. A treatment may therefore produce different effects on performance in these two groups. It follows that the subjects in a study have to be elite athletes for the results to apply convincingly to elite athletes . Accessing a sufficient number of elite athletes for a study may be difficult because they are a scarce resource, and it may also be difficult to arrange a study that does not conflict with their commitments to training and competition. A shortfall of elite athletes , or indeed of any subjects, can be offset partly by performing extra trials before and after the treatment. For example, under most circumstances doubling the number of trials gives about the same increase in precision for the estimate of enhancement as doubling the number of subjects. Doubling up on trials also allows estimation of individual differences in the enhancement (see below).

If elite athletes cannot be recruited, research targeted at serious athletes should involve subjects who are at least well trained in the sport or mode of exercise that is being studied. Well-trained subjects are also likely to perform more reliably in the chosen performance task. Studies on untrained or recreationally trained subjects will therefore require more subjects to delimit small enhancements, but even with a large sample size the findings may not apply to athletes.

The focus of this paper is performance enhancement for athletes of the highest caliber, but we recognize that many subelite and recreational athletes are interested in improving their performance in events that range from regional championships to local events. Research aimed at estimating performance enhancements for these athletes should, of course, involve a sample of subjects that is representative of this group. The smallest worthwhile enhancement for this group presumably has the same basis as for elite athletes -an enhancement that would increase the chance of winning an event-but the events probably have greater between-athlete variation than Olympic finals. Enhancements will therefore need to be larger to make a difference in winning, so sample sizes will not need to be quite as large as for studies of elite athletes . The other issues of design discussed in this paper are also appropriate for research on subelite athletes, but we emphasize that the findings in these studies provide only suggestive evidence of the effects of treatments on elite athletes .

Replication of Training and Diet Between Trials
The time between trials in an experimental study of human performance is usually at least several days, because this period is needed either for the intervention to be effected or for the subjects to recover from the maximal effort of the previous test. Conditions and behaviors that affect performance can easily change over such a period of time, and if these do not change equally in the experimental and control groups, differences in performance in the groups cannot be attributed solely to the treatment. It is therefore usual for the experimenter to reproduce environmental conditions for the trials of the performance test and to exhort subjects to replicate their behaviors before each trial.

Two major behaviors that need to be replicated are training and diet, both of which affect performance in numerous ways. In some studies authors simply note that subjects were instructed to eat their usual diet and maintain usual training. In other investigations subjects record food intake before the first trial, then use the record to replicate intake before subsequent trials ^{(1)} . We have found that such strategies still allow for considerable variation in pretrial preparation. Furthermore, athletes frequently fail to comply with diets or training that they believe might disadvantage their long-term training or nutrition goals. Greater attention should therefore be devoted to educating subjects about the importance of the standardization of training and nutritional practices, especially 1-2 d before an experiment. If subjects have to be left to their own devices during this time, records of all food intake and exercise should be kept before each trial. Recordings of heart rate (HR) from monitors worn during the pretrial period may also be useful to check that training is replicated before baseline and experimental trials. Better standardization of pretrial behaviors can be achieved by providing subjects with all their food and by supervising their training before each trial. The extent of compliance to diet and training should be noted in the publication of the study.

Realistic Conditions and Behaviors
In the section on validity we pointed out that the effect of a treatment on performance in a test will not be reproduced in an event when conditions and behaviors that interact with the treatment are different for the test and the event. The researcher may not know which conditions and behaviors interact with the treatment, so it is crucial that conditions and athlete behaviors in a study are the same as those for competitions.

The most obvious conditions that could affect performance enhancement are environmental. For example, treatments that increase plasma volume might have no effect on endurance running tests undertaken in a cool environment. On the other hand, the same treatment may enhance performance in a competition held in a warm humid environment where fluid balance becomes one of the factors limiting performance. The competitive setting of the event is another condition that some researchers try to reproduce in the laboratory, for example, by having athletes "compete" side by side on identical ergometers ^{(19)} or by offering performance-based prizes ^{(14)} .

An obvious athlete behavior that needs to be reproduced for a test is training. Most competitive athletes follow periodized training programs in which they perform different amounts and types of training in different phases and cycles of their program. Training at the time of important competitions is almost certain to be different from training at other times of the year, yet most athletes could not or would not participate in a research study at the time of such competitions. When athletes do participate, they should attempt to prepare for the tests by training in the same way they would prepare for competitions; otherwise the findings may not apply to competitions. An important training behavior that needs to be replicated before tests is tapering, especially when the treatment being studied is some form of training. For example, a study of the effects of several weeks of dramatically increased hard training may well show a detrimental effect on performance if athletes do not taper before the tests. Yet the same training followed by a taper before an event may show an enhancement of performance.

Dietary behaviors also need to be replicated between test and event. If the treatment under investigation is also dietary, the need to replicate other dietary behaviors is all the more important because dietary behaviors, like training behaviors, are likely to interact in their effect on performance. In this respect research designs in published studies have been surprisingly unrealistic. Most endurance athletes consume a carbohydrate-rich meal before their event; yet in most studies on the effects of dietary supplements consumed during exercise, the subjects have been tested after an overnight fast. Similarly, most athletes consume carbohydraterich drinks or foods during endurance events, yet in studies of the effects of pre-exercise meals on performance, athletes have usually consumed only water during the tests. For example, 11 of 12 studies of the effects of carbohydrate ingestion 30-60 min before an endurance test have been designed in this manner ^{(5)} ; outcomes were somewhat different in the twelfth study, in which athletes consumed carbohydrate during the tests ^{(1)} .

Researchers have a major problem replicating training and dietary practices in a study if these practices differ among athletes. One solution is to impose the mean behavior on the athletes during the study, but the results of the study will then apply to an event only when athletes behave in this way for the event. A better but more costly solution is to perform a series of studies, each with a different behavior or a different combination of behaviors. In the next section we offer another solution.

Design and Analysis for Individual Differences
Subjects sampled from a population differ in their genetic and acquired characteristics. In experimental research these characteristics may interact with (i.e., modify the effect of) the treatment, resulting in individual differences in the response to the treatment. When individual differences are present, they increase the width of the confidence interval of the treatment effect, so more subjects are needed to obtain acceptable precision for the effect ^{(9)} . Furthermore, if the average characteristics differ between the treatment and control groups, any observed differences in the effect of the treatment could result from the interaction of the treatment and subject characteristics rather than a direct effect of the treatment. In other words, the treatment might have no substantial effect on the average subject in the population even though it produced an effect in the treatment group.

Researchers address the problem of individual differences by sampling a subset of a population that is as uniform as possible on characteristics such as sex, age, and fitness. As we have discussed in the previous section, they then usually select for or impose uniformity of training and dietary behaviors, sometimes unrealistically. If successful, this strategy eliminates individual differences , but the findings apply to performance in events only when athletes have the same characteristics and display the same behaviors as imposed in the study.

It may be possible to modify this approach for research focusing on athletic performance enhancement. Sampling would still need to be restricted to a subset of the population-in this case, elite or highly trained athletes-but restrictions on behavior could be relaxed to allow individual athletes to follow some of their usual training and dietary practices. These practices would be assayed and included in the analysis as several covariates to allow prediction of the effects of the treatment for subjects who differ in training or diet. In this way, the results of the study would be more applicable to real events. The approach might also make it easier to recruit elite athletes into studies that use events rather than tests to assay performance because the athletes would not have to modify their usual preparatory and performance practices. The appropriate statistical analysis shows that athletes do not even have to maintain the same behaviors between trials (W. G. Hopkins, unpublished observations, 1998), but issues related to sample size and errors in measurement of the covariates have to be addressed before recommendations can be made for such a radical departure from an accepted principle of research design.

Even when researchers adhere to the traditional approach of standardizing behaviors between and within athletes, they cannot standardize all the genetic and acquired characteristics of the athletes. Thus, depending on the treatment there are likely to be individual differences in the response to the treatment that arise from differences in the anatomy, biochemistry, physiology, and psychology of the athletes. Researchers should therefore design their studies to estimate individual differences in the outcome, either by including several tests before or after the treatment, and/or by assaying subject characteristics that might account for the individual differences ^{(9)} .

Placebo Effects in Unblinded Studies
When a treatment cannot be disguised sufficiently to conduct a blinded study, there is a possibility that any observed performance enhancement may be partly or wholly a placebo effect-a change brought about solely by the expectation of the athletes. But if the researcher finds a correlation between the change in performance and the change in an underlying variable involved in the mechanism of the treatment, the enhancement cannot be entirely a placebo effect. It is not known whether placebo effects are large enough to affect performance of elite athletes in competitive events , but recent research indicates that the effects may be substantial for subelite athletes in a laboratory setting ^{(2)} . Researchers should therefore include at least one likely mechanism variable in unblinded lab studies of sport performance. The athletes must be unaware of any change in the mechanism variable because a noticeable change could inspire a placebo effect (example: muscle girth in a study of strength). Variables measured during the performance test, such as blood lactate concentration or muscle electrical activity, are also not appropriate to exclude placebo effects because the greater effort inspired by a placebo effect can also change the values of such variables.

A substantial correlation between change in a mechanism variable and change in performance is possible only when there are individual differences in the response to the treatment. However, some apparent individual differences result from within-subject random error in performance between tests. This random error will therefore degrade the correlation between change in the mechanism variable and change in performance, making the magnitude of the correlation hard to interpret. The problem is solved by including the mechanism variable as a covariate in the analysis and by then interpreting the strength of the association directly from the solution for the covariate rather than by calculating a correlation. The mechanism variable needs to be as valid and reliable as possible because errors in its measurement will still degrade the strength of the association (W. G. Hopkins, unpublished observations, 1998).

If there are unlikely to be individual differences in the effects of the treatment, the researcher should consider introducing some by varying the dose or nature of the treatment between subjects. The differences in the treatment can be entered into the analysis as a covariate. This strategy will help account for placebo effects only if the subjects are uncertain as to how the differences in the treatment are supposed to affect performance.

Interpretation of Outcomes for Athletes
Conclusions in a high proportion of studies in the exercise-science literature are based on the statistical significance of the effect without any consideration of the magnitude of the effect. In a typical example, a statistically nonsignificant enhancement of performance of 2.9 min in a time trial lasting 160 min was reported as no effect ^{(13)} . In reality there is a good chance that the treatment did have a real effect of about the observed magnitude (1.8%), and as we have seen, such an effect would probably be worthwhile for elite athletes . (The 95% confidence limits for this outcome would include an enhancement of at least twice this magnitude, which would definitely be worthwhile, but the exact confidence limits can't be calculated because the authors did not provide the P value.) The correct conclusion in this case is: This treatment appeared to produce a performance enhancement that might benefit competitive athletes, but we would need to test more athletes to be sure.

Even when the observed effect is too small to be of interest to well-trained athletes, it is wrong to conclude there is no effect, unless the study has been performed with an adequate sample size. For example, on finding an enhancement in running speed of 0.1% with a P value close to 1 (e.g., 0.96), many researchers would conclude that there is no effect. But the confidence interval for the change in performance with this P value is about −4.0% to +4.2%. The correct conclusion in this case is: This treatment had no effect on performance with our sample, but until we test more subjects we cannot exclude the possibility of a substantial positive or negative effect for the average competitive athlete. Only with about 16 times as many subjects would the confidence interval be narrow enough (±1%) to exclude substantial effects.

To encourage greater understanding of research on performance enhancement among sport scientists, coaches, and athletes, we make the following recommendations for presentation of findings:

Report the outcome as a percent change in a measure of athletic performance and in mean power where possible.
Report the 95% confidence limits for the outcome. Describe the confidence limits as the likely range of the true effect of the treatment on the average subject for the benefit of those who do not understand statistical jargon.
Interpret the magnitude of the outcome and confidence limits in terms of the likely effect on athletes in an event.
Report recent best competitive performances of the athletes as a percent of the world record to make it clear to what caliber of athlete the outcome can be generalized.
Discuss the possibility of different outcomes for athletes who differ in caliber, training and dietary practices, and other characteristics.
CONCLUSIONS
In the light of the issues raised in this paper, we believe that most published estimates of the effect of a treatment on human physical performance cannot be assumed to apply to highly trained athletes in competitive events . Previous studies have been deficient on one or more of the following counts: the sample size was too small for adequate precision of the estimate of enhancement; the performance test had questionable validity or produced a measure of enhancement with an unclear relationship to performance in the event; the athletes were not of sufficient caliber, and their behaviors in the study were not representative of behaviors of athletes in training or competitions.

Research on performance enhancement is at an early stage of development. We can expect to see greater attention to questions of validity and reliability of performance tests, bigger sample sizes, studies with more realistic tests and athlete behaviors, design and analysis to account for individual differences , and studies based on performance in real competitions.

REFERENCES
1. Burke, L. M., A. Claassen, J. A. Hawley, and T. D. Noakes. No effect of glycemic index of pre-exercise meals with carbohydrate intake during exercise.

J. Appl. Physiol. 85:2220-2226, 1998.

2. Clark, V. R., W. G. Hopkins, J. A. Hawley, and L. M. Burke. The size of the placebo effect of a sports drink in endurance cycling performance.

Med. Sci. Sports Exerc. 30:S61, 1998.

3. Cohen, J.

Statistical Power Analysis for the Behavioral Sciences, 2nd Ed. Hillsdale, NJ: Lawrence Erlbaum, 1988, pp. 37, 79.

4. Foster, C., M. A. Green, A. C. Snyder, and N. N. Thompson. Physiological responses during simulated competition.

Med. Sci. Sports Exerc. 25:877-882, 1993.

5. Hawley, J. A. and L. M. Burke. Effect of meal frequency and timing on physical performance.

Br. J. Nutr. 77:S91-S103, 1997.

6. Hawley, J. A. and T. D. Noakes. Peak sustained power output predicts

_{2max} and performance time in trained cyclists.

Eur. J. Appl. Physiol. 65:79-83, 1992.

7. Hickey, M. S., D. L. Costill, G. K. McConell, J. J. Widrick, and H. Tanaka. Day to day variation in time-trial cycling performance.

Int. J. Sports Med. 13:467-470, 1992.

8. Hopkins, W. G. and J R. Green. Combining event scores to estimate the ability of competitors.

Med. Sci. Sports Exerc. 27:592-598, 1995.

9. Hopkins, W. G. and R. D. Wolfinger. Estimating "

individual differences " in the response to an experimental treatment.

Med. Sci. Sports Exerc. 30:S125, 1998.

10. Jeukendrup, A., W. H. M. Saris, F. Brouns, and A. D. M. Kester. A new validated endurance performance test.

Med. Sci. Sports Exerc. 28:266-270, 1996.

11. Jeukendrup, A. and A. van Diemen. HR monitoring during training and competition in cyclists.

J. Sports Sci. 16:S91-S99, 1998.

12. Lindsay, F. H., J. A. Hawley, K. H. Myburgh, H. H. Schomer, T. D. Noakes, and S. C. Dennis. Improved athletic performance in highly trained cyclists after interval training.

Med. Sci. Sports Exerc. 28:1427-1434, 1996.

13. Madsen, K, D. A. MacLean, B. Kiens, and D. Christensen. Effects of glucose, glucose plus branched-chain amino acids, or placebo on bike performance over 100 km.

J. Appl. Physiol. 81:2644-2650, 1996.

14. Palmer, G. S., S. C. Dennis, T. D. Noakes, and J. A. Hawley. Assessment of the reproducibility of performance testing on an air-braked cycle ergometer.

Int. J. Sports Med. 17:293-298, 1996.

15. Palmer, G. S., J. A. Hawley, S. C. Dennis, and T. D. Noakes. HR responses during a 4-d cycle stage race.

Med. Sci. Sports Exerc. 26:1278-1283, 1994.

16. Peronnet, F., G. Thibault, E. D. Rhodes, and D. C. McKenzie. Correlation between ventilatory threshold and endurance capability in marathon runners.

Med. Sci. Sports Exerc. 19:610-615, 1987.

17. Pfitzinger, P. and P. S. Freedson. The

reliability of lactate measurements during exercise.

Int. J. Sports Med. 19:349-357, 1998.

18. Schabort, E. J., W. G. Hopkins, and J. A. Hawley. Reproducibility of self-paced treadmill performance of trained endurance runners.

Int. J. Sports Med. 19:48-51, 1998.

19. Schabort, E. J., W. G. Hopkins, J. A. Hawley, and H. Blum. High

reliability of performance of well-trained rowers on a rowing ergometer.

J. Sports Sci. (in press).

20. Schabort, E. J., J. A. Hawley, W. G. Hopkins, I. Mujika, and T. D. Noakes. A new reliable laboratory test of endurance performance for road cyclists.

Med. Sci. Sports Exerc. 30:1744-1750, 1998.

APPENDICES

Appendix 1. Effect of Performance Enhancements on Chance of Winning
This program, written for the Statistical Analysis System (SAS Institute, Cary, NC), generated the data shown in Figure 2 .

%let place = 4; *average placing of a particular athlete;

%let noath = 15; *number of athletes in the event;

%let between = 1; *between-athlete SD;

*For between = 0, use between = 0.001;

data dat1;

do trial = 1 to 10000;

do n = 1 to &noath;

true=−&between*rannor(0);

*true=−&between*rangam(0, 1); *nonnormal distribution;

perf=true+rannor(0); *within CV = 1% for all athletes;

output;

end;

end;

proc rank out=dat1;

var true;

ranks ranktrue;

by trial;

*adds enhancements;

data dat1;

set;

perf0p0=perf;

perf0p5=perf;

perf1p0=perf;

perf1p5=perf;

perf2p0=perf;

select (ranktrue);

when (&place) do;

perf0p5=perf0p5-0.5;

perf1p0=perf1p0-1.0;

perf1p5=perf1p5-1.5;

perf2p0=perf2p0-2.0;

end;

otherwise;

end;

proc rank out=dat1;

var perf0p0 perf0p5 perf1p0 perf1p5 perf2p0;

by trial;

data dat1;

set;

if ranktrue=&place;

proc freq;

tables perf0p0 perf0p5 perf1p0 perf1p5 perf2p0/nofreq;

title "Effect of adding enhancement of 0.0 to 2.0 CVs";

title2 "for between=&between, placegetter of true rank = &place";

title3 "and &noath athletes in the event,";

title4 "for normally distributed between-athlete ability";

*title4 "for gamma distribution of ability, with shape parameter = 1";

title5 "(the first line in each table gives the chance of a gold medal)";

run;

Appendix 2. Sample Sizes Based on Confidence Limits
In this new approach, the researcher uses a sample size that gives acceptable 95% confidence limits for the observed performance enhancement. For the smallest worth-while enhancement d, the acceptable confidence limits are ± d. For a crossover study, the performance enhancement is the difference between performance in an event after and before a treatment. If the within-athlete SD (or coefficient of variation) is CV, then the SD of the mean difference score for a sample of size n is given by CV√2/√n. The 95% confidence limits are therefore ± tCV√2/√n, where t is the appropriate value of the t statistic. The approximate value of t is 2, so if d = 0.7CV, then 0.7CV = 2CV√2/√ n, i.e., n = 16. For a fully controlled study the difference in performance between events after and before a control treatment is subtracted from a similar difference score for the experimental treatment. The SD of the combined difference score is therefore multiplied by √2 so that twice as many subjects are needed, in twice as many groups, or a total of four times as many subjects as in a crossover.

To derive the sample size for a study using a test rather than the event, one has to assume that performance enhancement in the test has the same magnitude on average as in the event. From the above equations, the confidence interval of the enhancement in the test or event is proportional to the CV/√n. The confidence interval for an enhancement in the test must therefore equal the confidence interval for the enhancement in the event when CV_{T} /√n_{T} = CV_{E} /√n_{E} , where subscripts T and E represent test and event. Rearranging, n_{T} = n_{E} (CV_{T} /CV_{E} )^{2} . For example, if the test has a CV of 1.7% and the CV of the event is 2%, (1.7/2)^{2} or about half the number of subjects would be needed for a study using the test than for a study using the event.

Appendix 3. A New Method for Analysis of Validity
This method is a combined analysis of a reliability study for the test and a reliability study for the event. The SAS program that performs the analysis first generates data for a sample of athletes who perform two tests and enter two events. The data are drawn from a population with between- and within-athlete variations, expressed as CV, as follows:

between-athlete CV common to test and event = 5;
between-athlete CV unique to test = 0.5;
between-athlete CV unique to event = 1.0;
within-athlete CV in test and event = 1.5.
The data are then analyzed with Proc Mixed, which is set up to estimate the above CV as variances, which SAS labels LIN(1) through LIN(4) in the output. LIN(5) is the estimate of the difference between the within-athlete variances in test and event, something that should concern the researcher who wants to use a particular test. (Any difference between the unique between-athlete CV is not an issue.)

By running the program you will find that a sample size of at least 40 is needed for acceptable confidence limits on the estimates of the between CV and on the difference between the within CV. Estimation of the within CV is more precise if fewer athletes complete more tests and events, but the precision of the between CV is reduced with fewer athletes, regardless of the number of tests and events.

Note that Proc Mixed allows you to use data for athletes with missing values in some tests and events. This feature is particularly important when you are analyzing data from events.

data dat1;

ssize = 40; *sample size (set to 400 to see precise estimates);

do athlete = 1 to ssize;

factorc = 1+5*rannor(0); *common CV;

factort = 0.5*rannor(0); *unique test CV;

factore = 1*rannor(0); *unique event CV;

test1=factorc+factort+1.5*rannor(0);

test2=factorc+factort+1.5*rannor(0);

event1=factorc+factore+1.5*rannor(0);

event2=factorc+factore+1.5*rannor(0);

output;

end;

run;

data dat2;

set dat1;

perf=test1; trial = 1; output;

perf=test2; trial = 2; output;

perf=event1; trial = 3; output;

perf=event2; trial = 4; output;

*Covariance matrix;

data dat3;

input parm row col1-col4;

datalines;

1 1 1 1 1 1

1 2 1 1 1 1

1 3 1 1 1 1

1 4 1 1 1 1

2 1 1 1 0 0

2 2 1 1 0 0

3 3 0 0 1 1

3 4 0 0 1 1

4 1 1 0 0 0

4 2 0 1 0 0

4 3 0 0 1 0

4 4 0 0 0 1

5 1 1 0 0 0

5 2 0 1 0 0

run;

proc mixed data=dat2 covtest cl;

class athlete trial;

model perf=trial;

repeated/subject=athlete type=lin(5) rcorr ldata=dat3;

parms (25) (0) (0) (1) (1);

run;

Appendix 4. A More Accessible Check on Validity
This method is based on correlating event performance against test performance, which is the traditional method of analyzing validity . It provides an estimate of the sum of the variances of any unique event and test factors, but it does not provide its confidence interval.

First use a repeated-measures analysis or ANOVA of the log-transformed test scores to determine the coefficient of variation (CV_{T} ) for several (n_{T} ) trials of the test with the best possible athletes ^{(18-20)} . Similarly, determine the coefficient of variation (CV_{E} ) for several (n_{E} ) events entered by these athletes. Next, regress the mean event performance for each athlete (Y axis) against the mean test performance (X axis) and compute the standard error of the estimate (SEE_{E} ) and the correlation (r). If athletes missed one or more of the events (or tests), use least-squares means rather than raw means ^{(8)} . If r is reasonably high (0.9 or more), it can be shown that: (Equation A1) , where CV^{2} _{U} is the sum of the unique test and event variances. If CV_{U} is much less than the between-athlete CV (calculated from any trial of the test or the event), the test is valid. For example, if the CV of the test is 1.0% and the CV of the event is 2.0%, and you observe 1.6% for the SEE of the mean performance of two events regressed against the mean performance of two trials of the test, then the unique variances are zero and the test is valid.

A check on validity can also be performed with correlations. If r_{T} and r_{E} are the retest correlations for the mean of the test and event scores, and if r is reasonably high (0.9 or more), then r = √(r_{T} r_{E} ) implies that the unique test and event variations are negligible and that the test is therefore valid. The retest correlations for the mean are given by r_{T} ≈ 1 − (1 − r_{t} )/n_{T} and r_{E} ≈ 1 − (1 − r_{e} )/n_{T} , where r_{t} and r_{e} are the usual retest correlations for individual performances in the test and event. A problem with this method is that one does not know how much less than √(r_{T} r_{E} ) the validity correlation r should be before the test is declared invalid.