Introduction
Statistics provide a basis for evaluating strength and conditioning programs. Indeed, descriptions for standard scores (z scores) are provided in the Essentials of Strength Training and Conditioning (^{2} ) textbook that serves as the principal resource for prospective and practicing strength and conditioning specialists. Standard scores rank order performances, that is, who is best, second best, and so forth, but they are less useful in detecting unusual changes in performance. In this research note, a new statistic for evaluating strength and conditioning programs is described: the standard difference score (SDS). The simplicity for calculating, plotting, and evaluating SDSs, using Microsoft Excel (Microsoft Corporation, Remond, WA, USA), along with how to interpret SDSs for strength and conditioning programs, is described. Standard difference scores overcome proportional biases associated with measurement statistics on change in performance scores used for validity and reliability studies. Moreover, SDSs, when considered with skewness , can be substituted for conventional exploratory statistics generated using higher end statistical software programs. This research note summarizes the calculation and use of this novel statistic.

A Review of Standard Scores
Standard scores are basic statistics described for evaluating strength and conditioning programs (^{2} ). Specifically, the z score provides a ranking of an individual on a given test and is calculated by subtracting the score from the mean and dividing by SD . Therefore, a z score of 0.0 is equivalent to the mean, a z score of 1.0 is approximately 1 SD above the mean, and a z score of 2.0 is approximately 2 SD above the mean (Figure 1 ). Conversely, negative z scores take into account rankings below the mean. With strength and conditioning data, rankings on z scores should be based on relative rather than absolute measurements (e.g., relative to body mass). For instance, if 2 athletes of different body masses (e.g., 80 and 100 kg) both attained bench press test scores of 110 kg, they would have identical z scores on the bench press test if their absolute scores were considered; however, the athlete with a body mass of 80 kg should be assigned a higher z score based on their relative performance.

Figure 1: Normal and skewed distributions.

Z scores must be inverted (i.e., multiplied by −1) when lower values are deemed better (e.g., sprint times, error scores, or percentage body fat). Z scores offer the advantage of being aggregated to generate a composite performance score as they account for differences in variance between groups, whereas percentiles do not (^{5} ). When z scores are averaged from multiple tests, the mean score will remain at 0.0, whereas the SD will not remain at 1.0. Rather, the SD will assuredly shrink <1.0 due to random and systemic errors for given measurements. Mathematically, it is improbable for athletes to maintain identical percentile rankings relative to the mean on every test administered. Indeed, the inclusion of multiple tests within the test battery, along with tests tending not to correlate well with each other (e.g., vertical jump vs. percentage body fat), causes greater shrinkage of the range and SD for composite z scores (Figure 2 ).

Figure 2: Shrinking of distribution as standard scores are averaged from multiple tests.

A Review of Normality and Skewness
A normal distribution is characterized by an equal distribution of scores below and above the mean. Tests containing extreme scores will shift the mean relative to the median, causing a skewed distribution (Figure 1 ). For instance, if an athlete has an extremely high vertical jump performance, relative to the rest of the team, the score from that performance will inflate the mean relative to the median, resulting in a positive skewness statistic. Determining whether negative or positive skewness has an effect on the normality of a distribution requires calculating the standard error of skewness (SES ) using √6/n , where n represents the number of athletes with the group. The standard convention for violating the normality of a distribution is when the skewness statistic equals or exceeds 2 times the SES (2 × SES ) (^{5} ).

A Review of Typical Methods for Assessing Change in Strength and Conditioning Performance Scores
Standard scores serve to rank-order athletes at both pre- and posttesting. Inferential statistics, such as the t -test or the analysis of variance, help determine whether a group has experienced significant change in performance overall; however, neither standard scores nor inferential statistics can be used to easily detect individuals shifting their rank-order relative to the rest of the team from pre- to posttesting. As such, calculating change in strength and conditioning program performance scores can be useful.

Change in raw performance scores can be sorted and rank ordered; however, detecting athletes changing rank-order considerably, without the magnitude of difference relative to the mean and SD , can be difficult. One technique for identifying large changes in measurement, relative to a group of scores, is the Bland-Altman (B-A) plot (^{1} ). The basis of the B-A plot is to calculate change scores for each subject and display lines perpendicular to the y-axis at the mean difference score: the +2 SD difference score and the −2 SD difference score, respectively. The x-axis for the B-A plot is an arithmetic average of the pre- and posttesting scores. On the B-A plot, changes in strength and conditioning program performance are considered extreme when scores reside at or beyond ±2 SD .

Hopkins, a sport scientist and expert on statistics, described previously how the B-A plot is prone to artifactual proportional bias and therefore has condoned its use for validity (^{4} ) and reliability (^{3} ) measurements. Specifically, Hopkins illustrated how subjects with higher scores, on a given measurement, have higher absolute measurement errors in comparison to subjects with lower scores. For instance, assume that the coefficient of variation for vertical jumping performance is 2%. An athlete with a vertical jump of 50 cm therefore had an absolute measurement error of 1 cm, whereas the athlete with a vertical jump of 25 cm had an error of 0.5 cm (i.e., 50% of an error difference). The B-A plot, in this example, yielded a greater error proportionately for the athletes capable of jumping higher. In essence, Hopkins argued that the statistic itself yielded an error. Such an artifact would also need a reversal of consideration when lower scores are considered better (e.g., sprint times, percentage body fat); that is, for these examples, slower sprinting scores and higher percentage of body fats would all be biased. One must keep in mind that Hopkins framed his arguments toward researchers considering the validity or reliability of a measurement. Evaluating changes to a strength and conditioning program imposes these measurement challenges along with the additional challenge that one should expect athletes to respond to training at different rates.

In evaluating changes in responses to a strength and conditioning program, one would expect greater change in measurements for subjects starting out at lower rankings. That is, athletes with poorer performances or physiologic measures intuitively would have a greater capacity to improve in response to a strength and conditioning program relative to athletes with better performances or physiologic measures at pretesting. Such an issue renders both the B-A plot and the alternative recommended statistics of regression or typical error calculations (^{3,4} ), less useful for evaluating changes in performance scores. Rather, the following SDS technique provides a simple alterative to these problems.

Introducing the Standard Difference Score
The calculation of SDS, using Microsoft Excel, requires a few simple steps (note: brackets surround commands, where G:G implies column and rows of data to include in the calculation). With pre- and posttesting scores arranged in 2 columns, a third column of change in raw performance scores by subtracting pre- from posttesting scores is tabulated. The mean ± SD are obligatory statistics needed for calculating z scores using the commands: mean [=AVERAGE(G:G)] and SD [=(STDEV(G:G))], respectively. The SDS is a z score on the change in performance score (i.e., posttest minus pretest), where individual difference scores are subtracted from the mean difference score and divided by the SD of the difference scores (note: using the $ symbol before the column and row for the mean and SD cells will lock each cell). Finally, the skewness statistic along with 2× SES should be calculated. Skewness is calculated using [=SKEW(G:G)], whereas 2× SES is calculated using [=2*(SQRT(6/(COUNT(G:G))] where the COUNT command represents the sample size for that given test.

Standard difference scores can be sorted and rank ordered, whereby scores ±2.0, on an individual test, represent the outer 5% change in performance. As with z scores on a single test, SDS should be calculated using relative and not absolute measurements, when applicable (e.g., bench press relative to body mass) and should be inverted when lower scores are deemed better. Standard difference scores should be plotted without use of an x-axis to avoid proportional bias (Figure 3 ). For instance, the SDS plot of squat performance in Figure 3 illustrates that athlete 11 is very close to −2.0 SDS value. Such a drastic response to a strength and conditioning program might not appear obvious in a data table or in a plot of raw data using a spaghetti graph (Figure 4 ); however, given this score relative to other teammates (e.g., the next lowest scores at or about z scores of −1.0), athlete 11 is detected easily in the plot. Exploration of SDS on other tests revealed that athlete 11 had an SDS score of −1.11 on the bench press (Figure 3 ), thereby exhibiting a pattern of decline in performance warranting screening by the sports medicine staff for overreaching/overtraining syndrome.

Figure 3: Plots of SDSs (z scores) and composite z scores of 3 tests. SDS = standard difference score.

Figure 4: Individual absolute squat performances at pre- and posttesting.

When the skewness statistic is ≥2× SES , athletes scoring ±2 SD should be highly scrutinized. The reason for this precaution is that when data are skewed, one of more scores caused the overall change scores to violate normality. Significant skewness shifts the mean relative to the median (Figure 1 ) and expands ±2 SD limits. Athletes with SDS at or >2 SD values, within a skewed distribution, should be highly scrutinized because these changes in performance are extreme relative to the other athletes engaged presumably in the same strength and conditioning program. It should be noted that among smaller data sets (e.g., <20 athletes), one athlete exhibiting an extreme change in performance can cause skewness .

Regardless of whether SDSs are normally distributed or skewed, plots of SDS can be used in substitution for higher end statistical software programs and, in some instances, may be more advantageous. To justify this claim, consider the following example. Data on the same group of athletes appearing in Figure 3 were explored using a higher end statistical software program. Figure 5 illustrates a Q-Q plot for change in squat performance data (note: changes in bench press performance are expressed relative to body mass). These data are positively skewed, most likely by scores from athletes 13, 48, and 61. Take notice on the Q-Q plot that the departure of performance of athlete 11 from the rest of the team is less pronounced because of the positive skewness . Conversely, each of these athletes, including athlete 11, are detected using the simpler SDS plot (Figure 3 ). As the 3 extreme positive scores have inflated mean relative to the median (Figure 1 ) and correspondingly the SD , scores at or near ±2 SD would exhibit extraordinary changes in performance. Issues such as injuries at pretesting might explain a large increase in the squat performance. In the case of athlete 11, his/her decline in performance on the bench press (i.e., an SDS of approximately −1.0), a completely different body part being tested, raised greater concern. The main point being is that it is quite conceivable for athlete 11 to have gone undetected had these data been explored with conventional techniques as opposed to the SDS plot.

Figure 5: Sample exploratory plot for change in squat performance.

Standard difference scores may also be averaged, or aggregated, to allow for evaluating an overall change in performance based on a battery of tests. As indicated previously, when the z scores of multiple tests are averaged, the SD will most assuredly shrink (Figure 2 ). Therefore, ±2 the phase SD values must be accurately calculated to detect extraordinary responses to a strength and conditioning program. An example of a composite z score plot is illustrated in Figure 3 (right panel). As depicted in this plot, the mean z score was 0.0, whereas the ±2 SD value was 1.39, as opposed to 2.0. The strength and conditioning professional may wish to apply weightings of individual tests to a composite z score (i.e., by default, z scores from each test within a test battery will be weighted evenly into a composite score). Such a weighting manipulation is easily done with proportions (i.e., 20% contribution equals 0.20). In generating weightings, one should consider the relative importance of each measure, in a composite score, based on phase of periodization. For instance, if the phase of the season is focused on intensity and acquisition of power, such measures may be weighted more substantially.

The strength and conditioning professional should also bear in mind that, because the SD of a composite z score is <1.0 (Figure 2 ), percentile rankings will also vary (i.e., percentile rankings displayed in Figure 1 are no longer accurate). As such, it is necessary to calculate new percentile rankings. To do so, use the following function: [=(PERCENTRANK(G3:G66,G3))*100], where G3:G66 represents the array of scores to include from column G, and where the G3 following the comma represents the cell containing the composite z score for which the percentile rank is being calculated.

As a function of mathematics, one or more accumulated change in performance scores or composite z scores (Figure 3 , right panel) should fall outside the ±2 SD limits (i.e., ∼5% of the sample). Should the distribution of composite SDSs be significantly skewed, the strength and conditioning professional should assume that 1 or more athletes are exhibiting extraordinary changes in performance overall, relative to the rest of the team. It should be noted that athletes exhibiting extraordinary changes in performance may not be the “best” or “worst” athlete on the team (i.e., SDSs are rankings on overall change in performance and not overall performance per se). The SDS method helps the strength and conditioning professional survey quickly the change in performance data more thoroughly. Extraordinary SDSs may identify athlete(s) who may have been ill during pre- or posttesting. Less popular but equally plausible, they may help detect athlete(s) using a banned/illegal performance-enhancing substance.

Practical Applications
Standard scores and inferential statistics have been described for evaluating strength and conditioning programs (^{2} ); however, such conventional statistics do not easily detect large individual changes in rank order. Conversely, the SDS offers a viable option for detecting individuals exhibiting extraordinary changes in performance. Standard difference scores are easy to calculate and can be used in substitution of exploratory statistics provided by higher end statistical software programs. As such, they are suitable statistics for the practicing strength and conditioning professional.

Acknowledgments
I am indebted to Dr. Steve Bohnenblust, Director of the Center of Scholarship and Research at Minnesota State University, Mankato, Minnesota, for his technical advice concerning the usage of this statistic. I thank Julie Stella, staff member, for her thoughtful editing of this article.

References
1. Bland, JM and Altman, DG. Statistical methods for assessing agreement between two methods of clinical measurement.

Lancet 1: 307-310, 1986.

2. Harman, E and Garganner, J. Administration, scoring, and interpretation of selected tests. In:

Essentials of Strength Training and Conditioning (3rd ed.). Baechle, TR and Earle, RW, eds. Champaign, IL: Human Kinetics, 2008. pp. 249-292.

3. Hopkins, WG. Measures of reliability in sports medicine and science.

Sports Med 30: 1-15, 2000.

4. Hopkins, WG. Bias in Bland-Altman but not regression validity analyses.

Sportscience 8: 42-46, 2004.

5. Kenney, JF and Keeping, ES.

Mathematics of Statistics . Princeton, NJ: Van Nostrand, 1962.