## INTRODUCTION

As strength and conditioning coaches, we routinely put our athletes through a variety of fitness assessments to determine their physical capability, so that we can tailor the design of their training program and adapt accordingly. Similarly, the psychologist, physiotherapist, and technical coaches also assess the athlete, with the results equally used to inform future interventions and team selection. But, with so much data collected and thus available for discussion, athlete review meetings, for example, where all staff attend, can often see each practitioner providing more discrete detail than is necessary. For example, although jump height may be informative to the strength and conditioning coach, this score, in this context, may not prove overly helpful to discussions fed in to by the coaches and other members of the sport science disciplines. These situations, therefore, lend themselves to the strength and conditioning coach providing a single score for the athlete's physical fitness, rather than separately discussing each individual test result. Such an approach can streamline collaborative communication, maximizing the time available for planning and practical delivery.

Furthermore, coaches may not be as concerned in the raw score of each athlete, as much as where the score ranked among their teammates, especially when there is competition for places. For example, a coach may have no concept as to what is deemed a good jump height or back squat, with this information only becoming apparent through some analysis that reveals the score is among the highest or lowest in the squad. Also, it can be rare to have the athlete who scored highest on the bench press, also score the highest on a change of direction speed test or Yo-Yo score, for example, suggesting that there is some compromise among the different components of fitness that collectively define an athlete's athleticism. So, although we want to rank how each athlete compared with their teammates across each test (to highlight test-specific strengths and weaknesses), we also want to be able to judge how they did holistically, that is, have some measure of general athleticism, where moderate scores across all tests may in fact be more beneficial to performance, than scoring really high in some, while doing terrible in others. The aim of this article is to describe a method which provides a single score of holistic fitness, referred to as a total score of athleticism (TSA).

### USING STANDARDIZED SCORES AND DEFINING WINDOWS OF OPPORTUNITY

A TSA is derived by averaging a set of standardized scores (here, either *z*-scores for large groups or *t*-scores for small groups) from a series of tests undertaken by an athlete (^{9}). A standardized score (of a single test), and therefore the TSA (of a series of tests), allows coaches to examine contextualized data of individual athletes relative to their teammates and thus set benchmarks and training goals that are realistic to the demands placed on players by the club. For example, each player's physical capacity will to some extent be a consequence of the coach's training philosophy, which determines competition tactics (or style of play) and their attitude toward strength and conditioning practices (^{11}). Furthermore, results may also be a consequence of general time allocated to training (e.g., semiprofessional athletes vs. professional athletes) and naturally, the age and maturation of the players (^{5,7}). So, although comparative data may be available outside of the club, enabling comparisons with professional athletes, for example, it may create unrealistic targets. This is because using comparative data may establish benchmarks or test goals, which require a time allowance to fitness training that is at odds with that which is allotted, and requires financial and logistical input that is not supported. Equally, comparative data drawn from other teams may represent a trend toward a particular set of fitness characteristics that maps back to a style of play that is not universally adopted. Finally, given it is likely that within-club comparisons will be used for team selection purposes, between-player comparisons are likely the most beneficial use of fitness testing data.

The usefulness of *z*-scores (the standardized score we will initially discuss) can be noted when we consider the following question. During a fitness testing battery, if an athlete squats 140 kg and has a beep (aerobic shuttle) test score corresponding to level 15, how well did they do, and on which did they do best? The first stage of answering this is to establish the maximum values attained from each athlete, within the tested squad. For squats, the highest recorded score may have been 220 kg and for the beep test, level 17. Therefore, the athlete attained a score of 64% and 88%, respectively, relative to the maximum. So, on a percentage basis, the athlete performed better on the beep test. However, the 88% on the beep test may have been one of the lowest scores among all those tested. By contrast, the 64% may have been one of the highest; so, arguably, the athlete scored best on the squats. Such information enables strength and conditioning coaches to more precisely highlight athlete strengths and weaknesses and program accordingly. Therefore, the final piece of information used is a measure of how well someone did relative to all who took those tests. A *z*-score contains all of this information, and because it is unit-less, it enables comparisons between other tests which otherwise would not be possible (^{3,9}).

By plotting athlete data as *z*-scores, coaches, athletes, and sport scientists can get a quick and easy to read data point and graph, indicating how well each athlete did on each test relative to their teammates, and which areas are strengths, and which are weaknesses. For example, looking at Figure 1, where zero represents the team average, anything above the zero-line means that the athlete is better than average, and anything below means, they are worse. Practically, this means that anything below the line represents a clear window of opportunity that should be targeted when individualizing the athletes' next training program.

The next question involves the interpretation of how good or bad they are at each test (relative to their teammates). For this, we must be able to interpret the *z*-score value (on the *y* axis), which corresponds to the height of each bar. To fully understand these values, the mean and SD need to be examined, whereby the former provides the average score and the latter the dispersion of data (a smaller SD means that the data set contains values that are, on average, close to the mean, while a larger SD suggests the opposite). Together, the mean ± 1 SD will contain ∼68% of all test scores, the mean ± 2 SD ∼95%, and the mean ± 3 SD ∼99% (Figure 2 and Table). *Z*-scores rescale values to show how many SDs away from the mean they are and therefore have a mean of zero and an SD of 1 (^{6}). We can interpret values by using a normal distribution (refer to Figure 2). So, if an athlete scores +2, it indicates that the athlete scored 2 SD above the mean, meaning that they performed better than 97% of all scores (50% up to the mean plus 34% up to +1 SD and another 13% up to +2 SD). A score of +1 informs us that they scored better than 84% of others who were tested, while −1 suggests 84% did better than them. So, when we analyze Figure 1 again, we must make note of the values on the *y* axis to determine their test scores. For this reason, when producing charts for each athlete, it can be useful to fix the *y* axis values (i.e., use Excel's chart formatting function to manually set max and min values) to make interpretation easier and more accessible to coaches and athletes, by allowing them to simply gauge performance through the height of each bar (if *y* axis values are not fixed, the histogram is plotted based on the largest *y* values).

Finally, in sport, smaller values can of course indicate superior performances, for example, 30-m sprint time. For these tests, negative values for *z*-scores would be achieved for athletes who were better than average. When this occurs, the final value can simply be multiplied by −1. This reversing of positive values to negative values and vice versa enables all scores above the line to be seen as an athlete's strength and all scores below the line to be seen as an athlete's weakness (relative to those who took the test); again, this adjustment simply makes for easier interpretation.

### CALCULATING A TOTAL SCORE OF ATHLETICISM

Coaches are often interested in 1 score that represents how “fit” a given athlete is. For this we can use the TSA, calculated by averaging the *z*-scores from each test (^{9}). For example, using the average mitigates scenarios whereby an athlete is missing a particular test due to an injury. Leaving the cells blank ensures this is picked up when interpreting the graphs (Figure 3). Another reason to average scores is to ensure the athlete is “well rounded.” For example, while an athlete may have a very high score for the bench press, their score for some test of aerobic capacity could be low, and in this case, the low score will neutralize the high score. The TSA is therefore indicative of the fact that sport often requires several athletic abilities; thus, athletes cannot just focus on 1 facet of physical performance at the expense of others. Similarly, researchers are also starting to use an averaging of *z*-scores to better understand in-competition metrics, by correlating this 1 measure of holistic fitness with key performance indicators such as tackles, shots, and passes (^{1,4,12}). Again, this seems logical because on-field metrics are simultaneously driven by several physical competencies, working in concert with one another, and thus, this represents a potentially fruitful addition to the traditional relationships identified between key performance indicators and single components of fitness (such as 30-m speed or 1 repetition maximum back squat).

Finally, after the completion of a comprehensive needs analysis, a quick reference assessment of who is the most athletic (relative to the demands of the sport) can be plotted by graphing all athlete TSA scores (Figure 4). Interpretation can then be made easier still by sorting them in Excel (highest to lowest) or ranking athletes as described elsewhere (^{10}) and below.

### DETERMINING THE FITNESS TESTING BATTERY

Given *z*-scores from each test are averaged, it is important to choose tests that represent the athletic components required of the sport in question. For example, choosing 1 test for strength, 1 for power, speed, aerobic capacity, and so forth, ensures a rounded approach to athleticism. Having more tests geared toward strength and power assessment relative to aerobic capacity, for example, is indicative of a requirement in athleticism centering on those qualities, perhaps because the sport event is highly intensive and of short duration, or that this particular combination better suits the specific positional demands. Whatever tests and weighting of tests are decided on, the validity of the TSA is governed by the tests used to make up its score. Using several tests that favor 1 attribute of athleticism (or giving 1 test a higher weighting), such as strength, will bias scores in its favor (^{8}). Practitioners therefore should also consider splitting squad assessments by positional groups (e.g. judging soccer goalkeepers by aerobic capacity may disadvantage them).

### CALCULATING THE TOTAL SCORE OF ATHLETICISM AND PLOTTING GRAPHS IN EXCEL

To calculate the *z*-score of any given test, the squad's average test score is subtracted from the athlete's test score, then this value is divided by the squad's SD; so, the equation reads as follows: *z*-score = (athlete score − team mean)/team SD. This can easily be computed in Excel by using the “STANDARDIZE” formula or inputted manually using the equation provided (and as illustrated in Figure 5). The formula contained within the cell (cell E2 in the example given in Figure 5) can then be dragged down and then across to compute *z*-scores for all athletes across all tests. However, before doing so, the test mean and SD must first be fixed using the “$” sign as per the formula highlighted in Figure 5. Furthermore, for purposes of drawing graphs and to ensure scores above the line are seen as strengths, and those below the line are seen as weaknesses (as per Figure 1), multiply speed-based time tests by −1 (Figure 6). Finally, the TSA is calculated by averaging all *z*-scores (Figure 7). For ease of interpretation for coaches and athletes, the TSA and each test's *z*-score can then be ranked, and a “traffic light” system can be used (Figure 8) to highlight how each athlete's fitness compares with their teammates; an example of how this can be presented (using the “VLOOKUP” function) is shown in Figure 9.

### CONVERTING *Z*-SCORES TO *T*-SCORES

There are actually 2 forms of *t*-scores, one used to transform *z*-scores into more user-friendly numbers, which we will discuss now, and 1 used to standardize scores in small squads, which we will discuss in the following section. So, some coaches and athletes may not like the format of a *z*-score, which is a small number that can be positive or negative. In these instances, or just through general preference, *z*-scores can be converted to *t*-scores using the following formula: *t* = (*z* × 10) + 50 (Figure 10). In this format, 50 represents the mean value (as oppose to 0 in *z*-scores), with 10 used to represent an interval equivalent to 1 SD (^{5}). Therefore, a score of 60 represents a score that is 1 SD above the mean and 70 two SD above the mean. Conversely, a score of 40 represents a score that is 1 SD below the mean and 30 two SD below the mean. We should also point out that raw scores can be directly converted to *t*-scores using the following formula: *t* = 50 + 10 (athlete score − team mean)/team SD (Figure 11). Because *t*-scores produce a number that is more conventionally appreciated by athletes, that is, it provides a scores between 0 and 100 rather than, for example −3 to 3 (as per *z*-scores); the final overall TSA score is presented in this way as illustrated in Figure 9. Anecdotally, however, it may still be better to illustrate any data contained in graphs through *z*-scores, as these more readily illustrate better and worse than average (and by what magnitude) through bars being above or below the *y* axis.

### FITNESS TESTING FOR SMALL SQUADS

The use of *z*-scores normally requires achieving 1 of 2 conditions. First, normally distributed data (as illustrated in Figure 2), which given the central limit theorem, are achieved with a sample size of >30 (^{2}). Second, it requires us to know the population SD (σ), which in reality, is rarely known. Therefore, when testing players from a squad of <30, the data are likely to follow a *t*-distribution, which is essentially shorter and fatter than the normal distribution associated with *z*-scores (^{2}). In these instances, where the shape of the curve is dependent on sample size, reference tables must be used to interpret the magnitude of difference for the assessed value relative to the mean; that is as opposed to *z*-scores where a value of 1 always infers a 34% difference relative to the mean (Figure 2). Therefore, if we were to use *z*-scores on small squads, we could not be confident in interpreting the magnitude of difference from the mean; thus, *t*-scores are advised. To reiterate, these are different to the *t*-scores presented above, with these *t*-scores computed as follows: *t* = (athlete score − team mean)/(SD/SQRT(n)), where SQRT(n) requires you to square root the sample number. Of note, this is the only difference from the formula used to compute a *z*-score. Figure 12 shows how this can be computed in Excel.

The issue with using *t*-scores is that, as aforementioned, it requires the use of reference tables, which is a lengthy and onerous task for those producing the athlete reports. However, even without the use of reference tables, the relative difference of each score can still be gauged from the graph, that is, above the line implies better than average and below the line implies worse, with the height of the bar indicating by how much. Furthermore, the average *t*-score can still be computed and used to rank holistic fitness (i.e., the TSA) among the athlete's teammates. However, to now turn the *t*-score derived TSA into a score between 0 and 100, which again may carry more contextual meaning for coaches and players, we use the “PERCENTRANK” formula in Excel (Figure 13). The score now informs athletes and coaches (as a percentage) how much above or below the mean they are, noting that like the *t*-score originally introduced, 50% represents the mean.

In closing, a player profile produced using *t*-scores is presented in Figure 14; this is the same player used above in Figure 9, allowing you to note the subtle difference between the 2 methods of analysis. Incidentally, the rank you get from *t*-score analysis is generally identical to the rank you get from a *z*-score analysis; it is just unfortunate that *t*-scores (unlike *z*-scores) are affected by sample size and thus require reference tables to determine relative difference from the mean.

## CONCLUSION

Oftentimes, the various coaching staff, sport science, and medical practitioners of a sports club require a single, holistic indication of an athlete's athleticism. Currently, there is no consensus on how this is best defined, and thus, a TSA may provide one such method. The validity of the TSA score is governed by the relevance of the fitness tests used, so coaches must be able to rationalize their choices based on the information derived from a comprehensive needs analysis of the sport including positional demands.

Finally, data visualization is an important consideration to maximize the effectiveness of this approach, with the figure schematic used simple to interpret for both coaches and athletes. Histograms may provide a logical and easy way to understand the data, as scores above the line mean an athlete is better than average, while below the line suggest they are worse, the height of the bar determines by how much. This information can then be used to identify areas to be targeted when the next training program is individualized for each athlete. Of course, it would be remiss of us to not point out that standardized scores essentially rank athletes within the tested population; thus, half the athletes will always be below average. Some consideration should therefore be given to whether this highlights windows of opportunity in these athletes or is a natural byproduct of exceptional fitness within the tested squad. If it were the latter, then other areas should be targeted, with this a natural consequence of analysis through standardized scores. For interested readers, a step-by-step guide for the calculation of *z*-scores and the TSA, along with how to graph results (as histogram or radar plot), is available elsewhere (^{10}).