With this logic in mind, the sensitivity and specificity of doping tests, along with the prevalence of doping (15,18,19) would make it possible to determine the true validity of doping tests. With this information, the positive predictive value (PPV) and the false discovery rate (FDR) for each doping test could be determined. The PPV can be calculated according to Formula 1 and reflects the proportion of true dopers among individuals with a positive test result and provides vital information about the performance of a test. Similarly, the FDR predicts the probability of false positive tests among individuals with a positive doping test (i.e., the FDR equals the proportion of clean athletes sanctioned by a false positive test). The prevalence strongly influences the validity of the test as shown in the Figure.
Theoretically, the effectiveness of every doping test and/or the likelihood that an athlete with a positive doping test is a true doper can be determined using these calculations. However, as described in Table 1, this would require precise doping prevalence data as well as specificity and sensitivity for every doping substance/method tested in an athlete cohort at any given time.
From the above analysis, it is clear that the true doping prevalence remains the big unknown yet essential for drawing meaningful conclusions about the effectiveness of the antidoping test program. In 2011, a doping prevalence study was conducted to assess doping prevalence during the past year among athletes participating at the 2011 World Championships in Athletics (Daegu, South Korea) and the Pan-Arab Games (Doha, Qatar). Using a randomized response technique which guarantees anonymity of the participants, the past year doping prevalence at the World Championships in Athletics and the Pan-Arab Games was 43.6% and 57.1%, respectively (7). Using the available scientific literature, de Hon et al. (17) estimated the prevalence of doping in elite sport to be between 14% and 39%. With a focus on blood doping, Sottas et al. (20) published in 2011 a retrospective analysis of blood data from 2737 athletes that participated at international athletic competitions and reported an average of 14% of the samples showing evidence of blood doping.
An essential requirement to establishing the validity of an antidoping test is for all WADA-accredited laboratories to report their own test validity, including the sensitivity and specificity for all substances tested. For example, after the test for recombinant human erythropoietin (rHuEpo) was established (21), the different laboratories were required to determine their own test validity. To date, these data have not been published; therefore, it is not possible to calculate the risk of false positives.
If we assume for a hypothetical doping test, a chain of custody with no human error, no bribery, no corruption, no intentions for underreporting (i.e., all situations with a precedent in elite sport) and the same validity as the human immunodeficiency virus (HIV) test; the best and most well-studied antibody-based, enzyme immunoassay in the world with a test sensitivity and specificity of 99.7 and 98.5, respectively (22), there is a very divergent FDR, depending on the prevalence of doping (see Table 2). Subtle differences in doping test settings in an athlete cohort with the same overall doping prevalence also can influence the FDR as a result of selection bias (Table 2). The first three scenarios illustrated in Table 2 refer to the problem of selection bias. Although theoretically, this scenario may occur in regular out-of-competition testing where athletes using prohibited substances and methods may accept a missed antidoping test rather than deliver a positive test. The last two scenarios in Table 2 illustrated how the number of total tests conducted, combined with an athlete group that knows how to circumvent having a substance detected in their urine, will lead to a high number of false positives and a lower number of true positives.
The figures presented in Table 2 only apply for A-sample testing and not for the whole procedure where a positive A-sample is followed by a verification test using a B-sample. When the B-sample is taken at the same point in time, analyzed in the same laboratory, using the same test procedure; “retesting” does not constitute an additional independent test. Therefore, it may not significantly alter the FDR value. Whether there are any benefits of this method of repeat testing (i.e., sample A and B) on the probability of false positive outcomes remains to be determined and will require the testing authorities to report A- and B-sample test outcomes. The main conclusions drawn from Table 2 would nevertheless remain the same with FDR that vary for the same tests depending on doping test settings and doping prevalence. Notably, HIV testing had the same unreliability for populations with 1% prevalence of HIV as depicted in the last scenario (Table 2) where more individuals received a false-positive HIV test result than individuals receiving a true positive result. In this case, the solution was not to implement a “B-sample” but add an independent testing procedure for those samples with a positive in the initial (A-sample) test. The reason for this is that a “B-sample” has a very high likelihood of delivering a false positive outcome. In antidoping, the urinary Epo test has variable antibody cross-reactivity that occurs due to some unknown individual factors that can lead to false positives (23) so B-sample testing is likely to yield the same false result.
The importance of independent verification can be illustrated with the following example. When performing 25,000 Epo tests (which is roughly the number of Epo tests conducted in 2012) in truly clean athletes and assuming a specificity of 0.985 (similar to the HIV test), at least 375 false-positive A-samples would be expected. If there was a second, independently taken sample tested using an alternative method with an equally high specificity as the first test, the total specificity of the antidoping test would increase according to formula 2 (14).
Using these two test models, the overall specificity would be high at 0.999775 but would continue to generate five false-positive samples for every 25,000 samples tested in a group of clean athletes (i.e., prevalence = 0%) and at least up to three false positives in a group with prevalence of 40% (i.e., false positives = N tests (clean athletes) × (1 − Total spec). Notably, this calculation is only correct if the confirmatory test is completely independent from the first test. To be completely independent would require a different sample taken from the athlete on a different occasion and analyzed using a different method by a different laboratory. Additionally, this calculation requires that the sample is only tested for a single substance. In a real-world scenario, a single sample is tested for a large number of substances. This reduces the specificity, where the multiple substance specificity equals (specificity)N substances tested (14). Assuming that a single urine A-sample is tested for three different erythropoiesis stimulating agents with the same test, the specificity would reduce from 0.985 to 0.956. With this A-sample specificity, the number of false positives among 25,000 would increase accordingly to 17 false positives (prevalence, 0%) or 10 false positives (prevalence, 40%) (A-sample specificity, 0.956; B-sample specificity, 0.985).
In 2013, WADA published some general Epo test figures on AAF from the previous year (24). Unfortunately, the published figures did not include total test numbers clearly differentiating the number of different Epo generations tested, identified, and did not list the number for A- and B-sample outcomes. The report did include the total number of samples analyzed for Epo (N = 25,405) and 44 AAF (see ref 24, pages 47–49). The % AAF was 0 for most agencies even those with high test numbers, but >3% and > 10% for two testing centers. Subsequent publications of laboratory data by WADA after 2013 were even less detailed and raise questions on the low proportion of AAF compared with the assumed high prevalence of doping. In all future publications, “total number of tests” should reflect and be corrected for samples being tested for more than one parameter (e.g., at least two different generations of Epo, delivering two different potential test outcomes in one test run). This reporting also needs to be done for every substance tested. The roughly 300,000 samples tested per year can easily sum up to several million single test runs on different substances all with different test validities for A-sample and B-sample testings, if a different procedure is used. This is the case for testosterone doping tests where a positive testosterone to epitestosterone ratio (T/E) is followed by a mass spectrometry on extracorporal (nonhuman) testosterone. In this pertinent example, the validities for the T/E depend on the ethnicity of the individual tested and genetic factors influencing steroid metabolism. Therefore, the specificity and sensitivity of each Epo generation and also each substance tested, along with their respective laboratory quality control figures, need to be published. Publication of these data would help clarify the likely reasons behind the striking differences between laboratories.
There is general consensus to introduce “intelligent anti-doping testing” but this requires resolving the important issues raised in this commentary. In addition, we propose an alternative verification of doping test results. In principle, such testing is already possible for certain substances, such as Epo testing where alternative verification is available, namely, a suspicious blood profile. The athlete biological passport (ABP) does not require relying on the same (urine) sample but requires a blood sample be taken in the time window close to the positive Epo testing and be analyzed according to the procedure of the ABP, which is methodologically completely distinct from Epo testing. In such a case, the procedure of a positive screening result by either Epo or ABP testing could be followed by confirmatory testing with the respective other test and would be equivalent to standard clinical laboratory procedures in infectiology (as for HIV testing). However, the ABP is less effective in identifying modern strategies of low dose doping with erythropoiesis stimulating agents or blood doping (25), and there is limited prospect of using such testing for verification of positive Epo tests. Confirmatory testing also would apply when an ABP (strengthened by next generation “omics” biomarkers) is used as a starting point to generate a level of evidence for blood doping. Other possibilities need to be sought to generate additional evidence of doping to help reduce the FDR and avoid false positives, such as a verification of a positive Epo test by extracting the peptides in the region of the signal for positive rHuEpo in a gel and expose the peptides to confirmatory testing by mass spectrometry (Werner Franke, personal communication).
Other screening options indicating potential doping that are being established, such as the steroidal profile of the ABP (26), also are prone to the same limitations as those reported for the hematological/blood module (low sensitivity, high cost, and low practicality) (27). Recently, we have shown that low-dose rHuEpo administration induces a robust transcriptomic response in whole blood with some transcripts being significantly altered up to 3 wk after rHuEpo withdrawal (28). Such next-generation “omics” approaches have been criticized for their inability to deliver test outcomes with high enough validity as a standalone antidoping test (29). However, the increased potential of these “omics” approaches to screen for abnormal profiles in blood or urine that reflect many different doping procedures have the potential to serve as both screening and confirmatory tests. For a screening test, it would be important to be highly sensitive, but not necessarily specific, with a distinct cutoff set for the “omics”-based readouts as an indicator for a specific doping procedure. Confirmatory testing could use conventional testing but restrict testing to those substances known to contribute to the abnormal profile. Similarly, the test can be adapted using more conservative cutoffs warranting high enough specificity to confirm a positive sample in the days following the original positive doping test due to the potential of these “omics” biomarkers to enable longer-term detection.
The efficacy of current antidoping science and its potential “side effects” for clean athletes cannot be meaningfully assessed with the currently published data. Publishing the deidentified essential data, which are in principle available, would not infringe on the potential of doping tests to catch doping athletes. The absolute minimal requirements for standard quality control reports should be the sensitivity and specificity of each antidoping test, number, and entity of tests conducted on each sample, number of analysis runs for each doping substance tested, number of samples tested, and the number of AAF for each step in the analysis (A and B samples) for every single substance tested. This information, along with blind quality control testing across laboratories needs to be published annually. Publishing these data will pave the way for future testing with a lower FDR to enhance the ability to sanction a greater number of true positive athletes with the smallest likelihood of false positives. Without these essential data “intelligent testing” is limited without adequate quality control measures. For a more effective fight against doping, it is imperative that confirmatory testing by additional procedures independent of those that delivered a positive A-sample test is introduced. While transparency with data is the ultimate starting point to enable researchers to work on optimizing such procedures, it can be foreseen that novel indirect approaches such as the “omics”-based biomarkers that deliver signatures that are indicative of specific doping procedures would have great potential to be used in screening and in confirmatory testing. Resolving the issues raised here would be a game changer for the clean athlete and would make the system more resistant to bribery, corruption, or other commodities aimed at manipulating drug testing outcomes.
2. Ayotte C, Miller J, Thevis M. Challenges in modern anti-doping analytical science. Acute Topics AntiDop
. 2017; 62:68–76.
3. Houlihan B. Achieving compliance in international anti-doping policy: an analysis of the 2009 World Anti-Doping Code. Sport Manage Rev
. 2014; 17:265–76.
4. Overbye M. Deterrence by risk of detection? An inquiry into how elite athletes perceive the deterrent effect of the doping testing regime in their sport. Drugs-Educ. Prev. Policy
. 2017; 24:206–19.
6. Frenger M, Emrich E, Pitsch W. How to produce the belief in clean sports which sells. Perform. Enhance. Health
. 2013; 2:210–5.
7. Ulrich R, Pope HG, Cleret L, et al. Doping in two elite athletics competitions assessed by randomized-response surveys. Sports Med
. 2018; 48:211–9.
9. Durussel J, Haile DW, Mooses K, et al. Blood transcriptional signature of recombinant human erythropoietin administration and implications for antidoping strategies. Physiol. Genomic
. 2016; 48:202–9.
10. Berry DA. The science of doping. Nature
. 2008; 454:692–3.
11. Ljungqvist A, Horta L, Wadler G. Doping: world agency sets standards to promote fair play. Nature
. 2008; 455:1176.
12. Fero M. Doping: ignorance of basic statistics is all too common. Nature
. 2008; 455:166.
13. Altschuler EL. Doping: similar problems arise in medical clinics. Nature
. 2008; 455:167.
14. Pitsch W. The science of doping revisited: fallacies of the current anti-doping regime. Eur. J. Sport Sci
. 2009; 9:87–95.
15. Waaler HT, Siem H, Aalen OO. Can we trust doping tests? Tidsskr. Nor. Laegeforen
. 2011; 131:1760–1.
16. Boye E, Skotland T, Osterud B, Nissen-Meyer J. Doping and drug testing: anti-doping work must be transparent and adhere to good scientific practices to ensure public trust. EMBO Rep
. 2017; 18:351–4.
17. de Hon O, Kuipers H, van Bottenburg M. Prevalence of doping use in elite sports: a review of numbers and methods. Sports Med
. 2015; 45:57–69.
18. Pielke R. Assessing doping prevalence is possible. So what are we waiting for? Sports Med
. 2018; 48:207–9.
19. Colquhoun D. An investigation of the false discovery rate and the misinterpretation of p-values. R. Soc. Open Sci
. 2014; 1(3):140216.
20. Sottas PE, Robinson N, Fischetto G, et al. Prevalence of blood doping in samples collected from elite track and field athletes. Clin. Chem
. 2011; 57:762–9.
21. Lasne F, de Ceaurriz J. Recombinant erythropoietin in urine. Nature
. 2000; 405:635.
22. Chou R, Huffman LH, Fu R, et al. Screening for HIV: a review of the evidence for the U.S. Preventive Services Task Force. Ann. Intern. Med
. 2005; 143:55–73.
23. Franke WW, Heid H. Pitfalls, errors and risks of false-positive results in urinary Epo drug tests. Clin. Chim. Act
. 2006; 373:189–90.
25. Lundby C, Robach P, Saltin B. The evolving science of detection of “blood doping”. Br. J. Pharmacol
. 2012; 165:1306–15.
26. Saugy M, Lundby C, Robinson N. Monitoring of biological markers indicative of doping: the athlete biological passport. Br. J. Sports Med
. 2014; 48:827–32.
27. Baume N, Geyer H, Vouillamoz M, et al. Evaluation of longitudinal steroid profiles from male football players in UEFA competitions between 2008 and 2013. Drug Test. Anal
. 2016; 8:603–12.
28. Wang G, Durussel J, Shurlock J, et al. Validation of whole-blood transcriptome signature during microdose recombinant human erythropoietin (rHuEpo) administration. BMC Genomic
. 2017; 18(supp 8):817.
Copyright © 2018 by the American College of Sports Medicine.
29. Neuberger EW, Moser DA, Simon P. Principle considerations for the use of transcriptomics in doping research. Drug Test. Anal
. 2011; 3:668–75.