Although not the only source of clinical evidence (1 , 2), analyses of large, multicentre, randomized controlled trials (RCTs) influence clinical practice in most areas of medicine, including critical care (3). Optimal patient care is supported if physicians understand the results and limitations within such published reports of randomized trials. However, it has been observed repeatedly that medical students and physicians lack knowledge of relevant biostatistical methodology (4 , 5). These difficulties have been linked to persistent statistical flaws in published research reports (6–11) with these errors in turn ascribed to deficiencies in undergraduate medical student and postgraduate physician education combined with insufficient statistical support of clinical research (4 , 12–19). Potentially compounding these biostatistical difficulties, serial comparisons of scientific publications over several decades highlight the progressive rise of regression modeling and other more sophisticated statistical techniques within clinical research (17 , 20–24).
Clinician discomfort with statistical concepts is also demonstrated by the popularity of “positive” p value results less than 0.05, despite extensive advice to deemphasize such arbitrary statistical inference and instead prefer effect size determinations with point estimates and measures of uncertainty, such as confidence, credibility, or prediction intervals; or their Bayesian congeners (25–28).
Assessments of statistical procedures encountered in published medical research have included analyses within one journal (14 , 20 , 24), across several journals (12 , 17 , 29), and within specialty audiences (30–32). The aim of the present investigation was to update the curriculum of biostatistical tests encountered within a cohort of recent RCTs relevant to the broad clinical practice of intensive care physicians.
MATERIALS AND METHODS
Journal and Article Selection
The published articles for analysis were identified by a two-step process. First, journals included in the Thomson Reuters InCites Journal Citation Reports were searched by the categories “Medicine, General and Internal” and “Critical Care Medicine” then ranked by impact factor, being the average number of times articles from the journal published in the previous 2 years were cited in the Journal Citation Reports year (33). The top 10 journals identified with the highest impact factors were then searched using an Ovid MEDLINE search strategy (Supplemental Digital Content 1, http://links.lww.com/CCM/D933) for RCT relevant to adult/pediatric critical care or intensive care practice published between 2011 and 2015 inclusive. This 5-year time period was chosen to provide an adequate sample size of publications that were likely to be more recent than those included in some existing methods analyses. All identified published articles then underwent a second stage manual screening process which excluded articles which did not meet the above criteria, leaving the analysis set of published RCTs (Fig. 1).
Data Extraction and Terminology
The following three terms describe the hierarchy of labels given to the identified statistical procedures: 1) “test” describes any use of statistical inference or a statistical procedure regardless of name or underlying methodology, 2) “method” is a recognisably named statistical “test” (such as chi-square), and 3) “category” refers to a collection of “tests” or “methods” with similar underlying statistical assumptions (such as the relevance of chi-square and Fisher exact tests to analysis of data within 2 × 2 contingency tables) (Supplemental Digital Content 2, http://links.lww.com/CCM/D934).
Two authors (J.M., J.P.) assessed all articles in the analysis set for the statistical methodology used to test the trial outcomes stated in the article. Tests associated with primary outcomes were also recorded separately. Each type of statistical test was only recorded once per publication, even if used to return more than one result in the published article. When no specific statistical test or method was recorded the method of data display was noted.
Differences between these two reviewers were resolved by discussion, consulting supplementary material accompanying the published article if necessary.
Outcomes and Analysis
The primary outcome of the present investigation was the overall frequency of each type of identified statistical test or method. Similar types of statistical tests or methods were grouped into categories (Supplemental Digital Content 2, http://links.lww.com/CCM/D934) based upon previously published classifications (12 , 14 , 20 , 34 , 35). The proportions of selected types of statistical methods or category of methods identified in this study were compared with proportions reported by two key published works (4 , 14). All analyses were performed using SAS Studio release 3.4 (SAS Institute, Cary, NC) or Stata release 15 (StataCorp, College Station, TX). Graphs were created with GraphPad Prism 6 (GraphPad Software, La Jolla, CA).
Journal Selection and Publication Characteristics
The Ovid MEDLINE search of the 10 journals between 2011 and 2015 inclusive identified 218 candidate publications, which were searched individually, resulting in an analysis cohort comprising 116 RCTs (Table 1). The most common source journal was the New England Journal of Medicine (37/116; 32%), whereas the most common region of origin of the trial’s first author was Europe (62/116; 54%). The majority (79%) of the randomized trials reported adult patients. An interim analysis was mentioned in 51% of studies, and 15% of studies reported a cluster design. Only one RCT article did not report any p values, and 50/116 (43%) reported statistical tests of baseline characteristics.
Overall Distribution of Statistical Tests
Within the 116 RCTs, there were 580 identified statistical tests or methods that were directly associated with at least one reported trial outcome. At least 48 identified different descriptive terms or phrases were associated with these statistical tests or methods (Fig. 2). Slightly over half of these described statistical tests or methods (26/48; 54%) could be regarded as basic rather than more advanced or multivariate in construction (Supplemental Digital Content 2, http://links.lww.com/CCM/D934).
The most frequently reported individual statistical methods were chi-square, occurring in 70 of the 116 articles (60%), Cox proportional hazards models (63/116; 54%), logistic regression (53/116; 46%), and Fisher exact, log rank, and Wilcoxon rank-sum each occurring 52 times (45%). Together these six statistical methods accounted for 342/580 (59%) of all the tests or methods identified.
When the tests or methods were classified according to the underlying statistical assumptions used to compare the trial random samples (Supplementary Digital Content 2, http://links.lww.com/CCM/D934; and Fig. 3) the two most common categories within the 116 RCTs were unadjusted tests of 2 × 2 contingency tables and nonparametric tests of rank order.
More complex categories were also encountered commonly within the 116 RCTs, including adjusted time-to-event analyses, identified 63 times, other generalized linear models noted 40 times (of which 17 were Poisson regression), and 21 generalized linear mixed effects models.
Test Frequency Relative to Previous Reports
The relative frequency of selected statistical methods was compared with previous work published approximately 5 and 10 years prior to the present RCT cohort (4 , 14). Compared with those prior descriptions (Table 2; and Supplemental Digital Content 3, http://links.lww.com/CCM/D935), changes over time were evident, with chi-square, logistic, time-to-event, and mixed model methods potentially more prevalent in recent RCT analyses.
Statistical Tests of the RCT Primary Outcome
In 116 RCTs, there were 122 statistical tests or methods (described by 28 different terms) that were identified as directly associated with a trial primary outcome. Commonly encountered tests or methods, accounting together for half of all identified, were chi-square (25/122; 20%), Cox proportional-hazards models (14/122; 11%), Fisher or other exact methodology (12/122; 10%), and logistic regression (10/122; 8%).
When tests were grouped according to similarity of statistical methodology, analyses of simple 2 × 2 contingency tables were the most frequent tests used to evaluate the primary outcomes of RCT (40/122; 33%) followed by nonparametric test of rank order (17/122; 14%). In the remaining analyses, adjusted time to event methods, logistic regression, and other generalized linear models were somewhat more commonly applied to RCT primary outcomes than unadjusted time-to-event analyses, t tests, ordinary linear regression, contingency tests greater than 2 × 2, or generalized linear mixed effects models.
The present study describes and categorizes the statistical tests and methods used to derive primary and secondary outcomes within recent RCT relevant to critical care medicine reported in highly cited clinical journals. These statistical methods were numerous and diverse, with 580 individual tests or methods described using 48 terms or phrases. As well as documenting the large curriculum of statistical methods encountered by clinicians evaluating these RCTs, the present study also suggested an increasing complexity of statistical methodology within RCT reports.
The present investigation observed that in addition to basic statistical comparisons of proportions or mean values between two experimental groups, current clinical audiences need to interpret more sophisticated biostatistical techniques including time-to-event analyses and regression models incorporating a mixture of fixed and random effects. The substantial curriculum of biostatistical methodology observed within the current cohort of RCT agrees with other recent work which demonstrated an increasing quantity and sophistication of statistical methods across most clinical research literature (14 , 18 , 24).
These circumstances define a challenge to clinicians without substantial biostatistical training who aspire to understand research trial analyses. Not only are there a substantial number of statistical methods in common use, but also there is a specific and large biostatistical vocabulary within those clinical trial reports (17). Although basic statistical analyses remain in common use for both primary and secondary trial outcomes, the present set of RCT also commonly reported multivariable-adjusted analyses. In particular, adjusted time-to-event analyses such as Cox proportional hazards models and multivariable logistic models were common, with these two methods among those known to be poorly interpreted by trainee clinicians (4). An important observation of the present study was that readers may encounter some form of generalized linear model, or generalized linear mixed-effects regression model for longitudinal and correlated data (36), in up to one half of modern reports of clinical RCT.
Only one of 116 articles within the present study reported no p values, confirming the persistent popularity of testing for arbitrary binary statistical significance despite strong recommendations that the strength of evidence for a real treatment effect is better presented as estimated effect sizes with accompanying uncertainty metrics (5 , 8 , 27 , 28 , 37 , 38).
Although previously identified deficits in the biostatistical knowledge of clinical researchers may be improving (10), the present observations together suggest that it remains unrealistic to assume that clinicians without specific biostatistical training may adequately understand the statistical basis of current clinical research trial reports. Although education of clinicians through guidelines (26) and organized programs (19) remains relevant, ready access within academic health science institutions to formally trained biostatistical support appears increasingly relevant to the scientific integrity of all forms of clinical research and practice (15 , 16).
A number of biostatistical methods were notably absent from the present cohort of randomized trial reports. Although most clinical studies continue to be designed from a simple two-arm frequentist statistical perspective (39), concurrent evaluation of multiple treatments may be more efficient and ethical (40). Response-adaptive multiarm and Bayesian trials have been described in oncology and related areas for many years (41 , 42), with these methods expected to become more widely adopted in clinical trials on other topics “including critical care” (43), especially as the relative efficiency of multi-arm trials and similar designs become more widely appreciated (40 , 44–46).
A strength of the present study is its characterization of the curriculum required for in-depth understanding of the biostatistical analyses within recent influential RCT relevant to critical illness, both for primary trial outcomes and secondary analyses.
This study has several limitations, including sample size, both in the range of journals and number of clinical trial reports examined. There is an exclusive focus on RCTs, without considering the wider statistical methods that may be required to evaluate observational data (47–49). Despite consensus from two authors, there was also the potential for error in the classification of the statistical procedures and tests encountered.
The present study has again highlighted the complexities facing clinicians when evaluating recent potentially influential clinical trial reports which contain a substantial curriculum of biostatistical methods. In addition to ongoing efforts to support formal training of clinicians in relevant methodology (14 , 19), we suggest that accessible specialist biostatistical support is “not only” integral to the quality of “clinical research, but also an asset to” evidence-based clinical practice.
We thank Lars Eriksson, University of Queensland Library, Herston Health Sciences Library, Brisbane, Queensland, Australia, for assistance with the search strategy.
1. Ioannidis JP, Haidich AB, Pappa M, et al. Comparison of evidence of treatment effects in randomized
and nonrandomized studies. JAMA 2001; 286:821–830
2. Concato J, Shah N, Horwitz RI. Randomized
, controlled trials, observational studies, and the hierarchy of research designs. N Engl J Med 2000; 342:1887–1892
3. Myburgh JA, Finfer S, Bellomo R, et al; CHEST Investigators; Australian and New Zealand Intensive Care Society Clinical Trials
Group: Hydroxyethyl starch or saline for fluid resuscitation in intensive care. N Engl J Med 2012; 367:1901–1911
4. Windish DM, Huot SJ, Green ML. Medicine residents’ understanding of the biostatistics and results in the medical literature. JAMA 2007; 298:1010–1022
5. Altman DG, Bland JM. Improving doctors’ understanding of statistics. J R Stat Soc Ser A Stat Soc 1991; 154:223–267
6. Murray GD. Armitage P, Colton T. Teaching statistics to physicians. In: Encyclopedia of Biostatistics. 2005Second Edition. Chichester, John Wiley & Sons, Ltd.
7. Altman DG. The scandal of poor medical research. BMJ 1994; 308:283–284
8. Ioannidis JP. Why most published research findings are false. PLoS Med 2005; 2:e124
10. Dechartres A, Trinquart L, Atal I, et al. Evolution of poor reporting and inadequate methods over time in 20 920 randomised controlled trials included in Cochrane reviews: Research on research study. BMJ 2017; 357:j2490
11. Fernandes-Taylor S, Hyun JK, Reeder RN, et al. Common statistical and research design problems in manuscripts submitted to high-impact medical journals. BMC Res Notes 2011; 4:304
12. Tetrault JM, Sauler M, Wells CK, et al. Reporting of multivariable methods in the medical literature. J Investig Med 2008; 56:954–957
13. Swift L, Miles S, Price GM, et al. Do doctors need statistics? Doctors’ use of and attitudes to probability and statistics. Stat Med 2009; 28:1969–1981
14. Arnold LD, Braganza M, Salih R, et al. Statistical trends in the Journal of the American Medical Association and implications for training across the continuum of medical education. PLoS One 2013; 8:e77301
15. Welty LJ, Carter RE, Finkelstein DM, et al; Biostatistics, Epidemiology, and Research Design Key Function Committee of the Clinical and Translational Science Award Consortium: Strategies for developing biostatistics resources in an academic health center. Acad Med 2013; 88:454–460
16. Moher D, Altman DG. Four proposals to help improve the medical research literature. PLoS Med 2015; 12:e1001864
17. Narayanan R, Nugent R, Nugent K. An investigation of the variety and complexity of statistical methods used in current internal medicine literature. South Med J 2015; 108:629–634
18. Yi D, Ma D, Li G, et al. Statistical use in clinical studies: Is there evidence of a methodological shift? PLoS One 2015; 10:e0140159
19. Thor J, Olsson D, Nordenström J. The design, fate and impact of a hospital-wide training program in evidence-based medicine for physicians - an observational study. BMC Med Educ 2016; 16:86
20. Emerson JD, Colditz GA. Use of statistical analysis in the New England Journal of Medicine. N Engl J Med 1983; 309:709–713
21. Horton NJ, Switzer SS. Statistical methods in the journal. N Engl J Med 2005; 353:1977–1979
22. Strasak AM, Zaman Q, Marinell G, et al. The use of statistics in medical research. Am Stat 2007; 61:47–55
23. Saquib N, Saquib J, Ioannidis JP. Practices and impact of primary outcome adjustment in randomized
controlled trials: Meta-epidemiologic study. BMJ 2013; 347:f4313
24. Sato Y, Gosho M, Nagashima K, et al. Statistical methods in the journal - an update. N Engl J Med 2017; 376:1086–1087
25. Shakespeare TP, Gebski VJ, Veness MJ, et al. Improving interpretation of clinical studies by use of confidence levels, clinical significance curves, and risk-benefit contours. Lancet 2001; 357:1349–1353
26. Schulz KF, Altman DG, Moher D; CONSORT Group: CONSORT 2010 statement: Updated guidelines for reporting parallel group randomised trials. BMJ 2010; 340:c332
27. Chavalarias D, Wallach JD, Li AH, et al. Evolution of reporting p values in the biomedical literature, 1990-2015. JAMA 2016; 315:1141–1148
28. Wasserstein RL, Lazar NA. The ASA’s statement on p-values: Context, process, and purpose. Am Stat 2016; 70:129–133
29. Taback N, Krzyzanowska MK. A survey of abstracts of high-impact clinical journals indicated most statistical methods presented are summary statistics. J Clin Epidemiol 2008; 61:277–281
30. Scott FI, McConnell RA, Lewis ME, et al. Increasing complexity of clinical research in gastroenterology: Implications for the training of clinician-scientists. Am J Gastroenterol 2012; 107:496–500
31. Kurichi JE, Sonnad SS. Statistical methods in the surgical literature. J Am Coll Surg 2006; 202:476–484
32. Tilson JK, Marshall K, Tam JJ, et al. A bibliometric analysis of statistical terms used in American Physical Therapy Association journals (2011-2012): Evidence for educating physical therapists. BMC Med Educ 2016; 16:118
34. Scotch M, Duggal M, Brandt C, et al. Use of statistical analysis in the biomedical informatics literature. J Am Med Inform Assoc 2010; 17:3–5
35. Kirkwood BR, Sterne JA. Linking analysis to study design: Summary of methods. In: Essentials of Medical Statistics. 2003, pp Second edition. Oxford, United Kingdom, Blackwell, 395–412
36. Detry MA, Ma Y. Analyzing repeated measurements using mixed models. JAMA 2016; 315:407–408
37. Pocock SJ, Stone GW. The primary outcome fails - what next? N Engl J Med 2016; 375:861–870
38. Pocock SJ, Stone GW. The primary outcome is positive - is that good enough? N Engl J Med 2016; 375:971–979
39. Chevret S. Bayesian adaptive clinical trials
: A dream for statisticians only? Stat Med 2012; 31:1002–1013
40. Lin J, Bunn V. Comparison of multi-arm multi-stage design and adaptive randomization in platform clinical trials
. Contemp Clin Trials 2017; 54:48–59
41. Berry DA. Adaptive clinical trials
in oncology. Nat Rev Clin Oncol 2011; 9:199–207
42. Ventz S, Barry WT, Parmigiani G, et al. Bayesian response-adaptive designs for basket trials. Biometrics 2017; 73:905–915
43. Kalil AC, Sun J. Bayesian methodology for the design and interpretation of clinical trials
in critical care medicine: A primer for clinicians. Crit Care Med 2014; 42:2267–2277
44. Berry SM, Connor JT, Lewis RJ. The platform trial: An efficient strategy for evaluating multiple treatments. JAMA 2015; 313:1619–1620
45. Saville BR, Berry SM. Efficiencies of platform clinical trials
: A vision of the future. Clin Trials 2016; 13:358–366
46. Ventz S, Alexander BM, Parmigiani G, et al. Designing clinical trials
that accept new arms: An example in metastatic breast cancer. J Clin Oncol 2017; 35:3160–3168
47. von Elm E, Altman DG, Egger M, et al; STROBE Initiative: The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: Guidelines for reporting observational studies. Lancet 2007; 370:1453–1457
48. Ferreira JA. Some models and methods for the analysis of observational data. Statist Surv 2015; 9:106–208
49. Cooke CR, Iwashyna TJ. Using existing data to address important clinical questions in critical care. Crit Care Med 2013; 41:886–896