Considerable attention has been drawn to poor replicability and reproducibility of biomedical studies.1 One explanation is inadequate reporting of statistical methods by authors and inadequate assessment of statistical reporting and methods during peer review.
In this narrative review, we describe the published experiences with statistical review. We review studies found using multiple searches in PubMed and Google Scholar. Many lack structured abstracts; we failed to identify a structured search protocol that found all the studies. However, every reference of the studies was reviewed for any additional studies. Using Google Scholar, every citation also was reviewed. As of March 29, 2016, our review included every reference and citation with data.
EDITOR INSTRUCTIONS TO AUTHORS
Geoffrey Loftus became the Editor-in-Chief of Memory & Cognition in 1994. Based on his long-standing concerns with null hypothesis statistical testing, in January 1993, Editor-Elect Loftus instructed authors that “data should be conveyed as a figure depicting sample means with associated standard errors and/or, where appropriate, standard deviations.”2 In Loftus’ view, “more often than not, inspection of such a figure will immediately obviate the necessity of any hypothesis testing procedures. In such situations, presentation of the usual hypothesis-testing information (F values, P values, etc) will be discouraged.” Loftus’ term as Editor-in-Chief lasted from 1994 through 1997. In 2004 (ie, available to us in our decision-making for Anesthesia & Analgesia), Finch et al3 examined the impact of Loftus’ statistical requests on manuscripts published in Memory & Cognition. In 1990, 1991, and 1992, error bars on figures appeared on approximately 10% of manuscripts. During Loftus’ tenure as Editor-in-Chief, error bars appeared in approximately 40% of papers. After Loftus’ term as Editor-in-Chief, the use of error bars in figures decreased from approximately 35% in 1998 to just 21% in 2000. Finch et al concluded: “at no stage, however, were confidence intervals or other error bars often used to support the interpretation of data, and null hypothesis statistical testing remained throughout the greatly dominant approach to inference. A substantial proportion of psychologists will need to make major changes to their statistical practices if the latest recommendations of the American Psychological Association are to be implemented successfully.”3
These disappointing results suggest that author instructions, editorials, and changes in editorial policy alone are insufficient to change the quality of the reporting of studies. Editors need internal processes to assure adherence to published policies.
In 2008, recognizing that “a detailed and comprehensive set of publication guidelines” was needed “for larger and more complex [quality] improvement studies,” an interdisciplinary team led by Greg Ogrink of Dartmouth Medical School published a statement of “Standards for QUality Improvement Reporting Excellence” (SQUIRE).4,a The SQUIRE checklist has 19 sections and 50 individual statements. In 2015, Howell et al5 evaluated how uniformly the SQUIRE guidelines had been adopted by 4 journals in the field of health care quality: the American Journal of Medical Quality, the BMJ Quality and Safety, the International Journal for Quality in Healthcare, and the Joint Commission Journal on Quality and Patient Safety. They analyzed articles published from 2006 to 2008 (pre-SQUIRE) and from 2010 to 2012 (post-SQUIRE). Of the 50 individual statements in the SQUIRE checklist, papers published before publication of the checklist reported mean 40% (standard deviation [SD], 10%) of the items, whereas papers published after publication of the checklist reported mean 41% (SD, 14%) of the items.5 Reporting completed by the authors was not greater after publication versus before publication of the checklist (P = .9).5 The authors concluded: “the lack of positive findings in this study suggests that the SQUIRE guidelines are not being optimally implemented by authors and journals publishing quality improvement reports.”5 This demonstrates that availability of a well-publicized checklist does not change article quality.
In 2007, the Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) initiative simultaneously published, in multiple high-impact journals, recommendations “on what should be included in an accurate and complete report of an observational study.”6–13Anesthesia & Analgesia started to require use of the STROBE checklist in 2010.14
In 2013, Bastuji-Garin et al15 examined the quality of reporting of observational studies before and after the publication of the STROBE statement in 4 highly cited dermatology journals: the Journal of Investigative Dermatology, the British Journal of Dermatology, the Journal of the American Academy of Dermatology, and the Archives of Dermatology. The outcome variable studied “was the STROBE score, defined as the number of the 22 STROBE items adequately reported divided by the number of applicable items, expressed as a percentage.”15 “The 13 STROBE items with several questions (2–15 questions per item, online supplement) were considered adequately reported when at least 50% of their questions had “yes” answers (after exclusion of the “not applicable” components).15 From 3 years before through 3 years after the publication of the STROBE statement, the median of the STROBE score among dermatologic epidemiology studies increased from 48% to 58%.15 Using segmented (interrupted) linear regression, the authors determined that the observed change was a consistent trend over time, not influenced (P = .64) by publication of the STROBE checklist.15 This is consistent with the findings reported for the SQUIRE checklist: the availability of a checklist alone does not increase the quality of study reporting.
Similarly, in 2016, Pouwels et al16 examined the impact of the STROBE checklist on the quality of reporting in 5 epidemiology journals (the International Journal of Epidemiology, Epidemiology, the Journal of Clinical Epidemiology, the American Journal of Epidemiology, and the Journal of Epidemiology) and 5 general medical journals (New England Journal of Medicine, The Lancet, the Journal of the American Medical Association, the BMJ, and the Annals of Internal Medicine). The authors compared the quality of reporting in the 3 years before the STROBE statement with the quality of reporting in the first 3 years after the release of the STROBE statement.16 The authors focused on whether confounding variables were reported, including “characteristics of key confounders,” “why potential confounders were selected for analysis,” and “methods to control for confounding (eg, stratification, multivariate regression, [or] propensity score matching).”16 The results also were disappointing. There was a small, albeit statistically significant, shift in the number of items reported after publication of the STROBE statement. However, the median number of items reported changed from 4 items (interquartile range, 3–5) to 4 items (interquartile range, 3.75–5). Several of the journals reviewed had participated in the 2007 release of the STROBE statement (The Lancet,7Epidemiology,8 the BMJ,11 the Annals of Internal Medicine,12 and the Journal of Clinical Epidemiology13). Compliance with the STROBE statement was not statistically better among the journals that published the initial statement than among the journals that did not publish the initial statement.16 Similarly, including a specific instruction to use the STROBE statement in the Guide for Authors did not improve compliance with the STROBE guidelines. Among the journals considered, only the BMJ required a completed STROBE checklist to accompany submitted manuscripts. The authors found that articles in the BMJ did not have better reporting of confounding variables than journals that endorsed the STROBE checklist but did not require checklist completion.16
Given the widespread publication of the STROBE checklist in 2007 in highly cited journals, and subsequent endorsement of the STROBE checklist by many journals, including Anesthesia & Analgesia, it is remarkable that the STROBE checklist has had so little impact on the quality of reporting for observational trials. These findings suggest that the publication of checklists, the endorsement of checklists by journals, and even a requirement for submission of completed checklists have little demonstrable impact on the quality of reporting.
MISSING ITEMS ARE OFTEN BASIC STATISTICS
Langan et al17 reviewed cohort, case–control, and cross-sectional epidemiologic studies published between 2005 and 2007 in the “highest impact factor” dermatology journals for missing STROBE elements. The journals reviewed were the Archives of Dermatology, the British Journal of Dermatology, the Journal of the American Academy of Dermatology, the Journal of Investigative Dermatology, and the Acta Dermato-Venereologica. Basic statistical information was frequently missing in the surveyed papers. Of the 137 papers reviewed, the number of individuals at each applicable stage of the study (eg, individual examined for eligibility) was only reported in 54 studies (39%), losses to follow-up were only reported in 17 studies (12%), and management of missing data was only reported in 8 studies (6%). Even more remarkable is that the statistical methods were only reported in 19 studies (14%), and only 10 studies (7%) included sample size calculations.
Šimundić and Nikolac18 reviewed all manuscripts submitted to Biochemia Medica between 2006 and 2009 with “some kind of statistical analysis of the data.” The authors identified at least 1 basic statistical error in all 55 manuscripts.18 The most common error was lack of power analysis, which was not reported in any of the papers.18 Three-fourths of the papers used the wrong statistical test to compare 3 or more groups. Two-thirds of the papers incorrectly presented P values.18 Other common errors were failing “to use non-parametric test like Mann-Whitney Rank Sum test when [the] sample [size was] too small or when data [were] not normally distributed” and failing “to use Fisher exact test instead of testing data with chi-square test due to the low cell frequencies.”18
Vähänikkilä et al19 reviewed articles published in 2010 in 5 dental journals: Journal of Dental Research, Caries Research, Community Dentistry and Oral Epidemiology, Journal of Dentistry, and Acta Odontologica Scandinavica. Of the 200 articles reviewed by the authors, only 18 (9%) were deemed as having acceptable statistical reporting.19 Among the 182 (91%) articles with inadequate statistical reporting, the authors identified inadequate “justification for the number of cases” in 160 articles (80%), “incomplete description of statistical procedures” in 65 papers (33%), and P values reported as thresholds (eg, “P > .05”) in 58 papers (29%).19
Prescott and Civil20 reviewed “a consecutive series of submissions” to the journal Injury, each of which was sent to Prescott for blinded statistical review between December 2010 and January 2012. The review included articles that were eventually published as well as articles rejected during the review process. Of the 100 papers, 47 had incorrect analyses. The most common errors were incorrect use of a 2 × 2 contingency table (14 papers), false precision in reporting percentages (30 papers), and “numerical errors” (19 papers).20
These results show that improper handling of basic statistical issues presents an ongoing problem in the medical literature. Our review of the literature on statistical reporting did not identify a single paper suggesting that any journal had high-quality statistical reporting. The evidence suggests that inadequate reporting of basic statistics is a fundamental and ubiquitous problem in the medical literature.
STATISTICAL REVIEW OF ALL RESEARCH PAPERS
Lukić and Marusić21 reviewed 97 articles with “some type of statistical analysis” published in the Croatian Medical Journal. “After the appointment of [the] statistical editor,” “the statistics was not satisfactory” in 91 of 144 (63%).21 “Comparison of the average number of errors per published manuscript before and after” the appointment of [the] statistical editor in 1996 “did not reveal any notable difference.”21 The authors identified errors in the statistical analysis in 52 papers (54%) “before the appointment of [the] statistical editor.” Remarkably, the estimated incidence of statistical errors was greater, 91 of 144 papers (63%) after the appointment of the statistical editor. Of these 144 papers with statistical analysis, the Editor-in-Chief sent only 30 papers (21%) for statistical review. The authors explain that the Editor-in-Chief selected manuscripts for statistical review based on the perceived complexity of the statistical analysis. The authors concluded that, in the case of the Croatian Medical Journal, “most problems arise from inadequate editorial practice.” The authors recommended “strict editorial policy on sending out all manuscripts with numerical data for statistical review, strict monitoring of revisions made in manuscripts…and reviewers with background not only in statistics, but in biomedicine as well.”
In an unusual prospective trial, Cobo et al22 used a balanced factorial design to randomize articles submitted to Medicina Clínica into 4 groups. The 2 interventions tested were assigning a specific statistical reviewer and suggesting reporting guidelines to the reviewer. Articles were evaluated based on the Goodman scale, which evaluates study design, sample size rationale, reporting of major variables, etc. The authors found that assigning a dedicated statistical reviewer significantly increased the quality of the article. There was no significant effect for the second intervention, recommending specific guidelines to reviewers.
Pitkin et al23 reviewed the quality of statistical review in Obstetrics & Gynecology. In 1993, Obstetrics & Gynecology conducted the first of several assessments of their process for statistical review. They found that among 100 consecutive articles, there were 15 with “statistical or design flaws that the [statistical] consultant concluded were sufficient for the article to “be rejected on that basis alone.” These errors were “unrecognized by the standard reviews.” Concerned by these findings, the journal instituted a requirement for 100% statistical review of all articles that “passed the standard peer review process.” In the first year of the new system, 16% of articles were found with serious enough “statistical or design flaws as to prompt rejection.” These were articles considered ready to be accepted by the reviewers before being evaluated by the Statistical Editor. The process was further formalized in 2001 with earlier and more formal statistical review. The authors found close agreement between the recommendations of the statistical reviewer and the ultimate disposition of the manuscript.
Henly et al24 assessed scientific and statistical reviews of articles submitted to Nursing Research over 1.5 years. They used the “General Assessment of Reviews of Nursing Research” as their assessment criteria.25 This is a tool to assess quality of reviews of nursing research, where quality refers to “adequacy of the argument used to support a judgment about the worth of a scientific manuscript.”25 The authors defined a fatal flaw as “a methodological problem that could not be corrected by revision of the manuscript.” The authors found that fatal flaws were more frequently identified “in a statistical review than in a scientific review (36% vs 16%, P < .02; N = 36 and 69, respectively).”24 The authors also assessed how useful reviews were in guiding the editorial decision on a 1 to 5 (poor to excellent) scale. They found that statistical reviewers were more useful than scientific reviews “(mean = 3.2, SD = 0.64 vs mean = 2.5, SD = 0.94, P < .002)” in guiding the editorial decision.24 Using the same scale, they also found that the statistical review was more useful to authors than the scientific review “(mean = 3.1, SD = 0.79 vs mean = 2.5, SD = 0.79; P < .002).”24
These findings suggest that a formal policy of 100% statistical review is likely an important component of peer review for scientific papers. Editors-in-Chief, and other reviewers, are generally not able to assess adequately whether statistical review is needed. The data also suggest that scientific review is not a substitute for adequate statistical review.
Three findings are evident from prior studies of statistical reviewing. First, instructions to authors and statistical checklists are not sufficient. No studies suggested that instructions or checklists improve the quality of statistical methods and reporting. Second, even basic statistics such as power analyses are frequently missing or incorrectly performed. Third, statistical review is needed for all papers that involve data (ie, not just data analysis because a common problem in research reports is the absence of analysis). A consistent finding in these studies is that nonstatistical reviewers (eg, “scientific reviewers”) and journal editors generally poorly assess statistical quality. Scientific reviewers and journal editors may lack the necessary expertise for statistical review. Alternatively, or in addition, scientific reviewers and journal editors may perform a fundamentally different assessment than statistical reviewers, which may preclude their rigid focus on analysis as done during statistical review.
The findings in this review are consistent with our experience. The second author (Dr. Shafer) served as Editor-in-Chief of Anesthesia & Analgesia from 2006 to 2016. The Journal progressively instituted statistical review of all research papers from 2011 through late 2013 and maintained that throughout 2014 and 2015. The first author (Dr. Dexter) served as Statistical Editor of Anesthesia & Analgesia from 2010 to 2015. Our tasks were completely different. As Editor-in-Chief, Dr. Shafer had to focus on an overall evaluation: Was it believable? Was it readable? Was it important? Was it novel? As statistical reviewer, Dr. Dexter needed to assess whether the statistics were clearly explained, appropriately conducted, and whether the words in the text matched the conclusions of the data analysis.
There is a paradox, however, in these findings. If most statistical limitations of articles are basic statistics, as suggested by these studies, why is it necessary to have a dedicated statistical review of all papers? It would seem that competent editors and scientific reviewers should be able to identify errors in basic statistical analysis and reporting. If checklists are available for nearly all types of data analysis, why is a statistical reviewer necessary? It would seem that an editor, scientific reviewer, or even an editorial assistant could verify that the checklist was adequately followed. The studies do not answer this question. However, as a practical matter, our experience matches the findings reported here. As suggested above, reviewers and editors miss errors in statistical analysis and reporting that are immediately obvious to a dedicated statistical reviewer.
A limitation of our article is that although previous studies found that missing items were often basic biostatistics,17–20 the journals were not from our specialty, and the extent to which the results are generalizable is unknown. Anesthesia & Analgesia handles multiple papers that involve engineering statistics, including nonlinear modeling. These include method comparisons, psychometrics, and business statistics including computer simulation.26–31 A focus of industry-sponsored research published in Anesthesia & Analgesia is engineering.32 Even if a paper has 3 comments about basic statistics, a single comment about an engineering statistics issue may take 1 month or longer of analysis time to address. This is particularly the case if the statistical issue was not recognized by the authors before statistical review.
Another limitation is that modern statistical analyses are often complex and technically difficult, as shown in the May 2016 issue of Anesthesia & Analgesia.33–35,b Statisticians rely on software and algorithms that may introduce untested assumptions or introduce unexpected errors. None of the studies we reviewed addressed whether statistical review could reduce errors in papers with complicated statistical analyses.
Based on this review, we conclude with 3 recommendations:
- Journals should have designated statistical reviewer(s).
- Journals should adhere to a policy of 100% statistical review of all manuscripts that have any form of data. Neither expediency nor the judgment of editors and reviewers is an excuse for bypassing statistical review.
- Statistical checklists and guidelines do not substitute for statistical review. No data suggest that they improve the quality of statistical reporting.
Jennifer Espy obtained library articles, edited the manuscript, and assisted in preparation of the journal submission.
Name: Franklin Dexter, MD, PhD.
Contribution: This author helped design the study, conduct the study, and write the manuscript.
Conflict of Interest: The Division of Management Consulting offers a course in planning statistical analyses in anesthesia. Dr. Dexter receives no funds personally other than his salary and allowable expense reimbursements from the University of Iowa and has tenure with no incentive program. He and his family have no financial holdings in any company related to his work other than indirectly through mutual funds for retirement. Income from the Division’s consulting work is used to fund Division research.
Name: Steven L. Shafer, MD.
Contribution: This author helped write the manuscript.
Conflicts of Interest: Steven L. Shafer declares no conflicts of interest.
This manuscript was handled by: Thomas R. Vetter, MD, MPH.
1. Vetter TR, McGwin G Jr, Pittet JF. Replicability, reproducibility, and fragility of research findings—ultimately, caveat emptor. Anesth Analg. 2016;123:244–248.
2. Loftus GR. Editorial comment. Mem Cogn. 1993;21:1–3.
3. Finch S, Cumming G, Williams J, et al. Reform of statistical inference in psychology: the case of memory & cognition. Behav Res Methods Instrum Comput. 2004;36:312–324.
4. Ogrinc G, Mooney SE, Estrada C, et al. The SQUIRE (Standards for QUality Improvement Reporting Excellence) guidelines for quality improvement reporting: explanation and elaboration. Qual Saf Health Care. 2008;17suppl 1i13–i32.
5. Howell V, Schwartz AE, O’Leary JD, Mc Donnell C. The effect of the SQUIRE (Standards of QUality Improvement Reporting Excellence) guidelines on reporting standards in the quality improvement literature: a before-and-after study. BMJ Qual Saf. 2015;24:400–406.
6. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. PLoS Med. 2007;4:e296
7. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Lancet. 2007;370:1453–1457.
8. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Epidemiology. 2007;18:800–804.
9. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Bull World Health Organ. 2007;85:867–872.
10. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Prev Med. 2007;45:247–251.
11. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. BMJ. 2007;335:806–808.
12. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. Ann Intern Med. 2007;147:573–577.
13. von Elm E, Altman DG, Egger M, Pocock SJ, Gøtzsche PC, Vandenbroucke JP; STROBE Initiative. The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement: guidelines for reporting observational studies. J Clin Epidemiol. 2008;61:344–349.
14. 2010 Anesthesia & Analgesia Guide for Authors: 2009–2010 Editorial Board, Anesthesia & Analgesia. Anesth Analg. 2010;111:525–538.
15. Bastuji-Garin S, Sbidian E, Gaudy-Marqueste C, et al.; European Dermatology Network (EDEN). Impact of STROBE statement publication on quality of observational study reporting: interrupted time series versus before-after analysis. PLoS One. 2013;8:e64733
16. Pouwels KB, Widyakusuma NN, Groenwold RH, Hak E. Quality of reporting of confounding remained suboptimal after the STROBE guideline. J Clin Epidemiol. 2016;69:217–224.
17. Langan S, Schmitt J, Coenraads PJ, Svensson A, von Elm E, Williams H; European Dermato-Epidemiology Network (EDEN). The reporting of observational research studies in dermatology journals: a literature-based study. Arch Dermatol. 2010;146:534–541.
18. Šimundić AM, Nikolac N. Statistical errors in manuscripts submitted to Biochemia Medica journal. Biochemia Medica. 2009;19:294–300.
19. Vähänikkilä H, Tjäderhane L, Nieminen P. The statistical reporting quality of articles published in 2010 in five dental journals. Acta Odontol Scand. 2015;73:76–80.
20. Prescott RJ, Civil I. Lies, damn lies and statistics: errors and omission in papers submitted to INJURY 2010-2012. Injury. 2013;44:6–11.
21. Lukić IK, Marusić M. Appointment of statistical editor and quality of statistics in a small medical journal. Croat Med J. 2001;42:500–503.
22. Cobo E, Selva-O’Callagham A, Ribera JM, Cardellach F, Dominguez R, Vilardell M. Statistical reviewers improve reporting in biomedical articles: a randomized trial. PLoS One. 2007;2:e332
23. Pitkin RM, Scott JR, Burmeister LF. The numbers game: evaluation of statistics by Obstetrics & Gynecology. Obstet Gynecol. 2014;123:353–355.
24. Henly SJ, Bennett JA, Dougherty MC. Scientific and statistical reviews of manuscripts submitted to Nursing Research: comparison of completeness, quality, and usefulness. Nurs Outlook. 2010;58:188–199.
25. Henly SJ, Dougherty MC. Quality of manuscript reviews in nursing research. Nurs Outlook. 2009;57:18–26.
26. Ledolter J, Dexter F. Analysis of interventions influencing or reducing patient waiting while stratifying by surgical procedure. Anesth Analg. 2011;112:950–957.
27. Ledolter J, Dexter F, Epstein RH. Analysis of variance of communication latencies in anesthesia: comparing means of multiple log-normal distributions. Anesth Analg. 2011;113:888–896.
28. Dexter F, Epstein RH, Bayman EO, Ledolter J. Estimating surgical case durations and making comparisons among facilities: identifying facilities with lower anesthesia professional fees. Anesth Analg. 2013;116:1103–1115.
29. Dexter F, Ledolter J, Hindman BJ. Bernoulli Cumulative Sum (CUSUM) control charts for monitoring of anesthesiologists’ performance in supervising anesthesia residents and nurse anesthetists. Anesth Analg. 2014;119:679–685.
30. Saugel B, Grothe O, Wagner JY. Tracking changes in cardiac output: statistical considerations on the 4-quadrant plot and the polar plot methodology. Anesth Analg. 2015;121:514–524.
31. Dexter F, Ledolter J, Hindman BJ. Quantifying the diversity and similarity of surgical procedures among hospitals and anesthesia providers. Anesth Analg. 2016;122:251–263.
32. Cannesson M, Shafer SL. All boxes are black. Anesth Analg. 2016;122:309–317.
33. Glance LG, Hannan EL, Fleisher LA, et al. Feasibility of report cards for measuring anesthesiologist quality for cardiac surgery. Anesth Analg. 2016;122:1603–1613.
34. Glance LG, Dick AW. In response. Anesth Analg. 2016;122:1722–1727.
35. Shafer SL. Broken hearts. Anesth Analg. 2016;122:1231–1233.