Share this article on:

The Quality of Randomized Controlled Trials in Major Anesthesiology Journals

Greenfield, Mary Lou V. H. MPH, MS; Rosenberg, Andrew L. MD; O’Reilly, Michael MS, MD; Shanks, Amy M. MS; Sliwinski, Michelle J. MS; Nauss, Michael D. BS

doi: 10.1213/01.ANE.0000150612.71007.A3
Economics, Education, and Health Systems Research: Research Report
Chinese Language Editions

Increased attention has been directed at the quality of randomized controlled trials (RCTs) and how they are being reported. We examined leading anesthesiology journals to identify if there were specific areas for improvement in the design and analysis of published clinical studies. All RCTs that appeared between January 2000 and December 2000 in leading anesthesiology journals (Anesthesiology,Anesthesia & Analgesia,Anaesthesia, and Canadian Journal of Anaesthesia) were retrieved by a MEDLINE search. We used a previously validated assessment tool, including 14 items associated with study quality, to determine a quality score for each article. The overall mean weighted quality score was 44% ± 16%. Overall average scores were relatively high for appropriate controls (77% ± 7%) and discussions of side effects (67% ± 6%). Scores were very low for randomization blinding (5% ± 2%), blinding observers to results (1% ± 1%), and post-beta estimates (16% ± 13%). Important pretreatment clinical predictors were absent in 32% of all studies. Significant improvement in the reporting and conduct of RCTs is required and should focus on randomization methodology, the blinding of investigators, and sample size estimates. Repeat assessments of the literature may improve the adoption of guidelines for the improvement of the quality of randomized controlled trials.

IMPLICATIONS: The quality of reporting of randomized controlled trials in general anesthesiology journals may be enhanced by clarification of study methods. Randomization methods and blinding of investigators who collect data from treatment strategies are among the areas requiring the most attention.

Department of Anesthesiology, University of Michigan, Ann Arbor

Presented, in part, at the annual meeting of the International Anesthesiology Research Society, New Orleans, LA, March 22, 2003.

Accepted for publication October 29, 2004.

Address correspondence and reprint requests to Andrew L. Rosenberg, MD, The University of Michigan, Department of Anesthesiology, UH 1H247 Box 0048, 1500 East Medical Center Dr., Ann Arbor, MI 48109-0048. Address e-mail to

It is generally agreed that randomized controlled trials (RCTs) are the most valid method to compare treatments and make inferences regarding cause and effect (1). The results of an RCT may also directly and immediately affect patient care more than any other type of study design (1,2). For these goals to be achieved, the RCT should adhere to the highest standard of quality. Attention to the quality of RCTs suggests that many may be suboptimal (3–5). Investigators have devised various quality assessment tools to evaluate the reporting, design, and methodology of clinical trials (1,6–9). As a result, in the last 10 years, consolidated standards of reporting trials (CONSORT) to improve the quality of RCTs have been developed. The CONSORT initiative was developed by an international group of clinical trialists, statisticians, and medical journal editors and has been accepted by several editorial groups including the International Committee of Medical Journal Editors. CONSORT is a 22-item checklist and flow diagram intended for authors to adequately describe key study methodologies that are required for readers to evaluate the validity, limitations, and generalizability of clinical trials including details for the enrollment, interventions, allocation, follow-up, and statistical analysis of the clinical trial (2,10). Although RCTs are frequently reported in anesthesiology journals, there has been only one study evaluating the quality of these types of publications in the anesthesiology literature (11). These authors demonstrated only a modest improvement in some areas of RCTs. Because this study evaluated manuscripts from several years, only four papers from a journal for a given year were sampled. A more comprehensive review of a larger body of literature would enable a more detailed assessment of study methodologies and analysis among RCTs. Therefore, we conducted a thorough 1-year review of the elements of all RCTs published in leading anesthesiology journals. Our goal was to more clearly delineate specific areas for improvement in the conduct, implementation, analysis, and reporting of RCTs in the anesthesiology literature.

Back to Top | Article Outline


All RCTs published between January 2000 and December 2000 in four anesthesiology journals (Anesthesiology,Anesthesia & Analgesia,Anesthesia, and Canadian Journal of Anesthesia) were retrieved with a MEDLINE search. This search was then limited to only those articles that contained the following: (a) English language, (b) human trial, and (c) randomized controlled trial. Three-hundred-thirty-nine abstracts met our search criteria. To confirm that our search methods resulted in all the RCTs published in these journals, we selected, by the random number tables method, one volume from each journal during 2000. No missed articles were found. A trained researcher (MDN) not involved in the quality assessment of articles independently evaluated abstracts from these 339 articles to verify that the papers met inclusion criteria. For example, abstracts that indicated that the articles were commentaries, communications, and systematic reviews were excluded. We also excluded 29 articles that were healthy volunteer studies and 8 pharmacokinetic studies. Of the 339 articles that seemed to meet our study inclusion criteria, an additional 60 articles had to be excluded because our review revealed that they were meta-analyses of RCTs, correspondence about RCTs, studies of equipment or cadavers, or were not RCTs. Our final study sample consisted of 279 manuscripts (Fig. 1).

Figure 1

Figure 1

These 279 articles were photocopied, and all identifiers were removed from all pages by three investigators (MDN, AS, and MJS) who were not involved in further evaluation; identifiers included names and affiliations of authors, journal name, corresponding authors, or any other unique identifiers. Articles were presented to other reviewers (MLG and AR) who were blinded or masked to all unique identifiers. Articles were offered in a random order using a computer-generated randomization scheme. Both reviewers have had formal training in research design, epidemiology, and biostatistics.

We used a modified version of Chalmers quality assessment tool (9,12) to evaluate each article. This tool in its modified or original form has been used extensively to evaluate RCTs (13–16). It uses a scale to evaluate eight domains associated with the study protocol and six domains related to data analysis (Table 1). The weighted scores for each domain have precise requirements for what must be recorded to achieve a certain score. Each of these domains was evaluated, and a numeric value for each was assigned depending on the quality. The scores were then used to generate a quality score for each domain. Percentages (total score divided by total possible score) were assigned because there were some items not applicable to the study under review. Thus, the scores were proportions, with the lowest possible score being 0% and the highest possible score being 100%. This method of scoring has been validated in numerous studies of clinical research (17–21).

Table 1

Table 1

We derived an overall quality score for each article. Each of the two reviewers came to a consensus on those items, upon which they disagreed (22). Analysis of variance with Duncan correction for multiple comparisons was used to test overall quality score differences among the four journals; Pearson’s χ2 test or Fisher’s exact test was used, as appropriate, for comparisons of individual score assignment for each quality item. Statistical analyses were conducted using the Statistical Analysis System (SAS 8.0; SAS Institute, Cary, NC).

Back to Top | Article Outline


The mean weighted quality scores (±sd) for the four individual journals, in random order, were 47% ± 17%, 38% ± 15%, 46% ± 15%, and 43% ± 15% (Fig. 2). The mean quality score for all studies combined was 44% (±16%). The percentage of studies within each journal with the highest scores for each quality item is presented in Table 1. Analysis of variance with Duncan’s correction for differences among the four journals was statistically significant (F = 3.79; df = 3 276; P = 0.01). These differences in scores were in the proportions of studies that adequately described the methodology for blinding patients and in the effect of randomization on pretreatment variables Table 1.

Figure 2

Figure 2

In 20% (41 of 201) of the papers in which a control population was feasible, the control group had either a different control appearance or regimen (i.e., the experimental group and the control group did not have the identical appearance for their course of therapy or treatment) or the control regimen was unstated and could not be determined from the context of the article. In the 279 studies reviewed, the final sample size differed from the number enrolled in 10% of the studies. This was either because of significant withdrawals within studies or was not explained at all. Pretreatment variable distributions, such as demographics, and important clinical predictors, such as comorbidities or previous opioid use in pain patients, were not present in any form in 32% (86 of 271) of the study articles. Side effects were not addressed in 11% of the studies, and an additional 21% listed side effects but did not discuss the impact on study findings.

Thirty-five percent of the studies adequately reported appropriate methodology for how the process of randomization was blinded to the study investigators. For example, it could not be determined from these articles if an appropriate method of randomization (i.e., computer-generated or random numbers tables) was used or how the randomization was blinded (i.e., opaque envelopes). Twenty-two percent of studies did not report appropriate blinding of patients. In some articles, authors reported that studies were “single-blinded” but did not include a description of the blinding methodology. Other articles reported the blinding methodology, but there was evidence through patient side effects that the patients might be able to discern their treatment assignment. Thirty-seven percent of studies did not describe how observers who might influence the outcome reporting (such as a nurse in the postanesthesia care unit recording postoperative pain scores) were blinded to patient study enrollment or treatment assignment. Also, 98% of articles presented no details of efforts to blind the observers to continuing study results.

Three of 279 studies (1%) included neither P values nor test statistics, whereas 55% included either the P value, or the test statistic, but not both. Forty-one percent of studies received a good rating for statistical analysis, and an excellent rating was assigned to 11% of studies. Standard deviations or standard errors were presented in 83% of the studies; confidence intervals were presented properly in only 11%. Forty-eight percent of the studies evaluated conducted an analysis of the numbers of patients required to detect differences proposed to be important by the authors; 52% (145 of 279) conducted sample size estimates. Fifty-three percent (149 of 279) of studies had negative results. Of these 149 negative studies, 107 (72%) contained no explanations of how a Type II error may have accounted for a lack of statistical significance in key outcomes. Eighteen of 149 studies (12%) alluded to the problem or admitted to the necessity for more patients, whereas 24 of 149 (16%) estimated the statistical possibility of a Type II error after the fact (a power analysis) or commented about the confidence interval around the differences.

Back to Top | Article Outline


There is significant room for improvement in the quality of RCT reporting in leading anesthesiology journals. In the four journals whose articles were evaluated, the overall quality score for methodology reporting ranged from 38% to 47%. These scores are neither better nor worse than the scores of other medical disciplines reporting clinical research. Similar studies in which the Chalmers method has been used have found mean scores of 40%–50% (23–26). A score of 80% is considered to be reasonably good using this technique (9). Our results were consistent with the previous work of Pua et al. (11) in demonstrating that the domains related to study protocol and data analysis were fairly similarly reported among the four journals. However, pretreatment clinical variables that may influence outcome have not been explicitly evaluated in anesthesiology RCTs, and patients blinded to treatment outcomes may have been evaluated differently by our study than by others. Not only did we require a comment or description about how patients were blinded and how that was achieved, we also evaluated whether patients might be able to determine which treatment they received by side effects. Furthermore, by evaluating the entire population of RCTs for a given year, we were able to more accurately measure the proportion of studies reporting in a given domain than other studies that sample publications. Finally, the differences among journals may reflect differences in editorial requirements among the different journals.

In this study, it was important to use a quality assessment tool that had been validated for RCTs and that had been more broadly used for evaluating clinical trials in a variety of medical and other scientific disciplines. The modified version of Chalmers quality assessment tool (9,12) was selected to evaluate RCTs in the major anesthesiology journals because it has been extensively used to evaluate the quality of articles in clinical journals as well as other scientific publications. We chose the modified Chalmers tool in our study because of its attention to the details of protocol design and data analysis. Would another tool have produced similar results? Does the subjectivity that accompanies the implementation of any evaluative tool introduce bias into our assessment? Work by Detsky et al. (7) has demonstrated that, in a comparison to 18 RCTs using different tools, the overall quality assessment of articles did not change significantly from tool to tool.

Our study suggests that significant deficiencies in the quality of RCT reporting are especially prevalent for the way randomization and blinding techniques are performed or reported. Our results suggest that these two domains may also have significant review or editorial bias, as indicated by the significant differences in the proportion of studies among the four journals adequately reporting these methodologies. Randomization and blinding are particularly important because these techniques are the basis for reducing bias and are the hallmarks of the RCT. Proper randomization requires that patients have an equal chance of being assigned to either the treatment or the control groups and that the method used to assign them is free of bias. Any method in which the investigator can determine or influence the group to which the next study patient will be assigned is to be avoided. For example, randomization assignments should be determined by an individual not involved in the actual treatment; random numbers tables and computer-generated randomization assignments are less amenable to manipulation than the toss of a die or drawing or shuffling from a deck of cards. It is also important that the assignment codes, even when done properly, are concealed from the investigator i.e., randomization concealment. For example, the study assignments should be in opaque envelopes or via telephone communication from someone not involved in study implementation so that the investigator cannot determine which treatment is next in line.

Most investigators understand the importance of randomization, but we found that the details for the methods used were frequently inadequate. For example, authors reported that patients were assigned by “random envelope method” or by “choosing colored balls from an opaque bag” as well as by “a number of randomly allocated cards,” by the “shuffling of sequentially numbered envelopes,” or by a “systematic random sample technique.” Not only are these vague, but they are also not truly unbiased because they allow a variety of factors to interfere with the treatment assignments. Moreover, we found few trials that adequately reported randomization concealment despite its fundamental importance to the validity of the RCT. Of the 279 articles we evaluated, <5% reported both the method of randomization and that the randomization results were blinded to the investigators. Our results are even less than those reported by Pua et al. (11) and probably are because of the fact that this study also evaluated randomization blinding.

It is often not possible or ethical to blind patients or investigators in certain studies, and the Chalmers assessment tool does not penalize studies in which it is impossible to blind patients or investigators. In our review, in those studies in which blinding was possible, investigators often reported that a study was single-blind or double-blind, but rarely described the actual methods used to blind those involved. Other times, investigators described little else than that the observer was independent. Ninety-eight percent of the studies presented did not mention a safety monitoring committee and gave no details of efforts to blind any observers to continuing study results.

This study also found significant flaws in the analytic methods described in anesthesiology RCTs. Nearly half of the journal articles (48%) did not state whether an estimate of sample size was made before the study began. Reporting appropriate sample size estimates to avoid Type II errors, (the a priori probability or the power of the study design to detect a statistically significant difference between the outcomes in the study groups if there is such a difference) is not only methodologically required to improve the validity of a study, it has ethical implications as well. Many clinicians would consider it unreasonable to expose patients to potentially harmful treatments, or side effects from treatments, without knowing before the study begins how many patients are required. Clinical research is currently undergoing more scrutiny by institutional review boards to justify the reasons for a research project and to provide evidence that the least risky protocol to the fewest required subjects is being performed. Of the 149 clinical trials that reported negative studies, only 28% addressed the possibility that such findings may have been caused by a small sample size or a Type II error. Researchers, and readers alike, often interpret the absence of statistical significance to mean that there is no relationship between treatment and outcome (27). In our review, we found that many authors concluded that because there was no difference in a study outcome, the treatments were equivalent or equally effective or that the new treatment was an acceptable alternative to the standard treatment. No proof of a difference is not equivalent to proof of no difference (28).

There are several limitations to our approach of measuring the quality of reporting among these RCTs. The first is that because an item was not reported does not necessarily mean that it was not performed in the actual trial. This concern has been expressed by others regarding the quality of reporting versus the quality of the design and conduct of RCTs (29–32). However, it seems to be generally accepted that the methodological rigor of the study is reflected in its reporting and that incomplete reporting frequently represents poor quality studies (33). Second, because these journal articles were published several years ago, it is possible that they may not reflect current reporting practices. Journal quality may have improved, as evidenced by the work of Mohor et al. (34) comparing studies before and after suggestions for improving the quality of RCTs. Studies by Pua et al. (11) evaluating the quality of RCTs from periods between 1980 and 2000, as well as the accompanying editorial by Todd (35) urging improvement in the conduct and reporting of RCTs, also indicate that some attention is being applied to improving the quality of reporting RCTs articles. This study does not evaluate more recent articles to determine whether quality has improved. However, it is important to note that only two of the major general anesthesiology journals reviewed in this article have adopted CONSORT guidelines in their instructions to authors. Finally, there is no clear “gold standard” for evaluating controlled research reporting or quality, as evidenced by the plethora of tools available for use (36), and this study represents the use of only one such tool. Although evidence suggests that validated quality tools generally agree with each other (7), it is possible that the use of another assessment tool would have resulted in different findings.

In conclusion, our data indicate that investigators, reviewers, editorial boards, and readers alike should recognize that the quality of reporting of RCT in the anesthesiology literature has scope for improvement. Because the results of the RCTs are considered the gold standard for implementing changes in clinical practice, our findings suggest that significant improvement in the quality of the most important published articles may be achieved by a more rigorous application of the CONSORT guidelines. It would be instructive for future studies to evaluate specific journals over time that have adopted CONSORT guidelines from those that have not. Targeting improvement in the reporting and the conduct of RCTs should focus on randomization methodology, the blinding of patients, investigators, and observers, and sample size estimation and power analysis.

The authors thank Ann E. Nadeau for her assistance in the preparation of this manuscript.

Back to Top | Article Outline


1. Altman DG. Better reporting of randomised controlled trials: the CONSORT statement. BMJ 1996;313:570–1.
2. Begg C, Cho M, Eastwood S, et al. Improving the quality of reporting of randomized controlled trials: the CONSORT statement. JAMA 1996;276:637–9.
3. Mosteller F, Gilbert JP, McPeek B. Reporting standards and research strategies for controlled trials. Control Clin Trials 1980;1:37–58.
4. Freiman JA, Chalmers TC, Smith H, Kuebler RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 “negative” trials. N Engl J Med 1978;299:690–4.
5. Moher D, Dulberg CS, Wells GA. Statistical power, sample size, and their reporting in randomized controlled trials. JAMA 1994;272:122–4.
6. DerSimonian R, Charette LJ, McPeek B, Mosteller F. Reporting on methods in clinical trials. N Engl J Med 1982;306:1332–7.
7. Detsky AS, Naylor CD, O’Rourke K, et al. Incorporating variations in the quality of individual randomized trials into meta-analysis. J Clin Epidemiol 1992;45:255–65.
8. Jadad AR, Moore RA, Carroll D, et al. Assessing the quality of reports of randomized clinical trials: is blinding necessary? Control Clin Trials 1996;17:1–12.
9. Chalmers TC, Smith H Jr, Blackburn B, et al. A method for assessing the quality of a randomized control trial. Control Clin Trials 1981;2:31–49.
10. Altman DG, Schulz KF, Moher D, et al. The revised CONSORT statement for reporting randomized trials: explanation and elaboration. Ann Intern Med 2001;134:663–94.
11. Pua HL, Lerman J, Crawford MW, Wright JG. An evaluation of the quality of clinical trials in anesthesia. Anesthesiology 2001;95:1068–73.
12. Rochon PA, Gurwitz JH, Cheung CM, et al. Evaluating the quality of articles published in journal supplements compared with the quality of those published in the parent journal. JAMA 1994;272:108–13.
13. Solomon MJ, Laxamana A, Devore L, McLeod RS. Randomized controlled trials in surgery. Surgery 1994;115:707–12.
14. Haines SJ. Randomized clinical trials in neurosurgery. Neurosurgery 1983;12:259–64.
15. Moss RL, Henry MC, Dimmitt RA, et al. The role of prospective randomized clinical trials in pediatric surgery: state of the art? J Pediatr Surg 2001;36:1182–6.
16. Pater JL, Weir L. Reporting the results of randomized trials of empiric antibiotics in febrile neutropenic patients: a critical survey. J Clin Oncol 1986;4:346–52.
17. Lam W, Sze PC, Sacks HS, Chalmers TC. Meta-analysis of randomised controlled trials of nicotine chewing-gum. Lancet 1987;2:27–30.
18. Glanz M, Klawansky S, Stason W, et al. Biofeedback therapy in poststroke rehabilitation: a meta-analysis of the randomized controlled trials. Arch Phys Med Rehabil 1995;76:508–15.
19. Sonis J, Joines J. The quality of clinical trials published in The Journal of Family Practice, 1974–1991. J Fam Pract 1994;39:225–35.
20. Hwang P, Auclair B, Beechinor D, et al. Efficacy of pentosan polysulfate in the treatment of interstitial cystitis: a meta-analysis. Urology 1997;50:39–43.
21. Paesmans M. Statistical considerations in clinical trials testing empiric antibiotic regimens in patients with febrile neutropenia. Support Care Cancer 1998;6:438–43.
22. Fink A, Kosecoff J, Chassin M, Brook RH. Consensus methods: characteristics and guidelines for use. Am J Public Health 1984;74:979–83.
23. Forssell H, Kalso E, Koskela P, et al. Occlusal treatments in temporomandibular disorders: a qualitative systematic review of randomized controlled trials. Pain 1999;83:549–60.
24. Sculier JP, Berghmans T, Castaigne C, et al. Maintenance chemotherapy for small cell lung cancer: a critical review of the literature. Lung Cancer 1998;19:141–51.
25. Bernard-Bonnin AC, Stachenko S, Bonin D, et al. Self-management teaching programs and morbidity of pediatric asthma: a meta-analysis. J Allergy Clin Immunol 1995;95:34–41.
26. Fabbri A, Magrini N, Bianchi G, et al. Overview of randomized clinical trials of oral branched-chain amino acid treatment in chronic hepatic encephalopathy. JPEN J Parenter Enteral Nutr 1996;20:159–64.
27. Borenstein M. The case for confidence intervals in controlled clinical trials. Control Clin Trials 1994;15:411–28.
28. Gallagher EJ. No proof of a difference is not equivalent to proof of no difference. J Emerg Med 1994;12:525–7.
29. Stewart LA, Parmar MK. Bias in the analysis and reporting of randomized controlled trials. Int J Technol Assess Health Care 1996;12:264–75.
30. Moher D, Pham B, Jones A, et al. Does quality of reports of randomised trials affect estimates of intervention efficacy reported in meta-analyses? Lancet 1998;352:609–13.
31. Huwiler-Muntener K, Juni P, Junker C, Egger M. Quality of reporting of randomized trials as a measure of methodologic quality. JAMA 2002;287:2801–4.
32. Esposito M, Coulthard P, Worthington HV, Jokstad A. Quality assessment of randomized controlled trials of oral implants. Int J Oral Maxillofac Implants 2001;16:783–92.
33. Altman DG, Dore CJ. Randomisation and baseline comparisons in clinical trials. Lancet 1990;335:149–53.
34. Moher D, Schulz KF, Altman D. The CONSORT statement: revised recommendations for improving the quality of reports of parallel-group randomized trials. JAMA 2001;285:1987–91.
35. Todd MM. Clinical research manuscripts in Anesthesiology. Anesthesiology 2001;95:1051–3.
36. Moher D, Jadad AR, Nichol G, et al. Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials 1995;16:62–73.
© 2005 International Anesthesia Research Society