Not all research studies are created equal. Decisions made during the conceptualization, design, conduct, and analysis of a research study can impact the quality of the study and the trustworthiness of its results. Identifying study limitations is of primary importance in ensuring that practice decisions incorporate the highest quality evidence so that the best outcomes are achieved for patients. In systematic reviews, guidelines, and health technology assessments, it is common to perform a critical assessment of potential limitations of a study. This assessment is often given different terms in the literature, which are sometimes used interchangeably but may be conceptualized differently, such as “risk of bias,” “critical appraisal,” “study validity,” “methodological quality,” and “methodological limitations.”1–7 The terminology used to describe the assessment of study methodology has a long history grounded in medical education and practice, and reflects the different approaches and concepts surrounding the appropriate assessment of primary studies for use in synthesized research.
In this paper, we discuss the evolution of this process and the nomenclature used, aiming to bring some clarity to the discussion. Recent advances in the methodological literature and development of new tools have led to a change in direction at JBI and prompted the JBI Effectiveness Methodology Group to consider the way forward for critical appraisal of JBI systematic reviews of effectiveness.
Guidelines for the critical appraisal of primary research emerged in the 1980s with the rise of evidence-based medicine (EBM).8–11 EBM is an approach to medical education and practice that is achieved through consultation of scientific research alongside clinical judgment and the consideration of patient values to assist clinical decision-making in a particular context.8,9 One of the steps embedded within the EBM approach is the critical appraisal of research studies following a comprehensive search of the literature and analysis of the relevance of retrieved studies.12 Critical appraisal was proposed by Sackett et al.11 as a way to systematically evaluate the medical literature for its trustworthiness, value, and relevance within a particular context.13,14 Accordingly, the defining characteristics of a critical appraisal are whether i) the study addresses a clearly focused question; ii) the study used valid methods to address the research question; iii) the valid results of the study are important; and iv) these valid, important results are applicable to the target patient or population.15 These criteria are the basis for the development of several structured approaches to critical appraisal supported by use of critical appraisal tools.
The Critical Appraisal Skills Program (CASP) and the Centre for Evidence Based Medicine (CEBM) were established in the 1990s in response to the need to integrate science into clinical education and practice.16,17 These collaborations paved the way for the development of educational programs addressing the need for EBM, in which critical appraisal methods form a central part. The CASP and CEBM critical appraisal tools systematically assess the quality and utility of different research designs based on the critical appraisal criteria developed by Sackett et al.11,12 Aligning studies with these criteria has the potential to influence study results and the way conclusions of a study are interpreted by research consumers.18 Therefore, the use of critical appraisal tools is crucial to ascertain whether the results of the study can be believed, interpreted, and transferred appropriately into other contexts, such as policy, further research, education, and clinical practice.18 There are now more than 100 critical appraisal tools available for use that contain a variety of items to evaluate study methodology; however, there is no consensus regarding the best tool for any particular research context.18 JBI has also developed a suite of critical appraisal tools for each study design, with targeted questions addressing the methodological characteristics of a study.3
While critical appraisal includes a process to assess the internal validity of a study (criterion ii),15 the original conception of critical appraisal and tools by groups such as CASP, CEBM, and JBI sometimes goes beyond internal validity into assessment of external validity, statistical conclusion validity, generalizability, and even reporting quality (see Appendix I for a glossary of terms). As such, we can view the term “critical appraisal” as a broad term that may cross into fields where other tools or processes exist, such as assessment of the certainty of review findings (eg, using the Grading of Recommendations, Assessment, Development and Evaluation [GRADE] assessment)19 or compliance with reporting standards.20 Also important is that critical appraisal questions focusing on the applicability, transferability, or statistical conclusion validity of a particular result from an individual study may not be as relevant in a systematic review process where these judgments are made on a synthesized result. As such, many current critical appraisal tools may be better used to create critically appraised topics, in journal clubs, or as an educational tool.
Perhaps an even broader definition of the assessment of study methodology is encapsulated by the term “quality” when used in clinical education and practice. Quality can be characterized as the assessment of study validity as well as other indicators that improve the overall standard of the published research, and tends to answer the question of whether the authors of a published research article have done their best in terms of the methodologies they employed. However, current reflections of the characteristics of critical appraisal and methodological quality have identified a broader scope than is required for the assessment of study validity for use in evidence synthesis.5 This has led to different ideas among research consumers and methodologists regarding how quality should be conceptualized when developing tools and evaluating primary studies for the specific purpose of synthesizing research.
Some methodological quality assessment tools have historically gone beyond the remit of methods and items related to the protection of a study’s internal validity to include external validity; other methodological components, such as sample size and power; as well as reporting quality and ethical considerations.5,21 The inclusion of these various constructs in a tool depends largely on the source of the guidance in creating such an assessment (eg, the research group and the tools’ intended use), researcher knowledge and expertise in developing this assessment, and the theoretical framework underlying the concept of methodological quality.22 For example, unrelated items regarding sample size (random error) may “slip in” to an assessment for internal validity due to a lack of researcher expertise. This lack of clarity has been problematic for the assessment of primary research, particularly for evidence synthesis, because what is required for evidence synthesis is to summarize a body of research and understand how reliable the results of the included studies are with respect to the pooled estimate of effect.5 Therefore, in this context, measurement of the potential for systematic error in the results of the included studies is needed.5,23 When used in evidence synthesis, methodological quality assessment tools should solely measure the implementation of internal validity safeguards that protect against systematic error (bias) during the design, conduct, and analysis of the study.23 However, despite this understanding, the inclusion of unrelated constructs in methodological quality assessment tools has persisted. Importantly, we should distinguish between the concepts of “study quality” (which can include additional considerations to internal validity) and “methodological quality” (which is focused only on the internal validity of a study).
Risk of bias
To overcome this problem, in 2008, the Cochrane Collaboration advocated for broad use of the term “risk of bias” in place of “methodological quality” in order to emphasize the link to systematic error and overcome the ambiguity between quality of the research and other constructs, particularly reporting quality, random error, and external validity.22,24 Similar recommendations have been made in environmental evidence synthesis research25 by the Collaboration of Environmental Evidence, and in preclinical research synthesis by the Systematic Review Centre for Laboratory Animal Research (SYRCLE),26 which has resulted in the development of risk of bias frameworks and principles for use in these fields. However, risk of bias assessment tools may still contain the same muddling of methodological quality constructs unrelated to bias, with some risk of bias tools still directing the user towards reporting items.27,28 Moreover, as a result of the introduction of the term “risk of bias,” the term “methodological quality” has been used ambiguously to include constructs unrelated to internal validity and now often entails a broader definition, even when used in evidence synthesis. While these terms may help to distinguish between these assessments, they also imply that methodological quality is a separate and distinct construct from risk of bias, which may be inaccurate. Methodological quality, when defined only in terms of internal validity of a study, can be—and sometimes is—used interchangeably with risk of bias, particularly when the purpose is to assess primary studies included in synthesized research. Perhaps one notable distinction is that risk of bias tools usually contain rules for making absolute value judgments regarding the risk of bias.2
A recent article by Furuya-Kanamori et al.2 attempts to redefine methodological quality and risk of bias based on the intent of the assessment. In this scheme, a methodological quality assessment tool contains a list of safeguards that can be counted, and it becomes a risk of bias assessment tool when the methodological quality assessment tool is updated with rules that enable safeguard (item) counts to be converted into a qualitative judgment (eg, high or low risk of bias judgments).2 Tools are then labeled as a methodological quality assessment tool or a risk of bias assessment tool, depending on the intent. This paper distinguishes the 2 terms for researchers; however, this terminology is yet to be adopted by the research community.
Methodological quality based on a quality count (ie, safeguards enumerated to produce a quality score) can be interpreted to be the probability that a study is credible.29 Significant attempts have been made to overcome the errors previously made in understanding methodological quality assessments, either through the intent of the assessment tool or by shifting the emphasis to the need for a strong link between the items included in a tool and bias through use of the term “risk of bias.”5 However, when methodological quality and risk of bias involve assessments through the same underlying safeguards, the intent is to describe the assessment of internal validity of a study as it relates to systematic error through the implementation or non-implementation of methodological safeguards in a study.
Different research groups have different interpretations of the term “methodological quality” and researchers may also associate this term with “quality of the evidence,” which was the historic term used by the GRADE Working Group that considered methodological limitations and other quality concepts, such as imprecision, publication bias, and indirectness.5 Quality is a diffuse concept and means different things to different groups. For example, when using an approach to assessing quality that does not focus on internal validity items alone, a study may be considered of “high quality” (for its particular objective and design) but still be at more than a low risk of bias. Due to these issues, GRADE now uses the term “certainty” of the evidence as opposed to “quality,” with risk of bias being just one of the contributing determinants of certainty.
Recommendation for JBI to adopt the term “risk of bias”
The current JBI critical appraisal tools for effectiveness reviews are checklists for each study design (herein focusing on analytical study designs), with the majority of questions addressing the methodological safeguards implemented in the primary study. The JBI Effectiveness Methodology Group has made the decision to move to the term “risk of bias” to describe the assessment of methodological safeguards in a study to mitigate biased results (for quantitative study designs). In formalizing this recommendation regarding the way the term “risk of bias” should and should not be used in relation to JBI reviews of effectiveness, bias is considered to be systematic error in the results of quantitative studies. The term “risk of bias” refers to the potential for bias, and not measurement of the actual bias in a study. Further, risk of bias does not include the assessment of other quality or critical appraisal constructs, such as reporting quality, random error, external validity, or ethical considerations.
This decision was made to align the JBI Collaboration with other research groups, such as Cochrane, to avoid some of the misunderstandings surrounding the terms “quality” and “critical appraisal.” Therefore, the way forward for JBI is to adopt the term “risk of bias” to reduce confusion for reviewers by using similar terms to collaborators and to unify the nomenclature in this field. As mentioned previously, risk of bias is one of the grading domains in a GRADE assessment where authors can lower their certainty in the body of evidence by 1 or 2 grades. Adopting the term “risk of bias” instead of “critical appraisal” and shifting to internal validity assessment will align JBI reviewers more closely with the GRADE approach. Importantly, this terminology only applies to JBI’s analytical quantitative tools, as different terminology is more appropriate when considering qualitative or other forms of evidence. A common language is necessary to ensure a universally understood assessment on which recommendations for clinical practice are made.
MK is supported by the INTER-EXCELLENCE grant number LTC20031—Towards an International Network for Evidence-based Research in Clinical Health Research in the Czech Republic. ZM is supported by an NHMRC Investigator Grant, APP1195676.
JCS for conception and design of the work. All authors critically revised the manuscript for intellectual content and approved the final version.
Appendix I: Glossary of terms
||The process of systematically assessing the outcome of scientific research to judge its trustworthiness, value, and relevance in a particular context.
||Applicability of the findings to a given population.
||The degree to which the results of a study can be applied to a broader population or situation.
||The GRADE approach to rating imprecision focuses on the 95% CI around the best estimate of the absolute effect.
||The GRADE approach to rating indirectness focuses on concerns about how the population, intervention, or outcomes differ from those of interest.
||A measure of how well a study is conducted and how accurately its results reflect the studied group.
||The extent to which there is potential for errors and bias in the design and execution of a study.
||The probability of finding a statistically significant result.
||The likelihood studies have not been published based on the outcome of the research study.
||An error in measurement caused by factors that vary from one measurement to another.
||The extent to which a complete and transparent description of the design, conduct, and analysis of a study is given.
|Risk of bias
||The likelihood that features of the study design or conduct of the study will give misleading results.
||The number of participants or observations included in a study.
|Statistical conclusion validity
||The extent to which the conclusions of research are founded on adequate analysis of the data.
||Errors that affect the accuracy of a measurement or cause readings to differ from the true value by a consistent amount each time a measurement is made.
1. Hartling L, Ospina M, Liang Y, Dryden DM, Hooton N, Seida JK, et al. Risk of bias versus quality assessment of randomised controlled trials: cross sectional study. BMJ 2009;339:b4012.
2. Furuya-Kanamori L, Xu C, Hasan S, Sa D. Quality versus risk of bias assessment in clinical research. J Clin Epidemiol 2021;129:172–5.
3. Aromataris E, Munn Z. JBI Manual for Evidence Synthesis [internet]. Adelaide, JBI; 2020 [cited 2022 Apr 10]. Available from: https://synthesismanual.jbi.global
4. Munkholm K, Paludan-Müller AS, Boesen K. Considering the methodological limitations in the evidence base of antidepressants for depression: a reanalysis of a network meta-analysis. BMJ Open 2019;9:e024886.
5. Higgins JPT, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA. editors. Cochrane handbook for systematic reviews of interventions version 6.2 [internet]. Cochrane; 2021 [cited 2022 Apr 10]. Available from: https://training.cochrane.org/handbook
6. Rodgers JB, Ritskes-Hoitinga M Sánchez Morgado JM, Brønstad A. Systematic reviews. Experimental design and reproducibility in preclinical animal studies. Cham: Springer International Publishing; 2021. p. 213–61.
7. Centre for Reviews and Dissemination. Systematic reviews: CRD’s guidance for undertaking reviews in health care [internet]. University of York; 2008 [cited 2022 Apr 10]. Available from: https://www.york.ac.uk/media/crd/Systematic_Reviews.pdf
8. Sackett DL, Rosenberg WMC, Gray JAM, Haynes RB, Richardson WS. Evidence based medicine: what it is and what it isn’t. BMJ 1996;312:71–2.
9. Guyatt G, Cairns J, Churchill D, Cook D, Haynes B, Hirsh J, et al. Evidence-based medicine: a new approach to teaching the practice of medicine. JAMA 1992;268(17):2420–5.
10. Guyatt GH. Evidence-based medicine. ACP Journal Club; 1991.
11. Anon. How to read clinical journals: I. why to read them and how to start reading them critically. Can Med Assoc J 1981;124(5):555–8.
12. Sackett DL. Evidence-based medicine. Seminars in perinatology. Elsevier. 1997;21(1):3–5.
13. Mhaskar R, Emmanuel P, Mishra S, Patel S, Naik E, Kumar A. Critical appraisal skills are essential to informed decision-making. Indian J Sex Transm Dis AIDS 2009;30(2):112–19.
14. Sur RL, Dahm P. History of evidence-based medicine. Indian J Urol 2011;27(4):487–9.
15. Tikkinen KAO, Guyatt GH. Understanding of research results, evidence summaries and their applicability-not critical appraisal-are core skills of medical curriculum. BMJ Evid Based Med 2021;26(5):231–3.
16. University of Oxford. The Centre for Evidence-Based Medicine [internet]. University of Oxford; n.d. [cited 2022 Apr 10]. Available from: https://www.cebm.net/
17. Critical Appraisal Skills Programme (CASP). Critical Appraisal Skills Programme (CASP) [internet]. CASP; n.d. [cited 2022 Apr 10]. Available from: https://casp-uk.net/
18. Katrak P, Bialocerkowski AE, Massy-Westropp N, Kumar S, Grimmer KA. A systematic review
of the content of critical appraisal tools. BMC Med Res Methodol 2004;4:22.
19. The GRADE Working Group. The grading of recommendations assessment, development and evaluation [internet]. GRADE; n.d. [cited 2022 Apr 10]. Available from: https://www.gradeworkinggroup.org/
20. Altman DG, Simera I. A history of the evolution of guidelines for reporting medical research: the long road to the EQUATOR Network. J R Soc Med 2016;109(2):67–77.
21. Harrison JK, Reid J, Quinn TJ, Shenkin SD. Using quality assessment tools to critically appraise ageing research: a guide for clinicians. Age Ageing 2017;46(3):359–65.
22. Viswanathan M, Ansari MT, Berkman ND, Chang S, Hartling L, McPheeters LM, et al. Assessing the risk of bias of individual studies in systematic reviews of health care interventions. Agency for Healthcare Research and Quality Methods Guide for Comparative Effectiveness Reviews AHRQ Publication No 12-EHC047-EF; 2012.
23. Moher D, Jadad A, Nichol G, Penman M, Tugwell P, Walsh S. Assessing the quality of randomized controlled trials: an annotated bibliography of scales and checklists. Control Clin Trials 1995;16(1):62–73.
24. Higgins JPT, Green S. Cochrane handbook for systematic reviews of interventions version 5.0.0 [internet]. The Cochrane Collaboration; 2008 [cited 2022 Apr 10]. Available from: https://training.cochrane.org/handbook/archive/v5.0.0/
25. Frampton G, Whaley P, Bennett M, Bilotta G, Dorne J-LCM, Eales J, et al. Principles and framework for assessing the risk of bias for studies included in comparative quantitative environmental systematic reviews. Environ Evid 2022;11(1):12.
26. Hooijmans CR, Rovers MM, de Vries RB, Leenaars M, Ritskes-Hoitinga M, Langendam MW. SYRCLE’s risk of bias tool for animal studies. BMC Med Res Methodol 2014;14:43.
27. Kennedy CE, Fonner VA, Armstrong KA, Denison JA, Yeh PT, O’Reilly KR, et al. The Evidence Project risk of bias tool: assessing study rigor for both randomized and non-randomized intervention studies. Syst Rev 2019;8(1):3.
28. Babic A, Pijuk A, Brázdilová L, Georgieva Y, Raposo Pereira MA, Poklepovic Pericic T, et al. The judgement of biases included in the category “other bias” in Cochrane systematic reviews of interventions: a systematic survey. BMC Med Res Methodol 2019;19(1):77.
29. Doi SA, Barendregt JJ, Khan S, Thalib L, Williams GM. Advances in the meta-analysis of heterogeneous clinical trials II: the quality effects model. Contemp Clin Trials 2015;45(Pt A):123–9.