Secondary Logo

Journal Logo

Critical Appraisal Turkey Shoot: Linking Critical Appraisal to Clinical Decision Making


Section Editor(s): Regehr, Glenn PhD

PAPERS: Thoughts on Thinking

Correspondence: Kevin Eva, Department of Psychology, McMaster University Faculty of Medicine, Hamilton, Ontario L8S 4K1, Canada.

The authors thank Glenn Jones for generating the checklist that was used by the class of 2000 and Annette Schrapp for administrative support in preparing and distributing the materials to study participants and tutors.

Since the publication of Physicians for the Twenty-First Century — “the GPEP Report” of 1984, medical educators have identified the need for physicians to become lifelong learners.1 Part of the impetus for this conclusion arises from several studies that have demonstrated that knowledge and/or competence of physicians decline as a function of time since graduation; the evidence indicates the cause to be failure to acquire new knowledge rather than a tendency to forget previously learned material.2 Thus, physicians need to be trained to identify the relevant medical literature (i.e., information-seeking skills) and to apply “critical appraisal” techniques to analyze potentially useful articles culled from the literature search.

There is little published evidence that educational interventions around critical appraisal teaching in undergraduate or postgraduate medical curricula impact in a sustained way the knowledge of epidemiologic principles or the critical application of current research information for clinical decision making.3 In considering the impact on conceptual knowledge, one could argue that there is a lack of validated tools available for evaluating critical appraisal skills; alternatively, the format of instruction, timing in the curriculum, and duration of instruction may be at fault. More important, studies have not addressed the issue of whether the demonstration of mastery of particular critical appraisal skills can be related to clinical decision making. Ultimately, such mastery becomes largely irrelevant if it does not translate into better judgment.

The authors of this study were concerned that, despite the inclusion in the first-year undergraduate curriculum of several focused objectives surrounding critical appraisal in the domain of clinical epidemiology, feedback from clinical faculty suggested that students had only rudimentary knowledge of the application of these principles at the end of the first year. In contrast to this feedback, problem-based learning (PBL) is believed to hold the potential to equip graduates with the skills to learn after graduation. In fact, several studies have shown significant differences between students of PBL and students of conventional curricula in the use of recently published medical literature.4,5,6 With this inconsistency in mind, two experimental questions were asked.

  1. Are critical appraisal concepts to which students are “exposed” in PBL in earlier curricular blocks retained sufficiently to allow identification of methodologic errors in formal articles?
  2. Does awareness of such methodologic flaws transfer to an appreciation of how these errors might invalidate the conclusions of the journal articles' authors?

Ergo, the goal of this study was to investigate the relationship between understanding the concepts of critical appraisal and their application in clinical decision making. Understanding this relationship can potentially improve the teaching of critical appraisal and the evaluation of this teaching.

Back to Top | Article Outline


Participants. This was a single-blinded experimental design study. The participant pool was composed of two consecutive first-year undergraduate medical school classes (the graduating classes of 1999 and 2000, respectively) in a PBL curriculum at McMaster University. Each class was composed of 18 tutorial groups of five to six students each. The students had some background in critical appraisal, as it had been studied in a readily identifiable manner during the first curricular unit at the beginning of the first academic year. For each class, the study took place during the third curricular unit running during the final three months of their first academic year.

Materials. The subunit planners for each month-long subunit in that third curricular unit selected two journal articles from their respective expert domains of gastroenterology, hematology, and endocrinology. These context experts chose articles that met the defined criteria of being (a) methodologically sound and (b) not directly covered within the context of the unit's curricular problems. Within each of the six articles so identified, one, two, or three different methodologic flaws were implanted, each flaw sufficiently egregious to warrant dismissal of the author's conclusions. The methodologic flaws inserted related to concepts that students were expected to have come across previously in the curriculum. Six categories of errors were examined (participant assembly, randomization, contrast, follow-up, analysis, and other). For example, the study group may have been inappropriately pooled or randomization might have been non-blinded. The text of the journal articles was retyped with the titles, tables, authors, and journal names absent. After this was done, the original six “gold” articles and their mirror flawed counterparts, or “turkey” articles, were superficially indistinguishable.

For each of the six articles a related clinical scenario was generated that would present a clinical management problem for which a specific intervention was to be considered. Each problem was relevant to the unit of study but was not directly related to the health care problems in the curriculum and could not be answered using standard textbooks. Also, according to the subunit planners, the answers to the problems should have been obvious if the relevant recent literature was known.

Procedure. Within both the class of 1999 and the class of 2000, students were randomly allocated biweekly to receive either a gold or a turkey article, for a total of six articles over 12 weeks. Randomization took place across the entire class, not by tutorial group, since the students worked on the exercise independently, and assignment was by use of a table of random numbers. The students were all given a “pre-appraisal” response sheet with the appropriate clinical scenario and were asked to respond on an anchored seven point Likert-type scale whether they agreed or disagreed with the optional management or intervention suggested. The scale was anchored between “definitely yes” (1), “probably yes” (2), “probably no” (5) and “definitely no” (7). This pre-appraisal response sheet served as a baseline of the students' knowledge of the condition demonstrated by the scenario. The students were then given two weeks to work on the articles they had been assigned. Afterward, the students completed a “post-appraisal” sheet that presented the same clinical scenario and the same clinical question that they had seen two weeks earlier. In addition, they were asked to identify any methodologic flaws in the articles they had read. For the class of 1999, this identification took place using an open format. For the class of 2000, the identification of flaws was noted by ticking them off a checklist that contained 29 potential methodologic errors, three to six per category. Responses to the post-appraisal questionnaire would allow us to estimate the students' ability to detect methodologic flaws and to assess whether or not the author's conclusions had influenced their clinical decisions. The responses were handed in to the tutor and a “tutor-guide” was provided to briefly explain the inserted flaws, thereby allowing discussion of the critical appraisal issues during tutorials.

Back to Top | Article Outline


Eighty-nine of the 100 students in the class of 1999 completed both the pre- and the post-appraisal questionnaires for at least one of the six questions. The average number of completed questions per participant was 5.61, with 69 of the 89 students completing all six questions. In the class of 2000, 63 of the 100 students completed both pre- and post-appraisal questionnaires at least once, averaging 5.68 questions per participant, with 50 of the 63 students completing all six questions. The decreased participation by students in the second year reflected ambivalence on the part of some of the tutors in dealing with the logistics of the exercise. Two hundred and forty-six (49.3%) of the 499 observations collected from the class of 1999 and 186 (52.0%) of the 358 observations collected from the class of 2000 were from the gold arm of the studies, thereby indicating that the questions were not completed differentially for the two types of papers provided.

Table 1 presents the mean pre-test and post-test scores for both the turkey and the gold groups of both classes. Upon coding the data, some scales were reversed so that the low end of the seven-point Likert scale was always the “correct” response. In neither class did the pre-test scores of the two groups differ significantly from one another. A 2(time: pre- vs. post-) × 2(arm: gold vs. turkey) repeated-measures analysis of variance revealed a significant interaction between time and arm (F(1,497) = 7.043, p < (0.1) for the class of 1999. The same analysis revealed an effect that bordered on significance for the class of 2000 (F(1, 356) = 3.273, p <.075). Planned comparison t-tests for both classes revealed the nature of these interactions. Mean post-test scores of both gold groups decreased significantly relative to their pre-scores (t[245] = 5.198, p <.01 and t[185] = 4.834, p <.01 for the class of 1999 and the class of 2000, respectively). In contrast, mean post-test scores of both turkey groups did not reveal a significant effect of time (t[252] = 1.323, p >.18 and t[171] = 1.693, p >.09 for the class of 1999 and the class of 2000, respectively). Therefore, students were more likely to change their management decisions in an appropriate direction if they had read a methodologically error-free version of the paper.



The participants who read the error-free gold version of the article did report having found errors, as can also be observed in Table 1, but they reported having found significantly fewer errors than those who read the turkey version of the article (t[496] = −3.252, p <.01 and t[357] = −3.338, p <.01, for the class of 1999 and the class of 2000, respectively). Collapsing across arms, there was a significant positive relationship in both classes between the number of problems raised and the post-score assigned (r = 0.230, p <.01 for the class of 1999, r = 0.344, p <.01 for the class of 2000). This indicates that the fewer errors raised, the lower (i.e., more correct) the post-score that was assigned. This relationship remained significant when the analysis was limited to the correct identification of the errors that had been planted within the turkey articles (r = 0.163, p <.01 and r = 0.251, p <.01 for the classes of 1999 and 2000, respectively). These analyses provide converging evidence that students were altering their management decisions based on the strength of the method that they perceived. In addition, it is reassuring that the participants did not appear to allow their prior impressions of the appropriate management decisions to influence their critical appraisals of the articles presented. This is evidenced by the lack of a relationship between the number of problems raised and the pre-score assigned (r = 0.016, p >.72 and r = 0.068, p >.19 for the classes of 1999 and 2000, respectively).

Finally, taking into account the numbers of turkey articles read and the numbers of errors embedded, the potential numbers of errors that could be correctly identified were 505 and 343 for the classes of 1999 and 2000, respectively; 178 (35.2%) of them were identified by the class of 1999 and 80 (23.2%) by the class of 2000. Review of the actual methodologic flaws identified by the students demonstrated no consistent pattern between the two classes. The proportions of the six individual error categories correctly identified by the class of 1999 were 33/86 (38%) for participant assembly, 37/98 (38%) for randomization, 63/163 (39%) for contrast, 11/45 (24%) for follow up, 8/68 (12%) for analysis, and 26/45 (58%) for other. The corresponding proportions correctly identified by the class of 2000 were 13/56 (23%), 21/56 (38%), 26/113 (23%), 16/29 (55%), 4/59 (7%), and 0/30 (0%) for the same six categories, respectively.

Back to Top | Article Outline


An ultimate objective in teaching critical appraisal concepts is for medical students to view literature searching and critical appraisal as fundamental skills required for effective medical practice. As Norman et al. demonstrated in a recent review of teaching critical appraisal, most reported teaching interventions, even the few controlled studies published, have assessed short-term gains in acquiring knowledge of critical appraisal techniques rather than their application to clinical decision making.6 The results of these studies were largely consistent with the anecdotal feedback that we have received from tutors—students appear to be poor critical appraisers. While it is important to be able to demonstrate some knowledge of the principles of how to scrutinize the medical literature carefully and critically, some demonstration of putting these principles into practice would seem to be just as desirable an educational outcome. By using a more decision-oriented outcome measure, the current findings suggest that the studies reviewed by Norman et al. and the interactions between students and tutors might underestimate students' ability to critically appraise scientific articles.

This study demonstrated that first-year medical students can alter their clinical management decisions appropriately as a function of whether they have read a methodologically sound or flawed journal article. When provided with the “gold” journal articles, these students changed their clinical decisions in the post-test in the direction of the correct management decisions, despite apparently identifying some putative methodologic flaws in these “gold” papers. As expected, however, fewer errors were identified by students in the “gold” articles, and there was a significant positive relationship between identifying fewer errors and assigning a “more correct” clinical decision score on the post-test.

The findings from the turkey articles require more explanation. As expected, students identified more errors in the turkey papers. However, at most, only 35% of the deliberately inserted methodologic flaws were correctly identified. Despite being unable to accurately identify all of these errors, the students tended not to alter their original management decisions when they had been assigned turkey papers. It seems that the students were uncomfortable with the authors' conclusions and, without necessarily being able to specify the flaws, decided to either maintain their original management decisions or make small changes in either direction. While the authors had anticipated from a curriculum review that the “flaws” inserted into the articles might be identified by students, one weakness of this study is that there was no assessment of the tutors' abilities to identify them.

Finally, there was no relationship between the number of flaws identified and the “correctness” of the scores the students assigned on the pre-test. This implies that the students were able to read the articles critically without being biased by their perceptions of the correct management decisions, thereby providing further evidence that our students treated the articles in a rational manner.

In summary, the current findings show that our first-year students do indeed have relatively limited ability to identify specific methodologic issues in journal articles. Despite this, however, the clinical decision-making results demonstrated a gratifying relationship between the students' perceptions of the “quality of evidence” and appropriate changes in their management decisions. This suggests that students are reading the literature more critically than might be assumed by simply testing their knowledge of particular critical appraisal concepts. That is, while seeming to treat articles appropriately, students may not be able to articulate specific methodologic errors, thereby giving the appearance of poor critical appraisal skills. While it is important for students to be able to articulate critical appraisal concepts, the current results suggest that examining students' abilities in this domain should take place in the context of clinical decision making. Our participants' capacity to alter their decisions in a rational manner suggests that even novice medical students should be strongly encouraged to critically appraise. Future research will determine to what extent the correct or incorrect perceptions by students of particular methodologic flaws influences their clinical decision making.

Back to Top | Article Outline


1. Muller S (chairman). Physicians for the twenty-first century: report of the project panel on the general professional education of the physician and college preparation for medicine. J Med Educ. 1984;59(11 Pt 2).
2. Day SC, Norcini JJ, Webster GD, Viner ED, Chirico AM. The effect of change in medical knowledge on examination performance at the time of re-certification. Proc Annu Conf Res Med Educ. 1988;22:139–44.
3. Norman GR, et al. Effectiveness of instruction in critical appraisal (evidence-based medicine) skills: a critical appraisal. Can Med Assoc J. 1998;158:177–81.
4. Blumberg P, Michael J. Development of self-directed learning behaviours in a partially teacher-directed problem-based learning curriculum. Teach Learn Med. 1992;4:3–8.
5. Marshall JG, Fitzgerald D, Busby L, et al. A study of library use in problem-based and traditional medical curricula. Bull Med Libr Assoc. 1992;81:299–305.
6. Shin JH, Haynes RB, Johnston ME. Effect of problem-based, self-directed under-graduate education on life-long learning. Can Med Assoc J. 1993;148:969–76.

Section Description

Research in Medical Education: Proceedings of the Thirty-ninth Annual Conference. October 30 - November 1, 2000. Chair: Beth Dawson. Editor: M. Brownell Anderson. Foreword by Beth Dawson, PhD.

© 2000 by the Association of American Medical Colleges