The diagnostic reasoning literature suggests that expert performance does not stem from the application of generic problem-solving strategies.1–4 Rather, when tackling a clinical problem, individuals appear to link the information they gather to their own idiosyncratic memories, to sort rapidly between possible diagnoses, and concurrently to identify data that might not “fit” their hypotheses.2,5–8 Models codifying these complex processes have recently centered on two systems of cognition, termed System 1 and System 2.9–12 System 1, which likely drives most first impressions, describes the automatic, rapid, subconscious processes that are activated when individuals encounter familiar problems; System 2, which is more deliberate, describes the analytic, slow, conscious processes that individuals employ when considering novel or ambiguous problems.9–12 Although distinguished as explanatory models, the cognitive processes used in actual clinical reasoning typically involve both systems, an interaction referred to as “dual processing.”13 Using both systems in concert may offer the best chance at diagnostic success.14
Although the literature on cognition and clinical reasoning has led to a variety of curricular interventions aimed at improving diagnostic performance15—such as tailoring curricula to emphasize illness scripts16 or teaching strategies to reduce bias and emphasize metacognition17,18—few researchers have applied this accumulated understanding to improving assessment protocols.19 In an effort to further this translation, we hypothesized that we could gain unique insights by giving instructions in an assessment setting that biased participants toward particular modes of reasoning. We expect that many aspects of clinical practice (e.g., time pressures and distractions) push clinicians towards System 1 processing. Thus, it is plausible that testing conditions that similarly promote System 1 processing might more accurately reflect performance in practice. At the same time, if the deliberate application of System 2 processing is essential for improved diagnostic performance,17 then testing conditions that emphasize automatic approaches could underestimate candidates’ actual ability. Knowing whether deliberate instructions to use a particular diagnostic approach do or do not improve performance in a testing environment can help advance assessment practices. Further, providing explicit directions to use a specific diagnostic approach also affords a direct test of whether cognitive debiasing strategies aimed at promoting System 2 processing help clinicians avoid System 1–induced errors.15,16 Answering such questions cleanly is difficult because System 1 and System 2 rarely, if ever, operate alone.20 Using manipulative instructions and altering the conditions in which clinicians make decisions can, however, influence their diagnostic accuracy; such strategies have been used to demonstrate that both automatic and analytic processing provide value as well as detriment.21
Whether or not such manipulations can influence the utility of diagnostic assessment strategies remains unknown; therefore, we developed an online vignette-based assessment that guides participants toward using either automatic or analytic reasoning processes to make a diagnosis.22 Previously, we tested participants with different levels of clinical experience at a single institution, and our results suggested that straightforward vignettes under both testing conditions provided similar reliability for medical students, whereas complex vignettes under instructions that encouraged automatic thinking offered the greatest reliability for PGY1 and PGY2 residents.22 Since then, research has shown that faster responses within a test correlate with better performance when clinicians are encouraged to “work as quickly as [they] can without sacrificing accuracy”; this finding further suggests that diagnostic accuracy is a function of individuals’ baseline knowledge and experience with specific problems more than the application of any particular decision-making processes.23
The current investigation, which has two aims, extends this research. First, to determine whether instructions that encourage clinicians to use more deliberate thinking improve diagnostic accuracy, we compared clinicians’ performance under conditions designed to encourage either automatic or analytic reasoning processes. Second, to evaluate the impact of these instructions on the utility of the assessment protocol, we examined the reliability of the results, the feasibility of the testing administration, and the results’ correlation with United States Medical Licensing Exam (USMLE) scores under both conditions. Accordingly, we sought to answer four questions:
- Does diagnostic accuracy improve when clinicians are encouraged to consider clinical vignettes through deliberate processes (as opposed to automatic processes)?
- Does the ability of the assessment to consistently differentiate among candidates (i.e., reliability) differ across testing conditions?
- Does the relationship between performance on the assessment and scores on high-stakes licensing examinations differ across testing conditions?
- When balancing testing time and reliability, how feasible is it to administer each testing condition?
This multicenter, randomized, cross-sectional study measured clinicians’ diagnostic accuracy on a series of previously developed clinical vignettes24,25 under two instructional conditions designed to encourage either automatic or analytic reasoning. We constructed our clinical assessment using a Web-based testing platform. The institutional review board at the coordinating institution approved this study. As compensation for their time, participants were entered into a raffle to win one of three Apple iPads (Cupertino, California).
We recruited third- and fourth-year medical students, along with residents, fellows, and faculty from emergency and internal medicine from eight academic institutions in the western United States between December 2011 and January 2012. We grouped participants into three cohorts based on their clinical experience: students, residents/fellows, and faculty.
We invited eligible participants (total N = 3,757) via e-mail during a two-week enrollment period at each site. The e-mail they received included a brief description of the study and a link to the testing Web site. On clicking the link, respondents were directed to an open-source blog that provided more information about the study, including that the investigators would collaborate with the National Board of Medical Examiners (NBME) to link performance on the vignette-based diagnostic assessment to past scores on the USMLE. Through the Web site, we collected participants’ consent and demographic information. Further, we programmed the Web site such that when participants clicked the button corresponding to their training level, it automatically randomized each of them into one of six assessment groups (three per instructional condition) with stratification by training level to ensure an even distribution across the six versions of the assessment (see Figure 1). The NBME confidentially linked assessment results with participants’ USMLE scores at the conclusion of the study and returned an anonymized dataset to the investigators for analysis.
We designed each instructional condition to encourage participants to use either automatic or analytic reasoning processes to solve clinical vignettes. All of the instructions for participants in both conditions appeared on a single computer screen. To encourage automatic reasoning, we drew from the work of Ark and colleagues.14 Half of the participants received testing instructions encouraging them to use similarity as their guide and to record their first impressions (FI) as their diagnosis:
Please read the vignette below. This case may look similar to things you have seen before (i.e., during training): trust your sense of familiarity and enter [in the box that followed] the first diagnosis that seems reasonable upon reading the case.
To encourage analytic forms of reasoning, we used an abbreviated set of prompts similar to those employed by Mamede and colleagues.25 Thus, the remaining half of participants received instructions designed to facilitate a directed search (DS) for the correct diagnosis; their instructions comprised four prompts asking them to explain their reasoning:
1. Think about how you would put together the information you have just been given about this case into a one-sentence summary. Write that one-sentence summary in the box below, using the terms you feel most accurately and concisely sum up the case.
2. List each diagnosis you are considering. For each diagnosis, write the features of the history, exam, and/or test results that support that diagnosis, and features that do not explain that diagnosis.
3. Now list no more than three most probable diagnoses for this case in rank order (#1—most likely, #2—next most likely, etc.).
4. What diagnosis is important not to miss in this case, regardless of probability? This diagnosis may or may not be on the list above.
Each version of the assessment began with a training vignette that demonstrated the expected free-text answer format for each instructional condition, followed by eight testing vignettes. A previous publication22 describes the instrument development and pilot process. Briefly, we selected cases from a pool of clinical vignettes developed and tested by Mamede and colleagues.24,25 Vignettes reported patients’ demographic and historical information, physical examination findings, and lab or other test results. We used 12 vignettes covering a spectrum of complexity: half were “simple” vignettes that described features of a typical presentation of a single diagnosis, and half were “complex” vignettes that intentionally introduced features that suggested more than one diagnosis.22,24 We previously published validity evidence for these vignettes in the domains of content, response process, relation to other variables, and internal structure.22
We delivered the assessment to participants via Catalyst (Catalyst Development Corporation, Yucca Valley, California), an online software program that allows for asynchronous testing. We divided the 12 vignettes into three blocks of four, each with two “simple” cases and two “complex” cases. We constructed three versions of the assessment by combining each of the three blocks in a pairwise fashion such that each assessment included eight vignettes (see Figure 1). We instructed participants not to use supplemental material, not to discuss the vignettes with others, and to allot sufficient time to complete the assessment in one uninterrupted sitting. The Catalyst software automatically measured total testing times.
Using a previously described rubric,22 two raters (J.S.I. and J.L.B.) scored all responses; generalizability coefficients were then calculated to evaluate the interrater reliability. Under the FI condition, we allowed only a single diagnosis; if a participant entered more than one diagnosis in the free-text box, we scored the entire response as incorrect. For comparability, we scored responses in the DS condition as correct only if a participant had ranked the correct diagnosis first. For discrepant scores, we came to a consensus through discussion before assigning a score for analysis. The interrater reliability of our scoring was high (G > 0.99) for each assessment.
We measured and compared diagnostic accuracy and overall time required across instructional conditions using analysis of variance (ANOVA). We also compared results of the assessment under the two conditions with USMLE scores using Pearson correlation coefficients.
Because the main outcome of interest was reliability achieved under each instructional condition (FI and DS), we based our sample size calculations on the equations presented by Streiner and Norman.26 To calculate reliability meaningfully, yet separately for simple and complex cases, we assumed the use of four vignettes and a reliability of at least 0.75. We calculated that we would require a sample of N = 53 per instructional condition to achieve a 95% confidence interval (CI) of ± 0.05. To consider the reliability separately for each experience level, we aimed to enroll 60 participants × 6 groups (2 conditions × 3 experience levels) for a total N = 360. Pilot testing revealed that a sample this size should provide > 95% power to detect a 10% difference in performance between groups with moderate (d = 0.5) effect size.
We used generalizability theory to calculate the internal consistency of the assessment scores for each level of experience. This allowed us to analyze how well our diagnostic reasoning assessment could consistently differentiate individuals within a group of peers, without artificially inflating the reliability by analyzing across the full (and, hence, more heterogeneous) sample. Generalizability theory also enabled us to account for the three replications of the effect of test (instructional) condition. We treated participants as the facet of differentiation and vignette as the facet of generalization.26 We then performed a series of Spearman–Brown calculations to examine assessment feasibility by determining how many questions are required within each condition to achieve reliability of at least 0.75. To examine the feasibility of testing within each condition, we multiplied the proportional increase in the number of items by the observed testing time.
Figure 1 illustrates the trial flow, and Table 1 provides the characteristics of the study participants. A total of 393 individuals participated in the study, though we excluded 5 who provided nonsensical or partial responses.
Overall mean testing time was 74.9 minutes (standard deviation [SD] = 149.8). Of the 388 participants we included, 6 (1.5%) had testing times classified as outliers because they were > 3 SDs from the mean. Excluding these individuals, mean testing times in the FI and DS conditions were 29.1 minutes (95% CI = 23.8–34.5) and 88.6 minutes (95% CI = 83.4–93.9), respectively (F[1,381] = 242, P < .001, d = 1.27). Testing time was not statistically related to experience (mean = 63.4, 57.4, and 55.8 minutes for, respectively, students, residents/fellows, and faculty; F[2,381] = 1.72, P = .18, d = 0.16 for the largest difference). Further, testing time did not reveal an interaction between experience and instructional condition.
Overall diagnostic performance
Our 2 (discipline) × 2 (instructional condition) × 2 (complexity) ANOVA revealed that overall mean scores for emergency medicine participants (68.1%) did not differ significantly from those of internal medicine participants (67.6%; F[1,228] < 1, P = .82, d = 0.03), nor did discipline and condition interact (F[1,228] < 1, P = .60). To conserve statistical power, we did not consider the discipline of participants in any further analyses.
Table 2 shows diagnostic accuracy as a function of instructional condition, vignette complexity, and experience level. A 2 (instructional condition) × 2 (complexity) × 3 (experience level) ANOVA revealed no difference as a function of instructional condition (F[1,379 = 2.44, P = .12, d = 0.12), but demonstrated the expected main effects of complexity (F[1,379] = 965.2, P < .001, d = 1.92) and experience (F2,379] = 39.6, P < .001, d = 0.73). We found no statistically significant interactions among these three variables apart from experience × complexity (F[2,379] = 6.17, P = .002), which revealed that the difference in accuracy between simple and complex vignettes declined with experience, although the effect size was large in all instances. Mean differences were 48.6% (d = 2.11), 46.2% (d = 2.01), and 37.6% (d = 1.63) for, respectively, medical students, residents/fellows, and faculty.
Correlates with other performance measures
Table 3 presents Pearson correlations examining the relationship between average diagnostic assessment scores and USMLE performance, birth year, and testing time. Despite including only eight vignettes, we observed statistically significant correlations with each step of the USMLE. This trend was true both when we considered all data (i.e., derived from FI and DS instructions together) and when we analyzed data collected from the two conditions separately. In 11 out of 11 comparisons between instructional conditions (shown in the first three rows of Table 3), the FI instruction yielded a higher correlation between our diagnostic assessment scores and USMLE scores than did the DS instruction (P < .05). We also observed a small negative correlation between overall performance and birth year in both instructional conditions such that older (more experienced) participants performed slightly better on average. Performance on the assessments did not correlate significantly with testing time in any instance.
Assessment reliability by instructional condition
To assess the internal consistency of our diagnostic assessment, we calculated Cronbach alpha for each of the three assessment versions. Table 4 shows the average reliability as a function of complexity and instructional condition. Table 4 also presents the results of Spearman–Brown calculations performed with these coefficients to determine the number of items (N(i)) required within each experimental condition to achieve reliability of at least 0.75 and the time requirements to complete an assessment of that length. These analyses revealed that an alpha of at least 0.75 can be achieved much more efficiently in the FI condition relative to the DS condition, although testing times remain long in both conditions.
Results from this large, experientially diverse cohort of clinicians suggest that diagnostic performance is similar when participants are encouraged to trust their first impressions compared with when they are encouraged to take a more deliberate approach. Further, guiding clinicians into an automatic reasoning mode seems to have improved both the efficiency of the diagnostic assessment and the correlation of the assessment’s results with performance scores on the USMLE (Table 3)—despite the number of vignettes remaining constant. These results suggest that instructions can have a substantial impact on the ability of vignette-based diagnostic assessments to differentiate meaningfully among clinicians for reasons independent of the gains that can be achieved through increased sampling.
This large-scale implementation of our assessment intervention attempted to address some of the limitations of our initial single-institution study.22 The lack of difference in mean diagnostic accuracy between instructional conditions stands in contrast with our prior results in which diagnostic accuracy was greater in the DS condition.22 These differences likely stem from methodological differences between the two studies. In our first study, we presented the FI and DS conditions in sequence, such that each participant answered six questions under FI instructions, then six under DS instructions. Possibly then, in our earlier investigation,22 the DS instruction amounted more to a sequential instruction to use a combined reasoning approach than to employ an analytic framework alone. This affect would be consistent with past data suggesting that explicit or implicit prompts encouraging both automatic and analytic reasoning improve performance compared with prompts encouraging one mode alone.14 In contrast, the between-subjects design of the current study exposed participants to only one instruction, minimizing the potential influence toward dual processing.
Our findings that graduated physicians performed better than students, and that complex vignettes elicited less accurate responses than simple vignettes (Table 2), support the argument that the assessment measured diagnostic skill. Statistically correlated scores between our assessment and the USMLE scores suggest both that performance on our assessment mimics learners’ ability to access and apply their medical knowledge to clinically oriented problems and that the clinicians working under the FI condition performed better in this regard than did those working under the DS condition.
Although necessarily speculative, this higher correlation between our assessment results and USMLE scores in the FI condition (Table 3) may suggest that USMLE scores heavily reflect nonanalytic processing. Though not designed to be “speeded” examinations (i.e., exams that challenge respondents to answer all questions within the time provided),27 the time pressure inherent in high-stakes USMLE testing may encourage more automatic than analytic thinking. Given the demonstrated correlations between licensure examinations and performance in clinical practice,28–30 as well as the fundamental link between knowledge and automatic reasoning,5–8 it is possible that assessments that privilege this type of reasoning—whether by time limitations or manipulations of testing instructions—provide more accurate measurements of clinical expertise. The lack of difference between the mean scores under our two testing conditions further suggests that encouraging test candidates to respond with their first impressions does not harm their overall performance.
The fact that the correlations between USMLE scores and diagnostic assessment results (under both testing conditions) were low to moderate likely reflects the unreliability of any test containing just a few items, especially given the robustness of context specificity in diagnostic reasoning. Extrapolation of our data suggests that increasing the number of items would achieve decent reliability, but the already-long testing times associated with the DS condition likely preclude feasible implementation of that intervention. The finding that testing under the FI condition resulted in consistently higher correlations with USMLE performance than did the DS condition with a constant number of vignettes suggests that either less time can be spent on the vignette-based diagnostic assessment or that time can be equated (i.e., using longer testing times, but encountering more clinical vignettes) to enable better content coverage without harming other aspects of the assessment’s utility or candidates’ performance.
Although internal consistency was relatively low in all groups (Table 4), a few interesting observations should be highlighted. If anything, DS instructions better differentiated between medical students relative to FI instructions, whereas the reverse was true for more experienced clinicians. This observation is consistent with the notion that inexperienced novices are less skilled at automatic impression formation and accordingly do not yet show replicable differences in their ability to use such processes. In contrast, experienced clinicians should have extensive tacit knowledge available for use in diagnostic reasoning. We would thus expect that careful and deliberate processing would reveal relative homogeneity (and lower reliability), whereas stronger diagnosticians may more easily be shown to outperform their peers when encouraged to use automatic processes. Simple items appear better able to differentiate students from more experienced (i.e., graduated) physicians than do complex items, whereas the reverse was true (if anything) for more experienced clinicians. Again, this finding is not surprising given that truly difficult vignettes are too difficult for even the best trainees, and truly easy vignettes are too easy for even the weakest experienced diagnosticians.
Our study has several limitations. First, we used volunteer participants, which may have resulted in a sample with more competent or motivated individuals. This influence, however, should narrow the range of scores and result in more conservative reliability estimates. Further, such a bias would be unlikely to affect the results systematically across the instructional conditions given the random assignment of participants. Second, we collected data only when participants completed their entire assessment. Possibly then, because of the substantially more laborious nature of the DS condition, more of the participants randomized to that condition started but did not finish their assessment. Strengths of this investigation include the use of realistic clinical vignettes that have been vetted in prior investigations,22,24,25 the large sample size that is experientially diverse, and the deliberate collection of validity evidence to ground these results in the context of existing literature and past performance.31
In conclusion, this investigation suggests that instructions to trust one’s first impressions result in similar diagnostic accuracy when compared with instructions to consider clinical information in a structured and systematic fashion. These results run counter to the assertion that encouraging clinicians to focus their thinking toward analytic processes is likely to reduce diagnostic errors. Further, instructions to trust first impressions created a substantial advantage in terms of the utility of the assessment protocol, enabling more feasible test administration in a manner that appears to correlate better with high-stakes licensing examinations.
1. Elstein AS, Schwartz A. Clinical problem solving and diagnostic decision making: Selective review of the cognitive literature. BMJ. 2002;324:729–732
2. Coderre S, Mandin H, Harasym PH, Fick GH. Diagnostic reasoning strategies and diagnostic success. Med Educ. 2003;37:695–703
3. Mandin H, Jones A, Woloschuk W, Harasym P. Helping students learn to think like experts when solving clinical problems. Acad Med. 1997;72:173–179
4. Norman G, Young M, Brooks L. Non-analytical models of clinical reasoning: The role of experience. Med Educ. 2007;41:1140–1145
5. Brooks LR, Norman GR, Allen SW. Role of specific similarity in a medical diagnostic task. J Exp Psychol Gen. 1991;120:278–287
6. Schmidt HG, Norman GR, Boshuizen HP. A cognitive perspective on medical expertise: Theory and implication. Acad Med. 1990;65:611–621
7. Custers EJ, Boshuizen HP, Schmidt HG. The influence of medical expertise, case typicality, and illness script component on case processing and disease probability estimates. Mem Cognit. 1996;24:384–399
8. Charlin B, Tardif J, Boshuizen HP. Scripts and medical diagnostic knowledge: Theory and applications for clinical reasoning instruction and research. Acad Med. 2000;75:182–190
9. Croskerry P. A universal model of diagnostic reasoning. Acad Med. 2009;84:1022–1028
10. Kahneman D Thinking, Fast and Slow. 2011 New York, NY Farrar, Straus and Giroux
11. Tversky A, Kahneman D. The framing of decisions and the psychology of choice. Science. 1981;211:453–458
12. Tversky A, Kahneman D. Judgment under uncertainty: Heuristics and biases. Science. 1974;185:1124–1131
13. Norman G. Dual processing and diagnostic errors. Adv Health Sci Educ Theory Pract. 2009;14(suppl 1):37–49
14. Ark TK, Brooks LR, Eva KW. Giving learners the best of both worlds: Do clinical teachers need to guard against teaching pattern recognition to novices? Acad Med. 2006;81:405–409
15. Bowen JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med. 2006;355:2217–2225
16. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med. 2003;78:775–780
17. Graber ML. Educational strategies to reduce diagnostic error: Can you teach this stuff? Adv Health Sci Educ Theory Pract. 2009;14(suppl 1):63–69
18. Graber ML, Kissam S, Payne VL, et al. Cognitive interventions to reduce diagnostic error: A narrative review. BMJ Qual Saf. 2012;21:535–557
19. Ilgen JS, Humbert AJ, Kuhn G, et al. Assessing diagnostic reasoning: A consensus statement summarizing theory, practice, and future needs. Acad Emerg Med. 2012;19:1454–1461
20. Jacoby LL. A process dissociation framework: Separating automatic from intentional uses of memory. J Mem Lang. 1991;30:513–541
21. Norman GR, Eva KW. Diagnostic error and clinical reasoning. Med Educ. 2010;44:94–100
22. Ilgen JS, Bowen JL, Yarris LM, Fu R, Lowe RA, Eva K. Adjusting our lens: Can developmental differences in diagnostic reasoning be harnessed to improve health professional and trainee assessment? Acad Emerg Med. 2011;18(suppl 2):S79–S86
23. Sherbino J, Dore KL, Wood TJ, et al. The relationship between response time and diagnostic accuracy. Acad Med. 2012;87:785–791
24. Mamede S, Schmidt HG, Rikers RM, Penaforte JC, Coelho-Filho JM. Breaking down automaticity: Case ambiguity and the shift to reflective approaches in clinical reasoning. Med Educ. 2007;41:1185–1192
25. Mamede S, Schmidt HG, Penaforte JC. Effects of reflective practice on the accuracy of medical diagnoses. Med Educ. 2008;42:468–475
26. Streiner DL, Norman GR Health Measurement Scales: A Practical Guide to Their Development and Use. 20084th ed New York, NY Oxford University Press
27. O’Donnell MJ, Obenshain SS, Erdmann JB. Background essential to the proper use of results of Step 1 and Step 2 of the USMLE. Acad Med. 1993;68:734–739
28. Tamblyn R, Abrahamowicz M, Brailovsky C, et al. Association between licensing examination scores and resource use and quality of care in primary care practice. JAMA. 1998;280:989–996
29. Tamblyn R, Abrahamowicz M, Dauphinee WD, et al. Association between licensure examination scores and practice in primary care. JAMA. 2002;288:3019–3026
30. Ramsey PG, Carline JD, Inui TS, et al. Predictive validity of certification by the American Board of Internal Medicine. Ann Intern Med. 1989;110:719–726
31. Downing S. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837