Many psychologists presume that problem solving relies on cognitive shortcuts, or heuristics, that not only reduce one’s cognitive load but also may lead to biases. They also believe that these cognitive biases may lead to diagnostic errors. Consider the following scenarios:
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice and participated in antinuclear demonstrations.
Which is the more likely statement?
A. Linda is a bank teller.
B. Linda is a bank teller and active in the feminist movement.
Rahim is a 55-year-old male who presents to the emergency department with multiple injuries following a car accident. On examination, he has diminished breath sounds on the left side and a tender abdomen. His blood pressure is 90/55, and his pulse is 135 beats per minute.
Which is the more likely statement?
A. Rahim has a pneumothorax.
B. Rahim has a pneumothorax and a ruptured spleen.
The first scenario above has achieved a certain degree of notoriety. Daniel Kahneman,1 the Nobel Prize–winning psychologist, once described Linda as “one of the best-known characters in the heuristics and biases literature.” The correct response in the scenario is option A, because bank teller includes both feminist and nonfeminist, according to formal logic.* Yet, in repeated studies, between 85% and 89% of respondents chose option B. Psychologists explain this behavior by arguing that participants match the scenario against exemplars of each role (bank teller and activist) and reason that, although Linda may be an activist, she also has to support herself. In doing so, they do not consider the basic logic of the situation. Although we invented the second scenario for this report, and so it has not been formally tested, we believe that the majority of physicians would choose option B, again the incorrect answer for the same reasons according to formal logic. However, if they chose the logically correct answer, the patient would likely die.
The tendency to select the joint option in such scenarios, called the conjunction fallacy, is one of a number of heuristics and biases uncovered by Kahneman and his colleague, Amos Tversky, in a research program dating back to the 1970s. They explained that humans can access two thinking processes, described by Stanovich2 as System 1 and System 2. An extensive literature describes these two processes.3 In Kahneman’s1 words:
The operations of System 1 are fast, automatic, effortless, associative, and difficult to control or modify. The operations of System 2 are slower, serial, effortful, and deliberately controlled; they are also relatively flexible and potentially rule-governed.
Kahneman and Tversky4 explain further that, in their view, the errors that result from human information processing represent a very particular kind of failure of both systems: “… errors of intuitive judgment involve failures of both systems: System 1, which generated the error, and System 2, which failed to detect and correct it.” In their opinion, then, all errors originate in System 1 processes and remain simply because the logical, analytical processes of System 2 have failed to identify and correct these errors.
However, in clinical choice situations, like the second scenario above, the statistically correct conclusion, which amounts to failing to diagnose both potentially lethal conditions, would raise questions of clinical competence. We believe that both scenarios highlight a fundamental problem with the perspective that errors are a direct reflection of cognitive biases. From a purely logical perspective, one option is unequivocally correct; from a practical perspective, however, the same option is egregiously incomplete. Despite our concerns, and similar issues arising from other biases, the research program of Kahneman and Tversky has been accepted virtually intact by researchers in the area of diagnostic errors, which are now viewed by most observers as arising almost entirely from cognitive biases on the part of the physician.5–8 For example, Croskerry6 stated, “Most errors occur with Type 1 [System 1] and may to some extent be expected whereas Type 2 [System 2] errors are infrequent and unexpected.” Croskerry6 also distinguished between the two kinds of processes, using descriptions similar to Kahneman’s: “Type 1 processes are fast, reflexive, intuitive” and “Type 2 processes are analytic, slow and deliberate.”
Surprisingly, as errors are almost unanimously presumed to be directly caused by System 1–related cognitive biases, only one study has claimed to actually document error from cognitive bias. Graber and colleagues9 retrospectively analyzed 100 cases of proven diagnostic error arising in the emergency departments of three academic hospitals. They identified cognitive errors in 74 of those cases, system-related errors in 65, and no-fault errors in 44. The most common cognitive error they found was premature closure or arriving at an incorrect diagnosis before eliciting a critical piece of information, usually from the laboratory. Their demonstration does not, in any sense, confirm a dual processing model; it does, however, suggest that errors may be associated with incomplete information and, hence, shorter times to diagnosis.†
Some experimental evidence also suggests that errors are associated with rapid processing. Mamede and colleagues11 conducted an experiment in which medical residents diagnosed a series of cases under two conditions: the first five cases with no instruction and the second five with the information that the cases “had been seen by experienced physicians who failed to diagnose them accurately.” The cases were counterbalanced. Accuracy was higher (2.78/4 versus 2.20/4), and time taken was slightly longer for the first group (60 seconds versus 55 seconds). However, Mamede and colleagues deliberately designed the cases to be difficult—a situation in which rapid, System 1 processing may fail. A second study from the same report, although showing a significant time difference between instructional conditions (198 seconds versus 164 seconds), did not show any difference in accuracy.
Although these studies do appear to support the claim that speed leads to errors, this finding is not completely universal. In fact, some evidence from visual diagnosis research suggests the opposite. Norman et al12 showed that time to diagnosis was negatively correlated with expertise for accurate diagnoses (experts took 8 seconds for an accurate diagnosis, novices 12 seconds); conversely, the relationship between expertise and time was positively correlated when the diagnosis was incorrect. Similarly, Lehr and colleagues13 found that radiologists took 113 seconds to diagnose a case when they were correct but 147 seconds when they were wrong. No research has confirmed that these results from visual diagnosis studies are generalizable to more typical diagnostic tasks.
In the present study, we experimentally tested the relationship between speed and accuracy on a series of representative diagnostic cases. We presented junior physicians with a series of written cases on a computer and examined their time to solution and accuracy, both within cases and summarized across cases. Our specific interest was whether time to diagnosis was positively or inversely related to accuracy. We also examined the relationship between speed, accuracy, and performance on the Canadian national licensing examinations.
We recruited participants from the examinees sitting for the Canadian national performance-based licensing examination (the Medical Council of Canada Qualifying Examination Part II or MCCQE2) at test centers located within the Michael G. DeGroote School of Medicine, McMaster University; the Faculty of Medicine, University of Ottawa; and the McGill University Faculty of Medicine. The MCCQE2 is administered on the same day to examinees at test centers located within the 17 medical schools in Canada; to sit for the exam, the examinees must have passed the Medical Council of Canada Qualifying Examination Part I (MCCQE1) and have had at least one year of clinical postgraduate training. The examination has two administrations on that day, a morning and afternoon administration. Examinees who have completed the morning administration in eastern Canada are sequestered for approximately 1.5 hours until all examinees from across the country have started the afternoon session.
We approached sequestered examinees at the three schools (approximately 128 at McMaster, 64 at Ottawa, and 64 at McGill) at the start of sequestering and asked them to participate in our study. We offered candidates $30 to participate, and participation was on a first-come, first-served basis. We obtained informed consent from each participant, which included a statement that their results on our test would be matched to their results on the MCCQE1 and MCCQE2, but that all data would be anonymized. The Medical Council of Canada reviewed our data-sharing procedures and deemed that they met the council’s rules for privacy. The medical research ethics boards at all three universities granted our study ethical approval.
At all three sites, we escorted the sequestered examinees who had consented to participate to a computer laboratory for the duration of the study. We asked participants to read the study instructions on a computer screen and to view a practice case; we did not evaluate their work on this case but instead used it to orientate the participant to the computer program. The instructions read:
Imagine you’re in an emergency department, and there is a large backlog of patients. You are about to see a series of patients, and you have a limited time (30 minutes) to see them all. It is likely that you will not be able to complete all the cases, but work as quickly as you can without sacrificing accuracy. On each case, we will display a countdown clock which will show your elapsed time on each case. As there are a total of 25 cases, the clock will show 100% elapsed time at one minute. However the case will not finish until you press “Enter.” Some cases are harder than others. If you need more time to complete the case, use it. Accuracy is as important as speed; you get no points for wrong answers.
Once the participant read the instructions and completed the practice case, he or she proceeded through the test materials, described below.
Four board-certified academic emergency physicians, supervised by J.S., developed the test materials. Cases reflected the spectrum of acute, general medical problems, based on the medical expert objectives of the MCCQE2.14 Cases represented a range of difficulty, from straightforward presentations of common conditions to rare, complex, or atypical cases. Four content experts (senior board-certified internal and academic emergency medicine faculty who were experienced teachers and program directors) further reviewed each case. To ensure a range of difficulty, we pilot tested 40 cases using a convenience sample of 10 first- and third-year internal and emergency medicine residents. We eliminated cases that were either too easy or too difficult (diagnostic accuracy approaching 100% or 0%), leaving 25 cases for our study. Each case included a patient history, physical examination results, and pertinent lab findings, including one diagnostic test that required interpretation (chest X-ray, CT scan, ECG, or laboratory [microbiology, biochemistry] test). We took care to ensure that all cases, regardless of difficulty, were of approximately equal length (mean length of 220 words, standard deviation of 50). We chose the final number of cases deliberately to exceed the number that any participant could likely complete during the 30-minute test. We presented cases in a fixed sequence, on computer screens, using Run Time Revolution, version 2.8.1 (Edinburgh, Scotland), which displays verbal and visual information and captures responses and response times (RTs). This presentation allowed participants to access all of the relevant data at once, not in a fixed sequential process, thus allowing participants to rapidly read information they deemed essential.
After completion of the test, we asked participants to estimate the total number of cases that they had encountered in their medical training that corresponded to each of the diagnoses in the study.
We used a three-point scoring key for each case (2 for completely correct; 1 for partially correct; 0 for incorrect). We computed scores for each participant for each case using this key. One of us (J.S.), who was blind to the performance data and any identifying information, adjudicated any discrepancies. For each candidate, we then computed the RT in seconds, or the time to read and process data, for each case. Next, we calculated an overall accuracy score and the overall time based on the number of cases completed by each participant. We also had available total scores and subscores on the MCCQE1 (total, multiple-choice, clinical decision making)15 and MCCQE2 (total, communication, data gathering, problem solving).
We first analyzed data within each case. To address the skewed distributions common to RTs, when analyzing individual cases, we correlated the logarithm of RT with accuracy. We also computed the correlation between self-reported experience with each diagnosis and accuracy. We then summarized the data by computing, for each participant: (1) the number of cases completed, (2) the total score for completed cases, and (3) the total RT for completed cases. Our primary analysis examined the correlation between overall RT and accuracy. By focusing only on completed cases, we did not induce an artifact, where slower candidates would receive a score of 0 for cases that they did not complete.
Finally, we examined the reliability of the scores by conducting a repeated-measures ANOVA and computing Cronbach alpha. We determined the validation of the scores by correlating accuracy and RT overall with MCCQE1 and MCCQE2 scores and associated subscores. We calculated Pearson correlations, raw and disattenuated, accounting for the reliability of all tests. We used SPSS (PASW) version 19 (Chicago, Illinois) for all analyses.
A total of 95 examinees participated in the study—41 at McMaster, 19 at Ottawa, and 36 at McGill. The number of participants at each site was determined by the number of computer workstations available. See Table 1 for the demographics of these participants. Most of the participants (75/95; 79%) graduated from a Canadian medical school, and most (72/95; 76%) were entering their second year of postgraduate medical training. About one-third were in family medicine. All chose to complete the MCCQE2 in English (French was an option at other sites).
Because international medical graduates (IMGs) are more likely to speak English as a second language, they likely will take longer to read each case. We therefore conducted a specific analysis of Canadian medical graduates (CMGs) versus IMGs. IMGs had longer RTs (88 seconds per case versus 61 seconds per case, t = 5.56, P < .00001) and had lower scores (mean score of 43% versus 47%), although this difference is not significant. As a result of longer RTs, IMGs completed fewer cases (17.6 versus 23.3, t = 6.51, P < .0001). They also had significantly lower MCCQE1 scores (474 versus 522, t = 2.89, P < .01) and MCCQE2 scores (415 versus 536, t = 6.39, P < .0001) and had attempted both parts more often than CMGs (about 1.5 attempts on average versus just over 1.0).
Because IMGs had both longer RTs and lower levels of competence, by a variety of measures, and because they completed substantially fewer cases, the demonstration of a negative relationship between time and accuracy might be confounded by the comparison of CMGs with IMGs. For this reason, we decided to conduct our analysis only on the 75 (79%) participants who graduated from Canadian medical schools.
See Table 2 for accuracy scores and RTs for each case. Accuracy scores ranged from 0.08 to 0.85, with a mean of 0.488. Although this may appear to be a low average score, our scoring criteria were stringent; for example, a diagnosis of ST-segment elevation acute myocardial infarction received two points, whereas acute myocardial infarction only received one point. Average RT per case varied from 20 seconds to 101 seconds, with a mean of 56 seconds; the longest RT was associated with the most difficult case (as judged by overall cohort accuracy), sulfonylurea-induced hypoglycemia. What we consider the most important data, the correlations between accuracy and RT, are in the sixth column of Table 2. When analyzed case by case, the correlation in 23 of 25 (92%) cases was negative (P < .0001 by sign test); the average correlation was −0.214, and 11 of 25 cases (44%) were significant. Despite the subjectivity of the participants’ estimates of experience, 19 of 25 (76%) correlations between accuracy and self-reported experience were positive (P < .005 by sign test), and the mean correlation was 0.16. Thus, increased accuracy, on a case-by-case basis, was associated with greater self-reported experience and decreased RT.
We then confirmed the relationship between RT and accuracy by examining overall RT and accuracy for each participant. The correlation between RT and accuracy was −0.54 (P < .0001), as shown in Figure 1. We computed the correlation between RT per word and case difficulty to rule out the possibility that more difficult cases may include more words and thus require a longer RT. Whereas we found a small negative relationship between accuracy and case length (r = −0.132, not significant), the relationship between accuracy and RT per word (r = −0.55, P < .0001) was almost identical to the relationship between accuracy and overall RT. Thus, increased accuracy was strongly associated with shorter RT.
Next, we examined our test from a psychometric perspective. We found test reliability (alpha) of the accuracy scores to be 0.41. Although this alpha is low for a summative examination, our test was not developed to discriminate among residents and had an overall duration of only 30 minutes. See Table 3 for validation of the accuracy scores, versus MCCQE1 and MCCQE2 scores. Correlation with the MCCQE1 total score was 0.40 (disattenuated = 0.65). Correlation with the MCCQE2 total score was 0.21 (disattenuated = 0.39) and with the MCCQE2 problem solving score was 0.31 (disattenuated 0.71). We found no significant correlation with MCCQE2 communication and data gathering subscores. RT was correlated negatively with all MCCQE2 subscores, but those correlations were all less than 0.1 and not statistically significant. More competent participants, then, as judged by both written and problem solving components of MCCQE2, were significantly more accurate on our brief case-based test.
Finally, to determine the relationship between RT and overall competence, we conducted a regression analysis, predicting a participant’s accuracy score from average RT and MCCQE1 overall score. Both predictors were highly significant, with beta weights of −0.528 and +0.378, respectively. Thus, the ability to rapidly arrive at a diagnosis and knowledge were independent predictors of diagnostic accuracy.
To our knowledge, this study is the first attempt to link medical diagnostic performance on a representative sample of acute general medicine cases to the speed of reaching that diagnosis. Our results are unequivocal. In 23 of 25 cases, increased accuracy was consistently associated with decreased time and, hence, greater speed. Overall, the correlation between time and accuracy was −0.54. Importantly, speed of diagnosis predicted accuracy even when controlling for knowledge as measured by MCCQE scores. Thus, we found no support for the assertion that a longer time to diagnosis (typically associated with more deliberate or System 2 processing) results in fewer errors.
Our study has several limitations. Most of the participants were at the same intermediate level of expertise, mid-second-year residents. However, our sample of relatively junior clinicians should exhibit more deliberative processing than experts because they are less experienced and less knowledgeable than their more senior colleagues. Further, the homogeneity of our sample may lead to an underestimation of the correlation between accuracy and speed. Thus, sampling bias would, if anything, result in more conservative estimates of any associations. Finally, the task of diagnosing our cases was necessarily artificial, involving responses to written cases via a computer; however, our data showed a moderate relationship to well-validated licensing examination scores, and thus this artificiality should not affect our results significantly.
We directed test development at creating a series of cases that spanned the continuum of difficulty, with the expectation that we would see more evidence of rapid and accurate reasoning with easy cases; yet, in fact, all cases presented diagnostic difficulties for participants. Despite this, we still found that accuracy and speed were strongly associated.
Other studies have found that slowing down and reflecting lead to more accurate diagnoses. In one study,16 the researchers experimentally induced an availability bias by showing residents a series of prototypical cases. The residents were then asked to diagnose as quickly as possible a series of new cases; half were similar to these prototype cases, with only a different diagnosis, and half were new cases. Residents then reviewed the cases a second time and went through a rigorous procedure of writing down their chosen diagnoses and alternative diagnoses, listing features in favor of each and, finally, changing their minds if they felt it was warranted. After the first phase, residents showed an availability bias, amounting to scores that were 25% lower for similar cases. However, after the reflection phase, their scores improved by about 30%. In another study,17 the authors showed that a similar reflection manipulation increased accuracy with difficult cases; however, the findings indicated that reflection actually reduced accuracy with simpler cases. The results from this study appear to show that System 2 reasoning is superior to that of System 1. However, both these studies were engineered to exemplify an incorrect availability bias by exposing participants to similar cases with different diagnoses. Another study by Mamede and colleagues11 showed a weak positive relationship between time and accuracy.
Taken together with the results of our study, these findings suggest that, with routine cases, rapid processing is both efficient and effective. However, when cases are more demanding, there may be value in more deliberative thinking. Speed is not an independent causal variable; more likely, both speed and accuracy reflect greater knowledge and experience with the specific problem. As educators, we should not encourage learners to speed up or to avoid any reflection. But, by the same token, we should not presume that all errors are a consequence of cognitive biases associated with rapid System 1 thinking, which we can ameliorate with instruction about cognitive biases and admonitions to slow down and be reflective.18 Instead, our results are consistent with the view that expert clinicians (and learners) both develop rapid and efficient heuristics to solve routine problems and strategies to slow down and increase vigilance as the situation demands. Some authors describe this approach as adaptive expertise.19,20
In conclusion, the results of our study demonstrate that the accuracy of a diagnosis is directly related to the speed of processing the information used to make that diagnosis. Our findings do not indicate that a rapid diagnosis (System 1) is prone to more errors than a slower, analytical diagnosis (System 2).
Acknowledgments: The authors acknowledge the support of the Medical Council of Canada in refining their proposal, collaborating on the coordination of this study with the MCCQE2 examination, and providing MCCQE1 and MCCQE2 scores.
Funding/Support: The authors acknowledge the financial support of the Medical Council of Canada. Additionally, Dr. Norman is funded by the Canadian Institutes for Health Research through a Canada research chair.
Other disclosures: None.
Ethical approval: The authors obtained ethical approval from the medical research ethics boards at the Michael G. DeGroote School of Medicine, McMaster University, the Faculty of Medicine, University of Ottawa, and the McGill University Faculty of Medicine.
* When two outcomes, a and b, can occur with probability Pa < 1 and Pb < 1, the likelihood that both will occur, PaPb, is axiomatically less than either alone: PaPb < Pa or Pb.
† Although the diagnostic error literature is almost unanimous in its adoption of a dual processing theory of reasoning, this theory has been criticized by some psychologists, including one of the authors (W.G.).10
1. Kahneman DFrangsmyr T. Maps of bounded rationality: A perspective on intuitive judgment and choice. Les Prix Nobel: The Nobel Prizes 2002. 2003 Stockholm, Sweden The Nobel Foundation:449–489
2. Stanovich KE. Note on the interpretation of interactions in comparative research. Am J Ment Defic. 1977;81:394–396
3. Evans JS. Dual-processing accounts of reasoning, judgment, and social cognition. Annu Rev Psychol. 2008;59:255–278
4. Kahneman D, Tversky AKahneman D, Slovic P, Tversky A. On the study of statistical intuitions. Judgment Under Uncertainty: Heuristics and Biases. 1982 New York, NY Cambridge University Press
5. Redelmeier DA. Improving patient care. The cognitive psychology of missed diagnoses. Ann Intern Med. 2005;142:115–120
6. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med. 2003;78:775–780
7. Elstein AS, Schwartz A. Clinical problem solving and diagnostic decision making: Selective review of the cognitive literature. BMJ. 2002;324:729–732
8. Klein G. Five pitfalls in decisions about diagnosis and prescribing. BMJ. 2005;330:781–784
9. Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Intern Med. 2005;165:1493–1499
10. Marewski JN, Gaissmaier W, Gigerenzer G. We favor formal models of heuristics rather than lists of loose dichotomies: A reply to Evans and Over. Cogn Process. 2010;11:177–179
11. Mamede S, Schmidt HG, Penaforte JC. Effects of reflective practice on the accuracy of medical diagnoses. Med Educ. 2008;42:468–475
12. Norman GR, Rosenthal D, Brooks LR, Allen SW, Muzzin LJ. The development of expertise in dermatology. Arch Dermatol. 1989;125:1063–1068
13. Lehr JL, Lodwick GS, Farrell C, Braaten MO, Virtama P, Kolvisto EL. Direct measurement of the effect of film miniaturization on diagnostic accuracy. Radiology. 1976;118:257–263
14. Medical Council of Canada. . Objectives for the Qualifying Examination. http://www.mcc.ca/en/exams/objectives
. Accessed February 16, 2012.
15. Page G, Bordage G, Allen T. Developing key-feature problems and examinations to assess clinical decision-making skills. Acad Med. 1995;70:194–201
16. Mamede S, van Gog T, van den Berge K, et al. Effect of availability bias and reflective reasoning on diagnostic accuracy among internal medicine residents. JAMA. 2010;304:1198–1203
17. Mamede S, Schmidt HG, Rikers RM, Custers EJ, Splinter TA, van Saase JL. Conscious thought beats deliberation without attention in diagnostic decision-making: At least when you are an expert. Psychol Res. 2010;74:586–592
18. Sherbino J, Dore KL, Siu E, Norman GR. The effectiveness of cognitive forcing strategies to decrease diagnostic error: An exploratory study. Teach Learn Med. 2011;23:78–84
19. Mylopoulos M, Woods NN. Having our cake and eating it too: Seeking the best of both worlds in expertise research. Med Educ. 2009;43:406–413
20. Moulton CA, Regehr G, Lingard L, Merritt C, MacRae H. Slowing down to stay out of trouble in the operating room: Remaining attentive in automaticity. Acad Med. 2010;85:1571–1577