When patients visit their doctors with a health problem and follow their prescriptions to recover, they do so on the tacit assumption that the diagnosis and advice are solely based on their symptoms and complaints and on the doctor’s knowledge of their disease. They might, therefore, be more than surprised to discover that the doctor’s diagnostic judgment could have been influenced by something the doctor read in a newspaper or on the Internet on the morning of their visit, rather than solely on the doctor’s assessment of their particular health condition.
If a doctor’s diagnosis is influenced by information from an unrelated source that leads to an incorrect diagnosis, the doctor is said to be the victim of a cognitive error called availability bias, defined by Croskerry1 as
the disposition to judge things as being more likely, or frequently occurring, if they readily come to mind. Thus, recent experience with a disease might inflate the likelihood of it being diagnosed.
The medical literature is replete with suggestions that cognitive biases play an important role in errors of judgment in medical diagnosis.1–4 Graber et al,5 for instance, maintain that cognitive bias is present in almost 75% of the diagnostic errors in internal medicine, contributing to a large fraction of the as many as 98,000 avoidable deaths estimated to derive from medical errors in the United States every year.6
However, there is surprisingly little direct evidence that cognitive bias actually does play a role in medical diagnostic error. The evidence is either indirect or anecdotal. An example of the former is the following: During outbreaks of West Nile fever in Israel and the United States, an association was demonstrated between the number of weekly reports in the mass media about the disease and the number of laboratory samples submitted to test for West Nile virus.7 The authors suggest that this increase in the number of lab tests ordered was due to an availability bias of the doctors concerned, caused by information in the media. However, we argue that a doctor ordering such a test is not necessarily biased. If West Nile fever is around, that slightly raises the chance that the particular patient in front of you, with symptoms resembling those of that disease, actually has West Nile fever. Testing for the West Nile virus would then not be a manifestation of bias but simply an act of caution. In addition, it is not certain that the doctors involved really believed that their patients had the disease; it may have been their patients who were influenced by the media coverage and believed that they might have the disease. These interpretational complexities are inevitable because that study and others were observational ones.8 An experimental approach might clarify such doubts about causality. However, experiments demonstrating bias in doctors as a result of being exposed to media-distributed information are, as far as we know, lacking.
We undertook the present study, therefore, to investigate whether exposure to media-distributed disease information would bias doctors into using that information in an unrelated context, leading to diagnostic mistakes.
The second purpose of our study was to investigate whether an antidote for such cognitive bias exists. In previous studies,9,10 we demonstrated that being encouraged to reflect on an earlier diagnosis (which could have been right or wrong) produced quite extensive improvements in diagnostic accuracy with complex cases. But would reflection also help when doctors are led to believe that they have solved a simple, straightforward case?
To study these issues, we conducted a three-phase experiment with residents in internal medicine:
* Phase 1 aimed at exposing these physicians to disease information reported by the media. Half of the doctors were asked to evaluate the accuracy of the Wikipedia entry for Legionnaires’ disease. The other half were asked to evaluate the accuracy of the Wikipedia entry for Q fever.
* In Phase 2 six hours later, as part of a seemingly unrelated study, all these physicians were requested to diagnose eight clinical cases. All cases had a diagnosis different from the diseases seen in Phase 1, but two of them had signs and symptoms similar to those of Legionnaires’ disease, and two had signs and symptoms that resembled those of Q fever. We predicted that if the Phase 1 session had produced an availability bias, the participants would tend to misdiagnose these similar-looking cases more often as either Legionnaire’s disease or Q fever.
* In Phase 3, participants were asked to diagnose again the two cases from Phase 2 that could have been affected by an availability bias, by following a procedure for structured reflection on the case (see below). We predicted that this procedure would override the bias and improve the initial diagnoses.
The 38 participants of the study were residents in training to become specialists in internal medicine. These physicians were from four teaching hospitals associated with the medical schools of three universities in the Netherlands: Erasmus University Rotterdam, Radboud University, and Maastricht University. The study was conducted in May and June 2010. All participants were invited to volunteer for the study through their residency program directors, who arranged for invitational letters to be handed out during routine meetings. Participants received book vouchers in return.
Materials and procedure
The experiment consisted of three distinct phases. In Phase 1 of the experiment, participants were requested to help in judging the accuracy of information about diseases as presented on the Internet. They were told that, more and more often, patients would consult sites such as Wikipedia, a popular online encyclopedia, to check what kind of disease they might have. It is therefore important, they were told, that the information relayed via this medium be accurate and comprehensive. To that end, they were requested to evaluate the quality of information that laypersons would encounter in a Wikipedia entry about a particular disease. Eighteen of the participants received a copy of the entry concerning Legionnaires’ disease; 20 received a copy of the entry concerning Q fever. All participants were requested to underline the accurate statements about epidemiology, transmission, symptoms, and therapy encountered in the text. Subsequently, they judged the accuracy, comprehensiveness, and clarity of the information by attributing a score, on a five-point scale, to the Wikipedia entry they had been assigned. The participants were randomly assigned to judge either the Legionnaires’ disease11 or Q fever12 entry in Wikipedia. In this way the two groups could act as each other’s control.
After completing this task, they were thanked for their contribution. They then returned to their daily clinical work, seeing patients with diverse problems.
In Phase 2, taking place about six hours later, the same participants were asked to diagnose eight clinical cases. Great care was taken to ensure that this phase appeared as an unrelated study. Invitational letters came from a different institution, the materials had different letterheads and a different font type, the experimenter was a different person, and the task (diagnosis) was seemingly unrelated to the one presented in Phase 1 (information accuracy judgment). They were informed that the cases to be presented were to be used in another study but that a check was needed to ascertain that the assumed diagnoses were in fact correct. Their help was needed in this respect. They were asked to read each of the cases and provide a diagnosis.
Each case consisted of a written description of a patient’s medical history, signs and symptoms, and test results. (See an example of a case in Box 1.) Two experts in internal medicine (one of us [J.v.S.] and a colleague), both board-certified internists with more than 15 years of experience in clinical practice and teaching in internal medicine, developed the cases. One of the experts just mentioned (J.v.S) prepared the cases, in collaboration with another one of us (K.v.B.), on the basis of our experience with real patients. The correct diagnosis was confirmed by presenting each case to the second expert (the colleague), who was not involved in the study and not aware of the study hypothesis or the Wikipedia cases.
Four of the eight cases were neutral to the purpose of the experiment (so-called “filler” cases), and four were test cases. Two of those four were descriptions of diseases that had clinical manifestations similar to those frequently encountered in patients with Legionnaires’ disease, and two were diseases similar to Q fever, although all cases had diagnoses different from those of Legionnaires’ disease or Q fever. For example, a patient with viral respiratory infection may present with signs and symptoms similar to Q fever. See List 1 for the names of the cases.
The cases were presented in a booklet in random sequence. All participants diagnosed the same eight cases in Phase 2 by following instructions to read the case and write down the most likely diagnosis for the case, trying to be as fast as possible but without compromising accuracy.
In Phase 3, each resident received four of the eight cases diagnosed in Phase 2: two filler cases and the two cases related to the Wikipedia entry (evaluated in Phase 1) that he or she had diagnosed in Phase 2. Those Phase 2 diagnoses could have been influenced by an availability bias from the earlier scrutiny of the Wikipedia entry. The cases were presented randomly in a booklet. Participants were told that we would like them to revisit some of the cases presented earlier, and they were told that, because of time constraints, they would only have to review four of the eight cases. They were then requested to follow a set of procedures intended to induce detailed processing of each sign and symptom. They were asked
to read the case again,
to write down the diagnosis they initially gave for the case in Phase 2,
to list the signs and symptoms in the case description that support this initial diagnosis,
to list the signs and symptoms that speak against this initial diagnosis, and
to list signs and symptoms that would be expected to be present if this initial diagnosis were true but that were not described in the case.
Participants were subsequently asked
to list alternative diagnoses if they felt their initial diagnosis to be incorrect, and
to follow the same procedures described above for each alternative diagnosis considered for the case.
On the basis of this analysis, participants were finally requested
to rank the diagnoses in order of likelihood and selecting their final diagnosis for the case.
This extended procedure has been shown to elicit analytical processing of a case, as opposed to eliciting predominantly nonanalytical processing.9,10
On the basis of the average time required for diagnosing the cases through nonanalytical and analytical reasoning found in previous studies,9,10 10 minutes were allocated for all the cases in Phase 2 and 30 minutes for Phase 3.
Afterwards, participants were asked to formulate for themselves what they thought was the purpose of the study. None of them showed awareness of the experimental manipulation. Subsequently, the participants were debriefed. At debriefing, they received written information on the study purposes and theoretical background.
The cases were based on real patients and had a confirmed diagnosis, which was used as the criterion for the evaluation of participants’ responses. Two experts in internal medicine (J.v.S. and a colleague) independently assessed the diagnoses provided by the participants, without being aware of the condition under which the diagnoses had been formulated. The diagnoses were judged correct, partially correct, or incorrect, scored respectively 1, 0.5, or 0. A diagnosis was considered correct whenever the participant cited the core diagnosis—for instance, “endocarditis” in the example presented in Box 1. When the core diagnosis was not mentioned but a constituent element of the diagnosis was cited, the diagnosis was judged as partially correct (e.g., “bacterial infection” in the case presented in Box 1). A diagnosis that did not fall into one of these categories was considered incorrect. The experts agreed on 88% of the diagnoses, and disagreements were solved by discussion.
Only residents who participated in all three phrases were included in the analysis. For each participant, we computed the mean diagnostic accuracy scores obtained in Phase 2 for the two test cases that had similarities to the disease that the participant had encountered while evaluating the Wikipedia entry in Phase 1 and for the two cases that did not (the four filler cases were not used in the analysis). For Phase 3, we computed the mean diagnostic scores of the two cases whose diagnoses could have been biased in Phase 2. Paired t tests were performed to compare the diagnostic accuracy under the two conditions in Phase 2 and to check for differences between the diagnostic performance in Phase 2 (nonanalytical reasoning) and Phase 3 (reflective reasoning). To check whether the cases seen in Phase 2 had in fact been confounded with the diseases encountered in Phase 1, we computed the number of diagnoses of Q fever or Legionnaires’ disease mistakenly cited by those participants who had been previously primed for the disease and by those participants who had not encountered information on the disease before. A paired t test was performed to compare the number of diagnoses of the two diseases under the two conditions.
The research ethics committee of the Department of Psychology, Erasmus University Rotterdam, approved the study.
All 38 residents completed all three phases of the study. The mean age of the participants was 29.0 years, (SD = 2.25 years); 15 were men and 23 were women; they had on average 4.5 years (SD = 2.5 years) of clinical practice.
Table 1 presents the mean diagnostic accuracy scores obtained when the physicians had been previously exposed to the Wikipedia information on a disease similar to the to-be-diagnosed disease and when they had not.
When the physicians diagnosed the cases after having read, six hours before, the Wikipedia entry about a disease similar to the one described in the cases to be diagnosed, their diagnostic performance decreased. A t test showed this difference to be statistically significant, t(37) = 2.52, P = .016 The overall decrease in diagnostic accuracy was caused by a larger number of incorrect Q fever or Legionnaires’ disease diagnoses, in line with previous exposure, shown in Table 2.
Participants misdiagnosed a case that looked similar to the disease described on Wikipedia significantly more frequently when they had read the Wikipedia information on this disease, t(37) = 3.14, P = .003. Subsequent reflection, however (in Phase 3), significantly improved diagnostic accuracy of the cases affected by previous exposure to biasing information, t(37) = 2.90, P = .006. The improvement derived from repair of initially incorrect diagnoses: After reflecting on the subject-to-bias cases, the number of mistaken diagnoses of the Wikipedia disease that had been attributed to similar-looking cases significantly decreased, t(37) = 2.30, P = .027 (Table 2). It is interesting to note that reflection increased performance to the same level as that of performance on the test cases that were not subject to bias: mean = 0.71 versus mean = 0.70, respectively.
These findings strongly suggest that an availability bias may emerge from exposure to disease information in the media, inducing physicians to use this information while diagnosing patients’ problems in another context and, thus, to make more diagnostic errors than would otherwise have occurred. In the present study, these errors were shown to be a result of availability bias because most of the wrong diagnoses could be attributable to the Wikipedia entry examined earlier in the day. If the mistakes had originated from factors other than availability bias, the incorrect diagnoses would not necessarily have been of the same disease the residents had read about in the morning. Being later encouraged to reflect on the initial diagnosis counteracted the bias and restored diagnostic accuracy.
The size of the bias was substantial. Simply reading about the diseases on the Internet increased by almost 100% the number of cases mistakenly diagnosed as one of those diseases. Studies on the availability heuristic among naïve participants, for example, with frequency-of-occurrence judgments,13 self-judgments of assertiveness,14 or vulnerability to heart disease,15 have shown much smaller effects. More surprising was that the effect emerged from a task carried out six hours earlier in a context entirely different from the diagnostic task. During the intervening time, physicians were engaged in their routine clinical duties, encountering several patients with a diversity of problems. In addition, care was taken to prevent the physicians from thinking that the morning task was in any way related to the afternoon task. Most studies on the availability heuristic requested participants to make judgments about a particular event immediately after performing a task that was designed to make that event easily retrievable from memory.13–16
Finally, whereas the participants in most studies on the availability heuristic have been naïve (e.g., undergraduate students),13–17 the present study showed bias to also occur among resident physicians with a mean of 4.5 years of experience. There is some experimental evidence that doctors indeed can be influenced by characteristics of patients seen previously. Norman and his associates,18,19 for instance, were able to show that details of a case previously seen would influence diagnosis of a subsequent, similar-looking case. See also a study by Mamede et al.20 In these studies, however, such effects emerged in the context of previous experiences with similar patients in a diagnostic situation, not as a result of exposure to information available in the mass media.
The present study has some limitations. First, it should be emphasized that although the effect of availability was substantial, this cannot be interpreted as a general estimate of the magnitude of availability bias. The cases were specifically designed to contain cues that could also be encountered in patients with the Wikipedia disease, a sine qua non because availability bias would not be expected if the clinical presentations were completely unrelated to the previously encountered information. Moreover, after having made a wrong initial hypothesis triggered by the availability bias, other biases such as confirmation bias and premature closure may have come into action to hinder repair. Any one of these other biases, however, would apply to any diagnosis and not only to the one Wikipedia disease encountered earlier.
Second, the cases were written ones, which may be seen as a restriction to generalize findings to real settings; however, recent evidence suggests that written cases may well have an effect on clinical reasoning that is equivalent to the effect of higher- fidelity simulations.21
Third, the attribution of a 0.5 score to a participant’s response is a judgment call and may have influenced the results.
Finally, some degree of analysis might have happened when participants diagnosed the cases in Phase 2. Any diagnostic decision involves some degree of both nonanalytical and analytical reasoning, although one reasoning mode may prevail. Even if participants may have made some analysis while reading the cases, pattern-recognition-based reasoning tends to predominate when participants are instructed to be fast and time is short.22,23
Why is it that physicians are so sensitive to information that is seemingly irrelevant to the task at hand? One can only speculate about the answer because the cognitive mechanism underlying availability bias is still subject to discussion.24 Because availability bias is universal in humans, it must have had adaptive advantages in the past.16,17,24 We can reasonably argue that, in our ancestral environment, most experiences with the world were personal, and, because that environment must have been fairly stable over long stretches of time, responses to these personal experiences were highly predictive for the most appropriate response to later, similar experiences. Therefore, the more easily these earlier experiences (together with one’s successful responses to them) could be accessed from memory, the greater the advantage when dealing with a new but similar event.
Today, however, many of our experiences with the world are no longer personal because they are experienced through the media. If one sees on television riots in the streets of Cairo, these riots have no personal consequences (unless one is actually living in Cairo). Our memory, however, is extremely poor in distinguishing between events that were personally experienced and events that were experienced vicariously but could have been personally experienced; it simply was not built for that task.25 Therefore, our memory treats recent vicarious events in much the same way as it would treat events actually experienced by us: It makes them preferentially available for use when something similar is encountered. Because the doctors in our study had encountered the criterion disease in Wikipedia in the morning, its attributes had become easily accessible in memory for use when the need would arise. The opportunity to use it came in the afternoon and led to large numbers of judgmental errors. Clearly, this strongly suggests that exposure to information picked up in the media can lead to mistakes even among fairly experienced professionals.
If the effect of availability bias is so substantial and so easily produced among doctors, one must fear for the consequences to society. Our findings seem to lend credibility to the claims that cognitive errors account for a large proportion of all medical mistakes and are directly implicated in adverse outcomes and deaths.1–5 The problem seems to be aggravated by the fact that physicians tend to rely on pattern recognition; that is, they diagnose routine problems in a predominantly automatic way, through recognition of similarities between the case at hand and (prototypical) examples of previous patients stored in memory.26,27 This mode of reasoning is fast, requires minimal effort, and is usually efficient in routine situations. Nevertheless, as it occurs largely without conscious control, it would increase proneness to cognitive biases and, consequently, to errors.28,29 Studies of decision making in other domains22 and, more recently, also in medicine,20 have, however, suggested that reflective reasoning can repair immediate erroneous judgments based on automatic reasoning.
Our findings are consistent with this suggestion: The experimentally induced bias provoked mistakes when physicians initially diagnosed the cases through nonanalytical reasoning, but this bias was counteracted by reflection. A more reflective reasoning process apparently not only improves the quality of physicians’ diagnoses when problems are complex9,10 but may also prevent them from falling prey to the availability bias. This finding suggests that physicians and students might need to be aware of the potentially adverse influence of media-distributed information on their diagnostic reasoning.
The study used a procedure for reflective reasoning that improved diagnostic accuracy, counteracting the bias. The procedure can be relatively easily used in educational settings. Whether the procedure—in its full format or in a simplified version—can also be used in clinical settings to improve diagnoses is still a question that requires further investigation.
In summary, to our knowledge, the present study provides the first experimental evidence that availability bias may arise in medical diagnosis simply as a result of exposure to media-provided information about a disease. The effect of the bias was shown to be substantial, reaffirming that cognitive errors may be implicated in many medical mistakes and their adverse effects for patients. The bias seems to be associated with nonanalytical reasoning and could therefore be counteracted by reflection.
Acknowledgments: The authors are grateful to the residents who dedicated their scarce time to participate in the study, and to Prof. Dr. Paul van Daele for his collaboration in the case development and the analysis of participants’ responses. The authors also thank Prof. Dr. Geoff Norman for his valuable suggestions for the revision of the manuscript.
Box 1 Example of a Clinical Case Used in Phase 2 of the Present Study* Cited Here...
A 41-year-old man presents at the emergency department because of mild dyspnea. He reports fever and moderate weight loss; he is a bit confused. The patient has used illicit intravenous drugs for several years. He is single and works as a butcher in an abbatoir.
Blood pressure, heart rate and breathing: normal.
Heart: normal first and second sounds, diastolic murmur grade 3/6 with punctum maximum located at the second intercostal space on the right side. Lungs: subtle bilateral basal crepitus. Abdomen: enlarged liver and spleen. Extremities: small, red stripes under the nails on several fingers.
Erythrocyte sedimentation rate 100 mm/h; hematocrit 40%; leucocytes 15 × 109/mm3 with 8% rods in the differential; mildly elevated liver enzymes. Electrolytes and creatinine levels are normal.
*Thirty-eight internal medicine residents read a Wikipedia entry about one of two diseases (Phase 1). Six hours later, in a seemingly unrelated study, they used nonanalytical reasoning to diagnose eight clinical cases (Phase 2). See List 1 for the list of cases; see the Method section of this report for how the cases were used. (The correct diagnosis for this case is acute bacterial endocarditis.)