Share this article on:

Examining the Diagnostic Justification Abilities of Fourth-Year Medical Students

Williams, Reed G. PhD; Klamen, Debra L. MD, MHPE

doi: 10.1097/ACM.0b013e31825cfcff
Diagnostic Reasoning

Purpose Fostering ability to organize and use medical knowledge to guide data collection, make diagnostic decisions, and defend those decisions is at the heart of medical training. However, these abilities are not systematically examined prior to graduation. This study examined diagnostic justification (DXJ) ability of medical students shortly before graduation.

Method All senior medical students in the Classes of 2011 (n = 67) and 2012 (n = 70) at Southern Illinois University were required to take and pass a 14-case, standardized patient examination prior to graduation. For nine cases, students were required to write a free-text response indicating how they used patient data to move from their differential to their final diagnosis. Two physicians graded each DXJ response. DXJ scores were compared with traditional standardized patient examination (SCCX) scores.

Results The average intraclass correlation between raters’ rankings of DXJ responses was 0.75 and 0.64 for the Classes of 2011 and 2012, respectively. Student DXJ scores were consistent across the nine cases. Using SCCX and DXJ scores led to the same pass–fail decision in a majority of cases. However, there were many cases where discrepancies occurred. In a majority of those cases, students would fail using the DXJ score but pass using the SCCX score. Common DXJ errors are described.

Conclusions Commonly used standardized patient examination component scores (history/physical examination checklist score, findings, differential diagnosis, diagnosis) are not direct, comprehensive measures of DXJ ability. Critical deficiencies in DXJ abilities may thus go undiscovered.

Dr. Williams is J. Roland Folse Professor of Surgical Education Emeritus, Department of Surgery, Southern Illinois University School of Medicine, Springfield, Illinois.

Dr. Klamen is associate dean for education and curriculum, and professor and chair, Department of Medical Education, Southern Illinois University School of Medicine, Springfield, Illinois.

Correspondence should be addressed to Dr. Williams, Department of Surgery, Southern Illinois University School of Medicine, 800 North Rutledge St., PO Box 19638, Springfield, IL 62794-9638; telephone: (217) 545-0529; fax: (217) 545-1793; e-mail:

One of the core abilities of physicians is the ability to organize and use their knowledge to collect pertinent data about patients, to generate an appropriate list of likely diagnoses given the chief complaint of the patient, and to make good diagnostic decisions based on the available data. In this report, we will refer to this constellation of abilities as diagnostic justification (DXJ) ability. Given the importance of this ability for physicians, developing DXJ abilities in medical students is a primary goal of medical school training.

Most of the studies on DXJ abilities of medical students1–3 have been conducted with a small subset of medical students or with one class in which medical students reasoned through one or two cases. There are few studies of DXJ ability once medical students leave medical school. Most studies that identify performance problems after medical school4 focus on professional behavior rather than clinical reasoning problems. Those that investigate clinical reasoning problems of physicians5 are focused on the nature of clinical reasoning problems rather than on the performance of the population of physicians as a whole.

We would argue that the clinical reasoning abilities of entire classes of medical students are not systematically and uniformly examined during the clinical years of training. Most commonly, faculty members evaluate diagnostic reasoning ability using secondary evidence such as oral case presentations or patient notes.6 There are problems with this method: (1) oral case presentations or patient notes represent the corporate thinking of the medical team rather than the independent abilities of a single medical student, (2) only a few medical students are asked to give oral case presentations, and the process of selecting students to do this is not systematic, (3) students volunteer to present only when they are confident of their knowledge, thus potentially leading to unduly positive generalizations about clinical reasoning ability because they are based on this sample of volunteered presentations, and (4) many students may never volunteer to present and may not be called on by faculty members to present if they don’t volunteer. Finally, every student is not systematically reviewed across a range of cases but, instead, only on those cases they encounter on a given clinical rotation. We believe that medical schools rarely require students to perform totally independently on a systematic set of cases, and, as a result, the schools are seldom in a position to assess the diagnostic reasoning ability of entire classes of medical students.

The purpose of this study was to directly measure the DXJ ability of all senior medical students in one medical school using a free-response written measure of DXJ ability. This measure directly probed medical students’ DXJ ability following completion of individual standardized patient cases. Our DXJ measure is based on current conceptualizations of clinical reasoning ability that come from clinical reasoning research, research in psychology on “dual-process” models of thinking,7 and research on diagnostic errors made by practicing physicians.5,8,9 Specifically, we were interested in answering three questions:

  • What is the level of DXJ ability of medical students shortly before graduation from medical school?
  • Are traditional standardized patient examination scores a good proxy measure for DXJ ability (i.e., is the DXJ score redundant with other standardized patient examination scores and, thus, not worth the time and effort required to administer and score it)?
  • Do traditional senior clinical comprehensive examination (SCCX) and DXJ performance measures fail the same subset of students when using the same passing standard?
Back to Top | Article Outline



During the 2009–2010 and 2010–2011 academic years, all fourth-year medical students (n = 67 and n = 70, respectively) who had completed all required clerkships at Southern Illinois University School of Medicine were required to take and pass a 14-case, standardized patient examination as a graduation requirement. These students served as the participants in this study. This examination occurred during the first part of the students’ fourth year of undergraduate medical training.

Back to Top | Article Outline

Test materials

Clinical comprehensive examination. The 14 cases on the standardized patient examination, known as the senior clinical comprehensive examination (SCCX), covered common chief complaints and common diagnoses underlying each chief complaint. Students were required to interview and examine standardized patients. For nine cases, after each patient encounter, students were required to type a list of pertinent positive and negative findings (free-response item), type in their differential diagnosis (free-response item), order laboratory investigations as needed from a computerized list of options, interpret laboratory investigation results, and type in their final diagnosis, DXJ, and initial management plan. Students were allowed 45 minutes to complete the postencounter portion of the case, including the DXJ item, which is described below. For four additional cases, these postencounter abilities were measured by having students write a SOAP (Subjective, Objective, Assessment, Plan) note in the manner used by the United States Medical Licensing Examination on the Step 2 Clinical Skills examination.10–12 Those cases were excluded from this study because, in the SCCX, students were already being asked to write an assessment of the patient using a different scoring scheme. We felt that asking them to do a DXJ also would be redundant. The remaining case focused primarily on management and also was excluded from the analyses for this study.

DXJ item. For the cases that are the focus of this study, students were required to provide a written justification of their diagnosis. The specific task presented to students was as follows: “Please explain your thought processes in getting to your final diagnosis; how you used the data you collected from the patient and from laboratory work to move from your initial differential diagnoses to your final diagnosis. Be thorough in listing your key findings (both pertinent positives and negatives) and explaining how they influenced your thinking.” This task called for a typed free-response answer from students and generally was written in paragraph form. The intent of this task was to require students to reveal their thoughts for inspection and, thus, to provide graders and others with a better indication of students’ ability to use their skills in data collection, medical knowledge, and data interpretation in the service of diagnosing patients’ problems.

Grading of diagnosis justification responses. Two physician raters, from a group of 15 raters across the two years, independently graded each student’s DXJ response for each case, aided by a grading protocol. The grading protocol used for the Class of 2011 is provided in Appendix 1. The protocol was simplified somewhat for grading Class of 2012 responses, based on the experience gained in the previous year. Likewise, the weighting of the DXJ response was increased from 10% to 20% for the Class of 2012. The grading protocol and especially the checklist items used to characterize responses that were judged as borderline or poor were based on research findings regarding diagnostic errors made by practicing physicians5,7,9,13 and similar errors observed in other professions.14,15

Appendix 1

Appendix 1

One of us (D.K.) trained the raters and served as one rater for all student DXJ responses for all cases. Training of physician raters involved showing them the rating form and how to use it, and giving them instructions to read all student responses and initially sort them into four piles: “excellent responses,” “competent responses,” “borderline responses,” and “poor responses.”

For training purposes, the physician raters were then given three examples of student responses to grade (a poor performance, an excellent performance, and an average performance), and their ratings were compared with those of the trainer. Questions about the use of the form were answered at that time as well.

Raters were blinded as to the name and, thus, the performance history of the students who prepared the responses. Because responses were typed, score variation due to clarity of writing was also minimized. One physician graded all the responses for all of the cases. An additional rater graded all student responses for single cases. The ratings for each item on the DXJ rating form were summed to create a DXJ score for the case. The DXJ score used for this study was the average DXJ rating by the two raters.

Back to Top | Article Outline

Data analysis and interpretation

Intraclass correlations were computed to determine the agreement between pairs of raters on a case-by-case basis. The analyses were first done using the consistency model to determine how well raters agreed in ranking the performances. The analyses were then repeated using the absolute agreement model to determine how well the two raters agreed on the absolute score assigned to the response. The mean of the intraclass correlations for the nine cases was computed and reported along with the results for individual cases.

The component scores that are most directly associated with diagnostic reasoning ability (history and physical examination; findings; differential diagnosis; diagnosis; diagnosis justification) were then examined and compared; mean component scores were computed and averaged across cases. Likewise, item–total correlations adjusted for the contribution of each component score were computed to determine the relative contribution of each component to the total score. Additionally, the correlation between each case component score and the total examination score for all 14 cases of the SCCX was computed to determine the contribution of each component to the total examination score, which we considered to be the closest approximation to a gold standard of clinical performance ability for each student.

We also computed the correlation between each traditional standardized patient case component score (history and physical examination checklist score, findings score, differential diagnosis score, and diagnosis score) and the DXJ score to determine whether the diagnosis justification score was a redundant measure that added no new information about medical student diagnostic reasoning and clinical knowledge utilization.

Intraclass correlations (consistency model and absolute agreement) among the nine case scores were computed to determine the similarity of DXJ and SCCX scores for each student across the nine cases.

To compare the similarity of decisions using the two performance measures (SCCX and DXJ), we set a common absolute passing score for each and compared the decisions that would be made with each measure. We used a 65% absolute passing score for each case. A passing score of 65% on the SCCX component means that the student achieved 65% of the points that reflected the standard of care for this case. This absolute standard was established in advance by the SCCX committee on the basis of evidence from the medical literature and collective judgments about acceptable standards of care for the patient. In doing so, committee members were aware of historical class performances on the case when available and current class performance results where committee members were blinded to the identity of students. For the purposes of this study, we also adopted the passing standard of 65% for the DXJ scores. A score of 66.67% on the DXJ translates as a competent performance. Thus, a 65% DXJ passing standard would translate to a borderline performance. Results for groups of students with large discrepancies in DXJ and SCCX scores were then compared.

All measures were reported and analyzed as percent scores, thus avoiding errors of analysis that might result from differing numbers of raw score points being associated with the various component scores.

This study was reviewed and approved as exempt on December 20, 2010, by the Springfield Committee on Research Involving Human Subjects at Southern Illinois University School of Medicine. The study was conducted in accordance with the research protocol submitted.

Back to Top | Article Outline


Comparison of DXJ and SCCX scores

The agreement between DXJ raters for each case was measured using the intraclass correlation coefficient (ICC) as a reliability index. Because we used the pooled ratings by the two raters as the best indication of DXJ performance for each case, the ICC reported is an indication of the pooled agreement for pairs of raters. We first determined rater agreement in ranking student DXJ responses. The average ICC representing rankings of student DXJ responses was 0.75, with a range from 0.63 to 0.83 for the nine cases on the Class of 2011 examination. Three cases had a reliability index at or above 0.8. Four had reliabilities between 0.7 and 0.79. The average ICC for ranking of DXJ responses for the Class of 2012 was 0.64, with a range from 0.33 to 0.81 for the nine cases. In only one case was the interrater reliability above 0.8. Two had reliabilities of 0.7 and 0.78, respectively. Seven of the nine cases had reliabilities of 0.6 or above.

Because the absolute score assigned is also important when using criterion-referenced measurement, as was done in this study, we also determined the ICC representing absolute agreement between pairs of raters in assigning DXJ scores. The average ICC for pairs of raters based on absolute agreement of the rating assigned for the Class of 2011 was 0.73, with a case range from 0.53 to 0.83. Four cases had interrater reliabilities of 0.8 or above. Two additional cases had reliabilities of 0.74 and 0.77. The absolute agreement ICC for the Class of 2012 was 0.56, with a case range from 0.3 to 0.81. Only one case had a reliability above 0.80. Two cases had reliabilities of 0.7 and 0.74, respectively. The absolute agreement ICC was 0.60 or above for four of the nine cases on the 2012 examination.

Individual student DXJ scores were similar for the nine cases. ICCs reflecting the relative ranking of students based on DXJ case scores for the nine cases were 0.87 for the Class of 2011 and 0.75 for the Class of 2012. The correlations based on absolute DXJ score assigned were 0.86 for the Class of 2011 and 0.71 for the Class of 2012. For comparison purposes, the correlations reflecting ranking based on traditional SCCX scores were 0.81 for the Class of 2011 and 0.79 for the Class of 2012. ICCs based on absolute score agreement for the nine cases based on traditional SCCX scores were 0.79 for the Class of 2011 and 0.78 for the Class of 2012.

Grading student DXJ responses took an average of 80 seconds per case according to the experience of the one rater who graded all of those responses.

Table 1 provides comparative information on traditional SCCX component and DXJ score characteristics for the nine cases. The table indicates the average component score and the range of component scores for the nine cases. As can be seen, the DXJ score generally had the lowest mean, indicating that this was the most difficult task for these medical students. The history and physical examination and the diagnosis tasks seemed to be the easiest tasks as measured.

Table 1

Table 1

The DXJ component has a substantially higher correlation with the total case score than does any other component and has the highest correlation with the total SCCX exam score, which is arguably the closest thing to a gold standard measure of clinical performance ability available to us. Because the four traditional SCCX component scores each have similar low to moderate correlations with the DXJ score, none of them seem to be replacements for what the DXJ scores measure.

Charts 1 and 2 indicate the discrepancies that would occur in pass–fail decisions using DXJ and SCCX scores and a 65% passing standard. Chart 1 assumes that a passing standard is set according to the average score across all nine cases, which is one strategy that medical schools use. Chart 2 examines discrepancies that would occur if pass–fail decisions are made at the case level, which is the other commonly used strategy.

Chart 1 Discrepancies in Pass–Fail Decisions at the Examination Level for the Classes of 2011 (n = 67) and 2012 (n = 70) at Southern Illinois University School of Medicine Using a 65% Passing Standard for Each Measure*,†

Chart 1 Discrepancies in Pass–Fail Decisions at the Examination Level for the Classes of 2011 (n = 67) and 2012 (n = 70) at Southern Illinois University School of Medicine Using a 65% Passing Standard for Each Measure*,†

Chart 2 Discrepancies in Pass–Fail Decisions at the Case Level for the Classes of 2011 (n = 67) and 2012 (n = 70) at Southern Illinois University School of Medicine Using a 65% Passing Standard for Each Measure*,†

Chart 2 Discrepancies in Pass–Fail Decisions at the Case Level for the Classes of 2011 (n = 67) and 2012 (n = 70) at Southern Illinois University School of Medicine Using a 65% Passing Standard for Each Measure*,†

In both Charts 1 and 2, the first finding is that, apparently, far fewer students would pass using the DXJ score than would pass using the traditional SCCX score.

The second finding is that use of either of the two scores would lead to the same pass–fail decision in a majority of cases. Most discrepancies that occurred resulted in the student’s failing when the DXJ score was used to make pass–fail decisions and passing when the SCCX score was used to make such decisions. However, there were some cases where students would fail on the basis of the SCCX score and pass if the DXJ score was used. (See, for example, results for the Class of 2011 in Chart 2.)

The third key finding is that the number of DXJ performances below the passing standard for the Class of 2012 was lower than was the case for the Class of 2011. Providing students with practice and placing more emphasis on diagnosis justification may be having the desired effect on DXJ performance. The average DXJ score was 7 percentage points below the SCCX score for students in the Class of 2011 but was only approximately 1 percentage point lower than the SCCX score for students in the Class of 2012. Fourteen out of 67 students in the Class of 2011 (21% of the class) had DXJ scores that averaged 20 or more percentage points lower than their SCCX scores, whereas this was true for only 2 students in the Class of 2012 (less than 3% of the class).

On the opposite end of the spectrum, there were five students (7%) in the Class of 2011 who had DXJ scores that averaged 16 percentage points higher than their SCCX scores. The DXJ scores for these five students was higher than the SCCX scores in 37 out of 45 possible cases (82% of the cases). In the Class of 2012, seven students (10% of the class) had DXJ scores that were 10 to 14 percentage points above their SCCX scores. The DXJ scores for these seven students were higher than the SCCX scores in 48 out of 63 cases (76% of cases).

Back to Top | Article Outline

Common DXJ errors observed

Two broad categories of poor or borderline performance emerged. Premature closure was a very common finding among poorly or borderline-performing students. Students fixated on one single diagnosis and focused their arguments as to why the diagnosis was right, without considering alternative, competing diagnoses, even though they were explicitly instructed to do so.

The second common finding was that poorly or borderline-performing students often failed to recognize and use common symptom patterns to reason clinically. Students tended to use one or two pieces of information and call that a pattern; disconfirming data, pertinent negatives, or even a full use of key positives were not called into play in making their diagnostic argument.

Although there were multiple instances of poor or borderline performance on the DXJ question, there were many students who clearly could reason well diagnostically, meeting all of the expectations specified in the grading protocol presented as Appendix 1.

Back to Top | Article Outline


We are certain that medical school faculty members would agree that one of the most important goals of medical training is to produce doctors who are able to integrate and use their medical knowledge to guide their data collection when interviewing and examining patients, to support interpretation of those data, and to accurately diagnose patients’ medical problems. Further, we believe that students’ ability to support those diagnoses is the best indication of their clinical competence.

The development and widespread deployment of standardized patient examination technology has improved the opportunity to objectively measure the clinical performance abilities of medical students and residents. However, we would argue that the traditional metrics used in standardized patient examinations (history and physical examination checklist, differential diagnosis, diagnosis) are necessary but not sufficient measures of diagnostic reasoning ability. We would also argue that a test-wise student may get a high score on the history and physical examination checklist for most standardized patient cases by using a data collection strategy on all patients that was rote memorized and applied. Likewise, test-wise students may get the correct diagnosis by choosing a common diagnosis associated with the given chief complaint, with no additional data needed from the standardized patient, given that most standardized patient cases are developed to cover common chief complaints and common diagnoses underlying those chief complaints. Our experience has been that most examinees get the correct diagnosis for standardized patient cases, a claim that is supported by the high diagnosis score for these students as documented in Table 1.

The results presented here suggest that the ability of fourth-year medical students to provide an adequate justification for their diagnoses may be markedly less advanced than the other skills routinely measured on standardized patient examinations in medical school and probably less advanced than we and other medical faculty would have hoped. If we are to improve these abilities in medical students, we need to employ feasible, direct measures of diagnostic reasoning ability throughout medical school and continue to refine our methods for systematically fostering development of these skills by all medical students. We believe that the physician time required to read and score student responses (160 seconds for two physician raters to read and score an examinee’s response for each case) is well worth the time and effort when the goal is to gain insight into a student’s ability to understand, diagnose, and defend the diagnosis of a patient’s medical problem.

One limitation of this study is that the results are based on the performance of medical students at only one medical school. However, we have no reason to expect that the diagnostic reasoning abilities of students at this school differ from those of students at other schools. In fact, there is some specific evidence to support the expectation that their diagnostic reasoning abilities are similar.16 A second limitation is that the diagnosis justification question tests primarily the analytic reasoning ability of students. If one embraces the dual reasoning argument from clinical reasoning research as we do, one has to acknowledge that some students might make good diagnostic decisions using pattern-matching ability and not be able to articulate the reasons for those decisions. We accept this possibility but believe that it does not negate the value of attempting to systematically measure DXJ ability of medical students as best we can.

If our results do generalize to other medical schools, they suggest that relying on traditional standardized patient scoring metrics (history and physical examination score plus findings plus differential diagnoses plus diagnosis) is likely to mislead medical school faculty into drawing a more sanguine conclusion about the diagnostic reasoning ability of students near the end of medical school training than would result when using a more direct assessment like the DXJ score.

Acknowledgments: The authors would like to thank the following raters who each graded diagnostic justification responses for one or more cases: Tracy Aldridge, MD, Kelly Armstrong, PhD, Careyanna Brenham, MD, Ted Clark, MD, Christopher Gleason, MD, Imran Hassan, MD, Hojiang Huang, MD, Tracy Lower, MD, Patrick McKenna, MD, Robert McLafferty, MD, Erica Nelson, MD, Ayman Omar, MD, Gary Rull, MD, and Christine Todd, MD. We would also like to thank the Senior Clinical Comprehensive Examination Committee (chair: Erica Nelson, MD) and the case authors who developed and refined this examination. Finally, the authors would like to thank Linda Morrison and Mary Aiello for training the patients, coordinating the clinical comprehensive examination, and compiling the performance data.

Dedication: This report is dedicated to the memory of Howard S. Barrows, MD, who made the study of diagnostic reasoning ability a major part of his life’s work and who played a major role in developing the senior clinical comprehensive examination at Southern Illinois University School of Medicine.

Funding/Support: None.

Other disclosures: None.

Ethical approval: This study was reviewed and approved as exempt on December 20, 2010, by the Springfield Committee on Research Involving Human Subjects at Southern Illinois University School of Medicine.

Back to Top | Article Outline


1. Voytovich AE, Rippey RM, Suffredini A. Premature conclusions in diagnostic reasoning. J Med Educ. 1985;60:302–307
2. Rao G. Probability error in diagnosis: The conjunction fallacy among beginning medical students. Fam Med. 2009;41:262–265
3. Friedman MH, Connell KJ, Olthoff AJ, Sinacore JM, Bordage G. Medical student errors in making a diagnosis. Acad Med. 1998;73(10 suppl):S19–S21
4. Papadakis MA, Teherani A, Banach MA, et al. Disciplinary action by medical boards and prior behavior in medical school. N Engl J Med. 2005;353:2673–2682
5. Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Intern Med. 2005;165:1493–1499
6. Williams RG, Dunnington GL. Assessing the ACGME competencies with methods that improve the quality of evidence and adequacy of sampling. ACGME Bull. April 2006:38–42
7. Norman GR, Eva KW. Diagnostic error and clinical reasoning. Med Educ. 2010;44:94–100
8. Berner ES, Miller RA, Graber ML. Missed and delayed diagnoses in the ambulatory setting. Ann Intern Med. 2007;146:470
9. Berner ES, Graber ML. Overconfidence as a cause of diagnostic error in medicine. Am J Med. 2008;121(5 suppl):S2–S23
10. Berg K, Winward M, Clauser BE, et al. The relationship between performance on a medical school’s clinical skills assessment and USMLE Step 2 CS. Acad Med. 2008;83(10 suppl):S37–S40
11. Harik P, Clauser BE, Grabovsky I, Margolis MJ, Dillon GF, Boulet JR. Relationships among subcomponents of the USMLE Step 2 Clinical Skills Examination, the Step 1, and the Step 2 Clinical Knowledge Examinations. Acad Med. 2006;81(10 suppl):S21–S24
12. Swygert KA, Muller ES, Scott CL, Swanson DB. The relationship between USMLE Step 2 CS patient note ratings and time spent on the note: Do examinees who spend more time write better notes? Acad Med. 2010;85(10 suppl):S89–S92
13. Klamen D, Williams R. The Diagnosis and Treatment of the Failing Student (Standardized Patient Exam Failures). 2009 Springfield, Ill Southern Illinois University School of Medicine
14. Endsley MR. Toward a theory of situation awareness in dynamic-systems. Hum Factors. 1995;37:32–64
15. Jones DG, Endsley MR. Sources of situation awareness errors in aviation. Aviat Space Environ Med. 1996;67:507–512
16. Williams RG, Klamen DL, White CB, et al. Tracking development of clinical reasoning ability across five medical schools using a progress test. Acad Med. 2011;86:1148–1154
© 2012 Association of American Medical Colleges