Secondary Logo

Journal Logo

Research Reports

Variations in Senior Medical Student Diagnostic Justification Ability

Williams, Reed G., PhD; Klamen, Debra L., MD, MHPE; Markwell, Stephen J., MA; Cianciolo, Anna T., PhD; Colliver, Jerry A., PhD; Verhulst, Steven J., PhD

Author Information
doi: 10.1097/ACM.0000000000000215
  • Free


In its 2001 report, “The AAMC Project on the Clinical Education of Medical Students,” the Association of American Medical Colleges (AAMC) found that its membership institutions had become “increasingly aware of apparent deficiencies in the design, content, and conduct of the clinical education of medical students.”1 This report was based on the effort of working panels of medical education opinion leaders from medical schools throughout the United States, who were charged with better understanding the problems of clinical education and making recommendations for improvements. The report noted that the conceptual basis for clinical education in medical schools throughout the 20th century

was highly flawed, primarily because the clerkship experiences, even in the individual clinical disciplines, were highly variable. The variability was inevitable, because of the varied nature of the clinical sites to which the students were assigned over the course of any given year, the variable spectrum of the conditions encountered at those sites, and the variable quality of the supervision and teaching provided by resident physicians and attending physicians at those sites. As a result, it was not possible for medical schools to ensure that all students were having comparable educational experiences during the last two years of the curriculum.1

One of the major conclusions from the report was that most schools did not pay enough attention to “ensuring that students acquire fundamental clinical skills, particularly physical diagnosis skills.”1 However, this conclusion was based on opinion data and was not accompanied by direct evidence of clinical performance inadequacies among students.

As a result, the working panels made two major recommendations: (1) Medical schools should ensure that “clerkships are designed and conducted so that above all else students acquire fundamental clinical skills”; and that (2) medical schools should make sure that students, through their clinical clerkships, are exposed to enough cases of common disorders “that are representative of those seen in the clinical practice of the relevant discipline.”1

The Liaison Committee on Medical Education (LCME) currently have established two medical school accreditation standards for U.S. and Canadian medical schools that address variability in clinical education. Standard ED-372 indicates that the medical education program

should ensure that each academic period of the curriculum maintains common standards for content. Such standards should address the depth and breadth of knowledge required for a general professional education … and the extent of redundancy needed to reinforce learning of complex topics.2

Standard ER 62 states that:

The clinical resources of the medical education program should be sufficient to ensure the breadth and quality of ambulatory and inpatient teaching. These resources include adequate numbers and types of patients (e.g. acuity, case mix, age, gender) and physical resources.2

Both the AAMC and LCME have stressed the importance of medical schools providing adequate systematic clinical experience to all medical students to ensure student and graduate ability to use fundamental knowledge, skills, and clinical reasoning skills to diagnose and manage commonly occurring patient problems.

A number of clinical performance studies have found that medical student clinical performances vary greatly from case to case,3–6 a phenomenon commonly referred to as case specificity. However, studies demonstrating case specificity frequently did not exert tight control over the clinical tasks posed and evaluated; the studies were typically done with standardized patient examinations featuring multiple cases developed by multiple individuals. Although the cases were created using general sets of case development rules, it was common for the cases to vary in the emphasis of the clinical tasks assessed and the measurement methods used for assessment. For example, some cases emphasized data collection, others emphasized data interpretation, and still others emphasized management. Also, with some exceptions,5 studies of case specificity have been conducted from a measurement perspective. Thus, little effort has been directed toward exploring the nature of case-to-case performance variation. Instead, studies have focused on the degree to which case specificity exists and on determining how many cases are needed on such examinations to allow a generalizable estimate of student performance ability given the observed degree of case specificity.

To determine whether medical school curricula are producing the observed medical student performance variation, an approach that controls for study method variance among cases while allowing case content to vary is needed. In this study, we investigated case-to-case variation in individual student diagnostic performance, in situations where the clinical tasks posed for students, and the means of observing and recording students’ performances, were invariant from case to case. Although our study uses generalizability theory and methods, we focus on exploring the impact of medical school clinical education design and management on student diagnostic proficiency.


As a graduation requirement and after completing required clerkships, all medical students at Southern Illinois University School of Medicine (SIUSM) must pass the Senior Clinical Comprehensive Examination, a 14-case standardized patient examination. The blueprint for this examination is explicitly based on the school’s published graduation competencies, which specifically outline the chief complaints that all students must be prepared to handle.7 For 9 of these cases, after interviewing and examining the patient, students complete a free-response diagnosis justification exercise, which should describe how they reached the final diagnosis. The results of the diagnosis justification portion of this examination provide unique insight regarding the diagnostic skills of these physicians-in-training; this is one of the few opportunities to study diagnostic (clinical reasoning and knowledge utilization) performance systematically and in depth where (1) the clinical cases are standardized and represent a broad spectrum of common chief complaints and diagnoses, (2) the examination and scoring conditions are controlled, and (3) the tasks are a close approximation to normal clinical responsibilities of physicians.

The focus of this research is to determine the extent to which the performance of medical students who have completed all required clinical clerkships demonstrate consistent diagnostic justification competencies when examined across a range of cases representing the most common chief complaints and diagnoses of primary care patients in the United States and Canada. We focused on the degree to which all fourth-year SIUSM medical students mastered these diagnostic skills after completing the required clerkships. Because the clinical training regimen at SIUSM is similar to that at virtually all other U.S. medical schools, our findings offer a close look at the effects of clinical training on medical students’ diagnostic justification abilities when measured in a carefully controlled but authentic way.

In this study we address the following questions: (1) Does the diagnostic justification performance of medical students near the end of medical school consistently meet expectations across a series of common cases representing the range of presenting complaints and diagnoses commonly seen in clinical practice? and (2) Is the performance of medical students consistent across three primary diagnostic tasks that make up diagnostic justification ability?


For this study, we used diagnostic justification exercise data gathered from the Senior Clinical Comprehensive Examination taken by the classes of 2011 (n = 67), 2012 (n = 66), and 2013 (n = 79) in the year they were scheduled to graduate from SIUSM. This study was reviewed and judged to be exempt from further review by the Springfield Committee for Research Involving Human Subjects Institutional Review Board. The study was conducted in accordance with the research protocol submitted.

Senior Clinical Comprehensive Examination

The 14 cases on SIUSM’s Senior Clinical Comprehensive Examination are designed to represent primary care patients’ most common chief complaints and the most common diagnoses underlying them. Developed initially by a team of physician and scientist authors, they are based on actual cases and focus on data collection (history taking, physical examination, test ordering), data interpretation (findings, differential diagnosis, diagnosis), initial management, and patient satisfaction with the service provided. The team develops a blueprint for each case that includes evidence from the research literature documenting the importance of the data collected for diagnosis and initial management. Each case is then reviewed by a committee of approximately 25 members representing many medical disciplines as well as specialists in test and standardized patient examination development. This review is intended to ensure that the details of the case are clear and that the focus of the case is appropriate for medical students. The committee makes case and data collection instrument alterations as needed. Finally, case coverage is systematically compared with SIUSM’s published Commencement Objectives, which specify performance expectations for graduates.7 Generally, the cases used in each Senior Clinical Comprehensive Examination cover approximately 80% of the chief complaints included in the graduation objectives and represent the most common diagnoses underlying those chief complaints. The cases used in this study varied from year to year with an average of four to five new cases introduced each year.

Diagnosis justification exercise and rating process

For nine of these cases (eight in the class of 2013), students were required to provide a written justification for their final diagnosis as a postencounter exercise. Students were presented with this prompt: “Please explain your thought processes in getting to your final diagnosis; how you used the data you collected from the patient and from laboratory work to move from your initial differential diagnoses to your final diagnosis. Be thorough in listing your key findings (both pertinent positives and negatives) and explaining how they influenced your thinking.” More details about this assessment, the Hofstee standard-setting process used,8 use of examination results, and examination validation have been published in earlier articles.9,10

Two physicians independently graded all of the student responses for each case. Physician judges were blinded to the name and thus the background of the student respondents. One physician (D.K.) graded all of the responses for all of the cases. This rater has studied the diagnostic reasoning process in depth and functioned as a generalist physician rater for each year of the study. One additional physician graded all student responses for one case. Thus, there were two physician ratings for each response for each case. The second rater for each case was the developer of the case and a specialist in the area. These raters varied from year to year. The average intraclass correlation between raters’ rankings of diagnosis justification responses (interrater agreement), reported earlier,10 was 0.75 and 0.64 for the classes of 2011 and 2012, respectively. The average intraclass correlation coefficient between raters’ rankings for the class of 2013 was 0.64.

For each case, students were graded on three diagnosis justification components, which made up the diagnosis justification composite score for each examinee:

  • Differential Diagnosis (DDX): Based on the diagnostic possibilities discussed, did the student consider an appropriate range of diagnostic possibilities given the findings of the case?
  • Recognition and Use of Key Findings: What percent of available key findings (pertinent positives and negatives alike) did students use in building an argument for the final diagnosis?
  • Thought Processes and Clinical Knowledge Utilization: How effective was the argument that students built for their final diagnosis?

Raters assigned a grade to students’ response for each task using a four-point scale: 0 = poor, 1 = borderline, 2 = competent, and 3 = excellent. Raters were calibrated through a process involving discussion of the diagnosis justification rating scale and through practice rating and then discussing ratings of sample diagnosis justification responses.

Data collection and analysis

We obtained and used an anonymized database of diagnosis justification scores for the classes of 2011, 2012, and 2013 in this study. The database consisted of diagnosis justification scores at the component level by each of the two expert raters for each examinee. These scores were first analyzed using a generalized variance components program (GENOVA)11 to establish the relative impact of various factors on the scores. The design variables were student (S), case (C), and task (T) along with all interaction terms described in Table 1.The analyses used the averaged score for the two raters as the diagnosis justification score. The averaged score of the two raters was considered the best indicator of diagnostic justification ability because this reflected the combined judgment of a generalist physician, who specializes in studying diagnostic reasoning, and the judgment of the case author, who was a specialist in the practice area. The generalizability analyses were also performed separately for each rater, and the results were similar (not shown). The S × C interaction term most directly addresses our first research question: consistency of diagnostic justification performance across cases; and the S × T interaction term most directly addresses our second research question: consistency of student performance across tasks. Subsequent analyses were done to document, expand on, and illustrate the GENOVA findings.

Table 1
Table 1:
Variance Components for Scores Based on Three Years of Diagnostic Justification Exercises Completed by Medical Students at Southern Illinois University School of Medicine, 2011–2013a


Table 1 provides a summary of the generalizability analysis findings that quantify the relative impact of various components (factors) on the diagnosis justification scores. The data in Table 1 show that the variation in student diagnostic justification performance from case to case (S × C) is the most substantial determining factor, accounting for an average of 31.36% of score variation. This factor was the largest contributor to score variation for each of the three years of this study. This result indicates that student diagnostic justification performance was highly variable from case to case and that the nature and degree of this variability were different for different students. This has been referred to in the research literature, described earlier, as case specificity.3–6

To investigate the nature and degree of case specificity in these results, we plotted the diagnosis justification cross-case performance profiles for every student in three performance segments (top, middle, and bottom 20% of the class) for each class. Figure 1 shows the results for students in one class and illustrates the variability in case-to-case performance. Although there is somewhat more case-to-case consistency among students from the top 20% of the class, lack of performance consistency across cases clearly characterizes all groups best. Results for the other two classes are similar but not shown.

Figure 1
Figure 1:
Diagnosis justification (DXJ) performance by case for the bottom (A), middle (B), and top (C) 20% of students in one medical school class at the Southern Illinois University School of Medicine, based on averaged expert ratings of performance [nominal meaning of scores: 0 = poor, 1 = borderline, 2 = competent (expected performance for this level of training), 3 = excellent]. Each line represents one student.

To illustrate this finding further, Table 2 depicts the case-to-case variability in diagnosis justification performance for three randomly selected students from each of the three performance segments (top, middle, and bottom 20%) for one class. The results show the percentile rank score for each student for each case. A percentile rank of 84 indicates that the student’s diagnosis justification performance was equal to or better than that of 84% of the students in the class. Percentile ranks show the relative difficulty of the case for each student. As can be seen, the students’ diagnosis justification performance is highly variable from case to case. For example, the percentile rank for Student G, in the bottom 20% of the class, ranged from the 4th to 94th percentile for the various cases. As mentioned above, this variability is especially pronounced for the middle and bottom 20% of the class.

Table 2
Table 2:
Diagnosis Justification Performance Variation From Case to Case for Three Randomly Selected Students From Each of Three Performance Groups Based on Overall Performance in Southern Illinois University School of Medicine’s Senior Clinical Comprehensive Exam, 2012

Because percentile ranks do not indicate the magnitude of score differences (i.e., practical significance), Table 3 addresses practical significance by indicating the number of students in each class segment who received borderline or poor diagnosis justification scores for various numbers of cases. Of the students in the classes of 2011, 2012, and 2013, 57% (38/67), 23% (15/66), and 33% (26/79), respectively, received diagnosis justification scores of borderline or poor on more than half of the cases. Table 3 shows that the majority of students in the bottom 20% of each class were rated borderline or poor for a majority of the cases. Students in the middle 20% of the class were rated borderline or poor for two to seven of the cases depending on the year. Students in the top 20% generally were rated borderline or poor on zero to three cases. These results demonstrate that the borderline and poor ratings are concentrated among a relatively small but practically meaningful subgroup of the student population.

Table 3
Table 3:
Number of Students at Southern Illinois University School of Medicine Who Demonstrated Borderline or Poor Diagnosis Justification (DXJ) Performance, by Number of Cases, 2011–2013a

The second largest variance component was student diagnostic performance proficiency across cases and tasks, which accounted for approximately 23% of the score variation (see Table 1). The student proficiency factor was the second largest contributor in all three years of the study. Because the purpose for examining students is to establish a general indication of ability to perform clinically across a range of cases, tasks, and situations, this variance component finding indicates that an average diagnosis justification score for this number of cases provides a reasonable estimate of student diagnosis justification proficiency when scored by expert raters.

Task difficulty and case difficulty were the next largest determinants of score variation, but the magnitude of the contributions of these factors was minor by comparison with case specificity and student diagnosis justification proficiency. Table 4 provides the three diagnosis justification task scores broken down by class year and cases.

Table 4
Table 4:
Mean Diagnosis Justification Performance Scores of Students at the Southern Illinois University School of Medicine, 2011–2013a

Repeated-measures analysis of variance, conducted separately for each class, demonstrated that the diagnosis justification tasks, represented by the three diagnosis justification components—DDX, Findings, and Thought Processes—differed in difficulty (P < .000). Task performance accounted for 37% of the score variation (partial eta-squared) in 2011 and 2012 and 76% of the score variation in 2013. Pairwise comparisons, again done separately for each class, indicated that the three tasks were all different in difficulty except that Findings and Thought Process ratings for the class of 2013 did not differ from each other. The differences between Findings and Thought Process scores for each class were relatively small, indicating that these tasks were similar in difficulty, and the task of creating an appropriate differential diagnosis appears to have been easier than the other two tasks for students (Table 4). Notably, the combined physician judge ratings of class ability to recognize and use pertinent findings for diagnostic purposes (Findings and Thought Process scores, respectively) were below the level that these raters considered competent (a score of 2.0) in 2011 and 2013.


In this study, we have provided an in-depth look at clinical reasoning abilities central to clinical practice—in this case, diagnostic justification proficiency—and we have provided data on three entire classes of medical students, rather than only on the performance of students who volunteered to participate. Our results show that student diagnostic justification performance is highly variable (not consistent) across the range of cases, common chief complaints, and common underlying diagnoses used in the Senior Clinical Comprehensive Examination. Equally important, we found that a substantial portion of students in each class (23%, 33%, or 57% of the students, depending on year) provided diagnosis justification responses that were judged borderline or poor for more than 50% of the cases.

Our finding of case specificity is not new. It has been reported as a major, often primary, determinant of standardized patient score variation.3–6 Our results extend this work by establishing that case specificity is present even when the clinical tasks and the measurement procedures are tightly controlled and uniform across cases. Therefore, our findings suggest that case specificity is a function of variations in student training and experience.

Our findings suggest that SIUSM’s planned set of medical school activities, operating under the LCME’s medical education standards, do not result in consistent diagnostic justification skills for all students. In fact, a substantial portion of each class received borderline or poor diagnosis justification scores for a majority of the cases on each examination. Admittedly, these results could be partially attributable to unreasonably high faculty expectations for student performance. We have tried to control for this by using two expert raters, one specialist and one generalist, to grade each diagnostic justification response, and then combining these independent grades into a single score. Acknowledging that possible alternative explanation, we still believe that our investigation of senior medical students’ diagnostic justification proficiency—judged by two expert physician raters who are blinded to the students’ prior performance and to their diagnosis justification ratings on other cases on the examination—does advance the possibility that near-graduates’ diagnostic justification proficiency is highly variable across commonly encountered patient cases and falls short of faculty expectations. In previous research12 we have documented that a one-month clinical reasoning elective for all students who fail this examination has resulted in improved clinical reasoning ability. While discussing limitations, we also will point out that the interrater agreement results are less than ideal, reminding us that there is not a uniform, commonly held definition of diagnostic justification ability among expert raters. We have addressed this by using the combined scores of two expert raters as our diagnosis justification score for each case.

In the beginning of this research report, we outlined the first recommendation from Phase 1 of “The AAMC Project on the Clinical Education of Medical Students”:

First, they (medical schools) must ensure to the degree possible that the clerkships are designed and conducted so that above all else students acquire the fundamental clinical skills that they will need throughout their professional careers.1

If one accepts that diagnostic justification skills are representative of the “fundamental clinical skills” mentioned above, and given our finding that diagnostic justification ability of near-graduates is lower than expected, we posit that, at least at SIUSM, the AAMC’s first recommendation has not been achieved, 12 years after the release of its report. We believe that these results regarding diagnostic justification abilities of senior medical students will be a surprise to most medical school faculty as well.

Is it acceptable for a substantial segment of medical students to graduate from medical school without being judged proficient at diagnosing a majority of cases with common chief complaints and common underlying diagnoses? Some medical school faculty would have us disregard this state of affairs, saying that the problem will be solved during residency training. This is of course a possibility, but comprehensive standardized patient examinations like the one used in this study are rarely employed during residency training, so we cannot confirm that these variations in diagnostic proficiency are resolved through the added clinical experience of residency training. More important, the absence of standardized patient examinations in residency training means that residency program faculty cannot readily identify residents who are unable to independently perform these critical tasks across an acceptable range of cases. As a result, residents who lack proficiency are not identified and therefore not helped to overcome those deficiencies before being certified.

Why is case specificity a primary defining characteristic of medical student clinical performance ability? Because of the variation in patients in the hospital or clinic at any one time, the clinical opportunities afforded medical students in the clinical years of training are essentially random. The LCME2 has acknowledged and addressed this problem by introducing standard ED-2, which requires medical schools to list the types of patients that all students must see in a clinical setting. However, this remedy is insufficient because exposure to single patients with particular chief complaints is unlikely to result in competence. Anders Ericsson,13–15 who has studied the development of expertise most thoroughly and systematically, maintains that deliberate practice is critical to the acquisition of expertise. Using sports examples, he differentiates deliberate practice from playing the game (e.g., tennis matches), making the case that little is gained from playing the game. Instead, he argues that most performance increments come from deliberate practice, which involves isolating component skills, practicing them literally hundreds of times under controlled conditions, and receiving feedback after practice sessions from coaches who observe the practice. The clinical experience that medical students acquire in typical clinical clerkships and rotations is analogous to playing the game rather than to performing deliberate practice.

Likewise, there is huge variability in the amount and nature of attending physician attention, instruction, and advice received by individual medical students during clinical training. This variation in clinical experience has been cited by Petrusa16 as the likely cause of low scores on standardized patient examinations and the difficulties associated with setting criterion-referenced passing standards for standardized-patient-based clinical performance examinations.

It is helpful to identify large deficiencies in diagnostic justification skills in senior medical students and establish that these results are most likely due to lack of control of clinical training rather than inadequate control of assessment methods. However, a bigger challenge is to find a way to remedy those instructional deficiencies. Medical school longitudinal clinical curriculum should be aligned with a set of critical clinical competencies that every medical student must be able to handle at graduation. This would allow for a stepwise introduction of critical diagnostic justification skills, with enough time for repeated deliberate practice across a defined set of clinical experiences available to all medical students. Assessing those competencies systematically through direct observation under circumstances where the student must perform without assistance from other members of the medical team (e.g., other students, residents, and consultants) will ensure that performance deficiencies are identified early and addressed. Routine authentic assessment of clinical performance with preset standards followed by direct observation and feedback based on that observation will ensure that students are afforded the needed opportunities to master critical competencies and are not certified as competent until they do. These changes will be difficult, as they require adjustments to teaching and clinical experience that conflict with the culture and goals of medical practice.

“The AAMC Project on the Clinical Education of Medical Students” concluded with this statement:

The observations made during the conduct of … Phase I have revealed a number of major issues of concern regarding the quality of the students’ educational experiences. The major concerns include:

The lack of adequate teaching of fundamental clinical skills, including rigorously conducted formative assessment of students’ performances

The lack of appropriate patient populations for medical student experiences, particularly in certain disciplines

The lack of adequate centralized oversight and management of medical students’ clinical education1

Our findings help document the impact of variation in clinical education (between and within hospital sites and between medical students) on graduating medical student clinical proficiency. These findings also reinforce the need to investigate new curricular methods that ensure repeated deliberate practice in a defined set of clinical experiences that does not depend on the mix of patients in the hospital. The curricular methods necessary to optimize diagnostic proficiency will probably conflict with current health care system expectations for medical students.

Acknowledgments: The authors thank the developers of the cases used for these examinations, the committee members who reviewed and refined the cases, and the physicians who blindly rated the diagnosis justification responses of students. They also thank Linda Morrison, Mary Aiello, and Melissa Smock, who managed the administration of the examination and the processing of score information.


1. Nutter DO, Whitcomb ME The AAMC Project on the Clinical Education of Medical Students. 2001 Washington, DC Association of American Medical Colleges
2. Liaison Committee on Medical Education. Function and Structure of a Medical School—Standards for Accreditation of Medical Education Programs Leading to the M.D. Degree. May 2012. Accessed January 22, 2014
3. Colliver JA, Markwell SJ, Vu NV, Barrows HS. Case specificity of standardized-patient examinations—consistency of performance on components of clinical competence within and between cases. Eval Health Prof. 1990;13:252–261
4. Schuwirth LW, van der Vleuten CP. The use of clinical simulations in assessment. Med Educ. 2003;37(suppl 1):65–71
5. Norman G, Bordage G, Page G, Keane D. How specific is case specificity? Med Educ. 2006;40:618–623
6. Wimmers PF, Fung CC. The impact of case specificity and generalisable skills on clinical performance: A correlated traits-correlated methods approach. Med Educ. 2008;42:580–588
7. Southern Illinois University School of Medicine. . SIU School of Medicine Objectives for Graduation. Updated April 8, 2013. Accessed January 22, 2014
8. Downing SM, Tekian A, Yudkowsky R. Procedures for establishing defensible absolute passing scores on performance examinations in health professions education. Teach Learn Med. 2006;18:50–57
9. Cianciolo AT, Williams RG, Klamen DL, Roberts NK. Biomedical knowledge, clinical cognition and diagnostic justification: A structural equation model. Med Educ. 2013;47:309–316
10. Williams RG, Klamen DL. Examining the diagnostic justification abilities of fourth-year medical students. Acad Med. 2012;87:1008–1014
11. Brennan R Elements of Generalizability Theory. 1983 Iowa City, Iowa ACT Publications
12. Klamen DL, Williams RG. The efficacy of a targeted remediation process for students who fail standardized patient examinations. Teach Learn Med. 2011;23:3–11
13. Ericsson KA. Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Acad Med. 2004;79(10 suppl):S70–S81
14. Ericsson KA, Charness N. Expert performance—its structure and acquisition. Am Psychol. 1994;49:725–747
15. Ericsson KA, Lehmann AC. Expert and exceptional performance: Evidence of maximal adaptation to task constraints. Annu Rev Psychol. 1996;47:273–305
16. Petrusa ER. Taking standardized patient-based examinations to the next level. Teach Learn Med. 2004;16:98–110
© 2014 by the Association of American Medical Colleges