Secondary Logo

Characteristics and Implications of Diagnostic Justification Scores Based on the New Patient Note Format of the USMLE Step 2 CS Exam

Yudkowsky, Rachel MD, MHPE; Park, Yoon Soo PhD; Hyderi, Abbas MD, MPH; Bordage, Georges MD, PhD

doi: 10.1097/ACM.0000000000000900

Background To determine the psychometric characteristics of diagnostic justification scores based on the patient note format of the United States Medical Licensing Examination Step 2 Clinical Skills exam, which requires students to document history and physical findings, differential diagnoses, diagnostic justification, and plan for immediate workup.

Method End-of-third-year medical students at one institution wrote notes for five standardized patient cases in May 2013 (n = 180) and 2014 (n = 177). Each case was scored using a four-point rubric to rate each of the four note components. Descriptive statistics and item analyses were computed and a generalizability study done.

Results Across cases, 10% to 48% provided no diagnostic justification or had several missing or incorrect links between history and physical findings and diagnoses. The average intercase correlation for justification scores ranged from 0.06 to 0.16; internal consistency reliability of justification scores (coefficient alpha across cases) was 0.38. Overall, justification scores had the highest mean item discrimination across cases. The generalizability study showed that person–case interaction (12%) and task–case interaction (13%) had the largest variance components, indicating substantial case specificity.

Conclusions The diagnostic justification task provides unique information about student achievement and curricular gaps. Students struggled to correctly justify their diagnoses; performance was highly case specific. Diagnostic justification was the most discriminating element of the patient note and had the greatest variability in student performance across cases. The curriculum should provide a wide range of clinical cases and emphasize recognition and interpretation of clinically discriminating findings to promote the development of clinical reasoning skills.

Funding/Support: This work was supported by local department resources.

Other disclosures: None reported.

Ethical approval: Ethical approval for this study was obtained from the University of Illinois at Chicago institutional review board, protocol 20050091; renewed November 7, 2014.

Previous presentations: Originally presented as an oral research abstract at the Association of American Medical Colleges Medical Education Meeting, Chicago, Illinois, November 2014.

Correspondence: Rachel Yudkowsky, MD, MHPE, Department of Medical Education 986 CMET, University of Illinois at Chicago College of Medicine, 808 S. Wood St. MC 591, Chicago, IL 60612; telephone: (312) 996-3598; e-mail:

Clinical reasoning is a fundamental skill that medical students must acquire in the course of training to become competent clinicians. Clinical reasoning includes the ability to judiciously gather patient data through a history and physical exam (PE) and interpret those data to generate a differential diagnosis and plan; an essential element in moving from data collection to differential diagnosis is the ability to correctly link key clinical findings with corresponding diagnoses.1–4 Clinical reasoning takes place in the minds of students and can be assessed indirectly through patient write-ups and oral presentations to residents and faculty preceptors, but only if preceptors probe for the rationale behind the student’s differential diagnosis to determine whether the student is making correct or incorrect links between findings and diagnoses. A student may reach the correct diagnosis despite an incorrect link to a clinical feature that is not clinically discriminating—for example, concluding that “this must be strep throat because his throat looks red.” On the other hand, a lack of awareness of evidence-based links will result in students who do not seek and cannot interpret key clinical findings.

One way that medical schools and licensing bodies formally assess clinical reasoning is by asking students to write a “patient note” following a standardized patient (SP) encounter.4–6 Typically these notes require students to document the history and PE findings, generate a problem list or differential diagnosis, and list their plan for investigation or management. Recent studies have reported efforts to include an assessment of students’ ability to link findings and diagnoses. Durning et al7 used a postencounter form instructing students to provide a one-sentence summary statement, a problem list, a differential diagnosis, a most likely diagnosis, and four key history and/or PE findings to support the most likely diagnosis (“supporting data”). Williams et al8,9 studied postencounter notes that included a diagnostic justification component in which students explained how they used the patient and laboratory data to move from their initial differential to their final diagnosis, including listing the key findings and how these influenced their thinking. Students were given 45 minutes to complete the postencounter note for each case. Williams et al used a global four-point scale (poor to excellent) to rate four elements of the diagnostic justification response: the differential, final diagnosis, recognition and use of key findings, and thought processes and clinical knowledge utilization. A large proportion of their students (37% across three class years) struggled with the diagnostic justification task; performance was highly variable and case specific. They concluded that the diagnostic justification task provided useful information beyond the typical SP encounter assessment elements and that the typical clinical curriculum does not provide students with sufficient opportunities to master this skill.

The United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills Examination (Step2CS)6 consists of twelve 15-minute SP encounters; students have 10 minutes to complete the postencounter note following each encounter. In 2012 the USMLE changed the format of their note to require a ranked and justified differential, in which students are asked to list the history and PE findings that support each of their diagnoses, in addition to documenting findings and providing a plan for immediate diagnostic workup. In response to this change, the University of Illinois updated the patient note format used in its graduation competency exam (GCE) to mirror that of the USMLE and developed a rubric to score the new note format. Initial validity evidence for that rubric was reported by Park et al.10 The purpose of the present study was to obtain additional validity evidence for the UIC patient note rubric by investigating the psychometric characteristics and curricular implications of the diagnostic justification scores in the context of the USMLE 2012 format for the postencounter notes.

Back to Top | Article Outline



We studied patient notes written during UIC’s GCE administered in 2013 and 2014. The GCE is a mandatory, moderate-stakes exam with formative and summative assessment and curriculum evaluation goals; the SP component of the GCE consisted of five SP encounters that were structured similarly to the USMLE Step2CS6 and served in part to prepare students for that exam. Students had 15 minutes to obtain a focused history and PE and communicate their initial findings to the patient, followed by 10 minutes to complete a structured patient note in which students had to (1) document the relevant history and PE findings (“Documentation”), (2) list up to three diagnoses, ranked in order of likelihood (“DDX”), (3) justify their differential diagnosis by listing the positive and negative history and PE findings they elicited that support each of the diagnoses (“Justification”), and (4) list up to five items for the immediate diagnostic workup of the patient (“Workup”). Passing scores were set by faculty using a modified Angoff method11,12; students who failed the SP portion of the GCE met with a faculty preceptor to review their scores and video recordings and develop an individual remediation plan.

The cases included in the 2013 and 2014 GCEs were based on a blueprint of 25 chief complaints that UIC students are expected to master by the end of their third year. SP training materials and a gold standard “exemplar note” were developed for each case by faculty teams. SPs participated in three to six hours of training for each case, including training in case portrayal, history and PE checklist completion, communication scale completion, and providing verbal feedback to students. Quality assurance measures included periodic video review during the exam to ensure continued accuracy and consistency of portrayal and checklist completion.

The scoring rubric consisted of a four-point anchored scale for each of the four note tasks: Documentation, DDX, Justification, and Workup (see Table 1). A full description of the development of the rubric and initial validity evidence was published by Park et al.10 The previously published version of the rubric was modified on the basis of feedback from faculty raters; the original “DDX” section, which included an integrated rating of both the appropriateness of the diagnoses and the quality of the supporting links, was split into separate sections for DDX and Justification. The version used in the 2013 exam was further revised for the 2014 exam: The first two levels of the justification section, which had been “no justification provided” and “many incorrect or missing links,” respectively, in 2013, were merged in 2014 to a single level, and the second level was changed to “some incorrect or missing links between findings and Dx” (see Table 1). Levels 3 and 4 were unchanged. This change was needed to make level 2 correspond to marginally acceptable performance, in line with the rubric for the other tasks; by 2013 there were very few students who offered no justification at all.

Table 1

Table 1

Each note was scored by a single rater. Raters included curricular deans, clerkship directors, and clinical faculty. Before scoring the notes all raters met as a group to review the note scoring rubric and the exemplar notes and to discuss scoring guidelines for each case. This group also served to provide a content validity check on the exemplar notes, confirming that key history and PE findings to be documented were flagged, that the differential was appropriate to the clinical case and to the developmental stage of the students, and that the history and PE items listed to justify each diagnosis were correct and complete.

In 2013 each case was assigned to one of five raters, who scored all the notes of that case. In 2014 the notes from each case were divided among two or three raters, for a total of 11 raters. The raters for each case met separately for a “calibration meeting” to score and discuss 10 notes, ensuring a shared mental model and agreement about how to apply the rubric and scoring guidelines among the raters of each case. Once training and calibration were completed, raters scored their assigned notes individually online over a period of about one month, using CAE LearningSpace.13 In a separate rater analysis,14 a subset of the 2014 GCE (55 randomly selected notes per case, 31% of full data) was rescored by six raters from the original 11; interrater agreement (weighted kappa) was 0.79 (Documentation = 0.63, DDX = 0.90, Justification = 0.48, and Workup = 0.54), indicating reproducible scores by the raters.15 The full dataset using single ratings per note was used for this study.

The minimum passing score for the note rubric was set by the rater group using a modified (extended) Angoff method.12,16 For each task on the rubric, each rater indicated the expected performance of a borderline (minimally competent) student. Overall, raters considered level 2 of each task to be the minimally acceptable performance, resulting in a minimum passing score of 50%. This passing score was applied across all cases.

In addition to the notes, assessment elements for SP cases included history and PE checklists and a 14-item behaviorally anchored, patient-centered communication scale,17 both scored by the SPs. Relationships between the note scores and the SP ratings were reported previously10 and are outside the scope of this paper.

Back to Top | Article Outline


Descriptive statistics were used to examine the distribution of clinical reasoning task performance using unweighted and weighted scores for each case and for the overall GCE. Passing rates were calculated for each case. The number of cases on which students received “1” or “2” score levels was tabulated to examine the distribution of students with low performance on the Justification task.

Item discrimination (correlation between clinical reasoning task to case-level score), interitem correlation (average pairwise correlation between clinical reasoning tasks), and Cronbach alpha (internal consistency between clinical reasoning tasks) were calculated to evaluate psychometric characteristics of each case-level clinical reasoning task. A generalizability study (G-study) was conducted using the person (p) × case (c) × task (t) design, assuming cases were random samples from a population of potential cases; tasks were assumed to be fixed as the finite set of clinical reasoning skills assessed. Variance components were estimated with and without justification, to examine factors affecting variability in scores. Changes in person-by-case interactions were examined when justification was excluded from the G-study design. Decision studies were conducted to project the number of cases required to reach a sufficient level of reliability.

Data compilation and analyses were conducted using Stata 13 (StataCorp, College Station, Texas). The institutional review board of the University of Illinois at Chicago approved this study.

Back to Top | Article Outline


A total of 180 and 177 M3 students participated in the GCE in 2013 and 2014, respectively, representing the total class size with the exception of a few students who had not yet completed their core clerkships. The student cohorts in 2013 and 2014 were 53% (2013) and 51% (2014) female, 35% and 40% white, 26% and 24% Asian, 20% and 17% Hispanic, 10% and 12% black, and 9% and 6% no response, respectively. Mean MCAT scores were 10.08 (2013) and 10.38 (2014). The number of students encountering each case varied slightly because a backup case was deployed in a few instances when the assigned SP was not able to perform the case.

The distribution of scores in the 2014 GCE abdominal pain case is shown in Table 1 as a typical example of student performance. For this case, about a third of the students (34%) poorly documented key history and PE findings; 45% did not correctly identify all three top diagnoses to be considered; 40% had some or many missing or incorrect links between findings and diagnoses; 54% proposed an ineffective workup plan, and 7% recommended tests that would put the patient in unnecessary risk or danger.

Students’ performance varied across cases and tasks; unweighted mean task scores for each case are shown in Table 2. Justification scores were generally higher in 2013 than in 2014, following the change to the rubric collapsing “no justification” (level 1 in 2013) and “many missing or incorrect links” (level 2 in 2013) to a single level in 2014. Students with poor or marginal diagnostic justification scores (justification ratings at level 1 or 2) for three or more cases comprised 9% of the class (N = 16) in 2013, and 30% of the class (N = 53) in 2014 after the 2013 levels were collapsed. Weighted note scores are shown in Table 3, along with case- and exam-level passing rates.

Table 2

Table 2

Table 3

Table 3

The diagnostic justification task had the highest item discrimination and lowest average interitem (i.e., intertask) correlation among the clinical reasoning tasks, indicating that performance on justification was the most predictive and unique factor contributing to overall note performance (Table 2). Internal consistency reliability or coefficient alpha across tasks within a case ranged from 0.37 (wrist pain 2013) to 0.77 (dizziness 2014) (Table 2).

The generalizability study showed considerable case and task specificity (Table 4). Aside from the residual error term, the largest source of variance was the person × case interaction (12% in 2013; 21% in 2014), followed by either case difficulty (6.2% in 2013; 11.6% in 2014) or the case × task interaction reflecting differences in task difficulty by case (13% in 2013; 7.9% in 2014). G- and Phi-coefficients based on five cases were 0.35 and 0.28, respectively, in 2013, and 0.46 and 0.38 in 2014. D-studies (not shown) indicated that increasing to 20 cases would result in G-coefficients of 0.77 (2013) or 0.69 (2014). Eliminating the justification score to create a note score corresponding to the note format used prior to the USMLE format change resulted in decreased reliability, with G- and Phi-coefficients of 0.28 and 0.21, respectively, in 2013, and 0.34 and 0.26 in 2014 (Table 4). Person-by-case interaction decreased substantially when justification was excluded (12.2% to 9.6% in 2013; 21.0% to 14.2% in 2014).

Table 4

Table 4

Back to Top | Article Outline


These results provide additional validity evidence for the UIC patient note scoring rubric based on the updated (2012) USMLE Step2CS note format.10 We found that scoring the new diagnostic justification task of the patient note had benefits for both the student assessment and curriculum evaluation functions of our GCE. The item analysis indicated that of the four tasks in the patient note (Documenting the history and PE, DDx, Justification, and Workup), the justification task was the most predictive of overall GCE performance and provided unique information about students’ clinical reasoning abilities. Including the justification task increased the reliability (G-coefficient) of the note and, hence, of the exam overall. Clinical reasoning was both case specific and task specific, and the justification task increased case specificity; variability in student performance between cases increased when justification was included. These findings emphasize the need to teach and assess reasoning tasks systematically across a broad sample of clinical cases.

From a curricular perspective the patient note allowed us to identify curricular gaps such as clinical reasoning tasks that a majority or substantial minority of students struggled to achieve. Students were challenged by all four clinical reasoning tasks assessed in the patient notes; nearly half the students (47% in 2013 and 48% in 2014) had unsatisfactory (“1”) or marginal scores (“2”) on the core tasks of documenting key clinical history and PE findings, developing a differential diagnosis, and linking key findings to their respective diagnoses. The justification score was the best predictor of clinical reasoning overall, highlighting the critical impact of correctly linking findings with diagnoses and the need for additional curricular attention to this task.

Poor performance on the note rubric implies that students completing their core clerkships may not be well prepared to perform the essential tasks included in three of the Association of American Medical Colleges Core Entrustable Professional Activities for Entering Residency18: gathering a history and performing a PE, prioritizing a differential diagnosis, and recommending and interpreting common diagnostic and screening tests. Our results are consistent with those of Williams et al,9 who found that 23% to 57% of students were judged to have borderline or poor diagnostic justification on more than 50% of the nine cases assessed. We agree with their conclusion that the opportunistic nature of clinical training and the highly variable quality of clinical supervision do not support deliberate practice or mastery in clinical reasoning and can result in uneven and ineffective acquisition of case-specific knowledge. Given the very real obstacles to changing the current structure of clinical training, additional complementary approaches should be explored that would allow systematic blueprinting and deliberate practice through ancillary clinical experiences including simulation-based training with SPs, virtual patients, and mannequins. A recent multicenter study found that up to 50% of the clinical experiences of student nurses could be replaced with simulation-based experiences with no detriment to learning outcomes.19 A systematic mastery learning program of simulation-based experiences within medical students’ clinical rotations could be a powerful tool toward improving clinical reasoning. Other curricular interventions that could promote the development of clinical reasoning skills include teaching hypothesis-driven history and PEs in the preclinical years20 and focusing on clinical reasoning tasks during clinical supervision and rounds—for example, by asking students to justify their differential and working diagnoses in write-ups and oral presentations using the one-minute-preceptor21 or SNAPPS22 approach.

Nearly half of our students (44% in 2013 and 41% in 2014) produced a dangerous or ineffective plan for diagnostic workup. Diagnostic workup skills tend to be emphasized more in the fourth year of medical school, subsequent to the timing of this exam, as indicated by our faculty choosing to assign only 10/100 points to this task. However, given the highly variable nature of clinical experiences in the fourth year, our findings suggest that it would be wise to devote more attention to teaching diagnostic workups, especially in determining when to initiate high-risk and invasive diagnostic procedures, during core third-year clinical experiences.

The educational impact of the new note format and its scoring rubric is as yet unknown. Both Durning et al7 and Williams et al8,9 asked students to justify only their final working diagnosis, whereas the USMLE note and the corresponding UIC rubric instruct students to list findings supporting each of the diagnoses listed in their differential. Justifying all their diagnoses may help students avoid confirmation bias and encourage them to attend to those features that clinically discriminate between competing diagnoses. A high-stakes exam such as the USMLE Step2CS is a powerful motivator, and our students see our local GCE as in part a practice test for the clinical skills exam. We expect that explicitly assessing students’ ability to link findings and diagnoses will encourage them to focus their attention on mastering this critical clinical reasoning skill. Exploring this impact will be the topic of a future study.

This study is subject to several limitations. The study was conducted in a single medical school; however, it included over 350 students from two cohorts of one of the most diverse student populations in the United States.23 Individual notes were scored by a single rater; however, interrater reliability was acceptable (weighted kappa 0.79). The study was based on only 10 SP cases. The rubric was developed locally and may not correspond to the rubric or section weights used by the USMLE to score the Step2CS notes.

Back to Top | Article Outline


Validity evidence supports the use of the UIC note rubric to score the updated USMLE patient note format. The new diagnostic justification task of the patient note provides unique and useful information about student achievement and curricular gaps. About half of the students struggled to correctly justify their diagnoses; performance was highly case specific. Curricula should provide students with a wide range of clinical cases, supplemented by simulation-based experiences if necessary, and emphasize recognition and interpretation of clinical findings to promote the development of clinical reasoning skills.

Acknowledgments: The authors wish to thank Drs. Ananya Gangopadhyaya, Olga Garcia-Bedoya, Samuel Grief, Asra Khan, Octavia Kincaid, Heeren Patel, Nimmi Rajagopal, Radhika Sreedhar, Pavan Srivastava, Alexandra Vanmeter, and Samreen Vora for scoring the notes.

Back to Top | Article Outline


1. Elstein AS, Shulman LS, Sprafka SA Medical Problem Solving: An Analysis of Clinical Reasoning. 1978 Cambridge, Mass Harvard University Press
2. Elstein AS, Schwarz A. Clinical problem solving and diagnostic decision making: Selective review of the cognitive literature. BMJ. 2002;324:729–732
3. Eva KW. What every teacher needs to know about clinical reasoning. Med Educ. 2005;39:98–106
4. Yudkowsky R, Park YS, Riddle J, Palladino C, Bordage G. Limiting checklist items to clinically-discriminating items: Improved validity of test scores. Acad Med. 2014;89:1057–1062
5. Berger AJ, Gillespie CC, Tewksbury LR, et al. Assessment of medical student clinical reasoning by “lay” vs physician raters: Inter-rater reliability using a scoring guide in a multidisciplinary objective structured clinical examination. Am J Surg. 2012;203:81–86
6. . United States Medical Licensing Examination. Step 2 clinical skills (CS): Content description and general information. Updated February 2015. Accessed July 22, 2015
7. Durning SJ, Artino A, Boulet J, et al. The feasibility, reliability, and validity of a post-encounter form for evaluating clinical reasoning. Med Teach. 2012;34:30–37
8. Williams RG, Klamen DL. Examining the diagnostic justification abilities of fourth-year medical students. Acad Med. 2012;87:1008–1014
9. Williams RG, Klamen DL, Markwell SJ, Cianciolo AT, Colliver JA, Verhulst SJ. Variations in senior medical student diagnostic justification ability. Acad Med. 2014;89:790–798
10. Park YS, Lineberry M, Hyderi A, Bordage G, Riddle J, Yudkowsky R. Validity evidence for a patient note scoring rubric based on the new patient note format of the United States Medical Licensing Examination. Acad Med. 2013;88:1552–1557
11. Angoff WHThorndike RL. Scales, norms, and equivalent scores. In: Educational Measurement. 19712nd ed. Washington, DC American Council on Education:508–600
12. Downing SM, Tekian A, Yudkowsky R. Procedures for establishing defensible absolute passing scores on performance examinations in health professions education. Teach Learn Med. 2006;18:50–57
13. CAE Healthcare. Audiovisual solutions—LearningSpace. Accessed July 22, 2015
14. Park YS, Hyderi A, Bordage G, Xing K, Yudkowsky R. Examining the inter-rater reliability and generalizability of patient note scores using a scoring rubric for patient notes based on the USMLE Step-2 CS Format [unpublished manuscript]. 2015 Chicago, Ill University of Illinois at Chicago
15. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174
16. Hambleton RK, Plake BS. Using an extended Angoff procedure to set standards on complex performance assessments. Appl Meas Educ. 1995;8:41–56
17. Iramaneerat C, Myford CM, Yudkowsky R, Lowenstein T. Evaluating the effectiveness of rating instruments for a communication skills assessment of medical residents. Adv Health Sci Educ Theory Pract. 2009;14:575–594
18. Association of American Medical Colleges. Core entrustable activities for entering residency: Curriculum developers’ guide. Accessed July 22, 2015
19. Hayden JK, Smiley RA, Alexander M, Kardong-Edgren S, Jeffries PR. Supplement: The NCSBN National Simulation Study: A longitudinal, randomized, controlled study replacing clinical hours with simulation in prelicensure nursing education. J Nurs Regul. 2014;5:C1–S64
20. Yudkowsky R, Otaki J, Lowenstein T, Riddle J, Nishigori H, Bordage G. A hypothesis-driven physical exam for medical students: initial validity evidence. Med Educ. 2009;43:729–740
21. Neher JO, Gordon KC, Meyer B, Stevens N. A five-step “microskills” model of clinical teaching. J Am Board Fam Pract. 1992;5:419–424
22. Wolpaw TM, Wolpaw DR, Papp KK. SNAPPS: A learner-centered model for outpatient education. Acad Med. 2003;78:893–898
23. University of Illinois at Chicago College of Medicine. Diversity at the College of Medicine. Accessed July 22, 2015
© 2015 by the Association of American Medical Colleges