Secondary Logo

Journal Logo


Validity Evidence for a Patient Note Scoring Rubric Based on the New Patient Note Format of the United States Medical Licensing Examination

Park, Yoon Soo, PhD; Lineberry, Matthew, PhD; Hyderi, Abbas, MD, MPH; Bordage, Georges, MD, PhD; Riddle, Janet, MD; Yudkowsky, Rachel, MD, MHPE

Author Information
doi: 10.1097/ACM.0b013e3182a34b1e
  • Free


In June 2012, the United States Medical Licensing Examination (USMLE) introduced a new patient note format for the Step 2 Clinical Skills (CS) exam. One of the main changes from the previous note format is the requirement that examinees provide justification for their differential diagnoses: For each encounter with a standardized patient (SP), the examinee must document up to three focused diagnoses, in order of likelihood, and list the positive and negative findings from the history and physical examination (PE) that support these diagnoses.1 As in the previous patient note format, the examinee must document relevant history and PE findings and also list initial diagnostic studies to be ordered.

The USMLE Step 2 exam is designed to assess the application of “medical knowledge, skills, and understanding of clinical science” necessary for supervised patient care.1 Whereas Step 2 Clinical Knowledge consists of multiple-choice items, Step 2 CS includes interactions with SPs to assess information gathering and PE skills. Step 2 CS also measures the ability to communicate findings to the patient and through written notes. Each examinee participates in 12 encounters with SPs that last 15 minutes. After each encounter, the examinee has 10 minutes to complete the patient note. Performance on the patient note (evaluated by a physician rater) is combined with performance on the PE (assessed by the SP using a checklist) to form the Integrated Clinical Encounter subcomponent score, which measures data gathering and interpretation skills. When Step 2 CS was first introduced in June 2004, more than one-third of U.S. medical schools revised their clinical medicine course objectives, content, and curriculum.2 The recent changes to the patient note format may again lead medical schools to change the ways in which they teach and assess clinical skills.

Although it is known that trained physician raters score the patient notes, the USMLE has not disclosed details of the scoring rubric used. Previously, the patient note was scored on a nine-point global scale, but the USMLE has announced that raters will score four different components of the new patient note: documentation, diagnostic impression, justification, and workup.1,3 This change creates a need among medical schools to develop scoring rubrics for their local clinical exams that are appropriate for the new Step 2 CS patient note format.

Prior research on the psychometric properties of the Step 2 CS patient note has shown conflicting results. One study4 showed patient note scores and data gathering scores from SP encounters to have high correlations, providing a basis for combining the two scores; however, another study,5 which compared videotaped reviews of SP encounters with patient note documentation, found that only 4% of notes (8 of 207) matched in complete accuracy with performance during the encounter. Studies have also shown mixed findings in case specificity of patient note scores. For example, although Clauser et al6 found little evidence of case specificity in patient note scores, Williams and Klamen7 noted case-specific differences in pass–fail rates for scores reflecting diagnostic justification. These studies raise questions as to when and how scores using the USMLE’s new scoring rubric support valid inferences.

At the University of Illinois at Chicago College of Medicine (UIC COM), the Patient Note Scoring Rubric was recently developed to assess medical students’ performance on the new USMLE patient note format as implemented in our local graduation competency exam (GCE). In the study reported here, we examine validity evidence for our rubric using Messick’s8 unified validity framework, focusing on internal structure, relationship to other variables, consequences, and response process. Findings from this study are intended to present the psychometric qualities of scores resulting from our rubric and provide practical information for medical schools and students preparing for the new patient note requirements of the Step 2 CS exam. Furthermore, validity evidence gathered from this study will be used to address the association among dimensions of the rubric, relationship between documentation and data gathering from the SP encounter, and case specificity of patient notes, which has implications for how patient note cases are administered and scored.


Patient Note Scoring Rubric

In response to the USMLE’s changes in the formatting and scoring of patient notes in the Step 2 CS exam, a new scoring rubric was developed, based on input and consensus from a committee of UIC COM faculty members, for use in our medical school’s GCE. The resulting Patient Note Scoring Rubric provides detailed and specific criteria to help standardize and facilitate scoring across raters and to provide medical students and faculty with more specific feedback regarding deficiencies. Patient notes are scored on three dimensions: documentation of history and PE findings (Documentation), justified differential diagnosis (DDX; incorporates diagnostic impression and justification), and plan for immediate diagnostic workup (Workup). Each dimension is scored on a four-point scale, where each scale point is anchored by a brief phrase describing the quality of notes that warrant that score (see Table 1). The points assigned to each of the three dimensions (30, 60, and 10 points, respectively) were determined through consensus to reflect the importance of each dimension for students completing their core clinical clerkships. Each scale point for each dimension is associated with a point value worth 25% of the total points available for that dimension (e.g., for Documentation, with 30 possible points, a score of 2 is associated with a point value of 15).

Table 1
Table 1:
Patient Note Scoring Rubric*

Data collection

Data were collected from SP encounters of 170 fourth-year UIC COM medical students in May 2012; these were part of the students’ GCE. During the GCE, students conducted a 15-minute focused history and PE with five SPs, each of whom presented a case developed by local primary care faculty based on the GCE test blueprint. The cases were pediatrics phone call, dysuria, dizziness, blood in stool, and chest pain. The SPs scored students using a case-specific checklist for the history and PE (about 16 items per case); they also assessed students’ communication and interpersonal skills (CIS) using a four-point rating scale for 14 items.9 Following each SP encounter, students had 10 minutes to complete the new USMLE Step 2 CS patient note template at a computer terminal. Students’ patient notes were scored online by five clinical faculty members who had been oriented to the Patient Note Scoring Rubric. Each rater was assigned to score all responses to one case using the rubric. All scoring was completed by June 2012.


Descriptive statistics were used to study characteristics of the data collected. A generalizability study (G study) was conducted to estimate the reliability (G and phi coefficients) of the patient note scores under varying measurement conditions. The G coefficient is used for ranking students in norm-referenced exams, whereas the phi coefficient is used for criterion-referenced exams. A fully crossed design was used, p (students) × c (cases) × d (dimensions of the rubric), using unweighted scores from each dimension of the four-point rubric.10 Cases were assumed to be random, sampled from a population (universe) of potential cases; dimensions were assumed to be fixed, as the finite set of components measured. As each patient note was scored by a single rater who was assigned to score all notes for a particular case, a rater facet could not be used to measure the variability in rater effects in the G study. Pairwise correlations were calculated between dimensions and between scores from each case.

To assess for changes in consequences of testing, we compared pass–fail rates on the patient notes scored in the May 2012 GCE using the new rubric with previous pass–fail rates of 155 students who wrote patient notes on five different cases in the May 2011 GCE (which had been scored using a rubric based on a single global four-point rating). Cases administered in 2011 and 2012 differed in their specific scenarios, but both were based on the GCE test specifications and prior history of student performance to achieve approximate comparability of cases during each test administration. A 50% passing standard for the patient note was applied, based on an Angoff standard setting exercise (the details of which are beyond the scope of this study). As evidence for response process, the clinical faculty raters provided information on their experience scoring the 2012 patient notes using the new rubric. Data compilation and analyses were conducted using Stata 12 (StataCorp, College Station, Texas). The institutional review board of the University of Illinois at Chicago approved this study.


Descriptive statistics

The average composite score across the five cases was 68.2 (SD = 8.4). Table 2 presents the mean composite scores for each of the five cases as well as the percentages of students receiving the lowest (1) and highest (4) score for each dimension.

Table 2
Table 2:
Composite Scores and Scores by Dimension of the Patient Note Scoring Rubric, Five Cases, University of Illinois at Chicago College of Medicine Graduation Competency Examination, May 2012

Internal structure

For the G study, we used data from 153 examinees, five cases, and three dimensions of the Patient Note Scoring Rubric; 17 students’ scores were dropped because of partial missing data. The G and phi coefficients were 0.47 and 0.43, respectively. Students accounted for 5.5% of the total variance; the interaction between student and case (p × c) accounted for 19.5% of the variance. A large variance component resulting from the student and case interaction indicates case specificity, where student performance varies depending on the case administered (Table 3). Pairwise correlations between patient note scores from different cases ranged from 0.10 to 0.24. Decision study (D study) results indicated that 15 cases were required to reach a phi coefficient of 0.70; about 25 cases were required for a phi coefficient of 0.80 (Figure 1).

Figure 1
Figure 1:
Decision study (D study) of the Patient Note Scoring Rubric assuming fixed dimensions: plots of G and phi coefficients. Dotted lines indicate coefficient values of 0.70 and 0.80, as a reference for reliability indices that correspond with sufficient numbers of cases.
Table 3
Table 3:
Variance Components of the Patient Note Scoring Rubric Generalizability Study (G Study)

The overall correlation between Documentation and DDX was 0.44 (P < .001); between DDX and Workup, it was 0.41 (P < .001), and between Documentation and Workup, it was 0.33 (P < .001; Table 4). Correlations between Documentation and DDX and between DDX and Workup were similar, whereas correlations between Documentation and Workup were lower.

Table 4
Table 4:
Correlations Within Dimensions of the Patient Note Scoring Rubric*

Relationship to other variables

The overall correlation between the patient note Documentation score and the SP encounter data-gathering checklist score was 0.47 (P < .001); depending on the case, correlations ranged between 0.13 and 0.30. The overall association between the patient note total score and SP encounter checklist score was 0.38 (P < .001); there was wide variability in specific cases, ranging from −0.04 to 0.30. The correlation between patient note total score and SP-based CIS score was 0.20 (P < .05); correlations for the various cases ranged between 0.01 and 0.29.

Generalizability indices of the single global rating scores (four-point scale) used in the year prior to the new scoring rubric (i.e., for the May 2011 GCE) were about 0.07 points higher (G coefficient = 0.54, phi coefficient = 0.51). To reach a phi coefficient of 0.70, 11 cases were required with the previous rubric based on global ratings; 19 cases were required for a phi coefficient of 0.80. Although the new scoring rubric would require 15 cases for a phi coefficient of 0.70, a direct comparison of reliability indices may not be appropriate, given the additional dimensions measured by the new rubric. Moreover, the five cases were not the same across the two examinations.


Comparison of pass–fail rates between the previous global rating and the new rubric found no meaningful differences. Using the 50% passing level determined through an Angoff standard setting procedure, the fail rate from the new rubric was 1.3%. The fail rate from the global rating in 2011, using a different set of cases and an unjustified differential diagnosis, was 0%.

Response process

Raters reported that it took them about five to seven minutes to score each patient note. They indicated they had greater ease in scoring the Documentation dimension than the DDX or Workup dimensions. Some raters valued thoroughness, whereas others valued concise presentation of key findings. Raters expressed concern that students did not fully understand the differential diagnosis task because students provided both supportive and refuting findings in their justification (as expected in clinical rotations) rather than supportive features alone (as requested in the exam). They noted that some students provided unnecessary details in the Documentation section and then had incomplete DDX and Workup sections. They felt that the rubric favored pertinent positive findings, relative to pertinent negative findings.


To the best of our knowledge, this study provides the first attempt by a medical school to develop and validate a scoring rubric for a local, SP-based exam using the new USMLE Step 2 CS patient note format. Our results provide some validity evidence for the use of scores from the Patient Note Scoring Rubric to assess three dimensions: Documentation, DDX, and Workup.

Results indicated that for the dysuria and dizziness cases, 27.7% and 21.5% of students, respectively, received the lowest score (1) for Documentation; however, 96% and 82% of these students received higher scores (2 or 3) for the DDX dimension. There are several possible explanations. First, some students may not have elicited key history and PE findings or may have provided inaccurate findings in the documentation section of the patient note template, but may have been able to make a reasonable differential diagnosis that was at least weakly supported. Second, perhaps faculty raters did not follow the instructions given to the students and the scoring rubric criteria but, instead, gave high scores for thoroughness of documentation rather than for documentation of relevant key findings. Incorrect usage of the scale by raters would contribute construct irrelevant variance, thereby threatening the response process evidence for validity. Alternatively, faced with instructions asking only for key findings, students may not have understood what was being requested or may have felt compelled to be thorough to garner all possible points for documentation; faculty raters may have penalized them for not being focused.

The G study resulted in variance components that were consistent with estimates reported in Clauser et al6; however, we found stronger evidence of case specificity than in prior studies that used a single global rating. In this study, the person–case interaction accounted for nearly 20% of the total variance compared with the 12% reported by Clauser et al6 for patient notes scored using a single nine-point scale. The low pairwise correlations between patient note scores provide additional evidence for case specificity. Developers of cases for evaluating patient note skills should be aware that differences in student performance may depend on the specific set of cases administered. Our findings underscore the need for a large number of cases (15 based on the phi coefficient) when high-stakes summative decisions are required from patient note scores. Students should practice writing notes for a wide range of cases, clinical problems, or clinical presentations as their performance can vary from case to case.11 They need to practice hypothesis-driven clinical encounters, emphasizing key clinically discriminating findings and considering the differential diagnoses early in the encounters.12

Although it is unclear what specific correlational values constitute a high or low association in this context, pairwise associations between scores on the first two dimensions (Documentation and DDX) and the last two dimensions (DDX and Workup) of the Patient Note Scoring Rubric had higher correlations than did the first and last dimensions (Documentation and Workup). Although correlations were not substantially different, this finding indicates that good Documentation was associated with good DDX, which was associated in turn with good Workup (and vice versa), providing some evidence for sequential dependency of scores. This has implications for effective performance strategies for students: It encourages development of strategies to promote better linkage and integration of clinical reasoning across dimensions. Further research on dependencies among the dimensions and their implications is needed.

Moderate correlations between SP encounter checklist scores and patient note Documentation scores indicate that the two scores may measure similar, yet different skills. It is unclear whether students are being thorough or using a key-findings approach in completing the SP encounter and/or the patient note. Further research should be conducted to examine methods to reinforce the key-findings approach in both settings and to refine methods to combine the PE checklist and patient note scores, considering their implications for validity.

There were no meaningful differences in pass–fail rates between a previously used global rating scale and the new scoring rubric. However, there are educational consequences resulting from the new note format. The new rubric provides more detailed information about student performance, allowing remediation to focus on the specific dimensions of how to identify and document key findings, link findings to specific diagnoses, and link the differential to further workup.

Raters felt the rubric was useful; however, their comments indicate a need for additional rater training focusing on the development of a shared understanding of evaluating key findings using the Documentation dimension. Moreover, training can refine raters’ ability to discriminate between scoring categories of the rubric. Issues pertaining to cognitive overload of raters scoring the patient note (e.g., fatigue or difficulty in processing the scoring task) and changes in scoring behavior (rater drift) are additional concerns that must be addressed in future research.13,14

Limitations include the fact that the study was conducted at a single institution and examined a single cohort of students who completed only five cases. Feasibility considerations prevented a direct analysis of the rater facet using the G study; double-scoring of future patient notes will allow a robust analysis of rater effects. Ongoing topics of inquiry include different wording, different weighting, and different passing standards for the rubric to increase the validity and educational utility of scores. A revised rubric that separates differential diagnosis and justification may be warranted as recent findings indicate that justification is likely to promote both analytical and nonanalytical aspects of diagnostic reasoning.15 However, given the exploratory nature of this study and its timing in relation to recent changes in the testing format of the USMLE Step 2 CS patient note, these results have meaningful implications for the field. Additional studies replicating procedures in this study using larger and more heterogeneous samples across multiple institutions should be conducted to test the generalizability of the results presented here.

In conclusion, we believe our findings show that the scoring rubric developed at UIC COM for assessing performance on the new patient note format is a useful tool for educational purposes and can contribute to improving clinical training for medical students. When the appropriate numbers of cases are used, scores from the rubric can yield reliable results. The validity evidence presented here provides initial support for using scores from this rubric as an indicator of students’ clinical skills.


1. Federation of the State Medical Boards and the National Board of Medical Examiners. 2013 Step 2 Clinical Skills (CS) Content Description and General Information. 2012 Philadelphia, Pa United States Medical Licensing Examination Accessed June 20, 2013
2. Gilliland WR, La Rochelle J, Hawkins R, et al. Changes in clinical skills education resulting from the introduction of the USMLE step 2 clinical skills (CS) examination. Med Teach. 2008;30:325–327
3. De Champlain A, Swygert K, Swanson DB, Boulet JR. Assessing the underlying structure of the United States Medical Licensing Examination Step 2 test of clinical skills using confirmatory factor analysis. Acad Med. 2006;81(10 suppl):S17–S20
4. Harik P, Clauser BE, Grabovsky I, Margolis MJ, Dillon GF, Boulet JR. Relationships among subcomponents of the USMLE Step 2 Clinical Skills examination, the Step 1, and the Step 2 Clinical Knowledge examinations. Acad Med. 2006;81(10 suppl):S21–S24
5. Szauter KM, Ainsworth MA, Holden MD, Mercado AC. Do students do what they write and write what they do? The match between the patient encounter and patient note. Acad Med. 2006;81(10 suppl):S44–S47
6. Clauser BE, Harik P, Margolis MJ, Mee J, Swygert K, Rebbecchi T. The generalizability of documentation scores from the USMLE Step 2 Clinical Skills examination. Acad Med. 2008;83(10 suppl):S41–S44
7. Williams RG, Klamen DL. Examining the diagnostic justification abilities of fourth-year medical students. Acad Med. 2012;87:1008–1014
8. Messick S. Standards of validity and the validity of standards in performance assessment. Educ Meas. 1995;14:5–8
9. Iramaneerat C, Yudkowsky R, Myford CM, Downing SM. Quality control of an OSCE using generalizability theory and many-faceted Rasch measurement. Adv Health Sci Educ Theory Pract. 2008;13:479–493
10. Brennan RL Generalizability Theory. 2001 New York, NY Springer-Verlag
11. McCarty T, Parkes MV, Anderson TT, Mines J, Skipper BJ, Grebosky J. Improved patient notes from medical students during Web-based teaching using faculty-calibrated peer review and self-assessment. Acad Med. 2005;80(10 suppl):S67–S70
12. Yudkowsky R, Otaki J, Lowenstein T, Riddle J, Nishigori H, Bordage G. A hypothesis-driven physical examination learning and assessment procedure for medical students: initial validity evidence. Med Educ. 2009;43:729–740
13. Bejar I. Rater cognition: Implications for validity. Educ Meas. 2012;31:2–9
14. Harik P, Clauser BE, Grabovsky I, Nungester RJ, Swanson DB, Nandakumar R. An examination of rater drift within a generalizability framework. J Educ Meas. 2009;46:43–58
15. Cianciolo AT, Williams RG, Klamen DL, Roberts NK. Biomedical knowledge, clinical cognition and diagnostic justification: A structural equation model. Med Educ. 2013;47:309–316
© 2013 by the Association of American Medical Colleges