Clinical reasoning is a difficult skill to assess directly, partly because the cognitive processes that clinicians use are only indirectly measureable and because it includes skills in multiple domains, such as history taking, physical examination, and medical knowledge.1 Several methods exist to assess clinical reasoning, such as key features examinations,2 script concordance tests,3,4 and some multiple-choice tests.5 Several of these tools to assess clinical reasoning skills in the context of patient care have been validated.6,7 In these validated tools, the ability to create a concise summary statement is assessed as a marker for clinical reasoning because it requires that students synthesize and prioritize patient information.8 The summary statement, a distinct component of the oral case presentation and the patient note, summarizes in one to three sentences what is known about a patient at that point, prior to the full assessment and plan.9 Summary statements also may replace a formal oral presentation on rounds.10 Writing a cogent, accurate, and complete summary statement is an essential step in clinical problem solving.8,11,12 Creating a high-quality summary statement is a skill, and as with any skill or competency, learners need feedback to improve.13 Despite the summary statement’s potential value in clinical reasoning assessment, no standardized approaches for evaluating it are currently available.
The purpose of this article is to describe the development of, and initial validity evidence for, a rubric to assess summary statements. Although many frameworks to evaluate assessment tools exist,14–16 we used Messick’s unitary theory in which validity evidence is used to support the overarching framework of construct validity (i.e., the degree to which an assessment measures the underlying construct).15,17,18 We sought validity evidence for a rubric to assess summary statements in virtual patient cases in the following areas: (1) content (the relationship between the tool’s content and the construct it intends to measure), (2) internal structure (internal consistency and interrater agreement), and (3) response process (evidence showing that raters have been properly trained).
Virtual patient cases
MedU cases are a collection of virtual patient cases developed primarily for medical students on their core clerkships and used by the majority of medical schools in the United States and Canada (www.med-u.org). They are available for pediatrics, internal medicine, family medicine, and radiology. The cases are developed through a collaborative process that builds on curricula from across the country by authors from a number of institutions.19 One pedagogical element of these cases is that students are prompted to write a free-text, nonevaluated short summary statement capturing the important case elements.
Summary statement selection
We used eight different MedU virtual patient cases (three from pediatrics, three from family medicine, and two from internal medicine) for this study. Between August 2010 and July 2011, MedU medical student users created a total of 83,347 summary statements for these cases. MedU administrative personnel randomly selected 120 of these summary statements for our use: 10 for initial pilot work (all from the pediatrics cases); 50 for rubric development (20 from the internal medicine cases, 20 from the family medicine cases, and 10 from the pediatrics cases); and 60 for establishing validity evidence (20 from the internal medicine cases, 20 from the family medicine cases, and 20 from the pediatrics cases) (see Figure 1). Summary statements had to be at least 10 words long to be selected. The total number of summary statements used was dictated by the number of iterative cycles needed to create the rubric and collect validity evidence.
Summary statement assessment rubric development
Between November 2011 and June 2014, four of us—experienced physician educators from the fields of pediatrics (S.S., M.S.D., N.B.B.) and internal medicine (J.R.K.)—reviewed and discussed definitions and characteristics of summary statements.9,10 We then reviewed 10 summary statements from two pediatrics cases to ensure consistency in how we were reviewing the summary statements and in how we were reconciling discrepancies in coding (see Figure 1, Step 1). For example, we identified that it was necessary to work through the virtual patient cases to the point where students are provided a standard set of instructions to create a summary statement to ensure that we knew the specific case content that could be included in a summary statement.
To develop the assessment rubric, we first attempted to apply the Reporter, Interpreter, Manager, Educator (RIME) framework20 to 10 summary statements from an internal medicine case (see Figure 1, Step 2). RIME is a developmental framework that distinguishes between novice and advanced learners on the basis of their performance of increasingly complex clinical reasoning skills. It allows for the assessment of learners along a continuum from Reporter at the lowest level to Interpreter, then Manager, and finally to Educator at the highest level. We found that although students reported information with variable accuracy, the summary statements were too short to assess whether they were interpreting information beyond just choosing what information to report. Using the same 10 summary statements from the internal medicine case, we then evaluated the students’ use of semantic qualifiers (e.g., acute/chronic, intermittent/constant) as a marker of advanced clinical reasoning.11 In general, most summary statements lacked extensive use of semantic qualifiers. Additionally, we felt that if semantic qualifiers were the only metric assessed, a summary statement that was inaccurate but that included semantic qualifiers could be judged to be of high quality.
Given that we could not apply an existing framework to assess the summary statements, we used the same 10 summary statements from the internal medicine case and an additional 10 summary statements from a family medicine case to individually create our own assessment rubric, building on our discussions from the previous rounds of coding (see Figure 1, Step 3). We compared coding structures and scoring iteratively using three additional cases, one each from internal medicine, family medicine, and pediatrics, with 10 summary statements from each case until we agreed on a consensus coding structure. Agreement occurred after we coded a total of 50 summary statements (see Figure 1, Step 4). We then collected validity evidence using 60 additional summary statements (see Figure 1, Steps 5 and 6).
Assessment rubric components
The final assessment rubric included five key components: factual accuracy, appropriate narrowing of the differential diagnosis, transformation of information, use of semantic qualifiers, and a global rating. The need for a global rating emerged as part of our individual coding; through discussion, we determined that the inclusion of such a rating might capture elements not explicitly examined in the rubric already. Table 1 provides definitions for each of these rubric components. Examples of rated summary statements appear in Appendix 1.
Validity evidence for the assessment rubric
We established initial content validity evidence by reviewing existing definitions of summary statements9 and assessment frameworks used in clinical settings.6,7,20 Our research group included experienced clinical teachers who used the iterative development process outlined above. This process drew from established educational theories related to clinical decision making21–23 and medical student assessment20 (see Figure 1, Steps 1–3).
We collected validity evidence about internal structure15 using intraclass correlation coefficients (ICC) for all cases, across cases, and across components to assess interrater agreement (see Figure 1, Steps 5 and 6). Cronbach alpha calculations were used to determine internal consistency for the rubric components (factual accuracy, appropriate narrowing of the differential diagnosis, transformation of information, use of semantic qualifiers, and a global rating) across all cases and within individual cases (see Figure 1, Steps 5 and 6). Finally, we used nonparametric Spearman rank–order correlations to explore correlations between these components of the assessment rubric across all cases and within individual cases (see Figure 1, Step 6).
To explore response process, we each were asked about the rationale for the global rating we assigned to each summary statement (reflection-in-action) as we rated students’ work (see Figure 1, Step 5).24 We felt that this step was important because the global rating was the least structured component of the assessment rubric and possibly could be a source of rater variance. Once initial interrater correlations were determined, we each reviewed our scoring decisions (reflection-on-action)24 and completed a three-item post hoc online survey (see Supplemental Digital Appendix 1 at http://links.lww.com/ACADMED/A289). This survey was designed to explore each rater’s general approach both to applying the rubric and to determining the global rating as well as whether general and case-specific factors influenced our rating. An experienced qualitative researcher (L.S.R.), who was not involved in the assessment rubric development, helped create the survey and analyzed the reflection-in-action and reflection-on-action comments using content analysis25 to identify emergent themes. She performed member checks with each of the researchers who coded the summary statements; no modifications to the emergent themes were made.
The institutional review board of the University of Washington reviewed this project and deemed it exempt from ethical approval.
The rubric development process, the inclusion of experienced clinical teachers, and the relationship between the rubric components and previously described concepts in clinical reasoning provide content validity evidence for the assessment rubric. Factual accuracy in the rubric reflected the “R” (i.e., accurate Reporter) in the RIME scheme.20 Appropriate narrowing of the differential diagnosis reflected students’ understanding of the pertinent positive and negative clinical findings, implicit in the “I” (i.e., Interpreter) of the RIME framework.20 The inclusion of critical information (i.e., key features of the primary problem) and exclusion of extraneous information are also critical components of the creation of illness scripts.8 Use of semantic qualifiers was a marker of conciseness and demonstrated that the student had synthesized the patient’s information and was categorizing it into preexisting frameworks for problem solving (e.g., illness scripts or pathophysiology-based problem solving).8 The concept of transformative language builds on the importance of encapsulation, or the packaging of lower-level details and their interrelationships into a smaller number of higher-level concepts with the same meaning.22 Finally, use of semantic qualifiers is a marker of expert clinical reasoning23 and is a skill that can be taught to students.26
Interrater reliability for the entire rubric (i.e., the sum of all rubric components and the global rating) was acceptable with an ICC of 0.891 (95% confidence interval [CI] 0.859–0.917; P < .001). We found variance between cases, with the family medicine case having the highest ICC of 0.915 (95% CI 0.868–0.948), the pediatrics case having an ICC of 0.880 (95% CI 0.815–0.926), and the internal medicine case having the lowest ICC of 0.856 (95% CI 0.778–0.912). We as raters applied the individual components consistently (see Table 2), with the global rating (ICC 0.929; 95% CI 0.877–0.963; P < .001) and appropriate narrowing of the differential diagnosis (ICC 0.934; 95% CI 0.884–0.965; P < .001) having the highest ICC.
The internal consistency of the assessment rubric was acceptable across all cases (Cronbach alpha 0.771). Across cases, there was a moderate to high degree of correlation between the components (see Table 3). Appropriate narrowing of the differential diagnosis and transformation of information had the highest correlation with the global rating. Use of semantic qualifiers and transformation of information were not significantly correlated with factual accuracy. Correlations between components of the assessment rubric varied between cases with overall higher correlations in the internal medicine and family medicine cases (see Supplemental Digital Appendix 2 at http://links.lww.com/ACADMED/A289).
Analysis of our reflection-in-action comments (n = 120; 30 summary statements × 4 researchers) explaining our individual approaches to assigning the global rating suggested that we independently focused on different components of the rubric but universally used the language of the rubric to explain our scoring. The most frequent comments described missing information (55/120 comments, 46%) or inaccuracies (10/120 comments, 8%). We then diverged in our rating approaches, focusing on transformation of information (Rater 1), appropriate narrowing of the differential diagnosis (Rater 2), providing feedback to improve the summary statement (Rater 3), and comparing the different summary statements being assessed to each other (Rater 4). Additionally, some of us made no mention of some components of the rubric in our comments (e.g., Rater 2 did not mention semantic qualifiers).
Analysis of the post hoc survey (reflection-on-action comments) showed that we all commented that providing information that accurately shaped the listener’s understanding of the differential diagnosis was the most important component we considered when assigning the global rating. We considered the inclusion of incorrect information or a presentation of the material in a manner that led to an incorrect diagnosis as a “fatal flaw.” We all felt that case-specific factors (e.g., more than one diagnosis per case; content expertise) affected our ratings. In addition, similar to our reflection-in-action comments, we focused on different aspects of the summary statement when assigning the global rating. For instance, Rater 1 focused on the presence of pertinent positive or negative findings, Rater 2 looked for all the key clinical findings in the case, Rater 3 queried how the summary statement would be interpreted, and Rater 4 tried to determine whether the student was creating a coherent story (see Supplemental Digital Appendix 3 at http://links.lww.com/ACADMED/A289).
Summary statements are an important component of oral and written communication in patient care. In this article, we describe the development and preliminary validation of an assessment rubric for evaluating summary statements that medical students write in the context of completing virtual patient cases. The rubric builds on existing models of assessing clinical reasoning.8,11,20,23 To our knowledge, it is the first published approach for evaluating summary statements. Following the tenets outlined in Messick’s unitary theory, we established elements of content validity, internal structure, and response process.
Prior work by Durning et al27 has described more comprehensive assessments of students’ clinical reasoning using encounter forms after a standardized patient experience. In that study, students were assessed on their ability to develop a summary statement, differential diagnosis, and problem list and their ability to use semantic qualifiers. In contrast, we focused on the summary statement as a single marker of clinical reasoning that commonly appears in oral and written communication in both actual patient care and some virtual patient cases, like those developed by MedU. Similar to other assessment tools that evaluate summary statements as one component of students’ work,6,7,27 our rubric provides a global assessment of the student’s ability to create a summary statement. The high interrater reliability for the global rating component of the rubric is similar to that of a rating scale that assesses students’ oral presentations.7
Although global ratings are commonly used in assessing clinical skills, they often do not provide sufficiently detailed feedback to guide improvement.28 The other components of the rubric (factual accuracy, appropriate narrowing of the differential diagnosis, transformation of information, and use of semantic qualifiers) had acceptable interrater reliability and, thus, may provide students with direction for improving their skills. The lowest interrater reliability score for the rubric was for factual accuracy (ICC 0.687; 95% CI 0.454–0.837). This may reflect that the rating scheme was binary (accurate/inaccurate) compared with those of the other rubric components, which included more options. In the next phase of rubric development, we plan to further refine the factual accuracy rating component to allow for more nuanced decision making.
Although the interrater reliability was acceptable, we found significant interrater variance despite the intensive development process. Although rater training has been suggested as a method to improve interrater reliability, findings from studies demonstrating large effects have been mixed.29,30 The development of this summary statement assessment rubric was intensive and iterative; thus, additional rater training would have been unlikely to significantly modify our approach as raters. To explain the rater variance (response process), we looked to the findings from the qualitative analysis of our comments (reflection-in-action and reflection-on-action analysis). We focused on different aspects of the assessment rubric when explaining our ratings, but we generally agreed on the most important components to consider. This phenomenon has been described by others31 and may account for a significant portion of the interrater variance. Additionally, we noted that case-specific components affected our rating; the cross-cases analysis of the internal structure of the rubric confirmed this assertion.
Our results also indicate that the elimination of rater variance may not be necessary to use this rubric to provide feedback to students. Although we focused on different components, our reflection-on-action comments suggested that providing information that accurately shaped the listener’s understanding of the differential diagnosis was the most important element of the rubric. The availability of the rubric alone may allow faculty to “unpackage” their feedback to students using a shared mental model, which in turn would help students to improve their skills in a more granular way. The rubric also can serve as a means of “signposting” for students (and to some extent for faculty) the performance expectations for writing a high-quality summary statement. In addition, in the era of self-regulated learning, sharing a feedback rubric can help students monitor and assess their progress as they work toward mastery.32
This pilot work has several important limitations. First, the validity evidence of the internal structure and response process of the rubric was determined by raters who had in-depth knowledge of the rubric and the virtual patient cases. This may have resulted in higher ICC, Cronbach alpha, and correlation coefficient values than would be expected from others using the rubric. In addition, collection of the validity evidence after reviewing initial interrater reliability and response process data likely affected how we as raters applied the rubric. Determining how other raters apply this rubric will be a critical part of the next step in the rubric development process. Lessons learned from the development process described here will guide the further refinement of the rubric and training materials. Next, although the qualitative analysis of our reflections about scoring allowed us to prospectively explore the response process, the questions used in the reflection-in-action and reflection-on-action analyses were not the same and, thus, are not directly comparable. We only applied the assessment rubric to a limited number of virtual patients in three disciplines; thus, we were unable to calculate generalizability coefficients or to perform factor analyses in this pilot work. In addition, we did not gather evidence of the rubric’s relationship to other variables or consequential validity33—important steps when establishing validity evidence, particularly for summative assessments. Next, we coded summary statements in cases outside our discipline, which does not occur in most clinical teaching settings. Given that the students’ summary statements contained no specific identifiable student data, we were unable to explore the relationship between summary statement quality and student-related factors such as medical school curriculum, clerkships previously completed, number of virtual patient cases used, or prior instruction writing summary statements. Finally, we scored summary statements in virtual patient cases rather than actual patient cases. How the rubric would apply in the context of actual patient care remains unknown.
In conclusion, we developed and established preliminary validity evidence for an assessment rubric to evaluate medical students’ summary statements in virtual patient cases. This rubric holds promise as a tool for structuring feedback to learners; however, more work is needed to use it as a scoring tool for higher-stakes assessment. Future work will focus on examining interrater reliability beyond our research team, applying the rubric to different cases, developing and piloting faculty training tools, correlating this rubric with other assessments of clinical reasoning, and determining whether feedback using this rubric improves students’ skills.
Acknowledgments: The authors wish to acknowledge Jenny Lipfert, MD, who designed Figure 1.
1. Durning SJ, Artino AR Jr, Schuwirth L, van der Vleuten C. Clarifying assumptions to enhance our understanding and assessment of clinical reasoning. Acad Med. 2013;88:442–448
2. Bordage G, Brailovsky C, Carretier H, Page G. Content validation of key features on a national examination of clinical decision-making skills. Acad Med. 1995;70:276–281
3. Charlin B, van der Vleuten C. Standardized assessment of reasoning in contexts of uncertainty: The script concordance approach. Eval Health Prof. 2004;27:304–319
4. Lubarsky S, Dory V, Duggan P, Gagnon R, Charlin B. Script concordance testing: From theory to practice: AMEE guide no. 75. Med Teach. 2013;35:184–193
5. Beullens J, Struyf E, Van Damme B. Do extended matching multiple-choice questions measure clinical reasoning? Med Educ. 2005;39:410–417
6. Kogan JR, Shea JA. Psychometric characteristics of a write-up assessment form in a medicine core clerkship. Teach Learn Med. 2005;17:101–106
7. Lewin LO, Beraho L, Dolan S, Millstein L, Bowman D. Interrater reliability of an oral case presentation rating tool in a pediatric clerkship. Teach Learn Med. 2013;25:31–38
8. Bowen JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med. 2006;355:2217–2225
10. Davenport C, Honigman B, Druck J. The 3-minute emergency medicine medical student presentation: A variation on a theme. Acad Emerg Med. 2008;15:683–687
11. Bordage G, Connell KJ, Chang RW, Gecht MR, Sinacore JM. Assessing the semantic content of clinical case presentations: Studies of reliability and concurrent validity. Acad Med. 1997;72(10 suppl 1):S37–S39
12. Schmidt HG, Rikers RM. How expertise develops in medicine: Knowledge encapsulation and illness script formation. Med Educ. 2007;41:1133–1139
13. Ericsson KA. Deliberate practice and the acquisition and maintenance of expert performance in medicine and related domains. Acad Med. 2004;79(10 suppl):S70–S81
14. Kane M, Crooks T, Cohen A. Validating measures of performance. Educ Meas Issues Pract. 1999;18:5–17
15. Messick S. Standards of validity and the validity of standards in performance assessment. Educ Meas Issues Pract. 1995;14:5–8
16. van der Vleuten CP, Schuwirth LW. Assessing professional competence: From methods to programmes. Med Educ. 2005;39:309–317
17. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.e16
18. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837
19. Berman NB, Fall LH, Chessman AW, et al. A collaborative model for developing and maintaining virtual patients for medical education. Med Teach. 2011;33:319–324
20. Pangaro L. A new vocabulary and other innovations for improving descriptive in-training evaluations. Acad Med. 1999;74:1203–1207
21. Schmidt HG, Norman GR, Boshuizen HP. A cognitive perspective on medical expertise: Theory and implication. Acad Med. 1990;65:611–621
22. Boshuizen HPA, Schmidt HG. On the role of biomedical knowledge in clinical reasoning by experts, intermediates, and novices. Cogn Sci. 1992;16:153–184
23. Bordage G, Lemieux M. Semantic structures and diagnostic thinking of experts and novices. Acad Med. 1991;66(9 suppl):S70–S72
24. Schön DA The Reflective Practitioner: How Professionals Think in Action. 1983 New York, NY Basic Books
25. Hsieh HF, Shannon SE. Three approaches to qualitative content analysis. Qual Health Res. 2005;15:1277–1288
26. Nendaz MR, Bordage G. Promoting diagnostic problem representation. Med Educ. 2002;36:760–766
27. Durning SJ, Artino A, Boulet J, et al. The feasibility, reliability, and validity of a post-encounter form for evaluating clinical reasoning. Med Teach. 2012;34:30–37
28. Boehler ML, Rogers DA, Schwind CJ, et al. An investigation of medical student reactions to feedback: A randomised controlled trial. Med Educ. 2006;40:746–749
29. Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. The role of assessment in competency-based medical education. Med Teach. 2010;32:676–682
30. Cook DA, Dupras DM, Beckman TJ, Thomas KG, Pankratz VS. Effect of rater training on reliability and accuracy of mini-CEX scores: A randomized, controlled trial. J Gen Intern Med. 2009;24:74–79
31. Murphy KR, Cleveland JN, Skattebo AL, Kinney TB. Raters who pursue different goals give different ratings. J Appl Psychol. 2004;89:158–164
32. Murad MH, Coto-Yglesias F, Varkey P, Prokop LJ, Murad AL. The effectiveness of self-directed learning in health professions education: A systematic review. Med Educ. 2010;44:1057–1068
33. Messick SLinn RL. Validity. Educational Measurement. 19893rd ed. New York, NY American Council on Education