No gold standard is available for assessing clinical reasoning, likely because clinical reasoning is a multifaceted construct that includes acquiring and interpreting data; synthesizing a case into a problem representation; generating, prioritizing, and justifying a differential diagnosis; and selecting a management plan.1,2 The context in which clinical reasoning occurs, including clinician factors, patient factors, and setting, also affects the process.3,4 Different assessment approaches target different facets of the clinical reasoning process; for example, standardized patient cases may assess bedside data acquisition, whereas multiple-choice exams may assess foundational knowledge and data interpretation. Thus, clinical reasoning needs to be assessed using multiple strategies.5–7
Clinical decision making represents the output of clinical reasoning. Medical educators have developed a number of methods to assess diagnostic and management decisions. They have used, for example, patient management problems (PMPs) to assess students’ decisions as the learners move through a case—from the chief complaint to follow-up.8 PMPs have lost favor, however, as scores lack sufficient reliability and because the length of each PMP restricts the ability to sample a broad range of content. Another method of assessing clinical reasoning, developed in 1984, is the use of key features exams (KFEs). KFEs, which focus on only the critical or challenging decisions in a situation, were developed to better target critical decision points. This focus allows examiners to sample a greater number of cases, thereby increasing the reliability of exam scores.9 The key features (KFs) may be related to selecting additional information to gather during the history, physical examination, or diagnostic testing; generating a differential diagnosis; prioritizing a working diagnosis; or initiating treatment, counseling, or monitoring.10,11
The development of a KFE case begins with the identification of the critical, challenging steps (i.e., the KFs) in the diagnosis and management of the clinical problem (see Box 1). Then a case vignette is written, followed by a limited number of questions specifically targeting the KFs. The reliability of KFE scores is maximized by testing two to three KFs per case.12 Typically, one KF is tested per question, although one question may test more than one KF.10,11 Answering questions can require either selecting options from a menu or providing a free-text response. Examinees receive points for selecting or writing correct options, but fail to garner points for errors of omission (e.g., not ordering an essential diagnostic test). They may lose points for errors of commission (e.g., ordering a potentially harmful diagnostic test or overordering; see Appendix 1). As often as possible, vignettes use lay language, which better discriminates among examinees.13
KFEs have been part of the Medical Council of Canada Qualifying Exam (MCCQE) since 1992.9,10 The MCCQE is administered at the end of medical school and includes an applied knowledge section (that employs multiple-choice questions), a clinical decision-making section (the KFE), and a clinical performance section (an objective structured clinical exam [OSCE]). Prior research has indicated that the KFE and the communication score on the OSCE best predict future complaints in practice14 and that, compared with scores on the other two components, scores on the KFE most strongly predicted patients’ adherence to antihypertensive regimens in practice.15 KFEs have been developed for medicine clerkships in Canada16 and Germany17; for medical school progress tests in the Netherlands18; for self-assessment in subspecialty surgery in the United States19; and for qualification for fellowship in the Royal Australian College of General Practitioners.20 Despite the correlation with practice and international use, widespread uptake has not occurred in U.S. medical schools—likely because of a lack of familiarity with the format.
A core goal of the internal medicine clerkship is for students to learn to make diagnostic and management decisions about common internal medicine problems.21,22 The majority (93%) of U.S. clerkships use the National Board of Medical Examiners Subject Examination (NBME-SE), a multiple-choice exam, to assess students’ applied knowledge of internal medicine.23 However, many clerkships also administer their own locally developed exams, primarily to assess content relevant to the clerkship, content not sufficiently covered in the NBME-SE, or both.23 In addition, most U.S. internal medicine clerkships use an online virtual patient course (Aquifer Internal Medicine, Hanover, New Hampshire), based on a national clerkship curriculum,24 to teach students clinical reasoning for common internal medicine problems.25
We developed an online KFE26—blueprinted to the national curriculum24 and the online virtual patient course—designed specifically to assess the clinical decision-making abilities of U.S. medical students at the end of their medicine clerkship. Similar to the MCCQE, which combines a KFE with a multiple-choice exam and an OSCE, the medicine clerkship KFE was developed as a potential summative assessment to complement the NBME-SE and clinical evaluations.
If medical schools in the United States are to benefit from the innovation of a KFE, we must establish validity for its use within a U.S. context. The purpose of this study was to gather validity evidence for the use of a KFE to assess clinical decision making in U.S. internal medicine clerkships. Messick’s sources of validity evidence27 served as the conceptual framework for the study. This study focused primarily on examining response process, internal structure, and relationship to other variables as evidence of validity.28,29 Content evidence is described elsewhere.26 We hypothesized that accuracy (response process), acceptable reliability and psychometric characteristics (internal structure), and moderate association between KFE and NBME-SE scores (relationship to other variables) would provide evidence supporting the validity of the KFE.
Blueprinting, test development, and scoring
After exam development, scoring, and pilot testing with a separate cohort of 162 students (described in detail elsewhere26), we selected 60 cases and blueprinted them into four forms (A–D)—each with 15 cases. We balanced the 15 cases according to organ systems, location of care (outpatient–inpatient), and decision focus (diagnosis–management); see Supplemental Digital Appendix 1 at https://links.lww.com/ACADMED/A611. We calculated the total test score for each form by averaging case scores (weighted evenly) across the whole test.
We conducted the study from February 2012 to January 2013. We solicited schools during a national meeting and from test developers’ institutions. Nine schools initially enrolled; these schools included seven public and two private institutions with a range of class sizes (24–220) from across the United States (West, Midwest, Southeast, Northeast). Students were required to complete the KFE, but participation in the study was voluntary. Completion of the posttest survey constituted students’ consent to participate in the study. We offered no incentives to participate in the study. The institutional review boards at all sites and at the University of Illinois at Chicago all provided ethical approval for the study (see also the disclosures, below).
The KFE was administered at or near the end of the medicine clerkship, which occurs during the third year of medical school at all the participating schools. The four forms were rotated at each school throughout the year. After a tutorial introducing the KFE format, students had 75 minutes to complete the KFE online in a proctored classroom. Upon completing the exam, students responded to a posttest survey regarding the following: the clarity of the instructions and questions, technical issues, test difficulty, and their opinions regarding the use of the KFE for clerkship assessment (see Supplemental Digital Appendix 2 at https://links.lww.com/ACADMED/A611). Students had 150 minutes to complete the 100-item multiple-choice-question NBME-SE at the end of the clerkship. Directors at individual schools were responsible for the timing and scheduling of the NBME-SE and KFE.
Internal structure and reliability.
We based each form of the KFE on the same overall blueprint, but each contained a different set of 15 cases; therefore, we analyzed each exam form (A, B, C, and D) separately. We calculated descriptive statistics, and using case scores (not individual KFs), we calculated internal-consistency reliability (Cronbach alpha). We determined discrimination using an item–total correlation coefficient, with each case treated as an “item,” and removing each case from the total against which it was compared. Positive discrimination indices indicate higher overall performance; we used 0.2 as a guideline to identify cases with good discrimination. We used generalizability studies to determine the relative contribution of various factors (“facets”)—as well as the interaction between factors—to the variance in exam scores. The factors we included in the generalizability studies are students (S), medical schools or universities (U), and cases (C). We determined variance due to the interaction between schools and cases (U × C). Variance due to students is nested within schools (S:U), and the interaction of students and cases (case specificity) included the nesting factor [(S:U) × C], which also serves as the residual error term in this design.
Relationship to other variables.
We conducted correlation analyses to examine the relationship between KFE scores (percent correct) and four other measures: NBME-SE scores, number of clerkships completed, and number of inpatient and outpatient weeks in the medicine clerkship. Because decision making requires applied knowledge, we hypothesized that scores on the KFE would correlate moderately with scores on the NBME-SE. We calculated descriptive and correlation statistics and internal consistency (Cronbach alpha) using SPSS version 22 software (Armonk, New York). We conducted additional psychometric and generalizability theory analyses using urGenova software (Iowa City, Iowa).
We included data from eight of the nine participating schools in the analyses; we excluded one school because the students there did not have access to the survey and could not consent. Because of variations in school schedules, all four forms of the KFEs were administered at six schools, three forms were administered at one school, and two forms were administered at one school. A total of 759 students took the KFE; 515 (67.9%) consented to participate in the study, and we had data from both the KFE and NBME-SE for 501 students (66.0% of 759). We have presented our results below according to Messick’s sources of validity evidence.27
Of the 515 students who responded to the survey, 463 (89.9%) indicated that 75 minutes was sufficient to complete the exam. Students who wanted more time requested a mean of 23.9 additional minutes (range 0 [some students did not answer the question regarding how many additional minutes they would like] to 120). Of the 515 students who completed the survey, 121 (23.5%) reported difficulties with the online technology, 42 (8.2%) reported problems regarding the clarity of instructions, and 59 (11.5%) reported issues with the clarity of questions. Of the 111 students describing the technical issues they experienced, 29 (26%) mentioned problems related to uncertainty over whether the exam was finished, 27 (24%) relayed uncertainty about navigating between cases and locking in answers, 27 (24%) identified problems logging in, 19 (17%) described slow Internet or local computer issues, 7 (6%) identified problems related to exam content, and 4 (3%) described issues unrelated to the exam.
We reviewed test scores manually for accuracy. For questions in which students were instructed to select as many options as they deemed appropriate, and in which the maximum was not revealed to the student, the maximum limit was not initially enforced in the scoring. We detected this error early in the study analysis and recalculated the scores. Subsequent review indicated that limits were correctly enforced.
Internal structure and reliability
The mean KFE score across forms was 58.4% (range: 54.6%–60.3% [standard deviation (SD) 8.4%–9.6%]). The mean score for the NBME-SE was 78.5% (range: 76.1%–79.6% [SD 7.7%–14.0%]). Cronbach alpha for the 15-case KFE forms ranged from 0.44 to 0.53 (see Table 1).
Of the 60 cases, 59 (98.3%) had a positive discrimination index, and 32 (53.3%) had a discrimination index greater than 0.20. The mean discrimination index for all cases was 0.27. By removing the least discriminating case from each exam form, Cronbach alpha increased to 0.48–0.58.
We calculated reliability in the generalizability analysis, which typically produces a lower value than when reliability is calculated with Cronbach alpha; however, a generalizability analysis can be used to estimate the change in reliability if additional cases are included in the exam. The absolute reliability Phi-coefficient ranged from 0.36 to 0.52 (see Table 2). Adding five cases to the most reliable form (A) would increase the Phi-coefficient to 0.59. The majority of the variance was attributed to cases (16%) and students nested within schools (5%). The smallest proportion of variance came from schools (1%).
Relationship to other variables
Clerkship length ranged from 6 to 12 weeks (the median was 8 weeks). Students spent 50% to 100% of that time in the inpatient setting (median 67%), and 0% to 33% in the outpatient setting (median 33%). Students who responded to the survey had completed a mean of 3.2 clerkships (SD 2.3) before their medicine clerkship, and they had spent 1.3 (SD 1.9) weeks in an outpatient setting and 6.8 weeks (SD 2.1) in an inpatient setting during their medicine clerkship. The correlation coefficient between the number of clerkships completed and exam scores varied from 0.16 (P is not significant [NS]) to 0.27 (P < .01) for the KFE, and from −0.067 to 0.099 (P is NS) for the NBME-SE. Disattenuated correlation coefficients between scores on the KFE and the NBME-SE varied from 0.24 to 0.47 (P < .01; Table 3).
On the survey, 387 of the 515 students (75.1%) reported that the difficulty of the KFE was just right; 127 (24.7%) indicated that it was too hard, and none reported that it was too easy. Three hundred eighty-one (73.9%) students recommended using the KFE for formative purposes. Of the 214 (41.6%) students who stated that the KFE should, or maybe should, be used for part of the clerkship grade, 86 students (40.2%) recommended using it for 5% of the grade, and 81 students (37.9%) recommended using the KFE score for 10%. Interestingly, 352 students (68.3%) reported that they would change their study habits if the KFE were graded.
This is, to our knowledge, the first nationally developed KFE to assess the clinical decision making of U.S. students in the medicine clerkship. The study builds on a previous report regarding the content and development of the KFE.26 Our results, derived following Messick’s validity framework,27 provide strong response-process and relationship-to-other-variables validity evidence, as well as moderate internal structure validity evidence, to support the use of KFEs as complements to other assessments.
The KFE provides a tool for examining an aspect of clinical reasoning that has been very challenging for clerkship directors to assess in a standardized fashion. Within a program of assessment, the KFE complements commonly used tools to assess applied knowledge (e.g., NBME-SE), data gathering (e.g., OSCEs or standardized patient encounters), and clinical decisions in practice (e.g., clerkship clinical evaluations).
Although clinical evaluations provide the advantage of assessing students within the context of real clinical care, scores provided by clinical evaluations are often obscured. This lack of clarity can occur because students’ clinical decision making is influenced by residents or faculty who have given input on a case before the students provide their own assessment and plan. In addition, the idiosyncratic nature of the clinical environment means that students’ clinical decision making may not be assessed on a sufficient number or breadth of clinical problems. Achieving adequate reliability with clinical evaluations is difficult—even with competency-based clerkship assessments.30 The results of the generalizability analysis conducted for this study showed that student interaction with cases (the “error term”) contributed the largest component of variability and supports the concept that clinical reasoning is case specific and not a generalizable skill. Thus, assessing clinical decision making across a broad range of problems is vital. The KFE provides a standardized approach for doing so, as it assesses students’ individual clinical decision making on a breadth of clinical problems drawn from the national curriculum for the medicine clerkship.24,25
The moderate internal reliability of scores for three KFE forms was comparable to the reliability of other medical student KFEs, despite requiring less testing time (75 minutes).16,17,31 For example, medicine clerkship KFEs in Canada and Germany had test score reliabilities (Cronbach alpha) of, respectively, 0.49 with 15 cases over 120 minutes16 and 0.65 with 15 cases over 90 minutes.17 The Canadian KFE included free-text responses, which may account for the increased testing time, and it required manual scoring. The German KFE included a greater number of KFs (4 per case), and students selected responses from menus. For the high-stakes MCCQE, the KFE had a reliability (G coefficient) of 0.622 to 0.639 for 28 to 39 cases over 3.5 hours,12 and the Royal Australian College of General Practitioners Fellowship KFE had a reliability (Cronbach alpha) of 0.64 to 0.83 with 24 cases over 3 hours.20 In an analysis of the MCCQE data, Norman and colleagues12 found that optimal test score reliability is attained with 2 to 3 questions per KF case. In their study, the reliability (G coefficient) was 0.579 with 20 cases with 2 KFs each and 15 cases with 3 KFs each. In general, increasing the number of cases increases the test score reliability for an exam; however, this also lengthens testing time.32 Because the majority of clerkships already use the NBME-SE,23 which takes 2.5 hours, we kept our KFE brief to avoid overburdening students and clerkship directors.
Future investigators should explore the consequences of including the exam in clerkship assessments, such as the effect on students’ study habits if the KFE were to be part of their clerkship grade. Future investigators should also address preconsequence evidence, such as specifying standard-setting procedures to set pass–fail cut points and determining how clerkship directors would weight exam scores relative to other assessments in the clerkship. To our knowledge, no published study has explored the consequences of using KFEs in clerkships; however, consequence data are infrequently reported in the health professions assessment literature.33 Future investigators should also reanalyze case and exam performance over multiple years. One study has indicated that the reliability of the Royal Australian College of General Practitioners Fellowship KFE improves with each year of administration, which the authors attribute to reviewing the cases after including them on the test and the increasing experience of test developers.20
We acknowledge some limitations. We surveyed students immediately following the exam about any response-process issues; however, we may have gleaned additional information by using a talk-aloud protocol as students worked through each case on the exam. A minority of students reported technical issues, the majority of which were related to slow Internet connections or uncertainty about navigation between cases, locking in answers, and whether all questions were completed. These issues did not interfere either with the students’ completing the exam, nor with our scoring of it, but they may have distracted students. We have since addressed the instructions and navigation issues. Local computer and Internet capacity are essential for students to sit any online exam. The process for scoring KFEs is complex, which increases the likelihood of errors. We did find an error with the original scoring of a subset of questions. We corrected the error, but this experience highlighted the importance of quality control.
By using a national clerkship curricu lum24,25 as a blueprint, efficiently assessing performance across a breadth of clinical problems, and targeting one important aspect of clinical reasoning—namely, clinical decision making—this nationally developed KFE (with moderate to strong evidence supporting its validity) is well situated to serve as a standardized tool to complement other assessments in the medicine clerkship.
Example of a Clinical Problem and Its Key Features (KFs)a
Clinical problem: Gallstone pancreatitis
Given a patient with risk factors for gallstones and clinical presentation consistent with pancreatitis, the third-year student will:
KF1: Identify pancreatitis as the most likely diagnosis
KF2: Order an ultrasound (to look for gallstones)
KF3: Initiate volume resuscitation and analgesia
aThe information shown here is for those who write KF cases or grade KF examinations and is not seen by students.
The authors wish to thank Regina Kovach, MD, L. James Nixon, MD, MHPE, Janet Jokela, MD, and Debra Stottlemyer, MD, for their contributions to exam development; Saad Alvi, MD, for data gathering; and Audra Bucklin for project management.
1. Gruppen LD. Clinical reasoning: Defining it, teaching it, assessing it, studying it. West J Emerg Med. 2017;18:4–7.
2. Bowen JL. Educational strategies to promote clinical diagnostic reasoning. N Engl J Med. 2006;355:2217–2225.
3. Durning S, Artino AR Jr, Pangaro L, van der Vleuten CP, Schuwirth L. Context and clinical reasoning: Understanding the perspective of the expert’s voice. Med Educ. 2011;45:927–938.
4. Durning SJ, Artino AR Jr, Pangaro LN, van der Vleuten C, Schuwirth L. Perspective: Redefining context in the clinical encounter: Implications for research and training in medical education. Acad Med. 2010;85:894–901.
5. Ilgen JS, Humbert AJ, Kuhn G, et al. Assessing diagnostic reasoning: A consensus statement summarizing theory, practice, and future needs. Acad Emerg Med. 2012;19:1454–1461.
6. van der Vleuten CP, Schuwirth LW. Assessing professional competence: From methods to programmes. Med Educ. 2005;39:309–317.
7. Schuwirth LW, Van der Vleuten CP. Programmatic assessment: From assessment of learning to assessment for learning. Med Teach. 2011;33:478–485.
8. Harden RM. Preparation and presentation of patient-management problems (PMPs). Med Educ. 1983;17:256–276.
9. Page G, Bordage G. The Medical Council of Canada’s key features project: A more valid written examination of clinical decision-making skills. Acad Med. 1995;70:104–110.
10. Medical Council of Canada. Guidelines for the development of key feature problems and test cases, v3. http://mcc.ca/media/CDM-Guidelines.pdf
. Published August 2012. Accessed September 11, 2018.
11. Page G, Bordage G, Allen T. Developing key-feature problems and examinations to assess clinical decision-making skills. Acad Med. 1995;70:194–201.
12. Norman G, Bordage G, Page G, Keane D. How specific is case specificity? Med Educ. 2006;40:618–623.
13. Eva KW, Wood TJ, Riddle J, Touchie C, Bordage G. How clinical features are presented matters to weaker diagnosticians. Med Educ. 2010;44:775–785.
14. Tamblyn R, Abrahamowicz M, Dauphinee D, et al. Physician scores on a national clinical skills examination as predictors of complaints to medical regulatory authorities. JAMA. 2007;298:993–1001.
15. Tamblyn R, Abrahamowicz M, Dauphinee D, et al. Influence of physicians’ management and communication ability on patients’ persistence with antihypertensive medication. Arch Intern Med. 2010;170:1064–1072.
16. Hatala R, Norman GR. Adapting the Key Features Examination for a clinical clerkship. Med Educ. 2002;36:160–165.
17. Fischer MR, Kopp V, Holzer M, Ruderich F, Jünger J. A modified electronic key feature examination for undergraduate medical students: Validation threats and opportunities. Med Teach. 2005;27:450–455.
18. Rademakers J, Ten Cate TJ, Bär PR. Progress testing with short answer questions. Med Teach. 2005;27:578–582.
19. Trudel JL, Bordage G, Downing SM. Reliability and validity of key feature cases for the self-assessment of colon and rectal surgeons. Ann Surg. 2008;248:252–258.
20. Farmer EA, Hinchy J. Assessing general practice clinical decision making skills: The key features approach. Aust Fam Physician. 2005;34:1059–1061.
21. Bass EB, Fortin AH 4th, Morrison G, Wills S, Mumford LM, Goroll AH. National survey of Clerkship Directors in Internal Medicine on the competencies that should be addressed in the medicine core clerkship. Am J Med. 1997;102:564–571.
22. Liaison Committee on Medical Education. Functions and structure of a medical school: Standards for accreditation of medical education programs leading to the M.D. degree. http://lcme.org/publications
. Revised March 2017. Accessed October 23, 2018.
23. Kelly WF, Papp KK, Torre D, Hemmer PA. How and why internal medicine clerkship directors use locally developed, faculty-written examinations: Results of a national survey. Acad Med. 2012;87:924–930.
24. De Fer TM, Fazio SB. CDIM-SGIM Core Medicine Clerkship Curriculum Guide, v3.0. 2006. Alexandria, VA: Alliance for Academic Internal Medicine; https://www.sgim.org/File%20Library/SGIM/Communities/Education/Resources/OnlineCDIM-SGIM-Core-Media.pdf
. Accessed September 7, 2017.
25. Lang VJ, Kogan J, Berman N, Torre D. The evolving role of online virtual patients in internal medicine clerkship education nationally. Acad Med. 2013;88:1713–1718.
26. Bronander KA, Lang VJ, Nixon LJ, et al. How we developed and piloted an electronic key features examination for the internal medicine clerkship based on a US national curriculum. Med Teach. 2015;37:807–812.
27. Messick S. Foundations of Validity: Meaning and Consequences in Psychological Assessment. November 1993.Princeton, NJ: Educational Testing Service.
28. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for Educational and Psychological Testing. 2014.Washington, DC: American Educational Research Association.
29. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837.
30. Zaidi NLB, Kreiter CD, Castaneda PR, et al. Generalizability of competency assessment scores across and within clerkships: How students, assessors, and clerkships matter. Acad Med. 2018;93:1212–1217.
31. Hrynchak P, Takahashi SG, Nayer M. Key-feature questions for assessment of clinical reasoning: A literature review. Med Educ. 2014;48:870–883.
32. Downing SM. Reliability: On the reproducibility of assessment data. Med Educ. 2004;38:1006–1012.
33. Cook DA, Lineberry M. Consequences validity evidence: Evaluating the impact of educational assessments. Acad Med. 2016;91:785–795.
Example of a KF Vignette, Question Stems, Answer Options, and Scoring Key
The history and physical examination are described using lay language as often as possible. The labs are not visible until the answer to question 1 is locked in.
A 43-year-old female presents with two days of abdominal pain and nausea. When it first began, the pain was more on her right side, and it came and went. For the past day, the pain has been more in her middle upper abdomen, and it is now constant and more intense. Her last bowel movement was yesterday and was normal. She did not vomit.
She has a past history of plantar fasciitis for which she takes ibuprofen 400 mg twice a week, and hyperlipidemia for which she takes a statin every now and then. She does not smoke and drinks 1 glass wine/week.
Temp 38.1, pulse 110, BP 130/80, RR 16, SaO2 99% RA.
Her conjunctivae are clear. Her abdomen is obese, soft, and mildly distended. Palpation of her middle upper abdomen causes the most pain. She tenses her abdominal muscles but is able to relax them when distracted. When you press deeply in her right upper abdomen, she is able to take a full breath in. Bowel sounds are quiet.
Q1. What is your leading diagnosis at this time? Select only one.
Q2. Initial labs reveal the following
What action(s) will you take at this time? Select up to 4, or select T if no action is needed at this time.
Abbreviations: KF indicates key feature; BP, blood pressure; RR, respiratory rate; SaO2, oxygen saturation; RA, room air; Q, question; SI, International System units; WBC, white blood cells; Hg, hemoglobin; Hct, hematocrit; AST, aspartate aminotransferase; ALT, alanine aminotransferase; T Bili, total bilirubin; BUN, blood urea nitrogen; B-hCG, beta-human chorionic gonadotropin; L, liters; IV, intravenous; PO, by mouth; mg, milligrams; CT, computed tomography; NaCl, sodium chloride; cc/h, cubic centimeters per hour.