Share this article on:

Clinically Discriminating Checklists Versus Thoroughness Checklists: Improving the Validity of Performance Test Scores

Yudkowsky, Rachel MD, MHPE; Park, Yoon Soo PhD; Riddle, Janet MD; Palladino, Catherine; Bordage, Georges MD, PhD

doi: 10.1097/ACM.0000000000000235
Research Reports

Purpose High-quality checklists are essential to performance test score validity. Prior research found that physical exam checklists of items that clinically discriminated between competing diagnoses provided more generalizable scores than all-encompassing thoroughness checklists. The purpose of this study was to compare validity evidence for clinically discriminating versus thoroughness checklists, hypothesizing that evidence would favor the former.

Method Faculty at four Chicago-area medical schools developed six standardized patient (SP) cases with checklists of about 20 items (“thoroughness [long] checklists”). Four clinicians identified a subset of items that clinically discriminated between competing diagnoses of each case (“clinically discriminating [short] checklists”). Cases were administered to 155 University of Illinois at Chicago fourth-year medical students during their 2011 Clinical Skills Examination (CSE). Validity evidence was compared for CSE scores based on thoroughness versus clinically discriminating checklist items.

Results Validity evidence favoring clinically discriminating checklists included response process: greater SP checklist accuracy (kappa = 0.75 for long and 0.84 for short checklists, P < .05); internal structure: better item discrimination (0.28 long, 0.42 short, P < .001); internal consistency reliability (0.80 long, 0.92 short); standard error of measurement (z score 8.87 long, 8.05 short); and generalizability (G = 0.504 long, 0.533 short). There were no significant differences overall in relevance ratings, item difficulty, or cut scores of long versus short checklist items.

Conclusions Limiting checklist items to those affecting diagnostic decisions resulted in better accuracy and psychometric indices. Thoroughness items performed without thinking do not reflect clinical reasoning ability and contribute construct-irrelevant variance to scores.

Dr. Yudkowsky is associate professor, Department of Medical Education, and director, Dr. Allan L. and Mary L. Graham Clinical Performance Center, University of Illinois at Chicago, Chicago, Illinois.

Dr. Park is assistant professor, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois.

Dr. Riddle is assistant professor, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois.

Ms. Palladino is a student, University of Illinois College of Pharmacy. At the time of the study she was a graduate research assistant, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois.

Dr. Bordage is professor, Department of Medical Education, University of Illinois at Chicago, Chicago, Illinois.

Funding/Support: This study was funded in part by a grant from the National Board of Medical Examiners (NBME), Edward J. Stemmler MD Medical Education Research Fund. The project does not necessarily reflect NBME policy, and NBME support provides no official endorsement.

Other disclosures: None reported.

Ethical approval: Ethical approval was obtained from the institutional review board at the University of Illinois at Chicago (no. 2010-0362).

Previous presentations: Oral abstracts summarizing portions of this study were presented at the Annual Conference of the Association of Standardized Patient Educators, Atlanta, Georgia, June 2013; and at the Annual Clinical Skills Chicago Style Conference, Chicago, Illinois, August 2013.

Correspondence should be addressed to Dr. Yudkowsky, UIC-COM Department of Medical Education 986 CMET, 808 S. Wood MC 591, Chicago, IL 60612; telephone: (312) 996-3598; e-mail:

Performance tests such as standardized patient (SP) encounters are central to medical education, providing unique opportunities for direct observation and documentation of learner behavior. High-quality checklists are essential to the validity of performance test scores. Methods of checklist construction are rarely reported and are generally based on discussion among faculty members,1 resulting in checklists that tend to assess the thoroughness of an examinee’s exploration of a chief complaint. Several problems with this approach could compromise assessment validity: Items are rarely evidence based; checklists do not discriminate between experts and novices2; and there is little consensus on the relevance of the items.3,4 Indeed, it has been argued that clinicians need a more parsimonious approach to the history and physical (H&P) examination5 and that measurement of performance should reflect this parsimonious approach, analogous to the “key features” approach for written tests of clinical decision making that focuses only on essential or critical steps.6,7

Checklists as currently constructed can also have adverse consequences for learning. Medical students preparing for high-stakes clinical exams memorize lists of H&P “thoroughness” items associated with chief complaints. This leads students to gather data by rote, unmindful of the diagnostic implications of the findings; even when they finally do “begin to think,” students may not use their findings effectively to discriminate between diagnoses because they do not appreciate the clinically discriminating nature of the findings. Thoroughness checklists also do not encourage students to learn which elements are discriminating or key, creating the risk that novices base their diagnoses on nonessential or nondiscriminating items. “Fast and frugal” heuristics focusing on essential elements or discriminating features rather than rote thoroughness can lead to better decisions,8 but only if the essential elements are recognized as such.

If thoroughness checklists have shortcomings from both an assessment and instruction perspective, what is the alternative? Our prior study on the Hypothesis Driven Physical Exam (HDPE)9 assessed physical examination (PE) skills through SP encounters in which students were asked to anticipate, elicit, and interpret PE findings in the context of two competing diagnostic hypotheses. The results of that study highlighted the difference between two types of PE checklist items: those that discriminated between the competing diagnoses (henceforth, “clinically discriminating items”), and those that did not. PE checklists limited to clinically discriminating items were more reliable (generalizable) than thoroughness checklists and required fewer test cases per exam: 12 cases to reach a reliability of 0.8 with clinically discriminating checklists versus 22 cases for thoroughness checklists, a 45% saving of testing time. The HDPE also had positive learning consequences, encouraging students to conduct a PE in a purposeful manner while attending to the meaning of the findings obtained.

This study builds on and broadens the results from the HDPE study by asking whether a hypothesis-driven approach to checklist construction can be used with more comprehensive SP encounters such as those used in the United States Medical Licensing Examination Step 2 Clinical Skills exam, and whether this would enhance validity evidence beyond generalizability. Specifically, the purpose of this study was to determine the impact of using checklists that comprised clinically discriminating items only, compared with traditional thoroughness checklists, on the validity of an SP-based Clinical Skills Examination (CSE) for fourth-year medical students. By addressing five sources of validity evidence,10 we asked whether limiting the checklist to clinically discriminating items would provide a more content-relevant sampling of task behaviors (validity evidence: content); allow for more accurate ratings by SPs (response process); result in improved psychometric indices and higher generalizability across cases (internal structure); result in higher correlations with faculty’s global ratings of the encounters and postencounter notes, implying that faculty reward a clinical reasoning approach, or lower correlations if faculty reward thoroughness (relationship to other variables); and improve the quality of cut scores (consequences). We hypothesized that validity evidence would favor clinically discriminating checklists and support the further development of these checklists for assessments of clinical skills.

Back to Top | Article Outline


We recruited six groups of three to five faculty members from four medical schools in the Chicago region in 2011. Each group developed an SP case based on an a priori specification of three or four competing differential diagnoses that should be actively explored during the encounter (see Table 1 for details of the cases and case development process). Data gathering checklists were limited to a total of 20 items as per our standard protocol to manage SP cognitive load during the encounter.11 We circulated these items (“preliminary checklist”) to four clinician experts who independently identified items that would assist in differentiating or discriminating between the diagnostic options, and suggested additional discriminating items. Items that three or more experts identified as discriminating between diagnoses constituted the final clinically discriminating checklists, averaging 8 to 9 items per case. We added clinically discriminating items identified by three or more experts that did not appear on the preliminary checklist to create the final thoroughness checklist. Thus, the clinically discriminating checklist of 8 to 9 items was a subset of the thoroughness checklist that included 19 to 25 items (Table 2).

Table 1

Table 1

Table 2

Table 2

We administered the cases to 155 fourth-year medical students at the University of Illinois at Chicago during their summative CSE in 2011. Each SP encounter lasted 15 minutes, followed by 10 minutes to document the salient H&P findings, a differential of up to five diagnoses, and up to five items for the immediate workup of the patient. On rare occasions when an SP was sick or otherwise unavailable, a “backup” case was substituted for that round; backup case results are not included in this study. Thus, only 139 and 144 students encountered the dyspnea and fatigue cases, respectively.

We compared CSE scores on the basis of the thoroughness and clinically discriminating checklists on measures relevant to the five types of validity evidence. Ethical approval was obtained from the institutional review board at the University of Illinois at Chicago.

Back to Top | Article Outline

Question 1: Content

Five faculty judges rated each checklist item regarding its relevance to the successful diagnostic outcome of the case, using a four-point scale (0 = not relevant/remove item; 3 = essential).

Back to Top | Article Outline

Question 2: Response process

Each case was portrayed by two to five different SPs (mode = 4 SPs per case). Half of the SPs for each case were trained per standard protocols to complete the thoroughness checklist, and half were trained on the shorter clinically discriminating checklist. Individual SPs worked a variety of days and shifts based on their availability. Students were assigned to exam dates on the basis of student availability. Overall, 417 encounters were scored using thoroughness checklists and 486 using clinically discriminating checklists; see Table 2 for a breakdown by case. “Gold standard” checklist scores were determined by having two SP observers review video recordings and complete thoroughness checklists for all of the encounters in which SPs completed the clinically discriminating checklist, and 40 encounters per case for the encounters in which SPs completed the thoroughness checklist. (We reviewed only 40 of the thoroughness checklist encounters per case for reasons of feasibility and cost.) We resolved disagreements between observer ratings by re-reviewing the relevant portion of the recordings. Agreement between the SP ratings and corresponding gold standard ratings was calculated for both thoroughness and discriminating checklists using absolute (exact percentage) rater agreement and kappa (taking into account agreement by chance).

Back to Top | Article Outline

Question 3: Internal structure

We created a full set of both thorough ness and clinically discriminating check list scores for each case. For encounters originally scored using the clinically discriminating checklists, thoroughness checklist scores were calculated on the basis of gold standard (video review) scores; for encounters originally scored using the thoroughness checklist, clini cally discriminating checklist scores were calculated on the basis of the subset of clinically discriminating items within the thoroughness checklist. We conducted psychometric analyses using the full dataset to determine item difficulty, item discrimination, standard error of measurement (SEM), scale reliability (coefficient alpha), and generalizability for scores based on thoroughness and clinically discriminating checklists.

Back to Top | Article Outline

Question 4: Relationship to other variables

We recruited six pairs of primary care faculty members to serve as “expert observers” for the six cases. Each pair reviewed video recordings of 15 encounters of one of the six cases and rated the overall quality of each encounter on a five-point scale (definite fail, marginal fail, marginal pass, solid pass, and outstanding). The encounters were selected to include five high, five midrange, and five low checklist scores; raters were blind to checklist scores. All of the postencounter notes for each case were scored by a single faculty mem ber (one faculty per case) on a four-point behaviorally anchored global rating scale (not acceptable, borderline acceptable, acceptable, and excellent). We provided faculty members with the full case descri ption and benchmark notes to help calibrate their ratings. Trained research assistants (one of whom was C.P.) independently reviewed the notes to identify the checklist items documented in the note and the accuracy of the differential diagnosis.

Back to Top | Article Outline

Question 5: Consequences

Cut scores for the thoroughness checklist were determined by five faculty members using a modified, three-level Angoff method12 with item-level judgments. In our analyses, we compared the results of cut scores, pass/fail rates, and rater agreement indices (intraclass correlations) of the Angoff judgments based on the thoroughness checklist items versus the subset of clinically discriminating checklist items.

Back to Top | Article Outline



There were no statistically significant differences between relevance ratings of thoroughness and clinically discriminating checklists (all P values > .05). One thoro ughness checklist item from the fatigue case had a very low average relevance rating of 0.8/3 and was removed from the exam as per our usual quality assurance protocol, resulting in a total of 127 items that were used in the exam.

Back to Top | Article Outline

Response process

Overall, there was 89% agreement between raters for the thoroughness checklist and 93% agreement for the clinically discriminating checklist. Rater accuracy was significantly greater for the clinically discriminating checklists compared with the thoroughness checklists, with kappa for the clinically discriminating checklists of 0.84 (95% CI, 0.80–0.88) and for the thoroughness checklists of 0.75 (95% CI, 0.73–0.77); see Table 2.

Back to Top | Article Outline

Internal structure

The correlations between scores on the thoroughness and clinically discriminating checklists for individual cases ranged from 0.43 to 0.69; the correlation between thoroughness and clinically discriminating checklist scores overall was 0.79 (95% CI, 0.72–0.84). Thoroughness and clinically discriminating checklists were of equal difficulty both overall and at the case level; overall thoroughness and clinically discriminating checklist difficulties were 0.68 and 0.70, respectively (t test P = .54). Item discrimination was generally higher for the discriminating checklists, reaching statistical significance for three of the six cases and overall: overall item discrimination of 0.28 for thoroughness checklist items, 0.42 for clinically discriminating checklist items (P < .001; see Table 3). Internal consistency reliability (coefficient alpha) was higher for clinically discriminating checklist items than for thoroughness checklist items overall (average thoroughness checklist alpha coefficient across all items: 0.80, 95% CI, 0.76–0.84; clinically discriminating checklist: 0.92, 95% CI, 0.90–0.94), but this was variable at an individual case level. SEM was lower for the clinically discriminating checklist both at the case level and overall, with overall thoroughness checklist SEM 3.94, clinically discriminating checklist SEM 1.10. Although the thoroughness and clinically discriminating checklists have a different number of items and different total scores, the proportion of SEM within the total score (and z score) was lower for the clinically discriminating checklists (Table 4).

Table 3

Table 3

Table 4

Table 4

Scores based on the clinically discriminating checklists had higher generalizability coefficients than scores based on the thoroughness checklists; thoroughness checklist G = 0.504, Phi = 0.420; clinically discriminating checklist G = 0.533, Phi = 0.491 (Table 5). For the thoroughness checklist, items nested within cases (i:c) accounted for most of the variance (28.1%) (aside from the residual); for the clinically discriminating checklist, the person-by-cases (p × c) interaction accounted for most of the variability (41.2%)—most of the score variance was due to differences between students in their ability to handle the different cases. D-studies showed that 36 thoroughness-checklist-type cases with 20 items per case would be required to reach a Phi of 0.8, whereas 26 discriminating-checklist-type cases with 8 (clinically discriminating) items per case would be required to reach a Phi of 0.8, a savings of 10 cases (28%).

Table 5

Table 5

Back to Top | Article Outline

Relationship to other variables

The correlation of the thoroughness checklist scores with clinically discriminating checklist scores for the subset of 90 encounters with global ratings was 0.74 (95% CI, 0.59–0.82). The correlations of the faculty global ratings of the encounter with the thoroughness checklist scores (r = 0.46, 95% CI, 0.25–0.63) did not differ significantly from the correlations of the global ratings with the clinically discriminating checklist scores (r = 0.32, 95% CI, 0.09–0.51). The correlations between the SP checklist scores and the documentation of the checklist items in the notes overall were r = 0.59 (95% CI, 0.42–0.72) for the thoroughness checklist and r = 0.47 (95% CI, 0.28–0.63) for the clinically discriminating checklist. Neither SP checklist was a good predictor of the correct differential diagnoses. Correlations between SP checklist scores and global ratings of the notes were r = 0.23 (95% CI, 0.17–0.29) for the thoroughness checklist and r = 0.12 (95% CI, 0.06–0.19) for the clinically discriminating checklist.

Back to Top | Article Outline


Although case-level cut scores varied across conditions, there was no differ ence in the final overall passing cut score (58%) between the thoroughness and clinically discriminating checklists. Despite the equivalent cut scores, the thoroughness and clinically discriminating checklists resulted in different passing rates: 88% for the thoroughness checklist versus 91% for the clinically discriminating checklist. The intraclass correlation (ICC) for the Angoff judgments of the thoroughness checklist (ICC = 0.71 [95% CI, 0.64–0.77]) was not statistically different from that of the clinically discriminating checklist (ICC = 0.53 [95% CI, 0.40–0.67]).

Back to Top | Article Outline


The purpose of this study was to gather validity evidence for SP encounter scores based on focused, clinically discriminating checklists as compared with evidence for scores based on traditional, thoroughness checklists. Sources of validity evidence in favor of the clinically discriminating checklist included greater SP checklist accuracy, as expected for a shorter checklist, and better item discrimination, internal consistency reliability, SEM, and generalizability. Although there were unsystematic case-level differences throughout, there were no significant differences overall in relevance ratings of clinically discriminating versus thoroughness checklist items, difficulty of clinically discriminating versus thoroughness checklist items, or cut scores derived from clinically discriminating versus thoroughness checklists. Scores derived from clinically discriminating checklists did result in a 91% pass rate versus an 88% pass rate for scores based on thoroughness checklists.

These findings confirm that a clinical reasoning approach to checklist develop ment that limits items to those that are relevant to sorting out a differential dia gnosis can result in improved accuracy and psychometrically valid indices for resulting scores. This was due to the elimination of “thoroughness” items that students learn by rote; because they are performed without thinking, these items do not reflect students’ clinical reasoning ability and contribute “noise” or construct-irrelevant variance to the score.

Most of the variance in clinically discriminating checklist scores was due to the person-by-case interaction, consistent with case specificity; test blueprints that systematically sample across cases would be well supported by focusing on clinically discriminating items. In contrast, thoro ughness checklist scores mostly reflected variability in the difficulty of items sam pled in each case (i:c) with low person variance, indicating poor ability to dis criminate student performance, perhaps because all students had memorized the same rote thoroughness approach.

Checklists can serve an important formative function during the develop ment of clinical reasoning skills: The com position of the checklists drives learning and conveys faculty values regarding data gathering. Although we did not examine the educational consequences of studying for a clinical reasoning-oriented exam versus a rote exam, our previous study9 indicated that a checklist focused on the clinical reasoning task rather than thoroughness encouraged students to identify and focus on H&P items that are clinically discriminating, and thus promoted active clinical thinking.

The present findings suggest that faculty tend to take a less productive, thoroughness approach to checklist development. We were not able to determine whether this approach extends to faculty global ratings of performance based on video encounters and posten counter notes, because there were no significant differ ences in correlations of faculty global ratings with thoroughness versus clinically discriminating scores (the 95% CIs over lapped). Faculty develop ment is needed to help transition to an assessment strategy that promotes clinical reasoning and encourages students to take a more thoughtful, individually targeted, and problem-solving approach to their patients.

The study was conducted with a single cohort of students at one U.S. medical school. We encourage others to replicate our findings with other groups of students. Clinically discriminating items were identified on the basis of expert opinion and were not necessarily evidence based. For the purpose of our study, we focused only on items that were salient to the task of establishing the diagnosis; additional information may be needed for other aspects of clinical reasoning such as selecting appropriate treatment and management options. Faculty should identify the specific clinical reasoning tasks most salient to each SP encounter and develop a small number of key checklist items that best reflect and support the specific decisions that need to be made. Future studies will explore the educational impact of focusing checklists on the clinical reasoning task and the validity consequences of focusing ratings of communication and interpersonal skills to those items salient to the specific communication challenge.

In summary, validity evidence favored the use of SP checklists that focused closely on the clinical reasoning task rather than on rote thoroughness; focused checklists showed greater accuracy and psychometric indices and required fewer cases to achieve acceptable test score reliability. Thoroughness items that are learned and performed by rote do not reflect clinical reasoning ability and contribute construct-irrelevant variance to the scores. A hypothesis-driven approach to checklist development can enhance the validity of score interpretation, increase test efficiency, and foster active clinical reasoning during patient encounters.

Acknowledgments: The authors are grateful for the generous participation of faculty members from medical schools across the Chicago area.

Back to Top | Article Outline


1. Gorter D, Rethans JJ, Scherpbier A, et al. Developing case-specific checklists for standardized-patient-based assess ments in internal medicine: A review of the literature. Acad Med. 2000;75:1130–1137
2. Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson M. OSCE checklists do not capture increasing levels of expertise. Acad Med. 1999;74:1129–1134
3. Nendaz MR, Gut AM, Perrier A, et al. Degree of concurrency among experts in data collection and diagnostic hypothesis generation during clinical encounters. Med Educ. 2004;38:25–31
4. Boulet JR, van Zanten M, de Champlain A, Hawkins RE, Peitzman SJ. Checklist content on a standardized patient assessment: An ex post facto review. Adv Health Sci Educ Theory Pract. 2008;13:59–69
5. Mangione S, Peitzman SJ. Physical diagnosis in the 1990s. Art or artifact? J Gen Intern Med. 1996;11:490–493
6. Bordage G, Brailovsky C, Carretier H, Page G. Content validation of key features on a national examination of clinical decision-making skills. Acad Med. 1995;70:276–281
7. Norman G, Bordage G, Page G, Keane D. How specific is case specificity? Med Educ. 2006;40:618–623
8. Wegwarth O, Gaissmaier W, Gigerenzer G. Smart strategies for doctors and doctors-in-training: Heuristics in medicine. Med Educ. 2009;43:721–728
9. Yudkowsky R, Otaki J, Lowenstein T, Riddle J, Nishigori H, Bordage G. A hypothesis-driven physical examination learning and assessment procedure for medical students: Initial validity evidence. Med Educ. 2009;43:729–740
10. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837
11. Vu NV, Marcy MM, Colliver JA, Verhulst SJ, Travis TA, Barrows HS. Standardized (simulated) patients’ accuracy in recording clinical performance check-list items. Med Educ. 1992;26:99–104
12. Yudkowsky R, Downing SM, Popescu M. Setting standards for performance tests: A three-level Angoff method. Acad Med. 2008;83(suppl 10):S13–S16
© 2014 by the Association of American Medical Colleges