The bedside skills of communication and physical examination—while critical for good patient care—have not always been carefully taught or reliably assessed. Starting with the administration of the Medical Council of Canada’s Qualifying Examination Part II (MCC QE II) in 1992, evolving through the administration of the Educational Commission for Foreign Medical Graduates Clinical Skills Assessment from 1998 to 2004, and eventually refined as part of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (CS) and Comprehensive Osteopathic Medical Licensing Examination (COMLEX) Level 2-Performance Evaluation (PE), clinical skills assessment is now a fundamental part of the process of licensing physicians in Canada and the United States.1 Some evidence indicates that performance on national licensing examinations is associated with improved patient safety, enhanced quality of care, and identification of doctors likely to subsequently face disciplinary action, although these relationships are not necessarily causal.2 Given the cost of developing and administering these clinical skills examinations, and the relatively weak associations with outcome measures, students have questioned whether the potential benefits of a high-stakes clinical skills examination are worth the investment. While a large body of evidence concerning the validity of clinical skills assessment already exists, the link between performance on a large-scale licensing examination and practice with “real” patients is weak. Although we do not agree with students’ petition to end the USMLE Step 2 CS or the COMLEX Level 2-PE, we concur with Ecker and colleagues3 that conducting further research to support the validity argument, providing greater feedback on performance to assist institutions and motivate students, and exploring options to reduce costs are important for addressing students’ and other stakeholders’ concerns.
Many of the arguments supporting or rejecting a specific examination rest with the validity of the scores or the decisions that are made based on the scores. Great efforts have been made to ensure that both the Step 2 CS and Level 2-PE are standardized. Individuals can take these examinations at different locations, with different sets of simulated scenarios and standardized patients (SPs), and still have comparable scores. Likewise, the testing time (i.e., number of stations) is sufficient, resulting in an adequate sampling of examinee skills and reasonably reliable assessment scores, at least for the composite domains that the examinees are required to pass. For both the USMLE Step 2 CS and COMLEX Level 2-PE, defensible, criterion-referenced standards are set and updated regularly.4 All of these efforts speak to the validity of the scores and associated pass/fail decisions.
In addition, Ecker and colleagues point to another type of validity evidence—consequential validity—that underscores the soundness of these examinations. One beneficial outcome of high-stakes clinical skills testing has been a dramatic increase in the emphasis on the teaching and assessment of these skills throughout the medical curriculum.5,6 Still, the medical education community needs to recognize that even more evidence can be gathered to support the validity argument for and the benefits of these tests.
Evidence to support the validity of clinical skills assessment scores and associated inferences already exists. What is lacking, and would help further justify the use of these examinations, is more evidence to support the “extrapolation” argument—that is, is performance on these examinations related to actual patient care? While some published studies show that Step 2 CS scores are related to residency performance,7,8 and that MCC QE II scores have been associated with various physician quality metrics,9,10 further work is possible. For example, one could look at the relationship between Step 2 CS performance and various criterion measures, including specialty board certification examination scores and status, disciplinary actions, participation in maintenance of certification (MOC) activities, and, most important, patient outcomes. Longitudinal studies, however, are difficult to conduct, and causal attribution can only be inferred. In addition, if one wanted to look, for example, at the association between communication ability (measured in both Step 2 CS and Level 2-PE) and residency performance, the analysis would necessarily be restricted to those who passed (since those who do not eventually pass do not move on to the residency phase of their medical training). Given that the examinations are specifically constructed to discriminate near the cut score, range restrictions would tend to attenuate any associations. Additionally, numerous variables may confound longitudinal studies of medical skills performance including, among others, residency training, specialty choice, and practice location.
Another type of validity evidence—predictive validity—may not be as convincing in understanding the value of performance assessment. Assessing clinical skills performance, just like taking the road test for a driver’s license, occurs at a specific time point. At the moment an examinee passes the driver’s road test, the community knows, or at least can infer, that he/she can obey the rules of the road. The chances are unlikely that the driver will always do so in the future, but this fact does not necessarily negate the value of the initial road test. The bottom line is that an individual can possess the necessary skills to drive a car yet, on occasion, choose not to use them. Similarly, requiring students to pass a clinical skills examination does not necessarily eliminate improper conduct in the future, but this requirement should lead to a more able cohort of residents entering postgraduate training.
Feedback on Performance
Feedback is one of the most powerful drivers of learning,11 and undergraduate and graduate medical education programs are clearly responsible for providing ongoing information about clinical skills performance to learners. Although the purpose of licensing examinations is to make decisions at the cut point, major resources have been invested in their development and maintenance, and there is an opportunity to provide additional valuable feedback. The examinations not only offer more reliable measurement than most schools could achieve; they can also allow for normative comparisons across institutions. USMLE Step 2 CS and COMLEX Level 2-PE currently inform examinees only of whether or not they have achieved the standard for minimal competency in clinical skills. Test takers are primarily motivated to pass, not necessarily to excel. While performance data from licensing examinations should be used judiciously given these tests’ primary purpose,12 additional data would prove helpful to learners and their institutions.
Would USMLE Step 2 CS and COMLEX Level 2-PE be more beneficial to students if the examinations could reliably detect differences in the quality of candidates’ clinical skills across the ability spectrum and across the measured domains? If this reliability criterion could be achieved, then more detailed score reports—allowing for the rank-ordering of candidates—might motivate individuals not just to pass, but to excel in, the measured domains.13 As Ecker and colleagues point out, the value of preparing for and performing well on the examination would certainly increase. This positive outcome may be particularly relevant in an increasingly competitive residency environment. Medical school deans would be motivated to hone their students’ clinical skills. With more scoring information, schools would be in a better position to remediate poorly performing students. Currently, failing examinees and their faculty have little idea of where to concentrate remediation efforts. In addition, curricular committees could evaluate any weaknesses in their curriculum or in the clinical skills teaching program. Deans would have a better understanding of where best to focus or redirect their limited resources. Finally, the perception of examinees is also important: Given the high cost of the examination, students, especially those who fail, likely feel that receiving only a “Pass” or “Fail” (or even broad categorizations concerning strengths and weaknesses) is an unsatisfying result. Receiving, instead, more discrete information about individual strengths, particular areas of improvement, and standing compared with peers would likely be better received among examinees, even if the cost of the examination remains understandably high.
Is the current scoring model capable of capturing increasing levels of expertise, or could it actually be penalizing more proficient or expert examinees, as some have suggested?14 Correlation of high-stakes clinical skills scores with other measures of ability, particularly for those with more experience, could help answer the question of how well current assessment models differentiate across the ability spectrum. If the inference of graded ability is not currently tenable, exam items or scoring rubrics could be modified to increase the precision of the scores over the entire ability continuum. Case development could be altered to present a higher degree of challenge to examinees. The addition of more patients with real or simulated physical findings may be helpful, although this reform would significantly add to the challenge of standardization across sites. SPs could be trained to portray patients who are somewhat less forthcoming or organized in their responses. The goal should be to create an examination model that will yield scores that can be used to provide meaningful feedback—to individuals and institutions—on performance across the range of student abilities.
With any examination system, particularly one placing a significant financial burden on examinees, the onus is on the testing agency to be as transparent as possible about costs and benefits without jeopardizing exam security. Recent protests about MOC by diplomates of the American Board of Internal Medicine (ABIM) and the subsequent reevaluation of MOC by the ABIM manifest this tension.15
The degree of precision required in the estimate of candidates’ clinical skills has a direct bearing on cost. Reliability of examinee scores is dependent on a number of factors, including testing time (i.e., the number of stations), fidelity of SP portrayals, and training of the people who provide the scores. All of these factors, including the associated quality assurance processes, add to the cost. Importantly, passing examinees practice in supervised settings which mitigates the potential patient safety risks associated with false-positive outcomes. Given this supervision, could slightly lower score precision and decision consistency be tolerable? If yes, the cost of the examinations could be lowered.
Can the same precision be achieved for lower cost? For example, given the high cost of SP training, can the same fidelity of portrayal and scoring accuracy be achieved with a shorter training period? Other questions about the examination process could also bear scrutiny. Are there ways to develop cases more efficiently? Are quality assurance efforts aligned with degree of needed precision? Are there less expensive ways to score the examination or obtain quality assurance measures (e.g., natural language processing for clinical notes, audio or computer vision analysis for communication skills, a more robust sampling framework for quality assurance activities)? Research addressing these queries may conclude with support for the status quo, but investigation may also lead to cost reduction. That is, examining SP training, case development, and exam scoring is important, as each influences cost.
Evaluation of other recent process changes is equally important. Did recent investment in USMLE Step 2 CS enhancements, including changes to the structure of the patient note and a reworking of the communication and interpersonal skills scoring rubric, improve score reliability? Did these augmentations, which required significant retraining of SPs and note raters at some cost, provide additional evidence to support the validity of the scores? Examinees, their schools, and the public are likely to be more supportive of high fees and the enhancements they support when the effects (e.g., increased reliability and validity) and costs are known.
Another important question about cost has been raised by skeptics of the current clinical skills assessment process: Would costs be reduced if clinical skills examinations were conducted at more sites, and perhaps at medical schools? We agree with Ecker and colleagues that the answer is no. Standardizing the examination across many centers requires allocation of major resources for training, monitoring, and security. The current effort to standardize across the eight centers (six for USMLE Step 2 CS, two for COMLEX Level 2-PE) is already quite substantial. High volume at a limited number of centers reduces costs. It increases the likelihood that exam sessions will be full, minimizes the need for SP retraining (due to lack of work), and helps ensure enough performance data at the SP/scenario level to properly calibrate the scores. Additionally, potential bias based on prior exposure to local SPs is more likely if the examination were to be administered at medical schools. As examinee travel costs go down in a decentralized model, administration and measurement costs tend to go up.
USMLE Step 2 CS and COMLEX Level 2-PE are valuable examinations. Providing additional evidence to support the validity of the scores (or decisions based on the scores), motivating students and assisting institutions through enhanced score reporting, and containing cost in proportion to need for precision can help address some of the concerns expressed by students and other stakeholders. Testing organizations can embrace these essential elements of transparency and accountability to address concerns about the value of clinical skills examinations.
1. Boulet JR, Smee SM, Dillon GF, Gimpel JRThe use of standardized patient assessments for certification and licensure decisions. Simul Healthc. 2009;4:35–42.
2. Archer J, Lynn N, Coombes L, et alThe impact of large scale licensing examinations in highly developed countries: A systematic review. BMC Med Educ. 2016;16:212.
3. Ecker DJ, Milan FB, Cassese T, et alStep up—not on—the Step 2 Clinical Skills exam: Directors of Clinical Skills Courses (DOCS) oppose ending Step 2 CS. Acad Med. 2018;93:693–698.
4. Roberts WL, Boulet J, Sandella JComparison study of judged clinical skills competence from standard setting ratings generated under different administration conditions [published online ahead of print February 21, 2017]. Adv Health Sci Educ Theory Pract. doi: 10.1007/s10459-017-9766-1.
5. First LR, Chaudhry HJ, Melnick DEQuality, cost, and value of clinical skills assessment. N Engl J Med. 2013;368:963–964.
6. Gilliland WR, La Rochelle J, Hawkins R, et alChanges in clinical skills education resulting from the introduction of the USMLE step 2 clinical skills (CS) examination. Med Teach. 2008;30:325–327.
7. Cuddy MM, Winward ML, Johnston MM, Lipner RS, Clauser BEEvaluating validity evidence for USMLE Step 2 Clinical Skills data gathering and data interpretation scores: Does performance predict history-taking and physical examination ratings for first-year internal medicine residents? Acad Med. 2016;91:133–139.
8. Winward ML, Lipner RS, Johnston MM, Cuddy MM, Clauser BEThe relationship between communication scores from the USMLE Step 2 Clinical Skills examination and communication ratings for first-year internal medicine residents. Acad Med. 2013;88:693–698.
9. Tamblyn R, Abrahamowicz M, Dauphinee D, et alPhysician scores on a national clinical skills examination as predictors of complaints to medical regulatory authorities. JAMA. 2007;298:993–1001.
10. Wenghofer E, Klass D, Abrahamowicz M, et alDoctor scores on national qualifying examinations predict quality of care in future practice. Med Educ. 2009;43:1166–1173.
11. Hattie J, Timperley HThe power of feedback. Rev Educ Res. 2007;77:81–112.
12. Prober CG, Kolars JC, First LR, Melnick DEA plea to reassess the role of United States Medical Licensing Examination Step 1 scores in residency selection. Acad Med. 2016;91:12–15.
13. McIlroy JH, Hodges B, McNaughton N, Regehr GThe effect of candidates’ perceptions of the evaluation method on reliability of checklist and global rating scores in an objective structured clinical examination. Acad Med. 2002;77:725–728.
14. Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson MOSCE checklists do not capture increasing levels of expertise. Acad Med. 1999;74:1129–1134.