Residency programs have been reporting milestones data to the Accreditation Council for Graduate Medical Education (ACGME) for the past several years.1,2 The milestones project created a framework to guide competency attainment but did not mandate how these changes should be implemented nor assessed.3 Early studies suggest milestones can document trainee competence progression over time4–8 but did not report the specific methods training programs use to collect the data. Instead, programs were encouraged to develop their own programs of assessment,9 and published literature in this area consists mostly of single-institution studies.10–22 In general, larger academic programs were able to create robust assessment systems, while many smaller programs either adapted their previous system for milestone use or borrowed approaches from other sources.23,24
Implementing an outside assessment system can be fraught with challenges. Work-based assessment (WBA) tools often demonstrate low reliability.25 Educators have proposed various tools, rater training techniques, rating scales, and mathematical modeling or data manipulation (learning analytics) to address this issue, but without great success.25–28 Entrustability scales in the competency-based era have gained favor in aligning judgments with specific anchor scale measures, but WBA tools that include these scales are limited in their inability to account for context.25,29,30 Models of entrustment have been developed that include context as an important element31,32 but rarely has the same WBA system been deployed and studied in situ in different contexts for extended periods for time.
In this study, we compared and contrasted the results of an internal medicine workplace-based assessment system developed at a university-based medical center, the University of Cincinnati (UC),21,22 and its 4-year deployment within a community-based residency program, the Good Samaritan Hospital (GSH). Our goal was to determine if entrustment varied between programs when a single assessment system was deployed in different contexts over time. Similarities and differences of such efforts may have important validity implications for accrediting bodies such as the ACGME that collect competence ratings from myriad sources.
The UC internal medicine residency program consists of 75 categorical residents rotating through an urban tertiary referral academic medical center. GSH residency has 28 categorical residents rotating through an urban community-based independent hospital. Although the 2 medical centers are separated by only 1.5 miles, there is little overlap between residents, and virtually no faculty members have dual appointments.
In 2011, UC developed and deployed the observable practice activity (OPA) assessment system.21,22 OPAs are discrete WBAs rated on an entrustment scale (1 = critical deficiency/do not trust, 2 = trust with direct supervision, 3 = trust with indirect supervision, 4 = trust with no supervision, 5 = aspirational performance). Faculty members, peers, nurses, social workers, pharmacists, case managers, office staff, and others who are in a position to directly observe resident performance are given the opportunity to provide entrustment ratings. There are an average of 10 content-based OPAs on faculty-based assessment forms, each of which are specific to a given rotation (e.g., manage pneumonia on general medicine or interpret anelectrocardiogram on cardiology), and 10 process-based OPAs (e.g., minimize unfamiliar terms during patient encounters or use feedback to improve performance) common to all forms and rotations. Nonattending-physician raters have an average of 4 OPAs to assess per form, specific to the vantage point and expertise of the rater (e.g., perform handovers for peers, return pages in a timely manner for nurses, and demonstrate respective behavior to all members of the health care team for social workers).33 Program directors and key clinical faculty created new OPAs and forms as rotations were adjusted or designed. OPA entrustment ratings serve as formative feedback at the end of clinical rotations and are mapped to ACGME subcompetencies with entrustment ratings corresponding to milestone levels.21 For example, minimize unfamiliar terms during patient encounters is mapped to the subcompetencies of interpersonal and communication skills 1 (communicates effectively with patients and caregivers) and to professionalism 3 (responds to each patient’s unique characteristics). These data are used for longitudinal tracking and reporting of progression toward competence over time.21,22
At UC, faculty development efforts included in-person instructional meetings, monthly reminder emails, online videos,34 and a formal “feedback on feedback” process,35 although no formal tracking of these efforts was undertaken. Nearly, all physician faculty members received teaching practice plan dollars36 for resident instruction, which could be withheld if they did not complete assessments within 2 weeks of the end of the rotation. Nonattending-physician raters received no instructions other than those included in the assessment forms and had no penalty for noncompletion. Initial data collection at UC began in August 2012.
In September 2013, GSH adopted the OPA assessment system. The initial set of forms and maps used by GSH were shared directly within a common residency management system (MedHub)37 and were exact copies of the corresponding rotation forms used at UC. Over time, GSH “mixed and matched” OPAs to fit its context and created several new forms as rotations there dictated. GSH did not create new OPAs nor did it change mapped subcompetencies from the original designations. Faculty development efforts included a formal meeting at the introduction of the system, in-person instructional meetings, emails at regular intervals, and semiannual faculty seminars on effective feedback. Evaluations were expected to be completed within a week of the end of a rotation. All faculty members were on staff at GSH, and half received minimal compensation for teaching (none of it was withheld for noncompletion). As at UC, nonattending-physician raters had no formal instructions other than those on the forms. There were also no penalties for noncompletion of assessment forms for physician or nonphysician raters.
Initial OPA data collection and mapping occurred via the MedHub residency management system. Data collection for both sites ended in June 2017. Data were then imported into SAS statistical software, version 9.4 (SAS Institute, Cary, North Carolina) for all analyses. We calculated summary statistics including means and medians for each program and compared them using t tests, chi-square tests, or the Wilcoxon rank-sum test where appropriate. The distributions of subcompetencies at each site were calculated and compared using the chi-square test. We calculated the mean level of entrustment at each site by month of residency. These were then compared between the 2 programs using linear regression with an interaction term. We calculated the percentage of each entrustment score by postgraduate year (PGY) level for each site. These were compared using the chi-square test. The variation between residents between the 2 programs was compared using the f test. To account for multiple comparisons, a Bonferroni correction to the P value was applied as .05 divided by the number of comparisons. For comparison of mean entrustment ratings for all raters of all residents for the 36 months of residency between the 2 programs, an adjusted P value of <.00139 was considered significant. For comparing the subcompetencies and competencies, adjusted P values of <.00139 and .01 were used. All other comparisons used a P value of <.05 to determine significance. To aid in interpretation, we calculated effect sizes for comparing the assessment data between the 2 programs.
The institutional review boards of both UC and GSH approved this study.
UC had significantly more residents included in the dataset than GSH had (255 versus 57) due to larger class sizes and an additional year of data collection (Table 1). UC residents were less likely to be foreign medical graduates (24/255, 9% versus 50/57, 88%; P < .0001) and had higher baseline United States Medical Licensing Examination (USMLE) scores (median USMLE Step 1 score of 224 versus 218; P = 0.0011; median USMLE Step 2 score of 237 versus 222; P < .0001). UC had no residents who had already completed a residency program in another country, whereas GSH had 5. UC had more unique faculty evaluators in the study period (394 versus 43), more subcompetency assessments collected (619,179 versus 101,821), more unique rotation forms (231 versus 39), and more unique OPAs assessed (486 versus 290). The median number of subcompetency assessments per PGY was highest in the first year at UC and in the third year at GSH (Table 1). The ratios of faculty member to peer/allied health assessments were similar between both programs (approximately 80% versus 20%, respectively).
Figure 1 shows the progression of entrustment over time for all assessors in both programs for all residents over the 36 months of residency. Every month-to-month comparison was significantly different between the 2 programs with the exception of month 32. Effect sizes were generally in the moderate range, with overall effect size of 0.49. GSH ratings were higher than UC ratings for all but the 11th, 15th, and 36th months. The rate of change for average monthly entrustment (slope of the lines of best fit) was similar between the 2 groups, with GSH having an increase of 0.041 each month versus 0.042 for UC (P = .73) with a coefficient of determination of 0.84 for GSH and 0.95 for UC.
Figure 2 shows the progression of entrustment of individual residents over the course of 36 months. In general, most residents progressed from lower to higher entrustment, but there was variation between residents in each program.
Table 2 shows the percentage of entrustment ratings made for each PGY level for each program. All comparisons between programs were significantly different (P < .0001). GSH residents received proportionally more level 4 or 5 entrustment ratings than UC residents did, most notably in the peer/allied health assessments (e.g., 85.9% of GSH peer/allied health entrustment ratings of PGY-3 residents were at level 5 versus 46.5% of UC residents), and for both programs, there was significant variation in the entrustment ratings given by faculty and nonfaculty assessors. GSH also had more level 1 assessments than UC from both faculty and nonfaculty assessors of PGY-1 residents, although the absolute number of these assessments was very low (<1%) in both programs.
Table 3 shows the number of assessments and ratings of each subcompetency and competency for each program, as well as a comparison of the relative rank of each count and rating for each subcompetency within each program. All mean ratings for subcompetencies and competencies were higher for GSH than for UC (except for subcompetency systems-based practice [SBP]-2, recognizes system error and advocates for system improvement), with P < .0001 for all values except SBP-2. Effect sizes for significant rating differences ranged from 0.29 to 0.69, with most above 0.50. Relative ranks of assessment counts were within 3 for all subcompetencies and within 1 for all competencies. The majority of relative ranks of assessment ratings were within 3 or less, with exceptions including SBP-2 (rated only 6 times by GSH), patient care (PC)-4 (skill in performing procedures), and PC-5 (requests and provides consultative care).
Our study is an early attempt to understand if entrustment varies between programs when a single assessment system is deployed in different contexts (in our case from a university-based internal medicine program to a community program) over time.
Although both programs used similar assessment items, response options, and entrustment scales, there were significant differences in entrustment scores between programs for nearly every month (Figure 1), with GSH being higher in most instances. This finding could be secondary to differences in assessment culture, rater training, rater incentives, intensity of supervision (e.g., less direct supervision could lead to the need to trust more), true performance, rotation sites, differences in residents (9% versus 88% international medical graduates), or other unknown factors. Further research should be added to the existing body of literature in this area38–40 and attempt to delineate the effect of these contextual differences on assessment ratings using the OPA system.
Similarities between programs included the finding that entrustment ratings varied by resident and by PGY within each program (Figure 2 and Table 2, respectively). Despite the differences in absolute entrustment levels between programs, variation between residents within programs could suggest raters discriminated between specific learner performance. In addition, the rate at which entrustment ratings rise was similar between programs (Figure 1). The reasons for this are unclear, but it could mean that no matter the starting point, trust rises at a similar rate when combining many learners in large datasets (even though there may be considerable variation between learners, as seen in Figure 2). Comparable findings have been seen in nationally aggregated and reported milestones data.7,41 Similar entrustment rate of rise could also be secondary to inherent properties of the OPA system and/or entrustment scales, irrespective of resident performance. For example, raters at both programs could be swayed by resident year of training to determine entrustment, leading to similar increases in trust based on time intervals. Graduate medical education (GME) has been on a time-based paradigm for so long that assessors may rely heavily on presumptive trust due to time in training.30 Perhaps, if GME moves away from time-based training, we might see more interprogram variation in the entrustment slope as we rely less on time-based anchors. On the other hand, if, irrespective of absolute rating values, similar rate of rise is a true indicator of progression, residents who fall off a given growth curve may be identified as at risk early in training (much as pediatricians monitor growth curves in children),42 and programs could use this information to improve resident performance before it is too late. Further research should study the rate of rise in entrustment scores across programs and consider entrustment rate (in addition to absolute entrustment values) as a measure of competence by the Clinical Competency Committee.
Effect sizes of the differences between month-to-month ratings and aggregate subcompetency and competency ratings (Figure 1, Table 3) were in the moderate range. Milestone assessments as reported to the ACGME are on a 9-point scale.43 A one-point difference in reporting can mean the difference between being deemed ready for unsupervised practice or not ready. Whether these differences are significant with regard to PC outcomes or other important metrics should be studied as next steps in this work.
GSH ratings were higher than UC ratings for all but 3 months of the residency cycle (11th, 15th, and 36th). It is not clear why this would be (i.e., why only these 3 months, or why not higher for all months). Interestingly, for UC, each of these months is just before a major transition. The 11th month precedes the transition to the PGY-2 year, the 15th month precedes the transition to an ambulatory long block,44 and the 36th month precedes the graduation. In addition, the flattening out of the GSH curve at the end of residency could have been secondary to raters considering impending graduation. It is possible that raters at both UC and GSH were relying heavily on time of year to determine their entrustment rating, and time-based decisions are a possible explanation for why the slope of entrustment gained was almost identical between the 2 programs. The impact of such differences on the confidence we might have in all of the ratings observed—and whether or not the results are determined by an adherence to lock-step time-based expectations of progression or some other factor rather than true changes in competence/entrustability—is significant and should be studied for the OPA and other entrustment-based WBA tools. That being said, as shown in both Figure 2 and Table 2, the differences in individual progression over time and distribution of entrustment scores at each PGY level suggest that time anchoring is not the only guidepost raters are using to make entrustment decisions. Techniques such as generalizability studies can help determine the components of variance for a given rating, and strategies like qualitative analysis can determine the factors raters use to make assessment decisions.
Isolating a single year of assessments within each program produced virtually the same outcomes as aggregating all years. In addition, comparing results for single years between programs produced similar results with comparison of aggregating all years. This might suggest that the OPA system is stable with respect to performance within residency programs, and these differences are stable between programs. These findings would need to be reproduced in other programs, but if true, the ACGME and others could measure and understand these stable differences between programs, and either adjust for them or consider interventions to reduce them and monitor the effect.
Table 3 shows that the relative ranking of count and rating of each subcompetency and competency were largely similar even after years of deployment in both programs. This is likely due to the inherent features of the OPA system, underscoring the implication that any given system can only “see what it can see.” By being able to measure output in this way both within and between programs, directors can change curricular experiences to fill gaps and achieve different outcomes. However, despite the overwhelming similarity in rankings, there were several stark differences. For example, GSH assessed SBP-2 only 6 times for all residents during the study period, underscoring the point that users of similar systems can sometimes deploy these systems in different ways. Presumably the ACGME received SBP-2 reports on residents from both GSH and UC, although these ratings were based on significantly different amounts of data.
Our study has several limitations. First, our system is specific to internal medicine and residency programs the size and composition of UC and GSH, and may not be applicable to other types of training programs. Second, we used a residency management system that not all programs have access to, and while OPA development and mapping does not require MedHub, the results may not be applicable for other programs using different platforms. Third, although our data sample included more than one program, it is possible that different results could occur if the system was used more widely. Fourth, we could not determine the reasons for the differences and similarities in the interinstitutional data. Fifth, we didn’t control for the changes made in each setting—for example, how much GSH changed forms and OPAs. Sixth, we did not include or analyze narrative data that accompanied entrustment ratings to determine if the ratings were justified. Each of these issues deserves further investigation.
We found that transferring an entrustment-based WBA system from one setting (university-based internal medicine residency) into another (community-based internal medicine residency) resulted in important interprogram differences. One might argue that these findings are obvious, in the way that water samples measured from a stream, lake, or ocean would be expected to be different even if the same tool was used. On the other hand, it might be worrisome if an instrument that measured these different water sources produced only uniform results. In any case, once the “water” is collected and inscribed by the ACGME, it all looks the same. Early reports have aggregated this information as a single whole7 despite the likely enormous differences in the way programs collect the data, and important assumptions are being made and fed back to the education world. In addition, variability based on context may have important implications if or when programs attempt to move from time-based to competency-based training systems. It is likely that well-established assessment systems such as the OPA system will be shared between programs in the future. Further research is needed to understand the contributions of WBA tool, context, and other factors on the data these systems produce.
1. Nasca TJ, Philibert I, Brigham T, Flynn TC. The next GME accreditation system—Rationale and benefits. N Engl J Med. 2012;366:1051–1056.
2. Philibert I, Brigham T, Edgar L, Swing S. Organization of the educational milestones for use in the assessment of educational outcomes. J Grad Med Educ. 2014;6:177–182.
3. Holmboe ES. The journey to competency-based medical education: Implementing milestones. Marshall J Med. 2017;3:3–9.
4. Nabors C, Peterson SJ, Forman L, et al. Operationalizing the internal medicine milestones—An early status report. J Grad Med Educ. 2013;5:130–137.
5. Li ST, Tancredi DJ, Schwartz A, et al.; Association of Pediatric Program Directors (APPD) Longitudinal Educational Assessment Research Network (LEARN) Validity of Resident Self-Assessment Group. Competent for unsupervised practice: Use of pediatric residency training milestones to assess readiness. Acad Med. 2017;92:385–393.
6. Beeson MS, Holmboe ES, Korte RC, et al. Initial validity analysis of the emergency medicine milestones. Acad Emerg Med. 2015;22:838–844.
7. Hauer KE, Clauser J, Lipner RS, et al. The internal medicine reporting milestones: Cross-sectional description of initial implementation in U.S. residency programs. Ann Intern Med. 2016;165:356–362.
8. Hauer KE, Vandergrift J, Hess B, et al. Correlations between ratings on the resident annual evaluation summary and the internal medicine milestones and association with ABIM certification examination scores among US internal medicine residents, 2013-2014. JAMA. 2016;316:2253–2262.
9. van der Vleuten CPM, Schuwirth LWT, Driessen EW, Govaerts MJB, Heeneman S. Twelve tips for programmatic assessment. Med Teach. 2015;37:641–646.
10. Choe JH, Knight CL, Stiling R, Corning K, Lock K, Steinberg KP. Shortening the miles to the milestones: Connecting EPA-based evaluations to ACGME milestone reports for internal medicine residency programs. Acad Med. 2016;91:943–950.
11. Yuan CM, Prince LK, Oliver JD 3rd, Abbott KC, Nee R. Implementation of nephrology subspecialty curricular milestones. Am J Kidney Dis. 2015;66:15–22.
12. Gardner AK, Scott DJ, Choti MA, Mansour JC. Developing a comprehensive resident education evaluation system in the era of milestone assessment. J Surg Educ. 2015;72:618–624.
13. Shumway NM, Dacus JJ, Lathrop KI, Hernandez EP, Miller M, Karnad AB. Use of milestones and development of entrustable professional activities in 2 hematology/oncology training programs. J Grad Med Educ. 2015;7:101–104.
14. Jones LK Jr, Dimberg EL, Boes CJ, et al. Milestone-compatible neurology resident assessments: A role for observable practice activities. Neurology. 2015;84:2279–2283.
15. Mallow M, Baer H, Moroz A, Nguyen VQC. Entrustable professional activities for residency training in physical medicine and rehabilitation. Am J Phys Med Rehabil. 2017;96:762–764.
16. Boateng BA, Bass LD, Blaszak RT, Farrar HC. The development of a competency-based assessment rubric to measure resident milestones. J Grad Med Educ. 2009;1:45–48.
17. Goldman RH, Tuomala RE, Bengtson JM, Stagg AR. How effective are new milestones assessments at demonstrating resident growth? 1 year of data. J Surg Educ. 2017;74:68–73.
18. Hart D, Bond W, Siegelman JN, et al. Simulation for assessment of milestones in emergency medicine residents. Acad Emerg Med. 2018;25:205–220.
19. Hartranft TH, Yandle K, Graham T, Holden C, Chambers LW. Evaluating surgical residents quickly and easily against the milestones using electronic formative feedback. J Surg Educ. 2017;74:237–242.
20. Tanaka P, Bereknyei Merrell S, Walker K, et al. Implementation of a needs-based, online feedback tool for anesthesia residents with subsequent mapping of the feedback to the ACGME milestones. Anesth Analg. 2017;124:627–635.
21. Warm EJ, Mathis BR, Held JD, et al. Entrustment and mapping of observable practice activities for resident assessment. J Gen Intern Med. 2014;29:1177–1182.
22. Warm EJ, Held JD, Hellmann M, et al. Entrusting observable practice activities and milestones over the 36 months of an internal medicine residency. Acad Med. 2016;91:1398–1405.
23. Williams F, Wininger D. The Toolkit Series: A Textbook for Internal Medicine Education Programs. 2017.Alexandria, VA: Alliance for Academic Internal Medicine.
24. Alliance for Academic Internal Medicine. UME/GME Program Resources. Curated milestone evaluation exhibit. https://www.im.org/resources/ume-gme-program-resources/evaluation-exhibit
. Accessed August 6, 2019.
25. Rekman J, Gofton W, Dudek N, Gofton T, Hamstra SJ. Entrustability scales: Outlining their usefulness for competency-based clinical assessment. Acad Med. 2016;91:186–190.
26. Carline JD, Wenrich M, Ramsey PG. Characteristics of ratings of physician competence by professional associates. Eval Health Prof. 1989;12:409–423.
27. van der Vleuten C, Verhoeven B. In-training assessment developments in postgraduate education in Europe. ANZ J Surg. 2013;83:454–459.
28. Kreiter CD, Ferguson K, Lee WC, Brennan RL, Densen P. A generalizability study of a new standardized rating form used to evaluate students’ clinical clerkship performances. Acad Med. 1998;73:1294–1298.
29. Crossley J, Johnson G, Booth J, Wade W. Good questions, good answers: Construct alignment improves the performance of workplace-based assessment scales. Med Educ. 2011;45:560–569.
30. ten Cate O, Hart D, Ankel F, et al.; International Competency-Based Medical Education Collaborators. Entrustment decision making in clinical training. Acad Med. 2016;91:191–198.
31. Ginsburg S, McIlroy J, Oulanova O, Eva K, Regehr G. Toward authentic clinical evaluation: Pitfalls in the pursuit of competency. Acad Med. 2010;85:780–786.
32. Hauer KE, ten Cate O, Boscardin C, Irby DM, Iobst W, O’Sullivan PS. Understanding trust as an essential element of trainee supervision and learning in the workplace. Adv Health Sci Educ Theory Pract. 2014;19:435–456.
33. University of Cincinnati Internal Medicine Program. Resident Education Curriculum 2017–2018. http://med.uc.edu/docs/default-source/default-document-library/curriculum-2017-18-final.pdf?sfvrsn=0
. Accessed August 9, 2019.
34. Warm E. Observable practice activity system. https://www.youtube.com/watch?v=t5JVRZrXBbU
. Published 2014. Accessed August 9, 2019.
35. Warm E, Kelleher M, Kinnear B, Sall D. Feedback on feedback as a faculty development tool. J Grad Med Educ. 2018;10:354–355.
36. Luke RG, Wones RG, Galla JH, Rouan GW, Tsevat J, Dorfmeister JW. Development and implementation of a teaching practice plan in a department of medicine (1995-1998): Relative teaching units (RTU’s). Trans Am Clin Climatol Assoc. 1999;110:214–226.
37. MedHub. Healthcare Education Management. http://medhub.com
. Accessed August 9, 2019.
38. Kogan JR, Hess BJ, Conforti LN, Holmboe ES. What drives faculty ratings of residents’ clinical skills? The impact of faculty’s own clinical skills. Acad Med. 2010;85(10 suppl):S25–S28.
39. Kogan JR, Conforti L, Bernabeo E, Iobst W, Holmboe E. Opening the black box of clinical skills assessment via observation: A conceptual model. Med Educ. 2011;45:1048–1060.
40. Kogan JR, Conforti LN, Iobst WF, Holmboe ES. Reconceptualizing variable rater assessments as both an educational and clinical care problem. Acad Med. 2014;89:721–727.
41. Hauer KE, Vandergrift J, Lipner RS, Holmboe ES, Hood S, McDonald FS. National internal medicine milestone ratings: Validity evidence from longitudinal three-year follow-up. Acad Med. 2018;93:1189–1204.
42. Céndan JC, Oloruntomi J, Soborowicz MB, Marchand L, Selim BR. Using assessment point accumulation as a guide to identify students at risk for interrupted academic progress. Acad Med. 2018;93:1163–1167.
43. Accreditation Council for Graduate Medical Education; American Board of Internal Medicine. The Internal Medicine Milestone Project. https://acgme.org/acgmeweb/Portals/0/PDFs/Milestones/InternalMedicineMilestones.pdf
. Published 2015. Accessed August 9, 2019.
44. Warm EJ, Schauer DP, Diers T, et al. The ambulatory long-block: An Accreditation Council for Graduate Medical Education (ACGME) Educational Innovations Project (EIP). J Gen Intern Med. 2008;23:921–926.