Evidence of residents’ progression through training programs and ability to practice unsupervised is necessary to ensure that graduating medical residents provide optimal patient care as they transition from trainee to practicing physician. With the goal of enhancing “educational experiences and assessment of residents and fellows,”1 the Accreditation Council for Graduate Medical Education (ACGME) implemented milestones, first in 7 specialties in 2013 and then in all remaining specialties and subspecialties in 2015.2 Milestones are competency-specific behavioral descriptions of learner performance along a developmental continuum. Semiannually, programs are required to report milestones ratings on each of their residents to the ACGME. Many programs also use the milestones descriptions in dialogues between faculty and trainee with specific and individualized verbal feedback to inform areas for improvement.
The ACGME has worked diligently to clarify the role of milestones as an evaluation tool for residents’ progress and “to facilitate the improvement of programs and guide more effective professional development.”1 They have explicitly stated that milestones are not intended to be used to make high stakes decisions such as licensure as this could lead to the unintended consequence of inflation of reported milestone scores.1 These statements are important, because the existence of a common learner metric like milestones also creates a temptation to use them as competency-based assessment outcomes in research studies comparing resident performance across programs, and it is unclear whether this is appropriate.
To provide evidence for evaluation of workplace-based competencies such as interpersonal and communication skills, professionalism, patient care, practice-based learning and improvement, and systems-based practice, the Pediatrics Milestones Assessment Collaborative (PMAC) was formed in 2014 by the National Board of Medical Examiners, the American Board of Pediatrics, and the Association of Pediatric Program Directors (APPD). PMAC developed several sets of workplace-based assessment tools to measure and provide feedback on trainees’ readiness to (1) serve as an intern in the inpatient setting with a supervisor present (implemented during “Module D1” of PMAC), (2) care for patients in the inpatient setting with a supervisor nearby (“Module D2”), and (3) serve as a supervisor of a clinical team within a training program (“Module D3”).3 PMAC aggregates ratings of residents’ observable behaviors by faculty, nurses, senior residents, and others using items developed by content experts and mapped to pediatrics competencies. Item ratings are further aggregated into a score for each measured competency for the learner during the rotation.
The PMAC scores provide competency-based assessments that, unlike the milestones, were intended to be standard measures of learner performance for decision making. The purpose of the current study was to shed light on potential uses of both the PMAC and ACGME milestone scores by (1) determining the sources of variance in PMAC and ACGME milestone scores and (2) measuring association between PMAC and ACGME milestone scores for each competency.
A full description of PMAC assessment development and implementation is reported elsewhere.3 In brief, content experts developed workplace-based assessment items that mapped to behaviors described in ACGME milestones, assessment experts constructed instruments from those items, observers at residency programs collected data on residents with the instruments, and PMAC investigators aggregated responses into competency scores and narrative comments to be reported to program directors and residents. PMAC scores are continuous scores ranging from 1 to 5 but are not on the same scale as the pediatrics milestones. For a list of the ACGME competencies used in the study, see Supplemental Digital Appendix 1 at https://links.lww.com/ACADMED/B4.
This study analyzed data collected in PMAC Modules D1 and D2. Table 1 describes the focus of each module and the participants in each module’s data collection; due to the time frames, no individuals would have participated as interns in both modules. All programs were unique with the exception of 1 program that participated in both modules. The programs did not differ significantly from one another in 2012–2014 certifying exam pass rates. Institutional review boards at each program and at the University of Illinois at Chicago approved the studies.
We focused on learners who received enough ratings during at least one PMAC-eligible (e.g., inpatient nonemergency) rotation to generate a report for the rotation that included PMAC competency scores. This typically required a minimum of 3 distinct observers to complete instruments, at least 2 of which must be from attending faculty or senior residents. For this study, we examined the PMAC scores on reports for each learner generated at the learner’s most recent PMAC rotation before the program’s ACGME milestones reporting date, and the corresponding year-end (Spring) milestones reported by the learner’s Clinical Competency Committee (CCC) to ACGME. CCCs received copies of all PMAC reports in advance of their deliberations.
To help understand any preexisting differences in intern skill (i.e., program selection effects), the National Board of Medical Examiners compared the mean scores on the three components of the USMLE Step 2 Clinical Skills (CS) examination of the participating residents by program using one-way analysis of variance and post hoc comparisons with Tukey’s honestly significant difference (HSD).
To determine the sources of variance in ACGME milestone and PMAC score ratings, we fitted 4 linear intercept-only random effects models, one 1 to the reported ACGME milestone levels and one to the PMAC scores for each of the 2 modules. In each model, we included random effects for program, competency, learner, and program × competency sources of variance. We computed the relative variance associated with each random effect and the residual variance.
Because reporting milestones can take on only 9 discrete values (full and half-levels between 1 and 5) and PMAC scores are continuous, as a sensitivity analysis, we also collapsed PMAC scores into the same number of equal-sized intervals, assigned ordinal values to each interval, and fitted fully Bayesian ordinal random effects models with non-informative before the ordinal data. For Module D1, all 9 milestone levels were reported and we divided PMAC scores into 9 equal-sized -ordered categories for the analysis; for Module D2, only 7 milestone levels were reported (1.5–4.5), and we divided PMAC scores into 7 equal-sized-ordered categories for the analysis. As the pattern of findings was the same, we report only the results of the linear models.
To examine the relationship between PMAC scores and ACGME milestones within programs and measure association, we fitted separate linear mixed effects models for each competency. In each model, we regressed ACGME milestone levels on corresponding PMAC score and date of the PMAC rotation, with a random effect of program to adjust for program-level variance. Data analyses were conducted using R 3.6 (R Core Team, Vienna, Austria) and the lme4 package.4 We hypothesized that PMAC scores would be significantly associated with milestone levels (a main effect of corresponding PMAC score).
Relative variance by source
Figures 1 and 2 display the relative variance by source for each score type (ACGME reported milestone and PMAC score) for data from PMAC Modules D1 and D2, respectively. In each set of competencies, the ACGME milestone model explained a large proportion of the overall score variance, and in each case, most of the explained variance was associated with program (or program × competency) sources, not learners. In contrast, although the residual variance is much larger in models predicting PMAC scores, learner variance is substantially larger than program variance in PMAC scores. In Module D1, program-related milestone variance was substantial (54%), both in comparison to learner milestone variance (22%), and program variance in the PMAC scores (12%). On the other hand, learner variance represented 44% of variance in PMAC scores. In Module D2, program-related milestone variance was 68% and learner variance was 14% compared with PMAC learner variance of 26% and program variance of 10%.
Association between ACGME milestones and PMAC scores within programs
Within programs, PMAC scores were significantly positively correlated with milestones for most nonprofessionalism competencies. Date of the rotation from which the PMAC score was obtained was never associated with milestone level. Table 2 presents the unstandardized regression coefficients from regressing each ACGME competency’s milestone level on the corresponding PMAC score, adjusting for program and rotation date. Figure 3 illustrates the relationship between mean PMAC score and mean reporting milestone for the 3 competencies that were measured in both modules for 3 specific programs. Other competencies show the same patterns. Each line represents a program. Three programs were selected to illustrate the varying ways that programs report milestones. Some programs report high milestones scores, whereas others report medium or low scores. While all programs use a wide range of PMAC scores (x axis) to describe learners, they use a much narrower range of milestone scores (y axis). Most lines are relatively flat or show a mild positive association.
Preexisting differences in interns by program
Analysis of variance of Step 2 CS score components for interns in module D1 found an overall difference by programs in the Integrated Clinical Encounter component (F[7,106] = 2.2, P = .042), but no program differed from another in post hoc comparisons. There were no differences by program in the Communication and Interpersonal Skills or Spoken English Proficiency components. Mean milestone scores were not correlated with mean Integrated Clinical Encounter scores at the program level (r = .05, P = .91).
For module D2, there was an overall difference by programs in the Communication and Interpersonal Skills component (F[6,54] = 3.0, P = .014), with one pairwise comparison between programs significantly different (P = .03 by Tukey’s HSD). There were no differences by program in the Integrated Clinical Encounter or Spoken English Proficiency components. Mean milestone scores were not correlated with mean Communication and Interpersonal Skills scores at the program level (r = .18, P = .70), and repeating our analyses while leaving out 1 of the significantly different programs did not substantively change the proportions of relative variance.
We found substantial program-related variance in milestone ratings, in contrast with primarily learner-related variance in scores derived from direct observation in the clinical workplace. To the degree that milestones are intended to measure individual learners and be comparable across programs, our data suggest some cause for concern. The program variance in milestones could result from at least 3 nonexclusive sources.
First, some programs may select for or match residents who begin training with greater competence than those in other programs. Although it is difficult to determine whether programs differ in the competence of their incoming residents, particularly given the sample sizes, programs did not differ in their certifying exam pass rates and, with only small exceptions, did not differ in the clinical competence of their entering students as measured by the USMLE Step 2 CS score components; our findings were robust to the small exceptions. Moreover, the low program-related variance in the PMAC workplace-based assessment scores may suggest that learners in different programs show a similar distribution of performance in the eyes of frontline observers. As these PMAC scores were available to the CCCs at the time they assigned milestones, we think it unlikely that milestone variance is solely caused by intended or unobserved differences in competence based on selection.
Second, some programs may place new interns in more challenging rotations than others. There are no standardized measures for comparing the difficulty of inpatient rotations across programs. The PMAC D1 module did include a question for observers about relative rotation workload, but this measure by design was only comparable across rotations within programs, not across them. Again, however, we might expect such differences to be reflected in program-related variance in the PMAC workplace–based assessment scores to a similar degree, and they were not.
Third, CCCs may differ in their assignment and reporting of milestones in ways that are idiosyncratic and construct irrelevant. Because CCCs had access to the PMAC reports, they easily could have reflected the same substantial learner variance. This suggests that although CCCs have the advantage of being able to incorporate and aggregate information about learners from multiple sources, they may not do so, or, in doing so, they may (intentionally or inadvertently) reflect program-specific norms instead. When CCCs can identify and consider such aggregate documented observations, they ought to do so, and proposed learner-specific deviations by the CCC from such aggregates ought to be scrutinized carefully.
This possibility is supported by prior studies in other specialties. Hauer et al5 found that program director tenure over 10 years was associated with higher postgraduate year 3 ratings in patient care, medical knowledge, systems-based practice, and professionalism competencies in internal medicine programs. Peabody et al6 showed that in family medicine “residents have no individual differences other than their year in residency”—that is, all learners received similar milestone levels at similar times in training. However, family medicine milestones have built-in dependencies on year in training that are not in the pediatrics milestones, and Peabody et al did not study interprogram variance. Some program directors acknowledge that milestones may not be sensitive to differences between learners even within a program. Sebesta et al7 found variation in training provided by urology programs to their CCCs and that 44% of surveyed urology residency program directors stated that, “Milestones assessments never or almost never accurately distinguished between residents.”
Within programs, milestone levels and PMAC workplace–based assessment scores were frequently (but not always) significantly positively correlated, as we had hypothesized, even in the presence of substantial variance among programs. Workplace-based assessment scores from direct observation thus may be useful to inform CCCs about milestone levels or other assessment or progress decisions. Because they are driven primarily by learner-related variance, they may also be more sensitive to learner differences across programs than milestones.
PMAC’s scores had larger residuals than ACGME milestone ratings. This may reflect unmeasured differences in rotations not incorporated in these analyses. Park et al8 recently showed that adjusting PMAC D1 scores for observer-reported workload (relative to a typical rotation at the program) could substantially decrease error of measurement, which supports this speculation.
Our findings should not be read as a repudiation of the intended uses of milestone scores, insofar as they are used as within-program measures (e.g., to identify struggling learners), to provide a developmental road map for learners and the residency curriculum,9 or as a source of feedback.10 However, because the same competencies and anchors are used for reporting in each program in a given specialty, medical education researchers may be tempted to treat milestone scores like standardized test scores, and compare them across programs. We do repudiate this unfortunately common practice. This kind of analysis should be approached, if at all, with caution and suitable adjustment for program variance, which may be construct-irrelevant.
Limitations of the present study are that we did not include all learners from PMAC Modules D1 and D2. We used only a subset of learners for whom reported ACGME milestones data were provided. For example, some of the learners in Module D1 were subinterns and as such did not have reported milestones. The programs included in the study volunteered to participate in PMAC and may not be representative of all pediatric residency programs, although we believe these programs are likely attuned to the importance of valid assessment. Furthermore, the study was conducted using the current pediatrics milestones and not the ACGME milestones 2.0 that have been under development since 2018 and which are intended to be less specialty specific for competencies in domains other than medical knowledge and patient care.
Reported milestones may not be sensitive to differences between individual learners and may more directly reflect differences in programs. The ACGME has worked diligently to underscore that milestones are not meant to distinguish between learners but rather to improve programs and guide professional development. They have explicitly stated that using milestones to make high stakes decisions about licensure and residency program graduation is problematic.1 Workplace-based assessments can provide scores with little program-specific variance that are more sensitive to differences in learners compared with ACGME reported milestones. Researchers, program directors, and other stakeholders should exercise caution when using milestones to compare learners across programs. This also may have implications for regulatory bodies that license and certify individuals based on program director reported competence which may be largely based on reported milestones.
The authors gratefully acknowledge the participation of the PMAC Module D1 and D2 Study Groups. Module D1: Nick Potisek, MD, and Allison McBride, MD (Wake Forest University Medical School); Kathleen Donnelly, MD, and Meredith Carter, MD (Inova Fairfax Medical Campus/Inova Children’s Hospital); Teri Turner, MD, MPH, MEd (Baylor College of Medicine—Houston); Renuka Verma, MD (Unterberg Children’s Hospital at Monmouth Medical Center); Su-Ting Li, MD, MPH (UC Davis Health System); Amanda Osta, MD (University of Illinois College of Medicine at Chicago); Hilary Haftel, MD, MHPE (University of Michigan Medical Center); Lynn Thoreson, DO (University of Texas at Austin Dell Medical School); Linda Waggoner-Fountain, MD, and Mark Mendelsohn, MD (University of Virginia Health System); Ann Burke, MD (Wright State University); Brian Clauser, EdD, and Tom Rebbecchi, MD (National Board of Medical Examiners). Module D2: Beatrice Boateng, PhD (University of Arkansas for Medical Sciences); Ann Burke, MD (Wright State University); Su-Ting Li, MD, MPH (UC Davis Health System); Julia Shelburne, MD (University of Texas Health Science Center at Houston); Teri Turner, MD, MPH, MEd (Baylor College of Medicine—Houston); Dorene Balmer, PhD, Jeanine Ronan, MD, MS, Rebecca Tenney-Soeiro, MD, MSEd, and Anna Weiss, MD, MSEd (The Children’s Hospital Of Philadelphia); Vasu Bhavaraju, MD (Phoenix Children’s Hospital/Maricopa Medical Center); Kim Boland, MD, FAAP, and Sara Multerer, MD (University of Louisville); Alan Chin, MD (UCLA Health); Sophia Goslings (USA Health System); Hilary Haftel, MD, MHPE (University of Michigan Medical Center); Nicola Orlov, MD (University of Chicago); Amanda Osta, MD (University of Illinois College of Medicine at Chicago); Sahar Rooholamini, MD, MPH (Seattle Children’s Hospital); Rebecca Wallihan, MD (Nationwide Children’s Hospital).
1. Accreditation Council for Graduate Medical Education. Use of individual milestones data by external entities for high stakes decisions—A function for which they are not designed or intended. https://www.acgme.org/Portals/0/PDFs/Milestones/UseofIndividualMilestonesDatabyExternalEntitiesforHighStakesDecisions.pdf?ver=2018-04-12-110745-440
. Published 2018. Accessed December 3, 2019
2. Holmboe E, Edgar L, Hamstra S. The milestones guidebook. http://www.acgme.org/Portals/0/MilestonesGuidebook.pdf
. Published 2016. Accessed August 21, 2016
3. Hicks PJ, Margolis MJ, Carraccio CL, et al. A novel workplace-based assessment for competency-based decisions and learner feedback. Med Teach. 2018; 40:1143–1150
4. Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. J Stat Softw. 2015; 67:1–48
5. Hauer KE, Vandergrift J, Lipner RS, Holmboe ES, Hood S, McDonald FS. National internal medicine milestone ratings: Validity evidence from longitudinal three-year follow-up. Acad Med. 2018; 93:1189–1204
6. Peabody MR, O’Neill TR, Peterson LE. Examining the functioning and reliability of the family medicine milestones. J Grad Med Educ. 2017; 9:46–53
7. Sebesta EM, Cooper KL, Badalato GM. Program director perceptions of usefulness of the Accreditation Council for Graduate Medical Education milestones system for urology resident evaluation. Urology. 2019; 124:28–32
8. Park YS, Hicks PJ, Carraccio C, Margolis M, Schwartz A; PMAC Module 2 Study Group. Does incorporating a measure of clinical workload improve workplace-based assessment scores? Insights for measurement precision and longitudinal score growth from ten pediatrics residency programs. Acad Med. 2018; 93(suppl 11):S21–S29
9. Schumacher DJ, Lewis KO, Burke AE, et al. The pediatrics milestones: Initial evidence for their use as learning road maps for residents. Acad Pediatr. 2013; 13:40–47
10. Tocco D, Jain AV, Baines H. The pediatrics milestone pilot project: Perspectives of current pediatric residents. Acad Pediatr. 2014; 14(2 suppl):S8–S9