In 1998, the Accreditation Council for Graduate Medical Education (ACGME) began restructuring its accreditation system, focusing on educational outcomes of residency training to ensure quality care following training.1 Measuring training outcomes remained a challenge, given the paucity of valid tools and the absence of a standard assessment approach.2,3 In 2013, the ACGME instituted a new assessment framework based on Educational Milestones: “developmentally based, specialty specific achievements that residents are able to demonstrate as they progress through training.”4 Members of each specialty’s Milestone Working Group created outcomes-based milestones for the specialty.
The Pediatrics Milestone Project, developed by a working group commissioned by the ACGME and the American Board of Pediatrics, outlined 48 competencies within the six ACGME-defined competencies (Patient Care, Medical Knowledge, Professionalism, Systems-Based Practice, Interpersonal and Communication Skills, Practice-Based Learning and Improvement) and an additional seventh pediatric domain, Personal and Professional Development. Each subcompetency delineates four or five milestones (levels of development) that span the continuum of medical education from medical school to continuing professional development.3,5,6 Despite the publication of several milestone-like assessment tools,7,8 few data have been collected regarding the reliability, validity, and feasibility of milestones for assessment of competence and other educational outcomes.2 A qualitative study of learners’ perceptions found initial evidence for the pediatric milestones (PMs) as learner-centered tools for formative feedback and assessment.9 However, data are lacking on whether the PMs differentiate between performance of medical students and postgraduate learners in the pediatric clinical setting. This study adduces evidence (from internal structure and relationships with other variables) for the validity of milestone classification ratings of pediatric interns (first-year residents) and fourth-year medical student subinterns on nine competencies intended to inform the decision about whether a learner is ready to serve as a pediatric intern in the inpatient setting.
The Association of Pediatric Program Directors Longitudinal Educational Assessment Research Network (APPD LEARN) and the National Board of Medical Examiners conducted a pilot study of the utility and feasibility of the PMs as assessment tools of a resident’s performance. The study design and evaluation are reported more fully elsewhere.10,11 Our subinquiry uses a causal-comparative design to assess differences in clinical performance between two levels of learners: interns and subinterns.
Overview of process
Interns and subinterns (assessment participants) were observed by raters from their rotation during a four-week period on a general pediatric inpatient ward rotation. Data were collected electronically on behaviors during the rotation using three assessment instruments: a multisource feedback (MSF) tool11 and two structured clinical observation (SCO) tools—one for rounds, and one for taking a patient history. Data were aggregated by subcompetency into a single spreadsheet, called a “dashboard” (Figure 1). Within one week of the end of the rotation, a faculty member or chief resident at each site used these data to inform judgments about how far the learner had progressed along the continuum of the milestones for each of the nine subcompetencies chosen for this study (see Appendix 1) and completed a milestone classification form (MCF). The faculty member then held a face-to-face formative feedback session with the learner.
Subcompetencies chosen for assessment
The subcompetencies chosen for this study were part of the Pediatrics Milestone Project.12 A survey was sent to all ACGME-accredited training sites asking which pediatric subcompetencies would provide evidence for the decision of readiness to serve on the inpatient setting. The survey respondents (98 program directors and 57 associate program directors) represented 117 of the 198 ACGME-accredited U.S. pediatrics residency programs. Two of the investigators (P.H., A.S.) chose from this survey the nine most frequently identified subcompetencies that would inform faculty decisions about learner readiness to serve as interns in the pediatric inpatient setting. Appendix 1 lists these nine subcompetencies and their respective descriptive anchors by milestone level.
We collected data for intern and subintern trainees on pediatric inpatient rotations. Participation was voluntary. Up to two interns and two subinterns per rotation could participate in each study month. Data collection occurred for a six- to nine-month period at each site from June 29, 2012, to June 30, 2013. Eighteen pediatric residency programs from 15 U.S. states participated in the pilot. Subintern trainees were specifically included in the study to directly address the study question of readiness to start internship, rather than readiness to progress through different postgraduate levels; the PMs span the continuum from undergraduate medical education to continuing professional development.
The MSF instrument consisted of 22 items based on all 9 of the subcompetencies we chose to study and included milestone behavioral classification items. MSF surveys began with questions identifying the role of the assessor and the duration of time spent observing the learner. Response choices to the questions varied and included both frequency scales for behavioral items and agree/disagree scales for global items. There were nine free-text boxes to provide specific examples to support the responses and to provide commendation or suggestions for improvement. Example questions included the following: Based on your interactions with the learner over the course of the rotation, please select the frequency with which you observed the trainee follow through on responsibilities (rarely, less than 50% of the time to always, 100% of the time). Did you observe any lapse(s) in behavior meriting feedback (yes, no, unable to assess)?
Two types of SCOs were performed on each learner—a rounding SCO and a history and physical SCO. The rounding SCO had 9 assessment items, and the history and physical SCO had 10 items. Each item had 3 to 5 response choices describing the extent to which the learner demonstrated the behavior described during the observation. These forms also had free-text boxes to provide specific examples. An example item from the rounding SCO was: Based on your observations of the learner during rounds today, please indicate the extent to which the learner stated plan(s) based on assessment, with rationale. An example from the history and physical SCO was: Based on your observations of the learner obtaining a patient history today, please select the most appropriate response to complete the statement: Questioning of the patient/family was … (response choices ranged from not directed to highly directed and efficient).
At the conclusion of the rotation, data from the SCO and MSF forms were aggregated by computer software into a dashboard (Figure 1) for the faculty feedback provider. These data assisted with completion of the MCF by the faculty feedback provider at the training program, which assessed the learner on the nine pediatric subcompetencies chosen to demonstrate readiness to serve on an inpatient unit (Appendix 1). While the MSF and SCO questions did not follow the numerical scale of the milestone anchors, they were mapped by subcompetency (which, although not part of the study design, reduces the likelihood that the feedback provider could simply use MSF and SCO ratings in place of more integrative milestone judgments). The faculty rater selected the milestone or milestone transition point that corresponded with his or her overall judgment for each of the subcompetencies evaluated for each learner assessed. Raters used a 1–5 scale on each of these subcompetencies where 1 represents the earliest developmental level (medical student) and 5 represents the highest developmental level (aspirational to be attained in practice). Raters could also choose points in between which represented progress toward but not yet attainment of a specific level. For example, a rating of 1.5 corresponds to a developmental level between the first and second behavioral anchors.
Training for faculty raters varied by site in content and length as well as by assessment role and was conducted by each individual site/program. At the beginning of data collection, each of the participating study site principal investigators or delegates participated in an orientation to the study, which included training on the specific aspects of completing the assessment forms. Many of these individuals served as the resident milestone raters and feedback providers at their participating institutions. Programs that used more than one faculty milestone rater and feedback provider held faculty training sessions for enhancing interrater reliability. Although the content of these sessions varied, all focused on how faculty milestone raters and feedback providers aggregated and scored the MCF. Faculty training for completion of the MCF also included guidance on the conceptual frameworks used for the development of the PMs, such as how each competency delineates four to five milestones that span the continuum of medical education from medical school to continuing professional development. The goal of these sessions was to develop shared mental models around the subcompetencies being assessed, and to instruct on providing feedback to the learner on each of these subcompetencies.
Data were entered through a Web-based front end to a secure, password-protected account under the control of the staff coordinating the project. We analyzed data using Statistical Package for the Social Sciences (SPSS) statistical software, version 20 (IBM SPSS Inc., Armonk, New York), and R statistical software, version 3.1 (R Foundation for Statistical Computing, Vienna, Austria). We then coded MCF ratings as numbers ranging from 1 to 5, with half-points (e.g., 1.5) possible, rendering a 9-point scale in all. Mixed linear regression models were fitted to the milestone classification ratings for each competency, with the learner’s rank (intern or subintern) as a fixed effect, and rater and program as random effects. We evaluated the main effect of the learner’s rank to test the hypothesis that interns would receive higher ratings. In another mixed model, to characterize the relationships among the subcompetencies, and to determine if these were considered differently for different learners’ ranks, we fitted ratings for all competencies together, with learner’s rank, competency, and interaction as fixed effects; training age of the learner (in months spent in current rank) and its interaction with the learner’s rank as fixed effects; and a random effect of rater. The study was reviewed and found exempt by the University of Illinois at Chicago institutional review board (IRB), which oversees APPD LEARN’s data repository, as well as by the IRBs of each participating site.
MCF data were collected for a total of 211 trainees, 32 subinterns, and 179 pediatric interns from the 17 sites; because of IRB delays, 1 site did not complete any MCF forms. Figure 2 displays the total number of participants by level and site. Seventy-six percent of the study sites used 1 (n = 7) or 2 (n = 6) faculty members as milestone raters/feedback providers. One site used 3, two used 4, and one used 16 different faculty members in this role.
The mean developmental level for interns ranged from 3.20 (ICS-4, Teamwork) to 3.72 (PROF-3, Humanism) on each subcompetency’s 5-point developmental scale. The range for subinterns was 2.89 (PC-2, Time management) to 3.61 (PROF-1, Professionalization). Figure 3 displays the mean intern and subintern ratings from the regressions on each of the nine subcompetency areas adjusting for rater and program random effects. The mean intern rating was always higher than the mean subintern rating, and significantly so for the subcompetencies Teamwork (ICS-4), Data gathering (PC-1), Time management (PC-2), Help-seeking (PPD-1), Coping with stress (PPD-2), and Professional conduct (PROF-2).
In the model that combined ratings across all domains for all levels of learners, mean milestone ratings were significantly higher for Professionalism subcompetencies (3.59–3.72; P < .001 for each) overall compared with Patient Care (2.89–3.24) and Personal and Professional Development (3.33–3.51). Competencies of Personal and Professional Development were significantly higher than those of Patient Care (P < .001 for each). On average, interns were rated 0.45 points higher than subinterns across all subcompetencies (P = .001), except in Professional conduct (PROF-2), for which interns were rated only 0.21 points higher (P = .03). There were significant observable differences (P < .05) between subinterns and interns in all subcompetencies studied, with the exception of Professionalization (PROF-1), Humanism (PROF-3), and Trustworthiness (PPD-5). There was also no effect of training age (P = .12) or the interaction between training age and learner rank (P = .24). Results did not differ substantially when excluding the program that used 16 different raters. All ratings were intercorrelated (Cronbach alpha = 0.93).
This study is one of the first reported, to our knowledge, that examines observable behavioral differences in the workplace using a developmental framework grounded in the ACGME competencies. Faculty assessors judged interns to be at a higher developmental level than subinterns in most subcompetencies associated with readiness to serve as a pediatric intern in the inpatient setting. Previously, Boulet et al13 demonstrated differences between medical students and interns on a simulation-based assessment of acute care situations. Other researchers have also shown differences between medical students and postgraduate trainees in communication skills14 and a gradual increase in global skills and competence over time.15
Our findings were consistent with this literature, demonstrating observable differences between subinterns and pediatric interns in all subcompetencies studied, with the exception of Professionalization (PROF-1), Humanism (PROF-3), and Trustworthiness (PPD-5). The developmental scale of the PMs is built on the construct of growth over the course of time in training.3 Thus, we were surprised by the lack of significant differences in these subcompetencies. In a prior study, Lee et al16 demonstrated higher scores for humanistic qualities among subinterns when compared with internal medicine residents and attending faculty members. Overall, Professionalism has notoriously been very difficult to define and measure as a domain; differences of opinion exist whether it is an inherent quality or one that can be learned and developed over time.17 It is also possible that growth in these Professionalism and Personal and Professional Development subcompetencies is too small to detect meaningful differences between trainees only one level apart.
In addition, the narratives for the lower levels of Trustworthiness, as well as Professionalization and Humanism, contained language that may be viewed as “negative,” possibly deterring faculty assessors from assigning these levels. For example, levels 1 and 2 of Humanism contain words such as “detached” and “lack of sensitivity.” The Trustworthy narratives contain phrases such as “may misrepresent data … leaving others uncertain as to the nature of the individual’s truthfulness …” and “demonstrates lapses in follow-up … despite awareness of the importance of these tasks.” Some faculty members may draw a correlation between this undesirable language and the quality of the learner’s character rather than medical skills. In turn, faculty may also fear strong negative emotional reactions by trainees who receive this feedback. In comparison, milestone descriptors from lower levels of other domains (e.g., “gathers too little information” contained within the Level 1 narrative description for PC-1) may not evoke these same emotions in the assessor or the trainee, and thus be more agreeable for selection. Overall, these findings raise the question whether developing separate assessment tools using items that describe specific, observable behaviors rather than ratings of “unprofessional” would allow observers to differentiate learners with variation in level of professional competence.
Although mean scores on different subcompetencies varied (consistent with the idea that learners develop them at different times or that the scales differ), Cronbach alpha demonstrates that scores also tended to track one another within learner (suggesting that there are overall stronger and weaker learners). Because all of the measured subcompetencies are intended to reflect our key construct—readiness to serve as a pediatric intern in the inpatient setting—it is not surprising that the intercorrelations are high.
This study had several limitations. All participants were volunteers; those who chose to participate may differ from those who did not. Although we did not gather data for subjects who declined to participate, anecdotal data from verbal reports by principal investigators at each site suggest the numbers were very low (fewer than five individuals at all sites combined). Participants knew they were part of a study and were being observed. Feedback from focus groups with study participants suggests that learners may not have known exactly when they were being observed, especially for behaviors listed on the MSF, which may minimize concern about this effect.
Faculty development varied widely among the institutions. Although each subcompetency area was accompanied by rich narrative descriptions of the behaviors associated with each level, shared mental models may not have existed across raters or institutions, a possibility we addressed with statistical controls. Additionally, individual learner “dashboard data” were not aggregated by raters’ roles. Therefore, faculty members assigning milestones did not know which observation originated from particular rater roles (e.g., nurses).
Other limitations to this study were the low subintern sample size and the lack of the assessor’s blinding to the learner’s level. Only half the sites enrolled and collected data for subinterns. Even for the sites that enrolled subinterns, fewer participants who met criteria (subintern on a general pediatric inpatient service) were available for the study. It is difficult, if not impossible, to blind workplace assessors to the learner’s level, as learners will be performing tasks specific to their role as intern or subintern. Because of the lack of blinding, workplace assessors may judge a participant based on the expected level of performance by role rather than simply the behaviors observed, thereby inadvertently creating differences not directly observed. We do not yet know if learners progress at the same rate between levels of the same subcompetency or if the rate of progression as a whole for learners is parallel among different subcompetencies. These factors may play a role in the differences (or lack thereof) noted between the interns and subinterns.
Finally, there are limitations to how readily this study may be adapted to other specialties. The PMs are unique in that they were created to provide a benchmark for learners across the continuum from undergraduate to graduate to continuing medical education. This may differ from other specialty milestones which mainly emphasize levels within graduate medical education (e.g., milestone level 3 in pediatrics is often similar to milestone level 4 in other specialties).
Pediatric interns generally were assigned higher development levels than were subinterns in all subcompetency areas. The difference was statistically significantly different for all but three of the subcompetencies (Professionalization, Humanism, and Trustworthiness). Mean ratings were higher for all three subcompetencies of the Professionalism competency in comparison with the other domains associated with readiness to serve on a pediatric inpatient service for both interns and subinterns. The milestone classification ratings had a coherent internal structure and could distinguish between differing levels of trainees. These findings provide evidence in support of the validity of such ratings for documenting the developmental stage of pediatrics trainees. Further studies are needed with larger sample sizes of subinterns to determine whether statistical differences exist as well for the competencies of Professionalization, Humanism, and Trustworthiness.
Acknowledgments: The authors gratefully acknowledge the editorial assistance provided by Dr. B. Lee Ligon and the invaluable participation of the Pediatrics Milestones Assessment Group (PMAG) in this research. Authors meeting ICMJE criteria who are not listed individually but included in the group authorship “APPD LEARN–NBME Pediatrics Milestones Assessment Group” are Stephen G. Clyman and member site personnel listed as (I) below.
1. Swing SRThe ACGME outcome project: Retrospective and prospective. Med Teach. 2007;29:648–654.
2. Swing SR, Beeson MS, Carraccio C, et al.Educational milestone development in the first 7 specialties to enter the Next Accreditation System. J Grad Med Educ. 2013;5:98–106.
3. Hicks PJ, Schumacher DJ, Benson BJ, et al.The pediatrics milestones: Conceptual framework, guiding principles, and approach to development. J Grad Med Educ. 2010;2:410–418.
4. Nasca TJ, Philibert I, Brigham T, Flynn TCThe next GME accreditation system—Rationale and benefits. N Engl J Med. 2012;366:1051–1056.
5. Hicks PJ, Englander R, Schumacher DJ, et al.Pediatrics milestone project: Next steps toward meaningful outcomes assessment. J Grad Med Educ. 2010;2:577–584.
6. Carraccio C, Benson B, Burke A, et al.Pediatrics milestones. J Grad Med Educ. 2013;5(1 suppl 1):59–73.
7. Varney A, Todd C, Hingle S, Clark MDescription of a developmental criterion-referenced assessment for promoting competence in internal medicine residents. J Grad Med Educ. 2009;1:73–81.
8. Meade LB, Borden SH, McArdle P, Rosenblum MJ, Picchioni MS, Hinchey KTFrom theory to actual practice: Creation and application of milestones in an internal medicine residency program, 2004–2010. Med Teach. 2012;34:717–723.
9. Schumacher DJ, Lewis KO, Burke AE, et al.The pediatrics milestones: Initial evidence for their use as learning road maps for residents. Acad Pediatr. 2013;13:40–47.
10. Hicks PJ, Margolis M, Poynter SE, et alAPPD LEARN-NBME Pediatrics Milestones Assessment Group. The pediatrics milestones assessment pilot: Development of workplace-based assessment content, instruments, and processes. Acad Med. 2016;91:701–709.
11. Schwartz A, Margolis MJ, Multerer S, Haftel HM, Schumacher DJAPPD LEARN–NBME Pediatrics Milestones Assessment Group. A multi-source feedback tool for measuring a subset of pediatrics milestones. Med Teach. 2016;38:995–1002.
12. Carraccio C, Gusic M, Hicks PThe pediatrics milestone project. Acad Pediatr. 2014;14(2 suppl):S1–S98.
13. Boulet JR, Murray D, Kras J, Woodhouse J, McAllister J, Ziv AReliability and validity of a simulation-based acute care skills assessment for medical students and residents. Anesthesiology. 2003;99:1270–1280.
14. Gude T, Vaglum P, Anvik T, et al.Do physicians improve their communication skills between finishing medical school and completing internship? A nationwide prospective observational cohort study. Patient Educ Couns. 2009;76:207–212.
15. Wouda JC, van de Wiel HBThe communication competency of medical students, residents and consultants. Patient Educ Couns. 2012;86:57–62.
16. Lee WN, Langiulli M, Mumtaz A, Peterson SJA comparison of humanistic qualities among medical students, residents, and faculty physicians in internal medicine. Heart Dis. 2003;5:380–383.
17. Lesser CS, Lucey CR, Egener B, Braddock CH 3rd, Linas SL, Levinson WA behavioral and systems view of professionalism. JAMA. 2010;304:2732–2737.