In 2009, the Accreditation Council for Graduate Medical Education (ACGME) initiated the Milestone Project as part of a new accreditation system based on competency-based outcomes.1,2 The ACGME competencies represent important content areas of physician performance. As part of the project, the American Board of Pediatrics (ABP) and the ACGME formed a working group, led by Carol Carraccio, to write the Pediatrics Milestones (PMs)—a framework for assessing a learner’s development of competencies.3–6
Once the PMs had been developed, it became apparent that considerable work was needed to create an assessment system grounded in the PMs; simply placing the long, theoretical, and multifaceted descriptions of milestones onto an assessment instrument was of low utility. Therefore, shortly after publication of the PMs,6 the National Board of Medical Examiners (NBME) and the Association of Pediatric Program Directors (APPD) Longitudinal Educational Assessment Research Network (LEARN) began a collaboration to develop meaningful workplace-based assessment instruments. The first of these efforts was the Pediatrics Milestones Assessment Pilot (PMAP), a study examining the feasibility of assessment based on the PMs in the authentic clinical environment of pediatrics residency programs. In the PMAP study, fourth-year medical students (subinterns) and first-year pediatrics residents (interns) were assessed based on a subset of PMs that may inform residency program faculty decisions about learners’ readiness to serve as interns in the pediatric inpatient setting.
In this article, we describe the PMAP study and report on the processes of item development, data collection, and data management. We examine the utility7,8 of the assessment procedures developed, the resources required for the assessments, and responses to the project by participating learners and program directors. We conclude with a discussion of accomplishments, limitations, and next steps.
Preparation of assessment items, development of instrumentation (forms), and cognitive interviewing took place in 2011–2012. Institutional review board (IRB) application processes, two waves of site selection, and site preparation took place in early 2012. Data collection began for the first 6 sites in June 2012 and for the other 12 sites in fall 2012. Data collection ended in June 2013.
Background and instrument development
Assessment content selection.
Developing the assessment content began with determining which content areas (competencies) among the 48 PMs5,6 would provide the best evidence to support the decision about the readiness of a learner to serve as an intern on an inpatient general pediatrics unit. We aimed to follow Kane’s9 construct for assessment decisions, where the included content was intended to inform the assessment decision rather than being based solely on availability, ease, or historical precedent.
Surveys were distributed in April 2011 to residency program directors and associate program directors who were members of the APPD. Respondents used a four-point scale (from “not at all” to “very much”) to indicate which of the competencies would contribute to their decision about the target inference. The survey respondents (98 program directors and 57 associate program directors) represented 117 of the 198 ACGME-accredited U.S. pediatrics residency programs. Members of the APPD LEARN–NBME Pediatrics Milestones Assessment Group reviewed responses grouped by competency. The top 20 competencies were identified for additional consideration by content experts based on (1) ability to observe behaviors in the authentic clinical inpatient setting, (2) relative meaning or value of data obtained from observations of each competency as it related to the decision, and (3) importance of specific information obtained from direct observation of behaviors as it related to providing the learner with important feedback. Nine competencies10–16 ultimately were chosen for the pilot (see Table 1).
Assessment items relating to each of the nine chosen competencies were written by groups of pediatricians including PM working group members who had authored the PMs (to provide context and continuity) and others who had not (to provide an outside review). Orientation sessions were held with participants to familiarize them with principles of item writing for assessment purposes.17 NBME staff compiled authors’ compositions and created recommended items from the submissions. During a weekly teleconference, participants compared the original PM text, submissions from the item authors, and NBME staff versions, and subsequently reconciled differences in interpretations of content.
Additional items were borrowed from the NBME’s Assessment of Professional Behaviors Survey Instrument.18
Some items assessed behaviors that were observable on one specified occasion, others could only be judged over time, and some asked for global judgments requiring the observer to rate the quality of the learner for a defined task or context (e.g., “I would like to have this person on my team”). A range of item formats, scales, and response types were chosen with the aim of providing efficient and accurate representation of observed behaviors.
Selection of assessment methods.
Content experts suggested use of multisource feedback (MSF) and structured clinical observation (SCO) instruments as methods for the collection of data to inform the chosen inference, given the accepted use of these methods for workplace-based assessment in the inpatient setting. In MSF, multiple observers (peers, supervisors, nurses, etc.) record information about the learner after a series of observations over time.19,20 In a SCO, a rater directly observes relevant behaviors during a particular aspect of clinical performance.21 Content experts suggested that two SCO instruments would be useful—one for observing the learner taking a history and doing a physical examination (SCO-H) and one for observation of skills exhibited by the learner during patient rounds (SCO-R). A competencies/methods matrix was used to identify how competencies could be assessed most efficiently and effectively across these three instruments. Table 1 lists the number of items, by instrument, that were developed to target each of the nine chosen competencies. In addition to the three observation-based assessment instruments, a milestone classification form (MCF) was built that displayed the PM levels for each of the nine competencies (Figure 1).
Instrument delivery and feedback provider report preparation.
Adobe FormsCentral (Adobe Systems Inc., San Jose, California) was used to create and deliver assessment instruments online, primarily for convenience in rapid prototyping for a feasibility study. A master account was created through which the NBME and APPD LEARN PMAP coordinators could create instruments and provide links to participating sites.
Web-based assessment instruments were accessed through a link that could be bookmarked, with the bookmark icon made to appear like an application (app) on mobile devices, and/or through links placed in e-mails, in study protocol instructions, at the end of a patient chart document, or in other locations within the usual clinical workflow. Upon accessing the instruments by clicking on one of these links, observers selected their role and the context for their observation, and they were then presented with appropriate instruments to complete (Figure 2). Once a completed instrument was submitted, the information was stored in a password-protected account. Report forms and raw data were available immediately within this account to site investigators, and completed SCOs were returned immediately and automatically to the observer to facilitate timely formative feedback to learners.
At the end of each rotation, each site generated a feedback “dashboard” report for each learner summarizing the quantitative items by competency and presenting all associated narrative comments. Feedback providers used these dashboard reports to complete the MCF for each learner. The MCF scores for each milestone were determined by each site using a site-chosen method. Sites often needed to modify the dashboard report to suit their needs for feedback and determining MCF scores. Traditional assessment instruments continued to be completed alongside the pilot instruments.
The purpose of cognitive interviewing22 is to examine the extent to which the instructions for and the items in a draft instrument are easy for likely raters to understand and answer. Telephone interviews lasting 30 to 60 minutes were conducted with three to five individuals at each of the first six participating institutions. During the interviews, interviewers worked through the assessment instruments item by item.
A total of 19 interview participants (faculty with differing educational and clinical roles, chief residents, and nurses) provided feedback on 27 assessment items. The majority of items were judged to be interpretable and acceptable in their current format. Five items were reworded or split into 2 separate items. Three items were eliminated because of problems (e.g., ambiguous wording, scale, or answer-choice descriptors) and disagreement on the ability to observe learners demonstrating the behavior. The final instruments were used at all sites for the entire data collection period, and some of the assessment items were used on more than one of the instruments. The MSF had 26 assessment items, the SCO-R had 9, and the SCO-H had 11; some of these items were free-text response or branching items. Additional demographic items were also included.
PMAP study setting and participants
Site selection and enrollment.
The PMAP study enrolled 18 pediatric residency programs in two waves. The aim of beginning with 6 programs was to identify and resolve any procedural difficulties with implementation before engaging the other 12 programs. The 6 sites were chosen based on their interest and anticipated ability to provide feedback to the study investigators. The 12 second-wave sites were selected based on their response to a call for participation and the ability to complete study procedures, such as IRB applications, prior to data collection.
Participant recruitment and selection.
Eligible observer participants at each of the 18 participating sites were program attending physicians (e.g., faculty, chief residents, hospitalists), nurses, residents, and other observers. Eligible learner participants were first-year pediatric residents (interns) and fourth-year medical students enrolled in pediatrics subinternship rotations (subinterns).
Potential participants were provided with a sheet containing study information sufficient to determine whether they wanted to participate. Sites chose participating learners and observers through independent processes, as approved by their local IRB. At each site, the site investigator recruited one or two interns and one or two subinterns per one-month rotation over the site’s six- to nine-month data collection period. Observers could include other residents, nurses, faculty, and other interprofessional team members (e.g., clinical pharmacists, social workers, case managers). Observers were not individually identified in the research data set but were captured as unique individuals for data analysis purposes.
Participant orientation and education.
Study orientation materials and in formational materials regarding the PMs provided to the sites served as resources for participant orientation and education. Faculty orientation and faculty development session methods and schedules were not prescribed by the study investigators so that local site leads could use the materials provided in concert with other local efforts to educate staff about PMs and the project.
IRB approval and data management
APPD LEARN and each participating site obtained approval or exemption from its IRB. APPD LEARN acted as the coordinating and data management center for the study, following its previously published approaches to data deidentification and data security.23 Narrative comments included in each instrument were reviewed both by automated name-recognition software (Stanford Named Entity Recognizer version 3.2.08) and by a human being to ensure that there was no identifiable information included. Any name or other identifying descriptor was replaced with “Learner.”
Assessments conducted for the PMAP study were not to be used to determine student or resident grades.
PMAP study procedures
For each participating learner, sites were instructed to attempt to obtain at least six completed MSF instruments from different observers and six completed SCO instruments (three SCO-R and three SCO-H) from faculty or senior residents immediately following observation of the relevant encounter during the learner’s rotation. Both types of assessment instrument were completed electronically, as described above.
Site study coordinators could choose to have automated e-mail notifications sent to designated site investigators when instruments were completed. This allowed real-time tracking of study progress and often served as a reminder for site investigators to notify their team (manually, i.e., via e-mail) when and if observations were not being completed promptly.
At the end of the rotation, for each learner, the feedback provider reviewed the dashboard report of assessment results, assigned a developmental milestone level for each of the nine competencies, and recorded these levels on the MCF. Each site chose the individual(s) who would serve in the feedback provider role, avoiding individuals who were responsible for decisions about admission of residency program applicants or about learner advancement (i.e., program directors). The feedback provider then provided formative feedback directly to each learner in a face-to-face meeting held for this purpose.
To help characterize the acceptability, feasibility, and utility of the process, participating learners were asked at the end of their rotation (shortly after receiving feedback) to complete an anonymous online survey about their assessment experience. The survey included quantitative rating and free-text response items. Data were collected on learner expectations, frequency and value of feedback received, and utility of feedback for identifying improvement areas and future goals. Survey responses were shared with programs only in the aggregate. For each rotation, reminders were sent to all participating learners at sites where surveys were missing for that rotation, but responses to the survey were not required.
Site lead feedback.
Telephone interviews with the initial six PMAP site leaders/assistant site leaders to collect feedback about implementation challenges and successes took place during the month following the completion of data collection at each site and ranged in length from 23 to 60 minutes. At least two interviewers were present for each session: one acted as lead interviewer and the other(s) as note taker(s). Additionally, an online survey of all 18 site lead investigators explored the methods used to orient participants to the study and the strategies used for recruitment of raters and learners and for achieving assessment instrument completion.
We examined MSF, SCO-R, and SCO-H ratings of learners by observers at each program to determine variance associated with learner, instrument, observer, and program in these assessments (data not reported). Survey and interview responses were analyzed descriptively to summarize perceptions of (1) utility of the assessments and feedback sessions in helping learners understand developmental milestones and (2) clarity and ease of use of the instruments. Four coders (T.L.T., P.J.H., C.C., S.E.P.) conducted thematic analyses, first independently reviewing all narrative text responses and identifying themes as they emerged (open coding). Coders then compared themes, identified new themes, and in some cases combined content grouped in similar themes into a more general or unifying theme. Using the new list of themes, coders once again analyzed all text (axial coding); differences in assignment of text content to themes were resolved with discussion in an iterative fashion. This process ended when no new themes emerged. After the coding process was completed, an individual outside the group (Leta Rose) used the themes to code a randomly chosen 25% of the narrative text responses. Complete agreement of theme to original narrative text responses was found in this trustworthiness check.
Eighteen pediatric residency programs from 15 U.S. states participated in the pilot. Data collection occurred for a six- to nine-month period at each site from June 29, 2012 to June 30, 2013. The PMAP study accrued a total of 2,338 completed assessment instruments for 239 learners from 630 unique observers: 1,303 (56%) of the instruments were completed by faculty, 707 (30%) by residents, 276 (12%) by nurses, and 52 (2%) by other clinical team members (such as pharmacists and social workers). Unique observers by role included 260 faculty (41%), 275 residents (44%), 85 nurses (13%), and 10 others (2%). The 2,338 completed assessments included 1,086 MSF instruments (46%), 713 SCO-Rs (30%), 328 SCO-Hs (14%), and 211 MCFs (9%).
The 239 learners assessed during the study included 42 subinterns (18%) and 197 interns (82%). The median number of instruments (MSF, SCO-R, SCO-H, or MCF) completed per learner over the one-month rotation time frame was 10 (interquartile range [IQR], 7–13]. For interns, it was 11 (IQR, 8–13); for subinterns it was 7 (IQR, 5–9). The target number of completed instruments per learner per rotation was 13 for the study: 72 (30%) of the 239 learners had all 13 instruments returned (66 [34%] of the interns and 6 [14%] of the subinterns). In total, the 2,338 instruments amounted to a median of 311 assessment item responses collected per learner (range, 25–364; IQR, 312–401). Observers wrote over 130,000 words of narrative text comments about learners.
One hundred fifty (63%) of the 239 learners completed the learner feedback survey (115 [58%] of 197 interns and 35 [83%] of 42 subinterns). Responses indicated that the frequency of real-time feedback varied based on the nature of the observations, with at least some feedback following rounds (reported by 75% of respondents) and some feedback following history taking (reported by 64% of respondents).
Two of the survey items asked learners to think about the feedback they received at the end of their rotation. Of the 137 learners who responded to these items, 128 (93%) agreed or strongly agreed with the statement “It helped me understand how those with whom I work perceive my performance.” Eighty-five percent (n = 117) agreed or strongly agreed with the statement “It was useful for constructing future goals or identifying a developmental path” (Figure 3).
An analysis of major themes developed from the learners’ narrative comments was performed on the initial 80 completed learner surveys (wave one sites). Sixty-three (79%) of the initial 80 respondents provided narrative comments. The themes from these narrative comments are presented in Table 2, with exemplar quotes. Learners’ descriptions of the purpose of the study centered around three main themes: to receive feedback; to explore a new and novel [assessment] system; and to assist learner performance improvement. Themes that emerged from responses to the survey item asking whether the study met the learner’s expectations included descriptions of the type, quality, and utility of the feedback provided and of the unobtrusiveness of the observational assessment process.
Among the 63 respondents who provided narrative comments about constructing plans in response to feedback, 57 (90%) stated one or more specific plan(s) for improvement in response to feedback, whereas 6 (10%) had no specified plan for action, providing responses such as “none” or “unsure.” The 57 respondents’ narrative descriptions of their plans were coded and grouped into themes. Twelve (20%) planned to balance open-ended and direct questions; 21 (36%) planned to improve communication with patients and families, particularly by clarifying and confirming shared understanding; and 23 (40%) planned to improve thoroughness with questioning and building care plans.
Themes for pilot study improvements included the following: Feedback should be specific, with examples of both “positive” and “negative” behaviors (30% of comments); feedback should come from a variety of team members (17% of comments); and feedback sessions should be scheduled regularly (23% of comments).
Overall, learners’ responses indicated that participation in the pilot study allowed more frequent, focused feedback that proved useful in improving their current performance and constructing future improvement goals.
Site lead feedback
As described above, semistructured phone interviews were conducted with site leads at the first six sites. Thirteen (72%) of the 18 site leads completed the online survey. The results reported below reflect a combination of both inquiries.
All respondents described the overall experience of participating in the pilot as positive. Many benefits were cited, including thinking through rater selection, gaining practical experience working with clinical competency committees, inclusion of many feedback providers, engagement of nurses and residents in the assessment and feedback, and upper-level residents’ excitement about teaching as a result of their participation in the study. The benefit cited most frequently was enhanced feedback to learners and faculty development—in particular, greater exposure to the PMs.
Operational challenges reported included difficulty in getting observers to complete instruments and assembling data from the Adobe FormsCentral database to construct individual learner dashboard reports (both processes often required additional administrative support and manual intervention); competing demands of patient care duties; assessment completion burden (as existing assessments were also in use); and limited observation opportunities which affected instrument completion rates. Most notably, respondents reported limited availability of raters during history-taking sessions as the primary barrier to obtaining SCO-H data. Administrative burden associated with sending manual e-mail reminders to observers was also noted.
Assessment methods used in graduate medical education should yield results that stimulate positive changes in individual resident performance, knowledge, skills, or attitudes, in the educational curriculum, or that are actionable and perceived as useful.24 The PMAP study demonstrated high perceived utility to learners, who valued receiving feedback that was timely, informed them of the perspectives and observations of their team members, and directed them in how to improve. A significant number of assessment instruments were completed per learner with a high volume of content (ratings and comments). Areas for improvement reported by site leads included improved administrative support, ideally through improved technology or smart assignment, reminder, and completion processes.
The mobile capability of the PMAP front-end delivery system allowed assessment to occur within, rather than outside of, clinical workflow. The system also provided immediate access to completed SCO assessments for the purpose of providing feedback.
There were several study limitations. This pilot study represented a small sample of learners at only 18 pediatric residency programs, with sites selected based on interest and capability to participate.
Some of the participating learners did not have time to complete the learner survey immediately after receiving end-of-rotation feedback, which may have resulted in a decreased response rate. Noncompletion of learner surveys may also have been related to learner response to the feedback itself, because the survey took place after the feedback session.
The construction and use of the feedback dashboard report was challenging and required additional resources from the study coordinators. Organizing and presenting narrative text comments in the dashboard was particularly burdensome; there was not a process available to automatically categorize or sort rater comments by competency domains, as can be done when using field notes.25,26 Technology solutions are also needed for automating reminder prompts so that assessment completion rates are not dependent on manually constructed e-mails.
This study did not assess the financial costs of study operations. A time-motion study of observers could have informed observer workload more precisely.
No standardized faculty development or orientation was required for observers participating in the study. Of note, many participating programs developed faculty training materials to assist observers in understanding the items and the PM content. For example, one site developed and shared videos to illustrate a learner with high-level performance and a learner with lower-level performance of specific content addressed on the SCO-R instrument.
Based on participant enthusiasm and the success of this pilot, the NBME, APPD LEARN, and the ABP have agreed to pursue further development of the PMs through a Pediatrics Milestones Assessment Collaborative project. This project is aimed at developing an assessment system that provides standardized assessment tools to inform important decisions about learners across the continuum.
The ACGME Next Accreditation System calls for all programs responsible for training physicians to provide evidence of learner performance based on direct observation of learner behaviors in the authentic clinical workplace.1 The PMAP study provides a model for developing a workplace-based assessment system for any specialty. Assessment content will likely differ from specialty to specialty based on the variance in the specialty-specific competency frameworks used. Additional methods may also be applied. The ACGME’s broad competency domains, which organize the more granular competencies, are used by many specialties. Assessment instrument items with content areas that relate to competency domains that are common to several specialties (interpersonal skills and communication, professionalism, systems-based practice, and practice-based learning and improvement) may be repurposed for other specialties without substantial modification.
Sustaining workplace-based assessment of competence will require a cultural change in the approach to assessment within trainees’ workplace environment. Faculty, nurses, other staff, and all learners will be required to integrate assessment-related observations into their daily workflows. Increased training of observers and feedback providers will be necessary to fully realize the impact of any new assessment system based on observations in the workplace. We look forward to collaboration across specialties in the further development of content, methods, and successful strategies for implementation and maintenance of all aspects of competency-based assessment and feedback.
Acknowledgments: The authors gratefully acknowledge the following groups for their invaluable participation in this research:
Pediatrics Milestones Assessment Group (PMAG) members by participating residency program: Members are listed by site as investigator (I) or coordinator (C). University of Michigan: Hilary Haftel, MD, MHPE (I); Kristen Wright, MD (I); Rosalind Moffitt (C). Children’s Hospital of Philadelphia: Patricia J. Hicks, MD, MHPE (I); Jeanine Ronan, MD (I); Rebecca Tenney-Soeiro, MD (I); Dawn Young, MEd (C). Baylor College of Medicine: Teri L. Turner, MD (I); Melodie Allison (I). University of Virginia: Linda Waggoner-Fountain, MD (I); Lisa Morris (C). Phoenix Children’s Hospital: Grace Caputo, MD (I); Sandra Barker (I); Vasu Bhavaraju, MD (I); Ana Velazquez (C). Cincinnati Children’s Hospital: Javier Gonzalez del Rey, MD, MEd (I); Sue E. Poynter, MD, MEd (I); Vermont Children’s Hospital: Ann Guillot, MD (I); Karen Leonard, MD (I). University of Louisville: Sara Multerer, MD (I); Kimberly Boland, MD (I); Olivia Mittel, MD (I). University of Illinois at Chicago: Amanda Osta, MD (I); Michelle Barnes, MD (I); Emri Tas, MD (I); Jen McDonnell, MD (I). University of Florida: Cynthia Powell (I); Nicole Paradise Black, MD (I); Lilly Chang, MD (I). Winthrop University Hospital: Jill Leavens-Maurer, MD (I); Ulka Kothari, MD (I); Robert Lee, MD (I). Boston University: Daniel J. Schumacher, MD, MEd (I). Emory University: Susie Buchter, MD (I). Inova Fairfax Hospital for Children: Meredith Carter, MD (I). State University of New York at Stony Brook: Robyn Blair, MD (I). University of Texas Health Sciences Center: Sandra Arnold, MD (I); Mark C. Bugnitz, MD (I). Wright State University: Ann Burke, MD (I). Children’s National Medical Center: Aisha Barber Davis, MD (I).
PMAG members at the National Board of Medical Examiners: Stephen G. Clyman, MD; Thomas Rebbecchi, MD; Colleen Canavan, MS; Christa Chaffinch, MA; Melissa J. Margolis, PhD; Margaret Richmond, MS; Kathleen M. Rose; Leta Rose; Yelena Spector, MPH.
Association of Pediatric Program Directors Longitudinal Educational Assessment Research Network: Alan Schwartz, PhD; Patricia J. Hicks, MD, MHPE; Robin Lockridge, MS.