Journal Logo

Empirical Investigations

Simulation for Milestone Assessment

Use of a Longitudinal Curriculum for Pediatric Residents

Frey-Vogel, Ariel S. MD, MAT; Scott-Vernaglia, Shannon E. MD; Carter, Lindsay P. MD; Huang, Grace C. MD

Author Information
Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare: August 2016 - Volume 11 - Issue 4 - p 286-292
doi: 10.1097/SIH.0000000000000162


As of June 2014, pediatric residencies are required to report their residents’ performance along the Pediatric Milestones to the Accreditation Council for Graduate Medical Education (ACGME) every 6 months.1 Before that time, pediatric residency programs had to attest, at the end of a resident’s training, that he or she demonstrated competency to practice independently in 6 areas: patient care, medical knowledge, interpersonal skills and communication, professionalism, systems-based practice, and practice-based learning and improvement. The implementation of these 6 core competencies, which were established in 1999, transformed resident evaluation from a process-based system (completing the required rotations) to a system based on attaining competency. However, it was recognized that the competency-based system did not provide sufficient granularity for program directors to determine competence in any given area. Thus, the milestones were developed in 2009 by the ACGME in partnership with specialty organizations. The idea was that each competency is made up of numerous subcompetencies, which together would define the overall competency. How an individual performed on an individual subcompetency would be determined by how he or she performed along a spectrum of skill expectations, or milestones, within that subcompetency, ranging from behaviors expected of beginning medical students (novices) to behaviors expected of seasoned physicians (experts). Pediatrics has identified a subset of the subcompetencies with their corresponding milestones, called “reporting milestones” required to be reported to the ACGME twice a year for each resident (see Appendix 1, Supplemental Digital Content 1,, for a list of the Pediatric Reporting Milestones). However, neither the ACGME’s published Pediatric Milestones, nor the ACGME’s Pediatric Program Requirements2,3 provide guidance about how to assess where trainees’ performance lies on these milestones. The literature in this area is also lacking. Therefore, programs must create their own assessment tools and approaches until the field is better informed by research data or by national standards.4

It is also well established that assessing residents based on competencies requires more direct observation than previously has been part of the training environment.5–7 However, creating opportunities for workplace-based assessment is difficult. By necessity, faculty members are asked to do this in conjunction with their own clinical obligations, which means juggling the duties of high-quality patient care, teaching, and clinical productivity, all while keeping the milestones framework in mind. Furthermore, reaching the ultimate goal of determining when a resident is ready for independent practice requires amassing sufficient observations over time and is challenged by the lack of standardization in the workplace.

Simulation represents a promising means to address these obstacles to a milestone-based assessment of residents. Simulation sessions have the benefit of a high faculty-trainee ratio, can be arranged such that faculty members can have protected time during the day without clinical responsibilities, and are not dependent on a patient presenting for care. The simulated situations can also be adjusted for different levels of learners to optimize assessment and furthermore to ensure that high-risk, low-frequency scenarios are addressed. The standardization inherent in simulation has facilitated its use for high-stakes assessment.8,9 Other benefits of simulation include the ability to test the reliability of assessment tools10 and to record the simulation scenario for review both with trainees to provide feedback9 and with faculty to improve interrater reliability among faculty assessors.

The use of simulation for Pediatric Milestones assessment has not been studied. Most of the work in pediatrics using simulation for assessment has centered on residents’ performance during codes.11–18 One group has shown that simulation can be a reliable and valid means of assessing pediatric residents across 10 simulated cases,19 thus setting the stage for the use of multiscenario, simulation-based, high-stakes assessment of residents on core topics in pediatrics.

The purpose of our work was to examine the use of simulation in relation to milestone assessment. We developed longitudinal formative and end-of-year summative simulated cases for pediatric interns and assessed the correlation between performance on the simulation cases and their traditional milestone-based global assessments. We specifically sought to determine whether (1) performance on the formative cases predicted performance on the summative cases and (2) performance on the summative cases correlated with placement of the interns on the milestones by the clinical competency committee (CCC), with the hope of providing evidence for the use of simulation for high-stakes assessment in pediatric residency.


We created the formative cases by convening a group of core pediatric faculty and arriving at consensus on 6 key pediatric topics for which interns should be proficient by the end of intern year (Table 1). Group members included residency program leadership, a pediatric clerkship codirector, and a medicine-pediatrics urgent care physician. These cases were designated as formative because their purpose was largely educational, and we conducted debriefings after the simulation scenarios. We developed the cases individually using a simulation template specific to our institution. We edited the cases in a collaborative and iterative manner and piloted them with junior residents, using their feedback to revise the cases (see Appendix 2, Supplemental Digital Content 2,, for a sample case).

Topics and Cases Used in the Formative and Summative Cases

The interns participated as pairs in the formative simulation cases, with one intern serving as a leader and the other participating in a consulting role. They participated in 2 formative cases 3 times during the course of the year, for a total of participation in 6 cases, with leadership in 3 cases. A core group of 3 faculty members took turns running the simulation computer and mannequin, acting as the simulated patient’s nurse and acting as the simulated patient’s parent. A 20-minute debriefing followed each 10-minute case. We videotaped both the simulation and debriefing sessions with resident consent and kept the recordings digitally secure. The interns were asked during their simulation sessions not to divulge the topics of the cases with their colleagues. After each set of cases, the interns were e-mailed full-text articles on the topics covered in the cases to reinforce the learning that had just occurred in the session.

Given the new requirement for mapping residents on the Pediatrics Milestones, we decided to base our assessment tool on these milestones. In addition to the 21 ACGME Pediatric Reporting Milestones, our residency program added an additional medical knowledge subcompetency requirement (MK2: Demonstrate sufficient knowledge of the basic and clinically supportive sciences appropriate to pediatrics2), as the residency program leadership felt that this was an area of knowledge imperative to the independent practice of pediatrics. Our primary measurement instrument for the simulation cases is shown in Figure 1 as a “competency-based simulation assessment.” This instrument was based on the 6 ACGME competencies, and the language for the anchors was developed by the coauthors in an iterative, collaborative fashion based on the language and terminology found in the Pediatric Milestones. The numeric anchor, or milestone, chosen for each of the items represented the score for the competency on that subcompetency or set of subcompetencies. We provided verbal descriptors only at the extreme ends and the middle point of the scales to facilitate assessment by raters; a more extensive rubric with the descriptors at all 5 points would have resulted in a significant cognitive load. The authors, in a collaborative manner, reviewed the Pediatric Reporting Subcompetencies to determine which ones would be easily evaluated in our simulation setting (Table 2). We also developed case-specific performance checklists (eg, “Respiratory Distress,” see Appendix 2, Supplemental Digital Content 2, in an iterative, collaborative fashion to inform the rater about the behaviors expected in the case; however, data from these checklists were not used for the analyses because they were case specific and not mapped to competencies. Three of the coauthors reviewed all the videos (simulation plus portions of the debriefing) as a group but individually completed the 6-item competency-based simulation assessment, evaluating just the performance of the intern leader in the simulation. To determine a final score for each intern, however, we came to consensus on each of the 6 items for each intern. The individual assessment allowed us to measure interrater reliability for the assessment tool. However, for the correlation analysis, we needed a single score. Of note, the debriefing was essential to the completion of the evaluation for 2 reasons; the debriefing best displayed skills of problem-based learning and improvement (eg, self-reflection), and often, the justification for an intern’s differential diagnosis was not revealed until the debriefing.

Pediatric Milestones Chosen for Simulation Evaluation
Competency-based simulation assessment.

We then developed summative cases whose purpose was primarily evaluative. Using the 6 core pediatric topics developed for the formative cases, we created a similar set of summative cases in which participants would need to demonstrate similar knowledge, skills, and attitudes. For example, a core topic was “altered mental status;” the etiology in the formative case was “narcotic overdose,” while in the summative case it was “hypoglycemia” (Table 1). Again, each case was written by one individual, and the group revised each case in an iterative, collaborative fashion. We then piloted the cases with junior and senior residents, after which we made subsequent revisions. The summative cases took place at the end of the intern year during a 2-day period; in contrast to the formative cases, each intern progressed through all 6 cases on his or her own during the course of 90 minutes. Each summative case ran for 10 minutes, followed by brief feedback from a faculty evaluator. The feedback consisted of areas in which the intern performed well and areas in which the intern could improve but was not a formal debriefing in which the interns engaged in an interactive discussion and teaching session about the case. The intern also received written teaching points from the case immediately after completing each case.

The summative cases were observed by 3 faculty members who had not worked with the pediatric interns, had not been involved in the formative simulation cases, and were not members of the intern CCC. Each rater evaluated the summative cases in real time to facilitate immediate, relevant feedback to interns and to mitigate the risk of data loss through unforeseen technical mishaps. The same rater evaluated the same case across all interns, with only 1 rater per case. Each rater had undergone formal training as an evaluator by 1 of the authors, which included gaining familiarity with the case content, reviewing 3 videotapes of the cases as piloted by junior and senior residents, completing the competency-based simulation evaluation independently, and then calibrating scores with the author on the criteria for each rating.

The CCC (which included authors A.F.V. and S.S.V.) convened at the end of the year to assign milestones placement for each intern. The CCC based milestones placement on faculty evaluations of interns from their clinical rotations, consensus from key faculty groups who work closely with the interns, and residency administration (the program director, associate program directors, and chief residents) input. Interns received a whole or half-number score on a 1 to 5 scale for each of the 21 reportable Pediatric Milestones plus the additional program-required subcompetency (MK2). Interns’ performance results on the summative cases were not available to the committee members. All data from the simulation and the CCC were deidentified and coded by a residency coordinator.

We received institutional board review approval at Massachusetts General Hospital via the exemption mechanism.

Statistical Analysis

To summarize the data elements, we had scores ranging from 1 to 5 for each of the 6 competencies for each of the 3 formative cases where the resident served as a leader. For the summative cases, we had scores ranging from 1 to 5 for each of the 6 competencies for each of the 6 summative cases for each of the residents. We used individual scores for each competency and did not add up the scores to create a total score. From the CCC assessments, we had available to us scores ranging from 1 to 5 for each of the 22 Pediatric Subcompetencies for each of the residents; however, we identified a priori the 9 subcompetencies most relevant to the simulation cases and used this subset for analysis (Table 2).

To examine the relationship between scores on formative simulation cases (“formative scores”) as a predictor of scores on summative simulation cases (“summative scores”), we performed a series of regression analyses. To create the data set, we performed a one-to-one merge matching on subject identifier and case topic (eg, “altered mental status”). In a similar fashion, we merged the CCC data with the summative scores, matching on subject identifier and competency scores, and analyzed the relationship between summative cases as a predictor of scores given by the CCC. We performed secondary analyses to investigate the influence of time and also intern type (categorical vs. noncategorical). We examined the relationship between the formative scores and summative scores by competency.

To determine whether comparatively poor performance on formative cases predicted poor performance on summative cases, we averaged the formative scores across cases by intern and examined the lowest quartile to see whether these were the same individuals at the lowest quartile of the summative scores. Similarly, we compared CCC scores of this lowest quartile of summative scores to those in the other quartiles.

We also calculated interrater reliability among the 3 raters of the formative cases, using κ’s Cohen. Because only 1 rater observed any given summative case, interrater reliability could not be calculated.

All analyses were conducted using Stata 13.0 (College Park, TX).


We included simulation data on 24 residents. The number of formative score observations was 305 across 24 residents with the average number of formative cases scored per resident at 2.1, and the average formative score was 2.3 (SD, 0.65; range, 1–4). The number of summative score observations was 638 across 20 residents, with each resident who participated in the summative session scored on all 6 cases and the average summative score at 3.1 (SD, 0.83; range, 1–5). Four residents did not participate in the summative simulations because of illness, scheduled vacation, or not being on a pediatric rotation (for combined internal medicine-pediatrics interns) during that period.

The relationship between formative and summative scores was not statistically significant, a finding that persisted when controlling for the time of the year when the cases occurred or the type of intern (eg, categorical) in the regression model. There was also no statistically significant relationship when examining each of the 6 competencies; coefficients ranged from 0.04 to 0.31, and no P values were statistically significant. There was, however, a statistically significant relationship between summative scores and CCC scores; every increase in summative scores by 1 point was associated with an increase of 0.2 CCC “points” (P = 0.029, 95% confidence interval, 0.03–0.43). Stated plainly, 1 resident scoring 4 in medical knowledge on a respiratory distress case would have a CCC score in medical knowledge that was 0.2 points higher compared with another resident scoring 3 on the same item and case. These findings are summarized in Table 3.

Results of Regression Analyses

The lowest quartile of formative scores did not predict the lowest quartile of summative scores. The residents in the lowest quartile of average summative scores had a lower mean CCC score average than those in other quartiles (2.87 vs. 2.95); however, this difference was not statistically significant.

The 3 raters observing the formative cases had an overall κ statistic of 0.54, indicating moderate agreement.


We developed a process to create simulation cases that could be used for high-stakes assessment and conducted a pilot investigation on the use of these cases to predict performance assessment on the ACGME milestones. We did not identify meaningful correlations between simulations used for formative assessment throughout the year and end-of-year simulations for summative assessment. We detected a modest association between the summative cases and placement by the CCC on the Pediatric Milestones. Given methodological limitations and a small sample size, our findings are exploratory at best; however, to our knowledge, this is the first described use of simulation for an assessment along the Pediatric Milestones.

Our development process adhered to several guiding principles. We identified clinical topics common to and expected of pediatric residents undergoing training. We used multiple cases to avoid the issue of case specificity. We used multiple raters to assess the interrater variability of our instrument. We developed an instrument that was closely aligned with the Pediatric Milestones so that we would assess the same behaviors.

We did not find that performance on formative cases and summative cases were correlated. Methodological limitations likely dominated and are articulated later. From an educational standpoint, the learning curve for intern year is historically steep and nonlinear; even individuals who initially struggle with core pediatric concepts may catch up by the end of the year as a result of the educational process; thus, one would not expect that performance at the beginning of the year would reflect end-of-year skills.

The relationship between performance on summative simulation cases and milestone placement by the CCC does make sense. Data used to inform the CCC’s decision derive from observation of interns’ clinical skills over the course of the preceding 6 months across a variety of rotations and clinical encounters. Similarly, although the 6 topics for the summative cases do not represent the depth and full breadth of pediatric conditions, they do represent exemplars of clinical scenarios with which interns should be familiar. As such, our findings suggest that a snapshot of representative cases can approximate the types of clinical experiences needed to inform the CCC’s decision. Such performance on standardized cases might also augment the CCC’s data by which to place residents on the milestones.

The role of simulation in pediatrics is evolving; currently, there are few studies examining its use in resident assessment. The majority of published studies on simulation for assessment in pediatrics focus on creating and establishing the validity and reliability of evaluation tools specifically for use in emergent situations, resuscitations, and with the Pediatric Advanced Life Support algorithms.11–13,15,16,20 Only 2 studies were identified, which studied evaluation tools in more commonly encountered pediatric scenarios.19,21 The first demonstrated construct validity of an assessment tool through showing that checklist evaluations could differentiate first- from second-year residents.21 The second used multiscenario simulations and found that more experienced residents reliably outperform less experienced residents and that data from multiple scenarios can be used to assess resident performance reliably.19 These studies do not evaluate simulation’s ability to discriminate between high- and low-performing residents within the same year of training, as our study attempts to do, and they do not demonstrate criterion validity, the relationship between evaluated performance on simulation and the current standard of CCC placement, which our study addresses.

Recognizing the need for more collaboration and a deeper study of pediatric simulation, an international network, “INSPIRE” (International Network for Simulation-based Pediatric Innovation, Research, and Education), has been established to coordinate and support groups doing research on pediatric simulation. INSPIRE focuses on teamwork, procedural and psychomotor skills, debriefing, acute care and resuscitation, technology, human factors, and patient safety.22

Our study has many limitations that preclude our ability to draw conclusions about simulation for high-stakes, milestone-based assessment. Our pilot study was conducted at a single academic institution with a medium-sized pediatric residency program; thus, our results are not generalizable to other residency programs. The small sample size of resident subjects resulted in an underpowered study. The small number of cases may have threatened measurement validity because of case specificity. We experienced data loss because of technical issues with video capture. Our evaluation tool was based on the Pediatric Milestones and was not a previously validated tool. Our formative case evaluators were all part of the pediatric residency administration, and all had extensive contact with the interns. Therefore, there may have been some unintended bias in scoring the formative cases. The interrater reliability was only moderate in strength. Our 5-point scales lacked the discriminating power of longer scales and were anchored only at the ends and in the middle, which may have caused an aberrant distribution of scores. Raters for the summative cases were inconsistent; some used whole-integer scores, while others did not. We did not assess the comparability of the formative and summative cases in clinical content and did not standardize the testing environment.

Our exploratory work suggests that simulation-based assessments for pediatric interns may play a valuable role in informing placement of pediatric interns on the Pediatric Milestones and furthermore may serve as an early warning system for struggling trainees. However, future research will require extensive rater training, validated measurement instruments, tests of case equivalency, larger sample sizes, and multiple institutions.


For their assistance with this study, we would like to acknowledge several people. Drs. Nina Gluchowski, Daniel Hall, and Anne Murray were instrumental as chief residents in helping to run our simulations, participate as confederates, and aid in running the simulation debriefings. Our coordinators, Melissa Lopes, Melissa Nieves, Celeste Radosevich, and Teresa Urena played the role of parents in our simulations. Drs. Kathryn Brigham, Josephine Lok, and Takara Stanley were evaluators for our summative simulation cases. Melissa Lopes also collected and entered the data for all of our evaluations and created Disney-themed code names to protect the identities of our subjects. Lastly, A.F.V. would like to acknowledge and thank the leaders of the Rabkin Fellowship in Medical Education, Dr. Christopher Smith and Lori Newman, for their support and mentorship throughout the implementation of this study, as well as the other Rabkin fellows for their feedback and encouragement.


1. Nasca TJ, Philibert I, Brigham T, et al. The next GME accreditation system—rationale and benefits. N Engl J Med 2012;366:1051–1056.
2. The Accreditation Council for Graduate Medical Education and The American Board of Pediatrics. The Pediatrics Milestone Project Website. Available at: Accessed July 28, 2015.
3. The Accreditation Council for Graduate Medical Education. ACGME Program Requirements for Graduate Medical Education in Pediatrics Website. Available at: Accessed March 18, 2015.
4. Schumacher DJ, Spector ND, Calaman S, et al. Putting the pediatrics milestones into practice: a consensus roadmap and resource analysis. Pediatrics 2014;133:898–906.
5. Carraccio C, Wolfsthal WD, Englander R, et al. Shifting paradigms: from Flexner to competencies. Acad Med 2002;77:361–367.
6. Aagaard E, Kane GC, Conforti L, et al. Early feedback on the use of the internal medicine reporting milestones in assessment of resident performance. J Grad Med Educ 2013;5(3):433–438.
7. Norman G, Norcini J, Bordage G. Competency-based education: milestones or millstones? J Grad Med Educ 2014;6(1):1–6.
8. Holmboe E, Rizzolo MA, Sachdeva AK, et al. Simulation-based assessment and the regulation of healthcare professionals. Simul Healthc 2011;6:S58–S62.
9. Levine AI, Schwartz AD, Bryson EO, et al. Role of simulation in U.S. physician licensure and certification. Mt Sinai J Med 2012;79:140–153.
10. Beeson MS, Vozenilek JA. Specialty milestones and the next accreditation system: an opportunity for the simulation community. Simul Healthc 2014;9(3):184–191.
11. Brett-Fleegler MB, Vinci RJ, Weiner DL, et al. A simulator-based tool that assesses pediatric resident resuscitation competency. Pediatrics 2008;121:e597–e603.
12. Donoghue A, Nishisaki A, Sutton R, et al. Reliability and validity of a scoring instrument for clinical performance during pediatric advanced life support simulation scenarios. Resuscitation 2010;81:331–336.
13. Adler MD, Vozenilek JA, Trainor JL, et al. Comparison of checklist and anchored global rating instruments for performance rating of simulated pediatric emergencies. Simul Healthc 2011;6(1):18–24.
14. Andreatta P, Saxton E, Thompson M, et al. Simulation-based mock codes significantly correlate with improved pediatric patient cardiopulmonary arrest survival rates. Pediatr Crit Care Med 2011;12(1):33–38.
15. Calhoun AW, Boone M, Miller KH, et al. A multirater instrument for the assessment of simulated pediatric crises. J Grad Med Educ 2011;3(1):88–94.
16. Grant EC, Grant VJ, Bhanji F, et al. The development and assessment of an evaluation tool for pediatric resident competence in leading simulated pediatric resuscitations. Resuscitation 2012;83:887–893.
17. Cordero L, Hart BJ, Hardin R, et al. Deliberate practice improves pediatric residents’ skills and team behaviors during simulated neonatal resuscitation. Clin Pediatr (Phila) 2013;52(8):747–752.
18. Ross JC, Trainor JL, Eppich WJ, et al. Impact of simulation training on time to initiation of cardiopulmonary resuscitation for first-year pediatrics residents. J Grad Med Educ 2013;5(4):613–619.
19. McBride ME, Waldrop WB, Fehr JJ, et al. Simulation in pediatrics: the reliability and validity of a multiscenario assessment. Pediatrics 2011;128(2):335–343.
20. Donoghue A, Ventre K, Boulet J, et al. Design, implementation, and psychometric analysis of a scoring instrument for simulated pediatric resuscitation: a report from the EXPRESS pediatric investigators. Simul Healthc 2011;6:71–77.
21. Adler MD, Trainor JL, Siddall VJ, et al. Development and evaluation of high-fidelity simulation case scenarios for pediatric resident education. Ambul Pediatr 2007;7(2):182–186.
22. Cheng A, Auerbach M, Hunt E, et al. The international network for simulation-based pediatric innovation, research and education (INSPIRE): collaboration to ensure the impact of simulation-based research [Abstract]. Simul Healthc 2013;8(6):418.

Simulation; Residency education; Pediatric Milestones; Trainee assessment

Supplemental Digital Content

Copyright © 2016 Society for Simulation in Healthcare