Journal Logo

Empirical Investigations

Tools for Assessing the Performance of Pediatric Perioperative Teams During Simulated Crises: A Psychometric Analysis of Clinician Raters' Scores

Watkins, Scott C. MD; de Oliveira Filho, Getulio R. MD, PhD; Furse, Cory M. MD, MPH; Muffly, Matthew K. MD; Ramamurthi, R. J. MD; Redding, Amanda T. MD; Maass, Birgit MD; McEvoy, Matthew D. MD

Author Information
Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare: February 2021 - Volume 16 - Issue 1 - p 20-28
doi: 10.1097/SIH.0000000000000467


There is growing evidence that children receiving surgical care in specialized pediatric centers have better outcomes and less complications than those cared for in nonspecialized centers.1 This has led some governing bodies to call for children to receive surgical care in centers staffed with clinicians who have received advanced training and experience in managing pediatric patients.2 Clinical teams caring for children in the perioperative setting must possess a high level of technical and nontechnical skills (NTS), especially when faced with rare, high-acuity events.3 Technical skills (TS) encompass the psychomotor skills and advanced knowledge required to carry out complex tasks in the medical and/or surgical management of patients.4 Nontechnical skills represent the cognitive and interpersonal skills needed to effectively carry out clinical duties and include aptitudes such as decision-making, teamwork, communication, situational awareness, task management, and leadership.5

Deficits in NTS contribute to most intraoperative errors and likely play a contributing role in all intraoperative morbidity and mortalities.6–8 Efforts to improve the NTS of clinicians have led to the development of numerous curricula and assessment tools focusing on skill acquisition.9,10 However, our ability to assess the competency of pediatric clinicians and perioperative teams and their management of life-threatening emergencies in children remains limited.10 Tools have been developed to measure teamwork in the perioperative setting, but none have specifically assessed teamwork in the pediatric perioperative setting.11,12 Most of these tools focus on individual performance (eg, anesthesia, nursing, surgery, etc.) and make the assumption that team performance is a sum of individual performances.11 The Team Emergency Assessment Measure (TEAM)13 and the Behavior Anchored Rating Scale (BARS)5 are both scales designed to measure the performance of teams managing emergencies and each scale treats the clinical team as the object of measurement. The TEAM tool has proven reliable across a range of clinical settings.14–16 The BARS tool has primarily been used to assess individual and teams of anesthesia providers but contains language that is broadly applicable to perioperative teams.5 The TEAM and BARS tool have not been used to assess interprofessional, pediatric perioperative teams.

To ensure that children are receiving safe surgical care, we need to regularly assess the training and competency of pediatric perioperative teams and design continuing education according to the identified gaps. To address the lack of studied assessment tools in the literature, the current study was undertaken with a threefold purpose: (1) determine whether clinicians trained on the use of different tools for assessing technical and NTS could consistently produce reliable scores, (2) determine the agreement between scenario-specific performance checklists scores and scores from global rating scales, and (3) determine the agreement between scores assigned by raters and the respective reference scores.


The human institutional review board at Vanderbilt University Medical Center granted this study exempt status. The current study represents an analysis of trained raters' scores using different instruments for assessing the technical performance of teams. After being trained on the use of the scoring rubrics, raters scored recordings of pediatric perioperative teams managing simulated emergencies. The pediatric perioperative teams consisted of an anesthesia provider (certified nurse anesthetist or resident in anesthesiology), an operating room nurse, a postanesthesia care unit nurse, an allied health provider (certified surgical technologists or anesthesia technician), and, for some scenarios, a resident in general surgery. In the recorded simulations, perioperative teams were tasked with managing one of 4 clinical scenarios: (1) hyperkalemia that progresses to ventricular fibrillation, (2) supraventricular tachycardia (SVT) that progresses to pulseless ventricular tachycardia, (3) anaphylaxis that progresses to pulseless electrical activity, or (4) local anesthetic toxicity that progresses to asystole (see file, Supplemental Digital Content 1, which describes scenarios in detail,

Technical Skills Rating Instruments

The TS assessment tools consisted of a scenario-specific checklist and a global rating scale (GRS). Figure 1 provides an example of the scenario-specific performance checklists, which consisted of a binary checklist of correct actions and common errors for each clinical scenario tested (see figures, Supplemental Digital Content 2, for the complete Scenario-specific Performance Checklists, The development and preliminary evaluation of the TS checklist were described in detail in an earlier publication.17 Briefly, the assessment tools were iteratively developed using previously described methodology including a modified Delphi process.18,19 Using different raters and scenarios, the TS assessment tool provided highly reliable scores as evidenced by high intrarater and interrater reliability. The GRS for technical performance consisted of a global technical score (GRS 1) assigned on a 9-point scale corresponding to poor (1–3), medium (4–6), or excellent (7–9) team performance (Fig. 2). The GRS was based on an existing BARS that instructs users to first assign a category score (eg, poor, medium, excellent) and then determine where within that category the performance falls.5,10

Example of the scenario-specific performance checklists–hyperkalemia.
Global rating instrument.

Nontechnical Skills Rating Instruments

Two previously published NTS instruments were chosen for reliability analysis: the TEAM13 and a BARS adapted for assessing perioperative teams instead of individual clinicians.5,20 The BARS tool (Fig. 3) includes 4 categories of NTS performance and a global score to be rated on a 9-point scale corresponding to poor, medium, or excellent team performance. The 4 categories of NTS performance are vigilance/awareness, dynamic decision-making and task management, communication, and teamwork. The BARS tool includes a rating matrix that includes examples of behaviors to be considered in determining each rating. The TEAM tool (Fig. 4 is composed of 11 items scored on a 0 to 4 (never to always) scale and an overall performance item scored on a 1 to 10 scale with 1 = poor and 10 = excellent. The TEAM tool covers 3 categories of NTS performance; leadership (2 items); teamwork and situation awareness (7 items); and task management (2 items).14 Similar to the BARS tool, the TEAM tool includes a rating matrix with example behavior anchors for low and high scores.

Behavior Anchored Rating Scale.
Team Emergency Assessment Measure.


We used SimBaby (Laerdal Medical Corp, Stavanger, Norway) software and mannequin to program and perform the scenarios. All scenarios were conducted and recorded in situ in an operating room at our institution. We recorded all videos using B-Line system (SimCapture; B-Line Medical, LLC, Washington, DC).

Rater Training and Video Scoring

Raters were recruited by soliciting individuals from the private listserv of the simulation interest group of the Society for Pediatric Anesthesia. All raters were experienced pediatric anesthesiologists with a minimum of 5 years of experience in the clinical practice of pediatric anesthesiology and simulation-based education and teaching. Five raters volunteered to participate in the study. All 5 raters worked at different institutions from where the simulated recordings were created; thus, the raters were blinded to the identity of the clinicians in the scenarios. The methodology for training the raters was based on previously described work and is briefly summarized hereinafter21 All rater-training sessions were conducted using GoToMeeting (LogMeIn, Inc, Boston, MA) video conferencing because of the geographic separation of raters and the study institution. Detailed rating guides (see files, see Supplemental Digital Content 3,, and 4,, which provide the complete rating guides) were created for the raters to use when training and when scoring videos. The guide included a description of each instrument with specific instructions for scoring, the scoring matrix for the BARS and the TEAM tool and original articles describing the tools being evaluated.13,20 The guide included an objective definition for each item on the scenario-specific performance checklists. Twelve videos from an earlier study were selected for training the raters.17 These training videos included similar clinical scenarios and team composition as the study videos but were not included in the final analysis. Raters met with one of the study authors (S.C.W.) via videoconference on 3 occasions to score and review training videos with each session lasting approximately 2 hours. Raters received a set of 4 videos for scoring in advance of each of the first 2 videoconferences. The raters were instructed to observe the entire scenario before assessing the global performance of the recorded team when using the GRS. Raters were told to give equal weight to behaviors at all periods of the scenario and avoid being biased by early or late behaviors.5,10 The raters' scores were reviewed by one of the study authors (S.C.W.) and items of disagreement were discussed during the videoconference. The third set of 4 videos were reviewed and scored item by item as a group during the third videoconference to clarify any remaining questions regarding the scoring instrument and to reach group consensus on the meaning of each assessment item.

Rating of Scenarios to Determine Reliability

Recordings of 16 unique pediatric perioperative teams were randomly selected from 140 recorded simulation events. To have an equal number of videos of each scenario type (eg, anaphylaxis, SVT), 4 videos of each scenario were randomly selected from a pool of 35 recordings (ie, 4 videos from 35 SVT recordings). Five raters scored each of the 16 videos on 2 separate occasions to assess both interrater and intrarater reliability. Four raters watched and scored half the scenarios twice using the assessment tools. For these raters, scenarios were divided into 2 groups of 8 (scenarios A–H and I–P). The fifth rater scored all 16 scenarios with the scoring instruments. This scheme allowed for each scenario to be scored 6 times (3 raters × 2 scores each) while limiting the total number of videos that each rater was required to score to 16 instead of 32 (Fig. 5). This allowed for the determination of interrater and intrarater agreement on global scores (the average of item scores). In addition, the raters' summative scores were compared with the reference standard score for each video. Videos were presented to each rater using SimCapture (B-Line Medical). This platform ensured that raters scored each video in the order assigned and permitted access to only one video at a time. Raters' scores were recorded on a secure, web-based data collection form (RedCap, Nashville, TN).

Rater and scenario assignment scheme.

Creation of Reference Standard

Two of the study authors (S.C.W., M.D.M.) created a reference standard score for each recording. The 2 authors are board-certified anesthesiologists each with greater than 5 years of experience in simulation development, education, training, and research. The authors scored each study video independently and then compared scores. When the authors disagreed on the rating of individual items, the recording was reviewed and discussed until an agreement could be reached. The reference score for each study video represents the consensus of the 2 expert raters.

Statistical Methods

Rater Agreement for the TS Tools

Intrarater reliability of raters' scores from the TS assessment tools was assessed with 2-way random effects models intraclass correlation coefficients (ICCs) using the raters' average scores between the 2 rating occasions.

Interrater agreement for scores from the GRS (measured on a 9-point scale) and for the overall scores from the scenario-specific performance checklists was assessed by absolute-agreement, 2-way random effects model ICCs. Scores used for these analyses were the average of scores assigned on 2 separate rating occasions. The 2-way random effects model was chosen because each video was rated by the same set of 3 independent raters, who were presumably randomly drawn from the universe population of similar raters. Absolute-agreement method was chosen because systematic differences among raters were considered relevant. The reliability of average ratings assigned by the 3 independent raters was estimated.22

Correlation Between Measures of Technical Performance

Correlation between the measures of technical performance (first global scales and scenario-specific technical scales) was quantified by estimating Spearman ρ coefficients.

Rater Agreement for the NTS Tools

Interrater agreement was assessed by ICCs considering the ordinal nature of BARS and TEAM item rating scales. Intrarater agreement was tested by ICCs between measurement occasions. Raters' agreement with reference scores was tested by ICCs between raters' and reference scores. The ICC interpretation was based on criteria suggested by Cicchetti: poor < 0.40; fair 0.4 to 0.59; good 0.60 to 0.74; and excellent 0.75 to 1.00.23

Statistical Software

Analyses were performed using STATA 14 (StataCorp, College Station, TX). Intraclass correlation coefficients for the nontechnical tools were estimated with VARCOMP procedure using analysis of variance type I sum of squares (SPSS for Windows, Version 12.0; SPSS Inc, Chicago, IL).


The scheme used for assigning scenarios to the raters resulted in each scenario being scored by 3 raters using each rating scale twice (2 occasions). Rater 5 scored all scenarios, whereas raters 1 and 2 scored 8 scenarios (A–H) and raters 3 and 4 scored 8 scenarios (I–P). This resulted in the scoring of 96 scenarios using the TS assessment tools. For purposes of our analysis, the 5 raters were divided into 2 groups of 3 raters with 1 rater (rater 5) overlapping both groups. Rater group 1 (RG1) was composed of raters 1, 2, and 5 and RG2 was composed of raters 3, 4, and 5.

Technical Skills Assessment Tools

There was good intrarater reliability between the 2 rating occasions for the GRS (measured on 1–9 points): ICC = 0.73. The overall ICC between rating occasions for the scenario-specific performance checklists was 0.95, indicating excellent agreement.

Fair interrater agreement was found for average scores from the GRS (ICC = 0.57), and excellent interrater agreement was found for average scores from the scenario-specific checklists (ICC = 0.91).

Excellent agreement was found between scores assigned by raters and the respective reference scores for the scenario-specific checklists (ICC = 0.91), and good agreement was found between scores assigned by raters from the GRS (ICC = 0.74).

Significant correlations were found between scenario-specific performance checklist scores and scores from the GRS [r = 0.46; 95% confidence interval [CI] = 0.29–0.62; P < 0.001].

Nontechnical Skills Assessment Tools

The raters' mean score ± standard error was 6.55 ± 0.39 (95% CI = 5.59–7.11) points for the BARS tool (9-point scale) and 4.58 ± 0.22 (95% CI = 4.15–5.0) for the TEAM tool (5-point scale).

The interrater reliability of BARS scores as measured by ICC coefficients was 0.49 (95% CI = 0.2–0.7) for RG1 and 0.82 (95% CI = 0.74–0.89) for RG2. The ICC represents fair agreement for RG1 and excellent agreement for RG2.23

The interrater reliability of TEAM scores as measured by ICC coefficients was 0.79 (95% CI = 0.73–0.83) for RG1 and 0.89 (95% CI = 0.86–0.91) for RG2. The ICC represents excellent agreement for both rater groups.

The overall intrarater reliability of scores from the BARS scale and the TEAM tool as measured by ICC coefficients was 0.86 (95% CI = 0.81–0.89) and 0.89 (95% CI = 0.86–0.91) representing excellent agreement.

The agreement between individual raters and the reference standard as measured by ICC for the BARS scores ranged from 0.17 to 0.89 (poor to excellent), whereas the ICC ranged from 0.68 to 0.82 (good to excellent) for the TEAM scale.


The present study evaluated 4 types of assessment tools for evaluating the technical (a scenario-specific performance checklist and a GRS) and nontechnical (BARS and TEAM) skills of perioperative teams. We report the following key findings. First, there was greater agreement between and within raters' scores when they used the TEAM assessment tool compared with their scores from the BARS tool. Second, there was good agreement between and within raters when they used the TS assessment tools, although the rater scores from the scenario-specific performance checklists were more reliable than scores from the global ratings. Third, raters' scores from all assessment tools had substantial agreement with the reference scores with the exception of raters' scores from the BARS tool. Fourth, the scores from the TS GRS correlated well with the scenario-specific performance checklist scores. Finally, the GRS used in this study was able to measure the performance of teams across a variety of scenarios, suggesting that it may be generalizable for assessing teams in other clinical scenarios.

Raters' scores using the TEAM tool demonstrated greater correlation with the reference standard and less variability between raters than scores from the BARS tool. This is particularly interesting given that both tools seek to measure similar constructs. This may reflect a training differential between the 2 different styles of assessment tools as there was different degrees of variability between raters' scores obtained from the NTS tools.24 The items that compose the TEAM tool contain specific questions to guide users in assigning scores, whereas the BARS tool contains broad categories of behaviors for users to consider when assigning scores. In this regard, the BARS tool may be used more like a GRS, whereas the TEAM tool has features more like that of a checklist. These features may account for differences in reliability and requirements for rater calibration between the TEAM and BARS tools. The TEAM tool differs from other tools designed to assess perioperative teams by focusing on the team performance as a whole and not as a sum of individual team members' performances. In a dynamic, multidisciplinary clinical environment, such as the perioperative setting, the cumulative performance of clinical teams likely impacts patient outcomes more than individual performances. The current study adds to the growing number of clinical and simulation-based studies that have used the TEAM tool to assess emergency teams with reliable results.

The greater variability associated with raters' scores from the GRS when compared with the scenario-specific performance checklist may reflect the more objective nature of a checklist style scoring system and/or a training differential required for the calibration of different styles of assessment tools. There is a common belief that scenario-specific checklists are more objective than global rating tools and therefore easier to use by raters and provide more reliable scores.25 This belief has been challenged over the years.24,26 Although the items of a checklist may seem objective, the selection of items for inclusion in a checklist and the definition for scoring those items is often based on a subjective process, as was the case for the checklist used in this study. The use of a GRS requires that its raters possess a detailed knowledge of the performance being evaluated, which may be advantageous for differentiating between novice and experts. In addition, evaluating a performance with a GRS requires a rater to make decisions and use judgment that is inherently subjective yet remain objective.27 This inherent dependence on rater knowledge and judgment is both an advantage and disadvantage of GRS when compared with performance specific checklist. These facts emphasize the importance of careful rater selection and training when using a GRS.

Most checklists, including the ones evaluated in this study, are designed to assess a specific skill and are very useful for evaluating the step-by-step performance of such skills.28 However, checklists may not account for detailed variations in performance that differentiate between an expert and a novice such as when an expert skips less important steps in lieu of more critical steps, performs tasks in a different order based on clinical judgment or performs tasks simultaneously, nor do they account for poor timing of task performance such as when a novice performs a critical task late or for all potential errors that a novice might make when performing an unfamiliar task.28,29 This tendency to reward completion of specific tasks over general competency in managing a scenario is a major disadvantage of performance assessment using a checklist style assessment tool.30 Another limitation of scenario-specific performance checklists is that performance in a given clinical scenario may not reflect performance on other scenarios or tasks.26 Thus, it is difficult to make generalized assessments of team performance using scenario-specific technical checklists, as a team that performs well in one scenario may not perform well in other scenarios and vice versa. Global rating scales have the theoretical advantage of being more generalizable than scenario-specific performance checklists, because they permit assessment of teams across multiple clinical scenarios. Although raters' scores from the GRS were more varied, there was significant correlation between our raters' scores using the GRS measure and between the GRS scores and the raters' scores using the scenario-specific performance checklists. This suggests that the checklist style assessment tool and the global measures were both assessing similar performance constructs. More importantly, it suggests that the GRS used in this study was able to measure the performance of teams across a variety of scenarios. A combination of the GRS and the scenario specific checklist would likely provide the most robust assessment of a team's performance.

The ability of GRS to better differentiate between novice and expert performers may be a reflection of the raters using the GRS, specifically their judgment and ability to make more nuanced evaluations using GRS.24,31 When using GRSs, raters are typically instructed to evaluate the complete performance before assigning a score to avoid being swayed by early or late acts. This allows raters to focus on the overall performance and competency of a team and not be biased by thoroughness of task completion or individual acts and permits the raters to take into consideration errors.24,31,32 This is how raters in our study were instructed. Scores from the raters correlated well with the reference standard scores for all assessment tools, suggesting that the raters were appropriately trained and calibrated to use the assessment tools. Rater training and calibration are an important, and often overlooked, factor in determining the reliability of scores for an assessment tool.33–36 The methodology for training raters in the current study was modeled after “best practices” previously described.21,33 Although our methodology was similar to previously described, we completed the rater training in considerably less time than previous studies and trained raters in multiple rating instruments simultaneously. The length of rater training sessions was deliberately kept short to accommodate the busy schedules and clinical demands of the raters. In addition, we were interested in the feasibility of using the studied assessment tools in clinical practice. Feasibility refers to an assessment tool's ease of application in clinical practice and is a product of the logistics, cost and training required to properly use the tool.28,29 For a tool to be relevant and used clinically, ie, outside of a research setting, the barrier to its use must be kept low. Any tool that requires an extensive training intervention to reliably use is unpractical and unlikely to be used for routine educational or clinical assessments.


The present study does have several limitations that must be mentioned to appropriately interpret our findings. First, each scenario scored by our raters represented the performance of a unique perioperative team managing one clinical scenario. Therefore, we are unable to determine what impact the nature of the scenario (ie, hyperkalemia versus anaphylaxis) might have had on a team's performance. It is likely that teams' performance varied between different scenarios. In addition, the scenarios may have had different discriminatory abilities with some scenarios better suited to discriminate between high and low performers. To truly assess the performance of a team, we would need to assess the teams' performance across multiple scenarios, and additional studies would be needed to determine the specific number of scenarios needed for reliable assessment.37 Furthermore the validity of scores from an assessment instrument is unique to the learner group and environment in which it is used, thus the results of the current study may not generalize to other perioperative teams or other rater groups. Second, intrarater and interrater reliability are but one component of validity. Messick's unified theory of validity requires multiple sources of evidence to support or refute scores obtained from a tool.38,39 Third, we did not formally analyze rater reliability between the 3 rater training sessions as recommended by the authors of the rater training protocol we followed.21 It is possible that rater reliability would have been improved had we performed a formal analysis and used the information to guide rater training. Finally, it is possible that scoring of scenarios using one type of tool (eg, TS checklist) influenced the raters' scoring with the other type of tool (eg, the nontechnical scales).

Future Directions

Ongoing, high-stake simulation-based performance assessment and training are recognized as a core component for improving operational safety in many industries outside of healthcare.40,41 Simulation training and assessment are now required during anesthesiology training and the initial board certification process, and it is strongly encouraged in the maintenance of certification in anesthesiology process.42,43 However, it is clear that additional research is needed to determine the impact of simulation-based training and assessment on improving the performance of practicing clinicians. Furthermore, the role of simulation-based assessment of TS in shaping ongoing professional development, practice improvement, and maintenance of certification/licensure remain largely unexplored. For these lines of research to be undertaken, the development and use of reliable assessment tools, such as those presented in this study, is critical for such research to have practical implementation.


Despite the publication of numerous assessment tools and decades of interest in the topic, the assessment of healthcare teams remains a challenge because of the inherent subjectivity of what constitutes good versus poor team performance.4,9 The present study provides evidence to support the use of 2 tools for assessing the technical and nontechnical performance of perioperative teams. We were able to train a group of clinicians to reliably use the tools in a reasonable period with good correlation between the trained clinicians and reference scores. The global rating tool provided reliable scores across a range of scenarios, suggesting that it might be used to assess teams in other clinical scenarios.


1. Houck CS, Deshpande JK, Flick RP. The American College of Surgeons Children's surgery verification and quality improvement program: implications for anesthesiologists. Curr Opin Anaesthesiol 2017;30:376–382.
2. Oldham KT. Optimal resources for children's surgical care. J Pediatr Surg 2014;49:667–677.
3. Kohn LT, Corrigan JM, Donaldson MS. To Err Is Human: Building a Safer Health System. Washington, DC: National Academies Press; 2000.
4. Sevdalis N, Hull L, Birnbach D. Improving patient safety in the operating theatre and perioperative care: obstacles, interventions, and priorities for accelerating progress. Br J Anaesth 2012;109:i3–i16.
5. Watkins SC, Roberts DA, Boulet JR, McEvoy MD, Weinger MB. Evaluation of a simpler tool to assess nontechnical skills during simulated critical events. Simul Healthc 2017;12:69–75.
6. Anderson O, Davis R, Hanna GB, Vincent CA. Surgical adverse events: a systematic review. Am J Surg 2013;206:253–262.
7. Mazzocco K, Petitti DB, Fong KT, et al. Surgical team behaviors and patient outcomes. Am J Surg 2009;197:678–685.
8. Catchpole K, Mishra A, Handa A, McCulloch P. Teamwork and error in the operating room: analysis of skills and roles. Ann Surg 2008;247:699–706.
9. Jepsen RM, Østergaard D, Dieckmann P. Development of instruments for assessment of individuals' and teams' non-technical skills in healthcare: a critical review. Cogn Technol Work 2015;17:63–77.
10. Weinger MB, Banerjee A, Burden AR, et al. Simulation-based assessment of the management of critical events by board-certified anesthesiologists. Anesthesiology 2017;127:475–489.
11. Etherington N, Larrigan S, Liu H, et al. Measuring the teamwork performance of operating room teams: a systematic review of assessment tools and their measurement properties. J Interprof Care 2019;1–9.
12. Li N, Marshall D, Sykes M, McCulloch P, Shalhoub J, Maruthappu M. Systematic review of methods for quantifying teamwork in the operating theatre. BJS Open 2018;2:42–51.
13. Cooper S, Cant R, Porter J, et al. Rating medical emergency teamwork performance: development of the Team Emergency Assessment Measure (TEAM). Resuscitation 2010;81:446–452.
14. Cooper S, Cant R, Connell C, et al. Measuring teamwork performance: validity testing of the Team Emergency Assessment Measure (TEAM) with clinical resuscitation teams. Resuscitation 2016;101:97–101.
15. Couto TB, Kerrey BT, Taylor RG, FitzGerald M, Geis GL. Teamwork skills in actual, in situ, and in-center pediatric emergencies: performance levels across settings and perceptions of comparative educational impact. Simul Healthc 2015;10:76–84.
16. Freytag J, Stroben F, Hautz WE, Schauber SK, Kämmer JE. Rating the quality of teamwork-a comparison of novice and expert ratings using the Team Emergency Assessment Measure (TEAM) in simulated emergencies. Scand J Trauma Resusc Emerg Med 2019;27:12.
17. Watkins SC, Nietert PJ, Hughes E, Stickles ET, Wester TE, McEvoy MD. Assessment tools for use during anesthesia-centric pediatric advanced life support training and evaluation. Am J Med Sci 2017;353:516–522.
18. Morgan PJ, Lam-McCulloch J, Herold-McIlroy J, Tarshis J. Simulation performance checklist generation using the Delphi technique. Can J Anaesth 2007;54:992–997.
19. McEvoy MD, Smalley JC, Nietert PJ, et al. Validation of a detailed scoring checklist for use during advanced cardiac life support certification. Simul Healthc 2012;7:222–235.
20. Weinger MB, Burden AR, Steadman RH, Gaba DM. This is not a test!: misconceptions surrounding the maintenance of certification in anesthesiology simulation course. Anesthesiology 2014;121:655–659.
21. Eppich W, Nannicelli AP, Seivert NP, et al. A rater training protocol to assess team performance. J Contin Educ Health Prof 2015;35:83–90.
22. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 2016;15:155–163.
23. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess 1994;6:284–290.
24. Ilgen JS, Ma IW, Hatala R, Cook DA. A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment. Med Educ 2015;49:161–173.
25. Cohen D, Colliver JA, Robbs RS, Swartz MH. A large-scale study of the reliabilities of checklist scores and ratings of interpersonal and communication skills evaluated on a standardized-patient examination. Adv Health Sci Educ Theory Pract 1996;1:209–213.
26. Van der Vleuten CP, Norman GR, De Graaff E. Pitfalls in the pursuit of objectivity: issues of reliability. Med Educ 1991;25:110–118.
27. Govaerts MJ, Van der Vleuten CP, Schuwirth LW, Muijtjens AM. Broadening perspectives on clinical performance assessment: rethinking the nature of in-training assessment. Adv Health Sci Educ Theory Pract 2007;12:239–260.
28. Chuan A, Wan A, Royse C, Forrest K. Competency-based assessment tools for regional anaesthesia: a narrative review. Br J Anaesth 2017.
29. Bould M, Crabtree N, Naik V. Assessment of procedural skills in anaesthesia. Br J Anaesth 2009;103:472–483.
30. Cunnington JP, Neville AJ, Norman GR. The risks of thoroughness: reliability and validity of global ratings and checklists in an OSCE. Adv Health Sci Educ Theory Pract 1996;1:227–233.
31. Hodges B, McNaughton N, Tiberius R. OSCE checklists do not capture increasing. Acad Med 1999;74:1129–1134.
32. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159–174.
33. Feldman M, Lazzara EH, Vanderbilt AA, DiazGranados D. Rater training to support high-stakes simulation-based assessments. J Contin Educ Health Prof 2012;32:279–286.
34. Woehr DJ, Huffcutt AI. Rater training for performance appraisal: a quantitative review. J Occup Organ Psychol 1994;67:189–205.
35. Borman WC. Format and training effects on rating accuracy and rater errors. J Appl Psychol 1979;64:410–421.
36. Castorr AH, Thompson KO, Ryan JW, Phillips CY, Prescott PA, Soeken KL. The process of rater training for observational instruments: implications for interrater reliability. Res Nurs Health 1990;13:311–318.
37. Van Der Vleuten CP, Schuwirth LW. Assessing professional competence: from methods to programmes. Med Educ 2005;39:309–317.
38. Cook DA, Hatala R. Validation of educational assessments: a primer for simulation and beyond. Adv Simul 2016;1:31.
39. Messick S. Standards of validity and the validity of standards in performance asessment. Educ Meas Issues Pract 1995;14:5–8.
40. Wilson KA, Burke CS, Priest HA, Salas E. Promoting health care safety through training high reliability teams. Qual Saf Health Care 2005;14:303–309.
41. Chassin MR, Loeb JM. High-reliability health care: getting there from here. Milbank Q 2013;91:459–490.
42. Availabale at: 07--01.pdf. Accessed February 8, 2019.
43. Available at: Accessed February 8, 2019.

Checklists; clinical competence; psychometrics; knowledge; reproducibility of results; task performance and analysis

Supplemental Digital Content

Copyright © 2021 Society for Simulation in Healthcare