Secondary Logo

Journal Logo

A Simulation-Based Acute Skills Performance Assessment for Anesthesia Training

Murray, David J., MD; Boulet, John R., PhD; Kras, Joseph F., MD; McAllister, John D., MD; Cox, Thomas E., MD

Section Editor(s): Miller, Ronald D.

doi: 10.1213/01.ane.0000169335.88763.9a
Economics, Education, and Health Systems Research: Research Report

In an earlier study, trained raters provided reliable scores for a simulation-based anesthesia acute care skill assessment. In this study, we used this acute care skill evaluation to measure the performance of student nurse anesthetists and resident physician trainees. The performance of these trainees was analyzed to provide data about acute care skill acquisition during training. Group comparisons provided information about the validity of the simulated exercises. A set of six simulation-based acute care exercises was used to evaluate 43 anesthesia trainees (28 residents [12 junior and 16 senior] and 15 student nurse anesthetists). Six raters scored the participants on each exercise using either a detailed checklist, key-action items, or a global rating. Trainees with the most education and clinical experience (i.e., senior residents) received higher scores on the simulation scenarios, providing some evidence to support the validity of the multi-scenario assessment. Trainees varied markedly in ability depending on the content of the exercise. In general, anesthesia providers demonstrated similar aptitude in managing each of the six simulated events. Most participants effectively managed ventricular tachycardia, but postoperative events such as anaphylaxis and stroke were more difficult for all trainees to promptly recognize and treat. Training programs could use a simulation-based multiple encounter evaluation to measure provider skill in acute care.

IMPLICATIONS: A trainee’s skill in managing critical events can be assessed using a multiple scenario simulation-based performance evaluation.

Washington University Clinical Simulation Center, Washington University School of Medicine, St. Louis, Missouri

Accepted for publication April 22, 2005.

Supported, in part, by the Foundation for Anesthesia Education and Research: Education Grant.

Address correspondence and reprint request to David Murray, MD, Washington University Clinical Simulation Center, Washington University School of Medicine, Box 8054, 660 South Euclid, St. Louis, MO 63110. Address e-mail to

Anesthesia providers are expected to be able to manage critical events after training, yet most trainees rarely encounter these crises in patient care settings. Life-size, high fidelity, electromechanical mannequins provide a method to instruct as well as evaluate a provider’s skill in acute care management(1–12). Historically, simulation training has been used by anesthesia residency and nurse anesthetist training programs as a method to enhance skill acquisition in managing acute care events(2–8,10–12). The similarities in the content of training exercises used in these studies suggest that a set of training exercises might be developed that could be used to assess anesthesia-provider skill(5–8,10–12).

In earlier studies, we developed a scoring method for a set of exercises that were designed to assess the ability of trainees to accomplish key diagnostic and therapeutic actions during a brief, directed simulation encounter(1,9,10). This test battery, multiple encounter approach is similar in structure to the standardized patient assessments often used in medical schools and for testing clinical skills as part of licensure examinations(13–16).

Our goals for this study were (a) to evaluate scenario content by determining trainee performance on individual exercises, (b) to provide further validation of a simulation-based acute care assessment, and (c) to compare the acute care skills of anesthesia trainees.

Back to Top | Article Outline


After obtaining approval for the protocol from our IRB, we recruited anesthesia providers (participants) to manage a set of six acute care exercises. The scenarios included (a) postoperative anaphylaxis, (b) intraoperative myocardial ischemia, (c) intraoperative atelectasis, (d) intraoperative ventricular tachycardia, (e) postoperative stroke with intracranial hypertension, and (f) postoperative respiratory failure. To successfully manage each event, the participant was required to make a rapid diagnosis and acute intervention. Before entering the simulation laboratory, participants spent 5 min reviewing a patient history, the results of a preoperative physical examination, and an anesthetic record. For the three intraoperative scenarios, the simulated event occurred approximately 30 min after the anesthesia induction. The three postoperative simulated situations occurred in the 15- to 20-min period after admissions to a recovery room.

The student nurse anesthetists (n = 15) were recruited from two specialty-training programs. These individuals had completed their clinical education and were in the final days of their respective programs. The residents (n = 28) were recruited from a single residency-training program and were individually evaluated during a 2-mo period close to the end of their respective training year (either CA-1, CA-2, and CA-3). The residents were categorized into two groups based on clinical experience. The junior residents (CA-1; n = 12) had completed 2 years of graduate training. The senior residents (CA-2 and CA-3; n = 16) had completed at least 3 yr of training that included additional more advanced anesthesia subspecialty experiences, including more complex experiences in major vascular and transplant anesthesia, surgical and cardiovascular intensive care, obstetric anesthesia, pediatric anesthesia, and pain management.

All of the participants had more than 8 h of simulation laboratory experience at our training center. These experiences were primarily in small group (<10 participants) training sessions. Before beginning the individual session, each participant provided informed written consent for videotaped recording and subsequent analyses of their performances. The simulation exercises for each participant were conducted in a single 75- to 90-min individual training session that was supervised by a nurse or physician educator. The six exercises were presented in identical order to each participant. After every two simulation encounters, the supervising faculty member discussed the case management for the preceding exercises.

This study was conducted in a simulation laboratory that contains a sensorized life-size electromechanical patient mannequin developed by MEDSIM-EAGLE® (MedSim USA Inc., Ft. Lauderdale, FL).

Each participant’s performance was videotaped and recorded on a four-quadrant screen that included two separate video views of the provider and the mannequin. Two microphones are suspended from the ceiling to capture audio during the scenarios. The third screen of the four-quadrant video recording is the simultaneous full display of patient vital signs (electrocardiogram (ECG), pulse oximetry, inspired and expired gas monitoring, arterial blood pressure (BP), and central venous pressures). In the lower right quadrant of the screen, preceptors typed identifying information such as the date, participant, and scenario number. This quadrant of the screen could also be used to add information to clarify participant actions.

The general approach to scoring the scenarios included two analytic methods (checklist and key action) and a single global rating scale. For the analytic scoring, three raters scored each participant’s performances using a detailed checklist of diagnostic and therapeutic actions that was individually developed for each of the six scenarios(Table 1). An additional three raters used an abbreviated checklist system that consisted of three key actions for each scenario. The key action raters also provided a single global rating of the performance using a visual analog scale. The checklist scoring system included 11–16 possible actions for each scenario (Table 1), and each was weighted based on its importance with respect to overall patient care. Faculty experts assigned item weights after a review of existing patient care practice standards and subsequent deliberations concerning the potential positive (or negative) impact of performing (or not performing) an action in the time allotted. The raters were asked to indicate whether or not a specific action described on the checklist had been achieved by the participant. Before scoring the exercises, the raters met to develop consistent end-points for each of the checklist and key actions. The highest cumulative weighted score defined the best possible performance in this scoring system. The maximum possible weighted score on the scenarios ranged from 14 to 22 points. To compare participant performances across scenarios, this score was converted to a percentage value based on the maximum number of attainable points.

Table 1

Table 1

The abbreviated or key action scoring system included three performance items for each exercise. The raters recorded whether participants performed each of the key actions. If the key action did not occur within 5 min, it was assigned a value of 0. The key action score for any given scenario could range from 0 to 3.

The three raters who used the abbreviated checklist (key actions) also provided a single global rating of each participant’s performance after each exercise. The raters were instructed to make a mark on a 10-cm horizontal line to indicate the overall level of performance. The rating system was anchored by the lowest value 0 (unsatisfactory) and the highest value 10 (outstanding). In advance, the raters agreed that a score of 7 or more would be considered a standard expected for a provider assuming independent responsibility for patient care.

Five anesthesiologists and one nurse clinician, divided into groups of three, independently rated the participants’ performances using the three scoring methods (checklist, key action, and global rating). Two faculty members and the nurse clinician scored the performances using the checklist; three faculty anesthesiologists used the key action and global scoring system. All of the raters independently observed and scored the residents performances from the videotaped recordings. The psychometric properties of the scores obtained from the analytic and global methods outlined above are provided in a previous set of studies of graduate physicians and anesthesia residents(1,10,11).

For each of the scoring systems (checklist, key action, and global rating), a two-way analysis of variance (ANOVA) was conducted to test the null hypothesis that there were no differences in performance among the senior residents, junior residents, and nurse anesthetists. For the three analyses, the independent variables were group (senior resident, junior resident, and nurse anesthetist) and case1–6. The dependent variable was the score (i.e., weighted checklist, key action, and global rating). Descriptive statistics (mean and standard deviation) were calculated, by study group, for each of the six scenarios and for each scoring modality. In addition, for the key action scoring, the percentage of each participant group who completed all three key actions in the 5-min period was calculated.

The reproducibility of the scoring was investigated using generalizability analysis(17). A trainee’s score can be influenced not only by their ability, but also by specific rater effects (stringency or bias in scoring), familiarity with scenario content, and scoring method. Generalizability analysis provides a method to evaluate rater and task (scenario) effects, including their interactions, and determine the magnitude of potential sources of variability in participant scores. These variance components can then be used to determine the reliability of participant scores as a function of the number of raters and number of scenarios.

Back to Top | Article Outline


Scores from the 16 senior residents and 12 junior residents were compared to those from the 15 nurse anesthetists. These comparisons were done separately for each of the three scoring methods. Scores, by method and study cohort, are presented in Table 2 and Figure 1. In general, the senior residents received higher scores than the junior residents and nurse anesthetists, regardless of scoring method. Some scenarios were much more difficult than others. For example, the cerebral hemorrhage scenario was not managed very well; none of the three groups, on average, scored more than 50% on the weighted checklist, received a global rating more than 6, or completed more than two key actions. In contrast, the ventricular tachycardia was handled fairly well, on average, by all three groups (Table 2, Fig. 1).

Table 2

Table 2

Figure 1.

Figure 1.

A variety of performance patterns were recognized for each of the events. In the initial scenario, only one half of the participants (21 of 42) recognized that the condition was anaphylaxis (heart rate [HR], 140 bpm; BP, 75/40; bronchospasm; O2 saturation, 85%) during the initial 3 min. After participants received a verbal prompt from the recovery room nurse indicating the patient had a rash(3 min), the majority of them (38 of 42) were able to diagnose anaphylaxis. Only 27 of 42 participants treated the condition with epinephrine.

The myocardial ischemia exercise required trainees to recognize that tachycardia (HR max, 130 bpm) and hypertension (BP max, 180/120) were associated with ST increase in the ECG (lead II ST-T wave increase, 2.7 mm). The diagnosis of myocardial ischemia was established by 29 of the 42 participants during the exercise. Almost all of the senior residents(15 of 16) were able identify the presence of myocardial ischemia, but less than half of the student nurse anesthetists (7 of 15) and CA-1 residents (5 of 12) were able to recognize myocardial ischemia. Despite failure to recognize the diagnosis, the majority of trainees initiated therapy to treat the tachycardia and hypertension.

In the atelectasis scenario, participants were expected to recognize that a definitive step to improve oxygenation was either to suction or provide larger tidal volumes. The intubated mannequin had an O2 saturation of 88%, decreased lung compliance, and reduced tidal volumes. Whereas more than half of the senior residents performed these actions (11 of 16), only 7 of 27 student nurse anesthetist and junior residents accomplished either of these definitive steps.

The stroke scenario required participants to recognize an intracerebral event in a postoperative patient. Eighteen of the 42 participants did not recognize that a cerebral vascular event had occurred in this bradycardic, hypertensive, and unresponsive simulated patient who had a dilated pupil. Trainees who recognized the diagnosis were also more likely to indicate the need for consultation and securing the airway.

In the final scenario, most of the trainees (34 of 42) recognized the need for reintubation after examining a tachypneic postoperative patient in respiratory failure. The mannequin was receiving 100% O2 with a nonrebreathing mask and had a respiratory rate of 28 breathes/min and O2 saturation of 75%. All 16 senior residents, 10 of 15 student nurse anesthetists, and 9 of 12 CA-1 residents reintubated the mannequin during the 5-min exercise.

In terms of overall performance, fewer than 20% of all participants were able to complete all three key actions for the stroke scenario. In contrast, the ventricular tachycardia scenario was managed effectively; more than 75% of all participants were able to complete the key actions, often in much less than the prescribed 5-min time period. Across all six scenarios, a larger percentage of the senior residents were able to complete all three key actions than either of the other two study groups.

ANOVA was used to test for specific differences in performance among groups and across scenarios. For the analysis based on the weighted checklist, the case by group interaction was not significant. This indicates that the relative performance of the individuals in each group did not vary as a function of the case and suggests that, whereas there are group differences, all groups found that the cases were similar in terms of individual case difficulty. However, there was a significant main effect attributable to group (F = 11.2; P < 0.01). This result reveals that, averaged over the six cases, there was a significant difference in mean scores among the senior residents, junior residents, and nurse anesthetists. A post hoc analysis (Scheffé test for multiple comparisons) revealed that the senior residents out-performed the nurses (mean difference = 11.2; P < 0.05). Although the junior residents also out-performed the nurses (mean difference = 5.0), this difference was not statistically significant. Finally, the senior residents out-performed the junior residents (mean difference = 6.2), but this effect was not statistically significant. There was also a significant main effect attributable to case (F = 17.5; P < 0.01). This indicates that, averaged over all study participants, the scenarios were not of equal difficulty.

The results for the other two scoring systems (global rating and key action) were similar to those for the weighted checklists. For these two analyses, there was no significant group-by-case interaction. This indicates that the group differences in performance were consistent across cases. The group main effects were all significant(Fglobal = 16.8; P < 0.01; Fkey action = 16.6; P < 0.01) revealing differential mean performance by group. Based on post hoc analysis of the global scores, the senior residents significantly(P < 0.05) out-performed the junior residents (mean difference = 1.0) and the nurse anesthetists (mean difference = 1.7). The differences observed between junior residents and nurse anesthetists (mean difference = 0.7) were not statistically significant. Similar results were found for the key action scores. That is, senior residents significantly out-performed junior residents and nurse anesthetists, but there were no statistically significant differences in scores between junior residents and nurse anesthetists. Similar to the weighted checklist analysis, there were significant main effects attributable to case (Fglobal = 8.8; P < 0.01; Fkey action = 13.6; P < 0.01). Averaged over study participants, the cases were not of equal difficulty.

Generalizability analysis was used to evaluate the sources of variance in scores and, in particular, to determine how reliable raters were in assigning scores and how consistent participants were in managing the scenarios. In this study, the variance attributable to the raters, and associated interactions, were relatively small. This indicates that the raters identified comparable scoring end-points for each event and were reasonably consistent in their assignment of scores for each exercise(Table 3). Although participants’ abilities varied depending on the content of the exercise, raters rank-ordered trainee performances in a near identical manner for each scenario. These rater variances were similar, and relatively small, whether analyzed across the entire participant group or within groups of participants (student nurse anesthetist, junior residents, and senior residents) (Table 3). This indicates that a trainee’s score is unlikely to vary as a function of the number of raters or the scoring method used to quantify the performance (Table 3). The largest variance component for the checklist, key action, and global scoring methods was related to the content of the exercises (trainee ×scenario) (Table 3). Therefore, the reliability of the participants’ overall scores will be more dependent on the number of scenarios in the assessment as opposed to the number of raters for a given scenario. Overall, whereas the use of six encounters resulted in moderately reliable scores, additional performance samples would be required if more precise ability estimates were required.

Table 3

Table 3

Back to Top | Article Outline


Based on our brief six-scenario assessment, the senior residents received higher scores than both the student nurse anesthetists and junior residents on the simulation exercises. For most scenarios, the junior residents and student nurse anesthetists had, on average, comparable performances. These findings provide some evidence to support the discriminant validity of the multi-scenario simulation exercise. The senior residents, having both additional training and increased patient management experience, would be expected to be able to handle the acute care scenarios more effectively and efficiently. Likewise, based on the similar duration of training and anesthesia experiences of the junior residents and student nurse anesthetists, one might not expect meaningful performance differences between these groups of trainees.

The scores obtained from individual scenarios provide a way to evaluate how well trainees perform in various types of encounters and to make some inferences about trainee skill in specific domains of practice. For example, most trainees, regardless of group, successfully managed the ventricular tachycardia scenario. Trainees in all three participant groups were able to recognize and effectively administer the prescribed treatment in less than five minutes from the onset of the arrhythmia and frequently in less than two minutes. Although most of our participants had never encountered a patient in ventricular tachycardia in the operating room environment, it would seem that their previous training in advanced cardiac life support prepared them to manage this condition in an intraoperative environment. The algorithms and arrhythmia recognition skills acquired in advanced cardiac life support training likely translated to enhanced performance in a simulation laboratory.

The comparable performance of the three groups on the ventricular tachycardia scenario also provides some evidence to support fairness in scoring models and, at least for this scenario, the content of the exercise. All three groups effectively managed this exercise, and many participants received the highest possible score. The requisite skills to obtain the maximum possible score were achieved by nurse and physician groups. If scenarios were designed to test in-depth knowledge as well as clinical skill, then group comparisons would be expected to favor the residents who have more extensive knowledge. Our goal in developing the evaluation was to evaluate requisite skills in acute care management, rather than measure in-depth knowledge of pathophysiology of disease process.

Unlike the ventricular tachycardia scenario, participants did not effectively manage some of the exercises. Two postoperative scenarios, stroke and anaphylaxis, were more difficult for all participants, regardless of previous training. Clinical findings in these two simulations (stroke = increased BP, bradycardia, and unresponsiveness in addition to a dilated left pupil; anaphylaxis = bronchospasm, tachycardia, and hypotension) were not subtle but seemed to be more difficult for participants to identify and subsequently manage. These results indicate that simulation-based assessment might be helpful to identify deficits in skill acquisition during training. If training strategies using simulation were available, then participants could potentially manage these conditions as well as the ventricular tachycardia scenario. The complexity of many conditions and non-uniform treatment algorithms may make training strategies more difficult to develop for these events when compared with more straightforward scenarios such as the ventricular tachycardia exercise. An alternative explanation for the performance deficits might be that these two conditions were simply modeled in a manner that made it more difficult for all providers to recognize and treat. More study of these exercises, and additional scenarios of related content, are required to determine if the results are generalizable to other acute care postoperative conditions or if similar performance deficiencies are found in graduates of other training programs.

There were several limitations to our study that warrant discussion. Trainees managed the scenarios in the same order and received performance feedback about their performance during the study period. If a simulation-based assessment is to be used as a summative evaluation method, then steps to enhance the security of the exercises and standardize feedback during the evaluation would need to be implemented in future studies. The majority of our raters (five of the six raters) were recruited from the faculty at the training site used by the residents and student nurse anesthetists. Trainee scores may be subject to rater bias or “halo” effects particularly when raters are aware of the training level of participants. This bias might be manifest by variances in scores recorded by blinded and unblinded raters. The more variation that there is among raters in scoring actions, the greater the potential is for this type of bias. However, if variances among raters are minimal, then a small number of raters can be used to establish a reliable score. Fortunately, in this study, as in our previous studies, the variance among raters’ scores was small, indicating that regardless of whether we used a single rater(blinded or unblinded) or the mean ratings of multiple raters, the trainee’s overall assessment score would be similar. Simple, unambiguous scoring systems with defined end-points for performance may be important to decreasing the potential for rater bias.

A simulation-based assessment may be a valuable tool for understanding the relationship between a specialist’s training and clinical experiences in developing and maintaining skill. The senior residents with more training and experience performed better than the junior residents and nurse anesthetists. If experience were a key requirement to developing requisite skills, then additional practice experience beyond clinical training would potentially narrow the differences in performance among groups. If training were an essential requirement to develop requisite skill, then differences between the senior residents and nurse anesthetist group would be expected to persist beyond training.

The skills required to manage these modeled situations are relevant for both nurse anesthetists and anesthesiologists, but the content domain of acute care is certainly more expansive than represented by the six scenarios modeled in this study. Therefore, replicating this study with additional scenarios would be valuable. The resident and nurse participants represent trainees from a small number of training programs. As a result, it is unclear whether the results of our investigation will generalize to trainees in other programs. By increasing the number of performance samples for each participant as well as the number of trainees, a more detailed analysis of the content and nature of a simulation-based assessment could be provided. This information could be used to assess skill acquisition during training and to develop training and assessment strategies using life-sized mannequins.

At present, there are few, if any, methods available to determine whether a professional has the skills required to manage complex, high-acuity events(18–24). A simulation-based assessment strategy could be developed for critical events, but additional studies that explore content domain and fidelity of the exercises are required. A key goal of future investigations will be to explore the relationship between a provider’s skill managing simulated patients and associated measures of clinical performance.

Back to Top | Article Outline


1.Boulet JR, Murray DJ, Kras J, et al. Reliability and validity of a simulation-based acute care skills assessment for medical students and residents. Anesthesiology 2003;99:1270–80.
2.Chopra V, Gesink BJ, DeJong J, et al. Does training on an anaesthesia simulator lead to improvement in performance? Br J Anaesth 1994;73:293–7.
3.Devitt JH, Kurrek MM, Cohen MM, et al. The validity of performance assessments using simulation. Anesthesiology 2001;95:36–42.
4.Gaba DM, Howard SK, Flanagan B, et al. Assessment of clinical performance during simulated crises using both technical and behavioral ratings. Anesthesiology 1998;89:8–18.
5.Holzman RS, Cooper JB, Gaba DM, et al. Anesthesia crisis resource management: real-life simulation training in operating room crises. J Clin Anesth 1995;7:675–87.
6.Jacobsen J, Lindekaer AL, Ostergaard HT, et al. Management of anaphylactic shock using a full scale anaesthesia simulator. Acta Anaesthesiol Scand 2001;45:315–9.
7.Lindekaer AL, Jacobsen J, Andersen G, et al. Treatment of ventricular fibrillation during anaesthesia in an anaesthesia simulator. Acta Anaesthesiol Scand 1997;41:1280–4.
8.Monti EJ, Wren K, Haas R, Lupien AE. The use of an anesthesia simulator in graduate and undergraduate education. CRNA 1998;9:59–66.
9.Murray DJ, Boulet J, Ziv A, et al. An acute care skills evaluation for graduating medical students: a pilot study using clinical simulation. Med Educ 2002;36: 833–41.
10.Murray DJ, Boulet JR, Kras JD, et al. Anesthesia acute care skills: a simulation-based anesthesia skills assessment for residents. Anesthesiology 2004;101:1084–95.
11.O’Donnell J, Fletcher J, Dixon B, Palmer L. Planning and implementing an anesthesia crisis resource management course for student nurse anesthetists. CRNA 1998;9:50–8.
12.Schwid HA, Rooke GA, Carline J, et al. Evaluation of anesthesia residents using mannequin-based simulation: a multi-institutional study. Anesthesiology 2002;97:1434–44.
13.Boulet J, McKinley DW, Norcini J, Whelan GP. Assessing the comparability of standardized patient and physician evaluations of clinical skills. Adv Health Sci Educ Theory Pract 2002;7:85–97.
14.Dillon GF, Boulet JR, Hawkins RE, Swanson DB. Simulations in the United States medical licensing examination™ (USMLE™). Qual Saf Health Care. 2004;13:i41–5.
15.Norcini J, Boulet J. Methodological issues in the use of standardized patients for assessment. Teach Learn Med 2003;15:293–7.
16.Rothman AI, Blackmore D, Dauphinee WD, Reznick R. The use of global ratings in OSCE station scores. Adv Health Sci Educ Theory Pract 1997;1:215–9.
17.Brennan RL. Generalizability theory. New York: Springer-Verlag, 2001:1–538.
18.Gaba DM. What makes a ‘Good’ anesthesiologist. Anesthesiology 2004;101:1061–3.
19.Issenberg SB, McGaghie WS, Hart IR, et al. Simulation technology for health care professional skills assessment. JAMA 1999;282:861–6.
20.JhaAK, DuncanBW, Bates DW. Simulator-based training and patient safety, making health care safer: a critical analysis of patient safety practices (evidence report/technology assessment No. 43—AHRQ publication 01-E058). Shojania KG, Duncan BW, McDonald KM, Wachter RM, eds. Rockville, MD: Agency for Healthcare Research and Quality, 2001:510–7.
21.Accreditation Council for Graduate Medical Education Web Site. ACGME Outcome Project. Available at
22.Epstein RM, Hundert EM. Defining and assessing professional competence. JAMA 2002;287:226–35.
23.Leach DC. Competence is a habit. JAMA 2002;287:243–4.
24.Arens JF. Do practitioner credentials help predict safety in anesthesia practice? APSF Newsletter 1997;12:6–8.
© 2005 International Anesthesia Research Society