For millennia, philosophers and scientists have developed systems to organize knowledge. For example, Linnaeus, the Swedish botanist, zoologist, and physician, published Systema Naturae in 1735. The Linnaean taxonomy began with 3 kingdoms, which were subdivided by classes, orders, genera, and species. Today, taxonomies are increasingly important because they offer a means to index content in databases with standardized terms, which in turn enhances information retrieval and analytics. For our purposes, we are most interested in analyzing deviations from best practices in delivery of high-fidelity simulations (HFSs). To do that well, we need a standardized vocabulary and a means of adding descriptive metadata to simulation records to assess the number and type of deviations. We therefore developed a taxonomy, which is a controlled vocabulary with preferred terms (words or phrases) for concepts in a domain of interest. In general, taxonomies have a hierarchical structure, such that broader classes include narrower subclasses.1 The hierarchical structure represents the domain of a theoretical construct, in this case, simulation deviation, as perceived by potential users.2
Our group performed a multicenter simulation trial to explore and identify challenges in using HFS to assess practicing physicians' performance. To facilitate reproducible delivery, we strove to standardize the conditions under which the simulated clinical encounters were conducted. Detailed scripts were created that described the appearance and content of the simulated clinical environment, specified the simulated patients' physiologic condition and their responses to interventions, outlined the roles for simulation confederates (ie, their scripted actions, verbal cues, and responses to anticipated questions), and established rules to guide responses to participants' actions.
Despite our efforts, the pilot trials revealed deviations in encounter delivery and occasional problems recording participants' performance. Thus, to improve standardization of scenario delivery during the study, and to identify deviations that could affect raters' ability to reliably assess participant performance, we recognized the need to classify the deviations from intended encounter delivery and documentation. Given that no existing taxonomy could be found for this domain, we created one to provide simulation researchers, educators, and us a tool to identify and categorize aberrations during the recording and delivery of simulation encounters.
In this paper, we briefly describe the creation of the scenarios (the scripted elements of a simulated clinical experience), the creation of the taxonomy of deviations while delivering the simulated encounter (performing the simulated scenario according to its script and rules), testing whether the taxonomy adequately covers the domain of interest, and determining the degree of inter-rater reliability when domain experts applied the taxonomy to categorize deviations during simulated clinical encounters.
Because the taxonomy was created based on deviations during the delivery of simulated clinical encounters during our multicenter trial, a brief review of the scenarios' content and study conditions is needed.
Creating the Scenarios and the Rules of Scenario Delivery
The study occurred during Maintenance of Certification in Anesthesiology (MOCA) simulation courses;3,4 thus, that course's guidelines influenced scenario design. Maintenance of Certification in Anesthesiology simulation courses must: (1) be at least 6 hours long, (2) simulate hypoxemia and/or hemodynamic instability crises; (3) have each course participant be primarily responsible for patient care (called the “Hot Seat” (HS)) in at least 1 encounter; and (4) have participants work in teams, comprised of the HS working with another course participant and simulation confederates, who assume the roles of perioperative care providers.
Five perioperative crises scenarios were created for the study. Four were used to develop and refine the taxonomy (laparoscopy [LAP], small bowel obstruction [SBO], gynecologic sedation [GYN], tachycardia in the postanesthesia care unit [PACU], and the fifth scenario, PEA during an orthopedic procedure [ORTHO]) was reserved to test its usability. The scenarios were designed to:
- Be approximately 20 minutes in duration,
- Have a designated HS and “First Responder” (FR) participant, and
- Ensure that both the HS' individual performance and teamwork could be assessed; therefore, the FR could only enter the scenario between 9 and 12 minutes after it commenced.
Detailed scenario scripts were created that specified the nature and timing of crucial events, the availability of clinical supplies or drugs, and images or laboratory values that participants might request. Subject matter experts were engaged to validate the content of the scenarios, including the critical elements to be standardized while delivering them.
Establishing Ground Rules for Scenario Delivery
Ground rules for encounter delivery were established within a Scenario Delivery Rule Book (henceforth Rules; see document, Supplemental Digital Content 1, http://links.lww.com/SIH/A290, which contains the Rules). The Rules helped balance the goal of administering a valid performance assessment with the fidelity capabilities and limitations of HFS, the time constraints of the study (ie, the opportunity to do all the Critical Performance Elements within 20-minute scenarios), and MOCA course logistics. For example, to optimize simulation fidelity, the Rules stated that participants would only receive credit for giving a medication if they actually connected the appropriate syringe (prefilled and labeled or drawn up from a vial) to an intravenous access site and injected the desired dose, not by merely announcing or pretending that they had given it.
Establishing Critical Events for Each Scenario
While developing the scenarios, critical delivery events that could affect participants' performance were scripted and standardized. A “time zero”, marking the beginning of each scenario, was established so that changes in the patients' condition and participant interventions could be defined in time-based windows. The patients' vital signs (VS) were carefully scripted, as were their responses to medications. Definitions of a “successful intervention”, like the minimum threshold of energy to produce successful cardioversion, were established. Clinical equipment and supplies, and the time points at which they were to be available during the scenarios, were specified. Verbal cues from simulation confederates were also critical because they conveyed clinical information that was not otherwise available from the mannequin; therefore, the precise wording and timing for cues were scripted. Table 1 contains abbreviated descriptions of the 5 scenarios and the critical events the team sought to standardize while delivering them.
Assessing the Accuracy of Scenario Delivery
To quantify the accuracy of scenario delivery, a “Universal Scenario Assessment Tool (USAT)” was created. The USAT assessed 5 HFS parameters the team surmised could affect encounter delivery: (1) simulation equipment function; (2) clinical equipment function and presence according to the scenario script; (3) confederate performance as scripted; (4) participants' behavior or attitudes during the encounter (eg, not taking the encounter seriously or mocking the simulation); and (5) environment (eg, the simulation theater's appearance relative to its intended clinical location) (see document, Supplemental Digital Content 2, http://links.lww.com/SIH/A291, contains the USAT).
Creating the Taxonomy
To create the taxonomy, author WRM reviewed the scenarios' scripts, the Rules, and the USAT to identify foreseeable deviations from intended encounter delivery. For example, the scripts revealed that the displayed VS could be inaccurate; the Rules indicated that sites could allow participants to pretend to use imaginary vascular access; and the USAT showed that the simulated equipment could fail and audio or video recordings could be inadequate. All foreseeable encounter delivery or recording deviations were collated for inclusion in the taxonomy.
The review identified 36 types of delivery deviations. Seven members of the research team, all experts in simulation and delivering it for MOCA, categorized the deviations into broad classes and narrower subclasses, according to their mental model of “deviations from optimal encounter delivery and documentation”. For this, a modified Delphi technique, facilitated by an open card sorting technique,5 was used. For the card sorting, 7 sets of 36 index cards were made with an encounter delivery or documentation deviation written on each. The cards were shuffled to mitigate a presentation effect and given to the 7 team members. They were instructed to sort the cards into classes and subclasses and name the categories into which they had placed the cards. After completing that exercise, the experts presented their classifications and described their rationale for the categorizations. The discussion produced consensus on categorizing the deviations. The classes and subclasses were defined and labeled for each category. This produced the first iteration of the taxonomy. It had 2 main classes (“participant deviation” and “simulation center deviation”), which were further subdivided into as many as 6 subcategories. Two information scientists (authors T.B. and E.T.) were consulted, who checked the logical coherence of the hierarchy and rewrote terms as noun phrases, to be in compliance with American National Standards Institute (ANSI) standards.6
The taxonomy was preliminarily tested and refined by categorizing deviations noted during pilot trials of 4 scenarios (LAP, SBO, GYN, and PACU) that preceded the study's formal data collection period. Study participants provided informed consent according to the site's local institutional review board requirements, and encounters were conducted during MOCA courses. Scenario authors reviewed recordings of the pilot encounters, noted deviations, and gave the sites feedback regarding their accuracy of delivering the scenario. A total of 59 pilot encounters (SBO, 4; GYN, 20; LAP, 23; and PACU, 12) were recorded. The unequal encounter distribution reflects the sites trialing the scenarios ad libitum.
Four hundred scenario deviations were noted within the 59 pilot trial encounters. Descriptions of the deviations noted during the authors' review were extracted and transcribed in written form (eg, “FR in at 14 minutes; should be no later than 12”). These were randomized and used by 2 members of the research team (W.R.M. and A.B.) to refine the taxonomy. They calibrated their judgments of the categories by selecting 30 random deviations and classifying them within the taxonomy. Satisfied that they understood the definitions and the taxonomy, the entire set of 400 deviations was randomized again and independently classified using the first iteration of the taxonomy. This first round of classifications yielded 63.5% (254/400) agreement. The 146 discrepancies were almost exclusively attributed to ambiguity about the definition of “inaccurate portrayal of confederate role—during scenario” and the need for a new subclass of documentation errors, “audio or video recording deviation”. With the confederate-role definition refined and the new subclass added, the reviewers used this second iteration of the taxonomy to categorize the 146 deviations for which they had previously disagreed and obtained nearly complete agreement (141/146).
Two experienced simulation educators and assessors (S.D. and J.R.), not involved in creating or refining the taxonomy, then assessed its comprehensiveness and the inter-rater reliability in applying it. They first reviewed the classification schema and definitions. They practiced using it by classifying 2 sets of 20 deviations chosen randomly from the original 400 deviations, comparing their responses to the taxonomy developers' responses. They then classified, individually and independently, 146 written descriptions of encounter delivery deviations identified from 16 recordings of pilot trial encounters of the fifth scenario (ORTHO), which had not been used to develop or refine the taxonomy.
The resulting 43.8% disagreement (64 of 146 classifications not matched) resulted from ambiguity categorizing “fidelity” deviations. For example, one reviewer felt that participants who wore street clothes in the simulated emergency department rather than “scrubs” should be categorized as “participant deviation—ground rule violation”, whereas the other classified them as “simulation center deviations.” Considering the raters' feedback, the taxonomy developers recategorized and more thoroughly defined fidelity deviations. To ensure that fidelity deviations were classified appropriately in the third iteration of the taxonomy, the developers categorized those deviations found in the original 400 and found that they adequately fit.
Raters S.D. and J.R. then undertook a second round of categorizing the re-randomized ORTHO encounter deviations using the final version of the taxonomy, 6 weeks after their original attempt.
Each class consists of up to 7 hierarchically arranged subclasses (Fig. 1). For the purposes of evaluating inter-rater agreement, classes with fewer than 7 levels of hierarchy, (eg, "1.a.i.1") were appended with “dummy” subclasses (eg, "1.a.i.18.104.22.168"), so that all classes had 7 levels. Inter-rater agreement was quantified using the percent agreement with 95% Wilson score confidence intervals (CIs)7 and Cohen kappa statistic with 95% Wald-type CIs.8 Inter-rater agreement was assessed at each level of the taxonomy. For example, at taxonomy level 3, only the first 3 subclasses were considered, and the remaining 4 subclasses were ignored. Thus, percent agreement is a monotonic decreasing function of taxonomy level.
Table 2 shows the inter-rater agreement between the raters who tested the taxonomy with the ORTHO scenario. The percent agreement at the final taxonomy level (7) was 77% (95% CI, 70%–83%). The corresponding Cohen kappa was 0.74 (95% CI, 0.66–0.81), which indicates substantial agreement beyond what is expected due to chance.9 At every level except the first, the inter-rater agreement (as measured by Cohen kappa) was likewise substantial. The apparent discrepancy at level 1 is due to the high frequency of “center” (97%) versus “participant” (3%) classifications; with only 2 choices, and one being so highly represented, kappa cannot exclude that the observed agreement at level 1 (97%), although high, was the result of chance.
In addition to substantial agreement, the raters were able to categorize all the deviations within the taxonomy.
Figure 2 shows the final taxonomy of HFS encounter delivery deviations, with definitions for the classes and subclasses. Table 3 provides examples of the categorization of deviations within the taxonomy.
We iteratively developed and refined a taxonomy of deviations from standardized HFS encounter delivery or recording. It is a hierarchical arrangement of 2 broad classes (simulation center deviation and participant deviation) with a total of 54 and 2 subclasses, respectively. For center deviations, up to 6 levels of hierarchy are used to categorize some deviations, whereas participant deviations have 2.
An adequate taxonomy includes every class needed to model the domain of interest (ie, it is exhaustive). This property was found in our evaluation, as all deviations could be classified. An additional requirement of a complete taxonomy is the property of class exclusivity: each class and subclass is unique, and there is no ambiguity as to where each possible element (in this case, simulated encounter deviation) belongs. This property is generally reflected in the reliability of raters to consistently classify deviations using the taxonomy. Our results suggest that this taxonomy has substantial class exclusivity, as classification agreement was 77% or greater at each level of categorization.
This study has limitations. Although the raters were using the taxonomy for the first time, they were both team members (a site investigator, who had directed several of the scenarios, and a post hoc performance rater); therefore, they may have had an understanding of the taxonomy's definitions and nomenclature that produced an overestimation of categorization agreement, compared to other simulation experts using the taxonomy. The inter-rater reliability was assessed from pilot trials of the scenarios, not from encounters from our fully refined and implemented simulation-based assessment study. Using the taxonomy to assess standardization of study scenarios may reveal the need for its modification or further refinement. Finally, the usability of the taxonomy was not ascertained from the reviewers (S.D. and J.R.). This important feature would need to be determined in future studies.
The hierarchical structure in a taxonomy represents the domain of a theoretical construct as perceived by its intended users.6,10 Classes are organized into hierarchical structures. Broader classes include narrower subclasses that have an “Is-a” relationship. In our taxonomy, for example, “Audio only deviation” is a subclass of “audio or video recording deviation”, which is a subclass of “simulation center documentation deviation”, which is a “deviation attributed to the simulation center”. The 7 experts initially proposed that the taxonomy: (1) categorize deviations as related to “script”, “rule”, or “equipment”; (2) group deviations according to the availability of resources (ie, equipment or information), incorrect timing of events, and inability to rate the encounter, or (3) focus on the source of the deviation (eg, if the participant or the site were its cause). Thorough discussion produced a consensus that the initial taxonomy covered the domain of interest.
Descriptors of the simulation process have been described. Defined simulation terms,11 guidelines for creating scenarios,12 a typology of simulation modalities,13 suggestions for best practices during simulation delivery,14 a method for creating “symmetrical” scenarios (those of approximately equivalent difficulty, but with content and presentation varied enough to not be recognized as identical,15) and a description of sources of error in standardized patient–based performance assessment16 are published; however, none are a taxonomy of unintended events during simulated-encounter delivery or recording. Such a taxonomy could aid in assessing the quality and accuracy of encounter delivery or documentation, give the simulation community a common vocabulary to describe deviations, and facilitate research into confounding effects that encounter delivery deviations have on participant performance during simulation-based assessments.
This taxonomy has several strengths. The comprehensiveness of the nomenclature was fostered by reviewing numerous HFS performances from multiple sites and by soliciting the opinions of domain experts. Because there were no prior classes or definitions available for this construct, the use of a modified Delphi technique, facilitated by an open card–sorting exercise, allowed the taxonomic classes to emerge de novo from the mental models of simulation experts. Discussion and consensus regarding its content enhanced the likelihood that the taxonomy encompasses its intended domain. Finally, simulation experts who did not participate in its development tested it using a large number of deviations not used during its refinement.
For commonly used standardized patient examinations, particularly those included as part of the physician licensure process,17,18 there are scripted rules for scenario administration and a need to classify both administrative and documentation errors.16 We believe, but have not formally assessed, that our taxonomy could be adapted for this application. The fundamental nature of medical simulation in our HFS scenarios—assessing performance by simulating clinical challenges that reflect actual patient care and observing participants accessing the resources necessary to manage the situation—is present within many standardized patient, partial-task trainer, virtual reality, or computer-based simulation modalities. Therefore, we believe, but have not yet tested, that the taxonomy could be directly applied, or modified, to meet the needs of a broad cross-section of simulation researchers and educators.
We note this taxonomy's limitations. It does not provide a means for assessing the severity of a deviation. For example, inaccurately representing the patient's VS during the scenario is defined within this taxonomy; however, VS deviations that are 10% higher than scripted are less likely to affect participant performance than are those that are 50% higher. It was created for HFS from the scripts and rules of one performance assessment trial. Therefore, the conditions of other studies, or the use of different modes of simulation-based assessment, could necessitate taxonomic modification or enrichment. Although the scenario authors carefully reviewed the pilot encounters, it is possible that some deviations were missed during that process and will need to be incorporated in future iterations of the taxonomy. Although substantial, the imperfect rater agreement also suggests areas for future modifications. This is to be expected, as taxonomies, like other controlled vocabularies, are dynamic resources that require maintenance and regular updates. For example, an updated version of the medical subject headings thesaurus,19 used by the National Library of Medicine to describe articles in MEDLINE, is released annually. As the science of medical simulation evolves, we believe this taxonomy can be adapted and modified to reflect those advancements.
The authors thank Mr. John Lutz for his assistance in data collection for this study.
This study was funded in part by grant R18-HS020415 to Dr Weinger from the Agency from Healthcare Research and Quality (Rockville, MD).
1. Stewart D. Building Enterprise Taxonomies
. US: Mokita Press; 2011:116–119.
2. Abbas J. Structures for Organizing Knowledge: Exploring Taxonomies, Ontologies, and Other Schemas
. NY: Neal-Schuman Publishers; 2010.
3. McIvor W, Burden A, Weinger MB, Steadman R. Simulation for maintenance of certification in anesthesiology: the first two years. J Contin Educ Health Prof
4. American Society of Anesthesiologists (ASA). Frequently Asked Questions About Simulation Courses Offered for Part IV MOCA® Credit. Available at: https://education.asahq.org/totara/asa/core/drupal.php?name=sim-endorsed
. Accessed August 10, 2014.
5. U.S. Department of Health and Human Services. Available at: http://www.usability.gov/how-to-and-tools/methods/card-sorting.html
. Accessed September 19, 2015.
6. National Information Standards Organization (U.S.), American National Standards Institute. Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies: An American National Standard. National Information Standards Series
. Bethesda, MD: National Information Standards Organization; 2005 (R2010).
7. Newcombe RG. Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med
8. Fleiss JL, Cohen J, Everitt BS. Large sample standard errors of kappa and weighted kappa. Psychol Bull
9. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics
10. Smelser N, Baltes P. International Encyclopedia of the Social & Behavioral Sciences
. 1st ed. Amsterdam: NY, Elsevier; 2001.
11. Meakim C, Boese T, Decker S, et al. Standards of best practice: simulation standard I: terminology. Clinical Simulation in Nursing
12. Alinier G. Developing high-fidelity health care simulation scenarios: a guide for educators and professionals. Simulat Gaming
13. Alinier G. A typology of educationally focused medical simulation tools. Med Teach
14. Furman GE, Smee S, Wilson C. Quality assurance best practices for simulation-based examinations. Simul Healthc
15. Bush MC, Jankouskas TS, Sinz EH, Rudy S, Henry J, Murray WB. A method for designing symmetrical simulation scenarios for evaluation of behavioral skills. Simul Healthc
16. Boulet JR, McKinley DW, Whelan GP, Hambleton RK. Quality assurance methods for performance-based assessments. Adv Health Sci Educ Theory Pract
17. Dillon GF, Boulet JR, Hawkins RE, Swanson DB. Simulations in the United States medical licensing examination (USMLE). Qual Saf Health Care
18. Boulet JR, Smee SM, Dillon GF, Gimpel JR. The use of standardized patient assessments for certification and licensure decisions. Simul Healthc
19. US National Library of Medicine, NIH. MeSH: Medical Subject Headings. Updated September 8, 2014. Available from: http://www.nlm.nih.gov/mesh/
. Accessed October 28, 2014.