Journal Logo

Empirical Investigations

Queen’s Simulation Assessment Tool

Development and Validation of an Assessment Tool for Resuscitation Objective Structured Clinical Examination Stations in Emergency Medicine

Hall, Andrew Koch MD; Dagnone, Jeffrey Damon MD, MMEd; Lacroix, Lauren MD; Pickett, William PhD; Klinger, Don Albert PhD

Author Information
Simulation in Healthcare: The Journal of the Society for Simulation in Healthcare: April 2015 - Volume 10 - Issue 2 - p 98-105
doi: 10.1097/SIH.0000000000000076
  • Free


Assessment of clinical expertise in postgraduate medical education is moving away from knowledge-based examination toward competency-based assessment.1,2 There has been a broad call for the development of tools to assess competency of postgraduate medical trainees in the United States,3 Canada,4 and internationally.2,5 At the 2010 Ottawa Conference, “Assessment of Competence in Medicine and the Healthcare Professions,” specific recommendations were made for the development and implementation of competency assessments at the “shows how” level of Miller’s pyramid.6 Within the specific context of emergency medicine (EM) training, the demand for validated assessment tools has been highlighted in recent conference proceedings2,7 and specialty-specific reviews.8

In a recent meta-analysis, simulation-based medical education with deliberate practice has been shown to be superior to traditional clinical medical education in achieving specific clinical skill acquisition goals within the realms of resuscitation, surgery, and procedural competence.9 Furthermore, simulation is now a mandatory component of maintenance of certification by the American Board of Anesthesiology,3 and high-fidelity simulation in Objective Structured Clinical Examination (OSCE) format has been incorporated into the Israeli National Board Examination in Anesthesia.10 The direct evaluation of performance through simulation-based assessment provides a unique opportunity for simultaneous evaluation of knowledge, clinical reasoning, and teamwork.11 The standardization, fidelity, and reproducibility of medical simulation make it well suited to the assessment of clinical competence.12

The development of tools and scoring metrics for high-fidelity simulation is happening at a rapid pace.13 A recent systematic review of technology-enhanced simulation in the assessment of health professionals provides a summary of assessment methods, but the evidence for validity is reported to be limited with “room for improvement.”14 Valid and reliable assessment tools that use simulation would be beneficial to postgraduate licensing bodies and training programs in EM.8 The postgraduate EM training program at Queen’s University in Kingston, Ontario, Canada, has integrated high-fidelity simulation through weekly junior and senior resuscitation rounds, trauma rounds, and monthly core curriculum sessions. We have previously reported the validation of a high-fidelity simulation-based OSCE assessment tool in a 3-station resuscitation examination, which used a combination of checklist and global assessment scores (GASs).15 Results from this trial suggested that a GAS demonstrated stronger relational validity and more consistent reliability. GAS has been previously shown to be superior to checklists for complex performance assessment.16

The purpose of the current study was therefore to extend this previous work and validate a novel assessment tool using anchored global assessment scoring for use in the evaluation of simulation-based resuscitation OSCE stations for EM trainees, namely, the Queen’s Simulation Assessment Tool (QSAT). Using the “unified model” of an argument for validity originally proposed by Messick17 and further developed by Downing,18 we designed the QSAT, OSCE stations, and review process with principles of content and response process validity, and we collected data relating specifically to the internal structure validity (interrater reliability, variance component analysis, and generalizability [G] study), relations with other variables (discriminatory capabilities based on the level of training), and consequences (perceived benefit to learning).


A prospective observational validation study was used to develop and evaluate the QSAT in the evaluation of EM trainees in simulated resuscitation settings. It was conducted between May 2010 and August 2012. The study received institutional review board approval from Queen’s University Health Sciences and Affiliated Teaching Hospitals Research Ethics Board. Participation was voluntary, and all trainees provided written informed consent to participate.

The study took place over 3 years at Queen’s University in Kingston, Ontario, Canada, in the simulation laboratory at the Kingston Resuscitation Institute within Kingston General Hospital. Multiple simulation mannequins were used: Gaumard Hal and Susie (Gaumard Scientific, Miami, FL), and Laerdal SimJunior (Laerdal Medical Canada, Ltd, Toronto, Ontario, Canada). All simulations were run by one of the members of the research team (A.K.H., J.D.D.) and our local simulation laboratory technician, with several years of experience running medical simulations and vendor-based training in the use of each of the mannequins. The physiologic parameters were adjusted using a predetermined set of palettes or scenarios frames. The Kingston Resuscitation Institute simulation laboratory was set up on each occasion to recreate the physical environment of an emergency department resuscitation bay, and all necessary equipment and tools were available to the trainees. Study participants were EM resident trainees from the Fellowship of the Royal College of Physicians of Canada–Emergency Medicine (FRCP-EM) program and from the College of Family Physicians of Canada–Emergency Medicine Certificant (CCFP-EM) program at Queen’s University.

An outline of the study protocol is provided in Figure 1. The novel QSAT tool was developed as a hybrid scoring tool using an anchored GAS system. In pilot work, we compared the GAS and checklist scores in the evaluation of simulated resuscitation scenarios and found that the GAS demonstrated stronger interrater reliability and discriminatory capabilities.15 We subsequently modified this tool, combining aspects of the checklist system into anchored GASs on a 5-point Likert scale ranging from 1 being inferior to 5 being superior and focusing on 4 domains as follows: primary assessment, diagnostic actions, therapeutic actions, and communication. Within each domain, key anchors are included and may be modified to assist with the decision making of expert evaluators in making judgments related to their scoring. An additional overall GAS was included, resulting in an overall potential score of 25 (5 for each domain and 5 for the overall GAS). The QSAT was conceptually designed to assess competence of a senior-level resident, and the scale descriptors are meant to reflect this. Specific actions that would be expected for individual stations are intentionally not included as listed anchors in the generic version of the tool. This key feature permits the tool to be customized for each individual station while still maintaining the original framework. The QSAT (Fig. 2A) underwent multiple revisions using a modified Delphi technique by a panel of EM physicians from Queen’s University, who were not directly involved in the study.

Study protocol flow chart.
A, Generic QSAT. B, Scenario-specific QSAT. BP, blood pressure; CXR, chest x-ray; ECG, electrocardiogram; HPI, history of presenting illness; HR, heart rate; LOC, level of consciousness; PMHx, medical history; RR, respiratory rate; Sat, saturation.

An expert panel of EM faculty developed each of the 10 scenarios to be used as OSCE stations. The faculty all had received previous training in simulation-based education, including programs offered by the Harvard-Macy Institute and the Center for Medical Simulation in Boston, MA, and the Masters of Medical Education program at the University of Dundee, Scotland. The scenarios were created in blocks of 2 or 3 within 3 months of each examination administration. The choice of scenario topic was guided by an initial blueprinting technique referencing the FRCP-EM curriculum at Queen’s, ensuring a broad sampling of content over 2 years. Table 1 outlines the specific scenarios chosen for each of the examinations.

OSCE Scenarios

The resuscitation-focused scenarios were designed to discriminate between residents of different levels of training by eliciting observable behaviors that traditionally are poorly assessed with typically used forms of assessment such as oral and written examinations. Each scenario included scripted roles and clear instructions for nurse and respiratory technician actors as well as the mannequin operator. For each scenario, a customized QSAT was used based on the generic QSAT but including specific anchors relating to the clinical presentation and desired observable behaviors. Each of the anchoring items was derived using a modified Delphi technique by a panel of EM and critical care physicians. Each of the scenarios and assessment tools were piloted and revised in the simulation laboratory through practice sessions with the involved faculty and actors. An example of a specific QSAT for the assessment of our scenario 2, acute subarachnoid hemorrhage, has been provided as Figure 2B.

The OSCE stations were administered to the residents with 2 or 3 scenarios per examination, each occurring over a 2-day period. The examinations were administered every 6 months over a 2-year time frame as part of biannual EM resident assessment days. Each resident was scheduled in a 30-minute block, and the scenarios were performed sequentially. Scenarios 1 to 6 were 6 minutes in duration, and scenarios 7 to 10 were 10 minutes in duration, allowing for completion of the examination in the 30-minute period. Each resident was provided with standardized verbal instructions before each scenario and given 1 minute to read a brief introduction to the scenario attached to the door of the room. The nurse and respiratory technician actors were able to provide additional prespecified information. The scenario ended when the allotted time was finished regardless of actions performed by the resident or the condition of the patient, which is an acceptable practice in the context of an examination. Each resident was the only physician present to manage the patient, and consultants were not available if called. The residents’ performances were video recorded using Kb Port ETC Pro (Kb Port, Allison Park, PA), a 3-camera system and a cardiac monitor video feed including audio. The video output consisted of a 4-part divided screen with each video angle and cardiac monitor and the audio recording from the room. Members of the research team observed each session from behind a one-way glass. Each candidate received a 20-minute formative debrief regarding their performance. The QSAT was used as a framework to guide the facilitator-led debriefing. The depth and detail of the debriefing was dependent on the strengths, weaknesses, and needs of each individual trainee but generally followed a typical structure of 3 phases of description, analogy/analysis, and application.19

The audio-video recording of each performance was independently reviewed by 3 EM physicians from a different academic institution who were blinded to the identity and level of training of the resident. All evaluators were attending EM faculty members and participated in simulation-based education programs at their own institutions. To standardize evaluations, the evaluators received a 2-hour orientation training session for each OSCE on the use of the assessment tools with feedback from one of the investigators (J.D.D., A.K.H.). The evaluators were provided with copies of all written instructions, visual stimuli, and scenario descriptions with expected actions. During the orientation sessions, these evaluators rated a standardized sample of training video recordings of varying levels of performance in each scenario. These were then reviewed with the investigator until consensus scoring was achieved. The evaluators were not the same 3 raters throughout the study but were sampled from a pool of available raters. The evaluators were however the same across scenarios for each individual examination. The same technique of orientation was used for each examination regardless if the rater had been involved previously or not. The evaluators were given a 3-month period to review each block of video-recorded performances and score them independently using the QSAT.

Immediately after examination, each resident completed a questionnaire asking the following questions, with a response indicated on a 5-point scale. (1) How comfortable do you feel being assessed in a simulation environment (1, comfortable; 5, very comfortable)? (2) How well do you feel this OSCE simulated an ED [emergency department] resuscitation (1, not realistic; 5, very realistic)? (3) How valuable do you feel this OSCE was for your personal learning (1, not valuable; 5, very valuable)?

Statistical Analysis

Data analyses were performed using IBM SPSS version 22.0 (Armonk, NY). On the rare occasion that a component of the QSAT score was not completed by the evaluator, the overall GAS for the trainee in the specific scenario was assigned to the missing domain score. To analyze the discriminatory capabilities of the QSAT, scores were compared across level of training (junior FRCP [postgraduate year (PGY) 1–2], CCFP-EM [PGY 3], and senior FRCP [PGY 3–5]) using analysis of variance. Post hoc analyses were conducted using the Bonferroni method. For each examination administration, an omnibus F test was conducted, using an α of 0.05. Interrater reliability for the QSAT in each of the 10 scenarios was estimated using Spearman rank correlation coefficient and absolute agreement interclass correlation coefficient (ICC). Variance components and G coefficients were calculated as well as decision (D) studies, calculating G coefficients under various combinations of raters and stations. Mean responses to each question on the questionnaire for each level of trainee were also calculated.


A total of 20 to 26 residents were enrolled for each OSCE, representing 69% to 96% of eligible postgraduate trainees in EM at Queen’s University. Those who did not participate were either out of the region on clinical rotations, pursuing a fellowship or advanced degree training, or on maternity leave. The resident study investigator (A.K.H.) was excluded from the study. Two residents enrolled were not included in the final analysis because of technical errors (one incomplete video, one video without audio).

Resident performances are compared by level of training in Table 2. Residents’ scores are a sum of the score given by each of the 3 raters using the QSAT. The maximum QSAT score is 25 (sum of 4 domains and GAS), giving a maximum score for the scenario of 75. The mean score presented in Table 2 is the mean of all residents of the specified level of training in the specified scenario. Figure 3 displays the resident scores in a box and whisker plot by quartile, arranged by scenario and level of training. In all scenarios, senior (SR) FRCP residents outperformed all other residents; the observed differences between all groups achieved statistical significance (analysis of variance) in all but one scenario. Post hoc tests indicated that junior (JR) FRCP and CCFP-EM residents performed similarly across scenarios. The SR-FRCP residents had significantly higher performance than the JR-FRCP residents in 7/10 scenarios and significantly higher performance than the CCFP-EM residents in 7/10 scenarios (Table 2).

Descriptive and Inferential Results for OSCE Scenario Total Scores
QSAT scores by scenario and trainee group.

The interrater reliabilities of the QSAT and each individual component are shown in Table 3. Total QSAT agreement was calculated using the sum score of all components of the QSAT, and the total OSCE agreement was calculated using the sum of QSAT scores on all scenarios in the OSCE. The interrater reliabilities of the total QSAT scores were strong to very strong as measured by Spearman ρ. Of the individual components, the lowest mean interrater reliabilities were found in the domains of diagnostic workup (mean, 0.59) and primary assessment (mean, 0.61). The highest mean interrater reliabilities were found in the therapeutic action domain (mean, 0.72) and overall GAS (mean, 0.70). In addition, the ICCs calculated for absolute agreement confirm moderate-to-substantial agreement across raters, with the lowest correlations in scenarios 6 and 10.

Mean Interrater Reliability of QSAT Scores by 3 Raters in Each Scenario and Combined OSCE

A separate generalizability analysis was conducted using a fully crossed person by scenario by rater (p × s × r) design for each of the 4 examinations administered. The estimated variance components, the relative contributions to score variance, and the G coefficients are provided in Table 4. The largest sources of variance in all 4 runs were trainee and trainee by scenario interaction. The G coefficients are presented as the examinations were actually administered; nr = 3 and ns = 3 for Summer 2010 and Winter 2011; nr = 3 and ns = 2 for Summer 2011 and Winter 2012. D studies were conducted to evaluate the effectiveness of alternative designs with differing numbers of facets for each of the administered examinations. Increasing the number of scenarios per OSCE to between 6 and 9 would produce G coefficients close to 0.90. With 6 scenarios and 1 judge, the G coefficients would be greater than 0.8, with the exception of the Winter 2012 OSCE.

Estimated Variance Components From G Study

The results of the postexamination questionnaire are presented in Table 5. Interestingly, for the residents in their final year of training, the mean answers were 4.6 of 5 for the question, “How comfortable do you feel being assessed in a simulation environment?”; 4.4 of 5 for the question, “How well do you feel this OSCE simulated a real resuscitation?”; and 4.8 of 5 for the question, “How valuable do you feel this OSCE was to your learning?”

Responses to Postexamination Questionnaire


In general, assessment is intended to quantify an underlying construct that is not easily amenable to direct measurement, in this case, competence in EM resuscitation. In any technologically enabled assessment, it is imperative to ensure that the validity of the assessment tool is demonstrated to avoid assessment of incorrect constructs.5 Cook et al14 argue that the classical framework of different types of validity (face, content, criterion, concurrent) has been surpassed by Messick’s “unified model”17 in which the argument for validity is collected with evidence from 5 sources as follows: content, response process, internal structure, relations with other variables, and consequences. In this validation study, we have structured our case for the validity of this assessment tool around these 5 sources. Content validity or the extent to which included content is relevant to our defined construct20 was ensured through the initial development by an expert panel of EM physicians, and the subsequent review by an independent group with content expertise in resuscitation medicine. Furthermore, the QSAT was developed as a modification of previously validated tools used at our center.15 Response process validity or the extent to which response process of examinees and evaluative process of raters reflect the defined construct is more difficult to measure and typically poorly reported in validation studies.14 Evidence here stems from 2 specific components of our process. First, during training, evaluators were specifically asked why they were scoring specific candidates strongly or poorly, and attempts were made to eliminate sources of variance that were irrelevant to the construct of resuscitation competence. Second, trainees were specifically asked in the questionnaire to what degree the examination reflected real resuscitation. Their mean response of 4.6/5 indicated a perception that the examination was very realistic.

Evaluation of internal structure validity came from 2 sources in our analysis, namely, the G study and calculations of interrater reliability. The G study uses variance component analysis to measure the contributions that all relevant factors make to the result.21 This analysis found that the largest source of variance, as expected, was for the examinees (persons) as this represents variability caused by differences in the residents’ performances. Ideally, this would be the only source of variance. The next largest source of variance was the examinee by scenario interaction, implying that trainee performance varied based on the scenario that they completed. Hence, more cases will be required to get more generalizable results. This was also reflected in the D study result, which found that an ideal number of scenarios to achieve a G coefficient greater than 0.8 with only 1 judge would be greater than 6. In general, the G coefficients were quite high. One feature contributing to this is the task similarity between scenarios resulting in the testing of similar constructs. Each scenario, while using different clinical scenarios, shares the commonalities of general resuscitation principles, namely, the initial assessment of the critically ill patient, communication skills, and crisis resource management. Therefore, excellent performance in one scenario may predict excellent performance in another scenario. An additional explanation for the high G coefficients is the significant heterogeneity of the trainees, ranging from PGY 1 to PGY 5. This heterogeneity is much higher than usual summative assessments, which would typically examine trainees from the same level of training. The interrater reliability of the QSAT as measured independently was moderate to strong. When used in multiple scenarios in our examination format, the combined scores for each OSCE were significantly higher, producing more reliable results. This supports the use of an increased number of scenarios for wider sampling to produce a more reliable picture of competence and more generalizable examination.22

To measure relations with other variables, the discriminatory ability of the QSAT was evaluated. Residents of higher levels of training (SR-FRCP) consistently demonstrated higher scores that were also statistically significant as measured by post hoc analysis in 7 of 10 scenarios. The 2 other levels of trainee, JR-FRCP and CCFP-EM however had very similar scores. At our site, the CCFP-EM and JR-FRCP resident groups receive a similar volume of resuscitation training throughout their initial EM residency, and it would be expected that the scores of the CCFP-EM residents and JR-FRCP residents would be similar in an examination measuring competence in resuscitation. Notably, in OSCEs that took place later in the academic year (scenarios 4, 5, 6, 9, and 10), the CCFP-EM group outperformed the JR-FRCP group, reflecting the rapid learning that typically occurs in the single year of EM training in the CCFP-EM program in Canada. As another measure of the relationship of the QSAT with other variables, a pilot version of the QSAT was demonstrated to correlate with written examination scores on a similar subject matter.15

The final component of the validity assessment was an evaluation of consequences or the impact of the assessment itself. The postexamination questionnaire specifically asked residents how valuable they felt the examination was to their learning. The mean response of 4.6/5 indicates that the residents felt the examination was very valuable to their learning. Furthermore, residents indicated on the questionnaire that they largely felt comfortable being examined in the simulation laboratory, with final-year trainees indicating that they felt very comfortable, with a mean response of 4.6/5.

Our QSAT is a unique assessment tool that was developed after reviewing the best assessment literature. Multiple studies in pediatrics and anesthesia have demonstrated the ability of high-fidelity simulation assessment tools using case scenarios to discriminate between trainees of different levels with good interrater reliability.23–27 Adler et al28 specifically demonstrated excellent interrater reliability using both checklist and anchored global rating instruments in the assessment of simulated pediatric emergencies. There have also been multiple tools developed for the assessment of team competency29,30 and crisis resource management (CRM).31 Specifically linking CRM with CanMEDS criteria, the Generic Integrated Structured Assessment Tool (GIOSAT) developed by Neira et al32 demonstrated excellent discriminatory ability and interrater reliability in the assessment of anesthesia residents in advanced cardiac life support scenarios. There have also been multiple checklist-based scoring systems developed for the assessment of advanced life support programs of advanced cardiac life support, pediatric advanced life support, and neonatal resuscitation program.33–36 Rosen et al37 demonstrate an excellent approach to linking Accreditation Council for Graduate Medical Education core competencies to simulation-based assessment using their Simulation Module for Assessment of Resident Targeted Event Responses (SMARTER) approach. This methodology, however, like others mentioned earlier, uses a checklist-based tool for assessment.

The QSAT uses anchored global assessment scoring to assess both specific clinical knowledge and actions and high-level resuscitation principles and behaviors. This use of global assessment scoring harnesses the expertise of the trained assessor to deliver valid and reliable assessments of the trainee. The QSAT has a general structure that may be modified and used to assess a broad range of clinical situations, depending on the needs of the assessor. Many of the earlier mentioned tools in contrast either are broad team or CRM assessments or have been designed to assess a very specific clinical scenario. It is important to note that any modifications to the generic tool would require the same steps used in our study to create a scenario-specific QSAT. The use of a modified Delphi technique by a panel of content experts is imperative, as is the process of piloting and revision. Although any modification to a tool has the potential to change its psychometric properties, we have demonstrated the customization process for 10 scenarios without significant compromise of its performance. The Royal College of Physicians and Surgeons of Canada, like other international licensing bodies, has demanded competency-based assessment programs. The QSAT combined with our system of OSCE assessment answers this call and has the potential for integration into benchmark assessment.

There are some study limitations that are important to note. Despite the use of the OSCEs as formative assessments, there were no pass/fail grades assigned to the QSAT scores. Should the QSAT be used for future summative examination, an appropriate standard setting strategy would need to be used to establish competency thresholds, and potential revisions would need to be made to the scales and descriptors of the current QSAT if it were to be used for summative certification assessments. In addition, the QSAT currently allows compensatory scoring, with high scores in one domain compensating for low scores in other domains. Future studies are needed to assess the implication of this for summative assessment and whether minimum standards would need to be set for each domain. In our study, the QSAT was used as an assessment tool for 10 separate OSCE scenarios, but these were not run as a single comprehensive examination. As a result, there are different residents involved in each block of 2 to 3 scenarios. Each examination did not address a comprehensive or complete selection of potential content that an examination of resuscitation skills would need to cover. As such, we were unable to perform additional measures of internal structure validity of the OSCE itself, such as interstation reliability and internal consistency. Item analysis and factor analysis were also not included in this study. In the future, a larger-scale OSCE that examines residents from multiple sites, with a full complement of stations using the QSAT after a standard setting study, could be used and evaluated as a comprehensive summative assessment. Furthermore, it would be ideal to compare performance in the simulation laboratory as measured by the QSAT with performance in the resuscitation bay of the emergency department with real patients.


In this study, we outline the rigorous development and evaluation of the QSAT, a modifiable tool for the assessment of postgraduate EM trainees in high-fidelity simulation-based resuscitation OSCEs. As demonstrated in 10 OSCE stations, the QSAT has moderate to very-strong interrater reliability and excellent discriminatory capabilities. We have additionally built a comprehensive argument for its content validity, response process validity, and consequence validity. Residents in their final year of training reported that they found the experience very realistic and valuable and were very comfortable being examined in the simulation laboratory. We anticipate that simulation-based assessment tools, such as the QSAT, will soon become an integral part of the summative assessment of residents in the form of internal residency examinations, midtraining examinations, and board certification examinations.


1. McLaughlin S, Fitch MT, Goyal DG, et al.; SAEM Technology in Medical Education Committee and the Simulation Interest, Group. Simulation in graduate medical education 2008: a review for emergency medicine. Acad Emerg Med2008; 15: 1117–1129.
2. Hamstra SJ. Keynote address: the focus on competencies and individual learner assessment as emerging themes in medical education research. Acad Emerg Med 2012; 19: 1336–1343.
3. Steadman RH, Huang YM. Simulation for quality assurance in training, credentialing and maintenance of certification. Best Pract Res Clin Anaesthesiol 2012; 26: 3–15.
4. The Association of Faculties of Medicine of Canada. The Future of Medical Education in Canada: A Collective Vision for Postgraduate Medical Education. 2012. Accessed on May 14, 2014.
5. Amin Z, Boulet JR, Cook DA, et al. Technology-enabled assessment of health professions education: consensus statement and recommendations from the Ottawa 2010 Conference. Med Teach 2011; 33: 364–369.
6. Boursicot K, Etheridge L, Setna Z, et al. Performance in assessment: consensus statement and recommendations from the Ottawa conference. Med Teach 2011; 33: 370–383.
7. Spillane L, Hayden E, Fernandez R, et al. The assessment of individual cognitive expertise and clinical competency: a research agenda. Acad Emerg Med 2008; 15: 1071–1078.
8. Sherbino J, Bandiera G, Frank J. Assessing competence in emergency medicine trainees; an overview of effective methodologies. Can J Emerg Med 2008; 10: 365.
9. McGaghie WC, Issenberg SB, Cohen ER, Barsuk JH, Wayne DB. Does simulation-based medical education with deliberate practice yield better results than traditional clinical education? A meta-analytic comparative review of the evidence. Acad Med 2011; 86: 706–711.
10. Berkenstadt H, Ziv A, Gafni N, Sidi A. Incorporating simulation-based objective structured clinical examination into the Israeli National Board Examination in Anesthesiology. Anesth Analg 2006; 102: 853–858.
11. Epstein RM. Assessment in medical education. N Engl J Med 2007; 356: 387–396.
12. McGaghie WC, Issenberg SB, Petrusa ER, Scalese RJ. A critical review of simulation-based medical education research: 2003–2009. Med Educ 2010; 44: 50–63.
13. Boulet JR, Murray DJ. Simulation-based assessment in anesthesiology: requirements for practical implementation. Anesthesiology 2010; 112: 1041–1052.
14. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Technology-enhanced simulation to assess health professionals: a systematic review of validity evidence, research methods, and reporting quality. Acad Med 2013; 88: 872–883.
15. Hall AK, Pickett W, Dagnone JD. Development and evaluation of a simulation-based resuscitation scenario assessment tool for emergency medicine residents. CJEM 2012; 14: 139–146.
16. Regehr G, MacRae H, Reznick RK, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998; 73: 993–997.
17. Messick S. In: Linn RL, ed. Validity, Educational Measurement. 3rd ed. New York: Macmillan; 1989: 13–103.
18. Downing SM. Validity: on meaningful interpretation of assessment data. Med Educ 2003; 37: 830–837.
19. Fanning RM, Gaba DM. The role of debriefing in simulation-based learning. Simul Healthc 2007; 2: 115–125.
20. Andreatta PB, Gruppen LD. Conceptualising and classifying validity evidence for simulation. Med Educ 2009; 43: 1028–1035.
21. Crossley J, Davies H, Humphris G, Jolly B. Generalisability: a key to unlock professional assessment. Med Educ 2002; 36: 972–978.
22. Boursicot K, Roberts T, Burdick W. Structured assessments of clinical competence. In: Swanwick T. Understanding Medical Education. Evidence, Theory and Practice. Oxford: Wiley-Blackwell; 2010: 246–258.
23. Adler MD, Trainor JL, Siddall VJ, McGaghie WC. Development and evaluation of high-fidelity simulation case scenarios for pediatric resident education. Ambul Pediatr 2007; 7: 182–186.
24. Brett-Fleegler MB, Vinci RJ, Weiner DL, Harris SK, Shih MC, Kleinman ME. A simulator-based tool that assesses pediatric resident resuscitation competency. Pediatrics 2008; 121: e597–e603.
25. Morgan PJ, Cleave-Hogg DM, Guest CB, Herold J. Validity and reliability of undergraduate performance assessments in an anesthesia simulator. Can J Anaesth 2001; 48: 225–233.
26. Murray DJ, Boulet JR, Avidan M, et al. Performance of residents and anesthesiologists in a simulation-based skill assessment. Anesthesiology 2007; 107: 705–713.
27. Weller JM, Bloch M, Young S, et al. Evaluation of high fidelity patient simulator in assessment of performance of anaesthetists. Br J Anaesth 2003; 90: 43–47.
28. Adler MD, Vozenilek JA, Trainor JL, et al. Comparison of checklist and anchored global rating instruments for performance rating of simulated pediatric emergencies. Simul Healthc 2011; 6: 18–24.
29. McKay A, Walker ST, Brett SJ, Vincent C, Sevdalis N. Team performance in resuscitation teams: comparison and critique of two recently developed scoring tools. Resuscitation 2012; 83: 1478–1483.
30. Walker S, Brett S, McKay A, Lambden S, Vincent C, Sevdalis N. Observational Skill-based Clinical Assessment tool for Resuscitation (OSCAR): development and validation. Resuscitation 2011; 82: 835–844.
31. Andersen PO, Jensen MK, Lippert A, Ostergaard D, Klausen TW. Development of a formative assessment tool for measurement of performance in multi-professional resuscitation teams. Resuscitation 2010; 81: 703–711.
32. Neira VM, Bould MD, Nakajima A, et al. “GIOSAT”: a tool to assess CanMEDS competencies during simulated crises. Can J Anaesth 2013; 60: 280–289.
33. McEvoy MD, Smalley JC, Nietert PJ, et al. Validation of a detailed scoring checklist for use during advanced cardiac life support certification. Simul Healthc 2012; 7: 222–235.
34. Napier F, Davies RP, Baldock C, et al. Validation for a scoring system of the ALS cardiac arrest simulation test (CASTest). Resuscitation 2009; 80: 1034–1038.
35. Donoghue A, Nishisaki A, Sutton R, Hales R, Boulet J. Reliability and validity of a scoring instrument for clinical performance during Pediatric Advanced Life Support simulation scenarios. Resuscitation 2010; 81: 331–336.
36. Rovamo L, Mattila MM, Andersson S, Rosenberg P. Assessment of newborn resuscitation skills of physicians with a simulator manikin. Arch Dis Child Fetal Neonatal Ed 2011; 96: F383–F389.
37. Rosen MA, Salas E, Silvestri S, Wu TS, Lazzara EH. A measurement tool for simulation-based training in emergency medicine: the simulation module for assessment of resident targeted event responses (SMARTER) approach. Simul Healthc 2008; 3: 170–179.

Assessment; Emergency medicine; Manikins; Medical education; Resuscitation; Simulation

© 2015 Society for Simulation in Healthcare