Evolving requirements to measure milestones and competencies at all phases of medical training1–6 signal a need to develop and validate new systems of assessment. Although the measurement of patient-related outcomes (i.e., provider behaviors and patient outcomes) in the workplace is desirable, such assessments are limited by costs, patient safety concerns, nonstandardized settings, and infrequent clinical events.7–9 Thus, educators must continue to rely on assessments completed in settings without direct patient contact.10,11 Although we caution against overrelying on such surrogate measures,12 establishing the necessary evidence base will permit educators and researchers to use these surrogates as the primary means of assessment during day-to-day practices. The use of patient-related outcomes then would be reserved for select situations, such as the later stages of training or the culmination of a program of research.7,8
Leaders in medical education have proposed that simulation-based assessments are essential to solving some of these challenges, given that they permit the testing of learners’ performance in safe and standardized environments.8,13,14 However, before an assessment tool is widely implemented, the validity evidence supporting both its intended use and the interpretation of its scores needs to be established.15 A recent systematic review of 417 studies of simulation-based assessments highlights notable limitations in the validity evidence supporting such tools.16 Two other reviews examined the association between simulation-based training and patient-related outcomes, but neither examined the role of simulation in assessment.17,18 To date, we are not aware of any review—for simulation-based assessments specifically or for assessments in general—evaluating the empirical evidence linking educational surrogates with corresponding assessments in the workplace. The purpose of the present study is to examine the association between scores from simulation-based and patient-related assessments and to outline the implications for current assessment practices.
According to guidelines, a proposed assessment tool is considered a valid surrogate if its scores correlate with the target outcome, and change in the proposed surrogate is associated with a corresponding change in the target outcome.19,20 Additional sources of validity evidence should also be sought to provide robust support for the surrogate measure.21–23 Moreover, the research itself should be rigorous and well reported. Hence, our detailed analysis of simulation-based surrogates required attention to broad sources of validity evidence as well as common reporting and methodological issues.
We conducted a systematic review to answer the following questions:
- What are the associations between technology-enhanced simulation-based outcomes and patient-related outcomes?
- What other sources of validity evidence have been reported for these outcomes?
- What is the quality of the methods and reporting in this body of research?
We planned, conducted, and reported on this review in adherence with PRISMA standards.24
Data sources and searches
We conducted our search in two stages. First, we examined all the studies identified in our earlier reviews of simulation-based training and assessment.16,25 For these, we searched Ovid MEDLINE, Ovid EMBASE, CINAHL, PsycINFO, ERIC, Web of Science, and Scopus using a search strategy previously reported in full,25 which was last updated on May 11, 2011. Second, we updated our search on February 26, 2013, searching Ovid MEDLINE, Ovid EMBASE, and Scopus using a revised strategy developed by a research librarian to focus on simulation-based assessments. Terms in both searches focused on the topic (e.g., simulat*), learner population (e.g., med*, nurs*, health occupations), and assessment (e.g., assess*, valid*); see Supplemental Digital Appendix 1 at http://links.lww.com/ACADMED/A246 for the full revised search strategy. We sought additional studies by examining the references from several published reviews of simulation-based assessments and training.26–38
We used broad inclusion criteria to identify original research studies published in any language that (a) assessed trainees both using technology-enhanced simulation and in the context of actual patient care, (b) involved health professionals at any stage of training or practice, and (c) reported evidence of the association between simulation-based scores and patient-related scores. We defined technology-enhanced simulation as “an educational tool or device with which the learner physically interacts to mimic an aspect of clinical care for the purpose of teaching or assessment.”25 We included technologies such as mannequins, virtual reality simulators, and part-task models, and excluded computer-based virtual patients and human standardized patients because they have been the topic of previous reviews.39,40 We included self-reported information regarding procedural success or complications but excluded self-assessments of confidence or subjective performance.
We worked independently, then in pairs, to screen titles, abstracts, and full-text articles for inclusion using the same criteria (see above) for both search strategies, with good agreement (intraclass correlation coefficient [ICC] 0.72 for the first search and 0.67 for the second). We resolved all conflicts by consensus.
Data abstraction and quality assessment
Independently and in pairs, we used a data abstraction form to extract information from the included studies on trainees, clinical topic, validity evidence, study quality, and measures of association. We distinguished validity evidence using Messick’s framework22—namely, content (ICC 0.60), response process (ICC 0.50), internal structure (ICC 0.73), relations with other variables (ICC 0.65), and consequences (ICC 1.0)—and further classified this evidence as favorable or unfavorable (i.e., evidence of validity or invalidity). For internal structure evidence, which often includes reliability metrics, we considered interrater reliability > 0.4 (“fair” per Fleiss and Cohen41) and internal consistency reliability > 0.7 as favorable. For relations with other variables, which typically involves correlations, we considered r ≥ 0.5 as favorable, r = 0.3–0.49 as weakly favorable, and r < 0.3 as negative, based on Cohen’s classification of these ranges as large, medium, and small/negligible, respectively.42
We used the Medical Education Research Study Quality Instrument (MERSQI)43 to grade overall study quality. We also looked at the unit of analysis (patient or trainee; ICC 0.59), presence/absence of a power analysis (ICC 1.0), number of independent analyses (ICC 0.94), reporting of patient demographics (ICC 0.51), blinding of assessment (ICC 0.69), time between simulation and patient assessment (ICC 0.73), assessment of behaviors through direct observation of specific encounters or rotation grades (ICC 0.90), collection of correlational data before or after training (ICC 0.91), and whether evaluating association was a study goal (ICC 0.94).
We classified simulation-based outcomes as time skills (time required to complete the task), process skills (measures of performance such as instructor ratings or minor errors), and product skills (quality of the final product, rate of completion, or major complication). We analogously classified outcomes assessed in the clinical context as time behaviors, provider behaviors (e.g., performance ratings or grades), and patient outcomes (e.g., procedural complications). When a simulation/patient correlation coefficient was not reported, we calculated one from other reported information (e.g., coefficient of determination [R2], P value, or t statistic) using standard methods.44 When necessary, we estimated correlation from linear regression slopes using the approach described by Peterson and Brown.45 For studies reporting insufficient information to calculate a correlation coefficient, we requested additional information from the authors. If more than one simulation or patient-related outcome was reported, we selected the association linking the most similar simulation and patient-related outcomes.
Data synthesis and analysis
We quantitatively pooled z-transformed correlation coefficients (Pearson r or Spearman rho) using random-effects meta-analysis, and then we transformed the pooled result back to the native format. We conducted separate meta-analyses for each outcome. We conducted planned subgroup analyses by topic, trainee, study quality (MERSQI score above or below median), named instrument, the timing of correlation (before or after training), and whether patient outcomes were derived from direct observation or rotation grades. The weighting for all meta-analyses was based on the number of trainees, not the number of patients. We explored possible publication bias using funnel plots and the Egger asymmetry test, although these methods are limited in the presence of high inconsistency.46
We quantified between-study inconsistency (heterogeneity) using the I2 statistic,47 which estimates the percentage of variability not due to chance. Values for I2 > 50% indicate large inconsistency. We used SAS 9.3 (SAS Institute, Cary, North Carolina) for all analyses. Statistical significance was defined by a two-sided alpha of 0.05.
From 11,628 potentially relevant articles, we identified 59 studies in which assessments included both simulation-based and patient-related outcomes. Of these, 29 reported data to determine the correlation between these outcomes. We attempted to contact the authors of the 30 studies lacking correlation information, successfully contacted 17, and received sufficient data from 4. Hence, 33 studies met our inclusion criteria. Figure 1 is a trial flow diagram showing our literature search and study selection process, and Appendix 1 reports key characteristics.
The 33 included studies enrolled 1,203 trainees (range 8–135, median 27).48–80 Most studies (n = 24) enrolled resident physicians. All of the clinical topics focused on procedural tasks such as surgery, anesthesiology, or endoscopy.
Most studies (n = 27) measured provider behaviors with real patients—namely, procedural ratings by instructors (n = 18),48,49,51–53,55,56,59–62,64,65,67,68,74,77,79 grades on clinical rotations (n = 8),57,58,66,69–72,75 and automated motion analysis (n = 1).50 Seven studies reported time behaviors,50,53,56,64,76,78,79 and five studies reported direct patient outcomes—namely, procedural success (n = 2),63,80 evaluation of a final product (n = 2),54,73 and major complications (n = 1).79 Simulation-based assessments of process skills (n = 24) included checklists (n = 9),49,52,62,63,70–72,74,75 global rating scales (n = 6),51,55,57,61,65,77 simulator-specific scores (n = 4),58,59,67,68 motion analysis (n = 2),50,64 and a visual analogue scale (n = 1).56 Seven studies measured time skills.48,53,56,64,76,78,79 Six studies measured product skills—namely, faculty ratings (n = 2),60,69 global ratings of a dental preparation (n = 2),54,73 procedural success (n = 1),80 number of attempts (n = 1),66 and major complications (n = 1).79 Six studies analyzed correlations between outcomes that we considered conceptually misaligned. Specifically, product skills were correlated with provider behaviors (n = 3),60,66,69 time skills were correlated with provider behaviors (n = 2),48,79 and process skills were correlated with patient outcomes (n = 1).63 Twelve studies reported the time delay between assessments, with a median of 75 days (range 0–180). Two studies reported nontechnical skills60,72; one of these studies also reported a technical skill, and we included that in our meta-analysis.60 A human rater assessed all provider behaviors, three patient outcomes,54,73,79 and two time behaviors.76,78 Two patient outcomes were self-reported by trainees.63,80 The method of measuring time behaviors was not reported explicitly in five studies.50,53,56,64,79
Associations between simulation-based and patient-related outcomes
For the 27 studies reporting a correlation with provider behaviors, the pooled correlation was 0.51 (95% confidence interval [CI], 0.38 to 0.62; P < .0001); see Panel A in Figure 2. On the basis of Cohen’s classification,42 we considered this a large correlation. However, between-study inconsistency was large, with I2 = 79%. For the 7 studies reporting a correlation with time behaviors, the pooled correlation was 0.44 (95% CI, 0.15 to 0.66; P = .0001), a medium correlation, with large inconsistency (I2 = 58%); see Panel B in Figure 2. For the 5 studies reporting a correlation with direct effects on patients, the pooled correlation was 0.24 (95% CI, −0.02 to 0.47; P = .05), a small correlation, with large inconsistency (67%); see Panel C in Figure 2. Neither funnel plots nor the Egger asymmetry test suggested publication bias for any analysis.
Subgroup analyses for provider behaviors (see Figure 3) demonstrated that the pooled correlation was highest for physicians in practice and higher for postgraduate trainees than medical students. Correlations also were higher for direct observations of specific clinical encounters than for rotation grades reflecting general impressions, and higher for assessments conducted before training than those conducted after training. The pooled correlation was large (> 0.68) for the three instruments that were used as the simulation-based outcome in more than one study: the Objective Structured Assessment of Technical Skill (OSATS),51,55,57 the Global Operative Assessment of Laparoscopic Skills (GOALS),61,65 and the Fundamentals of Laparoscopic Skills (FLS; or its predecessor the McGill Inanimate System for Training and Evaluation of Laparoscopic Skill).58,59,68
As noted above, using a measure as a surrogate requires evidence of correlation with the target outcome and also evidence that the surrogate changes when the target outcome changes. Whereas all studies reported the first criterion, only one study reported the second.80 This study assessed ventriculostomy cannulation success in patients and in an augmented-reality simulator. Whereas both outcomes improved with training, posttraining simulation success approached 100% in all participants, resulting in restriction of range that in turn attenuated the correlation. As such, pretraining correlation was greater (r = 0.76) than posttraining correlation (r = 0.34).
Appendix 1 reports the sources of validity evidence collected for each study and whether we judged it as favorable or unfavorable. All 33 studies reported a statistical association between two variables. Eleven studies explored additional relations with other variables by evaluating how scores varied by participants’ training level.
For simulation-based outcomes, 13 studies reported internal structure evidence, 12 reported content evidence, 2 reported response process evidence (e.g., comparison of on-site and off-site raters), and 1 reported consequences (rigorous standard-setting method). For patient-related outcomes, 10 studies reported content evidence, 9 reported internal structure evidence, and 1 reported response process evidence (unfavorable: key elements not visible on video). None reported evidence of consequences.
Methodological and reporting quality
We summarize study quality in Supplemental Digital Table 1 available at http://links.lww.com/ACADMED/A246. Of the 33 studies, 10 failed to report the number of learners providing data, and 18 failed to report the number of patients contributing data. Among the 15 studies reporting the number of patients, the average number of patients per trainee ranged from 1 to 5.9, with a median of 1. Only 2 studies reported demographic information on patients. The average MERSQI score was 13.4 (standard deviation, 1.4) from a maximum possible of 18.
Three studies calculated the correlation coefficient using more than one simulation-based data point per trainee (i.e., an inappropriate unit of analysis), and 1 of these studies also used more than one patient-related data point per trainee. Seven studies did not report sufficient information to determine the unit of analysis.
Seventeen studies listed the correlation between simulation-based and patient-related outcomes as the primary study objective, 3 listed it as a secondary objective, 1 listed it as an objective without prioritization, and 12 did not mention it as an objective. Only 3 studies reported a power analysis for a calculation involving a patient-related outcome.
Twenty-five studies reported multiple correlation coefficients, yet only 2 identified one analysis as a primary study objective—hence, most left it to the reader to prioritize among multiple reported analyses.
Our synthesis of 33 studies suggests that properly developed and validated simulation-based assessments can supplement and potentially replace measures of provider behaviors and patient outcomes for select procedural skills. We found that pooled correlations with simulation outcomes were, on average, large for provider behaviors, medium for time behaviors, and small for patient outcomes. Although between-study inconsistency was high, all but two of the individual coefficients were positive, and most were of medium magnitude or higher. Subgroup analyses indicated stronger correlations for participants with greater experience (i.e., practicing physicians > resident physicians), for direct observations of performance, and for assessments conducted before training.
However, available evidence provides only limited support for specific instruments. We identified large pooled correlations (r ≥ 0.68) and generally favorable validity evidence for three commonly used procedural skills assessment instruments: OSATS, FLS, and GOALS. All other instruments appeared only once in our review.
Although most simulation-based assessments demonstrated favorable correlations with provider behaviors and patient outcomes, just one study80 reported evidence indicating how changes in the simulation-based outcome corresponded with changes in the patient-related outcome, an important element in the chain of evidence supporting the use of such measures.19 Moreover, we found relatively sparse validity evidence beyond the correlation data. Thus, prior to using any simulation-based assessment tool, we encourage educators to carefully review all available validity evidence to verify that the evidence supports the intended use in their local curriculum.15
We intentionally included studies reflecting a variety of technology-enhanced simulation modalities, learner populations, clinical topics, and assessment methods. Although this variation likely contributed to the high between-study inconsistency, including more studies ensured a larger sample size and increased statistical power, thus expanding the applicability of our findings. We could not include 26 studies reporting both simulation-based and patient-related outcomes because authors did not report data linking these outcomes. We successfully contacted 57% of these authors, but most did not supply the needed information. We chose not to include standardized patient simulation, which may explain the dominance of procedural skills assessments.
Although our judgments regarding validity evidence (favorable/unfavorable) were grounded in accepted standards, we recognize that validity evidence and validation are far more nuanced than our simple classification scheme and that other schemes could be justified. Our classifications serve the present purpose of identifying broad strengths and weaknesses in this evidence base but may be insufficient to appropriately evaluate the validity of a specific instrument’s scores for a specific application.
We acknowledge that all surrogate outcomes have potential limitations, including noncausal associations, nonuniform response to change, and incomplete representation of the task.12,20 Researchers must address these limitations systematically before educators can confidently use simulation-based assessments to replace workplace-based assessments.
Study strengths include the rigor of our search, abstraction, and analysis process. We found no evidence to suggest publication bias, although that cannot be excluded.
Integration with other research
Recent systematic reviews have summarized the prevalence of validity evidence for simulation-based assessments broadly16 and clarified the specific data elements contributing to each evidence source.22 We extend these findings by detailing the magnitude and quality of the associations between simulation-based outcomes and patient-related outcomes. Our focus on assessment also complements other recent reviews examining the benefits of simulation-based training on patient-related outcomes.17,18
Our review, which included 33 studies focused on procedural tasks, complements a review of patient-related assessments focused on direct observation of nonprocedural behaviors.81 This predominance of procedural tasks is consistent with the findings of our recent systematic reviews of simulation-based assessments,16,17 but it contrasts with those of our reviews of simulation-based training that identified more than 200 studies addressing nonprocedural tasks.25,82,83 The preferential study of procedural tasks when evaluating simulation-based assessments and when measuring patient outcomes may not be related solely to the educational modality (i.e., technology-enhanced simulation), and it might reflect biased topic selection. Alternatively, it could reflect challenges in conducting workplace-based assessments of nonprocedural tasks or indicate that other approaches (e.g., standardized patients)39 have met this need.
Implications for practice and research
A number of factors likely contribute to our finding that the correlations between simulation-based assessments and provider behaviors are highest for practicing physicians and lowest for trainees. First, just 12 studies included either medical students or practicing physicians, and a larger sample may yield different results. Second, we suggest that practicing physicians’ performance is more consistent across contexts and that the higher variability of trainees’ performance may attenuate the correlation. Third, trainees’ workplace-based assessments may be influenced by variables absent from simulated settings that further reduce the correlation, such as stress, worry about patient harm, and cues or assistance from supervisors. No matter the explanation, the weak correlations among trainees suggest a role for dual assessment approaches: Simulation-based assessments might be most appropriate early in training (thus conserving clinical time and resources and protecting patients from potential harm), with workplace-based assessments used in later stages. By contrast, the stronger correlations for physicians suggest that simulation-based assessments may be sufficient for low- or moderate-stakes contexts.
During this review and in our discussions with authors, we identified several methodological concerns related to research evaluating the links between simulation-based assessments and provider behaviors and patient outcomes. First, because training generally reduces between-subject variability, the restricted range of posttraining scores will often cause pretraining correlation to be higher than posttraining correlation (as we confirmed in subgroup analyses and in the one study reporting pre- and posttraining analyses). Second, “a correlate does not a surrogate make.”20 Research seeking to validate surrogate outcomes should consider not only the baseline correlation but also whether scores from the two assessments change in parallel as the construct of measurement (e.g., trainee performance) evolves with training or over time. These two points together illustrate the potential confounding effect of training when establishing a relationship between simulation-based and patient-related assessments. Third, the precision of the assessment improves with multiple repetitions, but so does the participants’ performance, as is evidenced by the several studies that showed score improvement over multiple repetitions of an assessment task (data not reported). Thus, educators must consider the intended inference when deciding which data point(s) from a repeating activity to score. Fourth, we noted wide variation in sample size and suspect that many studies were underpowered to detect reliable estimates of correlation. Investigators could use our pooled correlations to estimate power when designing future studies.
In conclusion, correlation alone does not establish validity. Although most of the studies in our sample demonstrated supportive evidence of relations with other variables, we found substantial gaps among all other sources of validity evidence. We encourage researchers to seek a broad variety of evidence22 when evaluating the validity of both simulation-based and workplace-based assessments.
Acknowledgments: The authors wish to thank Stanley J. Hamstra, PhD (Department of Medicine, University of Ottawa, Ottawa, Ontario, Canada), and Jason H. Szostek, MD, and Amy T. Wang, MD (Department of Medicine, Mayo Clinic College of Medicine, Rochester, Minnesota), for their assistance in the initial literature search. They received no compensation for their contributions.
2. Frank JR, Snell LS, Cate OT, et al. Competency-based medical education: Theory to practice. Med Teach. 2010;32:638–645
3. Green ML, Aagaard EM, Caverzagie KJ, et al. Charting the road to competence: Developmental milestones for internal medicine residency training. J Grad Med Educ. 2009;1:5–20
5. Wayne DB, Butter J, Siddall VJ, et al. Mastery learning of advanced cardiac life support skills by internal medicine residents using simulation technology and deliberate practice. J Gen Intern Med. 2006;21:251–256
6. Weinberger SE, Pereira AG, Iobst WF, Mechaber AJ, Bronze MSAlliance for Academic Internal Medicine Education Redesign Task Force II. . Competency-based education and training in internal medicine. Ann Intern Med. 2010;153:751–756
7. Cook DA, West CP. Perspective: Reconsidering the focus on “outcomes research” in medical education: A cautionary note. Acad Med. 2013;88:162–167
8. Norcini JJ, McKinley DW. Assessment methods in medical education. Teach Teach Educ. 2007;23:239–250
9. Shea JA. Mind the gap: Some reasons why medical education research is different from health services research. Med Educ. 2001;35:319–320
10. Dijksterhuis MG, Voorhuis M, Teunissen PW, et al. Assessment of competence and progressive independence in postgraduate clinical training. Med Educ. 2009;43:1156–1165
11. Norcini J, Burch V. Workplace-based assessment as an educational tool: AMEE guide no. 31. Med Teach. 2007;29:855–871
12. Yudkin JS, Lipska KJ, Montori VM. The idolatry of the surrogate. BMJ. 2011;343:d7995
13. Nasca TJ, Philibert I, Brigham T, Flynn TC. The next GME accreditation system—rationale and benefits. N Engl J Med. 2012;366:1051–1056
14. Philibert I, Nasca T, Brigham T, Shapiro J. Duty-hour limits and patient care and resident outcomes: Can high-quality studies offer insight into complex relationships? Annu Rev Med. 2013;64:467–483
15. Kane MT. Validation as a pragmatic, scientific activity. J Educ Meas. 2013;50:115–122
16. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Technology-enhanced simulation to assess health professionals: A systematic review of validity evidence, research methods, and reporting quality. Acad Med. 2013;88:872–883
17. Zendejas B, Brydges R, Wang AT, Cook DA. Patient outcomes in simulation-based medical education: A systematic review. J Gen Intern Med. 2013;28:1078–1089
18. Schmidt E, Goldhaber-Fiebert SN, Ho LA, McDonald KM. Simulation exercises as a patient safety strategy: A systematic review. Ann Intern Med. 2013;158(5 pt 2):426–432
19. Bucher HC, Guyatt GH, Cook DJ, Holbrook A, McAlister FA. Users’ guides to the medical literature: XIX. Applying clinical trial results A. How to use an article measuring the effect of an intervention on surrogate end points. JAMA. 1999;282:771–778
20. Fleming TR, DeMets DL. Surrogate end points in clinical trials: Are we being misled? Ann Intern Med. 1996;125:605–613
21. Boulet JR, Jeffries PR, Hatala RA, Korndorffer JR Jr, Feinstein DM, Roche JP. Research regarding methods of assessing learning outcomes. Simul Healthc. 2011;6(suppl):S48–S51
22. Cook DA, Zendejas B, Hamstra SJ, Hatala R, Brydges R. What counts as validity evidence? Examples and prevalence in a systematic review of simulation-based assessment. Adv Health Sci Educ Theory Pract. 2014;19:233–250
23. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837
24. Moher D, Liberati A, Tetzlaff J, Altman DGPRISMA Group. . Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Ann Intern Med. 2009;151:264–269, W64
25. Cook DA, Hatala R, Brydges R, et al. Technology-enhanced simulation for health professions education: A systematic review and meta-analysis. JAMA. 2011;306:978–988
26. Dieckmann P, Phero JC, Issenberg SB, Kardong-Edgren S, Ostergaard D, Ringsted C. The first research consensus summit of the Society for Simulation in Healthcare: Conduction and a synthesis of the results. Simul Healthc. 2011;6(suppl):S1–S9
27. Hamstra SJ, Dubrowski A. Effective training and assessment of surgical skills, and the correlates of performance. Surg Innov. 2005;12:71–77
28. Issenberg SB, Ringsted C, Ostergaard D, Dieckmann P. Setting a research agenda for simulation-based healthcare education: A synthesis of the outcome from an Utstein style meeting. Simul Healthc. 2011;6:155–167
29. McGaghie WC. Medical education research as translational science. Sci Transl Med. 2010;2:19cm8
30. McGaghie WC, Draycott TJ, Dunn WF, Lopez CM, Stefanidis D. Evaluating the impact of simulation on translational patient outcomes. Simul Healthc. 2011;6(suppl):S42–S47
31. McGaghie WC, Issenberg SB, Cohen ER, Barsuk JH, Wayne DB. Translational educational research: A necessity for effective health-care improvement. Chest. 2012;142:1097–1103
32. Wayne DB, McGaghie WC. Use of simulation-based medical education to improve patient care quality. Resuscitation. 2010;81:1455–1456
33. Ahmed K, Jawad M, Abboudi M, et al. Effectiveness of procedural simulation in urology: A systematic review. J Urol. 2011;186:26–34
34. Byrne AJ, Greaves JD. Assessment instruments used during anaesthetic simulation: Review of published studies. Br J Anaesth. 2001;86:445–450
35. Edler AA, Fanning RG, Chen MI, et al. Patient simulation: A literary synthesis of assessment tools in anesthesiology. J Educ Eval Health Prof. 2009;6:3
36. Fitzgerald TN, Duffy AJ, Bell RL, Berman L, Longo WE, Roberts KE. Computer-based endoscopy simulation: Emerging roles in teaching and professional skills assessment. J Surg Educ. 2008;65:229–235
37. Kardong-Edgren S, Adamson KA, Fitzgerald C. A review of currently published evaluation instruments for human patient simulation. Clin Simul Nurs. 2010;6:e25–e35
38. Van Nortwick SS, Lendvay TS, Jensen AR, Wright AS, Horvath KD, Kim S. Methodologies for establishing validity in surgical simulation studies. Surgery. 2010;147:622–630
39. Cleland JA, Abe K, Rethans JJ. The use of simulated patients in medical education: AMEE guide no 42. Med Teach. 2009;31:477–486
40. Cook DA, Levinson AJ, Garside S, Dupras DM, Erwin PJ, Montori VM. Internet-based learning in the health professions: A meta-analysis. JAMA. 2008;300:1181–1196
41. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ Psychol Meas. 1973;33:613–619
42. Cohen J Statistical Power Analysis for the Behavioral Sciences. 1988 New York, NY Routledge Academic
43. Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA. 2007;298:1002–1009
44. Borenstein MCooper H, Hedges LV, Valentine JC. Effect sizes for continuous data. In: The Handbook of Research Synthesis and Meta Analysis. 2009 New York, NY Russell Sage Foundation
45. Peterson RA, Brown SP. On the use of beta coefficients in meta-analysis. J Appl Psychol. 2005;90:175–181
46. Terrin N, Schmid CH, Lau J, Olkin I. Adjusting for publication bias in the presence of heterogeneity. Stat Med. 2003;22:2113–2126
47. Higgins JP, Thompson SG, Deeks JJ, Altman DG. Measuring inconsistency in meta-analyses. BMJ. 2003;327:557–560
48. Wohaibi EM, Bush RW, Earle DB, Seymour NE. Surgical resident performance on a virtual reality simulator correlates with operating room performance. J Surg Res. 2010;160:67–72
49. Scerbo MW, Schmidt EA, Bliss JP. Comparison of a virtual reality simulator and simulated limbs for phlebotomy training. J Infus Nurs. 2006;29:214–224
50. Banerjee PP, Edward DP, Liang S, et al. Concurrent and face validity of a capsulorhexis simulation with respect to human patients. Stud Health Technol Inform. 2012;173:35–41
51. Banks EH, Chudnoff S, Karmin I, Wang C, Pardanani S. Does a surgical simulator improve resident operative performance of laparoscopic tubal ligation? Am J Obstet Gynecol. 2007;197:541.e1–541.e5
52. Beard JD, Jolly BC, Newble DI, Thomas WE, Donnelly J, Southgate LJ. Assessing the technical skills of surgical trainees. Br J Surg. 2005;92:778–782
53. Crabtree NA, Chandra DB, Weiss ID, Joo HS, Naik VN. Fibreoptic airway training: Correlation of simulator performance and clinical skill. Can J Anaesth. 2008;55:100–104
54. Curtis DA, Lind SL, Brear S, Finzen FC. The correlation of student performance in preclinical and clinical prosthodontic assessments. J Dent Educ. 2007;71:365–372
55. Datta V, Bann S, Beard J, Mandalia M, Darzi A. Comparison of bench test evaluations of surgical skill with live operating performance assessments. J Am Coll Surg. 2004;199:603–606
56. Ende A, Zopf Y, Konturek P, et al. Strategies for training in diagnostic upper endoscopy: A prospective, randomized trial. Gastrointest Endosc. 2012;75:254–260
57. Faulkner H, Regehr G, Martin J, Reznick R. Validation of an objective structured assessment of technical skill for surgical residents. Acad Med. 1996;71:1363–1365
58. Feldman LS, Hagarty SE, Ghitulescu G, Stanbridge D, Fried GM. Relationship between objective assessment of technical skills and subjective in-training evaluations in surgical residents. J Am Coll Surg. 2004;198:105–110
59. Fried GM, Feldman LS, Vassiliou MC, et al. Proving the value of simulation in laparoscopic surgery. Ann Surg. 2004;240:518–525
60. Gale TC, Roberts MJ, Sice PJ, et al. Predictive validity of a selection centre testing non-technical skills for recruitment to training in anaesthesia. Br J Anaesth. 2010;105:603–609
61. Ghaderi I, Vaillancourt M, Sroka G, et al. Performance of simulated laparoscopic incisional hernia repair correlates with operating room performance. Am J Surg. 2011;201:40–45
62. Jones T, Cason CL, Mancini ME. Evaluating nurse competency: Evidence of validity for a skills recredentialing program. J Prof Nurs. 2002;18:22–28
63. Kessler DO, Auerbach M, Pusic M, Tunik MG, Foltin JC. A randomized trial of simulation-based deliberate practice for infant lumbar puncture skills. Simul Healthc. 2011;6:197–203
64. Kundhal PS, Grantcharov TP. Psychomotor performance measured in a virtual environment correlates with technical skills in the operating room. Surg Endosc. 2009;23:645–649
65. Kurashima Y, Feldman LS, Al-Sabah S, Kaneva PA, Fried GM, Vassiliou MC. A tool for training and evaluation of laparoscopic inguinal hernia repair: The Global Operative Assessment Of Laparoscopic Skills–Groin Hernia (GOALS-GH). Am J Surg. 2011;201:54–61
66. Macmillan AI, Cuschieri A. Assessment of innate ability and skills for endoscopic manipulations by the Advanced Dundee Endoscopic Psychomotor Tester: Predictive and concurrent validity. Am J Surg. 1999;177:274–277
67. Matsuda T, McDougall EM, Ono Y, et al. Positive correlation between motion analysis data on the LapMentor virtual reality laparoscopic surgical simulator and the results from videotape assessment of real laparoscopic surgeries. J Endourol. 2012;26:1506–1511
68. McCluney AL, Vassiliou MC, Kaneva PA, et al. FLS simulator performance predicts intraoperative laparoscopic skill. Surg Endosc. 2007;21:1991–1995
69. Morgan PJ, Cleave-Hogg D. Evaluation of medical students’ performance using the anaesthesia simulator. Med Educ. 2000;34:42–45
70. Morgan PJ, Cleave-Hogg D, DeSousa S, Tarshis J. High-fidelity patient simulation: Validation of performance checklists. Br J Anaesth. 2004;92:388–392
71. Morgan PJ, Cleave-Hogg DM, Guest CB, Herold J. Validity and reliability of undergraduate performance assessments in an anesthesia simulator. Can J Anaesth. 2001;48:225–233
72. Mudumbai SC, Gaba DM, Boulet JR, Howard SK, Davies MF. External validation of simulation-based assessments with other performance measures of third-year anesthesiology residents. Simul Healthc. 2012;7:73–80
73. Nunez DW, Taleghani M, Wathen WF, Abdellatif HM. Typodont versus live patient: Predicting dental students’ clinical performance. J Dent Educ. 2012;76:407–413
74. Paisley AM, Baldwin PJ, Paterson-Brown S. Validity of surgical simulation for the assessment of operative skill. Br J Surg. 2001;88:1525–1532
75. Schwid HA, Rooke GA, Carline J, et al.Anesthesia Simulator Research Consortium. Evaluation of anesthesia residents using mannequin-based simulation: A multiinstitutional study. Anesthesiology. 2002;97:1434–1444
76. Sedlack RE, Baron TH, Downing SM, Schwartz AJ. Validation of a colonoscopy simulation model for skills assessment. Am J Gastroenterol. 2007;102:64–74
77. Stitik TP, Foye PM, Nadler SF, Chen B, Schoenherr L, Von Hagen S. Injections in patients with osteoarthritis and other musculoskeletal disorders: Use of synthetic injection models for teaching physiatry residents. Am J Phys Med Rehabil. 2005;84:550–559
78. Sugiono M, Teber D, Anghel G, et al. Assessing the predictive validity and efficacy of a multimodal training programme for laparoscopic radical prostatectomy (LRP). Eur Urol. 2007;51:1332–1339
79. Wilasrusmee C, Lertsithichai P, Kittur DS. Vascular anastomosis model: Relation between competency in a laboratory-based model and surgical competency. Eur J Vasc Endovasc Surg. 2007;34:405–410
80. Yudkowsky R, Luciano C, Banerjee P, et al. Practice on an augmented reality/haptic simulator and library of virtual brains improves residents’ ability to perform a ventriculostomy. Simul Healthc. 2013;8:25–31
81. Kogan JR, Holmboe ES, Hauer KE. Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. JAMA. 2009;302:1316–1326
82. Cook DA, Brydges R, Hamstra SJ, et al. Comparative effectiveness of technology-enhanced simulation versus other instructional methods: A systematic review and meta-analysis. Simul Healthc. 2012;7:308–320
83. Cook DA, Hamstra SJ, Brydges R, et al. Comparative effectiveness of instructional design features in simulation-based education: Systematic review and meta-analysis. Med Teach. 2013;35:e867–e898