Three hundred fifty studies (84%) involved physicians at some stage in training, including 281 (67%) that involved postgraduate physician trainees (residents), 208 (50%) that involved practicing physicians, and 115 (28%) that involved medical students (some studies included participants from more than one training stage). Twenty-six studies (6%) involved nurses, and 33 studies (8%) involved other trainees including emergency medical technicians, dentists, and respiratory therapists. We could not quantify precisely how many trainees participated from each category because 115 studies enrolling 4,283 trainees did not clearly define trainee levels (e.g., combining postgraduate and practicing physicians as “experienced”).
Studies evaluated the use of technology-enhanced simulations to assess learners in diverse topics, including laparoscopic and open surgery, gastrointestinal and urological endoscopy, anesthesiology, obstetrics–gynecology, and physical examination of the heart, breast, and prostate (see Table 3). By far the most common outcome was process skill, assessed in 356 studies (85%), followed by time (172 studies, 41%) and product skills (53 studies, 13%). Thirty-one studies (7%) evaluated nontechnical outcomes, such as communication and team leadership.
Table 4 lists the named assessment tools reported five or more times. Aside from the Objective Structured Assessment of Technical Skills (OSATS)31 and the McGill Inanimate System for Training and Evaluation of Laparoscopic Skills (MISTELS),32 these common tools involved computerized virtual reality and/or motion tracking. Consistent with this observation, the most commonly used simulator devices were computer-based virtual reality systems, employed in 171 studies (41%), followed by part-task synthetic models (156 studies, 37%) and mannequins (96 studies, 23%). Live animals were used in 14 studies (3%).
Looking at trends over time, the proportionate use of simulators since 2008 is similar to the overall sample, with 72 (42%), 60 (35%), and 38 (22%) of the 172 studies published in or since 2008 involving virtual reality, models, and mannequins, respectively. Seventeen of the 31 studies of nontechnical skills (55%) were published since 2008.
Table 1 summarizes the validity evidence presented in the 417 articles. By far the most common evidence element was relations with a learner characteristic such as training status (procedural experience or training level), addressed in 306 (73%) studies. One hundred thirty-eight studies (33%) reported no validity evidence other than this. Evidence of content, reliability, and relations with a separately measured variable were each reported in approximately one-third of studies (N = 142, 163, and 128, respectively). The prevalence of content evidence here is lower than that reported for the MERSQI (Table 5) because we required new evidence, whereas the MERSQI credits the presentation of previously published evidence. Response process and consequences evidence were infrequently reported (≤5% each).
Under the category of relations with other variables, 28 studies evaluated associations between simulation-based performance and performance with real patients. These outcomes included measures of procedural time (N = 6), behaviors (instructor ratings of technique, N = 25), and patient effects (rate of procedural success or complications, N = 4). Without exception, these studies showed that higher simulator scores were associated with higher performance in clinical practice.
One hundred eighty-one studies reported only one evidence element, 78 reported two elements, and 139 reported three or more. Nineteen studies reported no substantive evidence, despite having as an aim the evaluation of an assessment tool. Of the 163 studies reporting reliability data, 106 reported one reliability type (most often interrater reliability).
Seventy-five of the 217 studies reviewed in detail (35%) used a “classical” validity framework33 (content, criterion, and construct validity) for planning and interpreting data. Another 85 (39%) used a more limited framework, such as “construct” validity alone, and 51 (24%) reported no validity framework. Only 6 (3%) invoked the currently accepted model.20
Looking at trends over time, the number of validity evidence sources has decreased slightly in recent years: Before 2008, each study reported on average 2.14 (SD 1.36) evidence sources; since 2008, the average was 1.98 (SD 1.41). Considering evidence sources separately, the change pre- to post-2008 was less than ±4 percentage points and not statistically significant (P > .05) except for relations with a separate measure (35% pre-2008, decreasing to 25%; P = .03).
Table 5 summarizes the methodological quality of all 417 articles as evaluated using the MERSQI. Over half (N = 225) were single-group, single-assessment (i.e., cross-sectional) studies, whereas another one-third (N = 152) employed a one-group pretest–posttest or crossover design. The vast majority (N = 398; 95%) employed objective outcome measurements. Thirty-nine studies (9%) made a statistical error in a main analysis.
In our two-level approach, we selected all studies reporting two or more elements of validity evidence for additional coding of methodological and reporting quality. As seen in Table 5, MERSQI scores for these 217 studies were very similar to the full set, except (as would be expected) for the prevalence of validity evidence.
We evaluated these 217 studies using the QUADAS-2. As shown in Table 5, we found only 25 studies (12%) at low risk of bias in participant selection. High bias was typically due to expert–novice (case–control) comparisons, whereas unclear bias was due to failure to describe the population (data not shown). We also had frequent concerns about the conduct of the index test and reference test (N = 85 of 217 [39%] and N = 34 of 102 [33%] judged low risk of bias, respectively). Most reference tests aligned reasonably well with the target condition (N = 83 [81%] judged low concern about applicability), as did nearly all of the index tests (N = 205 [94%] low concern).
Although the STARD criteria reported in Table 2 (discussed below) focus on reporting quality, in abstracting these elements we also coded methodological quality. For example, 86 of 159 studies with human raters reported whether or not raters were blinded to trainee experience (i.e., done or not done)—but blinding was actually done in only 66 (42%) (i.e., it was reported as not done in 20). Raters were blinded to one another in 83 of 135 studies with two or more raters (61%) and blinded to the reference test in 12 of 95 studies with a reference test (13%). Twenty-three studies (14%) employed only one rater per observation. Raters completed special training in 67 of 159 studies (42%). Among the 38 studies reporting a sampling strategy, 20 enrolled the entire available sample (e.g., an entire medical school class), 5 enrolled a random sample, and 13 employed defined inclusion/exclusion criteria.
Table 2 summarizes reporting quality as measured by the STARD and GRRAS. Nearly all studies (N = 186; 86%) reported a focused question, but only 135 (62%) offered a critical review of relevant literature (“cites articles relevant to the topic or study design and critically discusses these articles”),22 and 139 (64%) proposed a plan for interpreting the evidence to be presented (validity argument). Trainee flow was sparsely reported, with 58 (27%) studies reporting eligibility criteria, 38 (18%) reporting any sampling method, and 30 (14%) reporting the number eligible. Twelve studies (6%) failed to report the number enrolled, and 52 (24%) failed to define the training level for all trainees.
Only 19 (9%) studies reported sample size calculations. Among 198 studies correlating two variables or comparing two groups, at least one main statistical method was undefined in 8 (4%). Similarly, among 153 studies reporting reliability analyses, statistical methods were undefined in 18 (12%). Confidence intervals were reported for 2% of correlation coefficients (N = 2 of 92 studies) and 10% of reliability coefficients (N = 16 of 153 studies).
Discussion and Conclusions
Brennan34(p8) stated, “Validity theory is rich, but the practice of validation is often impoverished.” This systematic review of simulation-based assessment suggests that such is unfortunately the case in this field of medical education. Most of the 417 studies in this sample offered only limited validity evidence, and nearly half reported only one element of new evidence. By far the most commonly reported source of validity evidence—and the sole source for one-third of studies—was the relatively weak design of expert–novice comparison. The average number of validity elements decreased slightly or remained constant in more recent studies, suggesting that conditions are not improving. Fewer than two-thirds of the studies proposed an outline of the validity evidence they expected to accrue, and one-fifth failed to interpret the results of the evidence presented. Only six studies acknowledged the current unified evidence-oriented framework.20
We also evaluated methodological quality using the MERSQI and QUADAS-2. Whereas MERSQI overall scores are somewhat higher than those reported in previous studies,8,23,26,35 QUADAS-2 ratings indicate a high prevalence of selective inclusion (case–control studies), incomplete description of the population, and lack of rater blinding—all of which have been associated with bias in clinical research.36,37 If such associations hold true in education, the findings of such studies may differ from the true properties of the assessment activity.
Reporting quality as appraised using the STARD and GRRAS criteria was also limited. The STARD guidelines were established to ensure reporting of key study features required to appraise the risk of bias (e.g., the information needed to complete instruments such as the QUADAS-2). Indeed, we often found it difficult or impossible to appraise methodological rigor (i.e., using the QUADAS-2 and MERSQI) when key information was reported vaguely or notat all.
Limitations and strengths
This review has limitations. We did not attempt to determine the direction or strength of validity evidence or judge the validity of interpretations for individual tools. However, by focusing on the type of validity evidence reported, we were able to comment on strengths and weaknesses across a diverse field and to prepare a catalog of tools for assessing many different skills (see Table 4, Supplemental Digital Table 1, http://links.lww.com/ACADMED/A130). Although beyond the scope of the present work, further evaluation of validity evidence for specific skills domains and for studies reporting on multiple tools would provide additional insight into these issues.
We found modest interrater agreement for some variables, due at least in part to incomplete or unclear reporting. We addressed this by reaching consensus on all reported data. The low interrater agreement for the QUADAS-2 applicability questions suggests that this instrument may require further refinement or clarification prior to widespread use in education.
We cannot exclude the possibility of publication bias, particularly in those studies exploring associations with clinical outcomes.
We included studies regardless of study design or validity evidence presented. However, we abstracted information for only one assessment tool per report, and we applied the QUADAS-2, STARD, and GRRAS criteria to only half the articles. Although we presume that studies reporting less validity evidence would fare less favorably on these measures, we did not confirm this, and the quality of reporting and methods remains unknown for studies not selected for detailed review.
Comparison with previous reviews
The present review agrees with previous reviews of assessment in simulation-based9–16 and non-simulation-based26,38–41 education in concluding that validation research is generally lacking. The present review builds on previous reviews by applying a modern validity framework to the field of simulation-based assessment and providing a detailed summary of validity evidence for currently available tools.
Previous reviews of reporting and methodological quality in medical education have focused on studies of educational interventions23,26,42 and identified significant shortcomings therein. We are not aware of any study applying the QUADAS-2, STARD, or GRRAS guidelines to medical education. The data we present regarding reporting and methodological quality thus constitute a unique contribution.
Implications for research
Our findings suggest that current validation research methods could lead to biased results. Education researchers can minimize potential bias by avoiding selective inclusion (i.e., including studies that selectively enroll experienced and inexperienced participants), describing the number and characteristics of the eligible population, and blinding raters to trainee experience, other raters, and (when present) the results of any tests used as related measures.
Although this study focused on simulation-based assessment, we suspect that there is room for improvement in the completeness and clarity of reporting for assessment research in health professions education generally. As noted previously, “Rote adherence to guidelines will not compensate for poor-quality research or inferior writing skills, but inclusion of the elements listed in guidelines … will enable a wide range of consumers to understand and apply the study results.”23 It may be useful to further refine the STARD and GRRAS for widespread application to educational assessment research. In the meantime, researchers might use these instruments (or our operational adaptation) to facilitate complete reporting.
Yet guidelines alone will be inadequate, in part because many authors are unaware of them or lack the skills to apply them in practice. True improvement in reporting quality will require the active efforts of journal editors and reviewers who understand, endorse, and enforce relevant standards.22
Implications for practice
This catalog of tools, indexed by clinical topic, will be of great practical value to educators searching for evidence-based assessment instruments. Unfortunately, only a handful of these tools have been subjected to validation across different assessment contexts. For most of the tools listed in Table 4, we see a preponderance of evidence for relations with other variables (especially relations with training status), and a relative lack of evidence for the content, internal structure, response process, and consequences of scores. It seems that a tool’s widespread use often outstrips the accumulation of validity evidence. To resolve this, researchers must do more than employ robust research methods. They will also need to deliberately target key evidence sources. This, in turn, would benefit from a structured agenda or argument, as has been proposed for professionalism.43 In the meantime, educators should ensure that actions based on an assessment’s scores are commensurate with the strength of the available evidence.
It is often said that assessment drives learning, and as medical education evolves toward personalized training and competency-based decisions,5 the role of educational assessment will only enlarge.44 Assessments that rely on self-report, log books, hours of training, or written tests to determine procedural competence will no longer suffice. To this end, we need both innovative tools—many of which will involve simulation—and coherent validity arguments supporting the interpretation of scores. Validity arguments, in turn, require rigorous, well-reported research providing strategically collected evidence. An arsenal of tools thus validated will enable decisions regarding formative feedback, mastery of technical and nontechnical skills, remediation, and credentialing that will streamline training and ensure the quality of health professionals and patient care.
Acknowledgments: The authors thank Jason Szostek, MD, Amy Wang, MD, and Patricia Erwin, MLS, for their efforts in article identification, and Colin West, MD, PhD, for his critical review of the manuscript (all affiliated with Mayo Clinic College of Medicine, Rochester, Minnesota).
Funding/Support: No external funding. This work was supported by an award from the Division of General Internal Medicine, Mayo Clinic.
Other Disclosures: None.
Ethical approval: Not applicable.
Previous presentations: An abstract based on this work was presented at the 2012 Simulation Summit of the Royal College of Physicians and Surgeons of Canada, Ottawa, Ontario, Canada, November 2012.
* Time skills refers to the time required to perform a task, process skills refers to performance during a task (e.g., global ratings or minor errors), and product skills refers to the final result (e.g., successful completion, major complication, or the quality of the final product).8
1. Antiel RM, Thompson SM, Reed DA, et al. ACGME duty-hour recommendations—A national survey of residency program directors. N Engl J Med. 2010;363:e12
2. West CP, Tan AD, Habermann TM, Sloan JA, Shanafelt TD. Association of resident fatigue and distress with perceived medical errors. JAMA. 2009;302:1294–1300
3. Albanese M, Mejicano G, Gruppen L. Perspective: Competency-based medical education: A defense against the four horsemen of the medical education apocalypse. Acad Med. 2008;83:1132–1139
4. Emanuel EJ, Fuchs VR. Shortening medical training by 30%. JAMA. 2012;307:1143–1144
5. Weinberger SE, Pereira AG, Iobst WF, Mechaber AJ, Bronze MSAlliance for Academic Internal Medicine Education Redesign Task Force II. . Competency-based education and training in internal medicine. Ann Intern Med. 2010;153:751–756
6. Albanese MA, Mejicano G, Mullan P, Kokotailo P, Gruppen L. Defining characteristics of educational competencies. Med Educ. 2008;42:248–255
7. Ziv A, Wolpe PR, Small SD, Glick S. Simulation-based medical education: An ethical imperative. Acad Med. 2003;78:783–788
8. Cook DA, Hatala R, Brydges R, et al. Technology-enhanced simulation for health professions education: A systematic review and meta-analysis. JAMA. 2011;306:978–988
9. Kardong-Edgren S, Adamson KA, Fitzgerald C. A review of currently published evaluation instruments for human patient simulation. Clin Simul Nursing. 2010;6:e25–e35
10. Van Nortwick SS, Lendvay TS, Jensen AR, Wright AS, Horvath KD, Kim S. Methodologies for establishing validity in surgical simulation studies. Surgery. 2010;147:622–630
11. Ahmed K, Jawad M, Abboudi M, et al. Effectiveness of procedural simulation in urology: A systematic review. J Urol. 2011;186:26–34
12. Feldman LS, Sherman V, Fried GM. Using simulators to assess laparoscopic competence: Ready for widespread use? Surgery. 2004;135:28–42
13. Schout BM, Hendrikx AJ, Scheele F, Bemelmans BL, Scherpbier AJ. Validation and implementation of surgical simulators: A critical review of present, past, and future. Surg Endosc. 2010;24:536–546
14. Byrne AJ, Greaves JD. Assessment instruments used during anaesthetic simulation: Review of published studies. Br J Anaesth. 2001;86:445–450
15. Edler AA, Fanning RG, Chen MI, et al. Patient simulation: A literary synthesis of assessment tools in anesthesiology. J Educ Eval Health Prof. 2009;6:3
16. Fitzgerald TN, Duffy AJ, Bell RL, Berman L, Longo WE, Roberts KE. Computer-based endoscopy simulation: Emerging roles in teaching and professional skills assessment. J Surg Educ. 2008;65:229–235
17. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837
18. Kane MTBrennan RL. Validation. In: Educational Measurement. 20064th ed Westport, Conn Praeger:17–64
19. Messick SLinn RL. Validity. In: Educational Measurement. 19893rd Ed New York, NY American Council on Education and Macmillan:13–103
20. American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing. 1999 Washington, DC American Educational Research Association
21. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.16
22. Cook DA, Beckman TJ, Bordage G. Quality of reporting of experimental studies in medical education: A systematic review. Med Educ. 2007;41:737–745
23. Cook DA, Levinson AJ, Garside S. Method and reporting quality in health professions education research: A systematic review. Med Educ. 2011;45:227–238
24. Price EG, Beach MC, Gary TL, et al. A systematic review of the methodological rigor of studies evaluating cultural competence training of health professionals. Acad Med. 2005;80:578–586
25. Whiting PF, Rutjes AW, Westwood ME, et al.QUADAS-2 Group. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–536
26. Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA. 2007;298:1002–1009
27. Bossuyt PM, Reitsma JB, Bruns DE, et al.Standards for Reporting of Diagnostic Accuracy. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Ann Intern Med. 2003;138:40–44
28. Kottner J, Audigé L, Brorson S, et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64:96–106
29. Moher D, Liberati A, Tetzlaff J, Altman DGPRISMA Group. . Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Ann Intern Med. 2009;151:264–269, W64
30. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174
31. Martin JA, Regehr G, Reznick R, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg. 1997;84:273–278
32. Derossis AM, Fried GM, Abrahamowicz M, Sigman HH, Barkun JS, Meakins JL. Development of a model for training and evaluation of laparoscopic skills. Am J Surg. 1998;175:482–487
33. American Psychological Association. Standards for Educational and Psychological Tests and Manuals. 1966 Washington, DC American Psychological Association
34. Brennan RLBrennan RL. Perspectives on the evolution and future of educational measurement. In: Educational Measurement. 20064th ed Westport, Conn Praeger:1–16
35. Reed DA, Beckman TJ, Wright SM, Levine RB, Kern DE, Cook DA. Predictive validity evidence for medical education research study quality instrument scores: Quality of submissions to JGIM’s Medical Education Special Issue. J Gen Intern Med. 2008;23:903–907
36. Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA. 1999;282:1061–1066
37. Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: A systematic review. Ann Intern Med. 2004;140:189–202
38. Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching? J Gen Intern Med. 2005;20:1159–1164
39. Kogan JR, Holmboe ES, Hauer KE. Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. JAMA. 2009;302:1316–1326
40. Howley L, Szauter K, Perkowski L, Clifton M, McNaughton NAssociation of Standardized Patient Educators (ASPE). . Quality of standardised patient research reports in the medical education literature: Review and recommendations. Med Educ. 2008;42:350–358
41. Ratanawongsa N, Thomas PA, Marinopoulos SS, et al. The reported validity and reliability of methods for evaluating continuing medical education: A systematic review. Acad Med. 2008;83:274–283
42. Baernstein A, Liss HK, Carney PA, Elmore JG. Trends in study methods used in undergraduate medical education research, 1969–2007. JAMA. 2007;298:1038–1045
43. Clauser BE, Margolis MJ, Holtman MC, Katsufrakis PJ, Hawkins RE. Validity considerations in the assessment of professionalism. Adv Health Sci Educ Theory Pract. 2012;17:165–181
44. Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. The role of assessment in competency-based medical education. Med Teach. 2010;32:676–682
45. Duffy AJ, Hogle NJ, McCarthy H, et al. Construct validity for the LAPSIM laparoscopic surgical simulator. Surg Endosc. 2005;19:401–405
46. Ritter EM, McClusky DA 3rd, Lederman AB, Gallagher AG, Smith CD. Objective psychomotor skills assessment of experienced and novice flexible endoscopists with a virtual reality simulator. J Gastrointest Surg. 2003;7:871–877
47. Hsu JH, Younan D, Pandalai S, et al. Use of computer simulation for determining endovascular skill levels in a carotid stenting model. J Vasc Surg. 2004;40:1118–1125
48. Grantcharov TP, Rosenberg J, Pahle E, Funch-Jensen P. Virtual reality computer simulation. Surg Endosc. 2001;15:242–244
49. Botden SM, Buzink SN, Schijven MP, Jakimowicz JJ. ProMIS augmented reality training of laparoscopic procedures face validity. Simul Healthc. 2008;3:97–102
50. Rossi JV, Verma D, Fujii GY, et al. Virtual vitreoretinal surgical simulator as a training tool. Retina (Philadelphia, Pa). 2004;24:231–236
51. Zhang A, Hünerbein M, Dai Y, Schlag PM, Beller S. Construct validity testing of a laparoscopic surgery simulator (Lap Mentor): Evaluation of surgical skill with a virtual laparoscopic training simulator. Surg Endosc. 2008;22:1440–1444
52. Rosenthal R, Gantert WA, Scheidegger D, Oertli D. Can skills assessment on a virtual reality trainer predict a surgical trainee’s talent in laparoscopic surgery? Surg Endosc. 2006;20:1286–1290
53. Datta V, Mackay S, Mandalia M, Darzi A. The use of electromagnetic motion tracking analysis to objectively measure open surgical skill in the laboratory-based model. J Am Coll Surg. 2001;193:479–485
Supplemental Digital Content
© 2013 Association of American Medical Colleges