Secondary Logo

Journal Logo


Technology-Enhanced Simulation to Assess Health Professionals

A Systematic Review of Validity Evidence, Research Methods, and Reporting Quality

Cook, David A. MD, MHPE; Brydges, Ryan PhD; Zendejas, Benjamin MD, MSc; Hamstra, Stanley J. PhD; Hatala, Rose MD, MSc

Author Information
doi: 10.1097/ACM.0b013e31828ffdcf


The growing complexity of modern health care, together with the imperatives of limiting the time spent training health professionals1 and maximizing patient safety,2 suggests the need for new models of medical education.3 If medical training is to be shortened4 or evolve to a competency-based system,5 valid assessments will be required to ensure that trainees have in fact achieved and maintained required proficiencies.6 Simulation facilitates deliberate practice and targeted assessment in a safe environment.7 Given its potential role in formative and summative assessments, there is a pressing need to understand the quality and limitations of evidence regarding simulation-based assessment and to highlight areas for improvement in future research.

A recent systematic review synthesized the results of hundreds of articles evaluating technology-enhanced simulation for training purposes.8 Far less is known about the validity of simulation-based assessments. Except for one selective review of medical and nursing publications,9 previous reviews of simulation-based assessment have focused on specific medical topics in surgery,10–13 anesthesiology,14,15 and virtual reality endoscopy.16 Although these reviews identified limitations in the quality of validity evidence, none conducted an in-depth evaluation of study methods, and all used an outdated validity framework.17 A comprehensive compilation of tools used for simulation-based assessment, and an evaluation of the study methods and validity evidence for these tools, would provide educators, researchers, and policy makers a foundation from which to advance.

The field of validity of assessments has evolved substantially over the past 60 years. The “classical” framework of different types of validity (face, content, criterion, and concurrent) has given way to a unified model in which validity evidence is collected from five sources—namely, content, response process, internal structure, relations with other variables, and consequences.17 Evidence is systematically collected to test the hypothesis that score interpretations are valid for their intended use.18 This model, first proposed by Messick19 in 1989, was formally adopted by the American Educational Research Association, American Psychological Association, and National Council on Measurement in Education in 1999,20 and it remains the standard approach by which to evaluate the validity of an instrument’s scores.21

Because judgments of evidence rest on clear and transparent reporting, it is also important to understand issues related to the reporting of validity evidence. Reviews in medical education have appraised the reporting quality of studies evaluating educational interventions,22–24 but we are not aware of appraisals of reporting quality for studies evaluating assessment tools.

To address these gaps, we sought to identify and summarize the validity evidence and reporting quality for all studies of technology-enhanced simulation-based assessment involving health professions learners. We did this by conducting a systematic review in which we evaluated the prevalence of validity evidence,20 potential methodological biases and limitations,25,26 and reporting quality.27,28 We sought to answer the following questions:

  1. What are the characteristics of tools to assess learning outcomes (knowledge, skills, attitudes) using technology-enhanced simulation?
  2. What validity evidence has been reported for these assessments?
  3. What is the methodological quality and reporting quality of the studies from which this evidence derives?


This review was planned, conducted, and reported in adherence to PRISMA standards of quality for reporting systematic reviews.29 We conducted this review of simulation-based assessment concurrently with a review of simulation-based training. Although these reviews addressed different questions and employed distinct inclusion criteria, some details of study identification and selection have been published previously.8

Evaluating the validity of education assessments

Before conducting our search of the literature, we determined the criteria and instruments we would use to evaluate the articles. We coded the prevalence of each of the five evidence sources noted above: content, response process, internal structure, relations with other variables, and consequences (see Table 1 for definitions). For internal structure and relations with other variables, we counted separately several distinct elements (see Table 1). Kane18 has extended Messick’s19 framework by emphasizing the importance of a systematic approach to validation, including a carefully articulated validity argument, so we coded for the presence of validity arguments in the planning and interpretation of each study. We focused on assessment and did not include evidence regarding the “validity” of training activities.

Table 1:
Definitions and Prevalence of Validity Evidence in a 2011 Systematic Review of Technology-Enhanced Simulation for Assessment

Evaluating method and reporting quality for assessment studies

We found no instruments for the appraisal of studies evaluating educational assessments. However, we identified three evidence-based instruments for appraising the methodological or reporting quality of studies of clinical diagnostic tests.25,27,28 The paradigm of clinical diagnosis applies readily to educational assessment, inasmuch as the intent of assessment is to make judgments (i.e., diagnoses) about learners for the purpose of making decisions regarding, for example, mastery or needed improvements. One important difference in research design is that studies evaluating clinical tests typically employ an independent gold standard, whereas gold standards are rarely available for educational assessments. However, we found substantial overlap in most reporting and methodological domains.

The Standards for Reporting Diagnostic Accuracy (STARD)27 were published in 2003, for the purpose of “improv[ing] the quality of reporting of studies of diagnostic accuracy.” Whereas some items on this checklist refer to studies making comparison with a reference test (gold standard), most items apply generally to any study of a diagnostic test. The Guidelines for Reporting Reliability and Agreement Studies (GRRAS),28 published in 2011, complement the STARD by emphasizing items specific to reliability studies.

To evaluate study biases, we used the revised Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2).25 The purpose of this seven-question tool is to “assess the quality of primary diagnostic accuracy studies” in the context of a systematic review. Three domains (participant selection, index test, and reference test) each have questions on possible bias (systematic flaws that distort study results) and applicability to the review question. A fourth domain evaluates bias in study flow. We did not code the participant selection-applicability question (i.e., the match between study participants and the review question) because all studies, per inclusion criteria, enrolled health professions learners.

We also evaluated study methods using the Medical Education Research Study Quality Instrument26 (MERSQI), which was developed to appraise the methodological quality of any quantitative research study.

Study eligibility

We included original research studies published in any language that had as a stated purpose the evaluation of technology-enhanced simulation for assessing health professions learners at any stage in training or practice. We made no restrictions based on study design or validity evidence reported. We defined technology-enhanced simulation as an educational tool or device with which the learner physically interacts to mimic an aspect of clinical care for the purpose of teaching or assessment.8

Study identification

Our search strategy has been previously published in full.8 To summarize briefly, we searched MEDLINE, EMBASE, CINAHL, PsychINFO, ERIC, Web of Science, and Scopus for relevant articles using a search strategy developed by an experienced research librarian. The search included terms related to the topic (simulat*, mannequin, virtual, etc.), population (education medical, education nursing, students health occupations, etc.), and assessment (assess*, evaluat*, valid*, reliab*, etc.). We used no beginning date cutoff, and the last date of search was May 11, 2011. We supplemented this search by examining the entire reference list from several published reviews9–11,14–16 and all articles published in two journals devoted to health professions simulation (Simulation in Healthcare and Clinical Simulation in Nursing).

Study selection

We worked independently and in duplicate to screen all candidate studies for inclusion, beginning with titles and abstracts and proceeding to the full text of studies judged eligible or uncertain. We resolved conflicts by consensus. Chance-adjusted interrater agreement for study inclusion, determined using intraclass correlation coefficient (ICC), was 0.72.

Data extraction and synthesis

We developed a data abstraction form through iterative testing and revision. We abstracted data independently and in duplicate for all variables (see Tables 1 and 2 for definitions), resolving conflicts by consensus. We abstracted data in two levels: basic and detailed. For all studies, we abstracted information on the number and training level of learners, clinical topic, study design, outcomes (reaction, knowledge, skills [distinguished as time, process, and product skills*], behaviors, and patient effects, as previously detailed),8 validity evidence (as above), and methodological quality (using the MERSQI). For studies reporting two or more elements of validity evidence, we abstracted additional details on learners, raters, outcome metrics, validity evidence, study quality (using the QUADAS-2), and reporting quality (using the STARD and GRRAS). We employed the two-level approach for reasons of feasibility, and also to ensure that coded studies had the evaluation of an assessment tool as a central focus (rather than as a small part of a study with a different focus). We would expect that studies reporting more validity evidence are generally better designed and reported overall, and thereby that excluding studies with less evidence most likely overestimates the quality of methods and reporting for the sample as a whole. However, we did not collect empiric data in this regard. ICC for this basic/detailed decision was 0.80.

Table 2:
Criteria and Prevalence of Reporting Quality as Determined by the Standards for Reporting Diagnostic Accuracy (STARD) and the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) in a 2011 Systematic Review of Technology-Enhanced Simulation for Assessment

ICCs for MERSQI variables ranged from 0.51 to 0.84. For validity screening, ICCs ranged from 0.67 to 0.91 except for response process (ICC = 0.34, raw agreement 95%) and consequences (ICC = 0.56), and for QUADAS-2 scores, ICCs ranged from 0.55 to 0.72 except for index test and reference test applicability (ICC = 0.17 and ICC = 0.39, raw agreement 94% and 83%, respectively). For reporting quality, nearly all ICCs were >0.5, and all were >0.3 except prospective/retrospective data collection (ICC = 0.27), rationale for test relationship (ICC = 0.22), and correlation coefficient confidence interval (ICC=0, raw agreement 99%). ICC values 0.21 to 0.4 are considered “fair,” 0.41 to 0.6 are “moderate,” and 0.61 to 0.8 are “substantial.”30

Most studies employed more than one assessment tool. In such instances, we selected one tool to code on the basis of (in order of priority) (1) the strongest validity evidence, (2) the tool noted in the report title or purpose, (3) a named tool, or (4) the tool measuring the highest outcome.

We summarized the data using counts and, where appropriate, means. We used SAS 9.3 (SAS Institute Inc., Cary, North Carolina) to perform t test and chi-square analyses of change over time, with an alpha level of 0.05.


Trial flow and participants

From 10,911 potentially relevant articles, we included 417 studies enrolling 19,075 trainees (median 30 trainees per study, interquartile range 20–50); see Figure 1 for details. Four of these were published in a language other than English. Forty-one percent of the articles (N = 172) were published in or after 2008. Table 3 summarizes study characteristics. Because we found so many studies, we do not reference them all in this report. However, Supplemental Digital Appendix 1 provides a complete list of studies, Supplemental Digital Table 1 provides key coding results, and Supplemental Digital Table 2 lists instruments by clinical topic ( We coded all 417 included studies for tool characteristics, validity evidence, and MERSQI criteria. We coded 217 studies (those reporting two or more elements of validity evidence) in greater detail using the QUADAS-2, STARD, and GRRAS criteria.

Table 3:
Description of Studies Included in a 2011 Systematic Review of Technology-Enhanced Simulation for Assessment
Figure 1:
Trial flow for a 2011 systematic review of technology-enhanced simulation for assessment. MERSQI indicates Medical Education Research Study Quality Instrument; QUADAS-2, Quality Assessment of Diagnostic Accuracy Studies version 2; STARD, Standards for Reporting Diagnostic Accuracy; GRRAS, Guidelines for Reporting Reliability and Agreement Studies.

Three hundred fifty studies (84%) involved physicians at some stage in training, including 281 (67%) that involved postgraduate physician trainees (residents), 208 (50%) that involved practicing physicians, and 115 (28%) that involved medical students (some studies included participants from more than one training stage). Twenty-six studies (6%) involved nurses, and 33 studies (8%) involved other trainees including emergency medical technicians, dentists, and respiratory therapists. We could not quantify precisely how many trainees participated from each category because 115 studies enrolling 4,283 trainees did not clearly define trainee levels (e.g., combining postgraduate and practicing physicians as “experienced”).

Tool characteristics

Studies evaluated the use of technology-enhanced simulations to assess learners in diverse topics, including laparoscopic and open surgery, gastrointestinal and urological endoscopy, anesthesiology, obstetrics–gynecology, and physical examination of the heart, breast, and prostate (see Table 3). By far the most common outcome was process skill, assessed in 356 studies (85%), followed by time (172 studies, 41%) and product skills (53 studies, 13%). Thirty-one studies (7%) evaluated nontechnical outcomes, such as communication and team leadership.

Table 4 lists the named assessment tools reported five or more times. Aside from the Objective Structured Assessment of Technical Skills (OSATS)31 and the McGill Inanimate System for Training and Evaluation of Laparoscopic Skills (MISTELS),32 these common tools involved computerized virtual reality and/or motion tracking. Consistent with this observation, the most commonly used simulator devices were computer-based virtual reality systems, employed in 171 studies (41%), followed by part-task synthetic models (156 studies, 37%) and mannequins (96 studies, 23%). Live animals were used in 14 studies (3%).

Table 4:
Commonly Reported Tools for Simulation-Based Assessment

Looking at trends over time, the proportionate use of simulators since 2008 is similar to the overall sample, with 72 (42%), 60 (35%), and 38 (22%) of the 172 studies published in or since 2008 involving virtual reality, models, and mannequins, respectively. Seventeen of the 31 studies of nontechnical skills (55%) were published since 2008.

Validity evidence

Table 1 summarizes the validity evidence presented in the 417 articles. By far the most common evidence element was relations with a learner characteristic such as training status (procedural experience or training level), addressed in 306 (73%) studies. One hundred thirty-eight studies (33%) reported no validity evidence other than this. Evidence of content, reliability, and relations with a separately measured variable were each reported in approximately one-third of studies (N = 142, 163, and 128, respectively). The prevalence of content evidence here is lower than that reported for the MERSQI (Table 5) because we required new evidence, whereas the MERSQI credits the presentation of previously published evidence. Response process and consequences evidence were infrequently reported (≤5% each).

Table 5:
Methodological Quality of Studies Included in a 2011 Systematic Review of Technology-Enhanced Simulation for Assessment

Under the category of relations with other variables, 28 studies evaluated associations between simulation-based performance and performance with real patients. These outcomes included measures of procedural time (N = 6), behaviors (instructor ratings of technique, N = 25), and patient effects (rate of procedural success or complications, N = 4). Without exception, these studies showed that higher simulator scores were associated with higher performance in clinical practice.

One hundred eighty-one studies reported only one evidence element, 78 reported two elements, and 139 reported three or more. Nineteen studies reported no substantive evidence, despite having as an aim the evaluation of an assessment tool. Of the 163 studies reporting reliability data, 106 reported one reliability type (most often interrater reliability).

Seventy-five of the 217 studies reviewed in detail (35%) used a “classical” validity framework33 (content, criterion, and construct validity) for planning and interpreting data. Another 85 (39%) used a more limited framework, such as “construct” validity alone, and 51 (24%) reported no validity framework. Only 6 (3%) invoked the currently accepted model.20

Looking at trends over time, the number of validity evidence sources has decreased slightly in recent years: Before 2008, each study reported on average 2.14 (SD 1.36) evidence sources; since 2008, the average was 1.98 (SD 1.41). Considering evidence sources separately, the change pre- to post-2008 was less than ±4 percentage points and not statistically significant (P > .05) except for relations with a separate measure (35% pre-2008, decreasing to 25%; P = .03).

Methodological quality

Table 5 summarizes the methodological quality of all 417 articles as evaluated using the MERSQI. Over half (N = 225) were single-group, single-assessment (i.e., cross-sectional) studies, whereas another one-third (N = 152) employed a one-group pretest–posttest or crossover design. The vast majority (N = 398; 95%) employed objective outcome measurements. Thirty-nine studies (9%) made a statistical error in a main analysis.

In our two-level approach, we selected all studies reporting two or more elements of validity evidence for additional coding of methodological and reporting quality. As seen in Table 5, MERSQI scores for these 217 studies were very similar to the full set, except (as would be expected) for the prevalence of validity evidence.

We evaluated these 217 studies using the QUADAS-2. As shown in Table 5, we found only 25 studies (12%) at low risk of bias in participant selection. High bias was typically due to expert–novice (case–control) comparisons, whereas unclear bias was due to failure to describe the population (data not shown). We also had frequent concerns about the conduct of the index test and reference test (N = 85 of 217 [39%] and N = 34 of 102 [33%] judged low risk of bias, respectively). Most reference tests aligned reasonably well with the target condition (N = 83 [81%] judged low concern about applicability), as did nearly all of the index tests (N = 205 [94%] low concern).

Although the STARD criteria reported in Table 2 (discussed below) focus on reporting quality, in abstracting these elements we also coded methodological quality. For example, 86 of 159 studies with human raters reported whether or not raters were blinded to trainee experience (i.e., done or not done)—but blinding was actually done in only 66 (42%) (i.e., it was reported as not done in 20). Raters were blinded to one another in 83 of 135 studies with two or more raters (61%) and blinded to the reference test in 12 of 95 studies with a reference test (13%). Twenty-three studies (14%) employed only one rater per observation. Raters completed special training in 67 of 159 studies (42%). Among the 38 studies reporting a sampling strategy, 20 enrolled the entire available sample (e.g., an entire medical school class), 5 enrolled a random sample, and 13 employed defined inclusion/exclusion criteria.

Reporting quality

Table 2 summarizes reporting quality as measured by the STARD and GRRAS. Nearly all studies (N = 186; 86%) reported a focused question, but only 135 (62%) offered a critical review of relevant literature (“cites articles relevant to the topic or study design and critically discusses these articles”),22 and 139 (64%) proposed a plan for interpreting the evidence to be presented (validity argument). Trainee flow was sparsely reported, with 58 (27%) studies reporting eligibility criteria, 38 (18%) reporting any sampling method, and 30 (14%) reporting the number eligible. Twelve studies (6%) failed to report the number enrolled, and 52 (24%) failed to define the training level for all trainees.

Only 19 (9%) studies reported sample size calculations. Among 198 studies correlating two variables or comparing two groups, at least one main statistical method was undefined in 8 (4%). Similarly, among 153 studies reporting reliability analyses, statistical methods were undefined in 18 (12%). Confidence intervals were reported for 2% of correlation coefficients (N = 2 of 92 studies) and 10% of reliability coefficients (N = 16 of 153 studies).

Discussion and Conclusions

Brennan34(p8) stated, “Validity theory is rich, but the practice of validation is often impoverished.” This systematic review of simulation-based assessment suggests that such is unfortunately the case in this field of medical education. Most of the 417 studies in this sample offered only limited validity evidence, and nearly half reported only one element of new evidence. By far the most commonly reported source of validity evidence—and the sole source for one-third of studies—was the relatively weak design of expert–novice comparison. The average number of validity elements decreased slightly or remained constant in more recent studies, suggesting that conditions are not improving. Fewer than two-thirds of the studies proposed an outline of the validity evidence they expected to accrue, and one-fifth failed to interpret the results of the evidence presented. Only six studies acknowledged the current unified evidence-oriented framework.20

We also evaluated methodological quality using the MERSQI and QUADAS-2. Whereas MERSQI overall scores are somewhat higher than those reported in previous studies,8,23,26,35 QUADAS-2 ratings indicate a high prevalence of selective inclusion (case–control studies), incomplete description of the population, and lack of rater blinding—all of which have been associated with bias in clinical research.36,37 If such associations hold true in education, the findings of such studies may differ from the true properties of the assessment activity.

Reporting quality as appraised using the STARD and GRRAS criteria was also limited. The STARD guidelines were established to ensure reporting of key study features required to appraise the risk of bias (e.g., the information needed to complete instruments such as the QUADAS-2). Indeed, we often found it difficult or impossible to appraise methodological rigor (i.e., using the QUADAS-2 and MERSQI) when key information was reported vaguely or notat all.

Limitations and strengths

This review has limitations. We did not attempt to determine the direction or strength of validity evidence or judge the validity of interpretations for individual tools. However, by focusing on the type of validity evidence reported, we were able to comment on strengths and weaknesses across a diverse field and to prepare a catalog of tools for assessing many different skills (see Table 4, Supplemental Digital Table 1, Although beyond the scope of the present work, further evaluation of validity evidence for specific skills domains and for studies reporting on multiple tools would provide additional insight into these issues.

We found modest interrater agreement for some variables, due at least in part to incomplete or unclear reporting. We addressed this by reaching consensus on all reported data. The low interrater agreement for the QUADAS-2 applicability questions suggests that this instrument may require further refinement or clarification prior to widespread use in education.

We cannot exclude the possibility of publication bias, particularly in those studies exploring associations with clinical outcomes.

We included studies regardless of study design or validity evidence presented. However, we abstracted information for only one assessment tool per report, and we applied the QUADAS-2, STARD, and GRRAS criteria to only half the articles. Although we presume that studies reporting less validity evidence would fare less favorably on these measures, we did not confirm this, and the quality of reporting and methods remains unknown for studies not selected for detailed review.

Comparison with previous reviews

The present review agrees with previous reviews of assessment in simulation-based9–16 and non-simulation-based26,38–41 education in concluding that validation research is generally lacking. The present review builds on previous reviews by applying a modern validity framework to the field of simulation-based assessment and providing a detailed summary of validity evidence for currently available tools.

Previous reviews of reporting and methodological quality in medical education have focused on studies of educational interventions23,26,42 and identified significant shortcomings therein. We are not aware of any study applying the QUADAS-2, STARD, or GRRAS guidelines to medical education. The data we present regarding reporting and methodological quality thus constitute a unique contribution.

Implications for research

Our findings suggest that current validation research methods could lead to biased results. Education researchers can minimize potential bias by avoiding selective inclusion (i.e., including studies that selectively enroll experienced and inexperienced participants), describing the number and characteristics of the eligible population, and blinding raters to trainee experience, other raters, and (when present) the results of any tests used as related measures.

Although this study focused on simulation-based assessment, we suspect that there is room for improvement in the completeness and clarity of reporting for assessment research in health professions education generally. As noted previously, “Rote adherence to guidelines will not compensate for poor-quality research or inferior writing skills, but inclusion of the elements listed in guidelines … will enable a wide range of consumers to understand and apply the study results.”23 It may be useful to further refine the STARD and GRRAS for widespread application to educational assessment research. In the meantime, researchers might use these instruments (or our operational adaptation) to facilitate complete reporting.

Yet guidelines alone will be inadequate, in part because many authors are unaware of them or lack the skills to apply them in practice. True improvement in reporting quality will require the active efforts of journal editors and reviewers who understand, endorse, and enforce relevant standards.22

Implications for practice

This catalog of tools, indexed by clinical topic, will be of great practical value to educators searching for evidence-based assessment instruments. Unfortunately, only a handful of these tools have been subjected to validation across different assessment contexts. For most of the tools listed in Table 4, we see a preponderance of evidence for relations with other variables (especially relations with training status), and a relative lack of evidence for the content, internal structure, response process, and consequences of scores. It seems that a tool’s widespread use often outstrips the accumulation of validity evidence. To resolve this, researchers must do more than employ robust research methods. They will also need to deliberately target key evidence sources. This, in turn, would benefit from a structured agenda or argument, as has been proposed for professionalism.43 In the meantime, educators should ensure that actions based on an assessment’s scores are commensurate with the strength of the available evidence.

It is often said that assessment drives learning, and as medical education evolves toward personalized training and competency-based decisions,5 the role of educational assessment will only enlarge.44 Assessments that rely on self-report, log books, hours of training, or written tests to determine procedural competence will no longer suffice. To this end, we need both innovative tools—many of which will involve simulation—and coherent validity arguments supporting the interpretation of scores. Validity arguments, in turn, require rigorous, well-reported research providing strategically collected evidence. An arsenal of tools thus validated will enable decisions regarding formative feedback, mastery of technical and nontechnical skills, remediation, and credentialing that will streamline training and ensure the quality of health professionals and patient care.

Acknowledgments: The authors thank Jason Szostek, MD, Amy Wang, MD, and Patricia Erwin, MLS, for their efforts in article identification, and Colin West, MD, PhD, for his critical review of the manuscript (all affiliated with Mayo Clinic College of Medicine, Rochester, Minnesota).

Funding/Support: No external funding. This work was supported by an award from the Division of General Internal Medicine, Mayo Clinic.

Other Disclosures: None.

Ethical approval: Not applicable.

Previous presentations: An abstract based on this work was presented at the 2012 Simulation Summit of the Royal College of Physicians and Surgeons of Canada, Ottawa, Ontario, Canada, November 2012.

* Time skills refers to the time required to perform a task, process skills refers to performance during a task (e.g., global ratings or minor errors), and product skills refers to the final result (e.g., successful completion, major complication, or the quality of the final product).8
Cited Here


1. Antiel RM, Thompson SM, Reed DA, et al. ACGME duty-hour recommendations—A national survey of residency program directors. N Engl J Med. 2010;363:e12
2. West CP, Tan AD, Habermann TM, Sloan JA, Shanafelt TD. Association of resident fatigue and distress with perceived medical errors. JAMA. 2009;302:1294–1300
3. Albanese M, Mejicano G, Gruppen L. Perspective: Competency-based medical education: A defense against the four horsemen of the medical education apocalypse. Acad Med. 2008;83:1132–1139
4. Emanuel EJ, Fuchs VR. Shortening medical training by 30%. JAMA. 2012;307:1143–1144
5. Weinberger SE, Pereira AG, Iobst WF, Mechaber AJ, Bronze MSAlliance for Academic Internal Medicine Education Redesign Task Force II. . Competency-based education and training in internal medicine. Ann Intern Med. 2010;153:751–756
6. Albanese MA, Mejicano G, Mullan P, Kokotailo P, Gruppen L. Defining characteristics of educational competencies. Med Educ. 2008;42:248–255
7. Ziv A, Wolpe PR, Small SD, Glick S. Simulation-based medical education: An ethical imperative. Acad Med. 2003;78:783–788
8. Cook DA, Hatala R, Brydges R, et al. Technology-enhanced simulation for health professions education: A systematic review and meta-analysis. JAMA. 2011;306:978–988
9. Kardong-Edgren S, Adamson KA, Fitzgerald C. A review of currently published evaluation instruments for human patient simulation. Clin Simul Nursing. 2010;6:e25–e35
10. Van Nortwick SS, Lendvay TS, Jensen AR, Wright AS, Horvath KD, Kim S. Methodologies for establishing validity in surgical simulation studies. Surgery. 2010;147:622–630
11. Ahmed K, Jawad M, Abboudi M, et al. Effectiveness of procedural simulation in urology: A systematic review. J Urol. 2011;186:26–34
12. Feldman LS, Sherman V, Fried GM. Using simulators to assess laparoscopic competence: Ready for widespread use? Surgery. 2004;135:28–42
13. Schout BM, Hendrikx AJ, Scheele F, Bemelmans BL, Scherpbier AJ. Validation and implementation of surgical simulators: A critical review of present, past, and future. Surg Endosc. 2010;24:536–546
14. Byrne AJ, Greaves JD. Assessment instruments used during anaesthetic simulation: Review of published studies. Br J Anaesth. 2001;86:445–450
15. Edler AA, Fanning RG, Chen MI, et al. Patient simulation: A literary synthesis of assessment tools in anesthesiology. J Educ Eval Health Prof. 2009;6:3
16. Fitzgerald TN, Duffy AJ, Bell RL, Berman L, Longo WE, Roberts KE. Computer-based endoscopy simulation: Emerging roles in teaching and professional skills assessment. J Surg Educ. 2008;65:229–235
17. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837
18. Kane MTBrennan RL. Validation. In: Educational Measurement. 20064th ed Westport, Conn Praeger:17–64
19. Messick SLinn RL. Validity. In: Educational Measurement. 19893rd Ed New York, NY American Council on Education and Macmillan:13–103
20. American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing. 1999 Washington, DC American Educational Research Association
21. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.16
22. Cook DA, Beckman TJ, Bordage G. Quality of reporting of experimental studies in medical education: A systematic review. Med Educ. 2007;41:737–745
23. Cook DA, Levinson AJ, Garside S. Method and reporting quality in health professions education research: A systematic review. Med Educ. 2011;45:227–238
24. Price EG, Beach MC, Gary TL, et al. A systematic review of the methodological rigor of studies evaluating cultural competence training of health professionals. Acad Med. 2005;80:578–586
25. Whiting PF, Rutjes AW, Westwood ME, et al.QUADAS-2 Group. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155:529–536
26. Reed DA, Cook DA, Beckman TJ, Levine RB, Kern DE, Wright SM. Association between funding and quality of published medical education research. JAMA. 2007;298:1002–1009
27. Bossuyt PM, Reitsma JB, Bruns DE, et al.Standards for Reporting of Diagnostic Accuracy. Towards complete and accurate reporting of studies of diagnostic accuracy: The STARD initiative. Ann Intern Med. 2003;138:40–44
28. Kottner J, Audigé L, Brorson S, et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol. 2011;64:96–106
29. Moher D, Liberati A, Tetzlaff J, Altman DGPRISMA Group. . Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Ann Intern Med. 2009;151:264–269, W64
30. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–174
31. Martin JA, Regehr G, Reznick R, et al. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg. 1997;84:273–278
32. Derossis AM, Fried GM, Abrahamowicz M, Sigman HH, Barkun JS, Meakins JL. Development of a model for training and evaluation of laparoscopic skills. Am J Surg. 1998;175:482–487
33. American Psychological Association. Standards for Educational and Psychological Tests and Manuals. 1966 Washington, DC American Psychological Association
34. Brennan RLBrennan RL. Perspectives on the evolution and future of educational measurement. In: Educational Measurement. 20064th ed Westport, Conn Praeger:1–16
    35. Reed DA, Beckman TJ, Wright SM, Levine RB, Kern DE, Cook DA. Predictive validity evidence for medical education research study quality instrument scores: Quality of submissions to JGIM’s Medical Education Special Issue. J Gen Intern Med. 2008;23:903–907
    36. Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA. 1999;282:1061–1066
    37. Whiting P, Rutjes AW, Reitsma JB, Glas AS, Bossuyt PM, Kleijnen J. Sources of variation and bias in studies of diagnostic accuracy: A systematic review. Ann Intern Med. 2004;140:189–202
    38. Beckman TJ, Cook DA, Mandrekar JN. What is the validity evidence for assessments of clinical teaching? J Gen Intern Med. 2005;20:1159–1164
    39. Kogan JR, Holmboe ES, Hauer KE. Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. JAMA. 2009;302:1316–1326
    40. Howley L, Szauter K, Perkowski L, Clifton M, McNaughton NAssociation of Standardized Patient Educators (ASPE). . Quality of standardised patient research reports in the medical education literature: Review and recommendations. Med Educ. 2008;42:350–358
    41. Ratanawongsa N, Thomas PA, Marinopoulos SS, et al. The reported validity and reliability of methods for evaluating continuing medical education: A systematic review. Acad Med. 2008;83:274–283
    42. Baernstein A, Liss HK, Carney PA, Elmore JG. Trends in study methods used in undergraduate medical education research, 1969–2007. JAMA. 2007;298:1038–1045
    43. Clauser BE, Margolis MJ, Holtman MC, Katsufrakis PJ, Hawkins RE. Validity considerations in the assessment of professionalism. Adv Health Sci Educ Theory Pract. 2012;17:165–181
    44. Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. The role of assessment in competency-based medical education. Med Teach. 2010;32:676–682
    45. Duffy AJ, Hogle NJ, McCarthy H, et al. Construct validity for the LAPSIM laparoscopic surgical simulator. Surg Endosc. 2005;19:401–405
      46. Ritter EM, McClusky DA 3rd, Lederman AB, Gallagher AG, Smith CD. Objective psychomotor skills assessment of experienced and novice flexible endoscopists with a virtual reality simulator. J Gastrointest Surg. 2003;7:871–877
        47. Hsu JH, Younan D, Pandalai S, et al. Use of computer simulation for determining endovascular skill levels in a carotid stenting model. J Vasc Surg. 2004;40:1118–1125
          48. Grantcharov TP, Rosenberg J, Pahle E, Funch-Jensen P. Virtual reality computer simulation. Surg Endosc. 2001;15:242–244
            49. Botden SM, Buzink SN, Schijven MP, Jakimowicz JJ. ProMIS augmented reality training of laparoscopic procedures face validity. Simul Healthc. 2008;3:97–102
              50. Rossi JV, Verma D, Fujii GY, et al. Virtual vitreoretinal surgical simulator as a training tool. Retina (Philadelphia, Pa). 2004;24:231–236
                51. Zhang A, Hünerbein M, Dai Y, Schlag PM, Beller S. Construct validity testing of a laparoscopic surgery simulator (Lap Mentor): Evaluation of surgical skill with a virtual laparoscopic training simulator. Surg Endosc. 2008;22:1440–1444
                  52. Rosenthal R, Gantert WA, Scheidegger D, Oertli D. Can skills assessment on a virtual reality trainer predict a surgical trainee’s talent in laparoscopic surgery? Surg Endosc. 2006;20:1286–1290
                    53. Datta V, Mackay S, Mandalia M, Darzi A. The use of electromagnetic motion tracking analysis to objectively measure open surgical skill in the laboratory-based model. J Am Coll Surg. 2001;193:479–485

                      Supplemental Digital Content

                      © 2013 Association of American Medical Colleges