The Association of American Medical Colleges recommends that students graduating from U.S. MD-granting medical schools are expected to master basic skills, such as knot-tying and suturing, and tasks, such as Foley catheter insertion and venipuncture.1 Residents graduating from training programs in procedurally oriented disciplines, like surgical specialties and emergency medicine, are expected to master a broader array of technical skills, tasks, and procedures.2–4 In surgery, the need for reliable and valid assessment of technical competency has always been of interest, but in the late 1990s, it became a topic of considerable concern as well as an area of explosive creative activity. The demand for more formalized testing of technical skills has been fueled by concerns for patient safety, the advent of new technologies, the rise of simulation-based training, and increased public scrutiny and accountability. To assess technical performance, educators need to decide which aspects of performance to measure. To date, the development of assessment metrics and tools has been driven by individual researchers,5 special task forces,6–9 accreditation agencies,10 and professional organizations.11
Researchers have used different measures for determining “competency,” “proficiency,” or “expertise.” For example, some assess wound closure skills according to the speed with which students can tie knots, whereas others measure the distance between the placed sutures, or the tensile strength of the knots. A researcher studying a simulator’s utility in training residents to perform a laparoscopic (Lap) cholecystectomy may decide to measure the amount of manual force applied to delicate tissues or the number of hand motions needed to reach the gallbladder. The aspect of performance that is being measured is defined as a “metric.”
Several theoretical articles have proposed metric definitions and frameworks,7,8 but the selection and development of specific metrics has evolved unevenly across the procedurally oriented disciplines—general surgery, orthopedic surgery, urology, neurosurgery, otolaryngology, emergency medicine, obstetrics–gynecology—for which technical competence is critical. Collectively, educators and researchers lack a clear taxonomy of metric categories to guide their decisions on what to measure when assessing trainees’ technical surgical skills.
“Taxonomy” has been defined as a “field of science that encompasses description, identification, nomenclature, and classification.”12 Establishing a defensible taxonomy is valuable because it organizes and clarifies phenomena within a system, promotes a common language within a community of researchers, leads to greater understanding of patterns and gaps, and provides a general foundation for more advanced study.
The purpose of this study is twofold: first, to create and empirically test a proposed taxonomy of assessment metrics based on the research literature in psychomotor skills training across the surgical specialties; and second, to better understand the frequency and context with which these metrics were being used.
Our study questions are (1) Which metrics should be included in a taxonomy of technical skills assessment, and how should they be defined and organized? (2) To what extent is our proposed taxonomy comprehensive and verifiable? (3) Which metric categories are most and least commonly reported? (4) Which types of skills, settings, learners, models, and instruments are most common?
Phase 1: Identifying and organizing metrics into a taxonomy
To establish a preliminary draft of the metrics taxonomy, in July–August 2011, the first author (C.C.S.) reviewed seminal articles on skills training, simulation, and assessment metrics for technical or procedural skills.13–25 These articles were selected on the basis of author reputation in surgical assessment or simulation, and their focus on defining or measuring technical performance. A taxonomy of “abilities, skills, tasks, and procedures” proposed by the International Consensus Group of 20017,8 formed the basis for this draft, but the taxonomy grew and changed shape as metrics from other articles were identified. Three of us (C.C.S., D.D., J.K.) sorted, grouped, and organized these metrics on the basis of their similarity, number, and type. We also added examples of how each metric could be operationalized for measurement purposes, and identified the underlying construct the metric was meant to represent. For example, the metric “time” can be operationalized as the number of minutes it takes to sew two segments of bowel together (with less time being the goal). The underlying construct being represented is “automaticity,” meaning that basic skills have been so well learned that they require little conscious attention.26 The metric “time” can also be operationalized as completing a procedure within a time limit to achieve a patient safety standard (e.g., establishing a patent airway within three minutes of beginning a cricothyrotomy).
Phase 2: Developing the coding structure and establishing reliability
Three authors (C.C.S., D.D., J.K.) developed a coding structure that would enable us to classify 16 types of metrics reported in a larger literature review according to our proposed taxonomy (see Table 1). The structure also supported documentation of five contextual variables related to our fourth research question. We wanted to determine the types of skill domains that are assessed most often (e.g., basic motor skills versus tasks or procedures); the settings involved (e.g., bedside versus Lap surgery); the types of learners being tested (e.g., novices versus intermediates or experts); the physical models used for assessment (e.g., box trainer versus virtual reality or live patient); and the types of instruments used to collect the scores (e.g., simulator-embedded metric calculations versus checklists or recording sheets). Data on these contextual variables indicate where the field is concentrating its assessment energy and where there are gaps to fill.
To pilot the coding structure in September–October 2011, we each read and coded 10 articles,17,24,27–34 which we had purposefully selected to represent different types of skills, settings, learners, models, and instruments. After comparing our coding results, we revised the metrics taxonomy and coding structure, separating subjective from objective metrics, and subdividing metrics into “inputs,” “processes,” and “outcomes.” “Inputs” refers to measurement of innate abilities prior to their application, “processes” refers to actions during the execution of surgical tasks or procedures, and “outcomes” refers to the results of executed tasks or procedures.
We recruited additional authors (M.E.S., S.M., K.Y.) to ensure timely completion of the study. To promote coding consistency among authors, in February–March 2012 we drew a random sample of five articles from the large database of articles that we assembled (see Phase 3). Each author read and independently coded the metrics and related contextual variables from these five articles. After discussing areas of agreement and disagreement, we again refined the metrics taxonomy and coding structure. Additionally, we developed more detailed coding guidelines and decision rules. For example, we decided to code no more than four metrics per article. We set this limit because we didn’t expect many articles to report more than four metrics per study, and to simplify the database by removing empty cells.
In a final test of coding agreement, we reread the same five articles and independently recoded the metrics reported in the articles. We defined “agreement” as the percentage of authors (C.C.S., D.D., M.E.S., S.M., J.K.) who identified the same metric from the 16 possible metric categories (see Table 2). We averaged 96% agreement for the first (or main) metric reported in these articles, 80% agreement for a second metric, and 55% agreement for a third metric. For two articles, our coding agreement across all three metrics was 100%. For the other three articles, our coding agreement for all three metrics ranged from 60% to 73%.
Phase 3: Searching the literature
To compile a database of articles with which to test the taxonomy, we conducted a systematic literature search in December 2011. Working with a library assistant, we used the advanced search feature on PubMed to obtain original articles (no letters or editorials) during a 10-year period (December 2001–December 2011). We used the following search strategy: [measurement of technical skills], OR [measurement of procedural skills], OR [measurement of laparoscopic skills], OR [measurement of endoscopic skills], AND [surgical specialties]. Titles and abstracts were read (C.C.S.), and articles were excluded if (1) no psychomotor (technical/procedural) skills were involved, (2) no learner assessment metrics were proposed, described, or tested, (3) the skills in question were not sufficiently relevant to surgery, or (4) study designs were not psychometric, preexperimental, or experimental.
Following the PubMed search, we conducted a hand search to find additional articles meeting our criteria. Specifically, we identified studies from the reference sections of six publications that reviewed the literature in technical skills assessment and simulation skills training.21,24,35–38
Phase 4: Testing the taxonomy and collecting contextual data
We determined a priori that we would consider the taxonomy sufficiently comprehensive and verifiable if 95% of the measures being reported in the literature could be classified according to our taxonomy, and that every metric category would have at least one entry. We transferred the coding structure and guidelines to an online database (SurveyMonkey Inc., Palo Alto, California) and randomly assigned articles to authors to review and code (April–July 2012). Prior to data analysis, we downloaded all records from SurveyMonkey into SPSS Statistics (Version 20) (IBM Software), checked for duplicates, and corrected obvious coding errors. In August 2012, we analyzed the data by computing the percentage of articles in which a metric was found, as well as the total number and percentage of times each metric category was recorded. We similarly computed the number and percentages of contextual variables recorded.
The systematic search via PubMed resulted in 185 articles, 71 of which did not meet our study criteria on the basis of information provided in abstracts (see Figure 1). This left 114 articles in our database. The follow-up hand search resulted in 114 additional articles, bringing our total to 228. During the coding phase, we excluded 26 articles that did not meet study criteria, lacked sufficient information on their methods and metrics to code, or could not be obtained in English or online. The analysis was conducted on 202 articles.
As shown in Table 2, the proposed taxonomy contains 12 objective metric categories (subdivided into “inputs,” “processes,” and “outcomes”) and 4 subjective metric categories (subdivided into “processes” and “outcomes”).
We coded a total of 567 metrics. Of that number, 520 (92%) were successfully coded according to the newly developed metrics taxonomy, and 47 (8%) were recorded as “other.” In the majority of instances, the “other” metric was a composite of several individual metrics that were in the taxonomy (e.g., scores based on some combination of “time,” “errors,” “efficiency,” and “precision”). “Learning curves,” cited three times, was the only new metric to emerge from the “other” category.
Frequency of metric categories
Out of 202 articles and 567 metric records, “time” was the most frequently reported (see Figure 2). Nearly 70% of the articles (n = 141) used time as a performance metric, and it accounted for 25% of all metric records. The next most frequent metrics were “manual techniques: objective” (n = 81; 40% of articles, 14% of records); “manual techniques: subjective” (n = 76; 38% of articles, 13% of records); “errors” (n = 62; 31% of articles, 11% of records); and “procedural steps” (n = 38; 19% of articles, 7% of records). On the basis of the articles in our review, we found evidence of use for all but one metric category (human factors: objective). Additionally, the use of “automaticity” (n = 1; 2% of records), “decisions and choices” (n = 2; 4%), and “clinical results of end products: objective” (n = 3; 5%) was very rare. Overall, process-oriented metrics outnumbered outcome-oriented metrics by a ratio of 8:1.
Frequency of contextual variables
Some types of skills (n = 242 records), settings (n = 241 records), learners (n = 380 records), models (n = 248 records), and instruments (n = 316 records) appeared more frequently than others (see Figure 3).
Most metric data are being collected on basic motor skills. By comparison, assessment of tasks and surgical procedures was less frequent. Basic skills (n = 124) were represented in 61% of articles and 51% of all skills recorded. A few rare articles (n = 7; 3% of articles, 1% of records) focused on aptitudes as a way to more objectively rank candidates for residency.
More studies focused on assessment as it relates to Lap and minimally invasive surgery (MIS) than open surgery procedures, bedside procedures, endovascular, or microvascular surgery. The frequency of Lap/MIS setting (n = 144; 71% of articles and 60% of settings recorded) was three to four times higher than other settings.
Metric data are being collected mostly on learners who represent the early end of the training continuum. This includes novices (e.g., medical students), beginners (e.g., postgraduate year [PGY] 1–2 residents), and intermediate learners (e.g., PGY 3–5 residents) (n = 272, combined; 72% of learner records). These learner groups were not always aligned with a training year, however, and in some articles the subjects were grouped according to specific types of experience (e.g., number of Lap cases performed) or demonstrated proficiency. Practicing surgeons were cited most often in studies of construct validation and in studies of Lap skills training, in which they were often grouped with novices or intermediate learners.
A wide variety of models is reported in the literature, from low-tech equipment and part-task trainers to animals and actual patients. The use of virtual reality simulators (n = 78; 39% of articles, 32% of models recorded) and box trainers (n = 64; 32% of articles, 26% of records) dominated this literature. Surprisingly, reports of full-body manikins were extremely rare (n = 1).
Checklists and rating scales were common, but hand-recorded measurements of work processes or products (n = 95; 47% of articles, 30% of records) was the major “instruments” category cited. Measurements were based on visual inspection (either live or via video recording) and usually involved simple counts (e.g., seconds to complete a task, recorded with a stopwatch; number of targets hit; number of objects dropped). This category also included recording measurements related to surgical success (e.g., positive margins from a lymph node biopsy; low leakage rate for a hand-sewn bowel anastomosis; low estimated blood loss during surgery).
Discussion and Conclusions
Assessment metrics, like test scores and numeric indicators of all kinds, are contrived measures, selected (or intentionally designed) to represent various phenomena of interest. The metrics we choose tell us something about what we value in human performance. Clearly, the most common metrics used for technical surgical skills during the decade under review were (1) time (with faster being presumed to be better); (2) objective measures of manual technique (several of which focused on economy of motion, efficiency); (3) subjective measures of manual technique (most of which were based on the Objective Structured Assessment of Technical Skills rating scales developed by Reznick et al)5; (4) errors; and (5) completion of task-specific procedural steps.
The value of having an established taxonomy can be considerable. In the last 60 years, for example, Bloom’s39 taxonomy on educational objectives has had an impact on curriculum development, learner assessment, and program evaluation. In this study, we employed the literature as a source of data to test the utility of the proposed surgical skills assessment metrics taxonomy. This research method, although it involves a systematic approach to reading a carefully drawn sample of articles, differs from a traditional literature review, which synthesizes actual results of similar studies according to questions or themes, or quantitatively via meta-analysis.
The proposed taxonomy represents improvement in terms of definition, comprehensiveness, and verification, but it is not perfect. We spent a significant amount of time reading, discussing, and rereading the articles in Phases 1 to 2, trying to clarify what was being measured. Some articles were easy to code; others proved difficult. We believe the difficulty reflects two things: (1) the conceptual challenge all taxonomists face in identifying the defining properties of an object or construct, and sorting them into clearly different, mutually exclusive categories; and (2) the lack of consistency in the research communities in their definition, use, and reporting of assessment metrics. This lack of consistency, in itself, exposed the need for a taxonomy such as the one we propose. Additionally, the metrics alluded to in abstracts and the method sections of papers didn’t always match the type of data reported in the tables, making classification difficult.
Despite these challenges, we were able to successfully classify 92% of the measures reported in the study literature according to the taxonomy. The metrics that weren’t coded were largely composite metrics that combined individual metrics that were already in our taxonomy. Although composite metrics pose a problem for coding, their presence does not alter the taxonomy per se. The next iteration of the taxonomy should include a better way to reflect composite metrics. The only “new” metric (learning curves) that did not easily fit the taxonomy is an important one. It could be argued that it represents a type of composite metric, as it is based on proficiency scores and time, cases, or trials. But the underlying construct it represents feels somewhat different from a composite measure of competence as measured in a single performance. Learning curves reflect a rate of learning, and as such, this metric brings something new to the taxonomy.
Not all of the proposed metric categories were well populated. This may reflect a limitation of the study in terms of the articles selected to verify the taxonomy, the time period we surveyed, and our decision to only code four metrics per article. We believe, however, that the most used metrics—time, manual techniques, errors, and procedural steps—are used for a reason: They are easier to execute. Assessing learning in a simulated environment where extraneous variables can be controlled is easier than at the bedside or in the operating room. Assessing a trainee’s ability to follow a known set of procedural steps is easier than assessing the trainee’s decision making. Subjectively assessing the overall quality of a performance is easier than objectively testing its immediate result. Attributing clinical outcomes from live patients in the operating room to a single trainee is inappropriate because patient outcomes result from a team and a system in which trainees are closely supervised.
Despite the challenges, we believe that more standardized assessments of surgical tasks and procedures (both at the bedside and in the operating room) need to be conducted, studied, and reported. This is not only because of their low frequency in the literature we reviewed but because they relate to the Surgery Residency Review Committee requirements for patient management and technical competency and to the American Board of Surgery mandates for board certification. The significant dominance of process over outcome metrics in the literature we reviewed may be appropriate, given the “early” learner groups involved. Even for these trainees, however, and certainly for more advanced learners, educators and researchers need to give greater attention to visible, testable characteristics of end products. For example, the security of a wound dressing may be tested by pulling slightly on the dressing at the end of an exam. Estimates of blood loss from a porcine model in a lab, or evidence of airway patency from a manikin, are obtainable.
The old adage “not everything that is measurable counts, and not everything that counts is measureable” is applicable to technical skills assessment. Metrics chosen by researchers may not be based on careful consideration of all metric possibilities but, rather, on what a simulator can count—like total time, correct hits, instrument path, speed, or dropped objects—and report.40 These metrics of convenience require careful scrutiny. It is our hope that our taxonomy may spark interest in thinking about performance and its measurement in ways readers may not have considered, or provoke them to reconsider the basis for how they choose their metrics.
5. Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Testing technical skill via an innovative “bench station” examination. Am J Surg. 1997;173:226–230
6. Ritchie WP Jr. The measurement of competence. Current plans and future initiatives of the American Board of Surgery. Bull Am Coll Surg. 2001;86:10–15
7. Satava RM, Cuschieri A, Hamdorf JMetrics for Objective Assessment of Surgical Skills Workshop. . Metrics for objective assessment. Surg Endosc. 2003;17:220–226
8. Satava RM, Gallagher AG, Pellegrini CA. Surgical competence and surgical proficiency: Definitions, taxonomy, and metrics. J Am Coll Surg. 2003;196:933–937
9. Satava RM. Disruptive visions: Surgical education. Surg Endosc. 2004;18:779–781
10. Swing SR. Assessing the ACGME general competencies: General considerations and assessment methods. Acad Emerg Med. 2002;9:1278–1288
11. Pellegrini CA, Sachdeva AK, Johnson KA. Accreditation of education institutes by the American College of Surgeons: A new program folowing an old tradition. Bull Am Coll Surg. 2006;91:8–12
12. Simpson MG. Plant systematics: An overview. Plant Systematics. 20102nd ed Academic Press
13. van der Vleuten CP, Schuwirth LW. Assessing professional competence: From methods to programmes. Med Educ. 2005;39:309–317
14. Michelson JD, Manning L. Competency assessment in simulation-based procedural education. Am J Surg. 2008;196:609–615
15. Sanfey H, Ketchum J, Bartlett J, et al. Verification of proficiency in basic skills for postgraduate year 1 residents. Surgery. 2010;148:759–766
16. Boehler ML, Schwind CJ, Rogers DA, et al. A theory-based curriculum for enhancing surgical skillfulness. J Am Coll Surg. 2007;205:492–497
17. Brydges R, Kurahashi A, Brümmer V, Satterthwaite L, Classen R, Dubrowski A. Developing criteria for proficiency-based training of surgical technical skills using simulation: Changes in performances as a function of training year. J Am Coll Surg. 2008;206:205–211
18. Hodges B, McIlroy JH. Analytic global OSCE ratings are sensitive to level of training. Med Educ. 2003;37:1010–1016
19. Sanne MB, Boten IH, de Hingh JT, Jakimowicz JJ. Meaningful assessment method for laparoscopic suturing training in augmented reality. Surg Endosc. 2009;23:2221–2228
20. Hamstra SJ, Dubrowski A, Backstein D. Teaching technical skills to surgical residents: A survey of empirical research. Clin Orthop Relat Res. 2006;449:108–115
21. Thijssen AS, Schijven MP. Contemporary virtual reality laparoscopy simulators: Quicksand or solid grounds for assessing surgical trainees? Am J Surg. 2010;199:529–541
22. Moorthy K, Munz Y, Sarker SK, Darzi A. Objective assessment of technical skills in surgery. BMJ. 2003;327:1032–1037
23. Stefanidis D. Optimal acquisition and assessment of proficiency on simulators in surgery. Surg Clin North Am. 2010;90:475–489
24. Aggarwal R, Grantcharov TP, Darzi A. Framework for systematic training and assessment of technical skills. J Am Coll Surg. 2007;204:697–705
25. Fried GM. Lessons from the surgical experience with simulators: Incorporation into training and utilization in determining competency. Gastrointest Endosc Clin N Am. 2006;16:425–434
26. Shiffin RM, Schneider W. Controlled and automatic human information processing: Perceptual learning, automatic attending, and a general theory. Psychol Rev. 1977;84:127–190
27. Datta V, Bann S, Beard J, Mandalia M, Darzi A. Comparison of bench test evaluations of surgical skill with live operating performance assessments. J Am Coll Surg. 2004;199:603–606
28. Gallagher AG, Leonard G, Traynor OJ. Role and feasibility of psychomotor and dexterity testing in selection for surgical training. ANZ J Surg. 2009;79:108–113
29. Goff B, Mandel L, Lentz G, et al. Assessment of resident surgical skills: Is testing feasible? Am J Obstet Gynecol. 2005;192:1331–1338
30. Guru KA, Kuvshinoff BW, Pavlov-Shapiro S, et al. Impact of robotics and laparoscopy on surgical skills: A comparative study. J Am Coll Surg. 2007;204:96–101
31. Källström R, Hjertberg H, Svanvik J. Impact of virtual reality-simulated training on urology residents’ performance of transurethral resection of the prostate. J Endourol. 2010;24:1521–1528
32. Knudson MM, Khaw L, Bullard MK, et al. Trauma training in simulation: Translating skills from SIM time to real time. J Trauma. 2008;64:255–263
33. Okrainec A, Soper NJ, Swanstrom LL, Fried GM. Trends and results of the first 5 years of Fundamentals of Laparoscopic Surgery (FLS) certification testing. Surg Endosc. 2011;25:1192–1198
34. Parent RJ, Plerhoples TA, Long EE, et al. Early, intermediate, and late effects of a surgical skills “boot camp” on an objective structured assessment of technical skills: A randomized controlled study. J Am Coll Surg. 2010;210:984–989
35. Aucar JA, Groch NR, Troxel SA, Eubanks SW. A review of surgical simulation with attention to validation methodology. Surg Laparosc Endosc Percutan Tech. 2005;15:82–89
36. Sutherland LM, Middleton PF, Anthony A, et al. Surgical simulation: A systematic review. Ann Surg. 2006;243:291–300
37. Sturm LP, Windsor JA, Cosman PH, Cregan P, Hewett PJ, Maddern GJ. A systematic review of skills transfer after surgical simulation training. Ann Surg. 2008;248:166–179
38. Van Nortwick SS, Lendvay TS, Jensen AR, Wright AS, Horvath KD, Kim S. Methodologies for establishing validity in surgical simulation studies. Surgery. 2010;147:622–630
39. Bloom BS, Krathwaohl DR Taxonomy of Educational Objectives, Book I: Cognitive Domain. 1956 White Plains, NY Longman Publishing
40. Andreatta PB, Woodrum DT, Gauger PG, Minter RM. LapMentor metrics possess limited construct validity. Simul Healthc. 2008;3:16–25