Validity and Reliability of the Robotic Objective Structured Assessment of Technical Skills

Siddiqui, Nazema Y. MD, MHSc; Galloway, Michael L. DO; Geller, Elizabeth J. MD; Green, Isabel C. MD; Hur, Hye-Chun MD; Langston, Kyle PA; Pitter, Michael C. MD; Tarr, Megan E. MD, MS; Martino, Martin A. MD

Obstetrics & Gynecology:
doi: 10.1097/AOG.0000000000000288
Contents: Original Research

OBJECTIVE: Objective Structured Assessments of Technical Skills have been developed to measure the skill of surgical trainees. Our aim was to develop an Objective Structured Assessments of Technical Skills specifically for trainees learning robotic surgery.

METHODS: This is a multiinstitutional study conducted in eight academic training programs. We created an assessment form to evaluate robotic surgical skill through five inanimate exercises. Gynecology, general surgery, and urology residents, Fellows, and faculty completed five robotic exercises on a standard training model. Study sessions were recorded and randomly assigned to three blinded judges who scored performance using the assessment form. Construct validity was evaluated by comparing scores between participants with different levels of surgical experience; interrater and intrarater reliability were also assessed.

RESULTS: We evaluated 83 residents, nine Fellows, and 13 faculty totaling 105 participants; 88 (84%) were from gynecology. Our assessment form demonstrated construct validity with faculty and Fellows performing significantly better than residents (mean scores 89±8 faculty, 74±17 Fellows, 59±22 residents; P<.01). In addition, participants with more robotic console experience scored significantly higher than those with fewer prior console surgeries (P<.01). Robotic Objective Structured Assessments of Technical Skills demonstrated good interrater reliability across all five drills (mean Cronbach's α 0.79±0.02). Intrarater reliability was also high (mean Spearman's correlation 0.91±0.11).

CONCLUSION: We developed a valid and reliable assessment form for robotic surgical skill. When paired with standardized robotic skill drills, this form may be useful to distinguish between levels of trainee performance.


In Brief

An assessment form for robotic surgical skill that demonstrates construct validity and interrater and intrarater reliability may be useful to distinguish between levels of trainee performance.

Author Information

Departments of Obstetrics and Gynecology, Duke University, Durham, North Carolina; Wright State University, Dayton, Ohio, the University of North Carolina, Chapel Hill, North Carolina; Johns Hopkins University, Baltimore, Maryland; Beth Israel Deaconess Medical Center, Boston, Massachusetts; Lehigh Valley Health Network, Allentown, Pennsylvania; Newark Beth Israel Medical Center, Newark, New Jersey; and Cleveland Clinic, Cleveland, Ohio.

Corresponding author: Nazema Y. Siddiqui, MD, MHSc, Duke University Medical Center, DUMC 3192, Durham, NC 27710; e-mail:

Supported by the National Center For Advancing Translational Sciences under award number UL1TR001117. Dr Siddiqui is supported by award number K12-DK100024 from the National Institute of Diabetes and Digestive and Kidney Diseases.

The authors thank Grace Fulton, BS (research coordinator), Duke University Medical Center, for study coordination; Samantha Thomas, MB, Duke University, for assistance with statistical analyses; and Jon Kost and Hubert Huang, Lehigh Valley Health Network, for assistance with video editing and data coordination.

Presented at the 2013 APGO/CREOG Meeting, February 27–March 2, 2013, Phoenix, Arizona.

The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Financial Disclosure All authors received reimbursement for travel from Intuitive Surgical, Inc., for a total of three curriculum development/research meetings. Dr. Pitter is a paid consultant for Intuitive Surgical, Inc; Dr. Geller has received speaking honoraria.

Article Outline

Robotic-assisted surgery has emerged as an alternative minimally invasive approach for gynecologic procedures including hysterectomy, myomectomy, cancer staging, and prolapse repair. The robotic approach offers increased dexterity with wristed laparoscopic instruments but also requires technical training that is exclusive to robotic surgery.1,2 Technical skills may be assessed using Objective Structured Assessments of Technical Skills.3,4 Objective Structured Assessments of Technical Skills are typically standard assessment forms with predefined criteria indicating how to score performance on a technical skill. Compared with traditional surgical evaluations, Objective Structured Assessments of Technical Skills allow for less biased assessments of technical performance, have demonstrated validity and reliability, and thus have been adopted broadly in gynecology5–7 and other surgical training programs.8–11 As surgical techniques have evolved, modified versions of the original Objective Structured Assessments of Technical Skills form have been validated for assessment of laparoscopic and endoscopic procedures.12

Robotic surgical training relies heavily on simulation to develop technical skill with the instrumentation. Surgical trainees are often required to practice in a simulation environment before their involvement in live robotic procedures. However, when performing drills using an inanimate “dry-lab” simulator, it is difficult to know how to assess performance. Existing validated assessment tools such as Objective Structured Assessments of Technical Skills4 and the Global Operative Assessment of Laparoscopic Skills12 assess parameters such as “Use of Assistants” and “Autonomy,” which are not relevant in robotic technical skills training. Thus, we further modified the Objective Structured Assessments of Technical Skills and Global Operative Assessment of Laparoscopic Skills forms to be more useful for robotic simulation training, which we titled the Robotic Objective Structured Assessment of Technical Skills. Our primary aim was to assess the construct validity of this modified assessment form. Construct validity refers to the degree to which a test measures what it claims to be measuring.13 In educational research, construct validity is often demonstrated by examining if an assessment tool can distinguish between levels of education (ie, Postgraduate Year),3,6 which is the same paradigm we used. We further aimed to assess interrater and intrarater reliability of Robotic Objective Structured Assessments of Technical Skills.

Back to Top | Article Outline


This is a multiinstitutional study conducted in eight academic training programs. It was approved by the institutional review board at each participating center. Through consensus meetings, the investigators modified the Objective Structured Assessments of Technical Skills form and developed the “Robotic Objective Structured Assessment of Technical Skills” (see the Appendix online at Robotic Objective Structured Assessments of Technical Skills is completed by directly observing performance on robotic simulation drills. Performance for each simulation drill is assessed across four categories: 1) depth perception and accuracy; 2) force and tissue handling; 3) dexterity; and 4) efficiency. Each category is scored from 1 to 5 with higher scores indicating more proficiency. Scores are summed across categories, giving a maximum score of 20 per drill. Notably, we initially included a fifth category, called “instrument awareness,” that was designed to assess robotic arm collisions and whether the surgeon was moving an instrument outside the field of view. Some raters downgraded scores for large, “clinically relevant” collisions, whereas other raters incorporated every collision in their score. Thus, this category was removed after initial pilot testing, when we identified a large amount of variability in the interpretation among raters.

To test this modified assessment form, we created a standardized series of simulation exercises. These included five drills in sequence (Fig. 1, or see the video online at The first drill was “tower transfer,” in which the participant picks up rubber bands and transfers them to towers of varying heights. The second is “roller coaster,” in which a rubber band is moved around a series of wire loops. The third drill is “big dipper” in which a needle is placed into a sponge in various prespecified directions. The fourth is “train tracks,” which involves placing a running suture. The final drill is “figure-of-eight,” in which the participant places a figure-of-eight suture and ties it using square knots. For the study, participating sites received a simulation training kit including all materials needed for the first two drills. For the needle and suturing drills (#3–5), sites were instructed to purchase 1-inch thick high-density foam, available at local craft stores. Foam was cut and marked using a template that was sent to each site.

We recruited residents, Fellows, and faculty to participate in a multiinstitutional study from January to May 2012. Participants were from gynecology (including gynecologic oncology, female pelvic medicine and reconstructive surgery, and advanced laparoscopy fellowships), urology, and general surgery. Participants attended study sessions where they completed baseline questionnaires assessing demographics, surgical experience, and prior robotic experience (Table 1). They then reviewed an orientation video demonstrating the five robotic simulation drills followed by completion of the drills. For each drill, participants were given 1 minute to practice and permitted up to 6 minutes to complete the exercise. A different piece of foam was used for practice than for skill assessment. Drills were digitally recorded using the robotic camera. All materials (questionnaires and videos) were deidentified and labeled with study numbers. Deidentified recordings were uploaded to a central data repository, which was electronically accessible by each investigator.

Each investigator was assigned recordings to score using the Robotic Objective Structured Assessments of Technical Skills form. Before scoring, all investigators participated in an orientation to the assessment form and video review process. Each deidentified study recording was then randomly assigned to three blinded investigators. Notably, investigators were not assigned video recordings from their own site.

For our analysis, we first assessed construct validity by evaluating the ability of the assessment form to distinguish participants based on robotic surgical experience. This was evaluated by comparing scores based on 1) level of training and 2) reported number of prior robotic console surgeries. Using the three judges' Robotic Objective Structured Assessments of Technical Skills scores, we calculated the median score for each exercise (maximum 20 points per exercise) and summed these medians to calculate a summary score (maximum 100 points) for each participant. Three-way comparisons were performed with one-way analysis of variance using the Tukey test for post hoc comparisons. Two-way comparisons were performed using Student’s t test.

We also assessed interrater and intrarater reliability. For interrater reliability, we compared the three judges scores for each drill using intraclass correlation coefficients, which are reported using Cronbach's α. To assess intrarater reliability, multiple judges were reassigned 10 of their original deidentified recordings for repeat review 2 months after the initial scoring process. Scores from the first and second reviews were compared using Spearman's correlation. Summarized intrarater reliability was calculated using the means of the multiple judges' correlation coefficients. Correlation coefficients were compared between drills using analysis of variance. Statistical comparisons of the Cronbach's alpha coefficients were performed with an extended Feldt's F-test using R 3.0.1, and comparisons of the Spearman correlation coefficients were performed with an F test using SAS 9.3. The remaining data were analyzed using SPSS 20.0 with P<.05 considered statistically significant.

Back to Top | Article Outline


There were a total of 105 participants, including 83 residents, nine Fellows, and 13 faculty members. The majority (84%) were from gynecology with the remainder from general surgery and urology. Of the 83 residents, there were 13 Postgraduate Year (PGY) 1, 22 PGY 2, 29 PGY 3, 19 PGY 4, and no PGY 5 or 6. Postgraduate Years 1 and 2 were considered junior residents, whereas PGY 3 and 4 were considered senior residents. Of all residents, 10% had assisted at the console for more than 10 procedures. Fellows were in various stages of their training with five of nine (56%) having participated in more than 10 procedures as a console surgeon. All faculty members had significant robotic console experience with a median of 108 robotic procedures (range 50–500).

Construct validity was demonstrated by the ability of our assessment form to distinguish participants based on prior robotic surgical experience. Faculty performed significantly better than Fellows, and Fellows performed significantly better than residents (Table 2). Even among residents, our tool was able to distinguish between junior- and senior-level residents. Furthermore, participants with more robotic console experience scored significantly higher than those with less robotic console experience regardless of PGY (Table 3).

Interrater reliability was assessed for each drill. By convention, correlation coefficients above 0.7 are considered sufficiently high and coefficients above 0.8 are considered very high agreement.14 Interrater reliability was consistently high for all five drills with a mean Cronbach's α of 0.79±0.03 (Table 4). There were no significant differences in Cronbach's α coefficients between drills (P=.57). Intrarater reliability was also consistently high for all five drills with a mean correlation coefficient among all raters of 0.91±0.11 (Table 5). The overall analysis of variance indicates no significant differences in mean correlation coefficient between drills (P=.96). There were also no differences in individual pairwise comparisons.

Back to Top | Article Outline


We created a modified Objective Structured Assessments of Technical Skills that can be used for objective assessment of performance during robotic training. This assessment form demonstrates construct validity and interrater and intrarater reliability in inanimate simulation training.

There have been numerous recent reports focusing on robotic technical skills training.15–17 To date, the more comprehensive studies have been performed using a series of nine inanimate technical skills—five of these based on prior Fundamentals of Laparoscopic Surgery exercises with four newly developed tasks for robotic surgery.15,18,19 For these drills, investigators chose to assess performance using a combination of time and errors. Using this assessment technique, construct validity was established in a small study involving eight faculty and Fellows and four medical students.15 Thus, prior studies have shown that based on time and errors, trained surgeons perform better than medical students. We were less interested in the ability of an assessment technique to distinguish between novice and expert levels of proficiency and more interested in the ability to distinguish performance among various levels of resident and Fellow trainees. Furthermore, we wanted to offer a competency-based instead of time-based approach to assessment, which is currently lacking in robotic simulation training.20 In our current study, we used a modified version of Objective Structured Assessments of Technical Skills, because this method of assessment has demonstrated validity and reliability across numerous types of technical skills.3,4,6,12 Also, because our goal is to assess trainee performance during simulation training, we performed our validation study in a resident and Fellow trainee population.

The strengths of our study lie in the methodology that we used. Because of the multiinstitutional design, we were able to include a larger number of resident and Fellow trainees than typical for an educational study. We used rigorous methods for standardization of study sessions and blinding of scoring judges to reduce bias. Furthermore, our participants encompassed multiple surgical disciplines, and the judges represented many gynecologic subspecialties, allowing greater generalizability of our findings. Through our study design, we also demonstrated reproducibility of our results. This is important because, even if an educational tool is valid, it must also be reliable among different judges to be useful for a training program.

Our study also has some notable limitations. Although there are multiple types of validity, we focused only on construct validity. We indeed demonstrated that our assessment form could discriminate between levels of training and, furthermore, between levels of prior robotic surgical experience. We did not, however, include experienced nonrobotic surgeons in our study. Inclusion of nonrobotic faculty surgeons as a control group would have allowed us to assess for whether surgical experience alone, rather than robotic console experience, contributes to higher scores. Another limitation of our study is that we were unable to assess predictive validity—the ability of a tool to predict future performance (eg, in the live operating room). Lastly, although our intrarater reliability coefficients were excellent, some of the interrater reliability scores were only moderately high (α of 0.78 for tower transfer, big dipper, and train tracks drills). This suggests that subtle differences in interpretation may exist, although our intraclass correlation coefficient values were well within the 0.7 to 0.95 range reported in other Objective Structured Assessments of Technical Skills studies.11,21–23

Residency programs are continually being challenged to improve surgical training, and simulation training is becoming an increasingly popular method. For trainees wishing to learn robotic surgery, they often spend time practicing simulation drills but lack an objective method to demonstrate their performance. We offer an assessment tool that can be used to assess trainees during simulation training. It is important to note that to use these drills and assessment form as a “test of readiness” for live surgery, rigorous methodology should be used to establish benchmark scores for resident performance. Importantly, for residents, who are still in a training environment, these benchmarks may be very different than those that are set for faculty surgeons. Furthermore, we recognize that robotic surgery not only involves technical console skills, but also presents unique challenges for the bedside assistant, communication in the operating room, and for cost containment in a training environment. Thus, we would consider incorporating technical skill training as one component of a comprehensive robotic training curriculum. One such curriculum exists and is currently undergoing further study.24

In summary, we present an assessment form for robotic simulation training that demonstrates validity and reliability in resident and Fellow trainees. This tool, when paired with inanimate robotic skill drills, may prove useful for competency-based training and assessment.

Back to Top | Article Outline


1. Lenihan JP Jr, Kovanda C, Seshadri-Kreaden U. What is the learning curve for robotic assisted gynecologic surgery? J Minim Invasive Gynecol 2008;15:589–94.
2. Woelk JL, Casiano ER, Weaver AL, Gostout BS, Trabuco EC, Gebhart JB. The learning curve of robotic hysterectomy. Obstet Gynecol 2013;121:87–95.
3. Martin JA, Regehr G, Reznick R, MacRae H, Murnaghan J, Hutchison C, et al.. Objective structured assessment of technical skill (OSATS) for surgical residents. Br J Surg 1997;84:273–8.
4. Reznick R, Regehr G, MacRae H, Martin J, McCulloch W. Testing technical skill via an innovative ‘bench station’ examination. Am J Surg 1997;173:226–30.
5. Buerkle B, Rueter K, Hefler LA, Tempfer-Bentz EK, Tempfer CB. Objective Structured Assessment of Technical Skills (OSATS) evaluation of theoretical versus hands-on training of vaginal breech delivery management: a randomized trial. Eur J Obstet Gynecol Reprod Biol 2013;171:252–6.
6. Goff BA, Nielsen PE, Lentz GM, Chow GE, Chalmers RW, Fenner D, et al.. Surgical skills assessment: a blinded examination of obstetrics and gynecology residents. Am J Obstet Gynecol 2002;186:613–7.
7. Rackow BW, Solnik MJ, Tu FF, Senapati S, Pozolo KE, Du H. Deliberate practice improves obstetrics and gynecology residents' hysteroscopy skills. J Grad Med Educ 2012;4:329–34.
8. Gélinas-Phaneuf N, Del Maestro RF. Surgical expertise in neurosurgery: integrating theory into practice. Neurosurgery 2013;73(suppl 1):S30–8.
9. Steehler MK, Chu EE, Na H, Pfisterer MJ, Hesham HN, Malekzadeh S. Teaching and assessing endoscopic sinus surgery skills on a validated low-cost task trainer. Laryngoscope 2013;123:841–4.
10. Tsagkataki M, Choudhary A. Mersey deanery ophthalmology trainees' views of the objective assessment of surgical and technical skills (OSATS) workplace-based assessment tool. Perspect Med Educ 2013;2:21–7.
11. Zevin B, Bonrath EM, Aggarwal R, Dedy NJ, Ahmed N, Grantcharov TP, et al.. Development, feasibility, validity, and reliability of a scale for objective assessment of operative performance in laparoscopic gastric bypass surgery. J Am Coll Surg 2013;216:955–965.e8.
12. Vassiliou MC, Feldman LS, Andrew CG, Bergman S, Leffondré K, Stanbridge D, et al.. A global assessment tool for evaluation of intraoperative laparoscopic skills. Am J Surg 2005;190:107–13.
13. Messick S. Standards of validity and the validity of standards in performance assessment. Educ Meas Issues Pract 1995;14:5–8.
14. Gallagher AG, Ritter EM, Satava RM. Fundamental principles of validation, and reliability: rigorous science for the assessment of surgical education and training. Surg Endosc 2003;17:1525–9.
15. Dulan G, Rege RV, Hogg DC, Gilberg-Fisher KM, Arain NA, Tesfay ST, et al.. Proficiency-based training for robotic surgery: construct validity, workload, and expert levels for nine inanimate exercises. Surg Endosc 2012;26:1516–21.
16. Hung AJ, Jayaratna IS, Teruya K, Desai MM, Gill IS, Goh AC. Comparative assessment of three standardized robotic surgery training methods. BJU Int 2013;112:864–71.
17. Lyons C, Goldfarb D, Jones SL, Badhiwala N, Miles B, Link R, et al.. Which skills really matter? proving face, content, and construct validity for a commercial robotic simulator. Surg Endosc 2013;27:2020–30.
18. Arain NA, Dulan G, Hogg DC, Rege RV, Powers CE, Tesfay ST, et al.. Comprehensive proficiency-based inanimate training for robotic surgery: reliability, feasibility, and educational benefit. Surg Endosc 2012;26:2740–5.
19. Dulan G, Rege RV, Hogg DC, Gilberg-Fisher KM, Arain NA, Tesfay ST, et al.. Developing a comprehensive, proficiency-based training program for robotic surgery. Surgery 2012;152:477–88.
20. Schreuder HW, Wolswijk R, Zweemer RP, Schijven MP, Verheijen RH. Training and learning robotic surgery, time for a more structured approach: a systematic review. BJOG 2012;119:137–49.
21. Jabbour N, Reihsen T, Payne NR, Finkelstein M, Sweet RM, Sidman JD. Validated assessment tools for pediatric airway endoscopy simulation. Otolaryngol Head Neck Surg 2012;147:1131–5.
22. Marglani O, Alherabi A, Al-Andejani T, Javer A, Al-Zalabani A, Chalmers A. Development of a tool for Global Rating of Endoscopic Surgical Skills (GRESS) for assessment of otolaryngology residents. B-ENT 2012;8:191–5.
23. Nimmons GL, Chang KE, Funk GF, Shonka DC, Pagedar NA. Validation of a task-specific scoring system for a microvascular surgery simulation model. Laryngoscope 2012;122:2164–8.
24. Robotic Training Network. Available at: Retrieved January 12, 2014.

Supplemental Digital Content

Back to Top | Article Outline
© 2014 by The American College of Obstetricians and Gynecologists.