Secondary Logo

Journal Logo


Evaluating teaching methods: Validation of an evaluation tool for hydrodissection and phacoemulsification portions of cataract surgery

Smith, Ronald J. MD, MPH*; McCannel, Colin A. MD; Gordon, Lynn K. MD, PhD; Hollander, David A. MD; Giaconi, JoAnn A. MD; Stelzner, Sadiqa K. MD; Devgan, Uday MD; Bartlett, John MD; Mondino, Bartly J. MD

Author Information
doi: 10.1016/j.jcrs.2013.11.048
  • Free


Surgical interventions are expected to be scientifically based. In contrast, teaching methods used to instruct surgical technique have not been. More than 3 million cataract surgeries are performed each year in the United States1 and the reported rate of major complications among trainees is between 5% and 7%,2–4 which is 3 to 50 times greater than the rate seen with experienced and high-volume surgeons.4,5 Both the Accreditation Council for Graduate Medical Education and the American Board of Ophthalmology have recognized a fundamental lack of scientific objectivity for teaching and evaluating surgical technique and established an imperative to develop a scientific method for evaluating and improving teaching methods.6–8 They have mandated a paradigm shift in assessing resident education from counting cases to measuring training outcomes.9

In a 10-year retrospective review, Rogers et al.10 found that by instituting fundamental structural changes to their surgical training curriculum, they significantly reduced the complication rate for subsequent residents. To improve teaching and reduce resident complication rates for the current resident, a valid and reliable set of measures of surgical skills is needed. These measures should not rely on sentinel event complications or summative feedback alone but rather should be based on surgical performance itself and provide immediate formative feedback to residents that they can use as they prepare for their next case. Furthermore, such a set of measures could be applied before and after a particular teaching intervention to ultimately identify the most effective teaching methods.

A small portion of cataract surgery is studied to more easily identify specific actionable improvements for a surgeon. Short video clips also reduce the potential for evaluator fatigue and may eliminate the possible bias that viewing an early or later portion of the surgery may have on evaluating the portion of the procedure being studied. We previously studied the capsulorhexis portion of the procedure.11 In the current study, we studied the hydrodissection and phacoemulsification steps because they are challenging and fundamental portions of cataract surgery. We developed a method for measuring surgical performance for hydrodissection and phacoemulsification and report the method’s validity and reliability.

Subjects and methods

An expert panel of University of California Los Angeles ophthalmologists from the Veterans Administration (VA) Medical Center of West Los Angeles, Olive View Medical Center, Jules Stein Eye Institute, and Harbor-UCLA Medical Center was surveyed and the literature was reviewed to develop an evaluation tool for measuring the performance of the hydrodissection and phacoemulsification steps of cataract surgery (Appendix). The evaluation tool consisted of 15 questions (items to be graded); questions were from the Global Rating Assessment of Skills in Intraocular Surgery12 and questions modified from the International Council of Ophthalmology approved Ophthalmology Surgical Competency Assessment Rubric (OSCAR) for phacoemulsification.13 The principal modification was to divide compound questions from the OSCAR rubric into 2 or more separate questions to more specifically test each surgical task. Slight changes were also made to eliminate internal inconsistency within questions and to keep questions brief. Questions in the analysis tool were put on a continuous visual Likert scale to enable graders to more accurately measure between the defined categories.

DVD video recordings of the hydrodissection and phacoemulsification portion of cataract surgery from residents at different levels of training and an expert cataract surgeon were submitted for assessment with the evaluation tool. Cases were collected from surgery performed at 3 centers; that is, the VA Medical Center of Los Angeles, Olive View Medical Center, and Jules Stein Eye Institute. There was 1 case from a postgraduate year 2 resident with no previous phacoemulsification cases (PGY2-0), 1 case from a PGY 3 resident with 3 previous cases (PGY3-3), a case from a PGY 4 resident with 39 prior cases (PGY4-39) and the next case from that resident (PGY4-40), a case from a PGY 4 resident with 111 previous cases (PGY4-111), and a case from a highly experienced cataract surgeon with 9300 previous phacoemulsification cases.

Each video clip was independently graded by 7 or 8 experienced reviewers in a masked fashion. Graders had developed and used a similar type evaluation tool for studying capsulorhexis11 and participated in the development of the questions of the current tool for the hydrodissection and phacoemulsification steps. No additional instructions were given beyond what was written on the tool. Masking was performed by assigning a random number to each video clip and a random number to each reviewer. The interobserver variability in their responses to each question on the evaluation tool for each video was assessed to identify the most reliable questions. If the grader was also the attending surgeon for the case, the grader was excluded from grading that case to avoid introducing bias. Thus, some videos were graded by 7 reviewers and others by 8 reviewers. The principal investigator (R.J.S.), who was responsible for keeping the randomized labeling code of the videos masked, did not participate in the grading used for the analysis.

Exclusion Criteria

For the resident surgery, if any part of the hydrodissection or phacoemulsification was performed by the attending surgeon, the case was not eligible for inclusion in the study. Cases in which iris hooks or a Malyugin ring were used were also not eligible for inclusion.

Safeguards for Patient Safety

There was no change in usual patient care other than that a copy of each surgery’s DVD recording was sent to the independent panel of experts for review. In compliance with the U.S. Health Insurance Portability and Accountability Act, there were no patient-identifying markers on the video recording.

Safeguards for the Surgeons

An information sheet describing the study was given to each surgeon before the scheduled cataract cases, assuring them that their names would not be used in conjunction with any presentation of the results. Participation was voluntary, and privacy of the surgeon as well as patient was maintained. The study design was approved by the UCLA Institutional Review Board.

Statistical Analysis

For each question, the intergrader variability was assessed using the intraclass correlation coefficient (ICC). The ICC is defined as the ratio of between-video variance to the total variance. The total variance is the sum of between-video variance and within-video variance. The ICC is between 0 and 1; a higher ICC value indicates better agreement between graders (lower intergrader variability). The mean difference in measurements between all graders for each question of each surgeon’s video was compared using repeated-measures analysis of variance (ANOVA). For analysis purposes, “n” refers to the number of observations for each question. All statistical analysis was performed using SAS software (version 9.1, SAS Institute, Inc.).


Six video clips were evaluated, 3 by 7 reviewers and 3 by 8 reviewers. This provided 45 observations for assessment of the interobserver variability for most questions. The duration of the combined hydrodissection and phacoemulsification portions of the cataract procedure correlated inversely with surgical experience level, ranging from 4.0 minutes in the most experienced surgeon’s case to 23.1 minutes for the least experienced surgeon’s case (PGY2-0, no previous experience) (Spearman correlation coefficient 0.83, P=.042). The time required for the graders to assess the surgical technique was usually the same as the video’s run time or just slightly longer and ranged from a mean of 5.6 minutes for grading the experienced surgeon’s video to 23.9 minutes for grading the beginning surgeon’s video (Table 1).

Table 1
Table 1:
Interobserver variability for each question and comparison of mean grades of each question between surgeons with different levels of experience. Scores for each video are listed in order of the experience of the surgeon from the most experienced surgeon with 9300 prior cases of experience (ES-9300) to the least experienced surgeon with no prior case experience (PGY2–0).
Table 1
Table 1:

Table 1 lists the mean rating and standard deviation of the grades for each question for each of the 6 surgical videos. Eleven questions resulted in significant mean differences in the surgical techniques between the surgical videos (P<.05, ANOVA). Of those, the questions with the lowest interobserver variability were question 1: Instrument handling during hydrodissection; question 2: Flow of operation: Time and motion during hydrodissection; question 3: Hydrodissection and nucleus rotation during hydrodissection; question 6: Nucleus rotation and manipulation in the phacoemulsification portion; and question 8: Nucleus cracking (ICC = 0.62).

Validity of Selected Questions on Detail

Question 1. Instrument Handling During Hydrodissection This question had low interobserver variability, and the mean scores were higher for more experienced surgeons and tended to be lower for the least experienced surgeons; the association was statistically significant (P<.0001, ANOVA) (Table 1).

Question 2. Flow of Operation: Time and Motion During Hydrodissection This question had the lowest interobserver variability, and the mean scores were higher for more experienced surgeons and tended to be lower for the least experienced surgeons; the association was statistically significant (P<.0001, ANOVA).

Question 3. Hydrodissection and Nucleus Rotation This question was also associated with surgical skill (P<.0001, ANOVA) but had somewhat more variability than the other 2 questions on hydrodissection.

Question 6. Nucleus Rotation and Manipulation Nucleus rotation and manipulation were significantly associated with surgical experience and had the third lowest interobserver variability.

Question 8. Nucleus Cracking This question was significantly associated with surgical experience. The interobserver variability in grading yielded an ICC of 0.62.

The questions with the greatest interobserver variability were from the phacoemulsification portion. These were question 9: Nucleus chopping; question 10: Anterior chamber stability; and question 12: Segment removal bringing the segments to the tip while protecting the capsule.

Complication and Analysis of the Complicated Case

Case PGY4-39 was complicated by a posterior capsule tear, which occurred during phacoemulsification of the last pieces of nucleus. An ophthalmic viscosurgical device (OVD) was injected, and the remaining lens fragments were flushed out with the OVD. The grade on the management of the complication question and the standard deviation were on the order of the other questions in the evaluation tool. The grades on the questions of the complicated case were analyzed further by comparing those grades to the grades on the next case of the same resident (PGY4-40), which was uneventful. Two questions resulted in significant mean differences between the 2 surgical videos, question 5 (Effective use and stability of phaco probe and second instrument), which improved from 3.77 ± 0.53 (SD) for PGY4-39 to 4.44 ± 0.45 for PGY4-40 (P=.031), and question 12 (Segment removal bringing the segments to the tip while protecting the capsule), which improved from 3.24 ± 0.81 for the PGY4-39 to 4.04 ± 0.69 for PGY4-40 (P=.031).


This study shows that surgical technique itself is measureable but that not all questions on an evaluation tool give accurate measurements. Some questions result in high interobserver variability, while others have low interobserver variability. Questions with high interobserver variation in the answers they elicit should be avoided. Furthermore, we showed a potential application of an evaluation tool. In a subanalysis of a complicated case, we showed how the evaluation tool could be applied before and after a complication to study changes in the surgeon’s technique after a complication. Our study supports that assessment tool questions can be an accurate method of assessing the components of cataract surgery skill; however, they have to be formally validated when they are used.

The surgical experience captured in the videos spanned the range from beginner to expert, from a PGY2’s first cataract surgery to a highly experienced surgeon with 9300 previous cataract surgeries. For 11 of the 15 questions, the assessments of the reviewers were significantly associated with experience level, with the most experienced surgeon scoring highest and the least experienced surgeons scoring lowest. Saleh et al.14 studied an evaluation tool consisting of 20 Likert scale (5-point) questions including a question on hydrodissection and 5 questions on phacoemulsification as well as questions on global technique over the whole case, including centration of the microscope, handling of tissues, and the overall speed and fluidity of the procedure. They evaluated the evaluation tool on 38 videos from 38 surgeons of various levels of experience and found that the total scores on the tool increased with increased level of surgical training up to the group with 250 cases or more. They also found surgeon-to-surgeon variability; some novice surgeons with lower experience scored higher than some novices with higher levels of experience. Our study’s results are consistent with the general trend and the variability between surgeons. For nearly all questions, the video from the experienced surgeon scored higher than those of the PGY4 surgeons, which scored higher than that of the PGY3, which scored higher than that of the PGY2 surgeon. Within the PGY level, however, the grades for the video of the PGY4 surgeon with less case experience were higher for most questions. We found similar results in our previous study of capsulorhexis,11 in which PGY4 surgeon videos scored higher than PGY3 surgeon videos; however, within the PGY level, the video of the resident with fewer cases received higher scores on some questions. Having reviewed the videos with the principal investigator (R.J.S.), we believe that the questions correctly assessed the quality of surgical technique shown on the videos and suspect that the differences reflect variability between different residents or the individual resident’s case-to-case variability in surgical performance.

Our questions had a higher level of interobserver variability than in our study of an evaluation tool for capsulorhexis technique,11 in which the ICC for the best question was 0.87. The ICC provides an assessment of interobserver variability for each question. The ICC is a measure of the variance between videos to the total variance. If there were perfect agreement between graders, the only difference would be the between-video variance and the ICC would have a value of 1 for that question. In contrast, large interobserver variability would result in high total variance between videos and the ICC would approach a value of 0.

Rootman et al.15 reported the interobserver variability in a study of an evaluation tool consisting of 15 Likert scale (5-point) questions on cataract surgery, including task-specific and global questions. Fourteen observers performed a masked review of 1 video of an experienced surgeon and 1 video of a novice surgeon. The mean total score was 30.3 ± 6.1 for the novice and 48.3 ± 7.2 (P<.001) for the experienced surgeon, and the interobserver variability between the 14 observers resulted in an ICC of 0.81. On average, they found a difference of 1.4 grading units for task-specific questions between the novice and the expert. Our results compare favorably, with a difference of 2.6 grading units between the expert and the beginner for the best hydrodissection question and a difference of 1.8 grading units for the best phacoemulsification question.

The most reliable questions assessed tasks that could be appreciated or seen well in 2 dimensions. Hydrodissection and the related ability to move the nucleus after hydrodissection were the most reliable questions, followed by nucleus cracking. In our previous study of capsulorhexis,11 we found less interobserver variability using a similar evaluation tool. In contrast, questions assessing tasks such as phacoemulsification while protecting the posterior capsule, which require the grader to optimally be able to see the instrument tip’s distance from the posterior capsule, requiring a 3-dimensional (3-D) perspective, elicited high interobserver variation. Similarly, anterior chamber stability had high interobserver variability, presumably because anterior chamber depth is also difficult to appreciate in a 2-dimensional (2-D) view. Feudner et al.16 report an overall ICC of 0.91 and interobserver variability for each question of a 5-question evaluation tool for capsulorhexis tested on 10 randomly selected masked video tapes of porcine surgery by residents and medical students, each reviewed by 3 observers. The least variable was the measure of time, with an ICC of 1.0, and the highest interobserver variability yielded an ICC of 0.65 for the question on tissue protection. Feudner et al. speculate, as we do, that the reason for the increased variability in the assessment of tissue protection is the difficulty of assessing the 3-D aspects of the task from a 2-D video recording. Tasks that require subtle appreciation of depth for optimum surgical performance may also require the grader to have a 3-D view to grade them most consistently. Improved videography allowing a second camera and an overlay of phacoemulsification parameters may improve consistency of measuring 3-D skills in phacoemulsification surgery.

Other potential sources of interobserver variability are that the phacoemulsification portion of the procedure is longer than the hydrodissection and capsulorhexis steps and that the grader had to formulate a composite of multiple repeated events spread out in time over the phacoemulsification step to make an assessment.

We reduced the number of questions that required counting in our evaluation tool to 1 because those requiring counting were not reliable in our previous study and our question on number of phaco-tip withdrawals likewise was unreliable. Simultaneously grading and keeping count may be difficult to do in a single viewing. A better way to assess repetitive insertion of an instrument may be through developing image-processing software or to assign a separate grader to keep count of repetitive motions, including instrument insertions.

There were a small number of cases and a small number of observers. This enabled a consistent group of graders in a defined time period to assess the technique of surgeons with different skill levels but similar surgical instrumentation and technique. Some questions were not applicable. For example, most surgeons in the study were predominantly using a sculpting technique and as a result, many graders marked the question on chopping as not applicable. Further validity testing of this question, especially in centers where chopping is the predominant technique, would be required.

One approach to reduce interobserver variability would be to develop objective metrics of surgical technique that could be assessed by video image processing without the use of an observer to grade the videotape. An objective evaluation tool was developed by Smith et al.,17 who applied motion analysis and video image processing to study cataract surgical technique. They compared 10 videos, 1 each from 10 junior surgeons with experience of fewer than 200 cases with 10 videos, 1 each from 10 experienced surgeons with more than 1000 cases each. They found significant differences in total path length, number of movements, and time between the novice surgeons and experienced surgeons. We also found, as expected, that the duration of the procedure was shorter for more experienced surgeons; however, Smith et al. further determined that the difference in path length was more discriminating between the groups than the measurement of time alone, suggesting that further efforts to develop objective assessment by computer-aided video image processing may be fruitful.

One case in our study was complicated by a posterior capsule tear, which occurred during phacoemulsification of the last pieces of nucleus. The masked review of the resident’s case with the complication and the following case from the same day and same resident showed no decline in technique and in 2 questions, there was significant improvement. In the case performed after the complication, the resident showed significantly better use and stability of the phacoemulsification probe and second instrument and significant improvement in the ability to bring segments to the tip while protecting the capsule. Evaluating the video of a case before and after a surgical complication may provide insights into the effect of a complication on the surgeon’s subsequent performance and how learning may occur in response to a complication.

In their study of capsulorhexis, Feudner et al.16 further showed the application of an evaluation tool to the prospective randomized evaluation of a teaching intervention. Sixty-three trainees were randomized to receive Eyesi simulator (VR Magic) training or not to receive such training. For each trainee, 3 wet-lab capsulorhexis surgeries were recorded before training and 3 weeks after training; surgeries by the control group that did not receive training were also recorded. The scores improved significantly in the group that received simulator training compared with the control group.

Medical education is undergoing a paradigm shift. We are beginning to understand that counting cases to measure training outcomes is no longer optimum. Training or teaching outcomes must be assessed more objectively using validated measurement tools. Only with reliable tools can learners be assessed accurately and fairly. However, the development of such tools hinges on accurate and reliable measures of surgical technique, and we believe that validation data assessing accuracy and variability are essential. We plan to further validate the questions of evaluation tools with improved videography and to develop video-image-processing algorithms. We will use these evaluation tools to study performance before and after teaching interventions involving microsurgical laboratory simulation and computer-based surgical simulation to identify the best teaching interventions and improve training for residents and patient care for those having eye surgery at teaching centers.

What Was Known

  • Evaluation tools for assessing cataract surgical technique have been proposed as a method for evaluating components of surgical technique. The accuracy and reliability of such tools have rarely been reported.

What This Paper Adds

  • Masked video review using GRASIS and OSCAR evaluation questions were reliable enough to detect differences in cataract surgical technique between complicated and uncomplicated cases and between surgeons of different levels of training.
  • The difference in reliability between different questions of an evaluation tool necessitate the reporting of validation data on each question when evaluation tools are used.


1. Cullen KA, Hall MJ, Golosinskiy A. Ambulatory surgery in the United States, 2006. National Health Statistics Reports No. 11, January 28, 2009–revised September 4, 2009. Hyattsville, MD, U.S. Department of Health and Human Services, Centers for Disease Control and Prevention, National Center for Health Statistics. Available at: Accessed March 22, 2014
2. Randleman JB, Wolfe JD, Woodward M, Lynn MJ, Cherwek DH, Srivastava SK. The resident surgeon phacoemulsification learning curve. Arch Ophthalmol. 125, 2007, p. 1215-1219, Available at: Accessed March 22, 2014.
3. Bhagat N, Nissirios N, Potdevin L, Chung J, Lama P, Zarbin MA, Fechtner R, Guo S, Chu D, Langer P. Complications in resident-performed phacoemulsification cataract surgery at New Jersey Medical School. Br J Ophthalmol. 91, 2007, p. 1315-1317, Available at: Accessed March 22, 2014.
4. Johnston RL, Taylor H, Smith R, Sparrow JM. The Cataract National Dataset electronic multi-centre audit of 55 567 operations: variation in posterior capsule rupture rates between surgeons. Eye. 24, 2010, p. 888-893, Available at: Accessed March 22, 2014.
5. Bell CM, Hatch WV, Cernat G, Urbach DR. Surgeon volumes and selected patient outcomes in cataract surgery: a population-based analysis. Ophthalmology. 2007;114:405-410.
6. Accreditation Council for Graduate Medical Education., 2006. Educating Physicians for the 21st Century; Systems-Based Practice, ACGME, Chicago, IL.
7. Mills RP, Mannis MJ. American Board of Ophthalmology Program Directors’ Task Force on Competencies. Report of the American Board of Ophthalmology Task Force on Competencies [guest editorial]. Ophthalmology. 2004;111:1267-1268.
8. Lee AG, Carter KD. Managing the new mandate in resident education; a blueprint for translating a national mandate into local compliance. Ophthalmology. 2004;111:1807-1812.
9. Henderson BA, Rasha A. Teaching and assessing competence in cataract surgery. Curr Opin Ophthalmol. 2007;18:27-31.
10. Rogers GM, Oetting TA, Lee AG, Grignon C, Greenlee E, Johnson AT, Beaver HA, Carter K. Impact of a structured surgical curriculum on ophthalmic resident cataract surgery complication rates. J Cataract Refract Surg. 2009;35:1956-1960.
11. Smith RJ, McCannel CA, Gordon LK, Hollander DA, Giaconi JA, Stelzner SK, Devgan U, Bartlett J, Mondino BJ. Evaluating teaching methods of cataract surgery: validation of an evaluation tool for assessing surgical technique of capsulorhexis. J Cataract Refract Surg. 2012;38:799-806.
12. Cremers SL, Nereida Lora A. Ferrufino-Ponce ZK Global Rating Assessment of Skills in Intraocular Surgery (GRASIS). Ophthalmology. 2005;112:1655-1660.
13. Golnik KC, Beaver H, Gauba V, Lee AG, Mayorga E, Palis G, Saleh GM. Cataract surgical skill assessment [letter]. Ophthalmology. 2011;118:427-428.
14. Saleh GM, Gauba V, Mitra A, Litwin AS, Chung AKK, Benjamin L. Objective structured assessment of cataract surgical skill. Arch Ophthalmol. 125, 2007, p. 363-366, Available at: Accessed March 22, 2014.
15. Rootman DB, Lam K, Sit M, Liu E, Dubrowski A, Lam W-C. Psychometric properties of a new tool to assess task-specific and global competency in cataract surgery. Ophthalmic Surg Lasers Imaging. 2012;43:229-234.
16. Feudner EM, Engel C, Neuhann IM, Petermeier K, Bartz-Schmidt K-U, Szurman P. Virtual reality training improves wet-lab performance of capsulorhexis: results of a randomized, controlled study. Graefes Arch Clin Exp Ophthalmol. 2009;247:955-963.
17. Smith P, Tang L, Balntas V, Young K, Athanasiadis Y, Sullivan P, Hussain B, Saleh GM. “PhacoTracking”; an evolving paradigm in ophthalmic surgical training. JAMA Ophthalmol. 2013;131:659-661.


Evaluation Tool

No Caption available.
No Caption available.
No Caption available.
No Caption available.
© 2014 by Lippincott Williams & Wilkins, Inc.