Education in the field of obstetric ultrasound has shifted from an exclusively observation-based training to simulation-based training for the past decade.1 Multiple factors may have contributed to the more widespread use of simulation: concerns for diagnostic errors, patient volunteers' or teachers' availability, and technological advances based on virtual reality (VR).2,3 Obstetric ultrasound simulators (OUS) may revolutionize the obstetric and gynecologic (OB/GYN) residents' curriculum for both training and evaluation.4
The utility of OUS for ultrasound training is now well supported by evidence.5–8 In a series of 8 Danish studies, Toslgaard et al9 successively explored the learning curves for novices, examined how to improve the efficiency of training with the use of dyad practice, and explored whether improvements were sustained over time. They also demonstrated skill transfer to subsequent clinical training.10 Finally, they demonstrated an improvement for the patients after simulation training, such as a decrease in discomfort and improvements in their perception of safety and their confidence.10
Evaluation of competency in obstetric ultrasound is a time-consuming process and requires pregnant volunteers who are willing to be used for training. One study has shown the potential of OUS as a substitute for evaluating trainees.11 In this study, dexterity and quality of images obtained during evaluation were assessed by 2 independent examiners. However, dexterity was subjectively scored between 0 and 10 based on observation.
A fundamental issue for evaluation with OUS is the need for metrics. Virtual reality–based simulators allow automatic recording of probe trajectories during training and evaluation. Obstetric ultrasound simulator could be a suitable tool to extract data from these probe trajectories for objective evaluation and to identify metrics that may discriminate the level of expertise of participants during their training. The aim of our study was to assess the potential of OUS to identify objective metrics to measure expertise in ultrasonography.
The experimental set-up was conducted at the Department of Gynecology and Obstetrics at the University Hospital of Rennes. All training and assessments were carried out in an undisturbed environment. The participants were recruited in June 2018, and the study was conducted from July to September 2018. Approval was obtained from our local institutional review board.
The participants were divided into 3 groups based on their levels of experience: medical students (novice), OB/GYN residents (intermediate), and OB/GYN consultants (experts). All participants were recruited locally at the Department of Gynecology and Obstetrics and provided oral and written informed consent. The experts were professional who had every day practice on genuine patients (approximately 25 morphologic scans a week). The intermediates were interns who are familiar with routine examination scans (fetal movements, amniotic fluid, estimation of fetal weight, approximately 5 scans a week) but not with morphologic examination. The novices had absolutely no ultrasound experience. The participants in the 3 groups never practiced on the simulator before the study, as the simulator was bought right before study beginning.
The medical program at the University of Rennes is a 6-year traditional curriculum, and the gynecology rotations are completed during the 2 final years. The novices were recruited during their gynecology rotations. During their prior medical training, the students had completed courses in pelvic anatomy and ultrasound theory but had no hands-on training. The intermediates included OB/GYN residents who had not graduated in ultrasound but who were familiar with ultrasound equipment. The experts were OB/GYN consultants who used ultrasound on a daily basis.
Virtual Reality Simulator
Training and assessments were performed using a high-fidelity VR-based simulator (Scantrainer; Medaphor, Cardiff, UK) designed for obstetric ultrasound training. It is composed of a monitor, an abdominal probe similar to a real one, docked into a Sigma haptic device, 2 screens, and a computer (Fig. 1). The haptic device provides realistic force feedback, when operator applies pressure on the abdominal probe. One monitor displays B-mode 2D ultrasound pictures provided by the system and obtained from real patients. The second monitor displays an animated illustration of the probe position in a virtual patient.
Step 1: Identification of Tasks
Most fetal malformations are detected during the second trimester scan. This scan examines the fetus in 11 planes as requested by the National Committee for Prenatal Ultrasound.12
Tasks reflecting the standardized second trimester examination of the fetus were selected to obtain a large sample of fetal anatomy screening, in particular tasks that involved switching from transverse to coronal or sagittal planes. We selected the tasks that would potentially reflect differences in ultrasound competences and provided the most realistic view of the fetus in 3 consecutive standardized planes. For task 1 (brain at 24 weeks of gestation), participants were asked to successively obtain the standardized view of the fetal head circumference, then the standardized plane of the cerebellum, and finally a coronal view of the brain through the cavum of septum pellucidum. For task 2 (heart at 22 weeks of gestation), participants were asked to successively obtain the standardized view of the 4 chambers, the view of the left ventricle and the aorta, and the view of the right ventricle and the pulmonary artery. For task 3 (spine at 22 weeks of gestation), participants were asked to obtain successively the standardized frontal view of the kidneys, a sagittal view of the spine, and the parasagittal view of the left diaphragm. Figure 2 shows a pictorial representation of the 3 tasks. The participant judged himself that the view was exact, and he froze the image to finish exercise.
All participants were asked to provide their age, sex, and years of clinical experience and were assigned an identifier to anonymize the data. They then received a short introduction to the simulated setting, including how to operate the simulator and its functions. They also received a course about the recommended fetal planes. All trainees practiced 10 minutes in a “probe manipulation exercise” to familiarize themselves with the device and to check the expertise difference between the groups. The score of this exercise was calculated by adding the scores of 6 multiple choice questions (pass: 1; fail: 0) designed to test recognition of 6 different 3-dimensional geometric solids (cone, pyramid, spiral, spheres, other complex geometric solids) by scanning them with the 2D probe. Technical assistance was provided during the simulator test, but no instructions, feedback, or time limit was provided. Each participant had to perform the 3 tasks 3 times.
The data were logged and recorded by a custom data acquisition software developed with Scantrainer (Medaphor). The probe position frequency sampling was 20 Hz. Computation of all metrics and trajectory analysis were run on a dedicated computer (Xeon E5-1650V4 @3.60GHz with 32Go RAM) using Matlab_R2017b. The analysis was done following a previously published technique.13 A total of 6 selected metrics were analyzed:
- Duration (D) corresponds to the execution time between the first time the hand moves the probe, until it has been released at the end of the task. It is measured in seconds.
- Path length (PL) represents the total distance traveled by the probe during the execution of the task. It is measured in millimeters.
- Average velocity (AV) corresponds to the average linear speed of the probe during the task. It is measured in millimeter per second.
- Average acceleration (AA) corresponds to the average instantaneous acceleration of the probe during the task. It is measured in millimeter per square second.
- Average jerk (AJ) corresponds to the AJ (derivative of the acceleration) during the task, also known as “smoothness” measure. It is measured in millimeter per cubic second.
- Working volume (WV) represents the volume of the convex hull for each trajectory. The convex hull of a trajectory is the smallest convex volume among those which contain it. It is measured in cubic millimeter.
One-way analysis of variance 3 tests were conducted to detect differences between the 3 groups of participants. The Bonferroni correction was applied to the results to control α inflation (P < 0.08). The participants were absolutely not told about what the metrics measured were.
Participant demographics according to expertise level are presented in Table 1.
TABLE 1 -
Demographics and Ultrasound Experience of Participants in the 3 Groups
|Age, mean ± SD
||24.8 ± 5.1
||28.3 ± 0.7
||44.4 ± 9.4
|Clinical experience, yr
||3.3 ± 0.7
||19.8 ± 10.2
|Manipulation probe test score
6 questions (1: pass; 0: fail), mean ± SD
|3.4 ± 1.2
||4.7 ± 1.2
||5.6 ± 0.5
Manipulation Probe Exercise
For the manipulation probe exercise, of a total of 6 points (6 multiple choice questions, pass: 1; fail: 0), the experts' mean score was 5.6 ± 0.5, the intermediates' mean score was 4.7 ± 1.2, and the novices' mean score was 3.4 ± 1.2 points (P = 0.002). Table 1.
Analysis of the probe trajectory metrics revealed significant differences for tasks 1 to 3 between the expert, intermediate, and novice groups (Table 2).
TABLE 2 -
Six Metrics Analyzed on Probe Trajectory During the 3 Tasks According to Level of Expertise
||Novices (n = 16)
||Intermediates (n = 12)
||Experts (n = 5)
||157.5 ± 176.2
||111.2 ± 110.3
||35.6 ± 16.3
||109.9 ± 69.7
||103.7 ± 61.1
||51.7 ± 47.4
||112.8 ± 64.7
||82.3 ± 50.6
||35.5 ± 11.4
||144.2 ± 113.4
||179.5 ± 107.1
||100.5 ± 40.9
||142.8 ± 50.8
||126.6 ± 51.7
||102.7 ± 113.8
||151.5 ± 56.9
||159.0 ± 58.9
||107.1 ± 63.8
||1.9 ± 0.5
||1.9 ± 1.3
||2.9 ± 0.8
||1.5 ± 0.6
||1.4 ± 0.7
||2.0 ± 0.6
||1.5 ± 0.5
||2.2 ± 0.7
||2.9 ± 0.8
||13.7 ± 4.8
||13.8 ± 10.5
||22.8 ± 5.9
||11.5 ± 6.2
||10.1 ± 4.0
||14.9 ± 3.7
||10.9 ± 3.7
||15.7 ± 5.4
||22.5 ± 8.0
||275.5 ± 135.7
||272.8 ± 228.3
||452.8 ± 135.8
||242.6 ± 188.6
||201.2 ± 71.6
||292.3 ± 76.1
||215.4 ± 84.5
||309.9 ± 130.0
||437.8 ± 160.5
||1 118.6 ± 622.4
||1 433.6 ± 716.9
||761.1 ± 603.9
||1 755.8 ± 1318.4
||1 275.1 ± 674.3
||825.7 ± 962.9
||1 618.6 ± 624.1
||1 759.2 ± 1 250.3
||867.5 ± 717.5
Duration of the Exercise(s)
The mean D(s) of the exercise differed between novices, intermediates, and experts for task 1 (respectively, 157.5 s ± 176.2 vs. 111.2 s ± 110.3 vs. 35.6 s ± 16.3, P < 0.001), task 3 (112.8 s ± 64.7 vs. 82.3 s ± 50.6 vs. 35.5 s ± 11.4, P < 0.001). For task 2 (109.9 s ± 69.7 vs. 103.7 s ± 61.1 vs. 51.7 s ± 47.4, P = 0.009), it was not significant.
The PL (millimeter) was shorter with increasing level of expertise for task 1 (novices: 1 442.3 ± 1 134.4 vs. intermediates: 1 795.4 ± 1071.9 vs. experts: 1 005.7 ± 409.7, P < 0.001) and for task 3 (respectively, 1 515.3 ± 569.4 vs. 1 590.8 ± 589.0 vs. 1 071.2 ± 638.3, P = 0.013).
For task 2 (the “heart” exercise), the PL was not statistically different between the 3 groups (1 428.7 ± 508.2 vs. 1 266.5 ± 517.5 vs. 1 027.0 ± 1 138.3, non significant).
The AV (millimeter per second) was significantly higher with increasing level of expertise for task 1 and 3. For task 1, AV was 19.2 mm/s ± 5.7 for novices versus 19.5 mm/s ± 13.8 for intermediate and 29.5 mm/s ± 8.7 for experts (P < 0.001), and for task 3, it was respectively 15.6 mm/s ± 5.2 versus 22.2 ± 7.5 versus 29.1 ± 8.6 (P < 0.001). For task 2, it was respectively 15.7 mm/s ± 6.4 versus 14.5 ± 7.0 versus 20.1 ± 6.4 (P = 0.024, NS).
The AA was statistically different between the 3 groups for task 1 (137.0 mm.s−2 ± 48.2 vs. 138.6 ± 105.1 vs. 228.9 ± 59.8, P < 0.001), and for task 3 (109.5 mm.s−2 ± 37.0 vs. 157.3 ± 54.6 vs. 225.7 ± 80.6, P < 0.001), but not for task 2 (115.8 mm.s−2 ± 62.8 vs. 101.7 ± 40.1 vs. 149.6 ± 37.5, NS).
Jerk was statistically different between the 3 groups for task 1 (2 755.7 mm.s−3 ± 1 357.0 vs. 2 728.5 ± 2 283.5 vs. 4 528.0 ± 1 358.4, P < 0.001) and for task 3 (2 154.7 mm.s−3 ± 845.5 vs. 3 099.7 ± 1 300.4 vs. 4 378.1 ± 1 605.7, P < 0.001) increasing with level of expertise. For task 2, there was no significant difference (2 426.8 mm.s−3 ± 1 886.4 vs. 2 012.1 ± 716.7 vs. 2 929.7 ± 761.1, NS).
The total WV covered was statistically different between the 3 groups for task 1 (1 118 661.6 mm3 ± 622487.1 vs. 1 438 677.7 ± 716 948.1 vs. 761 105.7 ± 603 917.3, P = 0.002).
For task 2 (1 755 821.8 ± 1 318 436.4 vs. 1 275 117.0 ± 674 373.0 vs. 825 782.0 ± 962 957.2, NS) and task 3, the difference was not significant (1 618 668.3 ± 624 129.9 vs. 1 759 287.7 ± 1 250 355.0 vs. 867 504.4 ± 717 940.5, NS).
This article presents for the first time the use of metrics computed on probe trajectory during simulated fetal ultrasound with OUS to objectively assess the expertise of the users and their dexterity. The main result was that objective metrics (D, acceleration, velocity, and jerk) differed statistically according to the level of expertise in 2 of the 3 tasks.
The time for each task (D) decreased as the level of expertise increased ranging from 35 to 52 seconds for experts and from 109 to 157 seconds for novices. The PL, although significantly shorter for experts in the “brain” and “spine” exercise, was not significantly different for the “heart” exercise. This might be explained by the fact that the planes in the heart exercise are much more tightly bunched in a very small volume.
There must have been a task effect. The “heart” exercise must have been less discriminant using trajectory metrics, because it implied tiny probe movements. The same could have been argued for the “brain” exercise because there is not a lot a movement between the head circumference and the transcerebellar view, but a lot a movement is necessary to obtain the coronal view. It is to notice that some amount of initial movement among experts must be tied to translating positioning and movements with the model to recall of positioning and movements with genuine patients.
Velocity and acceleration were significantly higher with the increase in level of expertise, suggesting that experienced sonographers move from one plane to another with a faster movement. Jerk, which is a derivative of acceleration, may be interpreted as the variation of acceleration or how sudden the variations are. We could have assumed that jerk would have been smaller and the gesture smoother for the experts, as it has been shown for surgical procedures.14 However, during ultrasound examination, jerk was higher for experts because they most often combined a quick translation, followed by a quick 90-degree rotation of the probe to switch, for example, from a transverse view of the kidney to a sagittal view of the spine. For the “heart” exercise, AJ did not differ between the 3 groups, which may be explained again with the small quantity of movements between the different views. The WV was smaller for the experts than for the intermediates and novices, implying that the experts remained within the area of interest. These results are similar to those observed by Zago et al15 about FAST (Focused Assessment with Sonography in Trauma) examination. In this study, they also used hand motion analysis to discriminate expertise and found similar results: longer hand path and higher WV for the novices.15 In our study, intermediates performed more like novices than experts on many measures because they may not feel comfortable with morphologic examination.
These data are all the more important in France where, for the first time next year, the practical examination for students will be by OUS rather than on volunteers. Other developed countries are following the same trend, ie, relying more on assessment by OUS. Objective metrics are thus required to respond to this move toward automatic and objective evaluation. Recording the trajectories of the probe and comparing them with trajectories obtained with experts are interesting ways of evaluating students' level. In a clinical diagnostic perspective, trajectory metrics, taken separately, may not be relevant to measure performance. However, these metrics are a first step. We tested some metrics, and not all of them will have clinical meaning, but they do have a kinematic meaning. Some are obvious (D) and some are not but may be discriminant (jerk, volume). It is part of the method to go from the identification of differences to explanation of the differences observed.
Madsen et al6 also analyzed simulator-generated metrics on a high-fidelity ultrasound transvaginal simulator.6 They evaluated a group of 16 ultrasound novices along with a group of 12 OB/GYN consultants. The score was calculated by adding the scores of the 7 modules (0, fail; 1, pass) for each participant. Of the 153 metrics, 48 reliably discriminated between levels of competence and demonstrated evidence of construct validity. However, in that study, simulator-generated metrics were dichotomic, marked either pass or fail, which is different from the present study with a continuous variable.
Few other studies have analyzed simulator-generated metrics, and none have analyzed trajectories. Furthermore, most publications on simulator-generated metrics focus on laparoscopy training.16,17 Jones et al18 conducted a study extracting data form a laparoscopic simulator. They suggested a relationship between the training level of the surgeon and the forces imparted on the tissue during a laparoscopic simulation.18 In another study about laparoscopic training, Rivard et al16 selected 36 individual metrics on 4 tasks, including speed, motion PL, respect for tissue, accuracy, task-specific errors, and successful task completion. Time and motion PL were significantly different for all 4 tasks and the other metrics for some of the tasks. They then used the validated metrics to create summary equations for each task, which successfully distinguished between the different experience levels.16 Lastly, another study about laparoscopic procedures explored the correlation between PL or smoothness and outcome measures such as accuracy error, knot slippage, leakage, tissue damage, and operating time.19 In that study, no correlation was found between the metrics and surgery outcomes, except for operative time. Finally, in another study by Sánchez-Margallo,20 the suturing performance was successfully assessed by the motion analysis method. They demonstrated construct validity for the execution time and PL.20
To integrate a simulator in a training and assessment program, it is necessary to demonstrate face, content, and construct validity of that simulator. The construct validity means the ability to discriminate between different levels of expertise. By opposition, face and content validity means how convincing or realistic the simulator is according to experts, in a more subjective way.21 It is interesting to notice how objective measures are used to inform construct validity. A study by van Dongen et al,22 which aimed to demonstrate construct validity for a laparoscopic VR simulator, used the clinical experience as a definition of expertise. Indeed, simulator metrics were tested in 16 novices, 16 residents, and 16 experts to construct content validity of a laparoscopic simulator.22 It seemed that performance of the various tasks on the simulator corresponded to the respective level of laparoscopic clinical experience.
In a study by Ramos et al,23 participants completed 3 VR exercises using the Da Vinci Skills Simulator, as well as corresponding dry laboratory versions of each exercise. Simulator performance was assessed by metrics measured on the simulator. Dry laboratory performance was blindly video evaluated by expert review using the 6-metric GEARS tool. This study is interesting because their definition of expertise was based on an exercise and not just the years of clinical experience. In addition, participants were asked to complete a questionnaire to evaluate face and content validity.23 In another study by Kenney et al,24 construct validity of a robotic surgery VR trainer was assessed. The performance was recorded using a built-in scoring algorithm including total task time, total instrument motion, and number of instrument collisions. Experienced robotic surgeons outperformed novices in nearly all variables. Again, each subject completed a questionnaire after finishing the modules to assess face and content validity. All experienced surgeons ranked the simulator as useful for training and agreed with incorporating the simulator into a residency curriculum.
Strengths and Limitations
One limitation of our study is linked with the definition of expertise. The 3 expertise levels were defined by the clinical experience, not the quality of the scans or images that the participants actually produced. However, clinical experience levels were clearly defined (experts who had everyday practice of fetal ultrasound, residents who were much less experienced, and novices who were medical students with no experience at all). Moreover, to address the potential bias on how to define an expert, every participant achieved a probe manipulation exercise, which confirmed levels in the 3 groups. Another limitation is that we did not assess face and content validity, because we focused on objective metrics.
Implications of Interpretation
One challenge when teaching ultrasound is to explain how to obtain the view of the fetus and how to switch from one plane to another, especially to sagittal views. Future explorations are required to approach the optimal trajectory of the probe. This could help better teach the technique, optimize scan D, and assess the dynamic quality of the exploration. Future works should also assess skill transfer to clinical practice and trajectories on actual patients.
This study shows that objective trajectory metrics differ according to level of expertise in 2 OUS tasks. The connected OUS interface between the operator's hand and the patient provides numerical data that can help better understand and assess skill acquisition. It is the responsibility of the clinicians to let the developers know what data they are interested in that would make the simulators suited for training.
The authors would like to thank all participants in this study, Felicity Neilson for English editing, and Stephen Thucker for technical support on the simulation system.
1. Cook DA. How much evidence does it take? A cumulative meta-analysis of outcomes of simulation-based education. Med Educ
2. Scalese RJ, Obeso VT, Issenberg SB. Simulation technology for skills training and competency assessment in medical education. J Gen Intern Med
3. Scott DJ, Dunnington GL. The new ACS/APDS skills curriculum: moving the learning curve out of the operating room. J Gastrointest Surg
4. Tolsgaard MG. A multiple-perspective approach for the assessment and learning of ultrasound skills. Perspect Med Educ
5. Chalouhi GE, Quibel T, Lamourdedieu C, et al. Obstetrical ultrasound simulator as a tool for improving teaching strategies for beginners: pilot study and review of the literature [in French]. J Gynecol Obstet Biol Reprod (Paris)
6. Madsen ME, Konge L, Nørgaard LN, et al. Assessment of performance measures and learning curves for use of a virtual-reality ultrasound simulator in transvaginal ultrasound examination. Ultrasound Obstet Gynecol
7. Tolsgaard MG, Rasmussen MB, Tappert C, et al. Which factors are associated with trainees' confidence in performing obstetric and gynecological ultrasound examinations? Ultrasound Obstet Gynecol
8. Tolsgaard MG, Chalouhi GE. Use of simulators for the assessment of trainees' competence: trendy toys or valuable instruments? Ultrasound Obstet Gynecol
9. Tolsgaard MG, Ringsted C, Dreisler E, et al. Sustained effect of simulation-based ultrasound training on clinical performance: a randomized trial. Ultrasound Obstet Gynecol
10. Tolsgaard MG, Ringsted C, Rosthøj S, et al. The effects of simulation-based transvaginal ultrasound training on quality and efficiency of care: a multicenter single-blind randomized trial. Ann Surg
11. Chalouhi GE, Bernardi V, Gueneuc A, et al. Evaluation of trainees' ability to perform obstetrical ultrasound using simulation: challenges and opportunities. Am J Obstet Gynecol
12. cneof 2016 - Recherche Google. Available at: https://www.google.com/search?client=firefox-b&q=cneof+2016
. Accessed January 7, 2019.
13. Despinoy F, Zemiti N, Forestier G, et al. Evaluation of contactless human-machine interface for robotic surgical training. Int J Comput Assist Radiol Surg
14. Ghasemloonia A, Maddahi Y, Zareinia K, et al. Surgical skill assessment using motion quality and smoothness. J Surg Educ
15. Zago M, Sforza C, Mariani D, et al. Educational impact of hand motion analysis in the evaluation of FAST examination skills. Eur J Trauma Emerg Surg
2019. Epub ahead of print 15 March 2019. DOI: 10.1007/s00068-019-01112-6.
16. Rivard JD, Vergis AS, Unger BJ, et al. Construct validity of individual and summary performance metrics associated with a computer-based laparoscopic simulator. Surg Endosc
17. Shanmugan S, Leblanc F, Senagore AJ, et al. Virtual reality simulator training for laparoscopic colectomy: what metrics have construct validity? Dis Colon Rectum
18. Jones D, Jaffer A, Nodeh AA, et al. Analysis of mechanical forces used during laparoscopic training procedures. J Endourol
19. Cesanek P, Uchal M, Uranues S, et al. Do hybrid simulator-generated metrics correlate with content-valid outcome measures? Surg Endosc
20. Sánchez-Margallo JA, Sánchez-Margallo FM, Oropesa I, et al. Objective assessment based on motion-related metrics and technical performance in laparoscopic suturing. Int J Comput Assist Radiol Surg
21. Koch AD, Buzink SN, Heemskerk J, et al. Expert and construct validity of the Simbionix GI Mentor II endoscopy simulator for colonoscopy. Surg Endosc
22. van Dongen KW, Tournoij E, van der Zee DC, et al. Construct validity of the LapSim: can the LapSim virtual reality simulator distinguish between novices and experts? Surg Endosc
23. Ramos P, Montez J, Tripp A, et al. Face, content, construct and concurrent validity of dry laboratory exercises for robotic training using a global assessment tool. BJU Int
24. Kenney PA, Wszolek MF, Gould JJ, et al. Face, content, and construct validity of dV-trainer, a novel virtual reality simulator for robotic surgery. Urology