Surgical Sabermetrics

Objective: To evaluate the current evidence for surgical sabermetrics: digital methods of assessing surgical nontechnical skills and investigate the implications for enhancing surgical performance. Background: Surgeons need high-quality, objective, and timely feedback to optimize performance and patient safety. Digital tools to assess nontechnical skills have the potential to reduce human bias and aid scalability. However, we do not fully understand which of the myriad of digital metrics of performance assessment have efficacy for surgeons. Methods: A systematic review was conducted by searching PubMed, EMBASE, CINAHL, and PSYCINFO databases following PRISMA-ScR guidelines. MeSH terms and keywords included “Assessment,” “Surgeons,” and “Technology”. Eligible studies included a digital assessment of nontechnical skills for surgeons, residents, and/or medical students within an operative context. Results: From 19,229 articles screened, 81 articles met the inclusion criteria. The studies varied in surgical specialties, settings, and outcome measurements. A total of 122 distinct objective, digital metrics were utilized. Studies digitally measured at least 1 category of surgical nontechnical skill using a single (n=54) or multiple objective measures (n=27). The majority of studies utilized simulation (n=48) over live operative settings (n=32). Surgical Sabermetrics has been demonstrated to be beneficial in measuring cognitive load (n=57), situation awareness (n=24), communication (n=3), teamwork (n=13), and leadership (n=2). No studies measured intraoperative decision-making. Conclusions: The literature detailing the intersection between surgical data science and operative nontechnical skills is diverse and growing rapidly. Surgical Sabermetrics may provide a promising modifiable technique to achieve desirable outcomes for both the surgeon and the patient. This study identifies a diverse array of measurements possible with sensor devices and highlights research gaps, including the need for objective assessment of decision-making. Future studies may advance the integration of physiological sensors to provide a holistic assessment of surgical performance.

into consideration that this review introduces Cognitive Load (CogL) as an additional nontechnical construct.CogL is a concept that underpins expert performance in surgery.Defined as the amount of finite working memory resources an individual must allocate to meet the cognitive demands of a task, [9][10][11] CogL is a multidimensional concept 12,13 and excessively high levels may have negative consequences for individual learning, team performance, and patient safety. 6,14Despite being different concepts, there is overlap in the literature between CogL and stress, 15 the feeling of strain or threat, 16,17 and the same methods have been used to measure these different cognitive states. 18,19ncreased stress and CogL have both been shown to have a negative effect on surgeons' nontechnical skills. 20bjective performance metrics can be broadly classified as either physiological (eg, cardiovascular, respiratory, dermatological, neurological, optical, or energy expenditure) or nonphysiological (eg, movement and acoustic analysis).Changes in CogL can be detected in an individual's physiology due to activation of the autonomic nervous system (ANS). 21Sensorbased objective metrics can measure the ANS directly and are used as surrogate measures of CogL or proxy assessments of nontechnical skills. 7,22Examples of physiological indicators include electrodermal activity (EDA), electroencephalography (EEG), and heart rate variability (HRV). 23,24revious reports on nontechnical skills in surgery have been conducted in the context of human assessments of behavior, 25 surgical education, 25 human factors considerations, 26 and cognitive load measurement. 26The rise of technology presents a remarkable opportunity to explore the application of measuring nontechnical skills in a surgical context, yet the extent of its utilization remains unclear.The aim of this scoping review is to evaluate the current technological advances in measuring surgeons' nontechnical skills using objective metrics.The present study also evaluates the various biomarkers used to measure nontechnical skills, and the specific surgical contexts in which surgical sabermetrics tools have been implemented.Furthermore, we intend to discuss the interventions included in these studies aimed at improving surgical performance.

METHODS
A scoping review was conducted in August 2022, following the Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist as guidance 27 (Supplemental Digital Content, http://links.lww.com/SLA/E999).The Population Intervention Context framework 28 was utilized to develop the search strategy with the expertise of a professional medical librarian.This review is registered by submission to BMJ Open. 29Systematic searches of PubMed, OvidMedline, Embase, PsycINFO, IEEE Xplore, Web of Science, and ACM digital library databases were performed from inception to August 2022, using Covidence Software (Veritas Health Innovation) to collate manuscripts.MeSH terms and keywords included, but were not limited to "Assessment," "Surgeons," and "Technology," in addition to terms relating to nontechnical skills.A full example of search terms including MeSH terms and keywords is included in Supplemental Digital Content (Appendix A, Supplemental Digital Content 1, http://links.lww.com/SLA/E999).Reference lists were not cross-searched.

Selection Criteria and Screening Process
We aimed to include all original research studies published since 2010 to capture technology advances over the past decade.Included studies involved the digital measurements of nontechnical skills, following the NOTSS taxonomy and including CogL, in a surgical context.The target population included surgeons from all surgical specialties and training levels (trainees, residents, and fellows), in addition to attending (consultant) surgeons.Studies utilizing real-life and simulated surgical environments were included.We excluded studies where full text was not available, that were not written in English, that did not measure nontechnical skills, that targeted the incorrect population, or did not report surgeon data, along with the following article types: review articles, conference abstracts, and letters to the editor.Following explicit inclusion and exclusion criteria, abstracts were screened by 3 reviewers (E.E.H., O.A., and E.G.M.G.) to identify articles that would later be assessed by a full-text review.The same authors conducted a full-text review of included papers against the inclusion/exclusion criteria.Conflicts that arose either during the title/ abstract screen, or full-text review, were resolved by a fourth author (S.J.Y.).The PRISMA flow diagram detailing the number of articles screened and included is shown in Figure 1.

Data Extraction
Data extraction was conducted by 3 authors (E.E.H./O.A./ E.G.M.G.) after calibration with a fourth author (S.J.Y.) on extraction criteria.During this process, the fourth author was consulted on study ambiguities or study data that did not clearly fit the predetermined extraction criteria.Data extraction was crosschecked by 2 authors (E.E.H./O.A.).Detailed information was extracted from each study under several categories including study design, participant cohort, and nontechnical skill assessment.Extracted data were captured using a dedicated spreadsheet designed for this review.

Data Synthesis and Quality Assessment
Data were analyzed using quantitative and descriptive statistics.A qualitative narrative synthesis was conducted to identify themes demonstrating how sabermetrics has been applied in the surgical context, presented using visualizations and flow diagrams.These include tables focusing on the scope and aims of included studies; quality of methods; participant analysis; contextualization; specific nontechnical skills assessed; metrics used; and study conclusions.Quality assessment of individual papers was undertaken using the Quality Assessment Tool for Diverse Designs (QATSDD). 30The QATSDD score was calculated by 2 authors (E.E.H./O.A.) based on the application of standardized and validated criteria.

RESULTS
The results of the database searches are outlined in a PRISMA diagram (Fig. 1).The overall summary of the 81 included studies is available in Supplemental

Objective Metrics
In this comprehensive review of 81 studies, a diverse array of measurements were implemented, with a total of 122 distinct objective, digital metrics utilized across 16 separate categories.The automated, objective measurements of nontechnical skills (NOTSS) in this study were categorized into 2 categories: physiological and nonphysiological metrics (Figs. 2, 3).Surgeon physiological measurements (n = 115) using noninvasive sensors were the preferred measurement of choice, but technology such as acoustic analysis (n = 3) [32][33][34] and movement (n = 4) 33,[35][36][37] were also employed.Cardiovascular measurements were the most common physiological metrics employed (n = 46), which include HRV, heart rate (HR), and blood pressure.HRV was the most frequently used measurement across all included studies (n = 25).The next most common categories were neurological (n = 24) with EEG (n = 18) and functional near-infrared spectroscopy (fNIRS) (n = 6), then optical (n = 21).Figure 2 compares the number of studies using a physiological metric (x-axis) versus the frequency a metric was used by a study participant (yaxis), where the sample size was known.This figure shows that HRV is also the most common metric in terms of participants (n = 333), but HR (n = 373), eye-tracking (n = 262), and EEG (n = 256) have also been deployed frequently.
Twenty-seven studies (34%) combined 2 or more objective metrics.HRV was the most frequently selected metric to use in combination (n = 12).Curiously, fNIRS was always used with another metric, and typically with a subjective measurement (n = 5).The less frequently utilized metrics also tended to be used alongside others.For example, electromyography (n = 3) was only used in combination with other metrics.Considering 39% (n = 31) of studies involved live operating, there was a disproportionate use of metrics in live surgery settings compared to simulation.For example, eye-metrics were only used in 4 (5%) of the live setting compared to 21 studies overall.Similarly, EEG was only used 5 times in live surgery versus 18 overall.From the simulation studies, there is a clear need for these metrics, which must now be translated into clinical practice.

Subjective Metrics
Forty-one studies combined objective and subjective, nondigital measurements.Three studies concurrently utilized human raters and the NOTSS taxonomy alongside a digital measurement.Thirty-two studies utilized subjective CogL measurements with tools including the Surgical Task Load Index (SURG-TLX) (n = 8) 38 and NASA Task Load Index (NASA-TLX) (n = 23). 31,38One study used a behavior marking system by Seelandt and colleagues to assess distractions. 32,391][42] Eight studies used the subjective State-Trait Anxiety Inventory (STAI) to assess perceived stress.Objective measures such as EDA were significantly related to these subjective, self-rated measures. 43,44tcomes Improving outcomes and reducing error are common aims of sabermetrics studies.Outcomes measured alongside surgical applications in included studies can be broadly classified as those that evaluate (i) surgical performance and patient safety, 44 or (ii) surgeon well-being 45 : 9][50] The use of sabermetrics may reassure patients that surgeons are working in optimal conditions and that a new technology or technique does not place unnecessary additional CogL on the operating surgeon that may negatively affect performance.(ii) Surgeon well-being: Physiological measurements can also be used to support well-being.HRV measurements have been used to identify daily stressors and periods of increased CogL associated with physiological strain. 51ssues such as fatigue, burnout, anxiety, and depression have been associated with EEG and EDA changes. 43,52,53oor performance or being involved in an error contributing to patient harm can cause moral injury, 54 and continual monitoring of performance through sabermetrics could reduce this risk.

Practical Application of Sabermetrics Implementation
A breakdown of the practical application of sabermetrics implementation in included studies is demonstrated in Table 1.The practical applications can be categorized into 13 overriding categories.The majority of studies used sabermetrics to measure task demands (n = 24), for example, a surgeon adjusting to the demands of the surgical task and environment or increasing case difficulty.Sabermetrics were also used to examine performance modulators such as noise, 32,68,69 or the effect of an intervention, such as intraoperative breaks or a training technique. 41,53,55,69,70[78][79][80]

Quality of Studies
The quality of studies was highly variable according to the Quality Assessment Tool Studies with Diverse Designs (QATSDD).QATSDD scores ranged from 25% to 81% (median = 57%, interquartile range = 17%, where Q1 = 48%, Q3 = 64%) with a higher score indicating higher quality.The median quality score for studies that utilized subjective and objective measures was 57% (IQR: 50%-64%).Less than half of the included studies (n = 37) examined the reliability and validity of the measurements selected.In addition, the median score given for the assessment of reliability and validity was 0 (out of 3).

Overview
1][82] The literature demonstrates that surgical sabermetrics has been assessed in both real operative and simulated surgical settings, with measurements from a range of physiological sensors, combined with observations, video, and surveys to add context.The included studies highlight the importance of CogL in the assessment of surgical performance, with clear evidence of its influence on surgeons' behavioral and cognitive skills across applications.2][83] There was substantial heterogeneity among the 81 studies that met the inclusion criteria, especially around study designs and measured outcomes.The current sabermetric monitoring and interventions that have been applied in surgery lack standardization; however, the included studies provide evidence of feasibility and direction on research gaps in this rapidly evolving field.

Metrics
Cardiovascular metrics were the most utilized objective measurement and were likely selected due to their evidence base and relative ease of measurement via consumer devices. 84Studies utilizing eye-tracking metrics opted for wearable glasses-style or head-mounted devices.Eye-tracking bar-type devices are also available, which have been used in nonsurgical settings but not within surgery to our knowledge. 85It is possible to attach these devices to a laparoscopic monitor, but this would not be suitable for robotic or open surgery.In addition, neurological measurements such as EEG and fNIRS often require participants to wear a full "cap," or smaller headband-like EEG device.Specifically, fNIRS gives an indication of energy use for cognitive tasks by measuring blood oxygenation levels in areas of the brain, typically the prefrontal cortex, but has only been used thus far in combination with cardiovascular metrics. 18,86Dermatological measures such as EDA are markers of ANS activity measured through subtle changes to electrical activity within our skin as detected by electrodes. 87Energy expenditure methods involve using temperature devices such as thermal cameras, or estimated calorie expenditure through HR devices.Assessment of movement within the OR involved sensors or machine-learning techniques with video recordings, utilizing commercial recording equipment and open-source algorithms demonstrating that low fidelity methods of assessment are possible. 37

Cognitive Load
Real-time measurement, modulation, and optimization of CogL can enhance surgical performance.Measured through ANS changes or via the metabolic demands of carrying out a task as evidenced by metrics of the included studies, CogL is a proxy for NOTSS measurements and has a variety of wider implications and applications.Measuring CogL indicates the current cognitive effort being expended and suggests the residual capacity available. 49Increasing CogL can impair individual and team technical and nontechnical performance, putting patient safety at risk.None of the reviewed studies identified specific absolute levels of "overload."Only 1 study attempted to quantify optimal levels of CogL, utilizing the index of pupillary activity, utilizing eye-tracking with an index of cognitive activity. 479][90][91] Conversely, performing a task at a lower CogL increases the chance of proficiency, and allows for the residual capacity and mental effort to be redistributed to deal with other tasks or saved for periods of increased task demands. 803][94][95] There is an association between increased CogL and muscle activity, leading to muscle fatigue and subsequently to pain and discomfort. 96,97Monitoring and optimizing CogL may aid in the reduction of pain and discomfort, essential as surgeons are at risk of work-related musculoskeletal injury which can impact individual well-being and the overall workforce. 98,99urther applications of CogL are detailed within the outcomes and applications of the sabermetrics sections of this review.

Nontechnical Skills
The NOTSS taxonomy 51,56,88 was used to classify objective metrics according to the main categories of surgical nontechnical skill in Supplemental Table 2, Supplemental Digital Content 1, http://links.lww.com/SLA/E999.Neurological measurements, such as EEG and fNIRS, were used to provide a direct indication of SA through measuring attention and engagement. 52,89,91bjective measurements support the hypothesis that acute stress can impair SA and surgical performance through the mechanism of increasing CogL. 24,56,90,100,101Increased surgeon reactivity to intraoperative stressors, demonstrated through CogL measurement, was found to indicate a loss of SA. 102 High CogL decreases SA by affecting attention, increasing reaction time, and negatively impacting recognition skills. 24,103,104Measuring overall team CogL through individual team member measurements can show psychophysiological mirroring and dynamic CogL changes occurring during unexpected or expected, task-specific events and provide evidence of effective teamwork. 7,24,64,105,106Social proximity is a predictor of behavior as it influences the opportunities available for effective teamwork and communication through nonverbal interactions and is suggestive of a shared mental model. 33,37Studies have shown that computer vision techniques and proximity sensors for movement analysis can demonstrate an association between movement and teams with "poor" and "good" SA in the OR.Proximity sensors demonstrated the movement and closeness of the team around each other with "good SA" associated with restricted movements.The time a team member spent close to the primary operating surgeon was a predictor of their NOTSS score. 33,37Although no studies assessed decision-making directly, it is a complex and dynamic cognitive process that is intertwined with other nontechnical skills. 2,107The effect of SA, stress, and increased CogL may cause detrimental effects on the human memory system, resulting in decision fatigue, tunnel-vision, and premature decision-making. 2,108,109

Applications of Surgical Sabermetrics
Sabermetrics methods are beneficial as they reduce reliance on human observers, which increases the speed, volume, objectivity, and value of assessments.Digital measurements also remove the risk of biased judgments often associated with subjective assessments.Automated assessments also have the benefit of discreetly measuring performance in real time for analysis and interpretation at a later date without interrupting the target procedure or surgical workflow.
A number of applications emerged from the literature, covering topics as diverse as surgical training, systems design, modality, and feedback.These have implications for the wider surgical team, including trainees and patients.Training applications include the optimization of CogL for surgical residents and assessment of the efficacy of training interventions.For example, Wu et al found that using cognitive and behavioral metrics through EEG and eye-metrics, with machine-learning can predict training outcomes with a 72.5% accuracy. 1105][116] In one example, Maimon et al (2022) used CogL levels via EEG in simulation to assess if a trainee was "ready" to operate on a live patient. 117This application offers immense utility in allowing for a tangible demonstration of surgical skill progression, as training practices increasingly move toward the use of simulation to supplement training. 118abermetric measures have also explored the role of the trainers' presence on trainee performance, and provide indications of trainer-trainee trust. 63,119Objective measures have the potential to contribute to formative and summative assessment, especially within a personalized education lens, for example, the longitudinal use of metrics to measure changes compared to previous. 112,116,1207][78][79][80] Measuring task demands was the most prevalent application.Surgical task characteristics can impose a high level of demand, driven by factors such as case difficulty, the need for precision, multitasking, and the use of adjuncts such as virtual reality. 58,103,123bjective measures can quantify the effect of these demands and determine if a new surgical tool, technique, or modality results in a substantial increase in CogL.Although there are currently no set criteria for cognitive overload in surgery, sabermetrics can be used to monitor CogL and optimize success by identifying contexts and trends that negatively impact operative performance.

Concurrent Assessment of Nontechnical Skills Using Objective and Subjective Metrics
The current gold standard tools for assessing nontechnical skills and cognitive load in surgery are subjective and observational measures.The use of these validated, nondigital tools is widespread within surgery.Dominant examples include ratings provided by a trained observer using the NOTSS tool, and selfreported questionnaires in the form of the NASA-TLX.Although the aim of the present scoping review was to evaluate technological advances in measuring surgeons' nontechnical skills using objective metrics, we anticipated that many included studies would also incorporate subjective and observational measurements.However, the majority of studies retrieved did not concurrently use subjective and observational methods alongside digital tools.When utilized, these measures were used for a variety of reasons.For example, Dias et al 37 used human NOTSS assessments to assign operative teams to "low" and "high" SA, and then investigated digital movement metrics, comparing these groups, finding that teams with higher rated SA had lower entropy and therefore less movement during the surgical time out.Cha et al 33 also gathered human NOTSS assessments for the purpose of correlating nontechnical skills scores with the speech and proximity metrics of participants.A number of human observation tools, including OTAS, Situation Awareness Global Assessment Technique (SAGAT), and NOTSS have also been used as outcome measures, to investigate the impact of operative stress on performance and teamwork. 31,42,69,104,124Workload measures, such as NASA-TLX and the surgical variant called SURG-TLX, were used to provide essential subjective data and support the evidence for objective measures.Subjective, self-report measures may provide context to objective metrics, and enhance our understanding of surgeons' personal interpretation of operative performance.It is crucial therefore to combine objective and subjective measures to provide a complete picture of operative events. 125

Digital Measurement of Technical Skills
Good technical surgical skills are reliant on competence in both cognitive and psychomotor skills. 52Although the focus of this review was on nontechnical skills, some studies concurrently investigated technical skills and overall performance, through standard, human-rater-dependent measures such as Objective Structured Assessment of Technical Skill (OSATS) or the Generic Error Rating Tool (GERT). 23,69According to the inclusion criteria, studies had to measure some aspects of nontechnical skills (SA, decision-making, communication, and leadership) to be included, so studies that utilized sensors or digital technology to solely measure technical skills were excluded.However, the lines between technical and nontechnical surgical skills are blurred and it is possible that some of the assessments in the included studies are also reflective of the technical skills being performed.For example, economy of motion is generally accepted to be reflective of surgical technique, gained through experience, and described variously as "fluidity" and "efficiency" by surgeons.It features in the dominant surgery assessment tools, including OSATS, and is the basis of thousands of assessments in the surgical literature.However, effective economy of motion, fluidity, and efficiency are critically dependent on SA, a core nontechnical skill that is associated with higher-order thinking and the ability to gather, understand, and predict future states in dynamic situations such as operative surgery.In addition, a hallmark of expertise is the ability to leverage automaticity in surgical practice; what may appear effortless, smooth, and precise hand motion is a function of CogL management.Experienced surgeons are able to "free up" cognitive resources at the moment in what is labeled System 2 thinking. 125his scoping review reveals emerging evidence supporting the objective measurement of CogL in the operative setting.Notably, while cognitive load is a prerequisite for technical proficiency, its influence remains distinct and independent in nature.Several studies assessed the impact of stress and CogL on technical performance as an outcome.For example, Grantcharov et al 23 found that high cognitive load and acute stress levels measured via HRV negatively impacted technical performance.Similarly, eye-tracking can be used to measure attention and other cognitive processes.Although none of the studies in the present review used eyetracking to measure technical skills, gaze patterns vary between novice and expert surgeons, 126 and therefore can be used as a marker of technical skill acquisition.However, such studies fall outside the scope of this review.

Limitations
Scoping reviews are ideally suited to capturing the breadth of novel topics, but are limited in the depth of evidence they can cover. 86Several limitations must be considered when interpreting the results of this review and future utilization of surgical sabermetrics.While the theoretical basis and use of these metrics have been established in nonhealth care industries, 127 the studies included in the present review lacked assessment of the practical usage of these sensors in live surgery.Simulated environments were more commonly utilized to evaluate the reliability and feasibility of devices. 36,44,72,128To protect patient safety, newly introduced technological devices must not interfere with the physical OR environment or cause interference with patient monitors or surgical equipment. 74,129Furthermore, it is essential to evaluate the wearability and comfort of sensors according to end users (surgeons) to deploy them successfully during surgery. 36Wearable technology can be intrusive and unsuitable for long procedures, 130 and may interfere with concurrent equipment such as headlamps and loupes. 74,131In the present review, eye-tracking glasses were reported to not cause discomfort, negatively impact the visual field, or affect surgical performance. 132However, no assessment was made regarding likely interaction with loupes, headlamps, or eye protection, and most studies excluded participants who required vision correction, such as glasses, limiting the scalability of eye-tracking devices.Physiological measurements can be influenced by the OR environment, such as room temperature, and individual differences between surgeons. 127,133,134Physiological measures are also affected by the individual's overall state; including physical activity levels, stress, sleep, digestion, caffeine intake, circadian rhythms, and medical conditions. 44,135These confounding factors were not well-controlled across studies.Furthermore, many studies only assessed male surgeons, or excluded females to control the potential effect of stress response during menstruation, 136 which is not demonstrative of the actual surgical landscape.
The objective metrics identified in the present review are directly measured continuous variables, characterized as proxy measures of the performance variables of interest (eg, cognitive load, SA, attention, stress, communication, and leadership).Interpretation of the surgical context is required to derive meaning, and ground truth is required to determine the validity.For example, eye-tracking metrics such as gaze pattern and fixation rate provide an indication of the psychological variable "attention," but the attention may be misplaced.Specific, narrow gaze patterns can be variously interpreted as (i) deliberate focus during a surgical task, with effortful deprioritization of distractors, or (ii) "tunnel-vision," a potentially dangerous state of cognitive bias reflecting loss of global SA.Alternatively, gazing around the OR, and directing attention from the surgical field to the patient monitor or anesthesiologist may be characterized as maintaining good awareness of the overall situation in the OR, or a sign of distraction and lack of focus.An integrative review combining objective metrics and audio-visual recordings offers a comprehensive perspective by providing contextual information and addressing these variations in interpretation.
In addition, measurements such as EDA can indicate levels of task engagement but cannot distinguish whether the engagement was appropriate and task-relevant or inappropriate and at risk of degrading SA and subsequent performance. 88coustic analysis can measure communication metrics relating to noise, including speech, 33,34 with Cha et al 33 noting a correlation between NOTSS scores and pitch, suggesting a link between speech metrics and leadership and teamwork behaviors.However, acoustic analysis cannot assess the content of speech.Noise peaks are associated with an increase in conversation that may be positively attributed to increased communication between team members. 33However, that conversation may be task-irrelevant, impacting teamwork and increasing CogL. 32espite these illustrated limitations in interpretation, metrics such as HRV and EDA may offer unique insights as adjuncts to current performance metrics.

Recommendations for Future Research
The heterogeneity of the study designs, data analysis, and outcomes means that a meta-analysis would be premature at this stage.The emerging evidence base and rapid development of sensor technology also make it difficult to draw firm guidance on practice changes for surgeons or recommend specific metrics to capture during live surgery.In line with previous reviews in surgery and other safety-critical industries, we did not find a consensus to recommend a single, objective physiological measurement of CogL. 127,137However, the present study advances knowledge beyond prior reviews by synthesizing the literature regarding automated, digital, objective measures for the assessment of nontechnical skills in surgery. 18,26,138On the basis of this synthesis, we make several recommendations regarding the future research agenda in surgical sabermetrics: (1) The use of a concurrent, validated subjective tool such as NOTSS to provide further evidence of the validity of objective metrics.(2) The concurrent use of audio-visual recording technology to provide context to data interpretation, rather than relying on human observation.This will add a true richness to the data and aid further research into metrics with conflicting interpretations.For example, acoustic analysis can detect noise peaks as a sign of increased team communication, and combining audio-visual recording with acoustic analysis may provide insight into the content of this communication.(3) In addition, combining the metrics of CogL with other objective measures of nontechnical skills, such as acoustic analysis for communication, may aid in the assessment of communication content and its relevance during cognitively challenging tasks.(4) The current heterogeneity of study designs limits the ability to directly compare studies or make recommendations; therefore, further studies with consistent reporting, measurement, and interpretation of data are required.For example, although there are several studies utilizing HRV, it can be assessed using time or frequency domains and therefore not all HRV studies are directly comparable.(5) Studies should include information on device usability including comfort to assess the impact on the wearer while operating.(6) Investigation into the acceptable range for optimal CogL, as there is currently no established upper or lower limit.The studies in this review have demonstrated the wide range of benefits of these measurements and demonstrated the applicability of sabermetrics within the OR.

CONCLUSIONS
Surgical sabermetrics is a novel and innovative area focusing on optimizing surgical performance to benefit both surgical teams and patients.The reach of sabermetrics is evident through the 81 studies in this scoping review, with clear merit in enhancing surgical well-being, performance, patient safety, and training.Despite being unable to draw a firm conclusion on which metric is "best," there are still clear clinical benefits.This review shows that objective assessment of CogL is an established area of interest for surgeons; however, more research is required to investigate the objective assessment of nontechnical skills.The importance of nontechnical skills in surgery is well-established; however, the measurement of CogL as an indicator of mental processes has yet to be integrated into surgical practice.Although it is a concept rather than a skill in itself, the ability to effectively manage and optimize CogL is a skill that can lead to enhanced surgical performance and well-being. 139urther areas of research in this evolving field should investigate objective NOTSS assessment in the real OR and assess factors affecting uptake and usability in the OR with real usability data.The integration of current subjective assessments such as NOTSS, with objective measures of performance, audiovisual operative recordings, postoperative debriefing, and cognitive task analyses will provide rich performance assessment to improve surgical care. 107,140The studies included in this review demonstrate the importance of assessing cognitive load, 139 and the feasibility of measuring several nontechnical skills indirectly using biomarkers of individual and team performance.

FIGURE 3 .
FIGURE 3. Summary of objective and subjective assessment tools.

Table
, Supplemental Digital Content 1, http://links.lww.com/SLA/E999.The details of the included studies (eg, participants, study design, aims, and outcomes) were significantly heterogeneous.The proportionate inter-rater reliability for full-text review was 0.94.

TABLE 1 .
Outline of the Practical Applications of Sabermetrics Implementation in Included Studies BP indicates Blood Pressure; METs, metabolic equivalents; RR, respiratory rate.