Dunning-Kruger Effect Between Self-Peer Ratings of Surgical Performance During a MASCAL Event and Pre-Event Assessed Trauma Procedural Capabilities : Annals of Surgery Open

Secondary Logo

Journal Logo

Original Study

Dunning-Kruger Effect Between Self-Peer Ratings of Surgical Performance During a MASCAL Event and Pre-Event Assessed Trauma Procedural Capabilities

Andreatta, Pamela B. PhD, EdD*,†; Patel, Jigarkumar A. MD; Buzzelli, Mark D. MD§; Nelson, Kenneth J. MD; Graybill, John Christopher MD∥,¶; Jensen, Shane D. MD; Remick, Kyle N. MD*; Bowyer, Mark W. MD*; Gurney, Jennifer M. MD¶,#

Author Information
doi: 10.1097/AS9.0000000000000180
  • Open

Abstract

INTRODUCTION

The implementation of self-assessment using performance rating scales is ubiquitous in medical and surgical education as a proxy for direct measurement of learning and performance.1 However, the validity and reliability of self-assessment measures are inadequate for predictive requirements, such as determination of competency, because of persistent range restrictions in instrumentation and the cognitive biases of self-raters, such as the Dunning-Kruger effect.2–4 Cognitive bias refers to the systematic tendency to engage in erroneous forms of thinking and judging.5 The Dunning-Kruger effect describes a consistent phenomenon where less adept individuals perceive their abilities as significantly greater than indicated through measured performance assessment, and highly adept individuals perceive their abilities as significantly less than what are independently measured.3,4,6,7 There are multiple explanations for this effect, but evidence suggests it is largely associated with meta-cognitive abilities, including the ability to differentiate individual expertise from the full scope of possible domain expertise. Individuals with low ability in a specific domain do not understand the complete scope of possible domain expertise and are unable to process the qualitative difference between their performance and the performance of others because they lack awareness of their relative incompetence. This leads to systematic error in the outcomes of self-assessed performance because less adept individuals see themselves as more skilled than they are and greatly overestimate their competence.

For highly skilled individuals, the Dunning-Kruger effect describes the tendency to underestimate their abilities relative to the abilities of others because their deep expertise facilitates an understanding of the full scope of possibilities and an awareness of what they do not know within the universe of possibilities. The tendency for experts to underestimate their abilities has also been attributed to a form of false-consensus effect, where high-performing individuals view the domain capabilities of others in an overly positive way and under-rate their own performance compared to inflated perceptions of peer performance.8,9 Other explanations for the Dunning-Kruger effect suggest that it is the result of statistical artifacts in combination with other forms of cognitive bias; nonetheless, its consistency challenges the validity and reliability of self-assessment measures, including those of physicians.10–12

Cumulative evidence suggests that compared to self-assessment, supervisors have a more accurate ability to discriminately assess the performance of others, as do peers with expertise in the performance domain.13–23 These findings suggest that peers with comparable expertise are capable of accurately assessing the clinical, communication, and humanistic performance of physician colleagues.23–25 Nonetheless, there are limitations to the validity and reliability of peer performance assessments tied to both instrumentation and other cognitive biases, such as halo effect and false-consensus effect, that limit their usefulness for predictive analyses and critical applications, such competency assessment.9,11,21,26,27 The validity and reliability of rating scales are often adversely impacted by range restriction, with most raters selectively restricting the scoring range to the middle or slightly above the middle ratings of the scale (eg, scale of 1–5; mode is 3.5).2,27,28 Although there is evidence that multisource feedback using rating scales is feasible and stimulates beneficial behavior changes in medicine and surgery, these improvements are largely associated with relational competencies such as communication skills, interpersonal behaviors, collegiality, humanism, and professionalism.29 Assessments of procedural competencies are not widely implemented in surgery, even if they are desirable for measuring surgical capabilities associated with operative performance.30 Consequently, the uses of self-assessment as a proxy measure of procedural competence remains pervasive, even as the ability of surgeons to accurately self-assess their procedural performance remains undetermined at best, unsupported at worst.

The clinical readiness program aims to assure that all military surgeons maintain procedural competencies for expeditionary contexts, including the management of mass casualty (MASCAL) events.30–34 The purpose of this study was to examine the extent to which the self-rated procedural performance scores of the individual surgeons corresponded to objective measures of performance (procedural abilities) and peer performance ratings of their procedural performance during a MASCAL event.

METHODS

Study Design

This study was reviewed and approved by the Institutional Review Board at our institution. We implemented a mixed-methods study design that included survey data captured directly from the sample and extent data collected as part of a program designed to assure the clinical readiness of military surgeons for a broad scope of procedural requirements in expeditionary contexts.30,33,34

Sample and Situational Context

The study question necessitated the use of a purposive sample of surgeons who collectively cared for trauma patients during a MASCAL event. Thirteen (13) military surgeons who were amongst the providers in Kabul, Afghanistan, when a terrorist attack occurred at one of the airport entry gates voluntarily participated in the study. The surgeons were divided between 4 units at a role 2 military treatment facility and along with their teams completed triage assessments for over 60 casualties. These medical personnel worked under the highest threat level over the course of the 15-hour event, providing surgical care for critically wounded adult and pediatric patients who suffered multiple, significant traumatic injuries. Post-event analyses of all cases were performed by military surgical leadership as part of routine Joint Trauma System Combat Casualty Care Conferences (JTSCCCC), including identification of what went well and what could be improved. Prior to participating in the study activities, the surgeons who participated in this study were a part of these case reviews and associated discussions and therefore were aware of the consensus analysis by the military surgical community about how their respective cases were managed.

Measures and Analyses

Pre-Deployment Assessed Procedural Capabilities

Prior to deployment the participating surgeons completed procedural performance assessment as part of the Department of Defense clinical readiness program. The program includes assessment of procedural skills associated with 32 critical trauma procedures and is performed through 1-to-1 interaction with 4 different expert trauma surgeons. The essential operative abilities for each procedure are assessed using precise instrumentation that meets the psychometric standards for performance-based assessment, including rigorous assessor training to assure the validity and reliability of procedural performance measures.30,33,34 The spectrum of performance scores range from 1 (dangerously incompetent) to 100 (independent, efficient, fully accurate performance). The clinical readiness program benchmark score for procedural skills performance is 90/100, which reflects accurate and independent performance, with allowance for up to 3 minor performance limitations or inefficiencies.30,33,34 Of the sample, 2 trauma surgeons and 2 orthopedic trauma surgeons served as expert assessors in the program after scoring 100/100 on all procedures. For the other participants, pre-deployment performance data from the point closest to their date of deployment to Afghanistan were incorporated into the analysis. All data were considered current because they were within 1 year of the time of deployment and provide information about the baseline procedural competencies of the participating surgeons at the time of the MASCAL event.

Self-Rating Data

Participants rated their personal performance during the MASCAL event using a scale that ranged from 1 (dangerously incompetent) to 100 (efficient, fully accurate performance). All responses were considered confidential and sent to one of the study investigators (P.A.), who was neither a surgeon nor present during the event. All data were de-identified after responses were assembled for analysis.

Peer-Rating Data

Participants rated the performance of their peers during the MASCAL event using a scale that ranged from 1 (dangerously incompetent) to 100 (efficient, fully accurate performance). All responses were considered confidential and sent to the same study investigator (P.A.) as for the self-ratings. All data were de-identified after responses were assembled for analysis.

To control for potential peer rating biases within the MASCAL team itself, data captured as part of the JTSCCCC review processes were used to compare the team’s peer performance ratings with those of their military trauma surgical peers who were not part of the MASCAL team. The case management for each MASCAL patient was used as a proxy for surgeon performance and were captured through detailed case reviews by the JTSCCCC participants. Case reviews included identification and discussion of performance accuracy and efficiency, and any concerns were noted for analysis. JTSCCCC peer ratings were calculated for each case by multiplying the number of peer identified procedural performance concerns by a multiplier of 3, and then subtracting that number from the total possible score of 100 (efficient, fully accurate performance). The multiplier of 3 was determined by using the pre-established clinical readiness program’s benchmark for procedural performance (90/100), which indicates no more than 3 performance factors that were either inefficient or partially accurate.30,33,34

Statistical Analyses

Correlations were calculated between the MASCAL team peer ratings and the JTSCCCC peer ratings to determine if there were evident biases within the team peer ratings. We analyzed the differences between self and peer ratings for each individual surgeon using t tests and effect sizes (Cohen’s d).35 We examined the differences between self-ratings, peer ratings, and pre-deployment procedural assessment scores using F tests. Statistical significance was set at P < 0.05.

RESULTS

The 2 forms of peer review ratings were significantly correlated (P < 0.05) and confirmed the peer ratings within the MASCAL team itself were unbiased by team allegiances. The descriptive data for the sample are presented in Table 1 for each subject.

TABLE 1. - Descriptive Data for the Sample, Including Self-Rating, Mean Peer-Rating, and Assessed Performance Scores
Surgeon ID Self-Rated Performance Peer-Rated Performance Pre-Deployment Assessed Performance
S1 80 95 98
S2 85 98 96
S3 90 95 94
S4 80 98 98
S5 70 98 100
S6 80 98 97
S7 70 99 100
S8 93 98 100
S9 92 98 100
S10 80 98 100
S11 83 95 100
S12 85 95 100
S13 85 95 100
Statistical difference between self-ratings and peer-ratings (P < 0.001), self-rating and pre-deployment assessment scores (P < 0.001). No significant difference between peer ratings and pre-deployment assessment scores.
ID indicates surgeon identifier.

The mean MASCAL procedural performance self-ratings and peer ratings are shown in Figure 1, along with the mean pre-deployment procedural assessment scores for the individuals caring for the MASCAL trauma patients. There was a significant difference between the self-ratings and the peer-ratings of MASCAL performance for all individual surgeons (P < 0.001), with a very large effect size of Cohen’s d = 2.77. Likewise, there was a significant difference between the self-rating of MASCAL performance and the pre-deployment procedural assessment scores for all individual surgeons (P < 0.001), with a very large effect size of Cohen’s d = 2.34. There was no significant difference between the peer ratings of MASCAL performance and the pre-deployment procedural assessment scores for all individual surgeons (P = not significant), suggesting that peer ratings and assessment performance scores were more accurate than self-ratings.

F1
FIGURE 1.:
Comparison of procedural performance scores by data source. MASCAL indicates mass casualty.

DISCUSSION

We examined the extent to which the self-rated procedural performance scores of individual surgeons corresponded to assessed measures of performance (procedural abilities) and peer ratings of procedural performance during a MASCAL event. The outcomes demonstrate that self-rated scores were significantly less that peer-rated scores and procedural assessment scores for the participating surgeons. However, there was no significant difference between the peer-rated scores and procedural assessment scores. This confirms that self-ratings for procedural abilities are unreliable when compared with independent methods of assessment.

The surgeons who participated in this study were all highly skilled, as evidenced by the pre-deployment procedural assessment scores. We propose that the study outcomes align with the Dunning-Kruger effect for highly skilled individuals to tend towards underestimating their abilities relative to the abilities of others. Although the study sample was small, all of the participants rated their personal performance lower than their peers rated their performance, and all participants rated their peers at or above their own performance level. The Dunning-Kruger effect proposes that individuals with deep expertise are able to comprehend the full scope of possibilities associated with a particular performance area and are therefore aware of what could possibly go wrong, as well as what could go well.3,4,6,7 Expert surgeons understand the potential impact of small errors and unknowns that await their patients after they complete their initial procedures. Therefore, they consider their performance in the universe of possibilities, not necessarily in the known; in this case, actual procedural performance in theater. For surgeons caring for trauma patients during a MASCAL event, they are aware of the challenges ahead for their patients, all of whom will be transported to subsequent levels of care. This understanding may influence their personal perceptions of their performance, even if they perform accurately and efficiently while directly caring for their patients. The surgeons might also have viewed the performance of their peers in an overly positive way compared to their own performance, similar to the false-consensus effect.8,9

These outcomes underscore the challenges of using of self-assessment as a proxy for direct measurement of performance.1 In addition to weak psychometric support for measures of self-assessment, clear cognitive biases routinely confound self-assessed outcomes, despite their convenience of use.2–4 For expert military surgeons, underscoring their performance is an unreliable measure of the medical force capabilities to provide accurate and efficient care. For less expert military surgeons, over scoring their performance is not only an unreliable measure, it provides a false perception of competence that is not supported by evidence and will lead to higher rates of morbidity and mortality for patients. Although assessments of procedural competencies are not widely implemented in medicine and surgery, they are desirable for measuring surgical capabilities associated with operative performance and assuring clinical readiness of the medical force.30–34 The clinical readiness program is committed to assuring that all clinicians are able to perform accurately and independently in their respective clinical domains. Performance assessment is essential for confirming readiness at all levels of expertise by assuring that experts establish evidence-based performance benchmarks and those who are developing their abilities know the standards they need to achieve.

Limitations

As with any study of this nature, there are limitations associated with interpreting the outcomes from a small purposive sample, especially when the composition of the sample is a highly trained group of surgeons. Notwithstanding the limitations on the generalizability of the outcomes, their consistency across the sample is noteworthy and suggests conceptual congruence. Additionally, the outcomes for self-ratings may be perceived as an artifact of the extreme stress environment during which the surgeons performed. They may have been consciously or subconsciously aware of their own feelings of vulnerability operating under a high threat level and perceived those feelings as concern about their surgical abilities and not the ability to effectively manage their stress responses.

CONCLUSIONS

The outcomes from this study demonstrate that expert surgeons caring for trauma patients during a MASCAL event rated their own performance significantly below the performance of their peers, and their self-rated performance scores were significantly less than both the ratings for their performance by their peers and pre-deployment assessment of their procedural abilities. Peer ratings and assessed performance measures aligned and demonstrate the value of independent assessment of performance. These outcomes align with assessment literature across multiple domains and underscore the limitations of self-assessment as valid and reliable measures of competence. Additionally, the outcomes demonstrate that pre-deployment skills assessment provides an accurate assessment of the performance of military surgeons when it counts: under a highly stressful situation of life and death in real-time. This is the ultimate objective of pre-deployment skills training.

REFERENCES

1. Nayar SK, Musto L, Baruah G, et al. Self-assessment of surgical skills: a systematic review. J Surg Educ. 2020;77:348–361.
2. Salkind NJ. Encyclopedia of Research Design. Vol. 1. Sage Publications, Inc.; 2010.
3. Kruger J, Dunning D. Unskilled and unaware of it: how difficulties in recognizing one’s own incompetence lead to inflated self-assessments. J Pers Soc Psychol. 1999;77:1121–1134.
4. Mazor M, Fleming SM. The Dunning-Kruger effect revisited. Nat Hum Behav. 2021;5:677–678.
5. Litvak P, Lerner JS. Cognitive bias. In: Sander D, Scherer KR, (eds). The Oxford Companion to Emotion and the Affective Sciences. Oxford University Press; 2009:89-90.
6. Dunning D. The Dunning–Kruger effect: on being ignorant of one’s own ignorance. Adv Exp Soc Psychol. 2011;44:247–296.
7. TenEyck L. Dunning-Kruger effect. In: Raz M, Pouryahya P, (eds). Decision Making in Emergency Medicine: Biases, Errors and Solutions. Springer; 2021:123–128
8. McIntosh RD, Fowler EA, Lyu T, et al. Wise up: clarifying the role of metacognition in the Dunning-Kruger effect. J Exp Psychol Gen. 2019;148:1882–1897.
9. Ross L, Greene D, House P. The false consensus effect: an egocentric bias in social perception and attribution processes. J Exp Soc Psychol. 1977;13:279–301.
10. Mabe PA, West SG. Validity of self-evaluation of ability: a review and meta-analysis. J Appl Psychol. 1982;67:280296.
11. Holzbach RL. Rater bias in performance ratings: superior, self, and peer ratings. J Appl Psychol. 1978;63:579–588.
12. Violato C, Lockyer J. Self and peer assessment of pediatricians, psychiatrists and medicine specialists: implications for self-directed learning. Adv Health Sci Educ Theory Pract. 2006;11:235–244.
13. Conway JM, Huffcutt AI. Psychometric properties of multisource performance ratings: a meta-analysis of subordinate, supervisor, peer, and self-ratings. Hum Perform. 1997;10:331–360.
14. Viswesvaran C, Ones DS, Schmidt FL. Comparative analysis of the reliability of job performance ratings. J Appl Psychol. 1996;81:557–574.
15. Atkins PWB, Wood RE, Rutgers PJ. The effects of feedback format on dynamic decision making. Organ Behav Hum Decis Process. 2002;88:587–604.
16. Becker TE, Klimoski RJ. A field study of the relationship between the organizational feedback environment and performance. Pers Psychol. 1989;42:343–358.
17. Beehr TA, Ivanitskaya L, Hansen CP, et al. Evaluation of 360 degree feedback ratings: relationships with each other and with performance and selection predictors. J Organ Behav. 2001;22:775–788.
18. Saavedra R, Kwun SK. Peer evaluation in self-managing work groups. J Appl Psychol. 1993;78:450–462.
19. Shore TH, Shore LM, Thornton GC. Construct validity of self- and peer evaluations of performance dimensions in an assessment center. J Appl Psychol. 1992;77:42–54.
20. Davis JD. Comparison of faculty, peer, self, and nurse assessment of obstetrics and gynecology residents. Obstet Gynecol. 2002;99:647–651.
21. Norcini JJ. Peer assessment of competence. Med Educ. 2003;37:539–543.
22. Van Rosendaal GM, Jennett PA. Comparing peer and faculty evaluations in an internal medicine residency. Acad Med. 1994;69:299–303.
23. Ramsey PG, Wenrich MD, Carline JD, et al. Use of peer ratings to evaluate physician performance. JAMA. 1993;269:1655–1660.
24. Nurudeen SM, Kwakye G, Berry WR, et al. Can 360-degree reviews help surgeons? Evaluation of multisource feedback for surgeons in a multi-institutional quality improvement project. J Am Coll Surg. 2015;221:837–844.
25. Lockyer J. Multisource feedback in the assessment of physician competencies. J Contin Educ Health Prof. 2003;23:4–12.
26. Nisbett RE, Wilson TD. The halo effect: evidence for unconscious alteration of judgments. J Pers Soc Psychol. 1977;35:250–256.
27. Risucci DA, Tortolani AJ, Ward RJ. Ratings of surgical residents by self, supervisors and peers. Surg Gynecol Obstet. 1989;169:519–526.
28. Evans R, Elwyn G, Edwards A. Review of instruments for peer assessment of physicians. BMJ. 2004;328:1240.
29. Al Khalifa K, Al Ansari A, Violato C, et al. Multisource feedback to assess surgical practice: a systematic review. J Surg Educ. 2013;70:475–486.
30. Andreatta P, Bowyer MW, Ritter EM, et al. Evidence-based surgical competency outcomes from the Clinical Readiness Program [published online ahead of print December 7, 2021]. Ann Surg. doi: 10.1097/SLA.0000000000005324.
31. Department of Defense. Department of Defense Instruction. DoDI 6025.19 (IMR). June 9, 2014. Available at: https://www.esd.whs.mil/Portals/54/Documents/DD/issuances/dodi/602519p.pdf Accessed February 2, 2021.
32. Remick KN, Andreatta PB, Bowyer MW. Sustaining clinical readiness for combat casualty care. Mil Med. 2021;186:152–154.
33. Bowyer MW, Andreatta PB, Armstrong JH, et al. A novel paradigm for surgical skills training and assessment of competency. JAMA Surg. 2021;156:1103–1109.
34. Bradley MJ, Franklin BR, Renninger CH, et al. Upper-extremity vascular exposures for trauma: comparative performance outcomes for general surgeons and orthopedic surgeons [published online ahead of print February 8, 2022]. Mil Med. doi: 10.1093/milmed/usac024.
35. Cohen J. The statistical power of abnormal-social psychological research: a review. J Abnorm Soc Psychol. 1962;65:145–153.
Keywords:

Dunning-Kruger effect; peer-assessment; performance assessment; procedural assessment; self-assessment; self-rating; surgeons; surgical assessment

Copyright © 2022 The Author(s). Published by Wolters Kluwer Health, Inc.