Share this article on:

In-Training Clinical Performance Scores Explaining American Board of Anesthesiology Certification: A Step Further

de Oliveira Filho, Getúlio R. MD, PhD

doi: 10.1213/ANE.0000000000001306
Editorials: Editorial

From the Department of Surgery, Federal University of Santa Catarina, Florianópolis, Brazil.

Accepted for publication February 29, 2016.

Funding: None.

The author declares no conflicts of interest.

Reprints will not be available from the author.

Address correspondence to Getúlio R. de Oliveira Filho, MD, PhD, Department of Surgery, Federal University of Santa Catarina, Rua Luiz Delfino 111/902 Florianópolis, SC 88015360, Brazil. Address e-mail to

Certification by the American Board of Anesthesiology (ABA) has been established under the assumption that physicians fulfilling examination requirements could be considered better anesthesiologists than those failing the examination.1 To test the validity of this assumption, Slogoff et al.1 conducted a simple and elegant study involving candidates to ABA certification during the years 1991 (written examination) and 1992 (oral examination). To assess the perceptions of faculty closely involved in resident training of each resident clinical performance, they asked a simple question to program directors or an indicated faculty of the 154 training programs in the United States at that time. The question regarding each of their residents entering the certification process for the first time in 1991 was: “For each of the operation below indicate whether you would permit this anesthesiologist to provide anesthesia care for you.” The following procedures with increasing degrees of complexity were listed: elective cholecystectomy, laparotomy for acute bowel obstruction, and sitting posterior fossa craniotomy. Possible responses were “yes,” “maybe,” or “no.” Increasing competence levels were assumed as the number of procedures receiving “yes” responses approached 3. In addition, they asked program directors to identify in a list the characteristics they considered deficient (less than average) in each resident. The list comprised 26 attributes considered relevant by ABA members under 5 major categories: knowledge, character, response to stress, clinical performance, and work habits. The main outcome measure was the success rate of the 1992 ABA certification process. The authors found that 74.6% of residents considered apt to provide personal anesthesia care for their program directors on all 3 procedures passed the certification process; the corresponding success rates for residents considered apt to provide anesthesia care for just the first and the second procedures or for just the first procedure were 53.8% and 44.9% respectively. The pass rate for those considered inadequate to provide personal anesthesia care for their program directors was 49%. The intergroup differences were statistically significant. Furthermore, deficiencies in several personal characteristics were significantly associated with fewer than 3 “yes” responses in the clinical skills rating: knowledge traits (basic science/general medical knowledge, anesthesia cognitive knowledge, knowledge utilization, and application); character (motivation and response to criticism); response to stress (ability in handling of stress); clinical performance (clinical judgment and manual dexterity); and work habits (industriousness, reliability, and responsibility; multiple r = 0.73; r2 = 0.53). This study provided the first evidence that board certification in anesthesiology might be considered a sensitive marker of superior performance in an anesthesiologist.

In this issue of Anesthesia & Analgesia, Baker et al.2 report the correlations between clinical performance scores (Zrel) of residents of the last year of residency and the respective scores at ABA written (ZPart 1) and oral (ZPart 2) examinations.

The clinical performance score encompasses the 6 Accreditation Council for Graduate Medical Education core competencies: patient care, medical knowledge, practice-based learning and improvement, systems-based practice, interpersonal and communication skills, and professionalism. The score metric is on a “relative-to-peers” scale. Zrel is measured in units of SD of the respective evaluator’s average scores to minimize leniency/severity biases. As a z-score, the mean is centered at 0, and the SD is 1.2

For the study, each resident was ascribed a clinical performance score that corresponded to the mean of all scores attributed to the individual during his or her last year in training. The median number of evaluations per resident was 87 (minimum = 36; maximum = 142)!

Scores from ABA examinations were standardized to the national average. Statistics were straightforward and consisted of calculation of Pearson product-moment r coefficients for bivariate correlations and multiple R for a multiple regression model, estimating the predictive ability of the clinical performance score and the ABA written examination score on the ABA oral examination score. Determination coefficients r2 estimated the amount of variance explained by regressions.

The main results of the study were that the mean clinical performance score obtained during the last year of residency significantly correlated with first-time ABA written (r = 0.27; P < 0.01) and oral (r = 0.33; P < 0.01) standardized scores. As a consequence, the respective percentages of variance accounted (explained) by the regression of Zrel on ZPart 1 and ZPart 2 were 7% and 11% of the total variance, respectively. In addition, multiple regression with Zrel and ZPart 1 as independent variables and ZPart 2 as the dependent variable showed that both predictor variables were independently associated with ZPart 2. The respective percentages of variance of ZPart 2 explained by each predictor were 4.5% (Zrel) and 20.8% (ZPart 1), for a total amount of explained variance of 25.3%.

The percentage of variation explained by a regression model (r2) refers to the ratio between the sum of squared regression errors and the sum of squared total errors. It is a measure of goodness of fit of a regression model. As r2 increases toward the unity, the difference between predicted and observed values of y, given x, decreases toward 0.

In social sciences, the amounts of variance explained by regression models are typically small. For example, in studies addressing the association of intention and behavior, the percentages of variance explained by regression models vary between 19% and 38%.3 Similarly, faculty evaluations of anesthesia residents have shown only moderate correlation with resident performance at in-training mock oral practice examinations (r = 0.41; r2 = 17%).4

In 1983, Abelson,5 at Yale, posed a very simple question: What percentage of variance in athletic outcomes can be ascribed to players’ skills, as measured by past performance indexes? Studying historical data of baseball league players’ percentages of successful hits, he estimated the percentage of variance of the event of getting or not getting a hit, when at bat. Contrary to the common intuition, the variance explained by players’ batting skills was estimated as 0.31%! That is it: one third of 1%!

Abelson explained his findings on the basis of the cumulative nature of the measures of performance. Individual player’s success is measured on the long run, over a season or a lifetime. Second, teams composed of above-average players tend to play more frequently. This increases the amount of data on players’ performances over time. In spite of accumulating a large number of successful hits, each at bat situation is unique, and the success depends on a lot of factors: the wind direction, the bat itself, the mental and the physical status of the player at that particular time, and so on.

Correlation coefficients are simply a mathematical way to express the degree of covariance between variables. The term “significant” in the context of correlations refers to the fact that the coefficient is statistically different from 0 at a given level of confidence.6

Regression can be used to explain and/or to predict. Prediction and explanation are distinct concepts. Explanation aims at identifying deterministic associations between a set of variables. Prediction means estimating the value of a given variable taking into consideration its associations with independent factors. If a correlation coefficient is significantly different from 0, it is a potential candidate for explanatory purposes. Conversely, high correlation coefficients and percentages of explained variance are necessary for the accuracy of predictions.3

Abelson concluded that relatively small percentages of explained variance do not invalidate the explanatory role of correlations between independent and outcome variables, provided the independent predictors are cumulative in nature, or in other words, collected repeatedly over a long period of time.5

In a manner analogous to baseball players, residents at the authors’ institution were subjected to multiple in-training evaluations. In fact, aggregating behavioral measures across multiple situations and occasions increases the reliability of measures by canceling out incidental, uncontrollable factors that act on the individual in each particular measurement occasion.7 Similar to the case of baseball players, low to moderate correlations were found between clinical performance scores and scores on ABA examinations. As a consequence, only small percentages of ABA score variation were explained by predictions based on clinical competence.

Intuitively, one would expect that highly clinically competent residents would certainly succeed at any high-stakes examination addressing clinical competence. That is, clinical competence would be the main, if not the only determinant of success. ABA examinations are complex, multifaceted, and highly reliable assessments. During ABA written examinations, candidates are scored on 225 high-quality items; the number of questions asked during the oral examination is also high, rather varied in nature, and scored by 4 independent raters. Given their complexity, ABA examinations cannot be compared with a single batting event. However, like a single at bat, ABA examinations are isolated measurement occasions, which are expectedly susceptible to the influence of extraneous factors, such as the emotional and physical status of the examinee, the perceptions about the environment of the examination, and so on. These factors may disrupt the expected relationship between clinical competence and performance at examination and illustrate the extreme difficulty social researchers have in forecasting human behaviors in unique situations based on past behaviors, intentions, attributes, or any other kind of imperfect measure.3

In spite of the abovementioned limitations of educational research, Baker et al.’s study provides sound evidence that clinical performance objectively rated during the last year of residency significantly relates to the size of the scores on ABA examinations.

Finally, a single dimension characterized the “clinical performance” construct assessed by the evaluation instrument used in Baker et al.’s study. This finding suggests that evaluators perceived clinical performance as a single concept so that they did not discern among the multiple Accreditation Council for Graduate Medical Education competencies the instrument was supposed to assess. The same single-dimension construct was elicited in the question posed by Slogoff et al. in their seminal study. By asking about the perceived aptitude of each resident in providing anesthesia care for themselves, faculty were exposed to a single construct of clinical competence, which was further demonstrated to have been based on knowledge trait, motivation, clinical judgment, manual dexterity, industriousness, reliability, responsibility, response to criticism, and ability of handling of stress.1 Baker et al. and Slogoff et al. used different approaches to estimating clinical competence. However, both studies produced evidence supporting the unidimensionality of the construct of clinical competence when measured by practicing anesthesiologists. Perhaps, more studies are still necessary to confirm such findings.

In conclusion, Baker et al. merit congratulations for digging deeper into the hard soil of educational measurement so brilliantly. The evidence they share in their article is an important step toward increasing our knowledge on the subject and stimulating further research.

Back to Top | Article Outline


Name: Getúlio R. de Oliveira Filho, MD, PhD.

Contribution: This author wrote the manuscript.

Attestation: Getúlio R. de Oliveira Filho approved the final version of the manuscript.

This manuscript was handled by: Franklin Dexter, MD, PhD.

Back to Top | Article Outline


1. Slogoff S, Hughes FP, Hug CC Jr, Longnecker DE, Saidman LJ. A demonstration of validity for certification by the American Board of Anesthesiology. Acad Med 1994;69:740–6.
2. Baker K, Sun H, Harmann A, Poon KT, Rathmell JP. Clinical performance scores are independently associated with the American Board of Anesthesiology Certification Examination Scores. Anesth Analg 2016;122:1992–9.
3. Sutton S. Predicting and explaining intentions and behavior: how well are we doing? J Appl Soc Psychol 1998;28:1317–38.
4. Schubert A, Tetzlaff JE, Tan M, Ryckman JV, Mascha E. Consistency, inter-rater reliability, and validity of 441 consecutive mock oral examinations in anesthesiology: implications for use as a tool for assessment of residents. Anesthesiology 1999;91:288–98.
5. Abelson RP. A variance explanation paradox: when a little is a lot. Psychol Bull 1985;97:129–33.
6. McMillan JH, Schumacher S. Research in Education. A Conceptual Introduction. 20015th ed. New York: Longman.
7. Epstein S. Stability of behavior. II. Implications for psychological research. Am Psychol 1980;35:790–806.
© 2016 International Anesthesia Research Society