Margolis, Melissa J.; Clauser, Brian E.; Cuddy, Monica M.; Ciccone, Andrea; Mee, Janet; Harik, Polina; Hawkins, Richard E.
Background: Multivariate generalizability analysis was used to investigate the performance of a commonly used clinical evaluation tool.
Method: Practicing physicians were trained to use the mini-Clinical Skills Examination (CEX) rating form to rate performances from the United States Medical Licensing Examination Step 2 Clinical Skills examination.
Results: Differences in rater stringency made the greatest contribution to measurement error; more raters rating each examinee, even on fewer occasions, could enhance score stability. Substantial correlated error across the competencies suggests that decisions about one scale unduly influence those on others.
Conclusions: Given the appearance of a halo effect across competencies, score interpretations that assume assessment of distinct dimensions of clinical performance should be made with caution. If the intention is to produce a single composite score by combining results across competencies, the presence of these effects may be less critical.
The mini-Clinical Skills Examination (CEX) is a rating tool developed at the American Board of Internal Medicine for use in evaluating the clinical skills of residents.1 The instrument is designed to be used by supervising clinicians, and it allows for efficient rating of seven competencies: medical interviewing, physical examination, humanistic qualities/professionalism, clinical judgment, organization/efficiency, counseling, and overall clinical competence.
Although the mini-CEX has been the focus of a number of research projects,1–4 a nontrivial limitation of this work has been the structure of the data collection. In a typical study, patient encounters are nested in (i.e., confounded with) raters. This limitation has prevented examination of whether the number of raters or patient interactions has a greater impact on the reproducibility of results. Research in other evaluation formats suggests that the number of encounters is the more critical factor, and Norcini et al. suggest that it likely is the more important feature of mini-CEX ratings as well.1 This is a particularly important issue, because with the mini-CEX the difficulty of the assessment should vary with the patients that are encountered.2
One aspect of the mini-CEX that has been viewed as important is the potential for efficient assessment across a wide range of examinee proficiencies.4 There has, however, been limited research to examine the relationships between the independent competencies within examinees. Norcini and colleagues reported relatively high observed correlations between the competencies, but what is lacking is evidence that they validly assess distinguishable clinical proficiencies.2 The presence of significant halo effects (evidenced by high correlations in rater-related error) would call into question the extent to which raters discriminated between the competencies, and this in turn would call into question the validity of interpretations that assume that: (1) the competencies are separate and distinguishable; and (2) the instrument assesses a wide range of proficiencies.
The purpose of the present research therefore was to evaluate the mini-CEX using a highly-structured data collection design that would support a more detailed analytic examination of the scores. Use of a multivariate generalizability analysis framework expands the potential to assess the performance of the rating instrument by providing information not only about the relationships between the competencies but also about the relationships between the various sources of error in the resulting scores. This analytic procedure specifically provides a means of assessing the extent to which raters treat the competencies within the mini-CEX as representing distinct and distinguishable aspects of examinee performance.
Three meetings were held at the offices of the National Board of Medical Examiners during the spring of 2005. At each meeting, a different group of eight experts was trained to rate performances from the United States Medical Licensing Exam (USMLE) Step 2 Clinical Skills (CS) Examination using the mini-CEX rating form. All data used for the present analyses resulted from the three meetings.
The content experts were recruited from around the country. The majority of physicians specialized in internal medicine or family medicine physicians (although other specialty areas were represented), they all had prior experience with the mini-CEX, and they all were involved in medical education. At each meeting, the experts first were introduced to the task with an orientation to and description of the project. An iterative process then was implemented to train the experts in the use of the rating instrument. They viewed a video of examinee performance, read the patient note associated with that performance, and individually rated the examinee on all competencies. The group then reviewed each competency to: (1) discuss their individual ratings; and (2) identify specific observed behaviors that contributed to the rating decisions. This resulted in a list of guidelines for each of the mini-CEX competencies that all raters could use to make subsequent rating decisions. After reviewing each competency for a given performance, the process was repeated. Guidelines were refined until the raters were satisfied with their content. At each meeting, the training session lasted approximately three hours.
Following the training, raters rated a final set of encounters. These encounters resulted from eight independent testing sessions from which the performances of ten examinees on an independent (nonoverlapping) set of six cases were selected. (USMLE examinees are given an option to decline use of their score data for research purposes; only those who did not decline this use of their data were included in this research.) The testing sessions were from all five USMLE Step 2 CS test sites, and the case sets were selected to provide a range of case presentations approximating the test specifications for the full examination. Each of the raters at each session reviewed a total of 60 final performances: 10 examinees on six different cases. The performances were presented in random order with the stipulation that no two performances from the same examinee were ever presented consecutively. Within each meeting, each rater viewed a different (nonoverlapping) sample of examinees on a different (nonoverlapping) set of cases. The same case sets were reviewed at all meetings, but the order of performances was randomized so that raters who viewed the same sets of performances viewed them in a different order. A total sample of 80 examinees and 48 cases was represented in this data set, and the process yielded 480 ratings for each of the mini-CEX competencies and a total of 3,360 ratings.
Multivariate generalizability analysis, like the univariate procedure, uses an analysis of variance framework to assess the contribution to observed score variance attributable to various influences. This approach also uses covariances to assess the extent to which sources of true-score variance and error variance are correlated across levels of the fixed facet. In this case, the fixed facet represents the seven distinct competencies that make up the mini-CEX. The resulting correlations provide evidence about the extent to which the proficiencies measured by each of these competencies are related; they also indicate the extent to which various kinds of halo effects may be present. For example, correlations between competencies for the person-by-rater effect indicate the extent to which a rater is likely to rate an examinee unexpectedly high or low (given the rater’s stringency and the examinee’s proficiency) on the remaining competencies when the examinee is rated unexpectedly high or low on one competency. A similar correlation between competencies for the person-by-rater-by-case effect would indicate the extent to which a rater is likely to rate an examinee unexpectedly high or low on a competency (given the rater’s stringency, the examinee’s proficiency, and the case difficulty) if the examinee is rated unexpectedly high or low on another competency.
The mGENOVA software package was used for all analyses.5 Because the software requires a crossed design, each examinee-case sample was analyzed separately. With this approach, examinees were crossed with cases and crossed with raters both within and across levels of the fixed facet (i.e., within each of the separately analyzed data sets, the seven rating components were completed for 10 examinees each of whom was rated by the same three raters on the same six patient-cases). Examinees from different sessions saw different, nonoverlapping, sets of patients. To account for this design, variance components were estimated for each of these data sets separately and then averaged.
Table 1 presents the results of the multivariate generalizability analysis. Because the design is both multivariate and includes examinees, cases, and raters, the table is complex and requires careful description. Within each score effect (e.g., examinee, case, rater, examinee-by-case, etc.), the values on the diagonal (in bold) represent the variance components that would be estimated for that level of the fixed facet in a univariate generalizability analysis. For example, the second column of the table represents the results for physical examination skills; within each score effect, the second row in that column presents the univariate variance components for that scale. The person variance for physical examination skills is .264; the case variance is .056, and so on.
Within the univariate framework, several results are of interest. The first is that the examinee by case variance is relatively small which indicates that the level of case specificity is relatively modest. The second effect of note is that the rater variance is relatively large and is consistently larger than the examinee variance. This suggests that differences in rater stringency contribute considerably more to measurement error than does case specificity. In fact, the greater contribution to measurement error of rater stringency above all other sources of error combined (except for the three-way interaction) suggests that a reliable measure based on the mini-CEX requires input from numerous different raters; multiple ratings from a small number of raters will be much less efficient. For example, the reproducibility (phi coefficient) for an assessment based on one rater rating each examinee on ten occasions would be only .39 for the overall clinical competence scale; the reproducibility for an assessment with the same number of observations in which each of ten raters rated the examinee once would be .83. A similar pattern of results exists for the other competencies. It is worth noting that the person-by-rater effects are reasonably small, suggesting that raters rank order examinees similarly, and the case effects, indicating variability in case difficulty, also are very small.
The values below the diagonal in Table 1 are covariance terms; those above the diagonal (in italics) are correlations. These correlations represent the strength of relationship between the seven scores representing levels of the fixed facet in the analysis. The correlations for the examinee effect are referred to as universe-score correlations. These are similar to true-score correlations in classical test theory and represent the strength of relationship between the proficiencies measured by each of the seven scales. These scores are all highly correlated, suggesting that there may be relatively little distinction between the proficiencies measured by the different competencies (e.g., the correlation between the proficiencies assessed by physical examination skills and clinical judgment is .96). The correlations for several of the other score effects also are moderate to high. The relatively high correlations for the case effect indicate that when a given case is relatively difficult (or easy) on one of the competencies, it tends to be similarly difficult (or easy) on the other competencies. The high correlations between score competencies for the rater effect indicate that when a rater is relatively stringent on one competency, that rater tends to be similarly stringent on other competencies (for example, the correlation between rater stringencies for medical interviewing skills and organization/efficiency is .95). The moderately high correlations for the examinee-by-case effect indicate that when an examinee does unexpectedly well on one competency for a given case, that examinee is likely to do unexpectedly well on the other competencies for that case. (“Unexpectedly well” describes an examinee who performs better on a given case than would be expected given the examinee’s proficiency [based on other cases] and the case difficulty [based on the performance of other examinees on that case]). The moderate correlations for the examinee-by-rater effect indicate that when a rater rates an examinee high or low on one competency (given the examinee’s proficiency and the rater’s stringency), that rater will rate the examinee similarly high or low on other competencies.
Although the mini-CEX has been evaluated in previous studies, the results presented in this paper provide a new perspective. The present analyses point to two major conclusions. First, differences in rater stringency contribute a considerable amount to measurement error while the effect of task specificity contributes relatively little; this is significant because it is contrary to expectations.1 This result also has the practical implication that in order to produce stable scores, numerous raters will be required; having each of a small number of raters rate an examinee multiple times will not be as effective as having a larger number of raters rate the examinee on a smaller number of occasions. An additional implication of this finding is that the mini-CEX is primarily sensitive to aspects of examinee performance that are not case specific and by extension that the mini-CEX may not be assessing aspects of clinical skills that have been shown to be case specific (e.g., data gathering and diagnostic decision making). One recent paper reported results of a generalizability analysis for scores from the Step 2 CS Examination; results indicated that, for the data gathering score, the variance component for the person-by-case effect was four times that for the person effect.6 Similar results were reported in a paper based on data collected as part of the pilot examination administered in preparation for Step 2 CS.7 To the extent that the residual term including the person-by-case effect can be taken as evidence of case specificity, these studies provide evidence of such specificity in the context of the same examination format (and, in fact, the same case pool) that was used in the current study. The absence of such an effect in the present results may raise questions about the specifics of what the mini-CEX is assessing.
The multivariate results call into question the extent to which raters provide independent ratings on each of the mini-CEX competencies and therefore argue for considerable caution in interpreting the individual competencies as distinct dimensions of clinical performance. The substantial correlated error terms across the competencies suggest that ratings are subject to a halo effect in which decisions about one scale unduly influence those on another. It is important to note that this halo effect is distinct from a correlation between competency scores that results from measuring proficiencies that are related; the correlated error terms represent an effect that goes beyond the actual interrelationship of the proficiencies. Most importantly, the magnitude of the present error-term correlations is substantially greater than what has been reported for multicomponent clinical skills assessments in previous literature. For example, Margolis and colleagues reported an examinee-by-case correlation of .18 between the data gathering and interpersonal skills components of a standardized patient-based examination where both components were scored by the patient.7 In the present results, the examinee-by-case correlation between medical interviewing and humanistic qualities/professionalism is .89. If the intention is to produce a single composite score by combining results across competencies, the presence of these effects may be less critical. Taken together, however, the results of this study argue that the resulting composite may be heavily weighted toward representing the examinee’s professional manner and away from issues of clinical judgment.
The results and conclusions reported in this paper are limited by specific characteristics of the data set. Most notably, generalizations to the typical operational application of the mini-CEX are limited by the fact that: (1) examinee performances were viewed on videotape rather than directly observed; and (2) the information about diagnosis and plan typically are the result of a face-to-face interview rather than the written format used in this study. Similarly, the mini-CEX is designed for use with actual patient encounters rather than the standardized-patient encounters used in this study. These modifications were necessary to produce the highly structured data collection design used in this study, but their implications are unknown (it should be noted that ratings based on video review were used in at least one previous study to provide evidence in support of the effectiveness of the mini-CEX instrument8). Additionally, in the present study each rater observed and rated each examinee six times within a relatively short interval. Although videos of the same examinee were not observed back-to-back, a rater may have observed the same examinee twice within the same hour. This may have created a level of carryover across observations which acted to reduce the effects of case specificity. Finally, it is important to note that the strong relationship between component scores may in part be impacted by nontrivial overlap in the language used to describe the competencies. Future research will seek to assess the impact of clarifying competency descriptors to reduce the amount of overlap.
Despite the practical limitations of the present research, the methodology represents a novel approach to systematically investigating critical aspects of the performance of the mini-CEX. Although no recommendations are being made for or against use of this instrument in general, the present results do raise several issues that should be considered when implementing and interpreting results from this clinical evaluation tool.
1 Norcini JJ, Blank LL, Arnold GK, Kimball HR. The mini-CEX (clinical evaluation exercise): A preliminary investigation. Ann Intern Med. 1995;123:795–9.
2 Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: A method for assessing clinical skills. Ann Intern Med. 2003;138:476–81.
3 Norcini JJ, Blank LL, Arnold GK, Kimball HR. Examiner differences in the mini-CEX. Adv Health Sci Educ. 1997;2:27–33.
4 Hauer KE. Enhancing feedback to students using the mini-CEX (clinical evaluation exercise). Acad Med. 2000;75:524.
5 Brennan RL. Generalizability theory. New York: Springer-Verlag, 2001.
6 Clauser BE, Harik P, Margolis MJ. A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. J Educ Meas. In press.
7 Margolis MJ, Clauser BE, Swanson DB, Boulet JR. Analysis of the relationship between score components on a standardized patient clinical skills examination. Acad Med. 2003;78:S68–S71.
8 Holmboe ES, Huot S, Chung J, Norcini J, Hawkins RE. Construct validity of the miniclinical evaluation exercise (mini-CEX). Acad Med. 2003;78:826–30.