Performance assessment literature is in agreement that human ratings are often contaminated by construct-irrelevant variance.1 It is the responsibility of testing agencies to minimize such effects by ensuring that raters behave in a standardized manner that yields consistent and meaningful scores. A successful strategy for meeting this responsibility will include building well-designed, detailed scoring rubrics and providing rigorous training to all raters.
Of course, good intentions do not guarantee success. No agency deliberately builds poorly designed rubrics or intentionally provides inadequate training; nevertheless, these unfortunate outcomes are not uncommon. Thus, validity evidence must be collected to support the adequacy of rubrics and the efficacy of training protocols. The Standards for Educational and Psychological Testing define the process of validation as “accumulating evidence to provide a sound scientific basis for the proposed score interpretations… . Validation can be viewed as developing a scientifically sound validity argument to support the intended interpretation of test scores and their relevance to the proposed use.”2 This view is consistent with Kane’s3 well-known conceptualization of validity. He suggests a framework in which the validity argument can be viewed as containing four types of evidence which logically link the data collected for the assessment with the proposed interpretations: scoring, generalization, extrapolation, and decision or interpretation. The current study presents validity evidence for the documentation scores from the United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills Examination. The documentation component, also known as the patient note, requires examinees to document their findings in a structured format after each patient encounter. The validity evidence presented here is related to two of Kane’s four categories: (1) scoring evidence related to the elements of the performance the raters score and the relationship between the scoring rubric and the construct of interest, and (2) generalizability evidence related to the stability of ratings across raters and cases.
Because it is both high-stakes and relatively new to the USMLE sequence, potential validity evidence for the scores from the Step 2 Clinical Skills Examination has been a matter of considerable interest. In a recent study, Clauser and colleagues4 showed that lack of rater consistency made a substantially greater contribution to measurement error in the documentation scores than the other sources of variance included in the study. One potential explanation for the low generalizability of documentation scores is that raters were not using the full score scale appropriately. Indeed, central tendency and restriction of range are two of the most commonly reported rater biases found in scores based on expert judgment.5 These findings led to an extensive research effort designed to improve the scoring rubrics for the documentation scores. The results of these efforts were introduced in stages during large-scale annual rater training exercises conducted in 2007 and 2008. The first phase provided increased specificity about how the components of the note should be weighted in producing the global rating. The second phase introduced more detailed case-specific guidelines about the content requirements for each component of the note. These enhancements to the scoring rubrics had proven successful in pilot studies; their introduction into operational scoring called for evidence demonstrating that the improvements generalized across raters and cases.
The USMLE Step 2 Clinical Skills Examination requires each examinee to interact with 12 standardized patients. Scores are typically based on 11 of these cases; one encounter may be used for pretesting new cases or training patients. Examinees are given 15 minutes to collect a patient history and complete a focused physical examination and 10 minutes to document their findings in a structured patient note. The notes use a SOAP format requiring the examinee to summarize the history, record pertinent findings from the physical examination, list up to five possible diagnoses, and list appropriate follow-up tests to confirm the diagnosis. Standardized patients complete three postencounter instruments assessing the examinees’ data-gathering skills, spoken English proficiency, and communication and interpersonal skills. Each patient note is scored by a physician rater using a nine-point scale, ranging from unacceptable to superior. Typically, each rater is trained to rate several cases. Ultimately, the documentation score is combined with the data-gathering score to produce a composite score called the Integrated Clinical Encounter (ICE) score.
As noted, in response to concerns about a lack of agreement in how raters used the score scale and evidence that some raters failed to use the entire scale, changes were made to the scoring procedures for the patient note. Changes to the scoring rubrics and modifications in the rater training procedures were introduced in stages. In 2007, specific rules were introduced defining the number of points allocated to each of the four components of the note. Concurrently, the training protocol was also expanded. The new protocol provided raters with exemplary patient notes illustrating typical performances at various score points and explanations justifying each score. After studying these benchmark notes, raters practiced scoring on notes that were prescored by expert raters. Discussions followed this practice session to resolve any misunderstandings and disagreements.
The second rubric modification occurred in 2008. Expanded case-specific rubrics were developed by experts who were familiar with both the case development and scoring processes. The previously used guidelines listed essential aspects of each of the four components of the note. These new rubrics expanded the level of specificity and provided a more detailed link between the identified content and expected scores. It should be noted that although both of these enhancements were designed to provide increased structure for the rating process, the rubrics remained mere guidelines—raters continued to use their professional judgment in producing the final rating.
The data for the present study are based on the cohorts of examinees who tested during three-month periods immediately following the annual rater trainings in 2006, 2007, and 2008. Study subjects have acknowledged that their examination data would be used for research purposes. All data were deidentified and grouped, and data sets and derivative analyses have been stored securely under conditions that preserve privacy and prevent release to any third party. Because the risk of harming any individual is therefore negligible, this study was not submitted for IRB review.
The structure of the data was both multifaceted (raters, examinees, cases) and nested (raters nested in cases), making it ideal for the generalizability theory framework. Multivariate generalizability theory is an extension of G-theory that partitions variances as well as covariances, allowing for appropriate modeling of relationships between multiple measures.6 The multiple measures of interest here are the data-gathering and documentation scores.
For the Step 2 Clinical Skills administration, encounters are grouped into sessions comprising 12 cases seen by 12 examinees. As mentioned, one case per session may be unscored; further, in practice, some sessions may have fewer than 12 examinees. Thus, the present analyses were implemented using an 11 × 11 persons-crossed-with-raters-nested-in-cases design. This resulted in more than 96% of sessions being included in the analysis. Variance and covariance components were estimated for 423 sessions from the 2006 cohort, 402 sessions from the 2007 cohort, and 407 sessions from the 2008 cohort using the mGENOVA software.6 The variance and covariance estimates for each session were then averaged within cohorts. Because each session had the same number of cases and examinees, unweighted averages were appropriate. Using these values, generalizability and phi coefficients were computed for both the documentation and data-gathering components for each cohort. Additionally, the composite generalizability and phi coefficients were computed using effective weights as described in Brennan and Clauser, Harik, and Margolis.7 Empirical standard errors for the variance and covariance components were also estimated.
Table 1 presents the estimated variance and covariance components and correlations for each effect (examinee, case, and examinee by case) for each cohort. Within the table, each 2 × 2 matrix is associated with an effect–cohort combination; the values in the diagonal represent the variance components, and the lower and upper off-diagonals represent the covariance and correlation, respectively, between documentation and data gathering. The empirical standard errors of each variance and covariance component are presented in parentheses. Generalizability and phi coefficients for each component/cohort combination and the composite scores are also shown at the bottom of Table 1.
In 2007, both the scoring rubric and rater training protocol were enhanced. The rubric was then refined to be case-specific in 2008. The impact of these enhancements is evident in Table 1, which shows an increase in the examinee variance for the documentation score from .21 in 2006 to .20 in 2007 and .32 in 2008. The residual term (which includes examinee-by-case variance) displays a similar trend; however, the improvements in generalizability and phi coefficients demonstrate that the increases in error variance are small compared with the increase in examinee variance. Changes in reliability can be difficult to interpret. For this reason, they are sometimes measured in terms of their hypothetical effects on test length. From 2006 to 2007, the phi coefficient for the documentation score increased by 0.02; this improvement corresponds to lengthening the 2006 CS examination by one case—that is, to achieve the reliability of the 2007 scores, the 2006 CS examination would need 12 scored cases instead of 11. A similar improvement was observed from 2007 to 2008. Thus, to achieve the reliability of the 2008 scores, the 2006 CS examination would need 13 scored cases instead of 11—an increase of almost 20%.
These results suggest that the modifications to the rubric and rater training procedures had a positive impact on the generalizability of the documentation scores. As mentioned, operationally, the documentation score is combined with the data-gathering score to produce a composite score. Pass–fail decisions are made based on the composite score. It is therefore a matter of interest to see whether the increased generalizability was associated with a similar improvement in the composite score. Table 1 displays the composite generalizability and phi coefficients when equal weights were assigned to each component. These also show a positive trend across cohorts. The increase in reliability of the composite score (i.e., ICE score) from 2006 to 2008 corresponds to lengthening the CS examination by two cases.
It is worth noting that the generalizability and phi coefficients for the data-gathering component also show an increase between 2007 and 2008 (Ep2 = .70 and ϕ = .60 in 2007 to Ep2 = .73 and ϕ = .65 in 2008). These gains may be due to concomitant changes in the scoring rubrics and training protocols for the data-gathering scores. The details of these improvements are beyond the scope of this paper; in brief, they included updating numerous physical examination items, retiring others on the basis of expert committee discussions, and introducing more explicit rubrics with new video examples. Similar to the results for the documentation scores, the observed gains from 2007 to 2008 correspond to an increase in test length of two cases.
The main finding from the present study is that the modifications to the scoring rubric and rater training procedures for the Step 2 CS documentation score were associated with an increase in the generalizability of the resulting scores. Before implementing the changes, there was concern about the fact that raters were using a restricted range of the score scale and that the ratings were unstable across cases and raters. Examination of the results in Table 1 shows that the overall score variances increased as a result of the modifications. This is consistent with raters using a wider range of the score scale. The largest part of this increase is associated with the examinee component of the scores. Examinee variance is analogous to true-score variance in classical test theory. This is the portion of the score variance that generalizes across cases and raters, reflecting agreement between raters about the examinees’ proficiency levels. The examinee variance increased by approximately 50% between 2006 and 2008. This suggests that raters used a wider range of scores, became better at discriminating among examinee performances, and increased their level of agreement. Recalling Kane’s conceptualization of validity outlined above, it is clear that increased examinee variance represents compelling validity evidence with respect to the scoring and generalizability of the documentation component.
Examinee variance is not the only component that increased during this period. Case variance also increased by approximately 15%. This effect reflects the variability in the difficulty of cases. It is not surprising that as the score scale increases, the variability in difficulty across cases would also increase. Operationally, differences in case difficulty and rater stringency are accounted for through a statistical adjustment to the observed scores. The intention of this adjustment is to make scores from different forms of the test equivalent.8 A modest increase in this source of variance is certainly not a matter of concern.
The examinee-by-case variance component (or residual term) also increased by approximately 25%. This component represents inconsistencies in how examinees perform across cases and how different raters rate the same examinee (or how different raters would rate the same performance). Again, it is not surprising that as raters use more of the score scale, the absolute magnitude of the resulting measurement error would increase. The important result is that the error variance only increased at about half the rate of the examinee variance.
Of course, improvement in reliability is only a small part of the picture; it does not necessarily guarantee improvement in validity. In fact, at times increased reliability may work against certain types of validity evidence.9 As such, further evidence regarding extrapolation and decision or interpretation3 is warranted. That said, given the straightforward nature of the modifications, it is not unreasonable to attribute at least some of the improvements in reliability to increases in accuracy. When accompanied with accuracy, consistency is a very desirable property.
In general, the findings suggest that the steps taken to enhance the scoring rubric and rater training procedures of the documentation component were in the right direction, although there remains room for improvement in the generalizability of the documentation scores. Efforts are currently under way to further improve both the scoring rubrics and rater training.
The authors would like to thank Janet Mee for assisting with data analysis.
1 Clauser BE. Recurrent issues and recent advances in scoring performance assessments. Appl Psychol Meas. 2000;24:310–324.
2 American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 1999.
3 Kane M. Validation. In: Brennan RL, ed. Educational Measurement. 4th ed. Westport, Conn: American Council on Education/Praeger; 2006:17–64.
4 Clauser BE, Harik P, Margolis MJ, Mee J, Swygert K, Rebbecchi T. The generalizability of documentation scores from the USMLE Step 2 Clinical Skills Examination. Acad Med. 2008;78;S68–S71.
5 Johnson RL, Penny JA, Gordon B. Assessing Performance: Designing, Scoring, and Validating Performance Tasks. New York, NY: The Guilford Press; 2009.
6 Brennan RL. Generalizability Theory. New York, NY: Springer-Verlag; 2001.
7 Clauser BE, Harik P, Margolis MJ. A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. J Educ Meas. 2006;43;173–191.
8 Harik P, Clauser BE, Grabovsky I, Nungester RJ, Swanson DB, Nandakumar R. An examination of rater drift within a generalizability theory framework. J Educ Meas. (in press).
9 Kane M, Case SM. The reliability and validity of weighted composite scores. Appl Meas Educ. 2004;17;221–240.