Purpose. Operational USMLE™ computer-based case simulation results were examined to determine the extent to which rater reliability and regression model performance met expectations based on preoperational data.
Method. Operational data resulted from Step 3 examinations given between 1999 and 2004. Plots were produced using reliability and multiple correlation coefficients.
Results. Operational testing reliabilities increased over the four years but were lower than the preoperational reliability. Multiple correlation coefficient results are somewhat superior to the results reported during the preoperational period and suggest that the operational scoring algorithms have been relatively consistent.
Conclusions. Changes in the rater population, changes in the rating task, and enhancements to the training procedures are several factors that can explain the identified differences between preoperational and operational results. The present findings have important implications for test development and test validity.
In November of 1999, the USMLE™ Step 3 became a computer-administered examination. Among the motivations for moving USMLE to computer was the potential for administering the computer-based case simulation (CCS)1–3 item format that could not be used with paper-and-pencil administrations. The inclusion of CCSs in the USMLE Step 3 examination marked the culmination of three decades of research and development effort by the National Board of Medical Examiners.
With the CCS software, examinees manage patients in a dynamic simulated patient–care environment. Each simulation begins with a short opening scenario describing the patient's appearance and location, and examinees are provided with a brief initial patient history. After this initial information, examinees are left unprompted; they can request specific physical examination information or go to an order sheet to order tests, treatments, and consultations. Examinees may advance the case through simulated time and make decisions about changing the patient's location (e.g., move the patient from the emergency department to the intensive care unit). As simulated time passes, test results become available and the patient's condition changes (based both on the underlying problem and the examinee's actions).
The simulations used in USMLE Step 3 are scored using case-specific regression-based scoring algorithms; quantifiable components of the examinee performance are used as the independent measures in a regression equation to predict the rating that an examinee would have received if the performance had been reviewed and rated by a group of trained clinician–raters.1,4,5,6 The scoring process is elaborated in the following paragraphs.
A listing of all examinee actions results from each completed case simulation. This “transaction” list includes not only the orders entered on the sheet but also (1) the sequence of the orders, (2) the simulated time at which each action was ordered, and (3) the simulated time at which results were seen or the orders were completed. Guided by clinical expertise and empirical information about examinee performance, experts develop scoring keys that identify specific aspects of the performance that provide essential evidence. This requires considering each of the several thousand unique actions that an examinee might order and making a judgment about whether that action should be considered appropriate in the context of the case. For actions that are considered appropriate (or potentially beneficial), additional consideration is given to level of importance. Three levels of beneficial actions typically are used: those considered (1) essential for adequate treatment, (2) important for optimal treatment, and (3) desirable for optimal treatment. On a case-specific basis, certain actions are considered neutral because they do not provide useful evidence about proficiency in that case. Finally, actions that are not considered appropriate for a case are categorized by the associated level of risk or intrusiveness. As with the beneficial actions, three levels of nonindicated actions representing different levels of risk and/or intrusiveness typically are used.
For each case, actions specified in the key are used as independent measures in a regression equation to predict expert ratings based on review of the full transaction list. The specific independent measures used in the regression equation may vary from case to case; in the simplest form of the model, seven independent variables are used (three represent the counts of beneficial actions, three represent the counts of nonindicated actions, and the seventh variable represents the timeliness of treatment).
Considerable research examining both the reliability of the ratings used as the dependent measures in the regression and the usefulness of the regression in capturing the information in those ratings was conducted prior to implementation.1,5–8 In two studies, multiple correlation coefficients were reported for 15 cases. These values ranged from .69–.95 with a mean of .84, and illustrate the relationship between the predicted scores (based on the regression) and the observed scores.5,6 Another study reported reliabilities ranging from .93–.99 with a mean of .97.6 The reliabilities represent the expected correlation between the mean ratings produced by the group of raters and the mean produced by another randomly equivalent group. These results therefore provided evidence that the regression-based procedure represented a useful approach to quantifying examinee performance.
While the results cited in the above studies argue that the regression-based approach to scoring allowed for an effective approximation of expert ratings of performance, other researchers have warned that what can be achieved on a small scale in research mode may be very different than what will be achieved in a large-scale operational setting.9,10 During the research and development phase, relatively few raters are needed; much of the initial work for this project therefore was based on ratings produced by a small group of highly experienced raters. For operational purposes, a much larger pool of raters was recruited; because these clinicians were newly recruited, they were necessarily inexperienced with the rating task. Additionally, work during the developmental phase focused on a more narrow range of case content than that used operationally. Given these factors, it is important to examine the extent to which operational results met expectations based on research in which scoring algorithm development was (1) generalized to a different and larger group of raters, and (2) applied to a wider range of case presentations.
Data reported in this study are based on operational case development results produced between November 1999 (when the first operational data were available) and February 2004. These results represent all scoring algorithms used in operational testing as part of the Step 3 examination. To produce the ratings, a total of 172 geographically diverse clinician–raters were recruited over the four operational years; an average of 43 clinicians participated each year. Raters were recruited based on recommendations from previous committee members and a review of their qualifications. All raters were practicing primary care physicians who had teaching and research experience, and most had a present affiliation with a medical school. Consideration was also given to ensuring a balance of primary care specialties. The clinicians were brought to Philadelphia for group orientation on the case simulation format, case development and scoring procedures, and for training in the rating process (including case-specific discussion for each case that an individual was to rate). The orientation and training procedures were scheduled across three days for each group, and the actual operational ratings were completed after the rater returned home. Secure Internet connections were used to allow the rater to access the sampled transaction lists and to enter their ratings.
When the examination was first implemented, it was necessary to collect operational data on all cases that were targeted to contribute to examinee scores. This was done during an initial period in which scores were withheld from examinees. Subsequent data collection for algorithm development was based on pretest data collected as part of ongoing operational testing.
The dependent measure used in each of the case-specific regressions was a mean rating for each of 200 to 250 examinee performances averaged across four or five raters. These data were used to estimate regression weights and multiple correlation coefficients. The simple model with seven independent measures (described earlier) was used most frequently. Other alternative models included variations in which (1) specific components of the category representing most beneficial actions were used as individual independent measures, and (2) actions were aggregated based on those contributing to diagnosis, treatment, and follow-up.
Figure 1 presents box and whisker plots indicating the rater reliability results for the preoperational work and for each of the four years of operational case development. The line within the boxes indicates the median value, and the width of the box contains the second and third quartiles. The whiskers represent the range that is the inside three quartiles of the data (i.e., 1.5 quartiles on either side of the median). The reliabilities were calculated as alpha coefficients from the raters-by-scores matrix, and these values can be interpreted as the correlation between the mean rating produced by these raters and a mean resulting from a randomly equivalent group of ratings. The results indicated that the median rater reliability for the four years of operational scoring was lower than that for the preoperational development period. The results also suggested that rater reliability increased over the four years of operational testing.
The multiple correlation coefficients for algorithms developed during the research phase and during each year of operational testing are presented in Figure 2. Again, the box and whisker plots represent the medians and interquartile ranges. These results suggested that the operational algorithms were relatively consistent across the four years and that they were somewhat superior to the results reported during the preoperational period. (The pattern of results in Figures 1 and 2 did not change if mean rather than median values were used.)
In general, the results reported in this study support the notion that transition from a research project to a large-scale operational project did not result in a substantial loss in either the reliability with which raters assess examinee performance or the efficiency with which those ratings are approximated by the regression algorithm. There are several possible explanations for the present findings.
First, it is possible that both experience with running the rating process and changes in the actual training of raters have contributed to the improved reliability of ratings over the four years of operational testing. As part of the training process, each rater individually rates a sample of examinee performances and then discusses those ratings with the group. This reconciliation process yields information about how the other raters are approaching the rating task and allows raters to determine if they are all using the rating guidelines in the same way. In addition, this exercise often leads to improved quality control. For example, if all raters but one had given a low rating to an examinee, discussing the specifics of the performance with the other raters might inform the discrepant rater of the failure to notice a potentially harmful action that had been ordered by the examinee. This process of reconciliation was not used extensively during the preoperational period; the focus on this aspect of training began in earnest during Year 1 of operational development. While raters are not required to reach consensus, the reconciliation process encourages them to pay careful attention to how they are rating the performances and provides reassurance that they are rating similarly. This increased focus on the quality of the rating process is therefore a likely and important contributor to the improved reliabilities over the four years of operational case development. The greater interquartile range for the second operational year likely reflects a transition during which individual raters varied in the extent to which their rating proficiency developed. By Year 3, it appears that continued refinement of the rating process along with the increased experience of the individual raters led to substantial improvement in reliability relative to performance in Year 1.
The finding of a lower reliability for the operational years in comparison to the preoperational period was not surprising. As mentioned earlier, most ratings during this period were produced by a small group of raters who had extensive experience with the rating process and with rating a much smaller range of case content than what is necessary operationally. Operational use requires recruitment of a much larger pool of raters who were not initially experienced with the rating task and who are responsible for the review of a wider variety of case content. The differences in the heterogeneity of the task and rater population between preoperational and operational use are therefore likely explanations for the decrease in reliability following implementation.
One possible reason for the increase in multiple correlation coefficients over the years of operational testing is that there has been increased focus on selecting regression models. While earlier model-development efforts may have been limited to the basic model described previously, experience with the process has yielded important insight into how physicians actually produce their ratings. This insight has led to the use of alternative models in which different combinations of actions lead to a more precise approximation of the expert rating policies. Increased use of these alternative models may therefore at least partially explain the improved regression results over time.
Finally, it is important to note that the reported results may be influenced by the range of examinee performances. While an increased range of performance would tend to inflate both of the described indices (reliability and multiple correlation), the finding that reliabilities decreased compared to research results and correlations increased compared to research results cannot entirely be explained by the range of performances.
Overall, this investigation of the expert ratings used in USMLE CCS scoring algorithm development suggests that preoperational results provide a reasonable indication of subsequent operational findings. Some of the potential concerns inherent in moving from a research to an operational program can be addressed by attention to issues of rater recruitment and training enhancement. Similarly, the ongoing dedication to refinement of algorithm development procedures provides additional explanation for the performance of the models over time.
Although the initial decrease in operational rater reliability was modest, any decrease may be important because scores resulting from the ratings will be a function of the reliability of the ratings on which they were modeled. While the reliability of the modeled scores does not insure validity, reliability is a requirement if valid inferences are to be made based on these scores. Given the above implications of suboptimal reliability results, several practical issues warrant consideration. First, practitioners developing assessments that require expert ratings of performance should be aware that a decrease in reliability when moving from research to operational use is not unexpected. Second, it is clear that rater experience is critical; although it may be unavoidable to have an inexperienced group of raters at the outset, staggering the subsequent introduction of new raters and controlling turnover will optimize rater experience at any given time. Finally, the results suggest that the training procedures currently in use (e.g., rater reconciliation) appear to be effective in improving the performance of raters over time. The above implications for validity and test development suggest that continuing efforts to increase the quality of all aspects of the scoring process are critical.
1. Margolis MJ, Clauser BE. A regression-based procedure for automated scoring of a complex medical performance assessment. In: Williamson D, Bejar I, Mislevy R (eds). Automated Scoring of Complex Tasks in Computer-based Testing. New Jersey: Lawrence Erlbaum Associates, in press.
2. Clyman SG, Melnick DE, Clauser BE. Computer-based case simulations from medicine: assessing skills in patient management. In: Tekian A, McGuire CH, McGahie WC (eds). Innovative Simulations for Assessing Professional Competence. Chicago: University of Illinois, Department of Medical Education, 1999.
3. Dillon GF, Clyman SG, Clauser BE, Margolis MJ. The introduction of computer-based case simulations into the United States Medical Licensing Examination. Acad Med. 2002;77(suppl 10):S94–6.
4. Clauser BE, Subhiyah R, Piemme TE, et al. Using clinician ratings to model score weights for a computer simulation performance assessment. Acad Med. 1993;68(suppl 10):S64–7.
5. Clauser BE, Margolis MJ, Clyman SG Ross LP. Development of automated scoring algorithms for complex performance assessments: a comparison of two approaches. J Educ Meas. 1997;34:141–61.
6. Clauser BE, Subhiyah R, Nungester RJ, Ripkey DR, Clyman SG, McKinley D. Scoring a performance-based assessment by modeling the judgements of experts. J Educ Meas. 1995;32:397–415.
7. Clauser BE, Swanson DB, Clyman SG. A comparison of the generalizability of scores produced by expert raters and automated scoring systems. Appl Meas Educ. 1999;12:281–99.
8. Clauser BE, Clyman SG, Swanson DB. Components of rater error in a complex performance assessment. J Educ Meas. 1999;36:29–45.
9. Clauser BE. Recurrent issues and recent advances in scoring performance assessments. Appl Psych Meas. 2000;24:310–24.
10. Dunbar SB, Koretz DM, Hoover HD. Quality control in the development and use of performance assessments. Appl Meas Educ. 1991;4:289–303.