In the early to mid-1990s, the National Board of Medical Examiners (NBME) examinations were replaced by the United States Medical Licensing Examination (USMLE). The USMLE, which was designed to have three components or Steps, was administered as a paper-and-pencil test until the late 1990s, when it moved to a computer-based testing (CBT) format. The CBT format provided the opportunity to realize the results of simulation research and development that had occurred during the prior two decades. A milestone in this effort occurred in November 1999 when, with the implementation of the computer-delivered USMLE Step 3 examination, the Primum® Computer-based Case Simulations (CCSs) were introduced. In the year preceding this introduction and the more than two years of operational use since the introduction, numerous challenges have been addressed. Preliminary results of this initial experience have been promising. This paper introduces the relevant issues, describes some pertinent research findings, and identifies next steps for research.
History of medical licensing examinations. Nationally developed and administered examinations used for granting the initial license to practice medicine in the United States have existed since the early part of the 20th century. The NBME, which was founded in 1915, has been the primary provider of these tools. Up until the early 1950s, these examinations included essay questions and bedside orals. In the early 1950s, support began to grow for the replacement of essay questions with the newer, objective format known as multiple-choice questions (MCQs). Despite the arguments of some that the MCQs might not address some of the competencies that were tapped by essay questions, the psychometric appeal of MCQs resulted in the discontinuation of essay testing and the adoption of an MCQ format for the NBME Parts I and II.1 The bedside oral exam of the NBME Part III continued as part of the examination program until the mid-1960s. There was a brief attempt to standardize the bedside oral examination and to supplement it with a segment that required the examinee to interpret a film clip that presented physician—patient interactions. However, the costs for organizing the orals and of producing the film clips were so great, and the advantages of MCQs so appealing, that the bedside orals and film clips were discontinued. At the same time, a new format was introduced into Part III, the patient management problem (PMP). The PMP was a paper format that presented a patient case with certain signs and symptoms. The examinee made decisions about managing the patient by selecting from a list of actions and received feedback by uncovering information reflecting the results of the indicated actions. The PMP format was used in Part III until the late 1980s, at which point the PMP format was dropped because of concerns about score reliability and interpretation. While practical constraints led to the exclusion of performance-based examination components, the belief in the importance of high fidelity and interactivity in physician assessment persisted. Starting around the time that bedside orals were discontinued, emerging computer technology was beginning to offer the promise of a testing format that could simulate the physician—patient encounter. Simulation research in the ensuing years, coupled with technologic advances made elsewhere, contributed to the evolution of the format. At the same time, research into new psychometric approaches led to the development of useful scoring models. An opportunity to capitalize on this format in a national, standardized examination occurred with the computerization of the USMLE Step 3.
Overview of the CCS. The CCS format has been described elsewhere.2 Briefly, the examinee is presented with a short textual description of a patient scenario in a defined health care setting. The examinee uses free-text entry to request diagnostic studies, consultations, procedures, therapies, and the like. Also, the examinee is responsible for other decisions, including when to admit the patient to the appropriate health care facility and when to re-evaluate the patient. As the examinee moves through simulated time, the patient's condition changes based on the underlying problem and based on the examinee's management plan revealed through order writing. Thus, the examinee has to monitor the patient's changing status, and modify the management plan as needed. When the examinee has finished managing the case, the software produces a detailed record of the examinee's actions, referred to as a transaction list. The transaction list contains all actions ordered by the examinee along with the simulated time at which each action was ordered.
CCS cases are developed by committees of expert physicians who are identified based on their experience in patient care and in medical education. The physician—author first defines a case architecture, which includes the patient's initial state, a model of interventions that will alter the patient's status, the amount of simulated time to depict, the types of prompts (e.g., notes from nurses, family members) that might happen in real life, and other interactions that will make the simulation appear more realistic. NBME staff and consulting physicians refine the model for the case-development committee's ultimate approval.
Scoring keys, or the criteria for performance on a case, are developed by a second committee of physicians who encounter the case as would an examinee (without any prior knowledge of the case). Their actions are then discussed in a group, and consensus is reached about the types of actions needed, the relative importances of different actions (i.e., some are essential, some less important), and scoring based on the sequence and relative timing of actions.
Ratings of transaction lists are completed by a third group of physicians. This group discusses criteria for evaluating examinees' performances and agrees on guidelines that will be used. Each physician then individually reviews the performances of several hundred examinees on a particular case and assigns a rating to the examinee—case performance. Ratings are on a nine-point scale from unacceptable to optimal. These ratings are used, in combination with the scoring keys, to build a regression-based scoring algorithm for each case. The algorithm is used operationally to calculate the case score that would most likely have resulted had there been an opportunity to subject the performance of every examinee to the rating process.3,4
Examinee preparation and test delivery. USMLE Step 3 registrants are provided access to MCQ and CCS practice materials. They are strongly encouraged to practice on the CCS format before arriving at the test site. The examination is administered over two days. The first day and a half are dedicated to delivery of approximately 500 multiple-choice items. The last half-day is dedicated to CCSs; examinees see a total of nine simulations. Examinees are given a brief tutorial and then up to 25 minutes for each simulation.
Operational challenges. Although some of the difficulties of introducing the CCS format into the live examination were similar to those associated with the computerization of MCQs, the simulations also offered some unique challenges. For instance, orientation software had to be more elaborate to train examinees for a virtual environment rather than an item format in which text on the screen is mostly self-explanatory. Also, vendor software, which was designed for MCQs and not for more complex software, had to be customized in order for simulations to run effectively.
Research and Findings
In order to assess the impact and the success of the introduction of the CCS, several issues were examined: (a) the process for developing the scoring algorithms was evaluated by examining rater reliability and the rating process, (b) the observed relationship between MCQs and CCS was examined through correlational analysis and by the identification of examinees with differing performances, and (c) test administration records were examined to see whether the complicated simulation format produced an excessive number of problems in delivery.
Scoring. During the decade preceding operational implementation of the CCS, numerous studies were published describing the potential utility of a computer-automated scoring algorithm designed to produce a score that approximated that which would have resulted if expert clinicians had reviewed and rated a CCS transaction list.3,4 This work was based on the responses of medical students who managed cases under low-stakes conditions. Additionally, the expert ratings used to construct these scoring algorithms were produced by a select group of raters who had extensive prior experience. Although these results were encouraging, there was reason to question whether they could be replicated when implementation moved from a field test to an operational setting. Operational development of the scoring algorithms would require recruiting a substantially larger group of expert raters who would inevitably have substantially less experience than those raters who were involved in the initial process. Also, it was sensible to question whether variation in performance that was evident across medical students who took the test under low-stakes conditions would be the same as that for Step 3 candidates seeking credentials for licensure.
To examine this transition, two indices were considered. The first was the reliability of the mean ratings used in developing the scoring algorithms for a sample of cases. The second was the correlation between that mean rating and the regression-based scores for those same cases. For this analysis, 18 different case simulations were examined. These cases represented a variety of simulated patients under a variety of clinical situations and were typical of the kinds of cases presented during the first year of operational CCS implementation. Across those cases, the mean rater reliability was .87 (SD = .05) and the mean correlation between the ratings and the regression-based scores was .86 (SD = .05). These values compare favorably with those reported from prior investigations.5 However, in general, the reliability of the individual ratings was slightly lower than those previously observed. To achieve these equivalent results it was necessary to increase the number of raters per case. It remains to be seen whether the ratings will become more reliable as the raters become more experienced.
Relationship between CCS and MCQ scores. One question of central importance is how closely CCS scores relate to the scores based on MCQs. Studies have examined this relationship and produced generally similar results, showing a moderate relationship between the proficiencies measured by the two formats.6 Correlations corrected for unreliability in the scores have ranged from .66 to .69. Figure 1 is a scatterplot of scaled proficiencies for a sample of examinees randomly selected from the full group that completed the Step 3 examination during the first year of computer administration. The scores represented in the plot are on the logit scale that is used for operational equating; the identity line is plotted for reference. The equating process allows scores from different test forms to be placed on the same scale and scores from MCQs and CCSs to be combined. The plot reflects a modest observed-score correlation (approximately .45). The one interesting aspect of the plot is that there are a number of examinees who did relatively well on one score while doing relatively poorly on the other; the largest difference reflecting a high CCS performance and low MCQ performance. Work has begun to identify the characteristics of examinees with these patterns of scores.
Test-administration issues. The first two years of computer-based testing for USMLE Step 3 have been completed without significant incident. Approximately 50,000 examinations were administered and, except for some delay at startup, there were no large-scale problems or disruptions in examinee testing or scoring. Still, there were instances in which operations could be improved. For example, during the first two years of testing, a small percentage (2–3%) of examinees experienced some interruptions during the CCSs. Although most resumed the examination without consequence to testing time or data, these rates are considered to be unreasonable and means are being investigated for lessening the likelihood of these potentially disturbing events.
The generally successful transition to operational use of CBT supports the contention that CCSs can be used for large-scale high-stakes testing. Nevertheless, the complexity of the format and the size of the related computer files continue to present challenges to a smooth and seamless administration. In addition, the CCS format is a relatively expensive format to develop and maintain. Like some of the formats that have gone before, the costs and benefits of CCS will have to be carefully monitored in the coming years.
Despite the operational challenges, data collected thus far on the relationship between MCQs and CCSs seem to support the belief, held by many physician—experts who have been involved in the development and design of the CCS, that this format can make a unique contribution to the overall assessment of physicians' competence. Research over the last two decades has been key to the successful introduction of the CCS and it will be key to its continued growth and development as an essential component of the USMLE. However, there is still work to be done to understand this contribution and to expand the body of evidence that supports the contention that the CCS enhances the validity of the Step 3 program. Some studies that are particularly central to this effort have already been completed and reported elsewhere.6,7,8
Several additional issues are currently under investigation, including the role of the CCS format in the standard-setting process, the identification of optimal time allocations for case management, and the development of methods to reduce the likelihood of overexposure of CCS content.
The introduction of the CCS into the USMLE program marked the first major change in testing format in more than ten years. This addition has brought the potential of filling some of the perceived gaps that were the target of many of the formats that came and went in the last century. The CCS, as an operational format, is still in its infancy. It shows great promise, but there is much to learn.
1. Hubbard JP, Levit EJ. The National Board of Medical Examiners: The First Seventy Years. Philadelphia, PA: National Board of Medical Examiners, 1985.
2. Clyman SG, Melnick DE, Clauser BE. Computer-based case simulations from medicine: assessing skills in patient management. In: Tekian A, McGuire CH, McGaghie WC (eds). Innovative Simulations for Assessing Professional Competence. Chicago, IL: University of Illinois, Department of Medical Education, 1999:29–41.
3. Clauser BE, Margolis MJ, Clyman SG, Ross LP. Development of automated scoring algorithms for complex performance assessments: a comparison of two approaches. J Educ Meas. 1997;34:141–61.
4. Clauser BE, Subhiyah R, Piemme TE, et al. Using clinician ratings to model score weights for a computer simulation performance assessment. Acad Med. 1993;68(10 suppl):S64–S67.
5. Clauser BE, Subhiyah R, Nungester RJ, Clyman SG, McKinley D. Scoring a performance-based assessment by modeling the judgements of experts. J Educ Meas. 1995;32:397–415.
6. Clauser BE, Margolis MJ, Swanson DB. A multivariate generalizability analysis of a test comprised of constructed-response and fxed-format items. Paper presented at the meeting of the National Council of Measurement in Education, New Orleans, LA, 2002.
7. Margolis MJ, Harik P, Clauser BE. Examining subgroup difference on a high-stakes performance-based assessment. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans, LA, 2002.
8. Floreck LM, Guernsey MJ, Clyman SG, Clauser BE. Examinee performance on computer-based case simulations as part of the USMLE Step 3 examination: are examinees ordering dangerous actions? Acad Med. 2002;77(10 suppl):S77–S79.