Swygert, Kimberly; Muller, Eric; Clauser, Brian E.; Dillon, Gerard F.; Swanson, David B.
The United States Medical Licensing Examination (USMLE™) is composed of three parts: Step 1 measures understanding and application of concepts basic to the practice of medicine, Step 2 measures clinical knowledge and skills important to patient care under supervision, and Step 3 measures clinical knowledge essential for independent practice. Step 1 consists solely of multiple-choice items; Step 2 and Step 3 have performance assessments and multiple-choice items. Except for the performance-assessment component of Step 2, all of the USMLE examinations are computerized.1
On the USMLE Step examinations, as on many large-scale, high-stakes standardized examinations, it is important that examinees have the opportunity to establish a reasonably comfortable testing pace. For some large-scale testing programs, time limits have been shown to negatively impact examinee test-completion behavior; often, time limits either prevent examinees from finishing the examination or require examinees to work quickly at the expense of accuracy.1–5 Examinees who do not complete the examination on time may be using a maladaptive pacing strategy.
In Spring 2001, the USMLE Step 2 Committee reviewed the issue of examination timing for the multiple-choice component of Step 2. Committee members were provided with data on examinee completion rates, between- and within-section performance patterns, and examinee perceptions of time sufficiency. Based on the information provided, the Committee concluded that the Step 2 time limits might be having an unintended effect on examinee performance; additionally, it appeared that this effect varied across examinee subgroups. As a result, the committee reduced the number of items per one-hour test section from 50 items per section to 46. This was achieved through a reduction of pretest items, which are not used in live scoring; the number of operational items per section remained constant. This change resulted in a total number of 368 items across the eight hours of testing time, as opposed to the previous total of 400 items, and an average of approximately 78 seconds per item, as opposed to 72 seconds per item.
The purpose of this article is to examine Step 2 examinee pacing and performance under the original time limits, and to examine the changes that occurred when the time per item was lengthened.
Two samples of Step 2 examinees were drawn from the USMLE databases. In both cases, the samples contained examinees taking the examination under standard timing conditions. The first sample completed the examination prior to the timing change. The majority of these examinees tested between August 2001 and August 2002. This cohort comprised 29,796 total-takers (25,950 first-takers) for whom biographic information was available. The second sample completed the examination subsequent to the timing change. The majority of these examinees tested between August 2002 and August 2003, and this second cohort comprised 28,373 total-takers (24,809 first-takers) for whom biographic information was available.
In both sets, four subgroups of examinees were identified: (1) U.S. and Canadian medical school students and graduates whose native language is English (U.S.-eng), (2) U.S. and Canadian medical school students and graduates with English as a second language (U.S.-esl), (3) international medical school students and graduates whose native language is English (IMG-eng), and 4) international medical school students and graduates with English as a second language (IMG-esl).
The Step 2 multiple-choice examination consists of eight one-hour sections. Sections are designed to be comparable in terms of difficulty and content. The order of sections and items within sections is randomized; examinees can skip and return to items within sections, but may not return to a section once leaving it. An optional survey follows the examination, and approximately 90% of examinees provide feedback. One survey question asks the examinees whether they felt they had sufficient time to complete the examination, and if not, examinees are to indicate how much more time would have been necessary.
The four subgroups were compared with one another, and across the pre–and post–timing-change conditions, with respect to the percentage responding to the survey question about the sufficiency of the time limit; trends in the mean proportion of items answered correctly for each section across each of the eight sections; trends in the mean proportion of items answered correctly, within sections, over successive five-item blocks; and trends in the mean item response time, within sections, over successive five-item blocks. For the survey analyses, both first-time and repeater examinees were used; for the other analyses, only first-time examinees were used. Since the post–timing-change condition included 368 items, a complete set of five-item blocks could not be defined; to maintain the same number of blocks for comparison purposes, the initial two blocks for the post-change analysis were composed of three items each.
When responses from the different subgroups to the survey item about time sufficiency were compared, it was obvious that, on average, opinions of all subgroups were different after the timing change, although the size of the differences varied by subgroup. For the 2001–02 data, 54% of U.S.-eng examinees reported having sufficient time to complete each section, as compared to 39%, 32%, and 24% for the U.S.-esl, IMG-eng, and IMG-esl examinees, respectively. The percentage of examinees who indicated that only five more minutes would be needed to make the time sufficient were 23%, 28%, 29%, and 26% for the U.S.-eng, U.S.-esl, IMG-eng, and IMG-esl groups, respectively.
For the 2002–03 data (post–timing change), the numbers increased for every subgroup. 64% of U.S.-eng examinees now reported having sufficient time to complete each section, as did 45% of the U.S.-esl examinees and 40% of the IMG-eng examinees. After the timing change, only 19% of U.S.-eng examinees reported needing five more minutes, but the percentages in this category for the other three groups were comparable to the numbers for the 2001–02 data. The IMG-esl examinees thus continue to be least likely to report satisfaction with the time limit (29%), but the gap between IMG-eng and U.S.-esl examinees has narrowed.
Proportion of Items Correct for Each Section
Changes in the mean proportion of items answered correctly across the eight test sections, if present, might in part reflect the impact of changes in examinee pacing as the testing day unfolds. When the mean proportion of items answered correctly was calculated for the four subgroups, it became apparent that the largest differences occurred between the U.S. and IMG subgroups. The two U.S. groups have mean proportion correct rates that were approximately 73% for the U.S.-eng and 70% for the U.S.-esl groups, while the rate was approximately 67% for both the IMG subgroups. The variability of mean proportion of items correct was very small for each subgroup across the eight sections. This is not unexpected, given that the clock starts anew at the beginning of each section; this result suggests that, from a timing perspective, there is no large impact on performance due to changes in examinee pacing as the testing day progresses.
When the time limit is extended for each section, we see that all four subgroups displayed an increase in the mean proportion of items answered correctly per section (U.S.-eng = 76%; U.S.-esl = 73%; IMG-eng = 69%; IMG-esl = 70%). The gap in mean proportion items answered correctly widened between the two IMG groups, and the non–native-English speakers appear to have a very slight advantage under the new timing conditions.
Performance within Sections
Figure 1a and 1b show the pre–and post–timing-change data for the performance of each subgroup within sections. Because the mean proportion of items correct did not appear to vary significantly across sections for any subgroup, the proportions reported here are averages across the eight sections. The vertical axis shows the mean proportion of items answered correctly, averaged across sections. The horizontal axis displays the ten successive blocks of items; again, the first two blocks in the post-change condition contain three items each and the other eight blocks contain five items each.
In Figure 1a, it can be seen that average performance declined near the end for all four subgroups. For the U.S. examinees, performance did not decline until the final five items; for the IMG subgroups, performance declined during the final ten items. The extent of the decline also appears to be least severe for the U.S.-eng examinees and most severe for the IMG-esl examinees. The performance of U.S.-eng examinees declined from 73% correct for the penultimate item block to 72% for the final item block, whereas IMG-esl examinees decline from 68% to 63% across the same two blocks.
The same performance patterns seen in Figure 1a are apparent in Figure 1b. The difference is that all four subgroups now have higher mean proportion correct values across the items, but a decline in performance across item blocks persists for all subgroups. However, the decline is smaller than that before the timing change. The IMG-esl examinees now answer approximately 69% of the items correct at the beginning, declining to 67% at the end. This suggests that, while all examinees are performing slightly better on the items overall, they're still having a problem with pacing.
Response Times within Sections
Figures 2a and 2b show the pre–and post–timing-change data for the mean response time per item in seconds. The vertical axis shows the mean response time, averaged across all sections. The horizontal axis again consists of successive five-item blocks.
Figure 2a shows a much larger overall decline in the average amount of time used across successive item blocks than there was in performance on those items, and the decline is much more gradual for time than it was for performance. What's more, the declines appear steepest for the two IMG subgroups, who used the most time, on average, at the beginning of the section. The IMG-esl examinees have the most severe decline in item response time, moving from an average of 76 seconds per item on items within the first block to an average of 60 seconds per item on the last block; by comparison, the U.S.-eng examinees begin by using less time, on average, per item, and show less decline, changing from 70 seconds per item to 62 seconds per item.
These disparities suggest that the U.S. examinees are pacing themselves more effectively across items. IMG examinees are more likely to run out of time, with much shorter average response times at the section end, because they are taking too long on the initial items within each section. For the pre–timing-change data set, the amount of time available per item was 72 seconds, but the IMG-eng examinees used approximately 74 items per second at the beginning of the sections, and the IMG-esl examinees started out even more slowly.
Several interesting results are apparent for the post–timing-change data in Figure 2b. First, the drop-off in mean item response time persisted for all subgroups, and the drop-off continued to be most severe for the IMG-examinee subgroups. Once again, the IMG examinees were less consistent in their pacing, as evidenced by the larger discrepancy from IMG response times for the initial items versus response times for the final item block.
However, all four subgroups used more time, post–timing change, on the initial block of items than they did in the previous data set. In the pre–timing-change data set, the IMG-esl examinees used 76 seconds per item at the beginning, which was four seconds more than the available average time per item. When the average time available rose to 78.26 seconds per item, the IMG-esl group used an average of 82 seconds per item on the initial items, falling to 66 seconds, resulting in the exact same difference (16 seconds) as before. Had examinees in three of the four subgroups maintained their original pacing once the new timing condition was instituted, no drop-off would have been evident and for the IMG-esl examinees, only a modest drop-off would have occurred.
An increase in the amount of time available per item on the USMLE™ Step 2 examination resulted in some changes that were expected and positive. Examinees appear to be more satisfied with the new timing constraints, differences in performance between the beginning and end of sections seem to have diminished, and overall examinee performance as measured by percent of items correct has improved.* However, time limits and their impact on examinee pacing continue to have some effect on both performance and response times at the end of test sections.
One unexpected change was the change in examinee pacing for the 2002–03 data. Examinees did not retain the same time-per-item pacing as for the 2001–02 data, but continued to use more than the average amount of time available per item at the beginning of the sections. This raises the question as to whether this is a pervasive test-taking behavior that is going to produce similar patterns, regardless of the amount of time allowed. Future analyses will need to include consideration of such examinee behaviors. In addition, it might be helpful to examine pacing further with an examination of the conditional distributions of item response times. It is possible that, within subgroups, examinee response times may be bimodally distributed, with some examinees using far too much time at the beginning, requiring per-item times close to zero for the final items.
The USMLE Step 2 Committee will continue to monitor these data and, as future analyses become available, will consider further timing adjustments as needed.
1. Federation of State Medical Boards and National Board of Medical Examiners. Bulletin of Information for the United States Medical Licensure Examination. Philadelphia: NBME, 2004.
2. Schaeffer G, Reese CM, Steffen M, McKinley RL, Mills CN. Field test of a computer-based GRE general test. Educational Testing Service Research Report No. ETS-RR-93-07. 1993.
3. Schaeffer G, Steffen M, Golub-Smith ML, Mills CN, Durso R. The introduction and comparability of the computer adaptive GRE General Test. Educational Testing Service Research Report No. ETS-RR-95-20. 1995.
4. Schnipke DL. Assessing speededness in computer-based tests using item response times. Paper presented at the annual meeting of the National Council on Measurement in Education, San Francisco, CA, 1995.
5. Slater SC, Schaeffer G. Computing scores for incomplete GRE general computer adaptive tests. Paper presented at the annual meeting of the National Council on Measurement in Education, New York, NY, 1996.
*Although the 2001–02 and 2002–03 data sets were both composed of first-time Step 2 examinees, there is no guarantee that the cohorts of examinees were identical across years, and no guarantee that the test forms were of equal difficulty across years. However, it is reasonable to assume that both cohorts and test forms were relatively equivalent across years. Cited Here...