Swanson, David B.; Holtzman, Kathleen Z.; Clauser, Brian E.; Sawhill, Amy J.
Several analytic approaches have been used to derive the optimal number of response options for multiple-choice items.1,5–9,13 Typically, these found that the optimal number of options is two or three, assuming that testing time is proportional to the number of options. From a different perspective, Haladyna and Downing7 have recommended using two- or three-option items, because they have observed that, for most items, several distractors are rarely selected, and that these “nonfunctional” distractors waste testing time.
Despite these recommendations, an item format with large numbers of options has been used increasingly in medical education. This format, termed “extended matching questions” (EMQs), is used on licensing and certification examinations nationally and internationally, as well as on intramural tests at many medical schools. A typical set of EMQs begins with an option list including from four to 26 options, with more than ten options commonly used. The option list is followed by two or more patient-based items requiring the examinee to indicate a clinical decision for each item, with all items in a set using the same option list. Over the past 20 years, multiple studies have found that EMQ-based tests are more reproducible (reliable) than other multiple-choice question (MCQ) formats aimed at assessment of medical decision making.2–4,11,12
The reason for the superior reproducibility of EMQ-based tests has not been clear. In most studies, item format has been confounded with the number of options. The purpose of the research reported here was to conduct a controlled investigation of the impact of the number and source of options. Two studies are reported. In the first, EMQs used previously on Step 2 of the United States Medical Licensing Examination (USMLE) were converted to one-best-answer format, and the number and source of options were systematically manipulated. The second study was similar in design, except the test material consisted of items without any prior use. Both studies were embedded in Step 2 test forms used during 2003–04; because these examinations are computer administered, information about examinee response times could also be collected, making it possible to investigate measurement precision as a function of testing time.
Test material for both studies was drawn from the USMLE Step 2 item pool. The 40 items used for Study 1 were derived from EMQs that had been used previously on Step 2. Each EMQ was rewritten as four one-best-answer (A-type) items, all of which had identical patient-based stems but varied in the type of option set used; a sample item is shown in the upper portion of Figure 1. The Base version of each item included all of the EMQ options (number of options ranged from 11 to 25, with a median of 14). The other three versions of each item had a smaller number of options. The Com 5 version included the five options viewed as most appropriate by a Step 2 item-writing committee that did not have access to information about examinee responses to the EMQ version. The Stats 5 version included five options selected by the authors based on examinees’ responses to the EMQ version when it was previously used on Step 2; these generally included the most popular response options from the EMQ format, though the selected options typically included the most discriminating distractors as well. The Stats 8 version included eight options, also selected by the authors based on examinees’ responses to the EMQ version that previously appeared on Step 2. The lower portion of Figure 1 indicates the percentage of examinees selecting each response in the sample item; cells with blank entries indicate options that were not presented in the associated version of the sample item.
To guard against any selection bias that could occur as a result of using test material that had previously appeared on Step 2,10 the 50 items selected for Study 2 had no previous use on Step 2. Each item also had a large number of options (range of 11 to 18 options, with a median of 13). Because these items had not been used previously, no information was available concerning examinee responses, and only the Base and Com 5 versions of items were used in Study 2.
All versions of items for both studies were embedded in unscored “slots” in 2003–04 Step 2 test forms; these slots were randomly interspersed among scored items. For each examinee, a small number of study items were randomly selected for presentation by test administration software; no more than one item with the same stem was presented to any examinee. Examinees’ item responses and response times were collected and returned to the National Board of Medical Examiners. Responses to study items from the 15,000 examinees attending U.S. or Canadian medical schools and sitting for Step 2 for the first time during the 2003–04 time period were retained for analysis.
For each version of each study item, three indices were calculated. The first was the item difficulty (p value) calculated as the percentage of examinees who responded to the item correctly. The second was an index of item discrimination: the corrected item-total (biserial) correlation, calculated as the correlation between the item (scored 0/1 for incorrect/correct) and the total test score with the item omitted from that score. The last index was the mean response time in seconds. All versions of four items (two in each study) were dropped from analysis, because review of the items and item statistics by a content expert indicated that there were two correct answers. Means and standard deviations (SDs) by option set type were then calculated for each index for the remaining 38 and 48 items in Study 1 and Study 2, respectively.
The left side of Table 1 summarizes the results for Study 1. On average, the Base version of Study 1 items (mean p value of 73) was somewhat more difficult than the Com 5, Stats 5, and Stats 8 versions of the items (mean p values of 79, 78, and 75; p < .01 on a repeated-measures analysis of variance [ANOVA] in which stems were crossed with option set type). For item discrimination, no statistically significant differences were observed across the types of option sets; in fact, the mean biserial for the Stats 5 version was the largest (0.28 versus 0.26 for the other three versions). Mean response times also differed significantly (p < .01 on an ANOVA), with the base version requiring an average of 83 seconds for a response and the Stats 8, Com 5, and Stats 5 versions requiring 75, 67, and 66 seconds, respectively. Interestingly, while the mean item difficulty for the EMQ version of the item (on previous use) was the same as for the Base version, mean item discrimination was significantly higher for the EMQ version of the item (0.30 versus 0.26).
The right side of Table 1 summarizes the results for Study 2. On average, the Base version of Study 2 items (mean p value of 68) was somewhat more difficult (p < .01 on an ANOVA) than the Com 5 version (mean p value of 75). Surprisingly, mean item discrimination was lower for the Base than Com 5 version of the items (0.21 versus 0.24), though the difference was not statistically significant. As for Study 1, the mean response time was greater for the Base version of the items (91 versus 77 seconds; p < .01 on an ANOVA).
Consistent with previous research2–4,11,12 and expectations, items with larger numbers of options are more difficult than items with smaller numbers of options. This finding was observed in both Study 1 and Study 2, with p values for Base items consistently six or seven points lower than for five-option items, with eight-option items in between.
Inconsistent with previous research, however, was the finding that there was little change in item discrimination as the number of options decreased. In most previous research, items with larger numbers of options were more discriminating than items with smaller numbers of options.2–4,11,12 However, most of this work was done using the EMQ format, with the items placed at the end of paper-and-pencil test books potentially resulting in somewhat speeded test administration, an issue to which we return below.
Also consistent with expectations, response times were longer for items with larger numbers of options. Across both studies, Base versions of items required approximately 17 seconds more to complete than five-option versions. As far as we know, this is the first experimental study (at least in medical education) using computer-based test administration to systematically manipulate the number of options to obtain estimates of differences in response time.
This is potentially important because it enables straightforward analysis of the relationship between test reliability and the number of options, taking response times into account. In general terms, the study results suggest that reducing the number of options has little impact on item discrimination and reduces testing time requirements by 15% to 20%. Using response times from Study 1 as a basis for estimation, approximately 43 Base-version items can be administered per hour (3,600 seconds/hour divided by 83 seconds/item); the analogous figure for Com 5 items is 53 items/hour. The impact on the standard error of measurement (SEM) can be estimated from these item counts: use of the Com 5 rather than the Base version of the item reduces the SEM by roughly 10% (1 minus the square root of 43/53), a value that applies regardless of test length. Similar calculations for Study 2 yield an estimated 8% decrease in the SEM. These results are consistent with the advice of some authors that nonfunctional distractors be eliminated because they waste testing time.7
Somewhat surprisingly, the same items and option sets in Study 1 were more discriminating when presented in EMQ than single-best-answer format, though these results must be inferred from earlier use of the EMQ versions of the items. It may be that presentation of items in sets has an impact on item statistics: for example, examinees who answer the first item in a set correctly may rule that response out in selecting an option for the second item in the set, which could, in turn, affect item discrimination. Alternatively, the pacing of Step 2 was recently shifted from 50 items/hour to 46 items/hour, and the change in pacing is another potential explanation for the lower discrimination indices for the items in Study 1 relative to their EMQ predecessors. Regardless of the reason, though, the mean discrimination index for the EMQ version of the items is higher than the analogous values observed in Study 1—perhaps enough higher to offset response time differences. This will be investigated in a subsequent study.
There are several limitations to the study that should be acknowledged. First, only a limited sample of items was included in each study, and all of the items came from the same examination program, so it is unclear if the major study results will generalize to other contexts. Second, the results were somewhat surprising: though results for item difficulty and response time were consistent with expectations, item discrimination was basically unrelated to the number of options, a departure from previous work. Future work will investigate this further in the context of both one-best-answer questions and EMQs.
From a practical perspective, we are still encouraging USMLE item writers to use the EMQ format with large numbers of options because of the efficiencies this approach affords in item preparation.2,3 However, pending the results of future research, we plan to begin advising attendees during USMLE item review sessions to reduce the number of options included on option lists in order to make more efficient use of testing time.
1 Budescu DV, Nevo B. Optimal number of research options: an investigation of the assumption of proportionality. J Educ Meas. 1985;22:183–96.
2 Case SM, Swanson DB. Extended matching items: a practical alternative to free response questions. Teach Learn Med. 1993;5:107–15.
3 Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences, 3rd ed. Philadelphia: National Board of Medical Examiners, 2001.
4 Case SM, Swanson DB, Ripkey DR. Comparison of items in five-option and extended-matching formats for assessment of diagnostic skills. Acad Med. 1994;69(10 suppl):S1–S3.
5 Ebel RL. Expected reliability as a function of choices per item. Educ Psychol Meas. 1969;29:565–70.
6 Grier JB. The number of alternatives for optimum test reliability. J Educ Meas. 1975;12:109–13.
7 Haladyna TM, and Downing SM. How many options is enough for a multiple-choice test item? Educ Psychol Meas. 1993;53:999–1009.
8 Lord FM. Reliability of multiple-choice tests as a function of number of choices per item. J Educ Psychol. 1944;35:175–80.
9 Lord FM. Optimal number of choices per item: a comparison of four approaches. J Educ Meas. 1977;14:33–38.
10 Norman GR, Swanson DB, Case SM. Conceptual and methodological issues in studies comparing assessment formats. Teach Learn Med. 1996;8:208–16.
11 Swanson DB, Case SM. Trends in written assessment: a strangely biased perspective. In: Harden R, Hart I, Mulholland H (eds). Approaches to Assessment of Clinical Competence. Norwich: Page Brothers, 1992:38–53.
12 Swanson DB, Case SM. Variation in item difficulty and discrimination by item format on Part I (basic sciences) and Part II (clinical sciences) of U.S. licensing examinations. In:Rothman A, Cohen R (eds). Proceedings of the Sixth Ottawa Conference on Medical Education. Toronto: University of Toronto Bookstore Custom Publishing, 1995: 285–87.
13 Tversky A. On the optimal number of alternatives at a choice point. J Math Psychol. 1964;1:386–91.
Moderator: Sandy Cook, PhD
Discussant: Rebecca Lipner, PhD