Several analytic approaches have been used to derive the optimal number of response options for multiple-choice items.1–7 Typically, these have found that the ideal number of options is two or three, assuming that testing time is proportional to the number of options. From a different perspective, Haladyna and Downing4 have recommended using two- or three-option items, because, for most items, several distractors are rarely selected and these “nonfunctional” distractors waste testing time.
Despite these recommendations, an item format with large numbers of options has been used increasingly in medical education. This format, termed “extended-matching questions” (EMQs), is used on licensing and certification examinations nationally and internationally, as well as on intramural tests at many medical schools. A typical EMQ set begins with an option list including from 10 to 26 options. The option list is followed by two or more patient-based items requiring the examinee to indicate a clinical decision for each item, with all items in a set using the same option list. Over the past 20 years, multiple studies have found that EMQ-based tests are more reproducible (reliable) than other MCQ formats aimed at assessment of medical decision making.8–11,14
The reason for the superior reproducibility of EMQ-based tests has not been clear. In most studies, item format has been confounded with the number of options. In addition, for research done in conjunction with paper-and-pencil administrations of Step 2 of the United States Medical Licensing Examination (USMLE), EMQs for paper-and-pencil tests were presented at the end of test books, so that item format was confounded with the position of the items in the test.
The purpose of the research reported here was to conduct a controlled investigation of the impact of item format and number of options. In this study, all items are presented in sets (EMQ format) and as independent single items (A-type format), and numbers of options in the EMQ and A-type formats were systematically manipulated. Both the EMQ and A-type versions of items were embedded in unscored portions of the Step 2 test forms used during 2005–06. Because these are computer-administered examinations, information about examinee response times could also be collected, making it possible to investigate measurement precision as a function of testing time.12
Test material was drawn from EMQ sets prepared for USMLE Step 2 in the late 1990s. Though many of the items had been used on Step 2 in the intervening years, item statistics were ignored to avoid biasing the results by selecting items with good statistical characteristics for use in the study.13 Members of a Step 2 Test Material Development Committee (TMDC) reviewed approximately 150 EMQ sets and selected 96 two-item sets that they viewed as well written and appropriate for Step 2. For each item in each set, TMDC members first identified the five and eight options that they viewed as most attractive for presentation in independent 5- and 8-option items (A-types); no information regarding examinee performance was provided in making these selections. Next, National Board of Medical Examiners (NBME) staff members tallied the selected distractors and identified the eight most popular options for use in 8-option EMQ versions of items.
Five versions of each of the 192 study items were then prepared, as summarized below:
1. A base EMQ set that included two study items and all of the options.
2. An 8-option EMQ set that included two study items and the eight most commonly selected options from the two A-type items included in the set (4 and 5 below).
3. A base A-type item that included all of the options.
4. An 8-option A-type item including the 8 options viewed as most attractive by the TMDC.
5. A 5-option A-type item including the 5 options viewed as most attractive by the TMDC. (Illustrative examples of base EMQ and 5-option A-type items are provided in the Appendix.)
All versions of the items were embedded in unscored “slots” in 2005–06 Step 2 test forms; these slots were randomly interspersed among scored items. For each examinee, a small number of study items were randomly selected for presentation by test administration software, subject to the constraint that the same examinee never saw more than one version of any item/set. Examinees’ item responses and response times were collected and returned to the NBME. Responses to study items from 10,000 examinees attending U.S./Canadian medical schools and sitting for Step 2 for the first time during the 2005–06 time period were retained for analysis.* Responses from approximately 150 examinees were available for each version of each item.
For each version of each study item, five indices were calculated. The first was the item difficulty (p value) calculated as the proportion of examinees who responded to the item correctly. The second index was the logit transform of the p value; this transformation is commonly used because the relationship between p values and item difficulties is nonlinear—the “distance” from 0.50 to 0.60 is much smaller than the distance from 0.85 to 0.95. The third was an index of item discrimination: the item-total (biserial) correlation, calculated as the correlation between the item (scored 0/1 for incorrect/correct) and the reported Step 2 score. The fourth was an r-to-z transformation of the biserial correlation, also commonly used to correct for nonlinearity in the scaling of correlation coefficients. The final index was the mean response time in seconds.
All versions of six items were dropped from analysis, because: (1) review of the items (with item statistics) by a content expert indicated that there were multiple correct answers or (2) one or more versions were answered correctly by all examinees, making some of the indices undefined. Means and standard deviations (SDs) by format, item position within set, and number of options were then calculated for each index for all five versions of the remaining 186 items in the study.
Table 1 summarizes the results for all five indices. The mean value of each index for each item position and overall is shown for each combination of format and number of options.
Overall, for both the EMQ and A-type formats, base versions of the items were more difficult than versions with fewer options presented in the same format (Student-Newman-Keuls posthoc test; p < .05). For A-types, the mean p value was 0.708 for the base version; 8- and 5-option versions of A-types were somewhat easier (p values of 0.725 and 0.754, respectively). The same overall pattern of results was obtained for logit-transformed item difficulties. In contrast, no statistically significant differences were observed for item discrimination for the biserial index or for r-to-z transforms of the biserial.
Large, practically meaningful differences were observed in mean response times: on average, examinees required substantially more time to respond to content-matched items with larger numbers of options. For base versus 8-option versions of EMQs, this difference was close to 10 seconds (75.9 and 66.1 seconds, averaging across item position within the set); analogous values for base, 8-option, and 5-option versions of A-types were 84.7, 76.0, and 68.1 seconds, respectively. A statistically significant difference was observed for each combination of format and number of options except for the base EMQ and 8-option A-type versions (Student-Newman-Keuls posthoc test; p < .05). Examinees responded to EMQs more quickly than A-types with corresponding numbers of options; the primary locus of this effect appears to be a reduction in the time required to respond to the item presented second in a set.
Consistent with previous research8–12,14 and expectations, items with larger numbers of options are more difficult than items with smaller numbers of options. This finding was observed for both A-types and EMQs. The magnitude of the effect ranged from roughly 0.02 to 0.05 on a proportion correct scale; because the standard deviation of Step 2 examinee scores on this scale is roughly 0.08, the effect should be viewed as small to moderate in size.
Inconsistent with previous research, however, was the finding that there was little difference in mean item discrimination as a function of item format or number of options. In most previous research, items with larger numbers of options were more discriminating than items with smaller numbers of options.8–11,14 However, most of this work was done using the EMQ format with items placed at the end of paper-and-pencil test books, confounding format, numbers of options, and pacing/speededness effects. Because item order was randomized for each examinee in this study, this confounding was not present.
Also consistent with expectations, response times were longer for items with larger numbers of options. Base versions of items required 8 to 16 more seconds to complete than 8- and 5-option versions. In addition, examinee responses to the second item in a set were roughly 8 seconds faster than for the first item in a set, contributing to differences in mean response times between the formats.
Accurate estimation of response time differences is potentially important because it enables straightforward analysis of the relationship between test reliability and the number of options, taking response times into account. In general terms, study results suggest that reducing the number of options has little impact on item discrimination and can reduce testing time requirements by 20%. Using response times for A-types as a basis for estimation, approximately 42 base-version items can be administered per hour (3,600 seconds/hour divided by 84.7 seconds/item); the analogous figure for 5-option A-type items is 52 items/hour. The impact on the standard error of measurement (SEM) can be estimated from this information: for comparable testing time, the use of A-types with five options rather than large numbers of options reduces the SEM by roughly 10% (1 – square root of 42/52). To achieve a similar improvement in precision using base versions of the items would require (roughly) a 25% (52/42) increase in test length.
Using the same approach to calculation for EMQs, 47 and 54 items/hour can be administered as base and 8-option versions, respectively. Approximately 14% more testing time (54/47 times the test length) would be required to achieve comparable levels of precision if larger numbers of options are used.
This study had several limitations that should be acknowledged. First, only a limited sample of items was included in the study. Second, all of the items came from the same examination program, and, consequently, it is unclear if study results would generalize to other contexts. Third, item statistics were calculated on relatively small samples (approximately 150/item), and only first-time examinees from U.S./Canadian medical schools were included in analysis; different results might be observed for examinees with different training or with English as a second language. (Additional analyses using larger, more diverse samples are planned as additional examinees complete 2005–06 Step 2 test forms.) Finally, the impact of using less than five options on item statistics and response times is unknown; this will also be addressed in future research.
From a practical perspective, we are still encouraging item writers for USMLE (and medical school exams) to use the EMQ format with large numbers of options, because of the efficiency this approach affords in item preparation.8,14 However, we plan to begin advising participants in item-review sessions to reduce the number of options included in order to make more efficient use of testing time.
1 Budescu DV, Nevo B. Optimal number of research options: An investigation of the assumption of proportionality. J Educ Meas. 1985;22:183–96.
2 Ebel RL. Expected reliability as a function of choices per item. Educ Psychol Meas. 1969;29:565–70.
3 Grier JB. The number of alternatives for optimum test reliability. J Educ Meas. 1975;12:109–13.
4 Haladyna TM, Downing SM. How many options is enough for a multiple-choice test item? Educ Psychol Meas. 1993;53:999–1009.
5 Lord FM. Reliability of multiple-choice tests as a function of number of choices per item. J Educ Psychol. 1944;35:175–80.
6 Lord FM. Optimal number of choices per item – A comparison of four approaches. J Educ Meas. 1977;14:33–38.
7 Tversky A. On the optimal number of alternatives at a choice point. J Math Psychol. 1964;1:386–91.
8 Case SM, Swanson DB. Extended matching items: a practical alternative to free-response questions. Teach Learn Med. 1993;5:107–15.
9 Case SM, Swanson DB, Ripkey DR. Comparison of items in five-option and extended-matching formats for assessment of diagnostic skills. Acad Med. 1994;69:S1–S3.
10 Swanson DB, Case SM. Trends in written assessment: A strangely biased perspective. In: Harden R, Hart I, Mulholland H, eds. Approaches to Assessment of Clinical Competence. Norwich: Page Brothers, 1992:38–53.
11 Swanson DB, Case SM. Variation in item difficulty and discrimination by item format on Part I (basic sciences) and Part II (clinical sciences) of U.S. licensing examinations. Proceedings of the Sixth Ottawa Conference on Medical Education. In: Rothman A, Cohen R, eds. Toronto: University of Toronto Bookstore Custom Publishing, 1995:285–87.
12 Swanson DB, Holtzman KZ, Clauser BE, Sawhill AJ. Psychometric characteristics and response times for one-best-answer questions in relation to number and source of options. Acad Med. 2005;80:S93-6.
13 Norman GR, Swanson DB, Case SM. Conceptual and methodological issues in studies comparing assessment formats. Teach Learn Med. 1996;8:208–16.
14 Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. 3rd ed. Philadelphia: National Board of Medical Examiners, 2001.
*NBME obtains consent from examinees for research use of test data, and all reported analyses were conducted using de-identified item responses. Cited Here...