Threats to the construct-validity evidence associated with locally developed achievement tests in medical education abound. Construct validity refers to multiple sources of evidence supporting or refuting the meaningful and accurate interpretations of test scores, the inferences derived from assessments, and the pass—fail (or grade) decisions made about examinees from those assessments.1 Messick defines construct-irrelevant variance (CIV) as “… excess reliable variance that is irrelevant to the interpreted construct.”2
Testwiseness, teaching to the test, and test irregularities (cheating) are all examples of CIV that tend to inflate test scores by adding measurement error to scores. Other sources of CIV, such as testing trivial content at low levels of the cognitive domain, may lower test scores by adding measurement error.
The purpose of this study was to evaluate the CIV associated with using flawed test items—items that violate standard item-writing principles—on a classroom achievement test in medical education. Specifically, the contribution to CIV from flawed test items was evaluated with respect to item and examination difficulty and the pass—fail or grade decisions made from tests composed of flawed questions.
Review of the Literature
The principles of writing effective multiple-choice test questions (MCQs) are well documented in educational measurement text-books, the research literature, and in test-item—construction manuals designed for medical educators.3,4,5 Yet, a recent study from the National Board of Medical Examiners (NBME)6 shows that violations of the most basic item-writing principles are very common in achievement tests used in medical education.
Several item-writing principles have been investigated for their effects on test psychometric indices, such as scale reproducibility, and item difficulty and discrimination. In their review, Haladyna, Downing, and Rodriguez4 summarize the current empirical studies of common item flaws. Most such studies evaluate the effect of a single-item flaw (such as using an unfocused item stem) on the psychometric characteristics of items and tests.
While several individual item flaws have been studied (negative stems,6 multiple true—false items,7 none of the above option8), the cumulative effect of grouping flawed items together as scales measuring the same ability has not been investigated.
A year-one basic science test was selected for study from among approximately 15 examinations administered near the end of a semester. (This test was selected, together with other tests, as part of a larger study.) The test consisted of 33 total MCQs, covering approximately six weeks of instruction in the discipline, and all test items sampled a single content category. Criteria for test selection were that some test items had to have item flaws and that the total score be sufficiently reliable to support decisions about individual students (r > .50).
Items were classified as either standard (without flaws) or flawed (containing one or more item flaws). If the item was classified as flawed, the type of rule violation was recorded. Three independent raters, blinded to item-performance data, classified the items using the standard principles of effective item writing as the universe of item-writing principles.4 Additionally, any unique item form that departed from standard practice was classified as flawed. There were few disagreements among the raters with regard to item classification; all disagreements were resolved by consensus discussion.
Absolute passing standards were established for this test by the faculty responsible for teaching this instructional unit using a modified Nedelsky method.9 Faculty established passing scores in the absence of item and test performance data.
Three separate scales were scored and item-analyzed: the total scale (all 33 items), the standard-item scale, and the flawed scale. Typical item-analysis data were computed for each scale: means, standard deviations, mean item difficulty, mean biserial discrimination indices, and Kuder-Richardson 20 reliability coefficients, together with the absolute passing score and the passing rate (proportion of students passing).
Eleven of the 33 items (33%) were classified as flawed. Five items used unfocused item stems (e.g., Which of the following statements is true?). Such item stems are almost always followed by a heterogeneous option set from which the examinee must choose the single correct statement, much like a multiple-true—false item. Other flawed item forms were: 3 = “none of the above” (1 also with a negative stem); 2 = “all of the above”; and, 1 = “partial K-type” (combination complex item form similar to the K-type item).
The total scale reliability was .71; the mean item difficulty was .68, and the mean biserial discrimination index was .33. The absolute passing score (57%) passed 81% of the n = 198 students taking the test.
Comparing the standard (22 items) and the flawed (11 items) scales, the observed K-R 20 reliability was .62 versus .44. Adjusting the length of the shorter flawed scale to 22 items (using the Spearman-Brown formula), the estimated reliability of a 22-item flawed scale was .61, essentially the same as for the standard scale. The standard-scale mean p value was .70; the flawed-scale mean p value was .63 (t197 = 6.274, p < .0001). The standard-scale items were slightly more discriminating than the flawed items, rbis = .34 versus .30 (using the total test score as criterion). The flawed and the standard scales were correlated r = 0.52 (p < .0001).
The absolute passing score for the standard scale was 57%; 55% was the passing score for the flawed scale. The passing rates were 82% for the standard scale and 60% for the flawed scale (t197 = 5.752, p < .0001).
An additional passing-rate analysis was carried out to evaluate the (theoretical) effect of various passing score levels on passing rates, for both the standard and the flawed scales. Passing rates for the standard and flawed scales are presented in Table 1 for different levels of passing scores, which were arbitrarily selected to represent various increments of the mean and standard deviation of the total scale distribution. The range of differences in passing rates is 10 to 25 percentage points, for the standard versus the flawed scale. All differences are statistically significant.
Discussion and Conclusions
The generalizability of this study is limited by its non-experimental design and the possibility of selection bias. Conclusions may be further limited by the atypical nature of the particular examination chosen for study: a short basic science test with a fairly high proportion of flawed item forms.
One third of the questions in this test have at least one item flaw. This finding is somewhat surprising given the research attention paid to objective item writing, frequent faculty development workshops on the topic, and the availability of high-quality training materials.5
In this study, flawed item forms are seven percentage points more difficult than items measuring the same content using standard item forms—this difference represents nearly half a standard deviation. Consonant with the increased difficulty of flawed items, such items fail nearly one fourth more students than standard items, despite a two-percentage-point lower passing score for the flawed-item scale.
When various passing score levels are arbitrarily established, the differences in passing rates range from 10 to 25 percentage points, with the flawed items always failing more students. These results could be a proxy for a graded curriculum, showing the effect of flawed items on grades that are assigned to different levels of test performance.
The increased test and item difficulty associated with the use of flawed item forms is an example of CIV, because poorly crafted test questions add artificial difficulty to the test scores. This CIV interferes with the accurate and meaningful interpretation of test scores and negatively impacts students' passing rates, particularly for passing scores at or just above the mean of the test score distribution. Thus, flawed test items reduce the construct-validity evidence for the assessment.
Further research is needed to establish the generalizability of these results. Experimental studies in which flawed and standard items (designed to test the identical content) are randomly assigned to examinees are desirable from a research-design perspective, but the ethics of such a study are questionable. Replications of this non-experimental study with different populations of students and different content domains are needed.
The use of test questions that violate the standard principles of effective item writing has an impact—a negative impact—on tests and students. The results of this study suggest that medical educators should consider increasing their faculty development efforts with respect to teaching item-writing skills and provide medical school faculty with the resources and support needed to eliminate flawed MCQs from tests. Successful faculty development efforts will emphasize effective training, practice, and feedback to item authors, the use of students' item-analysis data as feedback to item authors, and the continued follow-up and involvement of the faculty development staff with item authors.
1. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association, 1999.
2. Messick S. Validity. In: Linn, RL (ed). Educational Measurement. New York: American Council on Education and Macmillan, 1989:13–103.
3. Gronlund NE. Assessment of Student Achievement. Boston, MA: Allyn and Bacon, 1998.
4. Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines. Applied Meas Educ. 2002:15:309–33.
5. Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. Philadelphia, PA: National Board of Medical Examiners, 1998.
6. Jozefowicz RF, Koeppen BM, Case S, Galbraith R, Swanson D, Glew H. The quality of in-house medical school examinations. Acad Med. 2002;77:156–61.
7. Tamir P. Positive and negative multiple choice items: how difficult are they? Stud Educ Eval. 1993;19(3):311–25.
8. Downing SM, Baranowski RA, Grosso LJ, Norcini JJ. Item type and cognitive ability measured: the validity evidence for multiple true-false items in medical specialty certification. Appl Meas Educ. 1995;8:189–99.
9. Nedelsky L. Absolute grading standards for objective tests. Educ Psychol Meas. 1954;14:3–19.