Most medical students will tell you that few aspects of their undergraduate medical education are as stressful to them as are the many periodic examinations that constitute the gauntlet they must run from entry into medical school through the time just before they receive their diplomas and recite the Hippocratic oath. These examinations not only determine whether a student passes a unit, course, or clerkship, at many medical schools they also determine class rank and influence the quality and strength of the dean's letter that goes out to directors of residency programs.
Usually two kinds of tests are used to assess medical students' knowledge and competency. The majority of these examinations are in-house test papers of the multiple-choice variety written by the faculty who teach the courses; the remainder are extramural, national examinations prepared by discipline-based societies or licensing organizations. While much has been written about the advantages and disadvantages of multiple-choice questions, 1–7 at the moment and probably for some time to come, this is the kind of questions medical students will face on both in-house and national licensing examinations.
Commonly, every four to six weeks throughout the four years of medical school, a student will sit for a high-stakes one- or two-hour examination composed of 50 or more questions. Paradoxically, all too often these critical examinations are written and assembled at the last minute. Furthermore, by the time the questions have been assembled from the half dozen or more instructors who taught the course, there is inadequate opportunity to review the submitted questions or evaluate the examination for balance and overall quality. In addition, while most faculty who lecture medical students would argue that they write reasonably good exam questions, the truth is that few of them have benefited from formal training in the art of question writing. Often, the result is a heterogeneous collection of single-best-answer, fill-in-the-blank, and true-or-false questions of highly variable quality. Thus, while many faculty take seriously their formal lecturing duties and invest considerable thought into the preparation of lectures, it has been our experience that few commit commensurate effort to the task of preparing examination questions on the material they teach. The end result is that many excellent teachers assess their students using examinations of questionable quality. This problem is exacerbated by the fact that instructors often team-teach the particular section of the curriculum upon which the medical students are examined.
The literature contains few studies that evaluate the quality of in-house basic science examinations at medical schools. Also lacking are studies of the effectiveness of development programs aimed at improving the quality of multiple-choice examinations at medical schools. Our central hypothesis contends that in-house examinations currently being administered to U.S. medical students are of variable and generally low quality. To test this hypothesis, we gathered nine examinations from three U.S. medical schools and subjected the 555 questions to (blinded) quality assessment by three experts in educational measurement in medicine, each of whom had had extensive experience in item review for Steps 1 and 2 of the United States Medical Licensing Examination (USMLE) and the National Board of Medical Examiners' (NBME's) Subject Examinations (shelf examinations) program. This paper reports their ratings of the quality of the items on these nine in-house examinations.
In 1998, nine basic science examinations from three different medical schools were gathered, for a total of 555 questions. Of these, 92 had been written by individuals who had been trained to write items for USMLE Step examinations.
The quality of each question was rated by three expert biomedical test developers with extensive experience reviewing items for USMLE Steps 1 and 2. Each rated the items independently, and they were not informed of the purpose of the study. In addition, they did not have knowledge of the medical schools where the questions had originated or the item writers' identities or exam-writing training.
The rating scheme was developed to reflect accepted item-writing principles, 8–11 as well as the results of empirical studies of factors influencing item quality. 12,13 Items were rated on a five-point scale. A score of 5 was awarded to any item that included a vignette (either patient or laboratory), a one-best-answer format (i.e., not true-or-false), and no item flaw. 11 A score of 4 was given to an item that failed to completely satisfy one of these three conditions. The rating scheme required vignette items to be assessed lower if they had significant item flaws, or if the vignette was not necessary to answer the question. If an item did not involve a vignette, it was awarded a maximum of 3; to get this value, the item had to be focused, contain no item flaw, and involve more than recall of an isolated fact. An item was rated 2 if it failed to satisfy one of the criteria for a 3. An item was rated 1 if it failed to satisfy two or more of the criteria for a 3. Because true-or-false items generally had item flaws, they were awarded low ratings (i.e., 1 or 2).
The average quality assessment score (QAS)—the mean of the three assessors' ratings—was then calculated for all test items, and for subsets of items based on school of origin and the authors' training in question writing. Examples of the items (quoted and unedited) and the QAS scores they were assigned are shown in the Appendix.
To evaluate inter0rater agreement, a random-effects items-by-raters analysis of variance was performed to obtain variance components and a generalizability coefficient was calculated, with differences in rater stringency included in measurement error.
Variance components for items, raters, and error were 1.3174, 0.0641, and 0.4287, respectively. Pearson correlation coefficients between pairs of raters were .70, .78, and .79. The generalizability of the QAS for each item was .89, indicating a high level of inter-rater agreement for subsequent analyses.
As shown in Table 1, the mean QAS for all 555 questions was 2.39 ± 1.21. The qualities of the examinations at the three schools varied: the 222 questions from School A had a mean QAS of 1.94 ± 0.90, the 180 questions from School B had a mean QAS of 3.26 ± 1.28, and the 153 questions from school C had a mean QAS of 2.03 ± 0.94. The significantly higher mean QAS for School B likely reflects the fact that 44% of the questions submitted by school B had been written by an NBME-trained item writer.
When QAS scores for the 555 items were analyzed in relation to whether items' writers had been trained by the NBME, the 92 questions written by writers who had been trained had a mean QAS of 4.24 ± 0.85, compared with a mean of 2.03 ± 0.90 for the 463 questions drafted by faculty without NBME training. The difference between these two means was highly significant (p <.001).
The overall distribution of QAS for all 555 questions gathered from the three medical schools is shown in Figure 1. The distribution curves for the three schools and for NBME-trained and non-NBME-trained exam writers are shown in Figure 2.
A well-written examination reflects positively on a course. It demonstrates to the students that the course's director and faculty take pride in all aspects of the course. Students perceive, sometimes inappropriately, that material asked on an examination is important. Therefore, examinations should be carefully constructed to emphasize important points.
Our results provide evidence that the quality of in-house medical school examinations is substantially different from that of standardized national examinations. We believe the results probably generalize beyond the three schools and nine examinations in our study to course examinations at many medical schools. We are unaware, however, of other studies that may have assessed the quality of multiple-choice questions at other medical schools in the United States, or of the outcomes of programs intended to improve the quality of the writing skills of faculty for multiple-choice questions.
Several factors may underlie the relatively variable quality of the in-house examinations. First, the faculty may have devoted insufficient time to preparing examination questions. This contrasts with the extensive preparation time faculty usually expend researching and developing their lectures on which the questions are based. Second, there may be little or no peer review (individually or by committee) of questions submitted for an examination. Faculty usually avoid this step because of time pressures and their reluctance to criticize their colleagues, especially when the individual submitting the questions is considered an expert on the topic. Third, few formal faculty development activities are directed at question writing and examination preparation. Like many expectations for faculty in academic medicine, the assumption is that all possess such a skill set, when this may not be the case. Fourth, faculty may not agree on uniform standards and formats for questions, or may not have developed question-writing guidelines to assist in this process. Finally, examinations are usually assembled at the last minute, which constrains the faculty charged with assembling the examination and precludes developing and using strategies that could improve the overall quality of the examinations.
A limitation of our study is that the 555 questions were assessed by raters who were NBME staff, and the item writers whose questions received high QAS ratings were ones trained by these staff. Although the NBME-staff raters were blinded to the sources of the items, a reasonable argument could be made that we have simply shown that the raters recognized items prepared in accord with NBME training. One way to address this criticism would be to have the same pool of examination questions evaluated by raters who were not NBME staff or individuals who have not been trained by the NBME. For example, they could be assessed by faculty at the three medical schools (or other U.S. medical schools) who are highly regarded for the quality of the examinations they write, but who were not trained by NBME staff. While the results of our study do suggest that providing training to examination writers improves the quality of the multiple-choice questions they write, our study did not specifically assess the effects such a training program might have on the test development skills of a particular group of exam writers. We acknowledge that faculty who have not been NBME-trained may produce multiple-choice questions of high quality and that other kinds of training may be beneficial in this regard.
Another pertinent question raised by this study is whether the quality of the examination questions matters. That is, do the best students do well regardless of the quality of the examination's questions? We will address this question in a future study.
Since modes of instruction vary across medical schools, with some institutions emphasizing traditional basic science disciplines in the first two years and others integrating the teaching of basic science and clinical material from the outset of the curriculum, some medical educators might argue that vignette-type questions are inappropriate for assessing students' performances in courses presented in the first one or two semesters and that there is a place for recall- or identification-type questions in the first few semesters. While we concede that it is possible to write multiple-choice questions that perform well from the standpoint of discriminating students' performances in a course or bloc of lectures, we nevertheless maintain that testing knowledge through contextual, vignette, or problem-solving questions that require reasoning skills is preferable to testing the recall of isolated facts. We therefore acknowledge that if in the present study we had not used a rating system that down-graded questions that lacked a vignette, the results of the assessments of the questions might well have turned out differently. Furthermore, we concede that in some settings, particularly where the medical curriculum is one in which core basic science subject matter is taught in the first two years, it may be appropriate to write multiple-choice questions that consist of a mix of vignette and recall-type items.
What can be done to improve the quality of “in-house” medical school examinations? First, faculty responsible for writing examinations should be trained; writing good examination questions is a skill that can be learned. Second, at the outset of the examination writing process, all participants should agree on the format and guidelines for writing items. For example, those types of items that involve “not” and “except” formats should be disallowed. Third, the examination should be prepared well in advance (e.g., two to three weeks) of the date the students will sit for the examination. Fourth, a committee should review, critique, and approve the content and format of the final draft of the examination.
1. Frederickson N. The real test: influences of testing on teaching and learning. Am Psychologist. 1984;39:193–202.
2. Van der Vleuten CPM, Norman GR, De Graff E. Pitfalls in the pursuit of objectivity: issues of reliability. Med Educ. 1991;25:110–8.
3. Godfrey RC. Undergraduate examinations— a continuous tyranny. Lancet. 1995;345:765–7.
4. Renger R, Meadows LM. Testing for predictive validity in health care education research: a critical review. Acad Med. 1994;69:685–7.
5. Barth P, Mitchell R. Smart Start: Elementary Education for the 21st Century. Golden, CO: North American Press, 1992.
6. Swanson DB, Case SM. Trends in written assessment: a strangely biased perspective. In: Harden R, Hart I, Mulholland H (eds). Approaches to the Assessment of Critical Competence: Part 1. Norwich, U.K.: Page Brothers, 1992:38–53.
7. Simmons W, Resnick L. Assessment as the catalyst of school reform. Educ Leadership. 1993;50(5):11–5.
8. Haladyna TM, Downing SM. A taxonomy of multiple-choice item-writing rules. Appl Meas Educ. 1989;2:37–50.
9. Haladyna TM, Downing SM. Validity of a taxonomy of multiple-choice item-writing rules. Appl Meas Educ. 1989;2:52–78.
10. Swanson DB, Case SM. Assessment in basic science instruction: directions for practice and research. Adv Health Sci Educ. 1997;2:71–84.
11. Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences. 2nd ed. Philadelphia, PA: National Board of Medical Examiners, 1998.
12. Case SM, Downing SM. Performance of various multiple-choice item types on medical specialty examinations: types a, b, c, k, and x. Proceedings of the 28th Annual Conference on Research in Medical Education, October 1989. Washington, DC: Association of American Medical Colleges, 1989:167–72.
Examples of In-house Examination Questions (Quoted and Unedited) from Three U.S. Medical Schools
Questions 1–3 are examples of questions that received QAS scores of 4.0–5.0; questions 4–6 are examples of questions that scored in the intermediate range (QAS 3.0–4.0), and questions 7–9 are examples of questions that scored poorly (QAS 1.0–2.0). The mean QAS score for each question is indicated in the parentheses and the correct answer is marked by an asterisk.
- (QAS = 4.3)
- A cell has channels for Na+, K+, and Cl− in its plasma membrane. The resting membrane potential is −60 mV (cell interior negative). The intracellular and extracellular concentrations for these ions are (note —the values of Eireflect the convention of placing the extracellular ion concentration in the numerator of the Nernst equation):
(QAS = 4.7)
A patient is being mechanically ventilated (on a respirator). His arterial blood gas and serum electrolyte analysis shows:
The acid—base abnormality is:
- A. Net Cl− movement out of the cell will be increased
- B. Net Cl− movement into the cell will be increased
- *C. There will be no change in the net movement of Cl−
(QAS = 5.0)
A middle-aged male complains of difficulty climbing stairs. He describes weakness without pain in his right lower limb. He is able to place his right leg on each step without experiencing any problem, but has difficulty climbing the step, and must grasp the handrail to pull himself up. Climbing the next step with his left leg occurs normally. You also notice that his gait on a flat surface appears nearly normal, there is no weakness in extending the right knee against a considerable load. You suspect damage and/or malfunction in the:
- A. Acute (uncompensated) respiratory acidosis
- B. Acute (uncompensated) respiratory alkalosis
- C. Acute (uncompensated) metabolic acidosis
- *D. Acute (uncompensated) metabolic alkalosis
(QAS = 3.0)
The concentration of glucose in the plasma is 100 mg/100 ml. GFR is 120 ml/min. How much glucose is filtered each minute?
- A. Obturator nerve
- B. Tibial nerve
- C. Superior gluteal nerve
- D. Femoral nerve
- *E. Inferior gluteal nerve
(QAS = 3.7)
A patient presents with megaloblastic anemia. Serum vitamin B12 levels are below normal although a dietary assessment indicates the dietary intake of vitamin B12 is adequate. The Schilling test indicates normal production of intrinsic factor. Which one of the following conditions could contribute to vitamin B12 deficiency in this patient?
- 1. 100 mg
- 2. 1200 mg
- *3. 120 mg
- 4. 1.2 mg
- 5. 1 mg
(QAS = 3.0)
An 18-year-old girl awakens at 7:50 A.M. for her 8:00 A.M. Monday class, dresses quickly, and starts running across the quadrangle to the Biology Building. She feels shaky and the next thing she knows she is lying on the path with two students standing over her telling her that an ambulance is on the way. She is brought to the emergency department but no students accompany her to describe what they saw. Past history is negative and there is no family history of seizures. Which of the following questions will LEAST likely help you determine what happened to her?
- A. Excess acid production in the stomach
- B. Consumption of a high-fiber diet
- C. Excess intake of folic acid
- *D. Pancreatic insufficiency
- E. Excess excretion of vitamin B12 in the urine
(QAS = 1.0)
The glycolytic conversation of glucose to lactate: (single best answer).
- 1. Does she experience jerkiness when she awakens in the morning?
- *2. Did she use marijuana the night before?
- 3. How much sleep did she get over the past weekend?
- 4. Was she drinking alcohol the night before?
- 5. Has she been having recent headaches?
(QAS = 1.3)
Louder sound causes:
- A. Generates a net gain of two NADH's for each glucose consumed
- B. Requires the direct participation of molecular oxygen
- C. Cannot take place in cells lacking mitochondria
- *D. Is stimulated by a high intracellular concentration of fructose-2,6-bisphosphate
- E. Involves a single dehydrogenase
(QAS = 1.0)
In most cases of sensorineural (inner ear damage) hearing loss:
- A. Increased amplitude of action potential
- B. Increased frequency of action potential
- C. Increased receptor potential amplitude
- D. Recruitment of different hair cells
- *E. All of the above
- A. Hearing improves with time
- B. Surgery can correct the loss
- C. Removal of the cochlea is recommended
- D. High-frequency hearing is lost before low-frequency hearing
- *E. Amplifying incoming sounds will correct all perceptual problems
© 2002 Association of American Medical Colleges
13. Swanson DB, Case SM. Variation in item writing difficulty and discrimination by item format on Part I (basic sciences) and Part II (clinical sciences) of U.S. licensing examinations. In: Rothman A, Cohen R (eds). Proceedings of the Sixth Ottawa Conference on Medical Education, Toronto, ON, Canada: University of Toronto Bookstore Custom Publishing, 1995:285–7.