Purpose: To evaluate the reliability, efficiency, and cost of administering open-ended test questions by computer.
Methods: A total of 1,194 students in groups of approximately 30 were tested at the end of a required surgical clerkship from 1993 through 1998. For the academic years 1993–94 and 1994–95, the administration of open-ended test questions by computer was compared experimentally with administration by paper-and-pencil for two years. The paper-and-pencil mode of the test was discontinued in 1995, and the administration of the test by computer was evaluated for all students through 1998. Computerized item analysis of responses was added to the students' post-examination review session in 1996.
Results: There was no significant difference in the performances of 440 students (1993–94 and 1994–95) on the different modes of test administration. Alpha reliability estimates were comparable. Most students preferred the computer administration, which the faculty judged to be efficient and cost-effective. The immediate availability of item-analysis data strengthened the post-examination review sessions.
Conclusion: Routine administration of open-ended test questions by computer is practical, and it enables faculty to provide feedback to students immediately after the examination.
Dr. Wolfson is professor of surgery, Department of Surgery, Jefferson Medical College of Thomas Jefferson University, Philadelphia, Pennsylvania. Mr. Veloski is director, medical education research, Mrs. Robeson is project coordinator, and Mrs. Maxwell is research coordinator, all at the Center for Research in Medical Education and Health Care, Jefferson Medical College.
Correspondence and requests for reprints should be addressed to Dr. Wolfson, Jefferson Medical College, 1025 Walnut Street, Suite 604 College Building, Philadelphia, PA 19107-5083; telephone: (215) 955-6879; fax: (302) 651-5990; e-mail: <pwolfson@NEMOURS.ORG>.
Previous studies have shown that computerized tests offer greater security than do paper-and-pencil tests, and that it is practical to administer computerized multiple-choice questions (MCQs) to large groups of examinees.1,2 However, computers are also presenting new opportunities to examiners. For example, computerized adaptive testing streamlines testing and enhances test security by making it possible to present different sets of test items to individual examinees.3 Computers allow items to be selected from a pool of items based on an estimate of each examinee's ability, and they allow examiners to reevaluate test-item formats that large testing organizations have not emphasized because they have been too expensive to use with large groups of examinees using paper-and-pencil formats.4 The open-ended short-answer format is one approach that, although largely driven out of use in paper-and-pencil format by the use of more economical MCQs, can be administered by computer and may need to be reevaluated.
The literature describes three important advantages of open-ended, short-answer test questions in relation to MCQs. The first is the opportunity to enhance content validity. While MCQs can measure certain content very efficiently,5 it has been argued that they limit the range of abilities that can be assessed,6 and studies have shown that MCQs distort the behaviors of faculty and students.7 The second advantage is to provide greater measurement precision, which offering more than five choices to examinees has shown to do through improved test reliability.8–11 The third advantage is face validity. The credibility of licensure and certification examinations in the eyes of the public and the profession is acutely important in medicine.12 Open-ended test questions often appear to represent more faithfully the real situations encountered by physicians in the clinical environment, where short lists of five choices are not available. Therefore, it has been argued that open-ended questions are more likely to evoke the kinds of behavior to be expected of the examinees when they are physicians making clinical decisions.13
One type of open-ended, short-answer format, referred to as the uncued (UnQ) format, has been studied for nearly a decade.14–18 Examinees answer open-ended test questions in the paper-and-pencil mode by locating their responses in a structured list of thousands of one- or two-word responses. This comprehensive reference list is assembled by alphabetizing and cross-referencing plausible responses in multiple domains that include diagnoses, disease manifestations, procedures, drugs, and laboratory tests, together with appropriate synonyms and acronyms. A unique code number is assigned to each term. An examinee answers each test question by locating a preferred response in the alphabetic list, then entering the corresponding code number on an answer sheet, which is subsequently read by an optical mark scanner and scored by computer.
The present study was designed to evaluate the computer administration of this open-ended format. While previous studies suggested that students' performances would be unaffected by the administration of the test by computer, we were unsure how computer administration would affect the reliability of test scores, the amount of time needed for test administration, students' satisfaction, and cost.
Every six weeks since 1990, a comprehensive final examination of 100 open-ended test questions has been administered to students at Jefferson Medical College at the end of the general surgery clerkship rotation. A new examination is constructed every six weeks by randomly selecting questions from a pool of approximately 700 questions using predefined content specifications in 14 categories. Although a different set of 100 test questions is selected for each teaching block, the examinations have produced comparable mean scores across blocks of students.
An interactive test-administration computer program to support the UnQ format was developed by three of the authors (JJV, MRR, and KSM) with the support of a professional computer programming consultant. After reading each test question, the student strikes the keys for the first two letters of a preferred response. The program locates the closest matching term and displays it, followed by the 18 succeeding terms in the alphabetic list. The student may either continue typing a response and narrow the selection or use the mouse to move the cursor on the computer screen to pinpoint the preferred response. The student confirms a response by pressing the “Enter” key or by double clicking the mouse. The computer stores the code number for that response. Throughout the examination the student may skip or tag particular questions for later review, go back to previous questions, or change responses. A student who is unable to locate a preferred response on the reference list may record a written response for independent evaluation by the faculty. On average, the latter option accounts for fewer than 1.7% of all students' responses.
Eight students participated in a pilot study involving a 100-question UnQ examination administered by computer in early 1993. Subsequently, 440 third-year students in 15 teaching blocks of the surgery clerkship were entered into the study during the 1993–94 and 1994–95 academic years. The final examination for each block was randomly divided into forms A and B, each consisting of 50 items. Each form was formatted for administration by either computer or paper-and-pencil. The students were randomly assigned between two groups for a cross-over design that exposed each student to combinations of forms and modes. Group 1 answered form A using the computer and form B using paper-and-pencil. The procedure was reversed for Group 2, which answered form B using the computer and form A using paper-and-pencil. The time required for each student to complete each component of the examination was recorded for two blocks. Immediately after each examination, the student completed a written evaluation of his or her preference and any suggestions for improving the computer test.
Based on positive results, the experimental comparison between computer and paper-and-pencil examinations was terminated in the final block of 1994–95, after which administering the test by computer was used exclusively. A procedure was developed to process all students' responses at the end of the test, which made it possible to prepare a key-validation item analysis for the post-examination review session conducted half an hour after the end of the testing session.
There was no significant difference between the students' mean scores with the two modes of test administration. For example, for the 440 students who were administered half of the examination on paper and half on computer in 1993–94 and 1994–95, the overall mean score on the paper examinations was 83.69 (SD = 8.18). The overall mean for the same students on the computer-administered examination was 83.28 (SD = 8.21). A paired t-test showed no significant difference between the two means (t = 0.93). The alpha reliabilities for the 50-item tests administered during the two-year period ranged from .60 to .80, but did not differ significantly by mode of administration.
The percentages of the total numbers of students in two blocks (n = 60) who remained in the room at five-minute intervals during the 90-minute testing period (for the partial test of 50 questions) are shown in Figure 1. Although there was little difference between rates of completion for the two modes during the first hour, all of the students using the computer finished the test within 75 minutes. However, about 40% of the examinees using pencil-and-paper remained at 75 minutes. Observations in subsequent years confirmed that 95% of students were able to complete the entire test of 100 questions by computer in 150 minutes or less.
Of 400 students who returned the evaluation forms during the first two years of the study (91% response rate), 285 (71%) preferred the computer administration (p < .001). Their reasons included accuracy, ease of use, novelty, and speed. On the other hand, those who preferred the paper-and-pencil format cited familiarity and the opportunity to browse the long list of options, which was impractical on the computer. It is interesting to note that some students found the clicking of computer keyboards objectionable, while other students found the shuffling of papers to be more distracting.
The additional cost of administering the test by computer included the cost of development and maintenance of the computer program, which was about $1,000 per year amortized over the five years of the study. The annual cost of using the test-administration facility was approximately $5,500, based on an estimated cost of about $25 per student per session. One additional proctor was required for the eight half-day test sessions, at a cost of about $700 per year.
Table 1 summarizes the item-analysis data for one representative test question administered to three groups of students (n = 97) over three different years. This example demonstrates how the open-ended format accommodates multiple correct responses and also evokes a wider array of incorrect responses than does a five-choice MCQ. During the post-examination review session, the wide range of responses evoked by the UnQ format provides an empirical framework for discussing cost—effectiveness, safety, and practice variation.
Although the marriage of the computer and open-ended test questions is not new,19,20 our study addressed not only the technical feasibility of using the computer, but also its impact on important psychometric characteristics, testing time, students' acceptance, and operating cost over a five-year period. The early years of the study, which included a two-year experimental comparison of computer with paper-and-pencil examinations with over 400 subjects, revealed no effect on students' mean performance or test reliability. Computer administration was faster, students' acceptance was very high, and the cost was small in relation to the scale of other annual costs at the medical school. Over 600 students were examined by computer for three more years, and the data corroborated our findings. Although a separate evaluation form was not used to monitor these students' opinions about computerized examinations during the latter three years of the study, it should be noted that there was no systematic complaint written on the standard clerkship evaluation form, which provides space for general comments. In summary, the five-year study yielded positive findings with no measurable negative effect.
There was an important, unanticipated benefit of administering the test by computer. It had been a longstanding custom for the surgical clerkship coordinator (PJW) to review each test question with the students soon after the testing session.21 The computer made it possible to compile a detailed item analysis, including a frequency count of student responses to each test question, soon after the last student had completed the test. By using a projector connected to the computer, the detailed item analysis for each test question was shown to the audience of students during the review sessions. This provided immediate feedback to each student about the credibility of his or her responses in relation to the overall pattern of the group. It also provided an empirical base in a forum where students could inquire about incorrect responses or, in some cases, challenge the keyed (correct) responses. The review session became an extension of the clerkship and a valuable learning experience for the students as well as for the faculty. As individual test items were discussed and all responses scrutinized, ambiguous or poorly worded items were identified and either deleted or revised before they were used in subsequent examinations. This continuous validation of the database assured that items would be consistent with current practices.
It is important to keep in mind that this study was not designed to compare different item formats. The faculty in the department of surgery selected the UnQ format a decade ago, not only because it strengthened the content validity of their own examination, but also because this format complemented the MCQ tests that were being so widely used in other clerkship examinations. However, the relative advantages and disadvantages of different item formats cannot be overemphasized. These have been documented empirically in studies of paper-and-pencil test administration.5,8,14 Nevertheless, future studies of administering tests by computer might address the costs and benefits of different formats. For example, one limitation of earlier uses of the UnQ pencil-and-paper format had been that the students needed more time to locate their responses because the alphabetic reference list had been expanded to accommodate additions to the question pool. The computer streamlined this process, but it may have different effects on other formats.
The results of this study underscore the opportunity that administering tests by computer offers to medical school faculty responsible for individual courses or clerkships. While faculty may not always have access to the resources needed to develop sophisticated computer-adaptive tests or complex patient simulations, their options for computer-administrated tests are not necessarily limited to MCQs. Test-question formats such as the open-ended format can offer certain advantages over MCQs for a modest cost.
Our study showed that the computerized administration significantly enhanced the quality of the clerkship final examination. The open-ended format, which can be administered more effectively by computer than by paper-and-pencil, assured the faculty that students were not answering questions by sight recognition or random guessing. Furthermore, administering the test by computer made it possible to prepare a key-validation item analysis at the end of the test and review these data with the students to enhance the educational value of the final clerkship examination.
1. Luecht RM, Hadadi A, Swanson DB, Case SM. A comparative study of a comprehensive basic sciences test using paper-and-pencil and computerized formats. Acad Med. 1998;73(10 suppl):S51–S53.
2. Clyman SG, Orr NA. Status report on the NBME's computer-based testing. Acad Med. 1990;65:235–41.
3. Luecht RM, Nungester RJ. Some practical examples of computer-adaptive sequential testing. J Educ Meas. 1998;35:229–49.
4. Norman GR, Swanson DB, Case SM. Conceptual and methodological issues in studies comparing assessment formats. Teach Learn Med. 1996;8:208–16.
5. Norcini JJ. Reliability, validity, and efficiency of multiple choice question and patient management problem item formats in assessment of clinical competence. Med Educ. 1985;19:238–47.
6. Elstein A. Beyond multiple-choice questions and essays: the need for a new way to assess clinical competence. Acad Med. 1993;68:244–9.
7. Frederiksen N. The real test bias—influences of testing on teaching and learning. Am Psychologist. 1984;39:193–202.
8. Case SM, Swanson DB. Extended-matching items. A practical alternative to free-response questions. Teach Learn Med. 1993;5:107–15.
9. Fenderson BA, Damjanov I, Maxwell K, Veloski JJ, Rubin E. Teaching and evaluation based on keywords and extended matching questions. Pathol Educ. 1999;24:17–29.
10. Page G, Bordage G, Allen T. Developing key-feature problems and examinations to assess clinical decision-making skills. Acad Med. 1995;70:194–201.
11. Page G, Bordage G. The Medical Council of Canada's key features project: a more valid written examination of clinical decision-making skills. Acad Med. 1995;70:104–10.
12. Norcini JJ. Examining the examinations for licensure and certification in medicine. JAMA. 1994;272:713–4.
13. Newble D, Baxter A, Elmslie RG. A comparison of multiple choice tests and free-response tests in examinations of clinical competence. Med Educ. 1979;13:263–8.
14. Veloski JJ, Rabinowitz HK, Robeson MR. A solution to the cueing effects of multiple choice questions: the Un-Q format. Med Educ. 1993;27:371–5.
15. Young PR. Board news. J Am Board Fam Pract. 1990;3:310.
16. Damjanov I, Fenderson BA, Veloski JJ, Rubin E. Testing of medical students with open-ended, uncued questions. Hum Pathol. 1995;26:362–5.
17. Veloski JJ, Rabinowitz HK, Robeson MR, Young PR. Patients don't present with five choices: an alternative to multiple-choice tests in assessing physician's competence. Acad Med. 1999;74:67–74.
18. Chang K, Sauereisen S, Dlutowski M, Veloski JJ, Nash DB. A cost-effective method to characterize variation in clinical practice. Eval Health Prof. 1999;22:184–96.
19. Anbar M, Loonsk JW. Computer emulated oral exams: rationale and implementation of cue-free interactive computerized tests. Med Teach. 1988;10:175–80.
20. Anbar M. Comparing assessments of students' knowledge by computerized open-ended and multiple-choice tests. Acad Med. 1991;66:420–2.
21. McLeod PJ. Immediate review of multiple-choice questions. Teach Learn Med. 1995;7:67–70.