Kelly, William F. MD; Papp, Klara K. PhD; Torre, Dario MD, PhD; Hemmer, Paul A. MD, MPH
Clerkship directors (CDs) use a variety of methods to evaluate medical students’ performance during clinical clerkships. Teachers’ descriptive evaluations remain the predominant method across clerkships, with growing use of the Reporter-Interpreter-Manager-Educator (RIME) framework, formal evaluation sessions, and competency frameworks such as those defined by the Accreditation Council for Graduate Medical Education.1,2 Other methods include “workplace” assessment such as direct observation.1,3,4 Still, quantifiable examinations have long played a role in assessment during clerkships, the most often used being the subject examinations of the National Board of Medical Examiners (NBME).1,5 Examinations (quantifiable and otherwise) developed locally by faculty were recently reported to be used in one out of three internal medicine clerkships (34%) and contributed on average 14% of a student’s grade.1 However, details about how local, faculty-written (LFW) examinations are administered and, more important, their perceived value added, have not been described.
Additionally, prior publications illustrate the problems associated with LFW examinations at medical schools.6–10 Basic science LFW examination items have been shown to be poorly constructed,10 but faculty who have participated in item-writing workshops or who have taken part in a committee review process have been shown to write better ones.10,11 Furthermore, the evidence for the validity of LFW examinations can be adversely affected by construct-irrelevant variance (i.e., sampling items that are not relevant to the construct being tested) or construct underrepresentation (i.e., insufficient sampling of important constructs), both of which point to the need for adequate examination development, blueprinting, and review.6–9,12,13 Such potentially flawed approaches to test construction are not unique to medical schools.12
Given the ongoing use of LFW examinations within clinical clerkships,, we investigated examination practices of internal medicine CDs, the reasons why LFW examinations are used, and the factors associated with the quality and validity of those examinations. We report our findings below.
In June 2009, we, through the Clerkship Directors in Internal Medicine (CDIM), surveyed its 107 institutional members from U.S. and Canadian medical schools (not all medical schools in those countries are institutional members of CDIM). We sent each of the CDIM institutional members (typically the CD) an e-mail message with a cover letter linked to our online, confidential survey. Nonrespondents were contacted up to three additional times by e-mail and once by telephone. Participants were blinded to any specific hypothesis of the study. The institutional review board (IRB) at Case Western Reserve University determined that the CDIM survey research protocol does not fit the definition of human subjects research per 45 CFR 46.102, and therefore the protocol did not require exemption status, further IRB review, or IRB approval.
Questions regarding LFW examinations were drafted by two of us (W.F.K., P.A.H.) and were reviewed and edited by members of the CDIM Research Committee (D.T., K.K.P.), piloted, and finally reviewed and approved by the CDIM Council.
Demographic items included respondent’s age, type of medical school (public, private), current academic position, and the number of years in his or her current position.
In the section on LFW examinations, respondents were asked whether they used the NBME subject examination in medicine (yes/no) and whether they used LFW examinations (yes/no). If they used LFW examinations, we asked how many of these examinations they used. Respondents could provide information on up to 4 faculty examinations, resulting in a total of 68 possible items. For each LFW examination, we asked respondents to address timing of administration (whether at the beginning, during, midpoint, or end of the clerkship, or “other” with a free-text option); how the examination was administered (paper, online, both paper and online, or “other” with a free-text option); the examination format (essay, short-answer, multiple-choice, computer simulation, or “other” with a free-text option); how many items were on the examination; time given to complete the examination; whether the examinations were open-book (yes/no) and, if yes, what resources were allowed (free-text); whether the examination was given the same day as the NBME subject examination or other examinations (yes/no); how much the examination contributed to the grade (0%–100%); whether the examination was norm referenced, criterion referenced, or “not sure”; and whether students were required to pass the examination (yes/no) and, if yes, what the passing standard was.
To explore the rigor of the development and monitoring processes for the LFW examination(s), we asked who developed the examination(s) (CD, site director, other faculty, or “other” with a free-text option); whether parallel forms of the examination were given and, if yes, how many in an academic year; whether reliability figures were known about the examination(s) used (yes, no, not sure); and whether the examination(s) had been given to a control group. Finally, respondents were asked to comment on the reason(s) that they use LFW examinations (free-text response).
Descriptive statistics were used to summarize responses to all questions. We used the chi-square test or Fisher exact test to determine whether there was an association between an LFW examination’s contributing to a grade (yes/no)—or whether students were required to pass the examination (yes/no)—with whether the CD knows reliability estimates (yes/no or not sure), whether parallel forms of the examination were used (yes/no or not sure), and whether the examination was given to a control group (yes/no or not sure).
We used Mann–Whitney U to compare the percentage that an examination contributed to the final grade with whether the CD knew reliability estimates, used parallel forms of the LFW examination, or administered the examination to a control group. For purposes of comparison, we combined all of the forms of the examinations given as the first, second, or third examinations (e.g., all first-examination responses, regardless of the examination format, were considered together). Statistical analysis was conducted using SPSS (version 16; Chicago, Illinois).
Free-text responses were reviewed independently by two of us (W.K., P.A.H.); we developed themes based on the comments, met to review the preliminary coding structure, coded all of the responses, and resolved differences by consensus.
Sixty-nine of 107 (64.5%) institutional members responded to the survey (although not to every question). Half of the respondents (54.5%; 36/67) reported being at public or state medical schools. Respondents identified their roles as CD (94.2%; 65/69), subinternship director (43.5%; 30/69), undergraduate medical education director (14.5%; 10/69), vice chairs of education (10.1%; 7/69), and “other” (8.7%; 6/69). A majority (59.7%; 40/67) noted six or more years of experience in their current role.
Use of exams
The NBME subject exam was used by 92.6% (63/69) of responding CDs, an increase from 83% in 199914 and 85% in 2005.1 Twenty-two institutions (32.8%; 22/69) reported using one or more LFW examinations, up from 27% in 199914 but down from the 2005 value (36%).1 Twenty-two percent (17/69) of clerkships used both types of examinations, up from 12% in 2005.1 All clerkships that did not use the NBME subject examination reported using at least one LFW exam, and, conversely, all respondents not using an LFW exam used the NBME subject examination.
Authorship of exams
These LFW internal medicine examinations were written by the CD alone (9/22; 41%), the CD and site director (2/22; 9.1%), other faculty members (4/22; 18.2%), or a combination of these (7/22; 31.8%). These other faculty members included a chief medicine resident, faculty representing divisions, and even noninternists such as radiologists.
Administration and format of exams
See Table 1 for LFW characteristics. Half (50%; 11/22) administered one LFW examination, but six (27.3%) gave two, four (18.1%) gave three or four, and one (4.5%) gave seven tests. Exam(s) were given most often at the end of the clerkship (72.7%; 16/22). Three (13.6%) had examinations only midclerkship, which, on the basis of our review of comments, serve as either a midpoint exam to help prepare for the NBME subject examination or as a test of specific knowledge areas like geriatrics, chest X-ray, or ECG interpretation. The remaining three clerkships spaced examinations weekly or monthly. The majority of examinations were traditional, paper-based examinations (70%; 14/22), but one-quarter were online—including one taken across a three-day period. The most common format was multiple-choice questions (77.2%; 17/22), followed by short-answer (36.3%; 8/22) and essay (18.1%; 4/22). However, there were some interesting formats: One clinical reasoning test was administered orally; one used actual ECGs, chest X-rays, and blood smears; one assessed geriatrics skills with a written paper; one employed review of a videotaped patient encounter; and another clerkship reported using script-concordance testing.
Grading of exams
Most of the time (85%), exams were graded according to a preestablished criterion and not normatively. Half (11/22) of the clerkships required a passing performance on the LFW examinations in order to pass the clerkship. Passing performance was set at two standard deviations below the mean (18%; 2/11), 60% correct (36%; 4/11), 65% correct (18%; 2/11), or a combination of two standard deviations below the mean plus minimum 65% correct (9%; 1/11). The highest passing standard reported was 72.5% correct. One oral examination was graded “pass or fail” by the examiner.
Of note (see Table 1), CDs did not consistently have reliability estimates from the examinations, nor had they given them to control groups; however, they were likely to report use of parallel forms of the examinations.
There was no association between (1) whether the examination counted toward the final grade or (2) whether passing was required, and whether the CD knew reliability figures for the examinations, whether parallel exam forms were used, and whether the examination had ever been given to a control group. Furthermore, if an examination did contribute to a student’s final grade, the amount it contributed did not differ depending on whether the CD knew reliability figures, whether parallel exam forms were used, or whether the examinations had been given to a control group (see Table 2).
Reasons exams were used
All 22 respondents who used LFW exams answered the free-text question concerning the reason(s) they had such an examination. The themes that emerged were the need to test content relevant to the clerkship, dissatisfaction with the ability of the NBME subject examination to cover all topics taught during the internal medicine clerkship, clerkship composition not amenable to subject examination testing, and using faculty examinations to prepare for final examinations. Over 70% of respondents addressed the first and second themes.
Regarding the second theme, the underrepresented areas included specific interpretation skills such as ECG, chest X-ray, and blood smear reading, and also specific clinical areas such as clinical reasoning, hospice, dermatology, ophthalmology, geriatrics, and communication. Also, respondents cited lack of local control over the content, the need for more feedback on performance, and a preference for a testing format other than single-best-answer questions. Four clerkships that had only LFW examinations used them as an “NBME substitute” because of administrative issues, such as splitting their clerkship into inpatient and outpatient portions or giving junior students the LFW exam and their seniors the NBME subject examination. Only one respondent indicated cost as a concern about the NBME examination. Of the 17 clerkships that had both the NBME subject examination and their own LFW exams, 5 (22.7%) used LFW exams as a form of study preparation for the NBME, including interspersed (weekly or monthly) testing to help students budget their study time.
Assessment in medical education is a complex process, and no single method can hope to encompass the breadth and complexity of the many competencies involved. In fact, it has been advocated that we move away from a “trait-based” model of assessment, which fosters selecting one assessment method for one trait.15 Even so, examinations are likely to remain a key part of any program of assessment. LFW examinations should not be continued out of institutional habit, nor should they be ended solely because other forms of evaluation may be more convenient. They should be continued if they provide added value.
Although the NBME subject examination in medicine remains the dominant form of examination assessment of the internal medicine clerkship—with 93% of respondents indicating use of this exam—LFW examinations continued to play a notable role for approximately one-third of the respondents. In fact, the trend was for greater use of the combination of the NBME subject examinations with at least one LFW examination, with 22% reporting such a combination compared with 10.1% in 199914 and 11.9% in 2005.1 This trend of using multiple examinations may reflect the recognition that assessment should be designed to look at a student from a variety of angles. A 1986 study noted poor correlation of global clinical performance ratings and internal medicine clerkship examination performance, arguing that the two are complementary.16 More recently, performance on the NBME subject examination has been associated with intern and resident—but not attending—determinations of student progress on the RIME framework.17 In another review, correlations between examination score and clinical performance varied with clerkship specialty, and those with the lowest correlations weighted examination performance more heavily.18
Respondents reported that most LFW clerkship exams are administered on paper, although Internet-based testing has been shown to reduce departmental costs and be preferred by students in a pediatric clerkship.19 Findings also show that exams were rarely given as “open-book” tests, even though in one study such a format did not seem to advantage a student in terms of test scores and had important other effects such as reduced student anxiety, increased use of a textbook during the clerkship, and more time spent on understanding principles rather than memorization.20
Interestingly, according to the respondents’ free-text entries, the reasons for using local faculty examinations centered on issues about the NBME subject examination, such as the desire to test constructs and content that were perceived to not be well covered by the NBME examination, dissatisfaction with one or more elements of the NBME examination (content, feedback), administrative issues that preclude using the NBME examination, or use of LFW exams to help prepare for the NBME examination. The concerns about the NBME examination persist even after concerted efforts to align the NBME subject examination in medicine with national curricula21 and highlight that assessment should be tailored to what the students are expected to “know,” “know-how,” or “show-how” (such as in a clinical reasoning examination).22 Furthermore, the use of LFW examinations as preparation is supported by findings such as students who received regular, quantitative, formative feedback in the form or weekly quizzes scored higher on, and failed less often, their NBME subject examination,23 and by further evidence that assessment can be an educational tool that promotes learning.24
However, it is necessary but not sufficient to have legitimate reasons for using LFW examinations in a clinical clerkship; CDs also have a responsibility to ensure the currency, reliability, and validity of these assessments.25
Validity is not an inherent, unchanging property of an assessment method but an interpretation of the result for which one must continually seek supporting sources of evidence.26 Cook and Beckman13 cite that this evidence comes from five sources: content, response process, internal structure, relations to other variables, and consequences.
Many of the respondents indicated that their LFW examinations were tailored to meet specific content needs not addressed by the NBME subject examination, and this would address one of the sources of evidence for validity. Unfortunately, our survey results suggest that internal medicine CDs are falling short on seeking other evidence for the validity of LFW examinations. A minority of CDs reported having reliability estimates of their test instruments; the lack of such information jeopardizes judgments about students that are made on the basis of those instruments’ results. Furthermore, less than one-third of respondents indicated that they had administered their examination(s) to a “control group,” even though this may help assess quality and relevance.27 Fortunately, many clerkships did have parallel forms of exams (different versions testing the same material). Parallel forms support validity in terms of “response process” because they may aid test security as well as internal structure (alternate forms reliability), provided the different versions of the same instrument produce similar results.
One must also consider the consequences of the examinations as a source of validity evidence, both immediate (student’s grade given and pass determination) and perhaps longer-term (overall medical school record and thus residency application). Our low numbers of responses limit the power to reach statistical significance in the comparisons, but simply reviewing the data reveals that grade contribution and passing requirements appear not to be influenced by available sources of evidence, jeopardizing the interpretation of the results (validity) for the students taking the examinations. Without reliability estimates, it could be argued that CDs should not use their examinations to determine passing performance and/or a significant percentage of students’ grades. Although an LFW examination’s contribution to the final clerkship grade appears small, it may still affect a student’s overall grade—producing either grade “inflation” or “deflation.”
So, what is a CD to do about using LFW examinations? We have provided some recommendations in List 1. The development and review process should be deliberate, conducted by individuals with training in writing examinations, and involve a group process because this has been shown to improve the quality of faculty examinations.11 Such a deliberate, and group, process may help avoid threats to validity possible in the design of examinations.6,7,9 Our survey shows that such a group process is needed because test development was reported to be the responsibility of the CD alone by 41% of respondents. Effective blueprints for internal medicine clerkship examinations have been proposed.28 Some family medicine clerkships have effectively pooled resources to create their own reliable clerkship examinations with less burden while maintaining content and feedback control.29
There are limitations to this study. It is a survey of internal medicine CDs and their opinions; however, the findings should help inform other CDs who use LFW examinations.1 The overall survey response rate for purposes of analysis is adequate, but the number reporting use of LFW examinations is relatively low, limiting statistical power to detect differences and also raising the possibility of nonresponder bias. Question wording could have biased respondents toward describing summative final examinations, decreasing discussions of other in-clerkship and/or formative assessments. Also, although requested, we did not have actual versions of the exams administered, so we cannot comment on the content component of validity, flawed items, or factors such as construct-irrelevant variance or construct underrepresentation. We also did not ask about other review process items, such as receiving difficulty and discriminating indices, review of student performance, and the frequency of LFW examination modification. Given the responses to the questions we did ask, it would be reasonable to assume that such processes are likely to be in place by only a few respondents, yet we recognize it is still an assumption on our part. Finally, our intent is not to establish a national norm for internal medicine CDs’ use of LFW examinations but, rather, to provide a needed reference for programmatic evaluation.
LFW Examinations: Challenge and Opportunity
Our report represents the first effort to better characterize the use of LFW examinations within internal medicine clerkships and highlights important issues about the need for a deliberate and deliberative approach to the use of such assessments. It makes clear that internal medicine CDs need to attend to one of their core responsibilities about examination use25,26: They must gather evidence in support of the validity of the examinations and determine whether, and to what degree, the examinations should contribute to a grade. CDs should also provide information about their examination processes that national CD organizations can use not only to design faculty development initiatives for their members but also to guide ongoing collaboration with organizations such as the NBME.
Acknowledgments: The authors acknowledge the Alliance for Academic Internal Medicine staff for their help in creating the online survey used in this study as well as in survey distribution, collection, and data entry.
Other disclosures: The data used in this survey are the property of the Clerkship Directors in Internal Medicine and are used with permission.
Ethical approval: The institutional review board at Case Western Reserve University determined that the CDIM Survey research protocol does not fit the definition of human subjects research per 45 CFR 46.102, and therefore the protocol did not require exemption status, further IRB review, or IRB approval.
Disclaimer: The views expressed in this report are those of the authors and do not represent the official views of the Uniformed Services University of the Health Sciences, the United States Army, the United States Air Force, the Department of Defense, or other federal agencies.
1. Hemmer PA, Papp KK, Mechaber AJ, Durning SJ. Evaluation, grading, and use of the RIME vocabulary on internal medicine clerkships: Results of a national survey and comparison to other clinical clerkships. Teach Learn Med. 2008;20:118–126
3. Norcini J, Burch V. Workplace-based assessment as an educational tool: AMEE guide no. 31. Med Teach. 2007;29:855–871
4. Norcini JJ. Work based assessment. BMJ. 2003;326:753–755
5. Torre D, Papp K, Elnicki M, Durning S. Clerkship directors’ practices with respect to preparing students for and using the National Board of Medical Examiners subject exam in medicine: Results of a United States and Canadian survey. Acad Med. 2009;84:867–871
6. Downing SM. Threats to the validity of locally developed multiple-choice tests in medical education: Construct-irrelevant variance and construct underrepresentation. Adv Health Sci Educ Theory Pract. 2002;7:235–241
7. Downing SM. Construct-irrelevant variance and flawed test questions: Do multiple-choice item-writing principles make any difference? Acad Med. 2002;77(10 suppl):S103–S104
8. Downing SM. The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ Theory Pract. 2005;10:133–143
9. Downing SM, Haladyna TM. Validity threats: Overcoming interference with proposed interpretations of assessment data. Med Educ. 2004;38:327–333
10. Jozefowicz RF, Koeppen BM, Case S, Galbraith R, Swanson D, Glew RH. The quality of in-house medical school examinations. Acad Med. 2002;77:156–161
11. Wallach PM, Crespo LM, Holtzman KZ, Galbraith RM, Swanson DB. Use of a committee review process to improve the quality of course examinations. Adv Health Sci Educ Theory Pract. 2006;11:61–68
12. Tarrant M, Knierim A, Hayes SK, Ware J. The frequency of item writing flaws in multiple-choice questions used in high stakes nursing assessments. Nurse Educ Today. 2006;26:662–671
13. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.e16
14. Hemmer PA, Szauter K, Allbritton TA, Elnicki DM. Internal medicine clerkship directors’ use of and opinions about clerkship examinations. Teach Learn Med. 2002;14:229–235
15. van der Vleuten CP, Schuwirth LW. Assessing professional competence: From methods to programmes. Med Educ. 2005;39:309–317
16. Ramsey PG, Shannon NF, Fleming L, Wenrich M, Peckham PD, Dale DC. Use of objective examinations in medicine clerkships. Ten-year experience. Am J Med. 1986;81:669–674
17. Griffith CH 3rd, Wilson JF. The association of student examination performance with faculty and resident ratings using a modified RIME process. J Gen Intern Med. 2008;23:1020–1023
18. Lurie SJ, Mooney CJ. Relationship between clinical assessment and examination scores in determining clerkship grade. Med Educ. 2010;44:177–183
19. Ogershok PR, Moore RS, Ferrari ND, Miller LA. An Internet-based pediatric clerkship examination. Med Teach. 2003;25:381–384
20. Broyles IL, Cyr PR, Korsen N. Open book tests: Assessment of academic learning in clerkships. Med Teach. 2005;27:456–462
21. Elnicki DM, Lescisin DA, Case S. Improving the National Board of Medical Examiners internal medicine subject exam for use in clerkship evaluation. J Gen Intern Med. 2002;17:435–440
22. Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990;65(9 suppl):S63–S67
23. Brar MK, Laube DW, Bett GC. Effect of quantitative feedback on student performance on the National Board Medical Examination in an obstetrics and gynecology clerkship. Am J Obstet Gynecol. 2007;197:530.e1–530.e5
24. Larsen DP, Butler AC, Roediger HL 3rd. Test-enhanced learning in medical education. Med Educ. 2008;42:959–966
25. Pangaro L, Bachicha J, Brodkey A, et al. Expectations of and for clerkship directors: A collaborative statement from the Alliance for Clinical Education. Teach Learn Med. 2003;15:217–222
26. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837
27. Harris JA, Heneghan HC, McKay DW. The rating of pre-clerkship examination questions by postgraduate medical students: An assessment of quality and relevancy to medical practice. Med Educ. 2003;37:105–109
28. McLaughlin K, Lemaire J, Coderre S. Creating a reliable and valid blueprint for the internal medicine clerkship evaluation. Med Teach. 2005;27:544–547
29. Slatt LM, Steiner BD, Hollar DW, Chessman AW, Xin J, Hedgpeth MW. Creating a multi-institutional family medicine clerkship examination: Lessons learned. Fam Med. 2011;43:235–239