Variability in the quality of clerkship assessments is blamed on limited observations of students’ performance, time/productivity demands on faculty, and different opinions among faculty about what constitutes competent performance.1 This is despite efforts to develop and use explicit criteria and descriptive anchors. Given all this, objective structured clinical examinations (OSCEs) seem like a reasonable way to assess students’ achievement of clinical learning outcomes within or across clinical disciplines in a controlled, standardized environment. OSCEs are popular among U.S. medical schools as a valid and reliable method for measuring students’ knowledge and skills in a controlled, clinical setting before graduation.2,3 A recent survey of U.S. medical schools revealed that in the last year, the number of schools using one or more OSCEs increased from one third to one half (G. March, vice president for testing services, National Board of Medical Examiners, Philadelphia, Pennsylvania: personal e-mail communication to Larry Gruppen, PhD, chair, Department of Medical Education, University of Michigan Medical School, January 30, 2007).
Medical schools with OSCEs generally conduct some level of content analysis to ensure that their assessments are in fact measuring what they intend to measure. However, medical education researchers have recently questioned the appropriateness of applying standard psychometric tools that measure content validity to assessments such as those made by OSCEs.4 Strict psychometric interpretation, for example, might lead to eliminating items from an assessment because of poor item statistics, when in fact the item(s) might be relevant and need more rather than less (or no) representation in order to provide assessors with useful information. There is also a tendency to sum or average individual item scores, rather than to view each item as measuring something unique.4 For example, does it make sense to set the standard on a summated score that measures different clinical skills such as history taking, physical examination, and communication?
Outside of medical education, researchers have framed determination of a test’s validity as having practical and theoretical perspectives, as well as internal and external foci. In particular, Lissitz and Samuelsen5 have suggested that content and reliability should not be considered separately in evaluation of a test’s content validity. Within the four-way contexts of practical/theoretical and internal/external, they suggest specific questions and potential sources of evidence that can be helpful in making validity determinations.
At our institution, the University of Michigan Medical School, annual quality assurance (QA) methods informed us that the fourth-year (M4) OSCE checklists were meeting the requirements we had set for construct validity and reliability. As we prepared to adopt an end-of-second-year (M2) OSCE several years ago, and discussed links between expectations for that assessment and the M4 OSCE, we reexamined the domains and levels we were measuring on the M4 OSCE. We discussed whether the level of performance we had defined was actually more appropriate for students entering the clinical phase than for students less than a year from graduation. That is, although we had determined that the OSCE was assessing what we expected it to assess, we suspected that our expectations for student performance on the M4 OSCE were too basic.
We turned for guidance to our school’s Goals for Medical Student Graduation,6 which included a number of advanced cognitive skills that are challenging to measure reliably on the clinical clerkships. Again, this is because of inconsistencies in physician assessment and limitations in the time faculty and house officers have to observe medical students,7,8 particularly since adoption of restricted duty hours for house officers.9 We expect our graduates to possess abilities that include critical assessment of evidence, assimilation of new scientific information, independent clinical decision making, and self-assessment. We also expect that graduates will possess professional attributes (e.g., integrity) that will guide their interactions with patients in an increasingly diverse society.
Yet, we had no metric against which to judge whether our summative assessment was indeed achieving these goals. We therefore searched for available taxonomies that could be modified to achieve this purpose. In the mid-20th century, Benjamin Bloom10 led a committee of educational psychologists that developed a system categorizing learning behaviors. The committee grouped behaviors across three domains: cognition (knowledge—mental skills), affective (attitudes—feelings or emotions), and psychomotor (skills—manual or physical). They developed taxonomies for the cognition and affective domains but none for the psychomotor domain, claiming they had no experience teaching manual skills to college students. The psychomotor domain was originally created to recognize manual tasks and physical movement; however, this domain was reconceptualized more broadly by Simpson,11 who created the third taxonomy. On each of the taxonomies, the levels advance from the simplest to the most complex, and each level must be mastered before progressing to the next. Thus, the underlying (or lower level) stages can be assumed in applying the taxonomies to learning and assessment, and they can also be useful in charting development. For example, when assessing a higher-order ability such as synthesis (using Bloom’s taxonomy for cognition), we are also assessing whether our learners have basic knowledge about the problem or complaint presented by the patient (Level 1, knowledge; Level 2, comprehension) and whether they know how to apply that knowledge in a clinical situation (Level 3, application).
In this report, we describe the process through which we modified Bloom’s and Simpson’s taxonomies to be used for developmental staging of physicians. We mapped the modified taxonomies to our (M4) summative OSCE to address our growing concern that we were not assessing our senior students at appropriately advanced stages, given the proximity of the OSCE to graduation and our expectations and goals for our graduates.
Our summative M4 OSCE is a 13-station, half-day examination that all of our medical students take at the beginning of their fourth year. Among the stations are standardized patients (SPs) presenting with abdominal pain, back pain, breast lump, chest pain, depression, uncontrolled diabetes, memory loss, and mobility limitations, and as the parent of a baby with fever. Passing the OSCE overall and passing each OSCE station are requirements for graduation.
Two senior physician–educators (including H.M.H.) who directed and codirected the M4 OSCE at the University of Michigan Medical School, and who were also curriculum leaders in undergraduate medical student education, were asked to review Bloom’s and Simpson’s taxonomies and adapt them to medical education. Both physicians worked independently at first and then met to discuss differences and reach consensus. Two members of the M2 OSCE Committee were then asked to review the adapted taxonomies; a few additional changes were made after that review.
The same two physicians who adapted the taxonomies then reviewed the checklists for each of the 2006 M4 OSCE stations and categorized (domain) and ranked (level) each item on each checklist, based on the taxonomies. After completing that task, they met to discuss differences in their results. This discussion achieved domain consensus across all the checklist items. Cohen’s weighted kappa indicated moderate to high levels of interrater reliability (P < .0001); results were 0.75 in the cognitive domain (maximum possible = 0.93), 0.88 in the skills domain (maximum possible = 0.91), and 0.92 in the attitude domain (maximum possible = 0.92). These results provide strong evidence of interrater reliability, but occasional differences in exact ranking remained after the second round. Based on the disciplines and experience of the evaluators, small variations in agreement on the difficulty of a task might persist. To address the few variations we found, the mean station scores (to be discussed in the Results section) were computed by first computing each rater’s item ranking average for the station. Next, the two raters’ averages for that station were averaged to produce the mean station score. As shown in Table 1, these procedures led us to generally map the more advanced goals for our graduating students to Bloom’s cognitive and attitudes taxonomies10 and Simpson’s psychomotor skills taxonomy.11
The application of the modified taxonomies to our summative OSCE was then performed. Items for each station were grouped by cognition, skills, and attitudes, and means and medians were calculated (Table 2). The taxonomies are described as ranging from low or simple levels (e.g., “knowledge” on the cognition taxonomy) to advanced or complex levels (e.g., “evaluation” on the cognition taxonomy). By averaging the taxonomy levels within and across the OSCE stations, we were treating the taxonomies as interval when, in fact, they were ordinal. However, treating the taxonomies as interval (i.e., computing means and standard deviations) was the most effective method for providing meaningful information back to the faculty about their concern (i.e., the complexity of levels we were assessing on the M4 OSCE stations).12 We also calculated medians within and across the stations. The means and medians were comparable.
The total number of items across the 13 OSCE stations included in this study were 202, 68, and 16 for cognition, skills, and attitudes, respectively.
On the basis of the analysis described above, on average, none of the stations achieved the higher or highest levels of the taxonomies across all three domains. The level at which the cognition items were ranked averaged 4.1 (averaged median = 4.0). On Bloom’s taxonomy, 4.1 falls within the analysis stage, below the synthesis and evaluation stages. The level at which the skills items were ranked was 4.4 (averaged median = 5.0). On Simpson’s taxonomy, 4.4 falls within the “Mechanism: Sequence of steps without direct observation” stage, below the stages where complex movements are performed automatically, where movements are adapted on the basis of findings, and where new patterns are created. Finally, the level at which the attitudes items were ranked averaged 2.5 (averaged median = 3.0). On Bloom’s taxonomy once again, 2.5 falls within the “Active participation/listening skills” stage, below the stages where individual and cultural differences are recognized and valued, where individuals accept responsibility for their own behavior, and where a self-constructed set of values guides an individual’s behavior.
We described above how we adapted three taxonomies, Bloom’s for cognitive and attitudes domains and Simpson’s for psychomotor domains, to the medical education setting. We used these taxonomies to measure the construct validity of our summative M4 OSCE against our school’s Goals for Medical Student Education.6 We found that with a few exceptions (i.e., the cognitive domain on the communication skills station, and the skills domain on the health beliefs station), on average our OSCE stations were measuring only moderately high orders in all three domains of knowledge, skills, and attitudes. The taxonomy provided evidence that the M4 OSCE did not comprehensively measure students’ ability to synthesize and evaluate, perform more complex physical exam skills, or measure self-constructed, internalized values.
It is important to note that within the M4 OSCE stations, in specific instances we were assessing students’ knowledge, skills, and attitudes at higher levels. For instance, the item “The student discussed medication as a treatment option” (diabetes station) involves making judgments regarding therapy and aligning with the patient’s beliefs; the raters ranked it at the most advanced knowledge level (six out of six). “The student started palpating by placing his/her hand flat on the abdomen in the area farthest from the pain” (abdominal pain station) involves modifying the routine exam—based on observation and evaluation—to take into account that the patient has pain; raters ranked it just below the most advanced skills level (six out of seven). There were no items ranked above three in the attitudes domain, which has five levels. Thus, our checklists to some degree are assessing a range of knowledge, skills, and attitudes (see SDs on Table 2). As noted by Pangaro13 (whose Research in Medical Education framework is a very useful tool for measuring student development of clinical acumen and professionalism over time), advanced learners typically move from lower stages to higher stages in accomplishing a clinical task (i.e., gathering information, analyzing and synthesizing across sources of information, weighing evidence, and finding solutions—reflecting and self-correcting along the way). Thus, it makes sense that checklists such as those we use on our M4 CCA would have a range of tasks. The question for the faculty is whether the higher levels are adequately represented on the checklists (or whatever instruments are used), to determine in a valid and reliable way whether students have in fact achieved the goals that represent higher orders across domains.
And, although there are limited observations attributable to the small number of items at the higher or highest levels on the taxonomies, we can review student performance at these higher levels (i.e., on these particular items) and compare it with their performance at the lower and moderate levels (using those checklist items ranked as such). For additional checks on validity and reliability, we can also compare performances with clinical clerkship assessments.
Before this systematic analysis, we had suspected that the average level of the OSCE assessment was not sufficiently advanced, and since then we have developed and piloted several new stations. These include a station requiring the use of evidence-based medicine skills and the addition of postencounter notes (similar to those required in the United States Medical Licensing Examination, Step 2 CS exam) on two of the OSCE stations (chest pain and back pain). Scoring these stations is complex, and we are still working toward standardization. Our next step is to apply the adapted taxonomy to these newer OSCE components to determine whether we have succeeded in measuring the upper levels of the three domains. We suspect that even with these additions, we might still have gaps. However, we can use the adapted taxonomies not only to assess the OSCE and to compare our students’ performance on the different levels represented on the checklists (i.e., highest versus moderate versus lowest), but also to modify existing checklists or design new assessment instruments that measure students’ knowledge, skills, and attitudes at sufficiently high levels.
Our experience recounted above has shown us that to ensure construct validity, medical schools should assess whether their summative OSCE is actually measuring students’ abilities across the cognitive, psychomotor, and affective domains at the advanced levels expected by them. The application of taxonomies that assess these domains provide a methodological framework to analyze the validity and reliability of an OSCE against specific medical education goals across a spectrum of development.
Our analysis revealed a gap between our goals for students and our summative measurement of their knowledge, skills, and attitudes before graduation. We are still examining and addressing this gap through a variety of mechanisms, to ensure that our students are graduating as physicians who are competent to deliver excellent medical care.
The authors wish to acknowledge the assistance of Joel A. Purkiss, PhD, in conducting and describing methods for determining interrater reliability, and the guidance and support of Joseph C. Fantone, MD.