Share this article on:

Remediating Students’ Failed OSCE Performances at One School: The Effects of Self-Assessment, Reflection, and Feedback

White, Casey B. PhD; Ross, Paula T. MA; Gruppen, Larry D. PhD

doi: 10.1097/ACM.0b013e31819fb9de
Students’ Academic Performance

Purpose To investigate whether and how use of an online remediation system requiring reflective review of performance and self-assessment influenced students’ performance on objective structured clinical examination (OSCE) station repeats (subsequent to failure on the first attempt) and their self-assessments of their performance (between the first and second attempts).

Method Fourth-year medical students’ performances on seven OSCE stations were videotaped at University of Michigan Medical School in 2006. Failing students took the exam again; remediation included self-assessment and review, plus faculty guidance for failures that were greater than one standard error of measurement of the distribution. A total of 1,171 possible observations of students’ actual performance and performance self-assessments were analyzed using independent and dependent t tests and within-subjects ANOVA.

Results Results indicate statistically significant changes in students’ performance between first and second attempts and statistically significant improvements in self-assessment between first and second attempts. No significant changes were found between self-assessed and faculty-guided remediation.

Conclusions This study provides evidence that OSCE remediation combining review, reflection, and self-assessment has a salutary effect on (subsequent) performance and self-assessment of performance.

Dr. White is assistant dean for medical education and assistant professor of medical education, University of Michigan Medical School, Ann Arbor, Michigan.

Ms. Ross is research associate, University of Michigan Medical School, Ann Arbor, Michigan.

Dr. Gruppen is Josiah Macy, Jr. Professor of Medical Education and chair, Department of Medical Education, University of Michigan Medical School, Ann Arbor, Michigan.

Correspondence should be addressed to Dr. White, University of Michigan Medical School, 1135 E. Catherine St., Ann Arbor, MI 48109-5726; telephone: (734) 763-6770; fax: (734) 763-6771; e-mail: (

Editor’s Note: Commentaries on this article appear on pages 545 and 548.

An increasing number of medical schools require that their clinical students pass a summative objective structured clinical examination (OSCE), as revealed by a survey conducted by faculty at Boston University School of Medicine in 2005 (approximately one third of responding schools) and again in 2006 (roughly one half).1 Moreover, of more than 100 schools responding, OSCE remediation was required by 74% in 2005 and by 86% in 2006.1 Remediation is, in essence, feedback. And although these OSCEs are summative in nature, OSCE remediation programs have formative characteristics in that they are aimed at helping students improve their performance when they repeat (or otherwise make up) the station(s) they failed.

Over the last decade or more, U.S. medical schools have recognized and reported a growing problem with finding faculty time to provide effective, formative feedback to students as they progress through the clerkships.2–4 Under pressure to increase patient throughput that has direct ties to salary, faculty have less and less time to devote to teaching and feedback. And although students’ failure to master clinical judgment and skills in the clerkships can be identified on final clerkship assessments, faculty members’ and house officers’ limited time and competing activities (patient care) have led to limitations in their observations of students.5 Thus, summative OSCEs are important complements to physician-educator evaluations for ensuring that all students have achieved mastery of clinical acumen as defined by a school’s learning outcomes and goals.

An important component of clinical acumen is the ability to recognize one’s own limitations, particularly in terms of knowledge and skills. Because of this, medical educators and researchers are interested in studying self-assessment in medical students. These studies are an attempt to understand dimensions of self-assessment in the context of student learning and to identify factors or characteristics that predict self-assessment ability.6–10 Opportunities to practice self-assessment with feedback are key to achieving this outcome.11,12

The similarity of an OSCE to medical practice makes it a natural context for investigating students’ ability to self-assess. On the OSCE, students are engaged in clinical activities designed to measure their underlying knowledge, physical examination skills, and complex communication skills. Especially on the summative OSCEs, students are challenged to perform at advanced levels of cognitive functioning in activities requiring analysis, synthesis, and evaluation.13

Given limitations on faculty time for remediation and feedback and our interest in the OSCE as a naturalistic but controlled environment for fostering and measuring effective self-assessment, we were interested in exploring the following questions:

  1. For students who fail an OSCE station(s), does subsequent performance differ between one group of students who completed a required remediation program only and another group of students who completed a required remediation program and received specific faculty feedback?
  2. For students who fail an OSCE station(s), is there an improvement in self-assessment ability between original performance and subsequent performance that:
    1. results from remediation in general?
    2. differs between a group of students who completed a required remediation program only and another group of students who completed a required remediation program and received specific faculty feedback?
Back to Top | Article Outline


In 2006, 173 medical students at the University of Michigan Medical School completed a summative 13-station OSCE at the beginning of their fourth year of medical school.

Back to Top | Article Outline

OSCE description

Among the stations were standardized patients (SPs) presenting with abdominal pain, back pain, breast mass, chest pain, uncontrolled diabetes, memory loss/mobility limitations, and as a parent of an infant with fever. Each of these seven stations was videotaped for subsequent analysis. (Other stations not included in this study required students to interpret a broad, foundational array of X-ray and MRI images and EKGs, to access and evaluate the literature in defined clinical cases, and to assess an SP with depression.) Passing the OSCE overall and passing each OSCE station are requirements for graduation.

Back to Top | Article Outline

OSCE performance assessment

Each of the stations in this study was 15 minutes; in each of them, the SP completed a checklist assessing the student’s performance. History and/or physical examination items on the checklists ranged from 10 to 36. There were 16 communication skills items on each checklist; communication was scored separately and across all of the stations with SPs. History-taking items (on all stations in this study) were scored as “done” or “not done,” and physical examination items were scored as “done,” “needs improvement,” or “not done.” Student failures were based on a combination of “not done” and “needs improvement.” Scores derived from each checklist were reported on a scale of 0% to 100%.

Faculty with appropriate expertise developed the content and tasks on the OSCE stations and the checklists; the OSCE committee faculty review and give final approval for all stations and checklists. Professional staff members trained the SPs and provided quality assurance of SP performance and assessments (they monitored the stations during the OSCE and reviewed every performance that was not passing). SPs underwent at least 12 hours of training for the station(s) in which they worked; the training included watching student performances and completing the checklists; these assessments were compared with professional trainers’ assessments by using the checklists.

As they completed each station, students estimated how they performed on that station, using a percent score, 0% to 100%. The score represented the percentage of items correct on the checklist for the station. There was no performance feedback given at any of the stations during the examination.

Back to Top | Article Outline

Data analysis

The dataset for this study was constructed to define the individual station performance as the unit of analysis. Thus, there were 1,197 possible observations (number of students with complete data multiplied by seven stations in which interactions with SPs were videotaped). The rationale for using the station rather than the student as the unit of analysis was that remediation was done on a station-by-station basis; that is, students would remediate only stations they had failed. This analytic strategy necessarily ignores the possibility that student-by-station-level failures may overrepresent some students or some stations. Because the number of failures was too small to make a hierarchical analysis feasible, the results may contain biases that arise from these student or station dependencies.

Pass/fail cut points were defined by a group of clinical faculty using the Hofstee method.14 All students failing a station were required to complete remedial activities before repeating the station (second attempt). Several of the stations—abdominal pain, diabetes, and pediatrics—had an “A” case and a “B” case. If a student failed one of these stations, he or she was given the case not done on the first attempt. Differences in a performance on a station’s “A” case versus “B” case were analyzed and factored into the passing standards for the cases (i.e., there were different passing standards calculated for different cases within a station).

In this study, we focused on the remediation program for stations involving SPs, where tasks included history taking, physical examination, communication, and collaboration/ negotiation. There were two levels of remediation. Students who failed one of these stations but performed within one standard error of measurement below the passing cut point were required to review station-specific resources (e.g., relevant journal articles), view a video recording of a student demonstrating excellent performance on the station, view a video recording of their own performance on the station, reflect on the differences between the two performances, and then compare the two performances (in writing). A second group of students who failed one of these stations and performed below one standard error of measurement below the passing cut point were required to meet the same requirements, but they also received feedback from the faculty member responsible for the station, clarifying the student’s deficiencies and suggesting areas of needed improvement. Feedback from the faculty was in writing, directly into the remediation program Web site. Although faculty were willing to make themselves available if requested, none of the remediating students asked for a meeting.

The resources and videos were all digitized and integrated into a computer-based remediation program made available to students via a secure Internet connection; students used online text boxes to enter their self-assessment comments. Faculty accessed the videos and returned comments to students via the Internet using the same program.

Our data analysis consisted of independent and dependent t tests and within-subjects ANOVA. Only results that attained statistical significance (P < .05) are reported, and effect size measures are used to quantify the magnitude of these statistically significant relationships. Statistical analyses were performed using SPSS version 13 (SPSS Inc., Chicago, Illinois).

Back to Top | Article Outline


Of 173 students, 171 completed all seven videotaped stations, for a total of 1,197 analyzable observations. There were 57 station failures by first-time takers on the OSCE. Thirty students failed one station, nine students failed two stations, and three students failed three stations. In descending order, there were 12 failures on the chest pain station, 9 failures on the diabetes “A” case station, 8 failures on the memory loss/mobility limitation station, 7 failures on the breast mass station, 6 failures on the pediatric parent “B” case station, and 5 failures each on the back pain, diabetes “B” case, and pediatric parent “A” case stations.

Overall performance as rated by the SPs for all station-student pairs was 76.3. Students’ self-assessed scores for these same stations were very close to their actual performance, with a difference score averaging around 6.2 points (standardized mean difference [d] = 0.6).15

As expected, the average score on the failing performances (53.4) was substantially lower (24.1 points) than the score for the passing performances (77.5, d = 2.6) (see Table 1). Between these two groups, the self-assessments—students’ estimations of their performance on the first try (original failure)—were nearly identical (82.4 and 83.4, respectively). It was the difference in actual performance scores and self-assessed scores that resulted in a large difference in self-assessment accuracy between the failing performances (30.0) and the passing performances (mean = 4.9, d = 2.0).

Table 1

Table 1

For the failing performances, the difference between actual performance and self-assessments (Table 1) decreased from 30.0 points on the first attempt (a substantial overestimation) to 3.1 on the second attempt (d = 2.6). Within the subgroups of the failing performances, this pattern held true for both the group that self-remediated (difference decreased from 29.1 to 3.5, d = 2.8) and the group that remediated with faculty feedback (difference decreased 31.8 to 2.4, d = 2.4).

These changes are attributable to changes in actual performance, which rose from a mean of 53.4 on the first try to 79.4 on the second try (26.0) for all failed stations (d = 3.9) (Table 1). The increase in actual performance was 24.6 points for students who self-remediated (mean of 55.9 on the first try to 80.5 on the second try, d = 5.1) and 28.8 points for stations remediated with faculty feedback (mean of 48.6 on the first try to 77.4 on the second try, d = 3.9).

This study provides evidence that remediation combining review, reflection, and self-assessment has a salutary effect on students’ performance on failed OSCE stations. Both groups had gains in performance averaging about 16 points; however, a repeated-measures ANOVA indicated that there was no statistically significant (P < .28) difference between the two remediation conditions in the magnitude of this change.

Back to Top | Article Outline


Our first research question was whether the remediation program we designed would result in improved performance. Our findings provide evidence that remediation combining review, reflection, and self-assessment has a beneficial effect on students’ performance on failed OSCE stations. Both self-remediating and faculty feedback groups had gains in performance; however, a repeated-measures ANOVA indicated that there was no statistically significant difference between the two remediation conditions in the magnitude of this change.

This study indicates a substantial improvement in self-assessment accuracy between the first and second attempts for both groups of students—an average overestimate of about 30 points essentially disappeared for all failing performances and for both subgroups (self-remediation and remediation with feedback). There is evidence that once individuals become aware of what they don’t know or can’t do, they are able to assess themselves more accurately.16 However, a plausible alternative explanation is that student self-assessments were largely unaffected by the remediation (for both groups, self-assessments stayed at an average of about 80 points) and that the increase in accuracy was a result of actual performance catching up with the students’ self-assessments. The fact that the self-assessments (on average) did not change from the first to the second attempts leaves us uncertain about the underlying dynamics. If self-assessments had changed, either up or down, a specific interpretation would be easier to make, but the lack of change leaves us wondering whether the students understood more after remediation about their performance in relation to their expectations and could, thus, more accurately assess it, or whether they were just as unaware after remediation as before and the similarity between actual and self-assessed performance after the second try was coincidental. To understand this issue better, next year we will add a “retrospective-pre” item to the self-assessment. We will ask students to indicate whether, after they have completed the remediation program, they would change their self-assessment of their first try.

We undertook this study to determine the effectiveness of an online remediation program for students’ failing performances on an OSCE. To protect faculty time previously used in mostly one-on-one remediation meetings with students, we opted for a computer-based method of providing information resources to students and of challenging them to reflect on and assess their performance as it compared with an excellent performance under identical expectations. Faculty members needed to review and respond only to those students whose performances were at the lower end of the distribution (<1 standard error of measurement below the mean), which significantly reduced the time spent on remediation by the faculty. The similarities between the two groups in performance gains on the second attempt might well indicate that faculty feedback on students’ self-assessed performances may not be necessary.

We also wanted to determine how the online remediation program influenced self-assessment ability. We believe that both reflection and feedback on performance are key to effective self-assessment. One group of students used the program to reflect on and self-assess their performance; another group did the same but also received faculty feedback. In this study, inclusion of faculty feedback did not seem to improve subsequent student self-assessment when compared with exclusion of feedback. However, this finding might be attributable to the lack of a comprehensive effort focused on engaging students in self-assessment throughout medical school.

We believe that summative OSCEs are key to identifying gaps in students’ clinical acumen before graduation, and—on the basis of this study’s findings—that we have developed an effective method for remediating students’ deficiencies that does not require as much faculty time as standard formative feedback. We also believe that, when integrated into a longitudinal program focused on student achievement of self-assessment outcomes, this remediation program provides an excellent opportunity for students to reflect on and assess their performance of advanced cognitive and clinical skills.

Back to Top | Article Outline


1 March G. Three Short Questions (OSCE Survey). Electronic Mail. Message to Larry Gruppen, January 30, 2007. [unpublished].
2 Ludmerer KM. Time and medical education. Ann Intern Med. 2000;132:25–27.
3 Sanson-Fisher RW, Rolfe IE, Williams N. Competency based teaching: The need for a new approach to teaching clinical skills in the undergraduate medical education course. Med Teach. 2005;27:29–36.
4 Seabrook MA. Medical teachers’ concerns about the clinical teaching context. Med Educ. 2003;37:213–222.
5 White CB, Haftel HM, Purkiss JA, Schigelone AS, Hammoud MM. Multidimensional effects of the 80-hour work week. Acad Med. 2006;81:57–62.
6 Violato C, Lockyer J. Self and peer assessment of pediatricians, psychiatrists and medicine specialists: Implications for self-directed learning. Adv Health Sci Educ Theory Pract. 2006;11:235–244.
7 Fitzgerald JT, White CB, Gruppen LD. A longitudinal study of self-assessment accuracy. Med Educ. 2003;37:645–649.
8 Fitzgerald JT, Gruppen LD, White C. The influence of task formats on the accuracy of medical student self-assessment. Acad Med. 2000;75:737–741.
9 Coutts L, Rogers J. Predictors of student self-assessment accuracy during a clinical performance exam: Comparisons between over-estimators and under-estimators of SP-evaluated performance. Acad Med. 1999;74(10 suppl):S128–S130.
10 Fincher RE, Lewis LA. Learning, experience, and self-assessment of competence of third-year medical students in performing bedside procedures. Acad Med. 1994;69:291–295.
11 Boud D, Falchikov N. Quantitative studies of student self-assessment in higher education: A critical analysis of findings. Higher Educ. 1989;18:529–549.
12 Gordon MJ. Self-assessment programs and their implications for health professions training. Acad Med. 1992;67:672–679.
13 Bloom BS. Taxonomy of Educational Objectives, Handbook I: The Cognitive Domain. New York, NY: David McKay, Inc.; 1956.
14 Norcini J. Setting standards on educational tests. Med Educ. 2003;37:464–469.
15 Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Erlbaum.
16 Kruger J, Dunning D. Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. J Pers Soc Psychol. 1999; 77:1121–1134.
© 2009 Association of American Medical Colleges