The objective structured clinical examination (OSCE) is considered a state-of-the-art method of testing students' clinical skills in medicine. Large, high-stakes examinations use OSCEs to determine competency and, ultimately, licensure. Standard setting for OSCEs involves choosing a pass/fail or “cut” score that represents the level of competence students should possess for a skill or purpose assessed by the OSCE. Despite the importance of establishing cut scores, the methods for determining them for OSCEs are not well established. Many favor examinee-centered methods using collective assessments from expert judges. The modified borderline-group method (MBG) is a criterion-referenced, examinee-centered method of setting cut scores that is based on the concept that the cut score is best represented by the score obtained by a borderline test taker. Using the MBG method, judges identify test takers who are borderline, (i.e., their performances are right around the performance standard), and the cut score is calculated as the mean score of this borderline group. The Medical Council of Canada presently uses this method in the high-stakes OSCE examinations required for medical licensure.
In our institution, we have used the case-author method of standard setting for undergraduate OSCEs. It has proved easy to implement, but often has led to cut scores close to the 60% mark, a faculty-wide standard. The MBG method may be helpful in the undergraduate setting because of its simplicity and ability to compare a student's performance with a standard contributed by many observers, rather than the standard established by a single case author who does not actually observe the OSCE encounter.
This study compared cut scores, and their effects on subsequent pass-fail decisions, that were derived from the MBG method and the case-author method of standard setting in an undergraduate OSCE. We review the feasibility of using the MBG method of standard setting in a small-scale OSCE and add to the sparse literature on comparing standard-setting methods for the OSCE.
A formative OSCE examination was administered to 61 fourth-year medical students at the University of Ottawa Faculty of Medicine in February 1998. This examination was mandatory and has been run annually for a number of years. Twenty-seven students were not available for the examination because they were out of town on electives.
The OSCE examination was composed of ten stations. Eight involved encounters with standardized patients and two were written stations. A priori, we decided that the two written stations would not be included in this study because they were not part of the standard-setting process. The clerkship coordinators acted as case authors and developed the scenarios, including patients' scripts and items' checklists. Stations with history taking were assessed for content as well as communication skills. Communication checklists were included for four of the stations. Case authors were asked to establish a cut score (from a maximum score of 10) based on their estimates of the minimally acceptable level of performance for a graduating medical student.
The 27 physician—examiners (20 faculty and seven senior medical residents) were given a 15-minute orientation to the scoring principles of the exam. In each station, one physician—examiner viewed the students' interacting with the standardized patient and completed the checklists. At the end of each encounter, the physician—examiners completed a global rating scale, originally developed by the Medical Council of Canada, and responded to the question, “Did the candidate respond satisfactorily to the needs/problems presented by this patient?” Responses to the six-level rating scale were categorized as satisfactory (excellent, good, or borderline) or unsatisfactory (borderline, poor, or inferior).1 The physician—examiners were instructed to judge the students “at the level of competency necessary for a graduating medical student.”
Cut scores for the individual stations also were established using the MBG method. Each station's cut score was the mean of the case scores of students who were rated borderline satisfactory or borderline unsatisfactory on the global ratings. The score for each student was the sum of the checklist items identified as correct. When present, the communication checklists comprised 20% of the score. The overall examination's cut score was the sum of the scores of each station.
For data analysis, each encounter rated by a physician—examiner was considered an independent event. Statistical analyses were carried out using a standard statistical software package.
The cut scores from both methods and the mean case scores for the individual stations and for the overall exam are presented in Table 1. The mean case score reflects the level of difficulty for each station. The overall cut score for the case-author method (5.77) was higher than was the cut score for the MBG method (5.31). The case-author cut scores for individual stations were higher than were the MBG cut scores for all stations except the pediatrics case.
The numbers of borderline ratings for the stations are also represented in Table 1. Of 61 students, 12 (19.6%) to 39 (63.9%) were identified as borderline on a given station. Table 2 presents the numbers of students failing each station and the entire examination based on the cut-score method used. A total of 42.2% failed the entire exam when the case-author cut score was used, and 15.25% failed the exam when the MBG cut score was used.
In this study, the case-author and MBG methods of determining cut scores yielded different cut scores and resulting failure rates. This finding agrees with other comparative studies, which have found significant differences in OSCEs' pass/fail scores that are based on different methods of standard setting.2
Cusimano et al. compared the Angoff and Ebel methods of standard setting.3 The Angoff method's cutoff scores were lower, but their mean standard errors were higher than were those of the Ebel method. The authors concluded that the Angoff method performed poorly and should not be used in the setting of the OSCE cut score. In another study, the norm-referenced method was compared with the Ebel and contrasting-groups methods in an OSCE involving 310 internal medicine residents. The Ebel method gave higher pass rates than did the norm-referenced method, but the contrasting-groups method gave unrealistically low pass rates.4 Similarly, Travis and colleagues noted disappointing results during a fourth-year students' medicine OSCE examination.5 They compared the case-author cut-score decisions with a “master/non-master” decision. The case authors, who reviewed printouts of students' performances for the cases they had written, determined the contrasting groups. The mean kappa was only 0.26, indicating little agreement beyond chance. Travis and colleagues had expected much better agreement between results of standard setting developed by the case author and the case author's global ratings of performances on those cases, given that the case author might recall the check-list and assign a weight to each item.
To date, there is only one study comparing the MBG method with others in an undergraduate OSCE: Kaufman and colleagues6 compared the MBG method with three other standard-setting methods (Angoff, relative, and holistic). The Angoff and MBG methods produced similar passing scores, but varied significantly from the other methods. The authors concluded that the Angoff and MBG methods yielded reasonable and defensible approaches.6 Clearly, research into standard setting for the OSCE, which Cusimano has summarized in a recent review,4 remains impoverished.
In our study, as in many others, the lack of agreement of pass/fail decisions based on methods that set out, presumably, to measure the same outcome is disturbing. The obvious question is why is there such lack of agreement? One reason is that the case-author and MBG methods have fundamentally different constructs. In the case-author method, the case author establishes a cut score based on his or her personal perspective of what constitutes an adequate performance prior to the exam. The score represents the minimally acceptable level of competency for the case. The MBG method, on the other hand, is examinee-centered, establishing a cut score by using the mean score of all test takers identified as borderline. The individual performances that are close to the minimally acceptable standard, either borderline satisfactory or borderline unsatisfactory, determine the case's cut score. This method differs from the case-author, test-centered method in which the case's cut score defines a borderline performance. Another reason may be that case authors develop cut scores in the absence of any performance data and often have unrealistically high expectations. Other studies have shown this may be the case. Morrison et al.7 and Norcini et al.,8 in two independent studies, found that judges who were given performance data developed lower cut scores. Performance data provides judges with insight into the capabilities of real students. Finally, case authors, as in this study, are often content experts and may expect more from students and, therefore, elevate the cut score.
How can we reconcile differences in standard setting when all of the methods used set out to measure the same thing? The differences are certainly disturbing and do call into question the validity of our methods. It has generally been conceded that any pass/fail point is arbitrary, and that the arbitrariness involves two assumptions: that the passing score corresponds to a specified performance level and that the specified standard is appropriate. Attempting to establish the correctness of the standard may not be possible, but collecting evidence that supports the credibility of the standard is appropriate. Norcini and Shea have proposed criteria to assess the credibility of standards.9 They support the use of absolute standards based on informed judgment by qualified standard setters, with a demonstration of due diligence. Last, they stress that standards should be realistic. If these criteria are applied to our study, an argument can be made in favor of using the MBG method over the case-author method. Both methods yielded absolute standards and had qualified judges. The judges in the MBG method based their judgments on the entire spectrum of skills observed, however, which produced better face validity. Also, the MBG method produced the most realistic cut score, with failure rates considered more “reasonable” for an undergraduate OSCE. The cost in terms of physician—examiners' hours was certainly higher for the MBG method, but the added benefits and the potential to provide immediate feedback to students warrant the added expense.
The Medical Council of Canada has used the MBG method for several years in large examinations (over 1,500 test takers). In our study, despite the small number of test takers, the MBG method appeared to perform well. Our main concern was that a limited number of test takers would be identified as borderline, thus establishing cut scores based on very few individuals. In this study, anywhere from 12 to 39 test takers were identified as borderline and appeared to provide an adequate sampling. Just how many borderline test takers are required is unclear, but the study by Kaufman and colleagues6 also supports the use of the MBG method in a smaller OSCE.
The conclusions of this study are of limited generalizability. They apply to performance-based testing, specifically an undergraduate OSCE. Also, because the examination was formative, the students may not have prepared as diligently, which may have inflated the failure rates.
Although the MBG method appears to be the most credible, several aspects of this method could be refined. Improvements could begin with defining the goals of the process with the physician—examiners and discussing their individual expectations. Judges for standard setting should be trained by reviewing actual cases to allow for a collective agreement on performance expectations and to clarify items on the checklist. Future studies examining the impact of training for judges would be useful.
The case-author and MBG methods of setting standards produced different cut scores in an undergraduate OSCE. The case-author method produced unrealistically high failure rates, and its ongoing use is neither justified or defensible for medical school OSCEs. Finally, overall, the MBG method was the more credible and defensible method of standard setting in this study, and appears well suited to even a small-scale OSCE.
1. Dauphinee WD, Blackmore DE, Smee S, et al. Using the judgments of physician examiners in setting standards for a national multicenter high stakes OSCE. Adv Health Sci Educ. 1997;2:204.
2. Mills CN. A comparison of three methods of establishing cut-off scores on criterion-referenced tests. J Educ Meas. 1983;20:283–92.
3. Cusimano M, Cohen R, Hutchison C, et al. A Study of Standard Setting Methods in an OSCE. Proceedings from the 7th Ottawa International Conference on Medical Education and Assessment, June 1996, Maastricht, The Netherlands:93.
4. Cusimano MD. Standard setting in medical education. Acad Med. 1996;71(10 suppl):S112–S120.
5. Travis TA, Colliver JA, Robbs RS, et al. Validity of a simple approach to scoring and standard setting for standardized-patient cases in an examination of clinical competence. Acad Med. 1996;71(1 suppl):S84–S86.
6. Kaufman DM, Mann KV, Muijtjens AM, et al. A comparison of standard-setting procedures for an OSCE in undergraduate medical education. Acad Med. 2000;75:267–71.
7. Morrison H, McNally H, Wylie C, et al. The passing score in the objective structured clinical examination. Med Educ. 1996;30:345–8.
8. Norcini JJ, Shea JA, Kanya DT. The effect of various factors on standard setting. J Educ Meas. 1988;25:56–65.
9. Norcini JJ, Shea JA. The credibility and comparability of standards. Appl Meas Educ. 1997;10:35–59.