Cusimano, Michael D.; Rothman, Arthur I.
Since performance-based tests in medicine, such as the objective structured clinical examination (OSCE), are designed to measure specific, tightly defined competencies, the adoption of criterion-referenced cut-scores is ideally suited to the task of setting standards for the OSCE format, particularly if the results of these examinations will be used for credentialing.1,2 A number of authors3–9 have reported on the use of criterion-referenced approaches to setting standards for examinations in medicine. Kaufman et al.6 showed that if the Angoff procedure were used, it might give rise to a failure rate close to zero for an OSCE, whereas Jolly5 believes that item-based procedures like Angoff might lead to unrealistically high failure rates. Consequently, standards are best set in relation to actual examinee performance. We demonstrated that the Hofstee method could incorporate examinee-based data into a criterion-referencing process.9 This study extends our previous work on test-centered means of standard setting by focusing on examinee-based methods of standard setting.
An alternative to test-centered models of standard setting is to define the standard from a review of the examinees’ performance. There are a number of techniques that achieve this.10 These methods focus on an assessment of the examinees’ performance, either as individual participants or as a group, to define the acceptable standard. Judgments of test content do not enter directly into this group of methods. The borderline-group and contrasting-groups methods are the most practical and widely used methods of this category.8
In the borderline-group method, it is essential that the judges know or are able to determine the level of knowledge or skills of the examinees in the domain sampled by the competency test they will be asked to judge. The method requires that judges be able to identify what they view as minimally acceptable performance. The mean or median score of performances that are identified as minimally acceptable, or borderline, is defined as the cutoff score. Various modifications of the procedure exist,11 but the central act of judges identifying the borderline performance of examinees remains the identifying characteristic of this method.
Like the borderline-group method, the principal focus of judgment in the contrasting-groups method is on the competence of examinees rather than on factors such as the difficulty or relevance of items on a test. In this method, groups of examinees previously identified as certain masters or nonmasters are given the examination. Their score distributions are plotted and the point of intersection is taken as the initial standard.8
Rothman and colleagues12,13 have described their experience using the contrasting-groups and borderline-group methods to define cutoff scores for the large-scale Part II Medical Council of Canada licensing examination that involved hundreds of examiners and more than a thousand candidates. Kaufman et al.6 and Searle14 evaluated a modified borderline-group method for a medical school–sized OSCE, but measures of the consistency of the standards set were lacking. The apparent feasibility and the face validity of the examinee-based methods such as the borderline-group method provide support for their increasing adoption by health profession schools. Before that can occur, however, more information on the quality of the standards produced by these techniques is required. The purpose of the present study was to assess the quality of the standards produced on a small-scale OSCE by the borderline-group and contrasting-groups examinee-based standard-setting procedures.
The model chosen to evaluate the standard-setting procedures was an OSCE that is used to assess students at the end of their fourth-year clerkship (final undergraduate year) surgical rotation (55–60 students at a time). The examinations consisted of three couplets (a five-minute patient encounter and a five-minute postencounter probe), which were considered as six stations for the purposes of this study, and three patient-encounter ten-minute stations. Students were given a station score on the basis of the percent of checklist items done correctly at each station. Physician examiners (one per station) were also asked to rate observed student performance as honors, pass, borderline, or failing according to the level of performance expected of a student completing his or her fourth-year surgical clerkship. Performance data for the 1992–93 and 1993–94 administrations of this examination were used for analyses. The students who were rated “borderline” were called “the borderline group” for that station. For the borderline-group method, the cutoff score for that station was set at the mean score of students who comprised the “borderline group.”
The contrasting groups consisted of 27 randomly selected third-year medical student volunteers who had no clerkship experience and 27 surgical residents who had graduated from medical school six to 30 months before the OSCE. Each group took an OSCE identical to the OSCE for fourth-year medical students. The examiners of the OSCE administered to the contrasting groups were not told to which group (third-year student or resident) each examinee belonged.
All examiners were licensed physicians. The examiners for the contrasting-groups OSCE included both surgeons and nonsurgeons, whereas for the regular surgical examinations for year-four medical students (clinical clerks), all examiners (n = 189) were surgical faculty members. All participants in the contrasting-groups experiment received a small honorarium for their participation; an incentive was offered to those participants achieving one of the highest three scores in each group.
The station was used as the central focus of analysis. The resultant cutoff scores and the decisions (pass/fail) arising from the cutoff scores of both standard-setting methods were assessed on a station-by-station basis. The means, ranges, standard errors, and percentage of students who would have failed if the mean standard were applied to that station were calculated. The indices proposed by Meskausas1 and Jaeger15 and a modification of the Jaeger2 index (using third and first quartiles) were used to assess consistency of judges’ standards and of decisions arising from judges’ standards for each station for each of the standard-setting procedures.9 The goal here was to determine whether the judges provided consistent judgments of the pass/fail point for a particular station using the same method and what the effects of these judgments would be on pass/fail rates for that station. Results for the contrasting-groups method were based on nine stations and 54 candidates. Since stations were the focus of analysis, results for the borderline-group method used the results of a total of 16 stations administered in eight examinations (not all identical) with a total of 464 students (average 58 students per examination) and a total of 189 examiners.
Consistency of Judges’ Standards
Overall, 5.8% of station performances were rated as borderline (range, 0.7–8.7%). The borderline-group method had a lower cutoff score (55.2%) than both the contrasting-groups technique (61.7%) and the traditional holistic station pass score (60%). Student-performance station scores for the borderline group ranged from 32% to 100%, but clustered around the holistic cut point of 60%. Although the borderline-group technique gave a broader range of passing scores, its SEMean was narrower (1.2 versus 5.8) than that of the contrasting-groups technique. The borderline-group technique yielded a better stability index (> 1) SIS1 than the contrasting-groups technique (3.0 versus .36), indicating that the interjudge variability was less than the intercandidate variability with the borderline-group technique.
Consistency of the Decision Arising from the Standard
The lower failure rate of the borderline-group method (23.6% versus 32.9% for the contrasting-groups method and 32.9% for the traditional holistic cutoff score) did not result in a cutoff score that would be unacceptable to faculty or students. The indices of consistency of the pass/fail decision were better for the borderline-group technique (Jaeger1 = 1.5; Jaeger2 = 10.4; Jaeger3 = 1.7) than the contrasting-groups method (Jaeger1 = 2.6; Jaeger2, not applicable; Jaeger3 = 83.5).
Since the initial work of Cohen et al.16 Norcini and Shea,17 and Rothman et al.,18 a number of authors have applied the borderline-group method to SP or OSCE stations. Kaufman and co-workers6 applied the borderline-group method to the OSCE and found that failure rates would have been substantially lower than those with their holistic standard (i.e., 60%). Searle14 and Wilkinson et al.19 applied the borderline-group method to undergraduate medical school examinations and found that the method was practical and showed some evidence of validity. However, neither Searle14 nor Kaufman et al.6 provided estimates of the consistency of the standards or the decisions arising from those standards. Wikinson et al.19 used a compensatory test standard and showed that aggregating passing scores from a minimum of three stations was required to achieve reliable definitions of borderline students. Humphrey-Murto and MacFadyen20 showed significant differences in pass rates obtained with a case–author item-judgment technique and a modified borderline-group technique. They concluded that the modified borderline-group technique was more costly in terms of physician–examiners’ hours, but provided the potential for feedback to students. Since our school routinely has faculty act as examiners for the OSCE in surgery, application of the borderline-group technique was seamless. Assembling a group of faculty for the sole purpose of an item-judgment standard-setting exercise would have been much more difficult and costly.
A shortcoming of the borderline-group technique, as conceived of in our study and others in the literature, is that it is a fully compensatory model still coupled to checklist scores. In this type of model, the test taker who correctly does the checklist items the judge considers essential may get a passing score similar to that of the test taker who correctly does those checklist items the judge considers less important. The overlap of distribution of passing, failing, and borderline scores in our study and others attests to this.
A noncompensatory model that allows judges to use their full expertise can at least partially address this problem. Clauser et al.21 showed that judges never use a fully compensatory model when judging students’ pass/fail performance. Working with the National Board of Medical Examiner's computer-based case simulations and using all actions taken by the examinee, Clauser and colleagues22,23 also showed that such an approach can produce results that are equivalent to the ratings of highly trained experts.
McIlhenny and Orr24 have argued that the passing score set with the borderline-group technique is sample dependent, is not based on a rigorous criterion of competence, and should not be used in medicine. However, our standard-setting process required that judges make ratings of examinee performance. These ratings arose from examiners’ internal heuristic of competence that each examiner brought to the assessment process. That heuristic is no different than that brought to task in an item-centered standard-setting process like the Angoff procedure that asks judges to examine each test item to determine the probability that the “minimally acceptable” person would answer each item correctly. We made no attempt at training examiners to identify the borderline candidate. Others have shown that ratings achieved in this way are actually quite reproducible.19,25We would argue that the principal focus of judgment in both the borderline-group and contrasting-groups methods is on the competence of the examinees, rather than on factors such as the difficulty or relevance of the items on a test. In this way, these techniques maintain a criterion frame of reference since they depend on expert examiners observing and judging performance as acceptable or not.
Relevant to the discussion of these techniques is the issue of where to draw the line. In this study's data set, the use of the third rather than the first quartile of the range of borderline scores would have nearly doubled the failure rate, from 16.3% to 27.7%. A variety of other methods that use likelihood ratios and linear and quadratic discriminant functions has been used to set the standard26,27 in examinee-based methods. These statistical techniques require large sample sizes to be valid. It is unlikely that the necessary sample sizes could be achieved in anything but the largest OSCEs; averaging remains simple and easily understood among stakeholders.
The strength of a focus on examinee competence in the methods reported in the present study may also be one of their weaknesses. In an OSCE or SP-based medical examination, judging the competence of a particular performance requires expertise that is unlikely to be found in anyone but an expert such as a physician. Since even most experts have a limited range of practice it is extremely unlikely that nonexperts would be well suited to make judgments about the relative weighting of different behaviors in the overall ratings of an SP scenario. Furthermore, a number of expert examiners are desirable so that rating of performance can be averaged over a number of observers. These limitations may make these techniques unfeasible in certain circumstances. Alternatively, for schools that routinely use expert physician-examiners for medical OSCEs, assembling a standard-setting panel for an item-centered standard-setting exercise may be impossible. For these types of schools, continuously running a borderline-group method would be very feasible and desirable. However, for schools with nonphysician examiners, one might only perform the standard-setting procedure when significant changes occur in the test, its purpose, its raters, or its examinees.
Since cutoff scores vary as a function of test-score and criterion-performance distribution, these methods for assessing the quality of a standard-setting method can be limited. The G coefficient, developed with the framework of generalizability theory, which depends on the ratios of variances, addresses this limitation of the Meskausas and Jaeger indices. The approach allows the researcher to assess the relative magnitudes of variance components across different methods. This approach is the subject of another study.
Contrasting-groups and borderline-group standard-setting procedures have a sufficient theoretical and empirical basis for use in small-scale OSCEs that use a sound test-development strategy and experts such as physicians as examiners.
1. Meskauskas JA. Setting standards for credentialing examinations. Eval Health Prof. 1986;9:187–203.
2. Hambleton RK, Rogers HJ. Technical advances in credentialing examinations. Eval Health Prof. 1986;9:205–29.
3. Stillman P, Swanson D, Regan MB, et al. Assessment of clinical skills of residents utilizing standardized patients. Ann Intern Med. 1991;114:393–401.
4. Dauphinee D, Blackmore D. Setting minimal passing standards for the qualifying examination of the Medical Council of Canada: the transition from norm-referencing to criterion referencing. In: Rothman AI (ed). Proceedings from the Sixth Ottawa Conference on Medical Education. Toronto, ON: University of Toronto Bookstore Custom Publishing, May 1995:245–7.
5. Jolly B. Setting standards for tomorrow's doctors. Med Educ. 1999;33:792–3.
6. Kaufman DM, Mann KV, Muijtjens AMM, van der Vleuten CPM. A comparison of standard setting procedures for an OSCE in undergraduate medical education. Acad Med. 2000;75:267–71.
7. Verhoeven BH, Verwijnen GM, Muijtjens AMM, Scherpbier AJJA, van der Vleuten CPM. Panel expertise for an Angoff standard setting procedure in progress testing: item writers compared to recently graduated students. Med Educ. 2002;36:860–7.
8. Cusimano MD. A study of standard setting methods in an objective structured clinical examination [PhD thesis]. Toronto, ON: Ontario Institute for Studies in Education, University of Toronto, 1997.
9. Cusimano MD, Rothman A. The effect of incorporating normative data into a criterion-referenced standard setting in medical education. Acad Med. 2003;78(10 suppl):388–91.
10. Cusimano MD. Standard-setting in medical education. Acad Med. 1996;71(10 suppl):S112–20.
11. Zieky MJ, Livingston SA. Manual for Setting Standards on the Basic Skills Assessment Tests. Princeton, NJ: Educational Testing Service, 1977.
12. Rothman A, Blackmore D, Cohen R, Reznick R. The consistency and uncertainty in examiners’ definitions of pass/fail performance on OSCE stations. In: The Annual Meeting of the American Educational Research Association. San Francisco: AERE, April 1995.
13. Rothman AI, Blackmore D, Dauphinee WD, Reznick R. The use of global ratings in OSCE station scores. Adv Health Sci Educ. 1996;1:215–9.
14. Searle J. Defining competency—the role of standard setting. Med Educ. 2000;34:363–6.
15. Jaeger RM. Certification of student competence. In: Linn RL (ed). Educational Measurement. New York: Macmillan, 1989;485–515.
16. Cohen R, Rothman AI, Poldre P, Dirks F, Ross J. Defining test cutting scores in a criterion referenced multistation clinical skills examination. In: Hart GR (ed). Current Developments in Assessing Clinical Competence. Montreal: Can-Heal Publications, 1992:137.
17. Norcini J, Shea J. Equivalent estimates of borderline group performance in standard setting. J Educ Meas. 1992;29:19–24.
18. Rothman AI, Poldre P, Cohen R, Ross J. Standard setting in a multiple station test of clinical skills. Paper presented at the Annual Meeting of the American Educational Research Association, Atlanta, GA, April 1993.
19. Wilkinson TJ, Newble DI, Frampton CM. Standard setting in an objective structured clinical examination: use of global ratings of borderline performance to determine the passing score. Med Educ. 2001;35:1043–9.
20. Humphrey-Murto S, MacFadyen JC. Standard setting: a comparison of case author and modified borderline-groups methods in a small-scale OSCE. Acad Med. 2002;77:729–32.
21. Clauser B, Orr N, Clyman S. Models for making pass/fail decisions for performance assessment involving multiple cases. In: Rothman AI (ed). Proceedings of the Sixth Ottawa Conference on Medical Education, Ottawa, ON. Toronto: University of Toronto Bookstore Custom Publishing, May 1995:239–42.
22. Clauser BE, Margolis MJ, Clyman SG, Ross LP. Development of automated scoring algorithm for complex performance assessments: a comparison of two approaches. J Educ Meas. 1997;34:141–61.
23. Clauser BE, Swanson DB, Clyman SG. A comparison of the generalizability of scores produced by expert raters and automated scoring systems. Appl Meas Educ. 1999;12:281–99.
24. McIlhenny C, Orr G. Standard setting in an objective structured clinical examination: use of global ratings of borderline performance to determine the passing score. Med Educ. 2002;36:388–95.
25. Cunnington JPW, Neville AJ, Norman GR. The risks of thoroughness: reliability and validity of global ratings and checklists in an OSCE. Adv Health Sci Educ. 1997;1:227–33.
26. Koffler SL. A comparison of approaches for setting proficiency standards. J Educ Meas. 1980;17:167–78.
27. Livingston S, Zieky M. Passing Scores: A Manual for Setting Standards of Performance on Educational and Occupational Tests. Princeton, NJ: Educational Testing Service, 1982.