In a 1990 invited review, the late George Miller, one of the pioneers in the field of medical education, issued a stern warning “to abandon the comfortable camouflage of normative procedures and adopt criterion-referenced testing.”1 Miller saw the practice of norm-referenced scoring as detrimental not only to medical training but also, more broadly, to patient care. Despite his impassioned plea more than two decades ago, medical education researchers still grapple today with the same challenge—how to develop and apply appropriate performance standards for clinical performance assessment. Although researchers have focused on the development and use of performance-based assessments to evaluate clinical competence, few have investigated how such assessments should be scored and the impact of that scoring on educators’ decisions about students’ clinical competence.
For many faculty, developing defensible, appropriate standards of clinical performance is a daunting task, often misunderstood within medical education.2 In standardized patient (SP) examinations, such as an objective structured clinical examination (OSCE), an SP or other trained observer evaluates the student’s performance based on his or her completion of the items on a behaviorally grounded checklist. How to construct scores from these checklists and how to use those scores to make pass/fail judgments about students remain topics of concern for medical educators.3–5 Although a vast amount of literature exists on the topic of standard-setting, there is no consensus on the best methods and strategies for developing such standards.6
Standard-setting methods fall into two categories—norm-referenced, or relative methods; and criterion-referenced, or absolute methods. Norm-referenced standards involve the comparison of a student’s performance with that of his or her peers. The outcome (pass/fail) of the student’s performance is therefore dependent on the performance of the other students who took the same examination. Criterion-referenced standards compare a student’s performance with a predetermined standard; the outcome (pass/fail) of the student’s performance is independent of that of his or her peers. This fundamental difference between relative and absolute methods emphasizes standard-setting as an important factor in the evaluation of student performance. If the objective of the examination is to rank students, then a norm-referenced standard is warranted. However, if the objective of the examination is to determine whether or not a student has mastered the minimum requirements of a particular domain, such as clinical competence, then a criterion-referenced standard is needed. Though most medical educators recognize that OSCEs merit the use of criterion-referenced standards, they continue to rely on norm-referenced standards simply because these standards are well known and widely understood within the field and because of the time-consuming, labor-intensive, and often costly nature of developing criterion-referenced standards.2,7,8
Developing criterion-referenced standards often involves convening a panel of experts to determine what constitutes a minimally acceptable performance. This process may entail rating or ranking checklist items to achieve consensus on a performance standard. Item- or test-centered strategies for standard-setting on written tests like the Angoff and modified Angoff methods9 are often applied to OSCEs,10 though some person-centered methods have also been developed specifically for clinical assessment in medicine.11–14 Of note, no one strategy is considered the gold standard in standard-setting, and the use of different strategies can often yield different performance standards for an examination.15,16 In addition, the use of the same strategy by different expert panelists has been shown to yield different performance standards for the same clinical competency examination.17 Therefore, a standard is often judged by the very process used to set that standard, including the credibility of the experts who set it as well as their strategy to codify their ratings. A defensible process of standard-setting is reproducible and unbiased.18 What such a process looks like, however, can vary widely depending on the context and content of the examination and the resources that are available to the institution.
In recent years, some institutions have applied one such method of standard-setting, named the “critical element” or “critical action” approach,19 designed in part to better suit the objectives of clinical assessment. This approach to absolute standard-setting attempts to resolve the difficulties found in adhering to a complex and strict item-selection process that does not lend itself well to clinical assessment. A two-step process, the critical element approach first asks faculty to identify the behaviors from an SP checklist that are critical to successfully passing the station, followed by a second rating process designed to achieve consensus among faculty on the inclusion of these behaviors. Payne and colleagues20 applied a similar approach to the identification of a key set of checklist items, called critical actions, the performance of which is “critical to ensure an optimal patient outcome and avoid medical error.” Each student must perform the critical behaviors included in the resulting list to pass the examination.
In 2008, the David Geffen School of Medicine at UCLA (hereafter, UCLA) developed and piloted locally a new criterion-referenced standard, based on the critical actions approach, for a high-stakes clinical performance examination administered to all medical students at the end of their third year. The examination has important consequences in that all students must pass it to graduate from medical school. Using this new standard, intended to eventually replace the original, norm-referenced standard, resulted in a new construction and interpretation of scores and subsequently yielded a different group of failing students from that of the norm-referenced standard, which had been in place for 15 years. Our study of student performance data from three medical schools—UCLA; the University of California, San Francisco, School of Medicine (UCSF); and the Keck School of Medicine of the University of Southern California (USC)—uses generalizability theory (G theory) to determine the reliability of students’ scores based on a criterion-referenced standard, comparing the generalizability of their scores constructed using the current norm-referenced standard with those constructed using the newly developed critical actions criterion-referenced approach.
Developed by a consortium of eight medical schools in California, the multistation clinical performance examination (CPX) is administered to all consortium medical students at the end of their third year.
We obtained performance data for 477 students from one private (USC) and two public (UCLA and UCSF) MD-degree granting consortium medical schools who completed the same six CPX cases in 2008. USC used the data management system Educational Management Solutions to capture performance data, whereas UCLA and UCSF used METI Learning Space. We received deidentified performance data for all students who completed the CPX in 2008—175 from USC and 152 and 150 from UCLA and UCSF, respectively. We did not obtain demographic data as part of this study.
The CPX consists of eight, 15-minute OSCE stations. Of those eight stations, six are common across all schools; each school individually selects an additional two stations to administer to their students. During each 15-minute station, students are instructed to interview the patient, perform an appropriate physical examination, and offer a differential diagnosis and plan whilst establishing and maintaining good patient–physician rapport. At the end of each station, the student exits the room, and the SP completes a checklist, indicating “done” or “not done” for a series of performance items, and also rates several communication items on a Likert-style scale.
An expert panel of five primary care faculty members was charged with the task of creating a new absolute standard for evaluating students’ performance on the CPX using the critical actions approach. The panel members were asked to determine, “What is the minimal set of actions and responses to questions that any practicing physicians would be expected to make about this patient in order to avoid malpractice or to miss the essence of the problem?” They automatically included items with 100% agreement as critical action items and discussed further those with low agreement until they reached consensus. This process resulted in an abbreviated checklist of three to six items for each station. The panel calculated a passing score for each station by averaging across panelists the percentage of critical action items in a given station that they expected a minimally competent student to perform, yielding individual station passing scores ranging from 67% to 100% of the station’s critical action items. The panel determined that the criterion-referenced cutoff score for the eight-station examination, based on the average of these individual station passing scores, was 80%.
For our study, we modified this calculation using only the six common stations administered at all consortium schools in 2008 (see Table 1). A lack of independence of items within a station precluded analysis at the item level; therefore, we constructed individual station scores. In addition, because the panel identified only history taking, physical examination, and patient education items as critical action items, we excluded communication items from our analysis. According to the procedures described above, two of us (R.A.R.L., C.F.) constructed students’ scores as follows:
a. The norm-referenced or relative standard concerned the percentage of all items (excluding communication items) performed by the student in each station averaged across the six common stations. Students falling one standard deviation below the mean failed the standard.
b. The criterion-referenced or absolute standard concerned the percentage of critical action items performed by the student in each station averaged across the six common stations. Students who completed above an average of 81% of the critical action items (the passing score based on only the six stations) across the six stations were considered proficient.
Unlike conventional measures of reliability, which only consider one source of error, G theory (1) provides mechanisms for disentangling simultaneous sources of measurement error to better understand and thereby improve measurement design, and (2) enables the estimation of reliability for criterion-referenced measures. In G theory, the reliability of performance scores, like those on the CPX, is defined as the accuracy of generalizing from a student’s observed score, or the small sample of his or her performance, to his or her universe score (comparable in classical test theory to his or her true score).21 The reliability of a student’s performance score therefore indicates the variability in the universe score, or the proportion of variance in a student’s CPX score that can be attributed to person variation rather than to sources of error variation, known as facets.
Our G study investigated the relative influence of the station (s) facet and the interaction of person-by-station (ps), an indicator of case specificity. Of particular interest to us in these calculations was the variation in measurement scores attributable to each of these facets, or variance components. Variation in clinical competence scores can be partitioned into several variance components attributed to differences between persons, stations, and the interaction between them. Of note, we could not estimate the effect of the rater in this study because each student in each station encountered only one rater (the SP) on one occasion, meaning the effect of the rater was confounded with the effect of the station (as with many performance examinations).
The generalizability coefficient (ρ),2 or the reliability coefficient for relative decisions, considers measurement error comprising only those estimated variance components that contribute to a student’s relative ranking (e.g., the interaction of person and station). The index of dependability (φ), the generalizability coefficient for absolute decisions, however, considers measurement error associated with all facets.
The results from our initial analysis of data from each school revealed no meaningful difference between schools; therefore, we report our subsequent analyses and results in the aggregate. Our study was approved by the institutional review boards at the participating institutions. We conducted our analyses using GENOVA software (Iowa City, Iowa).22
On the basis of the norm-referenced standard described above, we calculated the passing score based on average performance across all six stations to be 0.65 (on a scale of 0–1), one standard deviation (SD) below the mean (M) station performance (M = 0.71, SD = 0.06). As determined previously by the panel, the cutoff score for the overall criterion-referenced standard was 0.81. On the basis of the norm-referenced standard, 84% (399 of 477) of students passed the CPX. On the basis of the criterion-referenced standard, 62% (297 of 477) of students passed. These two approaches then classified different students as passing and failing the examination, meaning that a student’s performance according to one standard was not necessarily tied to his or her performance according to the other, which is indicated by the significant relationship of pass/fail decisions between standards (χ12 = 87.2, P < .05).
Table 2 provides our G study results according to the norm-referenced standard. In our six-station calculations, the variance component for person (0.00204) indicates that the level of student performance differed between students, with about 41% of the total variance attributable to systematic differences between examinees. We attributed a moderate percentage (about 21%) of variation to station (0.00105), indicating that some stations were more difficult than others. We also attributed a large percentage (about 38%) of the variance to the interaction between person and station commingled with random error (0.00193). Though we cannot disentangle this variance component further, our calculations do suggest that the relative standing of students may vary from station to station.
The generalizability coefficient (ρ)2—useful for predicting the reproducibility of examinee rankings—represents the proportion of the total variation attributed to person. We calculated the six-station examination generalizability coefficient to be 0.51, which is relatively low for a high-stakes examination. Though estimated differently, the generalizability coefficient is analogous conceptually to the coefficient alpha, for which 0.80 is viewed as an acceptable level of reliability for high-stakes examinations; however, this level of reliability is often difficult to achieve with OSCE performance scores.23,24 Should educators use these performance scores, based on all checklist items, to make pass/fail decisions, the phi dependability coefficient (φ), which factors station difficulty into the estimation of score error, would be only 0.41.
We include in Table 2 the estimated variance components and generalizability coefficients for an eight-station examination (the true length of the CPX) as well as for a 24-station examination, illustrating an increase in generalizability with an increase in the number of stations. These findings indicate that a student’s norm-referenced score is moderately consistent when ranking students and moderately dependable when making absolute decisions should the examination be administered again using a different or expanded set of stations.
Table 2 also provides our G study results for the criterion-referenced standard. For the six-station examination, we calculated the amount of variance attributed to person to be 0.00141, which is small, suggesting that the level of student performance did not differ much. This facet accounts for a much smaller percentage of the total variance (about 20%) than for that of person in the norm-referenced score. The variance attributed to station (0.00125) is also low (17.43%). Just over 62% of the variance in this model is attributed to the interaction between person and station commingled with random error (0.00451), indicating that the difficulty of a station for a particular student very much depended on that individual student’s ability. For the six-station examination, we calculated the phi dependability coefficient to be 0.20, which is low for a high-stakes examination and considerably lower than the phi dependability coefficient associated with norm-referenced performance scores for a six-station examination (0.41). If we were to select a different sample of six stations, then, we would likely identify different students as passing and failing. It would take a considerable number of stations to achieve a more dependable measure of student performance.
Developing a criterion-referenced standard for a multi-institutional clinical performance examination, such as the CPX, poses unique challenges. Our study investigated medical educators’ interpretation of proficiency under a new critical actions-based approach to absolute standard-setting, demonstrating that different approaches to scoring an examination can, in fact, impact decisions regarding proficiency. Considering the desire at many medical institutions to move toward absolute standards in scoring performance-based clinical assessments, medical educators at these institutions must consider how their approaches to absolute standard-setting impact the reliability of their students’ performance scores and, subsequently, how they can improve the reliability of those scores.
First and foremost, we recommend that medical educators consider a variety of potential sources of error variance in developing absolute standards. Common practice is to consider the effects of the rater25; however, it is not possible to distinguish error associated with rater from error attributed to station, given the structure of the examination. In light of the challenges of adjusting the examination administration (e.g., adding additional stations), medical educators should identify common, systematic differences between stations to provide increased insight on how best to adjust the CPX to create a more reliable measure of clinical performance.
Another important consideration for medical educators is the method of standard-setting and whether or not the processes employed to set the absolute standard, here the critical action approach, reliably measure clinical competence. Results from our G study indicate that student performance based on the criterion-referenced standards is very much the product of a given station. The process of absolute standard-setting that we described earlier relied on a small group of faculty at one institution who selected specific items on each station checklist. These items were so essential that to not perform them had the potential for gross medical error, including harm to the patient. Items varied substantially depending on the context of the station and the patient’s chief complaint. Conversely, the full checklist used in calculating the norm-referenced score included not only items unique to each station but also more generic items, common to most clinical scenarios. The critical action items list, however, is a reduced, station-dependent list, which heightens the effect of case specificity. If most critical action items are station-specific, then a station score may represent the measure of students’ content-specific knowledge (e.g., treatment of chest pain, abdominal pain, or fever). We are not surprised, then, that multiple station scores, each indicating a student’s knowledge of a particular content area, are needed to capture his or her general ability, like clinical competency, a finding supported elsewhere in the literature.26
Though criterion-referenced standard-setting processes have been described in the literature,27,28 implementing such processes poses unique challenges for medical educators at individual institutions. Our faculty chose a process that took into consideration constraints such as faculty availability and training time. The results of our study indicate the need, however, to continue developing and refining our criterion-referenced standard to ensure the reliable, accurate identification of skill deficiencies in our students. Areas of future research include examining, for instance, how modifications to the critical action standard-setting process, like expanding the expert panel to include faculty from across the consortium or altering selection criteria for checklist items to capture behaviors expected not of a practicing physician but of a competent intern, could produce a different set of critical items or a new standard altogether and, subsequently, could yield different levels of generalizability. The investigation of different standard-setting processes, such as a different method of constructing the criterion-referenced score (e.g., weighting critical action items), may lead to a better, more dependable standard. It is imperative, then, to consider how educators’ choices about developing a standard can impact the classification of student proficiency and clinical competence. We recognize that our new, criterion-referenced standard measured not a generalized skill but, rather, station-specific skills. This standard identified students with potential content deficiencies, and we remediated accordingly by administering content-specific review activities (e.g., the medical interview and physical examination of a patient complaining of chest pain) rather than a more general skills-based review (e.g., communication skills, medical interviewing skills).
Our study has implications for programs at both the undergraduate and graduate medical education level that employ performance-based competency assessments. Although medical educators may welcome the transition to criterion-referenced scoring as a more valid approach to measuring clinical competence, our study emphasizes the need for careful consideration of examination generalizability and the dependability of the pass/fail decision-making process. The process for setting such standards must be deliberate and appropriate to both the student’s level of training and the examination’s purpose. Improving the reliability of examination scores is not simply a question of adjusting the cutoff score or including additional stations but of setting a standard and determining what behavior is indicative of that standard. In medical education, a false-positive, or naming someone proficient who is in fact not, is of grave concern because it has the potential for serious adverse effects on future patient care. Thus, the development of reliable and valid criterion-referenced standards—as Miller noted more than two decades ago—is of the utmost importance.
Funding/Support: Part of this research was made possible by a predoctoral advanced quantitative methodology training grant (#R305B080016) awarded to the UCLA Graduate School of Education and Information Studies by the Institute of Education Sciences of the U.S. Department of Education.
Other disclosures: None.
Ethical approval: The institutional review boards at UCLA, UCSF, and USC approved the use of student performance data.
Disclaimer: The views expressed in this report are the authors alone and do not reflect the views/policies of the funding agencies or grantees.
1. Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990;65(9 suppl):S63–S67
2. Rickets C. A plea for the proper use of criterion-referenced tests in medical assessment. Med Educ. 2009;43:1141–1146
3. Boscardin CK, Fung C. Identification of students with clinical deficits using latent class analysis. Paper presented at: Annual Meeting of the American Educational Research Association. April 13, 2009 San Diego, Calif
4. Clauser BE. Recurrent issues and recent advances in scoring performance assessments. Appl Psychol Meas. 2000;24:310–324
5. Hambleton RK, Slater SC. Reliability of credentialing examinations and the impact of scoring models and standard-setting policies. Appl Meas Educ. 1997;10:19–38
6. Barman A. Standard setting in student assessment Is a defensible method yet to come? Ann Acad Med Singap. 2008;37:957–963
7. George S, Haque MS, Oyebode F. Standard setting Comparison of two methods. BMC Med Educ. 2006;6:46
8. Cohen-Schotanus J, van der Vleuten CP. A standard setting method with the best performing students as point of reference Practical and affordable. Med Teach. 2010;32:154–160
9. Angoff WHThorndike RL. . Scales norms and equivalent scores. Educational Measurement. 19712nd ed. Washington, DC American Council on Education
10. Cusimano MD. Standard setting in medical education. Acad Med. 1996;71(10 suppl):S112–S120
11. Boulet JR, Murray D, Kras J, Woodhouse J. . Setting performance standards for mannequin-based acute-care scenarios An examinee-centered approach. Simul Healthc. 2008;3:72–81
12. McKinley DW, Boulet JR, Hambleton RK. A work-centered approach for setting passing scores on performance-based assessments. Eval Health Prof. 2005;28:349–369
13. Clauser BE, Clyman SG. A contrasting-groups approach to standard setting for performance assessments of clinical skills. Acad Med. 1994;69(10 suppl):S42–S44
14. Smee SM, Blackmore DE. Setting standards for an objective structured clinical examination The borderline group method gains ground on Angoff. Med Educ. 2001;35:1009–1010
15. Downing SM, Tekian A, Yudkowsky R. Procedures for establishing defensible absolute passing scores on performance examinations in health professions education. Teach Learn Med. 2006;18:50–57
16. Hambleton RK, Jaeger RM, Plake BS, Mills C. Setting performance standards on complex educational assessments. Appl Psychol Meas. 2000;24:355–366
17. Boursicot KA, Roberts TE, Pell G. Standard setting for clinical competence at graduation from medical school A comparison of passing scores across five medical schools. Adv Health Sci Educ Theory Pract. 2006;11:173–183
18. Norcini JJ. Setting standards on educational tests. Med Educ. 2003;37:464–469
19. Ferrell BG. A critical elements approach to developing checklists for a clinical performance examination. Med Educ Online. 1996;1:5
20. Payne NJ, Bradley EB, Heald EB, et al. Sharpening the eye of the OSCE with critical action analysis. Acad Med. 2008;83:900–905
21. Brennan RL. Generalizability Theory. 2001 New York, NY Springer
22. Crick GE, Brennan RL. GENOVA: A Generalized Analysis of Variance System [FORTRAN IV computer program and manual]. 1982 Dorchester, Mass University of Massachusetts at Boston, Computer Facilities
23. Newble DI, Swanson DB. Psychometric characteristics of the objective structured clinical examination. Med Educ. 1988;22:325–334
24. Brannick MT, Erol-Korkmaz HT, Prewett M. A systematic review of the reliability of objective structured clinical examination scores. Med Educ. 2011;45:1181–1189
25. Clauser BE, Harik P, Margolis MJ, Mee J, Swygert K, Rebbecchi T. The generalizability of documentation scores from the USMLE Step 2 Clinical Skills examination. Acad Med. 2008;83(10 suppl):S41–S44
26. van der Vleuten CPM, Swanson DB. . Assessment of clinical skills with standardized patients State of the art. Teach Learn Med. 1990;2:58–76
27. Norcini JJ, Shea JA. The credibility and comparability of standards. Appl Measure Educ. 1997;10:39–59
28. Boulet JR, De Champlain AF, McKinley DW. Setting defensible performance standards on OSCEs and standardized patient examinations. Med Teach. 2003;25:245–249