Curricula focusing solely on health and disease are no longer adequate for the education of well-rounded physicians. The current milieu requires physicians to pay attention to other dimensions of health care, including medical errors, system shortcomings, and practice scorecards. Responding to this need, the Association of American Medical Colleges,1 the Accreditation Council for Graduate Medical Education,2 and the Institute of Medicine3 have endorsed competencies related to quality improvement (QI). The competencies related to QI in graduate medical education are Practice-Based Learning and Improvement (PBLI) and Systems-Based Practice (SBP).
As these competencies are being introduced in medical education across the nation,4–6 appropriate teaching methodology and assessment of the competencies has raised significant concern among educators. Because these competencies require multidimensional complex behaviors and are skill based, performance-based assessment is of critical value.7 There is little published literature on the assessment of SBP and PBLI. Current methods to assess these competencies consist predominantly of end-of-rotation global ratings; 360-degree evaluations from peers, allied health staff, and patients; and performance evaluations of the design or implementation of QI projects.8–11 There is little validity evidence to support the usefulness of these assessment methods in the context of SBP and PBLI, and it is unclear whether currently used assessment tools are capable of predicting performance ability in these domains.
Since its introduction in 1972, the Objective Structured Clinical Examination (OSCE) has been widely used in medical education to assess history-taking, physical-exam, and communication skills.12–14 More recently, medical educators have developed OSCEs that address evidence-based medicine skills,15 cultural awareness,16–18 teaching skills,19 and prescribing skills.20 Noncognitive skills have also been assessed with OSCEs during admissions interviews.21 The only reported use of OSCEs to measure QI-related competencies is of a single OSCE station for medical students designed to assess root cause analysis and communication of a prescription error (Varkey and Natt22).
Because of the high predictive validity and influence on physician training, practice, and behaviors in non-QI curricular settings,23 the OSCE is a potentially powerful assessment tool for SBP and PBLI. The OSCE format provides an opportunity to systematically and sufficiently sample the different subdomains of QI, a key element of validity. Furthermore, the OSCE provides an opportunity for the demonstration of skills at Miller’s assessment level of “shows how”24 rather than the testing of knowledge alone.
The purpose of this pilot study was to gather validity, feasibility, and acceptability evidence for an OSCE to assess competency in PBLI and SBP in graduate medical education. We describe an OSCE that was implemented as part of a three-week QI elective for preventive medicine and endocrinology fellows at the Mayo Clinic, Rochester, Minnesota. The results of this pilot study provide the foundation for further work in using OSCEs for the assessment of competency in SBP and PBLI competencies.
Development of the OSCE Stations
The curriculum content of the three-week QI elective rotation is summarized in Table 1, columns one and two. This curriculum was modeled on Dewey’s25 philosophy of experiential education. Instructional methods consisted of case-based exercises, role-play, simulations, workshops, and didactics. In addition, as part of the curriculum, all fellows participated as a team in a QI project relevant to their clinical practice.
The director of the QI rotation (who is also the program director for the preventive medicine fellowship) led the development of the OSCE. The OSCE committee consisted of three educational assessment experts (including the preventive medicine and endocrinology fellowship program directors) and three other local institutional content experts. The committee created a blueprint for the OSCE based on the eight main curriculum objectives of the rotation: prescription errors, negotiation, evidence-based medicine, team collaboration, root cause analysis, quality measurement, Nolan’s three-question model of QI, and insurance systems (Table 1). Each station was designed to assess one of these components of the curriculum. Three stations involved individual interactions with an SP; the team collaboration station involved interaction of all the fellows in one room with a high-fidelity mannequin and an SP. All the other stations consisted of written challenges—short-answer and multiple-choice questions (MCQs). All stations were 15 minutes in length except for the station on evidence-based medicine, which was a two-part station and lasted 30 minutes. Details of each station are described in Table 1, columns three and four.
The stations were created and scripted by the OSCE committee, based largely on real-life situations that occurred during prior improvement projects. The OSCE was pilot-tested with one endocrinology research fellow and three medicine chief residents who were not part of the study. Based on their feedback, changes were made in the content and presentation of the stations before final implementation. SPs were recruited and trained per Mayo simulation center protocols.
This study was considered exempt by the Mayo Clinic institutional review board and approved by the University of Illinois Chicago institutional review board. Written informed consent was obtained from all participants before the start of the study.
All nine fellows in nonresearch rotations in preventive medicine and endocrinology training programs participated together in a three-week QI elective that focused on the basics of SBP and PBLI at the Mayo Clinic in Rochester, Minnesota, in November 2006. At the end of the rotation, all fellows were assessed using an eight-station OSCE conducted in the Mayo Clinic Simulation Center.
Assessment of fellows
A 15-minute orientation to the OSCE format was given to the fellows on the day of the examination. Performance on each of the eight stations was rated independently by three faculty experts. A five-point global competency rating (1 = unsatisfactory, 2 = marginal, 3 = pass, 4 = good, 5 = excellent) was assigned for the fellows’ overall competence in each station, taking into consideration efficiency of problem solving, interpersonal skills, safety of management, and discretionary use of resources. Fellows were also assessed using checklist scoring in all stations except the health care finance station, which used MCQs. In the stations related to prescription error and negotiation, the fellows’ performance also was rated by the SP and by the faculty, using an additional interpersonal skills checklist. On the day of the OSCE, each SP station had a dedicated faculty member who monitored the performance of the SP to ensure consistency throughout the administration of the OSCE.
A modified Angoff procedure was used to set pass/fail standards.26,27 The resulting checklist cut scores are as shown in Table 2. Three faculty judges independently estimated the probability that a borderline-competent fellow would answer the question correctly. No performance data were provided to judges. The judges then reviewed each test question as a group and discussed them until a consensus was reached for the probability estimate of each test question. Any test question that was judged to be ambiguous was eliminated from the scoring process for that station.
To evaluate the validity of the OSCE as an effective assessment tool for SBP and PBLI, we collected evidence concerning the five aspects of construct validity—content evidence, response process, internal structure, relationship with other variables, and consequences—as described by Downing et al.28,29
Seven institutional experts in the content areas of SBP and PBLI assessed the adequacy of the OSCE content by comparing the curriculum objectives with the exam blueprint. They also assessed the content for emphasis on important curriculum concepts and demonstration of skills in the appropriate context. Other content evidence examined included author expertise, clarity of instructions, and monitoring of SP portrayals to ensure consistency of case presentation. Clarity of instruction was assessed through a survey of the fellows.
The OSCE committee gathered response process evidence from the description and evaluation of score calculation and reporting methods, explanatory materials discussing appropriate interpretation of the performance scores, rationale for the use of different types of scoring methods, and familiarity of the fellows with the format of the examination.
Interrater reliability of the ratings from the three faculty raters in each station was estimated using intraclass correlation coefficients and represents the intraclass correlation for the average ratings of the three raters. A coefficient of 0.41 to 0.60 was considered moderate, 0.61 to 0.80 excellent, and 0.81 to 1.00 outstanding.30 The minimum acceptable level was set at 0.61 by the OSCE committee, as this was a formative examination. Pearson correlation coefficients were used to determine the correlation of interpersonal skills ratings between the faculty raters and the SPs; a P value of <0.05 was considered statistically significant. Interstation correlation was evaluated using the Pearson correlation coefficient. Intrastation reliability (internal consistency) was reported using Cronbach alpha on the items within a case. The OSCE committee set the minimum acceptable level at 0.61. The reliability of the OSCE global scores across the eight stations was calculated using Cronbach alpha. Because this was a completely crossed design with all fellows completing the same eight stations, alpha is equivalent to the generalizability coefficient. The estimated number of stations needed to reach a generalizability of 0.80 was then calculated using the Spearman–Brown formula.
Relationship to other variables.
The total checklist score across stations was calculated weighing each item equally. This total score was compared with the end-of-rotation Quality Improvement Knowledge Application Tool (QIKAT) score as well as the end-of-rotation SBP–PBLI score. Pre- and postintervention assessment of knowledge, skills, and attitudes towards QI were obtained at the beginning and end of the rotation using the paper-and-pencil QIKAT developed at Dartmouth.31 The QIKAT contains a total of six scenarios (three each for the pre- and postintervention tests) in which learners are asked to provide short answers to describe the aim, measures, and recommended changes for an improvement case provided to them. In other studies, residents who have undergone QI training have shown significant improvement of their QIKAT scores after an educational intervention.10 The psychometric properties of this tool have not been reported. At the end of the rotation and before the OSCE, the QI curriculum director provided overall ratings of resident competence in SBP and PBLI on a nine-point-scale (1 to 3 = unsatisfactory, 4 to 6 = satisfactory, 7 to 8 = superior, and 9 = outstanding). The psychometric properties of this tool have not been reported, and this tool is one of the assessments used by the program for the last five years for evaluation of competency in SBP and PBLI. The Pearson correlation coefficient was used to assess the relationship of total checklist score to the QIKAT posttest and the end-of-rotation faculty ratings.
This pilot OSCE was used for research purposes only and not for summative evaluation. We determined the theoretical pass rate on the basis of the results of the standard setting exercise and global scores. To study the consequence of the results on the curriculum, we interviewed appropriate program directors. We did not study the consequence on fellows.
We assessed feasibility through cost accounting and evaluating the logistics of conducting the OSCE. Costs related to the training and use of SPs, the cost of using the OSCE testing site, the time spent on grading the stations, and the time spent on providing feedback to the students were included in the cost analysis. The time spent on creating the OSCE for the study was also determined.
Fellows and faculty preceptors completed a survey at the end of the OSCE to determine the acceptability of the OSCE to assess SBP and PBLI competencies. The survey contained closed- and open-ended questions and elicited the perceptions of the fellows and faculty about the usefulness of the OSCE, its enhancement of learning, and its ability to accurately assess levels of SBP and PBLI skills on a five-point Likert scale (one = strongly agree, five = strongly disagree).
Two preventive medicine and seven endocrinology fellows, four men and five women, participated in all eight OSCE stations. Study participants included four first-year and five second-year fellows. All but one fellow completed the survey conducted at the end of the OSCE. Table 2 shows that most fellows did well on the OSCE. The validity evidence for the OSCE is described below.
The team of institutional experts concluded that the content of the stations matched the curriculum blueprint, emphasized significant curriculum concepts, and allowed demonstration of skills in the appropriate context. The majority of fellows who participated in the OSCE survey (6/8; 75%) agreed that the instructions for all the stations were clear.
A statistician reviewed a random sample of 10% of the paper forms to ensure that data entry was accurate. The OSCE committee ensured the response process by using the standard quality assurance process at the simulation center. One fellow expressed that (s)he would have liked a more detailed orientation to the OSCE because (s)he had never participated in one before.
The interrater reliability of the global competency and checklist scores from the three faculty raters for five of the eight stations (evidence-based medicine, quality measurement, root cause analysis, Nolan’s model, and insurance systems) varied from 0.85 to 1. Two scores had intraclass correlations less than 0.61—the prescription error global competency score (ICC = 0), and the negotiation checklist score (ICC = 0.53). Because the team collaboration station assessed the performance of the team, not individual fellows, we were unable to calculate a variability estimate within faculty, and, hence, we were unable to perform tests of intraclass correlation for this station. The mean faculty-assigned checklist score was 17.67 (95% CI: 5.16–26). There was no significant difference between the faculty and SP interpersonal skills scores in the negotiation OSCE (P = .58) or in the prescription error case (P = .16).
The mean interstation correlation was 0.13 (range = −0.62 to 0.99). The generalizability coefficient for the OSCE was G = 0.34. Sixty-one cases would be needed to reach a generalizability of 0.8. Internal consistency reliability of the station rating scales varied from 0.32 to 0.83; the majority of the stations had a Cronbach alpha of more than 0.61, except for the evidence-based medicine station (0.44) and the quality-measurement station (0.32).
Relationship to other variables
The correlation between the total OSCE checklist score and the final QIKAT scores was not statistically significant (r = 0.19; P = .63). There was a nonsignificant negative correlation between the total checklist OSCE score and the end-of-rotation faculty SBP–PBLI score (r = −0.61; P = .08). All fellows received a score between seven and nine on the end-of-rotation, faculty-rated, nine-point scale.
If the examination had been summative, one fellow would not have passed the insurance systems station or the Nolan station, and another fellow would have not passed the insurance systems station or the quality-measurement station. In addition, two other fellows would not have passed the Nolan’s station (see Table 2). Because this OSCE was used for research purposes only, there was no impact of the OSCE on the fellows’ summative assessments. Fellows were not surveyed for their impressions regarding consequences of the experience of taking the OSCE. The two program directors indicated that they planned a curricular change based on the results of the OSCE. The session on insurance systems will be made more interactive in future curricula, and preparatory reading material will be provided to the fellows before the session.
The cost to conduct the OSCE was $255 per fellow. This includes all costs related to SP training, hiring of SPs, renting the simulation center, and faculty and coordinator time for running the OSCE. This does not include the 45 hours of faculty time for the development of the OSCE.
All eight faculty participants agreed or strongly agreed that the OSCE was realistic, that the SP performance was realistic, and that they were able to accurately assess the fellows. All the fellows agreed or strongly agreed that the SP performance was realistic and that the OSCE was able to accurately assess their performance.
We describe the psychometric properties of an eight-station OSCE designed to assess competency in SBP and PBLI in a study piloted with nine fellows in preventive medicine and endocrinology. A combination of written, standardized patient, and simulation-based stations tested the knowledge, skills, and attitudes required for SBP and PBLI. Not all stations used SPs because some of the domains, such as the creation of quality measures and conducting a root cause analysis, require demonstration of knowledge and skills through written or graphical representations.
To gather evidence for the validity of the OSCE, we evaluated each of the five aspects of construct validity.29,30 The evidence for content and response process validity was excellent. Interrater reliability was excellent, with reliability coefficients greater than 0.85 for both global competency and checklist scores for the majority of the stations. As compared with other stations, interrater reliability coefficients for the checklist scores were lower than 0.8 for the negotiation and prescription error stations, perhaps reflecting that the small study size could lead to unstable estimates. Similarly, it is also possible that this reflects the subjectivity in interpreting certain checklist items in these two stations (e.g., reaching a mutual agreement for initiating a QI pilot in the negotiation station, or apologizing for the error in the prescription error station) and/or the need to better train raters for scoring these items. The interrater reliabilities for these two stations were moderate and were considered acceptable for skills assessment because this was a low-stakes, formative examination. The intraclass correlation of the global competency rating of the prescription error station was considered indeterminate at zero. Future iterations of this OSCE station will need to provide more detailed specifications for scoring the checklist items and global competency ratings.
The generalizability coefficient for the OSCE was low at 0.34, indicating that this OSCE should be used in conjunction with other methods for any high-stakes assessment. The low G coefficient likely reflects the high degree of case specificity because the different stations assessed different subcompetencies in SBP and PBLI. Increasing the number of stations or separating the two competencies and conducting a separate OSCE for each competency might provide a more generalizable estimate. The low variability of scores observed across this small, highly selected group of fellows also contributed to the low generalizability of this pilot administration, indicating that the stations were relatively easy or that all of the fellows were fairly competent after the elective. Administration of the OSCE to a larger, more heterogeneous group of learners would likely result in a higher generalizability coefficient. Modifying the cases to increase their difficulty would also provide a greater spread of scores and a correspondingly higher G.
There was no significant relationship between the total OSCE score and the end-of-rotation QIKAT score. There are several possible explanations for this. The reliability and validity of the end-of-rotation SPB/PBLI scores and QIKAT scores have not been studied, and they may be poor. Furthermore, the QIKAT was not originally designed to test competency or skills in SBP and PBLI but to evaluate the impact of a QI curriculum on knowledge of the learner. There was a nonsignificant negative relationship between the total OSCE score and the end-of-rotation faculty rating of SBP–PBLI score. This result was unexpected. One possibility is that the end-of-rotation SBP–PBLI score traditionally determined by the director of the rotation is a composite score that may have inherent subjective bias based on likeability and possibly influenced by social, cognitive, and environmental factors.32 In addition, because all the fellows were rated in the superior range (seven to nine on a nine-point scale) in the composite SBP–PBLI scores, the correlations were probably restricted by the limited range of scores. A broader range of candidate scores and larger sample size might lead to more useful results. These frequently observed shortcomings of faculty ratings highlight the need for complementary assessment methods such as this OSCE.
The OSCE was well accepted by the faculty and the fellows according to survey ratings and comments. Both groups commented that the stations were realistic and useful as assessment tools. Although the OSCE is likely to be more resource intensive than traditional assessment tools, it was feasible to implement an OSCE to test competency in SBP and PBLI. The cost to conduct the OSCE was $255 per fellow and is comparable with that noted in other studies involving the OSCE.33
On the basis of the results of our study, we think that the OSCE is an appealing assessment tool for SBP and PBLI because it tests situational awareness and problem solving, and it may protect against a possible halo rater effect and other subjective bias inherent in end-of-rotation evaluations as described by other investigators.34 It also lends itself to the identification of students with deficits in specific areas within the competencies, as well as curricular gaps with a potential to promote program evaluation and improvement. Although not conducted in the realm of SBP or PBLI, other studies suggest that the opportunity for formative as well as summative feedback makes the OSCE an excellent teaching tool as well.35,36 Future studies should explore the utility of the OSCE for SBP and PBLI education in addition to its values in assessment.
This study was not without limitations. Although not explicitly studied, the intrinsic motivation of fellows in this no-stakes examination may have influenced their performance in the OSCE. Because of the small study sample size, single institution site of study, and advanced level of trainees, the results of this pilot study cannot be generalized to other training programs. It is difficult to determine the predictive validity of the OSCE because a longer follow-up period would be needed to compare learner outcomes in practice or in the setting of QI projects versus their performance in the OSCE. Although the choice of preventive medicine and endocrinology fellows were a sample of convenience, we think that the diversity of specialty training programs sets the stage for further testing among varied specialties because the stations are not specialty specific.
This pilot study, conducted in a graduate medical education setting, provides promising evidence for validity, feasability, and acceptability of an OSCE for the assessment of competency in SBP and PBLI. Future studies, based on the validity framework used in this study, are needed with larger sample sizes to gain further understanding of the potential to enhance the teaching and assessment of SBP and PBLI.
The authors would like to acknowledge Ilene B. Harris, PhD, and Georges Bordage, MD, PhD, for their review of prior versions of the manuscript, and Kevin Bennet, BSChE, MBA, for participating in the OSCE as a rater.
3 Board on Health Care Services, Institute of Medicine. Health Professions Education: A Bridge to Quality. Washington, DC: The National Academies Press; 2003.
4 Ziegelstein RC, Fiebach NH. “The mirror” and “the village”: A new method for teaching practice-based learning and improvement and systems-based practice. Acad Med. 2004;79:83–88.
5 Coleman MT, Nasraty S, Ostapchuk M, et al. Introducing practice-based learning and improvement ACGME core competencies into a family medicine residency curriculum. Jt Comm J Qual Saf. 2003;29:238–247.
6 Paukert JL, Chumley-Jones HS, Littlefield JH. Do peer chart audits improve residents’ performance in providing preventive care? Acad Med. 2003;78:S39–S41.
7 Stone S, Ferguson W, Mazor K, et al. Development and implementation of an objective structured teaching exercise (OSTE) to evaluate improvement in feedback skills and following a faculty development workshop. Teach Learn Med. 2003;15:7–13.
8 Epstein R, Hundert E. Defining and assessing professional competence. JAMA. 2002;287:226–235.
9 Batalden P, Berwick D, Bisognano M, et al. Knowledge Domains for Health Professional Students Seeking Competency in the Continual Improvement and Innovation of Health Care. Boston, Mass: Institute for Healthcare Improvements; 1998.
10 Ogrinc G, Headrick LA, Mutha S, et al. A framework for teaching medical students and residents about practice-based learning and improvement synthesized from a literature review. Acad Med. 2003;78:1–9.
11 Varkey P, Reller MK, Smith A, Ponto J, Osborn M. An experiential interdisciplinary quality improvement education initiative. Am J Med Qual. 2006;21:317–322.
12 Harden RM, Gleeson FA. Assessment of clinical competence using an observed structured clinical examination. Med Educ. 1979;13:41–47.
13 Hodges B. OSCE! Variations on a theme by Harden. Med Educ. 2003;37:1134–1140.
14 Newble D. Techniques for measuring clinical competence: Objective structured clinical examinations. Med Educ. 2004;38:199–203.
15 Fliegel JE, Frohna JG, Mangrulkar RS. A computer-based OSCE station to measure competence in evidence-based medicine skills in medical students. Acad Med. 2002;77:1157–1158.
16 Altshuler L, Kachur E. A culture OSCE: Teaching residents to bridge different worlds. Acad Med. 2001;76:514.
17 Altshuler L, Kachur E, Aeder L, et al. Culture Objective Structured Clinical Exams to Assess Cultural Competence of Pediatrics Residents. Chicago, Ill: Accreditation Council for Graduate Medical Education; 2002.
18 Altshuler L, Kachur E, Aeder L, et al. Enhancing Resident’s Cultural Competence Through Culture Objective Structured Clinical Exams/Exercises (OSCEs). Washington, DC: Innovations in Medical Education; 2001.
19 Roberts C, Wass V, Jones R, et al. A discourse analysis study of ‘good’ and ‘poor’ communication in an OSCE: A proposed new framework for teaching students. Med Educ. 2003;37:192–201.
20 Scobie SD, Lawson M, Cavell G, et al. Meeting the challenge of prescribing and administering medicines safely: Structured teaching and assessment for final year medical students. Med Educ. 2003;37: 434–437.
21 Eva KW, Rosenfeld J, Reiter HJ, Norman G. An admissions OSCE: The multiple mini-interview. Med Educ. 2004;38:314–326.
22 Varkey P, Natt N. Novel use of the OSCE as a teaching and assessment tool of medical students’ quality improvement skills. Jt Comm J Qual Improv. 2007;33:48–53.
23 Martin I, Jolly B. Predictive validity and estimated cut score of an objective structures clinical examination (OSCE) used as an assessment of clinical skills at the end of the first clinical year. Med Educ. 2002;36: 418–425.
24 Miller GE. The assessment of clinical skills/competence/performance. Acad Med. 1990;65:563–567.
25 Dewey J. Experience and Education. New York, NY: Collier Books; 1938.
26 Kaufman DM, Mann KV, Muijtjens AMM, et al. A comparison of standard-setting procedures for an OSCE in undergraduate medical education. Acad Med. 2000;75: 267–271.
27 Humphrey-Murto S, MacFadyen JC. Standard setting: A comparison of case-author and modified borderline-group methods in a small-scale OSCE. Acad Med. 2002;77:729–732.
28 Downing SM, Haladyna TM. Validity threats: Overcoming interference with proposed interpretations of assessment data. Med Educ. 2004;38:327–333.
29 Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837.
30 Landis J, Koch G. The measurement of observer agreement for categorical data. Biometrics. 1977;33:671–679.
31 Morrison LJ, Headrick LA, Ogrinc G, Foster T. The quality improvement knowledge application tool: An instrument to assess knowledge application in practice-based learning and improvement. J Gen Intern Med. 2003;18:250.
32 Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15:270–292.
33 Yudkowsky R, Downing SM, Sandlow LJ. Developing an institution-based assessment of resident communication and interpersonal skills. Acad Med. 2006;81:1115–1122.
34 Searle J. Defining competency—The role of standard setting. Med Educ. 2000;34: 363–366.
35 Brazeau C, Boyd L, Crosson J. Changing an existing OSCE to a teaching tool: The making of a teaching OSCE. Acad Med. 2002;77:932.
36 Holmboe E. Faculty and the observation of trainees’ clinical skills: Problems and opportunities. Acad Med. 2004;79:16–22.