For more than a decade, medical educators have employed standardized students and objective structured teaching examinations (OSTEs) to evaluate the clinical teaching skills of medical faculty.1,2,3,4,5 Recent studies have set more rigorous standards of validity and reliability for these performance-based assessments.5 Some have begun using OSTEs for resident physicians,6 whom the Liaison Committee for Medical Education (LCME) and others increasingly recognize as critically important teachers for medical students and peers.7
OSTEs hold great promise for rapid and rigorous evaluation of clinical teaching skills and of new approaches to teacher training. For resident teachers and clinician—educators, it may require years to accumulate sufficient numbers of “real life” teaching evaluations for reliable teaching assessments. OSTEs can truncate this timeline to produce meaningful, prompt teaching assessments for important decisions such as resident evaluations or faculty promotions. OSTEs also facilitate outcomes-based educational research, as well as program evaluation of novel initiatives to improve teaching skills.
A major challenge in OSTE practice lies in developing accurate rating scale or checklist instruments appropriate for carefully assessing teaching performance on OSTE stations. Although earlier research has delineated characteristics of exemplary clinical teachers,8,9,10,11 it remains a challenge to translate this body of knowledge into sensitive and specific assessment instruments. Educational researchers have developed and studied numerous instruments,12 some tailored to evaluating residents' teaching skills.13 The SFDP-26, a 26-item rating scale based on the seven teaching constructs of the Stanford Faculty Development Program (SFDP),14 is one of the best-validated rating scales available to evaluate clinical teachers.15,16
The emerging OSTE literature has yet to address definitively the issue of selecting between dichotomously scored checklists and multi-point rating scales for best assessment of teaching performance. Research offers clearer support for using standardized students portrayed by senior medical students, who in non-OSTE studies have shown themselves to be capable evaluators of teaching.17 The related literature on objective structured clinical examinations (OSCEs) sheds light on some of these issues. Senior medical students who act as standardized patient examiners for learners18 may benefit by improving their own communication skills,19 suggesting that standardized students may improve their own teaching skills. The OSCE literature manifests more controversy over the choice between checklists and rating scales, with a minority of OSCEs featuring multi-point rating scales,20 although both formats can be used successfully.21
The purpose of our study was to develop and assess reliability and validity for an eight-station OSTE with case-specific, behaviorally-anchored rating scales, all developed specifically for resident teachers. This OSTE is the primary outcome measure for Bringing Education & Service Together (BEST), an ongoing randomized, controlled trial of a longitudinal residents-as-teachers curriculum at the University of California, Irvine (UCI). We hypothesized that our OSTE would demonstrate acceptable reliability and validity when used to evaluate generalist residents' clinical teaching skills before and after a pilot administration of the BEST curriculum.
We reviewed the OSTE literature1,2,3,4,5,6 to guide development of a 3.5-hour, eight-station OSTE for generalist resident physicians, 〈www.residentteachers.com/oste.htm〉. Its stations (Table 1) each last 15 minutes. The stations reflect residents' learning needs for teaching skills development as reflected in the educational literature,22 and particularly as reported in a recent focus-group study of 100 medical students, generalist residents, and faculty23 completed for the BEST study's needs assessment.
Because our literature review revealed no single instrument suitably specific for evaluating performance on each OSTE station, we adapted the published SFDP-26 rating scale15 to our OSTE. We selected by group consensus the SFDP items that fit the objectives and content of each OSTE station, retaining each item stem's original wording and five-point Likert-type rating scale. We added only two additional item stems from Lesky and Wilkerson's previous OSTE study.2 Participants in the consensus process included three attending generalist physicians experienced in clinical teaching and an education specialist. The product was a set of eight case-specific rating scales,20 each featuring 14–24 items.
Since the SFDP-26 was originally designed to evaluate clinical teachers after longitudinal exposure, we needed to hone our rating scales so that raters could accurately complete them after a single 15-minute exposure to each resident teacher. Each individual rating scale also needed to measure the unique, case-specific competencies that its station was designed to test. We wrote detailed, case-specific behavioral anchors for all item stems so that the anchored items match their stations' unique objectives while mirroring the appropriate underlying teaching construct of the SFDP24: learning climate, control of session, communication of goals, promoting understanding and retention, evaluation, feedback, and promoting self-directed learning. The anchors allow trained raters to determine whether they “strongly agree” that a resident demonstrates a given teaching behavior (rating of 5), “strongly disagree” (1), or prefer one of three intermediate levels (2–4). Each station's rating scale ends with the same SFDP-26 global item for rating “overall teaching effectiveness.” See Figure 1 for sample items. On all items, higher scores consistently indicate better teaching skills.
Fifteen fourth-year medical students staffed the OSTEs after completing 30 hours of training as standardized students and raters through a longitudinal clinical teaching elective. Family medicine PGY1s, PGY2s, and PGY3s not enrolled in the BEST study underwent practice OSTE stations during the students' training. For the 59 second-year generalist residents enrolled in the four university-based UCI residency programs participating in the BEST study, the residency directors offered enrollment to 31 residents whose rotation schedules favored participation in the study. A total of 23 PGY2s enrolled: 13 in internal medicine, five in pediatrics, and five in family medicine. Eighteen of these residents underwent a pretest OSTE in August 2001. All 23 residents undertook a posttest OSTE in February 2002. Between the OSTEs, those residents randomly assigned to the intervention group (n = 13) attended a 13-hour teaching skills curriculum, the results of which will be reported separately.
For most stations, one or two students enacted each case while another student watched by remote camera, with all students completing the rating scales for their stations. After the OSTEs, a rater-trained research psychologist [JH] rated occasional stations that did not already have two student ratings, so that all encounters within all stations were independently rated by at least two trained raters. An attending physician with medical education training [EM] completed additional ratings both to corroborate the students' ratings and to test the rating scales' validity, including rating one encounter from each pretest station.
For each participating resident teacher, we calculated each station's summary score using the mean of all raters' scores and including the final item, “overall teaching effectiveness,” within the summary score. We also computed a grand total OSTE score for each resident, summed across the eight stations. We conducted an item analysis and calculated the reliability of the summary score from each OSTE station's rating scale using Cronbach's coefficient alpha. Intra-class correlations measured the inter-rater reliability for each station.
Descriptive statistics are listed in Table 1. The standard error of measurement across all stations was 9.75.
We evaluated the internal consistency reliability of our OSTE rating scales with Cronbach's coefficient alpha. This rating scale reliability, indicating the degree to which a resident's score on each rating scale item reflects a common underlying teaching construct, exceeded .90 (range = .91–.94) for all eight OSTE stations. The OSTE's mean overall reliability (Cronbach's alpha) was .96 across all stations and test administrations. In our item analysis, we correlated the mean score on each item with the mean summary score for that item's entire rating scale. Of the 160 total items, these item-total correlations were low enough for only three items (<2%) that deleting those items would even minimally increase the alpha coefficients for their rating scales. Thus, virtually every individual item contributes uniquely meaningful information to total scores. Only the global item (“overall teaching effectiveness”) repeats itself across all stations, and only 14 of the 160 items (<10%) cross seven stations. We did not test the effect of these few repeated items on OSTE scores or on reliability studies.
Inter-rater reliabilities, calculated with intra-class correlations, exceeded .75 for seven of the eight OSTE stations (Table 1). One station (Station 7, teaching a procedure) had an overall inter-rater reliability of .54. Its inter-rater reliability was .19 for the OSTE's first pretest administration, .62 for the second pretest administration, and .78 across all posttest encounters. Had a single student rated each station, the average reliability would have dropped to an estimated intra-class correlation of .61. Including or excluding the research psychologist's ratings did not appreciably alter inter-rater reliabilities. The overall correlation between summary scores of the physician faculty rater and the medical student raters was .62, sampled across all eight stations.
We assessed the instruments' content validity by several means, including the detailed literature review. A large focus-group study involving learners and faculty23 further informed the objectives and content of the OSTE stations. Immediately following each posttest OSTE station, residents also completed an anonymous written evaluation. Among these 184 evaluations (23 residents × 8 stations), 92% indicated that each OSTE station realistically represented important teaching skills for generalist resident teachers. Five residents (22%) argued that the feedback station (Station 4) featured a student with an unrealistically difficult attitude, and four residents (17%) felt that the mini-lecture station (Station 8) needed either more or less time.
We evaluated the predictive validity of our instruments with three methods. First, we assessed the incremental validity of the OSTE by examining pretest-to-posttest changes in overall OSTE scores for (1) the residents who received teaching skills instruction and (2) the residents who did not receive instruction. The instructed residents' scores improved by more than two standard deviations, while the noninstructed residents' scores did not improve, substantiating the instruments' sensitivity to instruction. Second, we calculated the reproducibility of the SFDP's seven teaching constructs across all OSTE stations and test administrations. These intra-class correlations ranged from .57 to .80. And third, as was expected, the experienced second-year residents in the BEST study performed better during the OSTE than did incoming family medicine interns during the practice OSTE training.
Our results support the study's hypothesis that an OSTE tailored to generalist residents can ensure valid and reliable assessment of their clinical teaching skills. Inter-rater and rating scale reliabilities were high for our instruments, meeting experts' expectations for evaluation measures.25 Because the alpha coefficient for the combined score across all eight stations slightly exceeds the individual coefficient for each station, the overall pool of 160 items meaningfully reflects a single underlying construct, which the authors call “teaching skills.” While Station 7 (teaching a procedure) had an overall inter-rater reliability of only .54, we believe that this problem stemmed from a training issue because its medical student raters achieved an inter-rater reliability of .78 on the posttest after an additional 90 minutes of training.
We also believe our analyses showed the OSTE and its rating scales to be valid instruments for assessing generalist residents' teaching skills. Content validity was strong, as assessed by residents and faculty before, during, and after the pilot study. We believe each rating scale successfully measures its station's unique teaching content; we opted to use detailed behavioral anchors to achieve this case specificity so that the rating scale item stems themselves retain fidelity to the previously-validated SFDP-26 instrument. Good incremental validity and reproducibility of the SFDP's seven clinical teaching constructs supported acceptable predictive validity. Because there were few “real life” teaching assessments of our residents during the study period, we could not conduct extensive construct validation of our measures.
Our results support prior research showing that medical students can provide reliable and valid assessments of their clinical teachers.17 The standardized students in our OSTE—with adequate training and clear guidelines—used detailed rating scales consistently and competently. Checklists would have been simpler to use but might not have permitted the fine discrimination among multiple levels of teaching performance required by outcome studies such as our ongoing BEST trial. Correlations between students' ratings and those of the physician faculty rater were strong.
Limitations in our study should be considered. Our sample was small. Even though we included residents from three specialties, an OSTE designed for generalist resident teachers might not apply well to other specialties. Our participating medical students spent many hours in training, mainly to ensure that all students were consistently using the rating scales' behavioral anchors to assess each station's unique set of teaching behaviors. We believe this effort was justified in helping the students prepare for their own future roles as resident teachers.
Future research needs to continue exploring how OSTEs can best be used to help resident physicians achieve their goals as clinical teachers. Our OSTE would benefit from generalizability studies that analyze sources of variation in ratings. While our sample size in this pilot curricular study did not permit such analyses, we are currently undertaking a larger randomized, controlled trial that includes a generalizability study of the present OSTE, its primary outcome measure. Other questions deserve additional study: do OSTEs require three to four hours of testing time for acceptable reproducibility (as high-stakes OSCEs do),21 or can shorter examination formats offer adequate reliability? Can a single rater for each station achieve reproducibility comparable to that of multiple raters, as is the case with OSCEs?21 Can residents effectively use intra-OSTE feedback to improve clinical teaching skills, as they have done using student evaluations from actual teaching situations?26
Trained senior medical students competently enacted and rated an OSTE for generalist resident teachers, achieving high inter-rater reliabilities with validated, case-specific rating scales whose alpha coefficients were high. Future research should clarify how OSTEs can best be used to help resident teachers and others to improve their clinical teaching skills.
1. Simpson DE, Lawrence SL, Krogull SR. Using standardized ambulatory teaching situations for faculty development. Teach Learn Med. 1992;4:58–61.
2. Lesky LG, Wilkerson LA. Using “standardized students” to teach a learner-centered approach to ambulatory precepting. Acad Med. 1994;69:955–7.
3. Prislin MD, Fitzpatrick C, Giglio M, Lie D, Radecki S. Initial experience with a multi-station objective structured teaching skills evaluation. Acad Med. 1998;73:1116–8.
4. Gelula MH. Using standardized medical students to improve junior faculty teaching. Acad Med. 1998;73:611–2.
5. Schol S. A multiple-station test of the teaching skills of general practice preceptors in Flanders, Belgium. Acad Med. 2001;76:176–80.
6. Dunnington GL, DaRosa D. A prospective randomized trial of a residents-as-teachers training program. Acad Med. 1998;73:696–700.
7. Liaison Committee on Medical Education. Functions and Structure of a Medical School: Accreditation and the Liaison Committee on Medical Education, Standards for the Accreditation of Medical Education Programs Leading to the M.D. Degree. Washington, DC, and Chicago, IL: The Association of American Medical Colleges and the American Medical Association, 2000.
8. Jacobson MD. Effective and ineffective behavior of teachers of nursing as determined by their students. Nurs Res. 1966;15:218–24.
9. Stritter FT, Hain JD, Grimes DA. Clinical teaching reexamined. J Med Educ. 1975;50:876–82.
10. Irby DM. Clinical teacher effectiveness in medicine. J Med Educ. 1978;53:808–15.
11. Gjerde CL, Coble RJ. Resident and faculty perceptions of effective clinical teaching in family practice. J Fam Pract. 1982;14:323–7.
12. Irby DM. Evaluating resident teaching. In: Edwards JC, Marier RL (eds). Clinical Teaching for Medical Residents: Roles, Techniques, and Programs. New York: Springer, 1988:121–8.
13. DaRosa DA. Residents as teachers: evaluating programs and performance. In: Edwards JC, Friedland JA, Bing-You R (eds). Residents' Teaching Skills. New York: Springer, 2002:100–14.
14. Skeff KM, Stratos GA, Berman J, Bergen MR. Improving clinical teaching: evaluation of a national dissemination program. Arch Intern Med. 1992;152:1156–61.
15. Litzelman DK, Stratos GA, Marriott DJ, Skeff KM. Factorial evaluation of a widely disseminated educational framework for evaluating clinical teachers. Acad Med. 1998;73:688–95.
16. Litzelman DK, Stratos GA, Skeff KM. The effect of a clinical teaching retreat on residents' teaching skills. Acad Med. 1994;69:433–4.
17. Marriott DJ, Litzelman DK. Students' global assessments of clinical teachers: a reliable and valid measure of teaching effectiveness. Acad Med. 1998;73(10 suppl):S72–S74.
18. Mavis BE, Ogle KS, Lovell KL, Madden LM. Medical students as standardized patients to assess interviewing skills for pain evaluation. Med Educ. 2002;36:135–40.
19. Sasson VA, Blatt B, Kallenberg G, Delaney M, White FS. ‘Teach 1, do 1… better’: superior communication skills in senior medical students serving as standardized patient—examiners for their junior peers. Acad Med. 1999;74:932–7.
20. Gorter S, Rethans JJ, Scherpbier A, et al. Developing case-specific checklists for standardized-patient-based assessments in internal medicine: a review of the literature. Acad Med. 2000;75:1130–7.
21. van der Vleuten CPM, Swanson DB. Assessment of clinical skills with standardized patients: state of the art. Teach Learn Med. 1990;2:58–76.
22. Edwards JC, Friedland JA, Bing-You R. Residents' Teaching Skills. New York: Springer, 2002.
23. Morrison EH, Hollingshead J, Hubbell FA, Hitchcock MA, Rucker L, Prislin MD. Reach out and teach someone: generalist residents' needs for teaching skills development. Fam Med. 2002;34:358–63.
24. Skeff KM, Berman J, Stratos G. A review of clinical teaching improvement methods and a theoretical framework for their evaluation. In: Edwards JC, Marier RL (eds). Clinical Teaching for Medical Residents: Roles, Techniques, and Programs. New York: Springer, 1988:92–120.
25. Colliver JA, Williams RG. Technical issues: test application. Acad Med. 1993;68:454–60.
26. Bing-You RG, Greenberg LW, Wiederman BL, Smith CS. A randomized multi-center trial to improve resident teaching with written feedback. Teach Learn Med. 1997;9:10–3.