Secondary Logo

Journal Logo


Validation of a Global Measure of Faculty's Clinical Teaching Performance

Williams, Brent C. MD, MPH; Litzelman, Debra K. MD; Babbott, Stewart F. MD; Lubitz, Robert M. MD, MPH; Hofer, Tim P. MD

Author Information
  • Free


Although many methods of evaluating clinical teaching are possible, the mainstay of teaching evaluation is assessments completed by learners. Instruments have been developed to measure learners' assessments of several domains of clinical teaching 1–3; however, the relatively lengthy instruments can be difficult to collect for all teachers and from all learners, especially when the assessments are repeated over time. For some purposes—e.g., identifying candidates for academic promotion, rewarding good teachers, linking resources to teaching quality, and identifying less effective teachers for further assessment or faculty development—representative comparisons among teachers should be ensured. To achieve this, the learners' assessments must produce information about most teachers as rated by most learners. Fortunately, the information necessary to make these types of administrative decisions (promotions, rewards, etc.) can be global, and need not include detailed information about a particular faculty member's strengths and weaknesses as a teacher.

We have developed a simple, single-item assessment instrument to measure the quality of clinical teaching among faculty. The reliability of the University of Michigan Global Rating Scale (GRS) has been demonstrated, with reliability coefficients greater than .7 when there are eight or more evaluations by learners. 4 This report describes the validity of the GRS, when compared with a more detailed teaching assessment instrument.


The GRS was administered in June 1998 to all senior residents of three residency programs and half of the residents at a fourth program. The programs were located in two large academic medical centers, one community-based teaching hospital associated with an academic medical center, and an independent academic medical center affiliated with a medical school. In the four programs, there were 36, 38, 15, and 11 eligible graduating residents. Responses from two residents at the first program were incomplete, yielding 98 usable responses.

When completing the GRS, residents were asked to rate the educational value of time with an attending physician using a five-point Likert-type scale [1 = poor (bottom 10–20% of teachers, 2 = below average (15–25% of teachers), 3 = average (middle 25–35% of teachers), 4 = above average (15-25% of teachers), and 5 = excellent (top 10–20% of teachers)]. The residents were asked to rate each attending physician with whom they had any teaching contact during their residency, using a roster of all clinical teaching faculty. They also rated the amount of teaching contact they had with each faculty member along a five-point scale (1 = no contact, 5 = one or more months as the resident's inpatient attending physician).

At the same sitting, the residents were asked to complete the 26-item Stanford Faculty Development Program questionnaire (SFDP26) for a subsample of ten of the teaching faculty with whom they had some teaching contact. The subsample of faculty for each resident was chosen from a pool of faculty equal to the number of residents responding. The pool was selected by each residency program director to represent approximately equal proportions of teachers whom they believed to be poor, average, and good teachers. To maximize the likelihood of a different subsample of ten faculty for each resident, each resident began at a different point on the faculty roster and selected ten consecutive names of faculty with whom he or she had had some teaching contact. Thus the number of faculty rated by each resident was fixed at ten, while the number of residents rating each faculty member depended on the number of residents who had had teaching contact with that faculty member.

The SFDP26 was developed in conjunction with the Stanford Faculty Development Program for Clinical Teaching, and contains 25 items plus an overall measure of teaching performance. All the SFDP26's items are measured along a five-point Likert-type scale. The 25 items measure teaching performance in each of seven domains—learning climate, control of session, communication of goals, understanding and retention, evaluation, feedback, and self-directed learning. The factorial validity of the SFDP26 has been demonstrated among medical students 3 and residents, 5 and has been shown to correspond highly to medical students' global assessments of clinical teaching. 6

To examine the relationships between GRS scores and the seven subscales of the SFDP26, we used a multivariate random intercept model to decompose the variances and covariances between the GRS score and the seven SFDP26 subscales into a within-faculty covariance matrix and a between-faculty covariance matrix. 7,8 This allowed us to simultaneously estimate the intraclass correlation coefficients for each measure as well as the intercorrelations of the faculty-level scores for each measure. The primary advantage of using multilevel statistical methods for this analysis was that the faculty-level covariance matrix yields the population correlations between the faculty-level scale scores. 7 As such, they are corrected for the reliability of the faculty score for each measure. The usual correlation analysis of the observed faculty mean scores will tend to underestimate the true population correlation to the extent that the scores have a reliability less than one.

We examined the effect of contact time on the reliability of faculty scores and the correlations between measures of faculty performance in the multivariate random-intercept models by introducing contact time and terms for the interaction of contact time and each measure into the models. Due to the large number of parameters in the multivariate random-intercept models, intercorrelations of faculty-level scores could not be estimated separately for each institution. However, to obtain an estimate of the stability of the results across institutions, we qualitatively compared the correlations between empirical Bayes' estimates of GRS and SFDP26 scores within each institution.


In all, 98 residents (98%) completed GRS and SFDP26 questionnaires. Each resident was required to complete the GRS and SFDP26 for ten faculty members with whom he or she had had some contact; complete data were available for 96 faculty.

The mean number (range) of residents rating individual faculty members on the GRS and the SFDP26 was 18 (3,34). The mean GRS score (SE) was 3.83 (.07), and SFPD26 subscale scores are shown in Table 1.

Table 1
Table 1:
Mean Scores (SE) on the University of Michigan Global Rating Scale and the Stanford Faculty Development Program (SFDP26) Subscale for 96 Faculty Rated by Senior Medical Residents, 1998*

Correlations between the mean GRS global scores and mean SFDP26 subscale scores for all faculty members ranged from .86 to .98 (see Table 2). The correlations of these scores within the individual institutions showed a similar pattern, although they were slightly lower at one institution. The amount of teaching-contact time did not affect the correlations between GRS global scores and SFDP26 subscales scores.

Table 2
Table 2:
Correlation of Average University of Michigan Global Rating Scale Scores and Stanford Faculty Development Program (SFDP26) Subscale Ratings between Faculty, 1998


The clinical teaching enterprise is being influenced by powerful forces in contemporary medical education. Chief among these forces are competing demands for clinical and research productivity from faculty, and reductions in funding and resources for medical education. To protect the teaching mission, many teaching programs are devising new ways of organizing and providing resources for teaching programs. These include explicitly assigning teaching relative-value units as an index for resource allocation, 9 providing bonuses for excellence in teaching, and refining methods for including teaching performance in promotion criteria. 10 Ideally, a method to measure teaching quality should be able to be administered uniformly and reliably with minimal administrative cost to large groups of learners evaluating large numbers of teachers.

Detailed learners' assessments of clinical teaching performances across a number of domains are essential to identify the teaching strengths and weaknesses of individuals or groups of faculty, for example, to structure faculty development efforts. 11 Limited response rates and difficulties in maintaining uniform, consistent administration for on-going evaluations limit the usefulness of such evaluations for administrative decision making and for rewards and incentive programs. The GRS was developed to provide a global measure of teaching performance for over 200 faculty. It was essential that the measurement be uniformly applied, to allow reliable and representative comparisons across faculty members.

While the objectives of uniformity and brevity have been accomplished in the GRS, and evidence supports its reliability and validity, several limitations should be considered when using it for measuring faculty's performances. First, to achieve adequate reliability it should be applied only to faculty for whom at least seven or eight ratings are available. Second, it lacks precision to allow fine distinctions among faculty based on small differences in average scores. Rather, the GRS may serve as the basis for relative rankings among faculty to distinguish high-, middle-, and low-performing teachers. Third, the GRS should be used as one component of a teaching-evaluation system in conjunction with other modalities such as rotation evaluations, lecture evaluations, and qualitative information about teaching performances.

These considerations are reflected at the University of Michigan Department of Internal Medicine, where faculty with eight or more evaluations are ranked by quintile based on the GRS each year. The application of GRS scores using quintiles has been demonstrated to yield stable year-to-year faculty rankings. 4 Faculty in the top quintile receive a teaching bonus. For decisions related to the allocation of teaching assignments and promotions, the GRS serves as one component in an overall evaluation that considers other types of evaluation data and information about that particular faculty member.

Given the high-stakes nature of the allocation of rewards and incentives for clinical teaching, it is essential that each component of a system to measure the quality of teaching be valid. The GRS meets this criterion by strongly correlating with residents' assessments of faculty's teaching performances in each domain of teaching identified in a widely disseminated educational framework. The results were consistent across a variety of types of teaching programs, suggesting that the GRS can be used in small as well as large programs, and in community-based as well as university-based programs. This instrument, or any other simple and easily administered global measure of clinical teaching quality, can serve as one component of a system to appropriately and fairly recognize, compare, and reward teaching faculty.


1. Irby D, Rakestraw P. Evaluating clinical teaching in medicine. J Med Educ. 1981;56:181–6.
2. Hayward RA, Williams BC, Gruppen LD. Rosenbaum D. Measuring attending physician performance in a general medicine outpatient clinic. J Gen Intern Med. 1995;10:504–10.
3. Litzelman DK, Stratos GA, Marriott DJ, Skeff KM. Factorial validation of a widely disseminated educational framework for evaluating clinical teachers. Acad Med. 1998;73:688–95.
4. Williams BC, Pillsbury MS, Kolars JC, Grum CM, Hayward RA. Reliability of a global measure of faculty teaching performance. J Gen Intern Med. 1996;12(suppl):S100.
5. Litzelman KD, Westmoreland GR, Skeff KM, Stratos GA. Factorial validation of an educational framework using residents' evaluations of clinician—educators. Acad Med. 1999;74(10 suppl):S25–S27.
6. Marriott DJ, Litzelman DK. Students' global assessments of clinical teachers: a reliable and valid measure of teaching effectiveness. Acad Med. 1998;73(10 suppl):S72–S74.
7. Snijders TAB, Bosker RJ. Multilevel Analysis: An Introduction to Basic and Advanced Multilevel. Thousand Oaks, CA: Sage, 1999.
8. Goldstein H. Multilevel Statistical Models. 2nd ed. New York: Halstead Press, 1995.
9. Bardes CL. Teaching counts: the relative-value scale in teaching. Acad Med. 1999;74:1261–3.
10. Lubitz RM. Guidelines for promotion of clinician—educators. J Gen Intern Med. 1997;12(2 suppl):S71–S78.
11. Vu TR, Marriott DJ, Skeff KM, Stratos GA, Litzelman DK. Prioritizing areas for faculty development of clinical teachers by using student evaluations for evidence-based decisions. Acad Med. 1997;72(10 suppl):S7–S9.
© 2002 Association of American Medical Colleges