Secondary Logo

Journal Logo

Institutional Issues

The Development and Implementation of a Health-System-Wide Evaluation System for Education Activities: Build It and They Will Come

McOwen, Katherine S.; Bellini, Lisa M. MD; Morrison, Gail MD; Shea, Judy A. PhD

Author Information
doi: 10.1097/ACM.0b013e3181b6c996


Academic health centers (AHCs) use education evaluation data for multiple purposes. On the organizational level, program evaluation is important for internal monitoring as well as fulfilling regulatory requirements from external organizations, such as the Accreditation Council for Graduate Medical Education (ACGME) and the Liaison Committee on Medical Education.1,2 Internally, AHCs must enable the collection of education evaluation data about faculty, learners, courses, rotations, and programs and use the resulting data in conjunction with other metrics to inform institutional processes such as promotion and tenure.3–5 Most medical schools employ a wide range of methods to collect such data. Developing an evaluation system to adequately address such a variety of objectives is a daunting yet necessary task. It is particularly challenging to reconcile the need for a robust evaluation system with the nonsystematic collection and analysis of data within and across departments and programs that span the continuum of undergraduate to graduate medical education.6 Collecting evaluation data in a standardized manner, enabling collation and subsequent assessment and interpretation, is critically important if the data are to be maximally useful.

Models of student and program evaluation exist in higher education, but most are based on settings with one teacher per course in one location.7 Two unique features of AHCs add complexity and thus require new conceptualizations of evaluation. First, in medical education, courses are composed of events—lectures, labs, small groups—through which one group of students interacts with multiple faculty for the duration of the course. A single course has many teachers, each covering a single subject reflecting his or her personal expertise. Second, as learners move along the continuum of medical education, they spend progressively less time in classroom settings and more time in clinical settings, seeing patients and watching instructors in a model more akin to apprenticeship. Moreover, their time in clinical settings varies from days to months, and they may interact with multiple faculty, as well as other health care professionals.

To begin to address these issues, we asked a very broad question: Is it possible to bring all of the education evaluation data collection and reporting needs of an AHC together into one cohesive system? In answer, we present a case study of one AHC’s evaluation program and suggest that the complicated mission of developing a multiprogram, multipurpose evaluation system is not only possible, but has many benefits. The proposed solution is generalizable to other AHCs. We begin with a structured analysis of our needs, and follow by detailing the conceptual evaluation model that guided our system. We proceed with summarizing briefly the amounts and types of data collected in years leading to full implementation. We conclude with a brief list of needs that emerged during implementation and suggest directions for future growth. Throughout our discussion, much of the focus is on the process and product of evaluation of clinical teaching, but our view of the system we created is wider and includes evaluation of learners (e.g., students and residents) as well as courses and rotations.

Structured Analysis of Needs

The integrated evaluation system presented herein is a comprehensive venue for evaluation across several stages of medical training. It was developed and implemented during four years from 2003 to 2007, in response to a mandate from the executive vice president of the University of Pennsylvania Health System/dean of the School of Medicine (SOM). The first priority was to implement a standardized evaluation of several hundred faculty across multiple clinical departments and about 60 training programs, thereby enabling the collection and presentation of faculty teaching data in a Web-based dossier. Education sites were geographically distributed among five major hospital training locations: the Hospital of the University of Pennsylvania, Pennsylvania Hospital, the Veteran’s Affairs Medical Center, the Penn Presbyterian Medical Center, and the Children’s Hospital of Philadelphia, as well as numerous smaller outpatient sites. Before the current system, evaluations were collected, but processes and products were decentralized and not standard across the education curriculum. There was no system to track, aggregate, or report collected data across training programs, though anecdotal reports indicated that response rates were low, particularly for paper systems. This original mandate focused on evaluation of clinical faculty; however, we sought a solution that would extend to other types of education evaluation, such as evaluation of courses, rotations, students, and residents.

In 2003, at the beginning of the project, there was a single undergraduate medical education (UME), Web-based evaluation system developed in-house that met most evaluation needs for preclinical teaching. Students were asked to evaluate each lecture, small-group session, and laboratory as these events occurred, and to complete an overall course evaluation at the conclusion of each course. During clinical rotations, students were asked to evaluate didactics, the overall course or rotation, and clinical preceptors. Assessment of students by faculty (and sometimes residents and fellows) was typically handled using paper forms. Not all types of evaluation happened within each course. In graduate medical education (GME), the level of oversight and administration was generally based within a single residency program. Some programs had their own Web-based evaluation systems, but most used paper forms. Within a program, residents and fellows were asked to evaluate clinical experiences or rotations, supervising faculty, and sometimes the overall program. Supervising faculty and instructors were usually asked to evaluate residents and fellows. The multiple stakeholders for the UME and GME evaluation processes and results made consistency impossible.

The development of a unified evaluation system required a commitment to five key principles. First, the new system needed to evolve without creating disruption and discontinuity of existing systems. Numerous existing instruments, processes and systems ultimately needed accommodation. Second, the system had to bring together data for multiple types of teaching. Third, the system needed to enable multidirectional evaluation (e.g., students evaluate the residents and faculty on their clinical team, and faculty evaluate the residents and students). Fourth, the system needed to accommodate multiple types of evaluation forms to promote buy-in from the end users. Finally, and absolutely crucial to the success of the project, the new system required support from the Dean’s Office for the expansion and ongoing support of the Office of Evaluation and Assessment (OEA) within the Academic Programs Office of the SOM. Serving not just as a source of manpower but also as an important buffer between stakeholders and administrators, the OEA provides necessary innovation and oversight for evaluation practices and processes across the SOM. GME and UME are represented in the OEA by directors who administer the evaluation systems and are overseen by the associate dean for medical education research. The OEA houses all evaluation data and protects system access as well as the identity of evaluators. Maintaining the OEA separate from any academic department or program helped to build the credibility of the system and enforce common policies across programs.

As a result of the dean’s original mandate for a single evaluation of clinical faculty and within the context of the five needs listed above, a new evaluation system was developed. A timeline describing the entire project is difficult because of the organic evolution of the systems and processes, but development followed in approximately this order:

  • 2003-2004: Planning phase and pilot testing of data collection options in GME; extending UME data collection to include electives and other clinical experiences.
  • 2004-2005: Choosing a GME data collection option and implementing it in five pilot programs; automating UME report generation and dissemination.
  • 2005-2006: Rolling out GME data collection to about five programs per month and developing the faculty report.
  • 2006-2007: Final implementation and ongoing maintenance and development of the system.

In actuality, two systems run in parallel—one primarily for UME and one primarily for GME. However, all SOM courses and GME programs use the same items to assess clinical faculty. Subsequently, all educational data for a single faculty member are compiled to create a teaching dossier. We expect that standardization will increase over time in terms of common items for evaluating residents and fellows, rotations, and training programs.

The Conceptual Evaluation Model

A factor central to the success of our system is adherence to an evaluation model similar to those suggested by Musick8 and Kogan and Shea.9 Our model adds structure and detail to the issues surrounding each new type of evaluation brought into the system. The model consists of a series of questions that are applicable to any new program, course, or intervention addressed within the evaluation system. The questions provide the systematic guide to developing the system-wide evaluation: (1) Where does the evaluation occur, and in what format? (2) What or who is evaluated and what questions are asked? (3) Who is assigned an evaluation to complete? (4) When do evaluations happen? (5) How is the evaluation system administered?

Where does the evaluation occur, and in what format?

We searched for a cost-effective yet efficient way to capture thousands of faculty-student and faculty-resident/fellow evaluation interactions. Paper forms were too inefficient to process, and thus we ultimately decided to use an electronic system capable of addressing many of our priorities, including our ability to easily create new evaluation forms and assign them in multiple directions to as many or as few trainees as necessary; the system’s scalability, or the ease with which multiple courses and programs can use the system; and the system’s scheduling features, allowing us to assign evaluations on varied schedules. Ultimately, our choice reflected price, flexibility, and ease of use. For UME evaluation, we currently use an ever-adapting electronic system, custom built in-house. For GME, we licensed a system called Oasis from the company Schilling Consulting (

What or who is evaluated and what questions are asked?

Stakeholders in the evaluation process—program and course directors, for example—are offered a series of choices about what to evaluate within their course or program. Per the dean’s mandate, GME programs are required to evaluate faculty within the shared system, but they are also encouraged to use Oasis to conduct additional evaluations. Generally, accreditation requirements dictate other evaluation needs; for instance, GME programs are usually required by the ACGME to evaluate faculty, residents, fellows, rotations, and the overall program. In UME, the SOM administration determines what is evaluated and has developed standardized instruments to collect data on students, instructors, clinical rotations, events (lectures, labs, and small groups), and courses.

The dean’s mandate of a single evaluation form to evaluate all clinical faculty required that all GME and UME programs and departments agree to use a common set of items. Because the literature does not support the superiority of one existing faculty evaluation instrument over any other,10 a new 10-item instrument was developed. Chairs of each clinical department appointed one faculty member to the Faculty Teaching Dossier committee. This committee developed the processes, systems, forms, and policies to support the common evaluation of faculty. Each of 18 departments submitted their evaluation form(s) for comparison and a master form was created, synthesizing content and response options. Three smaller working-group meetings, as well as three larger consensus-building meetings, were held over a period of several months. The working group reached consensus that

  • The content of the form should focus on teaching skills rather than clinical effectiveness.
  • All items should use a five-point scale.
  • Each item should include behavioral anchors describing effective and ineffective teaching.

Content domains for nine common items were organized by faculty competency areas, with the final question being a global rating. Item stems were presented in two statements representing the effective and ineffective teacher. Items were rated on a scale where 1 = poor, 2 = fair, 3 = good, 4 = very good, 5 = excellent. A complete list of items appears in Appendix 1. We began pilot testing the form in a few departments in summer 2004, and planned to gradually include other programs throughout the 2005-2006 academic year. Ongoing research supports the reproducibility and validity of the evaluation score patterns.11–14 As a result of the system requirement to handle bidirectional evaluation, as programs implemented the required evaluation of clinical faculty form, we were able to set up the collection of learner assessment at the same time without needing additional resources.

Other standard instruments, such as the clinical evaluation of a student and course evaluation used by clerkship and elective courses in UME, were developed in a similar fashion with consensus of a committee appointed by department chairs. The system was designed to allow for a “common plus unique” strategy of evaluation form development, such that course-specific questions could be added to existing standardized forms. For example, the course evaluation form within the medical school curriculum uses the same 11 items for all courses. At the course director’s discretion, unique items can be added. In GME, program directors have discretion in how to ask questions about their residents and fellows, but they are mandated to evaluate a set of six competencies deemed important by the ACGME.2

Who is assigned an evaluation to complete?

The first step in implementing the system was to identify administrative staff who understood which faculty, students, residents, and fellows worked together. A one-size-fits-all approach with, for example, monthly evaluations pairing one student and one faculty person was not a practical solution given the diverse range of programs and courses. Assignment of evaluations needed to be flexible. Students in lecture-based courses are asked to evaluate every faculty person they encounter, and the evaluation is available on the day the event occurs. Students, residents, and fellows in clinical rotations are asked to evaluate their faculty, resident, or fellow instructors at intervals ranging from every two weeks to semiannually, according to the existing clinical schedule. Moreover, some programs prefer very specific team-based assignments, such that student A evaluates faculty B. Other programs prefer a more open model where students, residents, and fellows are asked to select from a list the faculty, residents, or fellows with whom they worked during a specific time frame or while at a particular location.

When do evaluations happen?

The intricacies of UME and GME require a system able to handle continuous evaluation. The frequency of evaluation requests is determined by each program, course, or department. At the UME level, most evaluations are determined by the length of a course and its lecture and lab components. In GME, residents and fellows can participate in block rotations with discernible start and end dates; they may have longitudinal experiences one day a week for a year, or they may do shift work with a different team every 24 hours. To address this variability, assigned coordinators within each program define the evaluation events for evaluators. On the last day of the rotation (or predetermined evaluation period), system-generated e-mails are sent to the evaluators to remind them to complete all assigned evaluation(s). Reminder e-mails are sent each week from that time until the evaluation is completed or “inactivated.”

How is the evaluation system administered?

The OEA monitors use of the evaluation system and is proactive in identifying noncompliant respondents. Over the course of its development, the OEA has also led efforts to develop a set of policies designed to protect confidentiality, deal with unprofessional use of the system, and provide adequate training to coordinators and feedback to stakeholders across all instruments and programs.

Access to the evaluation system is controlled by a university authentication system requiring a username and password. Almost all users are at the lowest level of the three-tiered system of users and have access only to their own assigned evaluations. On the second level are staff within programs and courses who create the evaluation assignments and have access to some reporting functions. These second-tier users have the limited ability to link evaluations to the identity of the evaluator. Thus, to maintain the confidentiality necessary for honest evaluation, it is imperative that the number of second-tier users be limited and that they be well trained. Whereas policies on the confidentiality of student, resident, fellow, and faculty evaluation data were initially handled within each course and training program, our policy has evolved to more limited access, so that the source of comments and ratings cannot be discerned. At the third tier are a small number of “super” users in the OEA who have access to all of the data.

The completion of evaluations is monitored by program coordinators and the OEA. In GME, program coordinators are responsible for regularly checking to make sure residents, fellows, and faculty complete what has been assigned. In UME, all compliance monitoring is handled by the OEA. Unprofessional use of the system, narrowly defined as inappropriate language and/or highly unconstructive feedback, is brought to the attention of the associate dean for medical education research.

Summary of the Amounts and Types of Data Collected

There is little doubt that the integrated evaluation system has been a successful undertaking. At the end of the 2005-2006 academic year, there were approximately 700+ students, 700 residents and fellows, and 1,000+ faculty who had used the system in UME or in 1 of 52 GME programs. Throughout the 2006-2007 year, we added 13 GME programs not previously included and enhanced evaluations for programs already in the system—adding evaluation of trainees, programs, and/or increasing the frequency of evaluation. At the conclusion of that year we had 30,243 evaluations of 2,562 clinical faculty (including many outside the AHC) and 19,383 evaluations of 1,100 residents and fellows, 14,594 evaluations of 65 GME training programs, 180,990 student evaluations of lectures and labs, approximately 10,000 evaluations of student clinical performance, and 8,191 evaluations of student courses.

Needs That Emerged During Implementation

Despite the widespread support for the project and length of the planning cycle, the first academic year of full implementation (2006-2007) highlighted many challenges.

Better reporting

The first faculty evaluation summary report was developed in January 2006 and was called HAMSTER (Housestaff and Medical Student Teaching Evaluation Report). Appendix 2 is an example of one available report, the three-year summary of the teaching activity, and evaluations of a single faculty member. It details the number of evaluations and mean rating for clinical teaching (separating UME and GME teaching), and classroom teaching (separating lectures from small-group teaching). Comparison data with percentile ranks are provided for the SOM and on a department level for the larger departments. Summary data are represented in text and graphically in a box and whiskers plot. Qualitative comments are also included, although not shown in the appendix. This report is available online for access by individual faculty members, their chair or division chief, and educational officers (faculty within each department in charge of collating data for appointment and promotions) and faculty coordinators (staff who work with individual faculty to prepare reappointment and promotion packages). Within the online system, faculty can interactively move through annual data and view item-level data and course-specific comments. The entire report can be downloaded should users need a paper copy.

We have learned that developing a single reporting format for all of the data collected in the system is not feasible. The compromise solution is to create a small number of standardized reports that are available to program coordinators on an as-needed basis. Whereas UME has already developed a routine schedule and standardized reporting format, GME is currently in the process of doing so. The general format consists of a simple summary showing the item stem, the number of respondents, mean, and standard deviation, in addition to qualitative comments. When applicable, the data for some comparison group are shown, usually course or program, although sometimes department. For more unique or extensive needs, programs/course directors work with the OEA.

More manpower

If restricted to evaluation of clinical faculty, the dual systems could probably be administered by the 1.5 to 2.0 FTE dedicated to the task. However, the system has been so popular that it is used to conduct multiple types of evaluations. Within the 65 GME training programs, the OASIS system is also used to evaluate residents, rotations, and programs, as well as several unique activities defined by the GME Office. Program coordinators of varying skill and ability manage the evaluation system at the program level, thus creating a need for ongoing support and training. Finally, as use within programs has grown more complex and spread to other clinical sites, more personnel have been added, increasing the OEA to 3.5 FTE.

Policy and oversight

A Faculty Teaching Dossier committee was created and charged with overseeing the evaluation process. Education coordinators were tasked with the responsibility to set policies and determine data accessibility. Early efforts focused on developing items and forms, rolling out the HAMSTER dossier, and exploring the feasibility of setting standards to add data interpretation.15

Directions for Future Growth

Revisit the fee program.

Heretofore, the development and maintenance of the system has been mostly subsidized by the Dean’s Office, and programs have not been charged for access to the system or user accounts. A new fee program is being developed in GME to provide additional administrative assistance to programs with limited personnel resources. The fees collected will offset some of the development costs for additional technical support for reporting improvements.

Develop policies.

Current policy issues include optimal reporting structures (defining what is routine and regularly provided by the OEA versus what is the responsibility of the program), editing (who, if anyone, edits the data and removes unnecessary comments?), inclusion criteria (should published reports be generated from live data or a predefined dataset?), and access (who has access to the raw data?). Additionally, the role of the HAMSTER report in risk assessment among clinicians and incentive plans is yet to be determined. Surely more issues will emerge as use grows.

Streamline forms.

What started with a mandate for creating a single system for evaluating clinical faculty has matured into a multipurpose system currently housing more than 120 evaluation forms, including 50 different forms for evaluating students, residents, and fellows, 20 evaluating courses and programs, and 50 evaluating rotations. Clearly, the forms are crafted to meet programmatic if not regulatory needs; still, in many cases, there is significant overlap. For example, almost all forms for evaluating trainees contain language mirroring the ACGME competencies.2 Standardizing the forms would greatly simplify the reporting structure of the data and improve the ability of stakeholders to synthesize evaluative data and continually assess and improve the system.

Include other types of teaching.

In addition to fulfilling the goal of more comprehensive and improved reporting, our vision includes expanding the evaluation system—that is, the actual collection of data—as well as HAMSTER reporting to include multiple types of teaching. For example, many faculty teach in the biomedical graduate studies programs that offer multiple doctoral and master degrees. Although the infrastructure described here is capable of expanding to encompass these programs, the political negotiation required to achieve this goal is ongoing. We also have requests from several programs to include in the system evaluations for lectures and other didactic formats used in residencies.


Planning and implementing a multiuser, multipurpose evaluation system was a lengthy yet energizing process. The integrated approach brought many benefits to the school. We now have one central repository and a one-look, one-style report for aggregated faculty teaching data regardless of teaching venue. A secondary, anticipated benefit of a system-wide evaluation model was the ability to promote feedback between all levels of students, residents, fellows, and faculty. The regular and consistent use of an evaluation system supports the development of those needed skills.16,17 Third, the infrastructure is in place that will allow us to grow beyond program process assessment to studies of educational intervention effectiveness and efficacy. Fourth, we are generating a state-of-the-art evaluation database that has already fueled multiple research projects.9,11–15

Along the way, we learned several lessons generalizable to the broader field of medical education. First and foremost, change is difficult. Necessity dictated the involvement of dozens of programs in the development process. Some viewed this new process as a welcome change, whereas many more viewed migration to the system as just one more annoying task—especially among those programs that had already invested significant resources in their own electronic systems. Indeed, significant effort was required in the start-up year, but each year is progressively easier. Second, and closely related, we learned that change is slow and patience is necessary. What was initially viewed as a two-year process really took about four years to implement fully. Third, although nearly all stakeholders agree that evaluations are important, program coordinators have many competing time demands related to curriculum, scheduling, and recruitment.

Could this very broad, multipurpose education evaluation system have been built without the dean’s specific mandate to evaluate all clinical faculty in the same way? In our setting, the answer is probably not. Certainly, we could have developed a common set of items and metrics that were used throughout programs but implemented in a multitude of ways and still technically met his mandate, assuming we could find a way to collate the data. His vision of a single, Web-based faculty dossier proved to be the driving force that moved the project forward. Arguably, the vision was more important than the financial support.

We return to our original question: Is it possible to bring all of the education evaluation data collection and reporting needs of an AHC together into one cohesive system? The short answer is theoretically yes, but practically not yet. We now understand what perhaps should have been obvious at the outset—the system needs to constantly evolve and will never be “complete.” The credibility of the system has increased to the point where additional projects may be included relatively easily. As expressed by others, we propose that in addition to informing our understanding of the teaching and learning occurring in our courses and programs and providing measures of effectiveness, the ultimate goal of any such system is to “maximize the size and breadth of data on assessment of competence” and to increase our understanding of the field as a whole.6,8,9


The authors would like to recognize Arthur H. Rubenstein, MBBCh, executive vice president of the University of Pennsylvania Health System and dean of the School of Medicine, for his leadership and ongoing support of this effort.


1 LCME. Current LCME Accreditation Standards. Available at: ( Accessed July 2, 2009.
2 ACGME. Common Program Requirements. Available at: (—dutyhourscommonpr.pdf). Accessed June 29, 2009.
3 Leite D, Santiago RA, Sarrico CS, Leite CL, Polidori M. Students’ perceptions of the influence of institutional evaluation on universities. Assess High Educ. 2006;31:625–638.
4 Shephard K, Warburton B, Maier P, Warren A. Development and evaluation of computer-assisted assessment in higher education. Assess High Educ. 2006;31:583–595.
5 Boud D, Falchikov N. Aligning assessment with long-term learning. Special issue: Learning-oriented assessment: principles and practice. Assess High Educ. 2006;31:399–413.
6 Rossi P, Lipsey M, Freeman H. Evaluation: A Systematic Approach. Thousand Oaks, Calif: Sage; 2004.
7 Avery JA, Bryant WK, Mathios A, Kang H, Bell D. Electronic course evaluations: Does an online delivery system influence student evaluations? J Econ Educ. 2006;37:21–37.
8 Musick DW. A conceptual model for program evaluation in graduate medical education. Acad Med. 2006;81:759–768.
9 Kogan JR, Shea JA. Course evaluation in medical education. Teach Teach Educ. 2007;23:251–264.
10 Beckman T, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971–977.
11 McOwen KS, Bellini LM, Shea JA. Residents’ rating of clinical excellence and teaching effectiveness is there a relationship? Teach Learn Med. 2007;19:372–377.
12 McOwen KS, Bellini LM, Shea JA. Evaluation of clinical faculty: Gender and minority implications. Acad Med. 2007;82(10 suppl):S94–S96.
13 Shea JA, Bellini LM. Evaluation of clinical faculty: The impact of level of learner and time of year. Teach Learn Med. 2002;14:87–91.
14 McOwen KS, Kogan JR, Shea JA. Elapsed time between teaching and evaluation: Does it matter? Acad Med. 2008;83:S29–S32.
15 Shea JA, Bellini LM, McOwen KS, Norcini JJ. Setting standards for teaching evaluation data: An application of the contrasting groups method. Teach Learn Med. 2009;21:82–86.
16 Ende J. Feedback in clinical medical education. JAMA. 1983;250:777–781.
17 Shaw I, Falkner A. Practitioner evaluation at work. Am J Eval. 2006;27:44–63

Appendix 1

Evaluation Form for Clinical Faculty Teaching for PENN Medicine Faculty, Developed and Rolled out 2004-2007** Items are rated on a scale where 1 = poor, 2 = fair, 3 = good, 4 = very good, and 5 = excellent.
Table A
Table A:
ppendix 2Example of a Report Available in the Web-Based Housestaff and Medical Student Teaching Evaluation Report (HAMSTER) System for PENN Medicine Faculty, Developed and Rolled out 2004-2007 Three-Year Evaluation Summary 2006-2008 (7/1/2005-6/30/2008)
© 2009 Association of American Medical Colleges