Traditional promotion standards at academic institutions may not give scholarly credit to the educational activities of faculty, with the consequence that excellent teachers often compare unfavorably with excellent researchers in decision making about academic advancement. In the 1990s, the Carnegie Foundation for the Advancement of Teaching took the lead in broadening the definition of scholarship to include education,1,2 seeking both to promote a scholarly approach to education and to stimulate productive discussion of criteria for assessing educational scholarship. These writings and others that followed3,4 have attempted to build a sound foundation for the advancement of medical educators. What is at stake is not simply salaries and promotions, but the recruitment and retention of high-quality, energetic educators to teach the next generation of physicians.
In the same period, financially challenged academic health centers and medical schools have expanded the number of clinical faculty to increase clinical revenue. Several medical schools have added new promotion tracks, in part to enhance career satisfaction and retention among clinician educators.5,6 A longitudinal review of the evolution of faculty-promotions policies in U.S. medical schools, conducted by the Association of American Medical Colleges (AAMC), indicated that from 1985 to 2004, the percentage of clinician educators increased by more than 315%, whereas tenure-eligible faculty increased by only 50%.7 Economic challenges have also prompted a more rigorous approach to the measurement of faculty performance, including that of clinical educators. Even though the academic medicine community has refined the analysis of evidence for educational scholarship and academic recognition,8–11 serious barriers to advancement of clinician educators remain.12,13 A study from Johns Hopkins University showed that clinician educators were 69% less likely to hold a higher rank than were basic researchers.12
One obstacle to the promotion of educators is that the grant dollars and peer-reviewed publications traditionally used to demonstrate the productivity of research faculty provide inadequate documentation of educational efforts. The conventional curriculum vitae format at most institutions does not provide enough flexibility to document the quality and impact of an educator’s performance. Educator portfolios (EPs) are far more informative, and the number of medical schools using EP documentation in their promotion dossiers increased 400% from 1993 to 2002.14 However, providing evidence to document educational quality has remained challenging; for example, most schools still use subjective student satisfaction ratings as the primary evidence of teaching quality.14 EPs are difficult to evaluate in the absence of recognized standards for documentation and evaluation of the wide array of educational activities and contributions. Without such standards, educators will continue to be disadvantaged in academic recognition. An effort to address this problem prompted our current project.
In a national attempt to define the components of and sound documentation methods for educational scholarship, the Group on Educational Affairs of the AAMC organized a consensus conference of educators from all over the United States and Canada in February 2006. The report from this conference defines five categories for documenting the quantity and quality of scholarly engagement in educational activities: teaching, curriculum, advising/mentoring, educational leadership/administration, and learner assessment.15 These five categories provided a useful framework for our development of an EP template for the Educational Scholars Program (ESP)* of the Academic Pediatric Association.16 ESP faculty created the template for two purposes: to measure the outcomes of this faculty development program, and to help the scholars plan and document their educational careers. The EP template provides a structure for systematically presenting both quantitative and qualitative data. It has been peer-reviewed and approved by MedEdPORTAL, and it is available online.16 Also, an article on how to create an EP, recently published in Faculty Vitae,17 highlights our template.
To meet the need for a widely applicable process for the evaluation of educators, our goal was twofold: (1) to define a set of measurable outcomes that would demonstrate the quality and impact of a faculty member’s educational activities in a format amenable to reproducible analysis, and (2) to develop an analysis tool that educators and administrators could apply across institutions and disciplines. We sought to create a tool that was detailed enough to capture the quantity, quality, and impact of the wide array of educational activities, yet simple enough for practical use by promotion committee members who seek to evaluate a faculty member’s educational performance.
Overview of methodological approach
Ten of us (all except L.G.), all educational experts, built the analysis tool. First, we separated into two teams to develop quantitative and qualitative items (i.e., individual evaluation criteria). These teams worked independently in the first phase of the project. In multiple rounds of item development and selection (see Figure 1), each of the two teams conducted a formal consensus-building process to test and refine items, using data from actual EPs (n = 27 blinded) completed by scholars participating in the ESP. We selected items that demonstrated the quality and impact of an educational activity, and we wrote verbal specifications to define for raters exactly what each item measured. For example, the item Variety of teaching strategies was specified at the intermediate level as “Teaching incorporates three or more strategies that go beyond passive transfer of knowledge (e.g., interactive lectures, small-group sessions, workshops, clinical precepting).” We either defined more explicitly or eliminated discordant items. In parallel, we modified the EP template to improve the quality of the information available for measurement. We created a set of detailed instructions for raters to enhance reliable use of the analysis tool.
In the second stage of the process, the quantitative and qualitative rating teams merged, and we combined all items into a single, briefer tool. Item selection and refinement continued in multiple rounds of testing of the combined tool, again using ratings of data from real EPs.
In summary, educational experts built and validated the analysis tool through an iterative consensus development process and modified the EP template in parallel to provide high-quality data for this analysis. Interrater reliability testing of the analysis tool, using a new set of EPs built on the revised template, will be described in a later paper. The IRB of the Penn State College of Medicine approved this project.
Development of quantitative items
Six of us (L.C., M.G., M.B., D.R., D.I., T.T.) set out to create objective, quantitative items. Reviewing the EP template, we compiled a list of more than 100 potential quantitative items. We divided into three subteams of two raters each and tested items in three different sections of the EP, using 27 blinded EPs (Figure 1, EP Test 1A). Each subgroup identified the critical items and wrote item specifications in the assigned section. For example, Total teaching hours/year was specified as “For each activity, count the number of hours taught per year to obtain teaching hours subscores; total teaching hours/year = total of these subscores.” If EP data were insufficient for measurement, the subteam either eliminated or modified the item to provide more reliable and interpretable data. Then, through consensus, we selected items in the initial list, based on importance and measurability.
To limit the analysis tool to a practical size, we used index scores when appropriate to combine several related measures. Each index captures in a single score the depth, breadth, and impact of a set of similar educational activities. For example, the Curriculum Impact Index totals the scores for all curricula documented in the EP. The score for each curriculum is the product of
* Number of learners taught (<50 = 1; 50–150 = 2; >150 = 3)
* Implementation score (curriculum not yet implemented = 1; curriculum implemented = 2)
* Geographic impact score (department = 1; institution = 3; regional = 5; national = 10; international = 10)
The Curriculum Impact Index for an educator’s EP provides a single, objective, quantitative measure of the impact of all the curricula that an individual educator develops. Weights used in the construction of the index scores were calibrated across the analysis tool to ensure that all indexed items were equivalent in the scores they generated.
Development of qualitative items
Working independently, those of us who composed the qualitative team (C.B., L.L., E.Z., D.B.) developed 52 qualitative items, all derived from information provided in the EP template. We selected a three-point rating scale (one = novice, two = intermediate, three = expert) and wrote verbal specifications for an intermediate rating. For example, under Curriculum Development, Quality of Evaluation was specified at the intermediate level as “Curriculum design includes sound learner evaluation methods (valid, reliable, feasible); curriculum is modified periodically using results of learner and program evaluations.” Those of us on the team tested qualitative items iteratively, using first one and then five more blinded EPs, and we refined and distilled items through group consensus (Figure 1, EP Test 1B). We eliminated discordant items that could not be sufficiently specified to achieve consensus among the raters. Those of us in the qualitative rating group, like those in the quantitative group, modified the EP template to enhance the quality of data for analysis.
Compilation of the combined quantitative and qualitative analysis tool
Using a consensus process, the three leaders of the qualitative and quantitative groups (C.B., L.C., M.G.) reduced the 100-plus quantitative and 52 qualitative items down to a list of 43 items (Figure 1, Version 1.1). The first column of the tool (Appendix 1) lists all the sections and subsections of the EP template, the second column shows corresponding evaluative items, and the third and fourth columns show the scoring method selected for each item in the second column.
Within each EP section, we decided on the best combination of quantitative and qualitative items that demonstrated educational quality and impact. We chose a quantitative item if it was valid, important, and could be reliably measured from the data in an EP; we chose a qualitative item if it captured important additional information that was not readily quantifiable. We sought a balance between quantitative and qualitative measures in a tool that was not too long to preclude practical use by a promotions committee.
Periodically, we invited an external evaluation expert (L.G.) to review our development process and our product. We tested the combined analysis tool in several iterations (Figure 1, EP Tests 2 and 3) and discussed discordance in detail. We achieved consensus on item selection and specification, creating Version 1.2 of the analysis tool (see Figure 1). We modified the EP template for a final time to improve the quality of future data available for analyses using the tool.
Balance between quantitative and qualitative items in the analysis tool
The analysis tool (Appendix 1) comprises seven sections that match those in the EP template and, in addition, a section called Measures of Educational Scholarship that evaluates data across the whole EP. In the tool, 18 items are quantitative, including 8 index scores that combine related data, and 25 are qualitative.
Developing quantitative items to evaluate teaching activities was particularly challenging because the EPs included many permutations of teaching venues, time commitment, and levels and numbers of learners. In our efforts to quantify teaching, we found that our original EP template did not generate accurate, unambiguous data. Reliable analysis of teaching demands very specific formats for documentation, so we expanded the EP documentation grid for teaching activities to include number of years teaching, teaching topics or activities, teaching strategies, number of hours spent teaching, and the number and level of learners taught (medical students, residents, or other learners) (Appendix 2a). We included similar grids in all sections of the revised EP template that call for quantitative evaluation.
Qualitative analysis of the EP comprises 25 items. Two sections of the EP, the educational philosophy statement and the five-year goals, are evaluated by qualitative measures only. Refinement of the qualitative items in the combined tool usually focused on providing sufficient definition of the item and careful, descriptive wording of its middle anchor (two = intermediate) to facilitate reliable evaluation. Data for qualitative items come from both the documentation grids in the EP template and from responses to open-ended, reflective questions included at the end of every EP section.
Appendix 2 demonstrates how the quantitative and qualitative data combine for evaluation of one section of an EP. The teaching activities section is used as an example. Appendix 2a is a blank teaching activities data grid, and Appendix 2b provides an example of a completed grid. Appendix 2c shows how the EP analysis tool summarizes the teaching activities ratings shown in Appendix 2b. This section of the EP evaluation includes a total of three quantitative items (total number of students taught each year, total number of hours taught each year, and geographic [department–international] impact score), one rated qualitative item (variety of teaching strategies), and written comments to clarify the basis for the rating.
Reciprocal relationship of EP template and EP analysis tool
We made changes in the EP template in parallel with development of the analysis tool. While quantitative items required improved documentation grids in the template, the qualitative items required the substitution of structured reporting formats for open-ended narrative responses. The new EP template reserves open-ended responses mainly for reflective questions.
For example, in the Curriculum Development section, the original version of the EP template asked for only narrative responses, which proved challenging to evaluate reliably. Therefore, we added to this section a quantitative grid to request data on the type and number of learners for each curriculum, on whether and where it was implemented, and on the author’s degree of responsibility for the development process. These data feed into the quantitative Curriculum Impact Index score (described in the Methods section). To improve qualitative evaluation of curriculum development, we created a second grid, based on the widely used GNOME model (G = Goals, N = Needs, O = Objectives, M = Methods, E = Evaluation).18 In sum, our efforts to create an analysis tool that could be used easily and consistently led to improvements in the structure of the EP template.
Because scholarship measures are particularly important for EP evaluators, the analysis tool includes a separate section called “Measures of Educational Scholarship” (Appendix 1, items 34–43) that evaluates data from all sections of EP template. This section includes two components: Scholarly Approach to Education and Products of Educational Scholarship.
A scholarly approach to education encompasses self-analysis (or reflective practice) in teaching and educational development activities, and use of “best practices” from the literature or recognized experts.15 Because these activities are usually difficult to quantify, we chose qualitative items and structured them using Glassick’s criteria for educational excellence2 as measurement specifications (see Appendix 1, items 34–40). We found that the most informative parts of the EP for measurement of a scholarly approach to education are the educational philosophy statement (viewed in parallel with the five-year goals) and the open-ended, reflective comments that conclude each section of the EP.
Educational scholarship requires fulfillment of three criteria: public dissemination, peer review, and creating a platform for others to build on.15 We based this section of the analysis tool on quantitative evaluation of products that resemble those used in conventional evaluation of research performance: number of publications and grants. However, because educational products are more diverse than research publications, our Scholarly Productivity Index includes not only peer-reviewed publications but also peer-reviewed and invited presentations and workshops, and disseminated products such as curricula, teaching modules, and evaluation tools that are adopted by other programs.
We report the development of a novel tool to measure the quality and impact of the educational activities of medical educators. Our goal was to provide a process and instrument worthy of widespread acceptance that could help promotions committees evaluate EPs objectively and thereby recognize and reward academic excellence in medical educators. The lack of such a valid, reliable, and practical tool limits the utility of EPs. Given the growing use of EPs and the difficulty promotions committees have in analyzing them, our project was, in part, an effort to provide for educators an evaluation process that used methods familiar to researchers. We aimed to achieve a balance between quantitative and qualitative measures: quantitative measures were essential to give the analysis objectivity and reproducibility, but qualitative measures were also necessary to allow a more nuanced assessment of an educator’s performance when numbers fail to tell the whole story. To increase the dependability of the qualitative items, we developed rich descriptions of each. We also used constructs such as Glassick’s criteria of educational excellence,2,3 Miller’s criteria for analysis of learner assessment methods,19 and the GNOME model of curriculum design18 to enhance the credibility of the qualitative items. Finally, we rated the qualitative items on a numeric scale to make them more reproducible and comparable with the quantitative ratings.
An objective EP analysis is useful not only for promotion. There is a national need for more standardized methods to evaluate educators, in order to provide more meaningful quality assurance for the educational mission. Educators need better ways to measure their own performance so that they can build their skills in an organized fashion. Program directors need better ways to assess the quality of their teachers so that they can provide feedback to developing educators and use their limited resources efficiently to educate future physicians. Any discipline that lacks good evaluation tools is prone to mediocrity because good measurement of quality creates a scaffold for ongoing growth and development.
The process of developing the EP analysis tool was instructive. The lessons we learned as a consensus development team may prove useful to other raters who use or adapt this tool:
1. Evaluation of quantitative items taught us (repeatedly) to “Keep it simple!” Proliferation of items is tempting, but a focused selection of essential items makes the tool practical.
2. Good quantitative items are based on judgments of quality; numbers are useful only if their educational value is explicit. Clear instructions help the rater who interprets a quantitative item to measure quality, not just quantity.
3. Evaluation of qualitative items needs to be grounded in sound principles that are clear to the rater. Specification of ratings for each item was critical to achieve concordance in our qualitative ratings.
4. Qualitative items need to be recorded in a numeric (e.g., ranked category) scoring system to give them “competitive equivalence” with quantitative measures. A simple three-level scoring system (one = novice, two = intermediate, three = expert) helped us to achieve concordance.
5. The quality of our analysis tool was highly dependent on the quality of our data source: creation of a structured analysis tool without a similarly structured EP proved futile. The corollary of this lesson is that educators need to learn how to document their work meticulously so that evaluators can rate it through a valid and reliable process.
6. We found the use of real EP examples in both the development of the tool and the training of raters to be essential. Constant reference to the “real world” of EPs was indispensable to testing and achieving consensus.
7. To judge excellence in educational performance, we needed a combination of quantitative items that judge quality, and qualitative items that can be expressed numerically, so that all ratings are amenable to parallel review and, potentially, some kind of holistic summation.
As with any innovative enterprise, our project has left a number of issues for further debate:
1. In developing the analysis tool, we made many decisions based on the literature, but when other models were lacking, we ventured out on our own. We believe that our panel of expert educators and a nationally respected evaluator provided the project with a good grounding in experience and expertise, but we welcome review of our product by a broader, national panel of experts.
2. We used index scores to keep the analysis tool as brief as possible. These scores are weighted by a reasonable scoring system, but different weights could be applied; the index scores may evolve with further discussion.
3. The balance between quantitative and qualitative items in the tool may also engender debate. Our compromise between these two approaches to evaluation satisfied our team of expert educators, but others might create a different kind of compromise. We firmly believe, however, that some combination of both quantitative and qualitative evaluation is needed to capture the complexity of educational performance and to demonstrate its impact.
4. Creation of a summary score for an entire EP was not a goal of this project. However, a composite score that captures the multifaceted components of an educator’s creativity and productivity may be necessary for institutional promotion committees to establish standard expectations of performance.
We hope that this project will contribute to a productive debate that leads toward achievement of a national consensus on EP standards. Medical educators nationwide need to work in a professional environment in which their efforts are valued and rewarded. Their hard work to shape the character and quality of future physicians deserves no less.
Dr. Chandran is associate dean of academic and faculty affairs, vice chair for education, and associate professor of pediatrics, Stony Brook University Medical Center, Stony Brook, New York.
Dr. Gusic is associate dean of clinical education and professor of pediatrics, Penn State College of Medicine, Hershey, Pennsylvania.
Dr. Baldwin is director, Academic Pediatric Association Educational Scholars Program, and professor of pediatrics, University of Rochester Medical Center, Rochester, New York.
Dr. Turner is associate director of housestaff education, codirector of the General Academic Pediatrics Fellowship Program, and associate professor of pediatrics, Baylor College of Medicine, Houston, Texas.
Dr. Zenni is assistant dean of educational affairs and associate professor of pediatrics, University of Florida College of Medicine, Jacksonville, Florida.
Dr. Lane is associate professor of pediatrics, Jefferson Medical College, Philadelphia, Pennsylvania.
Dr. Balmer is director of evaluation, Center for Education Research and Evaluation, Columbia University Medical Center, New York, New York.
Dr. Bar-on is associate dean of graduate medical education and professor of pediatrics, University of Nevada School of Medicine, Reno, Nevada.
Dr. Rauch is director, Pediatric Hospitalist Program, New York University School of Medicine, New York, New York.
Dr. Indyk is director of faculty development and assistant clinical professor, Department of Pediatrics, Jacobi Medical Center, Albert Einstein College of Medicine, Bronx, New York.
Dr. Gruppen is chair, Department of Medical Education, and professor of medical education, University of Michigan Medical School, Ann Arbor, Michigan.
This project was funded by the Academic Pediatric Association (APA) and by a grant from the American Academy of Pediatrics Dyson–Community Pediatrics Training Initiative. Funding from both sources supported operating costs of the APA Educational Scholars Program. The funding organizations had no role in the design and conduct of the study, collection, management, analysis, and interpretation of the data; and no role in the preparation, review, or approval of the manuscript. The corresponding author had full access to all the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
The authors thank the scholars in their program whose creativity, enthusiasm, and diligence in teaching the next generation of physicians is an inspiration to their teachers and mentors.