Murray, Peter M. MD; Valdivia, Jennifer H.; Berquist, Mary R.
Much has changed during the past quarter century with respect to medical economics. Declining reimbursements, the growing ranks of uninsured patients, and uncertainty over the solvency of federal programs all create challenges for academic medicine. The institutional costs of graduate medical education (GME) center around trainee stipends and benefits, but other expenditures are required to ensure a high-quality learning environment. To this end, it has become important for institutions to monitor resources so that they can be used to best support the strategic mission of the practice, the profession as a whole, and the needs of society. In the future, institutions may be faced with difficult decisions concerning GME, particularly as it relates to the growth of existing programs and the creation of new programs.
Evaluating an institution's return on investment in GME can be difficult, if not impossible. Little has been published on internal program assessment, annual institutional program assessment, or comparative program performance.1–4 Although metrics in medicine are not new and are now used widely, a metric devised specifically to evaluate the comparative performance of an institution's GME programs is a new concept. It was our aim to devise a tool that can consistently rate the value of our individual GME programs while aligning with the Mayo Clinic's enterprise-wide strategic priorities (outcomes, safety, and service) as well as the designated clinical priorities of Mayo Clinic in Jacksonville, Florida (transplantation, the neurosciences, cardiovascular medicine, and cancer). Herein, we report our results with the use of our assessment metric scorecard during a two-year period, academic years 2004–2005 and 2005–2006. The scorecard has played an integral part in our institution's prioritization of proposed and existing GME programs. It also provides program directors with a yearly analysis of their programs as well as a perspective on how each program compares with others in our institution.
The Assessment Metric Scorecard
Our assessment metric scorecard seeks to evaluate GME programs at our institution on the basis of four criteria: Research, Teaching, Patient Service, and a General category (Table 1). Also included in the metrics are a Basic Requirements category and a “Bonus Round” category. The first three areas—Research, Teaching, and Patient Service—mirror the primary integrated mission of our institution. Successful training programs must perform at a high level in these areas. Also, the Patient Service, Research, and Basic Requirements portions of the scorecard align with the Mayo Clinic enterprise-wide strategic priorities of patient outcomes, safety, and service by assessing individual quality initiatives of a residency program, specialty board passage rates, program accreditation cycles, and overall patient satisfaction. Further, individual program alignment with Mayo Clinic Florida's designated clinical priorities of transplantation, the neurosciences, cardiovascular medicine, and cancer is assessed. The General category reflects the program's performance in externally applied metrics such as the United States Medical Licensing Examination (USMLE) and accreditation status conveyed by the Accreditation Council for Graduate Medical Education (ACGME). The Basic Requirements are areas that form the foundation of a good residency or fellowship training program. The Bonus Round is a reward to training programs that have distinguished themselves in unique ways and helps programs that are “disadvantaged” in other areas of the scorecard. Individual performance measures have been vetted through our education committee and campus leadership, including our executive board. We consider every metric within this scorecard a measure of one or more of the Mayo Clinic's enterprise priorities of outcomes, safety, and service.
The Research and Teaching categories each constitute 25% of the metric total, Patient Service represents 20%, and the General category accounts for 30% of the score. The final score ranges from 0% to 100% (the percentages equate to 0–100 points; scores >100 and <0 are not recognized) and is used to assess the success or shortcomings of a GME program. The score may also provide a mechanism for internal comparative analysis of training programs to prioritize institutional GME resources. The points given to the four broad categories may be modified to align with an institution's particular mission. The points given to the Basic Requirements and in the Bonus Round may also be modified to complement an institution's mission and goals.
Our scorecard metric seeks to determine the value of our individual training programs at Mayo Clinic Florida. Our previous attempts to accurately and equitably evaluate the return to our institution on dollars allocated for GME have been unsuccessful. We can, however, track the overall financial performance of the individual departments that house residency and fellowship training programs. This financial performance can be normalized for the size of the department by dividing departmental net operating income by total departmental revenue. The resulting operating margin provides a high-level view of the financial position of the individual department supporting the training program and an idea of whether that department can “afford” its training program. We believe that scoring the value of our programs using the traditional definition of quality divided by cost lends too much financial influence to the metric. It is our view that cost should be one of the factors to assess value, but this is a difficult measure to apply, given the difficulties in measuring the financial return on dollars invested in GME. Therefore, we have substituted “financial stability” of the departments housing individual training programs in place of program cost. Financial stability receives weight along with other criteria we consider important.
The research portion of the metric examines the program's ability to expose trainees (residents or fellows) to either basic science or clinical research and accounts for 25% of the metric scoring. We consider peer-reviewed publications with residents or fellows as authors the best measure of GME research activity, representing a trainee's participation at all levels of the endeavor, including receiving oversight and mentoring by a faculty member. We also believe that peer-reviewed publications are a direct reflection of outcomes from clinical and basic science research, which positively reflects on best practices at our institution. Although abstracts, letters to the editor, and book chapters are encouraged, we do not provide points for these works.
A 15% weight (maximum of 15 points) is placed on publications. A publication is defined as any article published in a peer-reviewed journal that is referenced in Index Medicus for the academic year in question. Recognizing the time lag that occurs between a resident's or fellow's graduation and manuscript publication, publications by all current residents in a particular program are included as well as those by residents or fellows from the previous two graduating classes. In our experience, most publications are captured using this strategy. The score is determined on the basis of a ratio of the number of peer-reviewed publications per resident or fellow full-time equivalent (FTE) per year (Table 2).
Residents' or fellows' oral presentations at national or international meetings require a level of peer-reviewed scrutiny that implies academic achievement worthy of distinction. Even so, presentations have less weight than publications, constituting 5% of the metric, or a maximum of 5 points (Table 2). To further capture departmental research activity and the opportunity for resident involvement in research studies, points are given for the number of active protocols registered with the institutional review board (IRB) (2.5%, or 2.5 points) and dedicated faculty research time (2.5%, or 2.5 points—see Table 2). The scale of FTE time listed in Table 2 reflects how our institution apportions FTE research time. This scale can be modified to fit the standards adopted at any institution.
The teaching portion of the metric examines three areas—faculty-to-trainee ratio, academic rank, and conferences—and constitutes 25% (25 of 100 possible points) of the assessment metric scorecard (Table 1). Faculty-to-trainee ratio and conferences account for 5% (5 points) each, and academic rank accounts for 15% of the assessment (15 points). The scale can be modified to suit the standards at any institution. However, we believe that faculty academic rank tells a lot about the quality of a training program. Faculty members who have satisfied their institution's requirements for full professorship have undoubtedly made and continue to make notable contributions to that department's training program or programs through research, teaching, and patient service. Such individuals also bring national and international recognition to the program, potentially helping to attract high-quality applicants. Advancement in academic rank serves as an internal quality-control measure specific to the institution, and it therefore is heavily weighted within the context of the teaching category. In contrast, although important, the faculty-to-trainee ratio and number of conferences lend themselves less to comparison among programs because needs arguably vary. For this reason, more easily obtainable standards were established in these categories.
At our institution, a faculty-to-trainee ratio of one to one is considered the standard. A lower ratio may imply a training program that is too service oriented and possibly lacking in oversight. A ratio of two to one is considered exceptional in our metric, potentially increasing the trainee's learning experience. However, a ratio exceeding that may have diminishing returns because the trainee may be overextended in clinical service areas (Table 3).
Considering faculty academic rank, it must also be recognized that individuals at the level of full professor may be committed to other duties within national organizations or have administrative roles within the institution. This calls into question the optimal academic rank distribution for an effective training program. Thus, the ideal education milieu would not have 100% full professors. Rather, the ideal milieu should include faculty having academic ranks at the assistant and associate professor levels because these individuals may be active clinically and heavily engaged in the day-to-day functioning of the training program (Table 3). We believe that the ideal apportionment at our institution is that 75% of a department's faculty should have the rank of either full professor or associate professor. Programs with this percentage of associate and full professors are scored with 15 points in this category. For the reasons stated above, programs with more than 75% of their faculty with the rank of associate or full professor are not awarded additional points because we recognize that there are possibly diminishing returns (Table 3). Certainly, the requirements for academic promotion vary from institution to institution, but we assume that promotion criteria parallel an institution's priorities. This scale reflects what we feel best suits the needs of the training programs at our institution.
The third area of the teaching evaluation is the number of weekly teaching conferences. To qualify for scoring inclusion, a conference must have a predetermined outline and scheduled reading assignments. At our institution, the mean number of weekly conferences meeting these criteria is two. Programs that have two conferences per week receive 2.5 points, and programs with more than two weekly conferences receive 5 points. Although we recognize that additional conferences may be valuable, more points are not awarded to programs with more than two formal conferences per week because of the possibility of diminishing returns and interference with learning in the clinical arena (Table 3). Our scale may not be relevant for every institution and can be modified on the basis of the dynamics of the training programs at any institution.
Above all else, it is important that the needs of the patient come first. Irrespective of the importance of GME, patient care must be delivered at the highest level. It is critical that education programs be fully integrated with the patient-care activities of the department. Three important areas are assessed in determining the contribution of a training program to patient service: data derived from a telephone interview after a patient visits the clinic or is discharged from the hospital, the designated strategic priority for a department, and trends in a department's patient volumes (Table 4). As more patient outcome data become available, we plan to update this section to include additional program-specific performance metrics.
At our institution, an independent agency, Professional Research Consultants, Inc. (PRC, Omaha, Nebraska), routinely conducts telephone interviews of recent patients. A number of clinically relevant questions are asked, and the responses are analyzed. We have selected the scoring of one of these questions to populate the Patient Service portion of this metric: “What was the overall impression of your doctor?” The responses are scored on a scale of 0 to 100, and a mean with standard deviation is determined. The department is graded on the basis of its score relative to the mean (Table 4). If the department scores within one standard deviation of the mean, it receives 3 points. If it scores above one standard deviation from the mean, it receives 5 points. If it scores below one standard deviation of the mean, it receives 0 points.
Departments represented within institutionally directed high-priority initiatives receive a maximum of 10 points (Table 4). Departments within medium- or low-priority initiatives receive points accordingly. This category is closely aligned with the strategic plan of our institution.
A steady flow of new patients with new pathologies is essential for GME. Although many factors could be measured, heightening the complexity of this assessment, we believe that the number of new patients and new diagnoses seen is a reasonable measure of a department's clinical vitality. A maximum of 5 points is obtained in this area (Table 4). Assessment of departments such as anesthesiology and radiology is difficult because their volumes depend on other “feeder” areas. Therefore, to be fair, their volumes must be comparatively or historically scrutinized. Training programs within departments that saw 5% average increases in new patients during the previous year are given 5 points. Programs with 0% to 5% increases in new patients receive 2.5 points, and programs with declines in the number of new patients from the previous year receive no points.
The General category contains three important areas: accreditation status, the number of graduates who pursue academic careers, and scores on the USMLE. This category constitutes 30% of the metric score (Table 5). Like the publications and academic rank categories, program accreditation status is most important and is given the greatest weight (15%, or 15 points). Programs that achieve the longest accreditation cycle of five years receive 15 points. Programs are scored 10 points and 5 points for four years and three years of accreditation, respectively. We consider a three-year accreditation cycle a minimum acceptable standard. No points are given to programs with two-year accreditation, and a program placed on probation has 15 points deducted from its score. Although we recognize that certain programs—for example, the foot and ankle surgery fellowship in the department of orthopedic surgery—cannot apply for accreditation through the ACGME, we place a priority on accredited programs, and those programs without the possibility of ACGME accreditation do not receive points. However, a program without accreditation can improve its score in other ways—for example, by obtaining additional points in the Bonus Round.
The number of alumni in academic medicine is a measure of the academic strength of a program. However, some programs, such as family medicine, often choose to prepare physicians for community practice rather than academic practice. For this reason, less overall weight was attached to this factor so that certain programs are not severely disadvantaged. At our institution, however, physicians are prepared for an academic practice, regardless of their discipline. This metric's weightings can be modified to satisfy the strategic priorities of any institution. A running three-year average, recalculated annually, is determined; programs with more than 75% of graduates pursuing careers in academic medicine are given 5 points, and programs with between 25% and 75% of graduates pursuing academic careers are given 2.5 points. Programs with fewer than 25% of graduates pursuing academic careers are given 0 points in this category (Table 5).
USMLE scores are helpful because many trainees take this examination. Programs whose scores average less than 190 have 5 points subtracted from their total scores. For programs whose trainees have scores between 190 and 210, no points are given. Programs with average scores between 211 and 220 and between 221 and 230 receive 3 points and 5 points, respectively. Programs with trainees whose average score is higher than 230 receive 10 points (Table 5).
At our institution, we recognize several areas as minimum requirements that all programs should meet. If a program does not meet these basic standards, then 10 points are deducted from the total score for each requirement not fulfilled (Table 1). This method of scoring was adopted for two reasons. First, we expect that our programs meet each of these Basic Requirements. Second, the inclusion of these areas in the scoring portion of the metric would sufficiently dilute the metric to the point that performance on individual categories would mean nothing.
As an institution, we look for departments and training programs to create and maintain a culturally diverse working environment, which should be reflected in our residents and fellows. We expect every program and every fellowship, where appropriate, to have a 100% specialty board passage rate. We also require that all residency programs be 100% compliant with resident duty hours restrictions and have 100% participation in the resident core lectureship series, which covers multidisciplinary topics common to all programs (this does not apply to fellows). As previously mentioned, all departments that have training programs are expected to be financially stable. This is assessed annually by comparing the department's operating margin with the institution's mean operating margin. Programs with operating margins below one standard deviation from the mean have points deducted. Every department with a training program is expected to participate in a resident patient quality initiative and a resident patient safety initiative. These quality and safety initiatives have been vetted and approved at an institutional level by the Quality and Safety Subcommittee of the Mayo Clinic Florida Clinical Practice Committee. Finally, programs lose points if they do not fill all their available positions. However, we recognize that many factors affect the filling of residency and fellowship programs, either with or without the National Resident Match Program or other match processes. For instance, periodically the interviewed applicant pool in a particular specialty or subspecialty may be subpar, and we recognize that a program may choose to rank a limited number of individuals. Furthermore, various personal or geographic factors influence a candidate's decision to rank programs. These factors may have little bearing on the overall quality of the program and the program's ability to attract quality applicants. We do not consider the number of applicants to a program a particularly useful quality indicator. For these reasons, we give each program a “pass” from having points deducted for not filling once every four years.
In the development of the Bonus Round assessment tool, we recognized that certain areas may disadvantage particular disciplines. Attempts were made either to minimize the impact of these measures or to remove them from the metric altogether. We also recognized that many programs have profoundly unique qualities and strengths. The Bonus Round was devised to help programs capture additional points for areas that we consider indicative of a strong educational program. These areas are not credited in the four primary categories of assessment. Bonus points are given for achievement in the following areas: faculty receiving national or institutional teaching or awards, more than 25% of faculty in a particular program receiving teaching awards, residents or fellows receiving national or institutional awards, visiting professor lectureships, residents or fellows holding leadership positions in national organizations, extramural funding for residents or fellows FTE salary and benefits, and programs creating areas of teaching innovation (Table 1).
Scoring the Scorecard
The purpose of the assessment metric scorecard is to provide program directors an objective yearly evaluation of their programs. Each program is provided with an itemized score and a graph that shows how the program relates to the other programs in the institution. The scoring assessment also allows the institution's leadership to evaluate the strengths and weaknesses of each program and offers transparency within the institution's GME activities for the purposes of yearly prioritization of existing programs and planning for new programs. At our institution, a yearly review of all GME programs is conducted in January. Information from the previous academic year is gathered and the metric is scored. The assessment metric scorecard is only a part of our annual review process and serves as a standard for comparison among the programs so that each program can evaluate its individual progress from year to year among the major categories while gaining some perspective on how it compares with the other programs in the institution.
We have included the results of our annual scorecard assessment for a two-year period with aggregate mean scores and standard deviations. Figure 1 compares the 2006 scoring (based on the academic year 2005–2006) with the 2005 scoring (based on the academic year 2004–2005). The names of the individual residency and fellowship programs have been omitted from this diagram. Mayo Clinic Florida has nine ACGME-accredited residency programs: anesthesiology, dermatology, family medicine, general surgery, internal medicine, neurology, radiation oncology, radiology, and urology. There are 16 postresidency fellowship programs: adult reconstructive surgery, behavioral neurology, clinical neurophysiology, cross-sectional imaging, endocrinology, foot and ankle surgery, gastroenterology, hematology/oncology, hepatology, musculoskeletal imaging, neuropsychology, pain management, palliative medicine, pulmonary medicine, sleep medicine, and sports medicine (primary care). The mean aggregate score for our programs increased from 2005 to 2006 (mean, 64 in 2005, 67 in 2006), which we view as a positive trend. The standard deviation from the mean is also indicated. Fellowship programs B, E, and G made considerable gains in their scores, whereas residency program E and fellowship program O showed declines. Fellowship programs I and K were new programs and therefore did not have comparison scores from 2005. Fellowship programs A, B, and C and residency programs A and B scored above one standard deviation from the mean and are considered outstanding programs. Conversely, fellowship programs N, O, and P, as well as residency program I, scored one standard deviation below the mean. Initiatives are under way to improve these programs. A sustained subpar performance will cause our education division to further scrutinize these programs.
Causes for improvements and declines in individual program scores can be attributed to multiple factors. For example, fellowship program B, a relatively new fellowship, graduated its second fellow in 2006 and realized notable gains in its peer-reviewed publications and national presentations. Fellowship program G, another new fellowship, also made considerable improvements in its research endeavors as well as improving its USMLE average of incoming trainees and the percentage of graduates in academics. Fellowship program E made considerable improvements in its program largely on the basis of a recommendation made by the supervising GME committee during an internal review of the program as well as gaining commendation and accreditation by the ACGME. Unfortunately, residency program E had a decrease in peer-reviewed publications along with a decrease in faculty and new patient volumes, all of which affected its score. Fellowship program O had a decline in peer-reviewed publications, among other factors. It is also interesting to note that the programs scoring at the bottom of the scorecard tended to have overall declining scores from 2005 to 2006.
Our assessment metric scorecard is designed to evaluate programs on the basis of parameters that align with the mission of our institution. The metrics are adaptable, and the relative weight of each item can be modified depending on an institution's education mission and strategic plan. Missing are several academic benchmarks such as published abstracts, book chapters, and presentations at regional or local meetings. Although we consider these academic activities important, to include those items would diminish the relative impact of other parameters. All items simply cannot be included because doing so would dilute the weight of more important metrics. Furthermore, 15% of a program's score is determined by faculty academic rank. To obtain all 15 points in this category, 75% of the staff of the department must be at the rank of associate professor or full professor. Those with senior academic rank have arguably demonstrated broad academic activity, which implies participation in nonmeasured activities such as book chapters and published abstracts as well as presentations at regional or local meetings. At the same time, we believe that a staff mix with more than 75% of staff members having senior academic rank yields diminishing returns, because staff are needed at the assistant and instructor levels as well to round out a particular program.
We recognize that considerable emphasis is placed on residents' peer-reviewed publications (15% of our assessment metric scorecard as well as indirect emphasis as it is gained from the staff academic rank score). We believe that this is justified because the publication of original articles demonstrates sustained academic activity over time and also indicates a predictable level of trainee–teacher interaction. Further, peer-reviewed publications are strongly emphasized by the leadership of Mayo Clinic Florida. Although it is unrealistic to think that the majority of residents or fellows will see their research project through to publication during their training, the accounting of publications in our scorecard includes the last two graduating classes, allowing for the anticipated lag time in publication. Tracking and verification of peer-reviewed publications are also easier than they are for book chapters, published abstracts, or regional presentations. Also, it can be difficult to gain consensus on what constitutes a book chapter, presentation, or published abstract and to verify that they have been done. We consider peer-reviewed publications to be articles that appear in journals indexed by Index Medicus. This definition enables easy acquisition of those data. Furthermore, time devoted to research and the number of IRB-approved active research protocols are counted, which indicates the current opportunities for trainees to do research and also indicates a program's dedication to research.
There is overlap in some metric areas such that some categories are pervasive throughout the assessment tool. For instance, at most institutions academic rank cannot be obtained without success in peer-reviewed publishing. Gaining points in academic rank, then, also means that publication has occurred. Therefore, gaining points in the academic rank area further rewards research in an additive fashion.
Absent in our assessment metric scorecard is scrutiny of the number of applicants to a program and how far down the match list a particular program goes each year to fill its program. We do not believe these match data serve as valuable assessment metrics, and these data have little bearing on the overall quality of a program. So that the importance of the four primary categories—Research, Teaching, Patient Service, and General—is not diluted, Basic Requirements were established that every program must meet; otherwise, points are deducted from the overall score.
With increasing constraints on the health care dollar, institutions may scrutinize resources allocated to GME. In our institution, we consider it important to monitor the financial stability of departments that train residents and fellows. We have found it challenging to accurately assess the downstream revenue of GME programs at our institution, thereby making fair cost accounting difficult. We can, however, determine which departments pass on operating costs to other departments in our institution in the form of negative operating income. Therefore, in our scorecard, we track operating margins, because we consider this the fairest method to assess comparative departmental financial stability. We believe that the sustainment or growth of individual training programs must be evaluated in the context of each department's budget because all related training program costs are a departmental charge-back at our institution. Even though we operate as an integrated group practice, each department is expected to be financially viable.
ACGME requirements have long served as a quality indicator for GME programs. However, some educators have begun to question whether this accreditation is a valid stand-alone quality metric. Programs are sufficiently dynamic that programs with five-year accreditation could be aided by the annual review of the data collected by our scorecard. Also, there have been indications that the ACGME is moving toward an outcomes-based accreditation process.2–6 Kahn and Blonde2 have indicated that accreditation cycle length was not a contributing factor to overall fellowship performance. Berquist et al7 found that accredited and nonaccredited musculoskeletal radiology fellowships provided similar training opportunities. Few, if any, quality indicators exist to help institutions prioritize their GME initiatives. On the basis of surveys of program directors and residents, Klessig et al6 developed a group of quality indicators for internal medicine programs. The most highly rated quality indicators in that study were attributed to processes, and the lowest-rated quality indicators were outcome measures such as graduates in academics, faculty research, and training of minorities and women.6 Elliott et al5 developed a broader framework to define quality in GME programs. They rated quality on the basis of characteristics of trainees produced by the program, cost-effectiveness, the program's ability to meet society's needs, and how fairly and ethically the program treated its trainees.5 Yudkowsky et al4 found that psychiatry residents and program directors had differing opinions concerning the best quality indicators for a training program. Kahn and Blonde2 defined several quality program indicators for internal medicine subspecialty fellowship programs, including the ratio of fellows to teaching faculty and the program's research activity. These two indicators are also used in our assessment metric scorecard.
A comprehensive assessment metric for GME programs is overdue. Iverson8 called for implementation of a report card for GME programs and for specialty societies to make that report card available to the public. In his report card and in alignment with our scorecard, Iverson8 proposed that fill rates, entrance examination scores, board scores, and accreditation information be tallied.
We recognize that our scorecard has some shortcomings. In developing the scorecard, it became apparent that some departments may be disadvantaged by certain metrics. For instance, the PRC survey of patients' satisfaction with the “overall physician experience” in primary care areas could be negatively skewed if patients came to our institution with the expectation of seeing a subspecialist. For this reason, we have sought to balance the metrics included in the scorecard and to include the potential of bonus points where unique aspects of certain programs can be highlighted. Additionally, the seniority of individual residents was not taken into account when considering specific metrics and may have been a factor when comparisons among programs were made. The quantification of publications and presentations, however, was normalized on a yearly basis to take into account variations in program size.
In the first two years of using our assessment metric scorecard, we have seen an increase in the overall mean score, which we view as a positive trend. Valuable data regarding the strengths and weaknesses of individual programs have also been identified and provided to program directors. One disturbing trend identified is the decline in scores from 2005 to 2006 of our lowest-performing programs (fellowships P and O and residency I in Figure 1). Further, the departments housing these programs had an operating margin one standard deviation below the mean for our institution. In response to these issues, our Division of Education at Mayo Clinic Florida has resolved that we must enhance the sharing of “best practices” among all our training programs.
We have used the final scores on our assessment metric scorecard to help make strategic decisions regarding GME program growth and program prioritization at our institution. As the financing of medicine becomes more labile and the future of government subsidization of GME becomes uncertain, resources available for GME may diminish. To that end, it may become necessary for institutions to apportion funding for existing and future GME efforts on the basis of a prioritization process.
We have found our assessment metric scorecard to be a useful tool for yearly GME program evaluation at our institution, not only for our program directors but for our governing board. The scorecard is flexible—the weight of the component metrics can be adjusted to align with the strategic priorities of any institution. The resources necessary to implement this scorecard are minimal.