Recertification of practicing physicians, also termed “maintenance of certification,” is now at the forefront of activities for virtually all member boards of the American Board of Medical Specialties.1,2 The goal of recertification is to maintain high standards of medical practice that protect the public by using fair, valid, and reliable methods to assess professional competence.3,4,5 To fulfill this goal, a comprehensive framework that integrates self-evaluation and practice improvement with a secure, proctored examination defines the recertification process for the 21st century.6 Concern about the inability of proctored examinations to assess the full spectrum of clinical competence, including humanistic qualities, professionalism, and communication skills, stimulated the American Board of Internal Medicine (ABIM) to introduce the “patient and peer assessment module,” a practice-based assessment tool, into its new recertification program called Continuous Professional Development (CPD). Whereas residents seeking initial Board certification are required by the ABIM to achieve satisfactory ratings of the core components of clinical competence from their program directors, there is no parallel method for practicing physicians who seek recertification.7
The ABIM's CPD program is composed of three components: self-evaluation, a secure examination, and verification of credentials; the physician pays the fee for the program. The first component, self-evaluation, comprises a series of modular examinations taken at home. Its purpose is both to stimulate study in the disciplines of internal medicine and to encourage improvement of one's practice. The second component, the secure examination, is a traditional proctored examination featuring single-best-answer questions designed to evaluate clinical knowledge and judgment about essential aspects of patient care that a physician should have without reference to medical resources. The third component, verification of credentials, requires both good standing in a hospital or health care delivery system and maintenance of an unchallenged, unrestricted license to practice medicine.
As part of the self-evaluation component, physicians may select, as an elective, the patient and peer assessment module, which incorporates confidential, anonymous surveys of patient and peer ratings pertaining to physician—patient communication and peer assessment of clinical performance. The module also requires completing self-rating surveys and a quality improvement plan (QUIP). The ratings are administered through a touch-tone telephone, using a toll-free number and an automated voice-response system. Once the self-ratings and the required number of patient and peer ratings are achieved, ABIM provides performance feedback; there is no passing standard associated with this module. After submitting the QUIP, the diplomate receives credit for the module. The feedback and QUIP are intended to stimulate diplomates to self-reflect and improve the quality of the medical care they provide.
Prior to implementation, a pilot study assessed the feasibility of the module using 100 volunteers.8 Participants highly approved of the survey questions, and more than two thirds agreed that the module was a valuable learning experience. The technology used to record the survey ratings performed well.
The purpose of this study was to assess the value of the patient and peer assessment module. Specifically, we raised four measurement questions:
1. Are the ratings reliable? Is the variability in ratings mostly error variance?
2. Are the ratings related to demographic variables (e.g., practice characteristics, health, age, and gender), prior test performance, and program directors' ratings?
3. Are the results consistent with findings in the literature on patient and peer ratings?
4. Do ABIM diplomates find the module valuable?
Although the module does not use a passing standard, if the ratings were to provide reproducible estimates of diplomates' skills, we would be able to incorporate normative data in the feedback that would inform diplomates of their relative performances. To address these issues the peer and patient ratings of candidates for recertification in all disciplines of internal medicine were analyzed, along with results from their QUIP and demographic data describing practice characteristics collected during the recertification application process.
The required elements of the module include 25 patient surveys, ten peer surveys, two self-rating surveys, and a QUIP. Diplomates received coded forms and brochures explaining the surveys for distribution to 40 patients and 20 physician peers. Research has shown that this sample size is sufficient to obtain the minimum number of completed surveys needed to generate reproducible results.9,10 Currently, distribution guidelines suggest that patients be selected randomly and physician peers be selected from practice colleagues or referring/consulting physicians. The patients and peers use a touch-tone telephone to answer the survey using a coded number, which identifies the diplomate. Each survey takes about eight minutes to complete. Diplomates may monitor their completion rates through the phone system. The diplomate is required to complete two self-ratings, based upon the patient and peer survey questions. When the self-ratings and surveys from 25 patients and ten peers are complete, the ABIM sends the diplomate a confidential, aggregated performance feedback report and the QUIP to complete. The QUIP is designed to elicit 13 discrete responses from participants regarding their insights and reactions to their performance feedback. Also, two open-ended questions probe their planned efforts to improve patient care and invite suggestions for module enhancement. Credit for the module is received once the required elements of the module have been completed.
Module content and feedback. Survey development was based on extensive research conducted over a 15-year period.9,10,11,12,13 The items on both surveys incorporate aspects of communication skills, humanistic qualities, and professionalism. Table 1 lists the specific items on the surveys. The patient survey is composed of ten items using a five-point rating scale ranging from poor (1) to excellent (5). Patient demographic data are collected and include age, gender, health, kind of doctor, time under doctor's care, number of visits, and whether the patient would recommend the physician to others. The peer survey is composed of 11 items to be rated on a nine-point scale ranging from unresponsive to patients (1) to responsive to patients (9). Peer demographic data are collected and include specialty, gender, practice type, type of patient care, professional relationship, length of time patients have been shared, and whether the peer would recommend the physician to others. The demographic data are not mandatory but to achieve a valid rating no more than three omissions are permitted. The self-rating surveys address each of the patient and peer items. The performance feedback includes a mean rating for each item, a range of ratings, and the diplomate's self-rating. Mean and range of overall ratings and frequencies for each demographic variable are reported.
Participants. The data are based on 356 diplomates who elected to take the patient and peer assessment module between 1999 and 2002 in order to fulfill one self-evaluation module requirement of the CPD program. ABIM's database was used to obtain each participant's age, gender, certification status, examination results, and ratings received from the physician's program director at the end of the third year of internal medicine residency training. The program directors' ratings were available for only 100 participants due to missing data and incomparable scales.
The majority of participants (98%) had ten-year, time-limited certificates in either internal medicine or a subspecialty/added qualifications area; 49% had time-limited certificates in internal medicine, and 65% had time-limited certificates in a subspecialty/added qualifications area of internal medicine. The average participant had been certified in internal medicine in 1989 (SD = 3.7). The majority (81%) achieved initial internal medicine certification on their first attempts, 19% had required two or more attempts. About two thirds had certification in a subspecialty/added qualifications area: 55% in one area and 12% in two areas. Twenty-one percent were certified in cardiovascular disease, 21% in gastroenterology, 6% in medical oncology, and 5% or less in each of the other designated internal medicine subspecialties. The average age was 44 years (SD = 4.8), and 73% were men. Participants reported spending an average of 83% of their professional time in direct patient care, 7% in administration, 6% in teaching, and 3% in research.
Patient ratings. Data collected from the 25 patients' ratings of the 356 participants (n = 8,900) revealed 57% of the patients were women, the average patient age was 59 years (SD = 15), and 70% of the patients were over 50 years old. The majority of patients (79%) had been under the doctor's care for more than one year. The health of patients was normally distributed from poor to excellent (4% poor, 20% fair, 44% good, 26% very good, and 7% excellent). Practically all patients (99%) would have recommended their physician to others (data based on only 104 participants since this question was added after the module's initial implementation).
The average overall patient assessment rating was 4.8 (SD = .13) on a five-point scale with a range between 4.2 and 5.0. Table 1 displays item means and standard deviations. For the patient assessment, item means were all very high, 4.7 or higher. Although the variability is small, 14 participants were rated two or more standard deviations below the overall mean; no participant was rated more than one standard deviation above the mean. Participants generally rated themselves lower than did their patients, with the lowest self-rating on the item “Letting you tell your story; listening carefully….”
Intercorrelations between survey items ranged from a low of .43 (“Being truthful” with “Warning you during the physical exam about what he/she is doing”) to a high of .62 (“Discussing options” and “Encouraging questions”). All intercorrelations were significant at the .001 level or better. There was no significant difference in ratings between older and younger patients or between male and female patients. However, the health of the patient was significantly correlated with overall rating (r = .11, p < .001); those in better health tended to rate their doctors higher. There was a positive relationship between time spent under a doctor's care and overall rating (r = .08, p < .001); patients who had spent more time under the doctor's care tended to rate the doctor higher. The doctor's gender was significantly correlated with overall rating (r = .17, p < .001); female doctors received higher ratings. For the 100 participants who had program directors' ratings, findings showed patient ratings significantly correlated with internal medicine program directors' ratings of participants' humanistic qualities (r = .20, p ≤ .05) but not with the program directors' overall clinical competence ratings. Generalizability theory was applied to the patient survey.14 The variance component for participants was .01 (SE = .07). For norm-referenced score interpretation, the generalizability coefficient was .67 and the 95% confidence interval was small, ± .14.
Peer ratings. Data collected from the ten physician peers' ratings of the 356 participants (n = 3,560) showed that 80% of peers were men, 64% of their professional time was spent in outpatient care, and 93% reported sharing patients with participants for more than one year. Sixty-five percent of the peers were practicing internal medicine or a medical specialty and 14% were in a surgical discipline. The professional relationship between peer and participant was largely referral (63%); 21% were business partners. Nearly all (99%) peers would have recommended the doctor to others (data are based on only 104 participants, since this question was added after the module's initial implementation).
The average overall peer assessment rating was 7.9 (SD = .34) on a nine-point scale, with a range between 6.9 and 8.8. Table 1 displays item means and standard deviations. Consistent with the literature, the mean for the “Integrity” was highest (8.1, SD = .35), while the mean for the “Psychosocial aspects of illness” item was the lowest (7.7, SD = .42).9,10,11 Nine participants were rated two or more standard deviations below the overall mean and three participants were rated two standard deviations above the overall mean. Participants rated their medical knowledge and management of hospitalized patients lower than did their peers, while the other items were rated nearly the same by both.
Intercorrelations between survey items ranged from .54 (medical knowledge with compassion) to .81 (problem solving with overall clinical skills). All intercorrelations were significant at the .001 level or better. Peers who had shared patients with participants for longer time periods tended to rate them higher (r = .06, p < .001), but peers in larger group practices tended to rate participants lower (r = -.08, p < .001). For the 100 participants who had complete data, findings showed a significant correlation between peer ratings and program directors' overall clinical competence ratings (r = .25, p < .01). Generalizability theory was applied to the peer survey.14 The variance component for participants was .07 (SE = .21). For norm-referenced score interpretation, the generalizability coefficient was .61 and the 95% confidence interval was ± .41.
Quality improvement plan. Since the QUIP was modified following the pilot study and some plans were not returned, the QUIP results are based on only 83 participants. Sixty-five percent of these participants thought the feedback would help them improve the quality of the medical care they provided, and 61% thought the module had provided them with a valuable learning experience. The majority (80%) reported that they would routinely participate in self-reflection, and 82% would continue to seek feedback from patients and peers. Some participants (42%) reported their intent to change communication strategies with their patients, while only 28% reported they would change their communication strategies with their peers. Most participants identified specific areas for improvement, such as provide more complete and understandable explanations to patients, discuss options more fully with patients, and improve psychosocial skills.
The purpose of this study was to assess, from a measurement perspective, the value of the patient and peer assessment module in the recertification program. Reliabilities of the measures were comparable to other research findings and the confidence intervals were small.10 Ratings may have been atypically high in this study because the physicians were volunteers. Although the ratings were quite high, there was some variability, allowing for certain distinction between levels of performance. Interpretation of this distinction is delicate since even physicians scoring two standard deviations below the mean performed relatively high on the rating scale. However, the purpose of the module is not to identify “problem physicians” in a public sense, but to encourage physicians to improve their practices regardless of their quality. This could be realized by providing normative feedback that would demonstrate performance relative to those of other physicians in similar practices.
The ratings had small but significant correlations with health of the patient, time spent under a doctor's care, participant's gender, number of years sharing patients, and size of group practice. In addition, program directors' ratings were correlated with both the patient and the peer ratings, albeit on different items. This is particularly reassuring since this relationship loosely supports the validity of the assessment measure. Participants' reactions to the module were generally positive and their QUIPs identified specific areas for practice improvement.
These results should be interpreted with some care. First, the participants in this study voluntarily took this module and therefore may not be representative of others in the program. Second, since considerable amounts of data were missing for both the program directors' ratings and the QUIP, the results must be interpreted with some caution. Third, there are no means to ensure that the rating forms were distributed randomly, although research shows that the ratings are not substantially biased by the method of selection.10 Fourth, the patient survey appears to have a ceiling effect, thus prohibiting meaningful interpretations at the upper end of the scale.
Despite these limitations, these initial findings are heartening. Participants generally felt that the module provided a valuable learning experience and the results could be used to improve their practices. The ratings are reliable and appear to have substantial variability. As more data are collected, future research can explore the presentation of normative information to assess one's relative performance. However, before this work can be done, variations in ratings across specialties and types of practice must be examined. Research is also needed to examine whether performance is actually improved by the changes made in one's practice. Using this module, a longitudinal study of participants' ratings could be conducted to determine the ultimate value of assessment and self-reflection.