Wright, Christine PhD; Richards, Suzanne H. PhD; Hill, Jacqueline J.; Roberts, Martin J.; Norman, Geoff R. PhD; Greco, Michael PhD; Taylor, Matthew R.S.; Campbell, John L. MD
Editor’s Note: Commentaries by P. Rubin; E. Holmboe and K. Ross; and K. Weiss appear on pages 1654, 1657, and 1660.
Internationally, there is increasing interest in monitoring and evaluating the professional performance and behavior of doctors. Such regulation seeks to protect patients by ensuring that all doctors are fit to practice medicine and deliver good-quality care. However, there is no consensus about the best approach to evaluating doctors’ performance, and many countries have developed their own systems.1,2
In the United Kingdom, all doctors who wish to practice medicine must be registered with and licensed by the General Medical Council (GMC). A process of “revalidation” is expected to be introduced across the UK beginning in December 2012; to retain their license, doctors will be required to collect, over a five-year cycle, a range of “supporting information”3 to demonstrate that they continue to meet the principles and values of good medical practice as set out in the GMC’s guidance.4 Supporting information includes evidence of doctors’ continuing professional development and quality improvement activity, information about significant events (untoward or critical incidents), structured feedback from colleagues and patients, and a review of any complaints and compliments received. Doctors will be expected to reflect on the supporting information and discuss it as part of their appraisal process.3,5 Ultimately, the appointed responsible officer within the doctor’s organization (a senior licensed doctor with more than five years of practice) will make a recommendation to the GMC about the doctor’s suitability for revalidation. The supporting information itself will not, however, be submitted to the GMC for consideration.
An important component of this process is the collection of feedback from patients and colleagues, using structured questionnaires.3,6 Such feedback is intended to be formative in nature. Doctors are expected to collect feedback at least once every five years, reflect on the feedback they obtain, and use it to inform their further professional development, where appropriate.3
Over recent years, multisource feedback (MSF) questionnaires have emerged as a method of evaluating the performance of practicing doctors and postgraduate trainees. Such questionnaires must be carefully developed to ensure that they are reliable, valid, and acceptable to respondents.7 A number of survey instruments have been developed to support this agenda,8–19 and their theoretical basis and psychometric properties have been critically evaluated.11,12,15 MSF is not without its problems, however. A number of studies have shown that responses tend to be skewed toward positive assessments of doctor performance.9,10,13–18,20–23 Others have questioned the ability of MSF (particularly patient feedback) to identify poorly performing doctors.24
When patients evaluate individual doctors or wider health care systems, the characteristics of the patient sample (e.g., age, ethnicity) and the questionnaire administration methods (e.g., postal, telephone or “exit” surveys; use of proxy respondents) can influence ratings.25–41 When colleagues assess individual doctors, the rater’s professional group, the length and context of the rater’s working relationship with the doctor, and the rater’s familiarity with the doctor’s practice can influence responses.9,13,16,41 However, for many questionnaires, the potential biases arising from sampling have not yet been fully investigated.
In this study, we use the example of the GMC Patient Questionnaire (PQ) and Colleague Questionnaire (CQ)42 to stimulate discussion about the role of MSF in medical regulation. A recent international review15 highlighted the PQ and CQ as mapping well onto the principles of “Good Medical Practice”4 and having undergone comprehensive development and psychometric testing.10,35,43–45 However, their test–retest reliability and convergent validity have not yet been established, nor has the potential for bias in PQ and CQ ratings been assessed. The questionnaires recently underwent further development, and the purpose of this study is to perform a critical review of their suitability for use in the evaluation of doctors’ professional performance.
Doctor recruitment and data collection
We invited practicing doctors from 10 National Health Service organizations and 1 private-sector organization in England and Wales to participate between April 2008 and January 2011. The 11 organizations represented a range of health care settings (acute hospital, mental health, family practice). All practicing (nontraining) doctors working for the organization were eligible to participate. Initially, doctors received an internal communication from their medical director or chief executive informing them about the study and encouraging their participation. At this stage, doctors could decline permission for their contact details to be passed to the survey organization (Client Focused Evaluation Programme [CFEP]-UK). Organizations provided CFEP-UK with contact details and demographic information (age, gender, and specialty) for all doctors on the final mailing list. CFEP-UK allocated a unique identifying number to each doctor to enable his or her survey data to be collated and matched anonymously to his or her demographic information. Doctors received a detailed study information pack via e-mail or letter from CFEP-UK. Those who wished to participate returned a reply slip; up to two reminders were sent to nonresponders.
Participating doctors used the PQ as a postconsultation (“exit”) survey of 45 consecutive patients and nominated up to 20 colleagues (10 medical; 10 nonmedical) to provide feedback using the CQ. Most doctors arranged for clinic staff to distribute their PQs as patients arrived for their appointments; however, in some settings, the doctor distributed PQs at the end of the consultation. Patients were instructed to complete the PQ after their consultation with the doctor and, if possible, before leaving the clinic. If a patient was unable to complete the questionnaire, a carer (proxy) could do so. Patients deposited completed PQs in a box in the clinic, or mailed them directly to CFEP-UK. Nominated colleagues could complete a Web-based CQ or request a paper version. Once the patient and colleague surveys were completed, doctors received a personalized report from CFEP-UK, which summarized their patient and colleague feedback.
Questionnaire content and scoring
The questionnaires were designed for use within the UK appraisal and revalidation process. Their content was originally developed by the GMC,43,44 based on the principles of “Good Medical Practice.”4 Early versions of the questionnaires were released to the research team for an initial phase of piloting in 2005–2006,10 and a small number of revisions were made to their content prior to the current study.
Each questionnaire takes 5 to 10 minutes to complete and includes core items (rated on a five-point scale) relating to the doctor’s performance, global assessment items (rated on a binary scale), a free-text comment box, and items that collect contextual/demographic information about the respondent (Chart 1).
For every patient, we obtained scores for each of nine core items (PQ4a–g, 5a–b)42 and computed an overall mean score across these items where five or more valid responses (range 1–5, excluding “Does not apply”) were given. For every colleague, we obtained scores for each of 18 core items (CQ1-18),42 and we calculated an overall mean score (range 1–5) across these items where 10 or more valid responses (excluding “Don’t know” responses) were given.
Test–retest reliability substudies
To explore the temporal stability of responses on the questionnaires, we conducted separate PQ and CQ substudies, which invited respondents to rate the doctor’s performance on two occasions, separated by approximately two weeks.46 Our approach sought to receive sufficient test and retest questionnaires in line with previous published studies.47–49
For the PQ substudy, patients of 11 doctors received and submitted PQs via post for both rounds of questionnaires (“test” and “retest”). Doctors collected test–retest data after they had completed their patient exit survey for the main study. Each doctor generated a list of patients who had consulted him or her in the previous seven days. Because postal surveys produce lower response rates than exit surveys,35 we approached up to 80 of the most-recently consulting patients (or their carers).
For the CQ substudy, colleagues of 25 doctors completed CQs online for both rounds of questionnaires. Test–retest data collection took place in parallel with the doctors’ colleague surveys for the main study. Each doctor nominated up to 20 colleagues (10 medical; 10 nonmedical) who could be approached to provide feedback in the first round (for the main study as well as for the substudy “test” phase) and in the second round (for the substudy “retest” phase).
Convergent validity substudy
To investigate the association between scores on the GMC questionnaires and other measures that are known to assess similar aspects of a doctor’s performance, we invited the patients and colleagues of a subsample of doctors to complete an extended questionnaire. Following an approach to all doctors in three geographical areas who were willing to contribute to our main study, 136 doctors agreed to use the extended questionnaires.
For patients, the extended questionnaire included the PQ, the six-item Patient Enablement Instrument (PEI),50 and the 12-item Doctors’ Interpersonal Skills Questionnaire (DISQ).51,52 Like the PQ, the DISQ evaluates the doctor’s interpersonal skills during a specific consultation. The PEI measures the patient’s sense of empowerment and ability to understand and cope with his or her health condition after the consultation. We calculated overall mean scores on the PEI (ranging from 1 to 3) and DISQ (ranging from 1 to 5) where valid responses on >50% of items were available, and we correlated these with the overall PQ scores.
The extended colleague questionnaire comprised the CQ and the 18-item Colleague Feedback Evaluation Tool (CFET),15,22,53 which evaluates a range of doctor skills and qualities, similar to those assessed by the CQ. We calculated overall mean scores on the CFET (ranging from 1 to 5) where valid responses on >50% items were available, and we correlated these with the overall CQ scores.
Sample size considerations
Our target doctor sample sizes for the main study and for the test–retest and convergent validity substudies were not based on a formal sample size calculation; rather, our estimates drew on existing scientific literature, pragmatic considerations, and our previous pilot work experience.10 We aimed to obtain an appropriately large sample that would generate sufficient data to permit an assessment of the performance of the questionnaires in line with recognized best practice.46
Data management and analysis
CFEP-UK undertook initial data processing, including transfering online colleague feedback directly to a database designed for this study and scanning paper questionnaires (PQs or CQs) into the database. At the end of the study, CFEP-UK provided the research team with anonymized data, which included coded responses to all questionnaire items and anonymized free-text comments. CFEP-UK also provided anonymized demographic information for participating and nonparticipating doctors. The chair of the Devon and Torbay National Health Service research ethics committee reviewed the project and concluded that a full ethics committee opinion was not required.
Unless otherwise stated, we used PASW Statistics software version 18 (IBM SPSS Inc., Armonk, New York) and set an alpha level of P < .05.
We characterized participating doctors, patients, and colleagues and compared the age, gender, and clinical specialty of participating and nonparticipating doctors.
Data analysis focused on the core PQ and CQ items and was conducted at the respondent level. Although free-text comments were collated, anonymized, and reported back to participating doctors along with their MSF scores, we did not conduct a qualitative analysis of the comments.
Psychometric properties of the PQ and CQ. We examined core item response distributions and calculated completion rates, proportions of “missing/spoilt” responses, and mean scores and standard deviations of patient and colleague ratings.
We examined mean core item scores, interitem correlations, and item–total correlations for the PQ and CQ. We calculated Cronbach’s alpha and accepted a value of ≥0.70 as evidence of adequate internal consistency.46 To explore test–retest reliability, we calculated intraclass correlation coefficients (ICCs) with 95% confidence intervals (CIs) for the patient and colleague overall scores and for each core PQ or CQ item (two-way random, single measures). We regarded coefficients of ≥0.6 as acceptable.54 For global assessment items, we examined the distribution of responses at both time points.
Using generalizability theory,55–58 for the PQ and CQ, we assessed the overall reliability of doctors’ scores averaged across core items and across respondents. We used G_String III software (R. Bloch and G. Norman, Hamilton, Ontario, Canada) to calculate variance components, the generalizability (G) coefficient, and the standard error of measurement (SEM) for a random effects (R:D)*I design (raters nested within doctors, crossed with items).
A threshold of G = 0.80 is recommended for “high-stakes” judgments,57 but a less stringent threshold of G = 0.70 has been accepted in real-world settings involving untrained assessors.18,59 Because the GMC questionnaires were explicitly developed for formative purposes, and the study was conducted in a highly naturalistic setting with untrained raters, we accepted the lower threshold of G = 0.70 for our analysis. Decision (D) studies examined how varying the numbers of items or raters per doctor would affect the G coefficient for each questionnaire.
We explored construct validity via principal components analysis (Varimax rotation). To explore convergent validity, we hypothesized that PQ scores would show stronger correlation (Spearman rho) with DISQ scores (assessing similar attributes) than with PEI scores (assessing a related but conceptually different construct). For the CQ, we hypothesized that we would find a strong correlation between the overall CQ and CFET scores because they assess similar attributes. We regarded correlation coefficients between 0.4 and 0.8 as demonstrating acceptable convergent validity.46
Effect of respondent characteristics on core item responses. We used regression analysis to investigate (1) the effect of seven patient characteristics (gender, age, ethnic group, respondent type, consultation with “usual” doctor, importance of visit, and questionnaire return method) on core PQ item responses, and (2) the effect of seven colleague characteristics (gender, age, ethnic group, professional group, recency of familiarity with doctor’s practice, frequency of contact with doctor, and questionnaire return method) on core CQ item responses.
Initial nonparametric tests (Mann–Whitney; Kruskal–Wallis) revealed that all predictor variables affected responses on at least some core items (data not presented). Using Stata SE software Version 10 (Stata Corporation, College Station, Texas), we entered the predictor variables into multilevel logistic regression models, with a random effect to account for the clustering of questionnaires by doctor. For each PQ/CQ item, we dichotomized responses: scores of 1 to 3 (“satisfactory” or poorer) versus scores of 4 or 5 (“good” or better). In view of the large sample size and the testing of multiple items, we set an alpha level of P < .001 to interpret these data.
Across all organizations, 2,454 doctors were eligible to participate. Of these, 1,067 doctors (43%) agreed to participate; 541 (22%) explicitly declined, and 846 (34%) did not respond after two reminders. Participation rates (30%–66%) varied across clinical specialties (chi-square = 48.4, P < .001). Participating and nonparticipating doctors were similar in age (t = 1.63, P = .10) and gender (chi-square = 0.16, P = .69).
Two doctors failed to return any patient or colleague data, leaving an effective sample of 1,065 doctors. Of these, 1,057 (99%) returned some colleague data. Because of insufficient patient contact, 74 doctors did not attempt a patient survey, leaving an effective sample of 991 doctors, of whom 922 (93%) returned some patient data.
We obtained responses from 30,333 patients (median: 36 PQs per doctor; lower quartile [LQ]: 28; upper quartile [UQ]: 41) and 17,012 colleagues (median: 17 CQs per doctor; LQ: 15; UQ: 18). Most CQs were completed online (14,351/17,012; 84%). Table 1 summarizes the characteristics of patient and colleague samples.
Psychometric properties of the PQ and CQ
The questionnaires appeared acceptable to respondents, with only 365/30,333 (1%) patients and 829/17,012 (5%) colleagues completing <50% of the core items. Patient and colleague ratings were skewed toward favorable impressions of doctor performance (see Tables 2 and 3). On the PQ, mean item scores ranged from 4.69 to 4.89, and 98% endorsed the global assessment statements. On the CQ, mean item scores ranged from 4.41 to 4.86, and 97% of colleagues endorsed the global assessment statement.
Missing data on both questionnaires were minimal. Use of the “Does not apply” response option on the PQ and the “Don’t know” response option on the CQ varied across items. Colleagues from all professional groups (particularly administrative/managerial) selected the “Don’t know” option for aspects of performance they were unlikely to observe or have sufficient expertise to assess (data not presented).
The internal consistency was high, with Cronbach’s alphas of 0.865 (PQ) and 0.938 (CQ). In the test–retest substudies, 263/720 (36%) patients completed a PQ at both time points (median test–retest interval: 14 days; range 7–30 days), and 184/490 (37%) colleagues completed a CQ at both time points (median interval: 16 days; range 5–97 days). The ICCs for the patient and colleague overall scores were 0.834 (95% CI: 0.792–0.868) and 0.851 (95% CI: 0.803–0.888), respectively. ICCs were lower for items assessing aspects of the doctor’s integrity (PQ5a–5b: range 0.582–0.629; CQ1-18: range 0.450–0.596) than they were for the items that assessed the doctor’s skills (PQ4a–4g: range 0.627–0.732; CQ1-18: range 0.602–0.768). On the global assessment items, the majority of patients (257/258) and colleagues (181/184) selected the same response at both time points.
The G studies estimated the proportion of variance in core item scores due to doctors (PQ: 4%; CQ: 6%), raters (PQ: 42%; CQ: 32%), items (PQ: 1%; CQ: 4%), the doctor-by-item interaction (PQ: 1%; CQ: 5%), and unattributed error (PQ: 53%; CQ: 53%). The harmonic (arithmetic) mean number of raters per doctor was 21.3 (32.9) for the PQ and 15.5 (16.2) for the CQ, with G coefficients of 0.603 (SEM = 0.079) and 0.716 (SEM = 0.086), respectively, for the mean of 21.3 and 15.5 ratings (i.e., 60% of the variance in doctors’ mean scores on the PQ and 72% on the CQ was due to differences in doctor performance).
The D studies showed that ≥34 PQs (9 items) and ≥15 CQs (18 items) are required to achieve G coefficients >0.70. Changing the number of items had minimal effect compared with changing the number of raters.
Principal components analysis provided evidence of construct validity. For both questionnaires, we identified two components with eigenvalues > 1. For the PQ, items Q4a–4g loaded (0.813–0.871) onto the first component (“interpersonal, clinical, and organizational skills”: eigenvalue 5.115; 57% variance), and items Q5a–5b loaded (0.885 and 0.865) onto the second component (“integrity”: eigenvalue 1.938; 22% variance). For the CQ, items 1–15 loaded (0.584–0.800) onto the first component (“interpersonal, clinical, and organizational skills”: eigenvalue 9.018; 42% variance), and items 16–18 loaded (0.773–0.851) onto the second component (“integrity and health”: eigenvalue 1.536; 16% variance).
In the convergent validity substudy, 3,571 patients (of 113 doctors) and 2,004 colleagues (of 133 doctors) returned extended questionnaires. As hypothesized, the correlation between PQ and DISQ scores was stronger (rho = 0.629, P < .001) than that between PQ and PEI scores (rho = 0.314, P < .001). The CFET and CQ scores were strongly correlated (rho = 0.808, P < .001), suggesting acceptable convergent validity.46
Effect of respondent characteristics on core item scores
A complex pattern of patient and colleague predictors of response emerged across the core items (see Supplemental Digital Tables 1 and 2, http://links.lww.com/ACADMED/A109). Table 4 summarizes the variables that predicted responses, the direction of these effects, and the items for which the variable was an independent predictor of responses.
Five variables predicted patient responses. Patients who identified their visit as “very important,” and those who reported seeing their “usual doctor,” were more likely than other patients to give favorable ratings on all nine items. Exit survey respondents (six items), older patients (two items), and white patients (two items) were also more likely to provide favorable ratings than their counterparts. Neither gender nor respondent type (patient versus proxy) independently predicted responses for any PQ items.
Three variables predicted colleague responses. Managers, administrative staff, and nonmedical health professionals gave more favorable ratings (nine items) than medical peers. Colleagues reporting more frequent contact with the doctor gave more favorable ratings (five items) than those reporting less frequent contact. Questionnaire return method predicted response on only one item, whereas colleague age, gender, ethnicity, and recency of contact were not independent predictors of responses for any CQ items.
Globally, there is increasing interest in evaluating doctors’ professional behavior and practice.59–63 In this study, we evaluated the psychometric properties of the GMC PQ and CQ. We also documented biases in ratings that may occur when such instruments are used to obtain feedback on the professional performance of doctors from untrained patient and colleague assessors.
Our findings suggest that the GMC questionnaires were acceptable to our respondents and meet recognized standards of reliability and validity for formative workplace assessments.18,46,54,59,64 We estimated that at least 34 PQs and 15 CQs per doctor are required to achieve acceptable reliability (G > 0.70). Ratings on the core items were highly skewed toward positive assessments of doctor performance and were influenced by a range of respondent characteristics.
Strengths and weaknesses of the study
A large sample of UK doctors from a range of clinical settings contributed data to the study. The questionnaires underwent comprehensive psychometric testing using both classical test theory and generalizability theory approaches, and we addressed some of the gaps in the existing evidence on their performance.
Although participant doctors were representative of their peers in terms of age and gender, the overall participation rate was 43%, and there was variation in uptake across clinical specialties. Our participation rate is notably higher than previous pilot work10 (17%); however, because MSF was not mandatory at the time, our analysis is not based on a census sample of doctors. Thus, the true range of professional performance may not be represented in our data.
To reduce the potential for selection bias, we instructed participating doctors to obtain feedback from consecutively consulting patients or their carers. Because we were unable to monitor compliance with this sampling procedure, we cannot exclude the possibility that some doctors may have selected patients or carers who they believed would provide more positive feedback. For the colleague survey, doctors were asked to nominate 10 medical and 10 nonmedical colleagues who were sufficiently familiar with their practice to provide feedback. There is limited and conflicting evidence as to whether feedback obtained from nominated colleagues is more positive than that obtained from colleagues selected by a third party.17,20,24
The high alpha coefficients observed for both questionnaires suggest evidence of some item redundancy, raising the possibility that similar information might be obtained using fewer items.
We did not attempt to validate doctors’ PQ/CQ item scores against directly observed examples of their practice, skills, or knowledge, and the predictive validity of the questionnaires remains unknown.
Because of the volume of feedback collected, we did not attempt an in-depth analysis of respondents’ free-text comments. A previous content analysis of free-text comments on the CQ45 has raised questions about the value of routine formal analysis of such comments for the purposes of revalidation. However, a recent qualitative study65 indicates that some doctors favor the inclusion of free-text comments in their feedback report and use this information to contextualize their numerical scores and identify ways in which they could improve their practice.
The costs, in terms of both time and resources, of collecting patient and colleague feedback via questionnaires have not been investigated in this study, but, given the number of responses required to obtain reliable results, these need to be quantified and balanced against judgments about the utility of the information obtained.
Our findings have implications for the use of patient and colleague questionnaires to assess doctors’ professional performance, and the interpretation of feedback obtained via such instruments.
The G coefficient threshold used in this study meets that required for “real-world” assessments18,59 but is lower than that suggested for summative assessments of a doctor’s professional performance.64 The questionnaires should not be used as stand-alone tools for making judgments about a doctor’s fitness to continue to practice medicine. In the United Kingdom, patient and colleague feedback is one element of a portfolio of evidence to be collected by doctors for discussion within the appraisal process.3,5 In this context, MSF has a potentially useful formative purpose, in helping to identify areas of strength and weakness in a doctor’s performance.
Although improved generalizability of the GMC questionnaires would be desirable, we believe the resulting data provide sufficiently robust feedback for doctors to reflect on their performance. Future work may inform the development of “better” questionnaires, but the current PQ and CQ do offer a useful initial platform for supporting revalidation in the United Kingdom.
The required minimum number of PQs (≥34) is larger than for some other tools of similar intent15; for the CQ (≥15), it is comparable to some tools but larger than others.14–18 Achieving these targets should not be problematic in clinical settings where doctors routinely see large numbers of patients and work in large teams. However, some doctors may be disadvantaged—for example, those practicing in settings where patients are unable to respond because of the nature of their illness or treatment (e.g., emergency medicine, anaesthetics), those specializing in the intensive management of relatively small patient populations (e.g., forensic psychiatry), and those working in smaller teams or as locums.
Consistent with other MSF tools,9,10,13–18,20–23 patient and colleague ratings were highly skewed toward favorable assessments of doctor performance. On all core items, a small proportion of patients and colleagues responded at or below the midpoint of the scale, suggesting that the questionnaires are capable of capturing a range of views about a doctor’s performance. However, given the predominance of very high ratings, the modest reliability of the questionnaires, and the volunteer nature of the sample of participants, we suggest that caution is required in interpreting and responding to doctors’ scores. Future research aiming to reduce the skewness of data that result from MSF might investigate the use of different scale descriptors or attempt to provide raters with more detailed information on the purpose and use of the questionnaires.
Respondent characteristics and the survey process may affect core item ratings. For the PQ, favorable assessments were more likely from respondents who rated the reason for visiting the doctor as “very important,” who were consulting their “usual” doctor, who were from white ethnic backgrounds, or who were over 40 years of age. Questionnaires returned to clinic boxes contained more positive ratings than did postal returns. However, we found no evidence that patient gender or the use of proxy respondents influenced PQ ratings.
In the colleague survey, favorable assessments were more likely from nonmedical professional groups and colleagues who had more frequent contact with the doctor. Clear guidance is required to ensure that doctors nominate a balanced mix of colleagues. Doctors who work in smaller teams may be doubly disadvantaged if they are unable to nominate sufficient nonmedical colleagues and, to achieve the minimum sample size, have to seek feedback from colleagues with whom they have less frequent contact.
Overall, our findings confirm the GMC’s view3,6 that these patient and colleague surveys should be viewed as essentially formative assessments, until further data based on census sampling become available. When interpreting MSF, the characteristics of the individuals who have provided the feedback need to be considered. To the best of our knowledge, the potential for sampling bias has been explored for only a relatively small number of MSF questionnaires.9,13,16,32 Given the growing interest in this form of workplace assessment, a better understanding is required of precisely how patient and colleague ratings on these instruments might be affected by respondent characteristics and the context in which feedback is provided. Characteristics of the doctor, as well as the survey respondents, may also affect aggregated scores derived from questionnaires, and we explore this theme in more detail elsewhere.41
Finally, clear guidance is necessary to support doctors and appraisers wishing to disentangle the effects of sampling and other biases from true strengths and weaknesses in the doctor’s professional practice that could form a legitimate focus for continuing professional development.
Acknowledgments: The authors wish to thank all of the doctors, patients, and colleagues who contributed to this study, as well as the senior management teams at hosting organizations who supported the work. Professor Ajit Narayanan (Auckland University of Technology, New Zealand) and Dr. Gominda Ponnaperuma (University of Colombo, Sri Lanka) provided feedback and statistical advice regarding data interpretation and analysis. Ms. Louise Coleman (Client-Focused Evaluation Programme–UK) provided assistance with the recruitment and data collection aspects of the main survey and the substudies.
Funding/Support: The study was funded by the UK General Medical Council (GMC).
Other disclosures: Dr. Campbell is an advisor to the GMC and has received only direct costs associated with presentation of this work. Dr. Greco is a director of Client-Focused Evaluation Programme (UK) Surveys (CFEP-UK) who provided survey administration in respect of this study. Mr. Taylor was an employee of CFEP-UK at the time the project was undertaken.
Ethical approval: The chair of the Devon and Torbay National Health Service research ethics committee reviewed the project and concluded that a full ethics committee opinion was not required.
Previous presentations: Oral presentation (Colleague Questionnaire data only) at the International Revalidation Symposium, Leeds Castle, Kent, United Kingdom (December 2010); closed presentation to Council of the UK General Medical Council in London, United Kingdom (February 2011); oral presentation at the Society for Academic Primary Care Annual Meeting in Bristol, United Kingdom (July 2011); oral presentation at the North American Primary Care Research Group Annual Meeting in Banff, Canada (November 2011).
Supplemental digital content for this article is available at http://links.lww.com/ACADMED/A109.
1. Allsop J, Jones K. Quality Assurance in Medical Regulation in an International Context: Final Report for the Chief Medical Officer. 2005 Lincoln, UK University of Lincoln
2. de Vries H, Sanderson P, Janta B, et al. International comparison of ten medical regulatory systems: Egypt, Germany, Greece, India, Italy, Nigeria, Pakistan, Poland, South Africa and Spain. 2009 Santa Monica, Calif RAND Corporation
7. Schuwirth LW, Southgate L, Page GG, et al. When enough is enough: A conceptual basis for fair and defensible practice performance assessment. Med Educ. 2002;36:925–930
8. Archer J, Norcini J, Southgate L, Heard S, Davies H. Mini-PAT (peer assessment tool): A valid component of a national assessment programme in the UK? Adv Health Sci Educ.. 2006;13:181–192
9. Archer JC, Norcini J, Davies HA. Use of SPRAT for peer review of paediatricians in training. BMJ. 2005;330:1251–1253
10. Campbell JL, Richards SH, Dickens A, Greco M, Narayanan A, Brearley S. Assessing the professional performance of UK doctors: An evaluation of the utility of the General Medical Council patient and colleague questionnaires. Qual Saf Health Care. 2008;17:187–193
11. Evans R, Elwyn G, Edwards A. Review of instruments for peer assessment of physicians. BMJ. 2004;328:1240
12. Evans RG, Edwards A, Evans S, Elwyn B, Elwyn G. Assessing the practising physician using patient surveys: A systematic review of instruments and feedback methods. Fam Pract. 2007;24:117–127
13. Hall W, Violato C, Lewkonia R, et al. Assessment of physician performance in Alberta: The physician achievement review. CMAJ. 1999;161:52–57
14. Lelliott P, Williams R, Mears A, et al. Questionnaires for 360-degree assessment of consultant psychiatrists: Development and psychometric properties. Br J Psychiatry. 2008;193:156–160
16. Mackillop LH, Crossley J, Vivekananda-Schmidt P, Wade W, Armitage M. A single generic multi-source feedback tool for revalidation of all UK career-grade doctors: Does one size fit all? Med Teach. 2011;33:e75–e83
17. Ramsey PG, Wenrich MD, Carline JD, Inui TS, Larson EB, LoGerfo JP. Use of peer ratings to evaluate physician performance. JAMA. 1993;269:1655–1660
18. Violato C, Lockyer J, Fidler H. Multisource feedback: A method of assessing surgical practice. BMJ. 2003;326:546–548
19. Whitehouse A, Hassell A, Wood L, Wall D, Walzman M, Campbell I. Development and reliability testing of TAB a form for 360 degrees assessment of senior house officers’ professional behaviour, as specified by the General Medical Council. Med Teach. 2005;27:252–258
20. Archer J, McGraw M, Davies H. Assuring validity of multisource feedback in a national programme. Arch Dis Child. 2010;95:330–335
21. Baker R, Smith A, Tarrant C, McKinley RK, Taub N. Patient feedback in revalidation: An exploratory study using the consultation satisfaction questionnaire. Br J Gen Pract. 2011;61:e638–e644
22. Campbell J, Narayanan A, Burford B, Greco M. Validation of a multi-source feedback tool for use in general practice. Educ Prim Care. 2010;21:165–179
23. Elwyn G, Lewis M, Evans R, Hutchings H. Using a ‘peer assessment questionnaire’ in primary medical care. Br J Gen Pract. 2005;55:690–695
24. Archer JC, McAvoy P. Factors that might undermine the validity of patient and multi-source feedback. Med Educ. 2011;45:886–893
25. de Vries H, Elliott MN, Hepner KA, Keller SD, Hays RD. Equivalence of mail and telephone responses to the CAHPS hospital survey. Health Serv Res. 2005;40(6 pt 2):2120–2139
26. Duberstein P, Meldrum S, Fiscella K, Shields CG, Epstein RM. Influences on patients’ ratings of physicians: Physicians demographics and personality. Patient Educ Couns. 2007;65:270–274
27. Elliott MN, Zaslavsky AM, Goldstein E, et al. Effects of survey mode, patient mix, and nonresponse on CAHPS hospital survey scores. Health Serv Res. 2009;44(2 pt 1):501–518
28. Elliott MN, Beckett MK, Chong K, Hambarsoomians K, Hays RD. How do proxy responses and proxy-assisted responses differ from what Medicare beneficiaries might have reported about their health care? Health Serv Res. 2008;43:833–848
29. Goldstein E, Elliott MN, Lehrman WG, Hambarsoomian K, Giordano LA. Racial/ethnic differences in patients’ perceptions of inpatient care using the HCAHPS survey. Med Care Res Rev. 2010;67:74–92
30. Harmsen JA, Bernsen RM, Bruijnzeels MA, Meeuwesen L. Patients’ evaluation of quality of care in general practice: What are the cultural and linguistic barriers? Patient Educ Couns. 2008;72:155–162
31. Haviland MG, Morales LS, Dial TH, Pincus HA. Race/ethnicity, socioeconomic status, and satisfaction with health care. Am J Med Qual. 2005;20:195–203
32. Lipner RS, Blank LL, Leas BF, Fortna GS. The value of patient and peer ratings in recertification. Acad Med. 2002;77(10 suppl):S64–S66
33. Mead N, Roland M. Understanding why some ethnic minority patients evaluate medical care more negatively than white patients: A cross sectional analysis of a routine patient survey in English general practices. BMJ. 2009;339:b3450
34. O’Malley AJ, Zaslavsky AM, Elliott MN, Zaborski L, Cleary PD. Case-mix adjustment of the CAHPS hospital survey. Health Serv Res. 2005;40(6 pt 2):2162–2181
35. Richards SH, Campbell JL, Dickens A. Does the method of administration influence the UK GMC patient questionnaire ratings? Prim Health Care Res Dev. 2011;12:68–78
36. Rubio RN, Pearson HC, Clark AA, Breitkopf CR. Satisfaction with care among low-income female outpatients. Psychol Health Med. 2007;12:334–345
37. Taira DA, Safran DG, Seto TB, et al. Asian-American patient ratings of physician primary care performance. J Gen Intern Med. 1997;12:237–242
38. Weech-Maldonado R, Morales LS, Spritzer K, Elliott M, Hays RD. Racial and ethnic differences in parents’ assessments of pediatric care in Medicaid managed care. Health Serv Res. 2001;36:575–594
39. Weech-Maldonado R, Morales LS, Elliott M, Spritzer K, Marshall G, Hays RD. Race/ethnicity, language, and patients’ assessments of care in Medicaid managed care. Health Serv Res. 2003;38:789–808
40. Woods SE, Bivins R, Oteng K, Engel A. The influence of ethnicity on patient satisfaction. Ethn Health. 2005;10:235–242
41. Campbell JL, Roberts M, Wright C, et al. Factors associated with variability in the assessment of UK doctors’ professionalism: Analysis of survey results. BMJ. 2011;343:d6212
43. Kilminster S, Pell G, Roberts T Patient and Colleague Questionnaires: Validation Report to the GMC.. 2005 Leeds, UK University of Leeds, Medical Education Unit
44. MORI Social Research Institute. Revalidation Questionnaires Testing: Qualitative Research Findings.. 2004 London, UK General Medical Council
45. Richards SH, Campbell JL, Walshaw E, Dickens A, Greco M. A multi-method analysis of free-text comments from the UK General Medical Council colleague questionnaires. Med Educ. 2009;43:757–766
46. Streiner DL, Norman G Health Measurement Scales: A Practical Guide to Their Development and Use. 3rd ed.. 2003 Oxford, UK Oxford University Press
47. Brazier JE, Harper R, Jones NM, et al. Validating the SF-36 health survey questionnaire: New outcome measure for primary care. BMJ. 1992;305:160–164
48. Dorman P, Slattery J, Farrell B, Dennis M, Sandercock P. Qualitative comparison of the reliability of health status assessments with the EuroQol and SF-36 questionnaires after stroke. United Kingdom Collaborators in the International Stroke Trial. Stroke. 1998;29:63–68
49. Ramsay J, Campbell JL, Schroter S, Green J, Roland M. The General Practice Assessment Survey (GPAS): Tests of data quality and measurement properties. Fam Pract. 2000;17:372–379
50. Howie JGR, Heaney DJ, Maxwell M Measuring Quality in General Practice. Occasional paper no. 75. 1997 London, UK Royal College of General Practitioners
51. Al-Shawi AK, MacEachern AG, Greco MJ. Patient assessment of surgeons’ interpersonal skills: A tool for appraisal and revalidation. Clin Gov.. 2005;10:212–216
52. Greco M, Cavanagh M, Brownlea A, McGovern J. The Doctors’ Interpersonal Skills Questionnaire (DISQ): A validated instrument for use in GP training. Educ Gen Pract.. 1999;10:256–264
53. Narayanan A, Greco M. What distinguishes general practitioners from consultants, according to colleagues? J Manag Marketing Healthc.. 2007;1:80–87
54. Doros G, Lew R. Design based on intra-class correlation coefficients. Am J Biostatistics.. 2010;1:1–8
55. Brennan RL Generalizability Theory. 2001 New York, NY Springer-Verlag
56. Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N The dependability of behavioural measurements: Theory of generalizability for scores and profiles.. 1972 New York, NY Wiley
57. Crossley J, Davies H, Humphris G, Jolly B. Generalisability: A key to unlock professional assessment. Med Educ. 2002;36:972–978
58. Downing SM. Reliability: On the reproducibility of assessment data. Med Educ. 2004;38:1006–1012
59. Lockyer J. Multisource feedback in the assessment of physician competencies. J Contin Educ Health Prof. 2003;23:4–12
61. Lockyer JM, Clyman SGHolmboe E, Hawkins R. Multi source feedback (360 degree evaluation). Practical Guide to the Evaluation of Clinical Competence.. 2008 Philadelphia, Pa Mosby Elsevier
62. Norcini JJ. Current perspectives in assessment: The assessment of performance at work. Med Educ. 2005;39:880–889
63. Violato C, Lockyer JM, Fidler H. Assessment of pediatricians by a regulatory authority. Pediatrics. 2006;117:796–802
64. Postgraduate Medical Education and Training Board.Developing and Maintaining an Assessment System— PMETB Guide to Good Practice. 2007 London, UK Postgraduate Medical Education and Training Board
65. Hill JJ, Asprey A, Richards SH, Campbell JL. Multisource feedback questionnaires in appraisal and for revalidation: A qualitative study in UK general practice. Br J Gen Pract. 2012;62:e314–e321