Secondary Logo

Journal Logo


Embracing the Complexity of Valid Assessments of Clinicians’ Performance

A Call for In-Depth Examination of Methodological and Statistical Contexts That Affect the Measurement of Change

Boerebach, Benjamin C.M. MSc, PhD; Arah, Onyebuchi A. MD, PhD; Heineman, Maas Jan MD, PhD; Lombarts, Kiki M.J.M.H. MSc, PhD

Author Information
doi: 10.1097/ACM.0000000000000840
  • Free


Clinicians are involved in multiple initiatives to chart their professional performance.1,2 Systems have been developed to assess their performance when interacting with patients, colleagues, other health care professionals, and trainees.3–6 Clinicians’ professional performance is defined as their work-related actions and its resulting outcomes. Performance is multidimensional and can be categorized broadly into task and contextual dimensions, where both consist of several subdimensions.7 Assessment of performance is often classified as “objective” or “subjective” measures of performance, traditionally favoring the objective measures.8 However, in most health care settings, the interaction of several performance attributes within a specific context, culture, and setting will determine the actual performance of clinicians.8,9 Therefore, Boulet and Murray8 state that subjective performance measures are often less biased, more generalizable, and appropriately valid to assess performance because they can take the complexities of and interrelations among performance attributes into account. Although subjective measures of performance currently seem to be more frequently included in assessments than formerly, the psychometric properties of assessment tools are often solely based on, or tested through, methods and evidence assuming objective, knowledge-based measures of performance.10–12 A knowledge-based performance measure such as “puts on gloves before a procedure” or “checks blood pressure” can be rated rather objectively in two or three (answer) categories by a limited number of raters. In contrast, more subjective attributes such as “is respectful towards coworkers” or “keeps medical knowledge up-to-date” may be harder to categorize into “yes versus no” or “good versus poor.” More raters and more nuances are often required to obtain reliable and valid results.13,14 Therefore, assessment tools should take differences between subjective and objective measures into account.

Although modern techniques to overcome some of the issues faced in assessing performance are suggested in methodological literature, new insights are rarely translated into assessment tools in practice.15,16 In this article, we address five factors that impact the complexity of performance assessments and provide suggestions on how to deal with the issues they raise. The factors addressed are as follows:

  • The characteristics of a measurement scale can affect the performance data yielded by an assessment tool.
  • Different summary statistics of the same data can lead to opposing conclusions regarding performance and performance change.
  • Performance at the item level does not easily translate to overall performance.
  • Estimating performance change from two time-indexed measurements and assessing change retrospectively can yield different results.
  • The context can impact performance and performance assessments.

In this article, we use empirical data from the System for Evaluation of Teaching Qualities measurements (Box 1) to provide examples.

Description of the System for Evaluation of Teaching Qualities and Data Derived Using This System, Used as the Basis for Examining Methodological and Statistical Contexts That Affect Measurement of Change Cited Here

The empirical data used in this article’s illustrative examples are teaching performance data gathered through the System for Evaluation of Teaching Qualities (SETQ).19–22,33 This system was developed to evaluate clinicians’ teaching performance in residency training. The SETQ gathers resident assessments of their clinician teachers using a tool (questionnaire) that was based on an extensive literature review and discussions with several stakeholders. The system was successfully implemented institution-wide, later nation-wide, and it is currently being implemented in several countries across Europe. The validity and reliability of the SETQ tool has been published extensively.19–22 Across settings, the SETQ tools had satisfactory psychometric properties. Later, a follow-up study using confirmatory psychometric techniques updated the validity evidence of the SETQ tool and confirmed the high validity and reliability level of the SETQ data.23 The core tool comprises 20 items that can be categorized into five subscales labeled as learning climate, professional attitude towards residents, communication of learning goals, evaluation of residents, and feedback. Examples of items are “this clinician listens attentively to residents” and “this clinician gives corrective feedback to residents.” Each subscale contains three to five items, rated on a five-point Likert scale (1 = “strongly disagree,” 2 = “disagree,” 3 = “neutral,” 4 = “agree,” 5 = “strongly agree”). In addition to numerical items, residents are asked to provide clinician teachers’ strengths and suggestions for improvement through narrative descriptions.33 Because the SETQ has proven satisfactory validity for pointwise assessments, but change assessment using the SETQ has not been studied, the SETQ is an excellent tool to illustrate examples in the current article. Below is a scheme of the assessment process that is repeated annually.


Measurement Scale Characteristics

As noted, professional performance assessments often rely on reports, if not perceptions, of observers about performance. Many subjective attributes of performance, such as “respect towards coworkers” or “clear communication,” may not be appropriately rated by a small number of categories. It is therefore surprising that most attributes are assessed using tools with a four-, five-, or seven-point Likert scale.4,6 The validity and reliability of data yielded by a measurement scale depend on several characteristics and can differ across populations and tools. The tendency to use short measurement scales (such as four-, five-, or seven-point Likert scales) is based on evidence showing that traditional reliability (interrater, test–retest) and validity (convergent versus divergent) measures tend to be higher for shorter scales.10,11 However, this evidence is dated and was mainly based on educational tests that included questions that were knowledge based.10,11 Also, frequently used reliability measures such as Cronbach α and Cohen κ do not appropriately adjust for the higher probability of obtaining agreement or stability by chance, thus favoring shorter scales.12 The sensitivity of a scale to rate performance appropriately on a performance continuum is important and not always appropriately reflected in the statistical reliability and validity of a scale.13 Research on the sensitivity of measurement scales in professional performance assessments is scarce. For illustrative purposes, we address two issues regarding sensitivity of measurement scales—namely, the ability to detect small performance changes and the ability to detect change when the performance distributions are skewed.

Ability to detect small performance changes

Assessing performance change requires a measurement scale that is sensitive enough to capture the changes of effect sizes that can be expected. In performance assessments of students, performance changes can be large even over limited time because the learning curve of students is usually steep. For student assessments, even shorter scales that are less sensitive to small performance changes may capture changes for most of the students appropriately. In contrast, performance changes in experienced professionals are expected to be small (though relevant) because professionals tend to build up and, sometimes, top their education and clinical experience. For professional performance assessments, it is thus even more important that measurement scales have the ability to capture even small changes in performance. The ability of a short Likert scale to measure small performance changes is largely unknown, although some questions about this have been posed before.13,14 A review about measuring quality of life, for example, found that studies using longer scales (10- or 15-point Likert scale) reported effect sizes that were up to twice as large compared with studies using a 5-point Likert scale.13 The authors argued that these differences were largely caused by the limited sensitivity of the 5-point scale to detect the relatively small changes in quality of life that were observed in most studies.13 Another study that explored the reliability and validity of subjective data also suggests that longer scales may be more valid and reliable.14 Two studies in the context of medical education found that the length of a rating scale impacted the outcomes yielded by the rating tool, favoring the use of longer rating scales.17,18 Intuitively, we may understand that short scales (such as a 5-point Likert scale) are less suitable when changes in performance are small. However, because longer scales are not (yet) used on a large scale in practice, studies that experiment with longer rating scales should be encouraged, especially because two studies found that raters experience no difficulties with longer rating scales.14,18

Ability to detect change in skewed distributions

In addition to the ability of measurement scales to detect small performance changes, scales have to provide clinicians room to change their performance scores. Any scale with a prespecified maximum and minimum has physical boundaries that can impair the measurement of positive or negative change. Clinicians who are able to improve their performance in practice should be able to improve their performance scores on an assessment tool as well. Most of the professional performance assessment tools published describe positively skewed assessment scores (ceiling effect), with average scores often around or above 4.0 on a 5-point scale.3,4,6 This implies that a large part of the clinicians scored well above 4.0, and, for them, the 5-point scale simply allows very little positive change in future assessments. For some performance attributes, such as “washes hands before procedures,” clinicians are required to achieve a maximum score, and in these cases the ceiling effect is not surprising (nor is it a problem). In other, more subjectively rated attributes such as “displaying a professional attitude towards colleagues” and “clear (verbal or written) communication,” it remains unclear whether the limited ability to change performance on the measurement scale corresponds with a limited ability to change actual performance.

To explore whether the ceiling effect impairs the ability to detect performance improvement, we tested this using empirical data on clinicians’ teaching performance. We created two subgroups of clinician teachers; the first subgroup of clinicians scored above 4.0 on their initial performance measurement and had therefore limited room to change scores positively. The second subgroup scored below 4.0 on their initial measurement and had sufficient room to improve their scores. All clinicians were rated by at least six residents on two subsequent occasions, ensuring high reliability of both measurements.19–23 For both groups, the average performance change across all performance items (between the initial and subsequent year) was calculated. To obtain an “overall indication of clinicians’ performance change” we also asked residents who worked with a clinician for over a year to indicate whether clinicians changed their teaching performance in the last year (possible answers were 1 = worsened, 2 = same, 3 = somewhat improved, 4 = strongly improved). The average performance change was calculated for both subgroups (Table 1).

Table 1
Table 1:
Average Item and Overall Performance Change for Clinician Teachers Scoring Above 4.0 and Below 4.0 on an Initial Measurement, From an Exploration of the Ceiling Effect of a Short Measurement Scale on Clinician Performance Assessmenta

Table 1 indicates that the average score of the subgroup of clinicians who scored below 4.0 on the initial measurement was enhanced by 0.11 points. In contrast, the subgroup of teachers who scored above 4.0 on their initial measurement dropped in performance score by 0.09 points. Further, residents indicated comparable overall performance change for both subgroups. The overall improvement was expected for the subgroup below 4.0 (who enhanced their performance scores) but was surprising for the subgroup above 4.0, who dropped in performance score. Although these results may appear to be due to regression to the mean, the unaltered measures of variance between the first and second measurement data (standard deviation, standard error of the mean, and variance components were unaltered) contradict this. Therefore, the results could be seen as an indication that the upper limit of the 5-point scale actually impairs the ability to capture positive performance change for the high-scoring clinicians. When handling skewed performance data, like in most performance assessment systems, extreme scores (that are close to the positive or negative boundary) will not only apply to a few outliers but also apply to a large proportion of the assessed clinicians. Future assessments should aim to limit ceiling effects by making measurement scales longer, providing better anchoring points, placing the center point differently, or formulating the performance items differently.

Summarizing Performance Scores

In reporting performance data such as measurements of change to clinicians, the data often need to be summarized. Several summary statistics can be used for a Likert scale. By far, the most commonly used statistics are the median and mean. Although some argue that mean scores are not appropriate to summarize ordinal or skewed data, they are frequently used.24,25 In many performance attributes, such as “clear communication,” the proportion of cases in which a clinician performed below standard will be more informative than the median or mean performance. Because poor communication can impact patient safety, the proportion of cases in which the communication was below standard reflects the proportion of cases in which patient safety was at stake. This is not (fully) reflected in the mean or median. The kind of information that is desired by clinicians, researchers, directors, or policy makers will determine which (combination of) statistics or metrics will provide the most appropriate summary of clinicians’ performance. In Table 2 we summarized the performance for two clinicians (clinician A and B) who were evaluated by 10 residents on the performance item “regularly provides trainees with constructive feedback” for two subsequent years using four summary statistics. For clinician B, the four change statistics will lead to completely different conclusions regarding performance improvement. For clinician A, most of the statistics will lead to the conclusion that this clinician improved his performance; however, the magnitude of performance change (which is important as an effect size measure) differs considerably across the statistics or metrics.

Table 2
Table 2:
Performance Scores and Summary Statistics for Clinician A and B, Illustrating Differing Conclusions About Performance Improvement as Generated by Different Types of Summarizing Metrica

Performance Change in Item, Scale, and Overall Scores

Performance domains, such as professionalism, communication, interpersonal relationships, management, and teaching, are usually captured by one or more performance scales of an assessment tool. These performance scales are measured using multiple items. Because only the items of an assessment tool are measured (domains and scales are latent), and only the items contain concrete behaviors, actions, or attitudes, it could be argued that the actual performance and the subsequent change of performance happen at the item level. Most assessment tools contain many items, so it may take some time to interpret performance and performance change for all items. Therefore, scales, domains, or overall performance are often the levels of performance assessment interpretation. However, estimating domain or overall performance scores can be challenging.

By far, the most applied method to obtain performance domain scores is averaging or summing up the performance items that belong to a certain performance domain.9 This method assumes that all performance items are of equal importance for a performance domain, an assumption that seems implausible in practice. In previous studies, students, residents, coworkers, and patients could easily prioritize the performance items that belonged to a domain.26–28 Therefore, calculating domain or overall scores by averaging or summing all items seems an approach that is too simplistic.9,29,30 For research purposes, the use of the item-response theory (IRT) may provide a solution for this problem because it does not assume equal importance of all items (as classical test theory does). IRT can adjust for the inequality of importance across items in a scale. However, IRT has not been widely adopted and cannot be handled by popular general statistical software packages such as SPSS.16,24 For practice, carefully looking at data at the item level and weighting the importance of the items for a specific domain are crucial for appropriate interpretation of performance assessments.

Estimating Change or Assessing Change Retrospectively

Apart from calculating change from two time-indexed performance assessments, it is also possible to assess clinicians’ performance change retrospectively, by asking raters to indicate whether a clinician improved his/her performance. We are aware that assessing change retrospectively can suffer from recall bias among raters; however, we still expected a substantial correlation with the average performance change (at least in the same direction), as both measures tend to capture changed performance.24 As shown in Table 1, these two methods yielded different change results for high- and lower-scoring clinicians. Even more, the correlation between the scores yielded by these two methods is low, even for the group who scored below 4.0 (Table 3). One of the explanations for this surprising finding may be that changes on certain performance items are more important than changes on other performance items. Therefore, the average change over all items may be a poor indicator for the perceived overall performance change. This again highlights that the averaging or summing up method to calculate overall performance scores is probably too simplistic to assess performance and performance change appropriately. Previous studies of performance measurement at one point in time showed that overall indications of performance had high overlap with average performance scores (correlations of 0.50 up to 0.98).19,20,22,31 Our analysis shows that this does not account for estimates of performance change. When assessing performance change it is thus very important to consider which (combination of) measure(s) of change are of most interest. Robust assessments of change should include multiple change measures on item and domain level, assessed over time and retrospectively. The combination of multiple measures will provide the most valid overview of clinicians’ performance change.

Table 3
Table 3:
Correlations Between the Average Performance Change (Calculated from Two Time-Indexed Performance Evaluations) and Overall Performance Change Assessed Retrospectively, Separately for Clinician Teachers Scoring Above 4.0 and Below 4.0 on an Initial Measurementa

Performance Within Context

A final issue to address is the role of context in assessing performance. Differences between specialties, departments, hospitals, or countries can result in different perceptions of performance. This can impact performance assessment in three different ways. First, the performance scores may be impacted. For example, in hospitals with a focus on teaching, the professional infrastructure and learning climate can facilitate clinician teachers to excel in teaching, leading to overall high teaching scores among most teachers.32 In hospitals with limited teaching facilities it can be harder for clinician teachers to obtain high teaching performance scores. Therefore, it is important to interpret performance within its context—that is, assess how far item, scale, and overall scores of clinicians are from their group means or medians. Second, the relative importance of performance items or domains may differ between settings. For example, doctor–patient communication is important in a large proportion of the interactions that primary care clinicians have, but less frequently important for pathologists. Also, in centers with many trainees, supervising clinicians will have less direct patient contact because of their role as trainers. Consequently, the relative importance of clinicians’ interpersonal behaviors toward patients will be affected by the proportion of time spent, and quality of patient interactions they experience.

Third, the context can influence the operationalization of the performance items itself. If we consider doctor–patient communication again, the specific behaviors appropriate for communicating with patients will be different for pediatricians, internists, and psychiatrists. Also, because of differences between surgical and nonsurgical specialties, the work and type of coworkers differ so that effective collaboration could be affected by different attributes. The second and third categories of contextual influences suggest that the validity of assessment tools should always be reevaluated when tools are used in different contexts and settings. In some cases, performance items need to be adjusted, added, or removed to make them appropriately valid for context-specific performance assessment. When performance assessments still have to be compared across contexts, appropriate equivalence methods should be applied.


Critically appraising the complexity of performance assessment data is crucial when using performance assessments and assessing change. Faculty developers and policy makers should check whether the performance data generated by an assessment tool are interpreted correctly by the clinicians in practice. They can provide guidance to fully inform clinicians about their performance data, and they can help to guide clinicians in taking appropriate improvement actions. Future performance assessment tools and checklists should carefully consider measurement scale characteristics to avoid skewed data and to ensure the appropriate sensitivity of a scale. Different types of scales for different items can be used to ensure the most appropriate scale for each type of item, provided appropriate methods (such as from IRT) are subsequently used. It is very important that clinicians and other practitioners are able to interpret the assessment scores easily. This can only be achieved when the formulation of each item and the appropriate scale for each item are thoroughly considered, so that the value of the performance data yielded by assessment tools for both research and practice is enhanced.


1. Lanier DC, Roland M, Burstin H, Knottnerus JA. Doctor performance and public accountability. Lancet. 2003;362:1404–1408
2. Grol R, Wensing M, Eccles M, Davis D Improving Patient Care: The Implementation of Change in Health Care. 20132nd ed. Chichester, UK Wiley Blackwell
3. Beckman TJ, Ghosh AK, Cook DA, Erwin PJ, Mandrekar JN. How reliable are assessments of clinical teaching? A review of the published instruments. J Gen Intern Med. 2004;19:971–977
4. Fluit CR, Bolhuis S, Grol R, Laan R, Wensing M. Assessing the quality of clinical teachers: A systematic review of content and quality of questionnaires for assessing clinical teachers. J Gen Intern Med. 2010;25:1337–1345
5. Reinders ME, Ryan BL, Blankenstein AH, van der Horst HE, Stewart MA, van Marwijk HW. The effect of patient feedback on physicians’ consultation skills: A systematic review. Acad Med. 2011;86:1426–1436
6. Donnon T, Al Ansari A, Al Alawi S, Violato C. The reliability, validity, and feasibility of multisource feedback physician assessment: A systematic review. Acad Med. 2014;89:511–516
7. Sonnentag S, Frese MSonnentag S. Performance concepts and performance theory. Psychological Management of Individual Performance. 2002 West Sussex, UK John Wiley & Sons:1–25
8. Boulet JR, Murray D. Review article: Assessment in anesthesiology education. Can J Anaesth. 2012;59:182–192
9. Rosen MA, Pronovost PJ. Advancing the use of checklists for evaluating performance in health care. Acad Med. 2014;89:963–965
10. Cronbach LJ. Response sets and test validity. Educ Psychol Meas. 1946;6:475–494
11. Cronbach LJ. Further evidence on response sets and test design. Educ Psychol Meas. 1950;10:3–31
12. Hayes AF, Krippendorff K. Answering the call for a standard reliability measure for coding data. Com Meth Meas. 2007;1:77–89
13. Cummins RA, Gullone E. Why we should not use 5-point Likert scales: The case for subjective quality of life measurement. Second International Conference on Quality of Life in Cities. 2000 Singapore National University of Singapore
14. Preston CC, Colman AM. Optimal number of response categories in rating scales: Reliability, validity, discriminating power, and respondent preferences. Acta Psychol (Amst). 2000;104:1–15
15. Reise SP, Henson JM. A discussion of modern versus traditional psychometrics as applied to personality assessment scales. J Pers Assess. 2003;81:93–103
16. Streiner DL. Measure for measure: New developments in measurement and item response theory. Can J Psychiatry. 2010;55:180–186
17. Albanese M, Prucha C, Barnet JH. Labeling each response option and the direction of the positive options impacts student course ratings. Acad Med. 1997;72(10 suppl 1):S4–S6
18. Hassell A, Bullock A, Whitehouse A, Wood L, Jones P, Wall D. Effect of rating scales on scores given to junior doctors in multi-source feedback. Postgrad Med J. 2012;88:10–14
19. Arah OA, Hoekstra JB, Bos AP, Lombarts KM. New tools for systematic evaluation of teaching qualities of medical faculty: Results of an ongoing multi-center survey. PLoS One. 2011;6:e25983
20. Boerebach BC, Arah OA, Busch OR, Lombarts KM. Reliable and valid tools for measuring surgeons’ teaching performance: Residents’ vs. self evaluation. J Surg Educ. 2012;69:511–520
21. Lombarts KM, Bucx MJ, Arah OA. Development of a system for the evaluation of the teaching qualities of anesthesiology faculty. Anesthesiology. 2009;111:709–716
22. van der Leeuw R, Lombarts K, Heineman MJ, Arah O. Systematic evaluation of the teaching qualities of obstetrics and gynecology faculty: Reliability and validity of the SETQ tools. PLoS One. 2011;6:e19142
23. Boerebach BC, Lombarts K, Arah O. Confirmatory factor analysis of the System for Evaluation of Teaching Qualities (SETQ) in graduate medical training [published online ahead of print October 2, 2014]. Eval Health Prof. Accessed June 1, 2015
24. Streiner DL, Norman GR Health Measurement Scales: A Practical Guide to Their Development and Use. 2008 Oxford, UK Oxford University Press
25. Tabachnick BG, Fidell LS Using Multivariate Statistics. 2007Pearson International Edition, 5th ed. Boston, Mass Allyn & Bacon
26. Schmutz J, Eppich WJ, Hoffmann F, Heimberg E, Manser T. Five steps to develop checklists for evaluating clinical performance: An integrative approach. Acad Med. 2014;89:996–1005
27. Shaw KL, Brook L, Cuddeford L, et al. Prognostic indicators for children and young people at the end of life: A Delphi study. Palliat Med. 2014;28:501–512
28. Smith KS, Simpson RD. Validating teaching competencies for faculty members in higher education: A national study using the Delphi method. Inn Higher Educ. 1995;19:223–234
29. Boerebach BC, Lombarts KM, Keijzer C, Heineman MJ, Arah OA. The teacher, the physician and the person: How faculty’s teaching performance influences their role modelling. PLoS One. 2012;7:e32089
30. Lombarts KM, Heineman MJ, Arah OA. Good clinical teachers likely to be specialist role models: Results from a multicenter cross-sectional survey. PLoS One. 2010;5:e15202
31. Williams BC, Litzelman DK, Babbott SF, Lubitz RM, Hofer TP. Validation of a global measure of faculty’s clinical teaching performance. Acad Med. 2002;77:177–180
32. Lombarts KM, Heineman MJ, Scherpbier AJ, Arah OA. Effect of the learning climate of residency programs on faculty’s teaching performance as evaluated by residents. PLoS One. 2014;9:e86512

Reference cited only in Box 1

33. van der Leeuw RM, Overeem K, Arah OA, Heineman MJ, Lombarts KM. Frequency and determinants of residents’ narrative feedback on the teaching performance of faculty: Narratives in numbers. Acad Med. 2013;88:1324–1331
© 2016 by the Association of American Medical Colleges