Establishing assessments that promote trustworthy decisions regarding clinical competence but also support learner development is a priority in health professions education.1 Fulfilling either goal requires the judgment of observers. This in turn requires awareness of the difficulty inherent in attending to, processing, and translating observed performance into ratings that reference a standard and feedback that guides learners.2,3 This is challenged by modern competence frameworks that lead assessment designers to require raters to simultaneously focus on multiple dimensions of performance.4–6 Each dimension of competence is important, but cognitive capacity is limited and complexity can negatively affect rating quality, causing raters to apply mental shortcuts that result in suboptimal observation.7 Determining how to best manage this tension requires an understanding of the factors that influence raters’ cognition.
DeNisi8 has studied this issue, as have we,9 arguing that broadening a rater’s focus to the point that cognitive resources are exceeded harms appraisal/assessment processes.8,9 We have previously suggested that appraisals only have utility to the extent that information (i.e., candidate behaviors, stimuli details, contextual cues and their relevance) is attended to and recalled. Focusing attention sufficiently, however, and recalling information appropriately may be particularly challenging in modern educational practice.10 Representing the varied competencies expected of modern-day practitioners provides assessment processes with stronger claims to construct validity, but may exceed an observer’s ability to adequately consider relevant dimensions of performance.11 Under high-demand conditions, people spontaneously engage cognitive behaviors aimed at reducing memory load.12 Such strategies can include “degraded concurrent processing,” where two or more tasks are completed concurrently but one or more suffers relative to when the task is performed in isolation; “strict serial processing,” where multiple tasks are performed, but only one is completed at a time, leaving some tasks ignored at any given moment; engaging heuristics; depending on schemas; and avoidant behaviors.7,12–15 These tendencies challenge the assumption that observable performance elements are actively and appropriately considered during rater-based assessment.9,10 They also offer an explanation for commonly reported issues including idiosyncrasy of focus, poor interrater reliability, and low-quality feedback.
Efforts to improve feedback have focused mainly on the structure and process of how feedback is delivered. These have included targeted postobservation strategies such as guidance to explore learners’ understanding of the content and/or inclusion of specific information set against a criterion.16 We do not, however, fully understand what influences the information faculty have mentally available to them to generate feedback in the first place. It is typically assumed that raters see the same things, but prioritize issues differently or address them inadequately. However, recent research on the influence of broadening raters’ focus of assessment casts doubt on the sufficiency of this interpretation by illustrating that greater variability in perception exists when raters’ assessment demands are broadened.7,9
Such findings demand a strategy for assessment and feedback that considers all constructs included in modern competency frameworks while recognizing that asking raters to do more might reduce their effectiveness. In this study, we sought to test whether asking raters to sequentially assess a subset of a candidate’s competencies altered the generation of feedback and performance ratings relative to having raters evaluate more competencies simultaneously. Given the conceptual framework outlined above, and elaborated more fully elsewhere, we hypothesized that distributing the assessment of distinct but interrelated dimensions of competence across raters would result in improved assessment outcomes without compromising the extent to which all dimensions are taken into account.9 Our goal was not to evaluate the utility of a particular assessment protocol but, rather, to understand how raters’ impressions differed in the two experimental conditions while anticipating that the results could inform the structure of a variety of formative and summative assessment strategies.
Using a randomized between-subjects experimental design, we manipulated whether students were evaluated by an observer directed to attend to six dimensions of clinical competence (the simultaneous condition) or by three observers directed to each consider only two dimensions of competence (the sequential conditions). The dimensions listed on a previously developed global rating scale (GRS) served as the intervention.17 Participants were asked to rate four recorded and unscripted (i.e., spontaneously generated) clinical performances and to indicate the feedback they would provide to the students in each video. Primary outcomes were indicators of the amount and type of feedback provided and the reliability of the scores observed. We tested the hypothesis that broadening a rater’s focus would result in adverse effects on the generation of feedback and reliability. This study took place in Toronto, Ontario, and Halifax, Nova Scotia, Canada, with participants from multiple eligible organizations and colleges, in 2016–2017. Research ethics board approval for this study was provided by Centennial College (REB no. 191).
We recruited paramedic educators who were functioning, or who had functioned, as observers or raters in work and/or simulation-based training or assessment using existing e-mail distribution lists (convenience sampling). Participants must have had clinical experience and experience with mannequin or standardized-patient-based simulations for the purposes of assessment. Our recruitment letter indicated a time commitment of two hours and an honorarium for participation, which took place independent of work responsibilities.
We asked participants randomized to the simultaneous condition to consider six domains of competence (history gathering, patient assessment, decision making, communication, resource utilization, and procedural skill). Participants randomized to the sequential condition performed the same task using one of three unique versions of the GRS that contained only two of the original six dimensions, which we preassigned into three pairs (version 1 = history gathering and procedural skill; version 2 = decision making and communication; version 3 = patient assessment and resource utilization). Previous work showed these pairings to have the lowest interitem correlations and/or greatest conceptual differences, thereby ensuring that raters in both conditions had to focus on multiple dimensions of competence.17 We provided rater training by providing orientation to the rating tool and clinical scenario, describing performance expectations, and giving frame-of-reference information with generic guidelines (e.g., treat dimensions independently, review rating label definitions). We used a random number generator to assign participants in a 1:3 ratio, with the latter group also being randomly assigned to one of the three 2-dimensional rating tools. All participants observed the same four videos (in random order) and used only one rating form. They were not allowed to pause or review the video at any point, in order to replicate the naturalistic rating demands inherent in work-based assessment or simulation activities. They were permitted to take notes.
A different candidate was portrayed in each video. Each was a paramedic candidate responding to a deteriorating cardiac patient who, at a predetermined point, and regardless of intervention, suffered a cardiac arrest. The candidate responded to the case alone but had two first responders available who were trained to portray differences in their abilities and willingness to assist. The two first responders were instructed to be disruptive by conflicting with one another but not obstructive of the candidate’s efforts. We randomly selected four videos (of two male and two female candidates) from a pool of 80 that included performances from students currently enrolled in a paramedic training program; individuals who had just completed their training program; and working, experienced paramedics. We did not attempt to control for specific aspects of performance. Each video, created as part of a larger program of research, was nine minutes in length. Only one camera view was available (from the foot end of the stretcher).
Immediately following each video, we instructed participants to provide formative feedback verbally and to assign numerical scores reflecting the candidate’s performance on the dimensions assigned. We provided all participants in both groups with the following instructions: “Address specific and/or overall performance areas, specific to the dimensions included on your rating tool, with the intention of improving the candidate’s performance in future similar cases or with any patient they might encounter in the future.” Otherwise they received no prompting, directing, or leading apart from asking upon completion if there was anything else they would like to add before moving on. Participants completed the study with the help of a research assistant either in person or remotely using online video and audio recording technology. Recorded feedback was transcribed verbatim for analysis. We reviewed approximately 20% of the transcriptions for accuracy, observed no concerns, and proceeded with analyses.
Outcome measures and analysis
With no agreed-upon method of evaluating feedback, we coded the transcripts in a number of ways that were informed by earlier work exploring factors influencing rater feedback, characteristics of feedback effectiveness, and the concept of content validity.2,18,19 Our coding focused on quantity, characteristics (accuracy, false claims, statements of uncertainty, recommendations, subjective evaluations), and content (breadth and depth of dimension coverage, feedback type). Our intention was not to make firm conclusions regarding feedback quality, recognizing that feedback is better conceptualized as a conversation, but to look for indicators and surrogates that could help determine whether the focus of feedback changed as a result of simultaneous versus sequential competency assessment.16
Quantity. We identified and segmented feedback into individual statements representing unique ideas (e.g., “the steps in procedure X were properly sequenced”). These were counted and compared between groups.
Characteristics. Each statement that could be confirmed as clearly linked to an observable behavior in the associated video was coded as accurate. Statements that were clearly not observable in the video were labeled as false claims. We also coded statements that were subjective evaluations (e.g., “the time spent on procedure X was appropriate”), those that described uncertainty (e.g., “I am not sure if step 2 in procedure X was done or not”), and as recommendations (e.g., “next time complete procedure X using your dominant hand”). False claims and statements of uncertainty were not included in the subsequent content analyses because they were considered construct-irrelevant data.
Content. We operationalized content validity in three ways. First, as breadth of coverage, by determining whether dimensions of performance included on the rating tool were omitted in the feedback. Second, as depth of coverage, by counting the number of unique statements included within the dimensions of performance included on the rating tool. Third, as feedback type, by coding the focus of feedback statements as describing a specific behavior or task performed, a dimension of performance, the individual, the context, directions or recommendations, and/or encouragement of reflection.
During all coding, researchers were blinded to the group condition. Three research assistants completed the coding, with disagreements resolved by the principal investigator (W.T.).
Scores and reliability.
We explored the extent to which the scores assigned consistently differentiated between candidates’ performances using generalizability theory. For the simultaneous rating condition, this amounted to a fully crossed design with four videos assessed on six dimensions of performance crossed with a series of raters. Videos were set as the facet of differentiation, and three forms of reliability were calculated: internal consistency (the overall correlation between dimensions on the rating form), interrater reliability (the extent to which two raters’ ratings correlate with one another), and an overall reliability coefficient that took into account both item and rater variance as sources of measurement error.
For the sake of comparison, we conducted reliability analyses on the sequential condition scores in a way that takes into account the reality that would exist if this assessment strategy were enacted. Given that the goal would be to gather a full set of competency ratings for each candidate, we combined ratings from different raters in the sequential condition to create sets that covered all six dimensions. To do so, we assigned a random number to each rater and rank-ordered them within each two-dimensional GRS version based on that random number. We then combined the dimensional ratings of each rater who possessed the same rank. Doing so resulted in a set of ratings equivalent to those collected in the simultaneous rating condition on which the same reliability analyses were conducted. To minimize the risk of randomization failure, this process was repeated three times to estimate the mean and standard deviation of the reliabilities that would be observed. Doing so allowed a sample size calculation to determine how many observations were required to allow a sufficiently powered z test of whether the reliabilities in the sequential condition were statistically different from the reliability observed in the simultaneous condition. That analysis suggested that 5 replications would yield a power of 0.81. To be conservative, we conducted 15 replications and report the mean and 95% confidence interval (CI) of those permutations as our best estimate of the reliabilities generated through the sequential rating intervention.
We conducted all comparisons between groups using descriptive and inferential statistics (i.e., ANOVA, chi-square) as appropriate, with P = .05 set as our level of statistical significance. These analyses were conducted using IBM SPSS statistical software, version 21 (IBM Corp., Armonk, New York). We used ANOVA to calculate variance components, which were then used to calculate our reliability coefficients using generalizability theory.
Of the educators we invited, 88 participants were enrolled: 23 in the simultaneous condition, and 65 across the three sequential conditions. In the latter group, 21 completed the decision-making and communication tool, 22 used the history-gathering and procedural skills tool, and 22 used the patient assessment and resource utilization tool. Two participants in the simultaneous condition did not provide a complete set of ratings and were excluded from reliability analyses (resulting in analyses for 21 raters). Because the minimum number of participants in the sequential condition group was 21, the 22nd-ranked individual in each of the other two groups was excluded from the data aggregation after each random sort. See Table 1 for rater demographic characteristics.
Summing across all six dimensions, raters in the simultaneous condition offered 27.7 (95% CI: 22.3, 33.1) pieces of feedback, on average, compared with 42.9 (95% CI: 36.4, 49.4) pieces of feedback when sets of raters in the sequential condition were created (P < .05 for all four videos, with F values ranging from 7.4 to 20.0). When examined by dimension, raters in the simultaneous condition still offered less feedback than those in the sequential condition (Table 2).
We found no significant differences in the proportion of feedback statements that were accurate, false, subjective, indicative of uncertainty, or recommendations (Table 3).
Participants in the simultaneous rating condition were more likely to give feedback where at least one dimension of performance was not represented (less breadth), dimensions of performance included only one feedback segment (reduced depth), and types of feedback were excluded (Table 4).
Scores and reliability
The means assigned by the simultaneous condition were 5.4, 2.9, 4.3, and 2.5 for videos one through four, respectively—almost identical to the means assigned by the sequential condition, which were 5.2, 2.8, 4.1, and 2.7, respectively.
Generalizability analyses performed on the ratings assigned by the simultaneous condition demonstrated interrater reliability = 0.58, internal consistency (Cronbach alpha) = 0.74, and an overall reliability = 0.56. These served as the point estimates against which the reliability of the aggregated ratings provided by the sequential condition were compared. Following 15 random sorts of raters in that condition, the minimum reliabilities observed were interrater = 0.74, internal consistency = 0.78, and overall = 0.70. The 95% CI surrounding these means did not include the point estimates of reliability calculated for the simultaneous condition (Table 5).
Given the dependence of health professions education on observer judgment for performance assessment,20 it is important to understand the factors that influence rater-based appraisals.10,21 Models of clinical competence are broadening, and greater emphasis is being placed on using assessment practices to improve future performance (assessment for learning), both of which place additional demands on observers. Although previous research has demonstrated problematic outcomes when rating demands are high, the influence on feedback provision has been less clear. Further, that prior research has not offered a solution to the problem, given that reducing rater burden often involves limiting the scope of practice assessed.7 The purpose of this study was to explore that tension by examining the influence of asking a sequence of raters to assess a subset of competency domains relative to the norm of asking raters to assess all competencies simultaneously. Consistent with our conceptual framework and hypotheses, our findings suggest that broadening raters’ focus has potentially deleterious effects. When asked to consider six dimensions of performance simultaneously, raters offered less feedback (overall and by dimension), were more likely to ignore some dimensions of performance, and limited the variety of feedback provided relative to observers who were asked to consider a subset of dimensions. The intervention, however, did not appear to affect their ability to generate true and false memories, or the rate at which statements were described as subjective, as uncertain, or as recommendations. When scores from the sequential condition were aggregated, the reliability of the scores assigned increased as well. These findings suggest that asking raters to pay attention to fewer aspects of performance can lead to formative and summative assessments with greater utility, which has theoretical and practical implications.
Seminal work on the provision of feedback emphasizes basing feedback on “firsthand data” and observable “decisions and actions” that are considered in relation to “performance standards” and goals.21 This assumes that faculty have the capacity to detect and select meaningful information, process it in relation to learner context, ignore irrelevant data, and then translate observations into coherent feedback. Researchers have subsequently challenged this assumption, arguing that these are complex cognitive activities that are dependent on capacity-limited structures (e.g., attention, working memory).8,12,22–24 Our own research subsequently showed that, when raters broaden their focus, they tend to mentally encode only a portion of learners’ behaviors.7 Such cognitive limitations, thereby, reduce the opportunity to provide well-rounded feedback and increase the likelihood of disconnects of perception between raters and between raters and candidates. Limitations in the feedback provided in natural circumstances are hard to detect because observers can always provide some feedback and remain unaware of overlooked aspects of performance. This suggests that efforts to improve feedback through observer training will be insufficient because overcoming limitations induced by working memory capacity requires restructuring the tasks we impose on observers. While problematic or ineffective feedback has been attributed to raters’ limited skill, emotions, or poor insight, one additional and potentially causal mechanism may be the complexity of the demands placed on our raters when they are asked to evaluate many aspects of competence simultaneously.23,24
That said, there is a need to ensure that candidates satisfactorily achieve all competencies regardless of the cognitive limitations of their assessors. Validity frameworks require that all relevant constructs be adequately represented, thereby creating a new challenge if we must reduce what we ask raters to consider at a point in time.19,25 It makes no more sense, however, to emphasize construct representation over raters’ inherent capacity than it does to suggest that an invalid clinical procedure should be used because it is easier to apply than a more sensitive and specific diagnostic tool. If performance declines by requiring raters to consider all dimensions of performance simultaneously in ways that may diminish the value or utility of the assessment activity, then the fundamental goals of the assessment will not be fulfilled regardless of how comprehensive the construct representation.
To explore this issue further, researchers will need to determine how to operationalize assessment activities that distribute focus across raters and faculty without fundamentally undermining the feasibility of assessment practices. Simulation-based settings offer opportunity for raters to consider only segments at a time, but the same is not true in work-based settings. In real clinical contexts, asking raters to consider the variety of competencies, one portion at a time, in a sequential model provides one avenue of exploration through which this tension might be resolved. Whether it is important to have multiple assessors involved in the sequential observations or to prespecify which competencies observers should focus on, as we did in this study, remains to be determined. A further area in need of exploration is whether different people need to be involved or if the sampling strategies outlined here can be operationalized by one individual over several points in time.
Our findings should be considered in the context of the study’s limitations. First, we chose a common construct and case stimulus to experimentally test our hypothesis, but it is possible that other combinations of tools, item pairings, and stimuli may lead to better or worse outcomes. Similarly, using more than six or less than two dimensions may lead to different findings, whereas we presume these two levels to represent points on a continuum. Second, this study was completed using a simulation-based assessment with videos of performances as stimuli. This eliminates many of the factors that would exist in workplace-based assessments where contexts are generally more complex, but also can be more authentic. Third, in this study we asked multiple raters to assess the same performance. That was important to enable exploration of the focal issue, but may not be practical in simulation- or workplace-based settings. Our intention was not to replicate those environments precisely but, rather, to test the hypotheses we generated based on a program of research and the conceptual framework considered. Whether or not it matters that different raters completed the three versions of the two-dimensional rating form should be determined to make decisions about how to design interventions based on our results. Fourth, as is the case in all rater studies, there may have been unidentified rater variables that influenced the outcomes observed. Finally, as we described above, there is no standard for the evaluation of feedback, preventing us from claiming with certainty that the differences observed here necessarily translate into better learning outcomes on the part of the feedback recipients. We considered as many codes of feedback as we could with the intention of determining what changes, rather than proving that the feedback from one condition or the other would be more fitting for that purpose.
Crossley and colleagues have advocated for assessment processes that “reflect cognitive structuring.”26 In doing so, they mount an argument that the interaction between the rater and the performance observed has largely been ignored by noting that “the most remarkable observation might be in how irrational we have been to date with work based assessment instruments and processes.” Asking raters to consider broad constructs without taking into consideration their inherent limitations (which are often masked) may be yet another example of how irrational we have been. As health professions education increasingly advocates for meaningful observer contributions as part of assessment practices, a continued emphasis on understanding faculty behaviors and limitations is needed. In many assessment contexts, raters have been asked to deliberately expand their focus to ensure appropriate assessment coverage of all competencies expected of modern practitioners while promoting a degree of efficiency. On the surface, such changes might appear to be beneficial or innocuous. However, the results of this study suggest that what raters contribute when having to consider many dimensions of performance simultaneously is feedback that is lessened in quantity, breadth, depth, and diversity. Relative to aggregating the ratings of multiple raters who are asked to consider fewer dimensions of performance, simultaneous assessment of many dimensions generated scores with lower reliability. Therefore, sequential or distributed assessment strategies, where raters are either asked or allowed to limit their focus, may optimize formative and summative assessment efforts. In other words, when planning rater-based assessments of clinical competence, asking for less may get you more.
The authors wish to thank Centennial College for supporting this study, as well as Karen McIntyre and Fontana Lim for their assistance in completing the data collection.
1. Eva KW, Bordage G, Campbell C, et al. Towards a program of assessment for health professionals: From training into practice. Adv Health Sci Educ Theory Pract. 2016;21:897913.
2. van de Ridder JM, Stokking KM, McGaghie WC, ten Cate OT. What is feedback in clinical education? Med Educ. 2008;42:189197.
3. Govaerts MJ, Van de Wiel MW, Schuwirth LW, Van der Vleuten CP, Muijtjens AM. Workplace-based assessment: Raters’ performance theories and constructs. Adv Health Sci Educ Theory Pract. 2013;18:375396.
4. Gofton WT, Dudek NL, Wood TJ, Balaa F, Hamstra SJ. The Ottawa Surgical Competency Operating Room Evaluation (O-SCORE): A tool to assess surgical competence. Acad Med. 2012;87:14011407.
5. Bandiera G, Sherbino J, Frank J. The CanMEDS Assessment Tools Handbook: An Introductory Guide to Assessment Methods for the CanMEDS Competencies. 2006.Ottawa, Ontario, Canada: Royal College of Physicians & Surgeons of Canada;
6. Kogan JR, Holmboe ES, Hauer KE. Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. JAMA. 2009;302:13161326.
7. Tavares W, Ginsburg S, Eva KW. Selecting and simplifying: Rater behavior when considering multiple competencies. Teach Learn Med. 2016;28:4151.
8. DeNisi A. A Cognitive Approach to Performance Appraisal: A Program of Research. 1996.New York, NY: Routledge;
9. Tavares W, Eva KW. Impact of rating demands on rater-based assessments of clinical competence. Educ Prim Care. 2014;25:308318.
10. Tavares W, Eva KW. Exploring the impact of mental workload on rater-based assessments. Adv Health Sci Educ Theory Pract. 2013;18:291303.
11. Kane MT. Brennan BL. Validity. In: Educational Measurement. 2006.Westport, CT: Praeger Publishers;
12. Wickens CD. Multiple resources and mental workload. Hum Factors. 2008;50:449455.
13. Kool W, McGuire JT, Rosen ZB, Botvinick MM. Decision making and the avoidance of cognitive demand. J Exp Psychol Gen. 2010;139:665682.
14. Botvinick MM, Rosen ZB. Anticipation of cognitive demand during decision-making. Psychol Res. 2009;73:835842.
15. Shah AK, Oppenheimer DM. Heuristics made easy: An effort-reduction framework. Psychol Bull. 2008;134:207222.
16. Sargeant J, Lockyer J, Mann K, et al. Facilitated reflective performance feedback: Developing an evidence- and theory-based model that builds relationship, explores reactions and content, and coaches for performance change (R2C2). Acad Med. 2015;90:16981706.
17. Tavares W, Boet S, Theriault R, Mallette T, Eva KW. Global rating scale for the assessment of paramedic clinical competence. Prehosp Emerg Care. 2013;17:5767.
18. Govaerts MJ, van de Wiel MW, van der Vleuten CP. Quality of feedback following performance assessments: Does assessor expertise matter? Eur J Train Dev. 2013;37:105125.
19. Kane M. Validating score interpretations and uses. Lang Test. 2012;29:317.
20. van der Vleuten CP, Schuwirth LW, Scheele F, Driessen EW, Hodges B. The assessment of professional competence: Building blocks for theory development. Best Pract Res Clin Obstet Gynaecol. 2010;24:703719.
21. Ende J. Feedback in clinical medical education. JAMA. 1983;250:777781.
22. Wickens CD, Carswell CM. Salvendy G. Information processing. In: Handbook of Human Factors and Ergonomics. 2012.Hoboken, NJ: Wiley and Sons Inc.;
23. Kogan JR, Conforti LN, Bernabeo EC, Durning SJ, Hauer KE, Holmboe ES. Faculty staff perceptions of feedback to residents after direct observation of clinical skills. Med Educ. 2012;46:201215.
24. Telio S, Ajjawi R, Regehr G. The “educational alliance” as a framework for reconceptualizing feedback in medical education. Acad Med. 2015;90:609614.
25. Messick S. The interplay of evidence and consequences in the validation of performance assessments. Educ Res. 1994;23:1323.
26. Crossley J, Johnson G, Booth J, Wade W. Good questions, good answers: Construct alignment improves the performance of workplace-based assessment scales. Med Educ. 2011;45:560569.