Relying on clinical supervisors to observe and assess trainees’ clinical performances has been an essential component of assessment in medical education. One of the limitations of this form of assessment is excessive variability in the ratings and judgments provided by supervisors. In fact, variance attributed to the idiosyncratic ratings of the supervisors often exceeds the variance accounted for by differences in trainees.1–3 From a psychometric perspective, this can be problematic, as it is not unusual for the majority of clinical rating variances to be considered unusable “error.”4–6
We have raised the possibility in a previous article7 that variability in clinical ratings could be introduced through the cognitive processes used in making social judgments (i.e., inferences about the performer and the underlying reasons for the performance observed). Recent research8,9 suggests that supervisors do make such inferences during clinical assessments. However, the relationship between the formation of these social judgments about a trainee and supervisors’ impressions regarding that trainee’s clinical competence has not yet been specified.10
There are an infinite number of social judgments that could be made,11 raising concerns that as many social judgments could be formed about an individual as there are people providing judgments. Mohr and Kenny12 have found, however, that social judgments are much less idiosyncratic than that statement implies. They asked 69 study participants to view a brief video recording of a target person and to offer descriptions of that target person. The resulting descriptions included a wide variety of social judgments, yet that array of impressions could be easily organized into three coherent categories of description that were not only distinct from each other but actually represented conflicting social inferences and personality attributions. This finding replicated across six different targets, although for each target there was a different set of distinct categories of social judgments. This finding was supportive of earlier work13,14 that proposed that people spontaneously create a narrative account of a target person that typically includes causal explanations for the observed behaviors. Importantly, these causal explanations and attributions were able to account for variance in numeric ratings of the target individual that would normally be attributed to rater idiosyncrasy.12,13
Mohr and Kenny’s12 method, therefore, offers an opportunity to explore variability in clinical performance assessments and to examine the extent to which such variability reflects different social judgments. If such clustering of disparate social judgments does exist within subgroups of raters observing the same performance, it raises the possibility that there may be multiple “signals” in the “noise” of rater variability that are not currently being recognized, much less understood. This would, in turn, raise questions about whether these multiple signals should be treated as error to be eliminated or information to be exploited as we construct our assessment of a trainee’s competence. The first step, however, is determining the extent to which these multiple signals do exist and to establish methods to effectively document their impact on rater assessments.
The primary purpose of our study, therefore, was to explore the utility of Mohr and Kenny’s12 methodology to understand rater idiosyncrasy in clinical performance assessment. The specific research question was, What proportion of variance in clinical competence ratings can be explained by raters’ development of one of a few distinct social impressions about the performer? We attempted to match Mohr and Kenny’s12 conditions as closely as possible in the context of clinical performance assessment. It was necessary, however, to make some modifications to their methodology by using latent partition analysis (LPA) as a statistical method to identify the common categorizations of social judgments made by physician raters, as will be explained below. Thus, a secondary purpose of the study was to introduce LPA as a statistical method to the medical education literature.
The first component of this study involved the selection of videos of clinical performances and the development of an online system to present the videos and collect data from physicians. Data collection and analysis then proceeded in four phases.
- In Phase 1, physicians assessed the video-recorded clinical performances by providing Mini-CEX ratings and narrative responses in the online data collection system.
- In Phase 2, additional participants (who were naïve to the videos, the clinical ratings, and the purpose of the study) reviewed and sorted the physicians’ narrative responses for each clinical performance into piles based on the similarity of the social judgments being described.
- Phase 3 involved statistical analyses of the sorted piles using LPA to determine whether the narrative responses could sensibly be grouped into categories and then help define the number and composition of categories that best grouped the descriptions of social judgments for each clinical performance.
- In Phase 4, for each clinical performance, we treated the categories generated by LPA as the independent variable in a one-way ANOVA to determine the amount of variance that could be explained in the overall clinical competence ratings on the Mini-CEX.
To identify an appropriate set of video performances to be assessed by raters in the study, we collected video recordings of clinical trainees interacting with patients that had been developed and used by other medical education researchers,8 along with videos that had been posted to YouTube for educational purposes. From these, we selected for pilot testing 11 videos representing a range of primary-care-related topics. Each video depicted one trainee and one patient (no examiner present) and plausibly represented a second-year internal medicine resident (6 were scripted at this level,8 4 were produced as study aids for medical licensing exams, and 1 was a practice session with a standardized patient for an undergraduate OSCE).
We then piloted the 11 videos to select the most appropriate set of stimuli for the study. Pilot participants were known by us to have an interest in medical education research along with experience in assessment administration and design and/or with assessing residents’ clinical skills. Eighteen participants, consisting of 15 physicians, a standardized patient trainer, an experienced standardized patient, and a clinical psychologist, were asked to review, score, and comment on the videos and provide feedback on their experience with these three tasks during four rounds of pilot testing. On the basis of their responses, seven videos (3.5–7 minutes in length) representing a spectrum of competencies and interpersonal skill levels were selected as stimuli for Phase 1 (see Table 1). Participants from the pilot were not eligible to contribute responses in the subsequent phases of this study.
Phase 1: Data collection from physicians
Emergency, internal, and family medicine physicians associated with the University of British Columbia and the University of Toronto Faculties of Medicine who had experience assessing residents were invited to participate after we received approval from the respective research ethics boards.
Physicians who responded to our recruitment requests were directed to the online data collection system. After they provided informed consent to participate, they received this prompt for the first video:
In the following video you will be shown a portion of a clinical encounter between a second-year internal medicine resident and a patient. Please watch the video a single time and, based on the information it contains, answer the questions that follow.
The same template was used for all seven videos.
After viewing each video, physicians completed Mini-CEX ratings using the form commonly applied.15 It consisted of seven dimensions (medical interviewing skills, physical exam skills, humanistic qualities/professionalism, clinical judgment, counseling skills, organization/efficiency, and overall clinical competence) with nine-point scales (1–3 unsatisfactory; 4–6 satisfactory; 7–9 superior). This was followed by three open-text questions:
- Please comment on this resident’s clinical competence (about one paragraph). Include any specific behaviours, clinical skills, errors, omissions or other criteria that influenced your ratings.
- Based on your experience, how would you complete this statement? “Oh, I know this type of resident. They’re the type that [fill in the blank] …”
- Now we would like you to be subjective and speculate about what type of person this resident may be. Please take a moment to imagine how someone might perceive this resident’s personality, state of mind, intentions, or beliefs. Feel free to include your own first impressions in this description.
The three questions were designed to elicit responses consistent with each of three different conceptualizations of social categorization outlined in a previous article.7 Only responses to the third question were presented to sorters in Phase 2 (see below) because that is the question (after modifications made through pilot testing to ensure that participants’ focus remained on the social judgments) used by Mohr and Kenny to elicit social judgments.12,16
Phase 2: Sorting of physicians’ social judgment descriptions
We posted notices in the medical building at the University of Northern British Columbia to recruit 14 research participants to be sorters after receiving approval from the research ethics board. Because the sorting task required only reading comprehension skill, participants were required to be over the age of 18 and fluent in English, but clinical knowledge was not necessary.
Every open-text response to Question 3 was printed on a separate slip of paper. For each video, the slips of paper were randomly compiled into stacks.
Each participant was given the stack of responses for a single video and asked to freely sort them into piles according to these sorting instructions:
Take a slip of paper from the stack, read the description, and place it on the table. Now read the description on the next slip of paper. If this description is part of the same story about the resident as that first description, then place it in the same pile. If this description is part of a different story about the resident, then start a new pile. Continue doing this for all the descriptions. You can use as many piles as you like and you can rearrange them as often as you’d like.
This process was repeated for each video. To counteract learning and order effects, half of the sorters worked through Videos 1 to 7 and the other half worked through Videos 7 to 1.
In the Mohr and Kenny12 report, a pair of sorters independently created piles and summarized the descriptions represented by each pile before meeting to reach consensus about the ideal number of piles and their respective meanings. It was necessary for us to deviate from this procedure when our first pair of sorters could not reach consensus. Each sorter had carefully constructed a different number of piles, and each was committed to those constructions’ being the best representations of the responses. As a result, the decision was made not to force participants to reach an agreement on the ideal number of piles. Instead, the remaining sorters completed the task without any interaction with other participants, and the composition of each of the 14 sorters’ piles and their accompanying summaries were recorded. This technique is known as an “F-sort” in LPA methodology.17,18
Phase 3: LPA
The decision to record sorters’ independently generated piles meant that we needed a data summarization technique to identify any common and underlying division points. LPA is a categorization methodology developed to study classes of qualitative information.17 It hypothesizes that there exists a set of latent partitions, or common underlying categorizations, for a group of items. It is assumed to be more probable for individual sorters to place items from the same latent (or underlying) category into the same pile.18 LPA allows empirical investigation of the content elements, number, and size of latent categories, as well as quantification of the relationships among the latent categories.18 It is important to note that in LPA methodology, disagreement between sorters regarding the ideal number or composition of piles in the F-sort is expected. As such, the lack of consensus we observed is not considered to be an indication of flawed sorting but representative of the multiple ways in which items can be categorized. For example, if sorters were asked to divide a set of objects into piles, some might group them only by size, others only by function, and others by both size and function. Although the individual F-sorts might look very different from one another, LPA calculations can still reveal what items were more often grouped together.
The first step in LPA is to tabulate each sorter’s piles by creating a matrix with the items (the physicians’ responses to Question 3 in this instance) listed as both row and column headings. Each cell is filled in by asking, “Did this sorter put these two responses into the same pile?” (Yes = 1, No = 0). LPA calculates averages across all of the sorter matrices to determine the proportion of sorters who placed each pair of responses into the same pile and determine which responses are consistently sorted together. Subsequent LPA calculations17,18 use these values to detect patterns in the sorting behavior across the participants. Responses that are consistently combined into the same pile are considered to form a latent category.18
The mathematical technique of LPA has some computational similarities to factor analysis, although LPA is appropriate for categorical data and produces categorical structures, whereas factor analysis is used for scale data and produces dimensional structures.18 Analogous to factor analysis, the researcher can specify a range of partitions to be made. As part of determining the ideal number of partitions (i.e., categories) that summarize the responses, two output matrices are produced to indicate how well each requested number of categories fit the data. A “phi matrix” specifies the content or composition of the latent categories. It provides values to specify how strongly each item belongs to each latent category. The “omega matrix” quantifies how cohesive each category is and how much it overlaps or gets confused with all the other categories in a given set. This matrix shows the probability of items being placed into their assigned category and the probability of them also being placed into another category within the set (i.e., the “confusion” probability).18
Similar to factor analysis, LPA does not indicate the set of categories that provides the best model. These phi and omega matrices provide numerical parameters to indicate how well each set of categories fits the data, but often there are multiple sets that fit reasonably well. Ultimately, as in factor analysis, the researcher must review the content of the items to determine which set of categories offers the most meaningful groupings.18 We used these procedures to determine the best-fitting set of categories for each video, with each physician’s response thereby being assigned to one of the categories. RStudio was used to interface with R version 3.0.1 (Boston, Massachusetts)19 to conduct the LPA. Phases 2 and 3 were completed; the sorters and the researchers (who made the final LPA decisions) were blinded to the Mini-CEX ratings provided by the physician participants.
Phase 4: ANOVA
For each video, the assignment of each response to a category was then used as an independent variable in a one-way ANOVA to determine the proportion of variance explained (partial eta-squared) in the “overall clinical competence” ratings assigned using the Mini-CEX scale. IBM SPSS Statistics 21 (Armonk, New York) was used for ANOVA calculations.
This study was approved by the behavioral research ethics board at the University of British Columbia, the health sciences research ethics board at the University of Toronto, and the research ethics board at the University of Northern British Columbia.
Phase 1: Data collection from physicians
A total of 48 physicians reviewed, scored, and commented on at least one video, and 34 physicians completed these tasks for all seven videos. They received (or had donated to a charity on their behalf) a $100 honorarium. Considerable variability in Mini-CEX ratings was observed (see Table 2), consistent with findings of a previous study20 that used four of the same videos. Seven responses (2.6% of all responses) to the social judgment question were blank (0–2/video) and, hence, could not be used in the F-sort. There was no obvious systematic pattern to the missing data.
Phase 2: F-sort outcomes
As planned, 14 individuals with diverse backgrounds, including university employees and assorted students (undergraduate, graduate, medical), were recruited to be sorters. Each sorter performed the seven F-sorts in two to three hours and received a $75 honorarium. The sorting task was completed in groups of two to four people working independently at a large table. Dialogue between sorters was minimal and did not influence the sorting task. No relationship between the sorter’s background and proficiency in completing the F-sort was observed.
As shown in Table 3, most sorters used 3 to 4 similarly labeled piles to group the responses for most videos (range 2–11 piles). The resemblance between piles constructed by different sorters can be seen in Chart 1 where, using Video 6 as an example, the abbreviated summaries provided by each sorter for their own piles of physicians’ descriptions of social judgments are shown.
Sorters used piles to group together responses containing similar descriptor words, such as words describing personality traits. Responses using the words “arrogant” or “overconfident” to describe the resident in Video 6 were often put into the same pile, but a different pile was used for responses using the descriptor “lazy.” Notably, the content of the descriptions in one pile often conflicted with the content of the descriptions included in another pile within the same set (e.g., “dismissive” versus “good communicator”).
Phase 3: LPA
For none of the performances of the seven observed residents did physician participants describe a single common social impression. However, even when asked to provide very subjective and potentially idiosyncratic information, the number of distinct social judgments was small relative to the number of raters contributing descriptions. In other words, there was more consensus in the social judgments and impressions than there was idiosyncrasy.
Because LPA results will be unfamiliar to many readers, we describe below the interpretation of the key output for one video in detail (see Figure 1 for an overview) before providing a summary of the findings for the remaining videos.
In Video 6, the resident presented with poor clinical skills and poor interpersonal skills. The young male patient described a three-week history of low back pain radiating down one leg that was unresolved with pain medication. The resident’s history was not thorough; he displayed closed body language and assured the patient that this was a straightforward case of a common problem that would improve on its own with time. Although this performance received a large range of ratings, from 1 to 8, over 75% of physicians rated it as unsatisfactory.
When the physicians’ social judgment descriptions were divided into two categories, the phi matrix (not shown) indicated that four responses did not belong to either of the resulting categories, and one barely fit into both. Again, by way of analogy, this is equivalent to a factor analysis illustrating two factors with five items that load suboptimally. The omega matrix indicated that both categories were somewhat cohesive (shown in Figure 1, near the bottom of the two-partition column, as the bolded values 0.54 and 0.61 on the diagonal of the omega matrix). Review of the descriptive content of the responses, when grouped together in a two-partition model, revealed that there were contradictory statements included within each of the two categories. For example, Category 2 contained descriptions of the resident going “beyond expectations to help” the patient as well as “not attempt[ing] the most basic of medical skills.” This indicates a poor fit.
When a three-partition model was specified (see Figure 1, column 2), the cohesion of the categories (again, illustrated by the bolded numbers on the diagonal of the omega matrix) was largely unchanged. Categories 1 and 2 overlapped with one another (shown in Figure 1 by the off-diagonal value of 0.28 in the second matrix), while Category 3 was distinct. Category 3 was composed of six responses, five being those that did not fit well in the two-partition model. These six responses described the resident as friendly, helpful, and competent. Category 1 items described the resident as lazy and dismissive, and Category 2 items described the resident as arrogant and careless.
When specifying a four-partition model (see Figure 1, column 3), cohesion generally increased. Category 1 overlapped with Categories 2 and 4. Within each category, some responses loaded strongly only onto that one category. On the basis of these responses, it was determined that Categories 1, 2, and 4 all described an element of laziness, but each provided a different explanation for it. More specifically, Category 1 suggested that the laziness was an inherent characteristic of the resident or a deliberate choice of action, Category 2 inferred arrogance or overconfidence as a reason for the resident’s dismissing the concerns, and Category 4 described the resident’s behavior as a habit of using a superficial approach to patient care (see Table 4).
Within each of these categories there were also responses that loaded onto a secondary category in addition to their primary category. By reviewing the content of these items, we could identify the concept overlapping both categories. For example, the concept of being “disinterested” overlapped between Categories 1 and 2. Between Categories 1 and 4, the shared concept was “not being diligent.” Category 3 remained distinct and contained four of the responses that did not fit in the two-partition model. These responses described a very different resident: one that was reassuring and helpful and comfortable communicating with people. Without going into detail, the five-partition model (see Figure 1, column 4) did not fit well. Category 5 (careless and dangerous) was redundant with Category 1 (incompetent because lazy). As such, the four-partition model was determined to have the best fit.
The same process was used to interpret the LPA outputs for all seven videos. The best-fitting models are summarized in Table 4. We will briefly discuss a few findings that were shared across multiple videos. In describing Video 6 above, we saw how LPA revealed three different explanations for perceived laziness. This finding of categories containing different explanations or inferred reasons for the resident’s performance occurred in multiple videos. For example, in Video 3, three of the five categories were composed of responses describing the resident and his performance as awkward or hesitant, but each of the three categories had a distinct explanation for it: (1) not preparing well enough for the task, (2) feeling uncomfortable because of inexperience with the difficult task, or (3) being a distant person.
Again in Video 4, nearly all of the physicians commented that the resident did not fully connect with or understand the patient. But, when grouped into the four categories specified by LPA, there are four different proposed reasons for it: (1) uncomfortable with this sensitive topic, (2) a judgmental and overconfident person, (3) focused on task efficiency, or (4) a distant person.
Comparable explanations appear in the categories for Video 5 as reasons for a suboptimal performance by an unempathetic resident: (1) a lazy person, (2) lacks training in this difficult task, (3) not developmentally ready for this task, or (4) a distant person. Of note, although all three videos had the same category “distant person,” these responses were not provided by the same subgroup of raters (17 such responses were provided by 13 different physicians).
In summary, upon inspection of all seven sets of categories identified by LPA, each category is a coherent collection of descriptions of social judgments. The categories resemble impressions and often contain inferred reasons for the resident’s performance, known as causal explanations. More than one impression was described for every resident, and the descriptions of the social judgments for the same resident often contained conflicting information. Despite asking over 34 physicians to provide descriptions of social judgments, LPA helped determine that there were only two to five social impressions described for each resident.
Phase 4: ANOVA
The categorical (i.e., partition) assignments of each physician’s responses resulting from Phase 3 were used as the independent variable in a one-way ANOVA to determine the amount of variance in the “overall clinical competence” rating from the Mini-CEX that was accounted for by the categories. As shown in Table 4, the partial eta-squared ranged widely from 9% to 57% with a mean of 32% across videos. Using Video 6 as the example again, grouping physicians’ ratings into the four categories explained 53% of the variance in overall clinical competence ratings [F(3, 30) = 11.34, P < .05,
= .531]. The set of categories for five of the seven videos had mean ratings that were significantly different from each other (see Table 4), and a Bonferroni correction was used to make post hoc comparisons. For Video 7, the content of significantly different categories seems to differ in the description of the resident’s competence (Category 1 compared with Category 2). However, for Videos 1 and 3, the significantly different categories differed in terms of the social judgments being made about the resident. For example, in Video 3, the physicians who described the resident’s personality as distant and detached (Category 2) gave significantly lower ratings than did the physicians who described him as a warm person (Category 1) or as a good person (Category 5). For Video 1, every rater described the resident’s performance as coldly efficient. The category where her perceived motive was described as a goal of productivity (Category 1) had significantly higher ratings than the category where her behavior was attributed to loss of interest in the task (Category 2). Videos 2 and 6 had categories with content that differed in both descriptions of competence and social judgments.
In our study, dozens of physicians were asked to perform a potentially idiosyncratic task: describe the social judgments that could be made while assessing a video-recorded clinical performance. The social judgments that were described could be grouped on the basis of their similarity into a discrete set of categories resembling social impressions. Consistent with the social cognition literature, more than one impression was described for every resident, and the content across the various categories for the same resident contained not merely different but, in fact, conflicting social judgments and causal explanations. Importantly, however, the social judgment descriptions were not unique to individual raters and were replicated across many raters. Thus, despite a possibility for each physician participant to describe unique social judgments, LPA helped determine there were as few as two and no more than five distinct social impressions described for each resident. Thus, we were able to conceptually replicate the main finding from Mohr and Kenny.12 The difference between the categories within the set for each performance often focused on a different inferred reason for the resident’s performance, known as a causal explanation, and this finding is also consistent with those of past research.13,14
More important, in terms of assessment implications, there was a tendency for subgroups of physicians who had described similar social judgments to have also given more similar performance ratings. Accounting for these different social judgments for the same resident often explained significant variance in Mini-CEX ratings across the seven performances (9%–57%). Given that multiple physicians collectively described only a small but distinct number of social judgments and that those differing judgments were often associated with different ratings, perhaps some of the “error” variance in ratings is systematic and relevant. In other words, if multiple physicians describe the same social judgments, maybe there is something within the performance that could be noticed by others, such as patients. If multiple, distinct, and often-conflicting judgments are described for the same performance but such judgments are described by multiple people, could that consensus possibly represent multiple “signals” about the resident rather than “noise” from the rater?
These descriptions came in response to an explicit request for social judgments, a request that runs counter to pro forma Mini-CEX procedures. Previous research,8,9 however, has also found social inferences mixed with clinical assessment judgments, suggesting that they do naturally co-occur. It must also be noted that the design of this study does not allow causation to be determined. For example, although responses can be grouped together, and accounting for these different groupings can explain some rating variance, we cannot determine whether anything within those groupings caused the different ratings.
Similarly, we cannot determine the accuracy of any of the social judgments, categories, and/or impressions because of the lack of a comparable standard. As noted previously, some modifications were needed to transfer the methodological procedures from a social cognition context to a clinical assessment context. In particular, we required the methodology of LPA because the sorters could not reach consensus on the ideal number of categories describing each resident. However, when used, this analysis did enable the identification and discrimination of internally consistent groupings of social judgments with statistical and conceptual coherence. In contrast to Mohr and Kenny’s12 finding that three categories were consistently the ideal number to best represent the impressions being made, the exact number of categories varied between performances in our study. This could be due to the conflation of clinical competency judgments with social judgments, the use of a smaller dataset, or the modifications made to the original methodology. Despite these limitations, our results support previous findings that people form one of a finite number of impressions when perceiving others.12,13
It has been common to assume that different judgments of the same performance reflect rater biases and thus should be treated as error variance. If consensus within multiple divergent judgments is consistently found, it will be important to investigate the legitimacy of the multiple judgments. If multiple legitimate judgments are possible, it may be necessary to support trainees to critically reflect on and integrate divergent pieces of feedback and to be aware that their performance can justifiably be perceived differently by different subgroups of people. In addition, this study investigated only one conceptualization of social categorization, but three have been previously described.7 Further analysis of all three must be done to directly compare their capacity to explain variability in ratings.
In conclusion, social judgments and impressions made by raters are typically viewed as sources of idiosyncrasy and, therefore, construct-irrelevant variability in performance ratings contributing to the “noise” in the measurement. However, our findings that idiosyncratic judgments tend to be finite in number and replicable across multiple raters suggest that multiple “signals” do exist within the “noise” of interrater variability in performance-based assessment. It may be valuable to understand and exploit these multiple signals rather than try to eliminate them.
Acknowledgments: The authors wish to thank Jennifer Kogan, MD, for her generosity in sharing her videos; the authors of the YouTube videos for giving permission to use their posts; Richard Wolfe, PhD, for sharing his LPA program for R; Scott Allen, PhD, for his instructions on how to use R; Jimmie Leppink, PhD, MSc, MSc, LLM, for his methodological guidance; Shiphra Ginsburg, MD, MEd, for her assistance with the data collection from Toronto; and the Survey Research Lab at the University of Northern British Columbia for their persistence in developing the data collection system.