Medical education has long struggled to find a way to take advantage of human observation to inform assessment of its professionals and trainees. Rater-based assessments are used because they allow students to be observed performing complex tasks corresponding to higher levels of competency.1,2 Common examples include objective structured clinical examinations (OSCEs),3 small-group tutorial assessments,4 and workplace assessments.5 Unfortunately, rater-based assessments generally demonstrate psychometric weaknesses6–9 including measurement errors of leniency,10 undifferentiation,11 range restriction,12 bias,13 and unreliability.14 One of the biggest threats to the reproducibility of clinical ratings, low interrater reliability,15,16 has been found to occur even when different raters view the same performance.17–20 In a dramatic example, 19 of 20 OSCE stations each had one to eight discrepancies where at least one rater made a positive evaluative comment about the presence or absence of a specific observable behavior, while another rater made a negative evaluative comment regarding the exact same behavior.21
While actual ratee performance differences attributable to context or case specificity are acknowledged to play a critical role in the complexities of rater-based assessment,22 its effects are well understood and accounted for in current assessment systems. Causes of variability in ratings, given by multiple raters for the same performance within the same context, are more uncertain, with considerable debate currently taking place about whether or not such variability can be overcome.23–25 The challenge is illustrated well by Marshall and Ludbrook,26(p215) who stated that “the judgment that an examiner makes of a candidate in the setting of the conventional test of clinical skills is an entirely personal one.” This assumption of raters being idiosyncratic has led to the development of solutions to help raters be more objective. Medical education researchers have redesigned rating scales,27 forms,11 and systems28 to help deter subjective biases and support rater judgments during assessments, but these solutions have had limited success.6,29,30 With raters identified as the problem, rater training has been the most persistently proposed solution.31 Rater training's meager improvement of measurement outcomes, however, has provoked some researchers to suspect that medical raters are impervious to training,7,32 by suggesting that “some examiners are inherently consistent raters and others less so. The former do not need training and the latter are not improved by training.”33(p349)
Given the apparent intractability of this problem using our standard frameworks, it might be worth exploring other approaches to understand the manner in which people represent and make determinations about others. For example, a handful of medical education researchers have called attention to the importance of considering raters' social cognitive processes and corresponding implications concerning measurement of performance assessments. These authors have stressed the need to see raters as active information processors using judgment, reasoning, and decision-making strategies to assess ratees.34 They have also highlighted a complex interaction of impression formation, interpretation, memory recall, and judgment in assigning ratings.21 And several have described potential incongruence between assessment procedures, psychometric measurement principles, and human rater capabilities.2,35,36
The approach being explored by these authors is highly reminiscent of the impression formation literature, a large research domain within social cognition focused on understanding how individuals make judgments of others in social settings.37 Impressions are formed as part of knowing another person. They are constructed from factual information, inferences, and evaluative reactions regarding the target person.37 It has been suggested that impressions are used to organize information into a structure of knowledge about the person38 in order to interact with him or her.39 Social cognition researchers are interested in the specific cognitive processes used by people to think about the social world. They investigate how social information is encoded, stored and retrieved from memory, and structured and represented as knowledge; they also study the processes used to form judgments and make decisions.40
Interestingly, the idiosyncrasy of raters has also been of interest to impression formation researchers.41 In that literature, it is well established that different raters will often form different impressions of the same ratee even when given the exact same information.42,43 In fact, the descriptions made by a single rater about multiple others have been found to be more similar than the descriptions made by multiple raters about a single ratee.44 Typically, the largest portion of variance in personality trait ratings is not attributable to differences perceived between the ratees but to differences uniquely contained within the relationship between each rater and ratee.42,45 These parallel findings between the rater-based assessment literature and the impression formation literature suggest that social cognitive explorations of this phenomenon could be informative in better understanding the cognitive processes used by raters within the social context of rater-based assessments.21 In turn, such better understanding could inform new solutions for the limitations of these techniques.35
This paper represents a synthesis of related research domains focused on understanding the source of variance in social judgments. Although the measurement limitations in rater-based assessments undoubtedly stem from many complex factors, this paper explores the perplexing origins of rater variance when raters observe the same act. MEDLINE, ERIC, and PsycINFO were used to search for articles investigating social judgment processes including impression formation and associated sociocognitive processes. This paper is necessarily nonsystematic and nonexhaustive in order to present a preliminary understanding of vast literatures investigating problems analogous to those with rater-based assessments. Accordingly, the intent is to stimulate different ways of asking questions about the limitations of rater-based assessments prior to negotiating potential solutions. Because of space restrictions, the papers cited are a representative sample of larger bodies of research, and interested readers may want to consult their respective reference lists.
Within psychology literatures, the act of perceiving other people (i.e., forming impressions) is commonly described as a categorization task, though differences exist in the way in which these cognitive processes are thought to be enacted.46,47 Based on iterative readings of the social cognition literature, three themes emerged that encapsulate the differing conceptions of categorization as used in forming impressions of other people. These themes included the conceptualization of impression formation as the construction of Person Models, impression formation as a nominal categorization process, and impression formation as a dimensionally based categorization process. Each of these concepts will be elaborated, and potential implications for rater-based assessment in medical education will be highlighted, in the following sections.
Impression formation as idiosyncratic yet convergent Person Models
Social judgments have been found to be idiosyncratic and fallible under certain conditions.48 Psychology researchers have studied numerous variables that provide some understanding of why this is the case. For example, raters' mood and emotions at the time of the judgment can have an influence.49 If the ratee reminds the rater of a significant other, the ratee can be perceived to share similar characteristics.50 If the rater has recently been exposed to a description of the ratee, ambiguous behavior can be interpreted as being consistent with that description.51,52 Thus, there exists an implicit understanding that impressions are subject to variables and contextual factors beyond the ratee himself or herself.
Despite this expectation of rater idiosyncrasy in impression formation, however, there exists evidence that impressions will often be quite consistent across raters. One line of research, for example, has demonstrated that when raters were asked to write descriptions of a ratee based on their impressions, all descriptions for that ratee could be grouped into three representative stories (or “Person Models”) about that individual.42,45 The models are ad hoc descriptions of the ratee based on the rater's impressions formed from the information available. Importantly, although many stories can be generated, stories pertaining to any one individual tend to fall into one of three models, though the same three models are not relevant to every individual. To elaborate, in one study,45 69 participants viewed the same four-minute video of a ratee having a conversation with a friend and then with a family member. Participants provided written descriptions of what they thought about the ratee. Naïve participants subsequently reviewed all the descriptions and independently sorted them into groups based on similarity or shared meaning. Their groupings showed high agreement, and cluster analyses confirmed that, for each ratee, there were three distinct ways in which (s)he was described. Consider these three descriptions45(p341):
Model 1 (67.6% of descriptions): [Ratee E] is energetic, friendly, and expressive, although she is more outgoing with her friend than her mother. She seems to be a kind and considerate person who enjoys talking to others. She laughs a lot and has many ideas.
Model 2 (15.5% of descriptions): [Ratee E] is insecure and nervous. She seems distracted at times, and she has trouble making decisions. She plays with her pen a lot and keeps bringing up a trip she was supposed to go on last year.
Model 3 (16.9% of descriptions): [Ratee E] has to dominate the conversation. She is rude and obnoxious and seems insensitive to other people. She doesn't even say bless you when her friend sneezes. She seems self-centred and barely lets her friend talk.
Consistent with this example, for each ratee in the study the majority of participants had a tendency to describe the ratee using a particular Person Model. In each case, however, two other, sometimes vastly different, descriptions were also consistently given. Thus, although judgments are idiosyncratic, they are not infinitely so. It has been suggested that different combinations and prioritization of the pieces of information resulted in the different explanatory stories.42 In a follow-up study,45 the Person Models corresponded with ratings of liking and positive– negative evaluation such that raters using Model 1 viewed the ratee positively and liked her, whereas raters using Models 2 or 3 viewed her negatively and disliked her. The Person Models, therefore, were found to account for a substantial portion of the variance in impressions attributed to the unique relationship between the rater and the ratee—the variance often described as noise resulting from the idiosyncrasy of the rater.45
Impressions, and ratings, have often been regarded as personal to the rater and easily biased by various factors.7,53 If raters are forming Person Models as part of constructing a coherent impression about a ratee from the information they are receiving, and if there generally exist about three Person Models that are used for every ratee, this could help explain decreased interrater reliability in rater-based assessments while still yielding a sense of relative cohesion and coherence for each rater. And, it would lead to questioning whether and how the three possible, but highly divergent, models could ever be reconciled into a uniform set of ratings for an individual student. Before exploring these questions, however, it may be useful to examine some other social judgment conceptualizations.
Impression formation as a nominal categorization process
The Person Model shares many characteristics with theories that focus on the use of social categories as a way to decipher and integrate information about a ratee.47,53,54 Here, the focus is not on the ad hoc construction of narratives around a ratee's behavior; rather, the focus is on raters' tendencies to lump ratees into preexisting schemas. Categories are thought to be valuable in that they enable raters to apply preexisting knowledge to help understand incoming information about a person. Although there are clear and readily recognized dangers in overgeneralization (such as stereotypes), there are apparent benefits to categorization as well.46 With the use of categories, cognitive resources do not need to be used to monitor a ratee's category-consistent behavior. Instead, the rater only needs to note any category-inconsistent behaviors.55 Categorization of the ratee also allows the rater to go beyond the given information to infer other expected details consistent with typical category members.56 This can be useful to better understand the individual ratees, to make predictions about how they will behave, and to decide how best to behave when interacting with them.47 Consistent with the Person Model theories of impression formation, category-based knowledge is thought to act as a framework to provide possible explanations for why a ratee might display particular behaviors in a given situation. Accordingly, it has been suggested that categories could be thought of as a type of shorthand to explain what a group of people are like and why.57
Although the social categorization literature suggests that these categories can exist preformed in long-term memory,46 social categorizations of a person are thought to be flexible because any individual can be categorized in multiple ways.58 Consistent with the findings described above, this literature has found context to be important in determining which category of the many possibilities will be applied to the person.51,59 For example, a man carrying a baby in a grocery store may be categorized as a dad but in a hospital as a nurse. Researchers in this area have been particularly concerned with the question of how controllable category activation is. Some researchers argue that it is automatic and not controllable60; others have suggested that it is “conditionally automatic”61,62 or consciously controllable.63 Interestingly, there is evidence to suggest that intentionally trying to adjust social judgments to counteract categorization-based assumptions or trying to suppress categorical thinking can cause the categorizations to have more adverse influence on impressions.64 This has been repeatedly demonstrated, for example, with studies where raters who were trying to avoid the use of stereotypes ended up demonstrating more stereotypic thinking in subsequent trials65 and more stereotyped memories of the ratee.66 This suggests that good intentions and the motivation to avoid categorizing people may not be completely possible and, when attempted, may not result in improved judgments.
If we were to accept that raters may be categorizing ratees as part of perceiving and forming an impression of them, this could have important implications for rater-based assessment. Perhaps the most intriguing implication is the resemblance of categories to nominal rather than ordinal or interval data. As a level of measurement, the nominal scale “classifies objects into categories based on some characteristic of the object.”67(p15) Nominal variables have categories but do not have an inherent, logical order, a true zero, or an equal interval between the categories. Assessment forms often require ordinal responses such as the selection of an ordered descriptive value on a behaviorally anchored scale, or interval responses such as the selection of a numerical value on a Likert-type rating scale. If raters are judging ratees by perceiving them as belonging to a particular category, then how do they translate that categorical judgment into a rating scale value?
Impression formation as dimensionally based categorizations
In contrast to the literature focusing on nominal categorizations of individuals, a third conceptualization of categorization counterintuitively involves judgments made on dimensional scales. As is described more thoroughly in the following, people can appear to be placed into categories based on dichotomized judgments on two underlying dimensions. An extensive literature consistently identifies two orthogonal dimensions underlying social judgments that can account for the majority of variance in impression formation. In all studies, one of the dimensions refers to socially desirable or undesirable traits that directly impact on others. It includes positive traits such as friendly or honest and negative traits such as cold or deceitful. The second dimension has more variability across studies and refers to traits that tend to more directly influence the individual's success.68,69 It tends to include positive traits such as intelligent or ambitious and negative traits such as indecisive or inefficient. These dimensions have been given various labels, likely attributable, in part, to differing domains having been studied: warmth/competence,69,70 communion/agency,68,71 social/intellectual,72 other-profitability/self-profitability,73 morality/competence,74 and social desirability/social utility.75 Although the choice of labels for each of the dimensions may imply that researchers from different domains have identified very different dimensions, the researchers agree there is a common overlap of traits and behaviors.68–70,75,76
Interestingly, despite the speculation that there are two continuous, scaled dimensions underlying the process of social judgment, many researchers in the social judgment literature suggest that these two orthogonal dimensions are dichotomized into high- versus low-value judgments. When the two dimensions are crossed, therefore, the result is four potential combinations, and it has been proposed that individuals and groups are categorized in one of these four clusters.77 Researchers have shown that the stereotyped groups described in the preceding section can be categorized into each cluster based on rater judgments of warmth/competence dimensions and that each cluster is associated with emotional and behavioral responses in the rater.78 More specifically, in North America, groups judged high on warmth and competence, such as the middle class, invoke the emotions of pride and admiration and lead to behaviors of wanting to help and associate with them. Groups judged low on warmth and high on competence, such as the stereotypically gluttonous rich, elicit envy and willingness to associate but also to attack under certain conditions. Groups judged high on warmth and low on competence, including stereotypes for the elderly and disabled, elicit pity and willingness to help but also to avoid. Low judgments of both warmth and competence, including stereotypes for the homeless and drug-addicted, invoke the emotions of disgust and contempt and lead to behaviors of wanting to attack and to avoid.
The fundamental nature of two dimensions underlying social judgments has been explained using an evolutionary perspective. It has been proposed that successfully determining whether strangers are potential friends or enemies, based on their perceived intentions and also on whether they are capable of achieving those intentions, would provide a survival advantage.79 As such, persons categorized as having cold or immoral intentions and high competence receive more strongly negative impression ratings than those categorized as having immoral intentions and low competence.80 This occurs despite the immoral–incompetent categorization resulting from two negative dimensional judgments and the immoral–competent categorization resulting from the combination of a positive and a negative dimensional judgment. Categorizations based on dimensional judgments, therefore, do not purely reflect an algebraic combination of values judged on two orthogonal dimensions.
The finding that two dimensions can account for variance in impression formation is especially intriguing because two dimensions have also been found to underlie rater-based assessments in medical education.11,81,82 Factor analysis of rating forms designed to assess clinical competence often identifies two underlying factors regardless of the number of items or the number of dimensions included on the form. Of the two factors that explain the majority of variance in ratings, one tends to refer to knowledge and the other to interpersonal skills. The knowledge dimension seems analogous to the competence dimension in social judgments, and the interpersonal skills dimension seems comparable to the warmth dimension. As such, medical raters could be using the cognitive processes, previously described using the example of stereotyped groups in North America, to classify ratees into one of the four clusters with consequent emotions and reactions. If two fundamental dimensions reflect the cognitive judgments made by people in forming impressions of others, it may be useful to better understand how raters make judgments on these two dimensions and what factors influence judgments on either dimension. It may also be important to examine the dimensions more closely to determine whether they are in fact continuous or dichotomized; to look for additional dimensional axes; and to confirm that these two dimensions are truly orthogonal.
On the other hand, it is worth noting that the two dimensions were revealed through the use of rating scales and factor analysis or multidimensional scaling. It is not clear, therefore, whether these dimensions represent the actual cognitive processes used by raters or are artifacts of the rating process used to capture the judgments. Thus, although these dimensions are potentially useful in understanding the judgments reported on rating scales, it remains to be seen whether they meaningfully reflect the underlying cognitions that generated the ratings in the first place or whether they emerge from the data because rating scales were used to record the judgments.
The need for medical education to use rater-based assessments in determining the competence of its trainees and professionals combined with difficultly in resolving the psychometric limitations of these ratings has resulted in raters commonly being blamed for the limitations of this assessment approach.31 Although case specificity has been shown to play a very important role, rater variability (based on idiosyncrasies of opinion, defiance, or ineptitude) has also been seen as a source of construct-irrelevant error16,25 with less clear understanding of how to overcome the challenge it creates. Solutions targeted at bolstering rater objectivity and ability have had little impact on reducing these measurement errors,7 and hence, perhaps the time has come to consider an alternate conception of rater “error.” Through better understanding of how raters make judgments during the assessment process, we may be able to tease apart error attributable to human biases and error unintentionally imposed by assessment systems that are incongruent with innate human cognition. If we were to start with the premise that raters in rater-based assessments use the same cognitive processes as raters in social judgments, then what would the implications be for assessment and how would it change the way we talk about assessment?
Psychologists have shown that, in making social judgments, people have a propensity to categorize other people. In the impression formation literature, there seem to be at least three different conceptualizations of this categorization process. The Person Model literature presents an adaptable type of categorization based on the construction of stories, as needed, to describe specific individuals.42,45 In contrast, the categorization literature suggests that categories can be preformed constructs that exist in the long-term memory and are applied when activated.46 And a third conceptualization is the concept of cluster-based categorization that results from dichotomous judgments on two dimensions.77,78 Regardless of these differences in conceptualization, there is general agreement in the impression formation literature that such categorizations allow information about a typical category member to be applied to the new person, thereby reducing the cognitive resources needed to monitor the person's behavior, allowing for predictions of how he or she will behave, and providing options for how best to interact with him or her.46 If this is the basic process underlying raters' decisions in medical education, it has several implications for our conceptualization of rater error.
First, the categorization of the person can happen spontaneously and without awareness,60 and there may be poor control over these processes even when they are made explicit.64 This could directly impede efforts to modify the influence of categorization on assessments through rater training. Further, although there is evidence of these categorizations being surprisingly consistent across raters, there is nonetheless room for rater idiosyncrasy, or at least subgroups that consistently use a different Person Model in understanding a particular individual's behavior.45 If we were to take this categorization model as the underlying process by which raters were assessing ratees, therefore, it would radically alter our understanding of the source of rater differences and the ways in which we might imagine trying to address it. It is not that raters are scaling the behaviors differently but, rather, that they are placing ratees in different nominal categories.
Second, in the vast majority of rater-based assessments in medical education, the standard forms require ratings on a predetermined list of performance domains, roles, and/or competencies. These theoretically constructed assessment dimensions may not correspond with the categorizations that result from our innate cognitive processes, and they may not be universally applicable to all ratee categorizations. It is possible, therefore, that rater error might stem from an assessment system that asks raters to carry out judgment tasks that are incongruent with the cognitive processes used by humans to perform judgments. If we were to accept the process of categorization of ratees during assessments, then what are the potential ramifications for analyzing rater-based assessments? If raters are forming nominal judgments but assessment forms require ordinal or interval ratings, how do they translate that categorical judgment into a rating scale value? Could raters using different conversion systems explain a portion of rater error? Could conversion miscalculations be a source of rater error? How much influence could an unreliable or idiosyncratic rater judgment to rating scale conversion process have on the measurement outcomes of rater-based assessments?
The idea that people categorize others as a way of perceiving and understanding who they are and what their significance may be in a social environment is, in and of itself, not a radical concept. Its potential implications for rater-based assessments, on the other hand, are profound. First, it suggests that measurement errors may be partially a function of raters making somewhat consistent, but different, categorizations. If this is true, assessment systems may need to accommodate the categorization process, and faculty development efforts would look very different if we were to improve the quality of assessments. Second, measurement errors may reflect conversion errors stemming from idiosyncratic or erroneous translation of these nominal judgments into the ordinal or interval judgments we demand of our raters. The statistical benefits of interval variables over nominal variables are enormous. But if this is how human raters form judgments and make assessments, then this inconvenient reality may need to be faced head-on. Thus, the third implication is that there may exist a more efficient, accurate, and reliable rater-based assessment system that incorporates categorical judgment processes.
As we consider where to go from here, it is clear that, although immediate solutions are not available, a research agenda informed by the concept of an innate human inclination toward categorization of people during impression formation would lead to a very different set of questions regarding our assessment systems. Are rater-based assessments suffering from a “lost in translation” problem as they require ordinal judgments to be derived from a nominal categorization process? If raters are trying to provide nominal data, how might we codify these categorizations directly rather than asking our raters to translate their categorical assessments into universally applied scaled dimensions with ordered degrees of competence? How could such a categorical assessment system be analyzed and compared across raters? What would be required to compile and interpret various nominal judgments to determine the competency of individual students? How would the resulting assessment be communicated to students in a comprehensible and usable form? How could assessment decisions based on nominal data be defended during appeals and litigation?
It is with good intentions that steps have been taken to make rater-based assessments more consistent through increasingly structured dimensional assessment tools. Changes to rating scales, assessment procedures, and rater training have been based on solid reasoning and rigorous study. It is important to have psychometrically sound assessments that are defensible, useful, and meaningful. But the outcomes from this dedicated work have not entirely met expectations. It may be time to take a completely different look at what raters have been asked to do. The skills of observation, perception, judgment, and decision making have evolved in humans to benefit social interactions and, ultimately, survival. It is highly likely that the rater-based assessment environment triggers the use of these innate social cognitive processes. An assessment process that best utilizes the advantages of social cognitive processes while minimizing the disadvantages may provide improved results.
Financial support for the preparation of this paper was generously provided by a grant from the Society of Directors of Research in Medical Education (a nonprofit organization [501c3]): www.sdrme.org.
1Fromme HB, Karani R, Downing SM. Direct observation in medical education: Review of the literature and evidence for validity. Mt Sinai J Med. 2009;76:365–371.
2van der Vleuten CPM, Schuwirth LWT. Assessing professional competence: From methods to programmes. Med Educ. 2005;39:309–317.
3Turner JL, Dankoski ME. Objective structured clinical exams: A critical review. Fam Med. 2008;40:574–578.
4Eva KW. Assessing tutorial-based assessment. Adv Health Sci Educ. 2001;6:243–257.
5Norcini J, Burch V. Workplace-based assessment as an educational tool: AMEE guide no. 31. Med Teach. 2007;29:855–871.
6Lurie SJ, Mooney CJ, Lyness JM. Measurement of the general competencies of the Accreditation Council for Graduate Medical Education: A systematic review. Acad Med. 2009;84:301–309.
7Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med. 2003;15:270–292.
8Kassebaum DG, Eaglen RH. Shortcomings in the evaluation of students' clinical skills and behaviors in medical school. Acad Med. 1999;74:841–849.
9Albanese M. Rating education quality: Factors in the erosion of professional standards. Acad Med. 1999;74:652–658.
10Cacamese SM, Elnicki M, Speer AJ. Grade inflation and the internal medicine subinternship: A national survey of clerkship directors. Teach Learn Med. 2007;19:343–346.
11Silber CG, Nasca TJ, Paskin DL, Eiger G, Robeson M, Veloski JJ. Do global rating forms enable program directors to assess the ACGME competencies? Acad Med. 2004;79:549–556.
12Hatala R, Norman G. In-training evaluation during an internal medicine clerkship. Acad Med. 1999;74(10 suppl):S118–S120.
13van Barneveld C. The dependability of medical students' performance ratings as documented on in-training evaluations. Acad Med. 2005;80:309–312.
14Clauser BE, Clyman SG. Components of rater error in a complex performance assessment. J Educ Meas. 1999;36:29–45.
15Downing SM. Reliability: On the reproducibility of assessment data. Med Educ. 2004;38:1006–1012.
16Downing SM. Threats to the validity of clinical teaching assessments: What about rater error? Med Educ. 2005;39:353–355.
17Clauser BE, Subhiyah RG, Nungester RJ, Ripkey DR, Clyman SG, McKinley D. Scoring a performance-based assessment by modeling the judgments of experts. J Educ Meas. 1995;32:397–415.
18Elliot DL, Hickam DH. Evaluation of physical examination skills. Reliability of faculty observers and patient instructors. JAMA. 1987;258:3405–3408.
19Noel GL, Herbers JE, Caplow MP, Cooper GS, Pangaro LN, Harvey J. How well do internal medicine faculty members evaluate the clinical skills of residents? Ann Intern Med. 1992;117:757–765.
20Clauser BE, Harik P, Clyman SG. The generalizability of scores for a performance assessment scored with a computer-automated scoring system. J Educ Meas. 2000;37:245–261.
21Mazor KM, Zanetti ML, Alper EJ, et al. Assessing professionalism in the context of an objective structured clinical examination: An in-depth study of the rating process. Med Educ. 2007;41:331–340.
22Eva KW. On the generality of specificity. Med Educ. 2003;37:587–588.
23Holmboe ES, Ward DS, Reznick RK, et al. Faculty development in assessment: The missing link in competency-based medical education. Acad Med. 2011;86:460–467.
24Lurie SJ, Mooney CJ, Lyness JM. Commentary: Pitfalls in assessment of competency-based educational objectives. Acad Med. 2011;86:412–414.
25Clauser BE, Harik P, Margolis MJ, Mee J, Swygert K, Rebbecchi T. The generalizability of documentation scores from the USMLE Step 2 Clinical Skills examination. Acad Med. 2008;83(10 suppl):S41–S44.
26Marshall VR, Ludbrook J. The relative importance of patient and examiner variability in a test of clinical skills. Br J Med Educ. 1972;6:212–217.
27Gray JD. Global rating scales in residency education. Acad Med. 1996;71(1 suppl):S55–S63.
28Littlefield JH, DaRosa DA, Paukert J, Williams RG, Klamen DL, Schoolfield JD. Improving resident performance assessment data: Numeric precision and narrative specificity. Acad Med. 2005;80:489–495.
29Wood L, Hassell A, Whitehouse A, Bullock A, Wall D. A literature review of multi-source feedback systems within and without health services, leading to 10 tips for their successful design. Med Teach. 2006;28:185–191.
30Kogan JR, Holmboe ES, Hauer KE. Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. JAMA. 2009;302:1316–1326.
31Green ML, Holmboe E. Perspective: The ACGME toolbox: Half empty or half full? Acad Med. 2010;85:787–790.
32Cook DA, Dupras DM, Beckman TJ, Thomas KG, Pankratz VS. Effect of rater training on reliability and accuracy of mini-CEX scores: A randomized, controlled trial. J Gen Intern Med. 2009;24:74–79.
33Newble DI, Hoare J, Sheldrake PF. The selection and training of examiners for clinical examinations. Med Educ. 1980;14:345–349.
34Govaerts MJB, Schuwirth L, Van der Vleuten CP, Muijtjens AMM. Workplace-based assessment: Effects of rater expertise. Adv Health Sci Educ Theory Pract. 2011;16:151–165. Online First(16 September 2010).
35Govaerts MJB, van der Vleuten CPM, Schuwirth LW, Muijtjens AMM. Broadening perspectives on clinical performance assessment: Rethinking the nature of in-training assessment. Adv Health Sci Educ. 2007;12:239–260.
36Lurie SJ, Mooney CJ, Lyness JM. In reply: Letters to editor: How should the ACGME core competencies be measured? Acad Med. 2009;84:1173.
37Hamilton DL, Driscoll DM, Worth LT. Cognitive organization of impressions: Effects of incongruency in complex representations. J Pers Soc Psychol. 1989;57:925–939.
38Lingle JH, Geva N, Ostrom TM, Leippe MR. Thematic effects of person judgments on impression organization. J Pers Soc Psychol. 1979;37:674–687.
39Leyens J-P, Fiske ST. Impression formation: From recitals to symphonie fantastique. In: Devine PG, Hamilton DL, Ostrom TM, eds. Social Cognition: Impact on Social Psychology. San Diego, Calif: Academic Press; 1994:39–75.
40Bless H, Fiedler K, Strack F. Social Cognition: How Individuals Construct Social Reality. New York, NY: Psychology Press; 2004.
41Kenny DA. Person: A general model of interpersonal perception. Pers Soc Psychol Rev. 2004;8:265–280.
42Park B, DeKay ML, Kraus S. Aggregating social behavior into person models: Perceiver-induced consistency. J Pers Soc Psychol. 1994;66:437–459.
43Kenny DA. Interpersonal Perception: A Social Relations Analysis. New York, NY: Guilford Press; 1994.
44Bourne E. Can we describe an individual's personality? Agreement on stereotype versus individual attributes. J Pers Soc Psychol. 1977;35:863–872.
45Mohr CD, Kenny DA. The how and why of disagreement among perceivers: An exploration of person models. J Exp Soc Psychol. 2006;42:337–349.
46Macrae CN, Bodenhausen GV. Social cognition: Thinking categorically about others. Annu Rev Psychol. 2000;51:93–120.
47Fiske ST. Social cognition and social perception. Annu Rev Psychol. 1993;44:155–194.
48Nisbett RE, Ross L. Human Inference: Strategies and Shortcomings of Social Judgment. Inglewood Cliffs, NJ: Prentice Hall; 1980.
49Forgas JP. The role of emotion in social judgments: An introductory review and an affect infusion model (AIM). Eur J Soc Psychol. 1994;24:1–24.
50Andersen SM, Cole SW. “Do I know you?”: The role of significant others in general social perception. J Pers Soc Psychol. 1990;59:384–399.
51Smith ER, Collins EC. Contextualizing person perception: Distributed social cognition. Psychol Rev. 2009;116:343–364.
52Higgins ET, Rholes WS, Jones CR. Category accessibility and impression formation. J Exp Soc Psychol. 1977;13:141–154.
53Skowronski JJ, Carlston DE. Negativity and extremity biases in impression formation: A review of explanations. Psychol Bull. 1989;105:131–142.
54Kunda Z, Thagard P. Forming impressions from stereotypes, traits, and behaviors: A parallel-constraint-satisfaction theory. Psychol Rev. 1996;103:284–308.
55Macrae CN, Milne AB, Bodenhausen GV. Stereotypes as energy-saving devices: A peek inside the cognitive toolbox. J Pers Soc Psychol. 1994;66:37–47.
56Sherman JW, Lee AY, Bessenoff GR, Frost LA. Stereotype efficiency reconsidered: Encoding flexibility under cognitive load. J Pers Soc Psychol. 1998;75:589–606.
57Wittenbrink B, Park B, Judd CM, Sedikides C, Schopler J, Insko CA. The role of stereotypic knowledge in the construal of person models. In: Intergroup Cognition and Intergroup Behavior. Mahwah, NJ: Lawrence Erlbaum Associates Publishers; 1998:177–202.
58Stangor C, Lynch L, Duan C, Glas B. Categorization of individuals on the basis of multiple social features. J Pers Soc Psychol. 1992;62:207–218.
59Stapel DA, Koomen W. Social categorization and perceptual judgment of size: When perception is social. J Pers Soc Psychol. 1997;73:1177–1190.
60Bargh JA, Ferguson MJ. Beyond behaviorism: On the automaticity of higher mental processes. Psychol Bull. 2000;126:925–945.
61Monteith MJ, Sherman JW, Devine PG. Suppression as a stereotype control strategy. Pers Soc Psychol Rev. 1998;2:63–82.
62Gilbert DT, Hixon JG. The trouble of thinking: Activation and application of stereotypic beliefs. J Pers Soc Psychol. 1991;60:509–517.
63Blair IV, Banaji MR. Automatic and controlled processes in stereotype priming. J Pers Soc Psychol. 1996;70:1142–1163.
64Wegner DM. Ironic processes of mental control. Psychol Rev. 1994;101:34–52.
65Macrae CN, Bodenhausen GV, Milne AB, Jetten J. Out of mind but back in sight: Stereotypes on the rebound. J Pers Soc Psychol. 1994;67:808–817.
66Sherman JW, Stroessner SJ. Stereotype suppression and recognition memory for stereotypical and nonstereotypical information. Soc Cogn. Fall 1997;15:205.
67Hurlbert RT. Comprehending Behavioral Statistics. 4th ed. Toronto, Ontario, Canada: Thomson Wadsworth; 2006.
68Abele AE, Wojciszke B. Agency and communion from the perspective of self versus others. J Pers Soc Psychol. 2007;93:751–763.
69Fiske ST, Cuddy AJC, Glick P. Universal dimensions of social cognition: Warmth and competence. Trends Cogn Sci. 2007;11:77–83.
70Judd CM, James-Hawkins L, Yzerbyt V, Kashima Y. Fundamental dimensions of social judgment: Understanding the relations between judgments of competence and warmth. J Pers Soc Psychol. 2005;89:899–913.
71Ybarra O, Chan E, Park H, Burnstein E, Monin B, Stanik C. Life's recurring challenges and the fundamental dimensions: An integration and its implications for cultural differences and similarities. Eur J Soc Psychol. 2008;38:1083–1092.
72Rosenberg S, Nelson C, Vivekananthan PS. A multidimensional approach to the structure of personality impressions. J Pers Soc Psychol. 1968;9:283–294.
73Peeters G. Evaluative meanings of adjectives in vitro and in context: Some theoretical implications and practical consequences of positive-negative asymmetry and behavioral-adaptive concepts of evaluation. Psychol Belg. 1992;32:211–231.
74Wojciszke B. Affective concomitants of information on morality and competence. Eur Psychol. 2005;10:60–70.
75Beauvois J-L, Dubois N. Lay psychology and the social value of persons. Soc Personal Psychol Compass. 2009;3:1082–1095.
76Abele AE, Cuddy AJC, Judd CM, Yzerbyt VY. Fundamental dimensions of social judgment. Eur J Soc Psychol. 2008;38:1063–1065.
77Cuddy AJC, Fiske ST, Glick P. The bias map: Behaviors from intergroup affect and stereotypes. J Pers Soc Psychol. 2007;92:631–648.
78Fiske ST, Cuddy AJC, Glick P, Xu J. A model of (often mixed) stereotype content: Competence and warmth respectively follow from perceived status and competition. J Pers Soc Psychol. 2002;82:878–902.
79Wojciszke B, Bazinska R, Jaworski M. On the dominance of moral categories in impression formation. Pers Soc Psychol Bull. 1998;24:1251.
80Wojciszke B. Multiple meanings of behavior: Construing actions in terms of competence or morality. J Pers Soc Psychol. 1994;67: 222–232.
81Ramsey PG, Wenrich MD. Use of peer ratings to evaluate physician performance. JAMA. 1993;269:1655–1660.
82Nasca TJ, Gonnella JS, Hojat M, et al. Conceptualization and measurement of clinical competence of residents: A brief rating form and its psychometric properties. Med Teach. 2002;24:299–303.