The evaluation of clinical competence in the practice setting continues to be a cornerstone of the process by which the health professions determine trainees' preparedness to enter into professional practice. In medical residency programs, the end-of-rotation clinical evaluation (often called the “In-training Evaluation Report,” or ITER) has been one of the main mechanisms by which this clinical evaluation process is enacted.1 In general, ITERs consist of a set of rating scales that clinical supervisors are expected to use to indicate how well a resident is meeting the expectations of the training program across multiple domains of competence. In principle, this type of clinical performance evaluation has many characteristics that should make it an excellent tool for clinical assessment: It is based on the observation of performance, it is embedded in the real practice setting, it represents an extended observation period, and it is completed by experts in the domain.2 In practice, however, the ITER has been problematic as a mechanism to discriminate among residents,2,3 and in particular it has been a weak tool for identifying learners who are experiencing clinical difficulties.4,5 Despite efforts to improve the scales on which the ITERs are based,6 and despite various efforts to train faculty to use the scales more effectively,7 the ITER continues to be problematic as a tool to describe resident performance and discriminate among residents.
Increasingly, it is being suggested that the difficulties in developing effective scales to evaluate clinical performance in the field may have less to do with the specific details of the tools and more to do with the fundamental set of assumptions (the epistemology) that underlies the use of these tools.8,9 That is, for the last several decades, the approach to assessment in medical education has been dominated by a psychometric epistemology in which it is presumed that psychological constructs can be deconstructed and assigned numerical values according to definable rules to obtain an accurate and concise description of an individual's ability that will be objective, replicable, easily communicable, and comparable. This model has served the field well in recent years by guiding refinements to the assessment of knowledge and spurring the development of tools such as the objective structured clinical examination.
Yet, our measurement instruments do not merely allow us to quantify a construct; they shape how we think about, evolve, and ultimately teach that construct. Thus, as Hodges10 has warned, the psychometric construction of “competence as a reliable test score” opens the door for producing forms of “hidden incompetence.” For example, our ability to measure knowledge with high reliability might lead to an overemphasis of knowledge and prevent the medical community from noticing when individual practitioners do not maintain adequate interpersonal skills. Similarly, our effort to assess each competency on a separate scale might direct our focus away from the trainee's ability to integrate these competencies into a coherent understanding of effective clinical practice. This warning seems particularly important as the field strives to expand the definition of competence to explicitly include the more social and relational aspects of professional expertise (such as communication, collaboration, and professionalism).11,12 Understanding these social aspects of professional practice may require a more constructivist epistemology, which is based in the recognition that competent performance is always embedded in situated, relational contexts that are rich with information.13 Competence, from this perspective, is recognized as being constantly constructed and reconstructed and is acknowledged as inherently subjective and integrative in nature. Although context has been recognized to be important even with scale-based evaluations (e.g., observations by multiple raters in multiple situations are known to enhance reliability and validity), variations in performance across these contexts are generally interpreted as sources of noise that hide the “true,” stable score that properly represents the individual. Thus, these strategies can be seen as an attempt to “extract” the individual from the situation. In addition, scale-based strategies are explicitly designed to isolate different competencies independently, rather than asking the rater to assess the individual within a full context of performance. The clash of epistemologies that arises from the application of the psychometric approach to the complex social and relational aspects of clinical expertise was well articulated by Leach,14 who, in describing the development and evaluation of the competencies mandated by the Accreditation Council for Graduate Medical Education, stated:
The relevance of the work is dependent on an integrated version of the competencies, whereas measurement relies on a speciated version of the competencies. The paradox cannot be resolved easily. The more the competencies are specified the less relevant to the whole they become.
In fact, we would argue it is the very nature of paradox that it cannot be resolved at all using the thinking that generated the paradox in the first place. Thus, as a community, we might do well to reconsider Leach's assertion that measurement necessarily “relies on a speciated version of the competencies.”14 As Schuwirth and van der Vleuten8 suggest, it may be worthwhile instead to develop a better understanding of how teachers process (and represent) large bodies of rich information, and it may be worthwhile developing evaluation approaches that more authentically reflect this richness of information while keeping it manageable.
One potential starting place for this process might be the recognition that narrative (i.e., the stories that individuals construct about their experiences) “is the most compelling form by which we recount our reality, understand events, and through which we make sense of our experiences and ourselves.”15 Thus, narrative (in this context, the stories supervisors tell about their residents) has the potential to authentically reflect the richness of information suggested by Schuwirth and van der Vleuten.8 However, we would also note that unfettered narrative has the potential to violate Schuwirth and van der Vleuten's criterion of manageability. Thus, one way to address the challenge of creating evaluations that are rich, meaningful, and authentic but at the same time concise, communicable, and comparable is to formulate a set of “standardized narratives” that effectively represent the types of resident stories commonly described by experienced staff. Such an effort has been elaborated in a series of studies by Bogo and colleagues9,16,17 in the context of evaluating social work students in the field. Although promising, their work has not been replicated, nor has it been extended into the context of residency education in medicine. Therefore, in this report, we describe the insights we gained in trying to create, rank-order, and categorize a set of resident profiles that would characterize a representative range of ways that residents present themselves that staff physicians might encounter in the clinical teaching context.
The method for this research was based heavily on the work of Bogo et al16,17 in the field of social work. It involved our interviewing a set of attending physicians to collect their stories about residents, generating from these stories a set of standardized narratives, or “profiles,” then establishing a ranking and scaling of these profiles based on the collective opinions of a new set of attending physicians. Details of the method are elaborated below. For all aspects of this study, IRB approval was obtained from all institutions involved.
Creation of the narratives
To create the standardized resident narratives, or “profiles,” we interviewed 19 attending physicians from the departments of medicine at two participating institutions (the Faculty of Medicine, University of Toronto, and the Faculty of Health Sciences, McMaster University) in 30- to 60-minute interviews. As described more elaborately elsewhere,18 each attending physician was asked to describe (without mentioning names) first a specific outstanding resident they had supervised, then a problematic resident, and finally an average resident. These descriptions could be about any aspect of performance, and there was no attempt to encourage discussion of any particular area or dimension of competence. However, descriptions had to be of actual residents rather than generalized opinions. Where needed, the research assistant probed participants to describe specific behaviors their resident(s) displayed. The interviews were audiotaped and transcribed verbatim, but with any potentially identifying features removed. We conducted a grounded theory analysis to uncover the underlying themes (e.g., knowledge base, work ethic) that the physicians appeared to be using in framing their discussions of the residents (see Ginsburg et al18 for an elaboration of these dimensions).
From the 57 actual resident descriptions generated by the 19 physicians, we created 16 standardized profiles of residents, each about one-half to three-quarters of a page long. These profiles were designed to represent the full range of residents described by the supervisors by strategically combining various features and descriptions from different supervisors' stories while maintaining the language of the interviewed supervisors (but with any uniquely identifying information removed or altered to maintain anonymity of the residents discussed). All of the profiles were informed by the themes of performance identified in the grounded theory analysis, but no attempt was made to include each possible theme of performance in each profile. Rather, in an effort to maintain the narrative style of the 57 spontaneous descriptions offered and to authentically represent the way attendings discuss and describe residents, each profile is unique; each presents certain aspects of performance that are often different (and/or presented in a different order) from those presented in the other profiles.18 Examples of two profiles can be seen in Box 1. The full set of 16 profiles is in Supplemental Digital Box 1 (http://links.lww.com/ACADMED/A77).
Ranking, sorting, and scaling the narratives
The 16 resulting profiles were read and reviewed critically by two or three attending faculty at each of three participating schools: the original two schools plus the Faculty of Medicine, University of British Columbia. This process was designed to ensure that the style and language of the narratives felt authentic to the participants at all three institutions (each of which has its own culture of residency education) and to identify any potential gaps in the profiles' ability to represent any particular residents that these attendings had interacted with. A number of minor changes in this process were made following this pilot review.
To identify the ranking and to establish a score for each of the 16 resident profiles, four groups of internal medicine (IM) attending physicians were recruited as participants. Recruitment took place via e-mail announcements sent out to all eligible IM attending faculty at the three institutions. Eligibility was based on having at least two years of attending experience requiring the evaluation of residents. We gave priority to faculty who taught on general IM teaching units, but we also included attendings from other primarily inpatient-oriented medical services. The 14 participants of this phase of our study were 2 groups of 4 faculty each from the University of Toronto, 1 group of 3 from McMaster University, and 1 group of 3 from the University of British Columbia.
All four groups followed the same procedure. The first phase of the procedure took approximately 45 minutes to complete. Following introductions and instructions, each participant was given a set of the 16 resident profiles, each profile on its own page and placed in a random order in the set. The participants read through all 16 profiles, making any notes on the pages that they wished. Highlighters were provided to allow participants to highlight relevant parts of the descriptions as they saw fit. Each then sorted the 16 profiles into as many groups or categories as he or she felt was necessary to represent the various levels of competence expressed in the profiles. Participants were asked to provide words or phrases that best described the level of performance represented by each group they created. Each participant was then asked to rank the profiles within each group from highest to lowest. Thus, each participant generated two “scores” for each profile. The first score was assigned based on its grouping, with a value of 1 assigned to profiles in the “best” group, a value of 2 assigned to profiles in the “next-best” group, and so on, for as many levels of competence as the individual produced. The second score was generated based on how each participant ranked each profile, from 1 (highest) to 16 (lowest).
Following a brief break, the participants in each group were brought together and shown each member's categorizations and rankings. They were then, as one large group, given access again to the set of 16 profiles, now laid out from highest ranked to lowest ranked based on the average of each individual's rankings. They were asked to negotiate these new rankings (moving profiles up or down the line as needed) and to collectively determine the cut points for different levels of competence by whatever criteria the group chose to use. All discussions occurring during these negotiations were audiotaped and later transcribed.
Following this process, each group was debriefed regarding their experiences of the process and their sense of the authenticity and comprehensiveness of the profiles in representing the range of residents they had encountered.
The interrater reliability of the 14 participants across the four groups was calculated as both an average-rater intraclass correlation coefficient (Cronbach alpha) and a single-rater intraclass correlation (ICC) for both the categories generated and also the rankings (1–16). The intergroup reliability was also calculated as both an average-group and single-group ICC to determine whether there was evidence of differences in institutional culture.
Because participants were not limited in the number of categories they could create, the “average” group assignment for each profile across the 14 participants was generated using latent partition analysis19 as enacted by Miller et al.20 As described by Wiley,19 latent partition analysis is a statistical procedure designed to combine several participants' “partitions” of a set of items into categories to generate a description of the underlying (or “latent”) category structure that is common across participants. In short, by applying latent partition analysis to the categorical decisions made by each of our participants, we can estimate the “average,” or common, categorical structure that is latent in those collective partitions.
Finally, discussion transcripts and field notes were analyzed by two of us (S.G., O.O.). A formal thematic analysis was not undertaken because one group's tape was lost and only field notes were available. We did, however, use the discussion notes to help understand and explore each group's process during the exercises, and we have included quotations where appropriate to support these explanations.
Table 1 presents the data for the unnegotiated categories and ranking within each category made by all 14 participants (the profiles are sorted from highest average rank to lowest average rank across the 14 participants). As can be seen in the table, the number of categories used by participants ranged from a minimum of three (participants 2 and 3 in Group Four) to a maximum of eight (participant 4 in Group One). However, the modal response was five categories. Despite this range of categories used, the 14-rater alpha for the category assignment was 0.97 with a single-rater ICC of 0.81, suggesting that there was very high agreement on category level “scores” among the 14 participants. Similarly, the 14-rater alpha for the rankings themselves was 0.98 with a single-rater ICC of 0.86. To guard against the potential that these ICCs are inflated by inclusion of a wide range of profiles, we restricted the range systematically and recalculated each reliability coefficient. Whereas the single-rater ICC values were sensibly lower when we looked at restricted ranges of the profiles, the ICCs were generally similar (and still reasonably high) when looking at faculty rankings of the top eight profiles (ICC = 0.63), the bottom eight profiles (ICC = 0.64), and the middle eight profiles (ICC = 0.55).
Chart 1 presents the final negotiated ranking of the 16 profiles by each of the four groups, the categories generated during the group discussions, and the descriptions of each category offered by each group. As can be seen, there were some discrepancies in overall ranking and in the overall number of categories generated, with two groups generating five categories, one group generating four categories, and one group generating only three categories. There were many consistencies, however, in the language participants used in their discussion and categorization of the resident profiles. For example, safety was a common theme in the lowest-ranked profiles, as were professionalism issues and presumed personality defects. The issue of “remediability” and response to feedback arose as important distinguishing features between the lowest-ranked profiles and those ranked slightly higher, as did the degree of supervision required. On the other hand, in the higher-ranked profiles, participants commented on readiness for practice and suitability as a colleague/consultant. There were interesting idiosyncrasies noted in some instances as well; for example, for one participant the issue of improvement took on the greatest importance and weight, so that any profile in which a resident showed evidence of improvement or response to feedback was rated relatively higher, and those that indicated no response were ranked much lower. For another participant, the issue of treating work as a “9-to-5 job” seemed to be a major issue, and those profiles were ranked relatively lower.
Table 2 presents the final negotiated rankings and categorizations for the four groups, again ordered by overall rank of the profile. Again, despite some differences in the rankings and categorizations across groups, the intergroup reliability was very high, with a single-group ICC of 0.87 (four-group alpha = 0.96) for category membership scores and a single-group ICC of 0.91 (four-group alpha = 0.98) for the overall rankings, suggesting that there is more consistency than inconsistency in the decisions made. Table 2 also presents the results of the latent partition analysis, which generated four categories overall with two profiles in the top category, five in each of the middle categories, and four in the lowest-rated category. See Chart 1 for a presentation of the profiles, categories, and descriptions.
Finally, when asked about their experience in reading and sorting the profiles, participants in each group felt that the profiles were realistic and authentically captured most, if not all, of the characteristics of the residents whom they typically supervise. There was a sense from two groups that the profiles were skewed a bit to the negative (i.e., there was a disproportionate number of profiles reflecting problematic performance). Others noted that these residents were more difficult to evaluate, so having more options in those categories was helpful. Similarly, in one group there was concern that there was no “perfect” profile—that is, a resident with no flaws or weaknesses. However, in a subsequent group this was explicitly probed, and those participants felt that there is no such thing as a resident without any deficiencies. Another issue raised was that the profiles focus on the actual behavior observed, without any hint as to the cause or context of that behavior. For example, some participants recalled residents who cause “95% of the grief” because “they have personal problems, their dog died,” or “their wife was sick for two weeks, but what you're seeing is the end result of that ‘background noise’ that you may not be aware of.” In the context of this discussion, one participant questioned whether they should mark someone differently based on the reasons for that person's deficiencies.
Despite these minor issues, faculty felt they could readily “see” or “find” their residents in the profiles provided. Several commented that the profiles “nicely captured things that are hard to evaluate.” In one group, faculty discussed the idea of rank-ordering the profiles and concluded that figuring out the category a resident belonged to was more important than the rank-order, and that it was the categories that were probably the most meaningful in terms of assessment.
Much of the effort in improving the evaluation of clinical competence over the last few decades has focused on deconstructing competence into a list of “speciated” competencies that are believed to be separately evaluable on a corresponding evaluation instrument. Embedded in this activity is the assumption that by deconstructing competence into separate, behaviorally anchored competencies, we will be able to achieve greater precision in the evaluation of each, and that the aggregation of these separate, precise evaluations will more accurately and objectively represent the overall competence of the individual being evaluated. These assumptions have been questioned on both a theoretical basis8,14,17,18,21 and an empirical basis,9,22 and some researchers have begun to search for approaches that more effectively capture the clinical supervisor's integrated, subjective clinical impression of a trainee in a way that offers standardization and meaningful comparison across trainees.
The work we have described here is another effort in this direction. We created a set of standardized narratives, or “profiles,” of residents representing various levels of competence, using the language and descriptive style of experienced faculty telling stories about actual residents they had supervised. We then “scaled” these integrated representations of resident performance based on a consensus of clinical education experts. We would note that the scaling process in which we have engaged is strongly reminiscent of a multiple-cut-point, standard-setting process often used in more classic testing formats. In particular, the Angoff method, which asks experts to define the characteristics of the borderline performer is, in essence, asking those experts to create a “profile” at one point on the performance scale. This method has been used previously in the context of performance-based assessments.23 Further, the “contrasting groups method” asks expert judges to make a (usually dichotomous) pass/fail decision about a number of candidates on the basis of an overall understanding of each candidate's actual performance on the test. It, too, has been applied to performance-based assessments.24 Thus, there is some precedent for our procedure, which, in a sense, combines these two approaches by asking experts to reflect on hypothetical performances and make (in our case multiple) cut-point categorizations. Unlike typical standard-setting situations, which eventually abstract the expert categorizations into a cut point for numeric scores produced by the test, our profiles would themselves be the summative representation of the resident being assessed. The goal of our research was to establish the feasibility of such an approach to this modified “standard-setting” procedure applied to this unique form of “scale.”
Our first important finding was that faculty participants were quite content with the descriptions of residents provided in the 16 profiles. In no case was a profile identified as unrealistic or unrepresentative of a “real” resident. Further, several participants spontaneously noted a sense that they were reading about actual residents they had worked with, some with groans because they felt they had identified “their” resident in one of the lower-rated profiles. The only noted absence in the set of profiles was “the perfect resident” who has no foibles at all. Thus, clearly, the language, style, and range of descriptions resonated well with this group of experienced clinical faculty, regardless of the medical school at which they were supervising residents.
Further, we found that when clinical faculty were asked to review this set of 16 profiles representing residents across a range of competence and to rank these narrative representations relative to each other, the faculty were quite comfortable with the task and highly consistent in the resulting rankings. Remarkably, this was true not only among faculty within a given institution but also across institutions that might be said to have quite different institutional cultures. As an interesting additional note, the interrater reliability was fairly consistent throughout the various levels of performance, as indicated by the ICCs calculated when the range of profiles was restricted.
This is not to say that there was unfailing uniformity of opinion in our participants. In particular, three participants showed slightly more idiosyncratic sort patterns, with each using a greater number of categories to distinguish levels of competence, and each demonstrating slightly lower (though still impressive) corrected item–total correlations for their rankings (r = 0.74, 0.78, and 0.72, respectively). Notably, however, this group did not represent a coherent alternate pattern of sorting because their sorts also correlated lower with each other than with those of participants demonstrating a more standard pattern. So, although some idiosyncratic patterns of sort were observed (e.g., one participant placed the highest value on evidence of improvement, whereas another downgraded any profile where the resident seemed to be treating the rotation like a 9-to-5 job), the pattern of responses we observed suggests that, similar to the findings of Bogo et al,16,17 experienced clinical faculty have fairly consistent constructions of what the continuum of performance looks like even when their focus on what constitutes exemplary/poor performance might be somewhat variable. Interestingly, these opinions and constructions were present without requiring specific training or instruction. That is, attending physicians' experience in the field was sufficient to allow them to judge residents' level of performance, at least when the full range of performance levels was presented to them at once.
This finding is promising for future efforts to document meaningful evaluations of clinical competence because it implies that the problem of inconsistent evaluation may have less to do with individual faculty members' presumed idiosyncratic (or uninformed) understanding of competence and more to do with the manner in which our faculty are expected to represent this overall clinical impression using the evaluation tools with which they are provided: evaluation forms based on distinct, individual competencies, each of which must be separately rated. Faculty in our study, from three institutions, had remarkably similar conceptualizations of different levels of performance, and what those levels mean, despite having had no specific training for this task. This suggests that the solution to improving evaluations may not lie in training faculty to observe and document better22 or to make minor modifications to existing tools and scales. Rather, consistent with Schuwirth and van der Vleuten's8 suggestion, our findings suggest that efforts at improving clinical performance measures might more profitably focus on fundamentally rethinking the structure of the tools we are using, to ensure that the instruments authentically represent the way in which faculty functionally conceptualize their residents' clinical competence on a day-to-day basis. What is needed now is the development of methods that will allow faculty members' subjective representations of their residents' performance to be smoothly translated into some form of documentation.
We should note that this finding of high consistency among faculty in the rankings they assigned in our current data set is an interesting contrast to previous findings from our own work that found strong idiosyncrasies among faculty in interpreting individual behaviors, particularly in the context of professionalism.25–27 One explanation for this discrepancy is that our earlier studies explored responses to single challenging scenarios without giving the raters a larger perspective on the student with which to interpret the performance they were seeing. In the current study, the profiles that faculty were reviewing represented a summary of an entire rotation's worth of behaviors and encompassed a more comprehensive range of clinical performance. We take this to be further evidence that purely behavior-based descriptions of performance are unlikely to be the solution to the pitfalls of “objective” evaluation. Single observations of behavior are always interpreted in light of a larger set of contextual factors, and competence is most consistently understood by faculty through patterns of behavior rather than on the basis of any single observation. Thus, finding ways to represent this more integrated, synthetic, “pattern-based” interpretation of competence will be an important consideration in the development of future evaluation instruments.
It would appear, therefore, that this approach of creating authentic-sounding standardized narrative profiles of residents at various levels of performance is possible. These profiles resonated strongly with faculty attendings, who felt that the profiles captured areas of competence that are otherwise difficult to evaluate. Further they seemed to “scale” effectively and with high reliability using techniques reminiscent of other standard-setting procedures. How (and whether) the resulting “narrative-based scale” could be used in actual practice remains to be seen. Our participants certainly seemed to indicate (at least anecdotally) that they could see clear correspondences between particular profiles and actual residents they had supervised. This suggests some possibility for using such a tool as a form of summative assessment whereby faculty match their residents to one (or more) scaled profile. However, additional work would clearly be needed to assess the reliability, validity, generalizability, and feasibility issues that would have to be addressed if such an assessment were to be performed on every resident supervised. Alternatively, this set of profiles might be a mechanism to enable a supervisor who is struggling with a difficult resident to more effectively articulate some of the nature of that difficulty by finding in the matching profile some language to express the manner in which the resident is struggling. Or, perhaps this set of profiles might simply be another tool in the faculty development armament that would better prepare supervisors for interacting with (and perhaps evaluating) residents who are performing problematically. Thus, we are not promoting the use of profiles as the “solution” to the evaluation problem. There are clearly several issues remaining and hurdles to address before any such system might maximally benefit evaluation procedures in the clinical setting. However, although perhaps not a solution, we do feel that the results of the current study do offer a richer understanding of the problem of codifying clinical performance.
Supplemental digital content for this article is available at http://links.lww.com/ACADMED/A77.
1. Chaudhry SI, Holmboe ES, Beasley BW. The state of evaluation in internal medicine residency. J Gen Intern Med. 2008;23:1010–1015.
2. Turnbull J, van Barneveld C. Assessment of clinical performance: In-training evaluation. In: Norman GR, van der Vleuten CPM, Newble DI, eds. International Handbook of Research in Medical Education. London, UK: Kluwer Academic Publishing; 2002:793–810.
3. Gray JD. Global rating scales in residency education. Acad Med. 1996;71(10 suppl):S55–S63.
4. Dudek NL, Marks MB, Regehr G. Failure to fail: The perspectives of clinical supervisors. Acad Med. 2005;80(10 suppl):S84–S87.
5. Cohen G, Blumberg P, Ryan N, Sullivan P. Do final grades reflect written qualitative evaluations of student performance? Teach Learn Med. 1993;5:10–15.
6. Speer AJ, Solomon DJ, Ainsworth MA. An innovative evaluation method in an internal medicine clerkship. Acad Med. 1996;71(10 suppl):S76–S78.
7. Holmboe ES, Hawkins RE, Huot SJ. Effects of training in direct observation of medical residents' clinical competence: A randomized trial. Ann Intern Med. 2004;140:874–881.
8. Schuwirth L, van der Vleuten CPM. Merging views on assessment. Med Educ. 2004;38:1208–1210.
9. Regehr G, Bogo M, Regehr C, Power R. Can we build a better mousetrap? Improving the measures of practice performance in the field practicum. J Soc Work Educ. 2007;43:327–343.
10. Hodges B. Medical education and the maintenance of incompetence. Med Teach. 2006;28:690–696.
11. Frank JR. The CanMEDS 2005 Physician Competency Framework. Ottawa, Ontario, Canada: Royal College of Physicians and Surgeons of Canada; 2005.
13. Jones MD Jr, Rosenberg AA, Gilhooly JT, Carraccio CL. Perspective: Competencies, outcomes, and controversy—Linking professional activities to competencies to improve resident education and practice. Acad Med. 2011;86:161–165.
15. Hurwitz B. Narrative and the practice of medicine. Lancet. 2000;356:2086–2089.
16. Bogo M, Regehr C, Power R, Hughes J, Woodford M, Regehr G. Toward new approaches for evaluating student field performance: Tapping the implicit criteria used by experienced field instructors. J Soc Work Educ. 2004;40:417–426.
17. Bogo M, Regehr C, Woodford M, Hughes J, Power R, Regehr G. Beyond competencies: Field instructors' descriptions of student performance. J Soc Work Educ. 2006;42:579–593.
18. Ginsburg S, McIlroy J, Oulanova O, Eva KW, Regehr G. Toward authentic clinical evaluation: Pitfalls in the pursuit of competency. Acad Med. 2010;85:780–786.
19. Wiley DE. Latent partition analysis. Psychometrika. 1967;32:183–193.
20. Miller D, Wiley DE, Wolfe R. Categorization methodology: An approach to the collection and analysis of certain classes of qualitative information. Multivariate Behav Res. 1986;21:135–167.
21. van der Vleuten CP, Norman GR, Graaff E. Pitfalls in the pursuit of objectivity: Issues of reliability. Med Educ. 1991;25:110–118.
22. Lurie SJ, Mooney CJ, Lyness JM. Measurement of the general competencies of the Accreditation Council for Graduate Medical Education: A systematic review. Acad Med. 2009;84:301–309.
23. Norcini J, Stillman P, Sutnick A, et al.. Scoring and standard setting with standardized patients. Eval Health Prof. 1993;16:322–332.
24. Clauser BE, Clyman SG. A contrasting-groups approach to standard setting for performance assessments of clinical skills. Acad Med. 1994;69(10 suppl):S42–S44.
25. Ginsburg S, Regehr G, Lingard L. Basing the evaluation of professionalism on observable behaviors: A cautionary tale. Acad Med. 2004;79(10 suppl):S1–S4.
26. Ginsburg S, Lingard L, Regehr G, Underwood K. Know when to rock the boat: How faculty rationalize students' behaviors. J Gen Intern Med. 2008;23:942–947.
27. Ginsburg S, Regehr G, Mylopoulos M. From behaviours to attributions: Further concerns regarding the evaluation of professionalism. Med Educ. 2009;43:414–425.
Funding for this study was provided through a peer-reviewed grant from the Medical Council of Canada, Ottawa, Ontario, Canada.
For all aspects of this study, IRB approval was obtained from all institutions involved.