It is the nature of professions that the individual practitioner conducts his or her practice in a dyadic relationship with the client, with little external regulation from bodies external to the profession. Medical professionals may be subject to informal peer review from colleagues or consultants, but many, particularly those in primary care, may have little opportunity to judge and be judged. As a consequence, assessing oneself and others is a central skill in maintaining professional competence as a physician.
Problem-based schools have been in the forefront of acknowledging the importance of self and peer assessments. In part, this is a historical coincidence—problem-based learning (PBL) began during the 1960s at the same time that Knowles was popularizing the notion of self-directed learning.1 In part, however, the emphasis on self and peer assessment is a consequence of the PBL philosophy, which stresses collaboration and cooperation while disparaging competition. In addition, PBL traditionally uses tutorial groups, which provide a natural arena for the refinement of assessment skills. Despite this emphasis, the development of assessment tools that can be used in tutorials has proven to be one of the most difficult challenges for educators working within the PBL philosophy.2
Furthermore, while self and peer assessments are part of the rhetoric of PBL, rarely are they translated into action. While many PBL curricula list self- and peer-assessment skills as explicit objectives, it is usually assumed that formal training is not required so long as students are encouraged to assess themselves and others. That is, the skill of self assessment is expected to emerge naturally as a consequence of practice.
While this may be a logical position, it differs from an extensive literature. Gordon reviewed 18 studies performed prior to 1990 and found that, in general, the validity of self assessments was low to moderate and did not improve with time.3 More recent findings appear consistent with this conclusion. For example, Sullivan, Hitchcock, and Dunnington found a moderate correlation between peers' and tutors' ratings in a PBL group but no correlation between self and tutors' ratings.4 Similarly, Woolliscroft et al. found only weak-to-absent correlations between self and faculty's ratings.5
Furthermore, it is not clear that requesting students to provide self-incriminating information is ethically feasible. As an anonymous clinician once noted “even the Church never asked parishioners to make their confessions in public.”2 Similarly, the legal system protects defendants from testifying against themselves. Yet self-assessment protocols are typically put in place with the expectation that students will attempt to provide valid records of their limitations—an expectation that students might be expected to rebel against either vocally or by simply inflating the evaluations they provide.
In an attempt to improve upon this status quo, Regehr et al. proposed an innovative self-assessment method that relies on individuals' ranking their own abilities relative to one another.6 They argued that our goal, as educators, should be to enable students to rank their own relative strengths and weaknesses rather than rating their own performances relative to that of their peers. This intra-individual comparison might provide students with better feedback regarding which skills require work than will the more commonly used inter-individual comparison. Also, a relative-ranking exercise might eliminate any threat created by the act of self assessment because all students have the same series of labels applied to the same series of skills even though the ordering of the skills will likely differ.
Students who participated in Regehr et al.'s study interviewed a standardized patient and then ranked (from strongest to weakest) their performances on ten communication skills, including organization, nonverbal skills, and assessment. The authors found a moderate correlation between students' rankings and tutors' rankings (.58 when corrected for imperfect reliability of tutor rankings), thereby indicating that students might be better at evaluating the relative strengths and weaknesses of their own skills than they are at evaluating their skills relative to those of their peers.
While these results yielded promise, it remains to be seen whether the relative-ranking procedure provides a useful self-assessment tool for students in PBL tutorials. On one hand, the tutorial is an environment rich with opportunity to discover the relative strengths of personal characteristics and, therefore, should provide students with a chance to develop an understanding of their own abilities. Furthermore, PBL tutorial groups allow for prolonged interactions between students, peers, and tutorial leaders—a resource that seems ripe with information to guide decisions about a student's ability. In contrast, however, past work suggests that the less constrained the context within which evaluation takes place, the less reliable the assessments tend to be. Global assessments of performance judged over time intervals of weeks are usually unreliable and, therefore, unlikely to correlate with other measures.7 Harrington, Murnaghan, and Regehr have also supported this argument with self-assessment data collected after a long-term orthopedics rotation.8 They reported a corrected correlation of .38 between self assessments and faculty ratings despite using a relative-ranking model. They hypothesized that the difference between this result and that of Regehr et al.6 may have been due to the assessment of an unstructured three- to six-month rotation relative to the assessment of a single encounter with a standardized patient.
Still, the potential for the development of an evaluation exercise that provides a rich source of data to guide discussions between student and tutor is very attractive in tutorial settings that are likely to be more structured than clinical rotations. As a result, the present study was conducted to test the effectiveness of the relative-ranking model in undergraduate PBL tutorials. The characteristics of first-year students' tutorial performances were assessed on three occasions over a six-week period by (1) the students themselves, (2) two of each student's peers, and (3) the student's tutorial leader. In all cases, assessors were asked to rank order a series of seven skills, from strongest to weakest, rather than to provide absolute ratings.
The McMaster University Undergraduate Medical Programme utilizes a PBL curriculum. Unit-1 students are commencing their first year of training. Eighteen groups of five or, more commonly, six students meet several times weekly. The students have large-group sessions, clinical skills tutorials, clinical pathology conferences, and biweekly tutorials. Each of the tutorials is conducted with a tutor present, lasts two to three hours, and deals with a system-oriented problem. Normally, the tutorial would conclude with a group discussion entitled “tutorial evaluation.” Tutors in McMaster's medical program undergo workshops prior to becoming tutors to enhance their abilities in this regard. During the seventh week of Unit 1, a more prolonged discussion occurs dealing with students' and tutors' characteristics and performances. This session often lasts two to three hours and results in a mid-unit evaluation for each student.
During the first two weeks of Unit 1 in the fall of 2000, nine of the 18 tutorial groups were randomly selected to participate in the current study. One of the authors (HR) met with each group to describe the rationale and the logistics of the study. All individuals were told that the goal of the evaluation exercise was to facilitate identification of areas that require greatest attention. It was also stressed that no summative evaluation would arise from this exercise because it would not allow a comparison across individuals—even the strongest students would be assigned a “least strong” characteristic. The students were told that mid-unit evaluations performed by the tutor might incorporate any helpful information from the completed forms, but that the final evaluation would not. In contrast to final evaluations, mid-unit evaluations cannot be used as fodder for the students' final transcripts. By limiting any possible use of the information to low-stakes mid-unit evaluations, we anticipated that the responses would be less biased and that compliance would be enhanced. Any questions or concerns on the part of the tutor or students were open for discussion.
The assessment form designated and defined seven domains for relative comparison. The domains' names and definitions are given in Table 1. The domains were derived from Tutotest, a tutorial-based evaluation instrument that has been shown to be internally consistent (alpha = .98) and correlate highly with global evaluations assigned by tutors (r = .64).9 At the top of each form were the instructions first to identify the student's strongest domain, then to identify the student's least strong domain and, finally, to identify the “middle of the road” domain. Finally, assessors were asked to rank the remaining four domains in relative order between the two extremes.
Ultimately, six of the nine groups (36 students) chose to participate. Ranking instruments were distributed to each student, two of the student's peers from within the tutorial group, and the student's tutorial leader. The number of peer raters was limited to two per student to keep the rankings sheet from being too long (a factor that might mitigate against compliance) while still allowing us to address the issue of interassessor reliability within peer assessments. All assessors were asked to complete the instrument at the conclusion of the fourth, eighth, and twelfth tutorials (i.e., at the end of the last tutorials in weeks two, four, and six). The assessments were made anonymously on paper prior to the group's evaluation discussion to eliminate any biases that might arise from that discussion. For the same reason, these assessments were not requested beyond the mid-unit evaluation performed in the seventh week.
Students were asked to rank their own characteristics in the manner described above. The tutor was asked to rank all six students in his or her tutorial group across the same seven domains. Peer assessors were chosen sequentially in alphabetic order: Student A was peer-assessed by students B and C, student B was peer-assessed by students C and D, and so on through student F, who was peer-assessed by students A and B. Completed evaluation forms were collected by the tutor and returned to the authors by courier. Feedback on the use of the instrument was obtained from tutors and students subsequent to the completion of Unit 1.
In weeks two, four, and six, a relative-ranking exercise was completed for each student by four raters (the student him-or herself, two peers from the student's tutorial group, and the student's tutor). Combining each level of the week and rater variables generated 66 correlation coefficients for each student with a full complement of ranking exercises. The reliability of each pair of rankings was computed using Spearman rank-order correlation coefficients for each student and the resulting coefficients were then averaged across student.*Table 2 illustrates the resulting matrix of correlations as well as the number of correlations that were used to compute each mean score. The number of correlations does not equal the number of participants in every cell because some individuals did not complete all of the requested rankings.
The average interrater reliability between self assessments and peer or tutor assessments mimicked the poor correlations commonly found in the self-assessment literature (self—peer assessments, r = .003, self—tutor assessments, r = .037). In contrast to past work, the interrater reliability was also poor for tutor—peer assessments, r = −.007. The average intrarater correlation was highest for tutors (r = .217) and lowest for self rankings (r = .083; for peers, r = .141). However, these correlations are not reliably different from one another. There was also no effect of time; the average intraweek correlations did not differ from week two (r = .042) to week six (r = −.017; for week four, r = −.023). In fact, none of the 66 correlations illustrated in Table 2 differs significantly from zero. The highest correlation observed was between the rankings assigned by tutors in week two and those assigned by tutors in week four, r = .314 (t(23) = .715, p > .3). The average correlation overall equaled .03.
One possible explanation for these poor correlations is that individuals are capable of providing reliable rankings for extreme characteristics (i.e., those that received rankings of 1, 2, 6, or 7), but uncertainty in the middle of the range led to the poor correlations observed. Feedback from students supported this contention; several comments resembled those of one student who said “I could frequently select the highest and lowest items, but for the order of the middle items, it really became a random selection of the items without much consideration.” To test this hypothesis we transformed all of the rankings of three, four, or five into four and reanalyzed the data. If extreme traits were ranked consistently, then the correlations observed should have improved after this transformation. This was not observed; every mean correlation remained less than .32 and centered around a mean correlation of r = .03. This result does not necessarily vitiate the suspicion of the students that assignment of the middle items was harder than assignment of the extremes, but it does suggest that the perceived strengths of even the extreme characteristics were inconsistent across individuals and across time.
In short, the relative rankings assigned by our participants were not reliable. This was true despite the fact that most of the items included on the assessment instrument have been shown to be separable measures of tutorial performance, thereby supporting their content validity.9
Previous work has shown that the relation between self assessments and tutors' or peers' assessments tends to be poor.3 The lack of interrater reliability we observed suggests that using a relative-ranking system does not improve upon this status quo in tutorials. In fact, even the externally calibrated observers (tutors and peers) provided unreliable rankings compared with one another, whereas past work has shown peers and tutors capable of providing similar ratings.4 It is possible that the poor correlations observed across individuals were due to raters' insensitivity to performances that were specific to a given tutorial, because people tend to overestimate the extent to which individuals' characteristics are stable across time.10 However, had this intuition influenced our assessors' judgments, the rankings assigned by an individual judge should have been consistent from week to week. This was not observed.
The consistently low correlations observed in this study suggest very strongly that the instrument is not a useful tool for allowing individuals to assess tutorial performance. The rankings that assessors assigned appear to have been influenced more by random error than by any of the other predictable sources of variance. More generally, these results support a growing body of evidence that suggests that the assessment of broadly defined ability may be the educational equivalent of the emperor's new clothes. In reviewing the literature on self-assessment, Harrington, Murnaghan, and Regehr separated research that was based on long-term performance (e.g., post-clinical rotation) from that based on performance on a single task (e.g., an OSCE).8 In general, self-assessment has been better correlated with other accepted measures of performance for short-term evaluations relative to long-term evaluations. This dissociation has also been observed within the domain of relative-ranking exercises. Regehr et al. observed a moderate correlation between self assessments and observer assessments when psychiatry residents were asked to rank-order their performances on a single OSCE station.6 In contrast, Harrington, Murnaghan, and Regehr's study of performance evaluation at the end of an orthopedics rotation did not yield such promising results.8 To our knowledge, no one has examined this distinction within a single study, but the trend is quite consistent and is supported by our current work.
Two problems might limit the generalizability of our study's results. First, the participants chosen for study were first-year medical students, who, it could be argued, have not yet had sufficient time in medical school to develop the ability to accurately and reliably assess their own abilities. It should be noted, however, that longitudinal studies of the self-assessment abilities of medical students tend to suggest that first-year students' self evaluations are more similar to tutors' assessments than are self evaluations of more senior students. Two studies have found that the strength of the relationship between self assessments and tutor assessments decreased with time in medical school.11,12 In contrast, a third study showed a positive relationship between time in medical school and self—tutor correlations, but, unfortunately, the “time” variable in this study was confounded with other factors—this study was not a true longitudinal study in the sense that self assessment did not continue across time; the follow-up data were based on assessment of a performance that had taken place three years earlier.13
A second potential limitation is the use of the seven domains selected for the ranking exercise. Some (e.g., positive participation) were much more value-laden than were others (e.g., knowledge). This divergence might have limited the usefulness of the instrument, but, again, were students simply assigning high rankings to value-laden items, we would expect higher correlations than those observed. Using only seven domains could have made the correlations inherently unstable simply as an artifact of the mathematical properties of correlations. However, although this possibility would account for large variance in the correlations we observed, it does not readily explain an average correlation of zero.
Finally, one might argue that an examination of the correlations among rankings is not the most important outcome measure when considering self assessment. Rather, it could be said, we should be focusing our attention on the quality of the feedback provided to the student. While we agree that one of the great promises the relative-ranking procedure holds is the potential for a tool to direct tutors' feedback and students' efforts, it is not clear what such feedback would be addressing if a student's relative weaknesses can not be identified reliably.
In summary, tutorial evaluation has proven to be one of the more elusive aspects of PBL curricula. This is surprising because, intuitively, the tutorial is a setting rich with opportunities to learn many of the skills that are valued within the PBL philosophy and to allow students to recognize which aspects of their performances most require improvement. While the relative-ranking model proposed by Regehr et al.6 maintains promise as an evaluation tool that can provide appropriate and timely formative feedback, it did not prove to be a reliable method of assessment in the tutorial context during this study.