Ongoing self-assessment of knowledge and skills is essential for physicians, who in practice must self-direct their educational activities so they can maintain or improve their medical proficiency and know when it is appropriate to refer a patient to another physician. Physicians who cannot accurately assess their own knowledge and skills may provide compromised care to patients.
We believe that medical students' abilities to accurately self-assess predict their self-assessment abilities as physicians. By studying medical students, one can identify personal or task characteristics that influence self-assessment accuracy; that information can then be used to develop effective curricular interventions.
The limited research into medical students' self-assessment has yielded several findings. There is evidence of modest improvement in self-assessment skills.1 While some research has found that self-assessment may be modifiable by education (students' self-assessment skills increased slightly over the course of education2 and students' evaluation criteria became more stringent with experience3 ), other research suggests that, even if self-assessment is a learnable skill, much of that learning takes place in childhood and is rather fixed by the time students enter medical school.4 The evidence of limited improvement in self-assessment skills during medical training may reflect the relatively fixed character of adult self-assessment or it may reflect the fact that students receive little practice in self-assessment.
Since 1995, we have conducted a series of studies to establish measures of self-assessment; we have used those measures to better understand the components of self-assessment.5–8 In these studies, we have found that individual differences in self-assessment accuracy are not related to demographic variables (gender and ethnicity) or academic variables (performance and preparation). Having found no relationship between self-assessment accuracy and students' characteristics, we wondered whether different types of clinical tasks might affect self-assessment accuracy. For example, do students more accurately assess themselves when they are interpreting visual information, such as EKGs, or when they are conducting a physical examination on a standardized patient?
In this study, we examined the relationships between self-assessment accuracy and task characteristics. We compared the abilities of medical students to accurately self-assess their performances on two test formats of clinical skills. The first was a standardized patient examination, scored using a checklist, on basic history and physical examination skills. The second was a paper-and-pencil examination that tested the students' medical knowledge. We sought the answers to two questions: (1) does the accuracy of students' self-assessments vary across task formats? and (2) are students' self-assessments consistent within task formats?
METHOD
The study participants were fourth-year University of Michigan medical students who completed a comprehensive clinical assessment examination in either 1997 (n = 141) or 1998 (n = 163).
All University of Michigan medical students take a comprehensive clinical assessment examination (CCA) at the beginning of their fourth year. This multiple-station objective structured clinical examination is designed to ensure that students have the knowledge and skills expected at this point in their medical education. The CCA assesses students' skills in history taking, communication skills, physical examinations, and interpretation of test and laboratory results. The stations fall into one of two formats: a performance task or a cognitive task. In the performancetask stations, students perform a physical examination (e.g., a breast examination) and/or obtain a patient's history and are evaluated on checklist items by the standardized patient and/or faculty members. In the cognitive-task stations, students answer questions concerning the diagnoses of clinical vignettes, the interpretation of x-rays, and the interpretation of EKGs. Each student's performance is evaluated for each station and reported as a score in the range from 0% to 100%.
Immediately following the CCA, the students were asked to estimate their scores (ranging from 0% to 100%) on each station. Ten CCA stations were used to evaluate the students' self-assessment accuracy (Table 1 ): seven performance-task stations and three cognitive-task stations.
Table 1: Cognitive and Performance Tasks on the Comprehensive Clinical Assessment Examination, University of Michigan Medical School, 1997 and 1998
Self-assessment accuracy was determined by calculating the three indices developed in our previous work. The first of these is the bias index, the average difference between the students' estimated performance (x e ) and their actual score (x a ) on a set of n observations: Σ(xe –xa )/n . This index provides information about whether, on average, a student over- or underestimated his or her performance and by how much.
The deviation index is calculated as the average absolute deviation of the estimated score (x e ) from the actual score (x a ) over n observations: Σ|xe –xa |/n . In contrast to the bias index, which allows over- and underestimates to cancel each other out, the deviation index summarizes how far the student's estimates deviated from his or her actual performance.
The third index, actual-estimated correlation , assesses the correlation between estimated and actual performances over multiple observations, i.e., the extent to which variations in a student's estimates paralleled variations in his or her actual performance. Note that the correlation is not influenced by differences between the values of the estimated and actual scores (i.e., bias or deviation scores) but reflects only covariation.
The actual-estimated correlation index was calculated using two methods. In the first method, actual and estimated scores were correlated over students for each CCA station. In the second method, actual and estimated scores were correlated over the seven or three tasks within each of the formats for each student. The means and standard deviations of the resulting correlations for both methods are presented in Table 3 . The first method provides information about the variation of the correlations among tasks within a format as well as between the two formats. The second method provides information about the variation of the correlations among students within a format as well as between formats.
Table 3: Actual-Estimated Correlations of Cognitive and Performance Tasks on the Comprehensive Clinical Assessment Examination, University of Michigan Medical School, 1997 and 1998
RESULTS
Table 2 presents the actual and estimated scores for each task averaged across the students. For the three cognitive tasks, the actual scores and estimated scores were similar. The actual scores ranged from 77.1 to 80.4, and the estimated scores ranged from 74.1 to 80.8. The actual and estimated scores of the performance tasks had wider ranges. The actual scores ranged from 65.5 to 87.6, and the estimated scores ranged from 74.5 to 83.5. The actual scores of the cognitive and the performance tasks were not different (p = .12). Although the estimated scores indicated a statistically significant difference between the cognitive and performance tasks (p < .01), they differed by less than 2.5 percentage points (77.8 and 80.2).
Table 2: Scores and Accuracy Measures of Cognitive and Performance Tasks on the Comprehensive Clinical Assessment Examination, University of Michigan Medical School, 1997 and 1998
The bias scores and deviation scores are also presented in Table 2 . The bias scores for the three cognitive tasks indicate that all of the estimates were within three percentage points of the actual scores. The bias scores for the performance tasks are much more varied, ranging from an average of 0.6 (breast exam) to an average of 15.2 (communication skills). The bias scores for the cognitive tasks and the performance tasks are statistically significantly different (p < .01), although they differ by less than three percentage points. Interestingly, the students underestimated their cognitive task scores (−1.0) and overestimated their performance task scores (+1.6).
The deviation scores are larger than the corresponding bias scores for both cognitive and performance tasks. These scores also appear to be more consistent for the cognitive tasks (scores ranging from 11.0 to 11.6) than they are for the performance tasks (scores ranging from 8.1 to 17.9). No difference between the two task formats was evident for the average deviation scores (p = .12).
The actual-estimated correlations by CCA task and by student are presented in Table 3 . The task-based correlations for the cognitive format ranged from a low of.19 (chest pain) to a high of.41 (EKGs). The task-based correlations for the performance format ranged from a non-significant correlation of.10 (pediatric patient) to a high of.57 (breast examination). Due to the small number of tasks, no formal test was performed to determine statistical difference between the two formats. However, the mean correlations are similar:.32 and.35.
The student-based actual-estimated correlations indicate no difference between the two task formats (p = .66). There is a wide range in student correlations: −1.00 to +1.00 for the cognitive tasks and −.91 to +.99 for the performance tasks.
DISCUSSION
Generalizations from this study must take into consideration two limitations. First, the number of tasks the students completed was limited (ten tasks). Second, the two types of tasks (performance and cognitive) are not independent; i.e., cognitive knowledge is necessary when conducting a physical examination and clinical experience is applied when interpreting EKGs and x-rays.
It appears that the nature of the task (patient interaction versus application of knowledge) does not make a practical difference in students' self-assessments. If this finding is replicated in other studies, it would suggest that self-assessment accuracy is a generalizable skill, i.e., a skill that operates similarly across a variety of tasks and contexts. This would contrast with studies of clinical performance that have shown limited transferability of skills from one task to another.9 It may be that self-assessment is more analogous to a personality characteristic than it is to problem-solving behavior.
The study results also emphasize the importance of sampling tasks while conducting self-assessment research. Researchers commonly sample broadly from the population of study participants, but for results to be generalizable, they must also sample broadly from the population of tasks or contexts. If we had evaluated self-assessment accuracy using only one task—the communication skills task, for example—we might have concluded that self-assessment accuracy is relatively poor, as indicated by an average bias score of 15.2 and deviation score of 17.9 (Table 2 ). In contrast, if we had selected the breast examination task, we might have concluded that this group of students accurately assessed themselves, based on their mean bias score of 0.6 and deviation score of 8.1 Investigators should view self-assessment like any other skill, the evaluation of which cannot be reliably assessed with a single-item test, i.e., a single task.
Results of our self-assessment studies indicate that students are fairly accurate in estimating their performances against an objective standard. However, this ability is much less relevant to medical practice than the ability to gauge one's own strengths and weaknesses, an ability that is crucial to self-directed learning. Seldom in actual practice do physicians have any objective standard of performance they can use as a yardstick for judging their strengths and weaknesses, particularly in the day-to-day activities of patient care. Instead, self-directed learning is much more likely to be motivated primarily by specific patient problems, and secondarily by a process of identifying relative weaknesses in more general areas of medical knowledge.
This study examined the consistency of self-assessment accuracy between two different task formats. Our next step will be to determine whether self-assessment accuracy is stable over the four years of medical school. Future studies will also need to expand the examination of self-assessment from the classroom to the clinical arena during the clerkships, i.e., to examine the stability of self-assessment over task and context. Additionally, studies will need to examine the impact (if any) of differences in curriculum formats (e.g., problembased versus more traditional) on students' self-assessment.
REFERENCES
1. Calhoun JG, Woolliscroft JO, Hockman EM, Wolf FM, Davis WK. Evaluating medical student clinical skill performance: relationships among self, peer, and expert ratings. Proceedings 23rd Annual Conference on Research in Medical Education. 1984;23:205–10.
2. Arnold L, Willoughby TL, Calkins EV. Self-evaluation in undergraduate medical education: a longitudinal perspective. J Med Educ. 1985;60:21–8.
3. Calhoun JG, Ten Haken JD, Woolliscroft JO. Medical students' development of self- and peer-assessment skills: a longitudinal study. Teach Learn Med. 1990;2:25–9.
4. Woolliscroft JO, Ten Haken J, Smith J, Calhoun JG. Medical students' clinical self-assessments: comparisons with external measures of performance and the students' self-assessments of overall performance and effort. Acad Med. 1993;68:285–94.
5. Fitzgerald JT, Gruppen LD, White BA, Davis WK. Medical student self-assessment abilities: accuracy and calibration. Presented at the Annual Meeting of the American Educational Research Association, Chicago, IL, April 1997.
6. Fitzgerald JT, Gruppen LD, White C. The stability of student self-assessment accuracy. Presented at the 37th Annual Conference on Research in Medical Education, New Orleans, LA, November 1998.
7. Gruppen LD, Garcia J, Grum CM, et al. Medical students' self-assessment accuracy in communication skills. Acad Med. 1997;72, 10 suppl:S57–S9.
8. Gruppen LD, White C, Fitzgerald JT, et al. Medical students' self-assessments and their allocations of learning time. Acad Med. 2000; 75:374–9.
9. Elstein AS, Shulman LS, Sprafka SA. Medical problem solving: a ten-year retrospective. Eval Health Prof. 1990;13:5–36.