It is widely accepted that effective self-assessment is a critical skill for any health professional. It is one of the basic skills implicit in our current models of self-directed learning, continuing education, and self-regulation.1 It ought to be disconcerting to all, therefore, that the literature on self-assessment ability is bleak; after hundreds of articles and many literature reviews, there is only one sensible conclusion: self-assessment is poor.2 Although Ward et al3 have called into question this conclusion, criticizing the literature on several methodological grounds, even studies that effectively rectify these methodological weaknesses have failed to salvage the case for effective self-assessment.4 Three broad findings seem to dominate the literature: (1) there is little or no correlation between a group of individuals’ self-assessments and externally generated assessments of those individuals, (2) all but the highest performers tend to overestimate their performance and ability, and (3) the worst offenders are in the lowest quartile of performance, with most of these individuals nonetheless believing that they are above average in performance.5 Such findings place the notion of self-directed learning and self-regulation in jeopardy.
More recently, we have argued that the construction of self-assessment implicit in the current models of self-regulation may be conceptually flawed.2 Although seldom explicit, the construction of self-assessment in current conceptualizations might best be defined as “a process of personal reflection based on an unguided review of practice and experience for the purposes of making judgments regarding one’s own current level of knowledge, skills, and understanding as a prequel to self-directed learning activities that will improve overall performance and thereby maintain competence.” Operationally, this has manifested in studies as a question of whether individuals are able to rate themselves relative to their peers, or to rate their own strengths and weaknesses relative to one another, or to accurately guess the percentage of items correctly answered on a test. Findings suggesting that this form of self-assessment is poor, however, may be less problematic for self-regulation than might be imagined. That is, when addressing the value of self-assessment in the context of a physician’s day-to-day performance, what likely matters is having an awareness of when one lacks the specific knowledge or skill to make a good decision regarding a particular patient (e.g., when more information and/or a consultation is required). Here, self-assessment is not being used as a general, personal “reflection on practice,”6 leading to a leisurely decision to learn more about a particular domain. Rather, self-assessment is being used for “reflection in practice,”6 addressing emergent problems and continuously assessing whether one has sufficient skills and knowledge to effectively solve the current problem. As an analogy, we assume that most individuals do not read the dictionary out of the recognition that they need to improve their vocabulary. Rather, they look up specific words that they encounter when they are unsure of the definition. If this new conception of self-assessment is to be pursued, it requires researchers who are attempting to study self-assessment to ask whether physicians and/or medical students know when to slow down and/or look it up.
Viewing competence in self-assessment as situation-specific self-awareness rather than an ability to rate one’s strengths and weaknesses renders the poor correlations between perceptions and reality that have commonly been reported in this literature irrelevant. Furthermore, this new perspective yields a new consideration of various findings that suggest people might actually be self-aware despite the robustly reported weakness of self-assessment. For example, Norman et al,7 in asking clinicians to diagnose dermatological slides, observed that participants took longer to make a diagnosis on slides that were eventually diagnosed incorrectly relative to those that the participants diagnosed correctly. That is, their behavior suggests that, either consciously or unconsciously, they were acting differently when they were having difficulty.
Findings like these are sprinkled throughout the literature, but they have not yet been explored with systematic study, nor with the explicit intent to address the construct of self-assessment.2 The purpose of the current study, therefore, was to develop a methodology that systematically captures these phenomena to test this new conception of self-assessment ability in a more direct and methodical manner. In doing so, we were trying to establish a technique for evaluating the extent to which individuals show behavioral evidence of being aware in the moment of the boundaries of their knowledge. In short, we were looking at the extent to which individuals slow down and defer when they the are at the edges of their competence.
Participants were recruited to participate from an undergraduate psychology course at McMaster University. Students signed up for available experimental sessions using a Web-based system operated by the department of psychology, informed consent was obtained, and participants received bonus credits toward their course for participation. Ethics approval was granted by the Hamilton Health Sciences Research Ethics Board.
A computerized delivery platform was created to deliver trivia questions to respondents and to collect participants’ responses and reaction times. Sixty general knowledge questions, 10 from each of six domains—history, science, pre-1990s entertainment, sports, geography, and literature—were selected from the questions normed by Nelson and Narens.8 Questions were selected to represent a reasonable spread of difficulty; in each category, the probability of being answered correctly ranged between 0.2 and 0.8 (average = 0.49; SD = 0.13).
To begin, each participant was shown the six category labels and asked to select the category in which (s)he was most confident. On selection of a category, the computer prompted the participant to predict the number of items out of 10 that (s)he anticipated answering correctly. The participant was then presented with each of the 10 questions in the selected category, one at a time.
Participants were told that a correction factor would be imposed, so they should answer questions for which they felt confident in their answer and skip over questions for which they were unsure of the correct response. Thus, on presentation of a question, the participant first had to press one of two buttons indicating that they would answer or pass on the question. The time elapsing between being shown the question and depression of the pass or answer button was recorded for each question as the decision time. If the respondent chose to answer the question, a free-text box appeared in which the participant was required to type in a response and press a submit button. Participants were told at the beginning of the study that they would have only 10 seconds to respond once they had chosen to answer a question; a timer was shown to make it clear how much time remained. The elapsed time between the opening of the free-text box and depression of the submit button was recorded for each question as the response time. If the respondent passed, the question disappeared without requiring an answer. Subsequently, the next question was presented, and the same process was repeated until the participant had responded to (answering or passing) all 10 questions in the category.
After presentation of the tenth question, the remaining five categories were presented, and the procedure began anew: the participant selected the category in which (s)he was most confident, predicted the number of questions (s)he would answer correctly, and was then presented 10 questions, each requiring a pass/answer decision before a response box (or the next question) was presented.
After completing this procedure for all six categories, participants were reshown the questions on which they passed. They were told that the correction factor had been removed and that they should, therefore, make their best guess in trying to answer each of the questions that were left blank.
Forty-one undergraduate psychology students participated in the study. One participant’s data were eliminated from the analysis because of a programming error. Two participants were subsequently removed from the data set because of concern regarding their understanding of the instructions (having chosen to answer all or all but four questions on the first pass, then responding with don’t know in the response box). Thus, 38 subjects (30 female) are included in the final analysis. The age of participants ranged from 18 to 30 (mean = 19.5; SD = 2.5).
Using the traditional conceptualization and analysis of self-assessment, we compared participants’ anticipated performance (predicted score for each of the six categories) and actual performance (sum total of accurate responses across 10 items within each category). Across the six response categories, the correlation between predicted and actual score ranged between 0.35 and 0.57, with a mean correlation of 0.46. When actual performance was defined more stringently as correct responses in pass one (i.e., the questions participants felt confident enough in their response to answer), the correlations between predicted and actual score ranged between −0.18 and 0.32, with a mean correlation of 0.10. A comparison of the means for anticipated scores (37.5%) and actual scores (31.4%) showed that, as a group, participants overestimated their performance even when accuracy was summed across both pass one and pass two (paired t37 = 2.85; P < .01).
Knowing general areas of strengths and weakness
Because participants selected category order on the basis of anticipated success, a systematic decline in performance from the first to last category selected would indicate that participants were generally aware of their broad areas of strength and weakness. Consistent with this, the mean percent correct for first, second, third, fourth, fifth, and sixth selected categories, respectively, were 44.7, 35.5, 31.6, 30.3, 23.9, and 22.4, representing a statistically significant trend in the repeated-measures ANOVA (F5,185 = 16.92; P < .001). No difference was observed in the response time (i.e., the time taken to answer each question) as a function of category order (F5,185 = 0.68; P > .6).
Knowing when to defer
Because participants were given the option to pass during the first presentation of the questions and were then asked to answer all remaining questions on the second presentation, a higher accuracy rate on questions attempted on the first presentation compared with questions answered during the second presentation would suggest that participants were able to assess, in the moment, whether they were likely to get a particular question correct. Consistent with this, when participants were sufficiently confident that they chose to answer the question on the first presentation, they were correct an average of 65.9% of the time, whereas their average success rate for questions on which they chose to pass was only 4.3%. This difference was statistically significant (paired t37 = 24.81; P < .001).
Slowing down at the borders of competence
Because participants were timed in their decisions regarding whether to answer or pass on each question during the first presentation, we were able to assess the extent to which reaction times predicted the accuracy of their decisions. If participants were showing appropriate caution when at the borders of competence, then fast decisions to answer on the first presentation of a question should be associated with greater accuracy than delayed decisions to answer on the first presentation. Consistent with this pattern, when participants decided to answer on the first attempt, the decision was faster for questions that were answered correctly (7.2 seconds) than for questions that were attempted on the first presentation but answered incorrectly (11.4 seconds; paired t37 = 5.55; P < .001). Interestingly, if participants chose to pass during the first presentation, the mean decision time to pass was 7.8 seconds for questions on which their eventual answer (in the second, forcing round) was incorrect, whereas the mean decision time to pass was 13.4 seconds when their eventual answer was correct at the second presentation (paired t22 = 3.97; P < .01). Figure 1 illustrates the interaction between the time taken to answer and the eventual accuracy of the response for round one (items on which the participant initially chose to answer) and round two (items on which the participant initially chose to pass). Time intervals were selected to ensure that a large enough number of observations were considered per interval to have stability in the proportion estimates. Finally, it is worth noting that the same pattern seen in the time taken to decide to answer was also observed for the length of time participants required to type in their answer to the question: When participants chose to answer, they entered their response more quickly (4.5 seconds) if their answer was correct relative to when it was incorrect (5.0 seconds; paired t37 = 2.7; P < .05).
It would seem from these findings that reflection-in-practice is importantly different from reflection-on-practice. When we used the traditional measures of self-assessment to interpret our data, our results suggested that the overall summative, “guess your grade” self-assessments of our participants were moderate at best. However, the new behavioral measures used in this study were aimed more directly at capturing self-assessment as an ongoing monitoring process, and these results support a more optimistic outlook for self-assessment in daily practice. Participants did seem to be sensitive (whether consciously or unconsciously), in the moment, to whether they were likely to make an error. It seems that they could prioritize tasks in an order consistent with their overall accuracy on those tasks. They tended to defer answering specific questions for which their responses were likely to be incorrect. And, looking at the time taken to decide whether to answer or not, it seems that on a question-by-question basis, respondents knew what they knew (quickly deciding to answer questions on which they were very likely to be correct), knew what they did not know (quickly deciding to defer questions on which they were very likely to be incorrect), and appropriately considered more slowly whether to answer questions on which their eventual level of accuracy proved to be less certain (in the middle ranges).
We hypothesize that this in-the-moment capacity to recognize when things are going well, when one should slow down, and when to stop and look it up is, in part, the phenomenon that gives rise to the intuition that we are able to self-assess our strengths and weaknesses more broadly defined. The psychological sum of many accurate moment-to-moment judgments is likely to create a global impression that one has an accurate impression of the magnitude of ability one has in a particular domain. It is widely recognized in other areas of psychology, however, that we actually are not very good at mentally aggregating across past experiences—we tend to recall particularly salient events rather than typical events.9 As a result, it would not be surprising if self-assessments provide yet another example of the aggregation problem in that we likely cannot expect humans to accurately tally mental averages of successes and failures. Compounding that problem is the work of social psychologists which suggests that we each have a psychological immune system that leads us to unintentionally and unconsciously interpret events in the best light for our self-perceptions.10 The implication of this, however, is that a person might be quite poor at a task, self-assess as being generally above average in the domain, yet still be safe in daily practice by self-limiting behavior on a case-by-case basis. Of course, this is still speculative, but it does point to interesting questions regarding the relative value of self-assessment as reflection-on-practice and self-assessment as reflection-in-practice for the self-regulation of safe practice.
That said, it should be noted that more work is required to overcome some of the limitations of this study. We cannot say with certainty that this participant pool or study task generalizes perfectly to the context of clinicians working in real-world environments. Work is ongoing to address this issue, but in the meantime we are comforted by the fact that the broader self-assessment literature has revealed consistent patterns of self-delusion across lay and professional groups of participants.2 An additional limitation that should be noted is that we cannot be absolutely certain that the decline in accuracy noticed across the order in which categories were selected is driven by awareness of one’s strengths rather than fatigue. The latter hypothesis, however, is not supported by the equivalence in response times when compared across category order.
Finally, we wish to emphasize that the apparent accuracy of these moment-to-moment judgments of ability should not be construed as enabling these measures to be used to differentiate “good self-assessors” from “poor self-assessors” (or whether one is maintaining the skills necessary of practicing physicians). On the contrary, we continue to believe, on the basis of a large literature, that self-assessment in any form is not very “skill-like” (i.e., its successes and failures are context specific).1,2 However, continued refinement of the operational definitions of self-assessment measures will hopefully clarify the concepts of self-directed learning and continuing competence, thereby providing insight into mechanisms whereby the promotion of self-directed learning and professional self-regulation can be enhanced. To further develop this line of reasoning, continued research must be performed to determine whether similar patterns exist (a) in more ecologically valid environments, and (b) when the focus is turned from concrete knowledge-based tasks to more broadly defined skills such as communication skills and professionalism.
Dr. Regehr is supported as the Richard and Elizabeth Currie Chair in Health Professions Education Research. The work described in this report was made possible by the generous funding support received through a grant from The Royal College of Physicians and Surgeons of Canada and Associated Medical Services, Inc. The authors would also like to thank Yaw Otchere for his software programming efforts and Tavinder Ark for her research assistance.