How Do Gender and Anxiety Affect Students Self-Assessment and Actual Performance on a High-Stakes Clinical Skills Examination?

Colbert-Getz, Jorie M. PhD; Fleishman, Carol MS; Jung, Julianna MD; Shilkofski, Nicole MD, MEd

Academic Medicine:
doi: 10.1097/ACM.0b013e318276bcc4
Clinical Education

Purpose: Research suggests that medical students are not accurate in self-assessment, but it is not clear whether students over- or underestimate their skills or how certain characteristics correlate with accuracy in self-assessment. The goal of this study was to determine the effect of gender and anxiety on accuracy of students’ self-assessment and on actual performance in the context of a high-stakes assessment.

Method: Prior to their fourth year of medical school, two classes of medical students at Johns Hopkins University School of Medicine completed a required clinical skills exam in fall 2010 and 2011, respectively. Two hundred two students rated their anxiety in anticipation of the exam and predicted their overall scores in the history taking and physical examination performance domains. A self-assessment deviation score was calculated by subtracting each student’s predicted score from his or her score as rated by standardized patients.

Results: When students self-assessed their data gathering performance, there was a weak negative correlation between their predicted scores and their actual scores on the examination. Additionally, there was an interaction effect of anxiety and gender on both self-assessment deviation scores and actual performance. Specifically, females with high anxiety were more accurate in self-assessment and achieved higher actual scores compared with males with high anxiety. No differences by gender emerged for students with moderate or low anxiety.

Conclusions: Educators should take into account not only gender but also the role of emotion, in this case anxiety, when planning interventions to help improve accuracy of students’ self-assessment.

Author Information

Dr. Colbert-Getz is director, Office of Assessment and Evaluation, Johns Hopkins University School of Medicine, Baltimore, Maryland.

Ms. Fleishman is academic program manager, Johns Hopkins Medicine Simulation Center, Baltimore, Maryland.

Dr. Jung is assistant professor, Department of Emergency Medicine, Johns Hopkins University School of Medicine, and associate director, Johns Hopkins Medicine Simulation Center, Baltimore, Maryland.

Dr. Shilkofski is vice dean of education, Perdana University Graduate School of Medicine, Kuala Lumpur, Malaysia, in affiliation with Johns Hopkins School of Medicine.

Correspondence should be addressed to Dr. Colbert-Getz, 1600 McElderry St., Room 334, Armstrong Medical Education Building, Baltimore, MD 21205; telephone: (443) 287-4421; fax: (410) 614-6285; e-mail:

Article Outline

Lifelong learning in medicine requires continuous development of knowledge, skills, and attitudes throughout all stages of training and practice. Medical students must establish lifelong learning habits early on to ensure developmental and professional progress during the transition from student to physician.1 To promote lifelong learning, educators need to be able to assist students in the development of an internal locus of control for learning. For example, learning to correctly identify strengths and weaknesses allows students to select appropriate goals to enhance learning.2 Therefore, inability to accurately identify personal strengths and weaknesses decreases students’ effectiveness as lifelong learners. Thus, it is important to investigate whether particular subpopulations of students are more likely to be inaccurate in self-assessment and under what conditions students are more likely to be inaccurate in self-assessment, so that appropriate interventions can be planned to improve students’ ability to engage in accurate self-assessment.

Prior meta-analyses have shown a low to moderate correlation between medical students’ self-assessment and actual performance.3,4 However, there has not been a consistent finding of overprediction or underprediction, suggesting that prediction accuracy may vary by type of assessment and characteristics of students.3 Gender is one characteristic that has received considerable attention in research on medical student self-assessment. Female medical students typically perform better than males on standardized patient (SP) examinations or clinical skills examinations,5,6 but results are not consistent for accuracy of self-assessment by gender. Lind and colleagues7 reported that female students rated themselves lower on a self-assessment of confidence in skills during a surgery clerkship relative to faculty ratings of the students’ skills. Another study by Coutts and Rogers8 demonstrated that, when students’ ratings of their own performance on a five-point scale were compared with their actual performance on a 0% to 100% scale, females underpredicted and males overpredicted their performance. Vivekananda-Schmidt and colleagues9 showed that female students rated their confidence lower than did male students on an 11-point scale, but there was no difference in actual performance (measured on a 0%–100% scale) between female and male students. On the other hand, when students rated their performance on the same checklist as SPs, there was no difference between students’ and SPs’ overall ratings for history gathering or physical exam items.10,11 Another study by Gruppen and colleagues12 comparing students’ self-assessments with SPs’ assessments using the same instrument suggested that females overestimated performance on communication-related items, though the difference was only statistically significant for one item—avoiding medical jargon.12

Thus, previous work demonstrates that when self-assessment and actual performance measures are matched (e.g., self-assessment and actual performance using the same checklist), there is little or no difference in accuracy of self-assessment by gender. However, when measures are not matched (e.g., students’ ratings of self-confidence or performance on a different scale compared with actual performance ratings), there seems to be a difference by gender, with women demonstrating lower self-confidence than males relative to actual performance.

Whether an assessment is low- or high-stakes is another factor that could explain the inconsistent results of self-assessment studies. Female medical students compared with males have significantly more general anxiety and test-related anxiety than males.13 Because test-related anxiety may be expected to influence both self-assessment and actual performance, it is likely that a high-stakes test would provoke more anxiety and, thereby, would more readily reveal gender differences in accuracy of self-assessment. To date, no studies have addressed the interaction between gender and anxiety on self-assessment or actual test performance. Thus, the goal of this study was to determine the effects of gender and anxiety on self-assessment as well as on actual performance in the context of a high-stakes assessment.

Although research has not shown how anxiety and gender together relate to self-assessment, we suspect that previously reported findings of gender-based differences in self-assessment accuracy may be due in large part to differences in anxiety levels, with high-anxiety testing situations producing a gender difference relative to lower-anxiety testing situations. Because we measured performance with a high-stakes assessment in this study, we hypothesized that there would be interaction (or moderator) effects of anxiety and gender on self-assessment accuracy and actual performance. As a secondary hypothesis, we expected a weak correlation between self-assessment scores and actual performance, similar to the results of previous meta-analyses.3,4

Back to Top | Article Outline



Prior to their fourth year of medical school, two classes of students at the Johns Hopkins University School of Medicine (JHUSOM) completed a high-stakes comprehensive clinical skills examination (CCSE) in fall 2010 and 2011, respectively. Students could also complete an optional survey as part of the CCSE. Two hundred twenty-two students completed the CCSE in this time period; 202 of these completed the optional survey (91% response rate) and were included in the analyses. This study was determined exempt by the JHUSOM institutional review board.

Back to Top | Article Outline

The CCSE is a high-stakes exam because a passing score is required for graduation. Content is based on domains and specifications defined by the National Board of Medical Examiners and used for the United States Medical Licensing Examination Step 2 Clinical Skills exam. All cases and checklists undergo extensive pilot testing and psychometric analysis before contributing to a student’s overall score.

To complete the CCSE, students rotate through 10 cases (plus 1 pilot case) involving an SP encounter, during which students are expected to take a patient history, perform a focused physical examination based on the patient’s presenting complaint and history, and generate a differential diagnosis and management plan. The entire exam takes approximately seven hours to complete.

SPs assess students according to completion of items on a predefined checklist (scored “done correctly” = 1, or “not done” = 0), which SPs submit electronically immediately after each encounter. The SPs score students on three major domains: history taking, physical examination, and interpersonal/communication skills. For the purposes of this study, we only investigated history taking and physical examination performance, which together we refer to as “data gathering performance.” The overall reliability coefficient (Cronbach alpha) for the means of the 10 data gathering cases from 2010 and 2011 was 0.64. The interpersonal/communication skills domain was not included because the scale was revised from 2010 to 2011, and reliability was not sufficiently high to allow accurate estimation of interaction effects. Each case checklist has 12 to 20 data gathering items, so a student’s mean data gathering score is the percentage of history gathering and physical examination items for which he or she received credit on the SP’s assessment.

Back to Top | Article Outline
Standardized patients.

Two to four SPs are assigned to each CCSE case and are trained annually by faculty and SP educators. CCSE SPs ranged in age from 26 to 77 years, with a mean of 54 years. Approximately half were female (51%, 21) and half were male (49%, 20). Eighty percent (33) had acting experience, and the average number of years working as an SP was 8, with a range of 2 to 14 years of experience.

Back to Top | Article Outline

The last station of the CCSE included an optional survey asking for responses to various items about the exam experience, including students’ anxiety prior to the examination and their predictions of their performance on each domain of the exam. Similar to an OSCE anxiety scale used in prior research,14 students rated their test anxiety in anticipation of the CCSE on a six-point Likert scale where 1 = no anxiety and 6 = extreme anxiety. Students predicted their history taking and physical examination scores on a 0% to 100% scale.

Back to Top | Article Outline
Data analysis

We grouped students into one of three anxiety categories (low, moderate, high). The low anxiety category included students who rated their anxiety at a 1 or 2, the moderate anxiety category included students who rated their anxiety at a 3 or 4, and the high anxiety category included students who rated their anxiety at a 5 or 6. We computed the percentage of males and females for each anxiety category. We calculated an “actual data gathering score” for each student by averaging the percentage of data gathering items scored as “done correctly” for each individual case across all 10 cases. We calculated a “self-assessment deviation data gathering score” for each student by subtracting students’ actual data gathering score from the average of their predicted history taking and physical examination scores. Higher self-assessment deviation values reflect a larger difference between predicted and actual scores. Positive values reflect overprediction, and negative values reflect underprediction.

We computed normality statistics (skew, kurtosis) for self-assessment and actual data gathering scores. To determine the relationship between self-assessment and actual performance, we calculated Pearson product–moment correlation coefficient (R). We also calculated the coefficient of the determinant (R2) to determine the amount of shared variance. We conducted separate 2 (gender: female, male) × 3 (anxiety level: low, moderate, high) between-subjects ANOVAs on (1) actual data gathering scores and (2) self-assessment deviation data gathering scores. We assessed simple effects of gender across each level of anxiety for the ANOVAs with independent samples t tests. We set statistical significance at 0.05 for the ANOVAs and adjusted it down to 0.017 (alpha/number of simple effects tests) for the simple effects tests. To determine whether females and males significantly overpredicted or underpredicted their performance, we conducted one-way t tests by comparing self-assessment deviation scores to zero at each level of anxiety for females and males. Again, we adjusted alpha down to 0.017 for each anxiety-level one-way t test.

Back to Top | Article Outline


For the 202 students who completed the CCSE and the optional survey in fall 2010 or 2011, 53% (107) were male and 47% (95) were female. The racial background of the students was 51% (103) white or Caucasian, 35% (71) Asian, 9% (18) black or African American, 3% (6) Hispanic, and 2% (4) American Indian/Alaska Native or Native Hawaiian/Pacific Islander. Because over 90% (202/222) of students from both classes were included in the data analysis, we do not believe the study had issues with selection bias.

Table 1 provides frequencies of students who rated their anxiety in anticipation of the CCSE in the low, moderate, or high category organized by gender. The majority of students self-reported their anxiety levels as moderate regardless of gender. The t values for actual data gathering scores were –1.60 for skew and –0.91 kurtosis. The t values for self-assessment deviation data gathering scores were 0.65 for skew and 0.61 kurtosis. Thus, both score types did not violate assumptions of normality (t values were not greater than 2.00 or less than –2.00), meaning that parametric tests could be conducted on the data. The correlation between actual data gathering scores and self-assessment scores was –0.08 with an R2 of 0.01, suggesting a weak negative relationship or very little shared variance between the two score types.

Figure 1 shows actual mean data gathering scores by gender and anxiety group. The main effect of gender was significant, (F1,196 = 8.40, P = .004, ηp2 = 0.04). The main effect of anxiety was not significant (F2,196 = 1.45, P = .237). The interaction of gender and anxiety was significant (F2,196 = 7.23, P = .001, ηp2 = 0.07). Specifically, females with high anxiety had better actual data gathering performance (mean = 71%, standard deviation [SD] = 4%) than did males with high anxiety (mean = 64%, SD = 5%) (t42 = 4.74, P ≤ .001, η2 = 0.3). There was no significant difference in actual data gathering scores between females and males with low anxiety (t43 = 0.49, P = .629) or moderate anxiety (t111 = –0.98, P = .328).

Figure 2 shows self-assessment deviation data gathering scores by gender and anxiety group. The main effect of gender trended toward significance but was not significant (F1,196 = 3.60, P = .059). The main effect of anxiety was also not significant (F2,196 = 2.15, P = .120), but the interaction of gender by anxiety was significant (F2,196 = 3.13, P = .046, ηp2 = 0.03). Specifically, there was a significant difference in self-assessment deviation scores between female (mean = –2%, SD = 10%) and male students (mean = 8%, SD = 9%) with high anxiety (t42 = 3.22, P = .002, η2 = 0.19), with females being more accurate than males. There were not any significant differences in self-assessment deviation scores between females and males with low anxiety (t43 = –0.55, P = .587) or moderate anxiety (t111 = 1.05, P = .298).

Figure 2 also shows that females with high anxiety were the only students who underpredicted their performance; all other students overpredicted their data gathering performance. We tested the significance of the amount of underprediction or overprediction for each anxiety and gender group with one-way t tests. The –2% (SD = 10%) difference between predicted and actual data gathering performance for females with high anxiety was not significantly different from zero (t25 = 1.00, P = .33). The 4% (SD = 11%) difference between predicted and actual data gathering performance for females with moderate anxiety was also not significantly different from zero (t49 = 2.40, P = .020). However, the 9% (SD = 13%) overprediction by females with low anxiety was significantly different from zero (t18 = 2.94, P = .009). Males significantly overpredicted by 7% (SD = 12%) if they had low anxiety (t25 = 2.96, P = .007), by 6% (SD = 11%) if they had moderate anxiety (t62 = 4.07, P ≤ .001), and by 8% (SD = 9%) if they had high anxiety (t17 = 3.61, P = .002).

Back to Top | Article Outline

Discussion and Conclusions

When students self-assessed their data gathering performance, there was a weak negative correlation between their predicted score and their actual score on a required clinical skills examination, supporting the results of prior research.3,4

Research on the relationship between anxiety and performance has offered mixed results. Some social science studies suggest that students with low anxiety will perform better on examinations than will students with moderate or high anxiety.15,16 Others have found that moderate levels of physiological arousal can be associated with higher exam performance.17 Medical education studies have found that high-anxiety students perform equal to or better than students with moderate anxiety on high-stakes examinations.17–19 Thus, the low- or high-stakes nature of an examination could affect the relationship between anxiety and performance. Gender could also be a moderating factor in this relationship. Chapell and colleagues20 found that even though female college students had more anxiety than males, they also had higher GPAs than males. Similarly, female medical students have more anxiety13 and less confidence21 than male medical students, even though they perform the same as, and in some cases better than, those males.

Strickler and colleagues22 suggested that females perform better relative to males because females study more and thus are more prepared. Additionally, Mavis14 found that students with more clinical skills experience had more anxiety about a clinical examination. On the basis of prior research,23 Mavis suggested that as students gain more clinical experience, they become more knowledgeable about what they know and do not know, and the latter can increase anxiety relative to students with less clinical experience. Because females with high anxiety performed better than females with low and moderate anxiety and better than males in this study, we suspect that the high-anxiety females were more prepared for the exam. Future work should explore how examination preparation relates to anxiety and gender.

Being more aware of limitations in knowledge could also affect how accurately students are able to assess their performance. Research has found that high performers are more likely to underpredict performance, and low performers are more likely to overpredict performance because they are “unskilled and unaware.”24 Additionally, females are more likely than males to underestimate performance25,26 but estimation can vary by task. In this study, females with high anxiety had the best performance and also were closest in their prediction, although their direction of estimation trended toward underprediction. Males with high anxiety had the poorest performance and overpredicted their performance.

This study examined the effect of anxiety and gender on a high-stakes examination at one medical school, which may limit generalizability of the results to medical schools with similar demographics. Anxiety was measured by students’ self-report on one survey item, so we were not able to determine the psychometric properties of anxiety ratings. Additionally, anxiety and performance were self-assessed retrospectively, so ratings could have been confounded by relief in finishing the exam. However, because this was a high-stakes examination, we did not want to add stress to students by having them complete a survey prior to their examination, which could have then caused them to report anxiety based on annoyance with the extra task as opposed to anxiety related entirely to the examination.

Finally, the effect size for the interactions of gender by anxiety was small for self-assessment deviation scores and medium for actual performance. When looking at just high-anxiety students, the effect size for differences by gender was medium for self-assessment deviation and large for actual performance. Although we suspect that the effect sizes would increase if we gathered data for another class of students, the CCSE case content and scoring changed in 2012 to match recent updates to the USLME Step 2 Clinical Skills examination. Thus, we thought it was more important to gather scores from students who underwent CCSE experiences that were as similar as possible, despite having a smaller sample.

Accurate self-assessment skills are essential for lifelong learning. Educators can facilitate the development of such skills through targeted interventions for students who demonstrate inaccuracy in self-assessment. Additionally, our research indicates that students’ anxiety levels may help guide educators’ efforts to teach accurate self-assessment. For female students, high anxiety may be an indication that they are more aware of their knowledge limitations and, thus, able to self-assess more accurately; for male students, high anxiety may be an indication that they are less aware of their knowledge limitations. Research suggests that overpredicting performance or overconfidence in ability is a result of low-quality feedback or lack of feedback from educators.27 Thus, educators and administrative leadership can promote a successful start in medical school by ensuring that feedback is consistent and frequent, that evaluators receive proper training, and that students spend time reflecting on instances when they grossly overpredict or underpredict their performance, taking into account their perceived anxiety. Only with guidance from informed educators will students develop the ability to correctly identify strengths and weaknesses and, thus, become proficient lifelong learners.

Acknowledgments: The authors wish to thank Neva Krauss for her work in training the standardized patients.

Funding/Support: None.

Other disclosures: None.

Ethical approval: This study was deemed exempt by the institutional review board of Johns Hopkins University School of Medicine.

Back to Top | Article Outline


1. Miflin BM, Campbell CB, Price DA. A conceptual framework to guide the development of self-directed, lifelong learning in problem-based medical curricula. Med Educ. 2000;34:299–306
2. Eva KW, Regehr G. Self-assessment in the health professions: A reformulation and research agenda. Acad Med. 2005;80(10 suppl):S46–S54
3. Blanch-Hartigan D. Medical students’ self-assessment of performance: Results from three meta-analyses. Patient Educ Couns. 2011;84:3–9
4. Gordon MJ. A review of the validity and accuracy of self-assessments in health professions training. Acad Med. 1991;66:762–769
5. Haist SA, Wilson JF, Elam CL, Blue AV, Fosson SE. The effect of gender and age on medical school performance: An important interaction. Adv Health Sci Educ Theory Pract. 2000;5:197–205
6. Wiskin CM, Allan TF, Skelton JR. Gender as a variable in the assessment of final year degree-level communication skills. Med Educ. 2004;38:129–137
7. Lind DS, Rekkas S, Bui V, Lam T, Beierle E, Copeland EM 3rd. Competency-based student self-assessment on a surgery rotation. J Surg Res. 2002;105:31–34
8. Coutts L, Rogers J. Predictors of student self-assessment accuracy during a clinical performance exam: Comparisons between over-estimators and under-estimators of SP-evaluated performance. Acad Med. 1999;74(10 suppl):S128–S130
9. Vivekananda-Schmidt P, Lewis M, Hassell AB, et al. Validation of MSAT: An instrument to measure medical students’ self-assessed confidence in musculoskeletal examination skills. Med Educ. 2007;41:402–410
10. Antonelli MA. Accuracy of second-year medical students’ self-assessment of clinical skills. Acad Med. 1997;72(10 suppl 1):S63–S65
11. Kaiser S, Bauer JJ. Checklist self-evaluation in a standardized patient exercise. Am J Surg. 1995;169:418–420
12. Gruppen LD, Garcia J, Grum CM, et al. Medical students’ self-assessment accuracy in communication skills. Acad Med. 1997;72(10 suppl 1):S57–S59
13. Hojat M, Glaser K, Xu G, Veloski JJ, Christian EB. Gender comparisons of medical students’ psychosocial profiles. Med Educ. 1999;33:342–349
14. Mavis B. Self-efficacy and OSCE performance among second year medical students. Adv Health Sci Educ Theory Pract. 2001;6:93–102
15. Zieder M. Does test anxiety bias scholastic aptitude performance by gender and sociocultural group? J Person Assess. 1990;55:145–160
16. Naveh-Benjamin M, McKeachie WJ, Lin YG, Holinger D. Test anxiety: Deficits in information processing. J Educ Psychol. 1981;73:816–824
17. Cassady JC, Johnson RE. Cognitive test anxiety and academic performance. Cont Educ Psychol. 2002;27:270–295
18. Frierson HT Jr, Hoban D. Effects of test anxiety on performance on the NBME Part I examination. J Med Educ. 1987;62:431–433
19. Frierson HT Jr, Hoban JD. The effects of acute test anxiety on NBME Part I performance. J Natl Med Assoc. 1992;84:686–689
20. Chapell MS, Blanding B, Silverstein ME, et al. Test anxiety and academic performance in undergraduate and graduate students. J Educ Psychol. 2005;97:268–274
21. Blanch DC, Hall JA, Roter DL, Frankel RM. Medical student gender and issues of confidence. Patient Educ Couns. 2008;72:374–381
22. Strickler LJ, Rock DA, Burton NW. Sex differences in predictions of college grades from scholastic aptitude test scores. J Educ Psychol. 1993;85:710–718
23. Hoppe RB, Farquhar LJ, Henry RC, Stoffelmayr BE, Helfer ME. A course component to teach interviewing skills in informing and motivating patients. J Med Educ. 1988;63:176–181
24. Kruger J, Dunning D. Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. J Pers Soc Psychol. 1999;77:1121–1134
25. Beyer S. Gender differences in the accuracy of grade expectancies and evaluations. Sex Roles. 1999;41:279–296
26. Beyer S, Bowden EM. Gender differences in self-perceptions: Convergent evidence from three measures of accuracy and bias. J Pers Soc Psychol. 1997;23:157–172
27. Schwartz S, Griffin T. Comparing different types of performance feedback and computer-based instruction in teaching medical students how to diagnose acute abdominal pain. Acad Med. 1993;68:862–864
© 2013 Association of American Medical Colleges