Educators are reconsidering traditional methods of assessment as competency-based medical education becomes more commonplace. Measurement of competence requires a systematic approach and multiple assessments collected longitudinally, which, collectively, allow for an authentic portrayal of performance to emerge.1,2 This longitudinal, multiassessment approach is especially important for assessing complex behaviors such as communication, professionalism, and teamwork.3,4
Faculty at Northwestern University Feinberg School of Medicine (NUFSM) created an electronic portfolio to collect all assessments of students’ behavioral competencies.5 Portfolio assessments incorporate both narrative feedback and quantitative performance ratings. The portfolio provides an organizational platform to collect multiple assessments over time that students and faculty members alike can review to evaluate progress. It also allows for students to engage in reflection and self-directed learning as they review their feedback and identify areas for improvement. Just before students begin their first clerkship year (Year 3), a faculty committee conducts a summative review of the full content of each student’s portfolio. This review has two goals: (1) to identify students with deficiencies in certain competencies, and (2) to provide feedback to students concerning their strengths and weaknesses.
The summative portfolio review assesses behaviors that are essential for success on the clerkships.6,7 These include professionalism, interpersonal skills, and motivation. Difficulty in these domains can be persistent, can influence later performance if not addressed,8,9 and can negatively affect patient care.10,11 Regrettably, problematic behavior often goes undocumented in the preclerkship years.12 Moreover, clerkship grades are high-stakes assessments that influence acceptance into a residency program13; thus—even though remediation of these skills is challenging9,12—it is important to identify deficiencies early and create opportunities for skill development before clerkships begin.
Descriptive accounts of comprehensive portfolio assessment are now prevalent, and scholars have noted that portfolios have “face validity” among educators and students14,15; however, studies that empirically collect validity evidence of portfolio assessment in medical education are rare, with a few notable exceptions.16,17 In particular, little is known about the relationship between portfolio assessment and other variables in medical education, including later performance outcomes.
Recent publications have focused on using Kane’s18 argument-based approach to validating programs of assessment in medical education.19,20 This framework involves the empirical testing of assumptions behind how assessment decisions are interpreted and used. The approach entails, first, stating a claim or argument about an assessment and, then, collecting evidence to evaluate whether this argument is acceptable and defensible.18
In our current study, we test one of the main assumptions of our portfolio assessment system: that trained faculty can review qualitative evidence in a preclerkship portfolio and identify behavior, specifically problematic behavior, that will persist and negatively affect clerkship performance. We argue that trained faculty can discern observable patterns of behavior in the preclerkship setting by reviewing narrative feedback in the portfolios, and that this behavior can extrapolate to the clerkship setting. To evaluate this claim, we have analyzed the relationship between the results of a preclerkship portfolio review and later performance in clerkships. To our knowledge, this is the first study to report the association between a preclerkship portfolio review of behavioral competencies and clerkship performance, and as such, we believe it has the potential to add validity evidence to the use of portfolio review as an assessment method.
The NUFSM portfolio is one of two parallel preclerkship assessment systems at our medical school, which is a private medical school in a midwestern U.S. state with an average class size of 165. The first is a pass/fail grading scheme organized around six curricular elements called “blocks.” Block grades derive from quantitative assessments including written examinations, the majority of which assess knowledge acquisition and application. Notably, the block quantitative assessments also include faculty ratings of small-group and clinical performance. Students must pass all blocks before proceeding to the preclerkship portfolio review.
The portfolio is organized around the domains in which students are expected to demonstrate competency before graduation. Five domains are subject to portfolio review: (1) continuous learning, (2) communication, (3) patient care, (4) professionalism, and (5) teamwork. Medical knowledge is not subject to portfolio review because it is suffi ciently assessed via other methods. Each competency domain includes subcompetencies that represent specific behaviors and skills medical students must demonstrate before graduation; for example, the communication competency domain includes “Listen empathically and effectively to patients, colleagues, and teachers.” Every assessment in the portfolio maps to one subcompetency, allowing students and faculty to easily view performance in each domain. Portfolio assessments include faculty and peer evaluations of small-group performance, work-based clinical assessments, and objective structured clinical examinations (OSCEs). The portfolio does not include results of written examinations.
NUFSM has a tracking system that is designed to identify significant lapses in professionalism, such as cheating. In contrast, the portfolio system assists us in recognizing behavioral issues that may not garner attention at a single point in time; rather, it allows us to identify troublesome behavior that reemerges over time across multiple contexts (e.g., disrespect to peers or a lack of accountability), which would likely be problematic in a clinical setting.
We expect students to reflect on their progress and to identify learning goals for future achievement twice annually. They meet regularly with their mentor for formative review of both their progress and their reflections.
At the end of the preclerkship phase (Year 2) and again at the end of the first clerkship year (Year 3), a trained committee composed of faculty clinicians, all of whom have extensive medical education experience, formally reviews each student’s portfolio to determine whether the student is progressing toward competence. At least two reviewers independently review each portfolio, and they meet to reconcile any differences through discussion and debate.21 The interrater agreement before reconciliation is typically high; in 2014, it ranged from 77% for professionalism to 96% for continuous learning. The faculty reviewers read through the complete collection of assessment data, including all narrative feedback. The reviewers decide whether the medical student can progress to the next level of training, or undergo remediation before advancing.5
NUFSM students must complete seven required third-year clerkships before graduating. Clerkships grades are as follows: Honors, High Pass, Pass, or Fail. Students who require extra time to meet the clerkship objectives receive a grade of Pass after Remediation. Most clerkships derive final grades, using different formulas, from National Board of Medical Examiners (NBME) shelf exams, clinical evaluations from faculty and residents, and clerkship-specific OSCE performances. Some clerkships also grade students on professionalism and participation.
In the spring of 2014, 156 second-year NUFSM students completed the preclerkship portfolio review. Of these, faculty reviewers identified 31 (20%) whose portfolios indicated some concerning behavior. Four of these students had significant deficiencies that warranted completion of a formal remediation program before starting clerkships. Reviewers noted less serious performance concerns in the other 27 students, and these students were instructed to develop and participate in self-directed remediation by meeting with their mentor and creating an individualized improvement plan. The faculty reviewers based the decision to require formal remediation versus self-directed remediation on the frequency and gravity of deficiencies in one or more competency domains.5
For the purposes of this study, we divided the students into two groups: (1) students whose portfolios indicated some concerning behavior (both those requiring formal remediation and those requiring self-directed remediation), and (2) students progressing satisfactorily. We excluded 18 students from this study because they did not progress immediately to the clerkship phase (12 of these students took a leave of absence to pursue their PhD as a part of a dual-degree program, and 6 students took a year off for independent study). We also excluded 3 students who had not received final grades for all seven required clerkships by the end of October 2015. Therefore, the final sample consisted of 135 students who met the inclusion criteria for this study, 24 (18%) of whom were identified as demonstrating concerning behavior.
We measured performance in clerkships in two ways. First, we looked at the final grades (Honors, High Pass, etc.) for all seven required clerkships to detect differences between the two groups. Second, for the purposes of this study, we created a clinical composite score. We calculated this score by combining clinical evaluations, OSCE scores, and professionalism points (when available) across four clerkships: medicine, neurology, obstetrics–gynecology, and surgery. We chose these clerkships because all use similar elements in the calculation of the final grade. Statistical analysis demonstrated that elements in the composite score measure similar constructs and correlate moderately with one another. We omitted knowledge-based NBME shelf scores from the composite to better isolate the behavioral attributes that are the focus of the portfolio.
Faculty members and residents who supervise students in the clerkships complete a clinical evaluation for each student to communicate the student’s performance in inpatient and outpatient settings. The assessment form covers attributes and abilities such as clinical reasoning, presentation skills, communication with patients and the health care team, reliability, and initiative. The form includes a description of expected performance, and students are scored against this benchmark on a scale of 1 to 9 (1 = “target for improvement” and 9 = “exceeds this expectation”). Clerkship-specific OSCEs measure abilities such as communication with patients and families, respect, and clinical reasoning. Professionalism and participation scores measure domains such as accountability and participation in small-group activities.
The analyses focused on detecting group differences in clerkship performance between students identified as demonstrating concerning behavior through the preclerkship portfolio review and those progressing satisfactorily. To represent overall performance in clerkships, we calculated a pseudo–grade point average (GPA) on a scale of 0 to 4 (0 = fail; 1 = pass after remediation; 2 = pass; 3 = high pass; 4 = honors). We used a Mann–Whitney analysis to test for significant differences in overall clerkship grades between the two groups. We measured the magnitude of the difference using the probabilistic index, or common language effect size, appropriate for use with nonparametric data.22
To calculate the clinical composite score, we standardized the components as a percentage of possible points, averaged across the four clerkships. We gave all components equal weight. We used t tests to test for significant differences between groups, and we measured effect size using Cohen’s d.
We performed multiple linear regression using the clinical composite score as a dependent variable and the results of the portfolio review as an independent variable. We further controlled for medical knowledge by using United States Medical Licensing Examination (USMLE) Step 1 scores as an additional independent variable. We entered the portfolio results as a dummy variable (concerning behavior = 1; progressing satisfactorily = 0). We also tested for an interaction effect between concerning behavior and Step 1 scores.
We performed all analyses using SPSS version 23 (IBM Corp., Armonk, New York). The Institutional Review Board at Northwestern University deemed this study exempt.
For the 24 students in the final sample who had concerning behavior, the majority of problems were in the professionalism and communication domains (see Figure 1).
Here we have outlined the results of the analysis of final clerkship grades. Students with concerning behavior received significantly lower grades in the seven required clerkships compared with those progressing satisfactorily. The pseudo-GPA of students progressing satisfactorily was 3.29, compared with 2.71 for the students with concerns (Mann–Whitney U = 786.5; P = .002). The probabilistic index, also called “the common language effect size” or “probability of superiority” (
= 0.70), indicated a medium effect size.23 This number represents the probability that a randomly selected score from one sample group is larger than a randomly selected score from a second sample, where 0.5 represents no difference between the two samples. In this case,
shows there is a 70% probability that a GPA randomly selected from the sample of students progressing satisfactorily will be higher than a GPA randomly selected from students with concerning behavior.
The second analysis examined how performance on the clinical composite score differed between the two groups. Figure 2 displays an overlapping distribution of composite scores across both groups of students. The results from the t test show that students with concerning behavior averaged 79.4 (standard deviation [SD] = 3.9) on the composite score, significantly lower than students progressing satisfactorily (mean = 82.3, SD = 3.0, P < .001). Cohen’s d (0.83) indicates a large effect size.
Results of the multiple linear regression show that after holding the USMLE Step 1 score constant, demonstrating concerning behaviors as found in the preclerkship portfolio review was significantly associated with a lower composite score. The interaction effect was not significant. The final model accounted for 32% of the variance (Table 1).
Discussion and Conclusions
We studied the associations of preclerkship portfolio assessments, particularly behavioral competencies, and later performance in clerkships. The results suggest that portfolio reviewers are able to identify students whose concerning behavior before clerkship affected their future performance during clerkships. These students received not only lower clerkship grades but also significantly lower scores on clerkship measurements of behavioral competencies in communication, professionalism, patient care, and teamwork (which constituted the clinical composite score).
The results persist even after holding medical knowledge constant. To illustrate, imagine two students with the same USMLE Step 1 score. The portfolio review indicates that one has demonstrated concerning behavior and that the other is progressing satisfactorily. Our results suggest that the student with the concerning behavior will score 2.4 points lower on the clinical composite score than his/her counterpart (Table 1), suggesting that the portfolio review does indeed measure something that is independent of medical knowledge.
Others have found that preclerkship behaviors can persist across time. Klamen and Borgia24 have found that a preclerkship OSCE effectively detected poor communication skills and professionalism lapses that negatively affected future performance on a clerkship OSCE. Our study is the first to demonstrate that a portfolio review is an assessment method that can successfully identify subsequent behavior associated with lower clerkship grades and clinical performance scores. These results add validity evidence to the use of portfolios in assessment.
We attribute our findings to the strength of the NUFSM portfolio assessment system. This system is designed to be more comprehensive than traditional quantitative measures used to identify at-risk students.5 Instead of relying on a single measurement, the portfolio collects multiple data points across time, allowing for behavioral patterns to emerge.1 The portfolio also captures observational performance data in real-life situations, leading to more authentic assessment, which may better predict future clinical performance.25
The most authentic data we have collected are the narrative comments students receive from faculty members and their peers. We have found these narrative descriptions to be superior to quantitative ratings for measuring behavioral competencies.2,26 We depend on faculty members’ professional judgment to interpret the data. The portfolio reviewers rely on their experience working and teaching in clinical settings to decide which types of behavior are most likely to be problematic. The popularity and growth of entrustable professional activities highlight the increased acceptance and value of professional judgment in the assessment of a learner.27 Our results reinforce the use of professional judgment as a formal assessment method.
The findings have other important implications. A portfolio that serves as a repository of all assessments gives reviewers access to all recorded data points. This comprehensive view allows for longitudinal themes to emerge, increasing the validity of the decisions resulting from portfolio review. We believe the results of this study will help to increase student acceptance of portfolios. Research on informed self-assessment suggests that medical students will rely on external sources to guide self-directed learning only when they consider the data sources credible and trustworthy.28 A goal of the portfolio-based assessment system at NUFSM is to develop students’ self-reflective capacity and to encourage their self-directed learning.5 The findings we report here enhance the credibility of our portfolio as a method of assessment, particularly for behavioral competencies that are not easily measured. Ideally, students’ use of the preclerkship portfolio to develop self-reflection will drive self-directed learning and performance improvement throughout their careers.
Our study has a few limitations. First, the number of students demonstrating concerning behavior in the sample (n = 24) is relatively small. Second, our findings represent one cohort at one medical school and may not be generalizable to other settings. Third, we have attempted to isolate behavioral competencies in the clerkships by excluding NBME performance from our clinical composite score, but we acknowledge that knowledge acquisition is interrelated with these behaviors—particularly clinical reasoning—and can never be fully excluded.
We are carefully reevaluating our remediation efforts as a result of this study. One major goal of the NUFSM preclerkship summative portfolio review is to provide early educational support to struggling students, so they can perform well during clerkships. We are exploring additional interventions that will best position our students for success.
Future research will use qualitative methodologies to more closely examine which specific types of persistent behaviors most impact clinical performance. We will also explore whether the portfolio assessment process increases students’ reflective abilities and capacity for self-directed learning.
In conclusion, this study demonstrates that a preclerkship summative portfolio review can identify problematic behavior that negatively affects later clerkship performance. This study provides strong evidence that the decisions made from the portfolio review are valid. Our experience suggests that a comprehensive portfolio system is an effective tool for use in competency-based medical education, particularly for the assessment of behavioral competencies. Further study is needed to identify best practices for remediation of behavioral concerns identified in the preclerkship period.
The authors wish to thank Joseph Feinglass, PhD, William C. McGaghie, PhD, Diane B. Wayne, MD, and Julia Lee, PhD, for their helpful feedback on this report.
1. van der Vleuten CP, Schuwirth LW, Driessen EW, et al. A model for programmatic assessment fit for purpose. Med Teach. 2012;34:205–214.
2. Holmboe ES, Sherbino J, Long DM, Swing SR, Frank JR. The role of assessment in competency-based medical education. Med Teach. 2010;32:676–682.
3. Epstein RM, Hundert EM. Defining and assessing professional competence. JAMA. 2002;287:226–235.
4. Hodges BD, Ginsburg S, Cruess R, et al. Assessment of professionalism: Recommendations from the Ottawa 2010 Conference. Med Teach. 2011;33:354–363.
5. O’Brien CL, Sanguino SM, Thomas JX, Green MM. Feasibility and outcomes of implementing a portfolio assessment system alongside a traditional grading system. Acad Med. 2016;91:1554–1560.
6. Wimmers PF, Kanter SL, Splinter TA, Schmidt HG. Is clinical competence perceived differently for student daily performance on the wards versus clerkship grading? Adv Health Sci Educ Theory Pract. 2008;13:693–707.
7. Hoffman K, Hosokawa M, Donaldson J. What criteria do faculty use when rating students as potential house officers? Med Teach. 2009;31:e412–e417.
8. Papadakis MA, Teherani A, Banach MA, et al. Disciplinary action by medical boards and prior behavior in medical school. N Engl J Med. 2005;353:2673–2682.
9. Chang A, Boscardin C, Chou CL, Loeser H, Hauer KE. Predicting failing performance on a standardized patient clinical performance examination: The importance of communication and professionalism skills deficits. Acad Med. 2009;84(10 suppl):S101–S104.
10. Stewart MA. Effective physician–patient communication and health outcomes: A review. CMAJ. 1995;152:1423–1433.
11. Dimatteo MR. The role of effective communication with children and their families in fostering adherence to pediatric regimens. Patient Educ Couns. 2004;55:339–344.
12. Hauer KE, Teherani A, Kerr KM, O’Sullivan PS, Irby DM. Student performance problems in medical school clinical skills assessments. Acad Med. 2007;82(10 suppl):S69–S72.
13. Green M, Jones P, Thomas JX Jr.. Selection criteria for residency: Results of a national program directors survey. Acad Med. 2009;84:362–367.
14. Friedman Ben David M, Davis MH, Harden RM, Howie PW, Ker J, Pippard MJ. AMEE medical education guide no. 24: Portfolios as a method of student assessment. Med Teach. 2001;23:535–551.
15. Roberts C, Newble DI, O’Rourke AJ. Portfolio-based assessments in medical education: Are they valid and reliable for summative purposes? Med Educ. 2002;36:899–900.
16. Driessen EW, Overeem K, van Tartwijk J, van der Vleuten CP, Muijtjens AM. Validity of portfolio assessment: Which qualities determine ratings? Med Educ. 2006;40:862–866.
17. Roberts C, Shadbolt N, Clark T, Simpson P. The reliability and validity of a portfolio designed as a programmatic assessment of performance in an integrated clinical placement. BMC Med Educ. 2014;14:197.
18. Kane MT. Validating the interpretations and uses of test scores. J Educ Meas. 2013;50:1–73.
19. Schuwirth LW, van der Vleuten CP. Programmatic assessment and Kane’s validity perspective. Med Educ. 2012;46:38–48.
20. Cook DA, Brydges R, Ginsburg S, Hatala R. A contemporary approach to validity arguments: A practical guide to Kane’s framework. Med Educ. 2015;49:560–575.
21. Tigelaar DEH, Dolmans DHJM, Wolfhagen IHAP, van der Vleuten CPM. Quality issues in judging portfolios: Implications for organizing teaching portfolio assessment procedures. Stud Higher Educ. 2005;30:595–610.
22. Grissom RJ, Kim JJ. Effect Sizes for Research: Univariate and Multivariate Applications. 2012.2nd ed. New York, NY: Routledge/Taylor & Francis Group.
23. Grissom RJ. Probability of the superior outcome of one treatment over another. J Appl Psychol. 1994;79:314–316.
24. Klamen DL, Borgia PT. Can students’ scores on preclerkship clinical performance examinations predict that they will fail a senior clinical performance examination? Acad Med. 2011;86:516–520.
25. Wilkinson TJ, Frampton CM. Comprehensive undergraduate medical assessments improve prediction of clinical performance. Med Educ. 2004;38:1111–1116.
26. Hanson JL, Rosenberg AA, Lane JL. Narrative descriptions should replace grades and numerical ratings for clinical performance in medical education in the United States. Front Psychol. 2013;4:668.
27. Ten Cate O, Chen HC, Hoff RG, Peters H, Bok H, van der Schaaf M. Curriculum development for the workplace using entrustable professional activities (EPAs): AMEE guide no. 99. Med Teach. 2015;37:983–1002.
28. Sargeant J, Armson H, Chesluk B, et al. The processes and dimensions of informed self-assessment: A conceptual model. Acad Med. 2010;85:1212–1220.