Portfolios have gained popularity in medical education as a pedagogical tool with multiple uses for different goals.1,2 For example, some educational programs report using formative portfolios to encourage trainees to reflect on their learning,3,4 whereas others use summative portfolios to inform decisions about promoting students to the next level of study.5,6 Nevertheless, concerns remain about the psychometric properties of portfolios, especially when used for “high stakes” performance decisions affecting promotion or certification outcomes.7
Previous research has examined the reliability and, to a lesser extent, the validity of portfolios as an assessment tool and offered recommendations for practice. Interrater reliability improves when trained assessors discuss or negotiate their ratings of portfolios using global criteria to inform judgments about trainee performance.8–12 Validity is enhanced when judges have sufficient evidence of trainee performance that aligns closely with developmentally appropriate criteria, thereby increasing the likelihood for accurate interpretations about trainee competence.8,10 Even with these developments, the difficult question remains: Is it “fair” to use portfolios for promotion decisions?
Fairness remains challenging to examine empirically.13 Indeed, the very definition of fairness is elusive, subject to interpretation depending on one's point of view. Furthermore, fairness is difficult to ensure for structured testing formats, let alone for more variable performance assessments like portfolios.14,15 When considering portfolio fairness, we can, however, benefit from the attention that fairness in testing has received in recent decades.15,16 Currently, measurement experts use four general criteria to define and judge the fairness of tests: (1) equitable treatment for examinees, (2) equal outcomes for subgroups, (3) absence of biased items and formats, and (4) equal opportunity to learn.14–16 These criteria apply to portfolio assessment just as they do to other testing formats.14 In this report, we explore the criteria of equal outcomes for subgroups by examining the relationship between overall portfolio-based promotion decisions and medical students' gender, citizenship status, and verbal ability (including English language fluency).
Most research on fairness over the past 30 years has focused on structured testing formats, with much of this work examining differential performance outcomes linked to examinees' gender or ethnicity.15,16 This focus is understandable given the potential risk of systematically advantaging or disadvantaging groups in addition to individuals. These fairness concerns, of course, remain relevant to portfolio-based assessment.17 In fact, portfolios raise additional concerns because constructing a portfolio requires verbal skills and metacognitive abilities often associated with gender, language fluency, and/or socioeconomic status.18
We initiated this study to monitor fairness in portfolio assessment systematically. We also wanted to explore concerns some of our faculty had voiced about the fairness of implementing a high-stakes portfolio assessment that rested heavily on each student's ability to provide a written summary of performance to the school's promotion committee, whose members decide whether students are promoted to the next level of study. Accordingly, we explore the following questions: (1) To what extent are students' gender and citizenship status related to promotion decisions based on their portfolios, and (2) to what extent are students' writing ability and language fluency related to promotion decisions based on their portfolios? We first describe our use of portfolios for student assessment and then report how we examined one aspect of fairness.
The Cleveland Clinic Lerner College of Medicine (CCLCM) started in 2002 as a five-year program affiliated with Case Western Reserve University. The curriculum, described in detail elsewhere,19 was designed to prepare medical students for careers as future physician–investigators. Periodically, students write essays summarizing their progress in meeting the school's nine competencies (health care systems, clinical skills, clinical reasoning, professionalism, personal development, medical knowledge, research, reflective practice, and communication skills) and select evidence to document their achievement.5 The promotion committee bases decisions for students to advance throughout the program on summative portfolios rather than on grades or comprehensive examinations.
Summative portfolio review process
Each student submits a summative portfolio at the end of Years 1, 2, and 4 to a promotion committee composed of senior-level basic scientists and clinicians. This committee includes department chairs and residency program directors with experience in making judgments about the performance of practicing physicians and researchers. Their experience should contribute to committee members' ability to evaluate whether students are making appropriate progress along their career trajectories. Approximately 16 to 18 faculty of the 21-member committee participate in each of the three separate panels to rate the summative portfolios submitted by each class (32 students per class). Each review panel takes two days to rate summative portfolios for a given class. Portfolio review panels for the three classes are scheduled in May, June, and August to reduce committee fatigue.
During each of a class's three reviews in Years 1, 2, and 4, all review panel members individually review and discuss as a committee a sample of four summative portfolios to reach consensus about expected student performance. Then, at least two members review a subset of three to four student portfolios individually before pairing up to reach consensus on their ratings. A third member reviews a portfolio if a rater-pair disagrees about student performance for a given competency or an overall promotion decision. If consensus is not reached at this stage, all panel members read the summative portfolio in question and agree on a decision. For portfolios not requiring extra input, the rater-pairs present their ratings to the review panel to make one of five possible decisions about overall student performance: “pass,” “pass with concerns,” “pass with remediation,” “repeat the year,” or “dismissal from medical school.” A “pass with concerns” decision requires a student to develop a learning plan with faculty input, and a “pass with remediation” decision requires a student to develop a formal plan approved by the promotion committee. The committee chair sends each student a personalized letter detailing the committee's judgments for the nine competencies and overall promotion decision. A copy of this letter is placed in each student's permanent file.
This study explores the fairness of promotion decisions based on students' Year 1 summative portfolios. Specifically, we examine relationships among promotion decisions and students' demographic characteristics and writing ability/verbal fluency.
We included data from 182 first-year CCLCM medical students (97 men, 85 women) from six classes (2004–2009) who consented to release their data for research purposes (182 of 192 CCLCM matriculants).
We recorded the following data for each participant: gender (male/female), U.S. citizenship status (yes/no), English as primary language (yes/no), language fluency (English only/English and one or more languages), and MCAT Writing Sample score from the participant's American Medical College Application Service (AMCAS) application. We used the MCAT Writing Sample score to serve as a proxy of writing ability.20 The 11 possible letter-score designations (J–T) of the MCAT score were recoded into three performance groups (bottom = J–M, middle = N–Q, and top = R–T).
A trained research assistant pulled promotion letters from students' files to obtain Year 1 summative portfolio performance data and recorded each student's performance for the nine competencies (coded as 1 “pass,” 2 “pass with concerns,” or 3 “requires remediation”) and overall promotion decision (coded as 1 “pass,” 2 “pass with concerns,” 3 “pass with remediation,” 4 “repeat year,” or 5 “dismissal from medical school”) into an Excel spreadsheet. These ratings were then recoded into two categories (coded as 1 “pass” and 0 “concerns, remediate, fail”) for subsequent analyses. We justified collapsing categories into two levels to distinguish between students who met overall performance criteria and those with notable performance deficiencies. The second author (E.F.D.) confirmed the accuracy of data entry.
All variables were extracted from a data registry for the CCLCM program that the Cleveland Clinic's Office of Institutional Research approves annually. Medical students signed a consent form at matriculation to release their deidentified registry data for research purposes.
We used descriptive statistics to examine the promotion committee's judgments of students' summative portfolios (competency-specific ratings and overall promotion decisions). Chi-square statistics with Yates continuity correction for comparisons involving 2 × 2 tables were used to compare overall promotion decisions with students' gender, self-reports of language fluency, and MCAT Writing Sample score. The Cramér V statistic served as an effect size (ES) index. We used Cohen's21 criteria (i.e., 0.10 = small ES, 0.30 = medium ES, 0.50 = large ES) to judge the practical significance of ES indices. SPSS 16.0 (IBM Corporation, Somers, New York) was used for the above analyses with P = .05 for all hypothesis tests. Finally, the first author (S.B.B.) conducted post hoc power analyses of contingency tables to estimate the power of chi-square statistics and to identify the minimum sample size to obtain a desired power of 0.80.21
The 182 participating students had an average age of 23 (SD = 2.4 years) at matriculation and an undergraduate GPA of 3.6 (SD = 0.25). Approximately 86% (n = 156) were U.S. citizens. A small proportion reported disadvantaged status (5%, n = 9) or minority membership (11%, n = 20) on their AMCAS applications. For verbal ability, over 95% (n = 175) self-reported English as their primary language, whereas 60% (n = 109) claimed fluency in languages other than English (range = 1–4 languages). The majority of students also obtained MCAT Writing Sample scores at the middle (49%, n = 90) or top (36%, n = 65) of the performance continuum.
The promotion committee decided that 154 of 182 students demonstrated sufficient performance in the nine competencies to advance to Year 2 of the curriculum without concerns. The remaining students passed with concerns (n = 17), passed with remediation (n = 10), or withdrew from the program (n = 1).
Gender, U.S. citizenship, self-reports of language fluency, and MCAT Writing Sample score were not significantly related to overall promotion decisions (Table 1). Readers should weigh these findings cautiously because the chi-squares reported in Table 1 were underpowered (power ranged from 0.37 to 0.52) and had insufficient sample sizes. For instance, the post hoc estimate for the Gender × Promotion Outcome analysis identified that 349 participants were needed to achieve a conventional power of 0.80 given study parameters (alpha = 0.05, ES = 0.15, N = 180, degrees of freedom = 1). On the other hand, ESs were small (ranged from 0.01 to 0.15) for all contingency tables, suggesting weak associations between overall promotion decisions and students' group characteristics. For competency-specific decisions, over 90% of students received a “pass” for each of CCLCM's nine competencies (Table 2).
Our study explored an approach to monitor portfolio outcomes systematically. We applied fairness standards to portfolio assessment. Specifically, we examined whether medical students' demographic characteristics or writing ability/verbal fluency were related to a promotion committee's decisions of overall student performance. Examining fairness, although challenging for high-stakes performance assessments, is essential to maintain professional measurement standards14,16 and avoid potential liability.13 We have preliminary evidence that students' gender, citizenship, and writing ability/verbal fluency were not related to overall promotion decisions based on their portfolios. This finding does not prove or disprove the fairness of portfolios across subgroups and would not, even if we had obtained statistically significant results. According to the Standards for Educational and Psychological Testing,14
While group differences in outcomes should in many cases trigger heightened scrutiny for possible sources of test bias, outcome differences across groups do not in themselves indicate that a testing application is biased or unfair.
This study represents only one component of our overall approach to examine fairness systematically. We continue to monitor processes that may affect the fairness of our portfolio assessment system, such as giving stakeholders explicit portfolio guidelines and providing transparent student-appeal procedures. We also recommend periodically reviewing the curriculum to ensure that it provides all students with ample opportunities to demonstrate their competence, which is another aspect of fairness. In this study, most students met performance criteria for our nine competencies. We plan a follow-up study to examine the alignment of the different types (i.e., performance ratings, OSCE checklists, manuscript critiques, etc.) and sources (i.e., faculty, peer, self, etc.) of assessment evidence that students cite in summative portfolios to document their performance of each competency. Doing so should help identify areas where students receive insufficient or poorly aligned assessment evidence.
Focusing on fairness has exposed to us the fundamental importance of preparing students for this less familiar, portfolio-based approach to assessment. Students enter medical school with varying levels of reflective ability. Whereas some possess the ability to self-assess accurately, others struggle when attempting to analyze and interpret their performance. To address this potential threat to fairness, we recommend providing students with multiple opportunities to refine their metacognitive skills with input from trained portfolio advisors.18
This study has several limitations. First, we did not explore the extent to which the promotion committee weighed sources of irrelevant variance (i.e., grammar, following directions, etc.) when making overall or competency-specific promotion decisions about student performance. However, previous research has determined that assessors were able to differentiate between writing style and performance.8 Second, the MCAT Writing Sample score and students' self-reports of language fluency used in this investigation may be poor proxies of actual writing ability/verbal fluency. Third, we combined some categories of overall promotion decisions to form a dichotomous variable (i.e., “pass” and “concerns, remediation, fail”) for statistical analyses. Though we recognize that “pass with concerns” and “pass with remediation” represent different performance levels, we believe the collapsed category of “concerns, remediation, fail” retains the meaning of deficient student performance. Finally, the small class size at CCLCM requires collecting several more years of data before analyzing race/ethnicity, an important subgroup of concern when studying fairness issues. Post hoc power estimates suggest that we should have data from seven additional classes (with 32 students per class) to attain a power of 0.80 for the chi-square statistics reported in this investigation.
To our knowledge, this is the first study to report the relationship between portfolio-based performance decisions and medical students' demographic characteristics and writing ability/verbal fluency. Current trends suggest that portfolios will become more prevalent in the health professions as a high-stakes assessment tool. We encourage faculties to pay attention to fairness because violations can undermine the acceptability of portfolios as an assessment tool17,22,23 and place schools at risk for future lawsuits.13
The authors thank Ms. Ann Honroth for extracting the student promotion data used in this project.
All variables were extracted from a data registry that the Cleveland Clinic's Office of Institutional Research in Cleveland, Ohio, approves annually. Medical students sign a consent form at matriculation to release their deidentified registry data for research purposes.
Preliminary outcomes were presented at the 47th Research in Medical Education Conference, Association of American Medical Colleges, November 2008, San Antonio, Texas.
1 Buckley S, Coleman J, Davison I, et al. The educational effects of portfolios on undergraduate student learning: A Best Evidence Medical Education (BEME) systematic review. BEME Guide No. 11. Med Teach. 2009;31:340–355.
2 Tochel C, Haig A, Hesketh A, et al. The effectiveness of portfolios for post-graduate assessment and education: BEME Guide No 12. Med Teach. 2009;31:320–339.
3 Driessen EW, van Tartwizk J, Overeem K, Vermunt JD, van der Vleuten CPM. Conditions for successful reflective use of portfolios in undergraduate medical education. Med Educ. 2005;39:1230–1235.
4 Rees CE, Sheard CE. The reliability of assessment criteria for undergraduate medical students' communication skills portfolios: The Nottingham experience. Med Educ. 2004;38:138–144.
6 Davis MH, Friedman Ben-David M, Harden RM, et al. Portfolio assessment in medical students' final examinations. Med Teach. 2001;23:357–366.
7 Roberts C. Portfolio-based assessments in medical education: Are they valid and reliable for summative purposes? Med Educ. 2002;36:899–900.
8 Driessen EW, van Tartwizk J, Overeem K, Vermunt JD, van der Vleuten CPM. Validity of portfolio assessment: Which qualities determine ratings? Med Educ. 2006;40:862–866.
9 Driessen EW, van Tartwizk J, van der Vleuten CPM, Wass V. Portfolios in medical education: Why do they meet with mixed success? A systematic review. Med Educ. 2007;41:1224–1233.
10 Koretz D. Large-scale portfolio assessments in the US: Evidence pertaining to the quality of measurement. Assess Educ Princ Policy Pract. 1998;5:309–334.
11 O'Sullivan PS, Cogbill KK, McClain T, Reckase MD, Clardy JA. Portfolios as a novel approach for residency evaluation. Acad Psychiatry. 2002;26:173–179.
12 Pitts J, Coles C, Thomas P, Smith F. Enhancing reliability in portfolio assessment: Discussion between assessors. Med Teach. 2002;24:197–201.
13 Wilkerson JR, Lang WS. Portfolios, the pied piper of teacher certification assessments: Legal and psychometric issues. Educ Policy Anal Arch. 2003;11(45). http://epaa.asu.edu/ojs/article/view/273/399
. Accessed February 22, 2011.
14 American Educational Research Association; American Psychological Association; National Council on Measurement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Educational Research Association; 1999.
15 Cole NS, Zieky MJ. The new faces of fairness. J Educ Meas. 2001;38:369–382.
16 Camara WJ, Lane S. A historical perspective and current views on the Standards for Educational and Psychological Testing. Educ Meas Issues Pract. 2006;25:35–41.
17 Linn RT, Baker EL, Dunbar SB. Complex, performance-based assessment: Expectations and validation criteria. Educ Res. 1991;20:15–21.
18 Lam TCM. Fairness in performance assessment. ERIC Digest [Online]. ERIC Document Reproduction Service No. ED 391982; 1995. http://ericae.net/db/edo/ED391982.htm
. Accessed February 22, 2011.
21 Cohen J. Chi-square tests for goodness of fit and contingency tables. In: Statistical Power Analyses for the Behavioral Sciences. Orlando, Fla: Academic Press, Inc.; 1977:215–271.
22 Baartman KJ, Bastiaens TJ, Kirschner PA, van der Vleuten Cees PM. Evaluating quality in competence-based education: A qualitative comparison of two frameworks. Educ Res Rev. 2007;2:114–129.
© 2011 Association of American Medical Colleges
23 Schuwirth LW, van der Vleuten CP. Changing education, changing assessment, changing research? Med Educ. 2004;38:805–812.