Durning, Steven J. MD, PhD; Hemmer, Paul A. MD, MPH
Grading students is an administrative action that can challenge academic managers for a number of reasons, including concerns regarding standard evaluation criteria for performance, resource and faculty development needs, the difficulty that arises when faculty view themselves as responsible for grading rather than evaluation, and the inherent variability in educational settings and teacher–learner interactions. Grades succinctly summarize overall performance and, by distinguishing levels of performance, can help identify those likely to succeed in progressively more challenging educational environments, such as medical school. Grades can assure educators and consumers that a trainee has met the requirements to progress to the next level of education. The pursuit of grades can also foster competition, which can promote achievement or lead to destructive, unprofessional behaviors (e.g., cheating, plagiarism, bullying, etc.).
Clerkship education is, by its nature, less organized and prescriptive than classroom education. Students will see different numbers of patients with different conditions and comorbidities, and no two patients (or two students for that matter) are the same. This is true to the nature of the practice of medicine, but it also makes grading students’ clerkship performance more complex. Alexander and colleagues1 present compelling national data from a majority of MD-granting medical schools in the United States as to the variability in required clerkship grading practices. On the basis of these data, the authors suggest that the Liaison Committee on Medical Education (LCME) should take a greater regulatory role in defining—and imposing—clerkship grading policies and practices. As such, this study raises several important larger questions. Should clerkship grades (i.e., grading system, metrics, and distribution) be standardized across U.S. medical schools? And, if so, at what point does an academic institution lose the ability to govern and make decisions for itself?
Our purpose in this commentary is to examine how our medical education community might perceive these findings and suggest ways to address them. We discuss several related issues: normative versus criterion-based grading, reliability, and educational theory. We hope that this commentary will lead to thoughtful discussion and debate as well as compel future research about variability in clerkship grading practices.
Normative Versus Criterion-Based Grading
Normative grading, as the name suggests, is based on the notion that there should be a set distribution or norm (like a “bell curve”) to performance.2 For example, everyone cannot and should not receive an A, and there should be a distribution of less-than-passing grades that is fairly consistent (or mandated) from year to year. From the normative perspective, grade distributions should be roughly equivalent within and across clerkships. This perspective would view the distribution of grades toward the A or honors range and the few failures reported by Alexander and colleagues as nothing short of alarming.
Criterion-based grading does not mandate percentages of grades for each level (i.e., A, B, C); rather, grades should be awarded on the basis of meeting set criteria, which may be met by some, many, or all students.2 From this perspective, every student could receive an A or honors, and skewed grade distributions, in and of themselves, are not concerning. Indeed, having grades skewed to the high end (i.e., mostly honors or A and B grades) could be seen as success, and a norm-shaped grade distribution in a criterion-based system could actually raise questions about the effectiveness of the curriculum.
From the normative versus criterion-based grading perspective, the study by Alexander and colleagues raises a number of questions. Which clerkships—across and within medical schools—are using criterion-based or normative systems? Should the LCME adopt requirements for competency-based assessment and subsequent grading? If such a step were to occur, we suggest that the LCME should not require a normative standard by limiting the percentage of students who receive the highest grade. Finally, if the type of grading is to be mandated by the LCME, would the LCME also then require the same curriculum so that uniform grading standards can be applied?
We do not believe that the medical education community should endorse uniform grading or curricula, nor should the LCME take a prescriptive role in determining uniform evaluation, grading, or curricular content across medical schools. We certainly agree that any grading system should be consistently interpreted in terms of what the grades mean rather than applied according to a uniform distribution.3 For grades to be fair and defensible to students and academic managers, consistency and complete transparency in how grades are interpreted should be a priority for the medical education community. We believe this is a worthy call for action and long overdue.
Although the data reveal significant variation in grading systems, they do not provide details about the ways in which these systems are applied. Still, Alexander and colleagues suggest that one explanation for the large number of students receiving high grades could be that the widely used RIME (Reporter, Interpreter, Mentor, Educator) evaluation framework has not addressed the problems with grading. However, one cannot identify which specific clerkships use RIME and/or evaluation sessions, the experience within these given clerkships, or how these factors have affected the evaluation process. The published experience within individual clerkships would suggest that RIME and/or evaluation sessions have led to important changes in evaluation and grades.4–6 More specific analyses are required before we can determine whether problems with grading are associated with specific types of evaluation and/or grading frameworks.
Alexander and colleagues also call for evaluation systems with greater reliability. We agree, but we wish to raise an important clarification. Reliability is largely an issue of sampling.3 Two ways to increase reliability are (1) to increase the number of observations of trainees (i.e., increase the sample) or (2) to reduce the number of potential categories for assessing complex outcomes, such as how a student performs on a clinical clerkship (i.e., increasing the clarity of the categories can improve reliability).
The authors add to this psychometric viewpoint through the latter: They report that more students performed in the lowest category when fewer performance categories were used. Because reliability is an important characteristic of fairness in grading, we agree that considering fewer as opposed to more rating categories for a complex assessment, such as a student’s competence, is a welcome recommendation for clerkships as well as other stages in the medical education continuum.
There is another way, however, to achieve defensible reliability in grading systems with a larger number of categories—improving the clarity of what the categories mean to the rater and to other consumers, like the student. For example, simply giving over 90% of students honors is not necessarily transparent and may lack credibility. However, making an effort to communicate clearly how the evaluation and grading system works at an institution would enhance consumers’ ability to interpret and understand the grade.
From this standpoint, what is an acceptable or credible distribution of grades in a normative grading system? Should such a normative system be applied to improve fairness and transparency if an institution refuses to provide a credible and well-defined system of grading (be it normative or criterion-based)? Would implementing such a system without transparency risk misclassifying students? If academic managers and institutional leaders are unwilling to provide credible evidence for their grading decisions, an imposed grading distribution could address this issue.
Educational theory can improve our understanding of evaluation and grading. Alexander and colleagues refer to the analytic framework—performance captured by achieving requisite knowledge, skills, and attitudes.2 Such an approach often assumes that the student is the main (or only) “actor” in the educational system and that other potential mitigating factors (such as patients, teachers, curriculum, etc.) should play a minor role, if any. These stances are characteristic of information processing theory7 that has pervaded the medical education literature over the past several decades.
We believe that the learning environment and other individuals in the educational encounter (patients, teachers, peers) are not “noise” but are actually an important signal for learning and performance, and they should not be discarded.7 This perspective is consistent with contemporary educational theories, such as situated cognition, situated learning, and ecological psychology, which argue that knowledge construction and demonstration are inherently social phenomena, not solely in the purview of the individual actor (i.e., the student).7 As such, the normative grading model would not be likely to capture learning and performance with all the various inputs and interactions between these social components—student factors, school or setting factors, or teacher factors, to name just a few. Indeed, these theories suggest that the specifics of the situation are pivotal to the outcome. Given this stance, we would argue that heterogeneity of grades across institutions—and even within an institution—is not necessarily a problem and, instead, could be expected. From this perspective, it may be a mistake to try to enforce uniform grading distributions nationally.
There is also emerging educational literature on goal theory8 (mastery-based versus performance-based) that may help illuminate the issue of grades. Mastery-based goals are based in learning the material because it is believed to be important and should be mastered, whereas performance-based goals are based in learning the material to achieve a specific grade (e.g., “I want an A”). Mastery-based goals have been shown consistently to lead to improved learning over performance-based goals. Furthermore, studies are emerging that argue that letter-grade-based systems may lead to more burnout and depression than pass–fail systems.9 This theoretic perspective, beyond the issue of whether letter grades are given out in a uniform distribution with transparency in meaning, would question the utility of grades for clerkship performance in the first place.
Using the Data to Improve Grading Systems
We agree with Alexander and colleagues that using the same terminology (e.g., “honors”) to mean different things at different medical schools is problematic and must be changed. This should be a part of a long-overdue national agenda. We agree with the authors that undergraduate medical education leaders must be honest and forthright in their narratives of performance contained in medical student performance evaluations (MSPEs), including descriptions of the evaluation and grading processes used in all clerkships. It is time to end the use of terminology whose definitions vary from institution to institution and undermine confidence in students’ competence.
We propose that descriptive evaluation systems, such as RIME coupled with formal evaluation sessions, address this issue by asking teachers to evaluate a trainee’s performance on the basis of set criteria rather than simply to give a grade. There is also a clear role and need for faculty development and frame-of-reference training. We need to help teachers appropriately apply evaluation standards.10
Perhaps, on the basis of mastery-based goal orientation, we should eliminate letter grades altogether. Further, we could consider removing traditional time constraints from clerkship rotations. As in continuity clerkship experiences, students would be on a given rotation until they demonstrate needed mastery of core elements of that clerkship, whether in a traditional model or longitudinal integrated clerkship. Such an approach could also be accomplished by using criteria such as entrustable professional activities.11 We would argue that with a pass–fail grading system and mastery-based approach, one could discriminate among trainees on the basis of descriptive comments.
This being said, the “elephant in the room” of this discussion is graduate medical education (GME). For program directors, who rely on transcripts and the MSPE, grades are one measure to consider in the current difficult and competitive residency selection process. This undoubtedly pushes schools using pass–fail systems to find some other means of stratifying student performance. Additionally, clinical teachers (hassled and harried) know all too well the consequences of their “grade” recommendations for a student’s GME hopes. Not only might they substitute a high grade in place of comments, but they may also be trying to recruit students while evaluating them. Furthermore, we recognize that eliminating letter grades by moving to pass–fail systems would likely lead to unintended consequences, such as greater (and, in our view, more inappropriate) emphasis on licensure examination results during the resident selection process. Given such external and consequential concerns, we recognize that grades in medical school will likely remain for the foreseeable future.
What Is It Good for?
So, what is grading good for? Do the wide variations in clerkship grading reported mean that grading is good for “absolutely nothing”? Or, is such variability inherent, and perhaps desirable, in a complex educational environment in which individual medical school missions must be respected? We must continue to grapple with this issue, and Alexander and colleagues have set us on this course. But, we do not believe that stricter regulatory restrictions are a good answer to this question. We believe that what is needed is more credible and transparent interpretation of what grades mean, not inflexible regulatory grade distributions.
Other disclosures: None.
Ethical approval: Not applicable.
Disclaimer: The opinions expressed in this commentary are those of the authors and do not reflect the views of the Department of Defense or other government agencies.