Mastery learning systems aim to help learners reach consistently high levels of achievement.1 For each learning objective in a mastery system, learners undertake suitable educational and practice activities, which sometimes include an initial preassessment, then they complete a follow-up learning assessment. Learners who meet or exceed the objective’s mastery standard on this follow-up assessment advance to the next educational unit. Those who do not pass continue studying and practicing, then they retest until they meet the standard. Ideally, learners advance as soon as they are ready and no sooner, which keeps them in an optimal educational zone—they remain challenged but do not exceed their abilities.2 Consistent with this practice, meta-analyses in education broadly3 and medical education in particular4 have shown that mastery learning leads to higher achievement, though at the cost of additional learning time. With its emphasis on accountability for learner progress toward key objectives, mastery learning may be used as a concrete, educational-unit-level tactic for implementing broad, curriculum-level competency-based education frameworks in undergraduate and graduate medical education, such as core competencies and entrustable professional activities.5 As such, mastery learning approaches can serve as one part in educators’ overall drive to be more accountable and effective in preparing physicians to provide safe, high-quality patient care.
Sound assessment is the cornerstone of mastery learning systems. The inaccurate assessment of learners’ mastery or the poor use of such assessments for decision making would lead to the premature advancement or unnecessary holding back of learners. As such, assessment administrators first must be clear about how they intend to interpret and use scores from a mastery learning assessment, then they must collect evidence to test whether those inferences are valid and the resulting uses are justified, keeping in mind that the most compelling evidence is that which threatens to challenge our assumptions.6 , 7 Validation and justification are important activities in the educational research enterprise; new evidence may show long-standing assessment practices to be invalid,8 and controversies about interpretations and uses of test scores have risen to the highest courts of law.9
Of course, sound assessment is key not only in mastery learning but also in all of education, and a sophisticated body of research and practice in assessment already exists.10 However, mastery learning assessments entail interpretations and uses of scores that differ from those of standard assessments, requiring changes in validation and justification practices. In this article, we aim to thoroughly and accessibly outline key issues in the validation and justification of mastery learning assessments (see List 1 for an overview of these key issues). We organize our ideas by the key tenets of modern assessment validity theory—namely, that validity is a function of how assessment scores are interpreted and used, and that five broad categories of evidence may be collected to test whether a particular interpretation or use of an assessment is valid and justified.6 , 7 We assume that the reader has basic familiarity with this conceptualization of validity, for which accessible guides are available.11 , 12 We focus on clearly identifying key challenges, and where possible we suggest solutions to those challenges. We hope that this article will offer guidance to professionals responsible for curriculum and assessment design in both undergraduate and graduate medical education, clarify the key theoretical and methodological issues for researchers in mastery learning, and illustrate for medical education policy makers that “business-as-usual” assessments must be revised for the mastery learning context.
Interpretations of and Uses for Mastery Learning Assessments
What does “mastery” mean? Colloquially, it suggests a high level of expertise. However, for mastery learning, it only means readiness to proceed to the next phase of instruction. A medical student who understands mutagenesis enough to learn about genetic transmission has almost certainly not “mastered” mutagenesis in the lay sense but may have mastered it enough to move on to the next educational unit. Similarly, a resident who has “mastered” central venous catheter insertion in a simulation lab might be ready for supervised performance on patients but will of course still have much to learn. This difference in meaning could lead to problems. Learners who advance through a unit may believe they have “mastered” its content in the lay sense when they have only done so in the mastery learning sense. Conversely, educators asked to set mastery standards may set unnecessarily high standards, letting the lay connotation of “mastery” color their judgments.
How long are learners expected to retain “mastery”? In mastery learning models, achievement is often assessed immediately after the completion of training. Yet most learning units in medical education are connected to many later units, and achievement often decays rapidly following training.13 Moreover, many learning activities that maximize short-term mastery are precisely the opposite of those that support long-term retention and generalization of mastery.14 Although rigorous delayed testing is logistically challenging, particularly as learners rotate through different educational sites and assume time-consuming clinical commitments, limiting mastery learning assessment to the period immediately following training could subvert the intent of the mastery system, which is to ensure uniform, enduring competence.15
Mastery also may connote a completeness of knowledge or skill. In some contexts, mastery means that a learner has achieved sufficient competence in all the subunits of a content area (e.g., a learner cannot master “genetic transmission” without understanding each of the modes of transmission), or that the learner is sufficiently competent at all aspects of a procedure. In such situations, for example, if a learner scores 90% on a procedural task but the missed 10% reflect a serious error, the designation of mastery would be inappropriate.16 In such noncompensatory (i.e., conjunctive) scoring, learners’ performance on each subunit would be evaluated against a minimum standard, and mastery would be achieved only when the learner passes all subunits.
We usually want assessments to discriminate between learners of varying ability levels. However, the central inference in the mastery model is pass and advance or fail and repeat; there is no middle ground. Thus, the passing standard must be established with great rigor. This tenet also has implications for how assessments are designed. Figure 1 depicts a hypothetical distribution of learners’ true scores—their actual knowledge or skill levels—on a mastery learning assessment. For learners whose true scores are far below the mastery cut score, precision of measurement is not terribly important; the learner has clearly not mastered the content. For learners in this range, the assessment will be most useful if it generates rich, item-specific feedback to help them make efficient progress to the standard. It is essentially an assessment for the sake of promoting learning, not for precise measurement. By contrast, for learners whose true scores are within range of the mastery standard, perhaps within one standard error of measurement from the cut score, precise measurement becomes the priority. Assessment items that discriminate well in this range should be oversampled, though identifying such items may require sophisticated psychometric approaches, such as item response theory.17 For high-stakes examinations, such items also need to be kept secure from inappropriate disclosure to examinees (e.g., senior students sharing test items from previous years with junior students), which would compromise the measurement precision of those items. Preventing such disclosure likely requires that, for any given item, educators not divulge which answers are correct versus incorrect nor the reasons they are so scored; instead, examinees are likely only to be told their total score across many items. As such, for items in this range, beneficial assessment-based feedback will often need to be sacrificed to maintain measurement precision. Finally, for learners whose true scores are well beyond the mastery standard, neither precision of measurement nor the provision of rich feedback are of primary concern, so less effort could be spent on items that discriminate only in this range.
Figure 1: Hypothetical distribution of learners’ true scores, relative to the mastery cut score, on a mastery learning assessment. The location and shape of the distribution of the learners’ true scores, the location of the mastery cut score, and the width of the region of measurement focus were chosen for illustration purposes only; all will vary depending on the particular context of the assessment.
Along with specifying the inferences to be drawn from mastery learning assessment scores, the ways those scores are used to make decisions must be clearly delineated. The most obvious use of such assessments is for deciding when to advance learners in the curriculum. Two key details related to this decision are (1) the resources and policies in place for learners who do not pass, and (2) any special consequences for learners who fail persistently to meet mastery standards. Other uses of mastery scores exist but may have unintended consequences. For instance, a dean’s letter to a residency program that extols a medical student who quickly mastered the curriculum inadvertently makes “time to mastery” a new achievement indicator, perhaps encouraging learners to rush through the curriculum rather than truly mastering it.
To summarize, adopting a mastery learning system requires clear specification about how assessment scores will be interpreted and used. To interpret scores, the following should be clear: what level of achievement is meant by “mastery,” how long learners are expected to retain that mastery, and how complete that achievement must be. Intended uses of scores should be specified, with particular detail provided regarding remediation and retesting.
Sources of Validity Evidence: Content
One important source of validity evidence is the suitability of the assessment content, defined by the Standards for Educational and Psychological Testing 18 as the “themes, wording, and format of the items, tasks, or questions on a test” as well as the “administration and scoring.” In mastery systems, learners who struggle will retest one or more times. And, mastery systems may include pretests before instruction begins, possibly enhancing learning via the testing effect19 and allowing some learners to skip already-mastered units entirely. In such systems, most learners complete at least two assessments—a pretest and at least one posttest. Thus, having a large enough bank of content and sound methods for generating multiple equivalent test forms is important.20 Additionally, depending on one’s definition of mastery, certain aspects of how learners perform may be key criteria, beyond simply the products of their performance (e.g., correct answers or completed procedural tasks). For instance, if one defines mastery of suturing skill as the ability to suture automatically, with minimal to no conscious thought, a suitable assessment must detect when learners can suture even while they are distracted.21
Sources of Validity Evidence: Response Process
How learners respond to assessment items (e.g., how they interpret each question, how they enter answers) should be consistent with the intended score interpretations. Retesting in mastery learning systems could in some cases create a content security threat that may be evident in how learners respond to assessment items. Specifically, if retests recycle the same or similar content from prior tests, learners’ retest responses may reflect only their memorization of the surface details of the assessment content rather than their true mastery of the domain. Savvy learners might deliberately take a mastery examination for which they are not prepared to become “test-wise,” and then study only enough to briefly regurgitate the required information on a retest. At the same time, learning gains due to superficial memorization may not be obviously different from learning gains due to the genuine educational benefits of early testing, and both phenomena may be at play to varying extents in any given learner. Research illustrating how to analyze separately the superficial and the enduring educational effects of pretesting would be of value.
The most straightforward solution to the problem of learners memorizing answers is to build larger content banks (e.g., more items, more scenarios), though this is admittedly resource intensive. Probing learners’ reasoning for the answers they select to detect superficial memorization also may be possible; for instance, one may ask not only what the correct answer is on a multiple-choice examination but also why it is correct. However, such deeper understanding is a demonstrably different construct than the ability to recognize correct answers.22 Fortunately, content security is not a concern for some types of content; for instance, procedural checklists are given freely to learners with the expectation that they will be able to demonstrate all procedural steps satisfactorily.
Sources of Validity Evidence: Internal Structure and Reliability
Internal structure evidence reflects how well associations between test items and test components support that the intended construct or constructs are being measured. For example, if an assessment is intended to measure three distinct dimensions of performance, then items reflecting the same dimension should receive similar scores and should differ from items intended to measure different dimensions. The same principle applies to other aspects of the test, such as stations, raters, or testing occasions; namely, scores reflecting the same dimension of performance should ideally be reliable across each test condition.
Strictly speaking, reliability in mastery learning assessments is defined only in terms of how consistently the mastery versus nonmastery distinction is made. Common reliability statistics, such as coefficient alpha and test–retest correlations, refer to the reliability of discriminations between learners across the full range of their true scores. However, the reliability of a pass/fail decision at a particular cut score can be dramatically different from the average reliability of the same assessment across the range of possible scores. Generally, cut scores at or near the average learner performance level will be the least reliable, whereas extremely high or low cut scores are often highly reliable.23 Suitably modified reliability equations are available and should be used for mastery learning assessments, including the conditional error variance absolute decision generalizability coefficient24 and decision-consistency reliability indices.25 , 26 Both may be complex enough to require psychometric consultation to properly estimate and interpret.
Other unique aspects of mastery learning systems also can affect reliability estimates. If learners can choose when to take the mastery assessment and have a good sense of when they are sufficiently prepared to pass, their total test scores will be very similar (i.e., very near the passing score). In situations of such reduced score variance (i.e., restriction in range), reliability estimates will be attenuated. The very goal of mastery learning systems—uniform achievement from all learners—is thus at odds with classical reliability estimation. At the same time, remediation and retraining can affect item-level score variation and may actually increase reliability. Therefore, depending on the frequency of retesting, mastery learning assessments can show unstable reliability estimates. By extension, these issues may limit one’s ability to assess internal structure using methods such as factor analysis, which also requires a reasonable degree of variance between subjects and items. One solution is to use scores from the baseline, preinstruction assessment—at which time, we would expect learners to still vary in ability—to estimate the reliability and factor structure of later mastery learning assessments.
Finally, as with credentialing examinations generally, administrators may choose to score mastery learning assessments in a noncompensatory fashion, whereby learners must demonstrate mastery on many different subunits before progressing.27 Although such scoring might be most consistent with one’s intended interpretation and use of scores, the practice does affect measurement reliability. In noncompensatory scoring, overall measurement error is an exponential function of the measurement error for each subunit and thus can “balloon” into very unreliable overall pass/fail decisions. For instance, if learners must pass each of five procedural skill stations, which each have a pass/fail reliability of 0.8, overall pass/fail decision reliability would be only 0.8*0.8*0.8*0.8*0.8 = 0.33, an abysmally low reliability coefficient.28
Sources of Validity Evidence: Relationships to Other Variables
Test scores should relate positively to measures of similar constructs and to outcomes they are meant to predict or measure concurrently (“convergent validity evidence”) and should not correlate positively to conceptually dissimilar constructs (“discriminant validity evidence”). Many forms of convergent and discriminant validity evidence for mastery assessments will be similar to those for nonmastery assessments. However, the most important relationship to evaluate in a mastery learning system is whether assessment scores relate to learners’ success in their subsequent educational unit(s), including their eventual transition to practice.
As it impairs the estimation of reliability, the restriction of range in mastery learning assessment scores makes estimating relationships to other variables difficult. However, correlating relatively unrestricted assessment data obtained prior to implementing a mastery learning system with other variables is possible. For instance, if residents were assessed on chest tube insertion using simulation and then performed chest tube insertions on patients regardless of their assessment scores, outcomes data such as patient complication rates could be correlated to the assessment scores. Of course, in this example, where the consequences of allowing some failure in the criterion measure are dire, there is likely an ethical imperative to not allow low-scoring residents to proceed to patient care in the first place,29 thus disallowing estimation of this relationship.
Sources of Validity and Justification Evidence: Consequences of Assessment Use
In contrast to validity evidence that focuses on whether the assessment can support desired inferences, consequences evidence seeks to justify the uses or applications of scores by considering the intended and unintended consequences of the assessment and whether implementation of the assessment is reasonable and desirable.6 , 7 Consequences evidence includes information about the process of setting standards and the impact of the assessment on the learning process, learning outcomes, and the practice of health care.12
Standard setting is a central component of mastery learning, and the retraining of judges who are accustomed to setting standards in traditional assessments is critical. For example, in item-based standard setting methods, such as Angoff,30 the broad definition of the “borderline learner” may change from “a minimally competent learner who might pass one day and fail the next” to “a solidly competent learner who can be relied upon to perform consistently at an acceptable level over time.” For an in-depth exploration of standard setting in the context of mastery learning, see the article by Yudkowsky et al31 in this issue.
The mastery model potentially could widely influence curricula and training programs. Mastery standards mandate sufficient curricular time and resources for repeated practice, remediation, and retesting, thus reinforcing a competency-based approach to education.5 Poor initial performance and the need for repeated retests may highlight a gap between the curriculum that learners experience and faculty expectations, spurring curricular efforts to close this gap. Conversely, mastery standards may disrupt scheduling within the curriculum, and remediation and retesting may consume limited faculty and material resources. Additionally, instructors in a mastery learning context may identify common test errors and emphasize these points in subsequent sessions. Although not inappropriate if these errors reflect key aspects of the task, such “teaching to the test” could undermine the validity of scores and inferences if it improves test performance without a concomitant improvement in true skill.
On an individual learner level, one can seek evidence of increased efficiency and effectiveness of study and practice strategies, increased attention to the critical elements of the assessed domain, more functional motivational orientations,32 and improved self-regulation of learning. The mastery learning approach—setting a high bar and practicing until it is achieved—is consistent with the deliberate practice approach to attaining expertise33 and may encourage deliberate practice as a lifelong learning strategy. However, mastery learning systems that do not periodically reassess mastery may lead learners to focus on demonstrating mastery in the short term rather than maintaining mastery throughout their careers.
Mastery learning systems are meant to ensure that learners progress only when they are ready to do so; thus, learner outcomes in subsequent educational units are a primary consequence of interest. However, drawing inferences about mastery learning assessments from learners’ later progress can be challenging. If learners’ progress in later educational units is found to be subpar, it may be that one or more of the previous mastery standards were too lenient. However, other factors also could be at play. If learners’ subsequent progress is satisfactory, the preceding mastery standards were arguably stringent enough, though more lenient standards may have yielded comparable results in less time. The most powerful way to determine whether mastery learning assessment standards lead to desired outcomes is to systematically experiment with the standards and observe how later outcomes are affected, though this can be logistically and sometimes ethically challenging.
Finally, one can seek evidence of an impact on outcomes for patients, the health care system, and society as a whole. For instance, Barsuk and colleagues34 found that a natural lapse in the provision of simulation-based mastery training of central line insertion corresponded with an increase in patient complications, providing evidence that adhering to the previous mastery standard had helped control complications.
Conclusion
Mastery learning systems have the potential to refocus health professions education on learners’ achievement of consistently high levels of performance, and as such it fits well within current competency-based education efforts in medical education more broadly. The assessment of learners’ mastery and the subsequent decision making about their progress come with conceptual and methodological challenges that are not necessarily more onerous than those that arise from conducting “traditional” assessments, but they do require different approaches. List 1 provides a summary of the key considerations we outlined here for the validation and justification of mastery learning assessments.
List 1 Key Considerations for the Validation and Justification of Mastery Learning Assessments Cited Here
Interpretations of and uses for mastery learning assessments
Specify what degree of achievement or readiness to progress is meant by mastery
Specify how long learners are meant to retain mastery
Specify how complete mastery within a particular content area must be (compensatory versus noncompensatory scoring)
Specify how scores will be used to make decisions and actions about learners
Sources of validity evidence: Content
Develop sufficient assessment content to allow for high-volume retesting as needed
Use best practices for generating multiple equivalent retests
When appropriate, assess aspects of performance beyond achievement of content (e.g., automaticity of performance)
Sources of validity evidence: Response process
Examine whether learners’ response processes on retests are consistent with true mastery, rather than with memorization of the particulars of the assessment content
Sources of validity evidence: Internal structure and reliability
Use adjusted reliability estimates for the mastery versus nonmastery distinction
Carefully consider how to derive estimates of reliability and internal structure for mastery posttests, when learner performance is likely to be restricted in range
If noncompensatory scoring is used, adjust reliability estimates accordingly
Sources of validity evidence: Relationships to other variables
Carefully consider how to derive estimates of relationships to other variables for mastery posttests, when learner performance is likely to be restricted in range
Collect evidence as to whether a given mastery assessment relates to satisfactory versus unsatisfactory progress in later educational units and/or subsequent patient care
Sources of validity and justification evidence: Consequences of assessment use
Examine potential positive and negative effects of mastery assessment for curriculum and instruction, individual learners, patient outcomes, and society
Given the theoretical and empirical support for mastery learning, we look forward to acceleration in research and practice in this area. We hope that careful attention to the issues we raised here will lead educators and researchers to greater insights into proper mastery learning assessment, toward the ultimate goal of transforming medical education and health care.
Acknowledgments: The authors thank Drs. Georges Bordage, Clarence Kreiter, and William McGaghie for their critical review of this article.
References
1. Bloom BSBlock JH, Airasian PW. Mastery learning. Mastery Learning: Theory and Practice. 1971 New York, NY Holt, Rinehart and Winston
2. Guadagnoli M, Morin MP, Dubrowski A. The application of the challenge point framework in medical education. Med Educ. 2012;46:447–453
3. Kulik C-LC, Kulik JA, Bangert-Drowns RL. Effectiveness of mastery learning programs: A meta-analysis. Rev Educ Res. 1990;60:265–299
4. Cook DA, Brydges R, Zendejas B, Hamstra SJ, Hatala R. Mastery learning for health professionals using technology-enhanced simulation: A systematic review and meta-analysis. Acad Med. 2013;88:1178–1186
5. Frank JR, Snell LS, ten Cate O, et al. Competency-based medical education: Theory to practice. Med Teach. 2010;32:638–645
6. Kane MT. Validating the interpretations and uses of test scores. J Educ Meas. 2013;50:1–73
7. Cizek GJ. Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychol Methods. 2012;17:31–43
8. Lineberry M, Kreiter CD, Bordage G. Threats to validity in the use and interpretation of script concordance test scores. Med Educ. 2013;47:1175–1183
9. Ricci v Destefano. 557 U.S. (2009);557
10. Downing SM, Yudkowsky R Assessment in Health Professions Education. 2009 New York, NY Routledge
11. Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.16
12. Downing SM. Validity: On meaningful interpretation of assessment data. Med Educ. 2003;37:830–837
13. Arthur W, Bennett W, Stanush PL, McNelly TL. Factors that influence skill decay and retention: A quantitative review and analysis. Hum Perf. 1998;11:57–101
14. Rohrer D, Pashler H. Recent research on human learning challenges conventional instructional strategies. Educ Res. 2010;39:406–412
15. Norman G, Norcini J, Bordage G. Competency-based education: Milestones or millstones? J Grad Med Educ. 2014;6:1–6
16. Yudkowsky R, Tumuluru S, Casey P, Herlich N, Ledonne C. A patient safety approach to setting pass/fail standards for basic procedural skills checklists. Simul Healthc. 2014;9:277–282
17. Embretson SE, Reise SP Item Response Theory for Psychologists. 2000 Mahwah, NJ L. Erlbaum Associates
18. American Educational Research Association, American Psychological Association, National Council on Measurement in Education. . Validity. Standards for Educational and Psychological Testing. 2014 Washington, DC American Educational Research Association
19. Richland LE, Kornell N, Kao LS. The pretesting effect: Do unsuccessful retrieval attempts enhance learning? J Exp Psychol Appl. 2009;15:243–257
20. Crocker L, Algina J. Equating test scores from different tests. Introduction to Classical and Modern Test Theory. 1986 Belmont, Calif Wadsworth
21. Stefanidis D, Scerbo MW, Montero PN, Acker CE, Smith WD. Simulator training to automaticity leads to improved skill transfer compared with traditional proficiency-based training: A randomized controlled trial. Ann Surg. 2012;255:30–37
22. Williams RG, Klamen DL, Markwell SJ, Cianciolo AT, Colliver JA, Verhulst SJ. Variations in senior medical student diagnostic justification ability. Acad Med. 2014;89:790–798
23. Stansfield RB, Kreiter CD. Conditional reliability of admissions interview ratings: Extreme ratings are the most informative. Med Educ. 2007;41:32–38
24. Webb NM, Shavelson RJ, Haertel EHRao CR, Sinharay S. Reliability coefficients and generalizability theory. Handbook of Statistics. Amsterdam, The Netherlands: Elsevier. 2006;26:81–124
25. Livingston SA, Lewis C. Estimating the consistency and accuracy of classifications based on test scores. J Educ Meas. 1995;32:179–197
26. Subkoviak MJ. A practitioner’s guide to computation and interpretation of reliability indices for mastery tests. J Educ Meas. 1988;25:47–55
27. Norcini JJ, Stillman PL, Sutnick A, et al. Scoring and standard-setting with standardized patients. Eval Health Prof. 1993;16:322–332
28. Hambleton RK, Slater SC. Reliability of credentialing examinations and the impact of scoring models and standard setting policies. Appl Meas Educ. 1997;10:19–38
29. Ziv A, Wolpe PR, Small SD, Glick S. Simulation-based medical education: An ethical imperative. Acad Med. 2003;78:783–788
30. Angoff WHThorndike RL. Scales, norms, and equivalent scores. Educational Measurement. 19712nd ed Washington, DC American Council on Education
31. Yudkowsky R, Park YS, Lineberry M, Knox A, Ritter EM. Setting mastery learning standards. Acad Med. 2015;90:1495–1500
32. Payne SC, Youngcourt SS, Beaubien JM. A meta-analytic examination of the goal orientation nomological net. J Appl Psychol. 2007;92:128–150
33. McGaghie WC, Issenberg SB, Cohen ER, Barsuk JH, Wayne DB. Does simulation-based medical education with deliberate practice yield better results than traditional clinical education? A meta-analytic comparative review of the evidence. Acad Med. 2011;86:706–711
34. Barsuk JH, Cohen ER, Potts S, et al. Dissemination of a simulation-based mastery learning intervention reduces central line-associated bloodstream infections. BMJ Qual Saf. 2014;23:749–756