Secondary Logo

Journal Logo

Impact of Assessment

Testing Test-Enhanced Continuing Medical Education: A Randomized Controlled Trial

Feldman, Mark MD, FRCPC; Fernando, Oshan PhD; Wan, Michelle MA; Martimianakis, Maria Athina MA, MEd, PhD; Kulasegaram, Kulamakan PhD

Author Information
doi: 10.1097/ACM.0000000000002377
  • Free


Advancing the practice of continuing professional development (CPD) of physicians is a priority for licensing bodies.1 Effective CPD is critical to an effective clinical workforce, and yet it is challenging to balance clinicians’ learning preferences and time constraints with the need to provide comprehensive and memorable education. Physicians prefer learning in group settings for their formal CPD.2–8 This has engrained commonly used CPD formats such as didactic teaching, which addresses the issue of preference, with efficiency but not necessarily with effectiveness.9

This challenge for CPD mirrors similar challenges faced in undergraduate and postgraduate education, and hence, similar solutions may apply.10 Recently, medical education has begun to embrace principles and practices of learning developed through rigorous scientific testing from the field of the learning sciences.10,11 Interventions such as interleaved and distributed practice, managing cognitive load, and using test-enhanced learning (TEL) are becoming more common in health professions education scholarship and practice.

TEL has been shown to be effective for increasing knowledge retention.12,13 In fact, among postgraduate medical trainees receiving didactic teaching, repeated testing (without studying) better enhances long-term retention than does repeated studying without testing.14 From a cognitive perspective, repeated testing is a form of “retrieval practice.”10,11 The more often a concept or idea is retrieved from memory, the more strongly it becomes encoded. Thus, simply incorporating testing after learning becomes a powerful tool to help learners retain knowledge. Additionally, repeated feedback through formative testing provides real-time information to learners about their progress, and this can be a powerful motivator for learners of all levels of experience. Instructors can use testing to modify their teaching accordingly to accommodate emergent learner needs. Posttest feedback can also correct errors of misconception, provide tailored guidance, and afford opportunities for informed self-assessment—essential for learners in the CPD context.

TEL is also well aligned with practical trends in CPD. Professional recertification of physicians may require documentation of CPD that includes assessment and feedback. Formative assessment is an underpinning of competency-based continuing education, an evolving priority for the Royal College of Physicians and Surgeons of Canada.1

Still, effective incorporation of TEL in CPD must address practical and conceptual challenges. Larsen et al15 argue that for TEL to be effective, it must incorporate five key principles: (1) Tests should arise from educational objectives, (2) questions should require generation of answers rather than recognition, (3) testing should incorporate repeated retrieval, (4) tests should incorporate spaced repetition, and (5) feedback should be provided after tests. Some of these criteria encounter practical challenges in the CPD context. First, physician time and energy are at a premium; unlike trainees in formal education programs, physician learners are not a captive audience. Any TEL strategy must fit within the busy schedules of active clinicians and must provide a sufficient incentive for participation. Second, TEL interventions must be reconciled with physician preferences for learning in face-to-face and group settings. The mismatch between what physicians prefer and what they require might best be addressed by integrating TEL into preferred learning settings. Perhaps most important, given the sophisticated and advanced knowledge of practicing clinicians, TEL interventions still require testing for efficacy and practical impact in this population.

Research on the incorporation of TEL in CPD has been equivocal. Early data suggest that various TEL strategies used for CPD may be effective but inefficient15 or efficiently incorporated but ineffective.16,17 Larsen et al15 reported a benefit for knowledge retention using short-answer questions (SAQs) for TEL in a neurology conference context. McConnell et al16 found limited benefit for the same testing format following a brief educational session at a single institution for a wide range of medical specialties. The mixed data on knowledge retention suggest that further work on optimizing efficiency and efficacy is still needed. Moreover, other test formats—ones that are likely to be more efficient to write and score—should also be evaluated. One promising format is the multiple-choice question (MCQ). Although it does not meet the criteria for “generation of answers,” they are likely easier to deploy for the purposes of retrieval practice, providing feedback, and measuring final knowledge acquisition.

Most studies of TEL in CPD and of TEL in general focus on postinstructional testing. However, there are both old and new literatures on the relevance of pretesting prior to instruction. Pretesting likely has several benefits including motivating learners to prepare prior to instruction. Further, there is some evidence that even unsuccessful retrieval of knowledge in pretesting helps potentiate future learning from instruction.18,19

Given the mixed data on TEL in CPD and the lack of testing of the MCQ format in this context, “justification research” and randomized tests are still required. There is an increasing call for experimental interventions in education to be tested for effectiveness in typical and pragmatic educational settings to further the generalizability of such interventions.20,21 Thus, in this study, we investigated the use of an MCQ-based TEL intervention that incorporated both pre- and posttesting at a national medical conference. We measured the impact of TEL on knowledge retention as well as on self-reported learning behaviors using a randomized controlled study design.


This study was a pragmatic randomized controlled trial21,22 with pediatricians who registered for a four-day CPD conference—the SickKids Paediatric Update, an annual conference led by the Hospital for Sick Children and University of Toronto. Workshops and keynotes include a mix of didactic, interactive, and group-based learning activities. Concurrent learning may include between 12 and 80 learners per workshop, with typically over 300 pediatricians participating from across Canada. Approval for this study was provided by the SickKids Hospital Research Ethics Board.


The participants in this study were pediatricians from across Canada. Upon online conference registration, electronically linked information about the study was provided, and consent was sought with alternatives to participation given to all registrants (with options including full CME credit). Participants were able to choose particular workshops to attend during the conference.


Upon electronic conference registration and workshop selection, consenting pediatricians were randomly assigned to the intervention or control group. Random assignment was based on a preset randomization sequence. Pediatricians in the intervention groups received the MCQ-based TEL interventions, while the pediatricians randomized to control groups did not receive any testing or intervention. Because participants could choose multiple workshops, they were informed that they may be in the intervention group (receiving TEL) for some workshops and control for the other. See Figure 1 for a summary of the design.

Figure 1
Figure 1:
Schematic of the study design and sample size. Individuals may be randomized to the intervention group for one workshop and to the control group for another; thus, overall sample size is depicted in the schematic.

We endeavored to design the TEL package based on best principles described by Larsen et al15 and on best practices in MCQ question writing. The faculty leads for each of the conference’s 16 workshops were asked to identify five take-home key points of learning for their respective sessions. For each of the five key messages, we asked for two sets of corresponding MCQs with a clinical vignette for the second set.

Faculty leads were invited to attend a best-practice MCQ development session based on MCQ writing principles.23,24 Later in the year, the MCQs submitted were reviewed and, if necessary, edited by members of the research team experienced in test development to avoid cues that might render the MCQs less discriminating (e.g., grammatical cues, distracter length cues, logical cues).

Only the first set of MCQs was used as the TEL intervention, for both pre- and posttesting, whereas the second set was used as an outcome measure. Our second set of MCQ clinical vignettes allowed us to extrapolate application of learning and to rule out the effect of simply learning the items and answers on the tests.

The TEL “intervention” for each workshop consisted of two administrations of the same test, each delivered electronically through the Qualtrics survey platform. First, a pretest of five MCQ questions was delivered one week before the CPD conference workshop. MCQs took the form of a single-best answer with five options per item and were framed as stand-alone items with a focus on the five key objectives for each workshop. Participants completing pretest MCQs received no feedback. However, results from the pretests were given to workshop leads to supplement learning needs assessments prior to leading workshops. Following the workshop, participants in the intervention group received the same five MCQs as posttests two weeks later, but this time each item came with feedback. For each incorrect option, an explanation as to why the choice was incorrect was given automatically, and participants were asked to make another choice. This step was repeated until the correct choice was made. We chose to limit the number of items to five based on a previous feasibility pilot.

Neither pre- nor posttests were timed or modified in any way. Each test was accompanied by a brief survey on the quality of questions, the impact on learning behavior, and an opportunity to provide feedback on the intervention package.


Our primary outcome measure was performance on the first attempt of the knowledge retention test for each workshop delivered four weeks after the conference. All participants received a common retention test for each workshop attended. The test consisted of five new MCQ questions in a clinical vignette format (the second set of questions created by workshop leads). These questions were intended to evaluate retention and ability to use the knowledge of the major learning objectives of each workshop through questions that would be unfamiliar to the participants. This approach can be classified as knowledge retention testing for both groups because the content was tested after time delay and the test content was closely aligned to the key points emphasized in each workshop for all participants. We chose clinical vignette-based items to increase novelty of the primary outcome measure and reduce any practice effects from familiarity with the stand-alone pre/post questions for the intervention group. We also calculated a performance score for each participant by calculating performance across all tests divided by the number of workshops attended.

Secondary outcomes included self-reported changes in behavior of learners and teachers and perceptions of the utility of the TEL package, including efficiency, effectiveness, and satisfaction. These data were triangulated using objective data such as time taken and test completion. As this was a pragmatic trial, we could not control for amount of studying or review between sessions as well as other factors such as varying quality of workshop material and experience of presenters.

Data analysis

Because workshop participation size varied greatly, and given that participants were randomly assigned to TEL or control across multiple workshops, we could not analyze our data using a traditional factorial analysis of variance (ANOVA). Instead, we analyzed the data using a multilevel regression to account for the variable effect of the TEL intervention on participants and across workshops. Thus, we analyzed with intervention as a fixed factor and participant (level 1) and workshop topic (level 2) as random factors. Additional covariates included posttest performance, interitem reliability of the outcome test for the workshop, number of workshops attended, and number of workshops attended in the intervention group. All analyses set the alpha threshold at 0.05, two tailed. Where multiple analyses were performed, the Bonferroni correction was used to correct for the inflation of type I error rate.

In a pragmatic trial, statistical significance is secondary to the size of the effect of the intervention.19 To summarize the effect size, we also analyzed the primary outcome for each workshop measured as an individual “study” and cumulatively compared performance for the intervention across workshops using a random-effects meta-analysis.

Secondary outcomes included performance on pre- and posttest and responses to three satisfaction surveys at the end of the pre-, post-, and retention tests inquiring about participants’ perspectives on the utility of TEL. Results are reported as descriptive statistics or summaries of comments. All analyses were performed on SPSS 21 and SAS 9.2.


Fifteen of 16 workshop leads agreed to provide MCQs. Overall, 308 pediatricians registered for the conference and 186 individuals consented to participate across 15 workshops for which randomization to TEL or control occurred. For our primary outcome, complete data were available for 126 participants (68%) (445 individual data points). Workshops varied in participant size from 13 individuals to more than 80 individuals. One workshop, with only 10 participants, was not included in the study because the workshop lead declined to contribute MCQs. The final sample size available for outcome testing ranged from 13 to 50 per workshop.

The median number of different topic workshops that participants completed was 4, though the mode was 6. The range extended from 1 workshop to 8, with the interquartile range from 2 to 6 workshops attended by participants. Within our participant group, 84% were randomized to receive the intervention for at least 1 workshop, with 19% receiving it for 2 workshops, 27% for 3 workshops, and 19% receiving the intervention for 4 or more workshops. The number of workshops attended and number of workshops attended randomized to the intervention were not significant correlates of participants’ aggregate performance on the retention tests (r = −0.15, P > .05 and r = −0.14, P > .05).

Average score across each workshop for the pre- and posttests is presented in Figure 2. Repeated-measures ANOVA showed the aggregate increase from pre to post to be significant (F[2,174] = 12.12, P < .0001, partial eta squared = 0.14). Pre and post performance was correlated at 0.68 (P < .0001), suggesting that preworkshop knowledge was somewhat predictive of how much postworkshop knowledge participants acquired as measured by the pre-/posttesting. However posttest and retention were only weakly correlated at 0.21 (P < .0001).

Figure 2
Figure 2:
Average score on pre- and posttesting across multiple workshops.

The primary analysis of the data showed a significant effect of TEL on the primary outcome measure of knowledge retention for those in the intervention group compared with controls (beta = 0.434, t = 4.01, P < .0001 [95% CI = 0.22–0.64]). Summarizing retention scores across participants and workshops, scores were significantly higher for TEL participants (71.2% correct choices) than those randomized to workshop-only (60% correct choices). Covariates including number of workshops attended, number of workshops attended with intervention, number of workshops attended with control, and posttest performance were not significant. Retention performance showed significant clustering (Wald test Z = 2.25, P < .025) by individual performance (intraclass correlation coefficient = 0.09) and by workshop topic (Wald Z = 3.54, P < .003).

The random-effects meta-analysis allowed us to calculate a pooled effect size indicating a moderate effect of TEL (Hedges g of 0.46, 95% CI = 0.26–0.67) (see Figure 3), though the effect varied depending on the workshop topic, as suggested by regression analysis.

Figure 3
Figure 3:
Summary of effect size across workshops (Hedges g = 0.46; 95% CI: 0.26–0.67). Dots represent the effect size of the test-enhanced learning (TEL) (vs. no TEL) for each workshop; the bottom dot represents overall effect size. Horizontal lines represent 95% CIs.

The majority of participants (> 80%) agreed that pretests helped identify knowledge gaps and enhanced learning at the workshop, and moreover, that posttests with feedback helped verify learning. Aggregate satisfaction data are reported in Figure 4.

Figure 4
Figure 4:
The proportion of participants endorsing the utility of test-enhanced learning (TEL). Items are identified by when they were administered during the course of the study, after pretest, after posttest, or after retention.


In this pragmatic randomized controlled trial of TEL at a national CPD conference for pediatricians, our TEL strategy demonstrated benefit for both perceived and objective measures of knowledge retention, with moderate effect size with new assessment items. Although this effect varied across workshop topic, overall it suggests that TEL can be implemented to meaningfully improve knowledge retention in a wide variety of content areas for CPD.

Our findings are consistent with those of Larsen et al15 which demonstrated enhanced knowledge retention using repeated quizzing among practicing neurologists following annual CPD courses. In that study, four quizzes were delivered to participants, with the latter three completed a week apart. All quizzes were delivered five months prior to the retention test, at study end. The intervention’s effect size was similar to the results of our study. Larsen et al15 used SAQ-type assessments as opposed to MCQs. The utility of MCQs in TEL is debated because they do not provide an opportunity for generation of answers. However, the content richness of MCQs (e.g., vignette vs. recall items) may increase their effect and could perhaps address some of the concerns raised about their efficacy.25 MCQs are easier to deliver and score for both feedback and evaluation purposes. SAQs may be a more involved and perhaps more difficult format to generalize to all CPD contexts.

As such, the strength of our study was the simplicity and portability of our TEL strategy using one set of only five MCQs per workshop for both pre- and posttesting. Our secondary outcomes (i.e., participants’ perspectives) indicated that our TEL strategy was efficient and feasible to implement widely at medical conferences. The majority of our respondents thought the length, difficulty, and utility of testing were appropriate for their learning. A large proportion (> 80%) also noted that testing helped identify their knowledge gaps and enhanced their learning during the conference. The role of pretesting as an aid to focus attention and focus learning is still being investigated, though behavioral scientists have long identified pretesting as a “contaminant” of research studies. In an educational setting, however, it can be a powerful addition to posttest enhanced learning.

In contrast to our findings, McConnell et al16 failed to detect an effect of TEL in their study. One potential reason is the strong knowledge base of the participants in that study. Inspection of pretest scores across our workshop topic areas, however, showed that the average performance was typically below 50%, perhaps indicating that there was still learning that could be enhanced by TEL. Interestingly, pre and post performance are moderately correlated within our data (0.68), suggesting that prior knowledge of the topic is a good indicator of learning, and the TEL workshop further consolidated knowledge for those participants who knew something about the topic. However, this was not necessarily predictive of how much they would retain when measured using new items in a vignette format, as evidenced by the low correlation between posttest and knowledge retention testing. This may be a function of increasing the difficulty of the retention test as well as forgetting.

Our study also confirms findings from other work suggesting an interaction between the topic or content area and the effect of TEL. Mathes and colleagues’26 study in the context of residency training found dramatically different effect sizes for TEL across two courses. This suggests that topic content and context of learning likely have an effect on overall efficacy. As stated by Larsen,27 “one size does not fit all topics in applying spaced testing.” Thus, a strength of this study design is that we were able to identify an overall effect of TEL in pragmatic settings across multiple topic areas. However, we were not able to conclusively establish why TEL was more effective in one workshop over another. One underappreciated factor may be in the practices of teachers. We provided the results of pretests to teachers, which may have spurred changes in teaching around specific assessment gaps. Some teachers in our study perceived that receiving the scores informed their preparation or delivery of the sessions, whereas other teachers did not. Further data are required on how teachers use pretest information. For example, we do not have a good appreciation of the degree to which teachers routinely modify their instruction based on assessment data on learners’ gaps. Also, assuming that teachers were motivated to change their routine, it is unclear whether small modifications to the instructional design would impact overall learning effects in a CPD context.

Objective measures of physician behavior change and patient outcomes were beyond the scope of our inquiry. Our study was limited to measuring test-enhanced knowledge retention. Though our results are in keeping with other studies of TEL, the next step is to objectively document how changes in knowledge potentiate further CPD and influence change in practice. Even so, our results indicate that this portable model of testing with feedback can be broadly leveraged to efficiently and effectively improve outcomes of CPD.


A critique of this study is the potential for a “practice effect” as a result of a format similarity between the intervention and primary outcome. Although format similarity may inflate effect size, our outcome test was designed to go beyond simple recall and involved clinical vignette-based MCQs which differed (in form, questions, and choices) from the pre- and postintervention MCQs (which were stand-alone or fact based). Further, given that the TEL is still relatively new in CPD, justification research in the fashion of this study is necessary for the MCQ format prior to conducting further clarification studies which are obviously necessary. An additional critique of our study may be the small number of MCQs used to operationalize the intervention. Although this may be a limitation, it is a necessity of achieving adherence by busy clinicians. Previous experience in the context of CPD suggests that longer tests will likely reduce participation. Given the large number of potential sources of variance operating in our study, we still managed to detect a moderate effect in favor of TEL with only five MCQs. The control group in our study received no intervention, whereas in other studies the typical control is additional education or training.14,16 We could not include an additional education control group in this design because of feasibility and participant acceptability; additional instruction may attenuate the effect of TEL, though in all previous studies, additional instruction rarely outperforms the TEL condition. Further research could explore an active control group or other forms of testing.

Implications for research

Although TEL is accepted increasingly across the spectrum of medical education, more research is needed on the mediators of effectiveness. The role of topic area and the activities of teachers that receive test information require further exploration in the CPD context. More broadly, the format and timing of testing also require further study to truly understand when and where TEL can be delivered effectively. Of course, whether retention of knowledge translates to changes in practice requires further investigation.

Implications for practice

For those designing and delivering CPD, incorporation of TEL can be a simple and effective method of increasing the retention of knowledge. It may increase the motivation to participate in CPD as TEL activities may qualify for additional CPD credits. Intrinsic motivation of participants to receive feedback on their learning may also contribute to more sustained participation in CPD offerings.


Our study adds to the growing body of literature on the benefits of TEL in health professions and in CPD for physicians in particular. Evaluating clinicians not only enhances learning but also may help gauge knowledge which is requisite for competency-based CPD recertification programs. These findings will improve practice for CPD in Canada and fill a gap in the continuing education literature internationally. Testing remains an underused education intervention in CPD, and the use of formative assessment to enhance professional development should be a continued avenue for research.


1. Campbell CM, Parboosingh J. The Royal College experience and plans for the maintenance of certification program. J Contin Educ Health Prof. 2013;33(suppl 1):S36S47.
2. Yee M, Simpson-Young V, Paton R, Zuo Y. How do GPs want to learn in the digital era? Aust Fam Physician. 2014;43:399402.
3. Salinas GD. Engaging learners in CME in 2016: Understanding CME preferences to increase future participation. J Contin Educ Health Prof. 2016;36(suppl 1):S46S47.
4. Koczka CP, Geraldino-Pardilla LB, Goodman AJ, Gress FG. A nationwide survey of gastroenterologists and their acquisition of knowledge. Am J Gastroenterol. 2013;108:10331035.
5. Nasir A, Khader A, Nasir L, Abuzayed I, Seita A. Paediatric continuing medical education needs and preferences of UNRWA physicians in Jordan. East Mediterr Health J. 2016;22:4751.
6. Cragun D, Besharat AD, Lewis C, Vadaparampil ST, Pal T. Educational needs and preferred methods of learning among Florida practitioners who order genetic testing for hereditary breast and ovarian cancer. J Cancer Educ. 2013;28:690697.
7. Riley J, McGowan M, Rozmovits L. Exploring the value of technology to stimulate interprofessional discussion and education: A needs assessment of emergency medicine professionals. J Med Internet Res. 2014;16:e162.
8. Lindsay E, Wooltorton E, Hendry P, Williams K, Wells G. Family physicians’ continuing professional development activities: Current practices and potential for new options. Can Med Educ J. 2016;7:e38e46.
9. Mazmanian PE, Davis DA. Continuing medical education and the physician as a learner: Guide to the evidence. JAMA. 2002;288:10571060.
10. Norman G. The third wave in health sciences education. Adv Health Sci Educ Theory Pract. 2013;18:319322.
11. Mayer RE. Applying the science of learning to medical education. Med Educ. 2010;44:543549.
12. Larsen DP, Butler AC, Roediger HL 3rd.. Test-enhanced learning in medical education. Med Educ. 2008;42:959966.
13. Kromann CB, Jensen ML, Ringsted C. The effect of testing on skills learning. Med Educ. 2009;43:2127.
14. Larsen DP, Butler AC, Roediger HL 3rd.. Repeated testing improves long-term retention relative to repeated study: A randomised controlled trial. Med Educ. 2009;43:11741181.
15. Larsen DP, Butler AC, Aung WY, Corboy JR, Friedman DI, Sperling MR. The effects of test-enhanced learning on long-term retention in AAN annual meeting courses. Neurology. 2015;84:748754.
16. McConnell MM, Azzam K, Xenodemetropoulos T, Panju A. Effectiveness of test-enhanced learning in continuing health sciences education: A randomized controlled trial. J Contin Educ Health Prof. 2015;35:119122.
17. Grzeskowiak LE, Thomas AE, To J, Reeve E, Phillips AJ. Enhancing continuing education activities using audience response systems: A single-blind controlled trial. J Contin Educ Health Prof. 2015;35:3845.
18. Richland LE, Kornell N, Kao LS. The pretesting effect: Do unsuccessful retrieval attempts enhance learning? J Exp Psychol Appl. 2009;15:243257.
19. Kornell N, Hays MJ, Bjork RA. Unsuccessful retrieval attempts enhance subsequent learning. J Exp Psychol Learn Mem Cogn. 2009;35:989998.
20. Cook DA, Bordage G, Schmidt HG. Description, justification and clarification: A framework for classifying the purposes of research in medical education. Med Educ. 2008;42:128133.
21. Tolsgaard MG, Kulasegaram KM, Ringsted C. Practical trials in medical education: Linking theory, practice and decision making. Med Educ. 2017;51:2230.
22. Roland M, Torgerson DJ. Understanding controlled trials: What are pragmatic trials? BMJ. 1998;316:285.
23. Case S, Swanson D. Writing Written Test Questions for the Basic and Clinical Sciences. 1998.Philadelphia, PA: National Board of Medical Examiners;
24. Medical Council of Canada. Guidelines for the development of multiple-choice questions. Published February 2010. Accessed July 20, 2018.
25. McConnell MM, St-Onge C, Young ME. The benefits of testing for learning on later performance. Adv Health Sci Educ Theory Pract. 2015;20:305320.
26. Mathes EF, Frieden IJ, Cho CS, Boscardin CK. Randomized controlled trial of spaced education for pediatric residency education. J Grad Med Educ. 2014;6:270274.
27. Larsen DP. Picking the right dose: The challenges of applying spaced testing to education. J Grad Med Educ. 2014;6:349350.
Copyright © 2018 by the Association of American Medical Colleges