Secondary Logo

Journal Logo

Research Reports

Test-Enhanced Web-Based Learning

Optimizing the Number of Questions (a Randomized Crossover Trial)

Cook, David A. MD, MHPE; Thompson, Warren G. MD; Thomas, Kris G. MD

Author Information
doi: 10.1097/ACM.0000000000000084


Research has shown that Web-based learning (WBL) can be effective,1 but differences in instructional methods influence learning in Web-based courses just as they do in face-to-face settings. A recent review, for example, found that increased cognitive interactivity and the presence of practice exercises improve WBL outcomes.2 Health professions educators would benefit from further research to clarify how to optimize the effectiveness and efficiency of WBL.

It is well established3,4 that the act of testing (i.e., asking questions with or without providing feedback) directly enhances learning, most likely by encouraging more robust elaboration and perhaps also by strengthening the memory retrieval pathway.3,5 Testing appears to enhance retention better, and often in less time, than do other instructional methods such as repeated exposure to learning materials. Investigators in health professions education have demonstrated that the inclusion of self-assessment questions can significantly improve learning in WBL.6–10 However, our understanding is incomplete regarding how to effectively implement test-enhanced learning in educational practice. As McDaniel et al11 put it, “From a practical standpoint, educators would likely prefer to implement the most efficacious quizzing scheme.” This question, and the model of test-enhanced learning that underlies it, constitute the conceptual framework for this study. Our purpose is to clarify one aspect of test-enhanced learning implementation, namely: What is the impact of changing the intensity of testing (number of questions) in a Web-based course?

At least two studies in medical WBL indicate that including more questions does not always lead to better results. In one study,12 residents who completed a pretest comprising case-based questions and then answered case-based practice questions embedded in a Web-based module had lower posttest knowledge scores than residents who completed the pretest but answered non-case-based practice questions. Conversely, residents who did not take the pretest did better with case-based practice questions than with non-case-based questions (i.e., an interaction between pretest and question format). One possible explanation is that the number of case-based self-assessment questions required to learn a particular topic is finite. In the second study,13 residents completed a WBL module with embedded multiple-choice questions; some of them went on to answer additional open-ended questions. Those who answered the additional questions had slightly lower posttest knowledge scores than those who did not. Although these studies suggest that adding more questions does not necessarily enhance learning, the interpretations are confounded by the simultaneous variation in the types of questions, which in turn suggests the need for further research. We are not aware of previously published research evaluating the impact on learning of varying the number of practice questions in a Web-based course.

In the present study, we therefore sought to answer the question: Does varying the number of self-assessment questions affect knowledge outcomes in WBL for medical residents? On the basis of the model and empiric research cited above, we anticipated that adding questions would initially enhance learning, but that there would be a point of diminishing returns (i.e., a plateau) beyond which additional questions no longer improved knowledge. We also sought to replicate the previously identified interaction with pretest versus no pretest study conditions.12

As a secondary aim, we evaluated the relationship between the number of self-assessment questions and residents’ ratings of motivation and mental effort. Theory and empiric research suggest that motivation and mental effort may influence learning outcomes,14,15 but to our knowledge this has not been studied in test-enhanced learning. We anticipated that, when analyzed as outcomes, motivation and mental effort ratings would follow a pattern similar to that of knowledge scores (as described above).

To address these aims, we conducted a randomized crossover trial using a primary outcome of applied knowledge and secondary outcomes of motivation, mental effort, time to complete modules, and format preference.


Setting and sample

This study took place in an academic medical center between January 2009 and July 2010. We used e-mail and telephone to extend invitations to participate to all 229 categorical internal medicine and family medicine residents enrolled in the Mayo School of Graduate Medical Education in Rochester, Minnesota, during this two-academic-year period. No incentives were offered for participation. All participants gave consent. The Mayo Clinic institutional review board exempted this study from full review.


Ambulatory medicine WBL modules are part of the regular continuity clinic curriculum for our internal medicine and family medicine residency programs. For the present study, we updated the content of previously described WBL modules12,16 on hyperlipidemia, asthma, diabetes, and depression for residents to complete in 2009, and on osteoporosis, nicotine dependence, cervical cancer screening, and dementia for residents to complete in 2010. Each evidence-based module consisted of text, images, hyperlinked resources, and case-based self-assessment questions (patient scenario followed by a single-best-answer multiple-choice question) with feedback (correct answer and brief rationale).

Each updated module was developed in four formats, each with a different number of questions. We first developed the “full” module with 15 self-assessment questions interspersed throughout the module. We then developed the three shorter formats (10, 5, and 1 questions each) by successively eliminating the questions that we felt would have lowest instructional value. Using this approach, we selected the most comprehensive question (addressing the greatest breadth or most critical concepts) for the 1-question format. For examples of the case-based self-assessment questions, see Supplemental Digital Appendix 1 at In the 1-question format, the question appeared at the beginning of the module, before any information. In the multiple-question formats, all other questions followed immediately after relevant content. The location of a given question remained fixed relative to content and other questions in all formats of a given module.

After our preliminary analysis of 2009 data revealed essentially no difference in study outcomes between the 1-question and 5-question formats, we dropped the 5-question format in 2010 and substituted a no-question format. (In other words, each 2009 module had 15-, 10-, 5-, and 1-question formats, whereas each 2010 module had 15-, 10-, 1-, and 0-question formats). The instructional design was otherwise kept consistent from module to module, and from year to year.

We released four modules per year (topics listed above) at approximately one-month intervals. All internal medicine and family medicine residents were expected to complete all four modules as well as the associated knowledge posttests (described below) regardless of study participation, although not all residents met this expectation. They could complete available modules in any order, at a time and location of their choosing, with no limit to how long they spent on the module or posttest.


We used a randomized crossover design in which study participants completed one module using each format per year. We did not assume consent for year 2 after participation in year 1 but, rather, contacted and enrolled participants afresh in year 2. In each year, we randomly assigned residents to four groups, and each group followed a different pattern of module formats (see Figure 1). We also employed a variation of the Solomon four-group design17 in which study participants completed a pretest for two of the four modules, with randomized assignment (i.e., for any given module, half the participants completed a pretest and the other half did not; not shown). One of us (D.A.C.) performed randomization after consent (i.e., allocation concealed) using MINIM (version 1.5, London Hospital Medical College, London, United Kingdom), with stratification by residency program and postgraduate year.

Figure 1:
Flow of participants in a study of test-enhanced Web-based learning, Mayo School of Graduate Medical Education, January 2009 to July 2010. Internal medicine and family medicine residents were enrolled and randomized separately in each academic year. Each participant was assigned to complete four modules per year; each module included a different number of self-assessment questions (Q). Module sequence was randomly assigned. For each module, half the participants in each group were randomly assigned to complete a pretest (not shown). All available data for each participant at each time point were included in analyses. Module topics were as follows: Module A, hyperlipidemia; Module B, asthma; Module C, diabetes mellitus type 2; Module D, depression; Module E, osteoporosis; Module F, tobacco dependence; Module G, cervical cancer screening; Module H, dementia. Modules A through D were completed in 2009; Modules E through F were completed in 2010.

Instruments and outcomes

We assessed the primary study outcome, applied knowledge, using a posttest completed at the end of each module. We used a test blueprint to develop 17 or 18 case-based, single-best-answer, multiple-choice questions designed to assess application of knowledge.18 We reviewed questions from previous tests,12,16 in conjunction with test-item statistics, and retained, revised, or removed questions as appropriate. We developed new questions as needed; all new or substantially revised questions were reviewed by experts and were pilot-tested on internal medicine faculty. As indicated above, for any given module, half the participating residents also completed a pretest; this pretest was identical to the posttest, but no answers were provided. Residents were instructed to treat tests as “closed book.”

At the end of each module, residents reported the time required to complete the module. They also rated the mental effort invested and the module’s difficulty and motivational properties (using questions we defined previously14 on the basis of Paas and van Merrienboer’s19 constructs) and provided an overall appraisal, using a scale ranging from 1 (low) to 7 (high). In 2010 only, we asked residents to report how long they waited between completing the module and taking the posttest. We found good correlation between self-reported time and actual time in an earlier study in the same setting.16

Prior to the first module, residents completed a baseline questionnaire which collected information about their experience with WBL and comfort using the Internet. Following the fourth module, we asked residents to indicate on their course evaluation the number of self-assessment questions they felt were most effective and most efficient.

All instruments were administered using Blackboard Vista (Blackboard Inc., Washington, DC).

Statistical analysis

The primary analysis compared posttest knowledge scores across the module formats (i.e., the varying number of self-assessment questions) using mixed linear models accounting for repeated measurements on each subject and for differences among modules. We planned further adjustments for residency program, postgraduate year, and gender. For statistically significant models, we performed pairwise comparisons according to the Fisher method of least significant difference. We performed similar analyses for secondary outcomes of postmodule ratings of motivation, mental effort, and difficulty, and time spent learning. In a secondary analysis, we repeated the knowledge score analyses, using pretest scores as a covariate, for the subset of posttest observations with a preceding pretest. We also used mixed models in exploratory analyses to investigate potential associations between module ratings (motivation, mental effort, difficulty, and overall), time spent learning, and posttest knowledge scores.

All participants were analyzed in the groups to which they were assigned, and all data available for each participant at each time point were included. Prior to analysis, we removed residents’ names from the dataset and used unique numeric codes to link data across years and modules. Analyses used a two-sided alpha level of 0.05 and were performed using SAS version 9.1 (SAS Institute Inc., Cary, North Carolina). The expected sample size of 120 participants per year was to provide > 90% power to detect differences of 3% in knowledge scores. We report knowledge scores as percent correct.


Of the 229 eligible internal medicine and family medicine residents, 197 (86%) consented to participate and were randomized. Of these 197, 180 (91%) completed at least one knowledge posttest. Participant characteristics are reported in Table 1, and the flow of participants is shown in Figure 1. Cronbach's alpha for posttest knowledge scores was 0.88. The mean (standard error [SE]) percent correct pretest knowledge score (all modules and formats, unadjusted; n = 461 pretests) was 53.2 (0.8). The mean posttest knowledge score (n = 919 posttests) was 73.7 (0.5).

Table 1:
Characteristics of Participants at Enrollment in a Study of Test-Enhanced Web-Based Learning, Mayo School of Graduate Medical Education, January 2009 to July 2010*

Individual and simultaneous adjustment for residency program, postgraduate year, and gender did not alter the results of any reported analyses.

Effect of number of self-assessment questions in module

We found that mean posttest knowledge scores were the highest for the 10-question format, followed by the 15-question format (see Figure 2). Mean scores for the 0-, 1-, and 5-question formats were lower and very similar. Analysis of variance indicated that scores varied significantly according to the number of questions (P = .04). In pairwise post hoc comparisons, only the differences in scores between 10-question and 0- or 1-question formats were significant at P < .05. We report the results for each module separately in Supplemental Digital Appendix 2 at

Figure 2:
Residents’ mean posttest knowledge scores by number of self-assessment questions in a study of test-enhanced Web-based learning, Mayo School of Graduate Medical Education, January 2009 to July 2010. Error bars reflect standard error of the mean. Data available for 107 residents in 2009 and 138 residents in 2010.

Residents rated each module immediately on completion in terms of motivating properties, mental effort invested, perceived difficulty, and overall. Results of pooled analyses are shown in Table 2. The overall module rating varied significantly depending on the number of self-assessment questions (P = .02): Modules with 5 or more questions received higher ratings than those with 0 or 1 question. By contrast, the number of questions did not influence motivation, mental effort, or perceived difficulty ratings (P ≥ .39).

Table 2:
Residents’ Evaluations of Web-Based Modules in a Study of Test-Enhanced Web-Based Learning, Mayo School of Graduate Medical Education, January 2009 to July 2010*

As would be expected, time required to complete a module varied according to the number of questions (P < .0001; see Figure 3). In pairwise post hoc comparisons, we found statistically significant differences for 15 questions versus 0, 1, or 5 questions, and for 10 questions versus 0 or 1 question. Module-specific results are provided in Supplemental Digital Appendix 2 at

Figure 3:
Residents’ mean self-reported time for module completion by number of self-assessment questions in a study of test-enhanced Web-based learning, Mayo School of Graduate Medical Education, January 2009 to July 2010. Error bars reflect standard error of the mean. Data available for 106 residents in 2009 and 131 residents in 2010.

Response rates for the final course evaluation (completed after the fourth module) were low: Only 36 residents (28%) responded in 2009 and 58 (41%) in 2010. Because we asked the residents identical questions in both years—how many questions per module helped them learn most effectively and how many helped them learn most efficiently—we merged the responses for analysis. Residents most often identified 10 as the ideal number of questions both in terms of efficiency (45/94 [48%]) and effectiveness (41/93 [44%]), followed by 5 (40 [43%] and 35 [38%], respectively) and 15 questions (5 [5%] and 13 [14%], respectively).

Effect of pretest on posttest knowledge score

Residents completed a pretest for two of the four modules each year. After adjusting for differences between modules and number of self-assessment questions, we found that mean (SE) posttest knowledge scores were higher for modules for which residents completed a pretest than for those for which they did not: 75.4 (0.9) versus 72.2 (0.9), respectively; difference: 3.1 (95% CI 1.5–4.8), P = .0002. However, the interaction between the number of self-assessment questions and the presence/absence of a pretest was not statistically significant (P = .49), indicating that the effect of varying the number of questions was similar regardless of pretest completion.

Associations between module ratings, posttest knowledge scores, and time

After adjusting for the number of self-assessment questions, we found a significant association between posttest knowledge scores and postmodule ratings of motivation (regression slope b = +1.2 [i.e., a 1-point increase in motivation rating was associated with a 1.2% increase in knowledge score]; P = .002), mental effort (b = +1.9; P < .0001), perceived difficulty (b = −1.6; P = .001), and overall appraisal (b = +2.4; P < .0001). We also found a small but statistically significant association between knowledge scores and time required to complete the module, with knowledge scores increasing by 0.8 points for every 10 minutes of additional time spent (P = .0001).

Effect of timing of posttest completion

In 2010, we asked residents to report the delay between finishing each module and taking its knowledge posttest. We found a significant performance decay over time, with average scores of 75.4 if the test was taken immediately, 74.3 if it was delayed 5 to 30 minutes, 70.5 if it was delayed 30 minutes to one day or one to three days, and 64.3 if it was delayed by more than three days (P < .0001). However, the module-to-test-taking delay did not alter our findings regarding the number of self-assessment questions (Pinteraction = .14).


In accord with our prediction, we confirmed small but statistically significant differences in posttest knowledge scores as the number of self-assessment questions varied in our Web-based modules, with the highest scores found in the 10-question format. Although the absolute improvements in scores were small for 10 versus fewer questions (2.6%–2.9%), these differences are more noteworthy in relation to the 20-percentage-point improvement from pretest scores. Moreover, we contend that large effect sizes should not be expected in education studies conducted in real educational settings and comparing two active interventions.20 We believe that even differences of small magnitude can have important educational implications if they align with theory-based predictions, can be achieved without substantial cost to teachers or learners, and derive from studies with adequate power. Although answering more self-assessment questions required more time, residents most often recommended 10 questions as the most effective and efficient number (albeit with a low response rate for these items).

However, we found substantial knowledge score variability from module to module, even though all modules adhered to the same basic structure and were of similar length. The one-question format had the highest posttest knowledge scores in two instances (hyperlipidemia and asthma; see Supplemental Digital Appendix 2 at, and the absence of a consistent dose response warrants discussion. One explanation is that some topics (such as hyperlipidemia) are conceptually more cohesive than others (such as diabetes mellitus) that have a more disparate set of principles to learn and apply. With the former type of topic, a question regarding one principle may facilitate learning of conceptually related (but untested) principles, a phenomenon that has been described in previous research.21 Another explanation is that the self-assessment questions were not uniform in the breadth of the applied knowledge required to answer them. The questions in the 1- and 5-question formats were qualitatively different from those in the 10- and 15-question formats, in that we intentionally selected the most comprehensive question for the 1-question format and selected for the 5-question format those questions that we felt reflected, as much as possible, the full breadth of module content. Further, a question placed at the start of a module could have greater impact than subsequent questions or could induce a change in study approach that persists for the entire module. However, our earlier work suggests that question placement may not matter,22 and a study in an eighth-grade classroom found that questions preceding instruction were less effective than questions following instruction.11

Limitations and strengths

The greatest limitations of this study are the between-module variability and the small magnitude of effect, as discussed above. The low response and delayed timing of the end-of-course evaluation limit the interpretation of preference ratings, which could be biased. Self-reported time is not precise, but in a previous study the correlation between measured and self-reported time was good.16 The single-institution setting limits the confidence with which results can be generalized to different contexts, learners, or topics; however, we enrolled participants from both internal medicine and family medicine residency programs. Different results might be observed for learners with less clinical experience, such as junior medical students.

In addition, although we created all self-assessment questions to reflect the same quality standard, their instructional utility varied as noted above. The impact of additional questions might have been greater had question comprehensiveness been constant. Finally, residents completed the knowledge posttests without supervision, and we do not know whether they observed the instruction to treat them as closed-book tests. However, because test-taking practices likely remain stable within individuals and residents served as their own controls, the impact of this threat is probably minimal.

The strengths of this study include the randomized crossover study design, ample sample size, and high follow-up for the primary outcomes.

Integration with prior work

These results are consistent with previous research in health professions education showing that self-assessment questions improve WBL6–10 as well as non-Web instruction,23,24 and they also agree with theory and empiric evidence in education broadly.3,4 Our findings build on that earlier work by showing that additional questions improve learning, but that this effect seems to peak—in this study, at 10 questions—beyond which additional questions only increase module completion time.

As in previous research,12 we found that posttest knowledge scores were higher for participants who completed the pretest than for those who did not. However, we cannot determine whether the higher scores represent improved learning (i.e., test-enhanced learning) or an artifact of recalling the questions from the pretest (although residents did not receive answers to pretest questions). We did not confirm our previously observed interaction between pretest (present/absent) and question format,12 suggesting that in the earlier study it was the format rather than number of questions that influenced this interaction. This issue may merit further study.

We expected that motivation and mental effort would also vary with the number of questions,14,15 yet we did not observe any between-format differences for these outcomes. However, we did find a relationship between higher posttest knowledge scores and higher motivation, higher mental effort, and lower perceived difficulty, suggesting that these cognitive constructs may yet offer potential targets for improved Web-based instructional design.14,15,25


Test-enhanced learning improves educational outcomes. Educators implementing this approach require practical guidance, and this study provides evidence to that end. Naturally, we do not expect 10 self-assessment questions (or any other number) to be the right number for all Web-based courses; the number will depend on the course content, context, length, and learners. Yet, teachers rarely have the luxury of testing the “ideal” number of questions for each module they might propose. Although we cannot state that 10 questions are better than 8 or 12 questions, it appears that for a 45-minute WBL module, 10 questions are better than 0 or 1 question and are at least as good as 15 questions.


The authors thank Kara Kuisle (Mayo Clinic College of Medicine) for administrative assistance.


1. Cook DA, Levinson AJ, Garside S, Dupras DM, Erwin PJ, Montori VM. Internet-based learning in the health professions: A meta-analysis. JAMA. 2008;300:1181–1196
2. Cook DA, Levinson AJ, Garside S, Dupras DM, Erwin PJ, Montori VM. Instructional design variations in internet-based learning for health professions education: A systematic review and meta-analysis. Acad Med. 2010;85:909–922
3. Roediger HL, Karpicke JD. The power of testing memory: Basic research and implications for educational practice. Perspect Psychol Sci. 2006;1:181–210
4. Larsen DP, Butler AC, Roediger HL 3rd. Test-enhanced learning in medical education. Med Educ. 2008;42:959–966
5. Agrawal S, Norman GR, Eva KW. Influences on medical students’ self-regulated learning after test completion. Med Educ. 2012;46:326–335
6. Cook DA, Thompson WG, Thomas KG, Thomas MR, Pankratz VS. Impact of self-assessment questions and learning styles in Web-based learning: A randomized, controlled, crossover trial. Acad Med. 2006;81:231–238
7. Friedl R, Höppler H, Ecard K, et al. Comparative evaluation of multimedia driven, interactive, and case-based teaching in heart surgery. Ann Thorac Surg. 2006;82:1790–1795
8. Maag M. The effectiveness of an interactive multimedia learning tool on nursing students’ math knowledge and self-efficacy. Comput Inform Nurs. 2004;22:26–33
9. Kerfoot BP, DeWolf WC, Masser BA, Church PA, Federman DD. Spaced education improves the retention of clinical knowledge by medical students: A randomised controlled trial. Med Educ. 2007;41:23–31
10. Schmidmaier R, Ebersbach R, Schiller M, Hege I, Holzer M, Fischer MR. Using electronic flashcards to promote learning in medical students: Retesting versus restudying. Med Educ. 2011;45:1101–1110
11. McDaniel MA, Agarwal PK, Huelser BJ, McDermott KB, Roediger HL III. Test-enhanced learning in a middle school science classroom: The effects of quiz frequency and placement. J Educ Psychol. 2011;103:399–414
12. Cook DA, Thompson WG, Thomas KG. Case-based or non-case-based questions for teaching postgraduate physicians: A randomized crossover trial. Acad Med. 2009;84:1419–1425
13. Cook DA, Beckman TJ, Thomas KG, Thompson WG. Introducing resident doctors to complexity in ambulatory medicine. Med Educ. 2008;42:838–848
14. Cook DA, Thompson WG, Thomas KG. The Motivated Strategies for Learning Questionnaire: Score validity among medicine residents. Med Educ. 2011;45:1230–1240
15. van Merriënboer JJG, Sweller J. Cognitive load theory and complex learning: Recent developments and future directions. Educ Psychol Rev. 2005;17:147–177
16. Cook DA, Beckman TJ, Thomas KG, Thompson WG. Adapting Web-based instruction to residents’ knowledge improves learning efficiency: A randomized controlled trial. J Gen Intern Med. 2008;23:985–990
17. Campbell DT, Stanley JC Experimental and Quasi-Experimental Designs for Research. 1963 Chicago, Ill Rand McNally
18. Case SM, Swanson DB Constructing Written Test Questions for the Basic and Clinical Sciences. 20013rd ed Philadelphia, Pa National Board of Medical Examiners
19. Paas FGWC, van Merrienboer JJG. The efficiency of instructional conditions: An approach to combine mental effort and performance measures. Hum Factors. 1993;35:737–743
20. Cook DA. If you teach them, they will learn: Why medical education needs comparative effectiveness research. Adv Health Sci Educ Theory Pract. 2012;17:305–310
21. Chan JC, McDermott KB, Roediger HL 3rd. Retrieval-induced facilitation: Initially nontested material can benefit from prior testing of related material. J Exp Psychol Gen. 2006;135:553–571
22. Cook DA, Thompson WG, Thomas KG, Thomas MR. Lack of interaction between sensing-intuitive learning styles and problem-first versus information-first instruction: A randomized crossover trial. Adv Health Sci Educ Theory Pract. 2009;14:79–90
23. Larsen DP, Butler AC, Lawson AL, Roediger HL 3rd. The importance of seeing the patient: Test-enhanced learning with standardized patients and written tests improves clinical application of knowledge. Adv Health Sci Educ Theory Pract. 2013;18:409–425
24. Larsen DP, Butler AC, Roediger HL 3rd. Repeated testing improves long-term retention relative to repeated study: A randomised controlled trial. Med Educ. 2009;43:1174–1181
25. Kusurkar RA, Croiset G, Mann KV, Custers E, Ten Cate O. Have motivation theories guided the development and reform of medical education curricula? A review of the literature. Acad Med. 2012;87:735–743

Supplemental Digital Content

© 2014 by the Association of American Medical Colleges