The evaluation of educational technologies in medical education is becoming increasingly important as traditional methods of instruction are supplemented or replaced by these technologies. Educational technologies that simulate biologic processes, model clinical problem solving, or teach diagnostic decision making are being integrated into basic science courses and clinical clerkships, as was shown in a recent survey of 125 medical schools.1 Furthermore, the World Wide Web has accelerated the development of educational resources in nearly every medical domain (see, for example, Martindale's Health Science Guide, 〈http://www-sci.lib.uci.edu/HSG/HSGuide.html〉, or Organizing Medical Networked Information, 〈http://omni.ac.uk〉).
Several authors have called for new approaches to evaluating computer-based instruction that extend the research focus beyond comparative media studies,2,3 which examine the instructional quality of computer-based versus traditional instructional methods on the basis of students' performances on selected outcome measures. New approaches to evaluating computer-based learning recognize the need to expand our understanding about the types of learning that occur while students interact with educational software4 and the effects of computer-based design features on students' learning.5 Typically, evaluation studies that find computer-based instruction effective base their claims on students' cognitive gains as measured by pre- and post-tests. Few studies analyze the conceptual difficulties students experience in a domain of knowledge covered by the computer program and explore the ways the design of educational software may reduce those difficulties. In a study by Pradhan and Dev,4 first-year medical students used a computer program in neuroanatomy, Brain Storm™, that helped them form biomedical concepts. Based on written and verbal responses to a questionnaire, as well as students' pathways tracked by the program, the authors found that the students' approaches to learning and using new information did not change with the use of the computer program. However, patterns of incorrect answers pointed to specific conceptual errors the students made. Based on this result, the authors urged designers of computer-based instruction to identify the common conceptual errors students commit in a particular domain and redesign programs to improve learning.
Studies that find computer-based learning to be superior to traditional formats often fail to explore design components that affect learning in a domain.2 This has lead several researchers to examine the instructional values of specific design features based on how students use the features and how the usage patterns correlate with students' performances. Gurushanthaiah, Weinger, and Englund6 showed how three different visual-display formats of a computer clinical simulator affected the ability of anesthesiologists to detect acute physiologic changes. Based on the patterns of subjects' responses and accuracy rates collected while participants used the program, the study found that graphic displays were more effective than were numeric displays in signal-detection tasks. The authors discussed how their study could lead to the development and implementation of effective visual displays in computer simulators in anesthesiology. In a recent study of usage patterns of Web-based applications in a basic science course,7 students' responses to a survey regarding their use of the courseware were correlated with final examination scores as well as with server log files that recorded the numbers and hours of logins to the programs. The study found that the upper third of the class logged in more frequently to the Web forum compared with the lower third of the class. The usage patterns extracted from the server data helped the authors identify computer-based tools that were frequently used by students, a finding that could enhance the future development of those instructional tools.
This study explored three questions concerning computer-based instruction:
* Students' errors: what types of conceptual errors emerge across all learning sections?
* Interface design: what areas of interface design (e.g., navigation, presentation of examples, text, interactive features) can be modified to improve learning?
* Usage patterns: what do navigational pathways suggest about how students use an interactive feature that facilitates image comparisons?
Since 1992, the University of Washington School of Medicine's Department of Laboratory Medicine has been developing computer tutorials that teach the interpretation of image-based laboratory tests to medical students, medical technology students, physicians, and other health care workers.8 One of the tutorials, Urinalysis-Tutor (UAT) is a CD-ROM designed to teach the microscopic interpretation of urinary sediment structures. (For more information about UAT, visit 〈http://www.labmed.washington.edu/tutor/products/prod5/default.asp〉.) The microscopic examination of urinary sediment structures is one of the most important microscope-based clinical laboratory tests, along with the microscopic examinations of Gram-stained specimens and peripheral blood smears. Urine samples contain signs of diseases. It is, therefore, important that abnormal sediment structures and their clinical implications are correctly interpreted. UAT teaches the visual attributes and clinical implications of urinary sediment structures based on digitized microscopic images, descriptive text, simulated microscopic techniques, and other interactive features. A total of 31 concepts of urinary sediment structures plus four diseases are covered in sections dedicated to cells, casts, crystals, and organisms/artifacts.
In the UAT program, each concept is presented with a representative image (best example) of a sediment structure and two to three other example images. Example images illustrate variations in the shapes and sizes of sediment structures to help students generalize an instance to the same conceptual category and discriminate an example as an instance of a different category. Simulated microscopic techniques enable students to move the microscope stage or enhance visual identification and discrimination of sediment structures by using phase-contrast or polarization microscopy. These features are intended to help users focus on the relevant attributes necessary for the successful identification of sediment structures. Learners' attention is directed to the relevant visual attributes by a “highlight button” that draws outlines around sediment structures. A “focus button” is provided for making sediment structures visually distinct, and a “split-screen” feature allows students to select two images from the same or different categories of sediment structures and display the images side by side. The latter feature is considered instructionally useful because: (1) comparing and contrasting examples helps students view similarities and differences in one glance; and (2) comparing and contrasting images simultaneously provides multiple views of a concept, leading to reduction in memory overload.
We evaluated UAT in three phases between 1996 and 1998.
A total of 312 second-year medical students at the University of Washington School of Medicine used UAT in 1996 (n = 148) and 1997 (n = 164) as a requirement in the urinary system course. Students completed the UAT, including the pre- and post-tests embedded in the program. Given the number of students in each group, we had an estimated statistical power of 0.99 to detect differences of a medium effect size at p = .05, using Cohen's method for power analysis.9 The matriculation data (under-graduate GPAs, MCAT scores, and numbers of hours in science courses) of the two cohorts did not differ.
Students' identification data and test scores were collected over the network server and recorded in a database. In addition, we conducted an observational study of four students in the spring of 1997 to learn whether students viewed all examples of urinary sediment structures and whether they used the visual discrimination features, such as phase-contrast and polarization microscopy.
Gain scores between pre- and post-tests were calculated. Effect sizes were also derived to examine whether learning in both years was educationally significant as a result of students' interactions with UAT. The guidelines suggested by Cohen9 were followed in interpreting the meaning of the effect sizes: d = 0.2 (small effect), d = 0.5 (medium effect), and d = 0.8 (large effect). We conducted analyses of students' errors on individual urinary sediment structures based on the percentage of students who failed to choose a concept when it was a correct item (omission) and the percentage of students who chose a concept when it was a distractor (commission).
Based on results obtained during Phase 1, UAT was revised in an attempt to improve its interface design elements, which we hypothesized had affected students' learning and program use. The revised design elements were screen layout, navigational tools, order of examples, organization of text, and interactive features. The instructional materials from Phase 1, including text and images, were retained in the modified program. The modification of the interface design was guided by instructional design principles pertaining to visual learning and concept acquisition. Two authors of the original program, who were content experts in laboratory medicine, reviewed each area of modification.
In this phase of the study, results from 148 students who used the original version of UAT in 1996 were compared with results from 154 students who used the revised version of UAT in the spring of 1998 (the experimental group). Cohen's method of power analysis showed that we had 0.99 power to detect differences of a medium effect size at p = .05.9 The two cohorts were again similar in their demographic backgrounds and academic scores and preparation at matriculation. The 1996 cohort was used in the analysis because the order of their pre- and post-tests was identical to that of those administered in 1998.
As in Phase 1, students in the 1998 cohort used UAT installed at the library. In addition to students' identification and test scores, data about their navigation through the program were collected from individual students and recorded in the database. These data included the time a student opened and closed a screen and identification numbers of images that each student viewed. During this phase, observations of eight students were made while they used the program.
The same types of data analyses were conducted as were in Phase 1. Navigational data were analyzed for the 1998 cohort to examine the patterns of using an interactive feature that facilitated image comparisons.
The results of the three evaluation phases are presented below.
The KR-20 reliability measuring the internal consistency of the tests was 0.52, which is considered to be moderately low.10
Students' performances. Students in the 1996 cohort scored 32.3 (SD = 12.4) on the pre-test and 70.8 (SD = 12.6) on the post-test. In comparison, students in the 1997 cohort scored 41.6 (SD = 10.3) on the pre-test and 70.3 (SD = 13.3) on the post-test. The 1996 cohort attained a gain score of 38.5 between the pre- and post-tests and the 1997 cohort attained 28.7. The comparison of pre- and post-test mean scores using a t-test showed that the mean difference within each cohort was statistically significant (p < .001). The effect sizes for both cohorts was large: the 1996 cohort's was 3.08 and the 1997 cohort's was 2.43.9
The analyses of students' errors showed that students commited conceptual errors in 11 subject areas. For example, more than 70% of the students failed to choose the renal epithelial cell as the correct item on the pre-test. On the post-test, approximately 50% of the students had difficulties with identifying the renal epithelial cell when it was the correct item. Students tended to select other types of epithelial cells, squamous epithelial cells or transitional epithelial cells. These three types of epithelial cells were visually similar and, therefore, difficult to distinguish from one another. Because the identification and discrimination of epithelial cells can have significant clinical implications, this result raised concerns. Based on students' errors in identifying 11 urinary sediment structures, we generated the possible sources of errors by hypothesizing how students might have used the UAT.
Observations of students. Observations of four students while they used UAT showed that students viewed all examples accompanying the concepts, and, in most cases, they viewed the same examples more than once. The students also used all the visual features, animation, the highlight buttons, microscope focus, polarization microscopy, and phase-contrast microscopy. The students, however, overlooked the split-screen feature in the Image Atlas section and proceeded directly to take the post-test exams. In addition, the students reported that it was not easy to navigate between sections and they had to retrace several steps in returning to the main menu.
The modification of the UAT interface was based on students' comments and hypotheses about why students did not correctly identify selected urinary sediment structures. Revision targeted the design of a content map, improving navigational tools, revising text, reordering examples, and re-designing the split-screen feature.
A content map was designed and made available to users throughout the program. The added content map was intended to more clearly show both the relationships between and among concepts, and the amount of material students needed to cover in each learning section.11
Three modifications were made in the UAT to improve the learners' navigation in the program: (1) horizontal and vertical bars were added, the former displaying concepts and the latter examples of concepts (see Figure 1), to make navigation more efficient; (2) text command buttons were revised with simple picture icons to reduce confusion; and (3) a “main menu” button was added for students to return to the main menu without taking extraneous steps.
The layout and organization of the text were revised to shorten it, reorganize it under headers (such as “appearance,” “clinical implications,” “can be confused with”), and emphasize the visual cues necessary for identifying and discriminating sediment structures.12 The epithelial-cell section was radically redesigned. Previously this section had consisted of one screen containing all the relevant information for distinguishing the three main cell types, squamous epithelial cells, transitional epithelial cells, and renal epithelial cells. Using the same instructional materials, three separate screens for each of the cells were created in the new version, highlighting the visual attributes using short headers.
Examples of selected urinary sediment structures were placed in an order according to the degrees of their resemblance to the representative images, as judged by the content experts.13 Where appropriate, images of visually similar sediment structures were contrasted in a selected learning section. In addition, command buttons for simulated microscopic techniques and interactive features were placed directly under each example to draw the learner's attention.
The split-screen feature of the Image Atlas in the original version of UAT was replaced by the “compare-and-contrast” feature to address problems seen during the previous phase. Students were not likely to use the feature because it was not part of the main learning module and the original version did not provide a meaningful structure for image comparisons. In the revised program, compare-and-contrast was integrated in three learning sections, making available only the images pertaining to each section.12 As shown in Figure 2, the new interface included image panels and a list of image names in one screen to reduce extra steps in making multiple comparisons.
In consultation with a measurement specialist, two steps were taken to improve the reliability of the instrument. First, test items that did not discriminate well between high and low scorers were deleted based on the item discrimination index of .20. Second, a new scoring method, which treated multiple-response items as individual questions, increased the number of the test items. As a result, the reliability of the outcome measure (KR-20) improved from 0.52 to 0.83.
Students' performances. Using the revised scoring method, students in the 1996 cohort scored 65.3 (SD = 7.6) on the pre-test and 85.7 (SD = 8.0) on the post-test, and students in the 1998 cohort scored 67.4 (SD = 7.9) on the pre-test and 85.0 (SD = 9.0) on the post-test (1996 gain score = 20.4 and 1998 gain score = 17.6). The ANCOVA result (F (1,299) = 2.49, p > .05) showed that the difference in the post-test scores between the two groups was not statistically significant when controlling for their pre-test scores. The effect sizes for both years were large: 2.62 for the 1996 cohort and 2.08 for the 1998 cohort.9 We compared error rates for 11 concepts between the cohorts. Table 1 lists percentages of students committing errors in both cohorts, differences between the cohorts, and statistical significance levels.
The conceptual errors by the 1998 cohort were reduced in six areas compared with the 1996 cohort (as denoted by negative figures). When adjusted for multiple comparisons within each sediment category, statistical significance was found in two areas, squamous cell (χ2 = 6.3, p < .012) and ammonium biurate crystal (χ2 = 6.1, p < .01).
Usage patterns of compare-and-contrast. The server recorded navigational pathways as students used the program. These data were analyzed with a focus on whether students used the compare-and-contrast feature and whether their image comparison patterns correlated with their performances. A detailed description of the results has been published previously.14 Some of the key findings were that 106 students in the 1998 cohort (69% of the cohort) who used the feature at least once viewed 23 images in 121 seconds on average. Three predominant patterns of image comparison emerged: (1) 22% of the students viewed single images without making any comparisons (single viewing); (2) 11% of the students viewed pairs of images by selecting two images and replacing them with a new pair of images (pair viewing); and (3) 41% of the students viewed paired images, but retained the same image in one image panel and selected new images in the other panel (anchored viewing). Figure 3 compares sectional test scores for cells, casts, and crystals across students who engaged in single or anchored viewing. These two groups of students were compared with students who did not use the feature at all. The paired-viewing group was not included due to the small number of students who used the feature.
The post-test scores of the anchored-viewing group were the highest. Medium effect sizes (d = 0.5) were detected between the modes-none and anchored-viewing groups for casts and crystals, as well as between the single-viewing and anchored-viewing groups for casts. This result suggests that viewing cast and crystal images in an anchored mode led to educationally significant differences in test performance. All eight students who were observed used the compare-and-contrast feature and reported that the feature helped them make multiple comparisons of similar sediment structures.
We examined students' learning before and after revising an educational software program, Urinalysis-Tutor explored patterns of how students compared images of urinary sediment structures using an interactive feature. Phase 1 of the study, using the original version of the program, showed that students in two cohorts did not correctly identify sediment structures that were visually similar, particularly in 11 conceptual areas. After a critical review of the interface design based on the visual learning and concept acquisition literature, the software was revised. Comparison of the overall performances of the two cohorts of students who used the original and revised program showed little difference between the cohorts. Error analysis focusing on the 11 conceptual areas showed that reductions in errors were observed for six of 11 concepts, with statistical significance found for only two concepts. Navigational data collected from the 1998 cohort on the students' use of an interactive feature for image comparisons illustrated that students who viewed images in a particular mode (anchored viewing) attained the highest test scores in three learning sections.
The lack of a significant improvement in performance by the 1998 cohort has several plausible explanations. First, the revision of the program might have affected the way students in the 1998 cohort allocated their time and attention compared with the previous cohort. For example, the split-screen feature that facilitated image comparisons in the original design was available outside the main learning section. In the revised version, we made this feature available in three sections within the main learning module. Our intent was to help students attain concepts of visually similar sediment structures by comparing examples as they learned about the critical attributes of these concepts. Our data showed that students spent 43 seconds in the cell section comparing images, 54 seconds in the cast section, and 63 seconds in the crystal section. The increase in the amount of time students spent as they progressed through the main learning section suggests that students found the feature to be useful. However, in the absence of comparable data for the previous cohort, it is difficult to speculate how different patterns of time allocation might have affected students' learning as a result of this interface change.
The second explanation points to a constraint we experienced in evaluating computer-based instructional materials in a real-world context. Many interface changes were made in the program, including changes in text organization, order of examples, and navigational structure. We were not able to test the effectiveness of individual interface changes, however, given the evaluation setting involving cohorts of medical students who used the program as a part of their course requirement. In addition, there were other instructional variables we did not revise in the modified program, such as the quality of the digitized images.
While our study cannot point to specific design components that facilitated or hindered learning, we have shown a potential benefit of linking usage-pattern data and performance. We plan to explore in future studies the design factors that affect usage patterns and performances based on navigational data collected while students interact with software programs.
We fully support the points made by Adler and Johnson,15 who stressed the importance of conducting evaluation studies that compare different computer-based interfaces delivering the same contents. Our study attempted to present such an evaluation model that targets both improvement of software design and student learning.