The practice of medicine requires competencies beyond medical knowledge. The challenge of incorporating this multifaceted notion of medical competencies into the osteopathic medical licensure process is determining how to measure other competencies in addition to osteopathic medical knowledge. Currently, other than a performance evaluation component, a large portion of osteopathic licensure examinations still consists of text- and static-image-based multiple-choice items. Such test items are essentially narrations of practice situations, not direct physician-patient encounters. Many skills and procedures have been “blind spots” in osteopathic medical licensure testing.
To explore the possibility of testing additional types of knowledge and skills, we recently conducted a trial using experimental items with heart sounds and videos. The objectives of this trial were to determine (1) if multimedia items were able to test additional elements of medical knowledge and skills and (2) how to develop meaningful and effective multimedia items in the future.
The assumption of this study was that multimedia items might test different constructs if they perform differently from the matched text items. An earlier study found that items using cardiovascular videos and items using cardiovascular prints were equivalent, although video subtests were easier, and print subtests were slightly more reproducible.1 Another earlier study found that no reliability was lost in a health profession test by changing test items to interactive audiovisual items.2 A recent study, on the other hand, compared item characteristics between audio and text presentations of cardiac auscultation findings in medical licensure examinations and found that audio items were consistently and significantly harder and less discriminative at the same time.3 Some studies also suggest that certain types of innovative items may assess a different dimension of reasoning than typical written items.4 The analytical approaches of most existing studies primarily concentrated on group-level comparison between text format and innovative format. Detailed analysis at the level of matched item pairs was scarce. Although examining the group-level differences was a necessary step for analysis, the focus of this study was on the comparison of individual pairs. Combining statistical findings, we also conducted content analysis in order to understand why different types of multimedia items performed in certain ways.
This study received approval by the institutional review board of the Center for the Advancement of Healthcare Education and Delivery in Colorado Springs, Colorado. The trial was embedded in computer-delivered Level 3 of Comprehensive Osteopathic Medical Licensing Examinations (COMLEX) in the 2008-2009 testing cycle. We piloted two multimedia formats: heart sounds and video presentations. All sound clips were collected by clinicians from their daily practices. When a heart sound clip was used, the auscultatory site on the anterior chest wall where the sound was recorded was described in the item stem (e.g., second intercostal space on the right). Most video clips, other than three cardiac images, were short recordings of physical examinations or physician-patient encounters portrayed by physicians and standardized patients that were recorded using home camcorders.
The multimedia clips were sent to item writers with the assignment that they write a pair of multiple-choice items: one using a media clip in the stem and the other without the clip but using text to narrate the content of the clip. Each item in a matched pair was to cover the same clinical topic, provide the same amount of clinical information (as much as possible), and have the same testing objective. The content and wording were also to be matched as closely as possible. The Level 3 Item Review Committee reviewed the paired items, including the media clips. The committee consisted of 14 osteopathic physicians who were current postgraduate educators from different regions of the country. In the committee there was at least one member for each major clinical discipline: surgery, obstetrics-gynecology, psychiatry, family medicine, pediatrics, internal medicine, and emergency medicine. The committee's overall responsibility was to conduct content review for newly written COMLEX Level 3 items. The committee's review of multimedia trial items focused on (1) the necessity of multimedia materials for the items, that is, if media materials would test what text items could not test, (2) the quality of the media, such as fidelity, clarity, length, and size, and (3) the appropriateness of the difficulty level of the media materials. The committee approved 23 sound clips and 21 video clips for inclusion in the pilot study.
The experimental items were embedded in multiple unscored test sessions. Each session contained three to five media items and one to five text items unpaired with the media items in the same session. Each candidate received one randomly assigned unscored session. This design ensured that no candidate saw more than one item with any given topic. A total of 3,560 candidates took Level 3 during the 2008-2009 testing cycle. The gender distribution was 49.6% male and 50.4% female. For 90.9% of them, English was their native language. The majority (75.9%) of the group were white, 17.4% were Asian, 5.7% were African American, and 1% were other. The multimedia items had a sample size ranging from 231 to 291, with an average of 253. Text items had a sample size ranging from 222 to 291, with an average of 251. The proportions in gender, language, and race by text-item takers and multimedia-item takers were compared pair by pair using the chi-square test. Except for two pairs, no significant differences existed for any of the variables. Two pairs showed significant differences in the proportion of people whose native language was English or not. These two pairs did not show a significant difference in item difficulty or item discrimination in the following logistic regression analysis.
For each study item, three indices of item characteristics were calculated: item difficulty (the percentage of correct responses to the item), item discrimination (an item-total biserial correlation, which was calculated by the correlation between individual item response and the reported total score), and item response time (after a log transformation). For all three indices, paired-samples t tests were used to test the overall difference between the group of 44 multimedia items and the group of 44 matched text items.
A logistic regression approach was adopted for pairwise comparisons of examinees' responses to the matched items as follows:
Where r is the response to the item (1 for correct response and 0 for incorrect response).
g is an indicator of item format (1 for multimedia items and 0 for text items).
θ is an equated ability score estimated by the Rasch model based on all the scored items in the Level 3 examination. In this model, θ served as a covariate controlling the effect of ability level on the performance of items.
β0 is the intercept parameter of the regression model, and it represents the base probability of the correct response to the text item when θ equals 0.
β1 is the group difference in performance on the item. A positive β1 indicates that a multimedia item was easier, and vice versa. In this sense, a significant β1 reflects a significant difference in item difficulty between the two matched items.
β2 is the coefficient of θ or the effect of person-ability, and it reflects the relationship between the person-ability and the probability of answering a text item correctly.
β3 is the coefficient for the effect of the interaction between the item format in a matched pair and the person-ability. A significant β3 would suggest that the relationship between person-ability and the likelihood of answering an item correctly was significantly different between the two formats. Conceptually, this reflects the difference of item discrimination between two paired items. One of the advantages of this modeling was that it provided a statistical test for the equivalence of the difference between two items' discriminations.
Logistic regression is widely used to investigate differential item functioning, in which a single item is tested for possible bias usually caused by certain demographic factors. In this study, two different groups were defined by examinees randomly receiving either multimedia or text format items. Paired items were treated as a “single item.” Thus, any detected differential functioning was considered as the effect of item formats, since the two examinee groups of those pairs had no sign of any demographic differences.
After the analysis was completed, the Level 3 Item Review Committee analyzed the content of those pairs with significant differences in order to interpret the observed differences from two perspectives: (1) why the media changed the amount of information candidates needed to answer the questions and (2) whether the contexts changed substantially that two formats actually measured different constructs. This procedure helped provide more insight into the behavior of multimedia items.
Paired-samples t tests found that, overall, item discrimination was slightly lower for multimedia items, 0.150 versus 0.136, and the difference was not significant. The mean item P value was slightly but significantly higher for multimedia items (.672 versus .617, t = 2.390, P < .05). Also, the tests found that examinees needed a significantly longer time responding to the multimedia items (107.7 seconds versus 71.5 seconds, t = 5.81, P < .05).
Logistic regression for paired-item comparisons revealed mixed results. Out of the 44 pairs, nine pairs showed significant differences in difficulty or/and discrimination, and 35 pairs did not show any difference in difficulty or discrimination. For the nine pairs with significant differences, six demonstrated significant difference in item difficulty. Three had more difficult multimedia items and the other three had easier multimedia items. Two item pairs' multimedia items had significantly lower discrimination. One pair had significant β1 and β3, meaning that the items were different in both difficulty and discrimination. As Figure 1 shows, the response patterns were different for this pair of items. The multimedia item was not uniformly easier for candidates at different ability levels. Candidates whose ability score θ was below 1.1 were more likely to answer the multimedia item correctly than the text item. Candidates with an ability score θ higher than 1.1 were less likely to answer the multimedia item correctly compared with the text item. Although it is difficult to explain, the information of such interactions between item type and ability is extremely valuable. It implies that a particular multimedia item might test different constructs than the text items.
Significant pairs were submitted to the Level 3 Item Review Committee for further content analysis. The content experts believed that when the narrations of situations or movements in the text items were less direct, multimedia items could provide richer information by giving candidates certain levels of “feel” and “look,” therefore making the items easier to answer. For example, instead of a description of “70 degrees of hip flexion with the knee extended,” a video portraying the hip flexion would make the item easier. On the other hand, replacing standard textbook terminology with sound or image could make media items more difficult. The reason is that media items require at least one more cognitive step than their matched text items in answering the same question. That step is to establish a connection between sound or image with the correct textbook terminology. If candidates were not able to recognize the sound in an audio clip as “a grade 3 holosystolic murmur,” even if they knew the implications of the murmur they still could not answer the question correctly.
Interpretation of the differences of discrimination was more difficult. Although many factors could affect discrimination, the content experts most often felt that multimedia items might measure something different from what text items typically measure.
Multimedia items can add authenticity to the tests. Nevertheless, authenticity itself without additional measurement value does not provide adequate support for adding multimedia content to licensing tests. Significantly different statistical properties of some multimedia items in this study suggest that multimedia items behaved differently and might actually test something different. As this study showed, multimedia items could be either more difficult or easier. The differences in difficulty were likely related to the amount of the information that the multimedia content added to or took away from the test items relative to their text-based counterpart. This trial also found that multimedia content could change examinees' response patterns for items with the same topic. Such differences can be interpreted as the differences between item discriminations. Many factors can change item discrimination. While introduction of multimedia content to the text-dominated tests could unintentionally distort the intended construct, as when the multimedia stimulus fails to capture an essential element of the criterion situation or introduces irrelevant factors, it is equally possible that multimedia tasks measured important elements that the conventional tasks could not assess. From this perspective, this study provided some encouraging hints that multimedia content can test different elements for medical knowledge and skills.
Content analysis of the pairs with significant differences suggests that meaningful and effective multimedia items could be deliberately developed. The content experts found that, in order to develop effective multimedia items, the content of the media must be carefully selected. If the content can be described by text relatively sufficiently, it probably would not be worthwhile to develop multimedia content to replace the text. If the content is difficult to adequately describe by text, a multimedia item would likely be easier and more discriminative. On the other hand, if the content is easily labeled by textbook terminology, replacement with multimedia material would likely make the item more difficult and less discriminative. At the same time, technically, quality multimedia content requires a thoughtful design so that the length, size, volume, and visual or/and audio clarity would not become a distraction.
Analytically, this study focused on item analysis instead of group comparison. For the purposes of this study, this approach provided more insight into the behavior of multimedia items. The conventional logistic regression employed by this study proved to be a useful analytical approach for comparing multimedia and traditional items at item level. It did not directly test the differences of the indices of item difficulty and discrimination, but by modeling examinee response and ability for items of two formats, it effectively tested and demonstrated the difficulty and discrimination differences.
This study has a number of limitations. The number of multimedia-text item pairs was relatively small. Further, out of 44 pairs, only 9 had any type of significant difference. Practically, the majority of the media items were not effective. The group comparisons between the two format types were merely a necessary and initial step for the analysis; the results of such group comparisons have limited meaning. Additional content analysis for the insignificant pairs is needed to gain more insight into the nature of noneffective multimedia items. Such knowledge would be equally valuable for future development of truly effective multimedia items. Multimedia items are new to examinees. It is necessary in the future to survey examinees for their input. The current, computer-delivered testing allows examinees to provide comments. A few examinees did make comments on multimedia items, but for a study with fundamental importance, a formal and systematic survey is necessary.
Overall, the results of this study are promising in that multimedia items may be capable of measuring some constructs differently than what text items can measure. Further, effective multimedia items can be intentionally developed, and they can have reasonable psychometric properties.
The authors wish to extend their appreciation to the osteopathic physicians who assisted in the production and review of the audio and video materials used in this research.
This study received approval by the institutional review board of the Center for the Advancement of Healthcare Education and Delivery in Colorado Springs, Colorado.