The United States Medical Licensing Examination (USMLE) consists of four examinations that assess the ability of candidates for medical licensure to apply their knowledge of medical concepts and principles to the provision of effective patient care. Three of these examinations use multiple-choice questions (MCQs) consisting of patient vignettes to assess an examinee’s ability to interpret text descriptions of clinical findings and make clinical decisions. Although patient vignettes depict realistic cases, they are considered low-fidelity simulations because they rely primarily on text and static graphics (e.g., photographs).1
In 2007, the USMLE introduced high-fidelity simulations of heart sounds. The USMLE embedded the sounds in a portion of MCQs, such that both an audio recording of the heart sounds (as heard through a stethoscope) and an animated avatar (i.e., a drawing of a human) accompanied each multimedia test item. The avatar provides simulated respiration, including movement of the chest and pulsations from the carotid arteries, which can aid in interpreting the heart sounds or in discerning the patient’s overall condition. The examinee’s task is to place the stethoscope on one of four or six designated auscultation sites using the computer mouse, and then to listen to the heart sounds through headphones. The examinee is required to integrate the multimedia information with other patient data from the item’s text to answer the MCQ. The aim of the multimedia is to enhance fidelity by presenting auditory and visual information that is consistent with the presentation of that information in the clinical setting, thereby providing a more valid measure of an examinee’s skill at interpreting heart sounds.
Initial research on MCQs embedded with multimedia indicated that they exhibited slightly lower correlations with total test scores and required substantially more time for examinees to complete.2,3 Of particular interest was the finding that MCQs with multimedia heart sounds were more difficult for examinees than text-based MCQs. The USMLE typically represents test item difficulty as a p value, indicating the percentage of examinees who answer a question correctly (please note that here p is distinct from the P value used to indicate statistical significance). The 2009 study of 51 MCQs with heart sounds on the USMLE Step 2 Clinical Knowledge (CK) exam reported mean p values that were approximately 12% lower than p values for comparable text-based MCQs (e.g., a mean p value of 56% vs. 68%).2 When 43 different items were tested on the Step 1 exam, differences in p values between multimedia and text-based items were even larger (e.g., a mean p value of 52% vs. 74%).2 The study also compared the performance of graduates of United States and Canadian medical schools accredited by the Liaison Committee on Medical Education (LCME) with the performance of graduates of non-LCME-accredited international medical schools. The increase in difficulty for multimedia MCQs was comparable for the two groups of examinees.2
Findings from the initial (2009) study are somewhat disconcerting because they suggest that medical students may recognize textbook descriptions of abnormal heart sounds (e.g., “midsystolic click” for mitral valve prolapse) but may not be able to discern an abnormal heart sound as heard through a stethoscope. If assessment truly drives the curriculum,4 then it seems reasonable to expect that the inclusion of multimedia items on the USMLE would result in increased efforts not only by medical schools to teach cardiac auscultation but also by students to learn this new material.
Although the introduction of multimedia cardiac MCQs may motivate learning, it presents a psychometric challenge that could affect test scores and their interpretation. The ability to change test content and still measure examinee performance on a common scale depends on accurate item calibration or the process of determining the difficulty parameter of each test item. A key assumption of the statistical models used for calibration is that item parameters remain invariant over time.5,6 Item parameter drift (IPD) occurs when items become systematically easier or more difficult over time. On the basis of our previous experiences with introducing new test content, we suspected that MCQs with multimedia heart sounds would become easier over time. If left undetected, IPD can result in underestimating or overestimating examinee proficiency; however, the presence of IPD will not compromise accurate measurement as long as one implements procedures to correct for it.
The main purpose of this study is to investigate changes in item difficulty of multimedia MCQs over their first six years of use on USMLE Step 2 CK. A secondary purpose is to evaluate changes in item difficulty as a function of LCME accreditation status. The previous finding that MCQs with multimedia heart sounds were equally more difficult for LCME and non-LCME examinees suggests that both groups of students have room to improve their auscultation skills. Importantly, unequal gains might occur if the groups have differing access to instructional materials. Thus, this study seeks to answer two research questions: (1) Are changes in p values for MCQs with multimedia heart sounds equal to those for other (primarily text-based) MCQs assessing similar content? (2) Are the changes in p values the same for graduates of LCME-accredited and non-LCME-accredited medical schools? The findings have important implications for item calibration and test scoring to the extent that they identify the presence of IPD for new test content. Furthermore, the results should be of interest to the medical education community in terms of documenting possible improvements in examinee cardiac auscultation skills over time.
The study group consisted of 233,157 first-time examinees who completed the USMLE Step 2 CK from 2007 through 2012 (after which a new testing interface was implemented). This sample included 136,387 graduates of LCME-accredited medical schools and 96,770 graduates of non-LCME-accredited international medical schools. We evaluated a total of 1,306 MCQs, each of which had, on average, responses from 750 examinees. The items of primary interest were 138 multimedia-based MCQs, each of which was accompanied by a recording of simulated heart sounds detected at multiple standard locations on the patient avatar’s torso. An additional set of 1,168 text-based cardiology MCQs that did not contain multimedia heart sounds served as a control set. Although the control items did not include multimedia heart sounds, some contained other graphics. We included the control set in the study because, conceivably, examinee performance may have improved on all cardiac MCQs during the six-year period, not just on the MCQs with multimedia heart sounds. This study received Institutional Review Board approval from the American Institutes for Research.
We used a regression model to analyze differences in item difficulty over time. We determined the change in difficulty of each MCQ by comparing its p value on its initial use (e.g., Year 0 = 2008) with its p value on the next use, which may not be the next test administration year (e.g., Year 1 = 2010, Year 2 = 2011). Next, we compared changes in p values for items with multimedia heart sounds with changes in p values for the control items. Because p values are constrained by floor and ceiling effects, we subjected them to a log transformation.7 We evaluated the change in the transformed p values between the year of first use (Year 0) and each subsequent year (Year 1, … Year 5) using the following regression model:
where: Δ(log(p / 1 − p) is the change in the log-transformed p value between the initial year and any subsequent year (Year 1 through Year 5); that is, (log(p / 1 − p) subsequent year minus log(p / 1 − p)) initial year. Δ(log(p/1 − p) is the dependent variable.
The terms y, h, and L correspond to the independent variables, where:
y specifies the year the item was administered and ranges from 0 to 5;
h is an indicator of test item format (multimedia = 1; text based = 0); and
L is an indicator of LCME accreditation status (LCME-accredited = 1; non-LCME-accredited = 0).
The subscripted values of B correspond to the regression coefficients for each modeled effect, where:
B0 represents the overall change in the transformed p value for all cardiology items (multimedia and text based).
B1 represents the main effect associated with item format.
B2 represents the main effect associated with LCME accreditation status.
B3 corresponds to the interaction between the year and test item format; that is, the additional change in p value each year for multimedia items over and above any increase for traditional, text-based cardiology items. This effect corresponds to the primary research question.
B4 corresponds to the interaction between the year and LCME accreditation status; that is the yearly change in p value for graduates of LCME-accredited schools over and above any change for graduates of non-LCME-accredited schools.
B5 corresponds to the three-way interaction between year, test item format, and LCME accreditation status. A significant coefficient would indicate that one LCME group exhibits more (or less) change over time on one of the item formats than the other group. This effect addresses the second research question.
The model does not include a term for the two-way interaction between test item format and LCME accreditation status because preliminary analyses indicated collinearity between that effect and the three-way interaction. Because we were most interested in the three-way interaction associated with B5, we removed the two-way interaction for test item format and LCME accreditation status, which resolved the collinearity problem. For all statistical tests, we required an alpha level of P < .01 for statistical significance. We chose this conservative level of significance to adjust for the fact that the analysis required six statistical tests. We completed all analyses using IBM SPSS Version 23 (Armonk, New York).
Table 1 presents the change in p value over time for the multimedia MCQs and the text-based cardiology items. To facilitate interpretation, the table shows p values prior to application of the log transformation. Table 1 indicates that the multimedia items exhibit a much larger positive change in p value over time than the text-based cardiology items (12.4% vs. 1.4%). Note that multimedia and text-based cardiology items are of approximately the same difficulty (i.e., p values in the 70s), which supports the use of the text-based cardiology items as a comparison set.
Figure 1 depicts the score gains in graphic form for examinees from LCME-accredited and non-LCME-accredited schools. Although the data in the figure suggest that the increase in p value is larger for multimedia items, any formal inferences should be based on transformed p values and statistical tests for the terms included in the regression model (see below).
Notably, there appears to be an inconsistency between Table 1 and Figure 1. Table 1 suggests a slight decrease in p value for the text-based cardiology items for Year 1 and Year 2, whereas Figure 1 shows a slight increase for both groups of examinees. This apparent anomaly is an example of Simpson’s paradox.8 It is due to a notable change in the proportion of examinees from non-LCME-accredited schools during that time period (2007–2008); aggregating the results across the two groups creates the appearance that p values declined. Figure 1 disentangles that effect.
Table 2 summarizes the results of the regression analysis. Because the model involves repeated measures, we evaluated the Durbin–Watson statistic, which was 1.13. Values greater than 1 indicate that autocorrelation among the error terms is not a problem. The statistically significant regression coefficient for year indicates that the transformed p value of all MCQs increased over the six years (P < .001), regardless of item format. The significant regression coefficient for item format indicates that score gains, collapsed across years and LMCE accreditation status, were larger for multimedia items (P < .001). This outcome partially addresses the first research question. Meanwhile, the nonsignificant effect for LCME accreditation status (P = .296) indicates that score gains, collapsed across all years and both item formats, were not different for U.S. and international graduates. Note that these main effects should not be interpreted without also considering the interaction effects. The two-way interaction of item format and year (B3) was statistically significant (P < .001), which confirms that increases in p values each year were greater for multimedia MCQs. This effect is illustrated in Figure 1 by the steeper slopes for the two lines corresponding to multimedia items. Finally, the three-way interaction (B5) approached, but did not reach, statistical significance (P = .028), indicating that the difference in the rate of score gains between multimedia and text-based items was similar for examinees from LCME-accredited (U.S.) and non-LCME-accredited (international) schools. This last outcome addresses the second research question.
Previous studies have demonstrated that multimedia cardiac auscultation items that require students to listen to and interpret heart sounds are more difficult than items in which heart sounds are described by text.2,3 Results of the present study suggest that medical students have become more adept at interpreting auscultation findings and that MCQs embedded with multimedia have been getting progressively easier since their introduction on the USMLE in 2007. Specifically, while performance on text-based cardiology test content improved by about 1.4%, p values on multimedia cardiology items increased by 12.4%. This trend is apparent regardless of the type of medical school (LCME-accredited vs. non-LCME-accredited). For both groups, the rise in p values shows signs of leveling off, but additional research is needed to verify this trend.
One possible explanation for this positive trend in performance is that more examinees saw these multimedia items, and this greater exposure facilitated the memorization and sharing of those items with future examinees. To investigate this possibility, we tabulated the percentage of items used on multiple test forms (two forms, three forms, etc.) and found that the rates of exposure for multimedia items and text-based items were nearly identical, which indicates that this explanation is unlikely.
A second possible explanation for the gains in performance is that the multimedia items were more memorable because of their novelty, which may have allowed examinees who tested early to memorize important features and later provide clues to subsequent examinees (even though such conduct is prohibited and may constitute copyright violation). This dissemination might have occurred at a superficial level when, for example, an examinee later tells a friend something like, “One of the heart sound items was about a high school wrestler, and I am pretty sure the answer was mitral valve prolapse.” It also might have occurred at a deeper level if the heart sound itself is memorable and subsequently communicated. This latter explanation seems unlikely given the difficulty involved in accurately describing a simulated heart sound to future examinees; communicating text descriptions seems far easier.
A more likely reason for the improved performance on multimedia MCQs is the increased availability of formal and informal instruction on auscultation skills. Although teaching materials with heart sounds have existed for many years, technological advances that facilitate the reproduction of heart sounds have simplified production of educational cases, and material with recorded auscultation findings is available on various Internet sites. Examinees may be using these resources to practice listening to and recognizing heart sounds. Examinees also may be receiving increased formal instruction on auscultation skills at their medical schools. Although either explanation seems plausible, changes in formal instruction seem more likely to lead to the improved performance, particularly for graduates of LCME-accredited medical schools. Our findings do not show, however, performance differences between the graduates of LCME (U.S.) and non-LCME (international) schools. Given the finite number of heart sounds to learn, we would expect improved performance as medical students and the educational community begin to recognize that cardiac auscultation is a skill for which students will be held accountable.
The finding that performance improved on only an isolated subset of the cardiology domain—MCQs with multimedia heart sounds—has implications for test construction, scoring, and score interpretation for the assessments given by medical schools. First, the difficulty of new content should be considered during test construction; for example, tests should feature a balance of more difficult new items and easier, more familiar ones. Second, faculty should consider the impact of new content on pass–fail outcomes and consider whether adjusting the performance standard is appropriate. Third, programs that use psychometric models to place scores from different tests forms onto the same scale should also employ techniques to detect and compensate for IPD.5,6 For example, the present results indicate that the introduction of multimedia cardiac items would make the first-year cohort appear less proficient in cardiology than cohorts from previous years. Likewise, cohorts in subsequent years would appear to be more proficient—even though their improved performance was limited to a small segment of the cardiology domain. Testing organizations typically apply strategies to detect such problems (e.g., a scatterplot of p values from two testing occasions, displacement analyses). Medical schools could use similar strategies to ensure that student scores maintain some level of comparability over time.
This study has a number of limitations. First, the study included a small sample of test items for the comparisons at Year 5. A second and related concern is that the study spanned only six years. Although examining whether improvement in performance levels off over longer periods would be interesting, the multimedia presentation of the USMLE has undergone further change, which would interfere with comparisons with the items we studied here. Third, the study focused on cardiac auscultation; whether similar results would be obtained for other skill domains is unknown. Fourth, unlike in the earlier studies,2,3 we did not include a comparison of MCQs matched on all characteristics except the use of multimedia versus text; however, this is not a serious limitation given the purpose of this study. Previous studies sought to determine the statistical equivalence of multimedia and text-based auscultation items with text-based items serving as the control. In contrast, we sought to determine whether multimedia MCQs became easier over time, and each item from the initial year (Year 0) served as its own control. Further, we included the text-based cardiology MCQs as an additional control. Finally, this study focused exclusively on item difficulty and whether examinees showed evidence of learning over time. Additional studies of response time and discrimination are needed to answer questions related to measurement efficiency, reliability, and test speededness. As technology-enhanced item formats become more common, research will be needed to determine the effects on student learning and item performance.
1. Dillon GF, Boulet JR, Hawkins RE, Swanson DBSimulations in the United States Medical Licensing Examination. Qual Saf Health Care. 2004;13:4145.
2. Holtzman KZ, Swanson DB, Ouyang W, Hussie K, Allbee KUse of multimedia on the Step 1 and Step 2 Clinical Knowledge components of USMLE: A controlled trial of the impact on item characteristics. Acad Med. 2009;84(10 suppl):S90S93.
3. Shen L, Li F, Wattleworth R, Filipetto FThe promise and challenge of including multimedia items in medical licensure examinations: Some insights from an empirical trial. Acad Med. 2010;85(10 suppl):S56S59.
4. Shumway JM, Harden RMAssociation for Medical Education in Europe. AMEE guide no. 25: The assessment of learning outcomes for the competent and reflective physician. Med Teach. 2003;25:569584.
5. Babcock B, Albano ADRasch scale stability in the presence of item parameter and trait drift. Appl Psychol Meas. 2012;36:565580.
6. Wells CS, Subkoviak MJ, Serlin RCThe effect of item parameter drift on examinee ability estimates. Appl Psychol Meas. 2002;26:7787.
7. Cohen J, Cohen P, West SG, Aiken LSApplied Multiple Regression/Correlation Analysis for the Behavioral Sciences. 2003.3rd ed. Mahwah, NJ: Lawrence Erlbaum Associates;
8. Wagner CHSimpson’s paradox in real life. Am Stat. 1982;36:4648.