In November 1999 the USMLE Step 3 was implemented as a computer-administered examination that includes both MCQs and computer-based case simulations (CCSs) developed to assess physicians' patient-management skills. The CCS format has been described in detail previously.1 Briefly, CCSs produce a simulation of the patient-management environment. Each case begins with an opening scenario describing the patient's location and presentation. Using free-text entry, the examinee then orders tests, treatments, and consultations while advancing the case through simulated time. The system recognizes over 12,000 abbreviations, brand names, and other terms that represent more than 2,500 unique actions. Within the dynamic simulation framework, the patient's condition changes based both on the actions taken by the examinee and the underlying problem. The simulations are scored using a computer-automated algorithm that is designed to approximate the score that would have been produced if the examinee's performance had been reviewed and rated by a group of expert clinicians.2
CCSs are the result of a major development effort on the part of the National Board of Medical Examiners, and operational use of the CCS represents the culmination of 30 years of research. Throughout this effort, it was believed that this format had the potential to add a new dimension to physician licensure assessment. The purpose of the present research was to begin to assess the validity of that belief by examining the relationship between the scores produced by CCSs and those produced with MCQs. Two types of results are presented. First, multivariate generalizability analysis was used to examine the generalizability of the individual components of the test (CCS and MCQ), as well as the relationship between the proficiencies measured by those components (i.e., the true-score correlations termed universe-score correlations in generalizability theory). The resulting true-score correlation and reliability estimates were then used to examine the extent to which the addition of CCS improves measurement of the proficiency of interest, when that proficiency is defined as a weighted composite of the true scores measured by CCSs and MCQs. This latter assessment is of particular interest because, as Wainer and Thissen3 have pointed out, when a constructed-response item format has relatively low reliability and a reasonably high correlation with the proficiency measured by MCQs, allocating testing time to the constructed-response items may be counterproductive, even when the true score of interest is that measured by the constructed-response items. In this context, it becomes a matter of interest to know how much (if any) of the available testing time should be devoted to CCSs.
In its current configuration, USMLE Step 3 is a two-day examination. Each test form includes nine CCS cases and 500 MCQs. Multiple forms are used to enhance test security, and each form is constructed to meet complex content specifications with respect to sampling of MCQs and case simulations. The simulations are completed in the afternoon of the second day, with a maximum of 25 minutes of testing time allocated for each case. The current study is based on examinee responses collected during the first year of implementation of the computerized Step 3.
The first step in this research was to estimate variance and covariance components within a multivariate generalizability theory framework. For these estimates, data were sampled to produce a crossed design in which each examinee completed the same four CCS cases and the same 180 MCQs. (Sampling was used both to produce a crossed design and because of software limitations.) This design was replicated ten times with a sample of 250 examinees for each replication. The individual data sets were then analyzed using the mGENOVA software.4,5 For this analysis, item format (CCS and MCQ) was considered a fixed (multivariate) facet. Within each fixed facet, items were crossed with examinees. The ten replications provided a more stable (mean) estimate of these components. Variability across replications also provided a basis for estimating the standard error of the mean variance and covariance components. The multivariate analysis also produces an estimate of the universe score correlation between the MCQ items and the CCS cases.
Within a classical test-theory framework, test reliability is commonly estimated as the correlation between scores on parallel forms of a test. The square root of that value is an estimate of the correlation between the observed test score and the true score measured by the test. For the design used in this study (in which, within each fixed facet, examinees were crossed with items), both classical test theory and generalizability theory provide a basis for estimating the reliability of the test as a function of the number of items on the test. With these values and the correlation between the true scores measured by the MCQs and the CCS cases, it is possible to estimate the correlation between the observed composite score for a test containing varying numbers of MCQs and CCS cases and a true-score composite defined as a weighted sum of the true scores from the two formats. This correlation coefficient provides an index of the precision of measurement for tests constructed to produce a composite score.
When tests are constructed as a composite of constructed-response and fixed-format items, the test developer likely assumes either (1) that the two formats measure different proficiencies and the test score should be interpreted as a composite of performance across these proficiencies, or (2) that the proficiency of interest is that measured by the constructed-response items. In the latter circumstance, the fixed-format items are included because available testing time does not allow for administration of a sufficient number of constructed-response items to achieve an acceptable level of reliability. Again, as Wainer and Thissen3 pointed out, with limited testing time it may be possible for a highly reliable test (constructed of fixed-format items) to measure the proficiency associated with less reliable constructed-response items more precisely than the constructed-response items measure that same proficiency. This will occur when the correlation between full-length tests based on constructed-response items and fixed-format items is greater than the reliability of the test composed of constructed-response items. This logic is evident in tests of writing skills constructed of essay questions and MCQs. The same logic may be implicit in decisions by certification and licensure groups to replace oral examinations (constructed-response items) with tests composed of MCQs. This result is particularly likely to occur when there is a substantial discrepancy between the reliability of the constructed-response items and the MCQs.
Consistent with these alternative rationales for building tests using a combination of constructed-response and fixed-format items, the precision of tests composed of CCS cases and MCQs was evaluated using two different true-score composites as criteria. First, the correlations were estimated between the observed composite scores and the true scores associated with the CCS cases (a composite in which the MCQ true score is given no weight). Then, analyses were repeated in which the criterion was a composite produced by standardizing the true scores associated with the CCS cases and those associated with the MCQs and weighting each by .50 before summing. (These criteria were selected because they provide an interesting basis for examining the problem at hand. They do not reflect USMLE policy.)
The purpose of this correlational analysis was to assess the extent to which introducing CCS cases improved the correspondence between the true-score composite of interest and the observed-score composite. Results were estimated based on an initial condition in which all testing time was allocated to MCQ items, followed by conditions in which CCS cases were incrementally added to the test and MCQs were removed, keeping the total amount of testing time fixed. Results were produced for conditions consistent with the total amount of testing time in the current Step 3 examination (approximately 14.5 hours). Because the pattern of results will vary as a function of total testing time, the analyses were repeated for test lengths of eight hours and two hours.
The relationship between the observed-score composite and the criterion will be impacted by the reliability of the components of that composite. That relationship will also be influenced by how those components are weighted to form the observed-score composite. A full examination of the impact of this weighting is beyond the scope of the present paper. For comparison purposes, two alternative weighting procedures are presented: (1) a procedure in which the mean scores for MCQs and CCS cases are weighted as a function of the proportion of testing time allocated to each measure, and (2) a procedure in which weighting is optimized in the sense that it produces the highest possible correlation between the observed scores and the criterion.
Table 1 provides the mean variance and covariance components from the ten replications of the G-study. The values in parentheses represent approximate empirical standard errors for these mean values.* The variance components in the left column are those that would have been produced with a univariate G-study of the CCS cases. Those in the right column are equivalent values for the MCQs. The value below the diagonal in the matrix of person components is the covariance between the universe scores for CCS cases and MCQs. The value above the diagonal is the universe-score correlation (the equivalent of a true-score correlation in classical test theory). For the task and person-by-task components, it is assumed that the covariances are zero because the items comprising the fixed facets are independent.
The universe-score correlation of .69 between the CCS and MCQ scores reflects a reasonably strong relationship between the two scores but suggests that CCSs are apparently not measuring the same dimension as the MCQs. The universe score for the MCQs accounts for slightly less than 50% of the variance of the CCS universe scores. The remaining results in Table 1 are not surprising. For both formats, the examinee-by-task interaction represents the largest source of variance. If it is assumed that the task (case/task difficulty) variance can be largely eliminated through equating of test forms, this produces a generalizability for the MCQ score of approximately .93 (based on 500 items and ten hours) and a generalizability for the CCS score of .61 (based on nine cases and four hours).†
Figure 1 presents separate graphs for the three test lengths. Within each graph, separate lines representing the two observed-score weighting procedures crossed with the two composite criterion definitions show how the correlation with the criterion changes as a function of the allocation of testing time to the two item formats. In the graphs representing the two longer test lengths (the middle and top panels), when the CCS true score is the criterion, the highest correlation is achieved when all testing time is allocated to CCS cases. This result holds for both of the observed score-weighting procedures. When the criterion is a composite in which the two true scores are weighted equally, for the longer tests the correlation is close to its maximum value over a wide range of test-construction designs.
By contrast, for the shorter test length (the bottom panel in Figure 1), when the CCS true score is the criterion, the maximum correlation is achieved when 25% of the testing time is allocated to the MCQs. For this two-hour test, when the criterion is an equally-weighted composite, the maximum correlation is achieved when MCQs are used to the exclusion of CCS cases.
The results presented in this paper suggest that CCSs assess a proficiency that is related to but distinct from that measured by MCQs. This result is a necessary but not sufficient condition for justifying the use of this item format. Clearly, if the true scores from CCSs and MCQs correlated at a near-perfect level, it would be difficult to justify including the less efficient (reliability per unit of testing time) CCS cases in the examination. The reported true-score correlation provides evidence that CCSs measure a different dimension than that measured by MCQs. The correlation does not provide evidence that inclusion of this new dimension enhances the validity of the Step 3 examination. Evidence to support the usefulness of CCS for licensure assessment has been discussed previously1 and remains an important priority of research at the NBME.
In addition to showing a moderate correlation between the proficiencies measured by CCSs and MCQs, the results presented in Table 1 confirm those of previous studies, conducted using data from CCS field tests,6 showing that CCS scores are less reliable than those for MCQs (holding testing time constant). Given the moderate correlation between CCS and MCQ true scores and the low reliability of CCS cases, it is sensible to give careful consideration to the extent to which testing resources should be allocated to CCSs. The results presented in this paper suggest that if the intention is to measure a composite proficiency that gives even 50% of the weight to CCSs and 50% to MCQs, CCS cases make a useful contribution at test lengths similar to those used for USMLE examinations. If the intention is to assess the proficiency measured by CCSs, CCS case scores make an important contribution to the observed-score composite even when the total testing time is only two hours.
It is important to reiterate that this correlational evidence does not directly support the validity of the scores associated with CCS. The overall validity argument for CCS scores must include evidence that these scores measure some construct of interest more directly than MCQ scores measure that construct and also that CCS does not measure irrelevant constructs. Much as the true-score correlation between CCS and MCQ scores provides a necessary but not sufficient condition for the usefulness of these scores, the subsequent results reported in this paper represent a necessary but not sufficient condition to justify the use of CCSs. The chain of evidence required to produce a fully sufficient argument is still being collected.
1. Clyman SG, Melnick DE, Clauser BE. Computer-based case simulations from medicine: assessing skills in patient management. In: Tekian A, McGuire CH, McGaghie WC (eds). Innovative Simulations for Assessing Professional Competence. Chicago, IL: University of Illinois, Department of Medical Education, 1999.
2. Clauser BE, Margolis MJ, Clyman SG, Ross LP. Development of automated scoring algorithms for complex performance assessments: a comparison of two approaches. J Educ Meas. 1997;34:141–61.
3. Wainer H., Thissen D. Combining multiple-choice and constructed-response test scores: toward a Marxist theory of test construction. Appl Meas Educ. 1993;6:103–18.
4. Brennan RL. Manual for mGENOVA. Iowa Testing Programs Occasional Papers Number 47, 1999.
5. Brennan RL. Generalizability Theory. New York: Springer Verlag, 2001.
6. Clauser BE, Swanson DB, Clyman SG. The generalizability of scores from a performance assessment of physicians patient management skills. Acad Med. 1996;71(10 suppl):S109–S111.