During recent decades, consequential validity has become an increasingly important standard by which tests are evaluated.1 Consequential validity is that component of the overall validity argument that deals with the social consequences of testing. One aspect of consequential validity that should be addressed is the question of how a test may impact examinees from different definable groups (e.g., gender, ethnicity, English-language status). Research into gender differences on achievement tests and undergraduate, graduate, and professional school admission tests documents a long-standing history of performance differences between men and women.2,3,4 Additional evidence suggesting that essay tests tend to yield smaller gender differences than do multiple-choice assessments provides valuable information for test developers wanting to address potential gender-subgroup differences.3,5
Investigations of performance in a high-stakes, medical licensure context also report differences among gender subgroups6,7,8 and native/non-native English speakers.8,9 On Step 1 of the United States Medical Licensing Examination (USMLE), performance advantages were observed for men,7,8 while on USMLE Step 2, performances of men and women tended to be quite similar.6,7 For both Steps 1 and 2, native English speakers performed better than examinees for whom English is not the native language.8,9
Although a majority of the available research deals with the impact of subgroup differences on scores from traditional multiple-choice question (MCQ) and essay examinations, the growing inclusion of performance assessments in high-stakes tests makes investigating the consequential validity of these components an important step in validating the use of these alternative testing formats in a high-stakes context. Beginning in 1999, a dynamic computer simulation was added to the USMLE Step 3 examination. This test had previously been composed entirely of MCQ items. This paper reports efforts to collect consequential validity evidence regarding gender, English as a second language (ESL) status, and Liaison Committee on Medical Education (LCME) status relevant to scores produced using this simulation.
The National Board of Medical Examiners has developed a dynamic simulation of the patient-care environment. This simulation comprises individual computer-based case simulations (CCSs), and it was developed to assess physicians' patient-management skills (the format has been described in detail previously).10 Briefly, each case begins with an opening scenario describing the patient's location and presentation. The examinee then has the opportunity to (1) access history and physical examination information; (2) order tests, treatments, and consultations by making free-text entries; (3) advance the case through simulated time; and (4) change the patient's location. The system recognizes over 12,000 abbreviations, brand names, and other terms that represent more than 2,500 unique actions. Within the dynamic framework, the patient's condition changes based both on the actions taken by the examinee and the underlying problem. The simulations are scored using a computer-automated scoring algorithm that produces a score designed to approximate that which would have been received if the examinee's performance had been reviewed and rated by a group of experts.11
Step 3 is usually taken during the first or second year of post-graduate training. In its current configuration, the examination requires two days of testing. Each form of the examination includes nine case simulations and 500 MCQ items. Multiple forms are used to enhance test security, and each form is constructed to meet complex content specifications with respect to sampling of MCQs and case simulations. The simulations are completed in the afternoon of the second day, with a maximum of 25 minutes of testing time allocated for each case.
The sample used in this research included the responses of over 20,000 MDs who completed the Step 3 examination during the first year of computer administration. Approximately 60% of the examinees were men. English was identified as the native language for 43% of examinees; 30% identified some other language as their native language, and 27% did not respond. Approximately 60% of the examinees came from LCME-accredited schools. Self-reported information about the residency programs in which the examinees had trained (i.e., discipline choice) was also available.
The first step in the analysis was to calculate descriptive statistics for the MCQ and CCS scores for the groups of interest. Scaling and equating for the examination are implemented using the Rasch model. The Rasch model allows scores from different examinees to be put on the same scale, even when the examinees have taken different sets of test items. To put scores from alternate forms on the same scale, ability estimates were produced for the MCQ and CCS components of the test using the difficulty estimates from operational scaling. Analysis of variance (ANOVA) was used to examine the variability of scores across examinee subgroups. Because it was assumed that examinee groups would differ in proficiency, the scaled MCQ score was used as an examinee-background variable in a model designed to study performance differences across groups defined by gender, ESL status, and whether the examinee attended an LCME-accredited school. Additionally, previous research examining performance on MCQ items suggested that performance may vary significantly as a function of examinees' residency training. Dillon et al.12 reported results showing substantial performance differences across residency-based subgroups of examinees. Clauser, Nungester, and Swaminathan13 also showed that residency training was a significant explanatory variable with respect to differences in item-level performance by gender groups. Consistent with these previous findings, residency was dummy coded as an additional examinee background variable.
To examine subgroup performance differences on CCSs, the resulting ANOVA model was used to estimate expected scores for subgroup levels after accounting for other differences in group status. In addition to test-score-level analyses, analyses were run at the individual case level for 18 CCS cases.
Mean scores for all subgroups and for the total test are shown in Table 1. The mean CCS test score was 1.10 (SD = .76). The mean score was 1.09 for men and 1.12 for women. This difference was generally consistent with the performances on the MCQ component of the test (but slightly smaller in terms of standardized score units). Mean scores were .93 for candidates reporting that English was not their native language and 1.27 for those reporting that English was their native language. For LCME status, mean CCS scores were 1.26 for LCME graduates and .87 for non-LCME graduates. Mean scores by residency program ranged from .51 to 1.38.
Results of the ANOVA for the test-level analysis show nonsignificant effects for gender, LCME, and English-language status.* Expected mean scores were 1.12 for men and 1.16 for women. The expected mean for both LCME-accredited and non-accredited schools was 1.14. Expected means for ESL status were 1.13 and 1.15 for native and non-native English speakers, respectively. Again, these modeled score means were not significantly different, suggesting that, after accounting for group differences, CCSs did not significantly impact examinee performance based on gender, LCME, or ESL status. There was a significant effect of residency training program, which was an anticipated result as this variable was part of the design and not a study variable.
The results for 18 CCS cases indicated that the performances of men and women differed significantly on two cases. Mean scores favored women for both cases, while nonsignificant differences on other cases favored women in some instances and men in others. As noted previously, the dependent measure in these case-level analyses was a continuous raw case score. In the first case displaying a significant gender difference the estimated expected mean scores for men and women were 2.81 and 2.91, respectively.† For the second case found to have a significant difference between men and women the estimated expected mean scores were 3.58 and 3.79, respectively. In both cases the expected difference across gender was modest relative to the variability between examinees.
A significant difference related to LCME status was found for one case. Estimated expected mean scores for LCME and non-LCME examinees were 3.02 and 2.71, respectively. A significant difference in ESL status was also found for one case. Expected mean scores were 3.93 for examinees who indicated that English was not their native language and 3.65 for examinees who indicated that English was their native language.
The results reported in this paper have important implications for the use of CCSs as part of the Step 3 examination. Of central importance to the purpose of this paper is the finding that after accounting for examinee characteristics represented by the MCQ score and choice of residency training, there was no significant difference in test-level performances across subgroups defined by gender, ESL status, or LCME accreditation of the examinee's medical school. This is an important finding because examinees within these subgroups may differ in reading proficiency, English-language proficiency, or access to computers. If these classifications were associated with performance differences, those differences could be evidence of construct-irrelevant influences on the CCS scores. Such influences would be a threat to validity. The absence of evidence of such effects in this study must be considered supportive evidence of the validity of CCS scores. While evidence of this type does not shed light on what proficiency is measured by CCSs, it does set aside some important competing hypotheses that could undermine the interpretation of CCS scores.
When analysis was implemented at the case level across the 54 significance tests (18 cases each tested for the effects of gender, ESL, and LCME status), significant group differences were found for a total of four cases. This type of case-level analysis is conceptually equivalent to the differential item functioning (DIF) analysis that is routinely performed in many standardized testing contexts.14 DIF is generally considered a necessary (but not sufficient) condition for determining that an item unfairly disadvantages some subgroup of examinees. That is, there are clearly circumstances in which subgroups differ in their performances on items that validly measure appropriate content material. For example, the case identified as favoring LCME graduates dealt with a diagnosis that may be more prevalent in the United States than in many other countries and is likely to receive more emphasis within U.S. medical curricula. Inclusion of such a diagnosis is clearly appropriate for a U.S. licensure examination. Similarly, one of the cases that favored women over men focused on a diagnosis that is more prevalent in women. It is not surprising that these cases might favor one sub-group over another and yet both focus on material that is appropriately represented on the USMLE.
Although differential performance across groups is not necessarily evidence that a case is flawed, the potential for cases to perform in this way (based on content) highlights the importance of carefully-developed test specifications. Cases that confirm that some subgroups are more familiar with certain content areas are not inappropriate. However, a test form that disproportionately represented such cases would be problematic.
This conclusion is even more critical when consideration is given to the results for subgroups defined by residency training. In this study, residency training was included as an examinee-back-ground variable rather than as a focus of study because it was assumed that a close match between the content of a case and the focus of training within a specific residency would give examinees within that residency an advantage on that specific case. In fact, in 14 of the 18 cases studied, residency proved to be a significant factor in explaining examinee performance. This result is consistent with previous studies examining performance on MCQs. The point is that even when a test is developed to measure competence for general practice, an examinee's background and training will inevitably make him or her more familiar with some content areas than others.
Evidence regarding examinee-group differences presented in this paper represents an important part of the overall validity argument. It is, however, only one part of that argument. In fact, within the context of the present study even the interpretation of the expected CCS group means (i.e., expected marginal means within the ANOVA framework) requires the assumption that the MCQs are measuring a relevant proficiency.
The present research was an important step in providing support for the validity of the CCS component of the USMLE. In the context of a high-stakes examination, ongoing investigation of test validity is a critical part of the overall testing process. Although the results presented in this paper are part of a growing body of research to support the validity of CCSs, other aspects of the validity argument are still, and will continue to be, under study. For example, two recent studies15,16 examined the extent to which CCS scores can be shown to provide information not available from the MCQ component of Step 3. Of course, the ultimate and elusive piece of validity evidence would show a direct relationship between CCS scores and subsequent outcomes for the examinees' patients. Studies to provide this sort of evidence remain in the planning stage.
1. Messick S. Validity. In: Linn RL (ed). Educational Measurement. 3rd ed. New York: American Council on Education/Macmillan, 1989:13–103.
2. Doolittle A. Gender differences in performance on a college-level achievement test. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA, 1989.
3. Bolger N. Gender difference in academic achievement according to method of measurement. Paper presented at the Annual Convention of the American Psychological Association, Toronto, ON, Canada, 1984.
4. Wilder GZ, Powell, K. Sex differences in test performance: a survey of the literature. College Board Report Number 89–3, 1989.
5. Bridgeman B, Lewis C. Sex differences in the relationship of advanced placement essay and multiple-choice scores to grades in college courses. Educational Testing Service. Unpublished Research Report No. RR-91-48, 1994.
6. Case S, Swanson D, Ripkey D. Differential performance of gender and ethnic groups on USMLE Step 2. Paper presented at the Annual Meeting of the American Educational Research Association, Chicago, IL, 1997.
7. Case S, Swanson D, Ripkey D, Bowles T, Melnick D. Performance of the class of 1994 in the new era of USMLE. Paper presented at the 35th Annual Research in Medical Education (RIME) Conference at the Annual Meeting of the Association of American Medical Colleges, San Francisco, CA, 1996.
8. Swanson DB, Case SM, Ripkey DR, Melnick DE, Bowles LT, Gary NE. Performance of examinees from foreign schools on the basic science component of the United States Medical Licensing Examination. Paper presented at the Seventh Ottawa International Conference on Medical Education and Assessment, Maastrict, The Netherlands, 1996.
9. Ripkey DR, Case SM, Swanson DB, Melnick DE, Bowles LT, Gary N. Performance of examinees from foreign schools on the clinical science component of the United States Medical Licensing Examination. Paper presented at the Seventh Ottawa International Conference on Medical Education and Assessment, Maastrict, The Netherlands, 1996.
10. Clyman SG, Melnick DE, Clauser BE. Computer-based case simulations from medicine: assessing skills in patient management. In: Tekian A, McGuire CH, McGahie WC (eds). Innovative Simulations for Assessing Professional Competence. Chicago, IL: University of Illinois, Department of Medical Education, 1999:29–41.
11. Clauser BE, Margolis MJ, Clyman SG, Ross LP. Development of automated scoring algorithms for complex performance assessments: a comparison of two approaches. J Educ Meas. 1997;34:141–61.
12. Dillon GF, Henzel TR, Walsh WP. The impact of postgraduate training on an examination for medical licensure. In: Sherpbier AJJA, van der Vleuten CPM, Rethans JJ, van der Steeg AFW (eds). Advances in Medical Education. Dordrecht, The Netherlands: Kluwer Academic Publishers. 1997:146–8.
13. Clauser BE, Nungester RJ, Swaminathan H. Improving the matching for DIF analysis by conditioning on both test score and an educational background variable. J Educ Meas. 1996;33:453–64.
14. Clauser BE, Mazor KM. Using statistical procedures to identify differentially functioning test items (ITEMS Module). Educ Meas: Issues and Pract. 1998;17:31–44.
15. Clauser BE, Margolis MJ, Swanson DB. An examination of the contribution of computer-based case simulations to the USMLE Step 3 examination. Acad Med. 2002;77(10 suppl):S80–S82.
16. Floreck LM, Guernsey MJ, Clyman SG, Clauser BE. Examinee performance on computer-based case simulations as part of the USMLE Step 3 examination: are examinees ordering dangerous actions? Acad Med. 2002;77(10 suppl):S77–S79.