In the Standards for Educational and Psychological Testing,1 test validity refers to the degree to which evidence and theory support the interpretation of test scores. When assessment outcomes are intended to signal readiness for professional activity, an interpretive leap occurs between test administrations and final assessment decisions. Kane2 has advocated that test developers and score users conceptualize this interpretive leap as a chain of inferences or links that can be examined individually for their strengths and weaknesses. Further, it has been suggested that priorities for validation can be established based upon identification of the weakest links in the inferential chain, and that these weak links are often related to the particular testing format being used.3,4
For the United States Medical Licensing Examination (USMLE™), the primary validity issue is whether it is appropriate to use Step 1, 2, and 3 data to determine an individual's readiness for the safe and effective care of patients. The decision makers ultimately using USMLE scores make an interpretative leap from an individual's encounter with multiple-choice questions, standardized patients, and computer simulations, to a final decision regarding an individual's ability to successfully apply the knowledge and skills necessary for the safe and effective practice of medicine. To examine this inference, this article focuses on the accumulation of validity evidence for the multiple-choice question (MCQ) Clinical Knowledge (CK) component of USMLE Step 2. It addresses the degree to which experts view the USMLE Step 2 CK content as relevant to an individual's ability to practice safe and effective medicine. This article also investigates the frequency that USMLE Step 2 CK item content is used in clinical practice and the appropriateness of the content for USMLE Step 2 CK.
Typically taken during one's senior year of medical school, the USMLE Step 2 is intended to assess whether an individual can apply medical knowledge, skills, and an understanding of clinical science essential for the provision of safe and effective patient care under supervision. The Step 2 CK component is composed of approximately 370 MCQs. Using an MCQ testing format allows for fairly extensive and reliable content sampling, so that the inferences associated with generalizing from a specific set of items to the larger content domain are relatively strong. More of a challenge, however, is the link between performance on the MCQs and performance in medical practice. Because developers of licensure tests rarely have access to well-defined and established measures of professional competence, this study utilized the expert opinions of a physician panel who had both an understanding of the examination content and experience with the knowledge and skills required to treat and interact effectively with patients.
The data used in this article come from two sources: a USMLE test item database, and a survey administered to an expert panel of physicians after they completed the 2003 USMLE Step 2 standard-setting exercises. The survey included three questions. The first assessed the relevance of the content presented in 150 selected Step 2 CK items to the practice of first year postgraduate trainees (PGY-1s) and included six options ranging from “currently relevant to the practice of PGY-1s” to “not relevant for anyone.” The second survey question addressed the frequency with which the content of these items is used by PGY-1s and included six options ranging from “never” to “once a week or more.” The third question assessed the appropriateness of the item content for the purpose of the Step 2 CK examination and included options for “very appropriate,” “somewhat appropriate,” and “inappropriate” choices differentiated by reason. The first two questions included “unable to judge” options.
Twenty-seven experts were asked to use their overall knowledge to complete the three survey questions for each of the 150 MCQs. In order to obtain a wide range of perspectives and to minimize potential bias stemming from the experts’ backgrounds, a diverse physician group was selected from an extensive National Board of Medical Examiners (NBME®) database. The experts represented a range of medical schools from different regions of the United States and a range of medical specialties. The specialties included family medicine, genetics, internal medicine, neurology, obstetrics–gynecology, pediatrics, psychiatry, and surgery. The group included 15 men and 12 women and a range of racial and ethic backgrounds. Twenty-four of the experts had not worked with the NBME before, whereas three of them were current USMLE committee members.
The content relevance exercise occurred at the end of a two-day Step 2 CK standard-setting meeting held at the NBME including the 27 experts and relevant NBME staff. In the context of a series of standard-setting exercises, the experts were exposed to Step 2 CK item content and were oriented to the general purpose, target population, and content focus of the Step 2 CK examination. The content relevance exercise was not explicitly addressed, however, until the end of the two-day meeting after the experts completed the standard-setting exercises.
Based on the survey results, three scales were computed (when necessary, similar response options were collapsed): a three-point scale for relevance, a five-point scale for frequency of use, and a three-point scale for appropriateness for the purpose of the Step 2 CK examination. Because of missing data, the relevance, frequency, and appropriateness scales were based on 3,630, 3,380, and 3,642 expert/item judgments, respectively. Using a multivariate generalizability analysis, which provides variance and covariance components,5 reliability estimates, observed correlation coefficients, and correlation coefficients corrected for attenuation6,7 were calculated for each scale.
To better understand the relationships among variables, ordinary least squares (OLS) regression was used to estimate the effects of the frequency scale and related item characteristics on the appropriateness scale. Item characteristics included item difficulty and item discrimination. Item difficulty was measured by a Rasch difficulty estimate. Item discrimination, how well items differentiate between high-scoring and low-scoring examinees, was measured by the point-biserial correlation coefficient. To evaluate the potential conditioning impact of frequency of use on the relationship between item difficulty and Step 2 CK appropriateness, the regression model also included an interaction term between item difficulty and frequency of use. Frequency and item difficulty were centered (their means were set to zero) before computing the interaction term.8
To assess whether item content related to specific core areas of medical knowledge influenced Step 2 CK appropriateness, item content was dummy coded to include four areas: pediatric item content, surgery item content, psychiatric item content, and obstetrics and gynecologic item content. In order to evaluate the effect of item content related to specific sites of medical care on the appropriateness scale, item content was dummy coded to include two levels: emergency care and inpatient hospital care. The relevance scale was not included in the model because of its very clear relationship to the appropriateness scale as identified by the correlation coefficients.
Potential violations of the assumptions of OLS regression were investigated. All of the Variance Inflation Factors were less than 4 and the bivariate correlation coefficients between the independent variables were all less than .5. An examination of the residuals suggested that the errors were approximately normally distributed and the Cook's D influence statistics were all below 1. Thus, the data are appropriate for OLS regression.
Ninety-two percent of expert judgments indicated that the reviewed item content was relevant to clinical practice, somewhat relevant to clinical practice, or likely to be relevant to clinical practice in the future. Ninety percent of the judgments indicated that the item content was moderately to very appropriate for the purpose of the Step 2 CK examination. Approximately 85% of the judgments indicated that the content was used in clinical practice, but the perceived frequency of use of the item content varied considerably.
Table 1 summarizes the results of the reliability estimates, observed correlation coefficients, and correlation coefficients corrected for attenuation for the three scales. The reliability estimates indicate strong to moderate internal consistency among the cases included in the scales. The observed correlation coefficients between the three scales indicate a modest association among the three constructs. The correlation coefficients corrected for attenuation reflect the correlations among the three scales if the three scales were perfectly reliable. They indicate an extremely large degree of association between the scales.
Table 2 summarizes the results of the OLS multivariate regression. The adjusted R2 of .66 indicates that the independent variables explained 66% of the variance in the appropriateness scale. Item difficulty and frequency of use were both positively related to the appropriateness scale. In other words, the more frequently the item content was perceived to be used in clinical practice, the more appropriate the experts viewed the item content for the purpose of the Step 2 CK examination. Further, more difficult items were also considered more appropriate for the purpose of the Step 2 CK examination.
Interestingly, the interaction term between item difficulty and frequency of use was also significantly positively related to Step 2 CK appropriateness. This indicates that the experts considered very difficult items frequently used in clinical practice to be especially appropriate for the purpose of the Step 2 CK examination. Moreover, the experts considered very difficult items rarely used in clinical practice to be less appropriate for the Step 2 CK examination. In order to demonstrate this conditioning effect of frequency of use in clinical practice on the relationship between item difficulty and Step 2 CK appropriateness, predicted effects were calculated for item difficulty at three levels of frequency: the maximum reported frequency, the average reported frequency, and the minimum reported frequency. Based on these predicted coefficients, the impact of item difficulty on Step 2 CK appropriateness is greatest at the highest level of frequency (.176) and actually negative at the lowest level of frequency (−.038).
In terms of the specific core areas of medical knowledge, only psychiatric content was significantly related to the appropriateness scale. Items addressing psychiatry were more likely to be deemed appropriate for the Step 2 CK examination than typical internal medicine items (the reference group). In terms of item content regarding specific sites of medical care, only emergency care was significantly related to the appropriateness scale. Compared to ambulatory medicine items (the reference group), emergency items were less likely to be deemed appropriate for the Step 2 CK examination.
This study focused on the collection of data to support the interpretation of scores for the USMLE Step 2 CK examination. In the opinion of physicians familiar with the knowledge required to effectively care for patients, the overwhelming majority of the reviewed Step 2 CK content was deemed relevant to clinical practice and appropriate for the examination. This outcome was supportive of the inferences made by Step 2 CK score users. Perceptions about the frequency of use of the reviewed content varied. Upon further analysis, frequency of use, item difficulty, and Step 2 CK appropriateness related in a way that made intuitive sense. It appears that difficult material used frequently in clinical practice was likely to be rated especially appropriate for the Step 2 CK examination and difficult material used infrequently in practice was likely to be rated less appropriate for the Step 2 CK examination. Additional analysis established interesting patterns related to the clinical discipline of the test content and the site of care for an item.
This study was limited to some degree by the small number of physician judges. Furthermore, the expert judges’ participation in this exercise occurred after completion of an extensive standard setting meeting. While this meeting potentially provided them with additional relevant insight, the unintended consequences of this schedule are unknown. Both the limited sample size and the schedule may affect the generalizability of these findings. Nonetheless, the expert judgments revealed strong support for the relevance of USMLE Step 2 CK item content to clinical practice and the appropriateness of the item content for the Step 2 CK examination. These results and the findings indicating sensible relationships between expert judgments and item characteristics appear to provide validation support for the Step 2 CK program.