The recent addition of a clinical skills component to the United States Medical Licensing Examination (USMLE), the Step 2 Clinical Skills (CS) examination, underscores the need to measure competencies deemed important in medical practice that do not readily lend themselves to more conventional (e.g., multiple-choice item) assessment modalities.1
Although alternative forms of assessment hold a great deal of promise in high-stakes contexts, they nonetheless need to be evaluated according to the same rigorous psychometric standards that are commonplace with more traditional examinations. The need to gather evidence to support various aspects of validity is particularly important in a high-stakes context given implications for scoring, score reporting, and fairness, among other issues. Validation is the process by which empirical evidence is gathered to substantiate inferences that a user might like to make based on test scores.2 Messick, in his unified validity framework, argues that all aspects of validity are intertwined and as such reflect various aspects of construct validity.3 Nevertheless, the researcher may wish to emphasize one, or more than one, of these aspects depending upon intended score use. In particular, the structural aspect of validity is important given broad implications for test development and scoring activities. The latter refers to the extent to which the underlying dimensional structure of an assessment task is consistent with the hypothesized construct domain. In other words, can we gather evidence to support that examinees, when performing on CS cases, use knowledge sets and skills that correspond to domains outlined by experts in the test blueprint and as reflected in the scoring rubric? Factor analysis provides a useful framework to assess the extent to which hypothesized construct domains do indeed account for actual examinee performances.
Much of the earlier factor analytic work undertaken with CS examinations was exploratory in nature and suffered from a number of methodological limitations.4,5 First, most of the studies were not based on factor analysis but rather principal component analysis (PCA), which is a distinct data reduction technique. In PCA, the aim is to maximize item variances (both common and error) across a smaller number of orthogonal (i.e., uncorrelated) components. In contrast, the goal of factor analysis (FA) is to identify a set of unobservable variables (factors) common to all items that best account for the correlations among items or cases. More importantly, in FA, error variance is partialled out. Thus, PCA and FA can lead to the identification of very different solutions. Second, most published CS factor analytic research is based on very limited samples, thus precluding the use of estimation procedures that are better suited to the categorical nature of checklist item responses and expert ratings. Finally, much of the earlier validation work undertaken using FA with CS examinations is exploratory in nature. The interpretation of solutions in exploratory FA is often problematic in that conclusions are highly linked to the fit indices employed, many of which are arbitrary and potentially imprecise indicators of the true underlying dimensionality.6 Confirmatory factor analysis (CFA), on the other hand, entails assessing the fit of a prespecified model, based on theoretical and substantive considerations, to a correlation or covariance matrix. In CFA, the research posits a priori the structural relationships among variables and factors that are of interest to assess, including factor loadings and factor intercorrelations. This approach differs from exploratory factor analysis where no a priori hypotheses are put forth. The latter exploratory analysis also typically involves rotating the factors to improve interpretability of loadings, either via orthogonal (produces uncorrelated factors) or oblique (produces correlated factors) methods. These steps are unnecessary in CFA as the exact relationships among variables and factors are postulated at the onset and tested empirically.
Some investigations did attempt to address these limitations by using confirmatory factor analytic modeling as well as estimation methods appropriate to the nature of CS data. De Champlain and Klass7 examined the fit of several FA models with older research prototypes of the USMLE Step 2 examination. They reported that data gathering, communication, and interpersonal skills tended to vary as a function of the type of challenge posed in the case. Although useful, the latter study was limited by the low-stakes nature of the examination. Recently, McKinley and Boulet8 examined the fit of case models, specifically as they related to the data gathering construct. Their findings suggest that a two-factor model, reflecting medical history taking and physical examination maneuvers, best accounts for examinee case performances. However, their analyses were restricted to a single case and as such are difficult to generalize.
Clearly, additional factor analytic studies need to be undertaken to better understand the underlying structure of CS examinations. The present research provides evidence that benefits the Step 2 CS examination program in several arenas, including, but not limited to, a better understanding of student response processes and improved scoring rubrics.
The purpose of this study was to gather evidence to support the structural aspect of validity of the USMLE Step 2 CS, using confirmatory factor analysis. Specifically, the fit of several confirmatory factor analytic models was assessed for a representative sample of USMLE Step 2 CS cases and examinees, with an emphasis on the generic skills configuration that reflects the current structure of the examination.
Measurement instrument and sample
The USMLE Step 2 CS examination measures the data gathering skills (DG), documentation, communication, and interpersonal skills (CIS) as well as spoken English proficiency (SEP) skills of test takers. For all cases, DG is assessed via a case-specific checklist completed by the SP that lists relevant history taking items and physical examination maneuvers. Documentation skills are measured using a patient note (PN), which is completed by the examinee after each encounter. In the PN, the examinee is expected to provide appropriate history and note significant physical findings as well as provide up to 5 differential diagnoses and options for a patient management plan. The PN is rated by a trained physician assessor using a 1–9 Likert scale. DG and documentation skills are further combined into a component labeled Integrated Clinical Encounter (ICE). CIS are measured using three case invariant 1–9 Likert scale items, whereas SEP is assessed using a single 1–9 Likert scale item per case. Both sets of scales are completed by the SP after each encounter. Currently, examinees must successfully pass the ICE, CIS, and SEP components to pass the overall examination.
The checklist responses and PN, CIS, and SEP ratings of 387 examinees who completed a form of the USMLE Step 2 CS between August 31, 2004 and October 7, 2005 were analyzed in this study. The sample was composed of 184 female examinees (47.5%) and 203 male test takers (52.5%). Additionally, the sample included 202 (52.2%) International Medical Graduates and 185 (47.8%) United States Medical Graduates. These breakdowns are nearly identical to those reflected in the entire population and are off by less than 3% at most. A subset of four cases was used in this study. The latter four cases constituted a broad sample of all cases outlined in the test blueprint, with respect to the challenge posed, chief complaint, and clinical presentation. From a clinical presentation perspective, one case was of musculoskeletal nature, whereas the remaining three cases focused on cardiovascular, gastrointestinal, and constitutional issues, respectively. For security reasons, it is not possible to provide additional information on these four cases.
Percent-correct checklist scores, PN ratings (1–9), CIS ratings (ranging 3–27, i.e., 3 × 1–9 Likert scales) and SEP (1-9) ratings computed for each case were analyzed in the present study.
The three factor analytic models that were examined in this investigation are outlined in Table 1. In the Skills Model, it was hypothesized that the performance on the four USMLE Step 2 CS cases was a function of three underlying clinical competencies, i.e., ICE, CIS, and SEP. This currently corresponds to the scoring model for the USMLE Step 2 CS. In the Cases Model, it was postulated that the clinical scenario per se accounted for performance of examinees, i.e., each case was a factor loading exclusively on its own components (ICE, CIS, and SEP). This model clearly illustrates case specificity, where the content of the case accounts for examinee performance.9 The Hybrid Model is nearly identical to the Cases Model, with the exception that SEP was hypothesized to be a stand-alone factor unrelated to case content, based on past analyses. That is, we considered that level of spoken English proficiency did not depend upon clinical presentation, whereas ICE and CIS did. It is also important to note that the scores analyzed in the present investigation were equated to take into account case-SP stringency effects, with the exception of SEP, which was found not to vary as a function of the SP portraying the case. Thus, all case scores are comparable, regardless of which SP may have portrayed a given encounter.
Factor analyses were run using the software package Mplus.10 The parameters of the models outlined in Table 1 were estimated using diagonally weighted least-squares estimation. This method is appropriate when at least one variable in the model is binary or ordered categorical as is the case with the PN and SEP ratings. The fit of each model outlined in Table 1 was assessed using a chi-square goodness-of-fit statistic that is derived from the fit function minimized in the estimation of model parameters. It is important to point out, however, that chi-square distributed statistics often suffer from an inflated Type I error rate when based on large samples.11 That is, it is possible to incorrectly conclude that the fit of a given model is inappropriate for a data set. As a means of addressing this shortcoming, a descriptive index was also selected to assess model fit. McDonald’s Mk index is especially appealing as past research has shown it to be sample size and estimation method insensitive.12 In addition, the Mk index ranges from 0 and 1 in value which facilitates interpretation. As a practical guide, values of .9 or more are indicative of “acceptable” fit.12 Nevertheless, it is important to underscore that the relative fit of the three factor models will be compared as opposed to the absolute fit of any given solution. Practically speaking, it is of greater interest to compare the relative fit of the alternative models outlined in Table 1 rather than attempting to identify an “optimal” configuration from a statistical point of view. Adopting this relative approach is also congruent with views put forth by several factor analysts who maintain that no restrictive model fits the population and that all (restrictive models) are merely approximations.11 Consequently, our analyses were aimed at identifying the best fitting model among those examined, all of which were posited based on substantive considerations, rather than attempting to accept or reject an a priori false hypothesis.
Goodness-of-fit results obtained for three factor models that were examined are shown in Table 2. Using the rule-of-thumb laid out by McDonald and Mok, none of the models investigated would be judged to yield an “acceptable” fit. The Skills Model, which reflects the current Step 2 CS examination configuration, yielded the best fit among the three models that were examined. Conversely, the worst fitting model was the four-factor Cases Model. The fit of the five-factor Hybrid Model was better than that of the Cases Model, but worse than obtained with the Skills Model.
Results obtained in the present investigation suggest that a generic skills-based model best accounts for the performance of examinees on the USMLE Step 2 CS examination. That is, examinees did appear to make use of data gathering and documentation skills, communication and interpersonal skills as well as spoken English proficiency when completing the four USMLE Step 2 CS cases that were the focus of our study, irrespective of the clinical scenario depicted in each instance. As such, these findings provide strong evidence to support the structural evidence of validity of the USMLE Step 2 CS examination. Practically speaking, our results indicate that the scoring rubric and standard setting model currently in place with the USMLE Step 2 CS examination are appropriate and reflect the underlying structure of the examinee-case correlation matrix. Additionally, it was encouraging to note that the most parsimonious model was the best fitting among those examined. Overparameterization or “overfitting” is a common pitfall that practitioners fall prey to when examining the fit of a number of alternative models. This is especially pervasive in exploratory analyses where the desire to minimize or maximize various statistical fit criteria leads to the inclusion of factors that make little sense, from a substantive perspective. It was encouraging to note in our research that a simpler model, reflecting substantive program considerations, yielded the best fit.
Surprisingly, the findings reported in our study contradict much of the content specificity literature published in the past.9 Past studies had indicated that the nature of the clinical scenario portrayed in a case accounted for performance rather than generic clinical skills. For example, an examinee might display strong communication skills in patient education centered stations but not in emotionally charged cases. What might account for the differences noted between our study and this past literature? One major difference may relate to motivational level of examinees. Past studies were almost entirely based on intact medical school classes (samples of convenience) completing CS examinations with widely diverging stakes, ranging from voluntary participation to graduation requirements. It is important to underscore that the dimensionality of an examinee-case matrix is not determined by the test itself but rather as a function of the interaction between a participating cohort of candidates and the examination.13 Thus, it is plausible to think that highly motivated examinees (as is the case with a licensing examination) might prepare for and approach CS examinations using more structured strategies. Relating to this point, it is also conceivable that the introduction of a high-stakes test of clinical skills as part of the licensure process may have led to more uniformity in the teaching of clinical skills at medical schools and the way in which students are prepared to take the examination. This should be explored in further research.
Although informative, it is important to interpret the results of our research in light of a number of caveats. First and foremost, our analyses were restricted to a relatively modest sample size and case set. Hence, future studies should be aimed at replicating our work with broader case sets and samples. Also, the cases examined in the present investigation did not reflect an intact form per se, which is composed of 11 operational cases. As such, it would be important to assess whether order effects, if any, might lead to different findings with respect to underlying structure. Additionally, future investigations should also be aimed at assessing the fit of alternative models not considered in the present study. As research efforts relating to the Step 2 CS examination increase, and consequently better inform our understanding of various aspects of the program, it may be possible to identify other models worthy of further consideration.
Despite these limitations, it is important to reiterate that the present study is one of few that has closely examined the structure of a high-stakes, nationally administered test of clinical skills, using appropriate estimation methods and based on careful substantive considerations. Although preliminary, it is hoped that the findings reported in this investigation will not only provide empirical evidence to support the use of current scoring and standard setting models, but also lead to a better understanding of the way in which medical students perform on high-stakes CS examinations.