In medicine, the use of standardized patients (SPs) for instruction, evaluation, licensure, and certification is now widespread. History-taking, physical examination, and communication skills are generally evaluated by SPs or physician observers following the encounter.1,2 In either instance, detailed scoring criteria are developed to ensure that evaluation is consistent and objective. This is commonly accomplished through use of checklists developed specifically for the case. These checklists often specify the tasks (i.e., questions and maneuvers) a physician should normally complete given the reason for the patient visit. Unfortunately, little research has been conducted to determine the psychometric structure of checklists, although stations are frequently designed to measure skill in both medical history taking and physical examination.3
Research that has been conducted to determine the underlying structure of standardized patient examinations is often conducted at the station (or case) level.4,5 One study, although conducted at the examination level, tested the dimensionality of the checklist scores and focused on history taking skills.6 The authors reported that history-taking skills could be composed of as many as five different dimensions, suggesting that consideration of the latent traits associated with clinical competence should be taken into account when the checklists are initially constructed.
Various statistical techniques can be used to test hypotheses regarding the constructs normally associated with clinical competence. For example, structural equation modeling (SEM) can be used to test hypotheses regarding the relationship between measurable, observed variables (e.g., medical interviewing), and unobserved variables (e.g., patient care). The purpose of the current investigation is to illustrate the potential use of SEM with scores from a standardized patient examination administered under high-stakes conditions. Checklist items for a single case were examined to compare two models within a SEM framework; a single, underlying “data gathering” construct and a two-factor model consisting of medical history taking and physical examination. Consideration of the estimated parameters (slope, intercept) for each checklist item, in addition to the overall fit of the model, can be useful to guide the development of cases that match particular psychometric needs. As a result, a more valid and reliable assessment of candidate proficiencies can be expected.
CSA. The Educational Commission for Foreign Medical Graduates (ECFMG®) is responsible for certifying that graduates of international medical schools are qualified to enter accredited graduate medical education programs in the United States. To be awarded an ECFMG certificate, candidates must have their medical school diploma directly verified and pass the United States Medical Licensing Examination (USMLE™) Steps 1 (basic science) and 2 (clinical science). Until recently, candidates for ECFMG certification were also required to obtain an acceptable score on an English comprehension test (Test of English as a Foreign Language – TOEFL) and pass the Clinical Skills Assessment (CSA®).*1 Candidates who meet all these requirements can apply to residency programs in the United States.
Case checklists for CSA were developed by a committee of 11 practicing physicians and medical educators. Once the case scenarios were created, the committee determined the pertinent scoring criteria for the simulation. The checklists included the relevant history-taking questions and physical examination maneuvers that would be expected to be asked or performed in the process of providing adequate patient care. Checklist items were approved by the entire committee. For CSA administrations, case checklist scores were calculated as the percentage of the predetermined medical history questions asked and physical examination maneuvers performed. All checklist items were weighted equally. Cases were assigned to test forms based on content characteristics, including primary category (e.g., abdominal, constitutional, neurological), acuity, patient age, and patient gender. For the current investigation, a checklist for an 80-year-old woman who complains of pain in the right hip after a fall was used in the analysis. There were 2,793 candidate–SP encounters available for analysis.
Using factor-analytic techniques, the checklist data were analyzed to determine the dimensional structure of the case—unidimensional (data gathering) or multidimensional (history taking, physical examination). The AMOS structural equating modeling software package was employed.7 In SEM, several statistics are provided to determine whether the model adequately accounts for covariances amongst the variables.8 The chi-square statistic can be used as a measurement of congruence between the data and hypothesized structure. Other fit indices were obtained to further investigate model-data fit. The root mean square error of approximation (RMSEA) can be used to test that hypothesized models fit the data exactly.9 If the RMSEA is zero, model-data fit is perfect, less than 0.05 indicates “close fit,” and 0.05–0.08 indicates “fair fit.”10 Based on the application of these models, and the choice of dimensional structure, the psychometric characteristics of checklist items (intercepts–difficulty; slopes–discriminations) can also be investigated. The parameter estimates from the two models were calculated using maximum likelihood procedures.
The χ2 statistic obtained for the single-factor model was statistically significant (χ2 = 1645.8, p < .001; df = 172), indicating that there were statistically significant discrepancies between the sample covariance matrix and the estimated covariance matrix. The RMSEA was 0.06 (p < .001), indicating that model-data fit was fair. For the model in which two factors (medical history and physical examination) were hypothesized, the chi-square was also statistically significant (χ2 = 1373.1; p < .001; df = 170), indicating that there were still significant discrepancies between the sample covariance structure and estimated covariance structure. The RMSEA obtained for this model was 0.05 (p = .40), indicating that, despite the significance of the χ2 obtained, the fit was adequate. In order to evaluate the relative fit of each of the models, a χ2 difference statistic comparing the models was calculated. The result was statistically significant (χ2 = 320.9; p < .001; df = 2), suggesting that the two-factor model is preferred over the one-factor model. Parameter estimates for the checklist items for the one-factor (data gathering) and two-factor (medical history and physical examination) models are provided in Table 1.
The results presented are analogous to a confirmatory factor analysis, except that structured means are employed. The table provides the intercepts (mean proportion receiving credit), which are the same, regardless of model specified. The intercept is analogous to the proportion of candidates awarded credit (p value) in classical item analysis. The data provided show that candidates seldom received credit for asking about recent changes in medication (item 11, intercept = 0.04). In contrast, almost all candidates received credit for asking questions concerning past medical history (history item 9, intercept = 0.98).
The discrimination indices (standardized regression weights) are presented by model (one factor, two factors) for each of the checklist items in the case. The standardized regression weights indicate the degree to which the theorized scores increase (or decrease) in relation to item performance, in standard deviation units. Item 7 [“What makes pain worse?”] contributed little to the overall score (slope = −0.08). Similarly, physical examination items 14 [“Checks range of motion of at least right hip in at least one direction”] and 17 [“Palpates over area(s) of pain and trochanteric area of hip”] were negatively correlated with the hypothesized “data gathering” trait.
For the two-factor model, although overall fit was moderately improved, the same items were negatively correlated with the hypothesized factors. The results of this analysis suggest that item 7 is not contributing much to the overall history score (slope = –0.03). For the physical examination factor, the items that show a negative relationship with the case score appear to be those related to hip pain: item 14 (slope = –0.24) and item 17 (slope = –0.26). As physical examination scores increase by one standard deviation, performance on these items would decrease by one-quarter of a standard deviation. These parameter estimates can be interpreted as discrimination indices for the checklist items as well. The correlation between the history-taking and physical examination traits was –0.32. This result provides additional evidence that, although model-data fit was improved with the two-factor model, the hypothesized components (i.e., medical history and physical examination) may not be totally indicative of the underlying psychometric structure for this particular case.
The purpose of the current investigation was to explore the dimensional structure of data derived from a SP case and to illustrate how the parameters that were estimated from an SEM model could be used to determine the utility of specific checklist content. Two models were applied to the data for a single case; the first model hypothesized a single factor, data gathering, and the other hypothesized two factors, history taking and physical examination. Although other models could be entertained and evaluated, particularly since model-data fit was fair, at best, there was little theoretical basis to support them. Perhaps history taking could be split into social history, family history, review of systems, etc., but this would necessarily result in few indicator variables per latent construct. Moreover, it is not clear that history-taking questions are independent, further confounding this type of modeling. Finally, while the models considered may not adequately account for the underlying construct, this may be idiosyncratic to the particular case studied. Replication of this type of analysis with other cases and across cases is certainly warranted.
The parameters estimated by each model can be used to evaluate checklist item performance. Items with low or negative discrimination (standardized regression) parameters can be reviewed for content and considered for deletion in scoring. Alternatively, the parameters could be used as empirical weights for scoring checklists. This process is somewhat different from traditional item analysis in that the criterion measure, or measures, is/are the estimated factor scores. Through estimation procedures that are similar to item response theory framework, the item (checklist) parameter are less likely to vary as a function of the abilities of the examinees being tested. More important, if the trait (e.g., data gathering) is multidimensional, and one can confirm the structure, then the ability of specific items to discriminate along specific dimensions can be ascertained. Most commonly, traditional item analyses predispose a unidimensional structure. If this is not correct, and checklist items are being revised or deleted based on item discrimination statistics, the resulting case scores may be less reliable and, most likely, less valid.
The SEM approach described here may also be useful for exploring other potential sources of invalidity in checklist scores. Often, there can be multiple SPs who portray the same case role. Although these individuals can be standardized to some extent, differences in portrayal, or scoring, could manifest themselves in changes to the underlying structure of what is being measure. Here, multigroup models,11 where groups were defined by which SP was portraying the case, could be contrasted, both in terms of fit and the resulting item parameters. These types of analyses, where the abilities of the examinees seen by each of the SPs can be controlled for, will yield information concerning SP bias.
Although the application of factor analytic techniques with SP examinations can provide useful information that could be used to enhance case development and scoring, there are certain limitations. First, to obtain stable parameter estimates, relatively large examinee samples are required. This may not be possible for some medical school programs. Second, depending on the type of case and the choice of SP, the dimensional structure may vary, complicating the modeling process. Third, the evaluation of model-data fit can be influenced by the choice of statistical criteria. Finally, for poorly constructed checklists, item dependence may confound interpretation of the data. Here, linked checklist items, especially for physical examination, may result in observed item-level variability in candidate performance, resulting in parameter estimates that are difficult to interpret.
Although the application of factor analysis techniques to performance-based SP examinations is not without difficulties, both statistical and interpretative, the results from this investigation suggest that valuable information can be gained. This data can be used both to inform the case development process, and modify existing scoring systems.
1 Whelan GP. Educational Commission for Foreign Medical Graduates: Clinical Skills Assessment prototype. Med Teach. 1999;21:156–60.
2 Resnick RK, Blackmore D, Dauphinee WD, Rothman AI, Smee S. Large-scale high-stakes testing with an OSCE: report from the Medical Council of Canada. Acad Med. 1996;71:S19–S21.
3 Hamann C, Volkan K, Fishman MB, Silvestri RC, Simon SR, Fletcher SW. How well do second-year students learn physical diagnosis? Observational study of an objective structured clinical examination (OSCE). BMC Med Educ. 2002;2 〈http://www.biomedcentral.com/1472-6920/2/1
4 Volkan K, Simon SR, Baker H, Todres ID. Psychometric structure of a comprehensive objective structured clinical examination: a factor analytic approach. Adv Health Sci Educ. 2004;9:83–92.
5 Chesser AMS, Laing MR, Miedzybrodzka ZH, Brittenden J, Heys SD. Factor analysis can be a useful standard setting tool in a high stakes OSCE assessment. Med Educ. 2004;38:825–31.
6 Jacobs JC, Denessen E, Postma CT. The structure of medical competence and results of an OSCE. Netherlands J Med. 2004;62:397–403.
7 Arbuckle JL. AMOS Version 5.0 [computer software]. Chicago: SmallWaters Corporation, 2003.
8 Bagoglu M. A brief introduction to structural equation modeling techniques: theory and application. Texas A& M University Commerce Department of Psychology, 1999.
9 Smith TD, McMillan BP. A primer of model fit indices in structural equation modeling. Paper presented at the Annual Meeting of the Southwest Educational Research Association, New Orleans, LA, February 1–3, 2001.
10 Byrne BM. Structural Equation Modeling with LISREL, PRELIS, and SIMPLIS: Basic Concepts, Applications, and Programming. Mahwah, NJ: Erlbaum, 1998.
11 Grosset JM. A multigroup structural equation modeling approach to test for differences in the educational outcomes process for African American students from different socioeconomic backgrounds. Paper presented at the Annual Meeting of the American Educational Research Association, San Francisco, CA, April 18–22, 1995.
*The CSA requirement has been replaced by the USMLE Step 2 CS, a similar examination.
Moderator: Sandy Cook, PhD
Discussant: Rebecca Lipner, PhD