Exploring Psychometric Models to Enhance Standardized Patient Quality Assurance: Evaluating Standardized Patient Performance Over Time : Academic Medicine

Secondary Logo

Journal Logo

Research Reports

Exploring Psychometric Models to Enhance Standardized Patient Quality Assurance

Evaluating Standardized Patient Performance Over Time

Brown, Crystal B. MHS; Kahraman, Nilufer PhD

Author Information
Academic Medicine 88(6):p 866-871, June 2013. | DOI: 10.1097/ACM.0b013e3182901647
  • Free


The United States Medical Licensing Examination–Step 2 Clinical Skills (CS) assesses examinees’ proficiency in communicating with patients, as well as their ability to gather and document relevant patient information. Standardized patients (SPs), who are trained to portray real patients in a uniform fashion, play an integral part in this high-stakes assessment of clinical competence and interpersonal skills.1 The SP’s role in the clinical skills examination is instrumental because SPs rate examinees’ proficiency in speaking English, using interpersonal communication skills, obtaining a comprehensive medical history, and performing a relevant physical examination. An SP’s ability to objectively record the events of the clinical encounter is paramount. Timely and effective quality assurance necessitates training, monitoring, and evaluating SPs on a regular basis.

A considerable amount of research has shown that some SPs are more lenient than others and that SP leniencies may vary depending on the clinical aspect being rated.2–4 Equating procedures ensure that examinees have an equal opportunity to excel when presented with identical clinical content across SPs. These psychometric adjustments occur after the administration of the exam, prior to score reporting. However, during the clinical encounter, SPs, serving as the instrument of assessment, are the first point of contact with examinees. Therefore, before any data can be generated or psychometrically adjusted to correct for error in determining an examinee’s true performance, SPs make the first contribution to an equitable exam experience. Thus, the process of ensuring fairness and standardization essentially begins at the start of the exam, with the SP.5

Even though measurement models are generally used to monitor rater effects over time in many high-stakes performance assessments, to the authors’ knowledge, no studies have explored such effects for SP-based clinical skills examinations. In this study we discuss the usefulness of psychometric models for evaluating the quality of SP ratings and performance monitoring over time. More specifically, we evaluate the usefulness of difficulty and discrimination parameters computed by two alternative measurement models in achieving this goal. The difficulty parameter measures the leniency of an SP, whereas the discrimination parameter measures how accurately any one SP’s ratings can distinguish among examinees of differing ability. (Although the terms “difficulty” and “leniency” are generally interchangeable in this context, we have used “leniency” to describe the SP and/or SP rating behavior, and we have used “difficulty” to describe SP performance estimates on a medical case.) Prefacing such measurement methodology, we provide a brief explanation of the CS examination and of current practices for assuring SP quality.

The Step 2 CS Exam

The CS examination is an SP examination (i.e., an exam in which examinees encounter people trained to portray real patients with medical problems). The National Board of Medical Examiners administers the examination at five test centers across the United States. During the examination, examinees encounter 12 medical cases and have up to 15 minutes to interact with the SP in the clinical encounter. SPs are responsible for rating examinees in three areas of clinical skills, and immediately after each encounter, the SP completes three component measures of those skills: (1) a checklist to document whether the examinee elicited responses relevant to patient history and performed the essential components of the physical examination (data gathering), (2) a 9-point rating scale to evaluate the examinee’s spoken English proficiency (spoken English), and (3) a set of three 9-point rating scales to evaluate the examinee’s communication and interpersonal skills (communication). To provide an in-depth demonstration of the methodology, our study focuses on the rating performance of the SP on the communication component only, which yields a score ranging from 3 to 27 (reflecting a sum of the three communication scales); higher scores indicate better performance.

Often, SPs are matched and trained to more than one medical case. This results in unique SP–medical case combinations (SP-cases). In this study, we explore SP performance given a particular case; that is, we explore SP performance at the SP-case level.

Current Practices for SP Quality Assurance

Because of the complex dynamic between SPs and examinees encountering each other in the Step 2 CS setting, evaluating SP performance using a combination of qualitative and quantitative approaches is important. Video monitoring is a qualitative method of quality assurance, and for selected encounters it provides detailed portrayal information for SP–examinee interactions where quantitative statistics cannot. However, quantitative methods have the advantage of taking each instance of SP performance into account in evaluating SP rating behavior. Thus, quantitative methods can establish a pattern of rater behavior that review of a random selection of videos cannot. Therefore, the two methods complement each other.

Quantitative measures are geared toward determining the leniency and the discrimination of SP rating behavior. Our study focuses on two theoretically different methods of measuring difficulty and discrimination parameters, dissecting their relative usefulness in evaluating SP performance over time. These two methods are based on classical measurement theory and the common factor model.


Data and time constructs

Because we evaluated SPs over time, the SP-cases included in the study needed to be active (used in scoring) in the exam throughout the entire year. SP availability, in tandem with test construction (i.e., exam form or test content), influence how often a given SP-case is active in the examination. Such considerations resulted in a sample that consisted of 88 SP-case combinations which were consistently active in test administrations at one site throughout the year 2010.

We formulated four time segments to compute and evaluate whether and to what extent the difficulty and discrimination parameters of these 88 SP-cases changed over time. Because SP statistics are calculated from the number of encounters an SP has with examinees on a particular case, our examination required that enough examinees be in each time segment to enable us to compute useful SP-case statistics to make inferences about SP rating performance. Hence, we aligned time segments with CS scoring cohort schedules used in 2010 and merged them, when feasible, to provide a sufficient number of examinees on which to conduct the analyses.

We constructed the first time segment (T1) from test administrations that occurred between January and March. All remaining time segments were consecutive such that the second (T2) was constructed from test sessions administered between April and mid-July, the third (T3) from test sessions administered between mid-July and early November, and the fourth (T4) from test sessions administered from early November to the end of December.

Ethical considerations

Participants had given prior approval for their scores to be used for research purposes. We collected data for this study as part of routine test administration; therefore, these data were exempt from institutional review board inspection. We removed personal identifying information from examinee records to ensure anonymity, and we have reported only SP-case results.

Computing SP-case statistics

To compare both psychometric methods (classical measurement theory and the common factor model) in determining SP-case difficulty and discrimination parameters over time, we computed two sets of SP-case statistics. We produced one set of difficulty and discrimination statistics using the classical method and another set of difficulty and discrimination statistics using the common factor model. We did this for each SP-case in each of the four time segments.

The first set of statistics, based on the classical measurement theory, consisted of means and case-corrected total score correlations. Such SP-case statistics are currently the main quantitative method of assessment in maintaining and assuring SP-case quality. The SP-case mean represents the expected total score for a random examinee (random in that his or her representation in the sample is free of bias and subject to equal probability of representation in the sample). This statistic measures the leniency of an individual SP’s rating behavior. On a given case, higher values indicate more lenient SPs.

The case-corrected total score correlation represents the correlation between an examinee’s score on a given case and that examinee’s total score. This correlation is computed excluding the case of interest from the examinee’s total score to avoid erroneously high correlations between the case communication score and the examinee total communication score. The case-corrected total score correlation is a measure of SP discrimination. Correlation values range from 0 to 1. Values closer to 1 indicate SPs with more ability to discriminate between examinees of low and high ability. We computed classical parameter estimates using SPSS (PASW Statistics 18.0; Armonk, New York).

The second set of statistics consisted of intercepts and factor loadings, which we computed using a common factor model. This is a new approach to quantifying SP-case quality using a latent trait modeling framework.6 The latent trait in this analysis is examinee ability, so we estimated common factor model statistics taking examinee ability into account. Using MPlus version 5 (Los Angeles, California), we estimated the common factor model statistics, fixing the latent trait distribution to have a mean of zero and a variance of one7 (so that we could interpret average performance at zero along the performance distribution and interpret the relative variation of performance around the mean in standardized units of one).

The SP-case intercept computed by the common factor model is comparable to the mean computed by the classical measurement approach and reflects the expected SP rating on a case for a random examinee of average ability; higher values indicate easier cases. The factor loading computed by the common factor model is comparable to the case-corrected total score correlation, as both are measures of discrimination. The factor loading on an SP-case represents the expected increase in an examinee’s total communication score per unit increase in examinee ability; higher values indicate more discriminative SP-cases. Unlike the discriminations computed by the classical measurement model (case-corrected total score correlations), the discriminations computed by the common factor model are on the original communication scale which facilitates interpretation.

For the remainder of the report, we refer to means and case-corrected total score correlations based on classical measurement theory as classical parameters. We refer to intercepts and loadings based on the common factor model as model-based parameters. We evaluated the SP-case statistics resulting from the classical measurement and common factor models to determine the relative usefulness of each in evaluating SP-case difficulty and discrimination over time.

Additionally, in a similar fashion, we compared the qualitative method of video review to estimates from the common factor model.


In the first of three sections to follow, we selected three illustrative SP-case examples to depict SP performance at different levels of efficacy and to explain how to interpret and compare classical and model-based difficulty and discrimination parameters (Table 1). In the second section, to further determine the relative usefulness of a particular method in determining SP-case difficulty and discrimination, we ultimately categorized all SP-cases into performance groups based on their relative standing in the sample (Table 2). Finally, in the third section we addressed how model-based statistics may be a helpful addition to qualitative SP quality assurance measures.

Table 1:
Three Illustrative Standardized Patient (SP)-Cases: Communication Difficulty and Discrimination Statistics
Table 2:
Total Sample: Number of Standardized Patient (SP)-Cases Across Performance Categories* for the Classical Measurement Model and for the Common Factor Model

Evaluating individual SP-case performance over time using difficulty and discrimination statistics: Three illustrative examples

Table 1 provides three representative examples of how SP-case statistics can be used to evaluate SP performance within and across time segments, along with the sample mean statistics for difficulty and discrimination estimates under both models. Beginning with SP-case 1, these examples represent the range of performance from favorable to unfavorable given group averages, and in so doing, they imply the degree of statistical adjustment needed to calibrate rating behavior. In the first time segment (T1), SP-case 1 had a classical difficulty parameter estimate of 19.31, suggesting that this SP was somewhat stringent on the case when compared with the classical parameter mean of 20.34 in the total sample. The model-based difficulty estimate of 19.28 was very similar to the classical difficulty estimate and slightly lower than the model-based average of the total sample (20.35). The classical discrimination estimate was 0.50 (compared with the mean of 0.47 from the total sample), indicating a moderate association between this SP’s ratings on the case and the total communication score for examinees.

The corresponding model-based estimate (which is on the original communication scale) was 1.17 and slightly lower than the group average (1.31). This suggests that the expected score on this SP-case would be 1.17 points higher for an examinee whose ability is one standard deviation (SD) above the mean ability of the population. That is, such an examinee would have an expected communication score of 20.45 on the case if he or she encountered this particular SP in the examination. An expected score of 20.45 is the sum of the expected total score for a random examinee encountering this SP-case, and the discrimination (on the communication scale) for this SP-case (19.28 + 1.17). When we compare SP-case statistics similarly over the remaining three time segments, SP-case 1 appears to be consistent over time with just slightly less than average expected ratings for model-based discriminations as well as for difficulty estimates across both methods. This reflects a reasonably favorable scenario in which the communication ratings of this SP-case exhibit stability over time and would not need substantial statistical adjustments.

The change observed in the performance of SP-case 2 over time is slightly more than that for SP-case 1 and is more typical of the study sample. SP-case 2 difficulty estimates indicate that this SP was more stringent when compared with those in the sample (with classical and common factor model estimates approximately 1 SD below the pool average) for T1, T2, and T4, and most stringent in T3 (with estimates approximately 1.5 SDs below the pool average). The classical discrimination estimates show a consistently moderate association between the SP’s ratings on this case and examinee total scores on communication (with correlations ranging from 0.55 to 0.58). These classical discriminations, compared with the corresponding averages over time, suggest that this SP-case was more discriminating relative to the pool average. Corresponding model-based discrimination estimates suggest similar results (with discriminations ranging from 1 to 2 SDs above time-segment-pool averages), further revealing that this SP-case’s discrimination improved over time, especially in time segments T3 and T4 (where discrimination values for this SP-case were highest at 2.53 and 2.17, respectively). These high values may be due to the more sensitive nature of the common factor model in capturing SP-case discriminations. SP-case 2 is an example of an SP who demonstrates relatively high discrimination and slightly less-than-average expected ratings over time. However, mostly because of performance in T3, his or her rating behavior is less stable than that of SP-case 1, thus requiring relatively more statistical adjustment over time.

SP-case 3 is a less typical example. The classical and model-based difficulty estimates for SP-case 3 are very similar and slightly above average at each time point, suggesting that the SP’s ratings on the case are consistently more lenient than the average SP rating. Rater consistency in SP-case 3 is lower relative to SP-case 1 and SP-case 2 because of greater variability in discrimination parameters. The classical discrimination estimate for T1 is 0.35 and somewhat lower than the average (0.47) observed for the pool. The SP-case discrimination improves over time, demonstrating a moderate association that is more typical of the sample. Model-based discriminations are in alignment with the classical discriminations but appear to capture time segment differences more explicitly (with a range in estimates over time that differ 0.5 to 2 SDs from the total sample). This degree of variation in discrimination over time suggests that a relatively more extensive degree of statistical adjustment would be necessary for this SP-case compared with the other two examples.

Results from these three examples selected from the SP-case performance distribution indicate that both the classical and model-based parameters can be practical for monitoring trends in SP-case difficulty. However, model-based estimates appear more refined in their ability to detect SP changes in discrimination over time, thus offering another potentially useful element to monitor SP-case discrimination parameters longitudinally.

Evaluating SP-case performance over time using difficulty and discrimination statistics: The total sample

To evaluate performance patterns in the entire SP-case sample over time, we examined the difficulty and discrimination distributions for both the classical and model-based statistics. To classify group performance in each time segment, we created five SP-case performance groups: unsatisfactory, below average, average, above average, and exceptional. SP-cases with difficulty statistics that were more than 2 SDs below the group mean were assigned to the unsatisfactory performance group. Those with difficulty statistics that were 1 to 2 SDs below the group mean were assigned to the below average group. SP-cases with difficulty statistics within ±1 SD of the mean were assigned to the average group. SP-cases with difficulty statistics 1 to 2 SDs above the group mean were assigned to the above average group. Finally, SP-cases with difficulty statistics that were more than 2 SDs above the mean were assigned to the exceptional group. We applied the same categorization to the discrimination parameters.

Table 2 lists the number of SP-cases observed in each category for both classical and model-based statistics and shows that the classical and model-based difficulty parameter estimates have compellingly similar distributions. SP-case discrimination distributions, however, tell a different story. Table 2 illustrates that the classical and model-based distributions for discrimination are not so perfectly aligned. Although the number of SP-cases classified as average in discrimination is similar across methods and time segments, discrimination estimates vary around the mean much more for model-based estimates. The model-based distribution identified more SP-cases in categories other than the average category in T1 and T3—and in T4, despite the same number of SP-cases identified as average, the model-based distribution dispersed the remaining SP-cases across more performance categories than the classical distribution. Overall, the common factor model demonstrates a wider performance distribution of SP-cases. This wider distribution suggests that the model-based discriminations, over time, may be more consistently capable of identifying SP-cases across a wider performance range than the classical model discriminations.

A comparison of a psychometric approach to a qualitative approach in diagnosing problematic SP-cases

To further investigate the usefulness of a quantitative approach in diagnosing problematic SP-cases, we compared model-based estimates with the current qualitative method of SP-case monitoring, which is video review. For every encounter reviewed by video, SP-cases are assigned a letter grade based on a four-point grade point average (GPA) scale (4 = A, 3 = B, 2 = C, 1 = D). A grade of “A” indicates exceptional performance. Differences in performance categorization between the GPA statistics and the model-based estimates are of particular interest because, on the basis of practical thresholds of acceptability, they may indicate discord in how quality assurance resources are allocated.

To make GPA statistics comparable with the SP-case performance categories in Table 2, we calculated a mean GPA for each SP-case for each time segment and subsequently computed an overall SP-case mean GPA for each time segment. Performance group criteria remained the same as those for comparisons between the classical and model-based methods. Overall, the distribution patterns of the GPA and model-based statistics were similar to those observed in the comparison between the classical and model-based distributions. Explicitly, the GPA and model-based distributions were quite similar for the difficulty parameter. However, just as in the comparison between the classical and model-based methods, the quantitative approach based on SP-case statistics was able to detect finer SP performance differences than the qualitative approach because of the availability of discrimination parameters. Over time, there was more dispersion of SP-cases across the performance distribution for the discrimination parameter with the model-based estimates than with GPA statistics. The observance of larger differences in discrimination signifies more variation in discrimination when model-based parameters are compared with SP video monitoring outcomes, which further demonstrates the greater potential of the common factor model to capture variation in SP-case discrimination.


Research has shown that the interpersonal relationship between physician and patient in the practice of medicine may affect patient outcomes.8 Therefore, proper and accurate assessment of competency in communication and interpersonal skills is a vital part of medical education.9 SPs are an essential element in the high-stakes assessment of clinical skills. Although variations in SP performance can be adjusted psychometrically for scoring purposes, ensuring that SPs sustain an acceptable level of performance prior to such adjustment is imperative. Exploring methods to enhance the monitoring of SP performance is in keeping with the goal of state-of-the-art clinical skills assessment.

Video monitoring and evaluation of classical difficulty and discrimination estimates by SP-case are a customary part of the approach to comprehensive SP quality assurance in high-stakes assessment of clinical skills. Video monitoring covers the qualitative observation of all aspects of selected SP encounters, whereas statistics based on classical measurement theory have proven useful for providing quantitative assessments of leniency and discrimination among SPs portraying identical cases. The use of both types of evaluation triggers more qualitative investigation into seemingly aberrant patterns of SP-case behavior to inform training interventions. Longitudinally, incorporation of the classical measurement model or the common factor model in evaluating SP-case difficulty and discrimination can establish performance trends to compel interventions to assure SP quality. Although SP-case difficulty evaluation appears to benefit equally from the use of either the classical measurement method or the common factor model method, the common factor model offers an additional layer of inference with regard to evaluating SP-case discrimination. The common factor model may also dictate better allocation of training intervention resources, helping to guard against both missed opportunities for remediation and unnecessary remediation.

Although the application presented here is limited to the communication component of the clinical skills examination, the methodology illustrated could easily apply to the data gathering or the spoken English components that are also evaluated by SPs. The potential value of such a method to infer rater behavior over time can also extend to other arenas of medical assessment and training10 or general practice research11 that involve SPs.

The application of the common factor model, however, is not without its challenges. After comparing the SP-case statistics of the classical and common factor models, it may seem that the effort to implement the common factor model approach outweighs the rewards of detecting finer, albeit generally small, differences in discrimination among SPs. However, considering that in high-stakes assessment every rater has to achieve and maintain an acceptable level of performance for eligibility to participate in the exam, immense differences in SP performance are rare. Therefore, detecting what may be considered small differences in SP performance is useful to identify SPs who are not performing as well as their counterparts. In a high-stakes environment where raters are trained to standardized performance, any detectable difference is noteworthy and potentially actionable to maintain acceptable performance standards. We believe that maintaining acceptable performance with more refined methods, whenever possible, adds value to the process of identifying and, if necessary, retraining SPs who are not performing well.

Large datasets are necessary to implement common factor model analysis, which inhibits the frequency with which such analyses can be conducted longitudinally. We recognize this as a potential barrier in settings such as medical schools where large datasets may be hard to obtain. However, in the absence of large datasets, classical estimates are more readily available and easier to compute. Accordingly, they may be used as a more viable option. But, ultimately, model-based estimates are an explicitly predictive measure of SP performance and could better inform training interventions and avert potential fallacies associated with simple comparison of SP classical statistics on a given case. Therefore, in our assessment environment we make the concession of longitudinal, coordinated use of classical and model-based performance measures. We feel this combined approach may augment timely and effectual quality assurance practices. Although both in concert appear to be useful, model-based parameter estimates seem to provide a more sensitive method of detecting changes in SP rating behavior and may be preferred when sample sizes and other data design conditions permit model estimation.

Acknowledgments: The authors sincerely appreciate the support and contributions of Jeannette Sanger in the process of compiling sample data for this research.

Funding/Support: None.

Other disclosures: None.

Ethical approval: Data were collected as part of routine test administration and were, therefore, exempt from institutional review board inspection. All participants had given prior consent for scores to be used for educational research, and all personal information was removed to ensure anonymity.


1. Hawkins RE, Swanson DB, Dillon GF, et al. The introduction of clinical skills assessment into the United States Medical Licensing Examination (USMLE): A description of USMLE Step 2 Clinical Skills (CS). J Med Licensure Discipline. 2005;91:21–25
2. Iramaneerat C, Yudkowsky R. Rater errors in a clinical skills assessment of medical students. Eval Health Prof. 2007;30:266–283
3. Margolis MJ, Clauser BE, Swanson DB, Boulet JR. Analysis of the relationship between score components on a standardized patient clinical skills examination. Acad Med. 2003;78(10 suppl):S68–S71
4. Clauser BE, Harik P, Margolis MJ. A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. J Educ Meas. 2006;43:173–191
5. De Champlain AF, Macmillan MK, Margolis MJ, King AM, Klass DJ. Do discrepancies in standardized patients’ checklist recording affect case and examination mastery-level decisions? Acad Med. 1998;73(10 suppl):S75–S77
6. Kahraman N, De Champlain A, Raymond M. Modeling the psychometric properties of complex performance assessment tasks using confirmatory factor analysis: A multi-stage model for calibrating tasks. Appl Meas Educ. 2012;25:79–95
7. Muthén LK, Muthén BO Mplus: Statistical Analysis With Latent Variables. User’s Guide (Version 5). 2007 Los Angeles, Calif Muthén & Muthén
8. Beck RS, Daughtridge R, Sloane PD. Physician–patient communication in the primary care office: A systematic review. J Am Board Fam Pract. 2002;15:25–38
9. Carraccio C, Wolfsthal SD, Englander R, Ferentz K, Martin C. Shifting paradigms: From Flexner to competencies. Acad Med. 2002;77:361–367
10. Adamo G. Simulated and standardized patients in OSCEs: Achievements and challenges 1992–2003. Med Teach. 2003;25:262–270
11. Beullens J, Rethans JJ, Goedhuys J, Buntinx F. The use of standardized patients in research in general practice. Fam Pract. 1997;14:58–62
© 2013 Association of American Medical Colleges