Secondary Logo

Journal Logo

Measurement Precision of Spoken English Proficiency Scores on the USMLE Step 2 Clinical Skills Examination

Raymond, Mark R.; Clauser, Brian E.; Swygert, Kimberly; van Zanten, Marta

Section Editor(s): Szauter, Karen MD; Blackmore, David PhD

doi: 10.1097/ACM.0b013e3181b37d01
Psychometrics of High-Stakes Examinations

Background Previous research has shown that ratings of English proficiency on the United States Medical Licensing Examination Clinical Skills Examination are highly reliable. However, the score distributions for native and nonnative speakers of English are sufficiently different to suggest that reliability should be investigated separately for each group.

Method Generalizability theory was used to obtain reliability indices separately for native and nonnative speakers of English (N = 29,084). Conditional standard errors of measurement were also obtained for both groups to evaluate measurement precision for each group at specific score levels.

Results Overall indices of reliability (phi) exceeded 0.90 for both native and nonnative speakers, and both groups were measured with nearly equal precision throughout the score distribution. However, measurement precision decreased at lower levels of proficiency for all examinees.

Conclusions The results of this and future studies may be helpful in understanding and minimizing sources of measurement error at particular regions of the score distribution.

Correspondence: Mark Raymond, PhD, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; e-mail: (

Clear communication between physicians and patients is essential for the effective delivery of health care. As part of providing successful care, physicians must possess the language skills necessary for tasks such as soliciting a description of their patients' current symptoms, past medical histories, and pertinent family disease information. In addition, physicians must explain possible diagnoses and treatment regimes in a manner fully understandable to their patients, and they must ask unambiguous questions to verify comprehension. Ineffective communication may lead to patient misunderstanding, noncompliance, or other negative outcomes.1,2 To ensure that doctors can communicate effectively with their patients, standardized patients (SPs) have been widely used to assess communication skills in a simulated clinical environment.3

Although numerous factors contribute to satisfactory patient–physician dialog, it is essential that the physician possess the basic language skills necessary to communicate effectively with their patients. Because of workforce needs and training opportunities, many physicians who do not speak English as their first language choose to practice medicine in the United States. A measure of the spoken English skills of physicians entering the workforce in an English-speaking country is an important aspect of ensuring physician competence.4 As part of the United States Medical Licensing Examination Step 2 Clinical Skills (CS) Examination, SPs rate each examinee's English proficiency skills on each of 12 cases. Research has shown that these ratings are very reliable, with generalizability coefficients ranging from 0.95 to 0.98.5,6 However, high reliability does not necessarily ensure adequate measurement precision. The score distribution for spoken English proficiency (SEP) tends to be bimodal, with the majority of native English speakers receiving high scores and large numbers of nonnative speakers receiving ratings in the middle range of the scale. This substantial variability in examinee proficiency for the combined group helps produce a high reliability index. Thus, the purpose of the present study was to more closely evaluate the precision of the Step 2 CS SEP scores for both native and nonnative speakers of English.

Back to Top | Article Outline


Data source

The Step 2 CS Examination requires that each examinee interact with 12 SPs, each of whom portrays a specific clinical problem. Typically, 11 of these encounters are scored; one may be used for pilot testing. For each encounter, examinees have up to 15 minutes to obtain a history from the patient and perform a physical examination. After completing each interaction, examinees have 10 minutes to document their findings in a structured patient note. The SPs complete three scoring tools for each encounter: (1) a dichotomously scored checklist to assess the examinee's ability to collect a history and perform a focused physical exam, (2) three rating scales to evaluate communication and interpersonal skills, and (3) a nine-point rating scale to assess SEP. The SEP scale addresses, among other things, the frequency of mispronunciations, word choice errors, and the extent to which the examinee was asked to repeat himself or herself. In addition to the ratings provided by the SPs, a documentation score is assigned by trained physicians who evaluate the patient notes.

The present analyses are based on SEP scores for 29,084 first-time examinees who tested between July 15, 2007 and July 2, 2008 under normal test administration conditions. Forty-six percent of examinees were international graduates, 46% were female, and 39% indicated that English was not their native language. Participants had given prior approval for their data to be used for research purposes. All data were collected as part of routine testing and were deemed exempt from IRB approval. To ensure anonymity, personal identifying information had been removed from examinee records, and only group-level results are reported.

Back to Top | Article Outline


Generalizability theory provides a comprehensive framework for evaluating the precision of performance ratings and other types of complex assessments.5,7 Generalizability theory uses analysis of variance to isolate the sources of variability in ratings. The present investigation is concerned with three sources of variance: the examinee effect (person variance, ς2p), the case effect (case variance, ς2c), and the examinee-by-case interaction effect (error or residual variance, ς2e). The examinee effect acknowledges that a rating assigned to an examinee should be influenced by the examinee's true level of proficiency. The case effect arises because the rating an examinee receives could be due to the fact that some cases are inherently more difficult than others. The interaction recognizes that certain cases will be easier for some examinees but may be more difficult for other examinees. The interaction effect is almost always unpredictable and is considered random error variance. There is another source of variability that is not possible to isolate in these data—the effect associated with the SP. Because each SP usually portrays a single case for an entire testing session, the case variance and SP variance are inseparable. The term case variance as used here actually refers to both sources.

Two sets of analysis were completed. First, variance components were estimated for the total group of examinees and separately for native and nonnative speakers of English. For each group, variance components were obtained for persons (ς2p), for cases (ς2c), and for the interaction effect (ς2e). These values were subsequently used to calculate the standard error of measurement (SEM) and an index of dependability (reliability) known as phi (Φ). These indices are given by


where Nc refers to the number of cases (Nc = 11). The term ς2Δ represents the absolute error variance, and its square root is the SEM. Note that ς2Δ includes both random error and the more systematic case effect. The inclusion of ς2c recognizes that score precision is less when examinees see different cases and those cases are not equally difficult. The difference between the phi coefficient and the more common generalizability coefficient is that phi includes the case effect. However, in the present design in which cases are nested in examinees, the phi and generalizability coefficients are identical; the term phi is used for clarity.

Phi and the SEM are average indices of score precision; therefore, they do not express the extent to which precision can vary at different points along the score continuum. Given that the goal of this study is to evaluate measurement precision for two groups of examinees whose scores can be expected to fall in different ranges, it was important to evaluate measurement error at each score level. Generalizability theory includes provisions for calculating the SEM for specific score levels. This index is referred to as the conditional SEM (CSEM) and is obtained by:

where Xpc is the rating assigned to examinee p by case c, and Xp. is the mean of all ratings for an examinee. The magnitude of CSEM is determined by the extent to which the 11 cases agree in the ratings they assign to an examinee, with larger values indicating less agreement.

Back to Top | Article Outline


Table 1 summarizes observed SEP scores, variance components, and indices of measurement precision. Given that results were very similar across the five test sites, data were combined. The values in Table 1 are means weighted by the sample size for each site. Several observations about these results can be made. First, mean SEP scores are generally high, but they are significantly higher for native English-speaking examinees. Also note that the standard deviation for native speakers is much smaller than for nonnative speakers. These differences reflect the fact that nonnative speakers, most of whom were IMGs, were more heterogeneous in terms of background and experience. The differences in variability lead to the expectation that the variance components for the two groups will also be different, which is evident in Table 1. However, the percentages associated with each component are similar across groups, and, consequently, the values of phi for native and nonnative English speakers are nearly identical. The phi coefficient for the total group is quite high (0.954), a value generally consistent with prior research.5,6 Finally, the overall SEMs for nonnative speakers of English are two to three times greater than the SEMs for native speakers. To completely understand this phenomenon, it is important to examine measurement error at specific score levels.

Table 1

Table 1

Figure 1 presents the CSEMs for the two groups of examinees. It is apparent that scores toward the middle and lower ends of the score scale are less precise than scores at the upper end, and that the overall SEM, represented by the gray horizontal line, incompletely describes the distribution of measurement error. Despite the elevated values in the middle region, it is encouraging that the CSEMs are very similar for native and nonnative English speakers.

Figure 1

Figure 1

Back to Top | Article Outline


The findings of this study are consistent with the previous work indicating that SEP scores are, in general, very reliable.5,6 The average phi coefficient for the full group was above 0.95, which is remarkable for a performance assessment. The analyses also show that native and nonnative speakers of English are measured with about equal precision, but the measurement error for both groups increases toward the middle and bottom of the scale. There are several possible reasons for this. Some of the increase can be attributed to statistical artifact. Scores near the top of the scale for proficient examinees provide limited opportunity for disagreement due to ceiling effects (i.e., no rating can be greater than nine), while scores toward the middle of the scale present SPs with more scale points from which to choose. The issue may not be that measurement error increases toward the middle of the scale but, rather, that it declines toward the top. Rater leniency also seems to be a contributing factor. Subsequent analyses indicated that very lenient SPs were reluctant to assign low ratings to examinees viewed by other SPs as being less proficient, and this disagreement among ratings is a source of error. Moderate and stringent raters did not have this restriction-of-range problem; they tended to use more scale points.

There is also the possibility that the performance of less proficient speakers varies throughout the testing day. An examinee's speech may improve as he or she acclimates to the test setting; alternatively, it may decline because of the fatigue associated with speaking a second language for an extended time. In addition, there may be short-term variability in performance associated with certain types of cases. An examinee with limited English language skills may become anxious and suffer a temporary deterioration in performance on a case that involves an emotional or otherwise difficult clinical problem. Although the present study does not address these issues, prior research has detected systematic changes in scores throughout the day.8 That research found that scores trended upward over the first several encounters for three of the four Step 2 CS score scales. Interestingly, SEP was the one score scale for which this sequence effect was not evident.

Most of the preceding discussion about the potential sources of increased measurement error is speculative and beyond the scope of this research. Nonetheless, such an understanding could provide insight into how measurement precision might be improved. The results of the present analyses do suggest one avenue for improving score precision. The case effect, ς2c, accounted for a nontrivial portion of the overall error variance. This value represents the combined effects of the case and the SP who portray that case, and it is commonly designated as leniency or stringency error. It is possible to reduce or eliminate this effect applying a statistical adjustment to correct for leniency or stringency. The National Board of Medical Examiners applies such an adjustment to other Step 2 CS scales to help improve reliability.6 Data from the present study suggest that statistical adjustment would reduce the overall SEM from about 0.23 to 0.20 for the total group of examinees. It is reasonable to expect that the improvement to the CSEM would be greatest in the middle region of the score scale, where precision is least. In the past, adjustments to SEP scores have not been made because of high reliability. The results of this study are consistent with the view that SEP scores are quite precise, but they also suggest that reconsideration of steps to enhance precision may be warranted.

Back to Top | Article Outline


1 Stewart MA. Effective physician–patient communication and health outcomes: A review. CMAJ. 1995;152:1423–1433.
2 Levinson W, Roter DL, Mullooly JP, Dull VT, Frankel RM. Physician–patient communication. The relationship with malpractice claims among primary care physicians and surgeons. JAMA. 1997;277:553–559.
3 Rothman AI, Cusimano M. Assessment of English proficiency in international medical graduates by physician examiners and standardized patients. Med Educ. 2001;35:762–766.
4 Boulet JR, van Zanten M, McKinley DW, Gary NE. Evaluating the spoken English proficiency of graduates of foreign medical schools. Med Educ. 2001;35:767–773.
5 Clauser BE, Harik P, Margolis MJ. A multivariate generalizability analysis of data from a performance assessment of physicians' clinical skills. J Educ Meas. 2006;43:173–191.
6 Harik P, Clauser BE, Grabovsky I, Nungester RJ, Swanson D, Nandakumar R. An examination of rater drift within a generalizability framework. J Educ Meas. 2009;46:43–58.
7 Brennan RL. Generalizability Theory. New York, NY: Springer-Verlag; 2001.
8 Ramineni C, Harik P, Margolis MJ, Clauser B, Swanson D, Dillon GF. Sequence effects in the USMLE Step 2 Clinical Skills examination. Acad Med. 2007;82(10 suppl):S101–S104.
© 2009 Association of American Medical Colleges