Secondary Logo

Journal Logo

International Variation in Performance by Clinical Discipline and Task on the United States Medical Licensing Examination Step 2 Clinical Knowledge Component

Holtzman, Kathleen Z.; Swanson, David B. PhD; Ouyang, Wenli MS; Dillon, Gerard F. PhD; Boulet, John R. PhD

doi: 10.1097/ACM.0000000000000488

Purpose To investigate country-to-country variation in performance across clinical science disciplines and tasks for examinees taking the Step 2 Clinical Knowledge (CK) component of the United States Medical Licensing Examination.

Method In 2012 the authors analyzed demographic characteristics, total scores, and percent-correct clinical science discipline and task scores for more than 88,500 examinees taking Step 2 CK for the first time during the 2008–2010 academic years. For each examinee and score, differences between the score and the mean performance of examinees at U.S. MD-granting medical schools were calculated, and mean differences by country of medical school were tabulated for analysis of country-to-country variation in performance by clinical discipline and task.

Results Controlling for overall performance relative to U.S. examinees, results showed that international medical graduates (IMGs) performed best in Surgery and worst in Psychiatry for clinical discipline scores; for clinical tasks, IMGs performed best in Understanding Mechanisms of Disease and worst in Promoting Preventive Medicine and Health Maintenance. The pattern of results was strongest for IMGs attending schools in the Middle East and Australasia, present to a lesser degree for IMGs attending schools in Europe, and absent for IMGs attending Caribbean medical schools.

Conclusions Country-to-country differences in relative performance were present for both clinical discipline and task scores. Possible explanations include differences in learning outcomes, curriculum emphasis and clinical experience, standards of care, and culture, as well as the effects of English as a second language and relative emphasis on preparing students to take the Step 2 CK exam.

Ms. Holtzman is director, Assessment and International Operations, American Board of Medical Specialties, Chicago, Illinois.

Dr. Swanson is vice president, Assessment Programs Unit, National Board of Medical Examiners, Philadelphia, Pennsylvania.

Ms. Ouyang is measurement analyst, Measurement Consulting Services Unit, National Board of Medical Examiners, Philadelphia, Pennsylvania.

Dr. Dillon is vice president, Assessment Programs Unit, National Board of Medical Examiners, Philadelphia, Pennsylvania.

Dr. Boulet is associate vice president, Research and Data Resources, Foundation for Advancement of International Medical Education and Research, Philadelphia, Pennsylvania.

Funding/Support: None reported.

Other disclosures: None reported.

Ethical approval: This study was determined to be exempt from further review by the American Institutes for Research, the National Board of Medical Examiners’ external institutional review board.

Previous presentations: Abstract presented at the Annual Conference of the Association for Medical Education in Europe, Lyon, France, August 2012.

Correspondence should be addressed to Ms. Holtzman, American Board of Medical Specialties, 353 North Clark St., Suite 1400, Chicago, IL 60654; telephone: (312) 436-2699; e-mail:

Undergraduate medical education (UME) is incredibly diverse globally.1 Length of training varies from three to seven years with some students entering medical school immediately after completion of secondary school (around the age of 18), while others first complete a baccalaureate degree. Individual medical schools may accept both types of students with different tracks for each type. There is also significant country-to-country variation in clinical instruction and trainees’ patient experiences. Students in some medical schools have relatively little direct experience with individual patients, while others have early and continual exposure to patient care, including clerkship experiences with substantial one-on-one direct patient care.2 Depending on the country and school, clerkships may be predominantly hospital based or include multiple ambulatory experiences.3 Health care systems also differ by country, ranging from a UK-style national health service to a Canadian-style single-payer system to a U.S.-style employer-based insurance system coupled with public systems. In addition, standards of care, drug formularies, and cultural practices are not consistent around the world. Finally, while the mission of many medical schools is to prepare their graduates to practice in that country, some medical schools focus on preparing students to practice in other countries. For example, students in some Caribbean schools are being prepared to enter postgraduate training in the United States or other countries.4,5 These contextual factors and differences potentially influence the training that undergraduate medical students receive and the learning outcomes they achieve.

There is some research on the extent of variation in cognitive learning outcomes among medical students trained in countries outside the United States. A small-scale study of recent administrations of the International Foundations of Medicine (IFOM) Clinical Science Examination6,7 compared the performance of students attending medical schools in the United States with medical students from seven countries in Europe, the Middle East, and the Pacific Rim. Although U.S. medical students performed better in all areas, controlling for these differences in overall performance, international medical students scored relatively better in Medicine and Surgery and worse in Psychiatry. The study had limitations, however: The IFOM examination was taken at different points in the medical curriculum at different schools, and at some institutions it was used as a purely formative assessment, while at others it was used to inform residency selection. Given these limitations, it is hard to know whether the observed differences reflect true differences in cognitive performance or are due to other confounding factors.

In our study, we aimed to build on the findings of the IFOM study while addressing several of its limitations. We sought to determine the extent to which patterns observed on the IFOM examination are also present on the Step 2 Clinical Knowledge (Step 2 CK) component of the United States Medical Licensing Examination (USMLE). Step 2 CK is a one-day, computer-based test administered throughout the year to assess medical students’ understanding of clinical science knowledge. Examinees attending U.S. medical schools typically sit for Step 2 CK during their senior year, while the timing for international medical graduates (IMGs) varies. In addition to their use in medical licensure, Step 2 CK scores are also widely used in the resident selection process in the United States.

With more than 30,000 examinees per year, drawn from a wide array of countries and schools, Step 2 CK provides a good context for investigating country-to-country differences in cognitive learning outcomes. Because of its use in both U.S. medical licensure and the resident selection processes, it has consistent high stakes for everyone taking it. In addition, examinees choose when they are ready to sit for the examination, and detailed information about examination content and format is widely available. We reasoned that these factors would narrow differences among schools and countries, and any differences seen in analyses were probably larger prior to examinees’ preparation for Step 2 CK. Thus, in the context of globalization of medical education, a large-scale study to better understand possible performance differences would provide important information to inform curricular development at the UME and graduate medical education (GME) levels, selection of GME trainees, and strategies for examinees preparing for the examination.

Back to Top | Article Outline


We used test forms that were administered during the 2008, 2009, and 2010 academic years; analyses were conducted during the 2012 calendar year. Each test form included 352 items divided into eight 1-hour, 44-item sections. Virtually all items were in clinical vignette format, describing a patient care situation and requiring examinees to answer a question related to mechanisms of disease, diagnosis, or the next step in care. The dataset included first-time examinees from U.S. MD-granting medical schools, as well as from medical schools in countries averaging 60 or more first-time examinees annually in the Australasian, Caribbean, European, and Middle Eastern regions.

Analyses were conducted on total test scores, five clinical discipline scores (Medicine, Surgery, Obstetrics and Gynecology, Pediatrics, and Psychiatry), and four clinical task scores (Promoting Preventive Medicine and Health Maintenance, Understanding Mechanisms of Disease, Establishing a Diagnosis, and Applying Principles of Management). In the aggregate, across test forms and academic years, more than 9,000 unique items were included, and more than 1,000 items contributed to calculation of mean scores for each clinical discipline and task.

We applied the following procedure to each clinical discipline and task score.

  1. We calculated the percent-correct score for each examinee taking Step 2 CK for the first time.
  2. We then subtracted the mean percent-correct score on the corresponding test form for first-time U.S. examinees from the examinee percent-correct score, yielding a “difference score.” The difference score reflected each examinee’s performance relative to the U.S. mean for the associated score and form on a percent-correct scale.
  3. For each country of medical school, we averaged the difference scores across examinees from that country to obtain mean difference scores relative to the United States. Because we used U.S. means in the previous step, this transformation resulted in scores with a mean of 0 on each form for examinees from U.S. medical schools.

Because a score was calculated for each examinee for each clinical discipline and task, examinees served as their own controls in comparison of difference scores. This approach allowed us to identify areas of relative strength and weakness by region and country of medical school, as well as allowing comparisons of total Step 2 CK performance.

This study was reviewed by the American Institutes for Research, which serves as the National Board of Medical Examiners’ external institutional review board (IRB). It was determined to be exempt from further IRB review because the analysis involved deidentified extant data, where the research team did not have access to any identifying information.

Back to Top | Article Outline


In total, we analyzed 88,692 first-time examinees’ Step 2 CK performance: 54,955 from U.S. MD-granting medical schools and 33,737 from schools outside the United States. Table 1 provides information about examinees’ Step 2 CK performance broken down by region and country, along with mean difference scores by clinical discipline and task.

Table 1

Table 1

Across regions and countries, pass rates ranged from 66% to 97%, and mean total scores ranged from 196 to 230. For the 54,955 examinees from U.S. medical schools the pass rate was 97%, and the Step 2 CK mean score was 232 with a standard deviation (SD) of 22. By design, mean percent-correct difference scores for the U.S. group were 0.0 for all clinical disciplines and tasks. The SD for total test difference scores was 6.9% (approximately three points on the Step 2 CK score reporting scale).

Outside of the Caribbean, mean difference scores in clinical disciplines tended to be higher for Surgery and lower for Psychiatry. For example, the mean difference score in Psychiatry for examinees from Egyptian medical schools was 16.9% below that for examinees from U.S. schools, while total test scores were 10.0% lower, and Surgery scores were only 8.6% lower. For most countries, performance in Medicine was similar to the total test, reflecting, in part, that a large percentage of Step 2 CK items contribute to calculation of the Medicine score. With some variation across countries, mean difference scores for Obstetrics and Gynecology and Pediatrics were also similar to total test scores.

Although the amount varied by country, there were higher mean difference scores (relative to the total test) for the clinical task Understanding Mechanisms of Disease. Mean differences for Establishing a Diagnosis and Applying Principles of Management were generally close to the total test mean difference. Lower mean differences were seen for Promoting Preventive Medicine and Health Maintenance in all regions except the Caribbean, where differences were similar to the total test mean.

Back to Top | Article Outline


Although total test scores by country were highly variable, we found some commonalities in country-to-country variations in relative Step 2 CK performance. For clinical discipline scores, the performance of IMGs tended to be strongest in Surgery and weakest in Psychiatry. Across regions, the overall pattern of differences was similar to that reported in the IFOM study of Clinical Science Examination scores,6 as well as a study of USMLE Step 2 scores reported more than 15 years ago.8 For clinical task scores, we found that IMGs performed better in Understanding Mechanisms of Disease and lowest in Promoting Preventive Medicine and Health Maintenance. This pattern varied somewhat across regions and countries, but it was strongest for students attending schools in the Middle East and Australasia, less pronounced for students from schools in Europe, and almost absent for students from Caribbean schools.

The performance of examinees attending medical school in Caribbean countries was notably different from other regions of the world in that all mean difference scores tended to be closely clustered around the country’s total test mean difference. This pattern is consistent with the goal of many Caribbean schools to prepare their students for entry into U.S. postgraduate training, and in fact, many of these students also complete clerkship rotations in the United States. As a consequence, we would expect greater consistency with U.S. medical schools in curricular emphasis, patient experience, and standards of patient care. It is possible that performance patterns may also be similar for English-language tracks in European schools (and elsewhere) with goals and arrangements for clerkship instruction that are similar to many Caribbean schools. However, our study did not report on school-specific results except for countries with only a single medical school.

There are many potential reasons for the variation in the cognitive learning outcomes found in our study, including differences in curricula, clinical experiences, and patient populations encountered by trainees. Other potential factors that could explain our findings include variations in standards of care, cultural differences, differential effects of English as a second language, and a school’s emphasis on preparing students to take the three-step USMLE. Relative to the United States, international medical schools may place greater emphasis on mechanisms of disease, and students may spend more time on rotations in internal medicine and surgery. Several of these factors could have an adverse impact on performance in Psychiatry, where differences in culture, terminology, and standards of care may be particularly important. Recent investigations of medical school accreditation practices suggest that the structure and quality of educational programs also can have an impact on performance on Step 2 CK and USMLE more generally.5,9,10 Overall, more detailed investigations, including longitudinal studies, are needed to understand relationships between educational practices and student performance on Step 2 CK and USMLE overall.

While the variability in Step 2 CK performance may be attributable to a number of factors, our findings describe detailed performance information that should be useful for evaluation and development of curricula. International schools that want to improve their students’ performance on Step 2 CK could use our findings of relative strengths and weaknesses to modify their existing programs or pursue new curricular approaches. Similarly, students in some countries could guide their study efforts in areas where prior test takers have shown some weaknesses. For programs with large proportions of residents from international medical schools, this information should be useful in anticipating areas where it may be desirable to provide additional training during internship.

Our study has several limitations. First is the use of Step 2 CK performance as a measure of cognitive learning outcomes. Performance on Step 2 CK is designed to assess readiness for U.S. postgraduate training. As a consequence, performance is of substantial importance to examinees (and some medical schools). However, there are also traits not measured by Step 2 CK that are critical to success in residency training and provision of patient care. Second, this is a descriptive study from which causal relationships can only be suggested for further investigation. Third, the examinees choosing to sit for Step 2 CK may or may not be representative samples of students from a school or country; therefore, it is unclear if their total test performance is likely to be better or worse than others from the same school or country. Because each examinee contributed a full set of difference scores, however, the patterns of relative performance by clinical discipline and task should not be greatly affected, even if Step 2 CK examinees tend to be somewhat better or worse than typical graduates from the associated country or medical school. Finally, the use of difference scores relative to U.S. mean performance is not intended to imply that the U.S. way is the right way. Some of the large mean difference scores observed in this study may reflect an over- or underemphasis in U.S. medical school curricula. However, most graduates of international medical schools take the Step 2 CK because they want to complete residency training and obtain licensure in the United States. The reality is that Step 2 CK scores are important for licensure and for interview and admission decisions for many GME programs.

Our study provides important information about international differences in the cognitive learning outcomes assessed by Step 2 CK. Taking total test performance into account, we found that there were large differences in scores in clinical disciplines such as Psychiatry; the causes for these differences are likely complex and multifactorial. Though not as large as for Psychiatry, score differences for Promoting Preventive Medicine and Health Maintenance were consistently more negative than those for total scores except in the Caribbean; this may primarily reflect international variation in standards of preventive care. Additional work is merited to determine the reasons for and the practical implications of these performance differences for medical schools, residency programs, and examinees.

Back to Top | Article Outline


1. Wijnen-Meijer M, Burdick W, Alofs L, Burgers C, Ten Cate O. Stages and transitions in medical education around the world: Clarifying structures and terminology. Med Teach. 2013;35:301–307
2. Dezee KJ, Artino AR, Elnicki DM, Hemmer PA, Durning SJ. Medical education in the United States of America. Med Teach. 2012;34:521–525
3. Clancy CM, Cronin K. Evidence-based decision making: Global evidence, local decisions. Health Aff (Millwood). 2005;24:151–162
4. Boulet JR, Norcini JJ, Whelan GP, Hallock JA, Seeling SS. The international medical graduate pipeline: Recent trends in certification and residency training. Health Aff (Millwood). 2006;25:469–477
5. van Zanten M, Boulet JR. Medical education in the Caribbean: Quantifying the contribution of Caribbean-educated physicians to the primary care workforce in the United States. Acad Med. 2013;88:276–281
6. Swanson DB, Holtzman KZ, Grabovsky I, Phebus J, Angelucci K. Country-to-country variation in item difficulty on the 2010 English-language version of the International Foundations of Medicine (IFOM) Clinical Science Examination. Paper presented at: Annual Conference of the Association for Medical Education in Europe; August 2011; Vienna, Austria
7. National Board of Medical Examiners. . International Foundations of Medicine. Accessed July 12, 2014
8. Ripkey DR, Case SM, Swanson DB, Melnick DE, Bowles LT, Gary NScherpbier AJJA, van der Vleuten CPM, Rethans JJ, van der Steeg AFW. Performance of examinees from foreign schools on the clinical science component of United States Medical Licensing Examination. In: Advances in Medical Education. 1997 Dordrecht, Netherlands Kluwer Academic Publishers
9. van Zanten M, McKinley D, Durante Montiel I, Pijano CV. Medical education accreditation in Mexico and the Philippines: Impact on student outcomes. Med Educ. 2012;46:586–592
10. van Zanten M, Boulet JR. Medical education in the Caribbean: Variability in medical school programs and performance of students. Acad Med. 2008;83(10 suppl):S33–S36
© 2014 by the Association of American Medical Colleges