French Curricular Reforms
In 2000, sweeping reforms were instituted to the medical curriculum in France by a legislative commission mandated to complete this task. These reforms directly affected the entering second-year class of the Deuxième Cycle des Études Médicales (DCEM) in October 2001 (year three of medical school, described more fully below).
The medical curriculum in France begins after high school and is divided into three phases. Phase I (Premier Cycle des Études Médicales or PCEM) is devoted to the basic sciences required for the study of medicine. At the end of the first year, intramural examinations are administered to select the small proportion (about 10%) of students who will be allowed to continue with their medical education. Phase II (Deuxième Cycle des Études Médicales or DCEM), targets partly the basic biomedical sciences (DCEM 1) but is primarily focused on the clinical sciences and clinical training (years two to four, i.e., DCEM 2–DCEM 4). Phase III (Internat) lasts three to five years and is devoted to supervised training. Students wishing to undertake specialty training during Phase III must successfully pass a comprehensive clinical written examination prior to admission. Students who do not wish to undertake specialty training obtain positions in family medicine residency programs and graduate as general practitioners.
The recent reforms entail transitioning from a discipline-based pedagogic approach to an integrative model that favors active and cross-disciplinary learning. Additionally, the new curriculum will emphasize a better integration of theory and practice, small-group, and self-directed learning. A final goal of these curricular reforms is to enhance the medical student's ability to critically analyze and synthesize data, and to foster the development of requisite skills for the third phase of professional training.
France's Need for a Mandatory National Residency Examination
Currently, French students wishing to enter specialty residency programs must successfully pass the Internat examination during the DCEM 4. This examination is a composite written assessment with a relatively small number of multiple-choice items and essay questions. Separate examinations are administered regionally, and students must take local examinations in all regions where they wish to pursue postgraduate training. These examinations are not well related to the current curriculum. A more general examination of clinical sciences may be more appropriate. The French government has affirmed that an examination program for national residency selection will be implemented by 2004. For this 2004 examination, a one-to-three-hour written test, relying exclusively on essay items, is being proposed as the sole criterion for residency admission. Given the well-documented psychometric shortcomings of constructed-response items in general, it seems unadvisable to rely solely on this assessment modality as a selection tool for entry into French residency programs.1,2,3 Unacceptable score reliability estimates have repeatedly been reported in the literature with constructed-response items, largely because of the well-known phenomenon of content specificity.4 Consequently, an unacceptably large number of constructed-response items (more than is usually practically feasible) needs to be administered to obtain suitable reliability estimates for selection purposes in a high-stakes context.
Collaboration between France and the NBME
As part of its efforts to explore means of achieving a high-quality examination program for residency selection, an agency of the French Ministry of Health invited the leadership of the National Board of Medical Examiners (NBME) to present an overview of the development of national examination programs in the United States at a special conference convened in the fall of 2000. At the invitation of the dean of the Faculté de Médecine Université de Nice Sophia–Antipolis, we provided a similar presentation to the leadership of that institution. Following a series of discussions between our institutions, we agreed that a consortium of French medical schools and the NBME would develop a pilot multiple-choice national assessment for measuring knowledge and application of clinical sciences using the NBME's expertise and test materials.
The objective of this initial pilot was twofold. First, it was undertaken to permit a representative group of faculty from several French medical schools to obtain firsthand experience in part of the process of developing high-quality, multiple-choice examinations. Second, the pilot study tested the feasibility of sharing, translating, and incorporating NBME test material and expertise in the development of a prototype national examination prior to the 2004 implementation deadline. The pilot examination, the Évaluation Standardisée du Second Cycle (ESSC) (Standardized Evaluation of the Second Cycle), was administered to a sample of French medical students at four university test sites in January 2002.
Test Development Activities
As part of this project, a series of activities was undertaken jointly by the NBME and a committee of physicians representing the four participating French medical schools. Committee members and NBME project staff completed the following test development activities:
- As a starting point, we customized the blueprint and content specifications using the Comprehensive Clinical Sciences Examination (CCSE), an off-the-shelf test included in the NBME Medical Subject Examination Program that is similar in design to the United States Medical Licensing Examination (USMLE) Step 2.
- For possible inclusion in the examination, we selected a set of 250 items from the CCSE pool that we deemed acceptable in light of the French medical curriculum and practice patterns. The items were modified as necessary to reflect French requirements, e.g., available drugs and local practice.
- We translated the items. Each item was translated independently to provide a validity check. This approach has been successfully employed in the translation of the licensing examination from English to French in Canada.5 That is, each item was translated by a pair of French physicians. Translations for a given item were compared and any discrepancy was resolved between the physicians assigned to the test question. The item was dropped if the pair of French physicians could not arrive at a mutually acceptable translation. A panel of ten French physicians translated the 250-item set (i.e., about 50 items per pair of physicians).
- Finally, we assembled and approved the final 200-item multiple-choice test form. Fifty items were not retained for a variety of reasons, including inappropriateness to French practice and guidelines as well as the fact that other items (among the 200) already targeted these content areas in the examination.
These activities were largely completed over the course of two two-day meetings in Nice, France. Other activities included the translation of a proctor's manual and examinee preparatory materials, and the development of an answer sheet and a score profile for reporting purposes. These activities were completed jointly by NBME project staff and French committee members.
The purpose of our study was to estimate summary test- and performance-related statistics based on the administration of the January 2002 ESSC. Additionally, the performance of French students was compared with those of USMLE Step 2 and CCSE reference groups composed of students in the United States.
A total of 285 DCEM 4 students completed the ESSC in January 2002 at four French medical schools: Faculté de Médecine Université de Nice Sophia–Antipolis, Faculté de Médecine de Nancy, Faculté de Médecine de Saint-Antoine, Paris XI, and Faculté de Médecine de Créteil, Paris XII. Sample sizes ranged from 43 (Paris–Créteil) to 112 (Nancy) with Nice and Paris–Saint Antoine fielding 70 and 60 examinees, respectively. The group was composed of 117 men (41.1%) and 109 women (38.2%); Fifty-nine examinees (20.7%) did not provide information about their gender. Ages ranged from 21 to 36 years, with a mean of 25.2 years and a standard deviation of 1.73. The DCEM 4 is comparable to the fourth year of a typical U.S. medical school curriculum. Completing the ESSC was compulsory for students at Nancy, Nice, and Paris St.-Antoine, while it was voluntary (though strongly encouraged) for students from Paris–Créteil. It is also important to note that, although compulsory, the ESSC counted for only a small proportion of the year-end grade for most participating French students (and it didn't count at all for one school). As such, the test administration involved much lower stakes (and lower student motivation) than are typically associated with a Step 2 examination, a required part of the licensure process in the United States. In fact, a more apropos comparison could be made with the CCSE, an integrated achievement test covering material typically learned during core clinical clerkships. This exam is usually taken as a formative assessment in preparation for USMLE Step 2. Finally, the ESSC was administered under secure, proctored, and standardized conditions, identical to those that were in place with paper-and-pencil USMLEs (recall that prior to 1999, the USMLE was administered via paper-and-pencil; since 1999 it has been computer-based for all three Steps).
The four-hour examination had 200 multiple-choice items that required the examinee's single best answer, with the numbers of options ranging from four to ten. As previously stated, the examination blueprint was an adaptation of the CCSE and, as such, targeted a number of content and skill domains, including normal growth and development, general principles of care; individual organ systems or types of disorders; and physicians' tasks, including promoting health and health maintenance, understanding mechanisms of disease, establishing a diagnosis, and applying principles of management.
Item and test characteristics
Item p-values and biserial (rbis) correlation coefficients, which respectively correspond to the proportion of examinees correctly answering each item as well as the correlation between the response to an item (correct or incorrect) and the total number-right score, were calculated. The former statistic is usually referred to as the item difficulty index. The latter statistic is commonly referred to as an item discrimination index, since positive moderate to high values suggest that examinees with higher total scores are correctly responding to the item in a higher proportion than those with lower scores (i.e., the item is discriminating between low- and high-proficiency examinees). Also, Cronbach's coefficient alpha was calculated as an indication of reliability, specifically, the consistency with which the examination measured students' knowledge of clinical sciences from item to item throughout the test. Reliability coefficient values can vary from 0 to 1. The closer the reliability coefficient value is to 1, the more accurately examinees' proficiencies are being measured by the set of items included in the examination. For large-scale, high-stakes, multiple-choice examinations, values usually approach .90.
Also, a series of item response theory (IRT) model analyses was undertaken based on the Rasch or one-parameter model.6 The Rasch model, which provides a theoretical framework by which to estimate the probability of a correct response given an examinee's proficiency level and the difficulty of the item, possesses several advantages over classical test theory approaches (such as item difficulty, item discrimination, and reliability indices), including its ability to allow the user to estimate proficiencies that do not depend on the specific set of items that were taken (ability invariance property). This allows for the comparison of several groups of examinees with respect to the proficiency of interest, even though the examinees may have responded to nonidentical sets of items (with overlap). The Rasch model also allows the user to compute item statistics and examinee performance measures that can be represented on the same scale. This has important implications for targeting items to the appropriate proficiency levels of examinees. Little information is gained about the proficiency level of examinees if the items that are administered are either too easy or too difficult for the group of students. It is possible, with the Rasch model, to plot the difficulty of the items and the proficiency levels of examinees on the same scale to assess the extent to which they are matched to one another. This plot was produced for the items in the ESSC and for the proficiencies of the French medical students who participated in this field trial.
We estimated thee mean and standard deviation (SD) ESSC Rasch-based proficiency values for the French group. Similarly, STEP 2 and CCSE reference-group Rasch-based proficiency values were calculated. The Step 2 examination, part of the USMLE, assesses whether U.S medical students can apply medical knowledge and understanding of clinical science essential for providing patient care under supervision, and emphasizes health promotion and disease prevention. The Step 2 reference group was composed of students from Liaison Committee on Medical Education (LCME)–accredited schools in the United States and Canada who took the Step 2 examination for the first time in 1999. Step 2 was administered via computer to these examinees. The LCME is the recognized accrediting authority for medical education programs leading to the MD degree in U.S. and Canadian medical schools. The CCSE population was composed of students from LCME-accredited schools in the United States and Canada who completed the examination for the first time in 2000. The CCSE, like the ESSC, is a paper-and-pencil examination. Finally, a plot indicating the frequency of omitted responses throughout the latter part of the ESSC was produced as a gross indicator of speededness, i.e., to assess whether the four-hour period allocated to complete the examination was sufficient.
The distribution of item difficulty values or p values for the ESSC, shown in Figure 1, ranged from .06 to .96, with a mean value of .63 and a standard deviation of .21. From a content perspective, obstetrics and gynecology, pregnancy, and mental disorders items appeared to be the most difficult for the French medical students. Conversely, items measuring knowledge of renal, urinary and male reproductive systems, blood and blood-forming organs, and cardiovascular disorders appeared to be the least difficult.
The distribution of item discrimination values (rbis) is shown in Figure 2. Rbis values ranged from −.18 to .55 with a mean value of .22 and a standard deviation of .14. From a content perspective, nutritional and digestive disorders items were the most discriminating, while mental disorders items were the least discriminating of the items in the ESSC.
Following a conference call with the French committee during which we reviewed 33 items that were flagged based on weak p or rbis values, we decided that seven items would not be included in the final score. From a content perspective, four of these flagged items targeted knowledge of obstetrics and gynecology, whereas the remaining three items each targeted a distinct content category.
Cronbach's coefficient alpha value, calculated as an estimate of overall score reliability, was .91, suggesting that the examinees' scores were measured with a high level of precision.
Finally, a plot outlining the Rasch-based difficulty and proficiency estimates is shown in Figure 3. The distribution of these proficiency estimates is shown in the left half of the figure. Lower values indicate less proficiency in the areas targeted by the examination, whereas higher values indicate greater proficiency. The difficulty estimates are shown in the right half of Figure 3. Negative difficulty values (referred to as b-parameters in IRT parlance) indicate easier items, whereas positive values indicate more difficult items. Figure 3 shows clearly that most items were well targeted, in terms of their difficulty, to the proficiency level of examinees, i.e., there tends to be substantial overlap between the two distributions. However, approximately 20% of the items appeared to be too easy for the group of examinees that completed the ESSC. Nonetheless, these items are important for content coverage reasons, i.e., to ensure that the sample of items selected for inclusion in the ESSC adequately reflects the endorsed blueprint or domains to be measured by the test.
Proficiency estimates and projected Step 2 reference-group failure rates are presented in Table 1. The first column provides means and standard deviations for Rasch-based proficiency estimates for three groups of examinees: the sample of French medical students who completed the ESSC, and the CCSE and Step 2 reference groups. Again, since the items in the various exams were on the same scale, it is possible to compare proficiency estimates across groups. The second column in Table 1 shows projections of Step 2 failure rates for all three groups. That is, if each group had taken the Step 2 exam, what would have been their respective failure rates? The failure rate shown for the Step 2 reference group is the actual value for examinees who took the CBT Step 2 in 1999.7
Our results showed that the French medical students performed approximately 0.4 SD below the CCSE reference group in the United States, whereas they performed slightly more than one SD below the Step 2 reference group. Also, the CSSE group performed slightly less than one SD below the mean of the Step 2 reference group. CCSE performance by U.S. medical students is typically lower than their own Step 2 performance, in part because of differences in motivation. Thus, Step 2 failure rates for both the CCSE reference group in the United States and the French students would be substantially higher than that of the actual Step 2 reference group. Approximately 23.2% and 35.1% of the CCSE reference group and the French examinees, respectively, would fail Step 2 based on their Rasch proficiency estimates.
The numbers of students who omitted responses for items 150–200 (i.e., the last 25% of the examination) are shown in Figure 4. As would be anticipated if the testing time were inadequate, the number of students omitting responses increased steadily over the last 25% of the examination, from about 2% for item 150 to approximately 18% for item 200, the final item of the ESSC. Using a 10% omission rate as a “rough” indicator of speededness, it would appear that the time allocated to complete the examination from item 175 onward was insufficient for this group of examinees. This degree of speededness is not seen in the CCSE.
The globalization of medicine has had important repercussions on various aspects of the profession.8 The sharing of educational resources and curricula is also gradually favoring the development of similar practices across countries, which may, in turn, lead to common standards.9 One of the consequences of adopting similar curricula and practices is that more common assessments of medical knowledge and skills will be appropriate for use by countries throughout the world.
Our study reflects a collaborative effort to develop an examination based on U.S. experiences that could be readily adapted for assessing French medical education. Specifically, the purpose of this research was to develop a multiple-choice item examination that could be used to assess French students' clinical sciences knowledge at the end of the second cycle (i.e., immediately prior to supervised professional training in France).
It was apparent over the course of the test development meetings held between French faculty members and NBME staff that the process of developing a suitable French clinical sciences examination could be accomplished with relatively little effort. Although some items needed to be adapted to reflect local medical practices and guidelines, it was interesting to note that less than 10% of initially selected CCSE items were rejected because they were inappropriate for the French medical context. The Rasch proficiency-item difficulty plot also clearly showed that the examination was, by and large, well targeted to the proficiency levels of French students. Consequently, the scores reported to schools were probably accurate estimates of students' proficiencies in the clinical sciences, a conclusion further supported by the high reliability coefficient. The examination administered to French students also appeared to have been slightly speeded, i.e., an insufficient amount of time was allocated to complete the examination. This finding is not surprising given that the French version of the examination contained over 1,800 more words than its English counterpart, although the time provided to complete both examinations was the same. This would result in a French version of the examination that would be 10–12 items longer than its English counterpart. The Medical Council of Canada (MCC) has also reported similar trends with the French version of some of its examinations.10 Based on these findings, it would probably be necessary to increase the amount of time allotted for completing any similar future examination.
Additionally, our results suggest that French students performed only slightly below the mean of a group of U.S. medical students taking a similar subject examination (the CCSE). This is encouraging, given that the French students had had little, if any, exposure to the clinical-vignette item format that is central to both the CCSE and Step 2. French multiple-choice items are typically fact-based (list a drug, treatment, etc.), whereas those included in NBME examinations present a clinical scenario to the student to elicit some expected response (diagnosis, treatment modality, etc.). NBME item formats attempt to target the student's higher cognitive processing skills.11 It is likely that the performance level of French students could improve even more with additional exposure to this item format and adjustment of time allowed for testing. In fact, Figure 5, outlining the cumulative mean item difficulty level across the ESSC, seems to support this hypothesis. The values shown in Figure 5 correspond to the difficulty of an item (p value) averaged with all items preceding it. For example, the value shown for item 5 corresponds to the mean p value for items 1–5. It is interesting to note that the mean p value tends to stabilize from approximately item 60 onward (the dip shown at the end of the curve probably reflects the effect of speededness). In other words, it appears that it took French students about 60 items to become comfortable with the examination. Of course, this interpretation holds only if item difficulty tends to be independent of item sequence in the examination. There is little evidence to support that this is not the case.
Although encouraging, our findings need to be interpreted cautiously in light of our small sample of French medical schools (about 10% of all French medical schools) that participated. Furthermore, the students' levels of motivation to perform varied across schools. Nonetheless, it is encouraging that the first phase of this project has generated a considerable amount of enthusiasm for future collaborations with a larger number of medical schools in France. In fact, a second field trial is being discussed which will hopefully involve eight to 12 French medical schools and close to 1,000 examinees. Results from this trial will provide additional useful information that will probably generalize more readily to all medical schools in France.
Also, future research will need to focus on the comparability of the overall structure of the examination in both its English and French forms, as well as the extent to which individual test items are similar in terms of overall psychometric characteristics. Although similar analyses have been completed with French–English MCC examinations,10 methodologic flaws limit the extent to which we can draw sound conclusions about the degree of comparability between both test and item structures across languages. For example, the authors of the MCC study interpreted the absence of a significant mean score difference between French and English examinees as indicating identical test structures. Groups might be comparable with respect to overall score while still using different skill composites to answer the test questions. We are presently undertaking a study that will compare the factor structures of the two examination forms to more clearly answer this question. Also, the authors of the MCC study compared the extent to which items performed similarly for both French and English examinees using the Mantel–Haenszel chi-square statistic.12 In addition to having insufficient sample sizes to powerfully detect differences between the two groups,13 the authors assumed that the construct underlying the examination in both its English and French forms (central to the use of this procedure) was the same. This assertion is difficult to defend based on the analyses provided in the Bordage, Carretier, Bertrand, and Page study. The research that we intend to undertake in this area will focus on investigating the equivalence of several levels across both populations (i.e., construct domains, underlying structure, and items) via a systematic series of analyses based on a strong theoretical model proposed in the cross-cultural literature.14
Finally, we hope that the lessons learned in this field trial will be useful in the development of a sound examination program for French internship selection, i.e., one that will yield reliable and valid assessments of clinical sciences knowledge and application. We also hope that these collaborations and other similar joint ventures will foster an even higher level of cooperation between countries and contribute to making medical education and assessment truly global disciplines.