The objective structured clinical examination (OSCE) is widely used to evaluate medical students, select foreign medical graduates for training, and for medical licensure. The OSCE has proven to be a reliable and valid assessment of clinical skills.1 One area of controversy is who should be observing and rating the encounters. In the Medical Council of Canada Qualifying Examination Part II (MCCQE Part II) physician examiners are used as raters and standard setters, whereas in the Educational Commission for Foreign Medical Graduates (ECFMG) examination standardized patients (SP) are used. One argument for the use of physician examiners is that experienced physicians are essential to judge the ability level of examinees for making high-stakes decisions. However, with higher clinical demands on physician’s time and difficulty recruiting physician examiners, the use of nonphysicians is an attractive alternative.
Although there are studies that support the use of SPs,2–4 several have identified concerns with their use as examiners. Rothman and Cusimano,5,6 in two separate studies of the Ontario International Medical Graduate Program OSCE, found poor consistency between physician examiners and SPs in their ratings of interviewing skills and little agreement between them in identifying potentially problematic examinees regarding English proficiency.
One challenge for SPs is that they are commonly scoring by recall. One study that explicitly examined this issue compared physician examiners to SPs in five history-taking stations. Martin and colleagues7 compared physician examiners, SP observers, and SPs completing checklists from recall. Their findings suggest physicians should be used to rate examinees whenever practical. SP observers were considered better than the SPs who rated from recall.
One source of alternate scorers is medical students, and Van der Vleuten et al.2 demonstrated that trained medical students were almost as good as trained faculty. One interpretation of this study is that individuals with some medical knowledge may be superior to lay persons. Medical students are not appropriate for use in high-stakes examinations where they will be eventual test-takers.
In Canada, the MCCQE Part II is a requirement for medical licensure. The examination is run twice per year and each administration requires between 400 and 900 physicians. Securing sufficient numbers of physician examiners is becoming more challenging. We therefore wanted to determine if nonphysicians are a viable alternative. We chose individuals with a medically related background, as the literature suggested they might perform better. Physicians and nonphysician raters were compared on checklist scores and global rating scales in a high stake OSCE.
The MCCQE Part II is a 12-station OSCE consisting of seven ten-minute patient encounters and five couplets. The couplets are five-minute patient encounters paired with five-minute written exercises. The patient problems of each station are derived from one of the five major disciplines of medicine (medicine, obstetrics and gynecology, pediatrics, psychiatry, and surgery). The study was based on the fall 2003 administration and data was collected at four of 15 sites. Only English sites were selected to exclude variance due to language. Sites were chosen from different regions to provide a broader sample of data.
Three five-minute history-taking stations were selected for this study. Two of the cases were from the domain of obstetrics and gynecology and the third was a pediatrics case. The objectives tested were vaginal bleeding, pelvic pain, and eliciting a history regarding a crying and fussing child. There were 24 to 30 checklist items for each of the stations. One station also contained a rating scale item for “questioning skills” and another for “rapport with person.”
Scoring procedures have been described elsewhere.8 In essence, for each station, examiners complete a checklist measuring the observed performance of the examinees’ clinical skills and subsequently completed a global rating. The global rating scale is a six-point scale ranging from “inferior” to “excellent,” with the two middle categories described as “borderline unsatisfactory” and “borderline satisfactory.” Cut-scores for each station were established by the modified borderline group method. With this method, each station cut-score was the mean of the case scores for individuals rated as “borderline.”9 The physician examiner score and cut-score were considered the “gold standard.”
Thirty-three nonphysicians were recruited to be trained assessors. They were trained to score one of the three history-taking stations selected for this study. Most (27/33) of the trained assessors had a medical background such as nursing, pharmacy, physiotherapy, occupational therapy, paramedics, or psychology. The other six had no medical background but had been SPs in previous examinations. The training involved a two-hour general training session in which trained assessors were provided with a self-study booklet and then participated in one hour of “dry runs” of their patient problem. This step involved watching four to eight SPs portray the case and completing a checklist along with a physician examiner. Each trained assessor scored only one case.
Each of the 33 trained assessors was paired with two physician examiners, one in the morning session and one in the afternoon session. Each pair viewed the encounter in real time and scored up to 32 examinees. They completed the same checklist and global rating scale but were not allowed to discuss results at any time. Each trained assessor scored up to 64 candidates and a total of 466 examinees completed all three stations.
The data were analyzed using SPSS 13.0 (SPSS Inc., Chicago, IL) to calculate correlations between examiner types and to conduct a 3 × 2 × 4 repeated-measures analysis of variance. For this latter analysis, the three stations and examiner types (trained assessor versus physician examiner) were treated as within subject variables and the examination site (1–4) was treated as the between subject variable.
The main effect for examiner was not significant (F1,462 = .01, p = .94). The mean scores and standard deviations are shown in Table 1. However, the interaction between station and examiner was significant (F2,924 = 17.46, p < .001), as was the three-way interaction among station, examiner, and site (F6,924 = 7.50 p < .001). As shown in Table 2, which displays the means and standard deviations by site for stations and examiners, the significant three-way interaction likely occurred because there was a difference between the scorers at some sites and stations that did not occur at other sites and stations. This observation was confirmed by running post hoc comparisons for each pair of trained assessor and physician examiners as a function of site and station. Table 2 displays the resulting level of significance and effect size measure for these comparisons. To protect against an inflation of the family-wise error rate, a significance level of .02 was used for these comparisons. As shown in the table, there was a significant difference in mean scores between trained assessors and physician examiners at site 1/station 3, site 3/station 2, and site 4/station 1 that did not occur elsewhere.
Despite these differences in mean scores, the correlations between scores assigned by the examiners were relatively high. As shown in Table 2, the correlation between examiner scores range from .49 to .92 indicating a relatively high level of agreement between examiners for each site and station. The high correlation and similar mean scores between pairs of examiners suggest that there were few differences between trained assessors and physician examiners, other than some isolated differences due to the interaction of site and station. In the full examination, all three stations were psychometrically sound with means and standard deviations well within expected norms. Item total score correlations were station 1 = .347, station 2 = .421, and station 3 = .359.
Table 1 displays the cut scores for each station as determined by the global ratings of the trained assessors and physician examiners. Although the cut scores for each station appeared to be similar, the agreement in terms of pass/fail decisions was not high. Examinees were classified in opposing pass/fail categories as follows: station 1, 67/466(14.4%); station 2, 78/466 (16.74%); station 3, 117/466 (25.01%). Physician examiners failed more examinees in every station compared to the trained assessors (136 versus 103 in station 1, 127 versus 109 in station 2, and 174 versus 99 in station 3).
The purpose of this study was to determine if a nonphysician trained to score examinees on a particular case could produce ratings similar to that of a physician. In this study there was very good agreement between the physician examiner and trained assessor checklist scores for history-taking stations that were administered as part of a high-stakes OSCE. There was poor agreement, however, on pass/fail decisions. Up to 25% of candidates were misclassified by the trained assessors. This study confirms the findings of previous research suggesting that trained observers are a viable alternative for scoring checklists. The findings also raise the same concern identified by other studies regarding the ability of nonphysicians to complete global rating scales.
The finding that nonphysicians may have difficulty making judgments regarding the appropriateness of certain lines of questioning should not be surprising. A physician examiner may interpret a certain line of questioning favorably, for example recognizing a candidate who is ruling in or out disease, which the nonphysician would not have the medical knowledge to credit. For the Ontario International Medical Graduate OSCE, Rothman and Cusimano5,6 reported good consistency between physician examiner and SP ratings of English proficiency, but less agreement in their ratings of interviewing skills and little agreement in identifying problematic candidates. For similar reasons, Colliver et al.10 recommended caution in the interpretation of scores obtained from a case checklist completed by multiple SPs, especially if scores would be used for pass/fail decisions.
This study differed from some of the other studies because it was based on real-time simultaneous observations by the physician examiner and trained assessor pairs. The qualitative loss that may be associated with viewing videotaped encounters was avoided. A second difference lies in the approach to the recruitment and training of the trained assessors. The trained assessors were required to have a university-level degree and a professional background that would support their role as an examiner in a clinical skills examination. In addition they received three or more hours of training related to medical history taking and the case they would be observing. This is less than the 15 hours given to SPs who score the ECFMG examination,3 but these SPs are trained to portray a case as well as to score it. Newble and colleagues11 studied the effect of training in physician examiners. They concluded that training for physicians was not effective and that selection of inherently consistent raters was the critical factor. Van Der Vleuten et al.2 reported similar results and concluded that training was least effective and least needed for medical faculty. However, they also noted that with only two hours of training, laypersons approached the accuracy of untrained faculty.
In conclusion, the study demonstrated that trained nonphysician assessors may be a valid alternative to physician examiners for scoring checklists in a high-stakes OSCE. As a preliminary study, this is encouraging. The next step is to develop a better understanding of the interaction effect that occurred at two of the four sites.
The ability of trained assessors to make valid global judgments that contribute to pass/fail decisions was not supported by the present study. This challenge to the standard setting methodology will need to be addressed before trained assessors are incorporated in this high-stakes OSCE.
The authors wish to acknowledge Dr. Richard Birtwhistle for supporting this project, as well as Jodi Harold McIlroy and Ilona Bartmann for their statistical expertise.
1 Petrusa E. Clinical Performance Assessment. In: Norman G, van der Vleuten CPM, Newble DI (eds). International Handbook of Research in Medical Education. Dordrecht, The Netherlands: Kluwer, 2002.
2 Van der Vleuten CPM, VanLuyk SJ, Van Ballegooijen AMJ, Swanson DB. Training and experience of examiners. Med Educ. 1989;23:290–96.
3 De Champlain AF, Macmillan MK, King AM, Klass DJ, Margolis MJ. Assessing the impacts of intra-site and inter-site checklist recording discrepancies on the reliability of scores obtained in a nationally administered standardized patient examination. Acad Med. 1999;74(10 suppl):S52–S54.
4 Colliver JA, Swartz MH, Robbs RS, Lofquist M, Cohen D, Verhulst SJ. The effect of using multiple standardized patients on the intercase reliability of a large-scale standardizedpatient examination administered over an extended testing period. Acad Med. 1998;73(10 suppl):S81–S83.
5 Rothman AI, Cusimano M. Assessment of English proficiency in international medical graduates by physician examiners and standardized patients. Med Educ. 2001;35:762–66.
6 Rothman AI, Cusimano M. A comparison of physician examiners’ standardized patients’, and communication experts’ ratings of international medical graduates’ English proficiency. Acad Med. 2000;75:1206–11.
7 Martin JA, Reznick RK, Rothman A, Tamblyn RM, Regehr G. Who should rate examinees in an objective structured clinical examination? Acad Med. 1996;71:170–75.
8 Reznick RK, Blackmore DE, Dauphinee WD, Smee SM, Rothman AI. An OSCE for licensure: the Canadian experience. In: Scherpbier AJJA, van der Vleuten CPM, Rethans JJ, van der Steeg, AFW (eds). Advances in Medical Education. Dordrecht, The Netherlands: Kluwer, 1997.
9 Dauphinee WD, Blackmore DE, Smee S, et al. Using the judgments of physician examiners in setting standards for a national multicenter high stakes OSCE. Adv Health Sci Educ. 1997;2:204.
10 Colliver JA, Robbs RS, Nu VV. Effects of using two or more standardized patients to simulate the same case on case means and case failure rates. Acad Med. 1991;66:616–18.
11 Newble DI, Hoare J, Sheldrake PF. The selection and training of examiners for clinical examinations. Med Educ. 1980;14:345–49.
Moderator: John Boker, PhD
Discussant: Brian Clauser, PhD© 2005 Association of American Medical Colleges