Common to both lay and professional views of medical competence is the belief that practitioners must maintain additional qualities beyond intellect. Health care providers are expected to be compassionate, articulate, and capable of managing ethical dilemmas. Unfortunately there has traditionally been a divergence between the consensus that these personal qualities are important and the evidence that they can be measured in a valid manner. In an attempt to narrow this divide, we have developed and are testing a new form of structured interview modeled on the objective structured clinical examination (OSCE). This Multiple Mini-Interview (MMI) has been shown to provide a reliable estimate of candidates’ performance,1 but the new protocol demands that attention be paid to the biases that might arise as a result of the different vantage points held by heterogeneous raters. In this report we highlight the development of the MMI and present data pertaining to the issue of variability in raters’ backgrounds.
In a recent review article, Albanese et al.2 highlighted the increasing importance that the Association of American Medical Colleges (AAMC) has placed on the assessment of applicants’ personal qualities. A difficult challenge to be sure. The evidence that traditional measures of personal qualities (e.g., the personal interview) can accurately identify those candidates who maintain strength in noncognitive domains is equivocal at best. Used by nearly all medical programs in North America,3 the interrater reliability reported has ranged from .23 to .96.2 Structured interviews have typically yielded better interrater reliability than unstructured interviews.4 Eva et al.1 have recently shown, however, that interrater reliability is an insufficient standard by which to judge the generalizability of interview ratings.
The problem is one of content specificity. In making selection decisions, as indicated by Albanese et al. “one is most interested in stable qualities that have a high probability of occurrence in an almost infinite number of different situations.”2, p. 317 Although debate exists regarding whether such “stable qualities” exist, it has become clear in various contexts that the average performance an individual displays over the course of many encounters is a more generalizable indication of that individual's qualities than is any single encounter.5 The implication is that multiple interviews are required to gather an accurate depiction of an individual's personal qualities in much the same way that multiple clinical encounters (i.e., an OSCE) are required to gather an accurate conception of an individual's clinical competence. Two individuals might agree on the rating a candidate deserves after a single interview, but that rating will be fraught with bias if it is not predictive of how the same candidate will perform in a second interview, let alone in a more disparate context.
The Multiple Mini-Interview
In response to faculty concern over the validity of traditional admissions processes, we developed a novel admissions protocol at McMaster University and began testing the protocol in Spring 2002. Details of the MMI's development have been published previously.1 Modeled after the OSCE, the MMI maintains many of the characteristics of structured interviews; specific content is discussed and potential questions, accompanied by sample answers, are provided to examiners. In contrast to traditional structured interviews, however, the MMI de-emphasizes the need for a panel of interviewers and emphasizes the need for a series of interviews. If each interview is kept short, an OSCE-style interview process can be completed with fewer human resources than traditional interviews. Although essentially an admissions OSCE, we have opted to change the name of the protocol to make explicit the facts that the judgments are not objective and the stations are intentionally nonclinical.
To create MMI stations, we advocate that admissions committees undergo a blueprinting process in which they determine the qualities for which they hope the interview will select. This process should be informed by the educational philosophy adopted by the institution in which the admissions committee works as well as broader documents that outline the key competencies of practicing physicians.6,7 A process for doing so has been developed by Reiter and Eva.8 As a starting point for our own institution, we chose to focus on four areas: professionalism, collaboration, scholarship, and health care advocacy.
The stations used in the current study are described in the Appendix. They provide evidence of the flexibility with which mini-interviews can be created. Station 1 addressed general communication skills. An actor portraying a frantic individual attempts to gather information from the candidate pertaining to his/her undergraduate experience. In this station two examiners acted merely as observers, watching the interaction between actor and candidate and rating the communication skills displayed by the candidate. Stations 4 and 6 measured Professionalism. Each involved a computerized presentation of a video in which a medical student was placed in a compromised position. The task for the candidate was to watch the video and then discuss appropriate and inappropriate strategies for dealing with the situation displayed. Stations 7 and 8 were concerned with Collaboration. These paired stations challenged two candidates to work together to demonstrate their collaborative skills by attempting an origami task without the benefit of seeing one another. The remaining stations were more traditional interviews in which candidates discussed issues related to Scholarship and Health care Advocacy with the interviewers. Personal history stations in which candidates are asked to describe their past experiences can be used, but they were not included in this study as these types of stations have been used in past studies.1
Previous research has shown that the MMI provides a reliable assessment of candidates’ abilities, that the overall test reliability improves to a greater extent by maximizing the number of stations rather than by maximizing the number of observers per station, and that the MMI is viewed positively by both candidates and examiners alike.1 Remaining unanswered, however, is the question of whether faculty members and nonfaculty members are distinguishable by their ratings. At McMaster, heterogeneity has always been a fundamental principle because it is believed that breadth of experiences across students enriches the scholastic experience.9 To try to maximize heterogeneity across students, interviewers have traditionally been drawn from various populations, including faculty members, medical students, and individuals from the community at large. As we propose assigning a single interviewer to each station, the question of whether faculty members and individuals from the community assign performance ratings consistent with one another becomes an increasingly important question. Such questions have been addressed in the realm of evaluation,10 but no effort has been made to determine the rating tendencies of interviewers with different characteristics.2 To this end we designed a nine-station MMI within which three stations were assigned two faculty examiners, three were assigned two examiners from the community at large, and three were assigned one examiner from each group.
Letters were sent to all 198 candidates to McMaster University's undergraduate medical program who had been invited to interview on the first of two interview weekends conducted by the program in Spring 2003. The letter stressed that participation in the MMI was being requested for a research study, and agreement (or lack thereof) would in no way influence the decision of the admissions committee. Candidates were offered $40 in an attempt to make it clear that this initiative was completely separate from the regular admissions process. The first 54 who replied affirmatively and could be scheduled into six prearranged research sessions were included in the study. Mean age of the participants was 24.2 years (minimum = 19, maximum = 37). Of the 54 participants, 38 (70%) were women, a proportion commensurate with the gender ratio of interviewees that weekend.
In addition, 18 health sciences faculty members and 18 community members drawn from the legal profession and human resource departments of both local businesses and the university were recruited to act as examiners. In two instances, faculty members had to withdraw—they were replaced with current medical students. The Standardized Patient Program at McMaster University recruited and trained an actor who played the role of Frankie in Station 1 (see Appendix).
One week before the study, examiners were sent an MMI booklet that contained a description of the procedure, the “instructions to the applicant” for the station to which the examiner had been assigned, a list of potential points of discussion, a page of background information on issues pertaining to the content of the station, and a copy of the scoring sheet with which examiners were expected to rate the performance of each candidate.
On the study weekend, three sessions were run sequentially on each of two days with a 40-minute break for the examiners between sessions. Two examiners were assigned to each station. Three of the nine stations were staffed by two faculty members, three by two community members, and three by one member of each group. Before the first MMI on each day the authors of this article met with the examiners to ensure that the procedure was clear, to answer any last-minute queries, and to reinforce that the ratings should be assigned independently.
Nine candidates were assigned to each of the six sessions. To begin, the authors of this article met with the candidates to explain the process and have them sign informed consent forms. Each candidate was randomly assigned to begin at one of the nine stations and given two minutes to read the “instructions to applicants” posted on an office door. A buzzer sounded to inform candidates they could enter the room and to signify the beginning of the eight-minute period for completing the station. After eight minutes, another buzzer sounded at which time candidates concluded their discussions and proceeded to the next station. During the two-minute interval between stations, examiners completed an evaluation form that rated each candidate (using a seven-point scale) on communication skills, strength of arguments raised, suitability for the health sciences, and overall performance. At the conclusion of the last station, candidates were surveyed regarding their perceptions of the process. Examiners were asked to complete a similar questionnaire at the end of the day.
Table 1 shows the average score and standard deviation assigned to candidates for each of the four items on the evaluation form. The internal consistency (i.e., the average relationship between pairs of questions) was found to equal .96, indicating a high degree of redundancy. As a result, only the “overall performance” score was used in subsequent analyses.
The effect of day and session were analyzed using a mixed design analysis of variance (ANOVA) with Day, Session, Station, Rater (nested within Day and Station), and Candidate (nested within Day and Time) included as independent variables. No main effects reached significance indicating that the mean scores were comparable across day (Day 1 = 4.96, Day 2 = 5.02; F1,48 < 1.0, mean squared error = 6.175, not significant) and that no drift in scores occurred during the day (11:00 am = 5.10, 1:30 pm = 5.05, 4:00 pm = 4.82; F1,48 = 1.20, mean squared error = 6.1754, not significant).
To determine whether the ratings faculty members assigned were biased relative to those community members assigned, a repeated measures ANOVA was performed on the data collected within the three stations that were staffed by both a community and a faculty member. The mean score assigned by faculty members (4.66) bordered on being significantly less than that assigned by community members (4.96; F1,53 = 3.972, mean squared error = 1.790, p = .06).
Using Generalizability Theory, variance components were calculated from the ANOVA described above and used to determine the reliability of the MMI. The generalizability of a single rating was .20. Table 2 reports the variance components. Overall test generalizability (i.e., the reliability of the average of all 18 ratings) was .78. Table 3 reports the results of a Decision Study performed to determine the optimal combination of stations and raters, assuming that 18 observations can be collected. It is clear from this Decision Study that increasing the number of stations has a greater impact on the reliability of the interview than does increasing the number of raters within each interview. Had only ten stations been used with one rater per station (as in past studies), the overall test generalizability was expected to have been .71 using the formula σ2(candidate)/(σ2(candidate) + (σ2(candidate * station) /10) + (σ2(candidate * rater w/in station) /10.
The Relationship between Interviewers’ Characteristics and Ratings
To determine whether community members assigned candidates’ ratings comparable to those assigned by faculty members, a generalizability analysis was performed separately for each of the three groups of stations. The generalizability for the three stations that were staffed by two community members was highest at .58. The three stations that were staffed by two faculty members revealed the second highest generalizability = .46. Least reliable were the three stations that were staffed by one member of each group (generalizability = .31). Each pairwise difference is statistically significant: .58 versus .46, z(106) = 2.78, p < .05; .46 versus .31, z(106) = 3.12, p < .05; .58 versus .31, z(106) = 5.90, p < .05. If we were to assume a full nine-station MMI, the anticipated reliability with two raters per station would be .81, .72, and .58 for the three groups, respectively. In either case, the generalizability of the MMI appears to be lowest among stations evaluated by one community member and one faculty member, suggesting that there are larger inconsistencies in the way that community members rate candidates relative to the way that faculty members rate candidates than there are within either group of raters.
Table 4 illustrates the responses given by both candidates and examiners regarding their views of the MMI. The responses were positive, indicating that participants did not view the MMI as any more onerous or anxiety provoking than a more traditional interview.
Replicating the work of Eva et al.,1 this study revealed that the MMI can be a reliable evaluation instrument for medical school admissions. Also consistent with this past work and the context specificity phenomenon were the results of the Decision Study that showed that the number of interviews (i.e., stations) is a more important determinant of the overall reliability of the instrument than is the number of panelists within any given interview. This is the first time the optimal balance between interviews and interviewers has been examined with actual candidates to an undergraduate MD program, but Eva et al. were able to derive the same conclusion based on a sample of graduate students who participated in a pilot study.
These findings suggest that the demonstration of adequate interrater reliability, which has been used in the past as an argument for standardized interviews, is insufficient evidence to ensure that an interview is measuring stable and generalizable applicant characteristics. By contrast, the findings suggest that applicants will vary considerably, in unpredictable fashion, from one interview to another. Consequently, the scores derived from any one interview will be a poor predictor of performance in a second interview. The situation is exactly analogous to the assessment of competence in an OSCE, where performance on one station is a poor predictor of performance on a second station, so that it is necessary to sample 15–20 stations to achieve reproducible assessment.11 Analogous to the OSCE, the MMI provides candidates with a fresh start after each station, thereby providing an independent assessment of each candidate's performance in multiple situations. Similarly, any chance effects of being randomly assigned to an “easy” or “hard” panel of interviewers will be diluted with the MMI as candidates are exposed to a greater number of examiners. It is still true that all stations are presented within a single, high-stakes interview context and that, as a result, the situational consistency is likely greater in the MMI than in the world at large, but the low generalizability of ratings assigned to a single MMI station strongly suggests that even within this constrained context there is a substantial amount of situational dependency, thereby emphasizing the need for a multiple-sample approach.
Of further interest is the finding that community members’ ratings were less consistent with those provided by faculty members than were the ratings provided within either group. More research is required to determine whether one group's ratings are more predictive of success in medical practice than are the other groups’ ratings, or if both groups simply provided equally valid ratings of different aspects of each candidate's performance. At the very least these results support Ferrier et al.'s9 claim that using heterogeneous raters may result in a more heterogeneous class. The difference we observed in the mean scores faculty and community raters provide may be overcome with further training, but the absolute difference in scores will not matter as long as all circuits contain an equal proportion of examiners from each group. It should be noted that the distinction drawn in this study between raters of different backgrounds is very broad. Additional examination of the influence of characteristics of individual examiners is warranted, but it is less clear how this information could be used because it is likely infeasible for admissions committees to interview the interviewers before determining their suitability to participate. Instead it might be more fruitful to further consider differences between faculty members, medical students, and community members from various backgrounds (e.g., law, community physicians, nonprofessionals).
A further limitation of this study is the fact that participants were aware that the results would have no impact on the admission decision. Although only time will tell whether the psychometric characteristics of the MMI will change when the stakes are higher, the results have been robust and participants approached the task with sufficient realism that many candidates have subsequently requested feedback on their performances.
Additional advantages to the MMI include the potential to achieve the four purposes of admissions interviews identified by Edwards et al.4 (i.e., information gathering, decision making, verification, and recruitment) without confounding these purposes within a single interview (e.g., one station could be designed as a recruitment station without the goal of attracting the best candidates affecting the rest of the interview process). The MMI also corrects for the inefficient use of time that has been identified by Litton-Hawes et al.12 as a problem in more traditional interviews. Candidates believed that the eight minutes they were provided for each station was sufficient and, anecdotally, various examiners thought the time could be shortened, which also may be an indication of how rapidly interviewers form a judgment in a traditional long interview. Staffing each station with a single interviewer also has the potential to correct the imbalance in numbers between interviewers and candidates that has drawn criticism from candidates as an intimidating feature of traditional interviews.2 Finally, the MMI maintains the ability of the admissions protocol to demonstrate the value that the institution places on interpersonal interactions, a “human touch” criterion that was rightly identified by Albanese et al.2 as an important goal of the interview process. Using the MMI so that the face-to-face component of the admissions protocol is conducted in a reliable manner that takes into account context specificity might also demonstrate the value that the selecting program places on the virtues of “critical appraisal” and “scholarship.”
The authors would like to express their gratitude to the examiners and candidates. They thank Jane Bennett, Annette Schrapp, and Jessica Johnston for organizing the logistics of the process. Sara Cymbalisty recruited and trained the simulator, Shiphra Ginsburg provided the professionalism videos used in Stations 4 and 6, and Mary Lou Schmuck performed some data analyses. They are also grateful for the support and insight of Alan Neville and Susan Denburg. Finally, they are grateful for the financial support provided by the Programme for Educational Research and Development, McMaster University.
1.Eva KW, Rosenfeld J, Reiter HI, Norman GR. An admissions OSCE: the multiple mini-interview. Med Educ. 2004;38:314–26.
2.Albanese MA, Snow MH, Skochelak SE, Huggett KN, Farrell PM. Assessing personal qualities in medical school admissions. Acad Med. 2003;78:313–21.
3.Puryear JB, Lewis LA. Description of the interview process in selecting students for admission to US medical schools. J Med Educ. 1981;56:881–5.
4.Edwards JC, Johnson EK, Molidor JB. The interview in the admission process. Acad Med. 1990;65:167–75.
5.Eva KW. On the generality of specificity. Med Educ. 2003;37:587–8.
6.CanMEDs 2000 Project. Skills for the new millennium: report of the societal needs working group 〈http://rcpsc.medical.org/english/publications/CanMEDS_e.pdf
〉. Accessed 16 July 2003. Ottawa, Ontario: Royal College of Physiciansand Surgeons of Canada, 1996.
7.Price PB, Lewis EG, Loughmiller GC, Nelson DE, Murray SL, Taylor CW. Attributes of a good practicing physician. J Med Educ. 1971;46:229–37.
8.Reiter HI, Eva KW. Reflecting the relative values of community, faculty, and students in the admissions tools of medical school. Submitted manuscript.
9.Ferrier BM, McAuley RG, Roberts RS. Selection of medical students at McMaster University. J Roy Coll Phys. 1978;12:365–78.
10.Reiter HI, Rosenfeld J, Nandagopal K, Eva KW. Do clinical clerks provide candidates with adequate formative assessment during objective structured clinical examinations? Adv Health Sci Educ. In press.
11.Newble DI, Swanson DB. Psychometric characteristics of the objective structured clinical examination. Med Educ. 1988;22:325–34.
12.Litton-Hawes E, MacLean IC, Hines MH. An analysis of the communication process in the medical admissions interview. J Med Educ. 1976;51:332–4.