Purpose: Traditional admissions personal interviews provide flexible faculty–student interactions but are plagued by low inter-interview reliability. Axelson and Kreiter (2009) retrospectively showed that multiple independent sampling (MIS) may improve reliability of personal interviews; thus, the authors incorporated MIS into the admissions process for medical students applying to the University of Toronto’s Leadership Education and Development Program (LEAD). They examined the reliability and resource demands of this modified personal interview (MPI) format.
Method: In 2010–2011, LEAD candidates submitted written applications, which were used to screen for participation in the MPI process. Selected candidates completed four brief (10–12 minutes) independent MPIs each with a different interviewer. The authors blueprinted MPI questions to (i.e., aligned them with) leadership attributes, and interviewers assessed candidates’ eligibility on a five-point Likert-type scale. The authors analyzed inter-interview reliability using the generalizability theory.
Results: Sixteen candidates submitted applications; 10 proceeded to the MPI stage. Reliability of the written application components was 0.75. The MPI process had overall inter-interview reliability of 0.79. Correlation between the written application and MPI scores was 0.49. A decision study showed acceptable reliability of 0.74 with only three MPIs scored using one global rating. Furthermore, a traditional admissions interview format would take 66% more time than the MPI format.
Conclusions: The MPI format, used during the LEAD admissions process, achieved high reliability with minimal faculty resources. The MPI format’s reliability and effective resource use were possible through MIS and employment of expert interviewers. MPIs may be useful for other admissions tasks.
Dr. Hanson is associate dean, Undergraduate Medicine, Admissions and Student Finances, and associate professor of psychiatry, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
Mr. Kulasegaram is PhD candidate, McMaster University, Hamilton, Ontario, Canada, and research fellow, Wilson Centre for Health Professions Education Research, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
Dr. Woods is a scientist, Wilson Centre for Health Professions Education Research, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
Ms. Fechtig is curriculum coordinator, Manager and Collaborator Competencies, Center for Interprofessional Education, and Leadership Education and Development program, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
Dr. Anderson is professor, Institute of Health Policy, Management and Evaluation, and program director, Leadership Education and Development program, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
Correspondence should be addressed to Dr. Hanson, Faculty of Medicine, University of Toronto, 1 King’s College Circle, Room 2135, Toronto, Ontario, Canada M5S 1A1; telephone: (416) 946-7972; e-mail: firstname.lastname@example.org.
The admissions personal interview is used ubiquitously for the purpose of medical student selection.1 As an admissions tool, the personal interview focuses on the assessment of personal characteristics1 and lends a humanistic dimension to a high-stakes selection process.2 These benefits must, however, be considered alongside the personal interview’s well-known psychometric limitations. Particularly, personal interviews have low overall reliability—lack of agreement among interviewers and lack of consistency across different interview occasions—which, in turn, limits their predictive power.3,4 Admissions committees that ignore this psychometric dimension of personal interviews may, therefore, inadvertently diminish both the effectiveness of interviews and their valued humanistic dimension.
One common solution to increase the reliability of a performance measurement is to assess samples of the performance independently multiple times—that is, the multiple independent sampling (MIS) method. In other words, multiple raters evaluate performance on separate occasions, and each evaluation involves different raters. Averaging across multiple independent samples of an applicant’s performance in multiple interviews has been shown to increase reliability.5 The most notable use of this measurement technique is the Multiple Mini-Interview (MMI),6,7 which uses up to 10 highly structured, scenario-based interviews to assess applicants. Interestingly, the Association of American Medical Colleges recently reported that a preponderance of schools use the admissions personal interview, not the MMI,1 despite not only the evidence regarding the MMI’s high reliability and moderate validity7–9 but also the critical psychometric limitations of the personal interview.2–4
Two critical factors contributing to the low uptake of the MIS method to admissions interviews are the aforementioned high resourcing requirements (10 interviewers needed to attain reliable scores) and the potential effects on recruitment (due to the associated alterations to campus recruitment-focused activities).10 Additionally, the flexibility and intuitive simplicity of the personal interview may make admissions committees (and interviewers) reluctant to abandon it all together.
To provide an improved personal interviewing option for schools facing such impediments, Axelson and Kreiter10 investigated the application of MIS to the admissions personal interview itself. In their 2009 investigation, they reviewed the multiple interview scores of applicants who had been interviewed twice in consecutive years by a panel of two interviewers for admission to medical school. They estimated, after analyzing the scores of 168 candidates across four years who had interviewed twice, that reasonable reliability could be achieved using a traditional personal interview format by reducing the number of interviewers in the panel while increasing the number of separate personal interviews. Thus—instead of relying on a large number of structured scenarios—admissions committees might be able to depend on multiple, brief, single-rater interviews to enhance the reliability of the personal interview. Axelson and Kreiter presented new (but retrospective) evidence that a traditional personal interview format can be adapted to the MIS strategy to achieve acceptable reliability without automatically increasing either the financial or time costs. Further, they presented evidence that the MIS format could be employed within the admissions context of personal interaction and recruitment.
We report here the first prospective empirical test of the reliability of a similar modification to the admissions personal interview format using an MIS methodology named the modified personal interview (MPI). In 2010–2011, the Leadership Education and Development Program (LEAD) was initiated at the University of Toronto. The program’s inaugural admissions process (to select a cohort of students for LEAD in the summer of 2011) provided the opportunity to employ the MPI format and, thereby, assess its reliability as a selection strategy.
The University of Toronto Faculty of Medicine is the sole medical school within Canada’s largest city. The first-year medical student class (N = 250) constituted the eligible LEAD applicant pool.
The University of Toronto’s ethics review board has deemed this project to be program evaluation and, thus, exempt from research ethics review. The project’s conclusions, which we present herein, are secondary to a quality assurance project conducted to evaluate admissions processes at the Faculty of Medicine within the University of Toronto.
We informed the first-year students about LEAD and its selection process via announcements made during class and notifications sent over e-mail. The selection process constituted submission of written materials followed by, for a selected subset of candidates, the MPI process. We derived the attributes of successful LEAD candidates from the literature on leadership11,12 and through LEAD faculty consensus. These desired attributes were communicated to the pool of potential applicants and blueprinted onto (aligned with) questions asked during the MPI process. We applied MIS to both the written submission materials and to the MPIs, as detailed below.
Written submission materials
The written submission materials comprised three components: a two-page curriculum vitae (CV) summarizing applicants’ academic and leadership experiences, three brief descriptions of leadership experiences reported in the CV, and a brief vision statement of leadership goals and career aspirations. We removed identifying information from these written submission materials for faculty review in order to lessen the potential for assessor bias. We subdivided each applicant’s submission package into its components, and specific assessors independently rated each subcomponent. Further, the eight assessors rated different, specific leadership attributes for each particular component of the written submissions. For example, the raters of the CV assessed the applicant’s history of academic achievement and leadership experiences, as well as the evidence of his or her communication skills in diverse contexts and of his or her community service/social responsibility. The CV raters rated each of these four items as not assessed, unacceptable, acceptable, or outstanding for a total possible numeric score of 8. Similarly, the raters who examined leadership experiences rated these descriptions on an additional set of four items (attributes), again for a total possible score of 8. The raters who assessed leadership goals and career aspirations assessed the students’ statements on three items for a total possible numeric score of 6. The LEAD director (G.A.) rated all written submission components of all applicants in an effort to gain an impression of the overall caliber of the applicant pool. There was some rating overlap among the components because the CV, brief descriptions, and vision statements all assessed a range of attributes associated with leadership potential. For instance, the CV and brief descriptions both assessed the nature of applicants’ leadership experiences.
Selected attributes of leadership potential were also mapped onto the interview component. Selection for interview was based on the sum of scores on each component, with a maximum possible total of 22. The top-ranked applicants as based on final score were invited to the MPI stage.
The MPI process
Candidates who proceeded on to the interview stage moved among four interview rooms to complete the MPIs in succession. Each MPI was about 10 to 12 minutes long; a few interviews were longer at the discretion of the faculty interviewer. The four interviewers, all of whom had participated in the review of the written materials, framed all questions as behavioral descriptive questions (e.g., “When you entered a new workplace in the past, how did you go about meeting and developing relationships with new colleagues and supervisors?”), which have strong validity in assessing personal characteristics.13–15
Interviewers received training on the MPI format and on the focus of the interviews. Three of the interviews were semistructured, and the interviewers used a list of predetermined questions. The faculty director for LEAD (G.A.) conducted the remaining interview to assess students’ vision and expectations of LEAD; however, to foster the personalized aspect of a mentor/mentee relationship, the questions for this interview were not predetermined. We deemed this modification for the interview with the program director necessary because he would have to further develop individualized mentorship relationships with each selected LEAD student. This modification also allowed students to freely ask questions about the program in a one-on-one personal interview with the program director.
All four interviewers rated three common attributes—maturity, communication skills, and interpersonal skills—and a fourth attribute unique to their MPI.
The four leadership attributes unique to each respective MPI were (1) ability to work in teams, (2) vision and expectations of LEAD, (3) bandwidth and adaptability, and (4) self-reflection and personal insight. Ability to work in teams encompassed collaboration and team leadership. Vision and expectations of LEAD focused on the specific goals that applicants had set for themselves as leaders and the fit of these with the overall goals of LEAD. We defined bandwidth as the ability to take a wide-ranging approach to current objectives while managing multiple competing priorities and interests. Self-reflection referred to the ability of candidates to consider anew a specific past leadership experience.
The interviewers evaluated each attribute as a separate item on a five-point Likert-type scale (1 = poor, 2 = good, 3 = very good, 4 = excellent, and 5 = outstanding) to increase the scoring range available to interviewers. All items were totaled for a final MPI score out of 20, and overall total scores were used for selection.
LEAD interviewers had access to applicants’ written submission materials (but not the scores for the written materials) both before and during the MPIs. Interviewers were told to adjust interview scores retrospectively if needed as they assessed more candidates and developed a better appreciation for the range of responses. They were also instructed to base their scoring on the actual MPI performance and not the written material.
We analyzed all types of reliability through the generalizability theory (i.e., via a g-study). The following formula was used for overall reliability:
Generalizability (overall) = VC(p) / VC(p) + [VC(i)/ni] + [VC(q:i)/(nq:i)] + [VC(pi)/ni] + [VC(pq:i)/ni*nq:i]
* VC is variance,
* p is participant,
* i is MPI or interview
* n denotes the number of questions or items
* q is question.
* q:i indicates questions nested within interview
(See also Results.)
A decision study (d-study) was also conducted to determine how changing the number of interviews and items affected reliability. We also examined the correlation of total scores from the written submission materials and the MPI component. Analysis of generalizability was conducted with G-String IV software (Bloch & Norman, Hamilton, Ontario, Canada). Other analyses were conducted with SPSS 16 (Chicago, Illinois).
Sixteen candidates submitted initial applications to LEAD. Of these, we selected 10 for the MPI stage, 8 of whom were selected for the program. The entire set of MPIs was completed in three hours in one afternoon.
The averaged total score for the written component was 14.5 (standard deviation [SD] = 2.3). We ran the generalizability study for the written submission materials with 16 participants (p) crossed with the 3 components (c) with raters nested within components (r:c) and with items nested within components (i:c). The overall reliability for the written submission materials’ assessment was 0.75.
The average score for each MPI, averaged across all four MPIs, ranged from 14.8 to 16.1, and the overall score average was 61.35 (SD = 9.6). We conducted the g-study for the MPI format with 10 participants (p) crossed with MPIs (i) as a random factor and with questions nested within interview (q:i). The dependent variable was the score assigned to each participant. Results of the g-study for MPIs are reported in Table 1.
The majority of variance among MPI scores (58%) was attributable to the participant–interview interaction (pi) as well as the participant–question interaction nested with MPI (pq:i), which suggests that these facets caused random error in the assessment of applicants. The combined variance due to these interaction terms (pi and pq:i) contributed considerably more variance than the analogous “true score” variance (i.e., true differences) of participants (40%). Variance due to systematic differences between interviewers and items nested within MPIs was minimal.
Overall reliability of the MPI component and subsequent average MPI reliability was 0.79. The reliability of questions nested within MPIs (q:i) was 0.97. A separate analysis of items assessing only attributes common to all MPIs (attributes assessed across all interviews; i.e., maturity, communication skills, and interpersonal skills) showed a high inter-item reliability at 0.74. A d-study of inter-interview reliability as a function of the number of MPIs is presented in Figure 1. An acceptable reliability of 0.74 can be achieved by averaging the scores of three MPIs. Reducing the number of items nested within rater had less impact on reliability because the variance due to participant–question interaction nested within interviews (VC[pq:i]) is smaller than variance due to participant–interview interaction (VC[pi]). The Pearson correlation between the total score on the written submission materials and total score on the MPI component was 0.49 (P < .01) before correction for range restriction and 0.60 after correction.
The item–total correlation between the total score from the unstructured MPI and overall interview score was 0.70. The average item–total correlation for the semistructured MPIs was 0.71. The total faculty hours spent in the MPI process was 8 hours (12 minutes per MPI for 10 applicants, multiplied by the number [n = 4] of faculty members).
Discussion and Conclusions
This report provides some evidence that MIS as applied within the MPI format is a reliable selection strategy. High reliability was achieved with just four MPIs, and a d-study revealed that future MPIs can achieve reliability greater than 0.7 with only three MPIs. A total of only 8 faculty hours was spent conducting the MPI process. A comparable traditional admissions personal interview of 40 minutes’ duration with two interviewers would take more than 13 faculty hours (66% more time) for the same number (n = 10) of applicants. Therefore, although the current study assessed selection for a small, selective program within the medical school, the results suggest that MPIs might also prove useful in general medical school admissions.
This modification of the personal interview—that is, employing multiple interviewers to rate applicants on interviewer-specific attributes in a series of short interviews—has the potential to increase the uptake of MIS in admissions interviews. Previous application of MIS in the MMI showed that at least a minimum of 10 separate interviews were needed to achieve acceptable reliability.6,7 The MPI here met a minimum threshold at 3 interviews. A potential explanation for this finding is the specialized selection context of the LEAD admissions process. The MPI as used in this process focused on a narrow set of attributes related to leadership qualities as determined by LEAD faculty. Other aspects of nonacademic performance had already been assessed in the medical school admissions process. This specialized context also enabled the use of expert raters, which may have further enhanced interview reliability. The interviewers comprised a small cadre of faculty who not only were intimately involved in the development of both the program and its selection process but also held the requisite expertise in leadership education. These factors—in addition to the semistructured MPI format—could have contributed to the high reliability of the MPIs after just four interviews. Nonetheless, despite these potential confounders, the MPI process demonstrates the potential to save valuable time while still producing reliable results.
The specialization of this selection process (with interviewers rating applicants’ performances according to a predetermined, defined suite of attributes aligned with a specific physician role—in this case, the role of physician leader) also lends to the face validity of the MPI format. Face validity has been described as the extent to which the applicants believe the application process is relevant to the job in question,16 or—to extrapolate to the medical school context—medical school curriculum. Applicant acceptance of admissions processes has been associated with face validity.16 Thus, we would not expect the recruitment of applicants to decrease through the use of the MPI format. Future studies of the MPI will need to explicitly assess the format’s effect on recruitment and on applicants’ views through exit surveys with a larger sample or through qualitative research methods, such as interviews. Any such investigation will be a valuable extension of this body of research, similar to Axelson and Kreiter’s10 assertion that the admissions tools which a medical school employs potentially affect that school’s recruitment.
In the current study, the overlap of attributes across the written application and MPIs enhanced content validity (and reliability). The strong association between written application scores and MPI performance is likely a result both of the intentional attribute mapping or blueprinting we performed in developing the LEAD application process and of the availability of written application materials during MPI occasions. The use of raters from the written application as interviewers may have also contributed to the strong association of scores across both evaluations, even though we removed all personal identifying information from the candidates’ written application materials. Although the availability of written materials during MPI occasions could have contributed to this high correlation, the scores for these materials were not available. The availability of the written components of the application is a limitation of this study that will be addressed in further research. The use of different raters without access to written materials in MPIs may begin to clarify whether the association of scores between the written application and the MPI is the result of a halo bias formed while reviewing the written materials, or the result of a desired convergence across two assessment measures indicating construct validity.
Our findings from this evaluation of the MPI format provide evidence that MIS may be a viable means for resurrecting reliable personal interviews during the admissions process while balancing the needs of recruitment and resourcing. The MPI format represents another interview option in the admissions toolbox—that is, another alternative available to admissions committees. In particular, for smaller-scale, specialized selection tasks (e.g., subspecialty residency programs, MD/PhD programs, special medical school curricula), the MPI format can maximize inherently limited resource capabilities to achieve reliable selection data to use for decisions.
Future research is needed to address the reliability of the MPI in general medical school admissions that weigh a broader range of attributes, assess more applicants, and employ more interviewers with less specialized experience. Future research will determine the predictive validity of MPI scores for applicants accepted into LEAD and for other medical education programs or institutions.
Other disclosures: None.
Ethical approval: The University of Toronto’s ethics review board has deemed this project to be program evaluation and, thus, exempt from research ethics review.
Previous presentations: This study was delivered as an oral presentation at the Wilson Centre Richard Reznick Research Day by Mr. Kulasegaram in October 2011 in Toronto, Ontario, Canada.