Using admission interviews, among other admission criteria, to select medical students is a common practice with a long history.1,2 The types of interviews that are conducted vary widely, from traditional, unstructured interviews to structured interviews with objective scoring protocols. Unstructured interviews, characterized by a conversational, informal style, questions that are not specified in advance, and a lack of objective scoring criteria,3 appear to be most commonly used among medical schools.4,5 This preference is surprising in light of the susceptibility of unstructured interviews to a variety of biases. In their review of selection interview literature, Edwards and colleagues1 noted several sources of bias, including rater tendencies (leniency, severity, “halo” effects), demographic factors, stereotypes regarding the characteristics of a “good” applicant, and order effects. Edwards and others have called for the use of structured interviews, which involve consistent, predesignated questions and quantitative scoring criteria.2,4,6 Structured interviews can be evaluated for reliability and validity (including predictive utility), and, as such, are more scientifically and ethically defensible than are unstructured interviews.
In 1996, interviews were reinstated as a standard part of the admission process at the University of Iowa College of Medicine. Because of the previously stated problems with unstructured interviews, the interview had not been part of the standard admission process since 1973.7 However, interest in developing an effective interview grounded in the most recent research on the selection of medical professionals and the use of structured interviews led to the formation of an interdisciplinary committee to develop and evaluate a structured interview for the admission process.
The development of the interview proceeded in five phases that are briefly outlined here (the interested reader is encouraged to contact the authors for an extended description):
Phase 1. About seven months prior to the first interview, the committee decided on the essential features of the interview (i.e., should the interview contain standard and structured questions, a common rating template, etc.?).
Phase 2. Candidates' credentials were reviewed to determine which domains would be assessed by the interview and which would be assessed via other means (e.g., letters of recommendation). Phase 2 concluded with the development of questions and scoring templates.
Phase 3. The interview questions and templates were pilot tested and revised.
Phase 4. Faculty interviewers were trained.
Phase 5. The interviews conducted by the trained faculty were observed and evaluated, and feedback was given to the faculty to improve their interviewing and rating.
The present study reports on the initial reliability and validity of the structured interview in Phase 5, and compares its scores with those of other standard admission criteria. We conclude with a discussion of the implications our study's findings have for future validation of the interview protocol and for other medical schools seeking to develop a structured interview.
The interviewers were 73 volunteers from the faculty of the University of Iowa College of Medicine (63 men, ten women). After responding affirmatively to a recruitment letter from the director of medical admissions, the interviewers were trained and scheduled in pairs to conduct the interviews. (Complete procedures are available from the authors.)
In the fall of 1996, following an initial screening of standard application materials for 2,406 applicants, 490 candidates participated in on-campus interviews. The racial and ethnic makeup of this group (according to self-report in the application materials) was 35 (7%) African American, 3 (.6%) American Indian or Alaskan Native, 53 (11%) Asian American, 29 (6%) Hispanic or Latino(a), 357 (73%) white, and 13 (3%) did not respond. A total of 289 (59%) interviewees were men, and 286 (58%) were in-state applicants.
The final version of the interview consisted of nine questions, each with its own scoring template. The candidates' responses to each question were rated on a scale of 1 to 5 (Chart 1 shows an example of a scoring template). The interviewers also assigned overall interview scores on a scale of 1 to 5 (1 = a problematic candidate, 3 = an average candidate, 5 = a truly out-standing candidate). Examples of the interview questions used were “Thinking back over the past few years, what is one experience you have had that influenced or changed your life in a significant way,” and “The practice of medicine is changing rapidly and, as a physician, you will be involved in this evolution. How do you see those changes affecting your role in the practice of medicine?”
Each interview took approximately 15 minutes to complete and an additional ten minutes to score. Two faculty members participated in the interview of each applicant, except in rare instances when one interviewer was not available. Faculty interviewers scored each answer at its conclusion, but scored the overall interview after the interview was completed. When faculty participated in more than one interview, it was not always possible to pair them again with the same faculty partner.
Descriptive statistics, interrater agreement, and correlations with standard selection criteria [undergraduate grade-point average (GPA), Medical College Admission Test (MCAT) scores, and Iowa Evaluation Form (IEF) scores] were calculated. To determine how much overall interview scores accounted for admission, beyond other admission criteria, a hierarchical step-wise regression was conducted.
Descriptive statistics were obtained for the entire set of interviews. Means and standard deviations (SDs) calculated for each interview question as well as the for overall score are presented in Table 1. Mean scores were all above the scoring scale's midpoint (3 on a scale of 1 to 5). Question 1 (interest in medicine as a career) had the highest mean score, 4.03, SD = 0.79 and question 9 (future of medicine) had the lowest, 3.43, SD = 1.0. The mean (SD) for the overall score was 3.74 (0.84). For the sake of comparison, Table 1 also presents descriptive statistics for the subset of 148 students who were accepted to and subsequently enrolled at the university.
Mean scores (SDs) for each interview question, based on applicants' admission status (accepted, not accepted, or wait-listed on decision date of May 15, 1997), are presented in Table 2. The admission status for many applicants changed as some individuals declined acceptance and others from the wait-list were eventually accepted. Thus, the data in Table 2 represent the admission committee's initial decisions before applicants were notified. To assess differences in interview performances among the applicants who were initially accepted, rejected, or wait-listed, a one-way analysis of variance (ANOVA) was conducted on scores for each interview question, plus the overall score. For all interview questions, applicants who were accepted scored, on average, significantly higher than did applicants who were wait-listed. Moreover, with the exception of question 2, the average scores of the applicants who were wait-listed were significantly higher than those of the applicants who were not accepted.
An appropriate estimate of interrater agreement is the kappa statistic; however, this method relies on a consistency of rater pairs (i.e., interviewer A being constantly paired with interviewer B) that our study could not maintain. Therefore, to provide an estimate of interviewer agreement, we calculated the agreement between raters for each interview question and for the overall score. Table 3 summarizes these data according to the proportion of rater pairs who gave the same score on a given question, the proportion of rater pairs whose scores were within one point of each other, and the proportion of rater pairs whose scores differed by more than one point. Overall, interrater agreement was good; the percentages of rater pairs whose scores differed by one point or less ranged from 87% (question 9) to 98% (overall score).
Relationship of the Structured Interview to Other Selection Criteria
Correlation and regression analyses were conducted to examine the extent to which the structured interview, independent of other selection criteria (GPA, MCAT scores, IEF scores) was associated with admission status and uniquely predictive of admission status. For these analyses, each applicant's interview performance was represented by the mean of the two raters' interview scores. We assessed correlations for the entire group of interviewees as well as for the subset of individuals who subsequently enrolled at the university.
For the entire sample, the overall interview score correlated .10 (p <.05) with cumulative undergraduate GPA. Although this correlation reached statistical significance at p =.05, the strength of the relationship was low and the correlation with GPAs of students who subsequently enrolled at the university was even weaker. The −.05 correlation for the enrolled students' subset did not approach statistical significance (p =.52).
Again, even in cases where the relationship between interview performance and MCAT scores did reach statistical significance, the correlations were quite low. For the entire sample, overall performance on the interview correlated .18 (p <.01) with MCAT Biological Science,.08 (p >.05) with MCAT Physical Science, and .10 (p <.05) with MCAT Verbal Reasoning. For the subgroup of students who enrolled at the university, these correlations were .00 (p =.99), −.10 (p =.24), and .04 (p =.65), respectively.
Finally, the overall interview score was correlated with the IEF, an in-house evaluation form, usually completed by an academic supervisor, that is designed to estimate an applicant's noncognitive qualities in the broad areas of synthesis and integration of information (scale 1), interpersonal skills (scale 2), flexibility and/or compassion (scale 3), and confidence and/or professionalism (scale 4). None of the correlations between the overall interview score and any of the scores on the four IEF scales reached statistical significance at p =.05. This finding held true for the overall sample as well as for the subset of enrolled students.
Relationship of the Interview to Admission Status
Higher scores on the interview predicted a greater likelihood of being accepted to the medical school. The admission status for students was reverse-coded (1 = accepted, 2 = waitlisted, 3 = rejected). We found a significant negative correlation between the overall interview score and admission status for the entire sample (r = −.537, p <.001). This strong inverse relationship was in accord with the findings of the previously reported one-way ANOVAs.
To examine the extent to which the overall interview score accounted for admission status variability above and beyond the variance accounted for by traditional objective measures, we conducted a hierarchical stepwise regression where the overall interview score was entered after undergraduate GPA scores, MCAT scores, and IEF scores. The results are reported in Table 4. Taken together, traditional objective measures accounted for 16% of the variance in admission status, which was a statistically significant amount, F (10, 433) = 8.40, p <.001. When entered into the regression equation after the traditional variables, the overall interview score accounted for a significant additional percentage of variance in admission status, with R2 change of .20, F change = 142.44, p <. 001. Remarkably, the interview uniquely accounted for an additional 20% of the variance in admission status among students accepted, wait-listed, or rejected. The F value for the overall model (11, 432) = 23.08, p <.001, was also highly significant, suggesting that the predictor variables accounted for a substantial proportion of variance in admission status.
Considering the investment of personnel, time, and other resources in the project needed to develop and test a structured admission interview, the interview's performance during its first year of implementation was a matter of high interest. The distributions of scores for the interview questions and the overall interview score were quite acceptable from a psychometric standpoint. Mean scores for the individual questions were somewhat higher than the midpoint of the five-point scale; however, positively skewed scores were expected given the select group of applicants invited for interviews. Despite the slightly positive skew in the distribution of interview scores, the standard deviations of the individual scores suggested satisfactory variability.
When we compared scores with applicants' admission status, the applicants who were accepted to the program scored significantly higher than did those held on the wait-list, who in turn scored significantly higher than did the applicants who were rejected. These results were obtained despite the intention of the admission committee not to delegate significant weight to the interview during the first year of its development. We considered the fact that interview scores varied consistently with admission status to be positive; however, the relationship must be viewed with caution until the interview's prediction of medical student performance can be assessed.
Agreement between interviewers' ratings was relatively good. The over-whelming majority of interviewer pairs rated questions within one point of one another. The agreement rate on the overall interview score was particularly strong. This result is quite important, given the low degree of interrater reliability known to exist in traditional, unstructured interviews. Providing interviewers with specific questions and criteria for scoring applicants' responses appears to have met the goal of a selection tool with good interrater reliability.
We found considerable evidence to support the structured interview as a unique instrument for measuring content and as a unique contributor to admission decisions. Correlations between the interview scores and other predictor variables were low. Even IEF scores had low correlations with interview scores, despite appearing a priori to have some conceptual overlap with the interview. Such evidence is key to arguing for the use of a structured interview in addition to traditional selection criteria. The time and effort involved in interview development and implementation are justified only to the extent that the interview provides unique information about candidates.
It is difficult to make direct comparisons between traditional, unstructured interviews and the structured interview presented in this study. By their very nature, traditional interviews do not lend themselves well to quantitative scrutiny. Nevertheless, the inadequacies of unstructured interviews, in medicine and in other areas of personnel selection, has been well documented. We developed our structured interview through an extensive process involving input from parties with expertise in medical student education as well as personnel psychology. Although the process of development was intensive, the steps involved can be replicated at other institutions and the structured interviews can be evaluated.
Our study reports on only the beginning phase of implementing a medical student structured interview that performs well psychometrically and justifies the use of the resources required to complete such a project. Once students' performance data are available, we will follow up to describe the evaluation of the interview's predictive validity and summarize any changes made to the interview format in subsequent years.
One critical issue that is left unanswered by our research is the proper role of the interview in the medical school admission process. What characteristics do we seek in medical students, other than intellectual ones that can be measured quantitatively, and how can these be best measured? We argue that the personal interview is an essential measure of a student's characteristics provided that the interview is carefully developed as a selection measure. Faculty and staff must decide on what domains or constructs of performance are best measured by the structured interview. A structured interview allows predictive research to validate the importance of these domains in medical students' performances both in medical school and beyond. It is possible, for example, for an interview to measure a domain with reliability, but for that domain to be unimportant to later performance in medical school; thus, longitudinal research is needed to validate both the choice of domain and its measurement.