Hanson, Mark D. MD, MEd; Kulasegaram, Kulamakan Mahan; Coombs, Deborah L. MMS; Herold, Jodi PhD
The use of applicant written materials, including personal statements and letters of recommendation (LORs), in medical school admissions is ubiquitous.1 However, some researchers have identified potential drawbacks affecting the psychometric rigor of these materials.2–4 One specific concern is the potential for cognitive rater bias associated with their review—specifically, the halo effect or halo bias.5 The halo effect occurs when a rater’s first impression of an applicant unconsciously influences future assessment.6 For example, a strong academic transcript can positively influence a rater’s perception of a weak nonacademic record. Use of a single rater to evaluate an applicant file can therefore introduce the halo bias into the admissions file review process, and this halo effect increases measurement error, thus decreasing reliability and validity.
Dore and colleagues5 identified evidence of the halo effect, connected to the autobiographical submission (ABS) component used in the admissions process for the McMaster University MD program. They noted that, in a previous McMaster MD program study of admissions tools, the ABS had a low reported inter-rater reliability (0.45),7 leading them to postulate that a halo effect was operative. To investigate this potential halo effect, Dore and coinvestigators undertook a new study that compared the traditional method for evaluating the ABS (i.e., raters evaluated all ABS responses for one applicant and then moved on to evaluate all of the responses of the next applicant) with a new method for rating the ABS (i.e., raters evaluated one ABS response across all applicants before scoring the next response across all applicants). Dore and colleagues hypothesized that the use of this new rating method would produce ratings that were relatively more “independent” and that raters would gain an enhanced awareness of the possible range of responses to each item. In their study, three independent raters also scored applicants’ responses to five ABS items. The authors found that the new rating method decreased the halo effect, as demonstrated by lower inter-item reliability with the new method (G < 0.49) relative to the traditional method (G < 0.87), and enhanced inter-rater reliability with the new method (0.69) compared with the traditional method (0.03).5
Another example of the halo effect in academic medicine is that which Brothers and Wetherholt8 have reported. They identified a possible halo effect in faculty raters’ evaluations of surgical residency candidates’ LORs that were bundled together with academic performance records (e.g., United States Medical Licensing Examination Part 1 scores) for faculty review. This potential for a halo effect could act to limit the reported predictive validity of LORs on subsequent surgical residency performance.8 Other academic medicine researchers have also reported a halo effect operative among admissions interview raters who were not blind to interviewed applicants’ academic materials (e.g., undergraduate grade-point average [uGPA]).9
The use of strategies that employ multiple independent sampling (MIS) is important in evaluating many aspects of the health professions education enterprise, including—in the case of admissions—past performance as an indication of future potential. The MIS method involves multiple assessments of an applicant, usually by different raters on different occasions, so that one assessment does not influence another. The most familiar example is the objective structured clinical examination which uses multiple station scenarios to assess clinical skills. MIS enhances the psychometric properties of evaluation tools; the average of multiple scores derived from multiple performances across different problems as assessed by different raters produces a more reliable and valid estimate of a student’s ability than scores derived from a single rater on a single performance regarding a single problem.10 The MIS strategy has been successfully used in the admissions process for the selection of medical students through the Multiple Mini-Interview,11 through Computer-Based Multiple Sample Evaluation of Non-Cognitive Skills,12 and, as described above, through the Autobiographical Screening Tool.5,13,14
The purpose of this study was to evaluate the MIS method as applied to admissions file review in an effort to determine whether this approach minimizes the halo effect in ratings used for selection of applicants to the University of Toronto (U of T) medical school.
We hypothesized that if the halo effect is indeed operative in holistic assessment, then the ratings from the holistic review would result in high inter-component reliability scores even among seemingly unrelated components (e.g., academic component and ABS) and that all components would load heavily on a single factor. Likewise, we hypothesized that an MIS approach aimed at mitigating the halo effect would result in lower inter-component reliability and all components loading onto multiple factors.
The holistic file review process
U of T has thus far employed a holistic approach for reviewing admissions applications such that a single rater reviews all the components of an applicant’s file. Applicants must submit four components: (1) the academic transcript, including uGPA, (2) an ABS listing up to 48 activities or accomplishments (e.g., work, volunteering, extracurricular activities, research, academic awards), (3) a personal statement describing why the applicant wishes to attend medical school and how he or she has prepared, and (4) three LORs. The academic component includes the candidate’s transcript, undergraduate program of study, awards, journal publications, and other achievements. The ABS showcases an applicant’s depth and breadth of activities. The personal statement is a loosely structured essay in which applicants discuss their reasons for pursuing a career in medicine, their unique qualities, and the actions they have taken to prepare for a medical career. Each LOR comprises both a letter and a standardized rating form, both of which are used in the assessment of LORs.
Each component receives a separate score that contributes to the composite score. The composite score is weighted as follows: 60% the academic component, 10% the ABS, 9% the personal statement, and 21% the LORs. The composite scores are used to select applicants for interviews, and they also contribute to the final composite score (post interview) used to select applicants into the MD program.
U of T has used a blended (individual and team) rater model. Each team comprises three raters (two faculty members and one student). First, each rater independently rates 11 applicant files. Subsequently, the team members review their collective 33 files together and arrive at a final team consensus score for each applicant file. In this model, each applicant file is assessed by one rater team; the assessment comprises an initial detailed assessment by one individual rater and subsequent review and input by this rater’s team.
The MIS review process
Applicant files. In the late summer of 2011, we randomly selected 300 applicant files from the 2010–2011 admissions cycle so that we could reassess them using an MIS method. We used a stratified random sampling approach whereby we selected an equal number of files (n = 100) from each of three applicant groups: (1) those who were not selected for interview, (2) those who were interviewed but not offered admission, and (3) those who were interviewed, offered a position, and then accepted, deferred, or declined a position in the September 2011 class. We used the random number function in Microsoft Excel (Redmond, Washington) for random selection within each group. We duplicated 40 of these files so as to assess inter-rater reliability. We divided the 300 selected files (and the 40 duplicate files) into their 4 components and rebundled them into packages of, on average, 38 same-component items (e.g., a package of 38 academic transcripts or a package of 38 personal statements). Each package included five file components that were duplicates of those from an another rater’s package (of the same component), again in order to allow for the assessment of inter-rater reliability. Applicant files from all three strata were represented in each package.
We removed all personal identifiers, including names, schools, and dates of birth, from all files, and we blacked out the referee and applicant names in the LORs to ensure anonymity. However, to link each application’s scores from the two approaches, we retained the Ontario Medical School Application Service ID.
Raters. A research assistant contacted 124 faculty raters who had previously participated in the holistic file review to solicit participation in the study. Raters received general information on how the MIS method differed from the holistic approach. Rater instructions and assessment forms in the file review packages were the same as those used in the holistic approach, with one exception. In the holistic approach, the academic scores can be adjusted for publications, scholarships, etc., found in other file components, but adjusting the academic score was not possible in the MIS method because the raters did not have access to the other file components. Raters, therefore, scored the academic component out of 58 rather than 60 total points.
Raters received a survey to assess, on a Likert scale, their impressions of the fairness and efficiency of the MIS approach as well as their overall confidence in the final ratings. They were asked to list up to three strengths and weaknesses of the MIS method as compared with the holistic approach.
Inter-rater reliability. As mentioned, we randomly selected 40 applicant files from the sample of 300 for duplicate scoring. Duplicate files were assigned and distributed randomly among raters who were not informed of the duplications.
Holistic versus MIS ratings of applicant files. For our primary analysis, we conducted a generalizability study (G-study), which is a method of analyzing reliability. This method also allowed us to conduct a decision study (D-study), which provides an estimation of the impact of increasing or decreasing the number of measurements on the reliability of ratings. We conducted our G-study using urGENOVA (R. Brennan, University of Iowa, Iowa City, Iowa) through the G-String IV shell (R. Bloch & G. Norman, Hamilton, Ontario, Canada).
The formula we used to determine reliability is as follows:
Generalizability = Variance (applicants) / Variance (applicants) + Variance (components) + Variance (applicants × components).
We calculated inter-component reliability both for scores generated through the holistic approach and for scores generated through the MIS method. The resulting G-coefficient is analogous to an alpha coefficient or internal consistency.
To clarify the extent of the halo effect, we conducted exploratory factor analysis (principal components analysis) to determine the underlying factors and the number of factors onto which the four components loaded in both the holistic approach and in the MIS method. The presence of halo effect would be evidenced by all four components loading exclusively onto a single factor.
We calculated inter-rater reliability for the duplicate scoring of study files for each component, compared the composite scores from the two approaches using paired t tests, and examined predictive validity against the interview scores through Pearson correlations.
We examined the correlation of both interview performance and uGPA to scores from the holistic approach and the MIS method. We used Z scores to conduct all analyses comparing the holistic scores with the MIS scores. These and other comparison analyses were conducted with SPSS 16 (Chicago, Illinois).
Study participant feedback. We analyzed the results from the rater feedback forms using frequency distributions for the three survey items answered on the Likert scale.
This study received approval from the U of T health sciences research ethics review board.
Of the 124 raters we invited to participate, 36 accepted and completed the training, and 35 returned rated packages. One rater did not complete the task; therefore, we were missing the scores on the academic component for 33 applicants. We excluded these applicants from most analyses but retained their scores on the other three components for our inter-rater reliability analysis and our analysis of correlations with other admissions measures. Two applicants did not have their academic component scored, so we substituted in the group mean score for the rater who assessed their files.
Table 1 reports the mean scores and standard deviations (SDs) for each component under the two approaches. The overall inter-component reliability was 0.69 for the holistic approach and 0.29 for the MIS method. Variance components are shown in Table 2.
Factor analysis for the holistic scores revealed that all four components loaded heavily (>0.7) onto one factor and that the academic component had the strongest loading on this factor. Factor analysis of the MIS scores revealed that three factors accounted for 57% of the variance. Further analysis (varimax rotation) clarified that the academic and personal statement scores loaded on the first factor at, respectively, 0.81 and 0.70, whereas the ABS loaded on the second factor at 0.91, and the LORs loaded on the third factor at 0.84.
The single-rating inter-rater reliability coefficients for the 40 duplicate files in the MIS method for each component were as follows: 0.49 for the academic component, 0.51 for the ABS, and 0.56 for the LORs. The variance component for applicant scores in the personal statement was negative. The Pearson correlation between the duplicate scores of the personal statement was −0.31. Averaging across two raters increased reliability coefficients for all components, with the exception of the personal statement (see Table 3).
Correlations among application file components
Within the MIS scores, the academic component correlated with the personal statement scores at 0.21 (P < .001); no other components correlated with each other. Academic scores from the holistic approach correlated with the ABS, personal statement, and LORs at, respectively, 0.23, 0.22, and 0.28 (all values P < .001).
Correlations between the MIS component scores and holistic components are given in Table 4. The overall composite score for the holistic approach and the overall composite score for the MIS method correlated at 0.47 (P < .001).
Correlation between file review scores, interview, and GPA
For our analysis of the correlations between the file review scores and the interview and GPA, we used only the subset of 200 files of applicants who had an interview in the 2010–2011 admissions cycle. In the holistic approach, neither the scores of any of the individual components nor the final composite score correlated with total interview scores. In the MIS method, the scores of LORs correlated with the total interview score at 0.23 (P < .001) and at 0.36 after correction for attenuation. Correlation with the faculty interviewer and student interviewer subscores was, respectively, 0.17 (P < .009) and 0.26 (P < .001). No other MIS or holistic scoring components significantly correlated with the interview.
The academic component scores from the holistic approach and the MIS method correlated with the uGPA at, respectively, 0.62 (P < .001, 95% CI: 0.54–0.71) and 0.50 (P < .001, 95% CI: 0.41–0.59).
Results of the rater feedback surveys are reported in Figure 1. We received 35 rater study packages; 32 included a completed rater feedback form, 2 included feedback forms that were not completed, and 1 package did not include the feedback form. Whereas a small majority (n = 18/32; 56%) of raters either agreed or strongly agreed that MIS was a fair method of assessing applicants, only a small proportion expressed similar agreement that MIS was efficient (n = 14/32; 44%). A minority (n = 14/32; 44%) felt confident in their MIS ratings. Qualitative comments indicated that raters found the MIS approach appealing in that they had an opportunity to see a greater breadth of the applicant pool. Conversely, participants felt that rating only a single component limited the extent to which information from other components could inform their impression of each candidate. In addition, they noted that rating 30+ applicant files was onerous or fatiguing.
Discussion and Conclusions
The use of the holistic approach to review written application materials leads to high inter-component reliability among components intended to assess different attributes. In particular, the high association between academic and nonacademic components of the file review with the holistic approach (0.69) likely indicates the presence of a halo effect. The MIS method demonstrated lower inter-component reliability (0.29) and, thus, provides some evidence of a diminished halo effect. Further, the factor loadings resulting from our principal components analysis confirm that overlap among the constructs assessed by the components was lower using the MIS method compared with the holistic approach. Thus, using MIS to assess written application materials may limit rater biases that can affect the psychometric properties of scores. Despite this important difference between the two approaches, some important outcomes were similar. All components (regardless of the scoring approach) resulted in low to moderate correlations, suggesting that they assess similar, but not redundant, constructs.
The inter-rater reliabilities of components in the MIS method were generally moderate, with the exception of the personal statement. There is cause for optimism in that the G-study showed that averaging over two raters for each component increased reliability to within acceptable ranges. However, the negative correlation between rater scores on the personal statement suggests that the current format of the statement required by the U of T may be too unstructured to effectively evaluate applicants’ potential and that it may be in need of modification. Inter-rater reliability was not assessed in the holistic approach because we did not assemble the usual rating teams used for application file review at the U of T. Although a limitation of our study, previous research has established that the halo effect increases measurement error, thereby reducing inter-rater reliability.5,13
The file review composite scores from both approaches were poor predictors of interview performance. The only component that demonstrated low but significant prediction of interview performance was the set of three LORs within the MIS method. This significant relationship is likely an effect of the increased variability in the MIS scores of the LORs, which had a larger SD (2.7) and range (6–21) compared with the holistic approach (SD of 2.3 and range of 10–21). The more restricted range in the holistic approach likely decreases the correlation between the holistic LORs and interviews. Interviewers have access to applicant files before and during the interviews; however, we do not know whether the impressions they form about applicants while reviewing their files—and, in particular, their LORs—actually cause the correlation between these measures. The low correlation between file review and interview scores also likely indicates both that these two components are not redundant and that each of the processes should be retained. Future research will further examine the association between components and interview performance.
Our findings showed that MIS may reduce the halo effect, but this decrease may be moot in the absence of rater acceptability. Rater feedback shows differences of opinion regarding the MIS method. The structural limitations of the MIS method may explain some of the raters’ views. Typically at U of T, raters using the holistic approach assess the full file, which provides a range of application materials, for 11 applicants. Conversely, raters using the MIS method rated only one component of 30+ files, which may have led to rater fatigue. Reducing rater workload by providing fewer files could limit this drawback. Raters routinely examined all file components in holistic reviews, which enabled them to form a general impression of an applicant. The MIS method prevented raters from forming this general impression and may have dimmed their confidence in their ratings. Despite these limitations and the familiarity / comfort of our raters with the holistic approach, most participants found the MIS experience to be positive, as evidenced by 56% of raters either agreeing or strongly agreeing that MIS was a fair assessment method.
Furthermore, the application file materials reviewed in this study were intended for assessment by the holistic approach, not the MIS method. Addressing rater concerns about confidence by maximizing the information provided by the MIS method will require that components be redesigned specifically for MIS rating. For example, the current ABS component is simply a list of up to 48 activities, and raters receive minimal guidance regarding how to assess each applicant’s list of accomplishments. The holistic approach, however, enables raters to consult an applicant’s personal statement and LORs for more in-depth information as to the importance of any one specific ABS activity. Redesign of this ABS component (and other file components) for rating by the MIS method, such that raters can explicitly evaluate each component of an applicant’s file independent of the other components, could improve the MIS method, increase raters’ confidence in ratings, and reduce the reliance on holistic impressions. Further, specifically training raters to assess application materials that are designed expressly for the MIS method and not the holistic approach could also improve their confidence in their ratings and in the MIS method overall.
Apart from these practical changes, adopting MIS requires a shift in the evaluation paradigm for admissions. MIS emphasizes the importance of collective, yet independent, evaluations of individual applicants via psychometrically defensible tools. The holistic approach relies on the expertise of one rater to evaluate an applicant, whereas the MIS method relies on multiple raters to provide a collective and fair evaluation of an applicant.
Given our study findings, the MIS method will be implemented within the U of T medical school admissions file review process for the 2012–2013 admissions cycle. Future research will examine the predictive validity of MIS scoring and rater acceptability of the MIS method.
Funding/Support: Financial and in-kind support from the Office of Admissions and Student Finances, Undergraduate Medicine, Faculty of Medicine, University of Toronto.
Other disclosures: None.
Ethical approval: This study received approval from the University of Toronto health sciences research ethics review board.
Previous presentations: This study was delivered as an oral presentation at the 2012 Canadian Conference on Medical Education by Dr. Mark D. Hanson in April 2012 in Banff, Alberta, Canada.