Dore, Kelly L.; Hanson, Mark; Reiter, Harold I.; Blanchard, Melanie; Deeth, Karen; Eva, Kevin W.
Medical schools are faced with the challenge of selecting a few students from many qualified applicants. To make this selection most schools rely upon a combination of cognitive measures, such as undergraduate grade point average (uGPA) and Medical College Admissions Test (MCAT) scores and measures of noncognitive characteristics such as personal interviews and written submission materials.1 Like many schools, the Michael G. DeGroote School of Medicine at McMaster University invites candidates to interview based on grade point average (uGPA) and a candidate-written autobiographical submission (ABS). Local research has demonstrated strong reliability and validity for uGPA,2,3 but the reliability of the ABS has been weak.2 Although the subsequent onsite Multiple Mini-Interview (MMI)4–6 provides a good measure of applicants’ noncognitive characteristics, with reliability and predictive validity, it has the limitation that it can only be implemented for applicants invited to interview. Resource limitations demand, however, that a preinterview screening tool be used; institutional philosophy demands that reliable prescreening avoid total reliance on cognitive measures. For these reasons, we have initiated research designed to improve upon the reliability and validity of the ABS.
The ABS is an applicant written submission. It is completed pre-MMI by means of the web-based system operated by the Ontario Medical School Application Service. The Michael G. DeGroote School of Medicine at McMaster University currently receives approximately 4,000 applications annually for 148 student positions. The ABS is composed of five questions designed to evaluate noncognitive characteristics such as applicants’ personal experiences, suitability for McMaster and suitability for a career in medicine. Each applicant’s five ABS questions, stripped of any personal identifiers, are scored by three independent raters: one health science faculty member, one community member, and one medical student. Each applicant’s ABS score is, therefore, derived from 15 scores, 5 questions times 3 raters. Each rater scores 30–60 ABS submissions and upwards of 150 raters participate annually. Raters receive a one-hour training session in evaluation of the ABS. The ABS score, in combination with uGPA, is used to select applicants for the MMI. Two major concerns have arisen pertaining to the use of the ABS: (1) nonindependence of the ratings; and (2) nonindependence of the ratings.
Non-independence of the ratings
Scores on the ABS have been shown to correlate poorly with performance both within medical school and on the national licensing examinations written postgraduation.2 One reason identified by Kulatunga-Moruzi and Norman is that the interrater reliability of ABS scoring is less than adequate (0.45). The ABS has, however, been seen to have high internal consistency (0.88). Although high internal consistency may be seen as supportive of the reliability of a measure, it may in fact be a negative indication that the scores assigned to the individual questions do not provide independent measures of the applicant. That is, the halo effect may be afflicting this measure; if performance on the first questioninfluences the raters’ perceptions of performance on subsequent questions, then the initial overall impression of the candidate will determine the scores assigned to individual questions rather than the individual questions summing to provide a global assessment. This is an important distinction, because it would indicate that, functionally, only three observations (from three raters) are being collected in the current system instead of the desired fifteen and because it is common for reliability (and, in turn, validity) to improve as a function of the number of observations collected.7
To test whether or not this was an issue, we altered the direction in which ratings were collected. That is, we had raters evaluate one ABS question across multiple candidates before scoring the next ABS question. It was anticipated that this scoring method would increase the predictive capacity of the ABS by: (1) providing a greater number of truly independent ABS scores for compilation of each applicants’ total ABS score, and (2) providing raters with a better sense of the distribution of responses received for each question.
Non-independence of the ratees
The extent to which any ABS is completed independently when administered without proctoring would be difficult to quantify, but it is certainly variable. Undoubtedly a small percentage of candidates are less than scrupulous and hire ghostwriters in an attempt to generate a more appealing ABS. More commonly, however, candidates will pass their submissions around to friends, family, current students, or practicing physicians for feedback to improve the submission. The extent of the changes that result is again likely to be highly variable, but it creates a pair of measurement problems. First, given that there is an upper limit on how good an ABS can appear, and assuming that the collection of feedback results in improvement, the submissions may end up being more homogeneous than the candidates, thus lowering the maximum achievable reliability and validity. Second, even without restriction of range, the validity itself must be questioned as it becomes questionable whether one is discriminating between candidates or between candidate support systems.
A second purpose of this research was testing whether or not administration of an ABS in a proctored manner would improve upon the capacity of the tool to predict MMI scores. During the 2005 admissions cycle applicants selected for the MMI also completed a second, onsite, invigilated ABS as part of the admissions process. Herein, we report a comparison of those scores and those received for applicants’ offsite submissions.
Offers of interview to the Michael G. DeGroote School of Medicine at McMaster University in 2005 was predicated, in part, upon single rater (either a faculty, community member, or student) evaluation of offsite ABSs provided by each of 3,907 applicants. The top 1,000 applicants then had their ABS reviewed by one member of each of the two additional rater groups, for a total of three raters per applicant. The three ABS scores were combined with uGPA and the top 696 ranking applicants were invited to interview. The interview process then consisted of a 12-station MMI and an 8-question ABS. Half of the applicants completed the MMI in the morning and the ABS in the afternoon and the other half began with the ABS and ended with the MMI. The onsite ABS questions were comparable with, but not identical to, the noninvigilated questions participants answered offsite, with questions focusing on ethical decision making, advocacy, and personal experiences. The reliability of both the onsite and offsite ABS was computed using Generalizability Theory8 and the scores were compared to scores received on the MMI using Pearson’s correlations. Approval for the study was obtained through the Protocol Review Committee of the Michael G. DeGroote School of Medicine at McMaster University.
For a subset of 30 randomly selected candidates, two scoring methods were compared for each ABS. The first method was the traditional method; raters evaluated all ABS questions for a given candidate before moving on to score the next candidate. If one envisions a matrix with each candidate organized into columns and each question organized into rows, this scoring method could be considered to be a vertical scoring method as illustrated in Figure 1. The mean scores and reliability of this method were compared to that of a horizontal method in which raters evaluated all candidates for a given question before moving on to score the next question (Figure 1). The scores resulting from each scoring method were then compared to scores received on the MMI. The MMI, as it is used at McMaster does not provide a reliable measure of any one noncognitive quality, but it is blueprinted to address many noncognitive issues identified as important by McMaster stakeholders, with ethical decision making being predominant.9
All results are reported for the subset of 30 candidates for whom ABS scores were collected using both horizontal and vertical scoring. Whenever data were available for a larger cohort, analyses were repeated, but the findings were consistent with those of this subset of 30 in all cases, suggesting that these 30 were representative of the larger cohort of applicants. The scores for the ABS completed offsite (mean = 4.4) were significantly higher than those completed onsite (mean = 4.1; F = 5.7, p > .05). A significant interaction between site and scoring method (p < .01) revealed that this main effect was driven by a higher mean score in the traditional (offsite, vertical) scoring method (mean = 4.7) relative to the other three groups (mean = 4.0 to 4.2).
The internal consistency (i.e., the average correlation between questions) varied as a function of scoring methodology, but not as a function of site. In general, greater consistency was observed across question when the traditional vertical scoring methodology was used (G > 0.87 in all analyses) relative to the new horizontal scoring methodology (G < 0.49 in all analyses). When the interrater reliability was assessed, it was found to be high for ABS’s completed onsite (0.81 with vertical scoring, 0.78 with horizontal scoring). However, the offsite ABS interrater reliability was moderate when horizontal scoring was used (0.69), but poor when vertical scoring was used (0.03).
Perhaps more importantly, the ABS scores correlated better with the MMI when the horizontal scoring method was used (r = 0.44 offsite and 0.65 onsite) relative to when the vertical scoring method was used (r = 0.12 offsite and 0.28 onsite).
The results of the horizontal scoring method creates new, but cautious, optimism for the value of the ABS as a screen of noncognitive characteristics and its ability to determine which applicants should be invited to a medical school interview. The higher internal consistency achieved using the vertical scoring method provides evidence for our concern that the halo effect may have been biasing ABS assessments (i.e., that the ratings across question were nonindependent). The new horizontal method lessens this concern, by increasing the number of independent observations that are attained for each applicant. Similar mechanisms are believed to have yielded the success of the MMI protocol.
Seen in a vacuum, the method of ABS administration that performed best is clearly application of the horizontal scoring method to submissions collected in onsite, invigilated, time-controlled circumstances. Invigilation ensures independence of the ratees. However, the ABS does not function in a vacuum. Given the choice between MMI and onsite ABS, the MMI is preferred for a number of reasons. First, in terms of overall test generalizability, the MMI is at least as strong as the onsite ABS4. Second, with respect to predictive validity, the MMI has demonstrated significant positive correlation with in-school measures,5 and national licensing examination scores.6 Although the ABS may yet demonstrate such a positive outcome, to date it has not done so. Third, scoring of onsite ABS’s requires rater time subsequent to the date of interview, thus delaying decision making, whereas MMI scores are available immediately on that date.
Instead, the greater value of the ABS as scored under the new method may be seen elsewhere. Although it has proven its worth, the MMI leaves untested all those who have applied but could not be accommodated for interview. Given the resource implications for applicant and program alike, a reliable, valid, feasible and acceptable noncognitive measure to be used as a screening mechanism is desirable. That void is presently filled largely with unreliable, and thus largely random noncognitive measures.1 The same void could now be filled by a horizontally scored ABS as it appears to be at least moderately correlated with the MMI.
One limitation of the results of this study is the small sample of applicants who had all combinations of evaluation (onsite, offsite, horizontal scoring, and vertical scoring). Due to the limited availability of raters, only 30 randomly selected applicants had their offsite ABS rescored using the new method. Further study should be performed, but for now we take solace in the finding that the reliability analyses we performed on larger samples for whom a subset of the data were available mimicked the results reported here for this subsample. Further limitations include the fact that only minimal rater training was provided. Increased rater training would not be expected to invalidate the conclusions of this study, but may suggest that the reliabilities and correlations reported herein could be improved. Indeed recent reports regarding written submission materials utilized within medical and dental school admissions procedures in Israel have outlined psychometric enhancement with adoption of well-defined guidelines for rater evaluation of written submission materials.10,11 For both programs, detailed job analyses were completed to develop questions regarding noncognitive characteristics relevant to each profession. Accompanying scoring guidelines were developed for these questions and workshops for rater training were offered. The medical school training workshop included physicians and PhDs and was one-half day in duration. The National Institute for Testing and Evaluation (NITE) in Israel went further by recruiting raters with backgrounds marked by extensive psychological training.12 In all three studies, interrater reliabilities reached 0.94 (on average). These results provide a revealing contrast with those of this study by suggesting that the psychometric properties can also be improved in other, arguably more resource intensive, ways. Determination of which method yields higher predictive validity (or if the two methods complement one another in terms of predictive capacity) remains to be determined.
1 Salvatori P. Reliability and validity of admissions tools used to select students for the health professions. Adv Health Sci Educ. 2001;6:159–75.
2 Kulatunga-Moruzi C, Norman GR. Validity of admissions measures in predicting performance outcomes: the contribution of cognitive and non-cognitive dimensions. Teach Learn Med. 2002;14:34–42.
3 Trail C, Reiter HI, Bridge M, et al. Grade point average stability and its implications. Adv Health Sci Educ. In Press.
4 Eva KW, Rosenfeld J, Reiter HI, Norman GR. An admissions OSCE: The Multiple Mini-Interview. Med Educ. 2004;38:314–26.
5 Eva KW, Reiter HI, Rosenfeld J, Norman GR. The ability of the Multiple Mini-Interview to predict pre-clerkship performance in medical school. Acad Med. 2004;79:S40–S42.
6 Reiter HI, Eva KW, Rosenfeld J, Norman GR. Multiple mini-interview predicts clerkship and licensing exam performance. Paper presented at the 12th Annual Ottawa Conference on Clinical Competence, New York, NY, May 2006.
7 Streiner DL, Norman GR. Health measurement scales: A practical guide to their development and use. 3rd edition. Oxford University Press, 2003.
8 Shavelson RJ, Webb NM, Rowley GL. Generalizability theory. Am Psychol. 1989;44:922–32.
9 Reiter HI, Eva KW. Reflecting the relative values of community, faculty, and students in the admissions tools of medical school. Teach Learn Med. 2005;17:4–8.
10 Gafni N, Moshinsky A, Kapitulnik J. A standardized open-ended questionnaire as a substitute for personal interview in dental admissions. J Dental Educ. 2003;67:348–53.
11 Moshinsky A, Rubin O. The development and structure of an assessment center for the selection of students for medical school. Presented at American Educational Research Association Annual Meeting, April 12, 2005, Montreal, Canada.
12 Ziv A, Rubin O, Moshe Mittelman M, Lichtenberg D. Development and Application of a Simulation-Based Assessment Center for Non-Cognitive Attributes: Screening of Candidates to Tel-Aviv University Medical School (http://cre.med.utoronto.ca/omen/rounds.htm
). Accessed 15 December 2005.