Like most educational testing programs, the Association of American Medical Colleges (AAMC) allows examinees to take the Medical College Admission Test (MCAT) multiple times. The AAMC recommends that students retake the MCAT exam when there is a discrepancy between his or her coursework and MCAT scores, he or she did not adequately prepare, he or she misunderstood directions, he or she had a serious illness during the exam, or a medical school admissions officer recommended that he or she retake the exam.1 Each year approximately one-third of examinees take the MCAT exam more than once and, thus, have multiple MCAT scores included in their score reports.
In a recent national study of admissions policies and practices conducted by AAMC, medical school admissions officers revealed that they use different approaches to compute scores from applicants who test multiple times in their admissions process: 43% of schools consider scores from the applicant's most recent administration, 31% consider scores from the administration in which students received the highest total score (highest-within-administration), 8% compute a total score by summing the highest section scores across administrations (highest-across-administration), and 3% compute a total score by averaging scores across administrations. Thirteen percent of schools indicated that they use an approach that was not listed on the survey and wrote in a description of that approach. While a variety of approaches were described, most of these schools stated that they consider all scores in context and look for any improvement in scores based on additional student preparation.
The multiple approaches lead to a practical empirical question: Which approach is the most valid method to use in medical school admission decisions? The lack of research on this topic in medical school admissions has made it a challenging issue for practitioners. Previous research on the MCAT exam has addressed the predictive validity of each approach.2 The results indicate that the predictive validity coefficients for each scoring approach were similar; however, the average MCAT score resulted in slightly higher predictive validity coefficients than other approaches. The finding is consistent with the educational testing literature, which has found that the simple arithmetic average of scores across administrations often provides the best prediction of subsequent performance in school for repeaters from the perspective of predictive validity.3,4 However, the evidence is weak because of the very small differences in predictive coefficients.
Research has also focused on using the differential prediction method with other testing programs, investigating the differences in predicted outcome(s) between nonrepeaters and repeaters associated with using each scoring approach (referred to as prediction errors). While no previous research has examined prediction errors for repeaters on the MCAT exam, the educational testing literature suggests that more prediction errors are associated with using nonaverage approaches to computing scores for repeaters.4–7 We assert that one methodological weakness of prior studies is that they did not take into account the number of times an examinee tested. This is problematic because examinees who test more often than others may be more likely to capitalize on positive measurement error. As such, scores computed using a nonaverage approach could mean different things for examinees who test more times than those examinees who test fewer times (e.g., four times versus one time). In particular, a score computed using a nonaverage approach might correspond to a lower expected level of performance on a given criterion measure for examinees attaining that score based on multiple test administrations than for those attaining the same score based on one or fewer attempts.
The purpose of this study was to investigate the implications of the four most common approaches for computing MCAT scores for repeaters using the differential prediction method. In particular, this study expanded previous MCAT validity research on repeaters by exploring prediction error associated with (1) average scores, (2) most recent scores, (3) highest-within-administration scores, and (4) highest-across-administration scores (computing a total score by summing the highest section scores across administrations) for repeaters who tested two times, three times, and four times. With this study, we hope to provide guidance regarding the best approach for using MCAT scores for repeaters in medical school admissions.
The sample consisted of students who matriculated to medical school between 2000 and 2002, and included 30,965 matriculants (65%) who took the MCAT exam once, 12,523 matriculants (26%) who took the exam twice, 2,959 matriculants (6%) who took the exam three times, and 699 matriculants (1%) who took the exam four times. Two hundred twenty-eight matriculants (less than 1%) who took the exam more than four times were excluded from the analyses. Personal information was removed and only group-level results were reported. The American Institutes for Research IRB committee retrospectively reviewed the study and decided that no harm was imposed to the participants.
The MCAT total scores were computed based on four approaches: the average, most recent, highest-within-administration, and highest-across-administration approach score. As nonrepeaters had only one MCAT score, their scores were the same regardless of the approach.
The MCAT exam used in this study was administered in paper-and-pencil format and consisted of three multiple-choice sections (i.e., Physical Sciences, Biological Sciences, and Verbal Reasoning) and a writing sample. The science sections of the MCAT exam assessed scientific knowledge and problem solving, and the verbal section assessed the ability to understand and evaluate information. The writing sample score was not used to compute the total score as it is reported on an alphabetic score scale.
The first-attempt United States Medical Licensing Exam (USMLE) Step 1 total score was used as the criterion. The Step 1 exam assesses understanding and application of basic science concepts to the practice of medicine.8
We investigated whether there is differential prediction of Step 1 total scores between four MCAT groups (nonrepeaters, two-time, three-time, and four-time test takers). If there is no differential prediction, the predicted Step 1 scores will be the same for each group, the regression lines will overlap, and the magnitude of the differences in the expected Step 1 scores between MCAT nonrepeaters and each repeater group (referred to as prediction error in this study) will be approximately zero. As nonrepeaters have the same MCAT scores based on each scoring approach, they are used as the control group; the three repeater groups are used as the comparison groups. To estimate slopes and intercepts, Step 1 total scores were regressed on MCAT total scores computed based on each scoring approach for each group (the nonrepeater and three repeater groups). Sixteen separate regressions—four for each scoring approach across the four groups—were computed by ordinary least-squares regression.
Table 1 shows the means and standard deviations for Step 1 and MCAT total scores computed using each scoring approach for examinees who tested one, two, three, and four times. Within each repeater group, MCAT total scores differ based on the scoring approach, with the highest-across-administration approach resulting in the highest MCAT total score, followed by the highest-within-administration, most recent, and average approaches. We examined whether these differences were statistically significant using SAS PROC GLM (repeated statement) for each repeater group and found that the overall mean differences were statistically significant at the P < .01 level.
We then investigated whether there is differential prediction for MCAT nonrepeaters and repeaters related to number of attempts. Figure 1a–d plots the prediction relationship between Step 1 total scores and MCAT total scores computed in each scoring approach for the four matriculant groups (i.e., students taking the MCAT exam one, two, three, and four times). To describe the most likely MCAT scores attained by each group, the regression lines in Figure 1a–d were plotted within two standard deviations of the mean MCAT scores for each group.
Figure 1a shows the relationship between Step 1 scores and MCAT scores computed by taking the average across all administrations. The four regression lines overlap, suggesting that examinees with the same average MCAT score are expected to score the same on the Step 1 exam, regardless of the number of attempts they took to attain that MCAT score. For example, if the scores from two applicants—one who tested once and the other who tested four times—were averaged and resulted in an MCAT score of 25, both applicants would be expected to score about 207 on the Step 1 exam, and the effect size of the prediction error (e.g., the differences in the expected Step 1 scores for nonrepeaters and four-timers) is about zero.
In contrast, Figure 1b–d shows that using a nonaverage scoring approach (i.e., most recent, highest-within-administration, and highest-across-administration approaches) for repeaters is likely to result in overprediction of their Step 1 scores relative to nonrepeaters with the same MCAT score. The regression lines for repeaters are below the regression line for nonrepeaters when the most recent, highest-across-administration, or highest-within-administration scoring approaches are used. This result suggests that repeaters who have the same MCAT score are expected to score differently on Step 1 as a function of the number of times they took the MCAT exam.
For example, if two applicants—one who tested once and the other who tested four times—received MCAT scores of 25 using the highest-across-administration approach, the applicant who tested once would be expected to score 207 on Step 1, whereas the applicant who tested four times would be expected to score 194 (Figure 1d), and the effect size is 0.55. Slightly lower effect sizes were observed for the most recent and highest-within-administration approaches (0.42 and 0.45) based on the same MCAT total score of 25. This effect holds for all three nonaverage approaches to computing scores for repeaters; however, the effect is the largest for the highest-across-administration approach, followed by the highest-within-administration approach, and then the most recent score approach.
In this study, we examined the level of prediction error associated with four common approaches for computing MCAT total scores for repeaters. The common approaches for computing MCAT total scores were the average score across administrations, most recent score, highest-within administration score (a total score from the administration in which an examinee received the highest score), and highest-across-administration scores (a total score by summing the highest section scores across administrations).
From the perspective of minimizing prediction error, the results of this study suggest that the best approach for computing repeaters' score is to take the average across all administrations, which is consistent with previous educational testing research.2–7 MCAT average scores have the same meaning (in terms of expected Step 1 total scores): Students with the same MCAT average score are expected to perform the same on Step 1 exam regardless of the number of attempts to achieve that average. As such, admissions committees can have confidence that MCAT total scores computed by averaging across all administrations are comparable (at least with respect to Step 1 total scores) regardless of the number of times a student took the MCAT exam.
In contrast, the most recent, highest-within-administration and the highest-across-administration score approaches are less favorable because they are associated with more prediction error. Specifically, these approaches result in overprediction of repeaters' Step 1 total scores because repeaters are expected to perform less well on Step 1 than are nonrepeaters and repeaters who tested fewer times but have the same MCAT total score computed using one of these approaches. As a result, admissions committees should be wary of comparing MCAT total scores computed using any of the three nonaverage computation approaches when they are concerned about Step 1 total scores, because resulting scores have different meaning for nonrepeaters and repeaters who tested multiple times.
The results are very stable because of the large sample size used in this study, and our analysis based on later cohorts (not reported in this study) has indicated that the relationship between MCAT and Step 1 scores has been extremely stable over time. As with all research, there are limitations to our study. First, the only criterion used in this study was total scores on the Step 1 exam. Future research should investigate whether our findings generalize to other aspects of medical school performance, such as medical school grade and the likelihood of graduating on time. Second, the study evaluated one aspect (prediction error) of each approach for computing repeater scores in medical school admissions. Future work should evaluate other considerations, such as the impact of these approaches on subgroup differences. This is particularly important given the possibility of the financial considerations that may come into play in deciding whether or not to retake the exam. Third, this study did not look at the time between repeaters' testing (a long time lag with the possibility of additional course work in between may change the relationship). Last, it is important to note that the MCAT multiple-choice sections were reduced in length by 33% when the MCAT switched to computer-based administration in 2007. This study should be replicated when outcome data for the students who took the MCAT exam in 2007 and beyond are available.
The American Institutes for Research IRB committee retrospectively reviewed the study and decided that no harm was imposed to the participants.