Since 1952, resident recruitment has taken place via the National Resident Matching Program (NRMP),1 a system in which each residency program must submit a single list with an ordinal ranking of all acceptable applicants, while each applicant does the same for different programs. Developing a process that ensures that the program’s rank order list (ROL) accurately reflects the best applicants is one of the most important tasks of program leadership. In most programs, this process includes each applicant being interviewed by a small number of faculty members. Faculty members then numerically rate each applicant using Likert-scale questions, and an ROL is generated as an arithmetic mean of these scores. Although program directors and the resident selection committee may ultimately make changes to this list, the initial ROL plays a prominent (if not preeminent) role in the process.
Despite the importance of the rank order process, relatively little academic scholarship has examined how to optimize ROL construction.2–5 This is surprising given the number of manifest limitations intrinsic to a traditional approach, including questionable construct validity of the assessment measures, questions of inter- and intrarater reliability, and how to quantify and then account for the intrinsic variability in the system. In combination, these factors may have a significant negative impact on the accuracy of a program’s ROL and, correspondingly, on its Match outcome.
One example of how seemingly small factors can lead to large adverse consequences was the recognition in our own program of the “Bob effect.” Of two faculty members with the same first name, one had a reputation for being consistently positive in his evaluations of applicants and the other for being more negative. By measuring their respective biases we were able to calculate that for an applicant seen by four faculty members, the simple substitution of one Bob for the other could lead to a rank shift of more than 30 places on our list of 150 applicants. This discovery prompted us to explore alternate approaches for constructing our ROL.
Though there is little medical literature on the ranking process, residency programs are not alone in struggling with this type of problem. Collegiate sports organizations face a similar challenge on a weekly basis when they publish national rankings of top teams who may have never played against each other. Though a sports definition of what constitutes a “better” team may be more easily defined (e.g., better teams are expected to win more games), and the consequences of rankings may be different (including tournament seeding and a host of financial implications), at its heart the problem and process are the same: A group of experts are expected to predict future performance based on limited data.
We describe a new approach for ranking applicants to residency programs that is adapted from the core conceptual methods used in collegiate sports rankings. We then compare this approach with traditional methods by running a series of computer-simulated “recruitment seasons.”
Traditional ranking approaches
Because of the competitiveness and prestige of U.S. training programs and the ease of submitting many applications online via the Electronic Residency Application Service,6 most residency programs receive a large number of applications. These are screened and reviewed and then, on the basis of academic record, top candidates are invited to interview. In some programs, the ROL may then be generated by group consensus of the resident selection committee or program leadership.
More commonly, a quantitative measure is also included. Typically, this entails each interviewer using a standardized assessment tool that is customized to the program and incorporates data from both the written materials and the interview. Scores from each rater are averaged for each candidate and arranged in descending order to create the preliminary ROL, which may ultimately be modified at the discretion of the program director or resident selection committee. We refer to this process as traditional ranking (TR).
The Moore Optimized Ordinal Rank Estimator
In academic year 2010–2011 we developed a new algorithm for ranking applicants. The Moore Optimized Ordinal Rank Estimator (MOORE) assumes that applicants will still be assessed by multiple faculty members. This process will likely include a review of written materials, an interview, and the completion of a rating scale. However, the only component required by MOORE is that each faculty member generate an ROL of all of the applicants he or she has assessed.
Conceptually, these data can be treated as if they reflected the course of a sports season: Each ordinal list functions as if it contained the results of a series of games played between the teams on the list. For example, for the number 1 applicant on a list of 10, it is as if that applicant had “defeated” each of the 9 individuals below him or her on the list in direct head-to-head competition. For the applicant ranked number 2, it is as if he or she achieved an 8–1 record, losing only to the applicant ranked number 1.
The MOORE algorithm is based on the college hockey concept of pairwise rankings (PWR).7 Applicants are initially ranked based on win percentage as weighted by “strength of schedule.” The “strength of schedule” adjustment is essential to an ordinal ranking system because it normalizes each applicant’s performance to the quality of his or her opponents. Thus, a strong applicant is not excessively penalized for “losing” to other strong applicants, and a weak applicant is not excessively rewarded for “defeating” other weak applicants. We refer to this measure as the weighted win percentage (WWP), and solving the equation iteratively is the first step in the algorithm. Note that this calculation is similar to college hockey’s ratings percentage index, or RPI:
WWP = Win Percentage + Average WWP of Opponents
The second step of the algorithm identifies a set of top applicants who will be examined more closely in head-to-head comparisons (as below). In the PWR for men’s college hockey, the top teams are referred to as “teams under consideration” and are defined as the top half of teams in the country as based on RPI. In the present study, we similarly define “top-tier applicants” (TTAs) as the top half of applicants as based on WWP.
In the third step of the algorithm, MOORE separately conducts head-to-head comparisons for all TTAs using four factors: (1) WWP, (2) head-to-head record, (3) performance against common opponents, and (4) performance against other TTAs.
Within each comparison, one point is awarded for each category except head-to-head record, for which one point is awarded for each head-to-head victory. In the event of a tie, the WWP is used to determine the victor. Regardless of the margin of victory, the winning applicant is awarded a single point for the comparison.
Finally, the ROL is calculated based on the number of pairwise comparisons won by each applicant. The maximum possible number of points is one less than the number of TTAs. Ties are again broken by WWP. The weaker applicants not included on the TTA list remain in descending order of WWP.
Figure 1 illustrates the TR and MOORE algorithms with a simplified example of the “Bob effect.”
At the same time that we developed MOORE, we also designed the Recruitment Outcomes Simulation System (ROSS), a computer program that generates simulated recruitment seasons according to prespecified criteria (see Figure 2). This system allows us to test the comparative accuracy of MOORE and TR methods.
To accomplish this simulation, the system must begin by doing what cannot be done in real life: assuming that it is possible know the “true value” of a cohort of applicants on a gold standard scale (though it should be emphasized that the defining characteristics of the scale are not actually relevant—the simulation would be equally valid for any scale). Additional input parameters to the simulation include number of applicants, a specified distribution from which applicants are drawn, number of interviewers, and number of applicants seen by each interviewer. On the basis of these parameters, ROSS assigns applicant–interviewer pairings.
The model then introduces two types of variability into the system. The first is intrarater variability. This parameter reflects the intrinsic amount of “noise” in any assessment system—essentially, if the same faculty member interviewed multiple individuals with the same “true” score, how much variability would there be in the ratings? Second, the model incorporates interrater variability. This is systematic bias that occurs in the way that different interviewers assess applicants (i.e., some interviewers may tend to rate all applicants uniformly higher or lower than other interviewers).
For each simulated interview, ROSS creates an assessment score by adding to the applicant’s true value a unique “noise” value and a fixed “bias” value that is linked to the interviewer. Both forms of variability are drawn from normal distributions (as specified).
Based on the assessment scores, ROSS constructs ROLs using the TR and MOORE approaches. Each ROL is compared with the known true values of the applicants, and the primary dependent variable is the average number of rank errors per applicant. ROSS repeats each simulation a specified number of times according to the desired power for discriminating between methods. In the present study, we present data generated using 30 repetitions of the simulation, which is powered to be 90% confident of detecting a difference of an average rank error of 0.3 between the two methods. The inputs to ROSS used for the present simulations are given in Table 1.
Figure 3 illustrates the accuracy of TR versus MOORE with no bias and increasing amounts of intrarater variability (i.e., noise). Data are shown as the average number of rank errors per applicant, and error bars represent 95% confidence intervals. MOORE produces ROLs that are statistically indistinguishable from TR (P = .08 for variability = 1, and P ≥ .2 for variability ≥ 2), and the accuracy of both approaches decreases with higher levels of intrarater variability.
Figure 4 illustrates the accuracy of TR versus MOORE with no intrarater variability and increasing amounts of interrater variability (i.e., bias). When the pool of interviewers exhibit biases with a standard deviation less than 1, TR produces a more accurate ROL. For higher levels of bias, MOORE produces an increasingly more accurate ROL (P < .10 for bias = 2; P < .01 for bias ≥ 3). For any amount of bias greater than 1, MOORE performs more accurately than TR at all places on the rank list. Because the present simulations draw applicants from a normal distribution, the number of rank errors (and, accordingly, the differential advantage of MOORE over TR) tends to be higher in the middle of the list because applicants are more tightly clustered. Supplemental Digital Figure 1, http://links.lww.com/ACADMED/A142, illustrates the average number of incorrect assignments at each place on the overall rank list for MOORE and TR approaches.
Given the centrality of the NRMP Match in U.S. graduate medical education, it may seem surprising that the rank process has not been the focus of more intense investigation—and yet, as scientists, it can be daunting to imagine how we could conduct such research in a rigorous manner. At the most basic level, such inquiry is predicated on the idea that it is possible to define what makes for the “best” applicant. One would then need to create an assessment tool, demonstrate its validity, and prospectively evaluate how well it predicts “success.” Each step of this process is rife with logistical, methodological, and philosophical difficulties. As such, it is no surprise that most studies attempting to correlate some measure of applicant assessment with success in residency have been inconclusive.5,8–14
Yet each year, in the face of these concerns, programs still invest vast resources in the assessment and ranking of applicants and are highly emotionally attached to the perceived success of their Match outcome. Imagine, for example, a program director’s response to filling two positions with the top 2 candidates on the program’s ROL as opposed to those ranked numbers 49 and 50. We take this as de facto evidence that program directors believe it is possible to rank applicants and that rank order has some predictive value.
Although the nature of the Match prohibits direct scientific experimentation, this does not mean that the ranking of applicants cannot be done scientifically. In the present study we demonstrate that computer simulations of recruitment “seasons” open the door for a systematic examination of different rank strategies. A first finding is that concerns about the potential negative effect of intrinsic variability on ROL accuracy using a traditional approach are well founded. Though the numerical procedure used in TR may carry the illusion of scientific precision, lists generated by this method do not typically include error bars. Many programs might be surprised to realize how dramatically the “Bob effect” or other variability can affect the accuracy of their ROL. Further, this effect is likely to be most prominent among closely matched applicants—which is often the most important part of the list. So what can be done to mitigate the major sources of variability?
By definition, the use of an ordinal approach eliminates any impact of bias between interviewers. It does not matter whether one Bob rates a candidate higher or lower than the other Bob does on an absolute scale as long as each faculty member is internally consistent. The only data used in an ordinal system are the direct comparisons that occur between applicants within each faculty member’s list.
As others have noted, though, the implementation of an ordinal system is complicated.2 Two major obstacles are knowing how to appropriately weight data from lists of differing lengths and accounting for heterogeneity in the quality of applicants across lists. By borrowing from the conceptual approach of PWRs, the MOORE algorithm resolves these issues. Each comparison of applicants that occurs on a list is treated as a “win” or a “loss.” Longer lists are automatically imbued with greater power because it is as if applicants on those lists have played a greater number of games. As discussed above, differences in quality of victory are accounted for by incorporating a measure of “strength of schedule.” This is a critical addition to an ordinal approach, as variability of applicant assignment could otherwise have a dramatic impact. If a faculty member by chance was assigned only poor applicants, for example, the number 1 applicant on his list might be ranked disproportionately high. MOORE goes even further by including increased weight for performance in head-to-head comparisons, against common opponents, and against other top applicants.
Our data show that there is no apparent benefit of the MOORE approach over a traditional system in compensating for intrarater variability (“noise”)—on the contrary, at low levels of bias, traditional methods performed better. This is because ordinal approaches necessarily sacrifice some data. A ranking of applicants from 1 to n contains less information than a set of raw scores that also orders applicants in the same way. Thus, in a system with minimal variability, it would be beneficial to continue the use of a traditional system. However, we believe that it is highly unlikely to find a system with this little “noise.”
Although we have modeled intrarater variability as a single factor, there are many sources of “noise” in real-world systems. As alluded to above, these may include inadequate scale validity; poor reliability in implementation of the scale; intrinsic variability in how well applicants and faculty each “perform” from one interview to the next; unique and idiosyncratic interactions that inevitably occur between individual personalities; and, in some cases, perhaps deliberate noise (e.g., one faculty member admitted in the past to using “grantsmanship” tactics and always giving candidates that he liked perfect ratings to try to inflate their overall scores). Each source of noise will have different properties for modeling it and for trying to mitigate it, and each will be differentially susceptible to traditional versus ordinal ranking approaches.
In considering all of these factors, it becomes clear that recruitment systems are enormously complex and that one could spend years studying their minutia and testing different strategies for developing ROLs. In this study, we introduce a new paradigm for approaching the problem. We demonstrate that it is possible to simulate basic properties of recruitment systems and, thereby, to test the efficacy of different ranking methods. We then show that the MOORE algorithm is superior to TR methods in the presence of bias. We do not consider the present findings to be generalizable to all settings nor do we believe that any system could be. Rather, a strength of the present approach is that it creates a frame for further experimentation.
To this end, ROSS (or other simulation software) may be easily customized to the procedural logistics of different programs (e.g., the number of applicants seen, number of interviewers, and number of interviews per applicant). In conjunction with known historical data, including variability and bias, the quantitative output may serve as a reference guide, specific to that program, of the comparative accuracy of MOORE versus traditional approaches. For example, on the basis of the present simulations, a program interviewing 100 applicants would know that if the standard deviation of interviewer bias exceeds 2, an ordinal approach would be more accurate than TR.
There are additional benefits of ROSS (or other simulation approaches). First, it is relatively easy to program software to incorporate additional ranking strategies so as to test their comparative accuracy (e.g., if one wanted to compare MOORE with a median-based ranking strategy or against other ordinal approaches; see Supplemental Digital Figure 2, http://links.lww.com/ACADMED/A142). Second, simulation systems can allow programs to test different ways in which they could structure their interview process so as to optimize the accuracy of their ROL (e.g., by testing the impact on ROL accuracy of changing either the number of interviewers or the number of interviews per applicant). And finally, of course, it is possible to modify the software so as to incorporate greater degrees of complexity into the model (e.g., by distinguishing between different types of “noise” and allowing for independent modeling of each).
We are also continuing to work to refine and customize other aspects of MOORE to cater to the nuances of the resident recruitment process. One such modification that might help mitigate the data loss intrinsic to the method would be to incorporate raw data into the algorithm so as to generate weighted ordinal rankings. This would allow MOORE to factor into the algorithm the “quality” of a win or loss, distinguishing between a “big win” and a near-tie. Other ongoing efforts include differentially weighting the input of interviewers (e.g., if one wanted to add value to the opinion of a particularly experienced interviewer), experimenting with different weightings for the component factors in the pairwise comparisons, and modifying the range of TTAs in the pairwise comparisons, either to focus on a smaller subset at the top of the list or to identify a narrow range in the middle of the list for which one might want improved precision (e.g., if there were a region in the list where a program typically expects to fill their last slot). We also believe that the MOORE algorithm may be useful in other contexts, such as medical school admissions or as part of the ongoing assessment of trainees during their clinical rotations, and we are exploring how best to adapt it for these other applications.
Finally, though we have emphasized that MOORE need not be linked to a traditional assessment scale, we believe there may be value in continuing this practice in parallel with the ordinal approach. To this end, we have recently developed a new assessment scale with questions designed to reflect the core values of our program and numbers that are anchored to how we expect residents to perform. We hope that the use of this scale will focus the entire assessment process—from review of written materials, to the interview itself, to the discussions in our resident selection committee—on the core values of our program. We also find these data to be a useful supplement to MOORE as a way of examining outlier rank assignments, as a supplement to guide any discretionary changes made to the list (e.g., if we obtain new information about an applicant following the initial ROL), and to generate a “traditional” rank list that we can use as a point of comparison as we hone the MOORE algorithm. To date, we have been pleased to discover that the MOORE lists have been subjectively well received as matching faculty expectations for what the initial ROL “should” look like. Finally, we intend to use data from our new scale to follow residents prospectively so as to test how well it predicts performance in residency and beyond.
Acknowledgments: The authors are grateful to Robert M. Rohrbaugh, MD, and Melissa Arbuckle, MD, PhD, for reviewing this manuscript and offering their insights.