Editor’s Note: This is a commentary on Ross DA, Moore EZ. A quantitative experimental paradigm to optimize construction of rank order lists in the National Resident Matching Program: The ROSS-MOORE approach. Acad Med. 2013;88:1281–1286.
Senior medical students engage in a unique matching process to determine which graduate medical education (GME) program they will attend for internship and residency. Students apply to a variety of GME programs, are invited to interview by some, and then visit and interview at these programs. GME programs use information from applications and interviews to determine which applicants are most likely to succeed in their program. At the same time, applicants are determining the converse—namely, which programs are most likely to meet their expectations for training. Each GME program and applicant then creates a rank order list (ROL) reflecting the most desirable to least desirable applicants or programs, respectively. The National Resident Matching Program (NRMP) then fills programs with applicants in an order that optimizes program and applicant rank choices. Throughout this process, faculty members in GME programs implicitly assume that top-ranked applicants will perform better than lower-ranked applicants in their GME programs. Thus, the ROL should predict subsequent performance in GME programs, but, curiously, this is not always the case. An applicant’s rank position on an ROL is better described as a probability of performing well in a GME program but with very wide confidence intervals on these probabilities. I expect that every program director in GME can cite examples of highly ranked applicants who subsequently performed poorly during residency, and vice versa.
Why Are ROLs Weakly Correlated With Clinical Performance During GME?
ROLs of medical students are weak predictors of subsequent GME performance for at least six reasons. First, the skills needed to be a high-performing medical student do not necessarily transfer to the GME setting. For example, scoring well on a gross anatomy exam does not guarantee that an individual will make good medical decisions with this information when he or she applies it to patients in the GME setting. From the field of expert performance we know that expertise is domain-specific, meaning that expertise in one domain does not transfer to expertise in a different domain.1 Thus, all the skills and competencies needed to be a superior medical student do not always transfer to the GME setting, where quite different skills and competencies may be required. It is no wonder, then, that outstanding medical students may or may not be excellent resident physicians, though there is reason to expect at least some correlation.
Second, the assessment nomenclature used by medical schools to describe medical student performance is remarkably variable and imprecise.2 This naturally leads to difficulty in determining how students compare to each other, which then makes it difficult to reliably select the better-performing students. This is true when comparing students from different medical schools and even when comparing students from the same medical school.
Third, medical schools provide very optimistic portrayals of their students. The language used in medical student performance evaluations (MSPEs) is remarkably positive. This may result from the inherent interest that medical schools have in helping their students get selected by residency programs for GME. On the other hand, it may reflect a conflict of interest for the schools to say negative things about their students, who have typically paid significant sums of money to attend these schools. The fact that medical schools rarely fail students is remarkable and raises concerns about biased student evaluation. Nearly every student is portrayed as “excellent,” “outstanding,” “superior,” and the like. Since these terms are largely nondiagnostic, when I read an application I “read for the negative,” which means that I look for anything not positive, and I look for the absence of very strong positive comments. Such things are likely just “the tip of the iceberg” and deserve careful examination.3 Recently, many medical schools have begun to ask GME programs to assess the accuracy and usefulness of the MSPE for each of their graduates. I applaud this change and hope this action is adopted by more schools.
Fourth, standardized tests such as the United States Medical Licensing Examination are used by medical schools and residency programs as part of the medical student performance assessment even though they were explicitly not designed for this purpose. These scores are not highly diagnostic of future clinical performance, and thus any ROL decisions that are strongly influenced by these scores are likely to yield low correlations with future clinical performance during residency. These test results do, however, strongly predict the next exam score.
Fifth, the interview is often a nondiagnostic activity for detecting how an applicant will perform during GME. When interviewers are blinded to information contained in the formal application, the interview appears to offer a very modest amount of useful information in determining how an applicant will perform as a resident. Structured interviewing shows some promise in gaining useful information from an interview, but even structured interviewing leaves much room for improvement. It appears that applicants typically know what to say to faculty members during an interview to make a positive impression. In short, what applicants say in an interview and how applicants act in an interview may bear little resemblance to what they do in daily performance situations.
Lastly, resident selection committees are composed of faculty members who differ on how they rank order applicants. The first four issues in this section relate to medical schools, are not under the control of program directors in GME, and pose tremendous threats to achieving an ROL that strongly predicts subsequent clinical performance in GME settings. In contrast, interviewing and creating an ROL in the face of significant faculty member variance in scoring are both under the control of program directors and resident selection committees.
Creating a “MOORE” Accurate ROL
Typically, faculty members on selection committees read an application, interview the applicant, and then render a numerical score which reflects the faculty member’s interest in having that applicant in the GME program. Individual faculty member scores are then averaged for each applicant, and this average is used to create the ROL. This method works well if all faculty members score all applicants. In such cases, any faculty member who tends to give high or low scores (compared with other members on the selection committee) will not affect the ROL because his or her bias is present for all applicants. In contrast, if a biased (i.e., harsh or lenient) faculty member scores only some applicants, then the scores of these applicants will be disproportionately affected, and as a result their position on the ROL will change, sometimes dramatically.
This problem of biased scoring has been cleverly managed by Ross and Moore in this issue.4 Their method derives from a technique used to rank order college sports teams, taking into account the applicants’ “strength of schedule” in a given residency application “season,” and dealing with biased faculty members. Importantly, they simulate the effects of such bias on a known or “true” ROL and verify that their method renders the known (correct) ROL even in the face of significant interrater variance (bias). They also describe a method for simulating how a true or known ROL will be affected by differing degrees or types of interviewer bias. This allows direct comparison of the true ROL with ROLs created by the conventional averaging method and the ROSS-MOORE (Recruitment Outcomes Simulation System–Moore Optimized Ordinal Rank Estimator) approach. Their study shows that random noise (intrarater variance) in scoring will negatively and equally affect both conventionally made ROLs and ROLs made using the ROSS-MOORE approach.
The real advance of the ROSS-MOORE approach is how it reacts to significant interrater variance in score assignment (e.g., a faculty member who routinely gives very low scores which will selectively hurt the applicants he or she scores). The authors demonstrate that even when a conventionally produced ROL would be corrupted by harsh or lenient graders, the ROSS-MOORE approach can provide an ROL that is very close to the known or true ROL. Although the ROSS-MOORE approach generates an accurate ROL in the face of biased faculty member scores, the authors do not demonstrate how their method would perform in the face of intermittent bias (e.g., a faculty member who gives low scores to all the applicants when he or she is having a bad day, but returns to normal scoring for the rest of the interview days) or how it would perform in the face of selective bias, such as when a faculty member scores men systematically lower than women. I anticipate that this would act as “noise” and negatively affect both conventionally made ROLs and those made using the ROSS-MOORE approach. Fortunately, the simulation method the authors describe can answer this question and others.
What Should an ROL “Mean”?
An ROL implies at least two things. First, it is expected to reflect the judgments of the faculty members who constructed it, and second, it is presumed that higher-ranked applicants will perform better than lower-ranked applicants in subsequent GME programs. The ROSS-MOORE approach advances our ability to create an ROL that better reflects the ROL of each member of the selection committee. Unfortunately, since resident selection committees use information with limited diagnostic value to determine applicant scores, any method that depends on these scores to create an ROL remains limited by the data used on the input side of the process. Thus, we sorely need improvements in the quality of the input data. In this context, “quality” of data refers to the ability of the input data to meaningfully predict GME performance. Such prognostic data are dependent on improved performance evaluation methods being developed and used at the medical school level so that those of us trying to select the best applicants for GME can do so in a reliable fashion. In the meantime, for those who endeavor to create an ROL that largely compensates for interrater variability, the ROSS-MOORE approach is a big step forward.
In our program we have considered using mean scores to create our ROL instead of using a measure of centeredness of the group, such as the median (which is less sensitive to outlier scores). This decision was a philosophical one, implying that “everyone’s score is equal” and acknowledging that bias will arise. We allow such bias to exist since it reflects the belief of the faculty member and we have no access to the truth about which applicant will perform best in our residency. That said, we discuss strong outlier scores, which must be defended by the person rendering the score. Although the ROSS-MOORE approach is a real advance in creating an ROL that better reflects each faculty member’s rank ordering of the applicants, this improvement is only the tip of the iceberg in making ROLs that predict subsequent performance in GME programs.