Standardized Interview Scoring Methodology for Neurosurgical Residency Applicant Selection : Neurosurgery

Secondary Logo

Journal Logo

Special Article: Socioeconomics, Health Policy and Law

Standardized Interview Scoring Methodology for Neurosurgical Residency Applicant Selection

Soni, Pranay MD*; Davison, Mark A. MD*; Battisti, Elizabeth A. MHA*; Schmidt, Eric S. MD*; Benzel, Edward C. MD*,‡; Steinmetz, Michael P. MD*,‡; Schlenk, Richard P. MD*,‡; Benzil, Deborah L. MD*,‡

Author Information
doi: 10.1227/neu.0000000000002141
  • Free


Residency interviews are a vital component of the application process for competitive subspecialties and have been shown to predict clinical success.1-4 Neurosurgery is no exception to this trend. In a recent study surveying neurosurgery program directors, the interview process was rated as the most important component in resident selection, followed closely by US Medical Licensing Examination performance and letters of recommendation.5 An interview allows for a comprehensive assessment of an applicant's noncognitive characteristics including motivation, integrity, interpersonal skills, confidence, and maturity, all of which are critical considerations in the overall evaluation of an applicant.

Although the interview is crucial, evaluation is subjective and has potential for variation and bias. Attempts have been made to standardize the selection process for residency applicants, with limited success. A structured screening system was devised for rating both the objective and subjective elements from the applications of prospective orthopedic surgery residents.6 The results revealed favorable reliability statistics for overall applicant scores, but the correlation among reviewers was poor for subjective elements including quality of personal statement or letters of recommendation. More recently, a “semistructured” interview assessment tool was implemented in an institution's rank list formulation, by which scores were subjectively assigned to applicants according to interview performance and demonstrated good interobserver reliability.7

In 2020, the COVID-19 pandemic resulted in the implementation of unprecedented travel restrictions and social distancing behaviors. As such, the Accreditation Council for Graduate Medical Education strongly recommended that all residency interview activities are conducted virtually. The combination of the new interviewing format and the need to insure high reliability in interview evaluation led to consideration of implementing a standardized scoring algorithm by which faculty and resident interviewers evaluated applicants on their noncognitive attributes and program compatibility as demonstrated during their virtual interview. When implementing new strategies, especially in the high stakes of resident selection, it is important to study them to evaluate whether the intended purpose is realized. In this study, we assessed the feasibility, reliability, and biases associated with this standardized interviewer evaluation survey for the neurosurgical residency interview process.


In preparation for the 2020 to 2021 virtual neurosurgery residency interview season, we developed an interviewer survey to assess applicant interview performance and program compatibility. This survey was designed by the residency program director, department chairman, and 2 senior residents and consisted of 5 questions to assess each applicant's qualitative attributes (Table 1). Each interviewer scored each applicant on a numeric rating scale from 1 (lowest) to 5 (highest) for each of the 5 questions. Interviewers included neurosurgery faculty and senior residents. Interviews were conducted in a multiple mini interview format, and each interview session consisted of 1 applicant and a pair of interviewers (most often a faculty member coupled with a resident). Although there was no predetermined transcript or standardized question list for the interviewers to adhere to, interviewers were oriented to the process and survey questions before conducting the interviews. Therefore, interviewers were aware of the attributes they were assessing applicants for and were afforded the autonomy to evaluate for these attributes; however, they saw fit. Although a portion of the interviewers relied on standardized questioning, this was not universal in our cohort of evaluators.

TABLE 1. - Interviewer Survey Questions and Scores
Interviewer survey question Score (mean ± SD)
How would you rate this applicant's maturity/professionalism? 4.39 ± 0.69
How would you rate this applicant's motivation/drive? 4.29 ± 0.71
How does this applicant respond to difficult questions? 4.02 ± 0.82
How well do you think this applicant would “fit in” with the neurosurgery department? 3.98 ± 0.82
How should we rank this applicant? 3.79 ± 0.81

Overall scores were calculated based on the weighted average of each of the individual question scores. A total score for each applicant, defined as the average of all interviewers' scores for that applicant, was computed and directly used to assemble the final rank list. More specifically, the applicant rank list was created by anonymously ordering the interviewees by descending total score. The only deviation from this methodology came when placing the applicants who completed a visiting acting internship rotation because they did not participate in the formal interviewing process.

After the application cycle, we retrospectively reviewed our interview scoring metrics and compiled data regarding interviewer status (faculty or resident), interviewer sex, applicant sex, and numerical scores for individual questions and total scores. Data were anonymized before analysis. To limit selection bias, we only included interviewers who interviewed at least 90% of the applicants, and any applicant who was not interviewed by these interviewers was excluded from the analysis, so that all included interviewers interviewed all included applicants.

Multiple cohort analyses were performed by dividing the interviewers into cohorts based on status (faculty or resident) and sex (male or female). Average, minimum, and maximum scores were compared between these cohorts using the Student t test. Applicants were also divided into groups based on sex, and scores were compared between these cohorts for all reviewers and for subgroups of faculty reviewers, resident reviewers, female reviewers, and male reviewers.

To assess the reliability of our survey, the intraclass correlation coefficient (ICC) was calculated for each survey question and for the overall score. As reported previously,8 the following guideline was used to evaluate ICC measures: poor—below 0.50, moderate—between 0.50 and 0.75, good—between 0.75 and 0.90, and excellent—above 0.90. All statistical analyses were performed using RStudio (RStudio Team), and P < .05 was considered to be statistically significant.


General Results

A total of 22 interviewers (14 faculty and 8 residents) completed the interview process. Of these, 15 (68%) interviewed at least 90% of the applicants and were included in this study. Eight (53%) of the interviewers were departmental faculty. Similarly, a total of 47 applicants completed the interview process. Of these, 35 (74%) were interviewed by all 15 of the included interviewers. Female applicants (17%) and interviewers (20%) comprised the minority of their respective cohorts. Figure shows a boxplot of each interviewer's scores (8 faculty and 7 residents), and Table 1 presents the mean scores for each survey question.

Boxplot depicting each interviewer's overall scores (8 faculty and 7 residents). The dark line represents the median score; the upper and lower box edges denote the 75th and 25th percentiles, respectively; the whiskers mark the 1.5 time the interquartile range; and the black dots indicate statistical outliers.

Cohort Analysis

In cohort analysis, there was no difference in average, minimum, or maximum total scores between faculty and resident interviewers (Table 2); however, the average total score for female interviewers was significantly higher than that for male interviewers (4.14 vs 3.97, P = .003; Table 3). There was no difference in total scores between female and male applicants when evaluating all reviewers or subgroups of faculty reviewers, resident reviewers, female reviewers, or male reviewers (Table 4).

TABLE 2. - Average, Minimum, and Maximum Scores From Faculty and Resident Interviewers
Variable Faculty (N = 8) Residents (N = 7) P value
Average score 3.97 ± 0.17 4.05 ± 0.11 .310
Minimum score 2.42 ± 0.61 2.57 ± 0.46 .606
Maximum score 4.96 ± 0.08 5.00 ± 0.00 .186

TABLE 3. - Average, Minimum, and Maximum Scores From Female and Male Interviewers; P < .05
Variable Female (N = 3) Male (N = 12) P value
Average score 4.14 ± 0.03 3.97 ± 0.14 .003
Minimum score 2.80 ± 0.66 2.41 ± 0.49 .419
Maximum score 5.00 ± 0.00 4.97 ± 0.07 .180

TABLE 4. - Average, Minimum, and Maximum Scores for Female and Male Applicants by All Reviewers, Faculty Reviewers, Resident Reviewers, Female Reviewers, and Male Reviewers
Variable Female applicants (N = 6) Male applicants (N = 29) P value
All reviewers
 Average score 4.02 4.00 .928
 Minimum score 3.11 3.02 .709
 Maximum score 4.86 4.80 .527
Faculty reviewers
 Average score 3.92 3.98 .730
 Minimum score 3.11 3.13 .955
 Maximum score 4.79 4.72 .626
Resident reviewers
 Average score 4.14 4.03 .565
 Minimum score 3.52 3.33 .376
 Maximum score 4.61 4.62 .957
Female reviewers
 Average score 4.15 4.14 .967
 Minimum score 3.47 3.77 .229
 Maximum score 4.71 4.49 .229
Male reviewers
 Average score 3.99 3.97 .931
 Minimum score 3.16 3.05 .673
 Maximum score 4.67 4.74 .608

ICC Analysis

In our ICC analysis, question 1 (ICC 0.796), question 2 (ICC 0.867), question 3 (ICC 0.870), and question 4 (ICC 0.851) were found to have good reliability. Question 5 (ICC 0.900) and the overall score (ICC 0.913) were found to have excellent reliability (Table 5).

TABLE 5. - Intraclass Correlation Coefficient Between all Reviewers for Each Individual Question and Overall Score
Question Intraclass correlation coefficient 95% CI
1. How would you rate this applicant's maturity/professionalism? 0.796 0.681-0.883
2. How would you rate this applicant's motivation/drive? 0.867 0.790-0.924
3. How does this applicant respond to difficult questions? 0.870 0.796-0.926
4. How well do you think this applicant would fit in with the neurosurgery department? 0.851 0.767-0.915
5. How should we rank this applicant? 0.900 0.843-0.943
Overall score 0.913 0.864-0.950


In this study, we introduced a methodology for standardized interviewer evaluation of neurosurgery applicants and assessed both its reliability and resultant biases in our program's experience. Our data revealed good or excellent interrater reliability across all questions, including total score. Moreover, there was consistency in total scores between resident and faculty interviewers. Although practically insignificant, female interviewers scored applicants higher than male interviewers (female interviewers: 4.14, male interviewers: 3.97; P = .003). In multiple cohort analyses, there was no evidence of sex bias toward applicants when controlling for interviewer status (faculty or resident) or sex. To the best of our knowledge, this study is the first of its kind to implement and evaluate a standardized interviewer survey within the application process for neurosurgical residency.

Before our novel interview scoring methodology, resident rank lists were formulated through round table discussions conducted after each interview day. This traditional format was significantly affected by subjectivity and recall and was biased by the more vocal and assertive discussion participants. Moreover, there was no means of trending interviewer reliability or assessing for biases within the process. Our aim was to introduce a more standardized and assessable interviewer evaluation system to ultimately drive rank list formulation.

Previous attempts have been made toward standardizing the resident applicant assessment processes within competitive subspecialties.1,6,7,9 In 2007, one of the first trials of a more structured process was completed, in which faculty interviewers completed a 5-point survey tool evaluating a general surgery residency candidate's personal characteristics which included attitude, motivation, and professional integrity.1 The authors found that personal characteristic survey scores strongly correlated with clinical performance and final match list. Unfortunately, there was no assessment of survey reliability or evidence of bias in this analysis.

A more recent study used a standardized interviewer grading rubric to evaluate various qualitative domains, such as emotional maturity, commitment to patient care, and leadership, in applicants for a pharmacy school position.10 Applicants interviewed with 2 faculty members, and the authors investigated the reliability of their survey between the interviewing pair. The results demonstrated that mean scores were similar between most of the interviewing pairs, and evaluations from 8 of the 11 pairs had high interrater reliability. The authors suggested that a multiple mini interview format could have resulted in more favorable interviewer reliability, a concept prevalent in the medical school literature.11

It is well known that a sex disparity exists in neurosurgery. Although female account for 46.4% of graduating medical students, they represent only 12% of the neurosurgical workforce in the United States and Canada.12,13 This disparity is prevalent at the training level as well, with recent data reporting that only 17.1% of the 1323 neurosurgery residents and fellows in US programs are female.14 A retrospective review of historical residency matching data from 1990 to 2007 was performed to identify applicant characteristics associated with a successful match into a neurosurgery program.15 Through a multivariate analysis which considered US Medical Licensing Examination scores, medical school prestige, and Alpha Omega Alpha status, the authors found that female sex was associated with significantly lower odds of matching into neurosurgery (odds ratio 0.57, 95% CI 0.34-0.94). Knowing that this disparity persists today, it is critical to continually evaluate our residency selection procedures to identify and eliminate potential sources of sex bias, which are prevalent throughout other forms of evaluation.16-18 In our study data, we were unable to identify sources of applicant sex bias. Programs with a paucity of female residents should consider adopting a more objective interview evaluation process that lends itself to an analysis for unconscious bias.


The results of this study must be considered in the context of its limitations. First, this was a retrospective investigation and is therefore susceptible to all the biases and shortcomings intrinsic to this study design. To mitigate a selection bias, we believe it appropriate to omit reviewers who interviewed less than 90% of the applicants, which unfortunately resulted in the loss of 32% of the interviewer cohort. The particularly small distribution of female in the interviewer and applicant cohort also limits the ability to detect subtle sex biases that may be observed in a more robust cohort. Finally, our interviewer survey was introduced and evaluated in a virtual interview format, and therefore, the reproducibility of our findings may not translate directly to an in-person interview structure.


In this study, we demonstrated that our standardized interviewer survey was a feasible and reliable method for evaluating noncognitive attributes in neurosurgery residency interviews, and we were able to confidently use this system as designed to help with resident selection. In addition, there was no perceptible evidence of sex bias in our small, single-program experience. Future studies should aim to optimize the neurosurgery residency process and work to identify additional sources of bias including ethnicity and academic background, to facilitate the selection of a more diverse residency cohort. In addition, it would be prudent to assess the outcomes of our selection process intervention by objectively evaluating our current and future residency cohort in both the academic and noncognitive domains. However, this initiative will take a number of years to collect enough data to make meaningful conclusions. Moreover, future iterations of this process should allow the opportunity for interviewers to voice serious concerns regarding an applicant's qualification for a neurosurgery residency position with the rest of the interviewers because a legitimate concern identified by a minority of the interviewers has the risk of becoming covered up in the overall statistics of the survey. Finally, our experience may serve as an example for other residency specialties.


This study did not receive any funding or financial support.


The authors have no personal, financial, or institutional interest in any of the drugs, materials, or devices described in this article. Dr Steinmetz receives financial support from Zimmer Biomet, Elsevier, and Globus.


1. Brothers TE, Wetherholt S. Importance of the faculty interview during the resident application process. J Surg Educ. 2007;64(6):378-385.
2. Daly KA, Levine SC, Adams GL. Predictors for resident success in otolaryngology. J Am Coll Surg. 2006;202(4):649-654.
3. Downard CD, Goldin A, Garrison MM, Waldhausen J, Langham M, Hirschl R. Utility of onsite interviews in the pediatric surgery match. J Pediatr Surg. 2015;50(6):1042-1045.
4. Nallasamy S, Uhler T, Nallasamy N, Tapino PJ, Volpe NJ. Ophthalmology resident selection: current trends in selection criteria and improving the process. Ophthalmology. 2010;117(5):1041-1047.
5. Al Khalili K, Chalouhi N, Tjoumakaris S, et al. Programs selection criteria for neurological surgery applicants in the United States: a national survey for neurological surgery program directors. World Neurosurg. 2014;81(3-4):473-477.e472.
6. Dirschl DR. Scoring of orthopaedic residency applicants: is a scoring system reliable? Clin Orthop Relat Res. 2002;399:260-264.
7. Schenker ML, Baldwin KD, Israelite CL, Levin LS, Mehta S, Ahn J. Selecting the best and brightest: a structured approach to orthopedic resident selection. J Surg Educ. 2016;73(5):879-885.
8. Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155-163.
9. Lyons J, Bingmer K, Ammori J, Marks J. Utilization of a novel program-specific evaluation tool results in a decidedly different interview pool than traditional application review. J Surg Educ. 2019;76(6):e110-e117.
10. Kelsch MP, Friesner DL. Evaluation of an interview process for admission into a school of pharmacy. Am J Pharm Educ. 2012;76(2):22.
11. Lemay JF, Lockyer JM, Collin VT, Brownell AK. Assessment of non-cognitive traits through the admissions multiple mini-interview. Med Educ. 2007;41(6):573-579.
12. AAMC. Applicants, Matriculants, Enrollment, Graduates, MD-PhD, and Residency Applicants Data. Accessed December 20, 2021.
13. Odell T, Toor H, Takayanagi A, et al. Gender disparity in academic neurosurgery. Cureus. 2019;11(5):e4628.
14. AAMC. ACGME Residents and Fellows by Sex and Specialty, 2015. Accessed December 20, 2021.
15. Durham SR, Donaldson K, Grady MS, Benzil DL. Analysis of the 1990-2007 neurosurgery residency match: does applicant gender affect neurosurgery match outcome? J Neurosurg. 2018;129(2):282-289.
16. Goldin C, Rouse C. Orchestrating impartiality: the impact of “Blind” auditions on female musicians. Am Econ Rev. 2000;90(4):715-741.
17. Gonzalez MJ, Cortina C, Rodriguez J. The role of gender stereotypes in hiring: a field experiment. Eur Sociol Rev. 2019;35(2):187-204.
18. Moss-Racusin CA, Dovidio JF, Brescoll VL, Graham MJ, Handelsman J. Science faculty's subtle gender biases favor male students. Proc Natl Acad Sci U S A. 2012;109(41):16474-16479.


The authors present a pilot study of a standardized interviewer survey about residency candidates' “noncognitive attributes and program compatibility.” Assessment science convincingly supports the use of standardized, multiobserver evaluations, which increase reliability and reduce the impact of bias. In this study, high intraclass correlation coefficients validate survey reliability. Reassuringly, there were no differences in reliability related to the sex or career stage (faculty or resident) of the interviewer.

The structured survey included 4 formative questions and a single summative question, “How should we rank this applicant?” Three of the formative questions assessed applicants' “noncognitive” (emotional/behavioral) attributes. The fourth formative question, directed at “program compatibility,” was “How well do you think this applicant would “fit in” with the neurosurgery department?” Since “fit” is undefined, the utility of this question is low, while “fit” questions, in general, are an invitation to conscious or unconscious bias in hiring decisions.

The authors' overall approach is important and could equally be applied to cognitive, technical, and experiential applicant attributes, in addition to emotional/behavioral ones. Furthermore, although the rationale given for the project is the imperative to conduct virtual interviews during the COVID-19 pandemic, this approach is equally valid in the in-person setting. In future work, I encourage the authors to drop “fit” questions and to extend their approach to a holistic assessment of specific behavioral, cognitive, and technical attributes correlated with success in residency and beyond.

Nathan R. Selden

Portland, Oregon, USA


Internship and residency; Interview; Neurosurgery

© Congress of Neurological Surgeons 2022. All rights reserved.